Handling profanity in text data with Python.

Lubna Khan
5 min readFeb 21, 2021

Dealing with raw unstructured text data is a bit of a challenge. Conversational text as data presents itself with stray characters, punctuation, spellings, abbreviations, emojis etc to name a few. The presence of profanity in text data is the main focus of this article — ways to handle profanity in text and observing the impact that profanity/ censorship is likely to have on sentiment analysis.

The dataset used here has 2 fields containing tweet text: df[‘tweet_raw’] and df[‘tweet_clean_text’]. The former was the data extracted from twitter with no preprocessing / cleaning performed, the latter pre-processed sans mentions, hashtags, emails, phone numbers and URLs. The following snapshot shows a few lines of data containing profanity.

Profanity handling starts with detecting profanity:

  1. Flag texts containing profanity, so that they can be filtered and passed through a data pipeline for a certain action to be taken such as warning the user or removing them from public view.

To identify profanity in text, we use the Python library profanity-check. This library is depreciated as of May 2022. A newer library alt-profanity-check has been released as an update.

Profanity-check can be used to detect the presence of profanity in a sample text using the predict method. The output is a Boolean value True/ 1 or False/0. The example below demonstrates this:

The predict_prob method returns the probability of a detecting profanity in a sentence , as can be seen in the following examples:

86% — probability of containing profanity
6.6% probability of containing profanity

For the purpose of demonstration, I have created a function using the predict method to create a field in the dataframe df[‘contains_profanity’], that contains Boolean values 1s and 0s.

2. Mask inappropriate words within the textual content to make it safer/ reader-friendly.

As an alternative to completely removing swear words from text, those identified as inappropriate for plain view can be masked with the use of special characters. Many of us would have seen this widely used in subtitles on tele shows/ transcripts from virtual meetings.

For achieving this objective the Python library better-profanity 0.7.0.

The profanity.censor method is used to mask inappropriate words with special characters. The default setting uses “*” as shown below:

Another useful Python library is profanity-filter 1.3.3.

The profanity-filter library has provision for user-specified characters to mask inappropriate words. (Example shown below)

Both libraries mentioned here can be used for profanity handling, bearing in mind that sensitivity to certain words may be different between them. The 2 examples given above reveal that the word “stupid” is not treated as inappropriate by the library profanity-filter while it was masked by the library better-profanity.

Profanity-filter is a personal preference for its ease of use and the ability to include additional words to censor. Check out the example below!

Let’s proceed to examine how sentiment analysis is impacted by the use of the 2 modules for profanity handling

a. Sentiment analysis on text containing profanity.

b. Sentiment analysis on text with censorship

The sentiment scores for the dataset with and without censorship are different (as expected), but are likely to have variable scores with censorship when different modules are used. This is due to their differences in detection words for censorship and sensitivity. A limitation for both the modules explored here is that polyglot language use and misspellings for censor words- these may not be addressed.

To end this article a snapshot of polarity scores for my dataset with profanity and censorship are shown below to appreciate the use of censorship modules with relevance to sentiment analysis of text data.

Polarity scores comparison: Text containing profanity versus censorship

Note: The data used was gathered from Twitter using Tweepy as explained in the article Working with unstructured text data- Part 1. The raw tweets have not been fabricated to include or exclude certain words as shown in the examples as part of this article. The pre-processed tweets represent raw tweet data with some ETL performed as discussed in Working with unstructured text data using Python — Part 2 and Part 3.

Disclaimer: Inappropriate language & profanity seen in this article are used only to highlight the challenges of working with text data such as that from twitter- they are not aimed at directing hate or causing hurt to anyone.

Hope you liked this article! :)

Connect with me on https://www.linkedin.com/in/lubna-khan-59843569/

References:

[1] profanity-check, Available at: https://pypi.org/project/profanity-check/ (Accessed: 21 February 2021).

[2] better_profanity, Available at: https://pypi.org/project/better-profanity/ (Accessed: 21 February2021).

[3] profanity, Available at: https://pypi.org/project/profanity/ (Accessed: 21 February2021).

[4] profanity-filter: A Python library for detecting and filtering profanity, Available at: https://pypi.org/project/profanity-filter/ (Accessed: 21 February2021).

[5] Victor Zhou (2019) Building a Better Profanity Detection Library with scikit-learn, Available at: https://towardsdatascience.com/building-a-better-profanity-detection-library-with-scikit-learn-3638b2f2c4c2 (Accessed: 21 February2021).

--

--

Lubna Khan

Data Scientist/ Analyst, Language Tutor, AI enthusiast, Polyglot, Artist and lifelong learner.