Dealing with raw unstructured text data is a bit of a challenge. Conversational text as data presents itself with stray characters, punctuation, spellings, abbreviations, emojis etc to name a few. The presence of profanity in text data is the main focus of this article — ways to handle profanity in text and observing the impact that profanity/ censorship is likely to have on sentiment analysis.

The dataset used here has 2 fields containing tweet text: df[‘tweet_raw’] and df[‘tweet_clean_text’]. The former was the data extracted from twitter with no preprocessing / cleaning performed, the latter pre-processed sans mentions, hashtags, emails, phone…

Lubna Khan

Data Analyst, Language Tutor, AI enthusiast, Polyglot and Artist.

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store