Dealing with raw unstructured text data is a bit of a challenge. Conversational text as data presents itself with stray characters, punctuation, spellings, abbreviations, emojis etc to name a few. The presence of profanity in text data is the main focus of this article — ways to handle profanity in text and observing the impact that profanity/ censorship is likely to have on sentiment analysis.
The dataset used here has 2 fields containing tweet text: df[‘tweet_raw’] and df[‘tweet_clean_text’]. The former was the data extracted from twitter with no preprocessing / cleaning performed, the latter pre-processed sans mentions, hashtags, emails, phone…
This article aims at demonstrating why cleantext may be particularly useful in addressing emojis and handling ASCII/ unicodes & HTML codes that can be often overlooked or can be tedious in text preprocessing.
The steps detailed in the colab notebook for the previous article Working with unstructured text data using Python — Part 2 demonstrates the use of a user-defined function for removing ACSII codes, Latin-1 and Hex characters. This article aims at achieving the same objective by handling emojis using an alternate approach.
The dataset created in Part 1 is reused here using the field containing tweets — df[‘tweet_raw’]…
Tweets in their raw unstructured form are far from ready for any text mining or NLP project ideas. Here’s a walkthrough of a few pre-processing steps I’d normally take while working with tweet data. At the end of this article, we’ll have a fairly usable dataset containing clean tweet texts, mentions, hashtags, URLs and email addresses.
To begin with, we reuse the dataset created in Part 1 and narrow our focus to the field containing raw tweets — df[‘tweet_raw’]. A snapshot of the dataset looks like the one shown:
Close examination reveals the presence of HTML tags, ASCII, Latin-1, Hex…
The main focus of this article is to expose some of the underlying challenges of working with unstructured data, particularly text data coming from social media channels that has a range of language use, styles, vocabulary etc. The references provided at the end of the article can be used to learn about Twitter API, the set up, authorisation and the use of the module Tweepy for tweets extraction. The article aims at delivering a snapshot of a working approach to “get the job done” and in no way is the only right way of doing this. …
Data Analyst, Language Tutor, AI enthusiast, Polyglot and Artist.