Working with unstructured text data containing emojis using Python — Part 3
Pre-process raw tweets for information extraction.
This article aims at demonstrating why cleantext may be particularly useful in addressing emojis and handling ASCII/ unicodes & HTML codes that can be often overlooked or can be tedious in text preprocessing.
The steps detailed in the colab notebook for the previous article Working with unstructured text data using Python — Part 2 demonstrates the use of a user-defined function for removing ACSII codes, Latin-1 and Hex characters. This article aims at achieving the same objective by handling emojis using an alternate approach.
The dataset created in Part 1 is reused here using the field containing tweets — df[‘tweet_raw’]. A snapshot of the dataset looks like the one shown:
One would imagine whatever happened to all the emojis, symbols and other stray characters that are usually used in tweets. They exist in this dataset too, but represented as ascii/unicode/ html codes for emojis that we are unlikely to interpret without processing them.
For this purpose, we use the package clean-text 0.3.0, that can be accessed from https://pypi.org/project/clean-text/.
The package can be installed via pip using the command:
pip install clean-text
On colab, it would work as:
!pip install clean-text
The following snippet shows the working of this package:
The resultant text is converted to lowercase by default. A bit of fine-tuning on the parameters can handle ASCII & unicodes, get rid of numerical content, punctuation, emails, phone numbers and pretty much everything that is non-alphabetical. Saves a ton of time & energy!
Check out the following few examples:
Keyboard characters for emojis are not handled by the pre-process function using cleantext with default settings (example: < 3 ), but can be handled by setting parameters to remove numerical components and punctuation from the text.
The steps details in the colab notebook for the previous article Working with unstructured text data using Python — Part 2 can be used for the extraction of mentions, hashtags, emails and URLs by replacing the user-defined function using regular expression below:
with the function using the module cleantext:
The resultant field df[‘tweet_text’] is shown in the snapshot below that is now readable, interpretable and appears as closely as it can to the original tweet.
The challenge now remains with the presence of profanity, that may need exclusion depending on the use case and who gets to access the data. Check out my article Profanity: To be or not to be for a glimpse of profanity handling with Python!
Hope you liked this article :)
User-generated content on the Web and in social media is often dirty. Preprocess your scraped data with clean-text to…
Connect with me on: