Working with unstructured text data using Python — Part 2

Lubna Khan
3 min readFeb 17, 2021

Pre-process raw tweets for information extraction.

Tweets in their raw unstructured form are far from ready for any text mining or NLP project ideas. Here’s a walkthrough of a few pre-processing steps I’d normally take while working with tweet data. At the end of this article, we’ll have a fairly usable dataset containing clean tweet texts, mentions, hashtags, URLs and email addresses.

To begin with, we reuse the dataset created in Part 1 and narrow our focus to the field containing raw tweets — df[‘tweet_raw’]. A snapshot of the dataset looks like the one shown:

Close examination reveals the presence of HTML tags, ASCII, Latin-1, Hex characters in addition to mentions, hashtags, URLs, and email addresses- what is normally expected in a tweet. So let’s create an additional field in the dataset to start the text cleaning process. The image below shows the marked parts of the tweets text that will need excluding from the actual tweet text.

1. Remove Latin-1, ASCII & Hex characters

2. Remove HTML tags

3. Extract mentions, urls and hashtags

It is perhaps a good idea to extract all the key components that we have in mind- mentions, hashtags, email addresses, URLs etc before we consider removing punctuation and normalising the text to simpler forms for further text mining, due to the fact that all these components have an associated pattern involving punctuation.

The tweets data now appears usable for further exploration and feature extraction. Let’s save this dataframe for further pre-processing and more exciting work!

The source code to run this exercise on Google colab is available here. For more confident Python users, you may want to rework the code from GitHub on an IDE like VSCode.

An alternate approach to handling ASCII codes, Hex & Latin-1 characters is discussed in my article Working with unstructured text data containing emojis using Python — Part 3. Feel free to check it out!

Connect with me on:

https://www.linkedin.com/in/lubna-khan-59843569

--

--

Lubna Khan

Data Scientist/ Analyst, Language Tutor, AI enthusiast, Polyglot, Artist and lifelong learner.