Working with unstructured text data- Part 1

Lubna Khan
3 min readFeb 16, 2021

Retrieve tweets by search term using Twitter API in Python.

The main focus of this article is to expose some of the underlying challenges of working with unstructured data, particularly text data coming from social media channels that has a range of language use, styles, vocabulary etc. The references provided at the end of the article can be used to learn about Twitter API, the set up, authorisation and the use of the module Tweepy for tweets extraction. The article aims at delivering a snapshot of a working approach to “get the job done” and in no way is the only right way of doing this. The article is written with the assumption that the reader has some working knowledge of pandas and Python programming.

Let’s begin with generating a dataset to work with, by fetching text data in its rawest form from Twitter- data which is freely available in the public domain and accessible via the API request. Think of a keyword or search term that you would like to use to fetch related tweets. I went with the term “Royal” that was trending in UK the day I decided to publish this post. Let’s get started!

  1. Setup a developer account on Twitter to obtain twitter credentials. To create a develop account on Twitter and request for credentials, follow the link below:

https://developer.twitter.com/en/apply-for-access

Follow the steps as directed when you select “Apply for a developer account”. Obtaining the credentials may take a few days.

2. Follow the script “Retrieve tweets using a search term.ipynb” to collect tweets and the relevant fields using the Twitter API as a dataframe. More confident Python users can rework the code for use on IDE such as VSCode. For the benefit of new learners and those prefering a notebook interface the code is available for use here on Google colab. We use the package Tweepy for twitter data gathering.

The script generates a dataframe ‘tweet_df’ with columns as shown:

Dataset description- schema

At the end of the exercise, we’d like to answer a couple of questions about the tweets that are most relevant to the metrics — retweet_count & favorite_count.

Q1- What do the most liked/ retweeted tweets have in common- mentions, keywords, hashtags, underlying themes?

Q2- What are the most distinguishable features of these 2 groups of tweets versus all other tweets in the dataset? Is there anything at all that stands out?

Q3- What sentiments or emotions do the most liked/ retweeted tweets evoke?

Topline exploration (descriptive stats) of the 2 numeric fields for count of retweets and likes per tweet:

The most liked and retweeted tweets extracted in their raw form doesn’t reveal much.

100 most liked tweets
100 most retweeted tweets

As can be seen, tweets in their raw form are far from being useful for any information extraction.

3. Export this dataset and save it for further cleaning and pre-processing before we can start mining. Check out the next article: Part 2 & 3.

Time for text cleaning and pre-processing!

References:

  1. Tweepy Documentation. https://docs.tweepy.org/en/latest/
  2. Twitter. 2021. Twitter API — Tap into what’s happening. [ONLINE] Available at: https://developer.twitter.com/en/products/twitter-api. [Accessed 16 February 2021].
  3. Twitter. 2021. Api reference index. [ONLINE] Available at: https://developer.twitter.com/en/docs/api-reference-index. [Accessed 16 February 2021].
  4. Twitter. 2021. Getting started. [ONLINE] Available at: https://developer.twitter.com/en/docs/twitter-api/getting-started/guide. [Accessed 16 February 2021].

Connect with me on:

https://www.linkedin.com/in/lubna-khan-59843569

--

--

Lubna Khan

Data Scientist/ Analyst, Language Tutor, AI enthusiast, Polyglot, Artist and lifelong learner.