How to Create Pandas DataFrames with Tweets Scraped by Locations

2 Years Ago usmanmalik57 0 142 Views

Introduction

I was working on a problem where I had to scrape tweets related to the T20 Cricket World Cup 2022, which is currently taking place in Australia.

I wanted tweets containing location names (cities) and the keyword “T20”. In the response, I want the user names of tweet authors, tweet texts, creation time, and the location keyword used to search the tweet. Finally, I wanted to create a Python Pandas Dataframe that contains these values in columns.

In this article, I will explain how you can return scrape tweets containing location information and how to store these tweets in a Pandas Dataframe.

Developers can scrape tweets from Twitter using the Twitter REST API. In most cases, the Twitter API returns tweet-type objects that contain various attributes for extracting tweet information. However, by default, the Twitter API doesn't return a Pandas Dataframe.

Simple Example of Scraping Tweets

You must sign up with Twitter Developer Account and create your API Key and Token to access the Twitter REST API. The official documentation explains signing up for the Twitter Developer Account.

I will use the Python Tweepy library for accessing the Twitter API. Tweepy is an unofficial Python client for accessing the Twitter API.

The following script demonstrates a basic example of Twitter scraping with the Python Tweepy library.

I use the search_all_tweets() function to search 100 English language tweets containing keywords Sydney and T20. I set a filter for removing retweets.

import pandas as pd
import os
import tweepy

bt = os.environ['twitter-bt']

client = tweepy.Client(bearer_token = bt,
                      wait_on_rate_limit= False)

location = "Sydney T20"
language = "lang:en "
no_tweet = "-is:retweet "


query = '"'+location+'" ' + language + no_tweet

tweets = client.search_all_tweets(query = query, 
                                  max_results=100)

tweets.data

You can use the data attribute of the response object to print the returned tweets. You can access tweet text using the text attribute as shown below:

Output:

Scraping Tweets into Pandas DataFrames

Now you know how to scrape tweets with Tweepy, let’s scrape tweets with the following information and store them in a Pandas dataframe:

Username of the tweet author.
Tweet text.
Creation time of the tweet.

Scraping Tweets with a Single Location

I will scrape tweets with the following filters:

Location keywords, e.g., Sydney T20
English language
No Retweets
Between October 01, 2022, to October 30, 2022.

In this case, I passed values for the tweet_fields, user_fields, and expansions attributes of the search_all_tweets() function. These attributes extract tweet creation time and user information, e.g., username.

location = "Sydney T20"
language = "lang:en "
no_tweet = "-is:retweet "

start_time = '2022-10-01T00:00:00Z'
end_time = '2022-10-30T00:00:00Z'


query = '"'+location+'" ' + language + no_tweet

tweets = client.search_all_tweets(query = query, 
                                  tweet_fields=['created_at'],
                                  user_fields=['username'],
                                  expansions=['author_id'],
                                  start_time=start_time,
                                  end_time=end_time, 
                                  max_results=100)

tweets.includes['users']

In addition to the data attribute, which contains tweet information, you can retrieve user information using the tweets.includes[‘users’] list.

Output:

You can iterate through lists of tweet data and user data to extract user name, tweet, text, and tweet creation time and append these values to Python lists.

Finally, you can create a Pandas dataframe using these lists. Here is an example script:

user_name = []
text = []
date_time = []


for tweet, user  in zip(tweets.data, tweets.includes['users']):
    text.append(tweet.text)
    date_time.append(tweet.created_at)
    if user is not None:
        user_name.append(user.name)


df = pd.DataFrame(list(zip(user_name, text, date_time)),
               columns =['User', 'Text', 'Date Time'])

df.head(20)

Output:

Scraping Tweets with Multiple Locations

Since I wanted to extract tweets by multiple location keywords, I defined a Python function that accepts a list of locations, along with the start and end date of tweets. The function iterates through all the locations and appends the extracted tweet information in Python lists.

The get_location_tweets() function in the following script returns a Pandas dataframe containing the tweet text, the tweet author's username, the tweet's creation time, and the location keyword that parameterizes the tweet search.

import time
def get_location_tweets(loc_list, st, et):


    user_name = []
    text = []
    date_time = []
    loc_keyword = []

    i = 0

    for loc in loc_list:

        i = i + 1

        location = loc + " T20"
        language = "lang:en "
        no_tweet = "-is:retweet "

        start_time = st
        end_time = et


        query = '"'+location+'" ' + language + no_tweet

        tweets = client.search_all_tweets(query = query, 
                                          tweet_fields=['created_at'],
                                          user_fields=['username'],
                                          expansions=['author_id'],
                                          start_time=start_time,
                                          end_time=end_time, 
                                          max_results=100)


        user_name = []
        text = []
        date_time = []


        for tweet, user  in zip(tweets.data, tweets.includes['users']):
            text.append(tweet.text)
            date_time.append(tweet.created_at)
            if user is not None:
                user_name.append(user.name)

            loc_keyword.append(loc)

        time.sleep(3)


    df = pd.DataFrame(list(zip(user_name, text, date_time, loc_keyword)),
                   columns =['User', 'Text', 'Date Time', 'Loc Keyword'])


    return df

The following script calls the get_location_tweets() function to search for tweets containing either Sydney, Perth, Melbourne and the keyword T20.

loc_list = ["Sydney", "Perth", "Melbourne"]

start_time = '2022-10-01T00:00:00Z'
end_time = '2022-10-30T00:00:00Z'

locations_df = get_location_tweets(loc_list, start_time, end_time)

print(locations_df.shape)
locations_df.sample(frac=1).head(10)

Output:

Extracting tweets using Twitter API is straightforward. However, storing desired information from tweets in a data structure, e.g., a Pandas dataframe, is tricky. In this tutorial, I explained how you could scrape tweets using Twitter API and store them in a Pandas Dataframe in your desired format.

Be the first to reply

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.