How to Scrape Twitter

We’ll provide you with a step-by-step example on how to scrape Twitter feed using Python.

If you’re planning to scrape social media, Twitter is one of the best platforms to target. With 230 million monthly users, it has a lot of valuable information. There are many ways you can use it – from performing sentiment analysis to discovering market trends and improving your marketing strategy.

Though Twitter is pretty generous with giving access to its data, the official API requires a screening process and imposes quite a few limitations. To overcome these issues, you’ll have to look for alternatives, web scraping being the best one.

This guide will teach you all about Twitter scraping, introduce you to API alternatives and challenges you might encounter along the way. Moreover, you’ll find a step-by-step example of how to scrape publicly available Twitter data using SNScrape. If you prefer video, we have one as well:

What’s Twitter Scraping – the Definition

Twitter scraping is a way of gathering public data from the social media platform automatically. It’s usually done using pre-made scraping tools or custom-built web scrapers. Twitter is one of a few platforms to offer an official API, but it can be a pain in the arse to use, since it limits the number of tweets you can get (3,200) and their recency (last 7 days).

Social media marketers use Twitter’s popularity to their advantage. They collect information like (re)tweets, shares, URLs, likes, threads, and followers, to name a few. Scraping Twitter can yield many insights into influencer marketing, brand and reputation monitoring, sentiment analysis, or market trends.

Is Scraping Twitter Legal?

Even though there’s no regulation prohibiting scraping as an action, you have to be aware of scraping social media platforms because things can become hairy. We’re no lawyers, but the US Ninth Circuit of Appeals ruled that you can scrape social media data if 1) it’s publicly available (doesn’t hide behind a login), and 2) the content isn’t subject to intellectual property rights. Then, there may be some additional requirements if you’ll be working with personal information. Twitter has more leeway than other platforms; its terms of service (ToS) don’t forbid web scraping, but you do need prior consent. Even though these terms aren’t legally binding when scraping without an account, you still might get your IP address banned from using the platform. Since the question of web scraping isn’t always straightforward, it’s wise to seek legal advice. Each use case is considered separately: you’ll have more freedom when collecting data for research purposes than for commercial use.

So, What Data Can You Scrape Without Logging In?

Let’s break down the Twitter data points you can scrape into three categories:

Tweets: text and visual media, tweet URLs, tweet IDs, retweets, curate tweets to location or likes.
Profiles: name, image, follower and tweet count, user bio, latest post data like content, time, retweets, replies, etc.
Hashtags and keywords: tweet URL, time created, mentions, location, username, reply and quote count, conversation IF, retweets, media data like links, type, and others.

However, Twitter has been increasingly requiring a login to view its content. Over the past year, Reddit’s community has reported many issues concerning Twitter’s decision to move more of its content behind a login form. You may experience in various areas of the website, such as when scrolling down in a Twitter thread.

For now, there’s an easy workaround by clearing your browser cookies. But what does gated content hold for the web scraping community? Predictably, there’ll be fewer publicly available data points to gather in the future.

How to Scrape Twitter without an API

There are multiple ways to collect publicly available data from Twitter aside from using the official API.

One is to build a scraper yourself using web scraping libraries. It’s probably the hardest method, but it also gives you the most control, especially if you’re experiencing limitations with the other approaches. While scraping Twitter is easier than gathering data on Instagram, you’ll still need to use a headless browser to render JavaScript, together with Twitter proxies to avoid getting your IP blocked.

If you don’t like to meddle with code, you can use commercial no-code scrapers like PhantomBuster and ParseHub. These visual tools provide convenient templates or a point-and-click interface. While easy for simple tasks, they become increasingly complex and inefficient once you scale.

Finally, you can use one of pre-made Twitter web scraping libraries. These tools let you extract data by entering commands into a code editor or the command line. They require no API authentication or proxies to work. SNScrape is one example; Twint used to be another, but it’s no longer maintained.

How to Scrape Twitter Data Using Python: A Step-By-Step Guide

For this example, we’ll scrape Twitter using the SNScrape Python library. It’s a popular choice for Twitter scraping because:

You can start retrieving useful data in minutes,
It’s not bound by the API’s tweet limits, and
It returns historical data that’s older than seven days.

SNScrape has Twitter modules for different aspects of the platform like search, hashtags, and user profiles. All in all, you can use it to extract the following data points: users, user profiles, hashtags, searches, tweets (single or surrounding thread), list posts, and trends.

Preliminaries

For the code below to work, you’ll need to install the SNScrape library. You can do this by entering the following line into your operating system’s terminal:

				
					pip3 install snscrape

Before you begin, create a new text document. We named it scraping1.py, but you may go by any name. Open the file to start coding.

How to Scrape Twitter Search

Let’s begin with Twitter search. We’ll scrape the first 100 tweets for the query residential proxy.

Step 1. Import the necessary modules. We’ll need the Twitter module from SNScrape and JSON:

				
					from snscrape.modules import twitter
import json

Step 2. Let’s start with a Twitter search to get the bigger picture from the data. You’ll have to define the scrape by search function.

1) First, we’ll focus on the variables. Begin by typing in the queries variable and assigning keywords as a string value.

				
					queries = ['residential proxies']

2) Then, define the number of tweets you want to scrape. If you leave this out, the session won’t end until you scrape all the relevant tweets. Let’s set the max_results request parameter to 100, so you won’t drown in data.

				
					max_results = 100

3) Following that, implement the scraper itself. This will create a new scraper object instance for each query.

				
					def scrape_search(query):
    scraper = twitter.TwitterSearchScraper(query)
    return scraper

Step 3. Next, let’s iterate a dedicated output file for each query in the following lines. This way, you’ll get the results in a text file.

				
					for query in queries:
    output_filename = query.replace(" ", "_") + ".txt" 
    with open(output_filename, 'w') as f:
        scraper = scrape_search(query)
        i = 0
        for i, tweet in enumerate(scraper.get_items(), start = 1):

Step 4. Now, convert scraped tweets into a .json object for data parsing. You can access the content only when the results are converted.

				
					            tweet_json = json.loads(tweet.json())

Print out the content of a tweet such as its text, the creator’s username, date, and so on.

				
					            print (f"\nScraped tweet: {tweet_json['content']}")

The following lines are for writing data to the output file.

				
					            f.write(tweet.json())
            f.write('\n')
            f.flush()

Step 5. And last, let’s add the code that terminates the loop when max_results< are reached.

				
					            if max_results and i > max_results:
                break

Congrats, you’ve just defined the function, so it’s time to move on to scraping.

Save your code, open the command prompt, and change it to the directory where the code is saved. Then, run the code. When the results are in, you should see parameters such as usernames, dates, URLs, and the tweet itself, under the content key.

How to Scrape Twitter Hashtags

Following the example above, you can easily scrape other data, too: for example, hashtags. Simply switch the TwitterSearchScraper to the TwitterHashtagScraper. And instead of a query variable, write a tag of your choice. Just be careful not to leave dashes in the value because the code won’t work.

Here’s how to define the scrape by the hashtags function. You’ll also need to change the scraper in the loop – from scraper = scrape_search(query) to scraper = scrape_hashtag(query). The same applies to the latter examples.

				
					hashtag = ['scraping']
max_results = 50
def scrape_hashtag(hashtag):
    scraper = twitter.TwitterHashtagScraper(hashtag)
    return scraper

How to Scrape Twitter Users

You can also scrape by username or user ID. However, things get tricky here, so take a look at the user ID example. When defining your input variables, don’t forget Boolean operators. Also, make sure to list the usernames and user IDs as strings.

				
					UserId = [‘1097450610864123904’, True]
max_results = 50
def scrape_user(user, isUserId):
    twitter.TwitterUserScraper(user, isUserId)
    return scraper

How to Scrape a Single Tweet on Twitter

If you’re searching for a specific tweet, it’s also possible to scrape by tweet ID. This might be useful if a post holds important information like a product review. The approach here is similar, but this time the input value is an integer rather than a string. And naturally, we have to use another scraper.

				
					TweetId = 1516359017374887940
def scrape_tweet(tweetId):
    scraper = twitter.TwitterTweetScraper(tweetId)
    return scraper

Here’s the full script:

				
					from snscrape.modules import twitter
import json

# List of queries to scrape
queries = ['residential proxies']

#Max results to get per each query
max_results = 100

# Scrapes Twitter search
# Input: Twitter search (string)
# Example query: "covid"
def scrape_search(query):
	scraper = twitter.TwitterSearchScraper(query)
	return scraper

#-- Other scrapers that can be implemented

#Scrapes a single tweet
# Intput: Tweet ID (integer)
# Example tweet_id: 1516359017374887940
def scrape_tweet(tweet_id):
	scraper = twitter.TwitterTweetScraper(tweet_id)
	return scraper

# Input:
# If scraping by Username - username (string), False  (boolean)
# Example username "Proxyway1", False
# If scraping by User ID - user_id (string), True (boolean)
# Example User ID "1097450610864123904", True
def scrape_user(user, isUserId):
    twitter.TwitterUserScraper(user, isUserId)
	return scraper

# Input: Twitter hashtag (string) (without '#')
# Example hashtag: "scraping"
def scrape_hashtag(hashtag):
	scraper = twitter.TwitterHashtagScraper(hashtag)
	return scraper

for query in queries:
	#Creating an output file for each query
	output_filename = query.replace(" ", "_") + ".txt"
	with open(output_filename, 'w') as f:
    	scraper = scrape_search(query)
    	i = 0
    	for i, tweet in enumerate(scraper.get_items(), start = 1):
        	# Converting the scraped tweet into a json object
        	tweet_json = json.loads(tweet.json())
        	#Printing out the content of a tweet
        	print (f"\nScraped tweet: {tweet_json['content']}")
        	#Writing to file
            f.write(tweet.json())
        	f.write('\n')
        	f.flush()
        	#Terminate the loop if we reach max_results
        	if max_results and i > max_results:
            	break