How to Scrape Instagram
This is a step-by-step guide on how to scrape Instagram using Python.
Social media scraping provides a great way to collect valuable data, whether for research or commercial purposes. And Instagram is probably the most lucrative platform today. However, it’s also tricky to scrape, both due to technical and legal challenges.
In this guide, you’ll learn what Instagram data you can scrape without getting in trouble and which tools you should choose to avoid an IP address ban. Moreover, you’ll find two step-by-step guides for building a basic Instagram scraper with Python – one using Requests, and the other Selenium. If you prefer video, we have one as well:
What’s Instagram Scraping – the Definition
Instagram scraping is the process of automatically collecting publicly available data from the social media platform. Depending on your programming knowledge, it’s done using pre-made scraping tools or custom-built web scrapers.
Social media hustlers know that data gathering can bring a whole new deal onto the table. Just by collecting information like hashtags, or posts, you can perform market and sentiment analysis, monitor online branding, or find influencers for your business.
How to Scrape Instagram Legally
So, What Data Can You Scrape Without Logging In?
There are three main categories of publicly available data:
- Hashtags: post URL, media URL, post author ID.
- Profiles: latest posts, external URLs, likes, images, comments, the number of likes per post, and followers.
- Posts: latest posts, date, URL, comments, likes, author ID.
But keep in mind that Instagram often changes the rules, so it’s always a good idea to check what you can scrape before actually doing so.
Choosing Your Instagram Scraping Tools
There are generally three types of tools you can use to scrape Instagram: 1) a custom-built web scraper, 2) a web scraping API, or a 3) ready-made web scraper.
If you have programming knowledge, you can try to build your own web scraper using web scraping frameworks like Selenium or Playwright. It can handle complex automation, and since you’re the one looking after your scraping-bot, you can adapt it to all the structural changes Instagram throws your way.
Instagram doesn’t have its own official API anymore (not that it was of any use). But there are many reliable providers out there who offer web scraping APIs. For example, Apify provides different APIs to collect various Instagram data points, such as Instagram Profile Scraper and Post Scraper. Or, you can use general-purpose web scrapers based on large proxy pools like Smartproxy’s Web Scraping API or Zyte’s Smart Proxy Manager.
If you don’t have any programming skills, you can buy ready-made scrapers like Parsehub, Octoparse, or Bright Data’s Data Collector. These tools let you extract data by visually clicking on elements or using convenient templates.
How to Scrape Instagram Data: A Step-By-Step Guide Using Python
So let’s say you want to try scraping Instagram yourself. How do you go about it?
We’ll try to build two simple web scrapers. One uses Requests, which is a popular Python web scraping library. The second approach uses Selenium to launch a headless Chrome instance. Here’s how they differ.
Selenium simulates the browser – it opens and goes to the webpage. Scraping with Selenium will bring you more successful requests. Requests library, on the other hand, only sends your HTTP request to the browser. This method has a lower success rate, but you can scrape Instagram much faster.
Other Necessary Tools to Start Scraping Instagram
If you want to start scraping Instagram safely, you must also consider hiding your IP address since the platform limits the amount of information one can access without logging in.
The best way to do so is by using a rotating proxy server. Depending on your proxy provider, it will give you a different IP every five, 10 minutes, or each connection request. If you don’t need where to get proxies, take a look at our list of great Instagram proxy providers.
Managing Expectations
Scraping social media platforms is a hard and complicated process. You’ll have to arm yourself with patience to increase the chance of successful Instagram data gathering.
Some of your requests will inevitably fail – 20%, 30%, or even more until you get the hang of it. A failed request will throw a proxy error that will stop the scraper. You can add retry functionality to try again. However, to do so, you’ll need to change your IP address with every failed request.
Nowadays, most media online, including Instagram, ask users to provide personal information to access a website or specific content. They place data behind lead capture forms like phone number, email address, or some questions. There’ll likely be fewer data points to scrape soon because Instagram will gate even more content.
How to Scrape Instagram’s Public Profiles with Selenium
This is a real-life example using proxies. We’ll be scraping Instagram usernames with __a=1 parameter which can turn any page into JSON format. In this guide, the parameter will return content from a profile page.
Step 1. Start by installing Selenium, Chromedriver, and Selenium-Stealth.
1) First, import webdriver from the Selenium module.
from selenium import webdriver
2) Then import web driver using By selector module from Selenium to simplify selection.
from selenium.webdriver.common.by import By
3) Print out the results to format the console output.
from pprint import pprint
3) Since we’ll be using the JSON module, you’ll need to import it, too.
import json
4) Then import Selenium-Stealth for a more real-looking browser.
from selenium_stealth import stealth
Step 2. Set up the usernames of the Instagram profiles you wish to scrape.
usernames = ["jlo", "shakira", "beyonce", "katyperry"]
Create a variable for your proxies. They’ll help you reach a higher success rate.
proxy = "server:port"
You can create a new dictionary variable to store the scraped results.
output = {}
Step 3. Then write the beginning of the code, calling the main() function and adding another line to print out the results of your scrape after it is finished.
if __name__ == '__main__':
main()
pprint(output)
Write the code to iterate the usernames you’re gonna scrape. The main() function is going to iterate through the list of Instagram usernames and send it off to another scrape() function that we’re going to write later.
def main():
for username in usernames:
scrape(username)
Step 4. Next you’ll need to define the function by following these steps:
1) Create a new function. It will allow you to make changes to the browser settings before each scrape, like changing the user agent or rotating a proxy.
def prepare_browser():
2) Initialize Chrome options.
chrome_options = webdriver.ChromeOptions()
3) Add proxies to the browser options.
chrome_options.add_argument(f'--proxy-server={proxy}')
4) Now let’s specify the settings needed for Selenium-Stealth to work
chrome_options.add_argument("start-maximized")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
5) Create the Chrome browser with the options you’ve set earlier.
driver = webdriver.Chrome(options= chrome_options)
6) Apply more Selenium-Stealth settings. For extra anonymity you can rotate your digital fingerprint or user agent.
stealth(driver,
user_agent= 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36',
languages= ["en-US", "en"],
vendor= "Google Inc.",
platform= "Win32",
webgl_vendor= "Intel Inc.",
renderer= "Intel Iris OpenGL Engine",
fix_hairline= False,
run_on_insecure_origins= False,
)
6) Return the Chrome driver with all the options and settings you’ve set up so far.
return driver
Step 5. Now let’s move on to scraping.
1) Create a new function. The scrape() function requires only a single argument – a username that you’ve passed from the loop in the main() function.
def scrape(username):
2) Build a URL. Adding ?__a=1&__d=dis to the end allows you to get the response from the Instagram backend directly, without parsing HTML content.
url = f'https://instagram.com/{username}/?__a=1&__d=dis'
3) Then call the prepare_browser() function and assign a driver to a variable.
chrome = prepare_browser()
4) Open the browser and make a request.
chrome.get(url)
5) To tell whether the request failed or not, you’ll need to check if you weren’t redirected to the login. You can do that by looking at the login string – if it’s present in the URL, the request wasn’t successful. Additional retry functionality could be added here to try scraping a username later.
print (f"Attempting: {chrome.current_url}")
if "login" in chrome.current_url:
print ("Failed/ redir to login")
chrome.quit()
6) Otherwise, the request has been successful. This means we can extract the body text from the response and parse it as JSON. The result can then be passed to a parse_data() function together with the Instagram username you’ve scraped.
else:
print ("Success")
resp_body = chrome.find_element(By.TAG_NAME, "body").text
data_json = json.loads(resp_body)
user_data = data_json['graphql']['user']
parse_data(username, user_data)
chrome.quit()
Step 6. Let’s move on to parsing our data. Create a mentioned parse_data() function to get the data you want from the JSON response.
def parse_data(username, user_data):
For example, you can get some post captions from publicly available posts.
captions = []
if len(user_data['edge_owner_to_timeline_media']['edges']) > 0:
for node in user_data['edge_owner_to_timeline_media']['edges']:
if len(node['node']['edge_media_to_caption']['edges']) > 0:
if node['node']['edge_media_to_caption']['edges'][0]['node']['text']:
captions.append(
node['node']['edge_media_to_caption']['edges'][0]['node']['text']
)
In addition to the post captions, you can get users’ full names, the category they belong to, and the number of followers they have. All of this information can finally be written into the output dictionary.
output[username] = {
'name': user_data['full_name'],
'category': user_data['category_name'],
'followers': user_data['edge_followed_by']['count'],
'posts': captions,
}
This is the output of the script:
from selenium import webdriver
from selenium.webdriver.common.by import By
from pprint import pprint
import json
from selenium_stealth import stealth
usernames = ["jlo", "shakira", "beyonce", "katyperry"]
output = {}
def prepare_browser():
chrome_options = webdriver.ChromeOptions()
proxy = "server:port"
chrome_options.add_argument(f'--proxy-server={proxy}')
chrome_options.add_argument("start-maximized")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options= chrome_options)
stealth(driver,
user_agent= 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36',
languages= ["en-US", "en"],
vendor= "Google Inc.",
platform= "Win32",
webgl_vendor= "Intel Inc.",
renderer= "Intel Iris OpenGL Engine",
fix_hairline= False,
run_on_insecure_origins= False,
)
return driver
def parse_data(username, user_data):
captions = []
if len(user_data['edge_owner_to_timeline_media']['edges']) > 0:
for node in user_data['edge_owner_to_timeline_media']['edges']:
if len(node['node']['edge_media_to_caption']['edges']) > 0:
if node['node']['edge_media_to_caption']['edges'][0]['node']['text']:
captions.append(
node['node']['edge_media_to_caption']['edges'][0]['node']['text']
)
output[username] = {
'name': user_data['full_name'],
'category': user_data['category_name'],
'followers': user_data['edge_followed_by']['count'],
'posts': captions,
}
def scrape(username):
url = f'https://instagram.com/{username}/?__a=1&__d=dis'
chrome = prepare_browser()
chrome.get(url)
print (f"Attempting: {chrome.current_url}")
if "login" in chrome.current_url:
print ("Failed/ redir to login")
chrome.quit()
else:
print ("Success")
resp_body = chrome.find_element(By.TAG_NAME, "body").text
data_json = json.loads(resp_body)
user_data = data_json['graphql']['user']
parse_data(username, user_data)
chrome.quit()
def main():
for username in usernames:
scrape(username)
if __name__ == '__main__':
main()
pprint(output)
How to Scrape Instagram’s Public Profiles with Requests Library
This is another step-by-step example using the Python-based library – Requests. This method is significantly faster and more lightweight, as you don’t need to simulate a web browser. However, it also fails much more. But even with a low success rate you can scrape quite a bit of data by simply retrying with new proxies.
Step 1. Start by importing Requests, JSON, and Random.
import requests, json, random
Then we’re gonna print out the results to format the console output.
from pprint import pprint
Step 2. Now let’s set up a usernames list that will contain all of the Instagram users we’re going to scrape.
usernames = ["jlo", "shakira", "beyonce", "katyperry"]
After that, set up your proxy.
proxy = "http://username:password@server:port"
You can create a new dictionary variable to store the scraped results.
output = {}
Step 3. Then write the beginning of the code, calling the main() function.
if __name__ == '__main__':
main()
pprint(output)
Now prepare the headers and mask you’re sending a request through a scraper. The headers will also rotate a couple of user agents.
def get_headers(username):
headers = {
"authority": "www.instagram.com",
"method": "GET",
"path": "/{0}/".format(username),
"scheme": "https",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding" : "gzip, deflate, br",
"accept-language": "en-GB,en-US;q=0.9,en;q=0.8",
"upgrade-insecure-requests": "1",
"Connection": "close",
"user-agent" : random.choice([
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36"
])
}
return headers
Step 4. Write the code to iterate the usernames you’re gonna scrape. The main() function will iterate through the Instagram username list.
def main():
for username in usernames:
url = f"https://instagram.com/{username}/?__a=1&__d=dis"
Step 5. Now we’re going to write the line that will send the request and apply the headers and proxy to it.
response = requests.get(url, headers=get_headers(username), proxies = {'http': proxy, 'https': proxy})
To tell whether the request failed or not, you’ll need to check if you weren’t redirected to the login. You can do that by checking if the response is JSON. Also, this line will let you parse the response text.
if response.status_code == 200:
try:
resp_json = json.loads(response.text)
If you didn’t get your results in JSON that means you got redirected to login; the script moves on to the next username.
except:
print ("Failed. Response not JSON")
continue
Additional retry functionality could be added here to try scraping a username later.
else:
If you got your results in JSON, then you can clean (parse) your data.
user_data = resp_json['graphql']['user']
parse_data(username, user_data)
There might be some errors along the way. Let’s try and catch them. If failed, use retry logic.
elif response.status_code == 301 or response.status_code == 302:
print ("Failed. Redirected to login")
else:
print("Request failed. Status: " + str(response.status_code))
Step 6. Create a parse_data() function to get the data you want from the JSON response.
def parse_data(username, user_data):
For example, you can get some post captions from publicly available posts and assign a list of them to a variable.
captions = []
if len(user_data['edge_owner_to_timeline_media']['edges']) > 0:
for node in user_data['edge_owner_to_timeline_media']['edges']:
if node['node']['edge_media_to_caption']['edges'][0]['node']['text']:
captions.append(
node['node']['edge_media_to_caption']['edges'][0]['node']['text']
)
output[username] = {
'name': user_data['full_name'],
'category': user_data['category_name'],
'followers': user_data['edge_followed_by']['count'],
'posts': captions,
}
This is the output of the script.
import requests, json, random
from pprint import pprint
usernames = ["jlo", "shakira", "beyonce", "katyperry"]
proxy = "http://username:password@proxy:port"
output = {}
def get_headers(username):
headers = {
"authority": "www.instagram.com",
"method": "GET",
"path": "/{0}/".format(username),
"scheme": "https",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-encoding" : "gzip, deflate, br",
"accept-language": "en-GB,en-US;q=0.9,en;q=0.8",
"upgrade-insecure-requests": "1",
"Connection": "close",
"user-agent" : random.choice([
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36"
])
}
return headers
def parse_data(username, user_data):
captions = []
if len(user_data['edge_owner_to_timeline_media']['edges']) > 0:
for node in user_data['edge_owner_to_timeline_media']['edges']:
if len(node['node']['edge_media_to_caption']['edges']) > 0:
if node['node']['edge_media_to_caption']['edges'][0]['node']['text']:
captions.append(
node['node']['edge_media_to_caption']['edges'][0]['node']['text']
)
output[username] = {
'name': user_data['full_name'],
'category': user_data['category_name'],
'followers': user_data['edge_followed_by']['count'],
'posts': captions,
}
def main():
for username in usernames:
url = f"https://instagram.com/{username}/?__a=1&__d=dis"
response = requests.get(url, headers=get_headers(username), proxies = {'http': proxy, 'https': proxy})
if response.status_code == 200:
try:
resp_json = json.loads(response.text)
except:
print ("Failed. Response not JSON")
continue
else:
user_data = resp_json['graphql']['user']
parse_data(username, user_data)
elif response.status_code == 301 or response.status_code == 302:
print ("Failed. Redirected to login")
else:
print("Request failed. Status: " + str(response.status_code))
if __name__ == '__main__':
main()
pprint(output)