Web Scraping with Node.js: A Practical Guide for Beginners
This is a step-by-step guide to web scraping with Node.js. Here you’ll find two tutorials – one for scraping static pages, the other for dynamic websites.

Node.js is one of the most popular languages used for building web applications and scraping web pages.
In this step-by-step guide, you’ll learn why Node.js gained popularity in recent years, what tools you should use, and how to get more successful requests when extracting data. You’ll also find two step-by-step tutorials.
The first one will guide you through scraping static web pages with Node.js Axios and Cheerio libraries. The second tutorial will show you how to build a Node.js web scraper with Puppeteer – a headless library for scraping dynamic web pages.
Contents
- What Is Node.js Web Scraping?
- Why Should You Choose Node.js for Web Scraping?
- Steps to Build a Web Scraper with Node.JS
- Web Scraping Static Pages Using Node.js (Axios and Cheerio)
- Web Scraping Dynamic Pages Using Node.js and Puppeteer
What Is Node.js Web Scraping?
Node.js is a runtime environment that allows you to run JavaScript on the server-side. Its primary focus is to build web applications, but Node.js has also gained popularity with scraping websites, as much of the web now relies on JavaScript.
Web scraping with Node.js can be split into gathering data from 1) static and 2) dynamic web pages. The main difference between the two is that static pages don’t need JavaScript rendering to show content, while dynamic pages execute JavaScript before loading the information.
To scrape static websites with Node.js, you’ll need to make a request and download your target page’s HTML code using a HTTP client like Axios. Once you’ve downloaded the data, you can extract the data points and structure them with a parser such as Cheerio.
To scrape a modern website or a single page application, you first need to render the entire page; traditional scripts can’t help you with that. So, you’ll need to use a headless browser like Puppeteer to deal with elements like infinite scrolling or lazy loading. In this case, Node.js is one of a few languages that makes dynamic scraping a walk in the park.
Why Should You Choose Node.js for Web Scraping?
Node.js is probably the first choice by many when it comes to web scraping JavaScript-rendered websites like social media or news outlets. Here are the main reasons why you should choose the runtime for web scraping rather than other programming languages:
- Handles dynamic websites. Node.js is the top option for scraping websites that rely on JavaScript to load and render content.
- Highly scalable. The runtime uses a non-blocking I/O model, which allows you to handle multiple connections and requests simultaneously. It can also deal with large amounts of data without sacrificing performance. This makes Node.js a good choice for scraping multiple pages.
- Relatively easy to learn. Node.js is based on JavaScript, so if you’re already familiar with it, it will be an easy nut to crack. Node.js uses fewer lines of code than other programming languages that can handle dynamic websites.
- Great libraries and frameworks. Node.js has many tools that you can access via npm (Node Package Manager). For example, Axios is a popular library for handling HTTP requests, while Puppeteer and Playwright control a headless browser and deal with JavaScript rendering. The libraries also include packages for spoofing browser fingerprints and handling anti-bot systems.
- Large community. Node.js has a pretty large community of developers and users, extensive documentation, and many tutorials. You can also find discussions about specific issues on forums like StackOverflow.
While Node.js can deal with JavaScript-rendered websites, I wouldn’t say it’s the best option for static websites. It requires writing more code compared to, let’s say, Python programming language.
Steps to Build a Web Scraper with Node.JS
Step 1. Know when to Use a Headless Browser
Headless browsers are either essential for your web scraping project or irrelevant. Here’s why.
If you’re dealing with a website that doesn’t use dynamic elements to show you content or doesn’t include JavaScript-based fingerprinting techniques, a headless browser won’t be much of a help; it will only slow down your scraper. In this case, use an HTTP client (for example, Axios) and a parser (Cheerio).
But a headless browser is a smart choice if your website relies on dynamic elements. Regular HTML extraction tools won’t work with dynamic sites – a server can see if your browser is able to render JavaScript. This way, website owners identify real users from bots.
Step 2. Choose a Node.js Web Scraping Library
Node.js has many great libraries for web scraping, and the choice depends on your project requirements, including the complexity of the websites you want to scrape. Here are some popular options:
- Puppeteer is a powerful headless browser primarily designed for web testing, but it works just fine with web scraping. The library controls Chrome and Chromium browsers and uses Chromium’s built-in DevTools Protocol, which lets you control the browser directly, so it’s much faster than other headless libraries like Selenium. Puppeteer is well-documented and relatively easy to use.
- Playwright is one of the newest cross-browser libraries primarily used for browser automation. In terms of web scraping, it can emulate three major browser groups: Chromium, Firefox, and WebKit. The tool has an inbuilt driver, so you won’t need other dependencies for it to work. Playwright is asynchronous by default. This means it can easily handle multiple pages at once.
- Selenium is another web automation framework often used for scraping dynamic websites. It uses more resources than Puppeteer and Playwright but is flexible regarding browser support and programming languages. Selenium is one of the oldest tools, so you won’t lack support from the community.
- Cheerio is a data parsing library – it extracts data from the HTML code you’ve downloaded and transforms it into a structured format. It can’t send requests to web pages, so you should pair the tool with an HTTP client or a headless browser.
- Axios is the most popular HTTP client for making requests in Node.js. Usually, Axios is used when you don’t need to automate your browser. Axios can be paired with other Node.js libraries like Cheerio for a full web scraping experience – downloading and cleaning data.
Step 3. Devise Your Web Scraping Project and Guidelines
You can gather data from real targets like eBay or practice your skills on websites designed to be scraped.
The first approach is for experienced users – you’ll encounter more web scraping challenges along the way, like CAPTCHAs, but you can get a lot of useful product information. For web scraping newbies, go with web scraping sandboxes; we’ve made a list of best websites to practice web scraping.
Whenever possible, look for API endpoints. Some websites offer publicly available APIs. If that’s not the case, you may still find a “hidden” one. Web scraping JavaScript-rendered websites involves loading JavaScript and parsing HTML to extract data. But if you reverse engineer the API endpoint by inspecting network requests, you can get structured data and use less bandwidth. For example, GraphQL is a popular endpoint for handling large amounts of data in dynamic websites.
Don’t forget to respect the website you’re scraping. Check out its robots.txt file to understand which pages are off-limits for scrapers. Additionally, try to avoid overloading the servers with multiple requests and use proxies to hide your real IP address and location. I’d recommend buying rotating proxies – they’ll rotate with every request, and some providers allow you to hold sticky sessions.
Web Scraping Static Pages Using Node.js (Axios and Cheerio)
In this step-by-step tutorial, we’ll scrape a list of books – their title, price, rating, stock, and URL – from books.toscrape.com. Even though the runtime has an integrated request library, it isn’t easy to use, so few people choose it for fetching data. For this reason, we’ll be using popular Node.js libraries Axios and Cheerio.

The main page of books.toscrape.com
Prerequisites
- Node.js. Make sure you have the latest Node.js version installed on your system. You can get it from the official website.
- Axios. You can add it by running
npm install axios
in your operating system’s terminal. - Cheerio. You can add it by running
npm install cheerio.
Importing the Libraries
Step 1. First, let’s import the necessary libraries
1) Import Node.js HTTP client Axios.
import axios from 'axios'
2) Import Node.js parser Cheerio.
import { load } from 'cheerio'
3) Import the built-in Node.js file system module for writing results into the CSV file.
import fs from 'fs'
Downloading the Page
Step 1. Let’s download the target page.
const start_url = "http://books.toscrape.com/"
Step 2. Create a list to store the data.
const books_list = []
Step 3. Define the scrape() function.
async function scrape(url) {
Step 4. Make an HTTP request and wait for the response.
let resp = await axios.get(url)
Step 5. Extract HTML from the response.
let resp_html = resp.data
Extracting the Data Points
Step 1. Load the HTML into the Cheerio $ object.
const $ = load(resp_html)
Step 2. Pass the Cheerio instance to the parse() function.
parse($)
Step 3. Find the next page selector and the href attribute to scrape the next page.
try {
let next_href = $('.next > a').attr("href")
// In case the '/catalogue/' part of the URL is not found within
// the href attribute value, add it to the href
if (!next_href.includes('catalogue')){
next_href = `catalogue/${next_href}`
}
Step 4. Format the absolute URL of the next page we’re going to scrape.
let next_url = start_url + next_href
console.log('Scrape: ' + next_url)
Step 5. Call the scrape() function again and pass the URL.
await scrape(next_url)
} catch {
// Next page selector not found, end job
return
}
}
Parsing the HTML
Step 1. Define the parsing function.
function parse($){
Step 2. Now, we need to figure out where data points are located. Let’s scrape four elements: the book title, price, rating, and availability. Write-click anywhere on the page and press inspect. You can see that they’re all under a class called product_pod:
Step 3. We can extract the whole class:
$('.product_pod').map((i, element) => {
But the data you get will be messy, so let’s be more specific.
1) Extract the book title by finding the H3 tag within the element.
const book_title = $(element).find('h3').text()
2) Then, extract the book price by getting rid of the pound sign.
const book_price = $(element).find('.price_color').text().replace('£', '')
3) Now, get the book rating from the p tag with the classes star rating and Num (the book’s rating). This part is a bit more complicated because the rating is in the name that includes two words, and we only need one.
So, you first need to find the element with that class and get the value of the class attribute, which returns the string. It’s possible to split this string into a list of words using spaces as separators and grab the second word.
const book_rating = $(element).find('p.star-rating').attr("class")
.split(' ')[1]
4) Extract book stock information by finding the element by the instock class and trim unnecessary whitespaces.
const book_stock = $(element).find('.instock').text().trim()
5) Get the book URL by finding the a tag within the product_pod element and getting its href attribute which you’ll need to append to the start_url.
const book_url = start_url + $(element).find('a').attr("href")
Step 4. Now, let’s append our data points to the list:
books_list.push({
"title": book_title,
"price": book_price,
"rating": book_rating,
"stock": book_stock,
"url": book_url
})
Step 5. End the iteration.
})
//console.log(books)
}
Saving the Output to a CSV File
Step 1. Now, let’s structure all our data.
function write_to_csv(){
Step 2. Get the keys from the books object, this will become the first line of the csv file
var csv = Object.keys(books_list[0]).join(', ') + '\n'
Step 3. Iterate through each book dictionary element.
books_list.forEach(function(book) {
Step 4. Add a new line to the csv variable with the line break at the end.
csv += `"${book['title']}", ${book['price']}, ${book['rating']}, ${book['stock']}, ${book['url']},\n`
})
//console.log(csv)
Step 5. Write the output to a CSV file.
fs.writeFile('output.csv', csv, (err) => {
if (err)
console.log(err)
else {
console.log("Output written successfully")
}
})
}
Step 6. Then, pass the URL to the scrape function and tell Node.js for it to be awaited so that all of the scrapes finish before we move on to writing the output.
await scrape(start_url)
Step 7. Call the function to write the output.
write_to_csv()
Here’s the full code:
import axios from 'axios'
import { load } from 'cheerio'
// For writing into the output file
import fs from 'fs'
const start_url = "http://books.toscrape.com/"
const books_list = []
// Requesting the page with the help of Axios and waiting for the response
let resp = await axios.get(url)
let resp_html = resp.data
// Loading the html into Cheerio. $ - Cheerio object
const $ = load(resp_html)
// Passing the Cheerio instance to the parse() function
parse($)
try {
// Try finding the next page selector and
// extract the href attribute for scraping the next page
let next_href = $('.next > a').attr("href")
// In case the '/catalogue/' part of the URL is not found within
// the href attribute value, add it to the href
if (!next_href.includes('catalogue')){
next_href = `catalogue/${next_href}`
}
// Formatting the absolute URL of the next page we are going to scrape
let next_url = start_url + next_href
console.log('Scrape: ' + next_url)
// Calling the scrape() function again and passing it the URL
await scrape(next_url)
} catch {
// Next page selector not found, end job
return
}
}
// Function for parsing the html of the page.
function parse($){
// The selector for each distinct book element on the page is an article
// tag with the class of "product_pod". This line finds all such elements
// and begins iterating through them.
$('.product_pod').map((i, element) => {
// To get the title, we find the h3 tag within the element and
// extract its text.
const book_title = $(element).find('h3').text()
// Price is also simple, we just get rid of the pound sign
const book_price = $(element).find('.price_color').text().replace('£', '')
// The book ratings are easily scraped from the p tag with the classes
// "star rating" and "Num" where "Num" is the rating the book has
// received. To extract the rating, we first find the element with that
// class, get the value of the "class" attribute which returns a string:
// e.g. "star-rating One", split that string by whitespaces and assign
// the second element of the resulting list to our variable.
const book_rating = $(element).find('p.star-rating').attr("class")
.split(' ')[1]
// Simply finding the element by the "instock" class, extracting the
// text and trimming the resulting string to strip away unnecessary
// whitespaces.
const book_stock = $(element).find('.instock').text().trim()
// To extract the url of the book, we find the a tag within the
// product_pod element and get its "href" attribute which we append to
// the start_url
const book_url = start_url + $(element).find('a').attr("href")
// Appending the results dictionary to the books_list
books_list.push({
"title": book_title,
"price": book_price,
"rating": book_rating,
"stock": book_stock,
"url": book_url
})
})
//console.log(books)
}
function write_to_csv(){
// Getting the keys from the books object, this will become the first line of the csv file
var csv = Object.keys(books_list[0]).join(', ') + '\n'
// Iterating through each book dictionary element
books_list.forEach(function(book) {
// Adding a new line to the csv variable with the line break at the end
csv += `"${book['title']}", ${book['price']}, ${book['rating']}, ${book['stock']}, ${book['url']},\n`
})
//console.log(csv)
// Writing the output to a output.csv file
fs.writeFile('output.csv', csv, (err) => {
if (err)
console.log(err)
else {
console.log("Output written successfully")
}
})
}
// Script starts here, we pass the URL we are going to start our scrape on to
// the scrape function and tell node for it to be awaited so that all of the
// scrapes finish before we move on to writing the output
await scrape(start_url)
// Calling the function to write the output
write_to_csv()
Web Scraping Dynamic Pages Using Node.js and Puppeteer
In this step-by-step tutorial, we’re going to scrape text, quote, author, and tag from two URLs from quotes.toscrape.com with Node.js library Puppeteer:
Both links include dynamic elements. The difference? The second page is for dealing with delayed rendering. This is useful when a page takes time to load, or you need to wait before a specific condition is satisfied to extract the data.
Prerequisites
- Node.js. Make sure you have the latest Node.js version installed on your system. You can get it from the official website.
- Puppeteer. Since we’ll be using Puppeteer, you’ll also need to install it. Refer to the official website to learn how to install it on your system.
Importing the Libraries
Step 1. First, let’s import the necessary elements.
1) Import the Puppeteer library.
import puppeteer from 'puppeteer'
2) Since we’ll be using the built-in Node.js file system module, you’ll need to import it, too.
import fs from 'fs'
3) Then, import the URLs so you can scrape them.
const start_url = 'http://quotes.toscrape.com/js/'
//const start_url = 'http://quotes.toscrape.com/js-delayed/'
Setting Up CSS Selectors
Step 1. Inspect the page source of quotes.toscrape.com/js by right-clicking anywhere on the page and pressing “Inspect”.

Inspecting the web page.
Step 2. You’ll need to select all the quote class objects and, within them, find the text for the following classes: text, quote, author, and tag.
const quote_elem_selector = '.quote'
const quote_text_selector = '.text'
const quote_author_selector = '.author'
const quote_tag_selector = '.tag'
const next_page_selector = '.next > a'
Step 3. Set up the list where you’ll write the scraped quotes.
var quotes_list = []
Preparing to Scrape
Step 1. For Puppeteer to work, you’ll need to launch it in a headful mode.
async function prepare_browser() {
const browser = await puppeteer.launch({
headless: false,
})
return browser
}
NOTE: If you want to add puppeteer-extra-plugin-stealth to hide your digital fingerprint or set up proxies to avoid an IP ban, here’s the place to do so. If you don’t know how to set up proxies with Puppeteer, we prepared a step-by-step tutorial.
How to Set Up Proxies with Puppeteer
Learn how to set up a proxy server with Puppeteer.
Step 2. Then, write the main() function.
async function main() {
1) Call the setup_browser function to get the browser object.
var browser = await prepare_browser()
var page = await browser.newPage()
2) Now, let’s start to scrape with the start_url string.
await get_page(page, start_url)
3) Close the browser after the scraping is done.
await browser.close()
4) Print out the JSON output in the terminal window.
console.log(quotes_list)
}
5) When the code starts running, you’ll need to call the main() function.
main()

The output.
Scraping Multiple Pages with Node.js
Step 1. Let’s use the get_page() function to go to the URL, get the HTML output, and move to the next page.
async function get_page(page, url) {
await page.goto(url)
1) Now, we will tell Puppeteer to wait for the content to appear using the quote_selector. Once an element with a class=quote appears, it will begin to scrape. We’ll set the timeout value to 20 seconds.
await page.waitForSelector(quote_elem_selector, {timeout: 20_000})
2) Then, call the scrape function to parse the HTML.
await scrape(page)
3) Check for a next page selector and extract the href attribute to scrape it.
try {
let next_href = await page.$eval(next_page_selector, el => el.getAttribute('href'))
let next_url = `https://quotes.toscrape.com${next_href}`
console.log(`Next URL to scrape: ${next_url}`)
4) Call the get_page() function again and pass the new_url to scrape.
await get_page(page, next_url)
} catch {
// Next page button not found, end job
return
}
}
Step 2. Now, let’s move to the parsing part. We’re going to use the scrape() function.
async function scrape(page) {
1) Find all quote elements and put them in the quote_elements list.
let quote_elements = await page.$$(quote_elem_selector)
2) Then, iterate through the list to find all the values for each quote.
for (let quote_element of quote_elements) {
3) Now, let’s find the elements we need using the selectors and extracting their text content.
let quote_text = await quote_element.$eval(quote_text_selector, el => el.innerText)
let quote_author = await quote_element.$eval(quote_author_selector, el => el.innerText)
let quote_tags = await quote_element.$$eval(quote_tag_selector, els => els.map(el => el.textContent))
//console.log(quote_text)
//console.log(quote_author)
//console.log(quote_tags)
// Putting the output into a dictionary
var dict = {
'author': quote_author,
'text': quote_text,
'tags': quote_tags,
}
5) Push the dictionary into the quotes_list to get the output.
quotes_list.push(dict)
}
}
Here’s the output:
PS C:\node-projects\scraping> node .\index.js
Next URL to scrape: https://quotes.toscrape.com/js/page/2/
Next URL to scrape: https://quotes.toscrape.com/js/page/3/
Next URL to scrape: https://quotes.toscrape.com/js/page/4/
Next URL to scrape: https://quotes.toscrape.com/js/page/5/
Next URL to scrape: https://quotes.toscrape.com/js/page/6/
Next URL to scrape: https://quotes.toscrape.com/js/page/7/
Next URL to scrape: https://quotes.toscrape.com/js/page/8/
Next URL to scrape: https://quotes.toscrape.com/js/page/9/
Next URL to scrape: https://quotes.toscrape.com/js/page/10/
[
{
author: 'Albert Einstein',
text: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
tags: [ 'change', 'deep-thoughts', 'thinking', 'world' ]
},
{
author: 'J.K. Rowling',
text: '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
tags: [ 'abilities', 'choices' ]
},
{
author: 'Albert Einstein',
text: '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
tags: [ 'inspirational', 'life', 'live', 'miracle', 'miracles' ]
},
Saving the Output to a CSV File
Step 1. This function does the writing:
function write_to_csv(){
1) Get the keys from the quotes_list object. This will be the first line of the csv file.
var csv = Object.keys(quotes_list[0]).join(', ') + '\n'
2) Iterate through each quote dictionary element.
quotes_list.forEach(function(quote) {
3) Add a new line to the CSV variable with the line break at the end.
csv += `${quote['author']}, ${quote['text']}, "${quote['tags']}"\n`
})
4) Write the output to the output.csv file
fs.writeFile('output.csv', csv, (err) => {
if (err)
console.log(err)
else {
console.log("Output written successfully")
}
})
}
Step 2. Call the latter function from the main() after everything else is done.
async function main() {
1) Call the setup_browser function and get the browser object.
var browser = await prepare_browser()
var page = await browser.newPage()
2) Start to scrape.
await get_page(page, start_url)
3) Close the browser after the scraping is done.
await browser.close()
4) Print out the output json in the terminal window.
console.log(quotes_list)
5) Write the output to CSV.
write_to_csv()
}
Here’s the full code:
import puppeteer from 'puppeteer'
// For writing the output file
import fs from 'fs'
// URL to start the scrape
const start_url = 'http://quotes.toscrape.com/js/'
//const start_url = 'http://quotes.toscrape.com/js-delayed/'
// All of the CSS selectors
const quote_elem_selector = '.quote'
const quote_text_selector = '.text'
const quote_author_selector = '.author'
const quote_tag_selector = '.tag'
const next_page_selector = '.next > a'
// List we will be appending the scraped quotes to
var quotes_list = []
// Launching puppeteer in headless mode
async function prepare_browser() {
const browser = await puppeteer.launch({
headless: false,
})
return browser
}
// get_page takes two parameters
// page - puppeteer browser window
// url - url to be scraped
async function get_page(page, url) {
await page.goto(url)
// Telling Puppeteer to wait for the content to appear
// Once an element with a class=quote appears, the scrape will begin
// The timeout value is set to 20 seconds
await page.waitForSelector(quote_elem_selector, {timeout: 20_000})
// Calling the scrape function to parse the HTML
await scrape(page)
try {
// Await next page selector if it exists and extract the href attribute to scrape the next page
let next_href = await page.$eval(next_page_selector, el => el.getAttribute('href'))
let next_url = `https://quotes.toscrape.com${next_href}`
console.log(`Next URL to scrape: ${next_url}`)
// Calling the get_page() function again and passing the new URL
await get_page(page, next_url)
} catch {
// Next page button not found, end job
return
}
}
async function scrape(page) {
// Finding all of the quote elements and putting them in the quote_elements list
let quote_elements = await page.$$(quote_elem_selector)
// Iterating through the list to find all of the values we need for each quote
for (let quote_element of quote_elements) {
// Here we find the elements by using the selectors and extracting their text content
let quote_text = await quote_element.$eval(quote_text_selector, el => el.innerText)
let quote_author = await quote_element.$eval(quote_author_selector, el => el.innerText)
let quote_tags = await quote_element.$$eval(quote_tag_selector, els => els.map(el => el.textContent))
//console.log(quote_text)
//console.log(quote_author)
//console.log(quote_tags)
// Putting the output into a dictionary
var dict = {
'author': quote_author,
'text': quote_text,
'tags': quote_tags,
}
// Pushing the dictionary into the quotes_list for output
quotes_list.push(dict)
}
}
function write_to_csv(){
// Getting the keys from the quotes_list object, this will be the first line of the csv file
var csv = Object.keys(quotes_list[0]).join(', ') + '\n'
// Iterating through each quote dictionary element
quotes_list.forEach(function(quote) {
// Adding a new line to the csv variable with the line break at the end
csv += `${quote['author']}, ${quote['text']}, "${quote['tags']}"\n`
})
//console.log(csv)
// Writing the output to a output.csv file
fs.writeFile('output.csv', csv, (err) => {
if (err)
console.log(err)
else {
console.log("Output written successfully")
}
})
}
async function main() {
// Calling the setup_browser function and getting the browser object
var browser = await prepare_browser()
var page = await browser.newPage()
// Starting the scrape with the start url
await get_page(page, start_url)
// Closing the browser after scraping is done
await browser.close()
// Printing out the output json in the terminal window
console.log(quotes_list)
// Writing the output to csv
write_to_csv()
}
// Code starts running here, calling the main() function
main()