Scrapy vs Beautiful Soup vs Selenium – Which One to Use?
We compare three popular web scraping tools.
Scrapy, Beautiful Soup, and Selenium are three popular web scraping tools. If you’re new to scraping – or simply haven’t tried all of them yet – you might find it hard to figure out which one you might need. This guide briefly goes through their capabilities and situations where each of the tools works best. If you’re in a hurry, you can jump straight to the comparison table below.
- Scrapy – a Full Package for Large-Scale Scraping
- Beautiful Soup – a Simple Parser for Beginners and Small Jobs
- Comparing the Three Options
Scrapy – a Full Package for Large-Scale Scraping
Scrapy is a so-called web crawling and scraping framework. Unlike some other tools, it doesn’t require any further dependencies to work (unless you’re dealing with JS). In other words, it contains everything you need to crawl pages, download and parse, and finally store the data you’ve scraped.
Scrapy is Python-based and open source. These two features make it a popular choice for web scraping, meaning that you’ll find much information about using it: both by reading comprehensive documentation and consulting with other scrapers on sites like StackOverflow.
By design, Scrapy was made to be greatly extensible. Aside from its rich basic features, it supports middleware and extensions, both of which bring custom functionality to Scrapy scripts (also called spiders). The framework allows adding proxies, controlling crawl depth, cookie and session handling. There’s even an interactive console for seeing if you’ve selected the right CSS or XPath expressions live. In short, Scrapy is powerful.
One of Scrapy’s best features is that it can handle and process requests asynchronously. So, you can extract data from many pages at once, which makes the framework very fast and well suited for large-scale scraping.
However, all this functionality and extensibility means that Scrapy isn’t the easiest tool to use. Despite the generous docs, you’ll have to invest a good deal of time to get the hang of things.
Overall, you should be looking at Scrapy if you have a big web scraping project to do, or with a view to scraping at scale in the future.
Beautiful Soup – a Simple Parser for Beginners and Small Jobs
As Beautiful Soup’s developers put it, they’re here to help you extract data from that awfully-written page. And they do so by giving you a Python-based data parsing library.
In essence, that’s what Beautiful Soup is – a library that structures an HTML or XML page, selects the data you need, and helps you extract it in a proper format. Unlike Scrapy, it can’t crawl pages or make GET requests. So, you’ll have to use another library like Requests to achieve that.
In reality, Beautiful Soup isn’t one parser, but rather a bundle of several parsing tools in one. It uses html.parser, HTML5lib, and lxml in the backend, so you can experiment with various parsing approaches. For example, lxml is the fastest, while HTML5lib is slow but very flexible.
Beautiful Soup’s biggest benefit is that it’s very simple to use: you can write a basic scraper in minutes and with few lines of code. It also doesn’t break very easily. These features give Beautiful Soup great appeal among web scrapers, so you can benefit both from great documentation and a lively community of developers online.
Beautiful Soup allows parallelizing requests. But setting it up isn’t very easy, and it still fails to compare with Scrapy in speed.
Overall, Beautiful Soup is a great choice for one-off or smaller web scraping jobs, where you don’t need to consistently extract data on a large scale.
Selenium lets you not only load a website, but also interact with it: emulate actions, fill in forms, click on buttons, and do other things. In other words, it includes full functionality of a proper headless browser.
However, precisely because it controls a whole headless browser, Selenium isn’t light on resources. It doesn’t help that the only way to multithread it is to fire up a new browser instance. The efficiency simply isn’t there. Nowadays, there are arguably better options for headless web scraping, such as Puppeteer or Playwright.
Comparing the Three Options
Here’s a brief table that displays the main features of Scrapy, Beautiful Soup, and Selenium side by side:
|Selectors||CSS, XPath||CSS||CSS, XPath|
|Proxies||Yes||With external libraries||Yes|
are you sure?? is scrapy faster than selenium? I have tested both on my setup and scrapy seems to be running not that much faster, are there anything else that might impact the speed of the scraper? ngl i might be doing something wrong with scrapy
How many parallel requests did you run? Now multiple that by ten and try comparing them again. 🙂