What Is an AI Data Parser?

The Oxford Dictionary of English describes parsing as – just kidding! Parsing means turning an abstract jumbled ball of information into a nice and structured collection of data. Of course, you can do it yourself by manually entering the details of your lunch receipts – or 69,000 pages of laptops on sale on Amazon – into a spreadsheet. But AI parsing is much more powerful – and a lot better suited for scraping the web.

AI Data Parsing in Short

AI parsing is the method of taking unstructured information – like prices on a bunch of web pages – into nice and orderly data fit for a database by using LLMs (Large Language Models). Traditional methods already have the accuracy and speed, but the added flexibility of LLMs greatly reduces maintenance requirements and difficulty of scaling.

But to really explain the benefits of AI-assisted data parsing, we have to first look into the ways data was structured before AI/LLMs entered the field.

Traditional Methods of Data Parsing in Web Scraping Explained

Pre-Machine Learning Parsing

The basic model of parsing a website means taking a programmer, sitting them down in front of the HTML structure of a web page, and making them write a CSS, XPath, or Regex-based algorithm for extracting data out of that page. Ideally, once written, the algorithm will be able to reliably parse all the necessary data from any page under the same category of a domain.

The parsing algorithm you get is both static and deterministic:

Static: it doesn’t change unless you change it manually.
Deterministic: run it on the same web page a thousand times, and it will always get the same output; if the listed laptop price is $850, then the database entry for the price will always be $850.

There are two downsides to this method:

Maintenance: a static algorithm can’t handle any changes to the web page – just like you, but with less drama. So, someone needs to keep an eye on the web design and then rewrite the algorithm to adapt to any changes.
Non-scalable: let’s say it takes one developer one day to write a parser for a single domain. That’s not bad if you’re only scraping/parsing data from one domain. What if you want to hit 10,000 different domains? Then you’ll need either 10,000 developers, 10,000 days – or, more realistically, a combination of the two. Oh, and don’t forget the maintenance.

Classic Machine Learning Parsing

When machine learning (ML) became more commonplace, a new method was employed:

You sit down with a web page, look at the HTML code, and split it into elements.
You label the elements: this is the price field, this is the product photo, etc..
You train ML models on all this data before letting them loose to parse websites.

After the training is done, you get a model that is mostly domain agnostic – so, you don’t need to retrain it for every new domain.

The downsides are thus:

Intensive training: before your ML can start parsing websites, you need to train the model. And to train the model, you need to process and label thousands of websites. That’s a lot of manual labor.
Data drift: websites change over time, but the ML doing the parsing can’t account for that, so you will have to invest in the maintenance of the model as well.

Visual Parsing

Visual parsing is a novel take on ML parsing, and it made Diffbot famous. Instead of rooting through the code to identify elements your ML model needs to seek out, visual parsing renders the page in the browser. The model then parses the page via computer vision and returns structured contents. It’s kind of like what you do as a human when viewing a website.

The big upside of the Diffbot approach is that you don’t need to know how to code to train the model: you mark all the segments on a website as you visually understand them, and then the ML model will learn from that.
Since it doesn’t look into the code of the web page, just the visual output, it’s less sensitive to any changes that may happen in the background that are invisible to the eye.
On the other hand, it still needs a lot of human work to prepare the training materials, and the maintenance requirement isn’t going anywhere either.

With that in mind, we can consider AI web parsing.

Using AI for Web Parsing

AI web parsing will involve large language models. There are currently two main methods at play: LLM-based instruction generation and an LLM-based JSON parser.

LLM-Based Instruction Generation

This method may also be called LLM-based parser generation – it’s what Oxylabs’ OxyCopilot runs on. You take the HTML of a target page and feed into an LLM together with instructions to generate a parser (which would include what things you want to parse). The LLM will then write a parser – xPaths and all – for you.

In this situation, it replaces the programmer who would have to write that algorithm manually. You do it for a single page on the domain, and you now have a static and deterministic parser that will be able to snag data from any page on the same website.

So, this approach:

Saves labor and time: you don’t need a specialist to painstakingly code the parser for every domain you want to scrape.
Has a measure of self-healing: If you set up an alarm for when any changes to the pages you scrape are detected, the LLM can be instructed to rewrite the scraper, making maintenance that much faster.

The downsides:

You need a new parser for each domain, just like with the write-the-parser-yourself methods. However, this is alleviated by the fact that you can just make the AI write more parsers.
Human-written algorithms still remain superior when it comes to accuracy. To bring an AI parser up to par (at least somewhat), you’ll need to implement validation strategies, which increase complexity and cost.

LLM-Based JSON Parser

But what about skipping the middle-man – or the middle code, to be precise? Method two, LLM-based JSON parser, cuts out the whole “having to build a parser” part. What you do is take the HTML of the page, define your scraping requirements in JSON, and feed them both into a cheap LLM.

AI is much better at following the rules than writing them. Once it’s done parsing, it can then present the output as the structured data you need. You can use your own LLM for this! And with a wide-variety of MCPs available these days, all that data will then be sent to your database without you having to do anything.

Plus, unlike your static parser which will break when encountering any changes in the website, an LLM will, with no alterations to the JSON instructions, parse the website no matter what happens.

A couple of downsides, however:

It is non-deterministic: you may ask the LLM to scrape the price, but the results aren’t guaranteed to always be the same even when scraping the same page twice.
It’s also a little expensive: you’re making an AI query per HTML parsed, and those aren’t cheap. Also, a single LLM request can take 5-8 seconds to process, while a parser does it in one.
Local models require expensive infrastructure: you’re not running a million requests on a MacBook. You have to consider at which point it becomes more economical to have a home scraping setup vs. just buying more tokens.

Still, this method is employed by Crawl4AI, SpiderScrape, Firecrawl, AI Studio and many others. That’s because there are scenarios where it is actually more efficient.

Imagine scenario #1: you have a single domain and one million parsing requests to make:

Method one runs the AI once, gets the parser, and the parser then scrapes those 1 million pages on the cheap.
Method two would make one million AI queries – you pay for each one (and remember: queries take more time than scraping).

But what about scenario #2: 100,000 domains and 10 requests per?

Method one creates 100,000 algorithms that you then have to match with their specific domains and then run one million scraping requests. And if you don’t have the self-healing algorithm set up, you now have to manage your scrapers.
Method two runs that single JSON request on every page, at which point the price issue comes down to whether you’re using a local model or not, how much you paid for the infrastructure, and the alternative costs of following method one.

In Conclusion

AI web parsing is the logical next step in the evolution of web parsing. The previous methods were already good at parsing. The introduction of LLMs solve the issues of scaling and maintenance, making it easier to increase the scope of web scraping operations and to keep them going in the face of constant change.