What I'd love to see is scraper builder that uses LLMs/'magic' to generate optimised scraping rules for any page, ie css selectors and processing rules mapped to output keys. So you can run scraping itself at low cost and high performance..
Apify's Website Content Crawler[0] does a decent job of this for most websites in my experience. It allows you to "extract" content via different built-in methods (e.g. Extractus [1]).
We currently use this at Magic Loops[2] and it works _most_ of the time.
The long-tail is difficult though, and it's not uncommon for users to back out to raw HTML, and then have our tool write some custom logic to parse the content they want from the scraped results (fun fact: before GPT-4 Turbo, the HTML page was often too large for the context window... and sometimes it still is!).
Would love a dedicated tool for this. I know the folks at Reworkd[3] are working on something similar, but not sure how much is public yet.
This is essentially what we're building at https://reworkd.ai (YC S23). We had thousands of users try using AgentGPT (our previous product) for scraping and we learned that using LLMs for web data extraction fundamentally does not work unless you generate code.
Code is also hard. You got to generate code that accounts for all possible exceptions or errors. If you want to automate an UI for example, pushing a button can cause all sorts of feedback, errors, consequences that need to be known to write the code.
Here's a project that describes the use of llm to generate crawling rules and then capture them, but it looks like it's still in the early stages of research.
Most of the top LLM already do this very well. It's because they've been trained on web data, and also because they're being used for precisely this task internally to grab data.
The complicated ops of scraping is running headless browsers, IP ranges, bot bypass, filling captchas, observability and updating selectors, etc. There are a ton of SaaS services that do that part for you.
It seems also obvious that one would want to simply drag a box around the content you want, and the tool would just provide some examples to help you refine the rule set.
Ad blockers have had something very close to this for some time, without any sparkly AI buttons.
I’m sure someone would be working on a subscription based model using corporate models in the backend, but it’s something that could easily be implemented with a very small model.
That's an interesting take. I've been experimenting with reducing the overall rendered html size to just structure and content and using the LLM to extract content from that. It works quite well. But I think your approach might be more efficient and faster.
One fun mechanism I've been using for reducing html size is diffing (with some leniency) pages from same domain to exclude common parts (ie headers/footers). That preprocessing can be useful for any parsing mechanism..
Parsing html is a solved and frankly not a very interesting problem. Writing up xpath/css selectors or JSON parsers (for when data is in script variables) is not much of a challenge for anyone.
More interesting issue is being able to parse data from the whole page content stack which includes XHRs and their triggers. In this case LLM driver would control an indistinguishable web browser to perform all steps to retrieve the data as a full package. Though this is still a low value proposition as the models would get fumbled by harder tasks and easier tasks can be performed by a human being in couple of hours.
LLM use in web scraping is still purely educational and assistive as the biggest problem in scraping is not scraping itself but scraper scaling and blocking which is becoming extremely common.
What is the point of using LLMs for the scrapping itself instead of using them to generate the boring code for mimicking HTTP requests, css/xpath selectors, etc?
I get it may be interesting for small tasks combined with a browser extension but for real scrapping just seems to be overkill and expensive.
It is potentially expensive, but here's a different take.
Instead of writing a bunch of selectors that break often, imagine just being able to write a paragraph telling the LLM to fetch the top 10 headlines and their links on a news site. Or to fetch the images, titles, and prices off a store front?
At my job we are scraping using LLMs. For a 10M sector of the company. GPT4 turbo has never not once out of 1.5 million API requests hallucinated. We however use it to parse data and interpret it from webpages, this is something you wouldn't be able to do with a regular scraper. Not well atleast.
I guess the claim is based on statistical sampling at reasonably high level to be sure that if there were hallucinations you would catch them? Or is there something else you're doing?
Do you have any workflow tools etc. to find hallucinations, I've got a project in backlog to build that kind of thing and would be interested in how you sort through bad and good results.
in this case we had 1.5 millioon ground truths for our testing purposes. we now have run it over 10 million, but i didnt want to claim it had 0 hallucinations on those as technically we cant say for sure, but considering the hallucination rate was 0% for 1.5 million when compared to ground truths im fairly confident.
This is pretty much what we're building at Skyvern. The only problem is that inference cost is still a little bit too high for scraping, but we expect that to change in the next year
Would be nice if docs had a comparison between traditional scraping (e.g. using headless browsers, beautifulsoup, etc) versus this approach. Exactly how is AI used?
A lot of larger LLM's have been trained on millions of pages of html. They have the ability to understand raw html structure and extract content from them. I've been having some success with this using Mixtral 8x7B.
A lot of websites (like online shopping ) won't let you scrape for long unless you are logged in. Some (like real estate sites) won't tolerate you for long even if you are logged in. Some (like newspapers) won't accept a simple request, they will try to detect browser and user behavior. Some will even detect data center ip blocks to get rid of you.
I don't believe scraping is such a solved problem that you can slap AI and some cute vector spiders on it and claim that everything works.
Writing a few selector queries is probably easier than letting a llm output the selector queries by instructing the llm with the webpage and output request.
I do scraping for a living. Casperjs/phantomjs is still my best friend for scraping dynamic websites. Selector Queries are the least of a problem.
I don't see the benefit. I don't wanna say what I built 2markdown.com with (converts websites to markdown for LLMs), but it has a pretty decent performance without any high cost (and also sometimes erroneous) LLMs thrown on top of the scraping.
I have not used this specific library but its far from unrealistic and hardly a money pit. A LLM can fit in nicely with scraping libraries. Sure if you are crawling the web like google, it makes no sense, but if you have a hit list, this can be a cost effective way to not have engineering hours spent maintaining the crawler.
There are phenomenal web scraping tools to crudely "preprocess" the document a bit, slashing outer HTML fluff while preserving the small subset of actual data. From there, 8k tokens (or whatever) goes really far.