ScrapeGraphAI: Web scraping using LLM and direct graph logic

nodoodles · 2024-05-07T21:59:17 1715119157

What I'd love to see is scraper builder that uses LLMs/'magic' to generate optimised scraping rules for any page, ie css selectors and processing rules mapped to output keys. So you can run scraping itself at low cost and high performance..

jumploops · 2024-05-08T00:46:02 1715129162

Agreed!

Apify's Website Content Crawler[0] does a decent job of this for most websites in my experience. It allows you to "extract" content via different built-in methods (e.g. Extractus [1]).

We currently use this at Magic Loops[2] and it works _most_ of the time.

The long-tail is difficult though, and it's not uncommon for users to back out to raw HTML, and then have our tool write some custom logic to parse the content they want from the scraped results (fun fact: before GPT-4 Turbo, the HTML page was often too large for the context window... and sometimes it still is!).

Would love a dedicated tool for this. I know the folks at Reworkd[3] are working on something similar, but not sure how much is public yet.

[0] https://apify.com/apify/website-content-crawler

[1] https://github.com/extractus/article-extractor

[2] https://magicloops.dev/

[3] https://reworkd.ai/

KhoomeiK · 2024-05-08T00:28:38 1715128118

This is essentially what we're building at https://reworkd.ai (YC S23). We had thousands of users try using AgentGPT (our previous product) for scraping and we learned that using LLMs for web data extraction fundamentally does not work unless you generate code.

nodoodles · 2024-05-08T19:40:03 1715197203

Awesome to hear! Looking forward to a launch -- the Waitlist form was too long to complete, need to take another LLM to fill that :)

KhoomeiK · 2024-05-09T06:27:43 1715236063

1 month away ;)

spxneo · 2024-05-08T02:22:12 1715134932

all around automation sucks with LLM thrown on top of it

the statistics are not in its favour

visarga · 2024-05-08T17:34:49 1715189689

Code is also hard. You got to generate code that accounts for all possible exceptions or errors. If you want to automate an UI for example, pushing a button can cause all sorts of feedback, errors, consequences that need to be known to write the code.

KhoomeiK · 2024-05-08T02:44:16 1715136256

Yep, until you generate code—it's harder from a technical POV but you can get way higher performance & reliability.

longgui0318 · 2024-05-08T01:57:54 1715133474

Here's a project that describes the use of llm to generate crawling rules and then capture them, but it looks like it's still in the early stages of research.

https://github.com/EZ-hwh/AutoCrawler

nodoodles · 2024-05-08T19:45:32 1715197532

Thanks, will look into it, looks promising

nikcub · 2024-05-08T00:32:23 1715128343

Most of the top LLM already do this very well. It's because they've been trained on web data, and also because they're being used for precisely this task internally to grab data.

The complicated ops of scraping is running headless browsers, IP ranges, bot bypass, filling captchas, observability and updating selectors, etc. There are a ton of SaaS services that do that part for you.

nodoodles · 2024-05-08T19:28:12 1715196492

Agreed there are several complexities but not sure which ‘this’ you mean - specifically updating selectors is one of the areas I had in mind earlier..

selimthegrim · 2024-05-08T02:45:07 1715136307

There was one I remember out of UF/FSU called Intoli that seems to have pivoted into consulting.

greggsy · 2024-05-07T23:59:36 1715126376

It seems also obvious that one would want to simply drag a box around the content you want, and the tool would just provide some examples to help you refine the rule set.

Ad blockers have had something very close to this for some time, without any sparkly AI buttons.

I’m sure someone would be working on a subscription based model using corporate models in the backend, but it’s something that could easily be implemented with a very small model.

uptown · 2024-05-08T00:13:04 1715127184

Mozenda does something like that. I haven't used it in many years, so I'm not up to date on what it currently offers.

geuis · 2024-05-07T23:07:13 1715123233

That's an interesting take. I've been experimenting with reducing the overall rendered html size to just structure and content and using the LLM to extract content from that. It works quite well. But I think your approach might be more efficient and faster.

nodoodles · 2024-05-08T19:47:17 1715197637

One fun mechanism I've been using for reducing html size is diffing (with some leniency) pages from same domain to exclude common parts (ie headers/footers). That preprocessing can be useful for any parsing mechanism..

cpobuda · 2024-05-07T22:38:40 1715121520

I have been working on this. Feel free to DM me.

wraptile · 2024-05-08T07:48:57 1715154537

Parsing html is a solved and frankly not a very interesting problem. Writing up xpath/css selectors or JSON parsers (for when data is in script variables) is not much of a challenge for anyone.

More interesting issue is being able to parse data from the whole page content stack which includes XHRs and their triggers. In this case LLM driver would control an indistinguishable web browser to perform all steps to retrieve the data as a full package. Though this is still a low value proposition as the models would get fumbled by harder tasks and easier tasks can be performed by a human being in couple of hours.

LLM use in web scraping is still purely educational and assistive as the biggest problem in scraping is not scraping itself but scraper scaling and blocking which is becoming extremely common.

illegally · 2024-05-12T10:14:41 1715508881

Exactly, are you aware of any current efforts of people trying to do that?

wraptile · 2024-05-13T03:37:24 1715571444

Not anything in open source yet.

mariopt · 2024-05-07T22:04:48 1715119488

What is the point of using LLMs for the scrapping itself instead of using them to generate the boring code for mimicking HTTP requests, css/xpath selectors, etc?

I get it may be interesting for small tasks combined with a browser extension but for real scrapping just seems to be overkill and expensive.

geuis · 2024-05-07T23:20:42 1715124042

It is potentially expensive, but here's a different take.

Instead of writing a bunch of selectors that break often, imagine just being able to write a paragraph telling the LLM to fetch the top 10 headlines and their links on a news site. Or to fetch the images, titles, and prices off a store front?

It abstracts away a lot of manual fragile work.

mariopt · 2024-05-07T23:43:01 1715125381

I get that and LLMs are expected to get better.

Today, would you build a scraper with current LLMs that randomly hallucinate? I wouldn't.

The idea of a LLM powered scraper adapting the selectors every time the website owner updates it, it's pretty cool.

ewild · 2024-05-08T03:37:08 1715139428

At my job we are scraping using LLMs. For a 10M sector of the company. GPT4 turbo has never not once out of 1.5 million API requests hallucinated. We however use it to parse data and interpret it from webpages, this is something you wouldn't be able to do with a regular scraper. Not well atleast.

what · 2024-05-08T04:13:19 1715141599

Bold claim, did you review all 1.5 million requests?

bryanrasmussen · 2024-05-08T09:13:42 1715159622

I guess the claim is based on statistical sampling at reasonably high level to be sure that if there were hallucinations you would catch them? Or is there something else you're doing?

Do you have any workflow tools etc. to find hallucinations, I've got a project in backlog to build that kind of thing and would be interested in how you sort through bad and good results.

ewild · 2024-05-08T14:36:34 1715178994

in this case we had 1.5 millioon ground truths for our testing purposes. we now have run it over 10 million, but i didnt want to claim it had 0 hallucinations on those as technically we cant say for sure, but considering the hallucination rate was 0% for 1.5 million when compared to ground truths im fairly confident.

krainboltgreene · 2024-05-08T04:49:12 1715143752

How do you know that's true?

ewild · 2024-05-08T14:35:30 1715178930

the 1.5 million was our test set. we had 1.5 million ground truths, and it didnt make up fake data for a single one

krainboltgreene · 2024-05-10T17:34:26 1715362466

That's not what I asked. I asked "How did you determine that it didn't make up/get information wrong for all 1.5m?"

is_true · 2024-05-08T03:09:57 1715137797

I've written thousands of scrapers and trust me, they don't break often.

infecto · 2024-05-08T11:55:29 1715169329

Me too but for adversaries that obfuscate and change their site often to prevent scrapping. It can happen depending on what you are looking at.

is_true · 2024-05-08T23:49:21 1715212161

Scrapers well written should be able to cope with site changes.

suchintan · 2024-05-08T02:07:57 1715134077

https://github.com/Skyvern-AI/skyvern

This is pretty much what we're building at Skyvern. The only problem is that inference cost is still a little bit too high for scraping, but we expect that to change in the next year

nextworddev · 2024-05-07T20:41:00 1715114460

Would be nice if docs had a comparison between traditional scraping (e.g. using headless browsers, beautifulsoup, etc) versus this approach. Exactly how is AI used?

geuis · 2024-05-07T23:22:12 1715124132

A lot of larger LLM's have been trained on millions of pages of html. They have the ability to understand raw html structure and extract content from them. I've been having some success with this using Mixtral 8x7B.

nurettin · 2024-05-08T05:23:42 1715145822

A lot of websites (like online shopping ) won't let you scrape for long unless you are logged in. Some (like real estate sites) won't tolerate you for long even if you are logged in. Some (like newspapers) won't accept a simple request, they will try to detect browser and user behavior. Some will even detect data center ip blocks to get rid of you.

I don't believe scraping is such a solved problem that you can slap AI and some cute vector spiders on it and claim that everything works.

simonw · 2024-05-07T19:58:24 1715111904

Typo on your homepage: "You just have to implment just some lines of code and the work is done"

ushakov · 2024-05-07T21:01:10 1715115670

There’s also llm-scraper in TypeScript

https://github.com/mishushakov/llm-scraper

lucgagan · 2024-05-07T21:04:26 1715115866

Something similar I worked on in the past https://github.com/lucgagan/auto-playwright/

worldsayshi · 2024-05-07T22:39:38 1715121578

Does it use ChatGPT every time you run the test or only when a test fails (to check if the selector has changed)?

holoduke · 2024-05-08T11:44:02 1715168642

Writing a few selector queries is probably easier than letting a llm output the selector queries by instructing the llm with the webpage and output request. I do scraping for a living. Casperjs/phantomjs is still my best friend for scraping dynamic websites. Selector Queries are the least of a problem.

barrenko · 2024-05-21T19:10:10 1716318610

tell me more about your hobby?

RamblingCTO · 2024-05-08T05:24:24 1715145864

I don't see the benefit. I don't wanna say what I built 2markdown.com with (converts websites to markdown for LLMs), but it has a pretty decent performance without any high cost (and also sometimes erroneous) LLMs thrown on top of the scraping.

dev213 · 2024-05-08T11:45:03 1715168703

The "Integrates Easily" links to Langchain and Pipedream are flipped on your website.

RamblingCTO · 2024-05-09T14:00:27 1715263227

Ouch, thanks! Will fix asap

sethx · 2024-05-07T20:21:59 1715113319

At jobstash.xyz we have similar tech as part of our generalized scraping infra, and it’s been live for half a year performing optimally.

maxrmk · 2024-05-08T19:01:52 1715194912

I'd love to try the demo but there's no way I'm putting my openai key into that site.

spaniard89277 · 2024-05-07T22:17:32 1715120252

This is completely unrealistic unless you want to burn money.

msp26 · 2024-05-08T00:06:02 1715126762

In practice it's actually very good for lower volume tasks with non-fixed sources.

I haven't tried this library but I do use an LLM based scraper in addition to more traditional ones.

infecto · 2024-05-07T22:26:00 1715120760

I have not used this specific library but its far from unrealistic and hardly a money pit. A LLM can fit in nicely with scraping libraries. Sure if you are crawling the web like google, it makes no sense, but if you have a hit list, this can be a cost effective way to not have engineering hours spent maintaining the crawler.

spaniard89277 · 2024-05-07T22:47:38 1715122058

Which LLM do you use? Because I can't see an scraper running daily without being very expensive.

a_wild_dandan · 2024-05-07T23:00:40 1715122840

Llama-3 70B on my local MacBook works wonderfully for these tasks.

spaniard89277 · 2024-05-07T23:45:34 1715125534

How's the Pipeline? Do you pass all the html to the LLM? Isn't the context window a problem?

a_wild_dandan · 2024-05-08T01:23:20 1715131400

There are phenomenal web scraping tools to crudely "preprocess" the document a bit, slashing outer HTML fluff while preserving the small subset of actual data. From there, 8k tokens (or whatever) goes really far.

LunaSea · 2024-05-08T09:26:50 1715160410

At a very generous 50 tokens per second doesn't that still leave you with more than two and a half minutes (160s) processing time per document?

mrbungie · 2024-05-07T22:56:44 1715122604

GPT-3.5/GPT-4 ain't the only LLMs available. A Flan-T5/T5 or Llama2/3 8B models may be finetuning for this use case and used for much cheaper.

spaniard89277 · 2024-05-07T23:44:19 1715125459

How do you handle the context window limit? If you push the entire Dom to the LLM it will exceed the context window by far in most cases, isn't it?

aleksiy123 · 2024-05-08T00:00:19 1715126419

My guess is you do some preprocessing on the DOM to get it down to text but still retains some structure.

Something like https://github.com/Alir3z4/html2text.

I'm sure there are other (better?) options as well.

tarasglek · 2024-05-08T07:16:02 1715152562

I wrote https://markdown.download as a general helper for this

msp26 · 2024-05-08T00:07:52 1715126872

Trim unwanted html elements + convert to markdown. Significantly reduces token counts while retaining structure.

infecto · 2024-05-07T23:45:42 1715125542

Again, depends on the volume of the scraping and the value of the data within it. Even 3.5 can be cost effective for certain workflows and data value.