Hacker News new | past | comments | ask | show | jobs | submit login
ScrapeGraphAI: Web scraping using LLM and direct graph logic (onrender.com)
194 points by ulrischa 6 months ago | hide | past | favorite | 63 comments



What I'd love to see is scraper builder that uses LLMs/'magic' to generate optimised scraping rules for any page, ie css selectors and processing rules mapped to output keys. So you can run scraping itself at low cost and high performance..


Agreed!

Apify's Website Content Crawler[0] does a decent job of this for most websites in my experience. It allows you to "extract" content via different built-in methods (e.g. Extractus [1]).

We currently use this at Magic Loops[2] and it works _most_ of the time.

The long-tail is difficult though, and it's not uncommon for users to back out to raw HTML, and then have our tool write some custom logic to parse the content they want from the scraped results (fun fact: before GPT-4 Turbo, the HTML page was often too large for the context window... and sometimes it still is!).

Would love a dedicated tool for this. I know the folks at Reworkd[3] are working on something similar, but not sure how much is public yet.

[0] https://apify.com/apify/website-content-crawler

[1] https://github.com/extractus/article-extractor

[2] https://magicloops.dev/

[3] https://reworkd.ai/


This is essentially what we're building at https://reworkd.ai (YC S23). We had thousands of users try using AgentGPT (our previous product) for scraping and we learned that using LLMs for web data extraction fundamentally does not work unless you generate code.


Awesome to hear! Looking forward to a launch -- the Waitlist form was too long to complete, need to take another LLM to fill that :)


1 month away ;)


all around automation sucks with LLM thrown on top of it

the statistics are not in its favour


Code is also hard. You got to generate code that accounts for all possible exceptions or errors. If you want to automate an UI for example, pushing a button can cause all sorts of feedback, errors, consequences that need to be known to write the code.


Yep, until you generate code—it's harder from a technical POV but you can get way higher performance & reliability.


Here's a project that describes the use of llm to generate crawling rules and then capture them, but it looks like it's still in the early stages of research.

https://github.com/EZ-hwh/AutoCrawler


Thanks, will look into it, looks promising


Most of the top LLM already do this very well. It's because they've been trained on web data, and also because they're being used for precisely this task internally to grab data.

The complicated ops of scraping is running headless browsers, IP ranges, bot bypass, filling captchas, observability and updating selectors, etc. There are a ton of SaaS services that do that part for you.


Agreed there are several complexities but not sure which ‘this’ you mean - specifically updating selectors is one of the areas I had in mind earlier..


There was one I remember out of UF/FSU called Intoli that seems to have pivoted into consulting.


It seems also obvious that one would want to simply drag a box around the content you want, and the tool would just provide some examples to help you refine the rule set.

Ad blockers have had something very close to this for some time, without any sparkly AI buttons.

I’m sure someone would be working on a subscription based model using corporate models in the backend, but it’s something that could easily be implemented with a very small model.


Mozenda does something like that. I haven't used it in many years, so I'm not up to date on what it currently offers.


That's an interesting take. I've been experimenting with reducing the overall rendered html size to just structure and content and using the LLM to extract content from that. It works quite well. But I think your approach might be more efficient and faster.


One fun mechanism I've been using for reducing html size is diffing (with some leniency) pages from same domain to exclude common parts (ie headers/footers). That preprocessing can be useful for any parsing mechanism..


I have been working on this. Feel free to DM me.


Parsing html is a solved and frankly not a very interesting problem. Writing up xpath/css selectors or JSON parsers (for when data is in script variables) is not much of a challenge for anyone.

More interesting issue is being able to parse data from the whole page content stack which includes XHRs and their triggers. In this case LLM driver would control an indistinguishable web browser to perform all steps to retrieve the data as a full package. Though this is still a low value proposition as the models would get fumbled by harder tasks and easier tasks can be performed by a human being in couple of hours.

LLM use in web scraping is still purely educational and assistive as the biggest problem in scraping is not scraping itself but scraper scaling and blocking which is becoming extremely common.


Exactly, are you aware of any current efforts of people trying to do that?


Not anything in open source yet.


What is the point of using LLMs for the scrapping itself instead of using them to generate the boring code for mimicking HTTP requests, css/xpath selectors, etc?

I get it may be interesting for small tasks combined with a browser extension but for real scrapping just seems to be overkill and expensive.


It is potentially expensive, but here's a different take.

Instead of writing a bunch of selectors that break often, imagine just being able to write a paragraph telling the LLM to fetch the top 10 headlines and their links on a news site. Or to fetch the images, titles, and prices off a store front?

It abstracts away a lot of manual fragile work.


I get that and LLMs are expected to get better.

Today, would you build a scraper with current LLMs that randomly hallucinate? I wouldn't.

The idea of a LLM powered scraper adapting the selectors every time the website owner updates it, it's pretty cool.


At my job we are scraping using LLMs. For a 10M sector of the company. GPT4 turbo has never not once out of 1.5 million API requests hallucinated. We however use it to parse data and interpret it from webpages, this is something you wouldn't be able to do with a regular scraper. Not well atleast.


Bold claim, did you review all 1.5 million requests?


I guess the claim is based on statistical sampling at reasonably high level to be sure that if there were hallucinations you would catch them? Or is there something else you're doing?

Do you have any workflow tools etc. to find hallucinations, I've got a project in backlog to build that kind of thing and would be interested in how you sort through bad and good results.


in this case we had 1.5 millioon ground truths for our testing purposes. we now have run it over 10 million, but i didnt want to claim it had 0 hallucinations on those as technically we cant say for sure, but considering the hallucination rate was 0% for 1.5 million when compared to ground truths im fairly confident.


How do you know that's true?


the 1.5 million was our test set. we had 1.5 million ground truths, and it didnt make up fake data for a single one


That's not what I asked. I asked "How did you determine that it didn't make up/get information wrong for all 1.5m?"


I've written thousands of scrapers and trust me, they don't break often.


Me too but for adversaries that obfuscate and change their site often to prevent scrapping. It can happen depending on what you are looking at.


Scrapers well written should be able to cope with site changes.


https://github.com/Skyvern-AI/skyvern

This is pretty much what we're building at Skyvern. The only problem is that inference cost is still a little bit too high for scraping, but we expect that to change in the next year


Would be nice if docs had a comparison between traditional scraping (e.g. using headless browsers, beautifulsoup, etc) versus this approach. Exactly how is AI used?


A lot of larger LLM's have been trained on millions of pages of html. They have the ability to understand raw html structure and extract content from them. I've been having some success with this using Mixtral 8x7B.


A lot of websites (like online shopping ) won't let you scrape for long unless you are logged in. Some (like real estate sites) won't tolerate you for long even if you are logged in. Some (like newspapers) won't accept a simple request, they will try to detect browser and user behavior. Some will even detect data center ip blocks to get rid of you.

I don't believe scraping is such a solved problem that you can slap AI and some cute vector spiders on it and claim that everything works.


Typo on your homepage: "You just have to implment just some lines of code and the work is done"


There’s also llm-scraper in TypeScript

https://github.com/mishushakov/llm-scraper


Something similar I worked on in the past https://github.com/lucgagan/auto-playwright/


Does it use ChatGPT every time you run the test or only when a test fails (to check if the selector has changed)?


Writing a few selector queries is probably easier than letting a llm output the selector queries by instructing the llm with the webpage and output request. I do scraping for a living. Casperjs/phantomjs is still my best friend for scraping dynamic websites. Selector Queries are the least of a problem.


tell me more about your hobby?


I don't see the benefit. I don't wanna say what I built 2markdown.com with (converts websites to markdown for LLMs), but it has a pretty decent performance without any high cost (and also sometimes erroneous) LLMs thrown on top of the scraping.


The "Integrates Easily" links to Langchain and Pipedream are flipped on your website.


Ouch, thanks! Will fix asap


At jobstash.xyz we have similar tech as part of our generalized scraping infra, and it’s been live for half a year performing optimally.


I'd love to try the demo but there's no way I'm putting my openai key into that site.


This is completely unrealistic unless you want to burn money.


In practice it's actually very good for lower volume tasks with non-fixed sources.

I haven't tried this library but I do use an LLM based scraper in addition to more traditional ones.


I have not used this specific library but its far from unrealistic and hardly a money pit. A LLM can fit in nicely with scraping libraries. Sure if you are crawling the web like google, it makes no sense, but if you have a hit list, this can be a cost effective way to not have engineering hours spent maintaining the crawler.


Which LLM do you use? Because I can't see an scraper running daily without being very expensive.


Llama-3 70B on my local MacBook works wonderfully for these tasks.


How's the Pipeline? Do you pass all the html to the LLM? Isn't the context window a problem?


There are phenomenal web scraping tools to crudely "preprocess" the document a bit, slashing outer HTML fluff while preserving the small subset of actual data. From there, 8k tokens (or whatever) goes really far.


At a very generous 50 tokens per second doesn't that still leave you with more than two and a half minutes (160s) processing time per document?


GPT-3.5/GPT-4 ain't the only LLMs available. A Flan-T5/T5 or Llama2/3 8B models may be finetuning for this use case and used for much cheaper.


How do you handle the context window limit? If you push the entire Dom to the LLM it will exceed the context window by far in most cases, isn't it?


My guess is you do some preprocessing on the DOM to get it down to text but still retains some structure.

Something like https://github.com/Alir3z4/html2text.

I'm sure there are other (better?) options as well.


I wrote https://markdown.download as a general helper for this


Trim unwanted html elements + convert to markdown. Significantly reduces token counts while retaining structure.


Again, depends on the volume of the scraping and the value of the data within it. Even 3.5 can be cost effective for certain workflows and data value.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: