Show HN: DataFuel.dev – Turn websites into LLM-ready data

jackienotchan · 2024-12-13T06:41:45 1734072105

I'm noticing a big increase in crawling activity on the sites I manage, likely from bots collecting data for LLMs. Most of them don't use proper user agents and of course don't stick to any scraping best practices that the industry has developed over the past two decades.

This trend is creating a lot of headaches for developers responsible for maintaining heavily scraped sites.

- "Multiple AI companies bypassing web standard to scrape publisher sites" https://news.ycombinator.com/item?id=40750182

SalmonSnarker · 2024-12-13T04:53:49 1734065629

Projects like this should respect that if a site's robots.txt contains a long list of Disallow entries for other AI scrapers that they are probably not welcome to scrape either.

(of course this project doesn't do that)

sachou · 2024-12-13T04:56:40 1734065800

Good point! Thanks for the feedback.

Let me add that to my todos.

keyle · 2024-12-13T05:13:02 1734066782

It boggles my mind that you would launch without that as a prime directive.

tesrx · 2024-12-13T05:27:12 1734067632

OP just graciously accepted that feedback, no need to be condescending :)

pxtail · 2024-12-13T11:12:35 1734088355

Let me translate what OP wrote:

> Good point! Thanks for the feedback.

> Nothing like this will be added to the product. Money comes from scraping content and thus content will be scrapped regardless any non-scrapping hints and we will be actively working on countering anti-scraping measures.

vunderba · 2024-12-13T07:07:55 1734073675

well... while true how do you reconcile this with OP's statements on another thread?

> It has an extensive proxy IP and retry system in place to bypass bot detection.

Seems like a bit of "talking out of both sides of your mouth".

__loam · 2024-12-13T05:37:25 1734068245

It's kind of tone deaf to launch a tool like this without considering this in the current climate. Not a popular take on hackernews but everyone outside the tech space is pretty pissed about this stuff.

Onavo · 2024-12-13T06:22:15 1734070935

And proxy farms exist solely to get around this problem. If you believe the rights of content creators is the end all be all, don't complain next time Disney tries to extend the IP expiration dates.

__loam · 2024-12-13T09:52:29 1734083549

Using the behavior of one bad actor to excuse the abuse of everyone else is pretty bad.

bryanrasmussen · 2024-12-13T07:01:24 1734073284

I was recently on a project and out of the 10+ devs on it I was the only one who really knew about robots.txt, or at least the only one who said hey that robots.txt needs to handle internationalized routes, the default ones we disallow are all in English.

I don't say that makes them bad, they just knew other things, so I can totally not have my mind boggled that someone launched a product like this and didn't take obeying robots.txt into consideration and then adds it to the todos when someone complains.

freeone3000 · 2024-12-13T07:25:22 1734074722

Programmers have no institutional memory.

bryanrasmussen · 2024-12-13T08:41:32 1734079292

agreed, I had to often explain lower level simple stuff was there for them, because they didn't happen to know about that thing and were surprised.

emmanueloga_ · 2024-12-13T05:33:47 1734068027

I thought this might be interesting to share and potentially useful for the author of Datafuel as a comparison. I recently built something similar for a small app [1].

I use Bun.js's fetch to crawl pages, process them with Mozilla’s Readability (via JSDOM), and convert the cleaned content to Markdown using Turndown. I also strip href attributes from links since they’re unnecessary for my use case, and I don't recurse links. My implementation is basic, with minimal error handling and pretty dumb content trimming to stay within prompt tokens limit which could use improvements! I also found this Python library that seems a lot fancier than what I need, but also a lot more powerful [2].

I’m curious where a solution like Datafuel excels, especially since it already has customers? From the top of my head, the real complexity in scraping appears when processing a sizable number or URLS regularly and becomes more of a background processing / scheduling problem.

I feel like something like Datafuel could become more adopted if it was a nicely put together as a library to crawl locally, and then if you find yourself crawling regularly and want to delegate the scheduling of those crawls, you could buy into the service: "ping me back when these 10_000 URLs are done crawling", or something like that.

--

1: https://github.com/EmmanuelOga/plangs2/blob/main/packages/ai...

2: https://github.com/adbar/trafilatura

sachou · 2024-12-13T05:47:45 1734068865

yes exactly,

The main issue in scraping:

- If you scrape a lot, you will be block based on you IP; You need to use PROXY - Scraping entire website need specific logic, retries and more - It becomes an heavy background job

All the above takes time, so if in your business it is not your core feature, likely better to outsource it.

Good job doing it tho!

emmanueloga_ · 2024-12-13T09:19:12 1734081552

Some ideas:

Highlight the advantages of your service over DIY solutions prominently on your marketing site. The site looks great! but I think it could better focus on convincing developers to adopt your product vs just listing features.

Consider reaching out to clients to quantify the time saved using your service. Emphasize how it eliminates the hassle of setting up custom background job processes, proxies, and other complexities that can snowball into a full-fledged project.

Good luck on your journey!

zerop · 2024-12-13T04:50:09 1734065409

Great, congrats on your launch.

1. Does it take care of Bot detection. Most sites will have it.

2. Is this something similar to Firecrawl - https://www.firecrawl.dev/

sachou · 2024-12-13T04:54:54 1734065694

Yes, it has an extensive proxy IP and retry system in place to bypass bot detection.

I’m also trying to gather more feedback to identify the killer feature:

- Adding vectorization to Pinecone out of the box? - Adding multiple integrations like n8n, etc.?

Any crucial pain points to avoid?

gregoryl · 2024-12-13T05:36:34 1734068194

Are you concerned about making a product that does this? The legal aspect of accessing a computer system that is intending to block your use seems worrisome.

sachou · 2024-12-13T05:58:10 1734069490

It is the responsibility of the user. Everyone should be responsible for their own actions. We still allow knives to be sold, and most people use them for good.

carpenecopinum · 2024-12-13T06:52:39 1734072759

Now imagine that knife stabbings became so common that almost everyone started wearing body armor and you start selling body armor defeating knives explicitly. I can honestly see why most people would be upset about that.

gregoryl · 2024-12-13T16:24:21 1734107061

I don't see that as a good analogy. There's very limited space for this functionality to be used legitimately / legally - anyone permitted to scrape content is likely able to access the data without the protection measures in the way.

I'm fairly sure circumvention is a (prosecuted!) crime in several countries - curious if you're across that angle, and/or have legal advice/direction you can share?

Its_Padar · 2024-12-13T07:23:46 1734074626

> Please make it easy for users to try your thing out, preferably without having to sign up, get a confirmation email, and other such barriers. You'll get more feedback that way, plus HN users get ornery if you make them jump through hoops. https://news.ycombinator.com/item?id=22336638

> Off topic: blog posts, sign-up pages, newsletters, lists, and other reading material. Those can't be tried out, so can't be Show HNs. Make a regular submission instead. https://news.ycombinator.com/showhn.html

This looks like quite an interesting project but Show HNs need to be usable without sign up pages

sachou · 2024-12-13T08:39:00 1734079140

noted thank you for the nice reminder. Good I ll add more free tool and open playground

olup · 2024-12-13T05:34:30 1734068070

I am interested, but why should I use this one over jina ai reader (which is also free) or firecrawl, or the ten other puppeteer + readability + turndown pipeline (or even a AWS lambda doing the same) ? This is not sarcastic I am genuinely looking for something fresh in the field.

sachou · 2024-12-13T05:39:20 1734068360

do you need to embed it directly in pinecone ?

If yes then DataFuel is the right choice. Adding this feature as we speak.

Please let me know :)

olup · 2024-12-13T06:03:35 1734069815

Interesting but we process documents before embedding them, and have specific requirements for the embedder.

Having developed a couple of page to markdown myself, I think the bigger challenge is to make sense of so many pages that rely on spacial organisation of information that only makes sense to human, or even presence of images. One way to do it is to render the page as an image and extract data with a vision llm. But you do need heuristic on when to do classic extraction and when to use vision, plus get rid of cookie banner and overlays. This is more complex and costly, but have real business value, for the one that can pull it off.

sachou · 2024-12-13T06:07:22 1734070042

what would be your specific requirement?

Right now adding chunk size, model for embedding, what else?

Image is a great challenge with OCR can be solve as you mentioned

olup · 2024-12-13T06:12:01 1734070321

We, as many players, have custom pipelines on embedding. We don't split docs based on chunk size but do semantic chunking and chunk augmentation. We embed everything with two embeddings services to always have a fallback if one provider is not available.

If I were in your shoes I would not think embedding and inserting in a vector store would be my responsibility, especially since there are so many different stores on the market.

benatkin · 2024-12-13T18:41:14 1734115274

> Scraping can be a pain, but we need clean markdown data for fine-tuning or doing RAG with new LLM models.

And normally it's still a pain even if you sign up for a scraping service, and I don't see how this will be different.

aitchnyu · 2024-12-13T05:20:39 1734067239

Will this benefit sites or internal wikis which have well written content, good search and SEO? I interviewed at a few companies which apparently enables managers to use AI as an excuse to implement text search.

sachou · 2024-12-13T05:45:14 1734068714

I guess so if you goal is to have people knows about your content, might have a small SEO bump

nextworddev · 2024-12-13T05:28:58 1734067738

This is a pretty crowded market, e.g. Brightdata (most feature complete), Firecrawl (focused on api and sdk), etc

sachou · 2024-12-13T05:33:57 1734068037

yes exactly working on building a competitive advantage, but the AI space is so big and only getting bigger.

Any feedback that could help datafuel becomes more unique?