I thought this might be interesting to share and potentially useful for the author of Datafuel as a comparison. I recently built something similar for a small app [1].
I use Bun.js's fetch to crawl pages, process them with Mozilla’s Readability (via JSDOM), and convert the cleaned content to Markdown using Turndown. I also strip href attributes from links since they’re unnecessary for my use case, and I don't recurse links. My implementation is basic, with minimal error handling and pretty dumb content trimming to stay within prompt tokens limit which could use improvements! I also found this Python library that seems a lot fancier than what I need, but also a lot more powerful [2].
I’m curious where a solution like Datafuel excels, especially since it already has customers? From the top of my head, the real complexity in scraping appears when processing a sizable number or URLS regularly and becomes more of a background processing / scheduling problem.
I feel like something like Datafuel could become more adopted if it was a nicely put together as a library to crawl locally, and then if you find yourself crawling regularly and want to delegate the scheduling of those crawls, you could buy into the service: "ping me back when these 10_000 URLs are done crawling", or something like that.
- If you scrape a lot, you will be block based on you IP; You need to use PROXY
- Scraping entire website need specific logic, retries and more
- It becomes an heavy background job
All the above takes time, so if in your business it is not your core feature, likely better to outsource it.
Highlight the advantages of your service over DIY solutions prominently on your marketing site. The site looks great! but I think it could better focus on convincing developers to adopt your product vs just listing features.
Consider reaching out to clients to quantify the time saved using your service. Emphasize how it eliminates the hassle of setting up custom background job processes, proxies, and other complexities that can snowball into a full-fledged project.
I use Bun.js's fetch to crawl pages, process them with Mozilla’s Readability (via JSDOM), and convert the cleaned content to Markdown using Turndown. I also strip href attributes from links since they’re unnecessary for my use case, and I don't recurse links. My implementation is basic, with minimal error handling and pretty dumb content trimming to stay within prompt tokens limit which could use improvements! I also found this Python library that seems a lot fancier than what I need, but also a lot more powerful [2].
I’m curious where a solution like Datafuel excels, especially since it already has customers? From the top of my head, the real complexity in scraping appears when processing a sizable number or URLS regularly and becomes more of a background processing / scheduling problem.
I feel like something like Datafuel could become more adopted if it was a nicely put together as a library to crawl locally, and then if you find yourself crawling regularly and want to delegate the scheduling of those crawls, you could buy into the service: "ping me back when these 10_000 URLs are done crawling", or something like that.
--
1: https://github.com/EmmanuelOga/plangs2/blob/main/packages/ai...
2: https://github.com/adbar/trafilatura