Agreed! Apify's Website Content Crawler[0] does a decent job of this for most we...

Agreed!

Apify's Website Content Crawler[0] does a decent job of this for most websites in my experience. It allows you to "extract" content via different built-in methods (e.g. Extractus [1]).

We currently use this at Magic Loops[2] and it works _most_ of the time.

The long-tail is difficult though, and it's not uncommon for users to back out to raw HTML, and then have our tool write some custom logic to parse the content they want from the scraped results (fun fact: before GPT-4 Turbo, the HTML page was often too large for the context window... and sometimes it still is!).

Would love a dedicated tool for this. I know the folks at Reworkd[3] are working on something similar, but not sure how much is public yet.

[0] https://apify.com/apify/website-content-crawler

[1] https://github.com/extractus/article-extractor

[2] https://magicloops.dev/

[3] https://reworkd.ai/