Hacker News new | past | comments | ask | show | jobs | submit login

One of my projects is a virtual agency of multiple LLMs for a variety of back-office services (copywriting, copy-editing, social media, job ads, etc).

We ingest your data wherever you point our crawlers and then clean it for work working in RAGs or chained LLMs.

One library we like a lot is Trafilatura [1]. It does a great job of taking the full HTML page and returning the most semantically relevant parts.

It works well for LLM work as well as generating embeddings for vectors and downstream things.

[1] - https://trafilatura.readthedocs.io/en/latest/




+1 for Trafilatura. Simple, no fuss, and rarely breaks.

I use it nearly hourly for my HN summarizer HackYourNews (https://hackyournews.com).


The paper is good, too, for understanding how it works. The author also mentions many related tools in it.

https://aclanthology.org/2021.acl-demo.15/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: