We ingest your data wherever you point our crawlers and then clean it for work working in RAGs or chained LLMs.
One library we like a lot is Trafilatura [1]. It does a great job of taking the full HTML page and returning the most semantically relevant parts.
It works well for LLM work as well as generating embeddings for vectors and downstream things.
[1] - https://trafilatura.readthedocs.io/en/latest/
I use it nearly hourly for my HN summarizer HackYourNews (https://hackyournews.com).
https://aclanthology.org/2021.acl-demo.15/
We ingest your data wherever you point our crawlers and then clean it for work working in RAGs or chained LLMs.
One library we like a lot is Trafilatura [1]. It does a great job of taking the full HTML page and returning the most semantically relevant parts.
It works well for LLM work as well as generating embeddings for vectors and downstream things.
[1] - https://trafilatura.readthedocs.io/en/latest/