One of my projects is a virtual agency of multiple LLMs for a variety of back-of... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

simplecto 3 months ago | parent | context | favorite | on: Minifying HTML for GPT-4o: Remove all the HTML tag...

One of my projects is a virtual agency of multiple LLMs for a variety of back-office services (copywriting, copy-editing, social media, job ads, etc).

We ingest your data wherever you point our crawlers and then clean it for work working in RAGs or chained LLMs.

One library we like a lot is Trafilatura [1]. It does a great job of taking the full HTML page and returning the most semantically relevant parts.

It works well for LLM work as well as generating embeddings for vectors and downstream things.

[1] - https://trafilatura.readthedocs.io/en/latest/

ukuina 3 months ago | [–]

+1 for Trafilatura. Simple, no fuss, and rarely breaks.

I use it nearly hourly for my HN summarizer HackYourNews (https://hackyournews.com).

nickpsecurity 3 months ago | [–]

The paper is good, too, for understanding how it works. The author also mentions many related tools in it.

https://aclanthology.org/2021.acl-demo.15/

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact