Hacker News new | past | comments | ask | show | jobs | submit login

Nothing too crazy, just downloading a dump, splitting it into manageable batch sizes, and using a lightweight embedding model to vectorize each article. Using the best GPU available on colab it takes maybe 8 hours if I remember correctly? Vectors can be saved as NPY files and loaded into something like FAISS for fast querying.



This probably deserves its own article and might be of interest to the HN community.


I will probably make a post when I launch my app! For now I’m trying to figure out how I can host the whole system for cheap because I don’t anticipate generating much revenue


I'm interested in hearing about what you will be hosting.

Would Digital Ocean or Hetzner meet your needs?


I was going to use ec2 and s3, should I look at digital ocean or hetzner instead?


What is the end task(e.g. RAG, or just vector search for question answering), are you satisfied with results in terms of quality?


The end result is a recommendation algorithm, so basically just vector similarity search (with a bunch of other logic too ofc). The quality is great, and if anything a little bit of underfitting is desirable to avoid the “we see you bought a toilet seat, here’s 50 other toilet seats you might like” effect.


Did you chunk the articles? If so, in what way?


Yes. I split the text into sentence and append sentences to a chunk until the max context window is reached. The context window size is dynamic for each article so that each chunk is roughly the same size. Then I just do a mean pool of the chunks for each article.


Thanks for the answer.

Wikipedia has a lot of tables so I was wondering if content-aware sentence chunking would be good enough for Wikipedia.

https://www.pinecone.io/learn/chunking-strategies/


mwparserfromhell can parse the text content without including tables


How big is the resulting vector data?


Like 8 gb roughly




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: