Nothing too crazy, just downloading a dump, splitting it into manageable batch sizes, and using a lightweight embedding model to vectorize each article.
Using the best GPU available on colab it takes maybe 8 hours if I remember correctly? Vectors can be saved as NPY files and loaded into something like FAISS for fast querying.
I will probably make a post when I launch my app! For now I’m trying to figure out how I can host the whole system for cheap because I don’t anticipate generating much revenue
The end result is a recommendation algorithm, so basically just vector similarity search (with a bunch of other logic too ofc). The quality is great, and if anything a little bit of underfitting is desirable to avoid the “we see you bought a toilet seat, here’s 50 other toilet seats you might like” effect.
Yes. I split the text into sentence and append sentences to a chunk until the max context window is reached. The context window size is dynamic for each article so that each chunk is roughly the same size. Then I just do a mean pool of the chunks for each article.