Nothing too crazy, just downloading a dump, splitting it into manageable batch s...

abetusk · 2024-05-29T18:54:44 1717008884

This probably deserves its own article and might be of interest to the HN community.

gfourfour · 2024-05-29T21:12:39 1717017159

I will probably make a post when I launch my app! For now I’m trying to figure out how I can host the whole system for cheap because I don’t anticipate generating much revenue

solarengineer · 2024-05-30T05:18:00 1717046280

I'm interested in hearing about what you will be hosting.

Would Digital Ocean or Hetzner meet your needs?

gfourfour · 2024-05-30T18:31:16 1717093876

I was going to use ec2 and s3, should I look at digital ocean or hetzner instead?

riku_iki · 2024-05-29T20:02:51 1717012971

What is the end task(e.g. RAG, or just vector search for question answering), are you satisfied with results in terms of quality?

gfourfour · 2024-05-29T21:14:53 1717017293

The end result is a recommendation algorithm, so basically just vector similarity search (with a bunch of other logic too ofc). The quality is great, and if anything a little bit of underfitting is desirable to avoid the “we see you bought a toilet seat, here’s 50 other toilet seats you might like” effect.

thomasfromcdnjs · 2024-05-29T20:54:47 1717016087

Did you chunk the articles? If so, in what way?

gfourfour · 2024-05-29T21:17:48 1717017468

Yes. I split the text into sentence and append sentences to a chunk until the max context window is reached. The context window size is dynamic for each article so that each chunk is roughly the same size. Then I just do a mean pool of the chunks for each article.

thomasfromcdnjs · 2024-05-29T21:36:06 1717018566

Thanks for the answer.

Wikipedia has a lot of tables so I was wondering if content-aware sentence chunking would be good enough for Wikipedia.

https://www.pinecone.io/learn/chunking-strategies/

gfourfour · 2024-05-29T23:22:44 1717024964

mwparserfromhell can parse the text content without including tables

j0hnyl · 2024-05-29T18:17:07 1717006627

How big is the resulting vector data?

gfourfour · 2024-05-29T18:40:47 1717008047

Like 8 gb roughly