Did you chunk the articles? If so, in what way?

gfourfour · 2024-05-29T21:17:48.000000Z

Yes. I split the text into sentence and append sentences to a chunk until the max context window is reached. The context window size is dynamic for each article so that each chunk is roughly the same size. Then I just do a mean pool of the chunks for each article.

thomasfromcdnjs · 2024-05-29T21:36:06.000000Z

Thanks for the answer.

Wikipedia has a lot of tables so I was wondering if content-aware sentence chunking would be good enough for Wikipedia.

https://www.pinecone.io/learn/chunking-strategies/

gfourfour · 2024-05-29T23:22:44.000000Z

mwparserfromhell can parse the text content without including tables