Hacker News new | past | comments | ask | show | jobs | submit login

For use in retrieval/RAG, an emerging paradigm is to not parse the PDF at all.

By using a multi-modal foundation model, you convert visual representations ("screenshots") of the pdf directly into searchable vector representations.

Paper: Efficient Document Retrieval with Vision Language Models - https://arxiv.org/abs/2407.01449

Vespa.ai blog post https://blog.vespa.ai/retrieval-with-vision-language-models-... (my day job)




I do something similar in my file-renamer app (sort.photos if you want to check it out):

1. Render first 2 pages of PDF into a JPEG offline in the Mac app.

2. Upload JPEG to ChatGPT Vision and ask what would be a good file name for this.

It works surprisingly well.


I'm sure this will change over time, but I have yet to see an LMM that performs (on average) as well as decent text extraction pipelines.

Text embeddings for text also have much better recall in my tests.


No multi-modal model is ready for that in reality. The accuracy from other tools to extract tables and text are far superior.


You have detractors, but this is the future.


Is anyone actually having success with this approach? If so, how and with what models (and prompts)?


Claude.ai handles tables very well, at least in my tests. It could easily convert a table from a financial document into a markdown table, among other things.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: