Hacker News new | past | comments | ask | show | jobs | submit login

I just tried this on all the papers I downloaded over the past couple months - cool stuff.

How well would this work in a production setting, e.g. when searching over millions of PDFs on arxiv (soon to be tens of millions)? Follow-up: have you tried using a vector database such as Milvus as the key piece of underlying infrastructure to avoid having to implement deletes, failover, scaling, etc? https://zilliz.com/learn/what-is-vector-database




In terms of matching embeddings and performing similarity search on text/images - folks are already using the framework (Jina) for that and getting decent results.

In terms of processing the PDFs and extracting that data. idk. That depends on a lot of factors - e.g. do you need to OCR the PDFs or can just extract text directly? Either way, should be possible to write a module and then easily scale it up (Jina supports shards/replicas). Anyway, lemme know. I'm in talks with folks about this kind of shitshow...uh...use case now.

Jina supports multiple vector database backends, like Weaviate, Qdrant and others. For others (like Milvus), suggest you ask on the Slack [0] - responses tend to be fast.

[0] https://slack.jina.ai


We should probably try to implement a PDF search demo on top of Milvus.. LOL




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: