Hacker News new | past | comments | ask | show | jobs | submit login

- Modern PDFs - if you wanna extract text and images, then the PDFSegmenter used in my example will work. If tables too, might need some additional jiggery-pokery, but definitely doable. I know other ppl using the same framework (Jina) who've accomplished it.

- Exact word search - pretty simple. I've focused on more advanced stuff because color vs colour is same same but different. Also just because it's pretty easy since I'm just using pre-defined building blocks, not manually integrating stuff

- Cross platform frontend - I've seen a lyrics search frontend [0] and I've built stuff in Streamlit before. Jina offers RESTful/gRPC/WebSockets gateways so it can't be too tough

- Lightweight? I mean how lightweight do you want it? C? Bash? Assembly? I've found Python good for text parsing

- Long-term: The notebook I wrote has a few (each of which have their own), but compared to others they're relatively lightweight.

- Gluing code: I've been using pre-existing building blocks, and writing new Executors (i.e. building blocks) is relatively straightforward, and then scaling them up with shards, replicas, etc is just a parameter away.

I'm more into the search side then the PDF stuff. The PDF side I've had experience with through bitter suffering and torment. Not a fun format to work with (unless you're into sado-masochism)

[0] https://github.com/jina-ai/examples/tree/master/multires-lyr...




Thanks for elaborated answer.

Most of my use cases have to deal with 10-100 PDF small documents, some – 1000-2000, but I don't want the solution to choke on 10GB of huge PDFs (I was just uploading those to Google Pinpoint). So Go or Rust for backend should be good fit.

By cross-platform frontend I meant web/ios/android/desktop. It's probably only Flutter, but I'm looking for other plugins than Syncfusion's one to try. I know that sounds like overkill for many people (website with search suffice), but I already have cross-platform apps that would benefit from this functionality, and web is a fallback there, not the main option.


I know folks with thousands to millions of PDFs using the Jina framework and it works fine. I hear what you're saying about frontends and lightweight though. Jina doesn't come with any cross-platform frontends, though Jina NOW has a Streamlit interface that's responsive (so works across devices)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: