I worked on a neural search engine just when deep networks were taking off and we knew that it worked because we had test data that said certain documents were relevant for certain queries so we could compute precision and recall curves. My experience was that if the AUC metric is substantially improved customers really notice the difference.
Very few search vendors do this kind of testing because it is expensive and because enterprise customers seem to care more that there are connectors to 800+ external systems than if the search results are any good.
The main trouble I see with pdf search is that test extracted from pdf files is full of junk punctuation including spaces so if you are trying a bag of words based search the words are corrupted. Seems to me you could build a neural model that works around the brokenness of PDF but that isn’t ‘download a model from spacy and pray’ but would be a big job that starts with getting 10 GB+ of PDF text.
I'll agree that there's quite a bit of junk punctuation in the extracted sentences (and sentence fragments), quite often from short footnotes in the Wiki articles. Getting "good" PDFs with open usage rights was a bit tricky, especially in a super simple PDF format. I ended up PDF-printing from Chrome.
Needless to say, working with PDFs makes me want to pull my hair out.
I also ended up writing the SpacySentencizer Executor instead of using a "vanilla" sentencizer. That led to consistent sentence splitting (so "J.R.R. Tolkien turned to pg. 3" would be one sentence, not 5)
For testing, Jina allows you to swap out encoders with just a couple of lines of code, so trying different methods out should work just fine.
I dunno, you can download a million or so PDFs from arxiv.org and even more from archive.org. They aren't hard to find.
There is something to say for roundtripping PDFs from source you control (you can accurately model the corruption produced by a particular system) but you will certainly see new and different phenomena if you try more.
I'd agree that spacy's sentence segmentation is better than many of the alternatives.
If new and different phenomena means new kinds of corruption and downright weird behavior I'll end up having no hair left!
Even printing the same page to PDF with Chrome and Firefox delivers quite different results. Firefox was often combining "f" and "i" into fi ligature [0] which totally changed the meaning of "finished" for example.
Downloading a lot of random PDFs from arxiv would be great for making something battle-hardened and robust (and I'd love to get the chance to do it sometime) but I didn't have the time (or the remaining hair) to do it this time round.
And +1 to spaCy. I typically use it over Transformers because it's SO much faster. I just used Transformers in this example for a change. My Stack Overflow search notebook [0] uses spaCy.
Oh, nifty! This is more a demo of a PDF search engine that you could (in parts 1 thru x of the series) deploy to an intranet (for internal knowledge search) or internet (for general search), rather than a collaborative tool.
For handwritten/math symbols, I'm sure it wouldn't be too hard to integrate something. The Jina Flow [0] concept makes integrating new Executors [1] pretty easy.
Mathpix snip for pdf to Latex is excellent. Thank you for the free tier. It is helpful transcribing pdf math homework sets to use in the solution document without bugging the instructor for their source.
I just tried this on all the papers I downloaded over the past couple months - cool stuff.
How well would this work in a production setting, e.g. when searching over millions of PDFs on arxiv (soon to be tens of millions)? Follow-up: have you tried using a vector database such as Milvus as the key piece of underlying infrastructure to avoid having to implement deletes, failover, scaling, etc? https://zilliz.com/learn/what-is-vector-database
In terms of matching embeddings and performing similarity search on text/images - folks are already using the framework (Jina) for that and getting decent results.
In terms of processing the PDFs and extracting that data. idk. That depends on a lot of factors - e.g. do you need to OCR the PDFs or can just extract text directly? Either way, should be possible to write a module and then easily scale it up (Jina supports shards/replicas). Anyway, lemme know. I'm in talks with folks about this kind of shitshow...uh...use case now.
Jina supports multiple vector database backends, like Weaviate, Qdrant and others. For others (like Milvus), suggest you ask on the Slack [0] - responses tend to be fast.
Can anyone recommend how to build the following solution?
- Full-text search on modern era PDFs (i.e no need for OCR)
- Exact word search would suffice (fuzzy/contextual search actually is less desirable)
- Cross-platform frontend part that highlights and jumps to the found text within the document. Frontend should be embeddable (i.e. not a SaaS or just standalone UI)
- As lightweight as possible (i.e. no Java, Python or Ruby)
I'm looking at Mellisearch or Bleve for indexing/backend, and Syncfusion Flutter PDF viewer for frontend, but it still needs a lot of gluing code and I would love to explore more options.
Google Pinpoint is pretty cool, and I use it a lot, but there is only hosted Google version, plus it's too smart (still can't get it to do exact word search).
If you hadn't ruled out Python I'd be suggesting using Datasette + SQLite FTS - I've been building a whole bunch of different search engines on that (including ones for searching within OCRd PDF files) and the cost to host is trivial, since you just need to run a Python process somewhere with a binary SQLite database file. I usually use Vercel, Cloud Run or Fly for that.
- Modern PDFs - if you wanna extract text and images, then the PDFSegmenter used in my example will work. If tables too, might need some additional jiggery-pokery, but definitely doable. I know other ppl using the same framework (Jina) who've accomplished it.
- Exact word search - pretty simple. I've focused on more advanced stuff because color vs colour is same same but different. Also just because it's pretty easy since I'm just using pre-defined building blocks, not manually integrating stuff
- Cross platform frontend - I've seen a lyrics search frontend [0] and I've built stuff in Streamlit before. Jina offers RESTful/gRPC/WebSockets gateways so it can't be too tough
- Lightweight? I mean how lightweight do you want it? C? Bash? Assembly? I've found Python good for text parsing
- Long-term: The notebook I wrote has a few (each of which have their own), but compared to others they're relatively lightweight.
- Gluing code: I've been using pre-existing building blocks, and writing new Executors (i.e. building blocks) is relatively straightforward, and then scaling them up with shards, replicas, etc is just a parameter away.
I'm more into the search side then the PDF stuff. The PDF side I've had experience with through bitter suffering and torment. Not a fun format to work with (unless you're into sado-masochism)
Most of my use cases have to deal with 10-100 PDF small documents, some – 1000-2000, but I don't want the solution to choke on 10GB of huge PDFs (I was just uploading those to Google Pinpoint). So Go or Rust for backend should be good fit.
By cross-platform frontend I meant web/ios/android/desktop. It's probably only Flutter, but I'm looking for other plugins than Syncfusion's one to try. I know that sounds like overkill for many people (website with search suffice), but I already have cross-platform apps that would benefit from this functionality, and web is a fallback there, not the main option.
I know folks with thousands to millions of PDFs using the Jina framework and it works fine. I hear what you're saying about frontends and lightweight though. Jina doesn't come with any cross-platform frontends, though Jina NOW has a Streamlit interface that's responsive (so works across devices)
> - As lightweight as possible (i.e. no Java, Python or Ruby)
I don't have suggestions for you, but I do have a question regarding this point. Why wouldn't Java be considered lightweight? Java literally runs on your SIM card, which is a very bare-bones environment to run something on, I'd probably consider something like that pretty lightweight.
Ha, I'm from that generation of developers who have the mental model of what is actually happening on the hardware level when you run the program. Doesn't necesarilly mean I overoptimize or think about struct fields offsets or cache branching, but I do have this in my mental model and just can't unlearn it.
When I think about how many stuff needs to be moved in cpu/memory/io bus just to launch simple "Hello, World" in Java - I just cannot accept it. I do realize that for large programs that overhead is small, but still the JVM concept is something I want to avoid as much as possible. Plus the sheer scale of Java SDK and amount of legacy and complexity behind it exceeds my treshold of "avoiding complexity" by orders of magnitude. And the nail to the coffin of "no java" stance is, of course, experience with desktop Java applications. Consistenly the worst UX experience and performance I've seen in 25 years among desktop apps.
Don't remind me of desktop Java. What was that toolkit, swing(?) that was used in all the apps back in the day. PDFs have a special place in Hell, but Java desktop UXen deserve a whole special circle
Getting the URI of original PDF would be straightforward enough - I could whack that into the code tomorrow with a few lines.
Opening up the correct page? I don't know of any standardized PDF reader that supports that kind of thing. And the format has such a history that even if it were supported (technically by Adobe - don't even get me started on what PDF readers support what formats), there's no guarantee the file itself would even have that cooked in.
Mathpix PDF search is fully visually powered and does not use underlying PDF metadata, even working on handwriting. It’s a great choice for researchers (especially in STEM) who want to build a searchable archive of PDFs.
Amazon Textract does a phenomenal job of extracting text from dodgy scanned PDFs - I've been running it against scanned typewritten text and even handwritten journal text from the 1880s with great results.
Textract really does do a good job of balancing cost, ease of use, and quality, at least for my hobbyist needs.
I was inspired by another recent comment you posted on HN, and after some testing of the Textract console [0] I wrote a simple "local only" command-line version [1] (Python, boto3) that does similar things to your tool.
I used my tool to OCR a few hundred comic strip images I've been meaning to OCR for a while now - the service did beautifully where other tools I've tried in the past struggled with the handwritten text on the comics. Textract is fast enough that running serially was fine for a one-off without involving the more parallelized S3 workflow.
Based on some testing just now, it looks like for synchronous mode single-page PDF is supported, but if you try to OCR a multi-page document, it throws:
An error occurred (UnsupportedDocumentException) when calling the DetectDocumentText operation: Request has unsupported document format
Normally, I just use pdfsandwich [0] for PDFs, which has been good enough generally, but I'm tempted to switch to using textract, because it's been much better at rough scans for my tests.
1. PDFSegmenter (in the notebook) - extract the images of the text (yup, it does images too)
2. An OCR Executor [0][1] from Jina Hub [2] to extract the text from the images
3. Actually splice the text chunks together to be what you'd expect - that's the tricky part. Even text splitting over pages can be tricky to reassemble properly. PDFs are a pain the butt frankly.
The version in the notebook is just for simple text-based PDFs. I wrote some posts on our company blog[1] about the sheer agonies of dealing with PDF as a data format, so wanted to stick with as simple as possible for now.
That said, I'm planning future notebooks where you can perform text-to-image or image-to-image search, integrate OCR, scale it up, serve it, deploy it, etc.
Don't. PDF is a terrible format for storing machine readable data. You lose a ton of Information while you create the PDF which you then painstakingly have to get back later (if that's even possible)
I may have misworded it (if I wrote those words - PDF rots the brain and my memory likewise).
Agreed on the rest. PDFs don't store machine-readable data. Often just pixelated scanned hot garbage dumpster fire text.
I hate PDFs but have to work with the satanforesaken things. Hence the notebook. It's my little way of trying to give my little PDF-bespoked-hellscape a tiny little glow-up.
I probably didn’t read your comment closely enough. When I hear about PDF parsing or PDF as data I immediately get flashbacks from a project years ago where I had to parse PDF files. I think I am still traumatized by this experience so whenever I hear somebody wants to do this I just want to scream “Nooo. Don’t do this”
Incidentally Jina Hub [0] has a few OCR Executors [1][2] you could integrate into my notebook (though you'd have to do some rewiring to take images into account since it's a text-based notebook)
Wow, this post really took off! If anyone wants to read some of my blog posts on building PDF search engines (and the pain, torment and anguish that it causes) read:
Great stuff, I went down the rabbit hole of building something similar for synthesizing flash cards + Q/A pairs from textbook PDFs about a year ago, and I would also emphasize that PDF search is a janky nightmare to get within the ballpark of usability :')
I feel your pain my brother(?) [0] in suffering. That's why I started simple in the notebook. Even trying to go a little more complex just leads to exponential rabbit holes and footguns.
[0] based on typical HN demographics, no assumptions here
I worked on a neural search engine just when deep networks were taking off and we knew that it worked because we had test data that said certain documents were relevant for certain queries so we could compute precision and recall curves. My experience was that if the AUC metric is substantially improved customers really notice the difference.
Very few search vendors do this kind of testing because it is expensive and because enterprise customers seem to care more that there are connectors to 800+ external systems than if the search results are any good.
The main trouble I see with pdf search is that test extracted from pdf files is full of junk punctuation including spaces so if you are trying a bag of words based search the words are corrupted. Seems to me you could build a neural model that works around the brokenness of PDF but that isn’t ‘download a model from spacy and pray’ but would be a big job that starts with getting 10 GB+ of PDF text.