Search PDFs with Transformers and Python Notebook

PaulHoule · on July 25, 2022

Does it really work better than a simple tfidf?

I worked on a neural search engine just when deep networks were taking off and we knew that it worked because we had test data that said certain documents were relevant for certain queries so we could compute precision and recall curves. My experience was that if the AUC metric is substantially improved customers really notice the difference.

Very few search vendors do this kind of testing because it is expensive and because enterprise customers seem to care more that there are connectors to 800+ external systems than if the search results are any good.

The main trouble I see with pdf search is that test extracted from pdf files is full of junk punctuation including spaces so if you are trying a bag of words based search the words are corrupted. Seems to me you could build a neural model that works around the brokenness of PDF but that isn’t ‘download a model from spacy and pray’ but would be a big job that starts with getting 10 GB+ of PDF text.

alexcg1 · on July 25, 2022

I'll agree that there's quite a bit of junk punctuation in the extracted sentences (and sentence fragments), quite often from short footnotes in the Wiki articles. Getting "good" PDFs with open usage rights was a bit tricky, especially in a super simple PDF format. I ended up PDF-printing from Chrome.

Needless to say, working with PDFs makes me want to pull my hair out.

I also ended up writing the SpacySentencizer Executor instead of using a "vanilla" sentencizer. That led to consistent sentence splitting (so "J.R.R. Tolkien turned to pg. 3" would be one sentence, not 5)

For testing, Jina allows you to swap out encoders with just a couple of lines of code, so trying different methods out should work just fine.

PaulHoule · on July 25, 2022

I dunno, you can download a million or so PDFs from arxiv.org and even more from archive.org. They aren't hard to find.

There is something to say for roundtripping PDFs from source you control (you can accurately model the corruption produced by a particular system) but you will certainly see new and different phenomena if you try more.

I'd agree that spacy's sentence segmentation is better than many of the alternatives.

alexcg1 · on July 25, 2022

If new and different phenomena means new kinds of corruption and downright weird behavior I'll end up having no hair left!

Even printing the same page to PDF with Chrome and Firefox delivers quite different results. Firefox was often combining "f" and "i" into ﬁ ligature [0] which totally changed the meaning of "finished" for example.

Downloading a lot of random PDFs from arxiv would be great for making something battle-hardened and robust (and I'd love to get the chance to do it sometime) but I didn't have the time (or the remaining hair) to do it this time round.

[0] https://www.compart.com/en/unicode/U+FB01

alexcg1 · on July 25, 2022

And +1 to spaCy. I typically use it over Transformers because it's SO much faster. I just used Transformers in this example for a change. My Stack Overflow search notebook [0] uses spaCy.

[0] https://colab.research.google.com/github/jina-ai/workshops/b...

nicodjimenez · on July 25, 2022

Mathpix Snip also supports PDF search, including for handwritten content, and including math symbols in equations.

Disclaimer: I’m the founder.

alexcg1 · on July 25, 2022

Oh, nifty! This is more a demo of a PDF search engine that you could (in parts 1 thru x of the series) deploy to an intranet (for internal knowledge search) or internet (for general search), rather than a collaborative tool.

For handwritten/math symbols, I'm sure it wouldn't be too hard to integrate something. The Jina Flow [0] concept makes integrating new Executors [1] pretty easy.

I LOVE the testimonials on the site btw!

[0] https://docs.jina.ai/fundamentals/flow/

[1] https://docs.jina.ai/fundamentals/executor/

ok_computer · on July 25, 2022

Mathpix snip for pdf to Latex is excellent. Thank you for the free tier. It is helpful transcribing pdf math homework sets to use in the solution document without bugging the instructor for their source.

fzliu · on July 25, 2022

I just tried this on all the papers I downloaded over the past couple months - cool stuff.

How well would this work in a production setting, e.g. when searching over millions of PDFs on arxiv (soon to be tens of millions)? Follow-up: have you tried using a vector database such as Milvus as the key piece of underlying infrastructure to avoid having to implement deletes, failover, scaling, etc? https://zilliz.com/learn/what-is-vector-database

alexcg1 · on July 25, 2022

In terms of matching embeddings and performing similarity search on text/images - folks are already using the framework (Jina) for that and getting decent results.

In terms of processing the PDFs and extracting that data. idk. That depends on a lot of factors - e.g. do you need to OCR the PDFs or can just extract text directly? Either way, should be possible to write a module and then easily scale it up (Jina supports shards/replicas). Anyway, lemme know. I'm in talks with folks about this kind of shitshow...uh...use case now.

Jina supports multiple vector database backends, like Weaviate, Qdrant and others. For others (like Milvus), suggest you ask on the Slack [0] - responses tend to be fast.

[0] https://slack.jina.ai

redskyluan · on July 26, 2022

We should probably try to implement a PDF search demo on top of Milvus.. LOL

divan · on July 25, 2022

Can anyone recommend how to build the following solution?

- Full-text search on modern era PDFs (i.e no need for OCR)

- Exact word search would suffice (fuzzy/contextual search actually is less desirable)

- Cross-platform frontend part that highlights and jumps to the found text within the document. Frontend should be embeddable (i.e. not a SaaS or just standalone UI)

- As lightweight as possible (i.e. no Java, Python or Ruby)

- Long-term oriented stack (i.e. minimum dependencies, ideally promise of compatibility)

I'm looking at Mellisearch or Bleve for indexing/backend, and Syncfusion Flutter PDF viewer for frontend, but it still needs a lot of gluing code and I would love to explore more options.

Google Pinpoint is pretty cool, and I use it a lot, but there is only hosted Google version, plus it's too smart (still can't get it to do exact word search).

simonw · on July 25, 2022

If you hadn't ruled out Python I'd be suggesting using Datasette + SQLite FTS - I've been building a whole bunch of different search engines on that (including ones for searching within OCRd PDF files) and the cost to host is trivial, since you just need to run a Python process somewhere with a binary SQLite database file. I usually use Vercel, Cloud Run or Fly for that.

One example of a search engine I've built like this is the one on the Datasette website: https://datasette.io/-/beta?q=fts - I wrote about how that works here: https://simonwillison.net/2020/Dec/19/dogsheep-beta/

divan · on July 25, 2022

Interesting, thanks! I'll take a look (datasette is amazing).

alexcg1 · on July 25, 2022

- Modern PDFs - if you wanna extract text and images, then the PDFSegmenter used in my example will work. If tables too, might need some additional jiggery-pokery, but definitely doable. I know other ppl using the same framework (Jina) who've accomplished it.

- Exact word search - pretty simple. I've focused on more advanced stuff because color vs colour is same same but different. Also just because it's pretty easy since I'm just using pre-defined building blocks, not manually integrating stuff

- Cross platform frontend - I've seen a lyrics search frontend [0] and I've built stuff in Streamlit before. Jina offers RESTful/gRPC/WebSockets gateways so it can't be too tough

- Lightweight? I mean how lightweight do you want it? C? Bash? Assembly? I've found Python good for text parsing

- Long-term: The notebook I wrote has a few (each of which have their own), but compared to others they're relatively lightweight.

- Gluing code: I've been using pre-existing building blocks, and writing new Executors (i.e. building blocks) is relatively straightforward, and then scaling them up with shards, replicas, etc is just a parameter away.

I'm more into the search side then the PDF stuff. The PDF side I've had experience with through bitter suffering and torment. Not a fun format to work with (unless you're into sado-masochism)

[0] https://github.com/jina-ai/examples/tree/master/multires-lyr...

divan · on July 25, 2022

Thanks for elaborated answer.

Most of my use cases have to deal with 10-100 PDF small documents, some – 1000-2000, but I don't want the solution to choke on 10GB of huge PDFs (I was just uploading those to Google Pinpoint). So Go or Rust for backend should be good fit.

By cross-platform frontend I meant web/ios/android/desktop. It's probably only Flutter, but I'm looking for other plugins than Syncfusion's one to try. I know that sounds like overkill for many people (website with search suffice), but I already have cross-platform apps that would benefit from this functionality, and web is a fallback there, not the main option.

alexcg1 · on July 26, 2022

I know folks with thousands to millions of PDFs using the Jina framework and it works fine. I hear what you're saying about frontends and lightweight though. Jina doesn't come with any cross-platform frontends, though Jina NOW has a Streamlit interface that's responsive (so works across devices)

capableweb · on July 25, 2022

> - As lightweight as possible (i.e. no Java, Python or Ruby)

I don't have suggestions for you, but I do have a question regarding this point. Why wouldn't Java be considered lightweight? Java literally runs on your SIM card, which is a very bare-bones environment to run something on, I'd probably consider something like that pretty lightweight.

divan · on July 25, 2022

Ha, I'm from that generation of developers who have the mental model of what is actually happening on the hardware level when you run the program. Doesn't necesarilly mean I overoptimize or think about struct fields offsets or cache branching, but I do have this in my mental model and just can't unlearn it.

When I think about how many stuff needs to be moved in cpu/memory/io bus just to launch simple "Hello, World" in Java - I just cannot accept it. I do realize that for large programs that overhead is small, but still the JVM concept is something I want to avoid as much as possible. Plus the sheer scale of Java SDK and amount of legacy and complexity behind it exceeds my treshold of "avoiding complexity" by orders of magnitude. And the nail to the coffin of "no java" stance is, of course, experience with desktop Java applications. Consistenly the worst UX experience and performance I've seen in 25 years among desktop apps.

alexcg1 · on July 25, 2022

Don't remind me of desktop Java. What was that toolkit, swing(?) that was used in all the apps back in the day. PDFs have a special place in Hell, but Java desktop UXen deserve a whole special circle

divan · on July 25, 2022

PDF history is pretty amazing, actually. The fact that PDF survived over so many decades is something worth reflecting upon :)

alexcg1 · on July 26, 2022

Cholera has survived for thousands of years. Doesn't mean I want to deal with it :)

Seriously though, I agree. The fact that a file format can stick around that long is impressive.

snowstormsun · on July 25, 2022

pdfgrep with some formatting to add links open the correct page?

alexcg1 · on July 25, 2022

Getting the URI of original PDF would be straightforward enough - I could whack that into the code tomorrow with a few lines.

Opening up the correct page? I don't know of any standardized PDF reader that supports that kind of thing. And the format has such a history that even if it were supported (technically by Adobe - don't even get me started on what PDF readers support what formats), there's no guarantee the file itself would even have that cooked in.

gapovaj742 · on July 25, 2022

okay but what if my PDF is non parseable? Not sure if Python's any good for that

nicodjimenez · on July 25, 2022

Mathpix PDF search is fully visually powered and does not use underlying PDF metadata, even working on handwriting. It’s a great choice for researchers (especially in STEM) who want to build a searchable archive of PDFs.

simonw · on July 25, 2022

Amazon Textract does a phenomenal job of extracting text from dodgy scanned PDFs - I've been running it against scanned typewritten text and even handwritten journal text from the 1880s with great results.

I built a tool for running OCR against every PDF in an S3 bucket (which costs about $1.50/thousand pages) here: https://simonwillison.net/2022/Jun/30/s3-ocr/

ydant · on July 26, 2022

Textract really does do a good job of balancing cost, ease of use, and quality, at least for my hobbyist needs.

I was inspired by another recent comment you posted on HN, and after some testing of the Textract console [0] I wrote a simple "local only" command-line version [1] (Python, boto3) that does similar things to your tool.

I used my tool to OCR a few hundred comic strip images I've been meaning to OCR for a while now - the service did beautifully where other tools I've tried in the past struggled with the handwritten text on the comics. Textract is fast enough that running serially was fine for a one-off without involving the more parallelized S3 workflow.

[0] https://us-east-1.console.aws.amazon.com/textract/home?regio... [1] https://github.com/mbafford/textract-cli

simonw · on July 26, 2022

That's really neat: https://github.com/mbafford/textract-cli/blob/master/textrac... - my tool had to involve an S3 bucket too just because Textract won't let you upload a PDF directly to it without storing it in S3 first.

ydant · on July 26, 2022

Based on some testing just now, it looks like for synchronous mode single-page PDF is supported, but if you try to OCR a multi-page document, it throws:

   An error occurred (UnsupportedDocumentException) when calling the DetectDocumentText operation: Request has unsupported document format

Normally, I just use pdfsandwich [0] for PDFs, which has been good enough generally, but I'm tempted to switch to using textract, because it's been much better at rough scans for my tests.

[0] http://www.tobias-elze.de/pdfsandwich/

alexcg1 · on July 25, 2022

In that case I'd use:

1. PDFSegmenter (in the notebook) - extract the images of the text (yup, it does images too) 2. An OCR Executor [0][1] from Jina Hub [2] to extract the text from the images 3. Actually splice the text chunks together to be what you'd expect - that's the tricky part. Even text splitting over pages can be tricky to reassemble properly. PDFs are a pain the butt frankly.

[0] https://hub.jina.ai/executor/78yp7etm

[1] https://hub.jina.ai/executor/w4p7905v

[2] https://hub.jina.ai

shubham_saboo · on July 25, 2022

Wao, this is a really cool way to build full fledged search that too in a notebook!

Does it work end-to-end with PDF as a data structure or do we have to use OCR and parse the text first to be able to search it, really curious?

alexcg1 · on July 25, 2022

The version in the notebook is just for simple text-based PDFs. I wrote some posts on our company blog[1] about the sheer agonies of dealing with PDF as a data format, so wanted to stick with as simple as possible for now.

That said, I'm planning future notebooks where you can perform text-to-image or image-to-image search, integrate OCR, scale it up, serve it, deploy it, etc.

[1] https://medium.com/jina-ai

shubham_saboo · on July 25, 2022

Awesome, will be on the lookout for that!

alexcg1 · on July 25, 2022

We've got quite a few other notebooks for other kinds of search on the blog. Would love to hear your thoughts!

rahimnathwani · on July 25, 2022

Under the hood, it uses https://github.com/pdfminer/pdfminer.six which expects the text to be stored as text.

alexcg1 · on July 25, 2022

You mean the PDFSegmenter Executor in the notebook?

rahimnathwani · on July 25, 2022

alexcg1 · on July 25, 2022

PDFSegmenter also extracts images, which can then be OCR'ed in the next step of the pipeline

spaetzleesser · on July 25, 2022

"PDF as a data structure"

Don't. PDF is a terrible format for storing machine readable data. You lose a ton of Information while you create the PDF which you then painstakingly have to get back later (if that's even possible)

alexcg1 · on July 25, 2022

I may have misworded it (if I wrote those words - PDF rots the brain and my memory likewise).

Agreed on the rest. PDFs don't store machine-readable data. Often just pixelated scanned hot garbage dumpster fire text.

I hate PDFs but have to work with the satanforesaken things. Hence the notebook. It's my little way of trying to give my little PDF-bespoked-hellscape a tiny little glow-up.

spaetzleesser · on July 26, 2022

I probably didn’t read your comment closely enough. When I hear about PDF parsing or PDF as data I immediately get flashbacks from a project years ago where I had to parse PDF files. I think I am still traumatized by this experience so whenever I hear somebody wants to do this I just want to scream “Nooo. Don’t do this”

alexcg1 · on July 26, 2022

I think you and I should start a support group!

alexcg1 · on July 25, 2022

Incidentally Jina Hub [0] has a few OCR Executors [1][2] you could integrate into my notebook (though you'd have to do some rewiring to take images into account since it's a text-based notebook)

[0] https://hub.jina.ai/

[1] https://hub.jina.ai/executor/w4p7905v

[2] https://hub.jina.ai/executor/78yp7etm

CShorten · on July 25, 2022

Congratulations Alex, super cool!

alexcg1 · on July 25, 2022

Thanks man!

alexcg1 · on July 25, 2022

Nice to meet another person in the super-obvious-username club

alexcg1 · on July 25, 2022

Wow, this post really took off! If anyone wants to read some of my blog posts on building PDF search engines (and the pain, torment and anguish that it causes) read:

- https://medium.com/jina-ai/building-an-ai-powered-pdf-search...

- https://medium.com/jina-ai/search-pdfs-with-ai-and-python-pa...

Malp · on July 25, 2022

Great stuff, I went down the rabbit hole of building something similar for synthesizing flash cards + Q/A pairs from textbook PDFs about a year ago, and I would also emphasize that PDF search is a janky nightmare to get within the ballpark of usability :')

alexcg1 · on July 25, 2022

I feel your pain my brother(?) [0] in suffering. That's why I started simple in the notebook. Even trying to go a little more complex just leads to exponential rabbit holes and footguns.

[0] based on typical HN demographics, no assumptions here

Stampo00 · on July 25, 2022

Pardon me while I go add Optimus Prime to my corporate letterhead.