Running OCR against PDFs and images directly in the browser

aabhay · 2024-03-30T20:29:01 1711830541

I was really impressed until I realized that the app is basically a wrapper around tesseract.js, which is the actually cool part. Tesseract has a wasm port that can operate inside of a webworker.

Not saying that the article was being misleading about this, just saying that the LLM part is basically doing some standard interfacing and HTML/CSS/JS around that core engine, which wasn’t immediately obvious to me when scanning the screenshots.

simonw · 2024-03-30T20:58:59 1711832339

The LLM part is almost irrelevant to the final result to be honest: I used LLMs to help me build an initial prototype in five minutes that would otherwise have taken me about an hour, but the code really isn't very complex.

The point here is more about highlighting that browsers can do this stuff, and it doesn't take much to wire it all together into a useful interface.

authorfly · 2024-03-31T07:00:14 1711868414

Simon - hope you don't mind me commenting on you in third person in relation to the above. Simon is a great explainer, but I wish he would credit the underlying technology or library (like tesseract.js) a bit more upfront, like you.

It matters in this case because for tesseract, the exact model is incredibly important. For example, v4 is pretty bad (but what is available on most linux distros when ran serverside) whereas v5 is decent. So I would have had a more accurate interest in this post if it was a bit more upfront that "Tesseract.js lets you run OCR against PDFs fairly quickly now, largely because of better processors we as devs have, not because of any real software change in the last 2-3 years".

I felt this before for his NLP content too - but clearly it works because he's such a great explainer and one for teasing content later that you do read it! I must say I've never been left confused by Simons work.

simonw · 2024-03-31T14:52:15 1711896735

I was pretty careful to credit Tesseract.js - it's linked at the top of the tool itself https://tools.simonwillison.net/ocr and prominently in the article I wrote: https://simonwillison.net/2024/Mar/30/ocr-pdfs-images/

What else should I have done?

kuschkufan · 2024-03-31T17:52:08 1711907528

It's all subjective. Reading your linked blog made it perfectly clear for me you built this using tesseract.js. No idea what the other guys are complaining about.

authorfly · 2024-04-05T06:03:05 1712296985

I drafted a few longer responses about the feeling that I had after reading your post after that headline but they were a bit unavoidably asshole-y! So really, just something like "with Tesseract.js" in the headline is all I would think could be helpful to people on sites like HN. I do like your writing. But I do enjoy knowing what I'm reading specifically if possible, when it's tech.

targafarian · 2024-03-31T13:48:03 1711892883

You act like you were misled, but the article, within the first few sentences, says he realized the tools are available to do this (including naming tesseract.js explicitly!), he just needed to glue them together. Then he details how he does that, and only then mentions he used an LLM to help him in that process. The author's article title is equally not misleading.

Was an earlier headline or subtitle here on HN what was misleading, but then that was changed to not be misleading?

spullara · 2024-03-31T03:12:52 1711854772

Using the built-in browser OCR is usually much better but it is still behind an experimental API.

giovannibonetti · 2024-03-31T00:08:30 1711843710

In the same vein, I'm building a tool [1] to extract tables from PDFs (no OCR yet) and spreadsheets. The end goal is to make it easy to combine data from multiple sources by joining tables and produce some useful reports out of it.

The PDF parsing is done by the excellent PDFplumber Python library [2], the web app is built with Elixir's Phoenix framework and it is all hosted on Fly.io.

[1] https://data-tools.fly.dev [2] https://github.com/jsvine/pdfplumber

serjester · 2024-03-31T00:28:51 1711844931

I recently built a similar tool except it’s configured to use some deep learning libraries for the table extraction. I’m excited to integrate unitable which has state of the art performance later this week.

I built this because most of the basic layout detection libraries have terrible performance on anything non trivial. Deep learning is really the long term solution here.

https://github.com/Filimoa/open-parse

pants2 · 2024-03-31T05:43:31 1711863811

This is extremely cool and exactly what I've been looking for. Looking forward to trying it out.

codazoda · 2024-03-30T20:31:31 1711830691

This is timely. I just completed a few experiments and wrote a little about doing OCR on my handwritten notes.

https://notes.joeldare.com/handwritten-text-recognition

Tesseract was one of the tools I tested, although I used the CLI instead of the WASM version.

martin82 · 2024-04-01T00:47:04 1711932424

Link does not seem to work

ignoramous · 2024-03-30T21:26:33 1711833993

Wow, this is promising. I tried on a few poorly scanned papers I've lying about. A few observations:

1. Pre-process PDF images to detect letters better?

2. Use LLMs to spell/grammar check and perhaps even auto-complete missing pieces?

3. Employ rich text to capture style (ex: lexical.dev)?

Unsure if it is feasible to bundle it all up for web.

See also: https://github.com/RajSolai/TextSnatcher / https://github.com/VikParuchuri/surya

yjftsjthsd-h · 2024-03-31T01:38:10 1711849090

> Use LLMs to spell/grammar check and perhaps even auto-complete missing pieces?

I would really want human review. Remember that copier that changed digits because it was being clever with compression?

simonw · 2024-03-30T21:56:53 1711835813

I've been trying out alternative versions of this that pass images through to e.g. the Claude 3 vision models, but they're harder to share with people because they need an API key!

euazOn · 2024-03-30T21:57:43 1711835863

In case you wanted to add a pre-processing step, I found this ImageMagick script useful: https://www.fmwconcepts.com/imagemagick/textcleaner/index.ph...

Not sure how difficult it is to run it in the browser, though.

CharlesW · 2024-03-30T22:51:33 1711839093

FYI, cert is expired.

kordlessagain · 2024-03-31T13:05:25 1711890325

Here's an EasyOCR service: https://github.com/MittaAI/mitta-community/tree/main/service.... It will run locally, or leverage GPUs if they are available.

A PDF to image processor is being built and should be out in a few weeks

No docs, but happy to help anyone wanting to use either. Email is kord @ the company I'm working on.

Oras · 2024-03-30T22:47:24 1711838844

This is nice but Tesseract does not perform well when it comes to tables, at least when I tried it on multiple documents.

It would miss some cells from a table, or does not recognise all the numbers when they have commas.

simonw · 2024-03-30T22:55:05 1711839305

Tables are still the big unsolved problem for me.

There are a ton of potential tools out there like Tabula and AWS Textract table mode but none of them have felt like the perfect solution.

I've been trying Gemini Pro 1.5 and Claude 3 Opus and they looked like they worked... but in both cases I spotted them getting confused and copying in numbers form the wrong rows.

I think the best I've tried is the camera import mode in iOS Excel! Just wish there was an API for calling that one programmatically.

CharlesW · 2024-03-30T23:27:51 1711841271

Out of curiosity have you tried ocrs by Robert Knight? https://github.com/robertknight/ocrs

simonw · 2024-03-30T23:33:33 1711841613

No I hadn't heard of that one!

f_k · 2024-03-31T12:50:16 1711889416

If you're on Windows try https://table2xl.com (disclosure: I'm the founder), it's more accurate than Excel's camera import. No API though.

lennxa · 2024-03-31T11:11:12 1711883472

Would this be helpful? https://github.com/facebookresearch/nougat

Seems like it can handle tables.

constantinum · 2024-04-02T02:26:49 1712024809

As mentioned above, give Unstract a try: https://github.com/Zipstack/unstract

maCDzP · 2024-03-31T06:02:43 1711864963

I think the camera import on Excel MacOS works pretty well. You could probably call that version through an API.

sumedh · 2024-03-30T23:11:00 1711840260

Google and Azure have their own PDF Table extraction service but I have noticed Textract is a bit better.

fbdab103 · 2024-03-30T22:16:09 1711836969

The example on the Tesseract.js page shows it highlighting the rectangles of where the selected text originated. Does this level of information get surfaced through the library for consumption?

I just grabbed a two-column academic PDF, which performed as well as you would expect. If I was returned a json list of text + coordinates, I could do some dirty munging (eg footer is anything below this y index, column 1 is between these x ranges, column 2 is between these other x ranges) to self-assemble it a bit better.

simonw · 2024-03-30T22:21:51 1711837311

Yes it does, but I've not dug into the more sophisticated parts of the API at all yet. I'm using it in the most basic way possible right now:

    const {data: {text}} = await worker.recognize(imageUrl);

kgbcia · 2024-03-30T19:59:53 1711828793

I was thinking of doing something like this for visually impaired users. The next step is to pipe it into the JavaScript web speech synthesis API.

https://mdn.github.io/dom-examples/web-speech-api/speak-easy...

mwcampbell · 2024-03-30T21:05:03 1711832703

We already have our own tools for that, either integrated into screen readers or available as add-ons. Thanks for the thought, though.

joeevans1000 · 2024-04-02T20:37:00 1712090220

I am the primary caregiver for a blind academic and I can tell you that any work in this area by you will be more than welcome, by both he and I. There is a lot of progress to be made in this area. There are tools, but they are primitive to where it is headed (hopefully in part to your efforts).

constantinum · 2024-04-01T19:03:09 1711998189

There is a Open source tool that does document(pdf) processing at an enterprise scale. https://github.com/Zipstack/unstract

reliablereason · 2024-03-30T20:23:28 1711830208

Safari already does that. Quite a useful feature.

minimaxir · 2024-03-30T20:37:24 1711831044

Specifically, only Apple Silicon allows automatic OCR. Works on iOS too.

pvg · 2024-03-30T21:23:29 1711833809

It works on Intel Safari as well.

minimaxir · 2024-03-30T21:34:28 1711834468

It doesn’t work on my Intel Macs unless it’s really slow.

__jonas · 2024-03-31T00:19:36 1711844376

It works for me on an Intel MPB (2020) but it's probably a lot slower.

On this page: https://en.wikipedia.org/wiki/Typeface it takes almost ~10 seconds for the text in the first image to become selectable after page load.

With local images / PDFs in Preview it's really quick though

pvg · 2024-03-31T00:10:52 1711843852

Maybe it is! I mostly use it to select text in social media images for which it feels reasonably responsive.

Noumenon72 · 2024-03-31T02:27:02 1711852022

Neat.

You can also use MacOS's OCR capability to create a shortcut that allows you to copy and paste the text out of any region on the screen -- for example, a stack trace someone is showing you in a screen share.

https://apple.stackexchange.com/a/468362

CaffeinatedDev · 2024-03-30T22:10:58 1711836658

This is cool! I've also used tesseract OCR and found it to be pretty amazing in terms of speed and accuracy.

I use it for ingest of image and pdf type files for my own website chatting tool: tinydesk.ai!

I run the backend on an express js server so all js as well.

Smaller docs I do on the client side, but larger ones (>1.5mb) I've found take forever so those process in the backend.

ein0p · 2024-03-31T06:22:21 1711866141

Tesseract is way outdated though, to the point of being borderline useless when compared to alternatives. What’s the current deep learning based FOSS SOTA, does anyone know? I want something that does what FineReader does - create a high quality searchable text underlay for scanned PDFs.

voisin · 2024-03-31T00:26:26 1711844786

Is there something I can run on my Mac that will systematically OCR every PDF on my drive for easy searching?

simonw · 2024-03-31T00:48:15 1711846095

My s3-ocr tool could do that with quite a bit of extra configuration.

https://github.com/simonw/s3-ocr

You would need to upload them all to S3 first though, which is a bit of a pain just to run OCR (that's Textract's fault).

You could try stitching together a bunch of scripts to run the CLI version of Tesseract locally.

xrd · 2024-03-31T11:17:16 1711883836

If it's s3 and the url is configurable, you could probably run minio inside a docker container locally and save a few minutes?

simonw · 2024-03-31T19:32:41 1711913561

Sadly to use AWS Textract with PDFs you have to push to an AWS S3 bucket and then pass the bucket and file name to the Textract API.

alexwilde · 2024-03-31T01:57:00 1711850220

This is cool! I built something similar but it's CLI based. https://github.com/lifeiswilde/textract-ai

pyuser583 · 2024-03-31T04:55:31 1711860931

This behavior really freaks me out. It raises the potential that hostile urls will be introduced.

smusamashah · 2024-03-31T00:10:34 1711843834

The amazing thing here is that this tool is almost all compiled using LLM. This is very exciting. I have been using GPT-4 a lot lately to make tiny utilities. Things I wouldn't have even tried because of how much effort it takes to get started on those simple things.

I always wanted to make a chrome extension for one thing or another, but all the learning involved around the boilerplate always drained the motivation. But with GPT I built the initial POC in an hour and then polished and published it on store even. Recently I compiled some bash and cmd helper scripts, I don't know either of these enough (do know some bash) and don't have it in me to learn them. Specially the windows batch scripts. Using LLM it was matter of an hour to write a script for my need as either a windows batch script or even bash script.

Oh I even used GPT it to write 2-3 AutoHotKey scripts. LLMs are amazing. If you know what you are looking for, you can direct them to your advantage.

Very exciting to see that people are using LLMs similarly to build things they want and how they want.

zzz999 · 2024-03-30T23:28:03 1711841283

Does that run locally without the need to share information?

simonw · 2024-03-30T23:32:08 1711841528

Yes. Nothing leaves your browser.

jgalt212 · 2024-03-31T11:38:21 1711885101

Are these paid posts?

simonw · 2024-03-31T22:12:42 1711923162

What gave you that idea?

jgalt212 · 2024-04-01T00:15:13 1711930513

This quote.

> The LLM part is almost irrelevant to the final result to be honest

simonw · 2024-04-01T02:44:08 1711939448

Why would that suggest I'm being paid by anyone?

Oh I think I see. No, I'm not being paid to promote LLMs.

The point of my blog post was two-fold: first, to introduce the OCR tool I built. And second, to provide yet another documented example of how I use LLMs in my daily development work.

The tool doesn't use LLMs itself, they were just a useful speed-up in building it.

It's part of a series of posts, see also: https://simonwillison.net/tags/aiassistedprogramming/

jgalt212 · 2024-04-01T13:38:23 1711978703

My reasoning: There are hundreds of billions of dollars at stake getting the wider world to embrace LLMs on a long-term basis. If I were a VC, or Nvidia/OpenAI/MS marketing person, I'd be paying trusted names such as yourself to post about using LLMs. That coupled with the loose link between OCR and LLMs in your latest post created an itch I thought was worthy of a scratch.

> No, I'm not being paid to promote LLMs.

This is good enough for me, thanks.

_ache_ · 2024-03-31T07:34:50 1711870490

For now, I'm still using OCRmyPDF as it maybe slow but incredible usefull.

The files become big but it just works.

If an alternative is quicker / lighter I will use it but it must just works.