Nice post, OP! I was super impressed with the Apple's vision framework. I used it on a personal project involving the OCRing of tens of thousands of spreadsheet screenshots and ingesting them into a postgres database. I tried other OCR CPU methods (since macOS and Nvidia still don't play nice together) such as Tesseract but found the output to be incorrect too often. The vision framework was not only the highest quality output I had seen, but it also used the least amount of compute. It was fairly unstable, but I can chalk that up to user error w/ my implementation.
I used a combination of RHetTbull's vision.py (for the actual implementation) [1] + ocrmac (for experimentation) [2] and was pleasantly surprised by the performance on my i7 6700k hackintosh.
I wouldn't call myself a programmer but I can generally troubleshoot anything if given enough time, but it did cost time.
Notably, Apple seems to attach some very unfriendly restrictions to some of the built-in stuff, such as the voices. You can't use those commercially, it appeared to me when I researched it.
Tesseract alone is widely known to be "meh" at this point.
If you look at RAG frameworks as one example they'll typically use/support a variety of implementations. Tesseract is almost always supported but it's rarely ideal with projects like Unstructured[0] and DocTR[1] being preferred. By leveraging more-or-less SOTA vision models[2][3] they embarrass Tesseract.
I haven't compared them to the Apple Vision framework but they're absolutely better than Tesseract and potentially even Apple Vision.
There are also various approaches to use these in conjunction but that gets involved.
Happy to see OCR is advancing lately, but I really need HWR.
I am looking for something this polished and reliable for handwriting, does anyone have any pointers? I want to integrate it in a workflow with my eink tablet I take notes on. A few years ago, I tried various models, but they performed poorly (around 80% accuracy) on my handwriting, which I can read almost 90% of the time.
How well it works on your handwriting is for you to test, but if you, having all kinds of contextual information, can’t read it well, I guess it won’t, either.
I have found Tesseract to be both better than I expect (it feels great when it works most of the time) and worse than I expect (not quite enough correct data to fully rely on).
Does anyone know what languages Apple supports? The docs don't have a list. Tesseract might be "meh" but it is probably the best open source option available for devnagari scripts or Persian, for example.
I've used it on a number of Cyrillic languages (Russian, Bulgarian, etc), Hungarian, Turkish, along with the typical ones (Spanish, German, French, Italian, Portuguese). I've heard it supports Chinese. I just tried Persian and devnagari samples on my Mac and it could not do either.
Is there a tutorial on how to extract table from pdf or image for Apple Vision Framework. I tried the two links in your post and it just extracts the text without maintaining the table structure.
AWS textract provides sample python code to extract tables into csv which works great.
I tried doing something similar on Windows, and realized that PowerToys[1], a Microsoft project I already had installed, actually contains a very good OCR tool[2]. Just press Win+Shift+T and select the area to scan, and the text will be copied to the clipboard.
It's not so well known that one of the original rationales for "offside rule" programming languages is that it works just as easily for handwritten code as it does for typed.
Will we ever have programming languages that are primarily designed to take input from whiteboard grabs? (ie where not only handwriting, but also placement, connectivity, and maybe shape are meaningful?)
I did notice that many Mac apps, including Safari and Preview and Notes, do OCR on images automatically. It's pretty neat that I can easily select text in an image and copy and paste it somewhere else.
It’s kinda ridiculous how good it is, you can even select text from inside a YouTube video while it’s playing (or pause if needed).
Also if it’s text of a URL/domain or a QR code (eg in a photo of a poster, or in a video) you can hold-press/hold-click to open the link directly from the image.
The photos apps too. It’s just so good at conferences or when you need a long string digitised (iso default router password!). Photo > select > copy > then paste on phone or Mac (via that actually awesome handoff feature).
PyXA uses the Vision framework to extract text from one or more images at a time. It's only a small part of the package, so it might be overkill for a one-off operation, but it's an option.
macOS Ventura and newer actually have basic OCR functionality integrated into the Image Capture UI. When using an AirPrint-compatible scanner and scanning to PDF, the checkbox "OCR" is shown in the right pane.
Awesome! Is there a similar technique for the Apple vision ‘Copy Subject’ feature? I’ve become extremely reliant on it, but it feels very limited in access.
I had to Google this, do you mean the feature in Photos on mobile where you can "extract" items from a picture and make them into stickers? Apple seems to call it "lifting subjects" [0] [1].
I’ve always had good results from the Preview.app. I wonder how this engine compares for number of errors in a difficult source versus Free alternatives.
I would really love an `ocrmypdf` like tool which uses Apple Vision to create searchable PDFs from scanned images. I've been searching every week or so for some kind of project but so far haven't found anything. Perhaps it's time to make it myself...
I have played around with the OCR on my mac, and have been very impressed. It has been consistently better than tesseract for my purposes.
However, when creating a PDF from images using Preview and exporting using ‘Embed Text’ option to OCR, I have noticed the text is worse than if you OCR the exact same images using the shortcut above or using a script. Presumably Preview is using the Vision framework’s less accurate fast path when preparing the PDF. https://developer.apple.com/documentation/vision/recognizing...
Oddly enough if you enable it as a "quick action", when you run it, Finder creates a file in the same directory as the image containing the OCRed text (and named according to the first line of OCRed text).
I went back into my shortcut and Shortcuts added a pseudo-action "Stop and output <copy to clipboard>; if there's nowhere to output: <Do Nothing>", and I would think that "Do Nothing" would mean don't create a file, but I guess Quick Actions has some kind of special meaning given that all the other ones seem to be intransitive actions, implying that the user wants a file as the output.
The article was posted.. yesterday, and the entire reason given for not using the builtin Shortcuts sharing feature is... an article from 2 years ago, about a bug in the shortcuts hosting service, which has obviously been fixed.
I get that some people will want to create it from scratch themselves or incorporate the actual meat of it into a larger shortcut... but not sharing one that does what the article says, because of a bug 2 years ago, is a bit of a weird take.
sorry, that link may have been a cheap shot... but I did try to export the shortcut I created, and kept getting an error about not being signed in to icloud...! and I am signed in to icloud. it's just so confusing.
why can't shortcuts be exported as ... shortcut files?
it's not ideal to have people recreate the shortcut step by step (which is what I ended up describing in my post) but... I couldn't find a better way..! :-)
if you'd be able to recreate the shortcut and share it, and post the link here (and/or email it to me), I'd love to place that in the blog article! thank you
Raycast (macOS only) is also nice as it's able to search images by text. It also allows you to copy text from those images. Quick official demo here: https://www.youtube.com/watch?v=c96IXGOo6E4
How to interact with built in OCR via the cli? "Doing" something is (to me) which ocr tooling, what fonts it recognises, all the associated package management and tuning not "how I configure the gui and ui to let me use the tool they shipped with the os"
I made a Shortcut + PHP to get text from a screenshot, ask ChatGPT to make a task name from text, and create new task in Clickup and attache a screenshot. Use it often.
Are ios and macos shortcuts crosscompatible? I didnt know there was shortcuts for the mac, seems pretty powerful to be able to run them from the terminal too. Thanks OP
Yes they are compatible as long you use actions available on both platforms. For example, you can use AppleScript or shell in macOS but it will not work on iOS. However, if you use cross platform apps shortcuts it works even when you write files into the iCloud folder. For example, I did a shortcut that takes today’s events from the Calendar and appends the list into a Markdown file in a Obsidian vault on iCloud. I use it to scaffold meeting notes, and it works on my phone too.
Surprisingly, the Extract Text from Image action is available on Intel Macs: normally, features like automatic-image-OCR is limited to Apple Silicon Macs.
I used a combination of RHetTbull's vision.py (for the actual implementation) [1] + ocrmac (for experimentation) [2] and was pleasantly surprised by the performance on my i7 6700k hackintosh.
I wouldn't call myself a programmer but I can generally troubleshoot anything if given enough time, but it did cost time.
[1]: https://gist.github.com/RhetTbull/1c34fc07c95733642cffcd1ac5...
[2]: https://github.com/straussmaximilian/ocrmac