We desperately need a modern open source replacement for tesseract built on current SoTA ML tech. It is insane that we are resorting to using LLMs — which aside from being the wrong tool and far too overpowered for the job also are prone to hallucinations, have insanely expensive training and inference costs, etc — for this purpose because the “best” non-LLM solution is so bad it can’t even correctly ocr monospaced hi-res scans of ascii text with sufficient accuracy.
It was just to show that you can run it locally, in opposition to "cloud APIs" referred in the thread, but you are right, the more correct term is local
I think that's Baidu. I remember https://github.com/PaddlePaddle/ from when Ernie 3.0 was released back when text encoder models weren't forgotten with the progress of decoder-only ones.
Holy Crap! You were right about PaddleOCR. My personal benchmark for OCR tools is to submit several random pages from the first edition Moody's Manual for Railroads.
The reason I use it is to test whether it's just analyzing letter-by-letter (even if they claim it does more) or if it's actually scanning the letter/word in its context. If it's letter-by-letter, I get hilariously awful results.
Sure, it got things wrong. But it also figured out some things even I couldn't decipher.
There's certainly smaller and even better models for OCR.
But the whole "point" of LLM (forget it, it's not AGI) is you don't need to make many specialized models and cursed pipelines anymore, to solve a definitely-in-reach-without-LLM problem your farmer neighbor wants to pay $500 for.
Before LLM it's not going to be done as it takes more than $500 engineer hours. Now we just brute force. Sure, more compute, but we get it done!
We're trying to do something similar with VLM-1 https://vlm-docs.nos.run/guides/guide-pdf-presentations. We've found that a lot of the peculiarities of LLMs for text parsing (hallucinations etc.) can be avoided with structured output that restricts everything to a known schema/output range while constraining the number of output tokens required.
The United States Postal Service probably has the best in the world, though its training probably restricts it to a subset of possible inputs. I wonder if it would be possible to get a senator or congressman to push for open sourcing it.
I believe the USPS system makes extensive use of knowledge of possible valid addresses so you're probably right that it wouldn't be generally applicable. Their _dataset_ must be glorious (and extremely confidential) though.
In my experience it works remarkably well for features like scanning documents in Notes and in copying or translating text embedded in images in Safari.
It is not open source, but free to use locally. Someone has written a Python wrapper (apple-ocr) around it if you want to use it in other workflows. The model files might be in /System/Library/PrivateFrameworks/TextRecognition.framework if you wanted to port them to other platforms.
Text extraction is included (including the ability to specify custom words not found in the dictionary) but there are also utilities for face detection, classification, etc.
Has anyone tried Kosmos [0] ? I came across it the other day and it looked shiny and interesting, but I haven't had a chance to put it to the test much yet.
Okay, I got Kosmos-2.5 running - here's a mini review:
It's _extremely_ slow, about 30 seconds/page on an A10G. Maybe there's room to improve that, I don't know, but for now that's a problem.
The actual character recognition is superb, probably comparable to the big cloud offerings. On a sample of pages of moderately challenging typed text it was literally flawless aside from non-ascii characters.
It can do _neat_ handwriting with reasonable accuracy. This surprised me since there doesn't seem to be any handwriting in the training data (for Pix2Struct either). However, it will sometimes just skip handwriting.
The structured (markdown) output is sometimes impressive, occasionally a mess. I noticed two weaknesses in particular: it often starts a table as an HTML table and then switches to markdown, and it struggles to distinguish multi-column layouts unless they're pure book-like paragraphs or a table with clear gridlines. This is probably a result of sane/straightforward layouts from READMEs and scientific papers being most represented in the training data (the industry I'm in produces lots of documents with layouts that are wild even to a human).
One other thing: as a generative model it can and will go off the rails. One document I gave it with a lot of messy handwriting produced the typed header and then just 1500 lines of greater-than symbols. To be fair I couldn't read it either. While I didn't see it produce any valid-looking but "hallucinated" output, that's a possibility too.
It works really well for captioning. The few attempts I made at OCR failed miserably on CCTV images (camera label at top and datetime stamp on bottom).
Llava1.6, IntenVL, CogVLM2 can all do OCR with nothing but tiled image embeddings and an LLM. Feeding in OCR results from tesseract improves the reliability of the transcript, especially for long strings of random characters, but it’s not strictly necessary for the model to read the text out of the image.
Clip embeddings can absolutely “read” text if the text is large enough. Tiling enables the model to read small text.
Do you know of any guides or tutorials to doing this? I tried using the MiniCPM model for this task, but it just OCRed a tiny bit of information then told me that it couldn't extract the rest.
Love this curious and open-minded exploration of how this stuff works.
The pyramid strategy loosely tracks with renormalization group theory, which has been formally studied for years as a method of interpreting machine learning models:
I love the convergence we're seeing in the use of models from different fields to understand machine learning, fundamental physics, and human consciousness. What a time to be alive.
> Interestingly enough, it’s actually more efficient to send text as images: A 512x512 image with a small but readable font can easily fit 400-500 tokens worth of text, yet you’re only charged for 170 input tokens plus the 85 for the ‘master thumbnail’ for a grand total of 255 tokens—far less than the number of words on the image.
Sounds like an arbitrage opportunity for all those gpt wrappers. Price your cost per token the same, send over the prompt via image, pocket the difference?
Something I don't get is why OpenAI don't provide clear, comprehensive documentation as to how this actually works,
I get that there's competition from other providers now so they have an instinct to keep implementation details secret, but as someone building on their APIs this lack of documentation really holds me back.
To make good judgements about how to use this stuff I need to know how it works!
I had a hilarious bug a few weeks ago where I loaded in a single image representing multiple pages of a PDF and GPT-4 vision effectively hallucinated the contents of the document when asked to OCR it, presumably because the image was too big and was first resized to a point where the text was illegible: https://simonwillison.net/2024/Apr/17/ai-for-data-journalism...
If OpenAI had clear documentation about how their image handling works I could avoid those kinds of problems much more effectively.
I was trying to figure out this exact same issue. OCR on a PDF worked great, up until a certain point when it just started hallucinating like crazy. I was working on a whole pipeline to just feed in a PDF one page at a time to try and get around this issue. Otherwise, the OCR works absolutely fantastic compared to all other other tools I've been trying lately. These include OCRmyPDF (Tesseract), SuryaOCR, and some of the models on the Visual LLM Leaderboard.
I've also seen some people recommend Paddle OCR, but I find their documentation to be lacking and I haven't got that one working yet to evaluate.
For document text/table extraction, nothing beats the quality from the cloud providers. It can get costly but the accuracy is much higher than what you will find using an openai API.
But they do document that the images are resized and give you some rough guidelines on how you should be sizing your images. Low resolution is 1024 x 1024 with no tiling and High Resolution starts at 2048 x 2048 which then gets tiled. It could use further documentation but it is enough to know more than one page should never be used via the API.
Right. But I still have a lot of questions. How does the model handle when something important overlaps multiple tiles in high-resolution mode? Am I better off doing the tiling myself with some overlap?
The fact that it's so eager to hallucinate random things that sounds plausible enough if you're not paying attention without warning you or giving any error should make people reconsider using it for "data journalism" or similar.
If you make your system and it "works", then how will you see the one time out of X where it confidently provides you false information that you happily use because it usually work ?
> how will you see the one time out of X where it confidently provides you false information that you happily use because it usually work ?
You don’t. You treat it like you would a human worker: set your process to detect or tolerate wrong output. If you can't, don't apply this tool to your work.
This is true but misses a key fact, that typical llm errors are different to human errors. Not that they're worse or better but just that you need to understand where and when they're more likely to make mistakes and how to manage that.
There is an effectively infinite number of possibilities of things people could throw at it and they can't know ahead of time whether your use case will work or not. Even if they told you exactly how it worked, you wouldn't know for sure until you tried it. And giving a vague explanation wouldn't help you either.
Why is this not the top comment? FAIR published their C3MLeon paper about decoder-only autoregressive models that work with both text and image tokens. I believe GPT-4o's vocabulary has room for both image and audio tokens. For audio tokens, they probably trained an RVQ-VAE model like Encodec or Soundstream.
I love how well this is written. Definitely "look how interesting this is" rather than "look how much do I know". And it dives as deep as needs to, while being accessible for almost everyone. One really needs to master some topic to be able to describe it simply. Great job.
An important aspect that is not considered in the article is that GPT-4o can generate images by itself (even though the feature is not enable to the public) meaning that it's very likely trained on sequential image tokens and the images are quantized using a VQGAN, my guess is that the VQGAN takes 512x512 images and outputs 13x13 tokens (169 image tokens + special token), the VQGAN can be a convolutional network like shown in the article, for a transformer-based VQGAN I cannot think of a configuration with overlapping patches where it would output 13x13 tokens on a 512x512 image (unless they just added a padding of 4 on the entire image and the patches are not overlapping).
How do we know it generates the images itself and isn’t passing the text to dalle? It’s supposedly how the current gpt4 model does listen mode (with whisper but same idea)
Two reasons - the shown capabilities are way beyond what dalle is capable of, and they've been clear that this "omni" model by the "omni team" is natively multimodal
One possibility is that mapping images to a token embedding consumes ~170x more compute+space than mapping a token id.
Another possibility is that OpenAI is mapping each image to ~170 vectors in an embedding space that is shared with token IDs. If that's the case, the architecture of the image-to-fixed-number-of-tokens model has not been disclosed. It could be a standard CNN, a ViT-like model, an autoencoder, a model that routes a variable number of vectors with RGB data to a fixed number of vectors, or something else that has not yet been ublished. The whole thing is likely trained end-to-end.
At some point we're going to go from tokens to embeddings for everything. I saw some research on variable length embeddings, I wouldn't be surprised if someone generated a huge embedding space, did some form of PCA on generated embeddings, threw away low eigenvalue vectors, then trained a distilled model that generated variable length embeddings directly from that.
> At some point we're going to go from tokens to embeddings for everything.
Yes, I agree.
Further down the road, I imagine we will end up finding interesting connections to the symbolic approaches of GOFAI, given that the embedding of a token, object, concept, or other entity in some vector space is basically a kind of symbol that represents that token, object, concept, or entity in that vector space.
Interestingly, old terms like "representation" and "capsule," which didn't become as widely adopted as "embedding," tried more explicitly to convey this idea of using vectors/matrices of feature activations to stand in for objects, concepts, and other entities.
I don't think a 13x13 tiling (of N channels/features) can be ruled out just because it can't recognize a grid of 13x13 objects. There is presumably a lot of overlap between the receptive fields of the tiles (due to kernel step sizes).
A pyramid of overlapped tiling resolutions is of course possible too.
The way this tests GPT-4o performance by feeding in a 7x7 grid of colored shapes and requesting them back as JSON (about half way down the page) is really clever.
I did something similar when GPT-4V came out, partially with the goal to figure out the input format (I did not get anywhere other than "magic vectors"), but also to roughly estimate the amount of data you can get back out of a 512x512 (the low quality option) image.
What I found is that you can sometimes get more text out of 85-token image than you can out of 85 tokens of text! That said, I think there will be plenty of edge cases where it actually loses some information, and maybe you could argue that if you remove every other word in the text, it could still restore the text.
I never went deeper on this, but I believe there's something clever to be done in the context window with the fact that images are relatively cheap tokens-wise.
I'm probably wrong but the author may have have misunderstood input embeddings. Input embeddings are just dictionary lookup tables. The tokenizer generates tokens and for each token you find its embedding from the lookup.
The author is speculating about an embedding model but in reality they're speculating about the image-tokenizer.
If I'm not wrong the text tokenizer Tiktoken has a dictionary size of 50k. The image tokenizer could have a very large dictionary size or a very small dictionary size. The 170 tokens this image tokenizer generates might actually have repeating tokens!
EDIT: PS. What I meant to say was that input embeddings do not come from another trained model. Tokens come from other trained models. The input embedding matrix undergoes back propagation (learning). This is very important. This allows the model to move the embeddings of the tokens together or apart as it sees fit. If you use embeddings from another model as input embeddings, you're basically adding noise.
I've pondered a bit more about it and I was the one who was mistaken. I think the author made great observations. It's just that I don't want to go back to non token thinking. I don't want there to be a 13x13xE final output from the CNN. I really want there to be a visual vocabulary from which tokens are chosen. And I want this visual vocabulary to be fixed/untrainable/deterministic. That'd be very cool.
But why only choose 13x13 + 1? :(
I'm willing to bet that the author's conclusion of embeddings coming from CNNs is wrong. However, I cannot get the 13x13 + 1 observation out my head though. He's definitely hit on something there. I'm with them that there is very likely a CNN involved. And I'm going to put my bet on the final filters and kernel are the visual vocabulary.
And how do you go from 50k convolutional kernels (think tokens) to always 170 chosen tokens for any image? I don't know...
But don't input embeddings need to undergo backprop during training? Won't the external-model's embeddings just be noise since they don't share embedding space with the model that is being trained?
If the external-model also undergoes training along with the model then I think that might work.
I went through a similar journey back when GPT-4V came out. Here's an additional puzzle for you: GPT-4V knows the exact pixel dimensions of the image (post-resize since there is a max size for images in the pipeline, besides 512x512), but I'm 99% sure it's not provided as text tokens. How am I so sure? It's easy to get GPT to divulge everything from system prompt to tool details, etc. but I've tried every trick in the book and then some, multiple times over, and there is no way to get it to quote the dimensions as text. The only way to get it to give you the dimensions is to tell it to output a structure that contains width and height and just pick something reasonable, and they will "randomly" be the correct values:
I'm well aware of that, but there are plenty of ways to induce verbatim quoting from "hidden" information, and mostly verify it (through sampling a large number of times in separate runs).
Models are improving in truly hiding or ignoring information these days though. As the author of the article states, you'll have a hard time tricking GPT-4o to read text in images as instructions, most likely thanks to this research:
https://openai.com/index/the-instruction-hierarchy/
I do feel pretty confident that when the model is happily spitting out its system prompt, and all metadata around the image, but not its pixel dimensions, that probably those dimensions were not provided in any system/assistant/tool message. So maybe part of the image embeddings also encode the pixel dimensions somehow (it would also help the model not think of the image as a squished square for non-1:1 images that have been resized to 512x512).
I think that would have pretty serious implications for the transformer architecture though. If they're not embedded like text tokens, how would attention, etc work? And a conversation with multiple images back and forth? Not to mention with GPT-4o now having audio support as well. I would assume it does become tokens.
Great article. Perhaps some part of this magic number simply factors in the amount of compute necessary to run the image through the CNN (proportional to compute use per token in the LM).
It would be interesting to see what happens when you slightly shift the grid of objects until they're split across multiple tiles, and how that affects accuracy.
I'm not sure how chatgpt4o routes information. If a picture is submitted that contains text, does the text then get resubmitted to chatgpt4o as a textual query, or do the model weights themselves essentially transform the textual images to textual tokens. I do wonder if a response to the textual images is similar to a response to text queries...i.e. processed by the the same weights.
I've always wondered how Text to Image LLMs like stable diffusion work, do they just encode RGB values into a matrix and then have a helper tool convert that data into a jpg?
Shouldn't this theory be testable? The response time for an image of the same size should remain constant (assuming a generated response of constant size). You could then try to put an increasing amount of text inside of the image. If this text is fed to the LLM using OCR, the total amount of tokens grows. You should then be able to observe an increase in response time.
Even if tesseract accuracy is low, if the tesseract result in addition to the image is then passed to the LLM, it can result in a much more accurate OCR.
For example, GPT4 with some vision capability would be able to fill in the incorrect OCR with the additional word co-occurrence understanding.
I've tested this approach with purely text LLM to correct OCR mistakes and it works quite well.
Also note that in some newer OCR pipelines that don't involve LLMs, there is a vision component and then a text correcting model that is in some ways similar to some forms of spell check, which can further improve results.
you can tell that the OCR fails more in cases without natural language like with code/random characters. OAI seems to claim 4o is a fully end to end multimodal model, but we will never know for sure, we can't trust a single word OpenAI is saying.
I once uploaded a giant image to chatGPT, asking to transcribe it, but the request failed, and in the error message there was a reference to some python script related to tesseract stuff. Since then I'm 100% tesserract is used there in some aspect for text recognition.
Because no one knows how to prep the images. With the right file type and resolution I get under a single character error per 10 pages and it's been that good since the late 00s.
With handwriting? With mixed fonts? Tesseract requires heavy customization and extension to perform reasonably on these workloads. The off-the-shelf options from major cloud providers blow it out of the water.
Agreed. Tesseract is not able to handle handwriting or text that is distorted well, e.g. colored text over an image background — to the point that it would hurt any downstream LLM trying to make sense of the contents. It won’t even pick out bounding boxes.
I doubt they are running an OCR model, but if they actually were it would likely be an in-house one trained with more modern techniques.
multimodal llm would of course blow it all out the water, so some llama3-like model is probably SOTA in terms of what you can run yourself. something like https://huggingface.co/blog/idefics2
I really hope we see improvements to the resolutions large multimodal models can handle. Right now this patchwork approach leads to lots of unwieldly workarounds in applications.
I'm assuming that the tokens used to encode an image are entirely distinct from the tokens used to encode text. Does anyone know if this is actually the case?
I would assume it has a "mode" token where it switches between text/image (and now audio), or you'd have to try to maximize the number of reserved tokens between multiple modes. GPT-4o did go from 100K to 200K vocabulary, but as far as I understand all of that vocabulary is in use for text (reducing the token cost for non-English).
I would be disappointed if OpenAI had a separate model for OCR, though I guess that is believable. Much cooler if the LLM just understands language from text