How Does GPT-4o Encode Images?

ComputerGuru · 2024-06-07T16:37:17 1717778237

We desperately need a modern open source replacement for tesseract built on current SoTA ML tech. It is insane that we are resorting to using LLMs — which aside from being the wrong tool and far too overpowered for the job also are prone to hallucinations, have insanely expensive training and inference costs, etc — for this purpose because the “best” non-LLM solution is so bad it can’t even correctly ocr monospaced hi-res scans of ascii text with sufficient accuracy.

rvnx · 2024-06-07T16:59:21 1717779561

One good self-hosted OCR is PaddleOCR, https://github.com/PaddlePaddle/PaddleOCR

Beats everything else, truly international and multi-lingual, including Chinese (as it is made in China)

paul-tharun · 2024-06-07T17:58:06 1717783086

It is insanely fast compared alternatives and has really high accuracy even on new tasks without any training.

Their PaddleLayout models are also miles ahead compared to LayoutParser or TableTransformers in both inference speed and output quality

ComputerGuru · 2024-06-07T20:00:39 1717790439

Why is it “self-hosted” and not “library + desktop/cli app”? “Self-hosted” implies it need a full web stack and rdbms backend?

rvnx · 2024-06-07T20:23:15 1717791795

It was just to show that you can run it locally, in opposition to "cloud APIs" referred in the thread, but you are right, the more correct term is local

ComputerGuru · 2024-06-08T14:50:09 1717858209

Thanks. I had clicked the readme but I was on my phone and wasn’t able to translate it to English to see if it was a web app.

jakderrida · 2024-06-07T20:22:57 1717791777

I think that's Baidu. I remember https://github.com/PaddlePaddle/ from when Ernie 3.0 was released back when text encoder models weren't forgotten with the progress of decoder-only ones.

jakderrida · 2024-06-08T19:28:33 1717874913

Holy Crap! You were right about PaddleOCR. My personal benchmark for OCR tools is to submit several random pages from the first edition Moody's Manual for Railroads.

https://imgur.com/r2RsJeH

The reason I use it is to test whether it's just analyzing letter-by-letter (even if they claim it does more) or if it's actually scanning the letter/word in its context. If it's letter-by-letter, I get hilariously awful results.

Sure, it got things wrong. But it also figured out some things even I couldn't decipher.

rfoo · 2024-06-07T17:06:13 1717779973

There's certainly smaller and even better models for OCR.

But the whole "point" of LLM (forget it, it's not AGI) is you don't need to make many specialized models and cursed pipelines anymore, to solve a definitely-in-reach-without-LLM problem your farmer neighbor wants to pay $500 for.

Before LLM it's not going to be done as it takes more than $500 engineer hours. Now we just brute force. Sure, more compute, but we get it done!

I guess your OCR dream is covered by this.

Zetaphor · 2024-06-08T08:25:44 1717835144

> There's certainly smaller and even better models for OCR

Could you please list some? I am developing a tool that relies on OCR and everything I've found refers to tesseract as being the best choice

jaggirs · 2024-06-08T10:47:47 1717843667

Surya OCR

EarlyOom · 2024-06-08T02:39:31 1717814371

We're trying to do something similar with VLM-1 https://vlm-docs.nos.run/guides/guide-pdf-presentations. We've found that a lot of the peculiarities of LLMs for text parsing (hallucinations etc.) can be avoided with structured output that restricts everything to a known schema/output range while constraining the number of output tokens required.

orbital-decay · 2024-06-07T18:15:13 1717784113

A good open source model for handwriting recognition is sorely missing as well.

ComputerGuru · 2024-06-07T19:22:58 1717788178

The United States Postal Service probably has the best in the world, though its training probably restricts it to a subset of possible inputs. I wonder if it would be possible to get a senator or congressman to push for open sourcing it.

daemonologist · 2024-06-08T03:24:20 1717817060

I believe the USPS system makes extensive use of knowledge of possible valid addresses so you're probably right that it wouldn't be generally applicable. Their _dataset_ must be glorious (and extremely confidential) though.

nine_k · 2024-06-07T18:24:30 1717784670

Often in humans, too, depending on the badness of the particular handwritten word.

asadm · 2024-06-07T16:51:37 1717779097

hmmm I haven't tried but does apple's OCR api do better here? ie. is it possible to do it.

rgovostes · 2024-06-07T18:07:43 1717783663

The API: https://developer.apple.com/documentation/vision/recognizing...

In my experience it works remarkably well for features like scanning documents in Notes and in copying or translating text embedded in images in Safari.

It is not open source, but free to use locally. Someone has written a Python wrapper (apple-ocr) around it if you want to use it in other workflows. The model files might be in /System/Library/PrivateFrameworks/TextRecognition.framework if you wanted to port them to other platforms.

nexuist · 2024-06-07T19:10:44 1717787444

I also wrote a Swift CLI that wraps over the Vision framework: https://github.com/nexuist/seev

Text extraction is included (including the ability to specify custom words not found in the dictionary) but there are also utilities for face detection, classification, etc.

daemonologist · 2024-06-07T22:09:41 1717798181

Has anyone tried Kosmos [0] ? I came across it the other day and it looked shiny and interesting, but I haven't had a chance to put it to the test much yet.

[0] - https://github.com/microsoft/unilm/tree/master/kosmos-2.5

daemonologist · 2024-06-09T16:05:54 1717949154

Okay, I got Kosmos-2.5 running - here's a mini review:

It's _extremely_ slow, about 30 seconds/page on an A10G. Maybe there's room to improve that, I don't know, but for now that's a problem.

The actual character recognition is superb, probably comparable to the big cloud offerings. On a sample of pages of moderately challenging typed text it was literally flawless aside from non-ascii characters.

It can do _neat_ handwriting with reasonable accuracy. This surprised me since there doesn't seem to be any handwriting in the training data (for Pix2Struct either). However, it will sometimes just skip handwriting.

The structured (markdown) output is sometimes impressive, occasionally a mess. I noticed two weaknesses in particular: it often starts a table as an HTML table and then switches to markdown, and it struggles to distinguish multi-column layouts unless they're pure book-like paragraphs or a table with clear gridlines. This is probably a result of sane/straightforward layouts from READMEs and scientific papers being most represented in the training data (the industry I'm in produces lots of documents with layouts that are wild even to a human).

One other thing: as a generative model it can and will go off the rails. One document I gave it with a lot of messy handwriting produced the typed header and then just 1500 lines of greater-than symbols. To be fair I couldn't read it either. While I didn't see it produce any valid-looking but "hallucinated" output, that's a possibility too.

Kerbonut · 2024-06-09T07:15:04 1717917304

It works really well for captioning. The few attempts I made at OCR failed miserably on CCTV images (camera label at top and datetime stamp on bottom).

AndrewKemendo · 2024-06-07T19:38:56 1717789136

Fully agree

Improving OCR would require innovation within CV - separate from transformer architectures and frankly I don’t expect much new work to happen here

valine · 2024-06-07T14:05:17 1717769117

Llava1.6, IntenVL, CogVLM2 can all do OCR with nothing but tiled image embeddings and an LLM. Feeding in OCR results from tesseract improves the reliability of the transcript, especially for long strings of random characters, but it’s not strictly necessary for the model to read the text out of the image.

Clip embeddings can absolutely “read” text if the text is large enough. Tiling enables the model to read small text.

qeternity · 2024-06-07T22:27:47 1717799267

They can do it. They can not do it particularly well compared to SoTA OCR systems.

Onawa · 2024-06-07T14:47:18 1717771638

Do you know of any guides or tutorials to doing this? I tried using the MiniCPM model for this task, but it just OCRed a tiny bit of information then told me that it couldn't extract the rest.

pwillia7 · 2024-06-07T15:40:11 1717774811

I bet you could get this working in https://github.com/comfyanonymous/ComfyUI

I have done some other LLava stuff in it

3abiton · 2024-06-07T15:44:57 1717775097

I thought ComfyUI was mainly for SD. I should get into the game again.

lagniappe · 2024-06-07T15:50:13 1717775413

You can build just about anything with it

pests · 2024-06-07T18:49:59 1717786199

thanks been trying to remember the name of this project for weeks now

cpursley · 2024-06-07T19:49:51 1717789791

How well does this work on complex data tables?

tictacttoe · 2024-06-07T19:33:44 1717788824

I found llava to be disappointing, but Claude Haiku is quite good

riemannzeta · 2024-06-07T14:19:58 1717769998

Love this curious and open-minded exploration of how this stuff works.

The pyramid strategy loosely tracks with renormalization group theory, which has been formally studied for years as a method of interpreting machine learning models:

https://arxiv.org/abs/1410.3831

I love the convergence we're seeing in the use of models from different fields to understand machine learning, fundamental physics, and human consciousness. What a time to be alive.

enjoylife · 2024-06-07T18:42:31 1717785751

> Interestingly enough, it’s actually more efficient to send text as images: A 512x512 image with a small but readable font can easily fit 400-500 tokens worth of text, yet you’re only charged for 170 input tokens plus the 85 for the ‘master thumbnail’ for a grand total of 255 tokens—far less than the number of words on the image.

Sounds like an arbitrage opportunity for all those gpt wrappers. Price your cost per token the same, send over the prompt via image, pocket the difference?

falcor84 · 2024-06-11T09:41:27 1718098887

Would it be worthwhile in terms of the added server costs of printing the text to images?

simonw · 2024-06-07T13:42:07 1717767727

Something I don't get is why OpenAI don't provide clear, comprehensive documentation as to how this actually works,

I get that there's competition from other providers now so they have an instinct to keep implementation details secret, but as someone building on their APIs this lack of documentation really holds me back.

To make good judgements about how to use this stuff I need to know how it works!

I had a hilarious bug a few weeks ago where I loaded in a single image representing multiple pages of a PDF and GPT-4 vision effectively hallucinated the contents of the document when asked to OCR it, presumably because the image was too big and was first resized to a point where the text was illegible: https://simonwillison.net/2024/Apr/17/ai-for-data-journalism...

If OpenAI had clear documentation about how their image handling works I could avoid those kinds of problems much more effectively.

Onawa · 2024-06-07T13:46:54 1717768014

I was trying to figure out this exact same issue. OCR on a PDF worked great, up until a certain point when it just started hallucinating like crazy. I was working on a whole pipeline to just feed in a PDF one page at a time to try and get around this issue. Otherwise, the OCR works absolutely fantastic compared to all other other tools I've been trying lately. These include OCRmyPDF (Tesseract), SuryaOCR, and some of the models on the Visual LLM Leaderboard.

I've also seen some people recommend Paddle OCR, but I find their documentation to be lacking and I haven't got that one working yet to evaluate.

raybb · 2024-06-07T14:34:59 1717770899

Simon wilson recently had a thread going through some of the options here https://x.com/simonw/status/1797526667797442773

Onawa · 2024-06-07T14:43:50 1717771430

Funny enough, Simon Willison is the op of this comment thread lol.

mercer · 2024-06-13T13:32:23 1718285543

But doctor, I AM Simon Wilson!

infecto · 2024-06-07T13:53:24 1717768404

For document text/table extraction, nothing beats the quality from the cloud providers. It can get costly but the accuracy is much higher than what you will find using an openai API.

infecto · 2024-06-07T13:50:28 1717768228

But they do document that the images are resized and give you some rough guidelines on how you should be sizing your images. Low resolution is 1024 x 1024 with no tiling and High Resolution starts at 2048 x 2048 which then gets tiled. It could use further documentation but it is enough to know more than one page should never be used via the API.

alach11 · 2024-06-07T14:09:03 1717769343

Right. But I still have a lot of questions. How does the model handle when something important overlaps multiple tiles in high-resolution mode? Am I better off doing the tiling myself with some overlap?

nolok · 2024-06-07T15:35:33 1717774533

The fact that it's so eager to hallucinate random things that sounds plausible enough if you're not paying attention without warning you or giving any error should make people reconsider using it for "data journalism" or similar.

If you make your system and it "works", then how will you see the one time out of X where it confidently provides you false information that you happily use because it usually work ?

TeMPOraL · 2024-06-07T15:40:39 1717774839

> how will you see the one time out of X where it confidently provides you false information that you happily use because it usually work ?

You don’t. You treat it like you would a human worker: set your process to detect or tolerate wrong output. If you can't, don't apply this tool to your work.

IanCal · 2024-06-07T15:58:32 1717775912

This is true but misses a key fact, that typical llm errors are different to human errors. Not that they're worse or better but just that you need to understand where and when they're more likely to make mistakes and how to manage that.

simonw · 2024-06-07T21:46:22 1717796782

Right, that's why I've been recommending dedicated OCR tools (Textract etc) over vision LLMs: https://simonwillison.net/2024/Apr/17/ai-for-data-journalism...

ilaksh · 2024-06-07T18:40:31 1717785631

There is an effectively infinite number of possibilities of things people could throw at it and they can't know ahead of time whether your use case will work or not. Even if they told you exactly how it worked, you wouldn't know for sure until you tried it. And giving a vague explanation wouldn't help you either.

resters · 2024-06-07T20:54:20 1717793660

Is there documentation (is it possible?) on how to upload a PDF to gpt-4o using the API?

simonw · 2024-06-07T21:41:15 1717796475

I think you have to split it into a page per image and then upload each page separately. That's how I've been doing it.

rafaelero · 2024-06-07T14:31:15 1717770675

They are very likely using VQVAE to create a dictionary of tokens and then just converting images into them with an encoder.

lisperforlife · 2024-06-07T15:27:08 1717774028

Why is this not the top comment? FAIR published their C3MLeon paper about decoder-only autoregressive models that work with both text and image tokens. I believe GPT-4o's vocabulary has room for both image and audio tokens. For audio tokens, they probably trained an RVQ-VAE model like Encodec or Soundstream.

HarHarVeryFunny · 2024-06-07T19:55:24 1717790124

Wouldn't that be more applicable to image generation, or at least wanting to encode the image as a whole?

If you need to be able to reason about multiple objects in the image and their relative positions, then don't you need to use a tiled approach?

rafaelero · 2024-06-07T20:01:50 1717790510

VQVAE is trained to reconstruct the image, so in theory it should contain all the information (both content and location) inside its embeddings.

comboy · 2024-06-07T19:35:23 1717788923

I love how well this is written. Definitely "look how interesting this is" rather than "look how much do I know". And it dives as deep as needs to, while being accessible for almost everyone. One really needs to master some topic to be able to describe it simply. Great job.

GaggiX · 2024-06-07T13:40:05 1717767605

An important aspect that is not considered in the article is that GPT-4o can generate images by itself (even though the feature is not enable to the public) meaning that it's very likely trained on sequential image tokens and the images are quantized using a VQGAN, my guess is that the VQGAN takes 512x512 images and outputs 13x13 tokens (169 image tokens + special token), the VQGAN can be a convolutional network like shown in the article, for a transformer-based VQGAN I cannot think of a configuration with overlapping patches where it would output 13x13 tokens on a 512x512 image (unless they just added a padding of 4 on the entire image and the patches are not overlapping).

edude03 · 2024-06-07T14:13:16 1717769596

How do we know it generates the images itself and isn’t passing the text to dalle? It’s supposedly how the current gpt4 model does listen mode (with whisper but same idea)

GaggiX · 2024-06-07T15:18:10 1717773490

Go to the "Explorations of capabilities" and explore all the capabilities: https://openai.com/index/hello-gpt-4o/

You cannot have this level of control by prompting Dalle, also GPT-4o isn't using Whisper (older GPT-4s yes).

ec109685 · 2024-06-07T19:59:32 1717790372

At least ChatGPT 4o still looks like it is using dalle.

https://x.com/krishnanrohit/status/1755123169353236848?s=46

hackerlight · 2024-06-08T00:16:21 1717805781

Two reasons - the shown capabilities are way beyond what dalle is capable of, and they've been clear that this "omni" model by the "omni team" is natively multimodal

cs702 · 2024-06-07T14:42:52 1717771372

One possibility is that mapping images to a token embedding consumes ~170x more compute+space than mapping a token id.

Another possibility is that OpenAI is mapping each image to ~170 vectors in an embedding space that is shared with token IDs. If that's the case, the architecture of the image-to-fixed-number-of-tokens model has not been disclosed. It could be a standard CNN, a ViT-like model, an autoencoder, a model that routes a variable number of vectors with RGB data to a fixed number of vectors, or something else that has not yet been ublished. The whole thing is likely trained end-to-end.

CuriouslyC · 2024-06-07T18:15:39 1717784139

At some point we're going to go from tokens to embeddings for everything. I saw some research on variable length embeddings, I wouldn't be surprised if someone generated a huge embedding space, did some form of PCA on generated embeddings, threw away low eigenvalue vectors, then trained a distilled model that generated variable length embeddings directly from that.

cs702 · 2024-06-07T18:56:06 1717786566

> At some point we're going to go from tokens to embeddings for everything.

Yes, I agree.

Further down the road, I imagine we will end up finding interesting connections to the symbolic approaches of GOFAI, given that the embedding of a token, object, concept, or other entity in some vector space is basically a kind of symbol that represents that token, object, concept, or entity in that vector space.

Interestingly, old terms like "representation" and "capsule," which didn't become as widely adopted as "embedding," tried more explicitly to convey this idea of using vectors/matrices of feature activations to stand in for objects, concepts, and other entities.

For example, see Figure 1 in this paper from 2009-2012: http://www.cs.princeton.edu/courses/archive/spring13/cos598C... -- it's basically what we're talking about!

cs702 · 2024-06-07T17:29:12 1717781352

*that has not yet been published

HarHarVeryFunny · 2024-06-07T15:49:58 1717775398

I don't think a 13x13 tiling (of N channels/features) can be ruled out just because it can't recognize a grid of 13x13 objects. There is presumably a lot of overlap between the receptive fields of the tiles (due to kernel step sizes).

A pyramid of overlapped tiling resolutions is of course possible too.

simonw · 2024-06-07T13:35:12 1717767312

The way this tests GPT-4o performance by feeding in a 7x7 grid of colored shapes and requesting them back as JSON (about half way down the page) is really clever.

blixt · 2024-06-07T13:52:35 1717768355

I did something similar when GPT-4V came out, partially with the goal to figure out the input format (I did not get anywhere other than "magic vectors"), but also to roughly estimate the amount of data you can get back out of a 512x512 (the low quality option) image.

What I found is that you can sometimes get more text out of 85-token image than you can out of 85 tokens of text! That said, I think there will be plenty of edge cases where it actually loses some information, and maybe you could argue that if you remove every other word in the text, it could still restore the text.

I never went deeper on this, but I believe there's something clever to be done in the context window with the fact that images are relatively cheap tokens-wise.

_dark_matter_ · 2024-06-07T14:01:42 1717768902

The author mentions this in the article, that more than 170 tokens of text can be pulled from an image.

blixt · 2024-06-07T14:10:56 1717769456

Ah, you're right! My bad!

geor9e · 2024-06-07T15:32:21 1717774341

Nit: the implied premise that this isn't a beautiful and skilled painting https://www.oranlooney.com/post/gpt-cnn_files/malicious_dogs...

olooney · 2024-06-07T18:42:55 1717785775

The painting is Charlie and Sheba, from the Museum of Bad Art:

https://museumofbadart.org/zoo/

I found it while Googling for a test image for the malicious prompt test, where it was used as the lead photo for this blog post:

https://www.artsy.net/article/artsy-editorial-bad-art-good

There's definitely something eye-catching about it that really makes it stand out from the crowd.

iknownothow · 2024-06-07T14:13:35 1717769615

I'm probably wrong but the author may have have misunderstood input embeddings. Input embeddings are just dictionary lookup tables. The tokenizer generates tokens and for each token you find its embedding from the lookup.

The author is speculating about an embedding model but in reality they're speculating about the image-tokenizer.

If I'm not wrong the text tokenizer Tiktoken has a dictionary size of 50k. The image tokenizer could have a very large dictionary size or a very small dictionary size. The 170 tokens this image tokenizer generates might actually have repeating tokens!

EDIT: PS. What I meant to say was that input embeddings do not come from another trained model. Tokens come from other trained models. The input embedding matrix undergoes back propagation (learning). This is very important. This allows the model to move the embeddings of the tokens together or apart as it sees fit. If you use embeddings from another model as input embeddings, you're basically adding noise.

iknownothow · 2024-06-07T16:40:19 1717778419

I've pondered a bit more about it and I was the one who was mistaken. I think the author made great observations. It's just that I don't want to go back to non token thinking. I don't want there to be a 13x13xE final output from the CNN. I really want there to be a visual vocabulary from which tokens are chosen. And I want this visual vocabulary to be fixed/untrainable/deterministic. That'd be very cool.

But why only choose 13x13 + 1? :(

I'm willing to bet that the author's conclusion of embeddings coming from CNNs is wrong. However, I cannot get the 13x13 + 1 observation out my head though. He's definitely hit on something there. I'm with them that there is very likely a CNN involved. And I'm going to put my bet on the final filters and kernel are the visual vocabulary.

And how do you go from 50k convolutional kernels (think tokens) to always 170 chosen tokens for any image? I don't know...

kolinko · 2024-06-07T17:13:01 1717780381

Input embeddings are taken from a dictionary in case of text tokens, but they don’t need to be - they can be any vector really.

iknownothow · 2024-06-07T17:47:40 1717782460

But don't input embeddings need to undergo backprop during training? Won't the external-model's embeddings just be noise since they don't share embedding space with the model that is being trained?

If the external-model also undergoes training along with the model then I think that might work.

blixt · 2024-06-07T13:47:15 1717768035

I went through a similar journey back when GPT-4V came out. Here's an additional puzzle for you: GPT-4V knows the exact pixel dimensions of the image (post-resize since there is a max size for images in the pipeline, besides 512x512), but I'm 99% sure it's not provided as text tokens. How am I so sure? It's easy to get GPT to divulge everything from system prompt to tool details, etc. but I've tried every trick in the book and then some, multiple times over, and there is no way to get it to quote the dimensions as text. The only way to get it to give you the dimensions is to tell it to output a structure that contains width and height and just pick something reasonable, and they will "randomly" be the correct values:

https://x.com/blixt/status/1722298733470024076

llm_trw · 2024-06-07T14:15:39 1717769739

> It's easy to get GPT to divulge everything from system prompt to tool details,

It's easy enough to get it to hallucinate those things. It doesn't actually tell them to you.

blixt · 2024-06-07T14:32:30 1717770750

I'm well aware of that, but there are plenty of ways to induce verbatim quoting from "hidden" information, and mostly verify it (through sampling a large number of times in separate runs).

Models are improving in truly hiding or ignoring information these days though. As the author of the article states, you'll have a hard time tricking GPT-4o to read text in images as instructions, most likely thanks to this research: https://openai.com/index/the-instruction-hierarchy/

I do feel pretty confident that when the model is happily spitting out its system prompt, and all metadata around the image, but not its pixel dimensions, that probably those dimensions were not provided in any system/assistant/tool message. So maybe part of the image embeddings also encode the pixel dimensions somehow (it would also help the model not think of the image as a squished square for non-1:1 images that have been resized to 512x512).

dannyw · 2024-06-07T14:02:52 1717768972

Perhaps images aren’t tokens at all… and 170 tokens is just an approximation of the compute cost.

blixt · 2024-06-07T14:05:42 1717769142

I think that would have pretty serious implications for the transformer architecture though. If they're not embedded like text tokens, how would attention, etc work? And a conversation with multiple images back and forth? Not to mention with GPT-4o now having audio support as well. I would assume it does become tokens.

qarl · 2024-06-07T14:05:41 1717769141

They address this question in the article.

joelburget · 2024-06-07T14:04:40 1717769080

Vision transformers should be our default guess as to how GPT-4o works, yet this article never mentions them.

sva_ · 2024-06-07T14:46:21 1717771581

Great article. Perhaps some part of this magic number simply factors in the amount of compute necessary to run the image through the CNN (proportional to compute use per token in the LM).

surfingdino · 2024-06-07T13:58:05 1717768685

OCR is hard https://www.vice.com/en/article/gvy4gb/one-mans-david-and-go...

yorwba · 2024-06-07T14:29:18 1717770558

It would be interesting to see what happens when you slightly shift the grid of objects until they're split across multiple tiles, and how that affects accuracy.

SubiculumCode · 2024-06-07T19:28:59 1717788539

I'm not sure how chatgpt4o routes information. If a picture is submitted that contains text, does the text then get resubmitted to chatgpt4o as a textual query, or do the model weights themselves essentially transform the textual images to textual tokens. I do wonder if a response to the textual images is similar to a response to text queries...i.e. processed by the the same weights.

imranhou · 2024-06-07T18:15:19 1717784119

Not to be nit-picky but double checking myself, isn't a token just 0.75 words, so 170 token would be 127 words and not 227?

tantalor · 2024-06-07T13:44:23 1717767863

> CLIP embeds the entire image as a single vector, not 170 of them.

Single token?

> GPT-4o must be using a different, more advanced strategy internally

Why

freediver · 2024-06-07T13:58:26 1717768706

The embeddings do not offer the level of fidelity to recognize fine details on an image, hand writing for example.

jmount · 2024-06-07T14:51:09 1717771869

Scanning images is quite the problem in the presence of compression (and now interpolation) https://www.bbc.com/news/technology-23588202 .

jamesy0ung · 2024-06-07T22:44:51 1717800291

I've always wondered how Text to Image LLMs like stable diffusion work, do they just encode RGB values into a matrix and then have a helper tool convert that data into a jpg?

rvnx · 2024-06-07T13:54:55 1717768495

Author claims that the most likely is that there is Tesseract running behind ChatGPT-4v/o.

There is no way that this is Tesseract.

-> Tesseract accuracy is very low, it can barely do OCR on printed documents.

kherud · 2024-06-07T14:06:50 1717769210

Shouldn't this theory be testable? The response time for an image of the same size should remain constant (assuming a generated response of constant size). You could then try to put an increasing amount of text inside of the image. If this text is fed to the LLM using OCR, the total amount of tokens grows. You should then be able to observe an increase in response time.

jerrygenser · 2024-06-07T13:56:48 1717768608

Even if tesseract accuracy is low, if the tesseract result in addition to the image is then passed to the LLM, it can result in a much more accurate OCR.

For example, GPT4 with some vision capability would be able to fill in the incorrect OCR with the additional word co-occurrence understanding.

I've tested this approach with purely text LLM to correct OCR mistakes and it works quite well.

Also note that in some newer OCR pipelines that don't involve LLMs, there is a vision component and then a text correcting model that is in some ways similar to some forms of spell check, which can further improve results.

lyu07282 · 2024-06-07T14:18:36 1717769916

you can tell that the OCR fails more in cases without natural language like with code/random characters. OAI seems to claim 4o is a fully end to end multimodal model, but we will never know for sure, we can't trust a single word OpenAI is saying.

neochief · 2024-06-09T13:07:28 1717938448

I once uploaded a giant image to chatGPT, asking to transcribe it, but the request failed, and in the error message there was a reference to some python script related to tesseract stuff. Since then I'm 100% tesserract is used there in some aspect for text recognition.

llm_trw · 2024-06-07T14:17:14 1717769834

Because no one knows how to prep the images. With the right file type and resolution I get under a single character error per 10 pages and it's been that good since the late 00s.

alach11 · 2024-06-07T17:49:24 1717782564

With handwriting? With mixed fonts? Tesseract requires heavy customization and extension to perform reasonably on these workloads. The off-the-shelf options from major cloud providers blow it out of the water.

llm_trw · 2024-06-07T20:22:03 1717791723

Never had to use it with handwriting, mixed fonts and text where location carries semantic infirmation: absolutely.

yorwba · 2024-06-07T14:26:55 1717770415

How do you prep the images?

llm_trw · 2024-06-07T20:20:15 1717791615

May hourly rate starts at $300. If you'd like to hire me you're more than welcome to. I've done this work for a number of companies in the past.

freedmand · 2024-06-07T14:11:49 1717769509

Agreed. Tesseract is not able to handle handwriting or text that is distorted well, e.g. colored text over an image background — to the point that it would hurt any downstream LLM trying to make sense of the contents. It won’t even pick out bounding boxes.

I doubt they are running an OCR model, but if they actually were it would likely be an in-house one trained with more modern techniques.

RicoElectrico · 2024-06-07T14:09:41 1717769381

Yeah, Tesseract is barely production quality.

lyu07282 · 2024-06-07T14:19:51 1717769991

yeah it was SOTA in 2006, 18 years ago

jascha_eng · 2024-06-07T15:16:02 1717773362

Other than proprietary models, what is better than it today? Just asking in case I ever need OCR and don't want to pay the cloud providers for it :D

lyu07282 · 2024-06-07T15:54:34 1717775674

checkout https://github.com/mindee/doctr or https://github.com/VikParuchuri/surya for something practical

multimodal llm would of course blow it all out the water, so some llama3-like model is probably SOTA in terms of what you can run yourself. something like https://huggingface.co/blog/idefics2

alach11 · 2024-06-07T13:48:27 1717768107

I really hope we see improvements to the resolutions large multimodal models can handle. Right now this patchwork approach leads to lots of unwieldly workarounds in applications.

eminence32 · 2024-06-07T13:52:57 1717768377

I'm assuming that the tokens used to encode an image are entirely distinct from the tokens used to encode text. Does anyone know if this is actually the case?

tempusalaria · 2024-06-07T13:59:20 1717768760

It’s probable that there is a separate vision encoder which projects the image tiles into the distribution space of the text tokenizer a la CLIP/LLava

blixt · 2024-06-07T14:01:31 1717768891

I would assume it has a "mode" token where it switches between text/image (and now audio), or you'd have to try to maximize the number of reserved tokens between multiple modes. GPT-4o did go from 100K to 200K vocabulary, but as far as I understand all of that vocabulary is in use for text (reducing the token cost for non-English).

sashank_1509 · 2024-06-07T19:31:25 1717788685

I would be disappointed if OpenAI had a separate model for OCR, though I guess that is believable. Much cooler if the LLM just understands language from text

euclideanfire · 2024-06-07T15:23:50 1717773830

[flagged]

chpatrick · 2024-06-07T15:35:31 1717774531

You can download plenty of open source models and try it yourself, OpenAI is just one step ahead.