Hacker News new | past | comments | ask | show | jobs | submit login

Tangential question : did anyone ever use GPT4-V in production in visual tasks? It's never consistent enough for me to be useful



It's very reliable for GUI segment understanding; see e.g. https://github.com/OpenAdaptAI/OpenAdapt/pull/610 (scroll down to `gpt-4-vision-preview`).


Can it be used for automatic annotation?

As in: you tell it that these and these parts should be masked such and such, and then it does that?


We have not had success with that unfortunately.


Thank you, your comment will save me some trouble ;)


Don’t use it for anything OCR related that needs perfect accuracy. Stuff where some errors are ok, we’ve had great success. Depending on your budget, you can also run it multiple times to catch errors.


> you can also run it multiple times to catch errors.

Does this require a slight offset and/or rotation to the image, or just literal rerun, with seed seed/whatever giving a different result?


How does it compare to Tesseract?

Edit: Thank you!


I’ve done a lot of OCR work and tesseract is nearly a decade out of date at this point. It is not a serious technology for anything requiring good accuracy or minor complexity. From what I’ve seen, GPT-4V completely smokes tesseract, but then again, most modern OCR systems do. If you want fast and pretty powerful OCR, check out paddle. If you want slower but higher accuracy, check out transformer based models such as TrOCR.


See this for a comparison of PaddleOCR, TrOCR, and various cloud ones (note: on documents of typed and handwritten text):

https://news.ycombinator.com/item?id=32077375


Caveat that being from 2022, the Tesseract version used was almost certainly v4 (if Linux), rather than v5 which is much better (and widely available on Windows in 2022, but not Linux yet).

However Tesseract is quite behind still as you note, even with v5.


Running PaddleOCR in production now, I would suggest contrasting Tesseract v4 and v5, since v5 is a lot better(but until recently has not been available on Linux) - PaddleOCR does still smoke it though, you are right (especially for concurrency and fairly easily just setting different workers to different GPUs for best concurrent batching).


How is Paddle on complex data tables? This is my biggest challenge at the moment.


What format? The entire data table in one image, or a PDF for example printed off with 8 pages where the user choose to only put the header on the first page etc? Or decent formatting, font size 8+ on an image with decent resolution? With the latter you are probably fine although you will need some manual implementation for parsing the output. You get bounding boxes at word level. One thing if I started nowadays I would do is use basic columns (x coordinates) to add '|' inbetween the outputs(including detecting empty span positions), keep items with similarish y coordinates together on lines, and put it into ChatGPT to format as desired, I suspect this would avoid misreading.

I would say PaddleOCR is good in general for tables - it's much better (in terms of recall rate) at recognising numerical digits / symbols than Tesseract although I notice it often misrecognises "l" in "Lullaby/ml/million" etc as "1" sometimes.

The cloud providers have better table extraction iff you can guarantee the same format each time for the document.


A wide variety of PDFs (both in length and content) that can have a variety of different tables, real estate related with a lot of financial content. And I need to be able to run on local models / software (no parsing as a service, no OpenAI, etc).

Here's just one example: https://www.totalflood.com/samples/residential.pdf (I struggle getting accurate data out of the Sales Comp section - basically all approaches mix up the properties.


Sorry, this will be very hard to do. You can't really try and segment images based on lines as the tables probably varied. The floor plans and things... this data is very very challenging.

I would suggest your best bet is waiting 2 years for the next version of LLAVA to come out which may have capabilities to interpret very accurately on device. The progress with LLAVA has been fast recently but for now it's still a bit too inaccurate.


Tesseract's true value is being one apt-get command away (i.e. opensource). Does Debian host more modern OCR systems in their repos?


Tesseract the tool is one apt-get away but the trained models are not, and I've found that they are a starting point, not a final destination. You still have to do more training on top of them for anything that isn't black text on a crisp white background.


Big mistake on my part; I should clarify I fine-tuned both PaddleOCR and TrOCR on large amounts of data specific to my domain. I cannot speak on the best out of the box “ready to go” solutions (besides cloud ones, which were quite good with the right pre and post processing).


the coolest use case i've seen is this https://github.com/ddupont808/GPT-4V-Act


I feel like this is the beginning of the end for all captchas


Twitter / X has a very interesting captcha: you get to see 10 objects that have weird colors and are slightly deformed, and then you have to match them (1 at a time) with another row that has the same objects but seen from a different angle.

Of course eventually this will be defeated too, but for now it seems to work pretty well.


Image based or any kind of visual captchas will never be extremely effective. I think we will see more of PoW captchas in the upcoming years (just like cloudflare's turnstile captcha)


I'm not suer about that, can't you give GPT4 a math problem in an image already and have it solve it correctly most of the time?

And these haven't even been trained to defeat captchas/logic problem captchas yet, if it was fine tuned on the general pattern of them I imagine any form of captcha is bust.


Not sure about PoW. What about people with slower phones etc?

I don't see how an older smartphone could meaningfully outcompute a spamming infra.


Nope, I tried it for graph and diagram understanding and it wasn't good enough. Planning to repeat the evaluation with 4o when I have time.


I'm using 4o to convert visual diagrams into mermaid, and it's been almost perfectly accurate in my experience.


This is the out of the box thinking I love about HN. What do you do with the mermaid?


The resulting mermaid is used for... more LLM processing. Converting to mermaid first is more cost-effective, consistent, and accurate for my purposes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: