More

yigitkonur35 · 2025-01-28T07:58:29 1738051109

great stuff! nice to see that remotion is becoming more popular on such projects.

yigitkonur35 · 2024-09-22T19:42:02 1727034122

I really appreciate you sharing your hands-on experience with a real-world scenario. It's interesting how people unfamiliar with traditional OCR often doubt LLMs, but having worked with actual documents, I know how inefficient classic OCR methods can be. So these minor errors don't alarm me at all. Your use case sounds fascinating - I might just incorporate it into my own benchmarks. Thanks again for your insightful comment!

yigitkonur35 · 2024-09-22T19:38:34 1727033914

I've found this method really useful for prepping PDFs before running them through AI. I mix it with traditional OCR for a hybrid approach. It's a game-changer for getting info from tricky pages. Sure, you wouldn't bet the farm on it for a big, official project, but it's still pretty solid. If you're willing to spend a bit more, you can use extra prompts to check for any context skips. It's a lot of work, though - probably best left to companies that specialize in this stuff.

I've been testing it out on pitch decks made in Figma and saved as JPGs. Surprisingly, the LLM OCR outperformed top dogs like SolidDocuments and PDFtron. Since I'm mainly after getting good context for the LLM from PDFs, I've been using this hybrid setup, bringing in the LLM OCR for pages that need it. In my book, this API is perfect for these kinds of situations.

yigitkonur35 · 2024-09-22T19:34:32 1727033672

You're absolutely right. I use PDFTron (through CloudCovert) for full document OCR, but for pages with fewer than 100 characters, I switch to this API. It's a great combo – I get the solid OCR performance of SolidDocument for most content, but I can also handle tricky stuff like stats, old-fashioned text, or handwriting that regular OCR struggles with. That's why I added page numbers upfront.

yigitkonur35 · 2024-09-22T19:31:29 1727033489

I ought to test this with Sonnet too and compare the results. I feel it might perform better on OCR tasks. While I went with Azure OpenAI due to fewer rate restrictions, you've got a point - Sonnet could really shine here.

yigitkonur35 · 2024-09-22T19:21:45 1727032905

You're spot on. We shouldn't lump all LLMs together. This approach might work wonders for Anthropic and OpenAI's top-tier models, but it could fall flat with smaller, less complex ones.

I purposely set the temperature to 0.1, thinking the LLM might need a little wiggle room when whipping up those markdown tables. You know, just enough leeway to get creative if needed.

yigitkonur35 · 2024-09-22T19:06:19 1727031979

Yes, you can customize this as you wish by adding it to your prompt.

yigitkonur35 · 2024-09-22T19:04:44 1727031884

I did a ton of Googling before writing this code, but I couldn't find you guys anywhere. If I had, I'd have definitely used your stuff. You might want to think about running some small-scale Google Ads campaigns. They could be especially effective if you target people searching for both LLM and OCR together. Great product, congratz!

fzysingularity · 2024-09-22T21:26:27 1727040387

Hey, thanks! DM me if you want to test it out (sudeep@vlm.run).

Agreed on SEO - we’re redoing our landing page and searchability. We recently rebranded, hence the lack of direct search hits for LLM / OCR.

yigitkonur35 · 2024-09-22T19:03:38 1727031818

For highly consistent responses, manually transcribing the most challenging page of the document (or engaging in multiple rounds of dialogue with Claude) and incorporating it as a few-shot example can dramatically improve overall consistency.

yigitkonur35 · 2024-09-22T19:02:30 1727031750

I get your worries about LLMs and their consistency problems. But I think we can fix a lot of that using LLMs themselves for checks. If you're after top-notch accuracy, you could throw in another prompt, add some visual and text input, and double-check that nothing's lost in translation. The cheaper models are actually great for this kind of quality control. LLMs have come a long way since they first showed up, and I reckon they've stepped up their game enough to shake off that bad rap for giving mixed signals.

Oras · 2024-09-22T19:28:02 1727033282

How would you know something is missing?

I tried multiple OCRs before and it’s hard to tell if the output is accurate or not but just comparing manually.

I created a tool to visualise the output of OCR [0] to see what’s missing and there are many cases that would be quite concerning especially when working with financial data.

This tool wouldn’t work with LLMs as they don’t return the character recognition (to my knowledge), which will make it harder to evaluate them on a scale.

If I want to use LLMs for the task, I would use them to help with training ML model to do OCR better, such as creating thousands of synthetic data to train.

[0] https://github.com/orasik/parsevision

yigitkonur35 · 2024-09-22T19:30:10 1727033410

Wow, you knocked it out of the park! I'll be sure to use this when I tackle that evaluation.

whiplash451 · 2024-09-22T21:07:12 1727039232

If you can use an LLM for sanity checking, why can’t you use it for extraction at the first place?

ithkuil · 2024-09-23T19:40:52 1727120452

Because currently models output a stream of tokens directly which are the performance and billing unit. Better models can do a better job at producing reasonable output but there is a limit to what can be done "on the fly".

Some models like openai o1 started employing internal "thinking" tokens which may or may not be equivalent to performing multiple passes with the same or different models but it has a similar effect.

One way to look at it is that if you want better results you have to put more computational resources in thinking. Also, just like humans, a team effort yields better results in producing well rounded results because you combine the strengths and you offset the weaknesses of different team members.

You can technically wrap all this into a single black box and have it converse with you as if it was one single entity that internally uses multiple models to think and cross check etc. The output is likely not going to be in real-time though and real time conversation was until now a very important feature.

In future we may on one hand relax the real time constraint and accept that for some tasks accuracy is more important than real time results.

Or we may eventually have faster machines or more clever algorithms that may "think" more in shorter amounts of time.

(Or a combination of the two)