We've been building something similar with https://vlm.run/: we're starting out ...

We've been building something similar with https://vlm.run/: we're starting out with documents, but feel like the real killer app will involve agentic workflows grounded in visual inputs like websites. The challenge is that even the best foundation models still struggle a lot with hallucination and rate limits, which means that you have to chain together both OCR and LLMs to get a good result. Platforms like Tesseract work fine for simple, dense documents, but don't help with more complex visual media like charts and graphs. LLMs are great, but even the release of JSON schemas by OpenAI hasn't really fixed 'making things up' or 'giving up halfway through'.