I think OCR tools are good at what they say on the box, recognizing characters on a piece of paper etc. If I understand this right, the advantage of using a vision language model is the added logic that you can say things like: "Clearly this is a string, but does it look like a timestamp or something else?"
VLMs are able to take context into account when filling in fields, following either a global or field specific prompt. This is great for e.g. unlabeled axes, checking a legend for units to be suffixed after a number, etc. Also, you catch lots of really simple errors with type hints (e.g. dates, addresses, country codes etc.).
This has always been part of the complete OCR package as far as I know. The raw result of an OCR constantly fails to differentiate 1 l I i | or other similar symbols/letters.
Maybe this necessary step can be improved and altered with a VLM. There is also the preprocessing where the image get its perspective corrected. Not sure how well a VLM performs here.
As you said, I think combining these techniques will be the most efficient way forward.
You can also use it for robustness. Looking at e.g. historical censuses, it's amazing how many ways people found to not follow the written instructions for filling them out. Often the information you want is still there, but woe to you if you look at the columns one by one and assume the information in them to be accurate and neatly within its bounding box.
What do you think is the main problem it solves there?
The cool thing is that we can extract xPaths from the agent runs and re-run these scripts deterministically. I think that's a big advantage over pure vision-based systems like Operator.
and if you want to know the reason - I work in small engineering teams, clear communication is paramount. Being indirect or beating around the bush wastes time, leads to misunderstandings, and erodes trust. We need to be direct and concise to ensure everyone's on the same page and projects stay on track - respectfully of course :)
lol I initially thought dylibso was the author, I was mistaken.
That being said - WASM has been steadily improving over time, yet it hasn't quite achieved mainstream adoption.
I'm curious what's holding it back?
It seems like despite its technical advancements, WASM hasn't captured the public's interest in the same way some other technologies have. Perhaps it's a lack of easily accessible learning resources, or maybe the benefits haven't been clearly articulated to a broader audience. There's also the possibility that developers haven't fully embraced WASM due to existing toolchains and workflows.
as a Dylibso employee, I am wondering what made you think that :D at Dylibso we advocate for Wasm for software extensions, rather than an alternative to containers!
Toolhouse.ai is on a mission to democratize function calling access to developers worldwide. We're looking for a passionate and articulate Developer Advocate to join our growing team in the San Francisco Bay Area.
What You'll Do:
* Contribute to the design and development of developer tools: use-cases/demos, SDKs and APIs.
* Work closely with our team to ensure that developer needs are met from a produt perspective.
* Build and maintain internal tools and infrastructure to support developer workflows.
* Identify and implement improvements to our existing products and services.
Who You Are:
* A passionate developer with a product mindset, extroverted and with an understanding of LLMs and their potential applications.
* Excellent communication and presentation skills, with the ability to engage a technical audience.
* Experience building and using developer tools.
* A self-starter with the ability to manage your own schedule and workload (part-time).
If you're passionate about AI, startups and dev tools, we want to hear from you! Please get in touch with us by emailing hello@toolhouse.ai subject "Product Engineer - HN" and include any relevant links for us to understand who you are and what you have done/can do.
reply