Hacker News new | past | comments | ask | show | jobs | submit | orliesaurus's comments login

ToDesktop vulnerability: not surprised. Trust broken.

I think OCR tools are good at what they say on the box, recognizing characters on a piece of paper etc. If I understand this right, the advantage of using a vision language model is the added logic that you can say things like: "Clearly this is a string, but does it look like a timestamp or something else?"

VLMs are able to take context into account when filling in fields, following either a global or field specific prompt. This is great for e.g. unlabeled axes, checking a legend for units to be suffixed after a number, etc. Also, you catch lots of really simple errors with type hints (e.g. dates, addresses, country codes etc.).

This has always been part of the complete OCR package as far as I know. The raw result of an OCR constantly fails to differentiate 1 l I i | or other similar symbols/letters.

Maybe this necessary step can be improved and altered with a VLM. There is also the preprocessing where the image get its perspective corrected. Not sure how well a VLM performs here.

As you said, I think combining these techniques will be the most efficient way forward.


You can also use it for robustness. Looking at e.g. historical censuses, it's amazing how many ways people found to not follow the written instructions for filling them out. Often the information you want is still there, but woe to you if you look at the columns one by one and assume the information in them to be accurate and neatly within its bounding box.

OmniParser is pretty amazing! Thanks for sharing! It parsed the SpaceJam 1996 website pretty well, despite that website being extremely out of date

this is the best thing to ever happen to the RPA world

What do you think is the main problem it solves there?

The cool thing is that we can extract xPaths from the agent runs and re-run these scripts deterministically. I think that's a big advantage over pure vision-based systems like Operator.


lindy is voice activated?

Does Mastra support libraries of tools for agents like toolhouse.ai or https://github.com/transitive-bullshit/agentic

Agentic's tool library _should_ also work for Mastra via its AI SDK adapter.

(We haven't tested this, so if you do try let us know if you see quirks!)


What about Toolhouse and/or composeio?

in npm there's a mastra/composeio package that might work, they also seem to have some mcp support

I hate this. Sorry.


and if you want to know the reason - I work in small engineering teams, clear communication is paramount. Being indirect or beating around the bush wastes time, leads to misunderstandings, and erodes trust. We need to be direct and concise to ensure everyone's on the same page and projects stay on track - respectfully of course :)


lol I initially thought dylibso was the author, I was mistaken. That being said - WASM has been steadily improving over time, yet it hasn't quite achieved mainstream adoption.

I'm curious what's holding it back?

It seems like despite its technical advancements, WASM hasn't captured the public's interest in the same way some other technologies have. Perhaps it's a lack of easily accessible learning resources, or maybe the benefits haven't been clearly articulated to a broader audience. There's also the possibility that developers haven't fully embraced WASM due to existing toolchains and workflows.

[1] https://github.com/dylibso


> lol I initially thought dylibso was the author

as a Dylibso employee, I am wondering what made you think that :D at Dylibso we advocate for Wasm for software extensions, rather than an alternative to containers!


Because of the topic. I find you guys are the only people advocating for Wasm in general, in public.


the quality of this code reminds me of mine 10 years ago


Toolhouse| SF/Bay Area | Remote | Product Engineer | https://toolhouse.ai | Full-time

Toolhouse.ai is on a mission to democratize function calling access to developers worldwide. We're looking for a passionate and articulate Developer Advocate to join our growing team in the San Francisco Bay Area.

What You'll Do:

* Contribute to the design and development of developer tools: use-cases/demos, SDKs and APIs. * Work closely with our team to ensure that developer needs are met from a produt perspective. * Build and maintain internal tools and infrastructure to support developer workflows. * Identify and implement improvements to our existing products and services.

Who You Are:

* A passionate developer with a product mindset, extroverted and with an understanding of LLMs and their potential applications.

* Excellent communication and presentation skills, with the ability to engage a technical audience.

* Experience building and using developer tools.

* A self-starter with the ability to manage your own schedule and workload (part-time).

If you're passionate about AI, startups and dev tools, we want to hear from you! Please get in touch with us by emailing hello@toolhouse.ai subject "Product Engineer - HN" and include any relevant links for us to understand who you are and what you have done/can do.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: