I've built a couple of experiments using it so far and it has been really interesting.
On one hand, it has really helped me with prototyping incredibly fast.
On the other, it is prohibitively expensive today. Essentially you pay per click, in some cases per keystroke. I tried to get it to find a flight for me. So it opened the browser, navigated to Google Flights, entered the origin, destination etc. etc. By the time it saw a price, there had already been more than a dozen LLM calls. And then it crashed due to a rate limit issue. By the time I got a list of flight recommendations it was already $5.
But I think this is intended to be an early demo of what will be possible in the future. And they were very explicit that it's a beta: all of this feedback above will help them make it better. Very quickly it will get more efficient, less expensive, more reliable.
So overall I'm optimistic to see where this goes. There are SO many applications for this once it's working really well.
I guess I'm confused there's even a use case there. It's like "let me google that for you". I mean Siri can return me search results for flights.
A real killer app would be something that is adaptive and smart enough to deal with all the SEO/walled gardens in the travel search space, actually understanding the airlines available and searching directly there as well as at aggregators. It could also be integrated with your Airline miles accounts and all suggested options to use miles/miles&cash/cash, etc.
All of that is far more complex than .. clicking around google flights on your behalf and crashing.
Further, the real killer app is that it is bullet proof enough that you entrust it to book said best flight for you. This requires getting the product to 99.99% rather than the perpetual 70-80% we are seeing all these LLM use cases hit.
The airline booking + awards redemption use case is a mostly solved problem. Harcore milage redemption enthusiasts use paid tools like ExpertFlyer that present a UI and API for peeking into airline reservation backends. It has a steep learning curve, for sure.
ThePointsGuy blog tried to implement something that directly tied into airline accounts to track milage/points and redemption options, but I believe they got slapped down by several airlines for unauthorized scraping. Airlines do NOT like third parties having access to frequent flier accounts.
While the strategy to find good deals / award space is a solved problem, the search tools to do so aren't. Tools like ExpertFlyer are super inefficient: it permits you to search for maximum one origin + one destination + one airline per search. What if you're happy to go to anywhere in Western Europe? Or if you want to check several different airlines? Then all of a sudden your one EF search might turn into dozens. And as you say, pretty much all of the aggregator tools are getting slapped down by airlines so they increasingly have more limited availability and some are shutting down completely.
And then add the complexity that you might be willing to pay cash if the price is right ... so then you add dozens more searches to that on potentially many websites.
All of this is "easy" and a solved problem but it's incredibly monotonous. And almost none of these services offer an API, so it's difficult to automate without a browser-based approach. And a travel agent won't work this hard for you. So how amazing would it be instead to tell an AI agent what you want, have it pretend to be you for a few minutes, and get a CSV file in your inbox at the end.
Whether this could be commercialised is a different question but I'm certainly going to continue building out my prototype to save myself some time (I mean, to be fair, it will probably take more time to build something to do this on my behalf but I think it's time well spent).
Yes I think this points to the need for adaptiveness which remains humans edge.
We don't need PBs of training data, millions of compute, and hours upon hours of training.
You could sit down a moderately intelligent intern as a mechanical turk to perform this workflow with only a few minutes of instruction and get a reasonably good result.
Ah, but I think you're overlooking one major factor. Convenience. A lot of the spontaneous stuff we do ("hey why don't we pop down to x tomorrow?", or "do you fancy a quick curry?") are things you're not going to book with a Turk. BUT you definitely would fire up a quick agent on your way to the shower and have it do all the work for you while you're waxing your armpits. :) Agentic work is starting super slow, but once the wrinkles are worked out, we'll see a world where they're doing a huge amount of heavy lifting for the drudge stuff. For an example see Her - sorry! :)
This kind of stuff is an existential threat to ad-based business models and upselling. If users no longer browse the web themselves, you can't show them ads. It's a monumental, Earth-shattering problem for behemoth like Google but also normal websites. Lots of websites (such as booking.com) rely on shady practices to mislead users and upsell them etc. If you have a dispassionate, smart computer agent doing the transaction, it will only buy what's needed to accomplish the task.
There will be enormous push towards steering these software agents towards similarly shady practices instead of making them act in the true interest of the user. The ads will be built into the weights of the model or something.
Ads will move to the layer of the new interface when that happens. Also a computer can't watch a youtube video for you or look at funny cat pictures. You can still put ads next to things people want to look at.
Care to elaborate on the idea? I suppose you mean that ads will come to this "computer use" tool itself. Now, will users keep it in the foreground, when they already expect the tool to do (almost) everything for them?
I think the point is - don't be so naive.
Companies are investing near trillions into developing models, training models, compute, datacenter, nuclear reactors, etc.
Is the endgame some free/cheap tool that abstracts away the entire ad based web economy to the benefit of end users?
Imagine something closer to a super duper smart useful Siri/Alexa that feeds you product recommendations, paid placement, and other ads interspersed with your actual request response.
Hey Siri what temperature is it?
It's 45 and going to be chilly today, a North Face jacket might be handy today.. can I recommend you a few models? What's your size?
Or it's just simply going to make purchases and flight bookings based on paid boosts from online stores and airlines. They will simply say that it's making a holistic assessment, not simply based on the final price but the overall reputability etc. It's to avoid fraud and to streamline experience based on personalized machine learning algorithm result yadda yadda.
I mean, what really are ads and dark tactics (like those observed on accommodation booking websites)? They are ways to influence purchasing decisions.
If the decision is offloaded to AI, then logically ways to sway the AI decision will be developed. Such as backroom deals, hidden prompts and rules governing the assessment of the AI in making choices.
Still not a problem for Meta/TikTok/YouTube though, as people go there to consume content on purpose. But I agree, will be fun to see how Google and others will deal with it.
And if purchases become agentic, fine print or other shady tricks hidden in business terms will be how businesses draw consumers in.
Also, none of this will be existential, earth-shattering or enourmous until compute power per watt comes to a degree where all of this is economical at scale.
and/or the ad dollars will move into the decision layer and the AI will make different decisions / recommendations to your request, depending on who is bidding the most..
Imagine the most dystopian outcomes and you'll probably be closer than "well I don't have to see ads anymore!"
I'm all for the MVP approach and shipping quickly, though I'm really surprised they went with image recognition and tooling for injecting mouse/keyboard events for automating human tasks.
I wonder why leveraging accessibility tools for this wouldn't have been a better option. Browsers and operating systems both have pretty comprehensive tooling for accessibility tools like screen readers, and the whole point of those tools is to act as a middle man to programmatically interpret and interact with what's on screen.
I think the reason is that this is the most general implementation. It doesn't need playwright or have access to the DOM or anything else, if it has a screen and mouse/keyboard, then it will work. That's quite powerful (if slow and pricey, at the moment).
Unless I'm mistaken, playwright doesn't actually use the accessibility tree directly. It does have quite a few APIs for accessing nodes based on a11y attributes, but I could have sworn those were glorified query selectors rather than directly accessing the accessibility tree.
Last I checked on it, maybe a year ago, there were browser proposals for standardizing the accessibility tree APIs but they were very early discussions and seemed pretty well stuck.
That would be a good reason for Anthropic using image processing here though, short of forking open source a11y tools there may not have been a simple way to use accessibility data to interact.
Those sound like stop gaps at best. Its pretty clear the intended goal here. APIs are easy to integrate with but most systems in existence only have a visual interface intended for humans.
The end goal here is clear, being able to interface with anything available in the screen.
Accessibility tools are made for humans. If there is information only available visually and not via a screen reader or other accessibility tools, that is a problem that needs to be addressed.
Accessibility tools are I find are never as great as the source. Just because they are made for humans does not mean they are an improvement. I imagine at best a stopgap as image models improve.
From what I've seen of this new product (I've never used it), it sounds like its specifically trying to mimic a human user and they went with image recognition plus faked input devices.
That approach is a weird one to me, though only as long as its limited to the current use. If this is just another test bed for a much more broad tool that could rely on accessibility APIs that makes sense.
This is basically RPA with LLMs. And RPA is basically the worst possible solution to any problem.
Agents won't get anywhere because any user process you want to automate is better done by creating APIs and creating a proper guaranteed interface. Any automated "computer use" will always be a one-off, absurdly expensive, and completely impractical.
There is plenty of legacy software out there that has no and will never have a nice API to integrate with. Those are the situations where the terrible solutions are either let a human do it or automate the human tool chain from a high level. This is the LLM spin on it. Is it an efficient or even good solution? Hell no, but if there is no other solution to automation (assuming that's the goal) then does that matter?
This is a severely under-appreciated perspective. A lot of software, especially in industries that are slow to change, is just not programming-friendly. There are no APIs and no access to underlying databases, just user-focused point-and-click.
My take is that those industries are also going to be very slow to adopt any AI tools, especially these, and for good reasons. We are looking at integrating LLM into our products, but we have customers that told us they can't use any of those, straightforward.
I'd argue that this is not even a solution to begin with. If the LLM gets even one pixel value wrong, then at best the whole process breaks down. At worst, you could do some irreversible damage.
I could see this coming into Apple Intelligence for example; you could simply ask the browser to buy stuff off of your favorite store, or even do a chain of tasks like informing a contact off your list that you've bought said thing, etc.
The possibilities are quite exciting, in fact, even though the technology isn't quite there yet.
Apple should hook into app functions themselves instead of relying on UI. I would be really surprised if Apple made a browser automation tool, since that would be the complete opposite of the "it just works" credo
Captchas were already outsourced to cheap labor, maybe 10 or 20 cents a pop? AI using image interpretation is not any cheaper so the captchas efficacy is unchanged
The product I would like to see out of this is a way to automate UI QA.
Ideally it would be given a persona and a list of use cases, try to accomplish each task and save the state where you/it failed.
Something like a Chrome lighthouse but for usability. Bonus point if it can highlight what part of my documentation is using mismatched terminology making it difficult for newcomers to understand what button I am referring to.
I've seen similar sentiment even pre-LLM that AI would help automate other forms of testing, and I just don't quite see it.
Implementing tests is not the hard part. You could make that an intern project or hire a consultant for 3 months. The hard part is the interpretation of results.
That is - making a thing that spits out tickets/alerts is easy. The signal/noise tuning and actual investigation workflows are the hard part and still very manual & human operated. I don't see LLM mouse/keyboard control changing that yet.
> making a thing that spits out tickets/alerts is easy.
I don't really believe that what I am asking for is hard, yet I still can't buy it as far as I know.
> actual investigation workflows are the hard part and still very manual & human operated.
Sure but it would allow your QA worker to have pre-tested usecase-based path with some flag on whether or not they may be problematic with a screen-recording and some timestamp of where it went wrong.
These will always need human-in-the-loop to vet the findings before cutting a ticket to development team.
Fair - I'm not personally familiar with state of the art in UI QA automation, but I know theres been various screen recording type tools available for a decade+ with mixed success.
I come more from a "big data" background, and have dealt with CTOs who think "can't we just use AI?" is the answer to data quality checking multi-PB data lakes with 1000s of unique datasets from 100s of vendors. That is - they don't want to staff a data quality team, they think you can just magic it all away.
The answer was always - sure, but you are fixated on the easy part - anomaly detection.
Actual data analysis on what broke, when, how, why, and escalating to data provider was always 95% of the work. Someone needs to look at the exhaust, and there will be exhaust every single day.. so you can kill your dev teams productivity or actually staff an operations team responsible for the tickets the thing spits out.
That's fair and I don't think I have a good counter to this, it would be very easy for such a UI QA product to become just another "security vulnerability scanner" that cuts low severity tickets that nobody looks at.
do y'all see a way to ramp from mostly-human-in-the-loop to mostly-ai? Can you take a system that does 1% at the hard part of signal/tuning and teach it to get better over time?
I'm thinking for a single particular application under test and a mostly-static group of SMEs who might be involved to respond/tune
It’s both. Most manual tests are required to be run whenever the underlying code has changed. And that’s pretty slow and annoying. Interpreting results is usually pretty trivial, like checking the http code or checking against an assert. I don’t think most companies use/should use manual testing but where it’s unavoidable, this is a great workaround.
I've been been hacking on a web browsing agent the last few weeks and it's given me some decent understanding of what it'd take to get this working. My approach has been to make it general-purpose enough so that I describe the mechanics of surfing the web, without building in specific knowledge about tasks or website. Some things I've learned.
1. Pixels and screenshots (video really) and keyboard/mouse events is definitely the purest and most proper way to get agents working in the long term, but it's not practical today. Cost and speed are big obvious issues, but accuracy is also low. I found that GTP4o (08-06) is just plain bad at coordinates and bounding boxes and naively feeding it screenshots just doesn't work. As a practical example, another comment mentions trying to get a list of flight recommendations from Claude computer use and it costing $5, if my agent is up for that task (haven't tested this), it would cost $0.10-$0.25.
2. "feature engineering" helps a lot right now. Explicitly highlighting things and giving the model extra context and instructions on how to use that context, how to augment the info it sees on screenshots etc. It's hard to understand things like hover text, show/hide buttons, etc from pure pixels.
3. You have to heavily constrain and prompt the model to get it to do the right thing now, but when it does it, it feels magic.
4. It makes naive, but quite understandable mistakes. The kinds of mistakes a novice user might make and it seems really hard to get this working. A mechanism to correct itself and learn is probably the better approach rather than trying to make it work right from the get-go in every situation. Again, when you see the agent fail, try again and succeed the second time based on the failure of the previous action, it's pretty magical. The first time it achieved its objective, I just started laughing out loud. I don't know if I've ever laughed at a program I've written before.
It's been very interesting working on this. If traditional software is like building legos, this one is more like training a puppy. Different, but still fun. I also wonder how temporary this type of work is, I'm clearly doing a lot of manual work to augment the model's many weaknesses, but also models will get substantially better. At the same time, I can definitely see useful, practical computer use from model improvements being 2-3 years away.
It seems like a cheaper intermediate capability would be to give Claude the ability to SSH to your computer or to a cloud container. That would unlock a lot of possibilities, without incurring the cost of the vision model or the difficulty of cursor manipulation.
Does this already exist? If not, would the benefits be lower than I think, or would the costs be higher than I think?
Just the other day someone used Claude to write a script to configure a server. It left a port open and the server was hacked hours later and used to attack other servers. Hetzner almost banned the hosting account.
I had GPT4o walk me through configuring my RAID array, simple two drive duplication affair, and some command broke the configuration in new and mysterious ways - I can no longer get the drives to appear at all. So that will be the last time I copy paste anything from an AI into a shell.
IIUC, Claude's "Computer Use" is roughly a remote desktop, which is a superset of a remote shell. I don't think I'm proposing anything with a greater risk than already exists.
Based on the flow diagram, that doesn't seem to be the same thing. Webwright seems to be a shell as a tool for me, enhanced with AI features. I'm suggesting the shell as a tool for AI.
Webwright is a front-end shell that presents to me; I'm suggesting a back-end shell that presents to Claude.
It doesn't appear that Webwright enables tool-use. In other words, there's no task-oriented feedback loop between AI-provided shell commands and the results of those shell commands. Please correct me if that's not right.
Any idea on how does Sonnet does this, is the image annotated with bounding boxes on text boxes etc. along with its coordinates before sending to sonnet and it responds with box name back or co-ordinate back or ? is SAM2 used for segmenting everything before sending to sonnet ?
I really, really like this new product/API offering. Still crashes quite a bit for me and obviously makes mistakes, but shows what's possible.
For the folks who are more savvy on the Docker / Linux front...
1. Did Anthropic have to write its own "control" for the mouse and keyboard? I've tried using `xdotool` and related things in the past and they were very unreliable.
2. I don't want to dismiss the power and innovation going into this model, but...
(a) Why didn't Adept or someone else focused on RPA build this?
(b) How much of this is standard image recognition and fine-tuning a vision model to a screen, versus something more fundamental?
At the end of the day, the fundamental dynamic here is human creativity. We are taking a tool, the LLM, and stretching it to its limit. That’s great, but that doesn’t mean we are close to AGI. It means we are AGI.
This is an insightful comment, though it just goes to show how rigid the framing is of "natural vs. artificial" or "human vs. machine". None of this stuff has any vitality outside of _some_ relationship or interface with people.
Yeah, it makes the owner class richer while driving the marginal cost of labor to zero, at which point the working class can't sell their labor at all and starve.
This would assume the rich some how oppresses everyone to pieces. If I have access to all this wonderful automation tech, I'm sure as fuck not going to sit around and starve, I'm going to try automate my food production to make more food, more efficiently ?
> If I have access to all this wonderful automation tech
But "you" don't, that is precisely the point. The speed at which the gap between rich and poor grows keeps increasing, after all -- the rest is commentary --, and people who right now send people to die and murder in wars for oil, and what not not, will not suddenly start sharing when they fully captured all means of production for good. That's like hoping the person who keeps stealing your shit every chance he has, leaving you in sickness or death without a thought will give you a billion dollars once all the locks on your house have rusted off completely and you no longer have means to call the police.
This is a step towards a human-machine hybrid world. Putting a human in the loop can do wonders. Sure, it is expensive now, but the subsequent iterations will crush it.
Have you heard of Centaur chess? A human and a machine would team up to find the best chess moves against another similar team. It's not a thing anymore. Computers have advanced so much that humans can't really contribute in any meaningful sense.
All these AI models do quite well in games because there are set rules, finite moves, and they can iterate in a tight loop (without humans) to get immediate feedback on pass/fail.
I think this is what differentiates the speed at which AIs have gotten from ok -> good -> great -> better than humans at say chess, versus say driving a car, summarizing a paper, understanding human requests, recommending music, etc.
I think a lot of people are extrapolating the rate of progress & possible accuracy rates from chess bots to domains that do not compare.
Is the point of your comment to make people feel depressed ?
Either we're going to use these tools to augment our abilities or basically just become wiped out, at least our jobs will be, and there is no plan to provide support for anyone. Maybe the tech will make the transition to a post employment world so swift we don't even feel any negative economic effects at all, but let's see.
Depressing hasn’t been the reality for the majority of people over the last 100 years of technological progress. You could die from a scratch or a kidney stone 100s ago.
Once we realize we can make machines that can beat us in ways we can’t even understand, I wonder if will question if we have always been influenced this way by an exterior force
I wonder if I can hook up `scrcpy` with this and give it control over an Android. Can it drag the mouse? That'd be needed to navigate the phone at least.
I saw those demoed yesterday. The model was asked to create a cool visualization. It ultimately tried to install steamlit and go its page, only to find its own Claude software running streamlit, so as part of debugging it killed itself. Not ready to let that go wild on my own computer!
It's main use case is making the average office worker feel the same existential dread that some programmers feel when they see a LLM spit out a bunch of code in mere seconds.
I don't normally go for comedy on HN but this one got an audible chuckle.
TBH, while I giggle at the thought of anybody being replaced, I dont think it's likely, it's just that the standards and expectations have shifted in some domains. I think if anything LLM's raised the tide for everyone (in relevant roles) and we're all able to move a little faster now, like when we went from abacus to calculator a while back, just a different scale of magnitude.
You're not far off. Anecdotal but I shared the Anthropic demo video and a few articles in a company slack and a lot of PM's/admin folks that are only tangentially aware of LLM powered use cases at this point shared that sentiment. Welcome to the party folks!
I’m not sure if anyone else has really tried, but I’ve tested it a few times and never hit meaningful results.
1) I tried using it for QA for my SaaS but agent failed multiple times to fill out a simple form, ending with it saying the task was successfully completed.
2) It couldn’t scrape contact information from a website where the details weren’t even that hidden.
3) I also tried sending a message on Discord, but it refused, saying it couldn’t do so on someone else's behalf.
I mean, I’m excited for what the future holds, but right now, it’s not even in beta.
What is "this stuff"? Computer use was released 3 days ago and I would say the opposite is true for LLMs in general: it's overused in production and shoehorned into stuff that doesn't need it.
This is such a idiotic hype-cycle, they just fine-tuned a model over vision API. I really don't understand why everyone is loosing their mind over this
Which one? The article has four examples, none of which are particularly "cool" or impressive.
If anything, the examples involving moving the mouse to the address bar or getting csv's of results are very poor examples, because we can already do that much better without "computer use".
Because this is the last thing we can think of right now and after this is an abyss for the stock market that everyone knows is inevitable, but thinks we can avoid.
On one hand, it has really helped me with prototyping incredibly fast.
On the other, it is prohibitively expensive today. Essentially you pay per click, in some cases per keystroke. I tried to get it to find a flight for me. So it opened the browser, navigated to Google Flights, entered the origin, destination etc. etc. By the time it saw a price, there had already been more than a dozen LLM calls. And then it crashed due to a rate limit issue. By the time I got a list of flight recommendations it was already $5.
But I think this is intended to be an early demo of what will be possible in the future. And they were very explicit that it's a beta: all of this feedback above will help them make it better. Very quickly it will get more efficient, less expensive, more reliable.
So overall I'm optimistic to see where this goes. There are SO many applications for this once it's working really well.