Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Skyvern – Browser automation using LLMs and computer vision (github.com/skyvern-ai)
422 points by suchintan 6 months ago | hide | past | favorite | 139 comments
Hey HN, we're building Skyvern (https://www.skyvern.com), an open-source tool that uses LLMs and computer vision to help companies automate browser-based workflows. You can see some examples here: https://github.com/Skyvern-AI/skyvern#real-world-examples-of... and there's a demo video at https://github.com/Skyvern-AI/skyvern#demo, along with some instructions on running it locally.

We provide a natural-language API to automate repetitive manual workflows that happen within the companies' backoffices. You can check out our code and play with Skyvern here: https://github.com/Skyvern-AI/Skyvern

We talked to hundreds of companies about things they do in the background and found that most of them depend on repetitive manual workflows. The breadth of these workflows surprised us – most companies started off doing things manually, and eventually either hired people to scale the manual work, or wrote scripts using Selenium-like browser automation libraries.

In these conversations, one common point stood out: scaling is a pain either way. Companies relying on hiring struggled to adjust team sizes with fluctuating demand. Companies using Selenium and similar tools had a different problem: it can take days or even weeks to get a new workflow automated, and then would require ongoing maintenance any time the underlying websites changed because their XPath based interaction logic suddenly became invalid.

We felt like there was a way to get the best of both worlds with LLMs. We could use LLMs to reason through a website’s layout, while preserving the advantage of traditional browser automations allowing it to scale alongside demand. This led us to build Skyvern with a few core functionalities:

1. Skyvern can operate on websites it’s never seen before by connecting visible elements with the natural language instructions provided to us. We use a blend of computer vision and DOM parsing to identify a set of possible actions on a website, and multi-modal LLMs to map the natural language instructions to the available actions on the page.

2. Skyvern is resistant to website layout changes, as it doesn’t depend on any predetermined XPaths or other selectors. If a layout ever changes, we can leverage the methodology in #1 to complete the user-specified goal.

3. Skyvern accepts a blob of information when navigating workflows—basically just a json blob of whatever information you want to put, and then we use LLMs to map that to information on the screen. For example: if you're generating a quote from Geico, they commonly ask “Were you eligible to drive at 21?”. The answer could be inferred from the driver receiving their license in 2012, and having a birth date of 1996.

The above strategy adapts well to a number of use cases that Skyvern is helping companies with today: (1) Automating materials procurement by searching for, adding to cart, and transacting products through vendor websites that don’t have APIs; (2) Registering accounts, filing forms, and searching for information on government websites (ex: registering franchise tax information for Delaware C-corps); (3) Generating insurance quotes by completing multi-step dynamic forms on insurance websites; (4) Automating the job application process by mapping user-specified information (such as a Resume) to a job posting.

And here are some use-cases we’re actively looking to expand into: (1) Automating post-checkup data entry with patient data inside medical EHR systems (ie submitting billing codes, adding notes, etc), an (2) Doing customer research ahead of discovery calls by analyzing landing pages and other metadata about a specific business.

We’re still very early and would love to get your feedback!




I tried it out and it's pretty pricey. My OpenAI API bill is $3.20 after using this on a few different pages to test it out.

Not saying I wouldn't pay that for some use cases, but it would limit me.

One idea: making scrapers is a big pain. But once they are setup, they are cheap and fast to run... this is always going to be slower. What I'd love to see is a way to generate scrapers quickly. So you wouldn't be returning information from the New York City property registry... instead, you'd return Python code that I can use to scrape it in the future.

edit: This is likely because it was struggling, so it had to make extra calls. What would be nice is a simple feature where you can input the maximum number of calls / tokens to use on the entire call. Or even better, do some math and put in a dollar cap. i.e., go fill out the Geico forms for me and don't spend more than $1.00 doing it.


I love all of these ideas!!

1. You can set a "max steps" limit when you run it locally https://github.com/Skyvern-AI/skyvern/blob/d0935755963b017ed...

We also spit out the cost for each step within the visualizer. Click on any task > Steps > there's a column that's dedicated to how much things cost to run

https://github.com/Skyvern-AI/skyvern/issues/70

2. We have a roadmap item to "cache" or "memorize" specific tasks, so you pay the cost once, and then just run it over and over again. We're going to get to it soon!!



You've raised valid points about the cost and efficiency of our approach, which aims to make the LLM function as closely as possible to a human user. We chose this approach primarily for its compatibility with various websites, as it aligns closely with a website's intended audience, which is typically human.

Addressing complex website interactions is a key advantage of this approach. For instance, in the process of generating an auto insurance quote, the sequence of questions and their specifics can vary greatly depending on prior responses. A simple example is the choice of a foreign versus a California driver's license. Selecting a foreign license triggers additional queries about the country of issuance and expiry date, illustrating the complexity and branching nature of such web interactions.

However, we recognize the concerns about cost and are actively working on strategies to reduce it: - Optimizing the context provided to the LLM - Implementing caching mechanisms for certain repeated actions and only use LLMs when there's a problem - Anticipating advancements in LLM efficiency and cost-effectiveness, with the hope of eventually finetuning our own models for greater efficiency


There are two things here:

1) Using the LLM to find elements/selectors in HTML

2) Use LLMs to fill out logical/likely/meaningful answers to things

I highly recommend you decouple these 2 efforts. While you gave a good example of "insurance quote step by step webapp", the vast majority of web scraping efforts are much more mundane.

Additionally, even in this instance, the selector brain/intelligence brain don't need to be coupled.

For example:

Selector brain: "Find/click the button for foreign drivers license." Selector brain: "Find the country of origin field." Selector brain: "Find the expiry date field."

LLM-intelligence brain: "Use values from prompt to fill out the country of origin and expiry date fields."

Not-LLM intelligence brain: Inputs values from a JSON object of documentSelector=>value.


Interesting. We've decoupled navigation and extraction for specifically this reason, but I suppose decoupling selector with input could let us use cheaper smaller LLMs to "select" and answer

We've been approaching it a little bit differently. We think larger more capable models would actually immediately improve the performance of Skyvern. For example, if you run it with LLaVa, the performance significantly degrades, likely because of the coupling

But since we use GPT-4V, and it's rumoured to be a MoE model, I wonder if there's implicit decoupling going on.

I'm gonna spend some more time thinking about this


I still think you're missing the point. The idea is that you should use vision APIs and LLMs to build traditional browser automation using a DSL or Python.

I don't want to use vision and LLMs for every page. I just want to use vision and LLMs to figure out what elements need to be clicked once. Or maybe every time the site changes the frontend.


This is a great point. This is something already on our roadmap. We call it "prompt caching", but I realize writing this that it's a terrible name. Will update! (https://github.com/Skyvern-AI/Skyvern?tab=readme-ov-file#fea...)

Thank you for this feedback


The AI would be a compiler that generates the traditional scraper / integration test.

It would save all that long time spent going manually thought every page and figuring out which mistake we did, when that input string doesn't go into that input field or the button on the modal window is not clicked.

Change the UI? Recompile with the AI.


I didn’t check the code but there would be a few good ways to specify what you want:

* browser extension that lets you record a few actions * describing what you want to do with text * a url with one or two lines of desired JSON to extract


> We call it "prompt caching"

No, that's something completely different than what bravura is talking about, which is why he made a comment to say explicitly that he still thinks you're missing the point.

From your roadmap:

> Prompt Caching - Introduce a caching layer to the LLM calls to dramatically reduce the cost of running Skyvern (memorize past actions and repeat them!)

Adding a caching layer is not what they're asking for. They want to periodically use Skyvern to generate automation code, which they could then deploy themselves in their testing/CI setup. Eventually their target website may make breaking UI changes, then you use Skyvern to generate new automation code. Rinse and repeat. This has nothing to do with an internal caching layer within your service.


We've discussed generating automation code internally a bunch, and what we decided on is to do action generation and memorization, instead of code generation and memorization. They're not that far apart conceptually, but there is one important distinction: The generated output would just be a list of actions and their associated data source.

For example, if Skyvern was asked to log-in to a website and do a search for product X, the generated action plan would include: 1. Click the log in button 2. Click "sign in with email" 3. Input the email address retrieved from source X 4. Input the password retrieved from source Y 5. Click log in 6. Click on the search bar 7. Input the search term from source Z 8. Click Search

Now, if the layout changed and suddenly the log-in button had a different XPath, you have two options: 1. Re-generate the entire action plan (or sub-action plan) 2. Re-generate the specific component that broke and assume everything else in the action plan still works


I like this approach. Just as an example, if I'm getting a car insurance quote, I'd rather pay $1 to have the tool fill out the forms for me and be 90% that it filled them out correctly rather than pay $0.01 and only be 70% sure it did it correctly. And there are plenty of use cases like that.


You would still be willing to pay $1 if it got it wrong 10% of the time, or if it got 10% of the information wrong every time?


It really depends on the use case.


isn't that crazy rabbit thingy supposed to do just that? I hope you pre-ordered. I hear they're in great demand.



Just piggybacking here, but this is a great suggestion. It makes the cost a one-time expense, and you get something material (source code) in return.


> instead, you'd return Python code that I can use to scrape it in the future

Bravo, I would pay for this one, or hopefully run it on my GPU - it would be so fast to even just shove out your selectors (xpath, css, dealer's choice) for point-by-point update after you had done an initial code gen, or perhaps it could just diff and update chunks of code for you!

My local code model can already do the diff update stuff in nvim, but being able to pass it a URL and have it slam in all of the pertinent crawling code, wow.


Interesting enough I made a chrome extension that does almost exactly what you are describing. It’s called automize and it lets you very quickly generate custom selectors and export the code to puppeteer, playwright, selenium etc. it handles all the verifications as well as provides a handy ui that shows what you are selecting


It's getting genuinely difficult these days with everything walled behind Cloudflare, various anti-bot protections and increasingly creative CAPTCHAs


Scrapers are one of the main use cases we're seeing for Magic Loops[0].

...and you've hit the nail on the head in terms of our design philosophy: use LLMs to generate useful logic, then run that logic without needing to call an LLM/Agent.

With that said, we don't support browser automation. Skyvern is very neat, it reminds me of VimGPT[1], but with a more robust planning implementation.

[0] https://magicloops.dev

[1] https://github.com/ishan0102/vimGPT


Nice! Thanks for sharing this.

We tried approaches like VimGPT before but found the rate of hallucinations to be a bit too high to be used in production. The sweet spot definitely seems to be to combine the magic of Dom parsing AND vision

We're going to definitely work on logic generation and execution, but we're taking it a bit more carefully. Many of the workflows we automate have changing workflow steps (ie I've never seen the exact same Geico flow twice), but this certainly isn't true for all workflows


Really like the simplicity of your website. I think when you first announced it, you mentioned you might open source Magic Loops, might you do that?


Yes! We’re in the middle of cleaning things up, just need to make the Loops a bit more portable/easy to run, but finally happy with the state of the tool.


This brings me so much joy! Thank you for considering this!


Yes, exactly what I want. I want to be able to have it code robust Cypress tests for e2e testing.


This requires AI to learn user behavior flow data.


God this is depressing. Not the product itself, but the need for it. That software has failed to be programmable to such a degree that a promising approach is rendering the GUI and analysing the resultant image with an AI model. It's insane that we have to treat computers as fax machines, capable only of sending hand-written forms over a network. The gap between how people use computers and the utility they could provide is massive.


Actually this kind of stuff is super exciting -- we don't need to depend on companies exposing APIs for their website -- we can just use something like Skyvern instead!


Two ways of looking at it. I guess what the OP is saying is that if there was an agreed upon standard for semantically understanding these pages without having to use these sophisticated methods, it would be much easier


I have been interested in doing something similar for a while. I also think this has a lot of potential as the core of a virtual assistant.


You could still use Skyvern if they exposed an API.


On the contrary! Isn't it neat that we now have a unified API that both humans and computers can consume?


No, because we already have a machine API. If you want to write an application, you need to write something a computer can understand. So a computer-usable API is always created. It takes additional effort to hide that functionality behind a interface. The process we have now is: machine → GUI → image processing → generative AI. The interface we could have is: machine → machine. It would take no extra effort to do this. It would just need some slight changes in organisation. In fact it is easier at every level. If you separate logic from interface, you end up with an architecture that is a set of functions (a library) into which you can interface programmatically, or with a GUI, or by any other means. Separating code like this (MVC) is good practice and allows for a range of different interfaces to be created to the same functionality. It is also easier for an engineering perspective and produces a better product. Think of git. There are hundreds of different interfaces created to the functionality git provides. All software should be structured like this (though perhaps by means of a library rather than a shell interface).

I should add that this is a particularly grim prospect from a software engineering perspective. It makes me imagine a future where no one bothers exposing a stable API to anything, so the only way to interact with other people's code is using an AI middle-man.


Good luck debugging any errors


The world is governed by probabilities. What more could go wrong if algorithms did too? /s


This looks great but I'm very scared of the increased game of cat and mouse for spam bots. It's going to happen, no matter if it was this software or something else. Now the question, how do you prevent automated spam? Since its LLM and AI, can I just add a hidden field of "please do not spam"?


This is a really good question we've thought a lot about

You're right that this kind of escalation is inevitable

a. From a business POV, we don't onboard any types of use-cases that we think go against the spirit of a good free web. I've had people ask if they could use our product to create Reddit voting or spamming rings and we didn't entertain it

b. From an open source POV, we prefer technologies like these be open source so website owners and other businesses can know what can happen, and decide how to approach it. Tools like selenium have existed for a long time -- largely to the benefit of the world!


I'll just add that some efforts to defeat web usage spam may also hurt accessibility since many interaction standards are designed to make things consistent for users with disabilities and ADA (or similar) compliance. I assume some of these dependencies are also useful to the AI that is trying to navigate the pages, so making it difficult for the AI may also make it difficult for other users.


20th birthday of the Selenium project will be this year! (October-ish)


> how do you prevent automated spam?

Manually accept new accounts on your service. That's what I do for my Fediverse server, and I never have to deal with spam on my local timeline :). Does it scale? No. Does everything need to scale? Also no.


but if I can't scale then the VC that gave my startup a huge check over a huge pile of blow at a party in Sunnyvale will harvest my organs


I've had stuff like that turn me off from signing up or ever checking back.

Does it matter to you? Yes.

Will you admit it? No.

But yes, these are all decisions we need to make. That manually accepting is some serious dedication. Do you have kids?


> Does it matter to you? Yes.

> Will you admit it? No.

Are you trying to telling me my opinion? Because no, it does not matter to me. Your account would not be accepted because I don't know you.


If your target audience is businesses, not individuals, then you can go a very long way with fully manual onboarding, invoicing, etc. It's different for things like consumer services or e.g. forum users, but why couldn't you manually vet every business your business trades with?


I am not aware of anyone really successfully, defeating spam at the moment.

I mod a 1 million+ Facebook group and they can’t even prevent someone from making 200 posts in a minute with the word “crypto” in it. The word list will flag it, but the spam filter won’t.

Reddit constantly has people messaging you in chat about “opportunities.”

Email is a disaster.

My personal blog has over 100,000 spam comments sitting in the filter so at least they were caught, but processing them is impossible.


> I am not aware of anyone really successfully, defeating spam at the moment.

> I mod a 1 million+ Facebook group and they can’t even prevent someone from making 200 posts in a minute with the word “crypto” in it.

Could you possibly charge a nickel's worth of bitcoin to approve a post?


I've heard of a lot of success sifting through email spam using custom gmail scripts + GPT-4. Kind of interesting that we can use LLMs to both create and detect spam to some degree of effectiveness


the only way to prevent spam is charge appropriate money, I don't see other solutions. Thats why many company use credit card to verify users. But, with virtual cards, they have some ability to spam, but not so much.


This.

If you charge enough, the spammers become valuable customers. Of course they tend to leave before that point, but you don't really care if they leave or stay; you make money either way.

Value for value.


I'm not good at finding fire hydrants either.


Roughly how much does it cost to run to scrape a page? I see from the code this is basically an OpenAI API wrapper but you make no mention of that anywhere on your landing page/documentation, nor any mention of which LLMs this is capable of working with.

Also, an idea is to offer a "record" and "replay" mode. Let the LLM run through the instructions, find the selectors, record and save them. Then you can run through again without using the LLM, replaying the interaction log, until the workflow breaks, then re-generate the "interaction log" or whatever.


This is a great call-out. It's something currently in our roadmap

Re: cost for execution. This really depends on the page, but currently it costs between 5 cents and 20 cents per page to execute (today).

We have an improvement planned to help it "remember" or "cache" actions it's done in the past so it can just replay them and bring the cost down to near zero.

Re: LLMs it's capable of working with, currently it's only GPT-4V. I'll get this updated soon!


Based on #2, it seems like they only use the LLM when the page changes. I had a prototype of this sort of system working and it was surprisingly fault tolerant.


If you want to build it yourself, you could try using https://browserbase.com/. We offer managed headless browsers work everywhere, every-time. It costs $0.10 per browser session/hour (billed minutely). Feel free to shoot me an email if you want access! paul@browserbase.com


Does skyvern work on top of canvas elements in the browser? For example, is it able to read text from a canvas element and/or identify the location of images in the canvas?

I tried to dig through the github repo to better understand the vision side of things (i.e. how does it work when elements like buttons, divs aren't present), but I couldn't find anything. If you point me to the right place in the github repo, happy to dig further myself!


> Skyvern understands how to solve CAPTCHAs to complete complicated workflows

this seems like this could be used for abuse. the CAPTCHAs are specifically designed to stop botting on 3rd party websites.

or this will just be another cat and mouse game where the next level of CAPTCHAs get more annoying and invasive to verify we are human


It seems to me that the logical conclusion for captcha is to connect it indirectly to electronic id. This could be done in a privacy respecting way.

You could get some token from the website. It could include encrypted service name and policies, like rate limit, that the authority should enforce. The client passes the token to the eId authority. The authority signs it and adds timestamp, but no user info. Client gives token to the service. Something like that. This is a bad top of mind example.

I think we'll need to rely a lot more on eID in the future. I think it can be done in a good way but then it needs to be thought through before it gets adopted. And we have to be able to trust the eId institutes.


But it's the same problem all over again, spammers would get an id, auth, then spam.

Anti spams are about detecting whether activities are spam.

Binding an identity, is the naive mechanism that makes us think spam wouldn't happen. All it does is say ok we know it's pug35372 that teared the linens apart.

We can put all measures to authenticate users, won't makes them not potentially bots running havoc right after a manual authentication.

There are even farms, manually created accounts by gig seekers who would fill forms, email and phone number verification for less than a dollar.


As I mentioned the exchanged token could include a number of policies, like rate limits. But I expect there could be more sophisticated policies as well.

The service could send ban or lockout requests to the eId authority so that a misbehaving real life user could be locked out from the service even though the service doesn't know who they are (irl).

I would guess it could even be designed so that the authority doesn't know which services a given user has been banned from either. And all the service would need to know is "This user has violated policy X at <timestamp>".


I see how issuing a token has its advantages, thanks.


2FA and logged-in experience is sort of a proxy for eID. I suspect that's why so many companies require that you log in with something that knows your identity (log in with google), or ask you for your phone number to confirm your account


Unfortunately, CAPTCHA's are already easy for bots to bypass or solve.

There are quite a few services that will solve them in a few seconds, costing less than a dollar per 1000 solved tokens for most common CAPTCHA's (e.g ReCAPTCHA v2 and v3).

I recently had to deal with an attacker doing credit card testing that was using one of these services.

Related, I came across this last week, bypassing ReCAPTCHA with Selenium/Python/OpenAI Whisper API:

https://www.youtube.com/watch?v=-TMNh64ubyM


everyone seems to forget that stopping bots with google captcha was never the main goal...

humans have been training google's ai models for a decade or more each and every time they answered a captcha

at any rate, if someone wants to abuse your site, captcha, and even cloudflare won't help you

> the next level of CAPTCHAs get more annoying and invasive to verify we are human

like the solving puzzles ones? Or more advanced object identification, like selecting the correct orientation? Training more advanced AI now


Agreed.

We didn't open source this functionality on purpose, and are very very specific about what use-cases we onboard that require it.

That being said, we've gotten to learn a lot more about browser fingerprinting and captcha solving and it's a really interesting space.

If you're curious about it, check out this blog post: https://antoinevastel.com/bot detection/2018/01/17/detect-chrome-headless-v2.html


First of all, wonderful work. I'm gonna be using this for sure. I can think of many use cases. What would be nice though is a simple API. I send you what I need, you send me a jobId that I can use to check the status of my job and then let me download the results when I'm done.

I played with the Geico example, and it seems to do a good job on the happy path there. But I tried another one where it struggled... I want to get me car rental prices from https://www.costcotravel.com/. I gave it airport + time of pickup and dropoff, but it struggled to hit the "rental car" tab. It got caught up on hitting the Rental Car button at the top, which brings up a popup that it doesn't seem to read.

When I put in https://www.costcotravel.com/Rental-Cars, it entered JFK into the pickup location, but then failed to click the popup.


We have a simple API we're building as a part of our cloud offering. It's in private beta today -- if you'd like to check it out please email me at suchintan@skyvern.com and I'd be happy to chat

Thanks for the feedback re: costcotravel.com Skyvern definitely does NOT have 100% coverage of the web. This is one of the reasons we were excited to open source -- so we could learn about more websites where it doesn't work as expected

I've filed an issue for this case here: https://github.com/Skyvern-AI/skyvern/issues/69


Exciting to see this on HN. I think very soon agents like Skyvern will account for the vast, vast majority of web traffic.


Maybe for a transition period.

There's no reason for somebody to create a website, pay for resources, and hope for some sort of revenue if their visitors are mostly AI.

So why bother creating a UI? Instead it would make more sense to close the website and offer the same information as a paid API service.

Any sort of website that needs to validate human visitors will be plastered with DRM. Rendering these web browsing LLMs useless. And good riddance as well.

Using an LLM to browse the internet feels like a huge waste of resources.

Instead it would make more sense to have a wikipedia-like for AIs to crawl via embeddings.


I suspect that web traffic will encapsulate both. Many websites (government ones in particular) aren't interested in API-based access patterns.

This kind of pattern makes it so you can serve both users and agents with a single interface


This would be ideal. The only issue here is trust. If my website relies on advertising then of course I would prefer to serve more content to a human visitor.

So what? I bot protect my site, redirecting the AI to a minimalistic part that most likely expects some sort of value given?

People will just breach this trust, like OP and abuse tools like Selenium (as they always have) to imitate being a human.


I think this is pretty interesting -- I wonder if websites could allow agents to self-identify, and not count them towards advertising CPM to prevent dilution in the advertising metrics

Perhaps a similar thing as robots.txt is in order (agents.txt?)


I mean what kind of websites are we talking about here? The kinds of websites where all the value can be extracted by via a LLM are just content farms.

And yeah, that sucks for content farms but putting up content and getting nothing in return is already how ad blockers work and it hasn't destroyed the them. I seriously doubt that AI traffic will even put a dent 1/1000th of the traffic loss of Google snippets.


That’s why we can’t have nice things. Are we at the end of Eternal September? Will all the signs of human life be restricted to paid or otherwise closed groups? If all free users are bots, who will even run ads that feed the Web 2.0 internet?

I still have fear that the real internet has already split from what I see and I was left behind.


Why would the majority of web traffic turn into extremely expensive to operate agents?


The expectation is that the price of AI bots will go down and get below the human-driven click farms we have now and thus make fighting bots too expensive because identifying humans gets harder every day.


At first I thought this was a test tool for Web applications, but now I understand it's meant to be a better RPA.

Would it be usable for test automation? Would API allow to create asserts?


Yes absolutely. You can prompt it to "terminate" if some state isn't met (ie XYZ text isn't displayed on the screen), and treat terminated results as failures

For example, you could instruct it to go to hackernews and terminate if you don't see a comment from giamma by passing in this payload:

{ "url": "https://news.ycombinator.com", "navigation_goal": "goal is met if you see a post from giamma. Terminate if you don't" }


There's a startup called Octomind (https://octomind.dev) doing exactly that and ZeroStep (https://zerostep.com), but on a lower level


there are already some existing solutions for e2e testing. I would say playwright with codegen works well enough but there are ones that make it even easier by wrapping around openapi but seems overkill


It sounds interesting, could you please share the links if it is open sourced?


Is this (finally) a step towards a better way of automated frontend testing?

We're currently testing dom instead of vision.


This can definitely be used for front end testing. Just tell it to do something like a user and monitor whether it's successful or not

Here's a prompt example to try out

{ "url": "https://news.ycombinator.com", "navigation_goal": "goal is met if you see a post from basiep2. Terminate if you don't" }


To keep costs down, you could start at sitemap, use an open source model via open router to guess the page to navigate to and scrape the text, links, forms, from the page using regex and fall back to GPT 4 and Vision.


AI should automate tedious and un-creative work, and data entry tasks definitely fit this description. Rule-based RPA will likely be replaced by fine-tuned AI agents for things like form filling and similar.

Can you share some data on costs and scalability?

At Kadoa, we're working on fully automating unstructured data ETL from websites, PDFs, etc. We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of sources daily in a reliable, scalable, and cost-efficient way is a whole different beast.

Using LLMs for every data extraction would be way too expensive and very slow. Instead, we use LLMs to generate the scraper and data transformation code and subsequently adapt it to website changes, which is highly efficient.


Nice! We love what you're doing at Kadoa.

We're trying our best not to move into the web scraping space -- we're focusing on automating uncreative, boring, tedious tasks.

We've seen a lot of success going after form-filling on government websites, which would usually be very boring, but happens to work pretty well for us


You should consider focusing on intercepting network requests. Most if not all sites I scrape end up fetching data from some api. Like others have said, if you instead had the LLM create an ad hoc script for the scraping task and then use the feedback loop to continuously improve the outputs it would be really cool. I'd pay between $5 - $50 for each working output script.


We're definitely planning this.

Skyvern currently intercepts all network calls that gets made -- we save them all as a HAR file for debugging purposes... but never looks at them

A lot lot lot of scraping use-cases become simpler if you can just inspect a search api or a details API and get the information you're looking for. I'll add this to our roadmap!


I don’t know what my use case for this would be. I don’t tend to do anything regularly through a browser that I’d want to automate.

Would be kind of handy to have a “pull all my relevant tax info documents from these sites and zip them up” automation but I only do that once a year.

I’m probably being unimaginative. Anybody have any interesting use cases?

Anyone have


Imagine that exact use-case -- pulling up relevant tax information and filling it

Now imagine it from the accountant POV, where they have the same use-case for hundreds of clients

This is where we've seen something like Skyvern really shine. It's targeting industries and companies that are doing rote work at a significant scale


It reminds me of that bug a kid found to bypass the password locked screen of a very popular Linux distro.

Might be great for pen testing.


That's a great idea! I hadn't thought of pen-testing as a possible value prop for this product


Coming up next in Windows and Chrome, unrecordable unscreenshotable pages, to avoid all AI tools. Banking apps on Androidare already unscreenshotable now. Given how LLMs just bypass all html obfuscation, that's going to be the next step to protect these (ad) businesses.


The analog hole.

Until all recording and general computing devices become tamper-proof and locked down, people will always be able to take perfect (yes, perfect) recordings with some work or good-enough recordings trivially.

For this, I'd use a usb camera. For those apps that disable screenshots, I'd just take a picture using the phone of the first person next to me.

In my experience, only the ignorant, fools, and lawyers/lawmakers willingly waste resources on this security theater, with the later group using it trick other people or prevent them from exercising their rights (recording media).

Google should remove this misfuture. This future is only enabling abuse at this point.


Congratulations on shipping!

Check out https://github.com/OpenAdaptAI/OpenAdapt for an open source (MIT license) alternative that also works on desktop (including Citrix!)


I'm curious about the computer vision aspect of this tool. Specifically, how was the model which draws bounding boxes around interactable elements trained? Definitely a step beyond existing browser automation software!


It's surprisingly dumber than you think!

I'm always fascinated by how far you can get with heuristics in certain situations. Check out the code here -- https://github.com/Skyvern-AI/skyvern/blob/d0935755963b017ed...


How does it compare to this posted less than 24 hours ago?

https://news.ycombinator.com/item?id=39698546


Saw the launch yesteday. Love all of the excitement in the space!

LaVague is all about generating selenium code to interact with a specific page, and do it step-by-step

Skyvern is all about taking a simple instruction and converting it to a series of LLM-driven actions. It's meant to be more autonomous ("tell Skyvern what to do")


Isn't that the same thing when you interact with the underlying webpage?


We're quite different than LaVague. LaVague passes in the entire HTML DOM to the LLM to help it generate XPaths and valid Selenium code. (https://github.com/lavague-ai/LaVague/blob/main/src/lavague/...)

Try this at your own risk.. any reasonable website would result in extraordinarily high input token costs

We spend quite a bit of our time building a layer between the HTML and the LLM call to distill important pieces of information down to actions the LLM can take.. better weighing cost vs output. We're still not at 100% coverage.


It is similar. hence the timing of the plug, probably :)


If I were to build some custom GPT powered thing for this. Is there a similar project I can use with a command line interface or some programmatic interface?


Skyvern is actually an API-first product! The UI we built is mainly for simplicity and being able to debug the steps our agent takes.

You can easily copy sample curl requests through our UI. Feel free to check out the quickstart on our GitHub and let us know if you have any questions.


Thanks I will check it out.

Any idea on pricing/business model?


We tend to charge per request our users send us.. although the exact amount depends a lot on the exact task you want to run. Want to send Skyvern on a 40+ page journey to answer a question? It's a bit more expensive than just navigating to a page and extracting information

I'd love to chat about your use-case. Happy to follow-up over email (suchintan@skyvern.com) or over a quick call (https://meetings.hubspot.com/suchintan)


Wait, this is not Open Source??


What do you mean? All of our code can be found here https://github.com/Skyvern-AI/Skyvern


How does this compare to OpenAdapt?

I have a feeling that this tech will become a commodity and will probably be built-in into the OS or Browser.

Props for open-sourcing though!


Ah cool -- we weren't familiar with OpenAdapt. Will check it out.

One big decision we made was to focus on browser automations (instead of computer automation like Adept or OpenAdapt). The reason for this was that we wanted to leverage the information available inside of a DOM to improve the quality of our agent's actions. We found that relying on image-only analysis with X,Y coordinate interactions wasn't able to offer high enough reliability for production workflows


I agree -- this will likely get commoditized, which is why we didn't focus on making this a chrome extension. The API access pattern makes this particularly appealing as you can run multiple instances in the cloud


Exciting stuff, my employer would be interested but it's AGPL3 licensed so it's a non-starter for them.


the moment I saw vision in the title I knew what was going on. it was first demoed[0] by AI Jason around 4 months back. is it any different?

https://m.youtube.com/watch?v=IXRkmqEYGZA


Love this video

> self-operating-computer This is quite different than https://github.com/OthersideAI/self-operating-computer

Self-operating-computer uses pixel mapping to control your computer. This is a very good approach, but it's extremely unreliable. GPT-4V frequently hallucinates pixel outputs, causing it to miss interactions, or enter fail-loops

>The approach by AI Jason

AI Jason is using image-only methods to interact with the browser. This is a great first step, but this approach tends to be rife with hallucinations or errors. We do dom parsing in addition to image anaylsis to help GPT-4V correlate information in the image to the interactable elements within the DOM. This dramatically boosts its ability to perform the same task over and over again reliably (which proved impossible with the image-only approach)


nice. I was looking for simpler hacks as V didn't scale for me. Later I couldn't find time and this got back burnered.

interesting concept for problem solving though. congrats!


Thanks! We definitely experimented with V only (that's the dream), but there's too much context missing:

1. What's behind a select option? You don't know until you click it, which means you need another iteration. This sucks. 2. How do you consistently correlate things in the images to actual actions (ie upload a file to a file input, click on a button, insert a date into a date)? Having the additional HTML Tag information dramatically improves the action selection process (click vs upload vs type)


There was another AI/browser automation project posted yesterday that got to the front page https://github.com/lavague-ai/LaVague

I guess the main advantage of this new project is that its probably more accurate by using computer vision, but as others has said it uses much more resources.

Costs will come down over time though.

Get ready for alot of "Back Office" jobs to be automated away.


Looks terrific. I hope you will consider adding support for Claude 3.


We DEFINITELY will. I think we're planning on pushing that next week -- we've been super excited about it

Just created this: https://github.com/Skyvern-AI/skyvern/issues/72


I wonder if we could reduce the call by switching to a local llama?


https://github.com/Skyvern-AI/skyvern/issues/76 we're planning to introduce a llm router in a week and you should be able to call your local llama after that.

We're prioritizing on cloude 3, as its performance seems to be good. That said, please join our discord and bring more thoughts/requests to us. code contribution is also more than welcome


I think I’d really like a react-native version of this! Any plans?


we're a pretty small team and don't have a plan for it in the near future :(

I would love to know the reason you're interested in react native though if you don't mind sharing! pls email me or suchintan at shu@skyvern.com / suchintan@skyvern.com or join our discord


I wonder if the focus of this system can be shifted from corporate needs and applied to the needs of individuals who wish to organize and build tools seeking to de-enshittify platforms.

There are a great deal of platform features designed to atomize, isolate, and exploit individuals. Finding meaningful connection on platforms increasingly means navigating past the noise of antagonist individuals, overcoming profit extracting attacks on our attention, and endlessly doomscrolling until we find those ephemeral opportunities to genuinely connect.

I wonder if llms and browser automation tooling could help us build overlays that dynamically peel back the layers of enshitware that have been bolted on to our cybernetic perceptions of the world.

If you feel they can, and if you feel people with those aims are welcome in your community, and can find each other to collaborate, then I would be very interested in sending in PRs and helping you burn down backlogged items that benefit non-commercial de-enshittification use cases.


I'm Shu, also cofounder of skyvern. first of all, you are more than welcome to join our coummunity. one big reason of open sourcing skyvern is to serve the individuals. This project was inspired by problems we learnt from tlking to corporates but it doesn't have to always serve those use case. problems like boring form filling are pretty common in real life.

Second, llms definitely can help bridge the gap. My 58y old mom who grew up in the rural area of China doesn't know much about internet and doesn't know how to order takeout on her phone. She only knows the basic usage of wechat, the whatsapp in China and text messages. I've been a coder for 10+ years and I still find it so darn hard to keep up with tools and information out there. I do hope skyvern becomes what you're saying and help people get access to more in the world.


Shu, I thank you for communicating your personal ambitions for Skyvern, and the touching personal anecdote. Making computers easier to use for my aging parents is also one of my goals.

I will be reaching out to the project with an analysis of its security model against prompt injection attacks.

I'll also be taking on a project for a KeepassXC plug-in that automates the process of rotating password in online services, integrating Skyvern as the underlying system. At that time I'll need support and understanding Skyvern's gaps against the projects requirements, and I'll ask for mentorship on helping fill those gaps.

This use case I believe has potential to help both individual and corporate users achieve best practice and policy driven password management - currently a very difficult thing because of the propensity of users do not reset their passwords on time, which is ultimately caused by the difficulty and variety in password resetting mechanisms. Our plug-in will aim to solve that problem for KeepassXC users.

I believe this work could result in a valuable contribution to Skyvern's security model, since an llm driven password reset workflow is uniquely vulnerable to attack or to attacker controlled texts. This provides a great benchmark for Skyvern's overall security model, and a great point to explore both classical and llm-based mitigation techniques.

How can best reach you and the team when the initial letter is ready? Once that's ready, I would like to have a video chat and plan a collaboration. I expect something to hatch in the next 2 weeks. I think you will enjoy our approach to the problem!


What do you call an LLM with vision? LLVM

...oh, that's why it's called Skyvern


Weeks to automate something? Anyone experienced would be able to automate most workflows in a couple of days top.


You're right -- we should have written days to weeks.

What's interesting here is that large companies like UI Path charge thousands of dollars to build a single robot for companies.. I wonder if that large up-front expense will still be necessary in this new world


That's crazy. We usually create robots and most of the time we charge less than a thousand USD.

We have a lot of tooling in place now so most things take minutes. The harder step is getting the data in the client's infrastructure


When you say "getting the data in the client's infrastructure", do you mean self-hosting the robots? or something else?


No. Getting the data on the client's DB, filestore, or similar. For some ERPs we create insert queries, others have import functions.


I've edited the text above to say "days or even weeks".


thank you!!


>>(1) Automating post-checkup data entry with patient data inside medical EHR systems (ie submitting billing codes, adding notes, etc),

FULL FUCKING STOP.

[We talk about AI alignment. THIS is an aligment issue]

Do you understand billing code fraud?

If you supply this function - you will *eliminate ANY AND ALL human accountability* unless you have ALSO built a fully auditable provenance from DR <-ehr-whatever-> codes.

Codes ARE why the US health system is BS.

Here - if you want to be altruistic - then you will take it upon the fact that CODES are one of the most F'd up aspects of costing.

Codes = [medical service provided]

so code = 50 = checkup = [$50 <--- WHO THE HECK KNOWS]

So lets say I am Big Hospital. "No, we will only allow $25 for code 50" - and so they get that deal.

I am single clinic so they have to charge $50

Build a dashboard for what the large medical groups can negotiate per code, vs how a small hospital or clinic group gets per code.

Only automate it if you can literally show a dash of all providers and groups and what they can charge per code.

Infact - code pricing is a medical stock market.

(each hospital group negotiates between the price they will pay per code, how much lobbying is a factor and all these other factors...

what we really need an LLM for is to literally map out all the BS in the Code negotiations btwn groups, pharma, insurance, lobbying, kickbacks, political)

Thats the medical holy grail.

[EDIT: Just to show how passionate I am on this issue - here are some SOURCE:

I have designed and built & commissioned out 11+ hospitals.

Built the first iphone app for medical.. it was rejected by YC (hl-7 nurse comm system on iTouch devices) (2006?)

opensourced that app to OpenVista.

Brother was joint chiefs dr / head of va

worked with building medical apps and blocked by every EHR...

Zuckerbergs name is on top of some of the things I built at SFGH before he got there...(and ECH mtn vw)

Ive seen way beyond the kimono


We know very little about this space, except that the entire process is a little bit crazy.

We've talked to a few companies now that would use a product like Skyvern to just automate billing information gathering to make sure patients don't get screwed in the billing process

Are you open to chatting? I'd love to pick your brain about what's behind the kimono

suchintan@skyvern.com or https://meetings.hubspot.com/suchintan


Don't make me sign up for a demo, I'd rather just give you my credit card number and try it myself.

Aside from that cool project!


We're gonna build a self-serve UI soon! We just wanted to get it into people's hands ASAP :)

Feel free to email me at suchintan@skyvern.com -- I can let you know when the self-serve UI is live




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: