Hacker News new | past | comments | ask | show | jobs | submit login
OpenDevin: An Open Platform for AI Software Developers as Generalist Agents (arxiv.org)
198 points by geuds 3 months ago | hide | past | favorite | 107 comments



Tried it a few weeks ago for a task (had a few dozen files in an open source repo I wanted to write tests for in a similar way to each other).

I gave it one example and then asked it to do the work for the other files.

It was able to do about half the files correctly. But it ended up taking an hour, costing >$50 in OpenAI credits, and took me longer to debug, fix, and verify the work than it would have to do the work manually.

My take: good glimpse of the future after a few more Moore’s Law doublings and model improvement cycles make it 10x better, 10x faster, and 10x cheaper. But probably not yet worth trying to use for real work vs playing with it for curiosity, learning, and understanding.

Edit: writing the tests in this PR given the code + one test as an example was the task: https://github.com/roboflow/inference/pull/533

This commit was the manual example: https://github.com/roboflow/inference/pull/533/commits/93165...

This commit adds the partially OpenDevin written ones: https://github.com/roboflow/inference/pull/533/commits/65f51...


OpenDevin maintainer here. This is a reasonable take.

I have found it immensely useful for a handful of one-off tasks, but it's not yet a mission-critical part of my workflow (the way e.g. Copilot is).

Core model improvements (better, faster, cheaper) will definitely be a tailwind for us. But there are also many things we can do in the abstraction layer _above_ the LLM to drive these things forward. And there's also a lot we can do from a UX perspective (e.g. IDE integrations, better human-in-the-loop experiences, etc)

So even if models never get better (doubtful!) I'd continue to watch this space--it's getting better every day.


As a comparison, I use aider every day to develop aider.

Aider wrote 61% of the new code in its last release. It’s been averaging about 50% since the new Sonnet came out.

Data and graphs about aider’s contribution to its own code base:

https://aider.chat/HISTORY.html


It’d be really great to see a video or cast of you using aider to work on aider.

I can’t get anything useful out of these AI tools for my tasks and I’d really like to see what someone who can does.

I’d like to know if it’s me or my tasks that aren’t working for the llm.


Can I ask what language/stack you’re using for your project? More specifically, is it in Python? I’ve had mediocre (though at least partly usable) results on JavaScript repos, and relatively poor ones on anything less popular.


Aider is written in Python (they have a great Discord community, btw). My experience matches yours: for Python, aider/Sonnet seems to do much better than for Javascript so far. I strongly recommend aider despite LLM limitations at the moment for anyone interested in this space.

It's also very sensitive, unsurprisingly, to development documentation that is moving quickly, e.g., most AI APIs right now. A lot of manual intervention is still required here because of out-of-date references to imports, etc.


How heavy are the API costs for that?

For a project like yours I guess you should be given free credits. I hope that happens, but so far nobody has even given Karpathy a good standalone mic.


If you use DeepSeek Coder V2 0724 (that is #2 after Claude 3.5 Sonnet on the Aider leaderboard), the costs are very, very small. https://aider.chat/2024/07/25/new-models.html


Not much. I spent $25 on Anthropic in July.


Using sonnet?


I'm an active aider user, I spent ~$120 last month on a combo of Sonnet and Opus. It was much more expensive, as you probably know, with Opus. Now it's rather reasonably priced and more sustainable, IMO.


aider is great, i also use it almost daily. thanks for writing it Paul!


> 10x better, 10x faster, and 10x cheaper

Which is the elephant in the room.

There is no roadmap for any of these to happen and a strong possibility that we will start to see diminishing returns with the current LLM implementation and available datasets. At which point all of the hype and money will come out of the industry. Which in turn will cause a lull in research until the next big breakthrough and the cycle repeats.


While we have started seeing diminishing returns on rote data ingestion, especially with synthetic data leading to collapse, there is plenty of other work being done to suggest that the field will continue to thrive. Moore’s law isn’t going anywhere for at least a decade - so as we get more computing power, faster memory interconnects, and purpose built processors, there is no reason to suspect AI is going to stagnate. Right now the bottleneck is arguably more algorithmic than compute bound anyways. No one will ever need more than 640kb of RAM, right?


I feel like the GP and this response are a common exchange right before the next AI Winter hits.



a) It's been widely acknowledged that we are approaching a limit on useful datasets.

b) Synthetic data sets have been shown to not be a substitute.

c) I have no idea why you are linking Moore's Law with AI. Especially when it has never applied to GPUs and we are in a situation where we have a single vendor not subject to normal competition.


Synthetic data absolutely does work well for code.

While Moore's Law probably doesn't strictly apply to GPUs, it's not far off. See [1] where they find "We find that FLOP/s per dollar for ML GPUs double every 2.07 years (95% CI: 1.54 to 3.13 years) compared to 2.46 years for all GPUs." (Moore's law predicts doubling every 2 years)

https://epochai.org/blog/trends-in-gpu-price-performance#tre...


It’d be really nice to see research in this area from somewhere without a financial interest in hyping AI.

That incentive doesn’t invalidate research, but AI results are so easy to nudge in any direction that it’s hard to ignore.


I wonder when people mention Moores law do they use that vernacular literally or figuratively. IE literal as having to do with shrinking of the transistors, figuratively with any and all efforts to increase overall computational speed up.


In this context it’s the latter, but practically speaking they’re the same thing.


b is made up. They have absolutely not been shown to not be a substitute. It's just a big flood of bad research which people treat as summing up to a good argument.


Maybe not 10x yet, but deepcoder has done some impressive things recently. Instead of a generic LLM, they have a relatively smaller one which is coding specific and gpt4-class in quality. This makes it cheaper. In addition, they can do caching which ~10x reduces the cost of follow-up request. And there are still improvements around Star, which reduces the need for learning datasets (models can self-reflect and improve without additional data)

So while we're not 10x-ing everything, it's not like there's no significant improvements in many places.


I meant deepseek coder. Can't edit anymore.


Unfortunately the smaller model is not anywhere near GPT4 in quality and no one seems to want to host the bigger model (it was even removed from fireworks ai this week). And no one in their right mind want to send their code to deepmind chinese API hosting.


I'm perfectly fine sending my open source code to them. I'm also happy to send 95% of my private repos. Let's be honest, it's just boilerplate code not doing anything fancy, just routing/validating data for the remaining 5%. Nobody cares about that and it's exactly why I want AI to handle it. But I wouldn't send that remaining 5% to OpenAI either.


Much of nvidias marketing material covers this if you want to believe it. They at minimal claim that there will be a million fold increase in compute available specifically to ML over the next decade.


You don't know where it will go, just as people didn't know the development of LLMs at all would happen. There are no real oracles to this level of detail (more vaguely in broad lines and over decades some Sci-Fi authors do a reasonable job, and they get a lot wrong).

There have been a lot of people making these sorts of claims for years, and they nearly never end up accurately predicting what will actually happen. That's what makes observing what happens exciting.


Actually the improvement graphs are still scaling exponentially with training/compute being the bottleneck. So there isn't yet any evidence of diminishing returns.

source: https://youtu.be/zjkBMFhNj_g?feature=shared&t=1545


I just viewed an Andrew NG video (he is the guy i tended to learn the latest best prompting, agentic, visual agentic practices from) that hardware companies as well as software are working on making these manifest especially at inference stage.


Can you include link to Andrew NG's video please.


I think this was the relevant video not 100% sure. https://www.youtube.com/watch?v=8lH1mUcxODw&t=2013s


Guessing you used 4o and not 4o-mini. For stuff like this you are better off letting it use mini which is practically free, and then have it double and triple check everything.


This assumes that the model knows it is wrong. It doesn't.

It only knows statistically what is the most likely sequence of words to match your query.

For rarer datasets e.g. I had Claude/OpenAI help out with an IntelliJ plugin it would continually invent methods for classes that never existed. And could never articulate why.


This is where supporting machinery & RAG are very useful.

You can auto- lint and test code before you set eyes on it, then re-run the prompt with either more context or an altered prompt. With local models there are options like steering vectors, fine-tuning, and constrained decoding as well.

There's also evidence that multiple models of different lineages, when their outputs are rated and you take the best one at each input step, can surpass the performance of better models. So if one model knows something the others don't you can automatically fail over to the one that can actually handle the problem, and typically once the knowledge is in the chat the other models will pick it up.

Not saying we have the solution to your specific problem in any readily available software, but that there are approaches specific to your problem that go beyond current methods.


It doesn't make sense that the solution here is to put more load on the user to continually adjust the prompt or try different models.

I asked Claude and OpenAI models over 30x times to generate code. Both failed every time.


If Claude and OpenAI are so useless why does every company ban it during interviews?


Managers make most of those decisions and they have no idea what is achievable, reasonable or even particularly likely.


Do think that says more about the tools or the interview process?


This is a really complicated (and more expensive) setup that doesn't fundamentally fix any of the problems with these systems.


Yep when I read stuff like this I think, "nah I'll just write the damn code." Looking forward to being replaced by a robot, myself.


Popular programming in a nutshell.

It’s the new pop psych.


4o-mini is cheap, but is not practically free. At scale it will still rack up a cost, although I acknowledge that we are currently in the honeymoon phase with it. Computing is the kind of thing that we just do more of when it becomes cheaper, with the budget being constant.


It doesn't work like that. You're more likely to end up with a fractal pattern of token waste, potentially veering off into hallucinations than some actual progress by "double" or "triple checking everything".


Strong chance Moores law stops this decade due to the physical limits on the size of atoms lol.


I’m hopeful that there are some possible model topologies that don’t just stack matmuls.

Maybe there’s some wins to be had on the software side still.


I've heard variations on this argument for the past two decades, and it's amusing every time.


I’ve been hearing that for at least a decade.


And now it's here.


I’ll check back in 2030


instead of using openAI api, can it use the locally hosted ollama http API?


Yes. It's not really "open" if it depends on a non-libre service. To be legit, they must at least enable this experimentally.


Nice.

The "Browsing agent" is a bit worrisome. That can reach outside the sandboxed environment. "At each step, the agent prompts the LLM with the task description, browsing action space description, current observation of the browser using accessibility tree, previous actions, and an action prediction example with chain-of-thought reasoning. The expected response from the LLM will contain chain-of-thought reasoning plus the predicted next actions, including the option to finish the task and convey the result to the user."

How much can that do? Is it smart enough to navigate login and signup pages? Can it sign up for a social media account? Buy things on Amazon?


There is a pull request to add a security monitor that makes sure it does not do anything unreasonable: https://github.com/OpenDevin/OpenDevin/pull/3058


Good that they are thinking about it. Now the question is whether the LLM is smarter than the firewall.


I used this to scaffold out 5 HTML pages for a web app, having it iterate on building the UX. Did a pretty good job and took about 10 minutes of iterating with it, but cost me about $10 in API credits which was more than I expected.


Cost is one of our biggest issues right now. There's a lot we can do to mitigate, but we've been focused on getting something that works well before optimizing for efficiency.


I think that’s correct – even at a “high” cost (relative to what? A random SaaS app or an hour of a moderately competent Full Stack Dev?) the ROI will already be there for some projects, and as prices naturally improve a larger and larger portion of projects will make sense while we also build economies of scale with inference infrastructure.


This is a bigger issue than folks realize, visual inputs to GPT4 are really expensive (like several cents per dozen images in some cases), which means that you can't just spam the API to iterate on HTML/webpages with a software agent. We're trying to tackle this for web screenshots (also documents) with a custom model geared towards structured schemas designed to be fed into a feedback loop like the above while keeping costs down.


It's gross that this has a person's first name. How dehumanizing that will be for real Devins as this kind of thing becomes productized. How tempting to compare yourself to a "teammate" your employer pays a cloud tenant subscription for.


It's a reference to Devin, one of the earlier (and most hyped) "autonomous" ai-agent-based software devs that it attempts to replicate/match in the open.

Your interestingly different ire would be better-directed at the original project.

https://www.cognition.ai/blog/introducing-devin

Previous discussions on that fwiw include:

https://news.ycombinator.com/item?id=39679787


Odd take. There are plenty of products, restaurants and services that use a first name as their name. I don't think it's a big deal, or negative at all.


The Alexa and Siri's of the world feel their pain.

you want something unique but not too unique as to be weird.

I work with like 6 Matts.


"Devin" is a substantive which is used as a first name in the Celtic world. Pretty sure it's used here because of its meaning.


Is it dehumanising to give a dog a name that a person could have?


i dont like to discourage or be a naysayer. but,

dont build a platform for software on something inherently unreliable. if there is one lesson i have learnt, it is that, systems and abstractions are built on interfaces which are reliable and deterministic.

focus on llm usecases where accuracy is not paramount - there are tons of them. ocr, summarization, reporting, recommendations.


People are already unreliable and non-deterministic. Looking at that aspect, we're not losing anything.


As a result of human unreliability, we had to invent bureaucracy and qualifications for society at large, and design patterns and automated testing for software engineers in particular.

I have a suspicion that there's a "best design pattern" and "best architecture" for getting the most out of existing LLMs (and some equivalents for non-software usage of LLMs and also non-LLM AI), but I'm not sure it's worth the trouble to find out what that is rather than just wait for AI models to get better.


people may be unreliable but the software they produce needs to work reliably.

software system is like legos. they form a system of dependencies. each component in the chain has interfaces which other components depend on. 99% reliability doesnt cut it for software components.


I'm not sure, but you may be misunderstanding the project, or trying to make some point in missing. This project just automates some code tasks. The developer is still responsible for the design / reliability / component interfaces. If you see the result doesn't match the expectations, you can either finish it yourself, or send this tool for another loop with new instructions.


let me test it out, and then provide better feedback.


>the software they produce needs to work reliably

The word "need" is an extreme overstatement here. The vast majority of software out there is unreliable. If anything, I believe it is AI that can finally bring formally verified software into the industry, because us regular human devs definitely aren't doing that.


thats a fair statement to say that humans cannot be the gatekeepers for accuracy or reliability.

but why should the solution involve AI (thats just the latest bandwagon)? formal verification of software has a long history which has nothing to do with AI.


Probably because of Google's recent math olympiad results using AI-directed search in formal proof systems.


> but why should the solution involve AI

Because AI is able to produce lots of results, covering a wide range of domains, and it can do so cheaply.

Sure, there are so quality issues. But that is the case for most software.


What part of “AI” implies “formally verified?”


And that's precisely why we don't use people to do tests and to ensure that things work reliably. We use code instead.


I've had trouble trying to convince a few different people of this over the years.

One case, the other dev refused to allow a commit (fine) because some function had known flaws and was should no longer be used for new code (good reason), this fact wasn't documented anywhere (raising flags) so I tried to add a deprecation tag as well as changing the thing, they refused to allow any deprecation tags "because committed code should not generate warnings" (putting the cart before the horse) — and even refused accept that such a warning might be a useful thing for anyone. So, they became a human compiler in the mode of all-warnings-are-errors… but only they knew what the warnings were because they refused to allow them to be entered into code. No sense of irony. And of course, they didn't like it when someone else approved a commit before they could get in and say "no, because ${thing nobody else knew}".

A different case, years after Apple had switched ObjC to use ARC, the other dev was refusing to update despite the semi-automated tool Apple provided to help with the ARC transition. The C++ parts of their codebase were even worse, as they didn't know anything about smart pointers and were using raw pointers, new, delete everywhere — I still don't count myself as a C++ despite having occasionally used it in a few workplaces, and yet I knew about it even then.

And, I'm sure like everyone here has experience of, I've seen a few too many places that rely on manual testing.


That's not universal. QA teams exist for things which are not easy to automatically test. We also continuously test subjective areas like "does this website look good".


Agree. but the boundaries of automation are progressing year after year. We wont be able to replace everything humans do anytime soon for testing but still a lot can and will be done.


Yes, they are, and that's precisely why we use computers and deterministic code for many tasks instead of people.


I really don’t like the denigration of humanity to sell these products. The current state of LLMs is so far away on “reliability” than the average human that these marketing lines are insulting.

It really seems like the tech-bro space hates humans so much that their motivation in working on these products is replacing them to never have to work with a human again.


>I really don’t like the denigration of humanity to sell these products.

Sure, but then humanity was denigrated the first time a calculator was used to compute a sum instead of asking John Q Human to do it.

I'd argue that the more we find ways to replace humans with AI, we're more clearly defining what humanity is. Not about denigration or elevation, just truth.


> systems and abstractions are built on interfaces which are reliable and deterministic.

Are you sure we live in the same world? The world where there is Crowdstrike and a new zero day every week?

Software engineering is beautifully chaotic, I like it like that.


I suspect that the pursuit of LLM agents is rooted in falling for the illusion of a mind which LLMs so easily weave.

So much of the stuff being built on LLMs in general seems fixated on making that illusion more believable.


This is an interesting take, but I don't think it quite captures the idea of "agents".

I prefer to think of agents as _feedback loops_, with an LLM as the engine. An agent takes an action in the world, sees the results, then takes another action. This is what makes them so much more powerful than a raw LLM.


I think "sees the results" also embeds the idea of a mind. An LLM doesn't have a mind to see or plan or think with.

An LLM in a loop creates agency much like a car rolling downhill is self driving.


That works if the LLM has adequate external feedback from a terminal and browser in context with the past trial etc.

It can't self-correct its own reasoning: https://arxiv.org/abs/2310.01798


I tried opendevin for a sort of one off script that did some file processing.

It was a bit inscrutable what it did, but worked no problem. Much like chat gpt interpreter looping on python errors until it has a working solution, including pip installing the right libs, and reading the docs of the lib for usage errors.

N of 1 and a small freestanding task I had done myself already but I was impressed.



So does arxiv.org just let anyone publish a paper now? It seems to be used by AI research a lot more now instead of just a blog post.


They always let anyone publish a paper, as long as the submitter has an email address from a known institution OR an endorsement from someone who does. Any edu-email may actually suffice if I'm not mistaken.


yes that's the whole point of arxiv to allow anyone to publish.


arxiv.org is not a peer-reviewed publication but an archive of scientific documents. Notably, it includes preprints, conference papers, and a fair bit of bachelor's and master's projects.

The best way to use arxiv.org is to find a paper you want to read from a "real" publication and get the pdf from arxiv.org so you can read it without the publication subscription.

That is not to say arxiv.org is all horseshit though. Plenty of good stuff gets added there; you just need to keep your bullshit radar active when reading. Even some stuff published in Nature or IEEE smells like unwashed feet once you read them, let alone what arxiv.org accepts.

Good citation count and decent writing are often better indicators than a reputable publication.


The exact same thing happened with crypto and "whitepapers". I think it's because both these fields have so many grifters that believe an arxiv paper provides them much-needed legitimacy. A blog post doesn't have the same aura to it...


Does it have different goals than: https://aider.chat ?


Probably to be fully autonomous, vs guided like aider.

I still think a tool like aider is where AI is heading, these "agents" are built upon running systems that are 15% error prone and just compound errors with little ability to actually correct them.


Yeah, it has more agency, looks up docs, installs dependencies, writes and runs tests.

Aider is more understandable to me, doing small chunks of work, but it won't do a google search to find usage, etc. It depends on you to choose which files to put in context and so on.

I wish aider had a bit more of the self directedness of this, but API calls and token usage would be greatly increased.

Edit: or maybe an agency loop like this steering aider based on a larger goal would be useful?


My project Plandex[1] fits somewhere between aider and opendevin in terms of autonomy, so you might find it interesting. It attempts to complete a task autonomously in terms of implementing all the code, regardless of how many steps that takes, but it doesn’t yet try to auto-select context, execute code, or debug its own errors. Though it does have a syntax validation step and a general verification step that can auto-fix common issues.

1 - https://plandex.ai


I don't need OpenDevin. I just need AI to reliably write a function or unit test or create a small UI component. It needs to check latest documentation as its answer is often outdate. It needs to be able to pass test and debug itself without getting into a loop of repetitive error and can't get out of that hole. If LLM can do that , it would be saving me so much time. But latest models are all bad currently .


Heh, reliably.


Please don’t give any tools, AI or not, the freedom to run away like this. You’re inviting a new era of runaway worm-style viruses by giving such autonomy to easily manipulated programs.

To what end anyway? This is massively resource heavy, and the end goal seems to be to build a program that would end your career. Please work on something that will actually make coding easier and safer rather than building tools to run roughshod over civilization.


While I agree, that ship seems to have sailed for the time being. There will be a lot of very dubious code for the coming years/decade. Currently using Claude Projects or Copilot Workspace, you can write fully working software, but every time you ask for a change, it will double up, mess up etc some part of the code. You can just ask to fix it, but if you have the following:

- fix A please

- hmm, ok A fixed, B broken; fix B please

- hmm, ok B fixed, A now a bit broken, fix A please

- A & B working

But when you check the code, you often see that it wrote code for A that broke B, then it fixed B while leaving the code for A, now basically dead code but not necessarily detectable. Then it wrote code for A, again, after the code of B and the user thinks all is fine as it works. And this happens 1000x / day in normal projects.

I see it everywhere. Good for me (my company troubleshoots and fixes code/systems), but not for the world.


Why isn't this integrated with an IDE? Or am I missing that


I don't believe so, it's meant to run in it's own Docker container sandbox. If you're looking for something that is integrated with IDE, my current favorite plugin is https://www.continue.dev/. Apache 2.0 license, local or remote LLM integration, automatic documentation scraping (with a hefty list of docs preinatalled), and the ability to selectively add context to your prompts (@docs, @codebase, @terminal, etc.). I haven't seen any great human-in-the-loop-in-the-IDE options quite yet.


Last time I used continue, it was still phoning home by default, you had to opt out of telemetry.


It's on the roadmap! Stay tuned...




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: