Tried it a few weeks ago for a task (had a few dozen files in an open source repo I wanted to write tests for in a similar way to each other).
I gave it one example and then asked it to do the work for the other files.
It was able to do about half the files correctly. But it ended up taking an hour, costing >$50 in OpenAI credits, and took me longer to debug, fix, and verify the work than it would have to do the work manually.
My take: good glimpse of the future after a few more Moore’s Law doublings and model improvement cycles make it 10x better, 10x faster, and 10x cheaper. But probably not yet worth trying to use for real work vs playing with it for curiosity, learning, and understanding.
OpenDevin maintainer here. This is a reasonable take.
I have found it immensely useful for a handful of one-off tasks, but it's not yet a mission-critical part of my workflow (the way e.g. Copilot is).
Core model improvements (better, faster, cheaper) will definitely be a tailwind for us. But there are also many things we can do in the abstraction layer _above_ the LLM to drive these things forward. And there's also a lot we can do from a UX perspective (e.g. IDE integrations, better human-in-the-loop experiences, etc)
So even if models never get better (doubtful!) I'd continue to watch this space--it's getting better every day.
Can I ask what language/stack you’re using for your project? More specifically, is it in Python? I’ve had mediocre (though at least partly usable) results on JavaScript repos, and relatively poor ones on anything less popular.
Aider is written in Python (they have a great Discord community, btw). My experience matches yours: for Python, aider/Sonnet seems to do much better than for Javascript so far. I strongly recommend aider despite LLM limitations at the moment for anyone interested in this space.
It's also very sensitive, unsurprisingly, to development documentation that is moving quickly, e.g., most AI APIs right now. A lot of manual intervention is still required here because of out-of-date references to imports, etc.
For a project like yours I guess you should be given free credits. I hope that happens, but so far nobody has even given Karpathy a good standalone mic.
I'm an active aider user, I spent ~$120 last month on a combo of Sonnet and Opus. It was much more expensive, as you probably know, with Opus. Now it's rather reasonably priced and more sustainable, IMO.
There is no roadmap for any of these to happen and a strong possibility that we will start to see diminishing returns with the current LLM implementation and available datasets. At which point all of the hype and money will come out of the industry. Which in turn will cause a lull in research until the next big breakthrough and the cycle repeats.
While we have started seeing diminishing returns on rote data ingestion, especially with synthetic data leading to collapse, there is plenty of other work being done to suggest that the field will continue to thrive. Moore’s law isn’t going anywhere for at least a decade - so as we get more computing power, faster memory interconnects, and purpose built processors, there is no reason to suspect AI is going to stagnate. Right now the bottleneck is arguably more algorithmic than compute bound anyways. No one will ever need more than 640kb of RAM, right?
a) It's been widely acknowledged that we are approaching a limit on useful datasets.
b) Synthetic data sets have been shown to not be a substitute.
c) I have no idea why you are linking Moore's Law with AI. Especially when it has never applied to GPUs and we are in a situation where we have a single vendor not subject to normal competition.
Synthetic data absolutely does work well for code.
While Moore's Law probably doesn't strictly apply to GPUs, it's not far off. See [1] where they find "We find that FLOP/s per dollar for ML GPUs double every 2.07 years (95% CI: 1.54 to 3.13 years) compared to 2.46 years for all GPUs." (Moore's law predicts doubling every 2 years)
I wonder when people mention Moores law do they use that vernacular literally or figuratively. IE literal as having to do with shrinking of the transistors, figuratively with any and all efforts to increase overall computational speed up.
b is made up. They have absolutely not been shown to not be a substitute. It's just a big flood of bad research which people treat as summing up to a good argument.
Maybe not 10x yet, but deepcoder has done some impressive things recently. Instead of a generic LLM, they have a relatively smaller one which is coding specific and gpt4-class in quality. This makes it cheaper. In addition, they can do caching which ~10x reduces the cost of follow-up request. And there are still improvements around Star, which reduces the need for learning datasets (models can self-reflect and improve without additional data)
So while we're not 10x-ing everything, it's not like there's no significant improvements in many places.
Unfortunately the smaller model is not anywhere near GPT4 in quality and no one seems to want to host the bigger model (it was even removed from fireworks ai this week). And no one in their right mind want to send their code to deepmind chinese API hosting.
I'm perfectly fine sending my open source code to them. I'm also happy to send 95% of my private repos. Let's be honest, it's just boilerplate code not doing anything fancy, just routing/validating data for the remaining 5%. Nobody cares about that and it's exactly why I want AI to handle it. But I wouldn't send that remaining 5% to OpenAI either.
Much of nvidias marketing material covers this if you want to believe it. They at minimal claim that there will be a million fold increase in compute available specifically to ML over the next decade.
You don't know where it will go, just as people didn't know the development of LLMs at all would happen. There are no real oracles to this level of detail (more vaguely in broad lines and over decades some Sci-Fi authors do a reasonable job, and they get a lot wrong).
There have been a lot of people making these sorts of claims for years, and they nearly never end up accurately predicting what will actually happen. That's what makes observing what happens exciting.
Actually the improvement graphs are still scaling exponentially with training/compute being the bottleneck. So there isn't yet any evidence of diminishing returns.
I just viewed an Andrew NG video (he is the guy i tended to learn the latest best prompting, agentic, visual agentic practices from) that hardware companies as well as software are working on making these manifest especially at inference stage.
Guessing you used 4o and not 4o-mini. For stuff like this you are better off letting it use mini which is practically free, and then have it double and triple check everything.
This assumes that the model knows it is wrong. It doesn't.
It only knows statistically what is the most likely sequence of words to match your query.
For rarer datasets e.g. I had Claude/OpenAI help out with an IntelliJ plugin it would continually invent methods for classes that never existed. And could never articulate why.
This is where supporting machinery & RAG are very useful.
You can auto- lint and test code before you set eyes on it, then re-run the prompt with either more context or an altered prompt. With local models there are options like steering vectors, fine-tuning, and constrained decoding as well.
There's also evidence that multiple models of different lineages, when their outputs are rated and you take the best one at each input step, can surpass the performance of better models. So if one model knows something the others don't you can automatically fail over to the one that can actually handle the problem, and typically once the knowledge is in the chat the other models will pick it up.
Not saying we have the solution to your specific problem in any readily available software, but that there are approaches specific to your problem that go beyond current methods.
4o-mini is cheap, but is not practically free. At scale it will still rack up a cost, although I acknowledge that we are currently in the honeymoon phase with it. Computing is the kind of thing that we just do more of when it becomes cheaper, with the budget being constant.
It doesn't work like that. You're more likely to end up with a fractal pattern of token waste, potentially veering off into hallucinations than some actual progress by "double" or "triple checking everything".
The "Browsing agent" is a bit worrisome. That can reach outside the sandboxed environment. "At each step, the agent prompts the LLM with the task description, browsing action space description, current observation
of the browser using accessibility tree, previous actions, and an action prediction example with
chain-of-thought reasoning. The expected response from the LLM will contain chain-of-thought
reasoning plus the predicted next actions, including the option to finish the task and convey the result
to the user."
How much can that do? Is it smart enough to navigate login and signup pages? Can it sign up for a social media account? Buy things on Amazon?
I used this to scaffold out 5 HTML pages for a web app, having it iterate on building the UX. Did a pretty good job and took about 10 minutes of iterating with it, but cost me about $10 in API credits which was more than I expected.
Cost is one of our biggest issues right now. There's a lot we can do to mitigate, but we've been focused on getting something that works well before optimizing for efficiency.
I think that’s correct – even at a “high” cost (relative to what? A random SaaS app or an hour of a moderately competent Full Stack Dev?) the ROI will already be there for some projects, and as prices naturally improve a larger and larger portion of projects will make sense while we also build economies of scale with inference infrastructure.
This is a bigger issue than folks realize, visual inputs to GPT4 are really expensive (like several cents per dozen images in some cases), which means that you can't just spam the API to iterate on HTML/webpages with a software agent. We're trying to tackle this for web screenshots (also documents) with a custom model geared towards structured schemas designed to be fed into a feedback loop like the above while keeping costs down.
It's gross that this has a person's first name. How dehumanizing that will be for real Devins as this kind of thing becomes productized. How tempting to compare yourself to a "teammate" your employer pays a cloud tenant subscription for.
It's a reference to Devin, one of the earlier (and most hyped) "autonomous" ai-agent-based software devs that it attempts to replicate/match in the open.
Your interestingly different ire would be better-directed at the original project.
Odd take. There are plenty of products, restaurants and services that use a first name as their name. I don't think it's a big deal, or negative at all.
dont build a platform for software on something inherently unreliable. if there is one lesson i have learnt, it is that,
systems and abstractions are built on interfaces which are reliable and deterministic.
focus on llm usecases where accuracy is not paramount - there are tons of them. ocr, summarization, reporting, recommendations.
As a result of human unreliability, we had to invent bureaucracy and qualifications for society at large, and design patterns and automated testing for software engineers in particular.
I have a suspicion that there's a "best design pattern" and "best architecture" for getting the most out of existing LLMs (and some equivalents for non-software usage of LLMs and also non-LLM AI), but I'm not sure it's worth the trouble to find out what that is rather than just wait for AI models to get better.
people may be unreliable but the software they produce needs to work reliably.
software system is like legos. they form a system of dependencies. each component in the chain has interfaces which other components depend on. 99% reliability doesnt cut it for software components.
I'm not sure, but you may be misunderstanding the project, or trying to make some point in missing. This project just automates some code tasks. The developer is still responsible for the design / reliability / component interfaces. If you see the result doesn't match the expectations, you can either finish it yourself, or send this tool for another loop with new instructions.
The word "need" is an extreme overstatement here. The vast majority of software out there is unreliable. If anything, I believe it is AI that can finally bring formally verified software into the industry, because us regular human devs definitely aren't doing that.
thats a fair statement to say that humans cannot be the gatekeepers for accuracy or reliability.
but why should the solution involve AI (thats just the latest bandwagon)? formal verification of software has a long history which has nothing to do with AI.
I've had trouble trying to convince a few different people of this over the years.
One case, the other dev refused to allow a commit (fine) because some function had known flaws and was should no longer be used for new code (good reason), this fact wasn't documented anywhere (raising flags) so I tried to add a deprecation tag as well as changing the thing, they refused to allow any deprecation tags "because committed code should not generate warnings" (putting the cart before the horse) — and even refused accept that such a warning might be a useful thing for anyone. So, they became a human compiler in the mode of all-warnings-are-errors… but only they knew what the warnings were because they refused to allow them to be entered into code. No sense of irony. And of course, they didn't like it when someone else approved a commit before they could get in and say "no, because ${thing nobody else knew}".
A different case, years after Apple had switched ObjC to use ARC, the other dev was refusing to update despite the semi-automated tool Apple provided to help with the ARC transition. The C++ parts of their codebase were even worse, as they didn't know anything about smart pointers and were using raw pointers, new, delete everywhere — I still don't count myself as a C++ despite having occasionally used it in a few workplaces, and yet I knew about it even then.
And, I'm sure like everyone here has experience of, I've seen a few too many places that rely on manual testing.
That's not universal. QA teams exist for things which are not easy to automatically test. We also continuously test subjective areas like "does this website look good".
Agree. but the boundaries of automation are progressing year after year. We wont be able to replace everything humans do anytime soon for testing but still a lot can and will be done.
I really don’t like the denigration of humanity to sell these products. The current state of LLMs is so far away on “reliability” than the average human that these marketing lines are insulting.
It really seems like the tech-bro space hates humans so much that their motivation in working on these products is replacing them to never have to work with a human again.
>I really don’t like the denigration of humanity to sell these products.
Sure, but then humanity was denigrated the first time a calculator was used to compute a sum instead of asking John Q Human to do it.
I'd argue that the more we find ways to replace humans with AI, we're more clearly defining what humanity is. Not about denigration or elevation, just truth.
This is an interesting take, but I don't think it quite captures the idea of "agents".
I prefer to think of agents as _feedback loops_, with an LLM as the engine. An agent takes an action in the world, sees the results, then takes another action. This is what makes them so much more powerful than a raw LLM.
I tried opendevin for a sort of one off script that did some file processing.
It was a bit inscrutable what it did, but worked no problem. Much like chat gpt interpreter looping on python errors until it has a working solution, including pip installing the right libs, and reading the docs of the lib for usage errors.
N of 1 and a small freestanding task I had done myself already but I was impressed.
They always let anyone publish a paper, as long as the submitter has an email address from a known institution OR an endorsement from someone who does. Any edu-email may actually suffice if I'm not mistaken.
arxiv.org is not a peer-reviewed publication but an archive of scientific documents. Notably, it includes preprints, conference papers, and a fair bit of bachelor's and master's projects.
The best way to use arxiv.org is to find a paper you want to read from a "real" publication and get the pdf from arxiv.org so you can read it without the publication subscription.
That is not to say arxiv.org is all horseshit though. Plenty of good stuff gets added there; you just need to keep your bullshit radar active when reading. Even some stuff published in Nature or IEEE smells like unwashed feet once you read them, let alone what arxiv.org accepts.
Good citation count and decent writing are often better indicators than a reputable publication.
The exact same thing happened with crypto and "whitepapers". I think it's because both these fields have so many grifters that believe an arxiv paper provides them much-needed legitimacy. A blog post doesn't have the same aura to it...
Probably to be fully autonomous, vs guided like aider.
I still think a tool like aider is where AI is heading, these "agents" are built upon running systems that are 15% error prone and just compound errors with little ability to actually correct them.
Yeah, it has more agency, looks up docs, installs dependencies, writes and runs tests.
Aider is more understandable to me, doing small chunks of work, but it won't do a google search to find usage, etc. It depends on you to choose which files to put in context and so on.
I wish aider had a bit more of the self directedness of this, but API calls and token usage would be greatly increased.
Edit: or maybe an agency loop like this steering aider based on a larger goal would be useful?
My project Plandex[1] fits somewhere between aider and opendevin in terms of autonomy, so you might find it interesting. It attempts to complete a task autonomously in terms of implementing all the code, regardless of how many steps that takes, but it doesn’t yet try to auto-select context, execute code, or debug its own errors. Though it does have a syntax validation step and a general verification step that can auto-fix common issues.
I don't need OpenDevin. I just need AI to reliably write a function or unit test or create a small UI component. It needs to check latest documentation as its answer is often outdate. It needs to be able to pass test and debug itself without getting into a loop of repetitive error and can't get out of that hole.
If LLM can do that , it would be saving me so much time. But latest models are all bad currently .
Please don’t give any tools, AI or not, the freedom to run away like this. You’re inviting a new era of runaway worm-style viruses by giving such autonomy to easily manipulated programs.
To what end anyway? This is massively resource heavy, and the end goal seems to be to build a program that would end your career. Please work on something that will actually make coding easier and safer rather than building tools to run roughshod over civilization.
While I agree, that ship seems to have sailed for the time being. There will be a lot of very dubious code for the coming years/decade. Currently using Claude Projects or Copilot Workspace, you can write fully working software, but every time you ask for a change, it will double up, mess up etc some part of the code. You can just ask to fix it, but if you have the following:
- fix A please
- hmm, ok A fixed, B broken; fix B please
- hmm, ok B fixed, A now a bit broken, fix A please
- A & B working
But when you check the code, you often see that it wrote code for A that broke B, then it fixed B while leaving the code for A, now basically dead code but not necessarily detectable. Then it wrote code for A, again, after the code of B and the user thinks all is fine as it works. And this happens 1000x / day in normal projects.
I see it everywhere. Good for me (my company troubleshoots and fixes code/systems), but not for the world.
I don't believe so, it's meant to run in it's own Docker container sandbox. If you're looking for something that is integrated with IDE, my current favorite plugin is https://www.continue.dev/. Apache 2.0 license, local or remote LLM integration, automatic documentation scraping (with a hefty list of docs preinatalled), and the ability to selectively add context to your prompts (@docs, @codebase, @terminal, etc.). I haven't seen any great human-in-the-loop-in-the-IDE options quite yet.
I gave it one example and then asked it to do the work for the other files.
It was able to do about half the files correctly. But it ended up taking an hour, costing >$50 in OpenAI credits, and took me longer to debug, fix, and verify the work than it would have to do the work manually.
My take: good glimpse of the future after a few more Moore’s Law doublings and model improvement cycles make it 10x better, 10x faster, and 10x cheaper. But probably not yet worth trying to use for real work vs playing with it for curiosity, learning, and understanding.
Edit: writing the tests in this PR given the code + one test as an example was the task: https://github.com/roboflow/inference/pull/533
This commit was the manual example: https://github.com/roboflow/inference/pull/533/commits/93165...
This commit adds the partially OpenDevin written ones: https://github.com/roboflow/inference/pull/533/commits/65f51...