Hacker News new | past | comments | ask | show | jobs | submit login
How to tackle unreliability of coding assistants (martinfowler.com)
159 points by ingve 9 months ago | hide | past | favorite | 152 comments



I think this is "how to think about coding assistants and your task" but none of this is "tackling" their unreliability.

While coding assistants seem to do well in a range of situations, I continue to believe that for coding specifically, merely training on next-token-prediction is leaving too much on the table. Yes, source code is represented as text, but computer programs are an area where there's available information which is _so much richer_. We can know not only the text of the program but the type of every expression, which variables are in scope at any point, what is the signature of a method we're trying to call, etc. These assistants should be able to make predictions about program _traces_, not just program source text. A step further would be to guess potential loop invariants, pre/post conditions, etc, confirm which are upheld by existing tests, and down-weight recommending changes which introduce violations to those inferred conditions.

ChatGPT and tab-completion assistants have both given me things that are not even valid programs (e.g. will not compile, use a variable that isn't actually in scope, etc). ChatGPT even told me that an example it generated wasn't compiling for me b/c I wasn't using a new enough version of the language, and then referenced a language version which does not yet exist. All of this is possible in part b/c these tools are engaging only at the level of text, and are structurally isolated from the rich information available inside an interpreter or debugger. "Tackling" unreliability should start with reframing tasks in a way which lets tools better see the causes of their failures.


I absolutely agree, but I find the situation incredibly funny. There are three characters here: me, the LLM, and the compiler. Two of them are robots, but they refuse to talk to each other - it's up to me to tell the LLM the bad news from the compiler.


that's a frontend issue though. If you use the python interpreter in ChatGPT, you can tell it to run the code until it works, at which point it'll do a couple of iterations before giving you code.


How many people are putting service credentials into a ChatGPT Python interpreter to see if their generated code "works"?


One of our very large customers was hit with ransomware recently. So…best practice is don’t do things like that.


You can just use your own.... There is still the risk GPT-4 gives you ~ "delete all the things" and you blindly run it on your infra, but.. it's better than what. you describe.


This sounds like the worst game of telephone


dont you already cut and paste your error message to google like everyone else?


I decided to cut out the middleman:

https://github.com/skorokithakis/sysaidmin/


This is really nice. Thumb up


Thank you!


Hm, that’s true. I guess someone could have hooked up pattern matching and google search results and iterated that directly into some IDE.


It will come.

In fact, GPT is really good at correcting its code given the compiler output. We just need to automate that.

I think that this is not done yet because LLMs costs are O(n+1) so you really don’t want it to be stuck in some loop.


What does O(n+1) mean?


O of infinity


Indeed. These are basically tech demos at this point. A marvel to see, and sometimes useful, but still extremely crude.

There's a lot of headroom for sophistication as a few more research insights are made and an experienced tooling team commits a few years to making rich multimodal/ensemble code assistants that can perform smart analyses, transformations, boilerplating, etc on your project instead of just adding some text to your file.

But it'll take a years of insight and labor to build that system up, adapt it to different industries/uses, and prove it out as engineering-ready.

People get caught up in the novelty of Copilot and ChatGPT and then imagine that the revolution arrives when some new paper comes out tomorrow (or that none will arrive because of today's limitations), but the far more likely reality is that the revolution paces as something more like the internet's -- real and profound, but unfolding gradually over the span of decades as people work hard to make new things with it.


Frankly, we may be barking up the wrong tree with LLMs. Sure they deliver novel and very marketable results, hence the insane funding, but I can’t help but feel it’s really just a parlor trick and there is a yet undiscovered algorithm that can actually deliver the AGI that we seek - AGI that reliable and precise like fictional AIs.


I think a lot of progress in the field of AI comes from random breakthroughs rather than evolution of existing ideas. For that reason, I think AGI will probably be another random breakthrough, and not just an LLM with more parameters. Good if you're NVidia, bad if you're OpenAI. The random breakthrough could happen at Google, or it could happen in some professor's basement. There is no way to know, and no way to throw money at the problem and guarantee a return.


> Good if you're NVidia, bad if you're OpenAI

Not necessarily good for Nvidia. If the new algorithm is branch heavy, don’t parallelize well, … etc. and operate better on CPUs rather than GPUs, it will be Intel, AMD, and ARM that enjoy the windfall.


Good point!


> bad if you're OpenAI

Just a few days ago there was a (somewhat cryptic) report about OpenAI developing a model that could do simple math. I'm certain they're pursuing quite a lot of research that's not directly related to LLMs.

I don't think we're heading towards AGI any time soon, but I can imagine a complex system in the next couple years which uses an LLM for text generation, but offloads its serious "thinking" on to predictable, specialized subsystems. It's easy to imagine a lot of possibilities for code in particular.


I look at it like finding a lost treasure or something like that. The government expedition with a 1 trillion dollar budget might find it first... or it might be in some guy's attic and his kid came over and happened to notice it. Spending infinite money on research increases your chances, but it's still a chance.


The protein folding project was basically solved due to money and ml.

Alpha fold.

The leap that area took thanks to it was a leap.

It's more than just something potentially better


It would be a very poor AGI if it didn't create a company that does AGI ASICs in the first few days. Minutes?


A cat is AGI.

(Well, not artificial, but you get my point.) It couldn’t design ASICs in minutes.


I hope that computerized AGIs also get the zoomies at 3AM.


A cat is a very narrow form of intelligence.


True. And yet so incredibly wider than any AI.


I can't use my cat to make a painting in the style of Van Gogh. Nor can I use my cat write an email to my engineering team that I'm laying of 45% of them, in corporate speak. Nor can I use my cat to refactor a class into a more functional pattern. Hell I can't even get my cat to hunt mice. He's a cute fucker though.


> there is a yet undiscovered algorithm that can actually deliver the AGI that we seek - AGI that reliable and precise like fictional AIs.

Nobody wants to hear this, but AGI could very well be the next invention after practical cold fusion and faster-than-light travel.

Yes, we're moving faster, but does anyone know how long the road is?

A space rocket is moving very quickly (50k kmph?), much faster than any other technology such as airliners, maglev trains, but the closest star is still several light years away.


> These are basically tech demos at this point.

I disagree with this conclusion. Crude as it is, Copilot _does_ make me faster in practice, even though I have to proofread its output. I think this also agrees with your second statement, that progress will come in small but useful improvements. If you treat Copilot like improved auto-completion, then it is very useful today.

Actually, while using syntax trees as the underlying model is a very interesting approach, there is still a lot of low-hanging fruit to make Copilot much more useful than it is today, just in terms of performance (waiting 1-2 seconds to see if there is a completion, and if it makes sense, interrupts my typing flow) as well as low-level features (e.g. mark and insert part of a suggestion, instead of inserting the whole and then removing the wrong parts).


I think the ultimate jump in capability would be achieved by making a dedicated language to be used by an LLM and training it in parallel on mainly that. The problem is that to make this actually useful you need to establish a new programming language. And making a useful programming language that is actually used by people is even more expensive and takes more time than training a state-of-the-art LLM...


Internet was hard limited on getting access in a physical world.

The first tech giving me internet reliable was DSL or 60kbytes/sec.

This tech could only be replaced with replacing or adding cables.

Nvidia is making a ton more petaflops per month than last year and a lot of money and people have moved/shifted in a very short period of time.

Even my company saw the writing and the only/primarily thing we talk about is ml.

I also don't have to wait for services like a bank or anyone else supporting emails or internet services.

We have everything already in place for a much faster speed than at any other tech before.


"Any sufficiently advanced technology is indistinguishable from magic."

-Arthur C. Clarke


In order to really solve this, you couldn't use an LLM, at least not one in any shape that we have now.

You'd need something that actually understands the language. What is a lifetime, what is scope, what are types, functions, variables, etc. Something that can contextually look at correct, incomplete, or broken programs and reason about what they're doing by knowing the rules that the language operates on and following them to a conclusion. It would also need to understand high-level design patterns and structure to not just know what's happening in the literal sense "this variable is being incremented" but also in a more abstract sense "this variable keeps track of references because this is mixed C/Python code that needs that to handle deallocation". Something that recognizes patterns both within and outside of the code, with appropriately fuzzy matching where needed.

And I think importantly you'd need to be able to query it at a variety of levels. What's happening on a line-by-line basis and what's happening at the high level at a given point of code. One OR the other isn't sufficient.

That is not a simple ask. We're a long way away from something that smart existing.


Absolutely it's not a simple ask. But research in program synthesis was making interesting progress before LLMs came along. I think it would be better to ask how ML can improve or speed up those efforts, rather than trying to make a general ML model "solve" such a complex problem as a special case.

A step in this direction, which I've been trying to figure out as a side project, and which I would love someone to scoop me on and build a real thing, is to stitch an ML model into into a (mini)kanren. Minikanren folk have built relational interpreters for small languages but not for a "real" industrial language (so far as I'm aware). These small relational interpreters can generate programs given a set of relational statements about them (e.g. "find f so that f(x1) == y1, f(x2) == y2, ..."). Because they're actually examining the full universe of programs in their small language, they will eventually find all programs that satisfy the constraints. But their search is in an order that's given by the structure/definition of the interpreter, and so for complex sets of requirements, finding the _first_ satisfying example can be unacceptably slow. But what if the search were based on the softmaxed outputs of an ML model, and you do a least cost search? Then (roughly) in the case that beam-search generation from the ML model would have produced a valid answer, you still find it in roughly the same time, but you _know_ it's valid. In the case where a valid answer requires that in a small number of key places, we have to take a choice that the model assigns low probability, then we'll find it but it takes longer. And in the case that all the choices needed to construct a valid answer are low-probability under the model, then we're basically in the same place that the vanilla minikanren search was.


I've thought a lot about this as well, and I'm convinced it's a really great way forward assuming the model can have some inference of the search space of the queries that it's going to run.

It's too easy to fall into infinite loops for something with only a naïve understanding of the questions.


What LLM text generation has shown is that you don't actually have to understand English to generate pretty decent English. You just have to have enough examples.

This is where the massive corpus of source code available on the Internet can help generate a "LSM" (large software model) if you can expose the tokens as the lexer understands them in the training set.

If your LSM sees a trillion examples of correct usage of lifetime and scope and types and so on, then in the same way that an LLM trained on English grammar will emit text with correct grammar as if it understands English, your LSM will generate software with correct syntax as if it understands the software. Whatever the definition of "understands" is in the context of an LLM.


But:

- natural language is flexible, computer languages are less so.

- "pretty decent English" still includes hallucinations. I've seen companies whose product demo for generating marketing copy just makes up a plausible review. Hallucinating methods, variables, other packages/modules yields broken code.

- the human thought behind natural language is not feasible to directly provide to a model. An IR corresponding to the source of the program is feasible to provide. A trace of the program executing is feasible to provide. Grounding an LLM in the rich exterior world that humans talk about is hard; grounding an LSM in the rich internal representations accessible to an IDE or a debugger is achievable.


"pretty decent english" is a pretty fuzzy bar.

Indeed, Chat GPT 4 and Copilot can generate "pretty decent code" that will look fine to the average human coder even when it's incorrect (making up methods or getting params wrong or slighly missing requirements or similar).

The level of precision required for "pretty decent non-trivial code" is much higher than prose that looks like it was written by an educated human, so I share the idea that if it was augmented - even in really stupid ways like asking the IDE if it would even compile, in the case of Copilot, before suggesting it to the user - it would work much better at a much lower effort than increasing it's understanding implicitly by orders of magnitude.


> you don't actually have to understand English to generate pretty decent English. You just have to have enough examples.

I would have thought babies have been showing this beyond a doubt since time immemorial.


No, because we can't look into their skulls, to figure out whether they 'understand', whatever that means.


right. we're already abstracting from English words and characters into tokens, piping code through half a compiler so the LSM is given the AST to train on doesn't seem all that far fetched.


I keep seeing these takes and I still think OthelloGPT disproves them.

So far it really looks like a sufficiently large LLM with a sufficiently large / high-quality dataset can learn basically anything (given a slightly generous interpretation of the word "sufficiently").

In case of code completion, I think just training on program output in addition to source code would already unlock huge capability boosts.


> You'd need something that actually understands the language. What is a lifetime, what is scope, what are types, functions, variables, etc.

Why is that more important for programming languages than for human natural languages?


Because if you use a somewhat odd phrasing, uncommon terms, weird sentence structure, etc - human beings are very good at figuring out what is meant. We can interpolate, we can extrapolate, we can estimate and understand context. Very incomplete language can still be understood with the right context.

A compiler cannot do these things. The code must follow the rules of the language perfectly, or the code won't compile. And further, it doesn't understand intention like a human can. It doesn't know that you intended to loop over all items in a collection - if you have an off-by-one error it'll happily compile that regardless.


An interface layer is 70% of the value for 10% of the work.


Good point about program traces.

Another area I wish Copilot understood: where my cursor will go next. Right now it can only feed me new lines, but I bet the underlying LLM is already powerful enough that with little fine-tuning, it could guess that after I edit a call to foobar() to add an argument the next thing I'll do is probably edit the definition of foobar.

It feels like there's a trove of UX improvements in there; we've barely scratched the surface.


IDEs already has this built in. The problem is more an effect of us deciding that source code is text instead of something integrated, like Smalltalk.


Well yeah, but IDEs have code completion built-in as well.

My point is, I think Copilot-style tools could do more that insert text. They could recommend lines of code to remove, give you a quick keyboard shortcut to "place cursor in next file / line you're likely to want to edit", etc.

IDEs can do some rigid versions of that (eg "Jump to next diagnostic", "Refactor", "Jump to implementation", etc) but Copilot could be the more general, more powerful version of these features.


I've been working on some basics in this direction... taking the expected type at the cursor and using this to inform exactly what type & function definitions to splice into the prompt: https://andrewblinn.com/papers/2023-MWPLS-Type-directed-Prom...

We're currently working on two forks off this... one, using the expected type information in conjunction with existing grammar constrained generation to enforce per-token type-correct generations, and two, using some program synthesis unevaluation techniques to also provided runtime trace data relevant to the current program hole we're trying to generate a completion for.

But yeah in general there is so much existing work on static and dynamic program analysis which can be applied here... I think a lot of the interesting challenges are going to be UI/UX ones... interactive processes to help more precisely specify intent and iteratively valid generations as more and more code is written autonomously.


That's interesting. In my experience, although synctatic errors do happen, they're not nearly as common as, say, hallucinating methods or tripping over the exact API of a library.

One thought I've had but haven't experimented with yet is that we could leverage a lot of the existing tooling that was made for humans - e.g. tab-complete providers like Jedi, which do some type inference behind the scenes. It's able to provide suggestions of valid members for a given cursor position, and so logits could be warped to prefer tokens which match the suggestions (so if the output at a given time is `math.sq`, `math.sqrt` would be much preferred over `math.square_root` which doesn't exist). You'd have to be a bit smart around this though because in since situations such as when using an identifier for the first time it's not yet in scope and you don't want the LLM to never create variables. Maybe some beam search shenanigans and heuristics could be enough to get useful output, but at that point it no longer seems like a quick thing to just try out :)


it's been done, in a limited way: https://arxiv.org/abs/2306.10763; we're working on extended this, but yeah, there are a lot of details to work out


Generating invalid code is a big hassle. I have used ChatGPT for generating R code & sometimes it refers to functions that don't exist. The standard deviation of [1,2,1] is 0.577, given by sd(c(1,2,1)). Sometimes I get stdev(c(1,2,3)) - there is no stdev in R. Why not have a slower mode where you run the code thru the R interpreter first, & only emit valid code that passes that step ?


The Python one that currently exists can use Pandas, but then you're using Python and not R. Other language support must be on their roadmap, only question is how far down the list it is.


This can be done pretty easily with their api if you are willing to spend some time on it


> these tools are engaging only at the level of text

Couldn't we say the same thing about almost all UNIX/Linux coreutils? There's no way to get a strongly-typed array/tree of folders, files, and metadata from `ls`; it has to be awk-ed and sed-ed into compliance. There's no way to get a strongly-typed dictionary of command-line options and what they do for pretty much every coreutil; you have to invoke `something --help` and then awk/sed that output again.

These coreutils are at their core, only stringly-typed, which makes life more difficult than it strictly needs to be.

Everything is a bag of bytes, or a stream to be manipulated. This philosophy simply leaked into LLMs.


I think GP's point is we could get much better results if we fed the LLMs with more than the input text.

It could still be text, or at least bytestreams: program traces, OpenTelemetry logs, objdump of the compiled binaries, LLVM IR dumps, compiler errors, syntax highlighting markers, etc.


This is, indeed, also a weakness of those tools.


At least chatgpt can already use a python interpreter and can also write unit tests for it.

Alone with copilot the amount of money flowing into this area is much more now than just a year ago.

And based on a Google research blog article, they use it internally with over 50% of suggestions being accepted.

The race is on and no one can afford not to play the game.


> These assistants should be able to make predictions about program _traces_, not just program source text.

Are you sure 'traces' is the right word? Not something more like ASTs?

Btw, the predict-next-token approach has the benefit of also being able to deal with broken code.

You might also want to compare http://www.incompleteideas.net/IncIdeas/BitterLesson.html and https://gwern.net/scaling-hypothesis with your idea of adding more domain specific knowledge.


You know I've been thinking about this for a while, I honestly don't think LLMs will ever be the right choice for a coding assistant. Where I've had a strange idea they could help, although it's completely unproven yet, is replacing what we traditionally think of as a developer and code. I'm envisioning, for simple application, a completely non-technical person have a conversation with a coding assistant and it something directly executable, think the tokens it outputs are JVM bytecode or something similar. I'm sure there's innumerable flaws with my thinking/approach but so far, it's been fun to think and code around.


Very well put. The current generation of these tools are incredibly useful but clearly leaving a huge amount on the table.

Just makes me more excited for future iterations!


Would be interesting to see them trained on actual syntax trees from text.

Then maybe have a separate model trained on going from syntax tree to source code.


> Then maybe have a separate model trained on going from syntax tree to source code.

I don't see why you need a model for this. But yes, this is a very cool idea.


Presumably so you don’t need to write a code generator for every language you support.


> then referenced a language version which does not yet exist

It must have gaming forums in its training data. Gamers know that the solution to all problems is to upgrade (game/drivers/OS) to the latest version.


I just started using copilot last week. I was blown away. I would barely start typing and it would show me anywhere from a single line to 15 lines of exactly what I had been intending to write. It was mind blowing. In one instance, it found reference to data in a totally different part of the page and correctly used it in the code it was creating. I still can't figure out how it did that. I was beyond impressed. Sometimes it was 100% spot on, other times it was 90% of the way there, but it was crazy how much time it saved. Though, I have been programming for many decades, so I was able to tell pretty easily if what it was creating was good or not. I think this could lead a new coder astray.


Yeah, you really need to experience it first hand to get to that "wow" moment. For me, that moment was when I was writing some code in my over-engineered codebase, with some prematurely generalized abstractions that even I can barely understand. I'd pause for a second to think "wtf am I doing again?" and Copilot would suggest a whole block of code that aligned perfectly with my intent. This happened repeatedly and it felt like I was being guided to a solution. The incredible thing was that I hadn't even thought about it yet. It's like Copilot was thinking a step ahead of me.

I found this effect to be most pronounced when writing tests. I think Copilot shines in codebases with static typing, clear interfaces, doctstrings and unit tests. That's really about the densest, most richly annotated context you could give to an LLM. And that's before adding capabilities for more well-defined reasoning about static languages, types, etc. - there is potential for it to get even better at this.


Writing unit tests is the most effective use of copilot hands-down.

For pure functions, it is 100% correct and complete essentially 100% of the time, allowing me to write descriptive test names and nothing more. Even when mocks or spies have to come into the picture, it is usually 95% accurate. The key is that you should always have the file you are testing open in an editor tab, as well as another test class that demonstrates the testing style you want it to emulate.

Forget everything else about Copilot, it's worth the cost for this alone. Time writing tests reduced ~80%. They could remove all other functionality and rebrand it as Test Copilot, and they would still get my money.


Completely agree. When I talk to others about Github Copilot/ChatGPT however, I am rather surprised to hear some of the criticisms. As in, it's unreliable, it's wrong, not good for beginners, etc. I don't understand that criticism at all. Yes, it's occasionally, perhaps frequently wrong, but so is virtually any other source of information. What's the alternative? Well, you Google your way around half a dozen websites, read through forum posts, bloated/incomplete doc websites, Q&A sites, etc. It's absolute hell to me, stuff that can take 10-15min on average to deal with, so comparatively using LLMs to not is like light and day. Maybe it's due in part to people being inexperienced and lacking the ability to independently judge whether content seems right or not, so I will admit you do need to be a bit of an expert to fully utilize these tools.

But even as a beginner--when I started coding some 10 years ago, I'd have killed for something like ChatGPT. Had some programming problem you needed to solve? You better hope that someone wrote something about it online, that it's on StackOverflow or been discussed on some other discussion board. Otherwise, it's up to you to splunk into StackOverflow, wait a week or two, hope you don't get ignored/downvoted, or try to find some IRC channel to post the question to. Comparatively having the ability nowadays to talk to a AI about the most niche programming concepts in your specific use case, in English, without being judged? It's straight up magical to me.


I started coding back in the days when literally all we had were physical book documentation and maybe a physical book textbook or how-to book. Having the internet certainly is faster. Having copilot is even faster. I wonder what the next leap forward in speed will be? Back in the days of physical books, I did not foresee online documentation or things like stackoverflow. In the days of stackoverflow, I did not foresee something like copilot. So I assume we cannot yet foresee what will be the next big thing.


I feel it’s worth noting for others who have tried copilot before and was not impressed, I’ve experienced a massive abilities increase within the last two weeks, both speed AND quality wise, that took it from kinda useful to being able to autocomplete code using a homebrewed convoluted framework


I’ve been using CoPilot it pretty much since it came out and I am regularly shocked when some devs tell me they aren’t using it. Usually it’s because of silly company policies.

In fact the copilot plugin in PyCharm is awesome to also just write normal text, like for a article.


My current hypothesis here is that the way to make coding assistants as reliable as possible is to shift the balance towards making their output rely on context provided in-prompt rather than information stored in LLM weights. As all the major providers shift towards larger context-windows, it seems increasingly viable to give the LLM the necessary docs for whatever libraries are being used in the current file. I've been working on an experiment in this space[0], and while it's obviously bottle-necked by the size of the documentation index, even a couple-hundred documentation sources seems to help a ton when working with less-used languages/libraries.

[0]: https://indexical.dev/


I like that your solution is basically telling the LLM to RTFM.


Yeah I've been using it with prompts that ask it to cite sources as well, honestly I think the best results are when I'm still interacting w/ the docs directly in addition to having the LLM look at em - still can't quite replace needing to RTFM!


This is the way forward imo. Particularly as we've started to flesh out the relationship between model size and true context reliability. We've found that raw context-window size is not representative of what the model can actually consistently recall, but we've also found the recall is consistently reliable out to a point. I suspect more robust theoretical models around superposition will move us a long way towards understanding the limits of context reliability rather than what would currently be an experimental approach.


Several times when I’ve asked ChatGPT for an approach to something, it has spit out code that uses an API that looks perfect for my use case, but doesn’t actually exist.

So I’m thinking someone should be building an automated product manager!


This is what ChatGPT does to me whenever I ask anything non-trivial. I find it funny people think it'll take over our jobs, it simply can't even do the most basic things beyond the beaten path.


are you using GPT-4? because it's been incredibly helpful to me. major refactoring gets done in a matter of seconds, and when it doesn't work, I can just paste whatever compiler error I'm getting and it usually fixes the problem


Can you give an example of what you consider “major refactoring”?


Have had the same experience several times. “Yes ChatGPT, it makes perfect sense for that to exist and would be wonderful if it did, but unfortunately it does not.”


Perhaps that means you should implement that API.


I view these LLM tools as just another code generator. I have no idea how many bespoke single-use code generators I've written, but I know that I've spent a nontrivial amount of time writing code that writes code. If an LLM can do it with a natural language interface, it'll save me a lot of time.

But also, I would never trust a script I threw together in 15 minutes to actually produce real code and solve the problem. All it does is generate the text I tell it to. It can't understand the system or how it works, it just procedurally spits out text.

In the same way that a dozen lines of Python cannot understand your program, an LLM is also fundamentally incapable of understanding it. That's the crucial part of programming. An LLM will give you text all day, but it can't write your program.

Sure an LLM can produce trivial scripts, but in my experience so far, it can only reliably generate trivial programs that I could write blindfolded in the same amount of time.

If we just treat these tools like the text generators they are instead of insisting they're code generators, we'll all be better off.


In practice I find a bigger problem is perversity - AI assistant doing OK with incremental prompting, but sometimes decides to just remove all comments from the code, or if asked to focus its effort on one section, deletes all other code. Code assistants need to be much integrated with IDEs and I think you probably need 2 or 3 running in parallel, maybe more.


Yes, this really irritates me.

Me: <prompt 1: modify this function>

AI assistant (either ChatGPT or Copilot-X): <attempt 1>

Me: <feedback>

AI assistant: <attempt 2>, fixed with feedback, but deleted something crucial from attempt 1, for no reason at all


To avoid stuff like this I've tried to prompt for only the necessary changes. I haven't found a good prompt that does this repeatably. It might be worth some experimentation.


One can usually get around it by modifying the prompt, but one really shouldn't have to include things like 'do not remove comments' over and over. My IDE (shoutout to Jetbrains) lets me make custom prompts, but the real problems here are the transformer model's inability to maintain context or ask questions of its own, more readily, and a lack of visual grammar for 'what are we both talking about, and how'. Pair programming has gone some way toward improving that in recent years, but realistically transformers are getting to where they can spit out code faster than most people can compose or type it (not you of course). Looking at live diffs is not ideal, especially late in a coding session when fatigue is setting in. Some sort of shared context map for functions and variables might be a better approach. As things are I find myself using a clumsy mix of save-as and commits and often throwing away hours of work when an assistant is unable to backtrack.


I've used Aider <https://aider.chat> a fair amount, and I have read about Paul's experiments with different "diff producing" methods. I think a good avenue to experiment is two-pass. First, ask the LLM to output only the changed code, then (in a separate conversion) as the LLM just to update the relevant lines with the change the first LLM provided. I suspect this would clear up a lot of the "LLM failed to transcribe old code" and "diff isn't well-formed" errors that this tool has.


This seems ripe for using "cores" or specialist pipelines/agents to handle stuff like that. I may see what I can scrape together.


It's confusing that you had to scroll up to realize the article isn't by Martin Fowler.


I see lots of people saying the code is suspect, but I never hear about using LLMs for writing tests to validate their own code or LLM code. My success has been in generating test. Often faster and simpler than I would write by myself. What I test I can trust, especially if it comes out of an LLM.


I'm happy that more and more people embrace these tools for more and more critical software.

It keeps me employed, and even increases my rate quite a bit.


Are you an EMT?


I feel like the work on using CFGs with LLMs should be low-hanging fruit for improving code assistants, but perhaps someone more knowledgeable could chime in[1], [2], [3].

At lot of the confabulations we see today - non-existent variables, imports and APIs, out-of-scope variables, etc - would seem (to me) to be meaningfully addressable with these techniques.

Relatedly, I have gotten surprisingly great mileage out of treating confabulations, in the first instance, as a user (ie, my own) failure to adequately contextualise, rather than error.

In a sense, CFGs give you sharper tools to do that contextualisation. I wonder how far the “sharper tools” approach will carry us. It seems, to this interested layman, consistent with Shannon’s work in statistical language modelling.[4]

The term “prompt engineering” implies, to me, a focus on the instructive aspects of prompting, which is a bit unfortunate but also wholly consistent with the way I see colleagues and friends trying to interact with LLMs. Perhaps we should instead call it “context composition” or something, to emphasise the constructive nature.

[1] https://github.com/outlines-dev/outlines

[2] https://github.com/ggerganov/llama.cpp/pull/1773

[3] https://www.reddit.com/r/LocalLLaMA/comments/156gu8c/d_const...

[4] https://hedgehogreview.com/issues/markets-and-the-good/artic...


If you're going to be productive with LLM coding assistants, the skill you most need to develop is a strong QA process. You need to be really good at running code, trying edge-cases and quickly getting to a point where you have proven to yourself that the code works. Having good habits around automated testing helps with this a lot.

If you're already good at this stuff, you'll find the risk of coding assistants getting things wrong is pretty minimal for you.

If you have bad habits where you frequently write and commit code without first executing it and trying to poke holes in it, you'll find AI assisted coding full of traps.


Are there tools yet that put the compiler in the loop? Maybe even a check-compile-test loop that bails on the first failure and then tries to refine it in the background based on what failed?


Yes. ChatGPT Code Interpreter mode does exactly this - it writes code, runs it through a Python interpreter, then if there are any errors it rewrites the code and runs it until it works.

... and you can use it to run a C compiler too, if you know what you are doing! https://simonwillison.net/2023/Oct/17/open-questions/#open-q...


Does that compile my local projects?


It can if you zip the code up and upload it to Code Interpreter.


That seems like a joke of a workflow to me. Surrender everything, including your build process to openai so it can provide a bit of context assist?


Then maybe the model could be fine-tuned on that loop, haha. Could be a fun thing to try at least.


It's a bit of fools-errand because they train on information which is no longer valid and will get stuck if you don't inform them. For instance GPT cannot write a connection to an openai endpoint because the API was upgraded to 1.0 and broke compatibility with all the code it learned from


Is it possible for an LLM to have knowledge/input marked as "deprecated/obsolete"? Either by a user or some sort of "retraining" process?


I've been using chat gpt plus quite a bit for supporting me in programming tasks. It's too flaky to trust blindly and you need to narrowly scope what it does. It can also be a bit hand wavy with some things and I've had issues getting it to generate working code for some more complicated things. But where it really shines is in documenting things, explaining things, and suggesting problem areas in code and possible solutions.

And you can do fun things like point it at an openapi schema and asking it some questions about that API. I've given it screenshots of websites and asked it to criticize the design or document what is visible. It's amazingly good at supporting localization work.

I was working with some geospatial code for an algorithm that generates UTM coordinates from GPS coordinates a few weeks ago. I needed a Kotlin implementation that I could use on multiple of its platforms (js and jvm, i.e. no java dependencies allowed).

It was kind of useless writing the code for me for this (that's the first thing I tried obviously) but it unblocked me a couple of times and helped me figure out some details.

I ultimately found several old Java implementations that contained a lot of undocumented magical numbers and I just asked it the meaning of those numbers (about a dozen) and it came back with good explanations (basically things like wgs84 ellipsoid parameters, radius of the earth, etc.). The code wasn't quite working (it failed a test I wrote) so I asked it to identify possible causes for that and it came up with some uncovered edge cases that I was able to cross confirm with other implementations. In the end I was able to piece together a working implementation by combining different elements from several implementations. Each of them individually had issues. A lot of this code is ancient and there is apparently a lot of copy paste reuse in this space.


I have several issues with coding assistants.

Over time, will less skilled programmers produce more critical code? I think so. At some point a jet will fall out of the sky because the coding assistant wasn't correct and the whole profession will have a black eye.

The programmers will be less skilled because the (up until recently) lack of coding assistants provides a more rigorous and thorough training environment. With coding assistants the profession will be less intellectually selective because a wider range of people can do it. I don't think this is good given how critical some code is.

There is another related issue. Studies have shown that use of google maps has resulted in a deterioration of people's ability to navigate. Human mapping and navigation ability needs training which it doesn't get when google maps are used. This sort of thing will be an issue with coding.


A wide range of people can build things. We trust only a few to build jets and skyscrapers, etc.

I think much the same will happen with regards to programming. Sure, most people will be able to bust out a simple script to do X. But if you want to do a "serious task", you're going to get a professional.


I spent 10 years pair programming. It's a similar situation in some ways.

Like, I can't know if the code my pair writes has flaws, just like an AI coding assistant.

I've never learned so much about programming as when pairing. Having someone else to ask or suggest improvements is just invaluable. It's very rarely I learn something new from myself, but another human will always know things I don't, just like an AI.

Of course, you don't blindly accept the code your pair/assistant writes. You have to understand it, ask questions, and most of all write tests that confirms it actually works.


This will be an issue with all knowledge work. The machines will have the knowledge and more and more we will just trust them because we don't know ourselves. Google Maps is a great example.


I want to build a "Google Maps that doesn't make you dumber."

For local navigation, first and foremost. The goal is to teach you how to navigate your locale, so you use it less and less. You still will want to ask it for traffic updates, but you talk to it like you would between locals who know all the roads.

As a model for how to do AI in a way that enhances your thinking rather than softens/replaces it.


I thought about something like this as well. Maybe it can suggest new routes so you can know new streets. It could work by vibration (1 vibration right, 2 left) like some running apps do so you don't even have to look at your phone and can focus on what is in front of you.


Honestly I’d rather just RTFM and code it myself than invent a game to deal with these issues.


The way you tackle the unreliability of coding assistants is to know what you're doing so that you don't need a coding assistant, so that you're only using it to save time and effort.

Roughly speaking, if you stick to using AI for writing code you could have written yourself, you're okay.


I've only used a very small amount of AI assistance (mostly Anthropic's Claude), and always to learn, not to do. That is, I will ask it what's happening, why are things breaking, etc. It doesn't have to have the right answers, it just needs to unblock me.

I hear it is also quite useful for doing things which you know extremely well but are tedious to do. Anything in between is certainly a danger zone.


Though regular use of the assistant will degrade your ability to program without it.


I think in the same way that autocomplete does - I may not have typed out `.length` or whatever it happens to be in the language I used last year enough to remember a year later if it's len or length or size and if it's a property or a method after a year goes by without me touching that language... but then it's a simple search away to refresh my memory, and overall the autocomplete saves a LOT of time that makes up for this.

Yeah, if you never knew what the code that got generated did in the first place, that's not gonna apply, but if you're using it as basically just a code expander for things you could do the pseudocode for in your sleep, you're probably gonna be ok.


I'm just waiting for people to catch on that using an inherently unreliable tool that cannot gauge its own reliability to generate code or answer fact-based queries is a fool's errand. But I expect we'll spend a lot of effort on "verifying" the results before we just give up entirely.


I've written a few pieces of code with the help of AI. The trick is that I don't need the help. I could see where the bugs are.

The AI could do certain things much faster than I would be able to by hand. For instance, in a certain hash table containing structures, it turned out that the deletion algorithm which moves elements couldn't be used because other code retains (reference counted) pointers to the structures, so the addresses of the object have to be stable. AI easily and correctly rewrote the code from "array of struct" to "array of pointer to struct" representation: the declaration of the thing and all the code working on it was correctly adjusted. I can do that myself, but not in two seconds.


I could not disagree more. Having something generate code which is 90-100% correct is extremely valuable.

E.g. creating a new page in an app.

Feed an LLM the design, the language/framework/component library you're using, and get a page which is 90% of the way there. Some tweaks and you're good.

Far, far quicker than going by hand, and often better quality code than a junior developer.

Now I would never deploy that code without reading it, understanding it, and testing it, but I would always do that anyway. GPT4 is close enough to a good software engineer, that to resist is it to disadvantage your business.

Now if you're coding for pleasure and the fun of creation and creativity, then ditch the LLMs, they take some of that fun away for sure. But most of the time, I'm focused on achieving business outcomes quickly. That's way more productive with a modern LLM


I strongly disagree. Having something which is 90% reliable doesn't save me any effort over just doing it myself. If I have to spend the time to check everything the tool generates (which I do, because LLMs are unreliable) then I may as well have written it myself to start with.

I firmly believe that the worst kind of help is unreliable help. If your help never does its job, then you know you have to do everything yourself. If your help always does its job, you know you can trust it. If your help only sometimes does its job, you get the worst deal of all because you never know what you'll get and have to check every single time.


I have nearly 12 months of personal experience now that having LLMs produce code for me - even when I only use ~20% of the code they generate - is still a huge personal productivity boost.

Unreliable help is still useful if you know that it's unreliable - you learn to keep a critical eye on what it's doing and correct when necessary. Still saves a ton of time, and I'm not guessing that, I'm saying that from my own experience.


> If I have to spend the time to check everything the tool generates (which I do, because LLMs are unreliable) then I may as well have written it myself to start with.

Do you feel the same way about a PR from a coworker?


Depends on the coworker. I would say that I trust ChatGPT code much less than the average coworker, because coworkers usually don't do something like use imaginary variables or check-in code that doesn't compile.


Just wait a few months. Verifiability is the technique du jour and will make its way into a variety of models soon enough.


Seems like there are certain fundamental limits to what can be done here though. Much of the advantage to using these models is being able to come up with a vague/informal spec and most of the time have it get what you mean and come up with something serviceable with very little effort. If the spec you have in mind to begin with is fuzzy and informal, what do you use to perform verification?

After all, whether a result is correct or not depends on whether it matches the user's desire, so verification criteria must come from that source too.

Sure there are certain types of relatively objective correctness that most of the time will line up with a user's desires, but this kind of verification can never be complete afaict.


It’s a multi turn conversation, just workshop the idea and make changes along the way


.


is that a vi repeat? lol


yyp


Don’t humans fit that definition? We’ve managed okay for 1000s of years under those conditions.


Humans are unreliable, but we are also under normal circumstances thoroughly and continually grounded in an external world whose mechanics we interact with, make predictions about, and correct our beliefs about.

The specific way we're training coding assistants for next-token-prediction would also be an incredibly difficult context for humans to produce code.

Suppose you were dropped off in an society of aliens whose perceptual, cultural and cognitive universe is meaningfully different from our own; you don't have a grounding in concepts of what they're trying to _do_ with their programs. You receive a giant dump of reams and reams of source code, in their unfamiliar script, where none of the names initially mean anything to you. In the pile of training material handed to you, you might find some documentation about their programming language, but it's written in their (foreign, weird to you) natural language, and is mixed with everythign else. You never get a teacher who can answer questions, never get access to a IDE/repl/interpreter/debugger/compiler, never get to _run_ a program on different inputs to see its outputs, never get to add a log line to peek at the program's internal state, etc. After a _lot_ of training, you can often predict the next symbol in a program text. But shouldn't we _expect_ you to be "unreliable"? You don't have the ability to run checks against the code you produce! You don't get a warning if you use a variable that doesn't exist! You just produce _tokens_, and get no feedback.

To the degree humans are reliable at coding, it's because we can simulate what program execution will do, with a level of abstraction which we vary in a task dependent way. You can mentally step through every line in a program carefully if you need to. But you can also mentally choose to trust some abstraction and skip steps which you infer cannot be related to some attribute or condition of interest if that abstraction is upheld. The most important parts of your attention are on _what the program does_. This is fully hidden in the next-token-prediction scenario, which is totally focused on _what tokens are used to write the program_.


I hear this argument applied often when people bring up the deficiencies of AI, and I don't find it convincing. Compare an AI coding assistant to reaching out to another engineer on my team as an example. If I know this engineer, I will likely have an idea of their relative skill level, their familiarity with the problem at hand, their propensity to suggest one type of solution over another, etc. People are pretty good at developing this kind of sense because we work with other people constantly. The AI assistant, on the other hand, is very much not like a human. I have a limited capacity to understand its "thought process," and I consider myself far more technical than the average person. This makes a verification step troublesome, because I don't know what to expect.

This difference is even more stark when it comes to driving assistants. Video compilations of Teslas with FSD behaving erratically and most importantly, unpredictably, are all over the place. Experienced Tesla drivers seem to have some limited ability to predict the weaknesses of the FSD package, but the issue is that the driving assistant is so unlike a human. I've seen multiple examples of people saying "well, humans cause car crashes too," but the key difference is that I have to sit behind the wheel and deal with the fact that my driving assistant may or may not suddenly swerve into oncoming traffic. The reasons for it doing so are likely obscure to me, and this is a real problem.


They do fit this definition. One result of the rise of generative AI is exposing just how severely and commonly people misperceive their own capabilities and the functioning of their cognitive powers.


They do not. A human can verify its own reliability.


A human cannot do this self-sufficiently. This is why we work so hard to implement risk mitigation measures and checks and balances on changes.


Yes, not all humans at all tasks all the time. And some things are important enough to implement checks regardless.

However, there's a lot we can just throw humans at and trust that the thing will get complete and be correct. And even with the checks and balances, we can have the human perform those checks and balances. A human is a pretty autonomous unit on average.

So far, AI can't really say "let me double check that for you" for instance. You ask it a thing, it does a thing, and that's it. If it's wrong, you have to tell it to do the thing again, but differently.

In all the rush to paint these LLMs as "pretty much human", we've instead taken to severely downplaying just how adaptable and clever most sentient beings can be.


AIs can double check. Agent to agent confirmation and validation is a thing.

In any case, the point is that we have learned techniques to compensate for human fallibility. We will learn techniques to compensate for gen AI fallibility, as well. The objection that AIs can be wrong is far less a barrier to the rise of their utility than is often supposed.


You see how that's worse, right?

The original argument put forth was that "an inherently unreliable tool cannot gauge its own reliability".

Someone responded that humans fit that description as well.

I said we don't. We can and do verify our own reliability. We can essentially test our assumptions against the real world and fix them.

You then claimed we couldn't do that "self-sufficiently".

I responded that while that is true for some tasks, for a lot of tasks, we can. That an AI can't check itself and won't even try.

And now you're telling me that they can check against each other.

But if you can't trust the originals, asking them if they trust each other is kind of pointless. You're not really doing anything more than adding another layer.

For example: If I put my pants on backwards, I correct that myself. Without the need to check and/or balance against any other person. I am self-correcting to a large degree. The AI would not even know to check its pants until someone told it it was wrong.

The objection isn't that "AIs can be wrong", the objection is that AIs can't really tell the difference between correct and incorrect. So everything has to be checked, often with as much effort as it would take to do the thing in the first place.


This is a very narrow view of how LLMs can interact to improve inference accuracy. Different LLMs have different capabilities. They can be used in coordination to improve results.

Your objections seem to rely on a restricted view that says "we can't do better" but with no evidence. Whereas we have plenty of evidence of massive, continual improvement in the very areas you are holding up as problematic.


If that was even close to true then I would have had to fire far fewer people over the years.


I basically refuse to use any sort of AI assisted autocomplete for this very reason. I want consistency more than accuracy. consistency enables power and reliability, if it changes every 3rd time on the same input it slows me down.


>How to tackle unreliability of coding assistants

One of the well-established techniques was always to give more worthwhile incentives that are especially meaningful to the hardworking assistants.


fairly off-topic, sorry: Midjourney gets the cartoon donkey's snout wrong, so wrong. It looks more like a cartoon dog. For some reason I'm really bothered by it.


I use it like a man page on steroids. “Write an example bash program that implements a set data structure, and demonstrate it with examples.”


I dunno, maybe hammer out a couple dozen lines of code sourced from your own brain?


Sounds like an exercise in frustration


Not use them?


[]


Why? And why allow autocomplete? Why not ban anything including compilers, especially optimizing compilers? How do you draw the line?


Maybe learn to code?


My good-faith interpretation of your comment is: "Maybe learn to code (without external resources or tools)?" LLMs are another tool and resource, just like StackOverflow, linters, autocomplete, Google, etc. None of these tools are infallible, but they provide value. Just like all other tools, you don't need to use LLMs because of their issues - but we want them to be as useful as possible - what the author is trying to do.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: