I don't know to what extent you can self-verify your own system based on proof code you also write yourself. However I do know this topic of merging AI, Deep Learning, and proof assistants is an up and coming research area. I mainly follow publications from Talia Ringer [1] and collaborators on this topic.
Maybe I misunderstand, but my understanding is that (at least some languages) can be given proofs of correctness for some pre and post conditions. See e.g., https://arxiv.org/pdf/2303.05491.pdf
Conceptually, you tell the computer how to validate that the conditions you state are correct with your proof, and it checks that each step follows.
To me, it seems like there will be a finite set of pre and post conditions ever needed for 99% of future programming, and eventually everything will be a catalog of the conditions to working code.
One just needs a language to write those conditions in.
I'm not sure how to phrase this, but it seems like there's some trade off between how easy it is to express a condition, and how easy it is to verify that it holds in an imperative language.
One the one hand you have declarative languages, and on the other imperative languages.
Not sure where I'm going with this, but I think there's something there.
Essentially, AI’s output is fed into a checker, whose output is fed back into the AI for “reflexion”. Then the AI often corrects (leading to noticeable improvement in GPT-4 perf).
Very cool project. How reliable are you finding your prompts? They look like good choices based on my experience prompting GPT-3.5 and 4 for code editing.
FYI, I think my open source tool aider would work out of the box to serve this use case. You would just run:
aider file.py —msg “implement the comments”
Of course aider works with any popular language, not just python. And it can do a lot of other coding tasks. It's like pair programming with an AI.
Aider provides experimental support for LLMs other than OpenAI's GPT-3.5 and GPT-4. The support is currently only experimental for two reasons:
1. GPT-3.5 is just barely capable of editing code to provide aider's interactive "pair programming" style workflow. None of the other models seem to be as capable as GPT-3.5 yet.
2. Just "hooking up" aider to a new model by connecting to its API is almost certainly not enough to get it working in a useful way. Getting aider working well with GPT-3.5 and GPT-4 was a significant undertaking, involving specific code editing prompts and backends for each model and extensive benchmarking [0].
Officially supporting each new LLM will probably require a similar effort to tailor the prompts and editing backends.
Numerous users have done experiments with numerous models. None of these experiments have yet identified other models that look like they are capable of working well with aider. Claude has been the most promising so far, and the new Code Llama looks very interesting on first glance.
Once we see signs that a particular model is capable of code editing, it would be reasonable for aider to attempt to officially support such a model. Until then, aider will simply maintain experimental support for using alternative models.
There is more information on connecting aider to other models, local models and Azure models in the FAQ [1]. There are also ongoing discussions about LLM integrations in the aider discord [2]:
Paul's work on this problem space is worth following, or at least reading the thoughts and iterative engineering results such as this comparison of the GPT models and the new functions API:
Copilot currently keeps in context the file you are editing. Cross file support is coming but not here.(https://githubnext.com/projects/copilot-view/). But it would be very very useful.
One logical concept that's also been noodling in my brain was to construct a DFA(Deterministic finite automata) from the code seen in all the files and then offer the n-1 tokens to the language model and constrain the nth token's selection from the valid ones. I recall someone did this for things that produce DFAs that are fairly small in size(like JSON) and that essentially produced 100% valid JSON without hallucinations(It could be garbage JSON).
So for example if I had a `class ABC` then typing `abc.` could produce:
1. all the methods on it that were valid and
2. had arguments from the surrounding code informed by the LLM.
I'd like to make something more constrained. Instead of a fully-general programming language, let the LLM configure data-flows between pre-defined modules, field mappings, or presentations.
Then, hopefully, we could let the end-user more directly edit the prompt.
> You should be squeamish about running the code without reading it first, given that you're pair-programming with a bot.
It's funny, the first version of this project[0] let you do exactly that, e.g.,
def main(path: str):
#<<filenames = a list of filenames under path>>
for fn in filenames:
#<<size = size of fn in bytes>>
print(fn, size)
#<<use argparse and call main>>
and then run that program like any other Python script (using import magic):
Decorators are a runtime construct, this is a code-writing-time feature, wouldn't it be confusing to have Maccarone find the decorators at write-time that then did nothing at runtime?
There are examples of decorators having no runtime effects, such as typing.overload. Bit comments are probably more flexible here since it allows arbitrary blocks, not just function/class scopes.
How would not having the source code be present/pre-generated, and thus needing to generate it at runtime, be an example of a decorator having no runtime effects?
OP has to be doing some parsing somewhere, so you just switch to seeking decorators rather than magic comments. It's still before the code reaches the interpreter.
The potential issue I see here is that comments are valid anywhere while decorators may not be, but the parser is hopefully resilient to that. You could see a multi-phase LLM that uses the interpreter to ensure the code runs / works as expected
I’m not objecting to what you say. The latter part of my comment is simply considering cases where you only want to replace a part of the function (the author have some examples on the project page), where a decorator wouldn’t be flexible enough to support. Of course you can also argue those can be easily rewritten to be function scoped, but that is a design tradeoff you make that author did not (or didn’t want to).
in my question, there is no need for the decorator to be handled at runtime, the tool that does the LLM stuff has to parse the Python code to know what to generate. It can just key off of decorators rather than comments, or at least this is my question & hypothesis.
comments are in most languages, so I can see that angle, but you still have to be able to parse all supported languages, no small feat
you can alternatively split generated code from human written code with files, keep the mapping in something more structured like a config file
I just normally see a better way to do the same thing a magic comment does, generally speaking. There is typically a better language construct if you limit yourself to that language (most common), and config files offer much more structure with existing tooling (mostly decode in your preferred language)
Assuming you're using source control properly and read the diff before running it, I guess this is one way to make sure that a comment matches the code? If the bot changes it, maybe your comment wasn't clear enough?
I think they're going for some sort of high level semantic description as a workflow concept guide you instead of using AI at each individual step.
I'd much rather focus on teaching people how to think about tests and what they want to do and, when they're stuck on syntax or patterns, make sure they're thinking enough about the problem so that they know why they take the decisions they do. They can then use a search engine or AI to cover a specific technical gap.
Implenting or writing code is rarely the bottleneck for software development after enough experience so something like this doesn't seem useful for me but it's definitely cool to see people trying to integrate new technologies in different ways.
I tried implementing something like this over the summer but couldn't make progress with the 20-30 minimum response time for each OpenAI-generated block. From the demo video it looks like this runs pretty fast -- or does it?
I think they're spiritually related, but my understanding is that Marvin actually invokes the LLM at runtime to execute an AI Function (which is why it can perform, e.g., sentiment analysis). Maccarone invokes the LLM to generate code during development.
Ha, I just want to clarify that the quoted text isn't actually from the README or something like that, I'm not quite that crazy.
But no real argument with the concern. An LLM will generate bugs, and that may be a reason this kind of thing never makes sense in practice (isn't that an argument against copilot, too, though?).
It finds _some_ bugs. CI/CD, and a massive investment in automated testing has probably had the largest impact in moving software quality forward. (See e.g. "Accelerate", Forsgren, Humble & Kim)
Code review is an excellent tool to socialize knowledge and train up more junior engineers, but in terms of preventing bugs, it's low-value.
I'm fairly certain we ship far more bugs now than we ever did.
Before we had the ability to just add a patch and let the user download it, the end result needed to be very solid, because once that disk was purchased and taken home, it was static.
Now less attention is paid to these things, because it's just assumed to be tomorrow's problem.
It would be nice to make it clear that you're not actually quoting someone when you use quotation marks to paraphrase them. My concern is that it can come across as a bit of straw-manning otherwise.
There are so many languages with awesome type systems which can help guide AI to generate better code — and yet, these experiments always choose Python.
https://en.wikipedia.org/wiki/Macaronic_language
I finally have the right term to describe the warning signs from 1960's-era mainframes that coined "blinkenlichten".
https://en.wikipedia.org/wiki/Blinkenlights
There should be a German term for this, but "gefälschter deutscher" doesn't quite capture it.