Hacker News new | past | comments | ask | show | jobs | submit login
Maccarone: AI-managed code blocks in Python (github.com/bsilverthorn)
181 points by silverthorn 11 months ago | hide | past | favorite | 70 comments



It's so awesome to learn a new term like "macaronic language".

https://en.wikipedia.org/wiki/Macaronic_language

I finally have the right term to describe the warning signs from 1960's-era mainframes that coined "blinkenlichten".

https://en.wikipedia.org/wiki/Blinkenlights

There should be a German term for this, but "gefälschter deutscher" doesn't quite capture it.


Was about to say that sounds very german yet with a sprinkling of dutch.


> What prevents my program from behaving differently after each preprocessing run?

>The strength of your faith in GPT-4.

I got a chuckle out of that


The strength of my faith in GPT-4 has been getting stronger every year since 2016.


This is a GPT-4 generated post huh? Hallucinating to the days.


What happened in 2016?


right, but it cuts off in 2021.


In theory, an AI that wrote proofs for their code (ala coq) could be used to validate preconditions specified by the developer, right?


I don't know to what extent you can self-verify your own system based on proof code you also write yourself. However I do know this topic of merging AI, Deep Learning, and proof assistants is an up and coming research area. I mainly follow publications from Talia Ringer [1] and collaborators on this topic.

[1] https://dependenttyp.es/


Maybe I misunderstand, but my understanding is that (at least some languages) can be given proofs of correctness for some pre and post conditions. See e.g., https://arxiv.org/pdf/2303.05491.pdf

Conceptually, you tell the computer how to validate that the conditions you state are correct with your proof, and it checks that each step follows.


To me, it seems like there will be a finite set of pre and post conditions ever needed for 99% of future programming, and eventually everything will be a catalog of the conditions to working code.

One just needs a language to write those conditions in.


I'm not sure how to phrase this, but it seems like there's some trade off between how easy it is to express a condition, and how easy it is to verify that it holds in an imperative language.

One the one hand you have declarative languages, and on the other imperative languages.

Not sure where I'm going with this, but I think there's something there.


Reflexion can serve as good reference here:

https://arxiv.org/abs/2303.11366

Essentially, AI’s output is fed into a checker, whose output is fed back into the AI for “reflexion”. Then the AI often corrects (leading to noticeable improvement in GPT-4 perf).


Very cool project. How reliable are you finding your prompts? They look like good choices based on my experience prompting GPT-3.5 and 4 for code editing.

FYI, I think my open source tool aider would work out of the box to serve this use case. You would just run:

  aider file.py —msg “implement the comments”
Of course aider works with any popular language, not just python. And it can do a lot of other coding tasks. It's like pair programming with an AI.

https://github.com/paul-gauthier/aider


I absolutely love aider and tell everyone I run into about it. Keep it the great work!

Question: can aider work with Ooba/llama.CPP/meta code llama on local llm? If not yet, are you planning on it?

So many users just can’t use GPT4/Copilot because of corporate policy. But they have Macs with M2.


Aider provides experimental support for LLMs other than OpenAI's GPT-3.5 and GPT-4. The support is currently only experimental for two reasons:

1. GPT-3.5 is just barely capable of editing code to provide aider's interactive "pair programming" style workflow. None of the other models seem to be as capable as GPT-3.5 yet.

2. Just "hooking up" aider to a new model by connecting to its API is almost certainly not enough to get it working in a useful way. Getting aider working well with GPT-3.5 and GPT-4 was a significant undertaking, involving specific code editing prompts and backends for each model and extensive benchmarking [0]. Officially supporting each new LLM will probably require a similar effort to tailor the prompts and editing backends.

Numerous users have done experiments with numerous models. None of these experiments have yet identified other models that look like they are capable of working well with aider. Claude has been the most promising so far, and the new Code Llama looks very interesting on first glance.

Once we see signs that a particular model is capable of code editing, it would be reasonable for aider to attempt to officially support such a model. Until then, aider will simply maintain experimental support for using alternative models.

There is more information on connecting aider to other models, local models and Azure models in the FAQ [1]. There are also ongoing discussions about LLM integrations in the aider discord [2]:

[0] https://aider.chat/docs/benchmarks.html

[1] https://aider.chat/docs/faq.html#can-i-use-aider-with-other-...

[2] https://discord.com/channels/1131200896827654144/11330607806...


Paul's work on this problem space is worth following, or at least reading the thoughts and iterative engineering results such as this comparison of the GPT models and the new functions API:

https://aider.chat/docs/benchmarks.html


Isn't this how copilot 'just' works, except with comments? What's the advantage over copilot?


Copilot doesn't continuously update your code when you make changes.


you get to delay code review


Copilot currently keeps in context the file you are editing. Cross file support is coming but not here.(https://githubnext.com/projects/copilot-view/). But it would be very very useful.

One logical concept that's also been noodling in my brain was to construct a DFA(Deterministic finite automata) from the code seen in all the files and then offer the n-1 tokens to the language model and constrain the nth token's selection from the valid ones. I recall someone did this for things that produce DFAs that are fairly small in size(like JSON) and that essentially produced 100% valid JSON without hallucinations(It could be garbage JSON).

So for example if I had a `class ABC` then typing `abc.` could produce: 1. all the methods on it that were valid and 2. had arguments from the surrounding code informed by the LLM.


It's like in-painting, but for code :)


I love it.

I'd like to make something more constrained. Instead of a fully-general programming language, let the LLM configure data-flows between pre-defined modules, field mappings, or presentations.

Then, hopefully, we could let the end-user more directly edit the prompt.


Would python decorators be better for something like this?

I always get squeamish when I see magic comments


This is leaving a comment for another programmer, not the compiler or interpreter. It's what comments are for, actually, like writing a TODO.

You should be squeamish about running the code without reading it first, given that you're pair-programming with a bot.


> You should be squeamish about running the code without reading it first, given that you're pair-programming with a bot.

It's funny, the first version of this project[0] let you do exactly that, e.g.,

  def main(path: str):
      #<<filenames = a list of filenames under path>>
  
      for fn in filenames:
          #<<size = size of fn in bytes>>
  
          print(fn, size)
  
  #<<use argparse and call main>>
and then run that program like any other Python script (using import magic):

  $ python -m examples.file_sizes /etc
  …
  /etc/wgetrc 4942
  /etc/nsswitch.conf 542
  /etc/adduser.conf 3028
  /etc/ethertypes 1816
But yeah, it never felt exactly practical. :)

[0] https://github.com/bsilverthorn/maccarone/tree/v0.1.3


Decorators are a runtime construct, this is a code-writing-time feature, wouldn't it be confusing to have Maccarone find the decorators at write-time that then did nothing at runtime?


There are examples of decorators having no runtime effects, such as typing.overload. Bit comments are probably more flexible here since it allows arbitrary blocks, not just function/class scopes.


How would not having the source code be present/pre-generated, and thus needing to generate it at runtime, be an example of a decorator having no runtime effects?


I had the urge to try this out a while back, here's what I came up with: https://gist.github.com/nkrumm/2b154ea2041511233079222373c83...

The decorator invokes AI completion only the first time the function is run.

edit: I lost interest before I was able to get arguments to work ¯\_(ツ)_/¯


Yea, that’s cool and very possible, but it’s a runtime effect


OP has to be doing some parsing somewhere, so you just switch to seeking decorators rather than magic comments. It's still before the code reaches the interpreter.

The potential issue I see here is that comments are valid anywhere while decorators may not be, but the parser is hopefully resilient to that. You could see a multi-phase LLM that uses the interpreter to ensure the code runs / works as expected


I’m not objecting to what you say. The latter part of my comment is simply considering cases where you only want to replace a part of the function (the author have some examples on the project page), where a decorator wouldn’t be flexible enough to support. Of course you can also argue those can be easily rewritten to be function scoped, but that is a design tradeoff you make that author did not (or didn’t want to).


It literally just adds the code to the file between the comments. There’s nothing more magic than copy+paste and no runtime component.

It’s completely different from a decorator that generates code at runtime? As in, when the code runs?


which comments should be generated?

what if the output has comments?

in my question, there is no need for the decorator to be handled at runtime, the tool that does the LLM stuff has to parse the Python code to know what to generate. It can just key off of decorators rather than comments, or at least this is my question & hypothesis.


Here is a framework which uses decorators to delegate runtime behavior to an LLM. Not quite what you meant but the closest I’ve seen.

https://askmarvin.ai/components/ai_function/


I guess comments provide a simple paradigm for many languages


comments are in most languages, so I can see that angle, but you still have to be able to parse all supported languages, no small feat

you can alternatively split generated code from human written code with files, keep the mapping in something more structured like a config file

I just normally see a better way to do the same thing a magic comment does, generally speaking. There is typically a better language construct if you limit yourself to that language (most common), and config files offer much more structure with existing tooling (mostly decode in your preferred language)


Nice, it's like cog (https://pypi.org/project/cog/), but automatic.

It could replace template rendering in the long run.


Assuming you're using source control properly and read the diff before running it, I guess this is one way to make sure that a comment matches the code? If the bot changes it, maybe your comment wasn't clear enough?


How hard would it be to use code llama instead? https://ai.meta.com/blog/code-llama-large-language-model-cod...


Not sure I see the benefit over standard ai integration into an editor, what am I missing?


I think they're going for some sort of high level semantic description as a workflow concept guide you instead of using AI at each individual step.

I'd much rather focus on teaching people how to think about tests and what they want to do and, when they're stuck on syntax or patterns, make sure they're thinking enough about the problem so that they know why they take the decisions they do. They can then use a search engine or AI to cover a specific technical gap.

Implenting or writing code is rarely the bottleneck for software development after enough experience so something like this doesn't seem useful for me but it's definitely cool to see people trying to integrate new technologies in different ways.


I tried implementing something like this over the summer but couldn't make progress with the 20-30 minimum response time for each OpenAI-generated block. From the demo video it looks like this runs pretty fast -- or does it?


The name is foretelling of the end result: a flying spaghetti monster.


Looks amazing. Would you ever consider using Claude as well?

I prefer to use Claude for code generation if using a newer framework or language (the 2021 cutoff with gpt-4 is unfortunate)


I'm usually just copy pasting my entire file into chatgpt and ask it for help - it re-writes the whole damn thing, no need to have managed sections.


Seems likely that dev work moves towards this sort of thing - boilerplate being AI managed.

It'll be hell to debug an ever shifting codebase though


This is really cool.

Practically, how often does this lead to new errors from the AI managed codeblocks when you update code elsewhere?



I think they're spiritually related, but my understanding is that Marvin actually invokes the LLM at runtime to execute an AI Function (which is why it can perform, e.g., sentiment analysis). Maccarone invokes the LLM to generate code during development.


What happens if you start editing the code in a block?


This answer in the FAQ is wonderful:

    What prevents my program from behaving differently after each preprocessing run?

    - The strength of your faith in GPT-4.


Seems like we should bring back our old friend: the cache-invalidation key!


"Hallucination isn't a real problem, people will always scrutinize the generated code!"

Sigh...


Ha, I just want to clarify that the quoted text isn't actually from the README or something like that, I'm not quite that crazy.

But no real argument with the concern. An LLM will generate bugs, and that may be a reason this kind of thing never makes sense in practice (isn't that an argument against copilot, too, though?).


As far as I understand this is worse, because it'll regenerate everything in the file every time, right?

But yeah, I'm personally not comfortable with Copilot or its ilk, either.


Thorough code review is why we as an industry stopped shipping bugs.


We, as an industry, didn't stop shipping bugs. (Small example: https://github.com/CVEProject/cvelistV5/releases)

And that thorough code review prevents bugs is, at best, a debatable assertion. See e.g. https://www.microsoft.com/en-us/research/publication/code-re...

It finds _some_ bugs. CI/CD, and a massive investment in automated testing has probably had the largest impact in moving software quality forward. (See e.g. "Accelerate", Forsgren, Humble & Kim)

Code review is an excellent tool to socialize knowledge and train up more junior engineers, but in terms of preventing bugs, it's low-value.


The parent comment is sarcastic


Maybe. I can't read OPs mind, and it's a common enough trope throughout the industry that I figured adding some evidence could be useful.


Sorry I thought that was too obvious to warrant an /s but I suppose not.


I'm fairly certain we ship far more bugs now than we ever did.

Before we had the ability to just add a patch and let the user download it, the end result needed to be very solid, because once that disk was purchased and taken home, it was static.

Now less attention is paid to these things, because it's just assumed to be tomorrow's problem.


Where's this from? What are you quoting?


Where is the above quote from? If I search for it online I can only find your comment here.


Paraphrasing the people defending Copilot and ChatGPT when they came out.


It would be nice to make it clear that you're not actually quoting someone when you use quotation marks to paraphrase them. My concern is that it can come across as a bit of straw-manning otherwise.


Fair, though it's too late to edit now.


There are so many languages with awesome type systems which can help guide AI to generate better code — and yet, these experiments always choose Python.


Python appears to be the best represented language in OpenAI's training set.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: