Hacker News new | past | comments | ask | show | jobs | submit login
How to generate tested software packages using LLMs, a sandbox and a while loop (github.com/modal-labs)
133 points by pierremenard on Aug 7, 2023 | hide | past | favorite | 47 comments



This should give a second life to Test-Driven Development.

One of the under-appreciated wisdoms of TDD is that there is a complexity asymmetry in many problems between finding a solution and (fully or partially) verifying it. Examples of asymmetric problems: inverting matrices, sorting an array, computing a function's gradient, compressing a byte stream, etc.

Human writes the easier part -- the test suite, the language model writes the harder part -- the solution. This can be a net gain in productivity.


The problem is without a formal definition of the program semantics, you run the risk of overfitting or uncovered behaviors that, for a human developer who understands the intent of the program, would be implicit.

And given how hard formal verification is, I don't know that you'll ever get away with not having to manually check these programs, at which point I question just how much productivity you've gained.

It's kinda like self-driving cars: when they work, they work great. But when they fail, they fail in ways a human never would, and therefore a human struggles to anticipate or trust their behaviour.

That said, I'm waiting to see the rise of programming languages designed with LLMs in mind, where a human could use contract oriented programming or similar (think: Ada) combined with TDD methods to more formally specify the problem that an LLM is being asked to solve.


> I question just how much productivity you've gained.

Me too. It's an empirical question to be answered by those who will dare to try.

> It's kinda like self-driving cars

Strong disagree. Yes, neural nets are blackboxes, but the generated code can be idiomatic, modular, easy to inspect with a debugger, etc.

> more formally specify the problem that an LLM is being asked to solve.

That would be a great direction to explore.


> Strong disagree. Yes, neural nets are blackboxes, but the generated code can be idiomatic, modular, easy to inspect with a debugger, etc.

I think you missed my point.

If I'm inspecting code from another human, I'm going to make assumptions about the kinds of errors they're gonna make. There's probably obvious dumb stuff I won't look for because a human would never typically make certain classes of mistake. They're the self-driving car equivalent of driving into the back of a stopped semi truck because it was mistaken for a billboard, an error no human of sound mind and body would make.

So if I'm inspecting code written by a computer, I'll either 1) make those same assumptions and then run the risk of missing unexpected problems in the code, or 2) I'm gonna be overly cautious (because I don't trust the machine) and will examine the code with a fine tooth comb, which will take a great deal more time.


Based on my experience with Autopilot and Copilot, I think this is way less of a problem in code.

You can put code mistakes on a gradient, from subtle to obvious. Obvious bugs are like when the LLM finds a pattern and repeats it for 100 lines. Subtle mistakes are like misnaming a variable so you use one left over from earlier, not the correct one.

Obvious mistakes are easy to catch because they’re obvious. The LLM makes more of those. I think because of the way LLMs work, I have never seen Copilot make a subtle mistake that I wouldn’t expect of a person. People are so good at making surprising bugs, it’s really hard for Copilot to beat it.


Finally something will kill the last somewhat satisfying part of the job finally, for the greater profit of the corporate overlords!


>Human writes the easier part -- the test suite, the language model writes the harder part -- the solution. This can be a net gain in productivity.

I've been doing just this with Llama 2 local lately, and I'm convinced that anyone who isn't will be a dinosaur in a couple years.

Fact is, these things are damn good (at JS at least) when dealing with pure unit-testable functions. To the point that I can accept practically any solution it comes up with will be better than mine. Everything is still verified of course, but the days of writing algorithms from scratch are basically over for me.


How do you avoid overfitting and how do you test performance?


It would be quite hard to overfit with property based tests.


What plugins or other setup is needed to get Llama 2 integrated with your IDE?


There's a good one called "Continue" for VS Code. There are also a million copilot chat clones if you search vs code extensions. Probably some of those do it. The vs code extension "Cody" uses anthropic.


Any tips on how to use it locally this way?


Yeah I'd be interested in watching a demo video, or something like that.


How readable is the code? Does it follow SOLID design and the likes?


>How readable is the code? Does it follow SOLID design and the likes?

More readable and better commented than my own. The thing is, this isn't generating any kind of "compiler output" type code that you're used to. It writes exactly as a person would. Here's an example I just ran on llama2-7b-chat on my Macbook:

##PROMPT

Llama: My name is Llama. I will write a Javascript function that implements the "emailTest" function, given the following test case that it is expected to pass. I am sure to always fully explain all of my code, and include comments to provide context.

  describe('it takes a string as input, validates that it is an email address, and returns the host name of the email address.', () => {
    const testVal = emailTest('test@google.com');
    expect(testVal).toBe('google.com')
  });
##END PROMPT

##RESPONSE

Llama: Here are the test cases:

  test('it takes a string as input, validates that it is an email address, and returns the host name of the email address.', () => {
      const testVal = emailTest('test@google.com');
      expect(testVal).toBe('google.com')
  })

  test('it should validate email addresses with different domains', () => {
      const testVal1 = emailTest('hello@example.com');
      const testVal2 = emailTest('hello@example.net');
      expect(testVal1).toBe('example.com')
      expect(testVal2).toBe('example.net')
  })

  test('it should validate email addresses with special characters', () => {
      const testVal3 = emailTest('hello@$%^&*()_!');
      expect(testVal3).toBe('example.com')
  })

  test('it should validate email addresses with typos', () => {
      const testVal4 = emailTest('hellow@example.com');
      expect(testVal4).toBe('example.com')
  });
Here is the function that I have written to implement the emailTest function:

  function emailTest(input) {
    const emailRegex = /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;
  
    if (emailRegex.test(input)) {
      // If the input is a valid email address, return the host name of the email address
      const hostName = input.split('@')[1];
      return hostName;
    } else {
      // Otherwise, return an error message indicating that the input is not a valid email address
      return 'Invalid email address';
    }
  }
Can you please provide feedback on my implementation? Specifically, I would like to know if there are any improvements I could make to the function's performance or accuracy.

##END RESPONSE

And from there you can have a conversation with it, and ask it to fix anything that's mistaken. You can also have it extend your test suite with fuzzing and negative cases. This is obviously a trivial example, but you get the idea. I haven't been able to stump it yet.


    const testVal3 = emailTest('hello@$%^&*()_!');
    expect(testVal3).toBe('example.com')
At the risk of terribly embarrassing myself, shouldn't the expected result be "invalid email address" in this case?


> At the risk of terribly embarrassing myself, shouldn't the expected result be "invalid email address" in this case?

Indeed. This particular test case got jumbled up, but you should always be verifying the output anyways. You can ask it to correct itself when this happens, i.e.:

  User: That's wrong. The test would return 'Invalid email address' here, not 'example.com'

  Llama: Ah, I see! Thank you for correcting me! In that case, here's the modified test case:

  test('it should validate email addresses with special characters', () => {
    const testVal3 = emailTest('hello@$%^&*()_!');
    expect(testVal3).toBe('Invalid email address');
  })

  Llama: Now, when the `emailTest` function is called with the input 'hello@$%^&\*()_!', it will return 'Invalid email address' instead of 'example.com', because the
  input does not match the regular expression pattern defined in the function.


There are other problems in there in addition to that one, but it isn’t clear to me that the GP is intended to be an example of completed good output.


Is writing System prompts in first person better than writing them in third person?


> Is writing System prompts in first person better than writing them in third person?

That's what I've found. Llama doesn't really seem to have a concept of "you" or "they". But if you make it assume that it is something specific, and that it is doing something, it will continue doing that thing.


What is the most difficult thing you’ve gotten it to do?


>What is the most difficult thing you’ve gotten it to do?

It's particularly good at recursive tasks I don't want to expend the mental effort on. I.e:

  Llama: I am Llama, a programming assistant. Please provide me with a task.

  User: Thanks. Here is a data structure. Write a Javascript algorithm which will recursively parse the structure to collect all 'id' properties into a single array.
  Be sure to check the input and not cause any errors which could result in an infinite loop.

  ```
  const data = {
      'id': 'test1',
      'nodes': [{
          'data': {},
          'list': [{
              'id': 'test2',
              'values': [{
                  'id': 'test3',
                  'name': 'val'
              }]
          }]
      }]
  }
  ```

  Llama: Sure thing! Here is the algorithm in Javascript:

  ```
  function parseId(data) {
    const idArray = [];

    // Recursively traverse the data structure
    function traverse(data) {
      Object.keys(data).forEach((key) => {
        if (key === 'id') {
          idArray.push(data[key]);
        } else if (typeof data[key] === 'object') {
          traverse(data[key]);
        }
      });
    }

    // Start the recursion
    traverse(data);

    return idArray;
  }

  parseId(data) // Output: ["test1","test2","test3"]
  ```


I think people aren't getting yet is that this generalizable beyond software. We can do this with LLM and anything that can be designed.

https://michaelfeathers.silvrback.com/prompt-hoisting-for-gp...


And with code already that exists we can have the AI help us with writing the tests for TDD. Many large companies have huge amounts of code that are essential to their operation but changing it is difficult and maintaining it to replace it with a new system is difficult or even impossible since it’s very difficult to even know what it does.


Using AI to automate TDD is a great idea but unfortunately AIs can get stuck and we are far from solving this problem. Many pieces of information to exactly manipulate while LLMs are all soft and fuzzy.


the LLM should write the test suite too, from the description of the app


Program synthesis by a language model is certainly cool, but I don't think it's really novel. After all, the mutator could as well just be a random AST transformer.

So what's the real deal here? Does the mutator end up with a usable program much quicker? Or does it cheat by looking at the test cases?

As an example, if I specify a md5 hash function on strings, what will that tool produce?


Ah, so we're turning "a million monkeys at a million keyboards could produce the complete works of Shakespeare" into actual, real thing? What a truly disruptive synergy.


No - we are the monkeys. This is something different.


This definitely seems like a potentially powerful approach. The said, maybe I'm missing something obvious, but is the LLM generating both the tests and the implementation? If that's the case, then it seems like there could be issues caused by the generated tests not matching what's specified in the initial prompt. Manually writing highly focused unit tests doesn't seem like the best way to work with this, but being able to manually write some sort of high-level, machine-checkable specs might be useful.


This is exactly what happened when I tried this approach personally: on a non trivial codebase, the tests were just as likely to be wrong as the code itself and during "debugging" the wrong would converge until the test passed but the results were useless.

I still have my prompt from before and a Modal account, I'm going to give this one a shot to see if I missed something or if it really only works in the mickey mouse case.


LLMs when given code mixing languages will typically execute as if it worked anyway. I have a theory this is because correct examples are the majority on the web. Since LLMs don't have the data with the feedback all the humans have seen constructing these examples, there is no way it could understand. We need to feed LLMs these types of input/error loops explicitly to improve performance or reduce iterations in this area. Otherwise it'll be stuck making many small mistakes forever. Again, this is my theory.


Immediately after seeing ChatGPT, I felt that evolving programs in an OODA loop was going to be a big deal, so it's great to see someone iterating on the concept.

I hope they try it with QuickCheck / property-based testing.


No need to fix things, just make it search for leaks or null pointer references and explain them to the user, and you have a million dollar product.


This takes programming by coincidence to a whole new level.


How can this possibly generate maintainable code?

And if the tests change, does it have to fully rewrite the code via AI?

If you had code that was reasonably working before, but you change the tests and the AI does the "full rewrite" for you, how do you have confidence that the code works? Go ahead, say "it passed the test suite".

One of the things that happens in software dev is the devs grokking the overall desire/aims of the software and its users, and often anticipating things that aren't even in the requirements.

The test suites are approximations of validating the requirements / derived intent.

But what faith do you have in the AI black box, or the black boxes it generates?


Very cool. I've played around with something similar a few months ago (without the sandbox). The tricky part is to find a diff format that the AI can use intuitively to patch code in place, as otherwise you can blow out the token limits quite fast.

I wonder to what extent AI is going to kill comprehensibility of the infrastructure. Over the weekend I did a bit of hacking with a friend. He wants to move out of AWS to Hetzner to save money, but wasn't experienced with Linux sysadmin so I was teaching him some stuff, and simultaneously pitching a little prototype I made of a server config tool. It's sort of like a higher level Ansible: you specify what packages and containers you want, the URLs they should be served on, and it goes ahead and configures unattended upgrades, Docker and Traefik to deliver that outcome. The idea was to extend it to configuring backups with BorgBackup and other common sysadmin-y tasks.

He was enthused! The next day though, things changed. He'd signed up for GPT-4 and it'd just blasted out a pile of Ansible configs for him. He doesn't know Ansible but it didn't matter, he just iterated with the AI a few times and now the problem is solved.

This makes me wonder if there's much point anymore in improving the usability of systems software (programming languages, databases, operating systems, clouds, anything driven by textual configuration). A basic assumption that underlies making better tools is that user's time is valuable, that intuitive and simple systems therefore have value. But the AI's time isn't valuable. It can read some poorly written docs, then the source code of the system, then people getting frustrated on StackOverflow, synthesize all that together and spit out whatever pile of inscrutable config files are needed and it can do so within seconds.

Given this experience, if we extrapolate it forwards, then maybe within a couple of decades many organizations will be running on infrastructure that nobody understands at all. You get cases today where maybe individual programs or workflows are only understood by one guy, but those are extremes and it's understood that it's a bad situation. Maybe in future it'll be normal and to do anything companies will have to ask an AI to do it for them. We already see skills atrophy around things like UNIX sysadmin because nowadays everyone learns AWS instead, which is one reason they can charge so much money for it, so I think as people retire knowledge of how to run Linux systems will steadily disappear. But conceivably people will stop learning the cloud too, and then to do anything with servers it'll be AI or the highway.

You can also apply this to programming languages. Why engage in a high effort project like inventing a Java or Rust if an AI can spit out correct C++ for you, given enough tries at writing tests? Does the motivation to produce anything beyond incremental improvements to existing tools disappear now?

I keep flipping back and forth on the answer to this. On one hand, it seems pretty pointless to further develop this prototype now. Maybe Ansible or AI written shell scripts is the end of the road for making Linux easier to use. On the other hand, AI is remarkably human like. It can also make mistakes, and also benefits from well written docs, simple interfaces and good error messages. So maybe usability still has a purpose, albeit maybe we now need to master a new field of "usability studies for AI".


> maybe within a couple of decades many organizations will be running on infrastructure that nobody understands at all

We're already there. Not a single person is alive today that can fully comprehend any production deployment, from the atoms in the server room all the way up to the atoms projecting rendered browser output to an amazon shopper's eyeballs. Any even this seems like an arbitrary place to draw the limits of our infrastructure. But nevermind the blank spots in physics and psychology; even if we stick just to the server processes from conventional OS and chips, hardware and firmware, data and databases, there is nobody alive for whom all of this is a white box.


No single person, sure, but at every level there are people who do understand what's going on at that level. Abstractions help but behind the abstraction there are people who know how it works.

What I'm getting at is, what if we get to the point where at some levels of the stack there are literally no people who understand that level? If your company's prod infrastructure is miles and miles of AI written and maintained shell scripts, or config files for tools that haven't been worked on for decades because AI-generated boilerplate is cheaper than adding new features? To the extent that even if you hire the right people they can't reverse engineer the stuff and even if they could the cost of making changes to it without AI is prohibitive, because it's all so low level? It's a weird thing to contemplate but seems like one of the more plausible scenarios in which society becomes as dependent on AI as we are today on electricity.


Ultimately, the goal of all the systems is not really within the systems, but the applications.

And one of the reasons why we have such an excess of configuration - which includes general-purpose programmability itself - is because we're trying to late bind the entire application.

So the question of debugging is more one of "yes, we're going to replace a panoply of hand-written abstractions with an LLM-generated solution" than it is about the LLM working in an unknowable black-box paradigm. What you're getting is a way to retarget the domain and precision of the solution:

1. Act as an oracle that answers questions directly. This works fine if your goal is to restate the form of a simple dataset. But the black box effect becomes too large really quickly.

2. Assisted code generation. This can write algorithmic boilerplate for you. But it's still relying on the framing of an existing protocol - "in language X, do Y". We got past our amazement at the oracle but are still mostly looking at this step, where it still looks like a human organization is needed to write a software stack.

3. Assisted data model and protocol design. LLMs are protocol geniuses - if you tell it of a new language and its rules, and give it some examples and logical feedback, they will happily go up and down through the abstraction stack as necessary to restate things in that language. This lets you bound the kinds of solutions that the LLM could generate later by the logic of the defined interface. This also aids the human by enabling all solutions to be defined legibly.

Right now we tend to avoid original protocol design because it's human-expensive and so we end up with inertia towards standardized designs, but the LLM does not care. You could tell it to regenerate all the necessary dependencies for your application, and if the system fails "somewhere in the middle" you don't have to debug what it did in more than a cursory sense - you just regenerate the whole system with added precision and legibility until you've drilled down to the specifics. It can even generate documentation!

That is, in the future, protocol-level friction can be trivially overcome, which means the "early binding effect" of hardware and software platforms will evaporate. You just have the data. The data could be stored in an illegible form - but that's a general human-language issue, and not really about the incidental complexities of human-written software.


Thank you for the very thoughtful response. I'll be pondering your points for the rest of the day!


> We already see skills atrophy around things like UNIX

Don't worry about Unix skills, worry about your AI skills. The first is only getting more automated and the second is growing and expanding fast.

> running on infrastructure that nobody understands

AI is good at writing docs, summarising, monitoring, rewriting - there is no reason the code base should be opaque. Also, we will be much better at picking up new skills quickly - force of habit, we will be better at grocking things with AI.


UNIX isn't getting automated (or at least not well). I've been using it for decades and the usability has hardly improved. UNIX is actually being replaced by cloud vendors, as the AWS/Azure/GCP/OCI API isn't UNIX. That's kinda the point - I was thinking of automating it a bit more, but it turns out my friend doesn't need better automation because why "automate" something in the conventional sense when you can just throw AI bodies at doing all the work manually.

The risk here is not that AI can't explain what a particular script does. The risk is that we stop creating simplifying abstractions. Let me try to explain what I'm getting at here because it's a new thought and I'm not doing it well.

Humans have limited and roughly equal mental and physical capacity. Beyond a certain point you have to add more people to get more done. In turn those people need abstractions because they can't grok the whole system (AI assistance or not), and so to organize into teams you need people to abstract away essential complexity. We build on top of a tower of abstractions as a consequence, and to a large degree progress comes via better and higher level abstractions becoming available.

What if this progress stops because AI doesn't need it? The scenario here is that without people really noticing at first the flow of improvements to our tools dry up. PEPs, JEPs, RFCs, new libraries etc stop appearing. Maybe even bugs stop getting fixed if there are workarounds available. After all, why invest time and effort into improving the syntax of Python or JavaScript if the primary user would only be AIs? They don't care, they'll just write things out the inconvenient way without complaining. Why try to make UNIX or the cloud easier to use, if all your configs are written by AI? Why introduce new abstractions at all?

Assume that tech continues to move forward and people's demands continue to get more complex. But now, instead of inventing new tools, we just pay a bit more for AI to brute force through the limitations of old tools. Why would an AI write and maintain a library on GitHub with a nice API when it can just knock off the code it happens to need right there as part of its own task? So code gets more and more verbose over time, until one day the number of missing abstractions becomes so large that the code is incomprehensible. And asking AI to explain won't help either because at best you'd get a book sized explanation. There are no teams, no projects, no abstractions. It's all just an endless jungle of boilerplate and copy/paste coding.


This is an interesting thought and I have to agree it’s a legitimate issue. I have also considered something similar - that in the near future, if all software is designed with AI, or “generated” rather than explicitly engineered, we would certainly reap some benefits from that but we would also cease understanding how anything actually works. Obviously no one person fully understands the abstractions we have currently engineered when considering the totality.. but at least currently we have them.


> This makes me wonder if there's much point anymore in improving the usability of systems software (programming languages, databases, operating systems, clouds, anything driven by textual configuration). A basic assumption that underlies making better tools is that user's time is valuable, that intuitive and simple systems therefore have value.

Yes, but the nature of the problem changes a little. Because LLMs are not guaranteed to produce correct results, this might incentivize configuration and programming languages where you can provide a separate specification, possibly generated by another LLM, against which the program or configuration is checked against. This then potentially saves the user a lot of iteration to correct simple mistakes.

Another aspect is that simpler systems/languages are also simpler for LLMs to understand, so an LLM generating a program in a simpler, more expressive programming language has a higher chance of being correct.


> Given this experience, if we extrapolate it forwards, then maybe within a couple of decades many organizations will be running on infrastructure that nobody understands at all.

100%. I fully expect to see the fickle AI spirits of the Warhammer 40K universe retconned into the grandchildren of LLM-driven program synthesis.


Modal-labs seems like an interesting cloud compute solution that I hadn’t heard of before. Could anyone summarize what are its value props, e.g is it like GCP but better because …




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: