Claude has been pretty great. I stood up an 'auto-script-writer' recently, that iteratively sends a python script + prompt + test results to either GPT4 or Claude, takes the output as a script, runs tests on that, and sends those results back for another loop. (Usually took about 10-20 loops to get it right) After "writing" about 5-6 python scripts this way, it became pretty clear that Claude is far, far better - if only because I often ended up using Claude to clean up GPT4's attempts. GPT4 would eventually go off the rails - changing the goal of the script, getting stuck in a local minima with bad outputs, pruning useful functions - Claude stayed on track and reliably produced good output. Makes sense that it's more expensive.
Edit: yes, I was definitely making sure to use gpt-4o
I installed Aider last week - it just started doing this prompt-write-run-ingest_errors-restart cycle. Using it with git you can also undo code changes if it goes wrong. It's free and open source.
I've found that GPT-4o is better than Sonnet 3.5 at writing in certain languages like rust, but maybe that's just because I'm better at prompting openai models.
Latest example I recently ran was a rust task that went 20 loops without getting a successful compile in sonnet 3.5, but compiled and was correct with gpt-4o on the second loop.
Weird. I actually used the same prompt with both, just swapped out the model API. Used python because GPT4 seemed to gravitate towards it. I wonder if OpenAI tried for newer training data? Maybe Sonnet 3.5 just hasn't seen enough recent rust code.
Also curious, I run into trouble when the output program is >8000 tokens on Sonnet. Did you ever find a way around that?
I break most tasks down into parts. Aider[1] is essential to my workflow and helps with this as well, and it's a fantastic tool to learn from. In fact, as of v0.52 I'm able to remove some of my custom code to run and test.
Started playing around with adding Nous[2] as well (aider is its code editing agent), but not enough that I'm using it practically yet.
Yeah, I know about the max tokens. As long as the code stays below that limit, I can get sonnet to emit complete python scripts; run those directly & return the results to sonnet, and I have a great feedback loop. That breaks down when sonnet can't emit all the code at once, because then you have to figure out ways to predictably assemble a larger script from a smaller one...that's where I couldn't find a decent solution.
Edit: yes, I was definitely making sure to use gpt-4o