After reading, I don't think <5% points is helpful to add to discussion here wit...

anotherpaulg · on July 6, 2023

I haven't come across any other systematic, quantitative benchmarking of the OpenAI models' performance over time, so I thought I would share my results. I think my results might argue that there has been some degradation, but not nearly the amount that you often hear people's annecdata about.

But unfortunately, you have to read a ways into the doc and understand a lot of details about the benchmark. Here's a direct link and excerpt of the relevant portion:

https://aider.chat/docs/benchmarks.html#the-0613-models-seem...

The benchmark results have me fairly convinced that the new gpt-3.5-turbo-0613 and gpt-3.5-16k-0613 models are a bit worse at code editing than the older gpt-3.5-turbo-0301 model.

This is visible in the “first attempt” portion of each result, before GPT gets a second chance to edit the code. Look at the horizontal white line in the middle of the first three blue bars. Performance with the whole edit format was 46% for the February model and only 39% for the June models.

But also note how much the solid green diff bars degrade between the February and June GPT-3.5 models. They drop from 30% down to about 19%.

I saw other signs of this degraded performance in earlier versions of the benchmark as well.