I also suspect as much, but obviously can't know for sure. IMHO it's intellectually lazy if not dishonest to benchmark against 3.5 and not make that fact clearly known upfront
A better benchmark would have had two entries for ChatGPT, showing both 3.5 and 4 results
A better benchmark would have had two entries for ChatGPT, showing both 3.5 and 4 results