Ideally, it should also include the problem statement, but that's not in their JSON file and can't arsed to continue working on it – it's just a quick script I cooked up.
I find it very hard to judge the quality of most of these patches because I'm not familiar with these projects.
However, looking at the SWE-bench dataset I don't think it's representative of real-world issues, so "22% of real-world GitHub issues" is not really accurate regardless.
SWE-bench Lite is a subset of extremely simple issues from a cherry-picked subset (SWE-bench) of a handful of large (presumably well-run) Python-only projects.
Here are some rules they used to trim down the SWE-bench Lite problems:
* We remove instances with images, external hyperlinks, references to specific commit shas and references to other pull requests or issues.
* We remove instances that have fewer than 40 words in the problem statement.
* We remove instances that edit more than 1 file.
* We remove instances where the gold patch has more than 3 edit hunks (see patch).
You can't demonstrate whether a dataset is representative or not by "an example or two". You need to look at all the data.
And all of this is fine. It's just a benchmark suit and doesn't need to be fully representative. The dataset itself doesn't even claim to be that as far as I can find. All I'm saying that the title wasn't really accurate.
Ideally, it should also include the problem statement, but that's not in their JSON file and can't arsed to continue working on it – it's just a quick script I cooked up.
I find it very hard to judge the quality of most of these patches because I'm not familiar with these projects.
However, looking at the SWE-bench dataset I don't think it's representative of real-world issues, so "22% of real-world GitHub issues" is not really accurate regardless.