Hacker News new | past | comments | ask | show | jobs | submit login

Here's a list of all the successful and unsuccessful patches: https://gist.github.com/arp242/0dc5dab0f7cd10e663cfc26866651...

Ideally, it should also include the problem statement, but that's not in their JSON file and can't arsed to continue working on it – it's just a quick script I cooked up.

I find it very hard to judge the quality of most of these patches because I'm not familiar with these projects.

However, looking at the SWE-bench dataset I don't think it's representative of real-world issues, so "22% of real-world GitHub issues" is not really accurate regardless.




The problem statement of each issue is included in each result folder as `problem_statement.txt` (such as: https://github.com/nus-apr/auto-code-rover/blob/main/results...).

The developer patch for each issue is similarly included as `developer_patch.diff`.


What makes you say it's not representative?


SWE-bench Lite is a subset of extremely simple issues from a cherry-picked subset (SWE-bench) of a handful of large (presumably well-run) Python-only projects.

Here are some rules they used to trim down the SWE-bench Lite problems:

* We remove instances with images, external hyperlinks, references to specific commit shas and references to other pull requests or issues.

* We remove instances that have fewer than 40 words in the problem statement.

* We remove instances that edit more than 1 file.

* We remove instances where the gold patch has more than 3 edit hunks (see patch).

See https://www.swebench.com/lite.html


That's... rather limiting.


Look at the data. Does that seem like the average bug report to you?


It would help if you were to provide a specific example or two


You can't demonstrate whether a dataset is representative or not by "an example or two". You need to look at all the data.

And all of this is fine. It's just a benchmark suit and doesn't need to be fully representative. The dataset itself doesn't even claim to be that as far as I can find. All I'm saying that the title wasn't really accurate.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: