Here's a list of all the successful and unsuccessful patches: https://gist.githu...

yuntong · 2024-04-09T13:13:08 1712668388

The problem statement of each issue is included in each result folder as `problem_statement.txt` (such as: https://github.com/nus-apr/auto-code-rover/blob/main/results...).

The developer patch for each issue is similarly included as `developer_patch.diff`.

wsdookadr · 2024-04-09T12:23:07 1712665387

What makes you say it's not representative?

skywhopper · 2024-04-09T14:21:04 1712672464

SWE-bench Lite is a subset of extremely simple issues from a cherry-picked subset (SWE-bench) of a handful of large (presumably well-run) Python-only projects.

Here are some rules they used to trim down the SWE-bench Lite problems:

* We remove instances with images, external hyperlinks, references to specific commit shas and references to other pull requests or issues.

* We remove instances that have fewer than 40 words in the problem statement.

* We remove instances that edit more than 1 file.

* We remove instances where the gold patch has more than 3 edit hunks (see patch).

See https://www.swebench.com/lite.html

kevindamm · 2024-04-09T21:05:51 1712696751

That's... rather limiting.

arp242 · 2024-04-09T12:32:13 1712665933

Look at the data. Does that seem like the average bug report to you?

falcor84 · 2024-04-09T12:35:04 1712666104

It would help if you were to provide a specific example or two

arp242 · 2024-04-09T14:57:20 1712674640

You can't demonstrate whether a dataset is representative or not by "an example or two". You need to look at all the data.

And all of this is fine. It's just a benchmark suit and doesn't need to be fully representative. The dataset itself doesn't even claim to be that as far as I can find. All I'm saying that the title wasn't really accurate.