A Large Scale Study of Programming Languages and Code Quality in GitHub [pdf]

luu · on Nov 4, 2014

The claim:

“Most notably, it does appear that strong typing is modestly better than weak typing, and among functional languages, static typing is also somewhat better than dynamic typing. We also find that functional languages are somewhat better than procedural languages.”

But how did they determine that?

The authors looked at the 50 most starred repos on github, for each of the 20 most popular languages plus typescript (minus CSS, shell, and vim). For each of these projects, they looked at the languages used (e.g., projects that aren’t primarily javascript often have some javascript).

They then looked at commit/PR logs to determine figure out how many bugs there were for each language used. As far as I can tell, open issues with no associated fix don’t count towards the bug count. Only commits that are detected by their keyword search technique were counted.

After determining the number of bugs, the authors ran a regression, controlling for project age, number of developers, number of commits, and lines of code.

That gives them a table (covered in RQ1) that correlates language to defect rate. There are a number of logical leaps here that I’m somewhat skeptical of. I might believe them if the result is plausible, but a number of the results in their table are odd.

The table “shows” that Perl and Ruby are as reliable as each other and significantly more reliable than Erlang and Java (which are also equally reliable), which are significantly more reliable than Python, PHP, and C (which are similarly reliable), and that typescript is the safest language surveyed.

They then aggregate all of that data to get to their conclusion.

I find the data pretty interesting. There are lots of curious questions here, like why are there more defects in Erlang and Java than Perl and Ruby? The interpretation they seem to come to from their abstract and conclusion is that this intermediate data says something about the languages themselves and their properties. It strikes me as more likely that this data says something about community norms (or that it's just noise), but they don’t really dig into that.

For example, if you applied this methodology to the hardware companies I’m familiar with , you’d find that Verilog is basically the worst language ever (perhaps true, but not for this reason). I remember hitting bug (and fix) #10k on a project. Was that because we have sloppy coders or a terrible language that caused a ton of bugs? No, we were just obsessive about finding bugs and documenting every fix. We had more verification people than designers (and unlike at a lot of software companies, test and verification folks are first class citizens), and the machines in our server farm spent the majority of their time generating and running tests (1000 machines at a 100 person company). You’ll find a lot of bugs if you run test software that’s more sophisticated than Quickcheck on 1000 machines for years on end.

If I had to guess, I would bet that Erlang is “more defect prone” than Perl and Ruby not because the language is defect prone, but because the culture is prone to finding defects. That’s something that would be super interesting to try to tease out of the data, but I don't think that can be done just from github data.

ggreer · on Nov 5, 2014

You've done a good job of pointing out weaknesses in the paper, but I have a question about your defect rate argument: Would you hold the same position if the study had said the opposite? In other words, if it claimed that Ruby and Perl had more bug fixes (and thus defects) than Erlang and Java, would you claim the study was flawed due to a culture of meticulous bug-finding in Perl and Ruby?

From Conservation of Expected Evidence[1]:

If you try to weaken the counterevidence of a possible "abnormal" observation, you can only do it by weakening the support of a "normal" observation, to a precisely equal and opposite degree.

It really seems like a stretch to say that higher bug fix counts aren't due to higher defect rates, or that higher defect rates are a sign of a better language. Language communities have a ton of overlap, so it seems unlikely that language-specific cultures can diverge enough to drastically affect their propensity to find bugs.

1. http://lesswrong.com/lw/ii/conservation_of_expected_evidence...

loup-vaillant · on Nov 5, 2014

Personally, I have read the abstract alone, and was pleased to see it nurtured my own preferences: static typing is better, strong typing is better. But then I saw this disclaimer:

> It is worth noting that these modest effects arising from language design are overwhelmingly dominated by the process factors such as project size, team size, and commit size.

So I dismissed the paper as "not conclusive", at least for the time being. I wasn't surprised if their finding were mostly noise, or a confounding factor they missed.

By the way, I recall some other paper saying that code size is the most significant factor ever to measure everything else. Which means that more concise and expressive languages, which yield smaller programs, will also reduce the time to completion, as well as the bug rate. But if their study corrects for project size, while ignoring the problems being solved, then it overlooks one of the most important effect of programming languages in a project: its size.

lmeyerov · on Nov 5, 2014

I'd love to see a methodology similar to the one here analyzing concurrency bugs: http://www.cs.columbia.edu/~junfeng/09fa-e6998/papers/concur... . Akin to what's done in social sciences, they applied simple labels to bugs in the bug repos -- a grad student and some undergrads can label a lot in a couple weeks -- and regress on that.

Animats · on Nov 5, 2014

What's striking is the comment that the "mysql" project has the highest bug fix density of the C programs. That seems unexpected, because MySQL is very heavily used and reasonably stable.

It may be simply that MySQL bugs actually get fixed. The work behind this paper counts bug fixes, not bug reports. Unfixed bugs are not counted.

gmac · on Nov 5, 2014

Having used MySQL in the past and vowed never to use it again, I would not be especially surprised if the MySQL code base was unusually buggy. On the other hand, high use rates could well lead to high bug discovery rates.

mixonic · on Nov 5, 2014

Just the initial data-set alone is going to be incredibly mis-representative of software in general.

* Most-starred projects mean these are all successful projects. Much software is not successful.

* A successful project means there are likely more-experienced-than-average engineers coding.

* Most-starred projects will be older, more stable code-bases than most software.

* Open Source development is a small slice of software development.

* Github is a sub-set of Open Source development.

And I expect there are a myriad of holes to poke elsewhere. In general I distrust any research that surveys GitHub and tries to make claims about software development in general. It is lazy.

mtdewcmu · on Nov 5, 2014

>> why are there more defects in Erlang and Java than Perl and Ruby?

I have no experience with Erlang, but one reason I'd expect Java to have more defects than Ruby and Perl is that Java is more verbose, i.e. it takes more code to get something done. One would naively expect to find an association between the size of commits and their propensity to contain errors.

mheiler · on Nov 5, 2014

The nice thing about the paper is that they try to put data where others put guesses, rants, expectations, and believes. Arguing with data is actually difficult (and rare) in the software engineering field.

The practical result is that language choice doesn't matter since the effects are very small.

zvrba · on Nov 5, 2014

> There are lots of curious questions here, like why are there more defects in Erlang and Java than Perl and Ruby?

Maybe because Erlang and Java are used for projects of higher complexity (larger scope, more interacting components, etc.)? Did the authors try to address this issue at all?

pdpi · on Nov 5, 2014

The standout result for me that led me to believe this was very likely the case was that Erlang scored _horribly_ on concurrency bugs. To me, that makes sense: you're seeing all the bugs that come from trying to tackle tricky high concurrency situations, which is why people picked Erlang in the first place. If people tried to tackle those exact same problems in other languages, we'd probably see them doing worse at concurrency.

lostcolony · on Nov 5, 2014

Not to mention the Erlang mantra of 'let it fail'. They're tracking bugs, not severity. Someone may file an issue "hey, in this instance, this thing goes wrong", but because of the supervision process, it doesn't actually cause anything to break (just a logged error message and a bit of system churn). The language actively encourages you to code happy path, and address failure conditions only as necessary, relying on the supervision tree to handle errors that stem from deviating off the happy path.

breuderink · on Nov 4, 2014

> For those with positive coefficients we can expect that the language is associated with, ceteris paribus, a greater number of defect fixes. These languages include C, C++, JavaScript, Objective-C, Php, and Python. The languages Clojure, Haskell, Ruby, Scala, and TypeScript, all have negative coefficients implying that these languages are less likely than the average to result in defect fixing commits.

Notice how top projects in popular languages do have tendency to have more fixes than top projects in more obscure languages. Perhaps these projects have simply more users, leading to more reported bugs and community pressure? Interesting paper, but it is all to easy to jump to conclusions.

oskarth · on Nov 4, 2014

From the conclusion:

The data indicates functional languages are better than procedural languages; it suggest that strong typing is better than weaking typic; that static typing is better than dynamic; and that managed memory usage is better than unmanaged.

Obviously not the last word, but interesting study nonetheless.

djur · on Nov 4, 2014

The worst outcome was for Procedural-Static-Weak-Unmanaged, aka C and C++. There is no OO category, since most OO languages can be (and are) used in a procedural style.

pathikrit · on Nov 4, 2014

Because there are no Procedural-Dynamic-Weak-Unmanaged languages :)

abecedarius · on Nov 4, 2014

Forth! (I haven't read the paper.)

Immortalin · on Nov 5, 2014

Forth is untyped, not dynamically typed. ;)

ChrisCinelli · on Nov 5, 2014

The sample size is too low. It also seem not to weight enough the age of the project. Older project are usually more prone to defect since the amount of changes of specs that they have to go through.

The results for Typescript are completely wrong. Bitcoin, litecoin, qBittorrent do not have any Typescript code http://bitcoin.stackexchange.com/questions/22311/why-does-gi...

Double Face palm!

robert-boehnke · on Nov 5, 2014

Not only are they not Typescript, but they the once I checked were mostly C++, which is on the other end of the defect proneness spectrum, according to this study.

swuecho · on Nov 5, 2014

similar problem with Perl project in the paper.

djur · on Nov 4, 2014

This seems to be really well done and I'm looking forward to reading it in more detail, but one thing jumped out at me: one of the keywords used to identify fixes for programming errors is "refactoring". Refactoring doesn't necessarily (and actually shouldn't) refer to fixing a defect.

EDIT: The Procedural/Scripting split seems overdetermined. All of the procedural languages have static typing, and all the scripting languages have dynamic typing.

tarvaina · on Nov 5, 2014

Indeed. Refactoring means restructuring code in a way that doesn't change its external behavior. If we define bug as incorrect external behavior of code, then refactoring excludes bug fixes.

Refactoring doesn't necessarily even mean that the original code was structurally bad. When new features are added, code gets more complicated, requiring more abstraction. I like to abstract code via refactoring when we need it, not before. Then refactoring changes good code to good code. It's just that the new situation presents different demands to the code than the old situation.

zzzcpan · on Nov 5, 2014

The thing is we already know that some languages are much more error prone, than other languages. But languages are not entirely to blame. Before we can find out proper correlations we should answer the question: how these errors came to be?

For example, what caused a typo in a sting somewhere, that now contains "foo" instead of "bar"? Likely cognitive overload, because the code was too complex to keep entirely in memory. I.e. author was to busy processing the code in his head to notice the typo he just made. Therefore code with lower cognitive load is likely to have fewer bugs like this or even overall. Some programmers have learned that there is a way to prevent string typos with appropriate test cases and some programmers keep their code as simple as possible, i.e. with lowest cognitive load. So, we should see correlations to which languages such programmers prefer and which languages encourage such coding practices.

This is a complex subject that has much more to do with psychology, than with technology. And it should be studied as such. Trying to study bugs without touching psychology is pretty much bs.

apta · on Nov 4, 2014

It was interesting to observe that Go/Golang, while supposedly designed for concurrency, ranks the "worst" in terms of concurrency bugs (see Table 8 in the paper). I presume this is due to the absence of any way to specify immutability in the language. Of course, it could mean that because it has concurrency primitives built-in, that people are willing to use concurrency more. It would be interesting to see how a stronger type system would have an affect on such error rates.

I don't get this statement though: The enrichment of race condition errors in Go is likely because the Go is distributed with a race-detection tool that may advantage Go developers in detecting races. I thought that including a race detector would reduce race condition errors. Am I missing something?

It was also interesting to see it rank low in the security correlation ranking too.

rrradical · on Nov 4, 2014

They aren't actually analyzing whether the software has race condition errors; they are analyzing whether there are commits to fix race condition errors. If more errors can be detected, more will show up in commit logs and so the language will be deemed error-prone.

jcrubino · on Nov 5, 2014

Scala, Go and Erlang are all shown to be comparatively high in concurrency errors... so this may only show languages with a concurrency focus find the concurrency errors early and often.

And the authors at least noted that "The enrichment of race condition errors in Go is likely because the Go is distributed with a race-detection tool that may advantage Go developers in detecting races"

Arnor · on Nov 4, 2014

Isn't Go also more likely to be used for concurrent tasks and therefore more likely to exhibit concurrency bugs? I admit I skimmed the article and may have missed this being dealt with in the methodology...

Untit1ed · on Nov 4, 2014

I imagine he's saying that they're more likely to be detected and raised as concurrency bugs because of the tool, as opposed to remaining mystery "I swear it works on my local box" bugs like so many race conditions do.

nullc · on Nov 5, 2014

After seeing many nullity handing and race bugs in Java and PHP code that I wouldn't expect to see in C project; I've wondered if safety tools don't achieve their full promise because they make it easier to be sloppy in general.

E.g. is some of their gains loss because the developer is asleep at the wheel. Some things like this are established in other domains, on cars and bikes there is some evidence that improved safety equipment increases risky behaviour.

tomcam · on Nov 4, 2014

First off, props to this krewe for tackling a large and scary topic, for exceptional clarity, and especially their humility. At about a 1% spread, the differences among languages are negligible.

Even a sample this size is too small:

A first glance at Figure 1(a) reveals that defect proneness of the languages indeed depends on the domain. For example, in the Middleware domain JavaScript is most defect prone (31.06% defect proneness). This was little surprising to us since JavaScript is typically not used for Middleware domain. On a closer look, we find that JavaScript has only one project, v8 (Google’s JavaScript virtual machine), in Middleware domain that is responsible for all the errors.

paulajohnson · on Nov 5, 2014

1% is not the "spread" (whatever that is), its the proportion of the variance in bug fixes that is attributable to languages. The other 99% is attributable to the number of commits (i.e. projects with more commits have more bugs. Well gee.)

Once you factor out the level of commit activity, language influence is actually quite large.

paulajohnson · on Nov 5, 2014

The statement by the authors that the language effects are "small" is very strange: the only non-language "big" effect they found was that projects with lots of commits have lots of bugs, and that 99% of the bug count variance was due to commit count variance. Well gee.

Once you factor out commit count, the impact of language turns out to be quite large. A Haskell project can expect to see 63% of the bug fixes that a C++ project would see. I don't call a 36% drop in bugs "small".

didibus · on Nov 4, 2014

This is dumb, you can not compare different projects, of different difficulty, and of different scope together. You also can't isolate the simple randomness factor that different programmers are doing work on each of these, and that could simply be the cause of difference. Also, having a lot of bugs reported, can actually be a sign of a good thing, as opposed to having a lot of bugs not reported.

Anyhow, I don't think this says anything about anything.

lifeisstillgood · on Nov 4, 2014

I just love this stuff. I am seriously considering a research MSC on FOSS / methodologies so this is catnip to me

However it screams "well we did not find anything much". (Which is good science anyway so props to them)

Choice of language contributes just 1% to the variance in bugs - that is if you want to improve your project quality - don't bother looking at the language choice.

It's an interesting result - but does rather beg the question what drives the other 99%?

toast0 · on Nov 4, 2014

The other 99% is people. If I write terrible code (which I probably do), I'll write terrible PHP, terrible Perl, and terrible Erlang, and maybe a little bit of terrible C or C++ from time to time, and some terrible Java if I must.

paulajohnson · on Nov 5, 2014

No, the other 99% is project size.

davidD02 · on Nov 5, 2014

I wonder if a generalized linear mixed model approach to analysis might be better - I am curious if there was different between-project variation in error rate within each language. And interpreting the proportion of model deviance explained ("1%") is deceptive. Committer number and project size are nuisance variables - we wish to know what the size of the differences between languages would be when applied to the same project. We should also be interested in interactions between project size and language. The paper needs summary plots of the raw data.

swuecho · on Nov 5, 2014

for Perl, there are three project in the paper, gitolite, showdown, rails-dev-box

gitolite is a perl project.

showdown? https://github.com/showdownjs/showdown it has some perl code in it, but it is a js project.

rails-dev-box? rail dev, do you agree with that?

So I will not read the paper.

mike_ivanov · on Nov 4, 2014

Developer/team experience is not a confounder? Very believable, nevertheless.