Hacker News new | past | comments | ask | show | jobs | submit login
Why code review beats testing: evidence from decades of programming research (inburke.com)
117 points by kevinburke on Oct 4, 2011 | hide | past | favorite | 47 comments



In my previous job at VMware I've found code reviews to be very useful. Apart from finding bugs at early stage as pointed by this article:

- Code reviews are an effective means of knowledge transfer among engineers.

- It makes engineer more conscious while writing code since the patch will be read by others and hence less likely to cut corners/put hacks.

- Since the patch should be understood by others, author needs to ensure the code and the change/commit description are well documented.

- It lets your co-workers/manager know about the amount of progress made on a task.


Completely agree with the above but would add:

- Provides a mechanism for developers to receive feedback to promote learning and improve code quality.


Another witness of the same thing here. It worked great.

I'd also add something: "knowledge transfer" not only means that more experienced programmers teach techniques to newer ones, but also that we knew the code for all the projects in the team. In case someone leaves or is transfered to another team, it's pretty easy for other programmer to take over.

Edit: also it adds a way to control that people are working, both by the boss and peer pressure.


I gave a talk a couple of years ago on this topic (I read Code Complete and some of the same papers referenced). My goal was to get people to think more about doing things in addition to developer testing to find problems. Personally, I've found (cheap, informal) usability tests to be pretty amazing at finding fundamental problems with software -- the kind of stuff that doesn't show up in a unit test. After all, 100% test coverage can't tell you if your app sucks.

http://www.infoq.com/presentations/francl-testing-overrated

http://railspikes.com/2008/7/11/testing-is-overrated

http://www.scribd.com/doc/8585284/Testing-is-Overrated-Hando...


Comparing code reviews against software testing is a bit odd to me. They are both necessary in a professional software environment.

Code reviews are great to ensure a consistent code base that follows a certain coding standard. It is also great for passing on the knowledge between developers (avoiding common pitfalls, better ways of doing things, etc...)

Testing is such a big field that it is hard to compare directly with reviewing code. Manual tests help in verifying the user experience and end to end scenarios, esp for UI (something that code reviews can't do). Automated functionality tests are great in verifying daily builds and catching regressions (which is also very hard to do in code reviews in fairly complicated systems). Various other types of automated tests are necessary for different considerations: performance tests, security tests etc...


The first paragraph of the article says exactly that: you need to invest in more than one of testing, reviews and QA, and different techniques may detect different types of errors.


Given how much more popular automated testing and TDD have become in the past 7 years since that book was published (and much longer for some of the sources cited for those data), I'd be very surprised if we haven't gotten much better at that particular form of testing.

Also I'm quite curious how you can actually count the number of bugs found from each of these methods. Test-after unit testing would even be preventative if you wrote tests immediately after writing the code. If doing that made you think about more edge cases and thus fix up the code then it seems like those bugs would not get counted.

Seems like many of these would have different amounts of a preventative effect that would be difficult to measure....especially if you wanted to isolate their effect from each other.


As I understand the paper, the researchers tried both asking programmers to use certain techniques to debug a program with known bugs, then measured the number of bugs they caught against the known total.

In addition they went to larger companies, checked which bug detection techniques they were using and how well each one detected bugs.

I have the gated PDF, email me if you want a copy.


To me, unit testing is a way to prevent regressions and try to enforce contracts. Because the same developer usually writes the tests and the code, they have 'the curse of knowledge' and seem to me unlikely to find a lot of bugs this way.

That said, preventing regression is tremendously important. Especially after some turnover on who works on the code.


"It’s well known that it’s more expensive to find bugs early on in program development."

I think that's a little backwards, shouldn't it say less expensive?


Yes it should, nice catch, it's fixed now.


What I like about this is that by writing it the wrong way round you have collected plenty of evidence that your intended meaning is indeed well known!


Number of bugs found seems to me a poor metric. Some bugs are much (much) more important than others, and different bug-finding methods will find different kinds of bugs.


There have been a number of papers showing that total number of bugs is a pretty good heuristic for code quality.

I agree that it's hard to tell which is picking out more important bugs. My guess is that less prominent papers by the same authors get into the topic in more detail.


Be careful about the statistics that are quouted. It may be the case that if they are from code complete that the code that was reviewed was C or C++. In my experience formal code inspection does catch a fair number of bugs in C or C++ especially where the developer is not that experienced. In my experience this is much less the case with other languages.


It may be even worse than that... one study was using FORTRAN.


Modeling or prototyping scored high and sounds interesting. For my Android and iPhone apps, I've started to add some unit tests in the hope of getting some benefit months down the road. I'm doing all the development myself and I'd like to reduce the testing effort before each release.

Anyway, has anyone applied "modeling" or "prototyping" to mobile development?


To be perfectly honest I'm just quoting Steve McConnell there and I have no idea what those terms mean. I would appreciate it if someone could provide some insight.


Problem definition is a part of the problem solution, especially in software design. To prototype is to partially solve a problem to find a better definition of it. I guess modelling is simply a synonym or a slightly different approach with the same results.


Doing unit testing to reduce testing effort before releases sounds fishy to me. Unit tests should be fast and you should run them often. I even have them tied to the main target build, so that the compiler immediately tells me when I break something.

You can do paper UI prototypes with good results, they will make you think more about the corner cases, which in turn will do good to the code quality.

When we were making games I used to prototype some of the algorithms in Perl to benefit from the faster development cycle and expressivity (being able to quicky hack whatever conditions needed).

And of course prototyping is a perfect way to try the high-risk parts of the system first, so that you don't spend too much effort only to hit a blocker later in the project. But that's off the topic of quality.


Running the unit tests are done on a regular basis during development, of course. At some point I want to have Jenkins running so I can test after every commit.

I wasn't implying that I would simply run unit tests instead of testing at the end of being code complete. I want to arrive at being code complete a little closer to shippable.


Most of this content is already in Code Complete.

There are two anti-bug ideas I strongly agree with: bugs are found by use of the code, whether through testing or actual use; and that great unhindered programmers tend to write solid code. It's very logical that bugs will be found by usage - this is basically the test-driven reasoning. As far as programmer quality, it is also logical, but it is underemphasized in the article.

I distrust articles that emphasize one-size processes for software development. In the end, there's talent and something like culture. Government agencies that can't go out of business follow policy, and provide results just above the legal required standards. Old companies with seniority-based hierarchies feel similar to code monkeys. A small company with respectful, productive, and creative builders is ripe for a culture of sincerely-desired quality. This is what I mean by the importance of culture over process.


I agree wholeheartedly about the importance culture over process. I live in a process-heavy environment, but I still think that culture matters more.

Where process definitely helps is in finding those bugs that can't be found "by use of the code." i.e., the bug will be experienced by the user, but it may be invisible to the developer. In just the last year, we've uncovered two nasty race conditions that have been in our codebase for almost a decade, but only just started to show up because some unrelated code changed the timing of certain behaviors. Even having knowledge of the problem (thank heavens for good log files!!), we could not design tests to verify that the bugs were fixed: the windows of opportunity were just too small. Code inspection was the only way to verify that the fix matched the problem.


There's an assumption that there is a formal specification, and that it itself has no bugs in it:

'Giving programmers a specification, measuring how long it took them to write the code, and how many bugs existed in the code base'


I'm on my iPhone and so the material is a little hard to read, but from my eye the chart separates all texting into four segments and has just one for formal code testing. The thesis tends to rely on this graph to make its point. It seems to me this is incorrect; rather, code review beats the best single method of testing. In aggregate testing is hard to beat.

Like another poster said, this was just linkbait. Would rather see a more reasoned analysis.


The headline is pure link bait. The article concludes that you should do more than 'just testing'.

It does make that one statement about code reviewing being faster at finding bugs, that testing. But it discounts TDDing (i.e it looks at programs that already have bugs, and times how long it takes the programmer to find them).

So code reviews are faster at finding bugs than testing, as long as you discount a number of testing techniques.


It should be noted that code reviews, though useful for catching defects, are incredibly expensive.

I think there are cheaper ways -- along the lines of automated testing and design reviews in lieu of code reviews -- to reduce risk and defects, and obtain high quality software, than to spend such massive time/$ on code reviews.


They are, but having users find them is even more so.

We use design reviews, code reviews, unit tests, integration tests and final validation tests. Our bug rate is well below 1/kloc, but bugs still get out the door.

In the end it depends on how you calculate Cost of Quality. In some environments, having a customer experience a bug can have disastrous consequences, in others it's not a big deal at all. We're in the former category :-(


I'm a little amazed by an "X beats Y" post showing up that's thoughtful, well-argued, and acknowledges value in both X and Y. Or one that goes beyond X and Y partisanship to what is the actual goal, here.


I think this chart has cause and effect backwards. Places with programmers capable of doing formal code reviews are less likely to have bugs in the first place.

It's really hard to do a (proper) code review, a programmer who is capable of it is probably a high caliber programmer (I assume programmers review each others code).

Also I don't like this sentence: "code reading detected 80 percent more faults per hour than testing".

If you look at the article that sentence links to you will find that the problem is poor technique in creating the tests. You can't do "white box" testing without reading and understanding the code in the first place. And if their "black box" testing didn't find bugs, they didn't write the test properly in the first place.


And if their "black box" testing didn't find bugs, they didn't write the test properly in the first place.

That's practically begging the question: a test isn't a real test unless it catches all possible bugs? Really, any code for which such a magical test can be written is probably so simple that it only exists inside Fizzbuzz interviews.

Years of experience have consistently shown me that 10 minutes of looking over code by another developer will find more bugs than days of testing. A huge number of bugs are obvious to the eye, but have high odds of escaping tests. Doubly so in code where there's no "right" output to test for: as in the case of any non-optimal optimization function.

This doesn't mean tests are bad. Tests are good for the sort of code that's easily testable, and large-scale regression testing is needed for any project, even if unit tests aren't feasible. But testing is no substitute for having people read the bloody code.


>>10 minutes of looking over code by another developer will find more bugs than days of testing

I find this surprising. Unless you are talking about new developers. Automated test cases should be more than unit test cases. They should include system wide test cases as much as possible. i.e. For a game you would write test cases on the physics engine. The physics engine includes collision detection and collision response. Of course, you would also have test cases for each of those modules independently.


in Code:

  static final String HELLO = "helo wolrd";
  ...
  out.println(HELLO);
in test:

  input = in.readln(); // yeah massively simplified, I know
  assertEqual(Code.HELLO, input);
pass.

But a developer will (should!) find the error immediately.


That's a bad test though. Its not brittle but assuming it was written in a Tdd manner there should be at least two chances for the developer to catch the bug if they don't use the constant they are testing in the test.

I have no source on this but isn't it bad practice to use constants in your tests? I personally use a new string with the expected value. It forces me to look twice and really consider what I am doing.


It's a comment on hacker news illustrating a point, not a guide of how to write test cases.

The point is that it is possible to have good test coverage and still have bugs that are best found by humans.


> A huge number of bugs are obvious to the eye, but have high odds of escaping tests. Doubly so in code where there's no "right" output to test for: as in the case of any non-optimal optimization function.

I agree in principle, but your example is off. Optimization is testable, and even provable. If your optimizers doesn't rely on ad-hoc techniques, you can also usually give an estimate of how far away your solution is from the true optimum.


You are thinking unit tests, I am thinking functionality tests.

i.e. you known what the code is supposed to do then you test that it really does it. Then you think about edge cases and test those.


i.e. you known what the code is supposed to do then you test that it really does it.

Maybe if your code is "Hello World", but what if your code is a motion search, or a facial recognition algorithm, or a band-pass filter, or any other sort of real-world code whose operation can't be summed up exactly in two lines of code?


You write tests for it. Really, the complexity of the whole doesn't necessitate complexity of the parts. Hell, I'll go as far to say that a facial recognition algorithm that is a single piece of complex code is also a horribly bad piece of code. And yes, if you are writing that sort of code, it will be hard to test. Because it's bad code.

> or any other sort of real-world code whose operation can't be summed up exactly in two lines of code?

By this, I assume you mean real-world code whose operations consist of 10,000 lines of un-modularized spaghetti code.


>>By this, I assume you mean real-world code whose operations consist of 10,000 lines of un-modularized spaghetti code.

That is the same thing that came to my mind. Indeed, if you write code like that it will be really hard to write automated test cases, debug, maintain, etc. etc.


>>Maybe if your code is "Hello World", but what if your code is a motion search, or a facial recognition algorithm, or a band-pass filter, or any other sort of real-world code whose operation can't be summed up exactly in two lines of code?

Then you break the problem down as much as possible so that you know exactly what should be the expected output for the expected input. You have to keep breaking the problem down until it becomes obvious how to automate the tests. Otherwise it'll be really hard to write good, clean, modular code.


Then you run some samples through it and make sure it works correctly.

As many samples as possible, and since you wrote it you know which samples are hardest for it, and which trigger edge cases.

I'm not saying never review code, I'm just not convinced that a programmer's own testing works so poorly.


Then you run some samples through it and make sure it works correctly.

What if the exact meaning of "correctly" isn't defined?

Here's what happens when you blindly apply this approach to real-world applications: you end up with hardcoded "correct" md5sums that are "whatever the function happened to output last time, and we checked and thought it was right". I've seen this happen, repeatedly, with real-world apps, and it is useless at best and usually harmful.

For a huge span of real-world applications, there is NO RIGHT ANSWER. You can't blindly apply unit-testing to that in the same way you would for a web framework.


So I guess the question is, if the meaning of "correct" isn't defined, how did you arrive at code to do that thing? Are we talking neural networks here or what? Even those get tested, maybe not in an automated way, but you could at least make sure the results are better than your last iteration.

Take the facial recognition example. Give the whole thing a few images that it should fail on, and some it should pass. As you learn more about the system and where things fail or just don't add up, add tests there. Its likelier those parts of the code are bad too so refactoring isn't unheard of.

To a degree I completely agree with you that some things don't have a easily known answer in the real world. But adding tests to test known things like regressions, just seems like it should be a minimum of testing. I can't count the real world apps I've seen without a single "proof" that they don't regress state as things change. And they do, often and repeatedly. But dealing with external products and teams can be frustrating.


It seems to me that if there's no right answer, then whatever you do you can't possibly be wrong. That's pretty easy code to write (and make really fast at the same time).

I'm guessing there is a right answer, it just might be poorly specified. And automated or unit testing doesn't guarantee that even simple things are correct. It's just a trade-off of how much time you put into the automation versus the time you spend chasing bugs you could have caught with automation (which isn't all of them). I'd think in most real world code (for all the usual real world reasons) that people err on the too-little automated-testing side of things.

What do see as the problems with checking the md5sums? I'm guessing they're brittle if the answer isn't supposed to be exactly the same, but that reduces down to the same as checking a number is 5 when it might be between 4.8 and 5.2. It's not rocket science and getting it wrong doesn't invalidate all automated testing. At the very least the first time it throws an error the developer who wrote the bad test is going to learn something new about his system and you might sensibly use such cheaply written tests to locate errors when running code on a different OS (or version) or after minor cosmetic changes where you would expect the exact same result.


TDD while pairing is even better.


Headline is deceiving. Article confirms conventional wisdom: good product requires testing sieve.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: