Even Tetris is hard to test

thu · on Oct 26, 2014

The interesting bit that isn't mentioned is that all the given examples are actually part of the game specification, i.e. those behaviours are there on purpose. It means that if you have an accurate list of the desired features, you could probably also achieve 100% coverage. It is also possible that testing some features would not increase the code coverage.

"Hard to test" in the submission title didn't mean what I thought: I though it meant it was hard to write tests for Tetris, not that it was hard to recover a complete specification of the game while playing it.

Joeri · on Oct 26, 2014

The corrolary is that you must have a specification in order to test comprehensively. Either you use some form of TDD and your tests are your specification (and any behavior not under test is undefined), or you have exact and complete specifications that the as-built software can be compared to, and any hehavior not in the specficiation is undefined.

fidotron · on Oct 26, 2014

This is amusingly naive. I've seen the spec for Tetris, and it's surprisingly big - much larger than the smallest known fully conforming code, which is still surprisingly large. There are also a non tiny number of people knocking around for whom it's their entire job to test it. "Even" Tetris? No, sorry - that's nonsense.

The big problem here is that code coverage tests don't help you cover what you should have explicitly defined or tested but didn't. As a result a lot of things end up still defined by implementation and not specification, as all sorts of important details only got defined during implementation.

mgraczyk · on Oct 26, 2014

Nitpick: As an avid Tetris player, I have to say that the examples given for "extremely rare" events actually happen quite often during normal play. In particular, clearing 4 lines at once, two times in row will generally happen within the first few minutes of game play and is sort of the whole goal (if you're trying to maximize points). Wall kicks are less common but certainly not rare enough that their processing code would be left uncovered after playing a few games.

clebio · on Oct 26, 2014

Are there more rare scenarios that you can think of?

shalmanese · on Oct 26, 2014

There's a lot of special conditions around having almost the entire playing field filled in.

nanomage · on Oct 26, 2014

* The rotation interactions agains the wall. * The rotation interactions when there are hanging blocks. * Sliding. * Slide rotations.

any others?

userbinator · on Oct 26, 2014

One way to reduce the number of possible paths is to reduce the number of distinct cases, often via refactoring or a simpler yet more general algorithm; to use an extremely contrived example, it's like the difference between

    int addOne(int x) {
      if(x == 0)
        return 1;
      else if(x == 1)
        return 2;
      ...
    }

and

    int addOne(int x) {
      return x + 1;
    }

I always keep in mind the famous Dijkstra quotes about testing and program complexity:

"Program testing can be used to show the presence of bugs, but never to show their absence!"

"Simplicity is prerequisite for reliability."

Too · on Oct 26, 2014

This is a perfect example of why line/function coverage is a silly measurement. It doesn't take into account global state and function input. Take the second function above, you could get 100% line coverage in just one run. But the function does exactly the same thing as the one where you only got 50% coverage, already things smell fishy. You can also test the function with millions of different inputs, all of them giving the same 100% coverage, and it will work perfectly fine, until suddenly you try addOne(int.max) and it will fail.

What you want to be really sure is state coverage, or input range coverage assuming your functions are pure. Now testing every function from int.min-int.max might seem unrealistic but what you have to do then is constrain the possible range of input or divide into ranges with special cases that you can somehow group together. Say for example int.min, negative numbers, zero, positive numbers and int.max.

Also, just because you covered a line doesn't mean it's correct, the only thing you've really tested is that the program doesn't crash. For the test to be really useful you also need a correct result of the output, added by a human. You can't just randomize input to increase the coverage.

asveikau · on Oct 26, 2014

I came to this thread hoping somebody would say pretty much exactly your first paragraph. The author appears to be selling a code coverage tool so of course he falls into the trap. In real life you have to remember that hitting a line once is not the same as showing it to be correct. People who buy into code coverage tools make this mistake a lot.

aikah · on Oct 26, 2014

Sorry but your exemple makes very little sense in the real world. Because what you wrote is obvious.

Imagine now you have a switch with 50 different conditions that have nothing to do with each other and cant be reduced to a simple arithmetic operation.You'd have to test all the paths if you want a high test coverage rate.

All you can do is abstract decision making through FP or OOP(chain of responsability).

BrandonM · on Oct 26, 2014

I disagree that FP and OOP are the only abstraction methods. Arithmetic can be an excellent abstraction method.

For me, it's a code smell to have a function called "updateScore4". It's often possible to come up with an algebraic statement that gives the same result as a bunch of logic (code paths).

I think userbinator's point is that better code is easier to test as a result of having fewer code paths. Of course there are some gotchas to be aware of (e.g., overflow), but overall I agree.

nawitus · on Oct 26, 2014

On the other hand, you can "hide" the different code paths in a single algebraic statement, but even if your tests always execute that statement, you've lost information that you've actually tested the both "paths" of the algebraic statement.

I guess it's easier to use a boolean logic example (in JavaScript):

return myNormalObject || myDefaultObject();

Your code might always execute the myNormalObject half and return it, and even though your code coverage is 100% of the lines, your tests might miss myDefaultObject() code path.

Perhaps some code coverage tools can take this into consideration, though, but then you're back to the original problem..

BrandonM · on Oct 26, 2014

That's not arithmetic, though. That's logic represented as a one-liner... Not the sort of change I'm talking about.

judk · on Oct 27, 2014

Most real world problems need extra logic attached to each case of the conceptually simple arithmetic.

All of coloring is just arithmetic, after all, but is rather complex

Dewie · on Oct 26, 2014

> I disagree that FP and OOP are the only abstraction methods. Arithmetic can be an excellent abstraction method.

Arithmetic in programming is typically about functions taking in numbers and returning new numbers, ie. returning a new number instead of mutating one of the numbers in-place. That sounds like the spirit of FP, to me. (Though of course arithmetic on fixed-size numbers falls short of this ideal when it comes to overflow and such, and we often don't handle this possibility.)

I don't doubt that there are other abstractions than FP and OOP. But arithmetic looks like FP, to me.

BrandonM · on Oct 27, 2014

> I don't doubt that there are other abstractions than FP and OOP. But arithmetic looks like FP, to me.

When aikah said "FP", I took that to mean something like using a higher-level "updateScore" function that accepts a "calculateScore" function as an argument. I certainly may have been mistaken, though.

singingfish · on Oct 26, 2014

Bah one line of BBC basic. One line of incomprehensible BBC basic more like. When I was at school I implemented a game in three lines of spectrum basic. 5 if you wanted scoring. And it was readable. And it was the most popular computer game in the school by virtue of it being quicker to type than loading a tape, and being moderately entertaining. Just bound the cursor keys to the values in the line() function. The game crashed if the line went out of bounds :).

zach · on Oct 26, 2014

This is a great example of the value of test plans. This is basically technology to reconstruct missing test plans via the code. But of course someone already knew about the subtle "wall kick" feature since she or he wrote that code. It shouldn't be this hard, with some effective communication.

And actually, especially in games, test plans are still poorly communicated. In the old days, it was awful -- you would have the publisher doing all the QA, and barely speaking to the development team apart from bug reports. QA still often doesn't get involved until the last half of the project, before which time nobody has been thinking much about testing.

As studios improve their production, this is getting better. As a programmer, I've had more collaboration with QA as I work, at its best including having our QA liaison talk out a test plan with me while I'm working the feature. With enough communication, hopefully this kind of detective work to figure out what to test can be avoided.

nanomage · on Oct 26, 2014

Agreed. Dev's see the specs, and QA sees the holes. I get to participate from the initial planning into release. As a QA person I feel very lucky.

In regards to the article, i would wager the better scores will be found by QA than developers :-P

forrestthewoods · on Oct 26, 2014

Interesting. I do wonder what the actual value of 100% coverage. I'm not saying it's not substantial, I'm just curious how many cases that still misses. There are a lot of permutations of data that can be used by code, how many are possible and how many cause issues?

alkonaut · on Oct 26, 2014

I have yet to see a compelling argument that 100% test coverage would be "better" than 50% without knowing what the quality of the tests are In those cases. I'm always suspicious of code bases that claim 100% test coverage because they usually have it because someone has decided it must have, which invariably means it has tests that exist only to execute some corner case lines of code.

To have good test coverage you should both test all possible inputs and have proper asserts for those inputs.

Testing is hard, covering lines of code isn't. To put it another way: in a test the hard bit is the assert, not the call into the tested code. And coverage only reports on the former.

lucaspiller · on Oct 26, 2014

One of the things I've found as a side effect of people using code coverage tools is that instead of testing the behaviour of a method they end up testing the implementation. I think this is because they initially test the behaviour, but then see that one path is missing, so add a test to ensure that path is ran - instead of just testing the behaviour that calls that path and checking the code coverage tool. This ends up causing trouble if you ever want to change the implementation as you end up having to throw away half the tests, which means a lot of effort you spent to get 100% code coverage is now gone.

canadev · on Oct 26, 2014

I've faced this problem in my own tests. I want to achieve total coverage so that I know that I've got all the cases covered, but then I end up testing the implementation rather than the contract. I'm not sure what to do about it.

Someone · on Oct 26, 2014

You cannot fully avoid that, but you can keep things organized. Do it Eiffel-ish.

At the start of your function, check all the prerequisites, e.g:

   if(x<0) throw "x should be non-negative. Got $x"
   if(x>=n)throw "x should be smaller than n. Got $x and $n"

Add tests for edge conditions that do not throw, but, say, increase a global counting edge conditions hit:

  if(x=0) edgeConditionsHit += 1
  if(x=n) edgeConditionsHit += 1

Then, write tests so that you hit all paths in the condition tests.

If that doesn't hit 100% in the rest of the function, the function has code it doesn't need, or your precondition checks aren't complete.

Think about other edge conditions. For example, does your code special-case x=n/2? Add a check on top. And yes, that is implementation-specific, but there is nothing you can do about that.

Of course, you don't need the edge condition and implementation-specific checks in release builds.

With these in hand, you can also split tests into implementation-specific ones and contract-based ones.

aikah · on Oct 26, 2014

You cant have it both ways in my opinion. High test coverage == testing every possible paths == looking at implementation details. If you are testing an algorithm(and games are full of these)you want it to be a 100% accurate therefore you dont have much choice.

alkonaut · on Oct 26, 2014

Tests should be based on the specification. If I want to change some internal implementation detail I should only have to verify that the current tests pass.

If a e.g game contains a sorting in some place in the renderer, I can replace the quicksort with a mergesort as long as the renderer interface is still testing ok. The new sort algorithm may have new special case paths (even number of items vs odd for example) but it's not a concern of the renderer public interface. I may however have introduced a bug with an odd number of items here and the old code was 100% covered and now it isn't. So there is a potential problem and the 99% has actually helped spot it.

If the sorting is a private implementation detail of the renderer then there is no other place to test it than to add a new test to the renderer component only because the sorting algo requires it for a code path. This is BAD.

The proper action here is NOT to add tests to the renderer component to test the sorting code path, but instead to make the sorting visible and testable in isolation via its own public interface.

So one of the positive things about requiring coverage is that if you do it right, it will lead to smaller and more decoupled modules of code.

The bad thing is that if you do it wrong you will have your God classes and a bunch of tests coupled tightly to them.

canadev · on Oct 26, 2014

This is sort of the conclusion I was coming to. I am glad to hear someone else express it :)

GFK_of_xmaspast · on Oct 26, 2014

If you write a test that forces a particular codepath in the original implementation, like "parameter one is an empty string and parameter two is a null pointer, and verify the return value of the function call", then it should still be a perfectly valid test if the implementation changes, it just might not be a very meaningful test.

gamegoblin · on Oct 26, 2014

This is why I like using a randomized framework such as QuickCheck (or its derivatives in non-Haskell languages, such as Java) along with typical unit testing. It's often easier to write than tedious unit tests, and it can catch a lot of funny corner cases you don't even think to test.

Of course, this still isn't a guarantee. But it does make me feel better about my code.

sargun · on Oct 26, 2014

I love QuickCheck. It might be a little hard to do for Tetris.

gamegoblin · on Oct 26, 2014

Definitely true, but with enough abstraction to small, composable, pure functions, you can get pretty far.

dack · on Oct 26, 2014

One thing I like to say when people bring up test coverage is that code being "covered" only says it was run, not that it was correct, so it's only a weak statement of quality. However, code that is not covered is completely unknown, so there is a bigger chance of bugs there. Obviously though, if that code is trivial, it may not be worth the maintenance overhead of another test for just that.

So yeah, when a team claims 100% code coverage, usually that is just a signal that they care about testing and the quality of the code, therefore it tends to be less buggy. Not necessarily because 100% coverage itself made it so.

Really I only use a code coverage tool to check for important places that aren't covered at all, AFTER I have tried to think of the proper behavior/spec of the code from an outside perspective. It's like a secondary check after you think you are already done. That keeps you focused on what correct input and output are, and then patching up the little areas that you missed with a tool.

ygra · on Oct 26, 2014

  »Our customers tend to be makers of aircraft or car
   parts. Both businesses have strict safety standards
   which involve coverage testing, and our tools help you
   produce the relevant reports for certification, like
   DO-178B for the aviation industry.«

I guess in both industries the value of more testing cannot be understated.

marcosdumay · on Oct 26, 2014

I guess you meant overstated...

And yes, it can. Are you trying to state their value compared to what? A good schema for task division, with encapsulation and whatchdogs? Proofs of correctness? Proofs of halting? Testing is much less valuable than any of those.

rrrrob · on Oct 26, 2014

For example, 100% coverage won't help with the classic Windows Tetris bug where the score overflows at 32768.

xxs · on Oct 26, 2014

Yes, overflows and data races are bane for the coverage tests - just a false sense of security.

clebio · on Oct 26, 2014

Nice to see a context and link to code specifications in a high-risk and highly-regulated industry. 'Work slowly and don't break anything' would make a good poster.

(http://en.wikipedia.org/wiki/DO-178B)

ajuc · on Oct 26, 2014

Meanwhile BBC Basic one-line version has 100% code coverage the moment you start it :)

davidrusu · on Oct 26, 2014

well, if your code coverage scope is one line that's true, but it would make more sense to scope on statements.

nathancahill · on Oct 26, 2014

TL;DR: Use a coverage tool for tests.