Love this article. Very interesting, and I love how the results are able to be quantified.
Coming from some projects that had a very heavy unit testing culture, I can agree that unit testing doesn't always necessarily have the intended results on "quality" and can really shape a codebase.
I feel the root of the reason is that testing is not the same thing as development. Truly testing things involves a mentality that is different, it involves understanding how mulitple pieces fit together and operate in reality, not in some unit testing framework. A lot of developers look at unit tests the same way they look at development, once they work, it's done. But testing is proving they work in all the right conditions, not just getting it to work once.
I've personally seen very simple code convoluted so that it could be unit tested, and how those convolutions can easily add bugs and additional layers of indirection that hurt performance (as well as add unnecessary code, which needs maintenance). Many times these tests could have been written with the existing code as long as it used some other component... but then it wouldn't be "unit testing" only to see these components not work together on the first runtime. (For some reason, places that rely heavily on unit testing seem to have fewer overall or integration tests, because they spend so much time and effort on unit testing)
I'd love to see a study on how many lines of code change because of unit tests vs in non unit tested codebases. Both in terms of how many unit tests need to change and the size of the change itself.
It's crucial to keep in mind that unit tests are only one part of a healthy test and even TDD-diet. Especially if you write "true" unit tests that use doubles for test collaborators.
The plus side of using doubles in your unit tests though is that the outcome insinuated by your last question should likely be different. I've found that changes typically only impact the unit test of the class or sometimes direct collaborators and a very small number of integration or acceptance tests. If you follow the open/closed principle those changes become even fewer.
But everywhere I've worked in the past 10 years has always asked me to unit test everything, and unit testing is specifically called out all the time.
Other types of tests are rarely even discussed, because many developers don't do a lot of testing beyond unit testing. Sometimes there are other organizations that pick up this slack, like a dedicated test organization, but many times there are not.
Perhaps it's just my experience, but I've found unit tests often give a false sense of security when most of the issues I have to fix (which are frankly the hard issues, not simple logic bugs) heavily involve how the code interacts with its environment. I'm talking OS level, user level, privilege escalation (such as UAC in Windows), networking, load, perf and stress.
It's not that I think that unit tests aren't good, they are, but I think people focus on them too much, to the detriment of other testing. Integration testing should catch all the things the unit tests would catch as well, the only difference is you have two copies of the test, and the unit test will give you more granularity into what is failing when you look at the test suite results. But as soon as you have a call stack or a logging facility (especially around error conditions), which is best practice, most of this is obvious as soon as you look at the result of the test.
I totally agree that unit tests for bug prevention in new code are overrated. I mainly like them because they 1. Encourage decoupled code and 2. Make future refactoring easier especially in dynamically typed languages.
Not having other tests is a huge mistake. I think I'm spoiled because I've spent my entire career at places that cared deeply about testing and proper TDD.
I think it's a mistake to assume that integration tests will capture all of the bugs unit tests will. Even with a full whitebox approach, the analogy I'd make would be testing from the next room over with a waldo arm that can only move in a preset number of directions.
It's just much harder to put together tests that hit any level of internal granularity than it is by controlling closer to the target. Theoretically it can be done, but that doesn't make it a good idea. Plus you potentially end up with a large body of brittle tests that can be invalidated by any hiccups in the stack.
Integration tests really should concentrate on the behavior of the integration. It's not just a philosophical approach, it's about not hitting the hockey stick of diminishing returns in your testing while trying to maximize value of the results.
I'm curious what language and techniques you mean this applies to.
In my experience, I've seen a lot of code that was written in a way that was not unit-testable, for what was the same reason that later turned out to be poorly-maintainable.
Some seem to use (their personal idea of) unit testability as a proxy for maintainability or quality. That is often invalid.
I've written some high-quality code and tested it to death¹, only to be told that it probably wasn't unit testable (because dependencies were hard coded instead of injected, IIRC). Making that code unit testable would have complicated it, thus lowering its maintainability.
[1] Not a single bug has ever been found after my tests.
That strict definition of “unit” testing (advocating for isolationism and banning collaboration with dependencies) is too narrow for my taste. DI can make a lot of sense, but I don’t think that means you have to mock everything (inviting a maintenance cost) for the sake of purity.
What makes unit tested code maintainable has nothing to do with how application code is structured and is entirely about the ability to know in an automated fashion whether the previous expectations for how the code should function still hold after a refactoring. Business realities change over time and there are lessons learned that get woven into the code. If we're not careful, refactoring code can lead to losing those lessons. Encoding those lessons as executable tests allows us to better preserve that knowledge as a piece of code evolves.
Where unit tests have an advantage and where you might have been criticized is their ability to test permutations. If you write three tests for one function and three tests for a dependency of that function, you've essentially covered nine cases with six tests. In a complex application with a large dependency graph, that ability to multiply the value of each test can be difficult to overcome when testing at a higher level. Integration tests rarely cover edge cases. Additionally, unit tests are often quite simple to read and have limited setup which means that a programmer coming to the code base without prior knowledge of the code (I always pictured that being me six months in the future so that I'd be as kind as possible :-) can easily see exactly what is expected of the code.
But we shouldn't be dogmatic about the kind of testing we do and lose sight of the goal of that testing in the first place. Any test that furthers a future programmer's ability to rip the application code apart, put it back together and quickly know whether all the previous expectations of the code are still met is a good test. There's no hard and fast rule and judgment and experience are still necessary to know what to do in a specific situation. Too many people focus on the details of automated testing and lose sight of the goal.
I agree with the latter parts of your comment, but not that "What makes unit tested code maintainable has nothing to do with how application code is structured".
My original point was that code that's hard to test is often hard to refactor.
> My original point was that code that's hard to test is often hard to refactor
I agree with that. The poster I was replying to said somewhat the opposite. He said:
> Making that code unit testable would have complicated it, thus lowering its maintainability
My basic point was that less complicated doesn't mean maintainable. There's no perfect design in the face of future changes in requirements. So the most important quality of the code we write today is the ability to refactor it and know that it still satisfies all the initial requirements that haven't changed. Unit tests give you that property, lack of complexity doesn't. Simple code can be better than complex, but it's a secondary concern to a comprehensive test suite. I'll take a full test suite over code quality any day because it allows me to go in and add the quality later without fear of breaking things.
I guess I misspoke when I said "how the application is structured" since it you've interpreted it differently than I meant it. It was a reference to the line I quoted above...that complicating code lowers it's maintainability. I just meant that complexity and tested are separate concerns and that tests more so than simplicity make code maintainable. So I think we agree more than we disagree.
What I said hinged on a very narrow definition of unit testability, one that I believe is quite useless. My code was highly testable, and thoroughly tested. Those may not have been unit tests™ (I hard coded my A->B->C dependency chain instead of injecting it), but I don't care: it works, it's simple, and it's easy to read.
> There's no perfect design in the face of future changes in requirements. So the most important quality of the code we write today is the ability to refactor it and[…]
I take issue with the idea of such protean requirements. I'm more the YAGNI type: those changes you're trying to anticipate? They likely won't happen. I'd rather keep the code simple, and add the flexibility only when needed. Simple code is easier to modify anyway.
(Of course, I'd rather not ossify an architecture if I can avoid it, but an ossified architecture is often the sign of tight couplings, which to me are a form of complexity. If we keep the interfaces between modules small and simple, we are more likely to end up with something flexible.)
> I hard coded my A->B->C dependency chain instead of injecting it
Read my above point about the cartesian product of tests...if you're doing this, you're almost certainly not getting particularly good coverage of edge cases even if your test coverage percentage is high. If I have test A1, A2, A3 that use a mocked B and B1, B2, B3 that use a mocked C and C1, C2, C3, I've essentially tested 27 different cases that you'd have to write in your hard-coded dependency style while writing only 9 tests. When an application reaches even a moderate amount of complexity and that dependency graph grows beyond 3, that ability for unit tests to cover a geometrically-increasing number of ways that the software behaves is going to be hard to overcome with integration tests like you seem to have been writing. But, again, maybe your dependency graph wasn't big enough to need this. Maybe there was some other mitigating factor that made your situation different. I don't want to speculate on any individual situation since I haven't seen the code and there's no universally correct answer.
> I'm more the YAGNI type: those changes you're trying to anticipate? They likely won't happen.
The one virtually certain thing about the future is that there will be change. Of course you mostly can't predict what will change, but you can be damn near certain that something will. YAGNI is not about predicting that nothing will ever change, it's only about avoiding guessing at what will. Tests on current functionality are not guessing about what will change. The whole point of a test suite is that it avoids ossification. What can be less rigid than something that can easily be pulled apart and put back together in some other form while being sure that it still works? Simple and flexible without that property of safe reorganization is still a landmine waiting to go off in the face of future change.
OK, let's get practical. A happened to be a reactor base class inspired by http://ithare.com ; B was a MWSR concurrent queue (scary mutexes and all) ; C was a ring buffer on which the queue was built. The goal of the whole thing was to pass plain data objects between reactors. The one "requirement" that ended up changing was that plain data objects weren't enough (we want to pass anything that has a move constructor). This means scrapping the ring buffer, and modify the MWSR queue to use something else (possibly an `std::queue`).
I used property based tests. With relatively little effort, I ran thousands of tests. Allow me to laugh at your measly 27 from atop my high horse.
More seriously, though, I don't know how I could have mocked my ring buffer without writing another full featured ring buffer. Mocking looks a bit ridiculous in this case. Same for the queue. And the reason why this is so is not tight coupling. It's because my queue relied on the functionality provided by the ring buffer, and my reactor relied on the functionality provided by the queue. If this wasn't the case, I would have written a simpler ring buffer and queue to begin with.
Also, why mock stuff when I can test my code under real conditions?
> The whole point of a test suite is that it avoids ossification.
This feels backwards. If you just mean that test suites prevent the unintended consequences of change, sure. They're like a home made type system in that respect. But if you suggest we should bend a code base just to make it "let's mock everything" testable™, then no. Don't mock stuff for the sake of it, it's a wast of time.
I'd rather concentrate on making the code usable in the first place. With small, well defined interfaces. In my experience, such designs are naturally testable.
I'm not going to comment on how you should have done it without seeing the code. You can throw out terms like reactor base class, MWSR (which I've never heard of but am assuming is just an MPSC using non-standard terminology) and ring buffer, but that's not enough to go on to talk about your specific situation. Moreover, it sounds like you're using C++ which isn't a language I've spent much time coding in, so I'm ill equipped to talk about mocking frameworks available to you.
> Allow me to laugh at your measly 27 from atop my high horse
If you're going make snide comments, I'll stop this discussion. Suffice it to say you missed the point. 27 isn't just a number, it's a number arrived at by applying an exponent. Exponents get really big in a hurry. As the author of a crypto suite, I'd expect you to pay exponents a bit more respect. Testing 3 different cases per dependency gives you 27. Testing 10 gives you 1000. Property based testing is not a substitute for this kind of coverage. It is, however, complementary. It allows you to increase the base number which is raised to the exponent.
> I could have mocked my ring buffer without writing another full featured ring buffer. Mocking looks a bit ridiculous in this case.
It's safe to assume from this comment that you don't really understand mocking. Mocks aren't an implementation of an interface to be used in tests. Mocks verify that they were called in an expected way/order and give back the appropriate value. They don't actually do anything. They're highly coupled with each test case. This is why most mocking frameworks dynamically generate the mock per-test and let you simply verify the call(s) and tell it what to return.
> This feels backwards
It's absolutely not backwards. The main purpose of a test suite is to find a very specific type of bug...regressions. It's not to avoid other sorts of bugs other than the fact that forcing developers to write tests will force them to think about their code in ways that make them realize they missed something during implementation.
But the most important thing is to be able to rely on them to tell you when you've broken something that previously worked. Every single time I've come to a code base that was either missing tests or relied too heavily on integration tests, we've had a significant problem with regressions and, over time, developers shied away from making significant changes to the code base because they were afraid to break it. And the result is that ossification that you try not to achieve. But when you see a code base with a commitment to unit tests, you also see one where developers are free to make the necessary radical changes without fear of breaking existing functionality. They make the changes, run the suite and let it tell them everywhere they forgot to change, whether that be in application code or test code.
Multiple Writer Single Reader queue. Each on its own thread. A reactor is just an actor that read input from such a queue, then reacts to it. It's the basis for an actor model.
---
Yeah, C++. I don't know of any mocking framework there, and I'm not going to look for one —my current opinion on that stuff veers towards "worse than useless". (You just reinforced that notion by the way, see below.)
---
27 is a number you arrived by fantasizing about how mocking make your tests more thorough than they actually are —and you're not even testing under real conditions! Just an outsider's scepticism.
Your exponent is also driven by the number of dependencies, whose increasing number requires an exponentially growing number of tests. Property based testing on the other hand (search for "QuickCheck") generates as many random tests as one feels like using. (Of course, one has to test for several properties, but then each property can be thoroughly tested.)
> It's safe to assume from this comment that you don't really understand mocking.
Indeed, thanks for the explanation. But that just showed mocs are fundamentally incompatible with property based testing: I'd need to generate a moc for each random test case. Nope. I'm keeping my property based tests, they're much more powerful.
---
So you were talking about using your test suite like a type system on steroids. Good. Just keep in mind that there's a stark difference between maintaining a good test suite, and bending the code around some narrow idea of testability. I only do the former. Oh, and I tend to write tests after I write the code. Property based test don't make sense before the module is done anyway.
A quick note about unit vs integration test: in our little A->B->C example, testing C, then B (without mocs), then A (without mocs) are all unit tests in my book. A few internal dependencies aren't going to change that.
By the way, I'm talking about compile-time dependency, of the kind that don't involve runtime shared state. Shared state open a whole 'nother can of worm, and I do see the need to separate unit and integration tests for those. (I also see the need to avoid shared state as much as possible.)
I'm not going to try to refute what you've said since you've clearly demonstrated a lack of understanding of unit testing (your "book" isn't correct) and mocking and a lack of willingness to be convinced. Going point-by-point would just be restating things I've said above. I'll just say that to declare something as popular and influential as those two concepts "worse than useless" without bothering to understand them is, well, ballsy.
Incidentally, here's two C++ mocking frameworks found from Google, in case you're more open minded than I'm giving you credit for:
You on the other hand seem to completely ignore property based testing. Do look it up, it's not just for the FP crowd (even though it did come from there).
Here's a primer: you think of some property that must hold for your object. Take a ring buffer for instance: if you push N elements, it must be of size N (assuming it's capacity is N or more). Then you generate random input, and use it to test your property. One property, thousands of tests. Once that's done, you can move on to some other property (like, the ring buffer must have FIFO behaviour).
> to declare something as popular and influential as those two concepts "worse than useless" without bothering to understand them is, well, ballsy.
It wouldn't be the first time our industry demonstrated blatant inadequacy —at least in hindsight. We're still a young field, popularity doesn't mean much. Maybe I would have considered mocking if I didn't know of property based testing. But the power difference between singular unit testing an property based tests is too great to ignore.
The reason I call mocking frameworks "worse than useless" is because they turn you away from property based tests. If they were at least compatible with them… I'll allow one exception: if you can use the same mock with all the random input, or at least easily generate a moc per random input, then it wouldn't be too bad.
It wouldn't be too good either, though. I'd rather test with the real stuff. If there's a regression, it will be revealed, with or without mocking. Actually, if some dependency isn't properly tested (distracted dev or something), not mocking it gives you a second chance at detecting bugs. On the other hand, have you ever seen a regression that would have gone undetected if you didn't used mocking? That doesn't seem plausible.
The only use I see for mocks is to track down bugs once they are detected —by narrowing the search space. But I've never needed it in practice, with one possible exception: Lua.
Are you using dynamic typing? That would explain a lot.
I know what property based testing is. And it's not a substitute for unit testing. It's basically just a way of generating many tests from a single test template and is an alternative to example based testing. But it doesn't relate to what code you're testing, so what you're describing is still an integration test which serves a much different purpose from unit testing. Again, I think you just haven't had enough experience with unit testing and had that "aha" moment yet. It's not that I don't understand your point, I just doubt you have enough experience in testing strategies because you're trying to compare two things that aren't comparable. If I suggested your code use a function instead of a queue, you'd think I was ignorant and inexperienced because you can't replace a queue with a function...they're not concepts that can be swapped for one another. You suggesting property based testing instead of unit testing is likewise nonsensical. Unit testing and integration testing are different concepts of the same type. They cover what code gets tested. Property based testing is one of many strategies (example based testing, mutation testing, etc) that cover how code gets tested.
Choosing to write unit tests instead of integration tests does not preclude the use of property based testing or any other technique that covers how you test code. As I stated before, if you were to use property based testing at the unit test level, it would be all the more powerful because you'd get that same exponential coverage just with significantly larger number of effective tests per unit of code. From the A, B, C dependency graph example, if you used property based testing to test A thousands of times, B thousands of times and C thousands of times, you'd effectively be testing thousands times thousands times thousands (aka billions) of different scenarios. And you'd still be able to use mocks because mocks are created in test code and can be as dynamic as you need them to be.
I've done testing in many languages, both static and dynamic. Dynamic languages, thanks to the lack of static type checking, often don't need a mocking framework because the language is flexible enough that you can swap in a mock-like object at runtime. But the concepts are the same. And there really is a ton of value in testing code you've written in isolation. It's the testing equivalent of separation of concerns. And just like some developers never grok the single responsibility principle, some developers never grok the need to test code in isolation. You've obviously understood how to encapsulate your concerns based on the design you implemented. And yet the testing strategy you described is the opposite...trying to test everything at once rather than focusing on each test covering one specific bit of code.
This exponential madness really irks me. I confess I have no idea how
you can even imagine something like that, especially considering you
test "in isolation". When you test A, B, and C three times each,
you've got 9 tests. And since you were careful to test them in
isolation, you can be sure testing A doesn't miraculously test C as
well. Your 3^3=17 exponent comes out of nowhere.
Me on the other hand could pretend something to this effect. By testing
A, B, and C three times each, I effectively test B 6 times, and C 9
times, for a total of 3+6+9=18 "tests". But even then I would
hesitate to pull such a number out of my ass, because the extra "free"
tests are likely redundant with the real ones down the line. (And if
my coverage is any good, those are guaranteed to be redundant).
Surely you don't pretend mocking lets you test more things than my
approach? Surely the mocs don't behave any differently than the real
thing on the inputs they are given? How come then that my
"integration" tests don't supersede your unit tests?
> Choosing to write unit tests instead of integration tests does not
preclude the use of property based testing
I assume at this point that your definition of unit test means mocking
everything, except perhaps the standard library. You said earlier:
> Mocks verify that they were called in an expected way/order and
give back the appropriate value. They don't actually do
anything. They're highly coupled with each test case.
So, if tests are generated randomly, mocs must be adapted to each
random test, right? Now you need some moc generator, that will
generate a moc for each test case. This veers eerily close to
actually implementing the interface we were trying to moc.
Now I realise I did miss something:
> Mocks verify that they were called in an expected way/order and
give back the appropriate value.
Why would they do that? Oh, I see: we are mocking a mutable object
that is used somewhere else. In other words, shared state. The one
case for which I understand isolation could be good.
This may all be a big misunderstanding. The A->B->C dependency I was
thinking of is pretty simple: C is an implementation detail of B.
Each B contains and uses a C, but that C isn't used anywhere else.
Thus, mocking B or C would mean testing the implementation details of
A, and that is a big no-no —the kind of thing Uncle Bob would call TDD
gone wrong.
> And there really is a ton of value in testing code you've written
in isolation.
What value exactly? Does isolating objects in the tests make those
tests catch more bugs? What kind of bugs? Does isolating objects
make detected bugs faster to track down? Does isolating objects in the
tests influence how the code is written in the first place? How? Would
the test suite run faster?
> And yet the testing strategy you described is the opposite...trying
to test everything at once rather than focusing on each test
covering one specific bit of code.
Now that's a strawman. I don't test everything at once. If I did, I
would merely test A. No no no, I test and debug C first. When I'm
confident it is bug-free, I test and debug B. Bugs are easily tracked
down, because C is bug free, so if something went wrong it must be B.
I only test A last, when I'm confident B and C are both bug-free.
I've had a lot of times where no bugs were found with huge amounts of unit testing over trivial code. It would have been easier and of more use to just do manual inspection (code review) with a bunch of people instead of writing the tests at that level, and instead writing tests at a higher level of complexity which involved more code/components.
In a perfect world you'd test all the branches and possible values but out there in the real world, if you aren't finding bugs with your testing and there are still bugs in the code, you need to be spending your testing "dollars"
(time/effort/talent) more efficiently.
I failed to mentioned that the tests themselves founds lots of bugs. I needed those tests. Also, those weren't mere individual tests, I ran a number of property-based tests as well, they tend to be much more thorough.
I wrote those tests because I was writing foundational code. The outer layers hardly have any test, which is good enough for the proof of concept we were asked for. We'll need to be a bit more rigorous to polish this into production.
I have developed some kind of instinct that is non-trivial to teach, and impossible to convey in a small HN comment.
But my primary proxy is size. The less code the better. But even that have to give way to other concerns, such as style constraints, and straight up performance.
While I can't show the code I was talking about above (it was proprietary), I can show my crypto library, Monocypher¹. I think it is a good example of what I mean by high-quality code. (Yes, I am boasting. I believe this is justified. Also, quality requirements for crypto code are kinda off the chart.)
I think crypto code is a bad example. This[1] would not be considered "idiomatic" anywhere, except in crypto.
And for example, if one was optimizing for code readability over performance, I expect that would look very different (although you could do much of the same with macros, but I'm not sure if that's any better).
static int neq0(u64 diff)
{ // constant time comparison to zero
// return diff != 0 ? -1 : 0
u64 half = (diff >> 32) | ((u32)diff);
return (1 & ((half - 1) >> 32)) - 1;
}
I am incredibly skeptical that you've done better than the compiler with this bit twiddling stuff. You realize that on x86_64, it probably compiles down to 1-2 instructions to write `diff != 0`, right?
This isn't an effort to beat the compiler on speed or optimization - rather, the opposite. It's an effort to produce a method which executes in the exact same amount of time for any input - whether or not it is zero. This is key in implementing crypto while avoiding side-channel timing attacks and this isn't a particularly uncommon way to do so.
Now, with that being said, this strikes me as a domain issue: the OP seems well-versed in crypto and foundation / backend code, which has relatively constrained input and output behaviors and is testable using smart reasoning and fuzzing. This is somewhat different from front-end or user-facing code, which involves handling epic quantities of mutable state and requires validation of a feature end-to-end. I suspect OP writes great, "bug-free" code in the backend / systems domain, but I don't think their testing practice or anecdotes are applicable across the industry.
> I suspect OP writes great, "bug-free" code in the backend / systems domain, but I don't think their testing practice or anecdotes are applicable across the industry.
I'll grant OP a Magic Cloak of Error Abolishment and an enchanted ring with +4 INT, +4 WIS, and +3 VARIABLE-NAMING...
Regardless, what's the very first thing I'm gonna do when I inherit his project for support or further development? Start ripping it apart to add tests so that I can verify the behaviour of the system pre, and post, change.
If that code base was tested before, even poorly, that means extending the tests and improving their web of assertions and I will be making positive changes to the code within hours. If that code base is untested then my maintenance activities will involve unsupported refactoring and WAGs about system behaviour once the new tests come online. Days of work will hamper the start of maintenance activities. From experience, it will also likley mean finding a bunch of previously hidden issues, bugs, and suspect behaviour that has to be re-analyzed and re-tested before maintenance work can begin in earnest.
A strong feeling that something is "bug-free" does not tell me that everything works unchanged on a new platform/OS/architecture, or let me throw in a newly conceived corner case, or check some environmental oddity... If we start out with the premise that systems spend most of their lifetime in maintenance, preparing the system for maintenance activities and trying to eliminate risk in that delicate window make a lot of sense. Not for a lone developer who has a bunch of domain knowledge in their heads, but for teams where new individuals are exposed to new domains and code at the same time.
> I'll grant OP a Magic Cloak of Error Abolishment and an enchanted ring with +4 INT, +4 WIS, and +3 VARIABLE-NAMING...
Most of those come from my test suite. Without that, I'm back to -2 INT, -3 WIS, and a Cursed Cloak of Mistakes. I did let some bugs slip through the first time around…
> what's the very first thing I'm gonna do when I inherit his project for support or further development? Start ripping it apart to add tests so that I can verify the behaviour of the system pre, and post, change.
As far as Monocypher is concerned, I already did that. Test vectors, property based test, sanitisers, code coverage analysis, the works. The public interface is testable enough that you don't need to take apart anything to thoroughly test that library (it already is).
> A strong feeling that something is "bug-free" does not tell me that everything works unchanged on a new platform/OS/architecture, or let me throw in a newly conceived corner case, or check some environmental oddity...
It helps that Monocypher has zero dependency (not even the standard library), uses fixed size integers almost exclusively, compiles without warning as C and C++ with GCC and Clang, and that crypto code is abnormally straight line.
> If we start out with the premise that systems spend most of their lifetime in maintenance
I made sure Monocypher required next to no maintenance (I won't have the time for significant maintenance work). It's mostly a matter of size and scope.
The FOR macro is more readable than the original C for loops. Given their sheer number in this code, I think this is worth the 10 seconds learning curve. I don't use that when I work with others, though.
The "used twice" macro makes clear this is the exact same code. Would be hard to make sure it is otherwise. Easier to review that way (carry code is a nightmare to test and review).
The constant time comparison is necessary to ensure timing attacks cannot happen. The arithmetic trick may be a tad slower than a conditional branch, but its timings are consistent. Note: the compiler is still allowed to just use a branch, but in practice they don't.
constant time comparison in crypto is potentially important because of timing attacks to gain information. Haven't looked at the code here to see if that's the case, nor am I a crypto expert, but see https://en.wikipedia.org/wiki/Timing_attack
[1] can easily mean that no one used the software so it doesn't really tell us much. Also, unless you had actually re-written the code as unit testable how do you know it's more complicated that way or less maintainable? You're responding to a study with actual data, on a generally "preference" topic, with an anecdote so yes I do feel justified in pointing this out.
> [1] can easily mean that no one used the software so it doesn't really tell us much.
This was foundational code, used for inter-module communication all over the place.
> unless you had actually re-written the code as unit testable how do you know it's more complicated that way or less maintainable?
Because the code was small enough to allow me to envision the necessary modifications for dependency injections. I had 3 classes, with an A->B->C dependency chain, and no reason to inject anything if it weren't for some cargo cult about unit tests. Testing the hell out of C, then B, then A, proved quite sufficient without mocking or injecting anything.
> You're responding to a study with actual data, on a generally "preference" topic, with an anecdote so yes I do feel justified in pointing this out.
Whatever evidence the study actually has is weak. Small sample size, and the failure to analyse actual outcomes (bugs, speed of development…) mean we cannot possibly get much out of it. It's a good starting point.
I don't believe I have contradicted this study's evidence. But even if I did, my personal experience gives me way more evidence than such a study, so I feel perfectly justified in contradicting it on that basis. (More solidly settled science, that's another story.)
Problem is, you do not have a privileged access to my personal experience. You only know what I just wrote. And my written report of my personal experience means little, next to that study. You'd better believe the study before you believe my report.
Well stated and agreed with in the final paragraphs but why drop the half written anecdote?
Also I think you may be conflating DI with testability. Items can be u nit testable and easy to di while not being built for it. I write code that is testable and it generally happens to be di-able as well but that's not the goal. From my own experience.
Anecdotes are still a valuable in that they point out where you might be interested in looking further. You can't trust anecdotes, but they may indicate where to look.
> I think you may be conflating DI with testability.
I'm conflating DI with a stupidly narrow idea of testability. I'm fully aware that my code was easily testable (and thoroughly tested!) despite hard coded dependencies.
I think good code usually results in the test suite being useless the vast majority of the time. That is, only let's say one run in a thousand of any given test should fail after adding or refactoring code, if that. Even then, the failures are usually commensurate with the changed expectations of the changed code and not indicative of a bug.
> But taking the data at face value, the trend lines speak clearly. More test methods in codebases predict more cyclomatic complexity per method and more lines of code per method
My initial reaction to this is that having more tests means your code accommodates more and increasingly complex cases and edge-cases in general, because you’re not going to write 10 test-cases testing the same logic (or combination of logic).
That’s going to mean more code and more complex code either you like it or not.
The projects with fewer test-cases may not handle as many edge-cases, and maybe they don’t have this complexity (and the associated tests) because they don’t need it yet?
If so, tests themselves are not bad, but may just be indicative of the real world cost of real world complexity?
I'm a heavy proponent of writing tests - even if it's not full-on TDD - at work, to the sometimes annoyance of my co-workers, and my reaction before even reading the results was that for those two, his prediction was wrong.
It seems obvious to me: If you have (good) automated tests for correctness, a method can handle more complexity without the programmer needing to handle all paths in their head at once. The tests should let them know if they made a mistake somewhere. On the other hand, if no such tests exist, those methods would be more likely to be written as smaller methods that call each other (maybe one "driver" and a couple "helpers"), so the programmer doesn't need to hold the whole thing in their head at once.
The odd part is that the second way can often be better for clarity/maintainability, since those helpers can be reused and the driver can be written at a higher level. But it seems like automated tests give people an out where they don't have to think about that anymore.
Just to expound on your hypothesis (from another OOP TDD-ish proponent):
Google defines Cyclomatic Complexity as "a software metric... [that] is a quantitative measure of the number of linearly independent paths through a program's source code."
It seems to me that swapping out a dependency, abstracting it, and then feeding multiple variants of the same dependency to some code would greatly increase the number of linearly independent paths through the code. These kinds of refactorings/improvements also tend to greatly increase maintainability, clarity, flexibility, and your programs featureset...
By way of example: taking some DB insert code and updating it write to a simplified interface and then injecting components to handle writing to the DB, a local file for testing, AWS S3, and such... A well-factored, clean, reusable, persistence ignorant solution will have a higher cyclomatic complexity than a hard-coded, raw, INSERT. I know which one I want to maintain, though ;)
Quite possibly. I will assert, though, from my experience, and the wisdom I've garnered from talking to people around unit testing, that as code assypmtotically approach 100% code coverage, the complexity of said code will tend to go up exponetially (or worse).
I don't have significant proof other than to say that things like DI and IoC start to get used in places that make little sense, which means complexity that wouldn't exist without tests. I also have anecdotes of defensive code I've written, with code branches that 'will never execute'. Branches that are there in case I missed something subtle, that simply log an error, or otherwise throw some exception. I can verify the code works before walling it off, but if you ask me to unit test those lines of code, now I need ways to set the internals of a function externally, or otherwise add code to allow variables to be set in a way that shouldn't be possible. This leads to complexity. And if you asked me to test it, I'd just quietly delete the defensive code rather than poison internal state variables.
> Where a "reasonable" cutoff point may be... Now that's a harder discussion :)
I think this is something that should really be reflected in your systems overriding architecture...
A 100% tested persistent ignorant Domain library is A Good Thing. But your Infrastructure library, or your build scripts, or things that only really make sense as integration tests? Only as much as makes sense.
While it's an individual heuristic: if I can see myself arguing for more than 30 minutes why it's meaningless to test something then I have no problem letting it slide ;)
I love that this post is quantifying results over actual data—a huge improvement over the speculative/anecdotal pattern that a lot of these analyses have.
But the conclusion that complexity and LOC both are higher in codebases with much unit testing... seems very weak. The complexity graph in particular looks very level, with the trendline driven almost entirely by the very low datum for 0-10%, and the LOC graph looks almost as suspicious. That low datum definitely demands further explanation before any conclusions are drawn on those two.
To me, it also inspires two further questions about the LOC methodology:
1) how sensitive is the measure to different coding styles? Does it include the method header? Does it include the closing bracket for the whole function? Does it include open bracket for the function when on a line by itself? Does it include any bracket on a line by itself? With the average method length ranging from 2 to 5, coding conventions on brackets could make a substantial difference, and if coding conventions correlate with testing philosophy at all (for whatever reason), that's a possible threat to validity.
2) Could these averages be dominated by a few larger projects? That is, is the average computed as
average (length of all methods, all projects)
or
average ( for each project, (average (length of methods)))
? If the former, larger projects would dominate the average. (True of all the other questions that involve counts and averages too, actually.)
In additions to the methodology concerns you raised, I have to question the use of a "linear" trend line, particularly when one axis is arbitrary buckets of unequal size.
The consistent bump in the 0% bucket suggests that there is a significant measurement error going on here.
The author acknowledges this:
> So if any of these codebases are using a different framework, they’ll be counted as non-test codebases (which I recognize as a threat to validity).
I'm inclined to say the 0% bucket should be thrown out, since it's impossible to disambiguate between "no tests" and "tests I can't detect". All of the other buckets are quite likely to be correctly classified.
> I think it is the other way around. People are forced to write unit tests for complex and big methods.
Intuitively this seems correct to me, at least for the code-first paradigm. I do wonder if this holds in the TDD paradigm though, since in that case there are (should be?) tests for everything regardless of complexity.
It would be interesting to try to answer this question with further experimentation along the lines of the OP, though I can't think of any way of detecting/classifying projects as TDD vs. not-TDD without contacting the repository authors (which would be time-consuming, potentially prohibitively so).
The results are compatible with "risk homeostasis" or "risk compensation" theory. People who use unit tests have an increased sense of safety, so are more comfortable living with less-safe designs and code-bases.
"Risk compensation is a theory which suggests that people typically adjust their behavior in response to the perceived level of risk, becoming more careful where they sense greater risk and less careful if they feel more protected. Although usually small in comparison to the fundamental benefits of safety interventions, it may result in a lower net benefit than expected"
-- https://en.wikipedia.org/wiki/Risk_compensation
I like the idea of this study. I wish there were more data behind it. I’d love to see comparisons based on language, or testing tool, for example. Or code base size, age, author count, etc. There also needs to be some weighting of confidence based on how many code bases are in each group. eg, Since there are only four code bases in the >50% group, it’s dangerous to draw conclusions based on a trend line when that group is equally weighted with the far more populous groups.
Still, I’m hoping the author expands his datasets and analytical methods as time goes on and I’d be interested to see this topic revisited with a larger sample.
What about causality? It seems plausible that people would be more motivated to write unit tests for complex functions that are hard to gain cofidence in otherwise.
This was my thought too. The findings are written as "unit tests cause X" when really what is shown is "unit tests are associated with X".
Another explanation for the results could be common cause:
Maybe packages with more unit tests are also professionally developed, or written by people interested in software methodology, or are older; and those aspects are also associated with method length/decomposition patterns/etc.
It seems plausible that people would write shorter, clearer functions that
have no possibility of being incorrect if they had no tests to gain the
confidence of them being correct.
Also, designing for testability only gives you testability. If you ditch this
requirement, suddenly you're free to design directly for ease of use, not for
some proxy like testability.
> It seems plausible that people would write shorter, clearer functions that have no possibility of being incorrect
This seems exceedingly implausible. If people were capable of writing perfectly correct functions consistently, they’d do so.
Also, in my experience the people who write tests are generally the ones who write good small functions. The people writing massive dense functions are rarely writing good (or often any) tests. (This obviously diverges from the study’s results, though, so grain of salt.)
In my experience the people who are the strongest advocates of unit testing write small functions. I left off "good" because I don't agree that "small functions" are "good" necessarily.
Often the functions are in fact too small. They properly belong in a larger function as part of that function's implementation, but are "factored out" solely for the purpose of unit testing that piece of the algorithm inside the function.
Usually these small functions are also only called once. In that case especially it leads to messy, bloated code that is in fact harder to reason about, harder to maintain and less efficient besides.
> Usually these small functions are also only called once. In that case especially it leads to messy, bloated code that is in fact harder to reason about, harder to maintain and less efficient besides.
the flipside is that if this approach is done well, you end up with things broken down along the natural abstraction lines you'd use, and if you trust that your small functions do the right thing (are tested themselves) and are called correctly (their callers are tested well), it tends to make it much easier to reason about the code, IMHO.
but yeah, also, sometimes larger functions and more integration-y tests are the way to go. or the above fine breakdown + unit tests + integration tests. depends on the situation, i think.
The tendency is to lock in the first abstraction line that you thought of..which rarely ages well with future experience.
My rule of thumb is that something needs to be a function if it is called 3x, doesn't if it is only called from one place, and if it happens 2x it is a judgement call whether to leave cross-referencing comments between the spots or to factor it out into a function. (If it is unlikely that I will need a third and the common code is difficult to extract, I will leave comments.)
You are absolutely correct. GLSL shaders were excruciatingly difficult to test and debug, before tooling caught up. As a result, I took great care to write small and easily understood functions. It did much to improve my work in Python and C++, where debugging and unit testing were available.
It's irritating to devise guidelines for good development, in general, and then have people build a cult around them. People honestly believe that unit testing is the only incentive for writing good code. This is equivalent to ascribing all good things to Dear Leader and all bad things to an out group.
There is no guarantee that all complex problems can be solved in an obviously correct way if you use small enough functions. As the number of functions increases, the complexity tends to get pushed into the interactions between them - and you need integration testing to look for bugs in how they interact. If you are writing small functions - and, generally speaking, you should - integration testing becomes more important than unit testing.
Unit testing tells you what you already know. It's kind of pointless, when you think about it (unpopular opinion). If you have a handle on your code there's rarely any need for it.
To echo the original comment, I already know half my assumptions don't hold anymore, and I don't care about them. Unit tests are just more code to maintain. It's like vastly expanding the requirements, unnecessarily.
I've been using a program I wrote for 18 years now. The code today is not the same as the code 18 years back. I have slowly modified the code as ideas for new features come up, old features removed because they aren't used anymore, and foundational code rewritten as I found better implementations (or the previous implementation was a mistake [1][2]. Making a sweeping change (like re-implementing some core code) is dangerous because of the threat of new bugs, which is why tests are important (although I tent to prefer integration tests over unit tests).
[1] I used to have an error logging mechanism in place that would record where in the code the error happened (in addition to other information) and how it propagated up through the code.
While it wasn't that hard to use, per se, it was bothersome when I had to add an error (each error had a unique ID and because of language issues and tooling, that was a manual process). And it really never paid back its implementation cost in terms of reporting, so I finally ripped it out.
[2] I had my own version of C's FILE* [3] that ended up being a horrible abstraction---it was so confusing that I could never remember how to use it, and I wrote the code. I got fed up with it, and ripped that out, replacing it with native IO calls.
[3] For stupid reasons now that I think back on it.
The simple way to say it is that once your code gets too complicated to keep all the execution paths in your head (sooner than you think), tests are what enable you to make sure you didn't forget an assumption you made about how it should work.
Do you like going back to manually verify things still worked after you make a foundational change? Or do you just trust yourself to know you've not impacted anything negatively? Do you honestly believe that's good use of your brain power even if true? Do you enjoy being married to that code? Because that's exactly what happens as nobody but you can work on it. That strategy works out well for employees to become key men and get big retention bonuses so there is some merit to your approach :-)
Not to mention the team impact to cowboy coding. It's a selfish approach to software development and is often undertaken by bully developers that say things like "use the source" as a way to make you feel dumb for not wanting to spend all day trying to untangle their likely insane code. All because they were too lazy to use a little empathy and be a good teammate.
To recap: Tests document and improve your design, verify functionality, provide leverage to accelerate development, and reduce "bus factor" risk. I don't care what a small sample data set says to the contrary there's really little debate that when wielded properly these tools and techniques do improve software on multiple dimensions that go beyond the software itself and heavily impact the business.
20+ years of product development experience in lots of teams/products inform this opinion. Let's see the data on how your business grinds to a halt when these tools aren't used and the code base gets increasingly larger and complex and people leave. Tell me about how you don't need those automated tests as you approach technical debt bankruptcy and all your key people have left. "Use the source" is a terrible response to that problem.
Rewrites can be fun though so maybe that's the answer. It sure worked out well for Netscape. ;-)
Unit tests are a poor way to validate code behavior. 30+ years of development has taught me that validation tests within the code itself are far better.
1) They run whenever the product runs. Good logging code tells you when it fails, even on customers devices.
2) By being part of the production code, they are far easier to keep up to date as the code changes even through refactoring.
3) They don’t force you to change anything about how you code.
Why not just say that unit tests are a management methodology for mixed-skill teams? There's no need to invoke 'bully devs' just because inexperienced or un-conscientious developers exist.
Usually I feel like the only person who doesn't unit test. I've heard all of the arguments and I like the theory, but practically I don't want to spend the time on it, and I maintain a large codebase. Regressions are easily spotted and corrected in a well structured code base. Could occational bugs be spotted before production with unit tests? Sure! But it isn't worth the time in my case.
I’ve tried to write unit tests. They just haven’t been worth the effort.
1) It takes more time.
2) In any complex application, unit testing requires creating mocks to separate the code you are testing from live subsystems.
3) That requires modifying your production code to be easily mockable in ways that don’t benefit the production code base.
4) Unit tests are by default artificial. They don’t test real world usage.
5) As you refactor production code, you have to rewrite unit tests, adding more overhead.
A far better way to write tests is within your production code. Validate every parameter and every assumption. Raise errors when possible, log everything else.
This is far easier to develop, and maintain. It runs when you testers, and customers use the applications.
Now I write desktop and mobile apps. If I wrote APIs and servers unit tests would become far more useful. But on device most of my problems can only be caught by integration tests, not unit tests.
> A far better way to write tests is within your production code. Validate every parameter and every assumption. Raise errors when possible, log everything else.
> It runs when you testers, and customers use the applications.
Do you rely on any sort of automated testing to catch errors before it gets to your customers? From a business standpoint it doesn't feel right to offload testing to your users, especially in more critical systems.
Our testing is by hand, we have a QA lead and entire team helps.
Verification isn’t offloading testing, it’s monotoring real world usage to find out where it differs from testing. The hard part is the logs can be a torrent of data, you need to have a process to monitor and escalate.
i’d like to automate testing. But it has to be no more work than manual testing, and/or it has to be as effective or better than manual testing. And i can’t see either of those being true.
Most of my bugs can’t be caught by unit tests. For example one common category is where previous devs made assumptions where code normally works fine, but fails in specific edge cases from multiple thread timing or view controllers being disposed. i don’t know how to automate those tests.
You're not the only one, but it can be an unpopular opinion. Of course we wrote very large code bases with no unit tests for decades, before it became a blessed cult of sorts. I know when I've written a method that's so inherently tricky, that I need a unit test for peace of mind. It's rare, and I prefer not to write tricky methods, or complicated architectures. This is practically impossible to avoid on a large team of mixed skill and weak leadership. I prefer not to work on teams like that.
No unit tests. With good code structure regressions are minimal, most are caught with practical testing. Ocationally one slips through to production, is reported, and quickly patched. It's rare and something that I'm sure happens to everyone, even with unit tests. I don't tell people to do it my way, but it works for me and my company.
I just mean a manual test of application functionality. Maybe some manual function calls. Nothing automated or saved.
BTW, the code is an ecommerce platform, not just a website but dozens of integrated applications that handle everything from product selection to order fulfillment, and everything inbetween. 11 years of code that dozens of people use daily to process hundreds of thousands of orders per year.
Interesting article, but too many uncertainties: 100 examples only, the 30 code bases w/o unit tests might have tests the author didn't check or see, and there is likely a cluster effect in this sample to be found. But interesting enough to warrant a large scale investigation to see if those results would really hold.
If you define unit tests as tests which test non-divisible units on a low level, then by definition they are a form of tight coupling.
The most loosely coupled form of test would be an end to end test (most loosely coupled, but obviously not without its disadvantages).
Since the pain of dealing with one form of tight coupling is magnified by having another form present, the presence of unit tests act as a sort of canary in the coalmine: if your unit tests are extra painful today, that's probably because the part they are connecting with has a high degree of tight coupling. That might (or might not) lead you to refactor and decouple the rest of the code base in an attempt to mute that pain. Either way you'll feel the pain.
I believe this effect is real. I also don't do this myself, because I don't believe in self flagellation as a means to enlightenment. The best thing to couple your tests to depends upon a range of factors, IMHO, and getting religious about where to couple anything is the worst thing you could do, especially if that religion dictates tight coupling (unit testing all the things).
Method length and complexity probably go up because if you focus on loose coupling (to mute the "unit testing pain") to the exclusion of other good code qualities - less code/less complex code - then those qualities will be sacrificed.
Very cool, I'd love to see a more scientific approach to understanding software development methodologies.
What might be interesting is a metric to see how 'refactorable' a codebase is - i.e once a class is created what is the probability there would be significant changes in the future for highly tested vs untested code.
Also the same metrics along a dimension of typed vs untyped languages (my hypothesis would be that unit testing would show more benefits for untyped languages like Python than .NET).
The author mentions that they only analyzed the collection of C# repositories that they used in their study on Singletons, but I would expect the results to be different depending on the language used.
In my experience with Python, unit testing is essential to catch careless errors (typos, bad ducks, etc. that can't be caught by a linter) that would cause your program to crash at runtime, so there's a slightly different motivation for writing unit tests. It's a less restrictive language so you don't have the issues with private methods like the author mentioned, and is (at least to me) a bit less of a strain to write tests. Unit testing might correlate more with good hygiene for a Python codebase because testing is vital to check correctness and there are fewer situations where you have to mangle your code for the tests, whereas it may not in C# because you get guarantees from the type system that makes testing less vital and writing tests is more of a nuisance.
I was curious about You can’t (without a lot of chicanery) unit test private methods.
Why is that? Maybe you can't test them directly but private methods should be executed and thus indirectly testable by public methods. If there is code in private methods that can't be executed by any public method invocation, it's unnecessary code isn't it?
> private methods should be executed and thus indirectly testable by public methods.
Many people would count that as integration tests. Because if you don't call the method directly, it's not a unit test, for some reason. (I guess it depends whether you consider the unit to be the class or the method.)
Some of those people will also say that not being able to "unit" test a method is actually a bad thing. I personally don't care much, as long as I can make sure the code works somehow.
I would argue that those people are wrong, and defeating the purpose of object oriented programming, by braking a couple of basic principles like encapsulation and abstraction.
Also, they are wrong in that they also defeat the whole purpose of unit testing, which is being able to refactor with confidence that there won't be any regressions.
If you "unit test" the private implementation details of a class, you'll need to fix the tests every time you want to refactor it. That's an unacceptable tax.
Instead, classes should encapsulate and abstract away a problem for you, from the outside, you shouldn't need to know how it works. It should expose a clear interface ideally as loosely coupled to other classes as possible. And this interface is the bit that needs to be unit tested, the "contract" with the outside world.
I've had a similar debate. Someone kept trying to exclude some of our data objects from test coverage reports. I agreed we shouldn't bother testing getter/setters. But, those object better be used in other tests we are writing.
The typical argument for getters/setters is that there is some logic that should be performed and that's why the fields should not be public. If there's logic, they should be tested. If there's no logic, they should be public fields.
No, the point of getters and setters is that the interface to get and set fields should be the same whether or not logic is being performed behind the scenes, and the caller should not need to know.
Making fields public makes it impossible to add logic in the future without changing other code. That breaks encapsulation and is generally non-idiomatic in languages with getters and setters.
My point is when we checked the report, the file was already getting covered. If your tests of the system aren't using the data objects of your system, you have larger problems.
My guess is that both variables in these results are almost entirely driven by external factors. Team size, whether it's pro or side project, deadlines, experience/talent of developers, whether tests are required or optional, the language. Hard to draw direct correlations given the sample size.
I can speculate as to why this would happen. People expect unit-tested code to have shorter methods because it's easier to write tests for shorter methods.
However, writing a new test for a new method can be harder than just stuffing that code in some other method that already has a test. Depending on how testing is enforced, you might be able to at get away with not having to do anything at all if the existing test still passes. If you're in a hurry and the test fails, just change the assertion to whatever happens. Is it right? I dunno, the test says it's working.
I believe you came to a result to quickly by the way you measure the code bases chosen. Lot's of things here that you rund over very quickly without showing anything to back up you right/wrong results.
And sure, tests is of course having an impact on design. As it should have. There is consensus around, at least for the oop paradigm, that this is the case.
I also belive you try to make a once size fit all for measuring tests and that is simply not fair in terms of how differently these codebases are designed and implemented.
Awesome article, so interesting. I wonder how language design affects this, I write much better code in Elixir than I do in Node for example.
Maybe though the solution is simple; if you value having low cyclometric complexity/method length/etc. don’t let people commit stuff unless it passes the required level. This is certainly working well for me with prettier+eslint... I’ll look into other tools based on the code smells in this article I think!
I would love analysis like this based on all the data that codeclimate must have. The fact that the test coverage buckets and at 50%+ makes me think that the number of projects that were actually TDDed is very low. I've never seen a project that was truly TDDed with less than 98%. Typically the 2% come from a config file or something like that.
Projects might only invest in unit tests once a project exceeds a certain complexity, or has seen refactorings with age which made the need for tests clear.
I'm not sure how to normalize properly, but maybe projects solving a similar problem could be compared. Like different ORMs or Databases or Kernels.
This is a great direction for research, but wake me when there's been any measurement of significance in the results. Trendlines are interesting but until we've proven that the variation we're seeing is actually statistically significant with regard to the sample size, these aren't actionable results.
When I was writing device drivers for the middleware people, I found that unit tests were useful for testing the code and making sure the hardware was doing what it was supposed to. Along with the drivers, I released the test code to middleware so they could use it in their own code.
This is just my experience, but the engineers and managers who are the biggest proponents of unit tests don't write tests themselves for their own work. I've noticed this many times in many different jobs and I'm curious as to why it is?
While an interesting read, I can’t help but wonder about two things:
1. This is obviously .NET projects only
2. 100 is a drop in the bucket even if we take C#-only projects. How good are statistical results, is it representative?
No, a function can be long but have a low cyclomatic complexity score [1] (being a list of do_this(), do_that(), do_somethingelse(), etc) while a short function can have a higher cyclomatic complexity score (lots of fiddly logic).
You could deliberately do that. But most of the time length and complexity correlate. More importantly, when you're splitting up functions to make them testable, you're very much reducing complexity and length in tandem.
Splitting up functions just to make them testable is like a code smell, in my experience. It has correlated highly with cultish adherence to testing for the sake of testing rather than to achieve an engineering or quality goal.
It might be reducing "cyclomatic complexity" but not necessarily overall complexity. Often the opposite is true: reduced locality imposes its own costs in maintenance and efficiency for example.
I don't think that makes sense, if I had a longish method or a task that could be decomposed into smaller pieces to clearly identify what's happening I would prefer to break it up and test those pieces if it made sense. Obviously this depends on how you do it. DDD talks about this in terms of services and such which might be what your long function really is. In which case you _do_ have to test this large chunk of logic.
It isn't harder to test, necessarily. It's harder to unit-test. The distinction is important. Perhaps a single test that TDD people might call an "integration test" suffices for that function. Often it does, in fact.
An integration test isn't a test for a function. Its a test of the system as a whole. On a per function level, more code in a function makes it harder to test, whether you are unit testing it or somehow manually testing it.
Coming from some projects that had a very heavy unit testing culture, I can agree that unit testing doesn't always necessarily have the intended results on "quality" and can really shape a codebase.
I feel the root of the reason is that testing is not the same thing as development. Truly testing things involves a mentality that is different, it involves understanding how mulitple pieces fit together and operate in reality, not in some unit testing framework. A lot of developers look at unit tests the same way they look at development, once they work, it's done. But testing is proving they work in all the right conditions, not just getting it to work once.
I've personally seen very simple code convoluted so that it could be unit tested, and how those convolutions can easily add bugs and additional layers of indirection that hurt performance (as well as add unnecessary code, which needs maintenance). Many times these tests could have been written with the existing code as long as it used some other component... but then it wouldn't be "unit testing" only to see these components not work together on the first runtime. (For some reason, places that rely heavily on unit testing seem to have fewer overall or integration tests, because they spend so much time and effort on unit testing)
I'd love to see a study on how many lines of code change because of unit tests vs in non unit tested codebases. Both in terms of how many unit tests need to change and the size of the change itself.