"But thanks to the Pareto principle, you find in practice that with well-factored software you need fewer tests than you'd imagine."
Thanks to the "Devs are Human" principle, you never have the test coverage you thought you'd have. I've never worked on a project where we've thought, "Wow, turns out that we needed fewer tests than we thought."
"If you have well-factored abstraction layers, each of which has good unit tests for the abstractions it expects from below and for the abstractions it promises to export to above, the likelyhood of random combinations breaking things goes way down. Sure, a random dangling pointer from one driver can scribble over random memory for anywhere else, but there are techniques for catching that type of mistake."
Well-factored abstraction layers do help. That's the only way you write 100 million line applications that run on over a billion computers with over 100,000 unique configurations. The problem is that if you have the size of dev teams you have on Windows (or presumably at Google), if a dev makes a mistake once every five years, that takes half a day to resolve, the checked-in version is ALWAYS broken. Even with great testing, abstraction layers, etc... mistakes will happen -- sometimes for a given dev even more often than once every five years.
"In fact the smaller and more rapid the integration cycle is, the less overall pain there seems to be associated with them. It isn't that you do 100 integrations with 1% of the pain each time. It is that you do 100 integrations with 0.1% of the pain each time, which means your overall integration pain is a tenth of what it used to be."
The only thing you've done here is required me to integrate 10x more often. Checking in and requiring syncing after you write each line of code doesn't fix the issue, it just makes the process a lot more tedious as we hit our eventual problem.
"I strongly believe that you are over-estimating the natural immunity that Google has to this type of problem."
Why wouldn't cloud-based services be immune to it? Your millions of lines of code run over a small set of easily testable configurations. And based on what I've seen working on other, admittedly smaller scale cloud services, this is exactly the case. It's just an easier problem. There's nothing wrong with that. In fact its a great selling point..
"For an interesting public example, look at the LLVM project. They produce critical software that OS X is dependent on, that operates in layers where each piece depends on the ones below it, that is multi-platform, and do it with developing on one branch with rapid integration."
But LLVM is a product that is largely isolated from the system, right? It reads in input from a very standard source and writes out output. gcc, Visual C++, and the Intel compiler are probably all in one branch systems.
Linux is probably a better example, as they don't sit on an abstraction layer so much as they are the abstraction layer. Is that one branch for a whole distribution? Probably not, for at least the obvious reason that many parts of a distribution aren't even created by the Linux team.
Thanks to the "Devs are Human" principle, you never have the test coverage you thought you'd have. I've never worked on a project where we've thought, "Wow, turns out that we needed fewer tests than we thought."
"If you have well-factored abstraction layers, each of which has good unit tests for the abstractions it expects from below and for the abstractions it promises to export to above, the likelyhood of random combinations breaking things goes way down. Sure, a random dangling pointer from one driver can scribble over random memory for anywhere else, but there are techniques for catching that type of mistake."
Well-factored abstraction layers do help. That's the only way you write 100 million line applications that run on over a billion computers with over 100,000 unique configurations. The problem is that if you have the size of dev teams you have on Windows (or presumably at Google), if a dev makes a mistake once every five years, that takes half a day to resolve, the checked-in version is ALWAYS broken. Even with great testing, abstraction layers, etc... mistakes will happen -- sometimes for a given dev even more often than once every five years.
"In fact the smaller and more rapid the integration cycle is, the less overall pain there seems to be associated with them. It isn't that you do 100 integrations with 1% of the pain each time. It is that you do 100 integrations with 0.1% of the pain each time, which means your overall integration pain is a tenth of what it used to be."
The only thing you've done here is required me to integrate 10x more often. Checking in and requiring syncing after you write each line of code doesn't fix the issue, it just makes the process a lot more tedious as we hit our eventual problem.
"I strongly believe that you are over-estimating the natural immunity that Google has to this type of problem."
Why wouldn't cloud-based services be immune to it? Your millions of lines of code run over a small set of easily testable configurations. And based on what I've seen working on other, admittedly smaller scale cloud services, this is exactly the case. It's just an easier problem. There's nothing wrong with that. In fact its a great selling point..
"For an interesting public example, look at the LLVM project. They produce critical software that OS X is dependent on, that operates in layers where each piece depends on the ones below it, that is multi-platform, and do it with developing on one branch with rapid integration."
But LLVM is a product that is largely isolated from the system, right? It reads in input from a very standard source and writes out output. gcc, Visual C++, and the Intel compiler are probably all in one branch systems.
Linux is probably a better example, as they don't sit on an abstraction layer so much as they are the abstraction layer. Is that one branch for a whole distribution? Probably not, for at least the obvious reason that many parts of a distribution aren't even created by the Linux team.