I’m still not following what the issue is. If you refactor some code and change the behaviour of the code, and the code tests the expected behaviour and passes, then you have one of two problems:
1. You had a bug you didn’t know about and your test was invalid (in which case the test is useless! Fix the issue then you fix the test…)
or
2. You had no bug and you just introduced a new one, in which case the test has done its job and alerted you to the problem so you can fix your mistake.
What is the exact problem?
Now if this is an issue with changing the behaviour of the system, that’s not a refactor. In that case, your tests are testing old behaviour, and yes, they are going to have to be changed.
The point is that you're not changing the interface to the system, but you're changing implementation details that don't affect the interface semantics. TDD does lead you to a sort of coupling to implementation details, which results in breaking a lot of unit tests if you change those implementation details. What this yields is either hesitancy to undertake positive refactorings because you have to either update all of those tests or just delete them altogether, so were those tests really useful to begin with? The point is that it's apparently wasted work and possibly an active impediment to positive change, and I haven't seen much discussion around avoiding this outcome, or what to do about it.
There has been discussion about this more than a decade ago by people like Dan North and Liz Keogh. I think it’s widely accepted that strict TDD can reduce agility when projects face a lot of uncertainty and flux (both at the requirements and implementation levels). I will maintain that functional and integration tests are more effective than low-level unit tests in most cases, because they’re more likely to test things customers care about directly, and are less volatile than implementation-level specifics. But there’s no free lunch, all we’re ever trying to do is get value for our investment of time and reduce what risks we can. Sometimes you’ll work on projects where you build low level capabilities that are very valuable, and the actual requirements vary wildly as stakeholders navigate uncertainty. In those cases you’re glad to have solid foundations even if everything above is quite wobbly. Time, change and uncertainty are part of your domain and you have to reason about them the same as everything else.
> I will maintain that functional and integration tests are more effective than low-level unit tests in most cases
Right, that's pretty much the only advice I've seen that makes sense. The only possible issue is that these tests may have a broader state space so you may not be able to exhaustively test all cases.
Absolutely right. If you’re lucky, those are areas where you can capture the complexity in some sort of policy or calculator class and use property based testing to cover as much as possible - that’s a level of unit testing I’m definitely on board with. Sometimes it’s enough to just trust that your functional tests react appropriately to different _types_ of output from those classes (mocked) without having to drive every possible case (as you might have seen done in tabular test cases). For example I have an app that tests various ways of fetching and visualising data, and one output is via k-means clustering. I test that the right number of clusters gets displayed but I would never test the correctness of the actual clustering at that level. Treat complexity the same way you treat external dependencies, as something to be contained carefully.
Why does testing behavior matter? I don’t care if my tests exhaustively test each if branch of the code to make sure that they call the correct function when entering that if branch. That’s inane.
I care about whether the code is correct. A more concrete example; say I’m testing a float to string function, I don’t care how it converts the floating point binary value 1.23 into the string representation of “1.23”. All I care about, is the fact that it correctly turns that binary value into the correct string. I also care about the edge cases. Does 0.1E-20 correctly use scientific notation? What about rounding behavior? Is this converter intended to represent binary numbers in a perfect precision or is precision loss ok?
If your tests simply check that you call the log function and the power function x times, your tests are crap. And this is what I believe the parent commenter was talking about. All too often, tests are written to fulfill arbitrary code coverage requirements or to obsequiously adhere to a paradigm like TDD. These are bad tests, because they’ll break when you refactor code.
One last example, I recently wrote a code syntax highlighter. I had dozens of test cases that essentially tested the system end to end and made sure if I parsed a code block, I ended up with a tree of styles that looked a certain way. I recently had to refactor it to accommodate some new rules, and it was painless and easy. I could try stuff out, run my tests, and very quickly validate that my changes did not break prior correct behavior. This is probably the best value of testing that I’ve ever received so far in my coding career.
1. You had a bug you didn’t know about and your test was invalid (in which case the test is useless! Fix the issue then you fix the test…)
or
2. You had no bug and you just introduced a new one, in which case the test has done its job and alerted you to the problem so you can fix your mistake.
What is the exact problem?
Now if this is an issue with changing the behaviour of the system, that’s not a refactor. In that case, your tests are testing old behaviour, and yes, they are going to have to be changed.