Meh. I keep running into this thing.. I want to test a (legacy) system but it wasn't designed with testability in mind. Every time I look for advice, I run into a stumbling block where the advice seems to apply to systems that were designed to be testable.
In this case, right at step 1. Capturing the output and making sense of it while simultaneously allowing for normal variation e.g. due to timing is one heck of a big job and requires a lot of mocking. It's not a job that can be done in hours; probably weeks, if you want something that doesn't get thrown away the next time you have time to spend on testing.
With these systems, I've often found that black-box testing is what is required. This is at a systems level, even higher than this blog suggests. Yes, there are specific error scenarios and timing situations you may not cover with such tests. However, if you tackle the system holistically and make your tests from an external actor perspective, you can get a long way without concerning yourself with implementation details beyond the ones needed to deploy it and all dependencies in a repeatable environment. It does take time to build the harness to have reusable test pieces, but it's often worth it. As a bonus, you'll have a specification to meet as you try to replace the system (though the suite is rarely enough to give confidence on its own).
Yes, that's where I'd like to start (and I sort of did, but the approach I took was nowhere near good enough). Like you say, it does take time to build the harness. Kind of a hard sell with our client..
I'd need to mock hardware I don't have, possibly mock some kernel interfaces, set up a physical test endpoint that can observe things on the wire (as in raw ethernet packets, but also application level protocols..). And well, really capturing timing issues and such is kinda important because things like flow control, multipath routing & failover are critical features of the software.
Of course some of the things you want to test are inherently timing related issues. Like, a certain signal must be asserted if a particular message isn't received within a specified (possibly configurable or not) time window after some event.
The core of the software is a heavily asynchronous & multithreaded mess..
Getting all the scaffolding in place and then automating everything is going to require lots and lots of time. And reverse engineering.
I tried playing with bash scripts and docker based virtual deployments but it completely misses the hardware side of things and really doesn't get me far.
I did this once on a large project I recently joined. It took a long time to build these sophisticated mocks (3-4 weeks) and it was a fairly hard sell but it did have the advantage that it's somewhat immune from mythical man month issues.
That is, you can hire a new person, they can do all this work and it doesn't affect the speed that the rest of the team delivers at much. They don't need to know extensive amounts about the underlying core code - just the way it interfaces with the outside world.
It's also a pretty effective way of onboarding somebody on to the project.
That isn't the case for any kind of unit testing approach, where you have to be able to work with, understand and refactor the core code. That requires core dev team members and will take their attention away from customer facing tasks.
Hm, do you know of any tool/framework for capturing and then re-running (mocking) syscall responses? Could be interesting to keep in a toolbox for future...
What I run into, over and over again, is code that can't be executed in isolation. I mostly work in Java, and I keep ending up maintaining codebases where nearly everything is declared "static"... including the code that connects to databases and remote services and initializes data to start the service. I've seen this "pattern" across four different employers over two decades now.
That approach is almost guaranteed to introduce bugs into complex systems. It’s fine if you’re doing something directly useful at the same time, but just be aware of the risks.
Though that dependent on what you mean by legacy. If nobody’s touched it in 5+ years it might very well be unchained when the system is completely replaced with something else. It’s well worth adding tests if something is or will be in active development.
> BUT you know that you should. If you don’t, you’re making things worse! If you don’t write the tests, you may even break something and you won’t realize!!!
Slowly adding tests is fine if your goal is to increase the total code under test, but it's not going to give you much confidence that you didn't just break something when making updates to the system.
This is precisely the focus of the book "Working Effectively With Legacy Code", and my impression on reading it was that the advice seemed sensible (although I haven't tested any of it deliberately). Worth looking into if you haven't.
I don't really understand the obsession around snapshot testing. Seems like a lazy way of doing testing that doesn't add any confidence that you're doing the right thing. It just adds confidence that the initial value you got the snapshot from, is the same going forward, which is not super valuable.
Snapshot testing is basically what used to be referred to "master-knows testing" (don't remember if this is the exact term) where only the one who initially created the test knows if it's correct, because the intent is not exposed at all. Instead, a properly written unit tests exposes what you want that unit to do, and focuses on only that part, so you can change things in the unit without breaking those specific parts. Snapshot testing would ruin this.
> You want to add tests, but you don’t have time to!
This also seems like a weird thing. The idea behind testing is to save you time. If you're taking longer time because you're adding tests, you're doing something wrong! The whole idea is to make the feedback if something is working or not faster, not slower.
Snapshot testing is useful for catching regressions and nothing else. It’s better than nothing, and in some cases is the only thing you can do if you don’t properly understand what all the inputs and outputs of a system mean.
For example if working with a large legacy codebase, it’s important to maintain previous functionality and only change it if you can rationalize why. People and other systems might be relying on something that’s a bug, or a side effect that’s not obvious.
If you don’t do this, then you can either choose to have no tests, or have to tear down the whole system to understand it.
>Snapshot testing is useful for catching regressions and nothing else.
Visualizing and eyeballing snapshots in behavioral tests is a highly effective way of catching bugs and defining behavior (edit code - run test - verify output is correct, if it is, lock it down).
I find them to be vastly more effective and cheaper to build than unit tests even on greenfield code bases.
The painful part is "verify output is correct" which only the person who initially did the test, can actually say it's correct or not.
If a new person joins the codebase and sees that a snapshot is now different, how they know what's correct or not? What I've seen in the wild is 1) talk with the person who authored the test or 2) just say yes and move on.
I mean you should have the context of your change right. If you're submitting a change to reduce the margins on everything, you should expect to need a new snapshot.
If you're are doing a pure refactor of your css, you shouldn't see a change. Unless your css rules are order dependent and now you caught that.
This also seems like a weird thing. The idea behind testing is to save you time. If you're taking longer time because you're adding tests, you're doing something wrong! The whole idea is to make the feedback if something is working or not faster, not slower.
This is true when you're writing the code initially, but the fact is some people don't write tests, and then you might inherit a huge pile of untested code. Adding tests to would take a lot of time, and sometimes you need to start making changes fast.
When there are no tests it's very likely you won't be able to add tests without changing the code to be something that you can test. This is where snapshot testing shines. Snapshot tests are the only type of test you can add without needing to modify any of the code. You can add it in easily, and it gives you some confidence that you're not breaking things.
> When there are no tests it's very likely you won't be able to add tests without changing the code to be something that you can test
Indeed, it will most likely require refactoring the code to support the testing. But the initial idea of _why_ you want to add testing there in the first place, is so you can refactor with confidence and not having to manually test all the cases. Compare "refactor to test -> add tests -> refactor behaviour" with "refactor behaviour -> manual tests" and the first one will most likely be faster.
A few years ago our team rewrote a critical part of our system in a more resource efficient language to cut a large chunk out of our AWS bill.
The system handled around 500K requests per minute. The inputs/outputs of the system looked like this.
1. Receive HTTP request
2. Make a bunch of HTTP requests
- Call external web services for business logic
- Call reporting services for analytics
3. Render result
On a single production box of the legacy system, we instrumented the http clients to dump the requests/replies to disk for 15 minutes.
Using the snapshots, we wrote some scripts to allow us to replay the data through the new system. It was a totally ad hoc manual process that took about a week. It found a TON of bugs and greatly increased our confidence in the new system. After we were done we threw it away.
I am convinced that using standard unit testing alone would not have found many of these bugs and caused them to surface in production.
The headline of the article should be a hint. "3 steps to add tests on existing code when you have short deadlines"
So, if you're on a short deadline and maybe you didn't write the code the first place, so you might not be familiar with the internals, snapshot testing is the quickest way to get confidence that you aren't causing further regressions.
These aren't meant to be used forever more. Just to stop things from sliding backwards.
After all, the title of the site is understandlegacycode.com. It's all about getting to a point where you can refactor and do things properly when you do get time.
If that doesn't make sense, what would you suggest to do?
I made a similar objection the last time this came up (might have been on another site.) Someone pointed out that they can still be useful for confirming that refactoring for readability/testability has not changed behavior, at least for the paths the snapshots exercise. That could be valuable when dealing with a hairy code base. Still, you wouldn't want to keep such tests around for long.
Enough with the memes. I enjoyed the boromir meme the first thousand times but I'm only clicking this article because I want to see what you have to say not to browse r/memes lite. It also is a poor deflection of the fact that a deep topic such as this is getting such a light treatment. If the article is not much in plain text the gifs colours and box shadow doesn't make it good
This is a shortcut to achieve better coverage, brittle, possibly wrong, and not maintainable; and it is not really giving you any additional confidence on the code. If you're pressed for time, why waste it on something with close to zero value? Coverage by itself is an empty goal, you're better off with no tests than this.
Suppose you're the first person to touch this code in a while. You don't really know how solid it is, but given there's no tests you would be justified in being a bit nervous.
Now you need to change it.
Any future bug discovered is probably going to come back to you, given you are now the last person to touch it and the perceived expert on this legacy monstrosity. Having proof of how the code worked before you touched it not only prevents regressions while you're working on it, but also prevents having to do this sleuthing on the fly with someone wagging a finger in your face. Not only that, but depending on dependency drift you not be able to easily go back and prove the old functionality after the fact, so having this in a CI log will be valuable for your sanity and reputation.
I don't think that scenario is realistic. If you're making a change that doesn't alter a function's output, why is the code for that function being changed at all? That would only be possible if the snapshot didn't capture all the relevant outputs to being with.
Most likely the output will change and you'll be left with a random blob of data to figure out what went wrong instead of clear specifications.
If your goal is "I have a blackbox of code, I must make a change, and I will assume existing behaviors are correct", this approach should be reasonable to ensure that the code doesn't break after your change.
You don't have an external spec to test against in this case, the existing implementation details are the spec, and so the goal is to preserve them.
I'd say going from:
* Find one input
* Get coverage
* Mutation test to ensure you have assertion density
is probably a fine way to accomplish the goal.
Eventually you'd want to define an external specification for the code that isn't just "the service works, therefor ensure the service does what it's been doing", at which point some of these tests may be removed, or at least enhanced.
That's part of the problem. The original test was created based on implemented behaviour and not test/functional requirements, so you don't have any confidence that what you're preserving is the right behaviour. It tells you you haven't changed the output, but once you've covered your codebase with these, and make a commit that triggers 46 snapshot failures, you have zero knowledge to decide whether that change is desirable or not.
In the absence of a specification, documentation, or access to the people who originally implemented the program, the program itself is the spec. With no way to decide what the program "should" be doing, whatever it's doing right now is the right behavior.
The point is that the existing behavior is de-facto correct by virtue of being the existing behavior.
The goal here is not "Prove this software meets a spec" it is "This software is the spec, and we must not regress". There are many more systems that act as a defacto spec than that are built to an external one.
If you choose to then move to a world where you want an external spec, as I said, at that point some of these tests will be thrown away or refactored. Most systems never make it to that stage.
This is especially common when you have legacy software where the person or team that wrote the code is no longer there. It takes a monstrous amount of effort to derive a meaningful spec, and step 0 is always preserving existing behavior.
Often I see tests that spend a lot of time mocking internal components of the application. Such tests are verifying the implementation details instead of the functionality. They often need to be thrown out if you do some refactoring, even if that refactoring didn’t change the behavior of the program at all. The kind of text described in this article should work just fine in that case, achieving the desired goal of showing whether your refactor caused any behavior changes.
In my experience, tests like this actually aren’t hard to maintain. The test can print out a diff between the expected and actual result when there is a failure, so you can check if the new behavior is expected and easily copy-paste to update the test if desired.
I like mocking internals because I can control the flow of the logic.
I will write a mocking test to ensure a business flow is followed at the test level just because I have fallen too many times for a developer updating snapshots because they made a change... but they update all the snapshots even the ones that show they introduced a new bug.
It's not meant to give you confidence in the code. It's meant to give you a quick test to make sure you don't break something while you change the code, if you have to make changes to code that lacks any tests at all.
As the author said:
> It’s fine to delete useless tests afterwards. They were useful when you had to change the code. They’re not here to replace proper tests in your codebase!
> It's meant to [...] make sure you don't break something
It helps you see if you didn't break 'something', but 1) you have no clue what that something is by reading the test 2) you have no information to fix it. Plus the snapshot itself might have already captured the wrong implementation. It's a false sense of security.
Snapshots are perfectly acceptable for testing Legacy Apps.
If it truly is a legacy app, snapshots should rarely be updated (because you're not updating the code) and any introduced bugs will trip a snapshot. The issue is always about people blindly updating snapshots because 'I made a change and the snapshots are wrong'.
If the failed tests don't line up with what you believe you changed, you are the reason it's Legacy Code.
It allows you to refactor your code and tell whether or not you have changed behaviour. This is practically what unit tests are for full stop. It also allows you to modify behaviour and show that the behaviour has been modified as you expected without writing any additional tests. If you are finding them too brittle, then you can remove the ones that are bothering your and replace them with better tests.
It's a very good technique that I've used many times in the past.
So code that is in production should have no protection from inadvertent changes because that is worse than no tests at all? Wow. Do you code for a living?
I disagree that tests of this form are worse than no tests at all. It depends on the purpose of the test suite. The article itself mentions most of these disadvantages to this approach at the end. But recall, the stated purpose of the test suite in the article is "regression", in the context of a legacy code situation. This isn't a situation we'd aspire to be in, but it is a useful tool to help locally improve a bad situation where you are required to change legacy code.
Ideally, in a healthy software project, you want to define automated tests where each test checks that an invariant is satisfied by the implementation, and links that invariant to requirements that the software is meant to implement. Then, if a test fails, you know that one of your functional requirements has been broken, in at least some scenario. If the tests and test harness are well structured, it should be very quick and easy for you to learn which requirement has been broken, in exactly what scenario. Fantastic. You can take that information and iterate, and try again.
Unfortunately, in a legacy code situation, we commonly would have no idea what the functional requirements of the software were, or what the relationship is between the functional requirements and the code. There is often no existing automated test suite. If we are under pressure to make a change to the codebase, having a dumb automated regression test suite that can check if _the behaviour of the code is the same as it was before_ is very valuable. Note that the purpose of this regression test suite is very different to the purpose of the earlier test suite that we would build in a healthy software project.
You can think about the feedback from an automated test suite as giving some information as you modify the code and move about the space of possible programs. A very high quality test suite will give you feedback that helps you steer in the right direction: towards modifications that produce possible programs that do not break any existing requirements. However, in a legacy code situation, we often don't have that luxury of knowing any requirements, or knowing internal invariants that the existing code is meant to maintain. So in this case, the feedback from our automated regression test suite will not be very informative: it does not give us a signal that we are moving in a direction that breaks requirements or not. It does however let us know if our change (from our tiny bug fix / high priority feature we're trying to jam in) causes the behaviour of the existing system to change that we did not otherwise expect. I.e. we can use it to detect if we are "moving" in the space of behaviours of the program. If a regression test breaks, this alerts us to an unplanned consequence of our change, so we can abort rolling the change out and go back the drawing board with more analysis and debugging and to try to understand what the system is doing, and if we actually want the behaviour to change or not.
This is some of the best testing advice I've read. I follow the first two points with enormous success. The third idea seems like it would solve a lot of the cases where i've stumbled. Much appreciated!
Every time I forget there is always enough time to write tests, I quickly find I’ve been grinding gears trying to sort something out that a test would have helped me visualize much faster.
>> You can’t afford spending days to write the tests that should be here in the first place.
We had an outage because of some code that got pushed to production, and no tests were written for it. It was a very simple bug, basically a list was being appended to instead of replaced. Nobody noticed for a while because of the right data 'being there', which made things worse because a lot of things had to be re-worked after the bug was fixed, as they were going off the bad data.
I mentioned that any new code should have tests. And the manager replied with "We don't have time to write tests", and I had to bite my tongue to not reply, "because we're spending all our time fixing simple bugs and reworking data"
I don't understand the line of thought that automated tests are too time consuming to add. You know what's more time consuming? Manual testing. It takes forever. And if you have to fix a bug somewhere, have fun manually testing everything again, if you can even remember all the manual tests you should be running.
For example, a project I work on is available in Asia. They put their last name first. So anything related to names needs 2 tests to make sure it works right. Oh, there is legacy data, so customers might be missing first, last, or both names. So add another 3 tests onto the pile. You're looking at 5 tests right there.
Let's say the feature you're working on related to names has 3 dimensions. Well, now you potentially have to run 5*3=15 tests. If one of those fails and you bugfix, you have to re-run those tests. Have fun with that if your tests are manual.
When I hear "add tests" it's always a red flag to me. Imho tests are not optional, they are implied, they are a tool to build your stuff (like scaffolding). If tests are added at the end, how would one know the code written before that was correct?
The later tests are added, the less of their value is leveraged.
IMO, where tests are adding most productivity is when they enable rapid refactoring. When you test with that goal in mind it doesn't really matter much where in the cycle tests are added.
Yes, it should not matter 3 or 30, once you start to test the higher level/API then some classes don't need a separate test, because they are tested when being invoked by other tested code.
I'd argue that if you feel like you have to use snapshot testing to meet a deadline, you probably don't even need the test in the first place.
It's one thing if you're using snapshot testing like your average test, to make sure things don't break when you make changes. But if you're in a bind, is a snapshot test really better than your own eyes?
Maybe it's better to release the thing in that case and add better tests asynchronously to the deadline.
I suppose the snapshot test in a pinch is not so bad if it's just changes being made to legacy code. But I'm not sure I'd do such a thing when building new features.
It seems the author is using legacy code to mean "bad code I didn't write". My experience with legacy code is that it exists on one server set up by someone who left the company 10 years ago, only runs on a particular version of a runtime that isn't available to download any more, and if you run it locally, it makes hard coded calls to the live environment, breaking it. Having a unit test framework is the dream.
A "snapshot test" like this would save me time if I wanted to check that the same input yields the same output for some module. -- I can see wanting that if I'm adding new code which doesn't change the old part of the module much, or if I'm changing the implementation (but not the interface) of the module.
But, yes, in many cases it's not a good automated check.
I like to "make the change easy, then make the easy change". This looks like 2 commits: 1st a refactor to make code more straightforward around the area I'm about to change, and 2nd to actually make the change. Ideally, the 2nd should be a small, local, easy to reason about change, enabled by code cleanup in the 1st.
The 1st commit in this sequence is a pure refactor, and definitionally should change no behavior. The "snapshot test" described sounds perfect for this, especially in unfamiliar code. Ideally, I'd go even further, and have a compiler prove for me that my refactor produced a perfectly equivalent program from a black-box perspective. Snapshot testing is great because it gets pretty close, very cheaply, whereas the full program equivalence problem is impossible.
When the snapshots are somewhat human readable this technique is also very useful for validating changes.
Hook up the last known good snapshot and the new output to your favorite diff UI and you'll get a pretty good idea about the actual effects of the code changes. I feel more confident signing off a new "known good" based on reviewing a full document diff than after iterating a set of meticiously hand-crafted expectations alongside the production code until convergence.
At a previous job, we had a law called "Peeler's Law", named after an engineer who was a top front-line bug-fixer: if you don't have time to add a test, that's when you need to add a test the most.
I find this to be a double-edged sword. It's undoubtedly true that the system needs tests, and if they can't be easily added, it needs refactoring. And it often saves time in the mid/long run.
All of this is true. But what's best for the system isn't necessarily best for the programmer. If you're under pressure to show results in order to keep your job or get a good performance review, buggy untestable code is better than no code because you spent time writing tests. Remember, if the goal is to keep your job and/or get that raise.
"But," you may argue, "in that case your job is terrible and you should quit."
I often see this advice in the bubble that is HN. The reality is that a lot of programmers can't change their jobs that easily because of a multitude of problems (age, difficulty interviewing, social anxiety, bad economy, disabilities, pressing financial problems, etc).
Well, even if you decide not to quit the decision will be made for you if you choose to continue to develop a system without proper testing. Eventually your customers will fire you.
You would think so, but this is very often not true.
You'd be surprised by in how many businesses you can simply plod on with no repercussions, but if you try to test (and reduce your apparent immediate output) you can get a bad review or a PIP. In some, automated testing is not even something they are aware of. But what happens to the software, you say? Well, all software eventually ends up either working/used or not working/abandoned, and tests often have very little impact in this decision...
This was for example the case in the "shared services" unit of a MAJOR oil & energy company. Probably the first you will think of.
X considered harmful, barf. What is with these pretentious pricks who feel the need to hold "strong opinions" (TM) to be considered a "senior" (TM). Sometimes testing the blackbox is all you got, thats it. Change detector test is an absolutely valid way to prevent code from changing behavior, that seems to be a pretty basic need and a very common form of regression. If you haven't seen the case for when that might be necessary then you live in unittest heaven or maybe you don't have a whole lot of experience dealing with other peoples legacy code or with real business deadlines. Good fuckin luck!
I think that even the most "pretentious" opinions are worth learning about, but people need to have the confidence to go their own way and do their own thing when an opinion doesn't make sense to them. In programming, there are very few things that are objectively bad, and even then what's "bad" is often limited to who will be working on a given project.
The best example at the top of my mind is semicolons in JavaScript.
Opinions on semicolons seem to have changed a lot since I began to seriously learn JavaScript. I remember most people saying to "always use semicolons" without really explaining why. Some of the more seasoned developers would point out a few reasons why not using semicolons could lead to issues, yet even though one with a fair amount of knowledge about how JavaScript is interpreted could simply avoid those issues in the first place, people would still use said reasons to label semicolonless JavaScript code as "objectively bad". It didn't even matter to these people whether the code actually worked perfectly well, or if the code was just as readable, or if it was the preference of the authors of the code. The zealotry towards semicolons entered the point of obvious signaling of senority; most of these people just came off as pretentious and unhelpful.
If someone says you shouldn't do something, but what you want to do makes perfect sense to you, go ahead and do it. You might run into issues, or you might have no problems at all. I use semicolons in JavaScript today, but I never had problems when I used to write it without semicolons despite how I was told by everyone that it was "dangerous".
For applications where state drives output (IE, every single webapp) I find that change detection is a decent chunk of issue detection.
Change Detection + Business Flow Testing will probably deal with your stakeholder requirements far cheaper than almost every other method of 'ensuring quality'... even the classic lie of 'developers would never make that mistake'
In this case, right at step 1. Capturing the output and making sense of it while simultaneously allowing for normal variation e.g. due to timing is one heck of a big job and requires a lot of mocking. It's not a job that can be done in hours; probably weeks, if you want something that doesn't get thrown away the next time you have time to spend on testing.