Hacker News new | past | comments | ask | show | jobs | submit login
We need to talk about testing (dannorth.net)
126 points by glenjamin on July 27, 2021 | hide | past | favorite | 72 comments



"For every complex problem there is an answer that is clear, simple, and wrong." -- H. L. Mencken

The saddest thing about articles like these is that they are so full of wisdom, but they can never seem to compete with the "hot new methodology."

Like it or not, the "agile" movement had a huge impact on the way people do their jobs, in part because it had a name, it had advocates, and it had principles that you could fit on a PowerPoint slide.

But good testing (like good design) just can't be encapsulated in a handful of principles you can teach to a newbie. Good testing requires wisdom.

I can write a lot of words about what I've learned about testing over the years, but it probably won't do a good job of conveying that wisdom to others. The best testers I've ever worked with have a strong gut instinct, developed by experience, for where the bugs will be found, and how to effectively spend our limited time-budget for tests.

Is it worth adding more end-to-end UI tests? What if they run slowly? (How slowly?) Is it better to add more unit tests? What if adding unit tests requires aggressively mocking out dependencies? Which parts of the product require more testing, and which ones have been adequately tested?

These questions don't have quick, easy answers, but the quick, easy answers keep winning mindshare, and we're all impoverished as a result.


Well said. But wisdom goes hand to hand with dedication. Let me paraphrase someone wiser than me:

"Dedication brings wisdom; lack of dedication leaves ignorance. Know what leads you forward and what holds you back and choose the path that leads to wisdom."


> Good testing requires wisdom.

I don’t disagree, but I’d be willing to settle for diligence and discipline.


Writing an article claiming that "it's complex" and "it depends" is the easiest thing in the world; even when these claims are false, few will dare call them out as false.

In my experience good testing actually doesn't require wisdom, or at least there's a lot more value in quick easy principles than there is in wisdom (which is very hard to assess whether it's actually achieving anything).

We have plenty of articles that pontificate and cover everything in shades of grey; actually committing to a stance and giving concrete, actionable advice (that may be wrong for occasional edge cases) is far more valuable.

(To answer your questions: have half an hour's worth of tests, start with purely end-to-end UI tests and convert cases to more unitlike tests as necessary to keep that half hour runtime. Don't ever mock, stub if you must but only after you've tried improving the code so that you don't have to. You already know which parts require more testing and which have been adequately tested, give yourself permission to trust your instincts on that one)


Half an hour is 30 times longer than I want to wait for good quality feedback.

I’ll settle for under 5 minutes, but I always shoot for 1 minute.

30 minutes is not really compatible with continuous deployment approaches


IMO a deployment that doesn't contain a user-facing feature is meaningless - so rather than being able to deploy literally every second, you only need to be able to deploy fast enough to make each user-facing feature (which is probably, what, 1/week/developer?) its own deployment. So assuming you have the kind of team size that I think is right (which is a whole other discussion), half an hour is good enough. (I'd still call that continuous deployment, because master is continuously deployed - but you only have feature branches landing in master at a rate of one per 30 minutes).

Your test-edit cycle should definitely be much shorter than half an hour, but you don't need to run the full test suite for that.


I like this joke from a while ago about testing "QA Engineer walks into a bar and he orders a beer. Orders 0 beers. Orders 99999999999 beers. Orders a lizard. Orders -1 beers. Orders a ueicbksjdhd. First real customer walks in and asks where the bathroom is. The bar bursts into flames, killing everyone."

Which leads into being very clear about what you want to achieve with testing, correctness, robustness, fit for purpose etc, and be clear how much effort you want to put into each area. Often I have seen people put a lot of effort into testing things that really doesn't have much payback compared to what they could be testing instead. So be careful of the opportunity costs of your testing efforts.


> Often I have seen people put a lot of effort into testing things that really doesn't have much payback compared to what they could be testing instead.

90% of the "unit tests" that I've observed in the wild are checking for things that a modern type system would easily prevent you from doing.

The "unit testing is a magic bullet" cults seem form in environments that use weakly typed or highly dynamic languages like Javascript that let you pass anything to anything and only blow up when you execute one particular branch at runtime.

A good reason to use Typescript, Rust, Python's optional type hints, etc. is that they point out these problems for you as you're writing your code, so you don't have to unravel your mess three days later as you're cranking out 10 pages of boilerplate unit tests that only cover two functions and aren't even close to being exhaustive.

Use better languages, stop wasting your time testing for typos and brain farts, and focus on testing higher-level aspects of your design that your language and tools can't possibly know about.


> only blow up when you execute one particular branch at runtime

On a related note: I'm a big fan of property-based testing, where each test is basically a function 'args => boolean', and we're asserting that it always returns true regardless of the given arguments (in reality there is usually more structure, for nicer error messages, skipping known edge cases, etc.). The test framework (e.g. Hypothesis, Quickcheck, Scalacheck, etc.) will run these functions on many different values, usually generated at random.

This works really well for 'algorithmic' code, e.g. '(x: Foo) => parse(print(x)) == x', but it can sometimes be difficult to think of general statements we can make about more complicated systems. In that case, it's often useful to test "code FOO returns without error". This is essentially fuzzing, and can be useful to find such problematic branches (at least, the "obvious", low-hanging fruit cases).


Just use mocks for complicated systems


No offense, but I think you've completely missed the point of what my comment was talking about.

Mocks are a way to decouple components of a system during tests, e.g. to avoid having tests hammer a real database.

I'm talking about the problem of figuring out what to test; in particular, which properties of a system are invariant.

For example, lots of unit tests are based on coincidences, for example:

    {
      uid  = createUser('Alice');
      page = renderProfilePage(uid);
      assert(page.contains('Hello Alice'));
    }
We can turn this into a property test by taking an arbitrary username as an argument:

    (username: String) => {
      uid = createUser(username);
      page = renderProfilePage(uid);
      assert(page.contains('Hello ' + username));
    }
However, this test will fail, since this is not an invariant of the system. In particular, it will fail when given an argument like "&<>", since the resulting page will not contain the text 'Hello &<>' (instead, it will contain "Hello &amp;&lt;&gt;").

All sorts of issues like this will crop up as we test more complicated code. For example, certain components might fail gracefully when given certain values (e.g. constructing a '404' return value); that might be perfectly correct behaviour, but it makes useful invariants harder to think of (since they must still hold, even in those cases!)

Mocking is completely orthogonal to that.

PS: I consider mocking to be a code smell. It can be very useful when trying to improve processes around a legacy system; but things which are designed to use mocking tend to be correlated with bad design.


I believe we are not on the same page for wnat we consider a complicated system to be. Do yo assume that $developer it the owner of all code in this mental exercise?


Common practise is to write own generators: (username: ArbUsernameValid) => ...

ArbUsernameValid covers all relevant Strings that can be used here like take dictionary of names from 100 countries and feed you arb generator


> 90% of the "unit tests" that I've observed in the wild are checking for things that a modern type system would easily prevent you from doing.

I'm a heavy Clojure user, so I often hear how static types are unnecessary and not useful, dynamic types are much more convenient and aren't you using clojur.spec anyway to make sure your data is correct?

But I'm a strong static type proponent and wish a statically typed Clojure existed. The reason is exactly as you say: with dynamic types, type errors are only found at runtime, which means the only way to find them ahead of time is to exercise the code thoroughly. So you rely on unit tests to find type errors.

The problem with that is, testing is labour intensive and also non-exhaustive. You have to come up with all the ways the types might become mismatched and in my personal experience, its super easy that something slips through. These kinds of bugs seem to be the biggest cause of production errors in my code and they are the exact types of errors that a static type checker would have prevented.

So I don't think unit tests area good substitute for a static type system.

Many people argue that the overhead of static typing makes them too slow, is too much effort and makes it harder to write code. In my personal opinion, static types make me think about my data much more deeply and helps me design better software. Yes, its slower, but its slower because it forces me to think about the problem space more. The actual extra typing (har har) necessary to add type annotations is a very small overhead. If that's really what's making you slow, then consider learning to touchtype, switching to a better keyboard layout or better physical keyboard, an IDE with better autocompletion, just learn to use the slow down to think about your code more, or use a language with type inference. Actual writing code is a small part of my day anyway, so its not been an issue for me when I use a statically typed language.

A problem I have with optional type hints is that they are optional: not all code (standard library, third party libraries) will have it, so you only get a small bit of the benefits.


> "unit testing is a magic bullet" cults

The thing is, those unit tests achieve practical utility in those language environments, because without the protections of the type system those kinds of errors bring down systems all the time.

I agree that more rigorous type systems can spare you from having to write a lot of boilerplate tests, and I prefer to work in more strictly typed language environments. Nevertheless, "unit" tests that prove the functionality of individual elements, even if you write fewer of them, still offer a huge ROI. They are damn close to a "magic bullet", and every project that neglects them pays the price.


Another common problem with unit testing is that the function being tested should actually be four different functions and the setup is so complex, it becomes more of an integration test on the wrong level.

Then the developer will say "unit testing is a pain in the ass, I rather do integration tests". But the problem is that the code being tested is calling out its problems through these tests.


i hear this often but i don’t think it’s true. sometimes a good integration test will do you more favour than a dozen of unit tests. also unit tests can be a pain in the ass as a bad integration test


Unit testing has become a bit like a diploma. Something society requires to acknowledge your quality, but that is mostly pro-forma and very poorly correlated with domain skills or intelligence.

Those are systems that have firmly crossed from "simulation" territory to "simulacra" territory.

People who think about the actual value of specific tests like you are few and far between. Most feel constrained by peer pressure and the need to do what they perceive is correct by definition.


Tests are to stop the future programmer from breaking your code.


Tests (specifically unit tests) are a way of documenting your assumptions about how a pice of code will be used.


By making tests so brittle you can't change any code without updating tests. And who wants to do that?


If the behavior changes, the requirements have changed or you previously had a bug. That's not brittle tests.


Someone wants to test:

  function do_add(a, b, add) {
    return add(b, a);
  }
And does something like:

  do_add(1, 2, add)
  expect(add) to be called with (2, 1)
Then the function is changed to:

  function do_add(a, b, add) {
    return add(a, b);
  }
Oops, broke the test. Is this a type of behavior change that you expect would break the test? Often there are many ways to correctly do something and a good test should allow any of them.


But that's not a behavioral change, only bad testing. The test wouldn't be brittle if the behavior of adding was locked down completely.


Any test that uses mocks is therefore "bad testing".


The very problem is that tests failing is not necessarily due to behavior changing but implementation details not affecting behavior changing.

This is especially the case when tests rely on mocks, which don't really implement the BEHAVIOR of a dependency, but rather represent a "replay" of a specific call stack sequence.


QED


Rings true. I often write more unit tests in a team than I otherwise would, even if I don't think they add that much value, just because no pull request ever failed review for having too many unit tests, and any time spent arguing about it is time I'd rather spend building the next feature.


> Often I have seen people put a lot of effort into testing things that really doesn't have much payback compared to what they could be testing instead.

Do you have examples of this? What kind of tests do you use and to test what? I've seen people testing _literally everything_ and some only the happy path, failures and critical units like user input/protocol assumptions/algorithms.


Someone was testing my code and used a Jira plugin (xray) to document all of their "evidence". This gave non technical stakeholders a lot of confidence because the evidence looked so fantastic and neat. My business analyst found a defect and I raised it with the tester as it was relevant to another stream of work the tester had completed earlier. The tester showed that they were unfamiliar with the business requirement relevant to the defect. I dug and prodded the tester a little, only to uncover that the tester felt that referring to my code repository and basically re running my code to compare dataframe row counts, etc was adequate test coverage. Don't be fooled by "evidence".


I see tests like this all the time:

// TestConstructor

Object o = new Object();

assertNotNull(o);

Completely bonkers, but mention it and people just look at you blankly: "....but the test coverage".

Cult-like thinking.


I've always felt the best approach for (automated) testing is:

Unit test style tests should test the specification. That is, you test that good input creates the results that the specification states it should and you test that the known failure cases that the specification says should be handled are, indeed, handled as the specification states. This means that most of the tests are generally happy path tests with some testing the boundaries and known bad cases. The goal is not to find bugs, but to prove in code that the implementation does in fact meet the requirements as set out in the specification.

Regression tests. Any time a bug is found, add a failing test to reproduce the code. Fix the bug and the test should pass. This both proves that the bug is actually fixed and prevents the bug from creeping back in again later. Again, the goal is not to find bugs, its to prevent reoccurrence and is completely reactionary.

Finally, property-based generative testing. Define properties and invariants of your code and schemas of your data, then generate random test cases (fuzzing essentially). Run these on a regular basis (overnight) and make sure that the properties always hold, for good input data, and that error states are handled correctly, for bad input data. You can also apply this to the overall system, by simulating network failures between docker containers [1]. The goal of this is to find bugs, since it will test things you won't have thought of, but since it generates random test cases, you aren't guaranteed that it will find anything. Its also notoriously hard to write these tests and come up with good properties. I don't often do these tests, only when the payoff vs effort seems worth it. Eg for a low impact system, its not worth the effort, but for something that's critical, it may be.

For human QA, I think it makes most sense to test the workflows that users do. Maybe sure everything works as expected, make sure the workflow isn't obtuse or awkward, make sure everything visually looks ok and isn't annoyingly slow. Test common mistakes due to user error. Stuff like that. I don't think we can expect this to be thorough and its unrealistic to think that it will find many bugs, just that it will make sure that most users experiences will be as designed.

So, test for what you expect (to prove that its what you expect), test known bugs to prevent regression and to prove that its fixed. Then, only if your software is critical enough to warrant the effort, use property-based testing as necessary to try and weed out actual bugs. Most software can skip that though.

[1] For example https://github.com/agladkowski/docker-network-failure-simula... or https://bemowski.github.io/docker-iptables/ I've personally successfully used https://github.com/IG-Group/Havoc to test a fault tolerant distributed system I worked on a few years ago, using Clojure's generative testing to create test cases and failure scenarios and Havoc to handle injecting of network errors.


IMO a key thing for human QA testing is to test unhappy paths and error handling. What happens if the user does something unexpected. The happy path is generally well tested by developers during development, but testing the unhappy paths is much more difficult and time consuming, and that's where the QAs come in.


Sure, agreed. That’s why I said “Test common mistakes due to user error. Stuff like that that.” My main point is that it’s unrealistic to expect human QA to find bugs, but what humans shine at that computers don’t is workflow issues, visual issues and “does it feel slow/bad/whatever”. I suppose leaning more heavily on the mistakes side of the workflow makes sense though, yes.


I don’t mean to be too critical, but this article is part of a larger trend. I can nod along with all of its central claims and walk away with exactly zero actionable advice or practical path toward integrating this into my teams workflow.


I would agree with you and say while there is no set advice or plan to follow (that would make things all too easy) there are a couple things here that we could all apply to our teams.

> Tracking automated test coverage (unit, integration, ui) is a performative task that doesn’t provide hard evidence to increase confidence with stakeholders. Instead we should shift automated testing from functional to compliant, accessible, and security based testing and track coverage there.

> Shift testing to the left. This has been a major problem in the organizations I’ve worked at and we should continue to keep a close eye on. Establishing processes to get QA as early as possible into architecture reviews, design reviews and other early processes that tend to only be dev, product, and design focused.

> Continue to build up our embedded QA unit to be sources of insight for multiple stakeholders and provide domain knowledge for our products. As QA we should always be asking two questions are we building the correct product? And are we building it correctly?


This speaks to the artfulness of testing, as a discipline of programming in general. There is no science of testing. Testing well is a skill learned over decades, as in the case of the author.

Testing is a fraught activity, too often leaving stakeholders without confidence, and leaving programmers feeling like they are just going through the motions. Yet there is clearly some value to testing; who prefers untested code to the tested?

Our lack of definitive answers regarding how to best test should not discourage us from testing. We should instead appreciate the inexactness of good testing, and seek to develop a fine sensitivity for how to test our software well.


What would you expect as actionable advice? The software world is so diverse, how could a reasonably sized article cover all those needs?

The best I can think of would be: regression tests are ok and mean "don't you ever do _that_ again". But they are not sufficient to catch the funny ways your customer will use the software. For that you need some people striking a balance using it in new but realistic enough ways.


Anytime someone creates actionable testing advice I automate it. However that still isn't enough to ensure quality so I need humans to find the bugs that actionable advice didn't


Can't stand this headline format "We need to talk about X" because it implies the author speaks for the group rather than to the group, and to my ear it always carries a condescending tone.

See also "We're all [person] right now."


Yeah, I haven't read the article, because that headline format is one of those that I've decided to never reward with a click.


My initial thought on this was that the negative changes the author describes in the post were driven by continuous delivery pipelines. But agile delivery practices have a similar effect of increasing the frequency of delivering software.

If you have a system to deliver software updates in one hour, you will use it to ship 'just in time' updates. Structurally it is hard to argue against fixing the customer pain when you can easily deploy a fix.

But the long term effect of having CD is that testers get pushed out of the system.

If we deliver code everyday we can no longer afford to spend 3 days testing each release. If we can't spend 3 days testing for defects in the entire product, we end up only testing a small amount of the actual product on each release.

If we deliver code multiple times per day we probably cannot afford to do 'any' manual testing.

CI/CD encourages spending as little time as possible on testing. If deploying code takes 10 minutes, why let testing make it a 3 day process?

The long run effect is that the quality of software decreases in general. Bugs are continuously deployed into the system and you never reach a state of doneness.


They way I've run CD is we had continous deployment to a staging environment where final testing and stakeholder review happens, then periodic deployments to production.

This works best when you do a good job of getting the bigger/riskier changes in earlier and do a good job scheduling other changes around anything that is time sensitive.


> They way I've run CD is we had continous deployment to a staging environment where final testing and stakeholder review happens, then periodic deployments to production.

As you hint at, it is called continuous delivery, not continuous deployment.

The best strategy IMO is to mob with testers, and clarify how a thing will be tested before implementing it. It saves you a lot of headaches.


What happens next when feature A,B,C,D look fine in staging, but feature E is a blocker?


I have had CI/CD pipelines, in offshored projects, where the amount of issues that would creep in was unberable, despite unit tests and whatever best practices.

So for the teams to have anything at the end of the CI/CD that could be either demoed, or tested, multiple stages integration stages of green builds were introduced.

A full multi-stage CI/CD integration pipeline from a devs machine, into what became known as diamant build, would take about a day, assuming all in-between-stages were green on first build.


> A full multi-stage CI/CD integration pipeline from a devs machine, into what became known as diamant build, would take about a day, assuming all in-between-stages were green on first build.

A day is indeed a bit on the long side, but CICD was never a synonym for instantaneous.

Let's put things into perspective: staged deployments, or simple blue/green deployments, do require a baking period to verify if a new deployment leads to alarms going off. Depending on how you phase your deployment, that last stage alone can take hours. and that's deployment alone.

If we look into it, unit tests barely register in the time that it takes a commit to go from git push to the client starting to use the feature/bugfix. I've worked on webapps deployed globally where unit tests took far less time to run than the baking time of a single deployment stage.


I think this article has some really good points, especially "The purpose of testing is to increase confidence for stakeholders through evidence" resonates with me.

I'm personally a huge proponent of End 2 End Testing. You can think you covered everything in isolation, but the user is presented with bugs nonetheless. I have near 100% E2E test coverage. Yes it takes time to run, yes third-party integrations can result in failed tests. (Which probably signal an issue with you or your third-party anyway). I believe a full E2E test requires less maintenance and has a broader coverage than Unit Tests.

For instance I run TestCafe E2E tests that tests all possible user interaction from signup and paying for a subscription to account termination. I go as far as reading expected emails through a web email client and verify the content in the received email using TestCafe. I test cache invalidation, a host of automated member expiration, warning and authentication / security scenarios. All this should be done anyway, why not do it using E2E tests. I did a rewrite, switching from Vue.JS/SSR to React on Next.js I had only minimal code changes to my E2E tests. And with minimal I mean a couple of ID/Class DOM references to account for different components in Vue vs React.

Not only do you test code paths, but also all your components in the stack and the interaction between them, which for me are Node.js, Nginx, PostgreSQL, Redis, Amazon SES, etc. If something fails, you will have to do some digging, more so than a failed Unit Test. But more often than not you know what has recently changed and so where to look for possible issues. Your E2E tool makes a screenshot of the browser fail-state often highlighting the issue at hand.

Of course these types of tests work best on websites, and not so great on say low level GPU driver code. They may also become to cumbersome and slow on highly complex sites. The biggest drawback is probably the time it takes for a full test run which in my case can take up to 20 minutes in a headless browser.


I 100% agree with the central thesis that tests are about evidence-based correctness.

I think keeping this in mind is important as a product and codebase scales since a lot of testing approaches shift in importance over time. The articles sidebar about TDD is a micro-example of this, but there are ones on the order of months/quarters/years that Ive come across as well.

One example is how you likely only want end to end and integration tests early on in a development lifecycle, rather than excessive subsystem tests and mocks. This is because your end to end product experience is likely to change less than the subsystems that implement specific features (eg user auth). Over time, as your subsystems solidify you’ll likely want to reprioritize and keep end to end tests lower in proportion since they take longer to execute. In other words, your test suite should be proportional to the expected requirement changes over time.

Another more out-there example is the disabling of tests over time as test coverage starts to overlap. A lot of tests in a large codebase exercise the same codepaths, and this is usually not a net benefit (it sometimes can be, eg if one test runs really fast vs another more comprehensive one thats slow). So having a way to evaluate whether a test is actionable or useful is something that emerges organically as large codebases begin to change.


I also personally like to have hierarchies of tests, where higher-level tests don't need to test specific details covered by the lower-level tests like edge cases. Take implementing a language (lexer, parser, static analysis, etc.) for an IDE or compiler.

For the lexer, I have the tests cover the different EBNF fixed string tokens (keywords and symbols) in a given token, along with the more complex tokens for numbers, etc. For the more complex tokens I have lexer tests covering the different valid number forms, identifier characters, etc. This leads to overlap in things like keywords, but not an overlap in implementing different specifications (e.g. CSS modules).

For the parser tests, because I have the lexical coverage, I can just use a representative example of a number, etc. That allows the parser tests to focus on the different paths (optional parts, loops, etc.) of the EBNF and things like error recovery when different symbols are missing. With these, I validate the AST is correct, including the AST class that implements that node in the tree.

For the AST tests, because I have the lexer and parser coverage, I can focus on the data model and how the information in the parse tree is exposed to the AST data model.

For the higher-level tests (resolving variables to their declarations, etc.) I can focus on the cases relevant to that level instead of having to also test combinations of the lower-level tests.


> Another more out-there example is the disabling of tests over time as test coverage starts to overlap. A lot of tests in a large codebase exercise the same codepaths, and this is usually not a net benefit (it sometimes can be, eg if one test runs really fast vs another more comprehensive one thats slow). So having a way to evaluate whether a test is actionable or useful is something that emerges organically as large codebases begin to change.

If the cost of maintaining these tests exceeds the value they provide, don’t just disable them: delete them. But if the maintenance isn’t an issue, and it’s just the compute resources that are an issue, keep them enabled and schedule your tests more effectively! For example, at my company we have tests which are run the moment you want to commit, and will block your commit if they fail, and then we have a separate service which continually runs tests against tip 24/7. In order to keep low friction in the dev process, we only run a subset of tests as part of the commit process, and assume that a few failures will make it into tip but will be caught within a couple days (we ship every two weeks so that timing’s been pretty safe).


The idea that the tests of TDD are not the same thing as the tests of test coverage is a great insight that I don't think I'd quite realized before. In my experience, that seems in fact accurate - writing a short tool that uses the code you want to write is a good way to break down the problem and also make sure your solution actually works, but it's different from having confidence in the system.

On the other hand, I'm a little confused at the idea of non-automated tests that provide value. If you're interested in security/compliance/accessibility/etc., those seem like general software quality things, but they don't seem like "tests." To me, a "test" is a thing that either passes or fails, by analogy to tests in education, tests in law, etc. It's definitely valuable to review code for security issues or ensure that you're taking accessibility into account, but unless you're asking a question that can produce a yes or no, then you're not increasing stakeholder confidence through evidence, per the article's formulation (which I agree with). If you think there's value in manual testing because a human will provide you with "insights and feedback," great, but if you do hallway "testing" and everyone successfully completes the task but provides no feedback, is that a failure? If you can't find any security bugs from reading the code, does that mean there are none?

If you want unstructured information from humans as part of inputs to your design, great (and that's a great reason to shift feedback left), but I think that's a different category from testing.

You can't automate everything but you can automate a whole lot. Static analysis tools, better languages, modeling tools, etc. can find or prevent security problems more reliably than a human. Linters can check for accessibility issues (missing alt tags, missing ARIA attributes, etc.), and you can write integration tests that try to drive your entire product using only accessibility APIs to make sure your workflows are covered. Automated tests can look at emitted logs and metrics and make sure they're emitting what you expect and not emitting what you don't. And so forth.


The purpose of testing is to increase confidence for stakeholders through evidence

This is a good, succinct description of why you should write tests.

Though, as a solo developer who's considered sinking some time into writing tests for Report Card Writer, I've chosen not to. No amount of testing will help me make sales, which is my primary problem at the moment.


Does anyone have experience in testing "data science workflows". I made up that term because I can't think of anything better to call it. For example, a programmer starts off with some business requirements (that would be written by a business analyst) that might cover rules on which customers should receive marketing newsletters. By SQL wrangling a few different databases using some pretty complicated logic, the programmer arrives at a list of customers that they feel satisfies the business analyst's requirements. Would a tester test that the programmer's code "works" by "unit testing" or should a tester have the knowledge and skills to build their own "independent version" of what the programmer has built and compare results?


I don't have a real "answer", but in my experience, being able to arrive at the same solution two different ways definitely helps increase confidence.

It also helps to have somebody with domain experience who is able to validate that the results makes sense. This isn't sensible in a "software development" workflow, but for one-off data analysis and data set generation tasks it's worth doing for complicated queries.


I've been leaning more towards extensive tracing and automatic error-based rollbacks than writing lots of tests. It seems to make more sense with the tools available today.


Agreed! Rollbacks are easily explainable, show commitment to safeguarding the user experience and make your business users way more happy.


> It seems to make more sense with the tools available today.

Aren't unit test frameworks ubiquitous, and involve low-effort and low turn-around time that can even run completely on a developer's desktop? I mean, why an overly complex system like tracing replace simple standalone tests?


I’ve been working with teams with testers and with teams completely without them. My experience is that there are always bugs and about the same amount of them. It seems like it never matters how much effort you put in. I think it’s had to be about ownership and who is going to have to wake up in the morning to fix things.

One thing that I believe that’s also left out here is that Test-driven developed code looks different than code that isn’t. All the TDD code is per definition testable but other code isn’t always testable. (Mind you, I say code here and not programs.)

I also feel like there’s a missing point in the article, which is how easy something is to test. If it’s not easy to test something or setup a test for it it will probably never happen. I think you should write systems to be easily testable since then people will do actual testing. And if you don’t test that continuously (that it is testable) you will end up in a system that is eventually harder to test. Automated tests will not happen for something that isn’t easy to test and automated tests are kind of a way to acknowledge that “well, at least this works”.


> If it’s not easy to test something or setup a test for it it will probably never happen

Absolutely right. And it's never easier to set up a test than when the code is fresh in your mind.

I'm also the opinion that bug fixes, and especially regression fixes, should always have tests included. If you've tested your code in the first place, adding an extra test case is really easy. It's much worse when you realize that the author of the code (even if it's past-you) couldn't figure out how to test it, and now that's your problem.


I'm a huge believer in testing -both unit and "monkey."

A lot of my stuff has both. If it is heavily UI-centric, or involves device interfaces and/or communications, I tend to use test harnesses.

Unit testing is good. Combined with code coverage tools, it can be quite useful.

But having unit tests is not any kind of quality assurance. It means that the "low-hanging fruit" has been picked.

Test harnesses and "monkey testing" require a lot of discipline, but are often well worth it.

I write about the approach I take, here: https://littlegreenviper.com/miscellany/testing-harness-vs-u...


Unit tests were never considered a silver bullet. The only thing they do is automatically run specific sanity checks on small subcomponents with everything else faked out.

At best, they are a convenient way to ensure that if an invariant suddenly varies unexpectedly and in an unplanned way... A red flag is thrown into play.


I'm happy to read encouragement to consider the wider group of stakeholders in the scope of testing. In many organizations, incentives are out of alignment to find issues beyond the obvious user-facing correctness ones, so "testing theater" wins, because it allows teams to declare victory more quickly. Software as a discipline would benefit from more testing attention to security, accessibility, usability, etc.


i don't see the Maintainability bad-thing: software Does and Has-to change. So testing is there also to allow one of the main stakeholders - the software-makers-themselves - to operate in some confidence.. Which, i agree, may not be part of the shipped software per-se.. but of the process of software makeing


I don't think you can automate all the testing, since almost by definition, you always test AGAINST something. That something can be a specification, or a different implementation, or just a clock, but it has to come outside the subject (program) being tested.

A different take is that we want tests to have two qualities - correctness (the test cases should match the desired program behavior) and comprehensiveness (the test cases should cover as much of behavior as possible). The only way these are both 100% attained is if the source of your test cases is effectively another implementation of the same specifications, and you compare the behavior of the two.

The comparison itself can be automated, but the creation of the other implementation cannot. So, logically, if you want to automate it anyway, you have to compromise one of the two qualities, either being less correct (be less strict when comparing program output) or less comprehensive (verify less possible inputs).

So it seems to me that the "automate all the tests" folks want to have a cake and eat it too. They want to have an automated comprehensive and correct test suite, without the effort of writing an approximation of another implementation (used to compare).

In the past (I love how the blog post says "I am not advocating to returning to the Dark Ages of software delivery", as if that was necessarily terrible), you had a QA team to do exactly that - approximate another implementation of the same program directly from the specs. The better the approximation, the better the verification. But the cost is duplication of effort, in some sense. If you try to remove the duplication (i.e. for example by having the same person create both at the same time), you're likely to compromise one of the two qualities, without realizing it.

Let me state the above yet differently. The "let's automate testing" approach is based on the assumption that human tester is running the same tests over and over. But that's not the case, the manual testing is actually different each time, so what you invisibly lose by automation is comprehensiveness.

In fact, the QA's job in the past was to have another person (other than the developer) trying to make the sense of the specification (and presumably approximate the implementation with the test design), and check if the specification was translated correctly by running a comparison of their understanding to the developers'. While the comparison itself can be automated, the another look is important to discover the parts of the specification are actually weak, and can be understood differently. I don't think testing a program is solely about its intrinsic properties, but rather about checking the correctness of the translation from specification to the executable code.


I think if we have an automated way to check two implementations it would beat most every other form of testing that exists.

Then testing would come down to simply write everything twice and hope you got it right at least once.


I don't think there is anything theoretical that prevents you from doing that. In fact, it has been tried: https://en.wikipedia.org/wiki/N-version_programming

The problem in practice is though, for larger programs, it's really difficult to delineate what are the inputs and outputs (so they could be recorded and compared), especially since they take so many different forms.

But on small scale (like functions or modules), it should be possible, but the tooling is not widespread. IMHO it would save lots of time on doing unit tests for refactoring, and would be a real progress.

And I will go on a limb and claim, what people (in OOP world) mean by testability, is really just referential transparency (in FP parlance), in other words, our ability to delineate what are all the inputs and outputs of a module. Thus, adopt more FP and this will become increasingly possible.


I'm irrationally passionate about testing. Shameless plug: (rendered version at https://gist.github.com/androidfred/501d276c7dc26a5db09e893b...)

# Test your tests

## Summary: test interfaces, not implementations

Tests that pass when something is actually broken, and fail when things actually do work, are worse than useless- they are positively harmful.

## The requirement

Let's say we've been tasked with returning `400` when `GET` `/users/<userId>` is called with a negative `userId`.

## The test

The requirement can be turned into a test that hits the endpoint with a negative `userId` and checks that a `400` is returned:

```java @Test public void getUser_InvalidUserId_400() { expect().statusCode(400).when().get("/users/-1"); } ```

## Implementation

A ubiquitous style of implementation may look something like this: (ignore whether you like the style of implementation or not, it's just an example)

```java public class UserResource {

    @Inject
    private UserService userService;
    
    @GET
    @Path("/users/{userId}")
    public Response getUser(@PathParam("userId") final Long userId) {
        try {
            return Response.ok(userService.findById(userId)).build();
        } catch (final UserException e) {
            return Response.status(e.getErrorCode()).build();
        }
    }
} ```

```java public UserService {

    @Inject
    private UserDao userDao;

    public User findById(final Long userId) throws UserException {
        if ((userId == null) || (userId <= 0)) {
            throw new UserException("Invalid arguments", 400);
        }
        return userDao.findById(userId);
    }
}

```

## Another test

Everyone knows a test hitting the endpoint is not enough- more tests are required. A ubiquitous style of additional test may look something like this:

```java public class UserResourceTest {

    @InjectMocks
    private UserResource userResource = new UserResource();

    @Mock
    private UserService userService;

    @Test
    public void getUser_InvalidParams_400() throws UserException {
        doThrow(new UserException(INVALID_ARGUMENTS, 400)).when(userService).findById(-1L);
        assertThat(userResource.getUser(-1L).getStatus(), is(equalTo(400)));
    }
} ```

## Test the tests

### Break something

Let's say the negative `userId` check is removed from the `UserService`. (by mistake or whatever)

The test that hits the endpoint will fail, because it will no longer get a `400` when `GET` `/users/<userId>` is called with a negative `userId`. This is exactly what we want out of a test!

The additional test however will pretend the negative `userId` check is still in the service and pass. *This is literally the exact opposite of what we want out of a test!*

### Refactor something

Or, instead, let's say the `UserResource` is refactored to use a `NewBetterUserService`. (the `NewBetterUserService` still throws an exception on negative `userId`)

The test that hits the endpoint will pass, because it will still get a `400` when `GET` `/users/<userId>` is called with a negative `userId`. This is exactly what we want out of a test!

The additional test however will fail because it expects the `UserResource` to call the (old) `UserService`. *This is literally the exact opposite of what we want out of a test!*

### But... The `UserServiceTest` would fail

Maybe there's a `UserServiceTest` that would fail if the negative `userId` check is removed from the `UserService`:

```java public class UserServiceTest {

    @Test
    public void findById_InvalidParams_400() throws UserException {
        expectedException.expect(UserException.class);
        expectedException.expectMessage("Invalid arguments");
        assertThat(expectedException.errorCode(), is(equalTo(400)));
        
        userService.findById(-1L);
    }
} ```

It's irrelevant. The `UserResourceTest` test is still a bad test, because it fails on refactoring. And even if there is a `UserServiceTest` that fails if the negative `userId` check is removed from the `UserService`, the `UserResourceTest` doesn't fail if there is no such `UserServiceTest`, or if there is but it's incorrectly implemented etc. It just pretends the negative `userId` check is in the `UserService` and passes.

If you were absolutely adamant about testing the implementation (which you shouldn't be, because implementation tests fail on refactoring, making them bad tests), the "correct" way of testing that would be:

```java public class UserResourceTest {

    private UserResource userResource = new UserResource(new UserService()); //not a mock

    @Test
    public void getUser_InvalidParams_400() throws UserException {
        assertThat(userResource.getUser(-1L).getStatus(), is(equalTo(400)));
    }
} ```

Because now, if the negative `userId` check is removed from the `UserService` the test will fail. But since it also fails on refactoring, it's still a bad test.

### But... Mocks are useful

Yes, mocks are useful, and they do have a place. Eg, instead of connecting to a real db, the DAO methods could be mocked, and REST calls to other services etc could be mocked too. They're external dependencies not part of the core of the system under test. The case made here isn't that mocks are never useful or always bad, it's that mock testing is often used excessively for internals of the system under test, which doesn't make sense.

Also, I'd still argue that, by default, there should still be a strong preference for, rather than mocking DAO calls to db and REST calls to other services, actually spinning up a real in memory db (eg [https://github.com/vorburger/MariaDB4j](https://github.com/v... lets you do this), and actually WireMocking calls to other services. Then, more of the system under test is being tested, with less effort.

Ie unlike with regular mock testing, it's verified that the app doesn't just call a DAO method but also that DAO method actually connects to the db, runs the expected query and returns the expected result based on the primed content in the db, and that the app actually hits the configured external service url with the expected request and handles the primed response based on the Wiremock stub etc etc.

While such tests take a bit longer to run, you write far fewer, higher quality, less brittle tests, because you're not having to write and maintain `nn` mock tests between every layer of your app internals to check every single call on the way through the stack. You can freely refactor literally all* the internals all you want without having to change the tests at all - they will keep passing as long as everything works and they will fail if something doesn't. (which, again, is exactly what you want)

## Terminology

The test that hits the endpoint is commonly referred to as a "integration test", and the other test is commonly referred to as a "unit test".

Kent Beck (the originator of test-driven development) defines a unit test as "a test that runs in isolation from other tests". This is very different from the definition of a unit test as "a test that tests a class/method in isolation from other classes/methods". *The test that hits the endpoint satisfies the original definition.*

But it doesn't really matter if you want to call a given test an "integration" test or a "unit" test. The point is the test fails when something breaks and passes when something is improved. If it does the opposite, it's not a good test.

## More on the topic

* http://googletesting.blogspot.com.au/2013/08/testing-on-toil...

* http://codebetter.com/iancooper/2011/10/06/avoid-testing-imp...

## Misc

There are other problems with the example, such as eg

* Primitive Obsession: using a Long to represent `userId` and using procedural inline checks in the service to check for negative `userId`. Instead, create a class `UserId` that encapsulates and enforces its own checks.

* Hidden Dependencies: `UserResource` has a hidden dependency on `UserService` and `UserService` has a hidden dependency on `UserDao`. Dependency Injection frameworks like Spring encourage such hidden dependencies, and the excessive use of mocks in unit tests.

* The service shouldn't know about HTTP error codes.

These are simplifications in order to keep the example short - the scope of this article is testing.


Highly recommend publishing this as a blog post and sharing as a post on HN so it isnt lost to the ether - hard to consume this as a comment but theres good nuggets in there


Thank you! It was just published here :)

https://news.ycombinator.com/item?id=28135917


This seems a little bit disconnected from the article but it would be a good submission on its own I think!


Appreciate the encouragement, I've posted it now :P https://news.ycombinator.com/item?id=28135917




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: