Hacker News new | past | comments | ask | show | jobs | submit login
Property-based testing in practice [pdf] (harrisongoldste.in)
70 points by Smaug123 5 months ago | hide | past | favorite | 52 comments



While agreeing with the results of this article, I’ve found that convincing other developers of writing test with properties isn’t very easy: coming up with good properties is not always trivial.

Here is an informal testing maturity ladder in increasing order:

- code can only be tested with an integration test

- code is tested by comparing the stdout with an earlier version (hopefully it’s deterministic!)

- code is retrofitted with tests in mind

- code is written to be testable first, maybe with lots of mocks

- code is testable, and pure functions are tested with unit tests, leaving the rest to integration tests. Fewer mocks, some stubs.

- property based tests, assuming the unit tests are fast in the first place

- fuzzing

- mutation based testing

Not speaking of formal specs or performance testing or anything else.


The problem with your hierarchy is that there's no empirical evidence supporting it. Small unit tests have not empirically been shown to have benefits over integration tests, and test-driven design has failed to show a benefit over tests written after the fact. The only thing that seems to matter is that tests are written, and the more tests, the more chances of finding a bug. That's it. So your list is actually:

* integration and unit tests: since these are manually written, they scale poorly but are simple.

* property tests: since these are semi-automatic they scale better but are a bit more complicated to setup.

* fuzzing: almost fully automatic, although I don't differentiate this much from property-based testing.

* mutation based testing


Is mutation based testing only better because it forces more tests to be written to kill the mutants?

Also, mutation based testing is really orthogonal to the others, since it's a way of evaluating the adequacy of the tests, not of actually testing. One could easily imagine using PBT/fuzzing to generate (and then simplify) tests with the express goal of killing mutants.


> Small unit tests have not empirically been shown to have benefits over integration tests, [...]

Do you have any links to these studies?

> * property tests: since these are semi-automatic they scale better but are a bit more complicated to setup.

Once you are in the groove, I find property based tests to be simpler (or not harder) than example based tests. But that's when I am writing tests as I am developing the system, ie I take testability into account in the design.


> code is written to be testable first, maybe with lots of mocks

If you mean what I think I mean, this is the bottom rung of the ladder. Code that is only testable with lots of mocks is in practice worse than code with no tests.

Tests should do two things: catch undiscovered bugs and enable refactoring. Test mocked to high heaven do one thing: confirm that the code is written the way it’s currently written. This is diametrically opposed to and completely incompatible with those two stated goals. Most importantly, the code can’t be changed without breaking and rewriting tests.

Mocks are okay for modeling owners of external state. Even better are dummy/fake implementations that look and behave like the real thing, (but with highly simplified logic).


I really like this list, and it's a great idea to explain testing this way.

Perhaps there is also a level -1, when the tests actually make things worse. I see this when tests are extremely brittle, flaky, don't test the most complex or valuable bits of code, are very slow to run, or unmaintained with a list of "tests we know fail but haven't fixed".


There might even be a whole ladder going further down. You, something about the value of everything and the cost of nothing.

I've seen tests with bugs in them, hiding bugs in the code and giving a false sense of robustness. because, you know, code coverage is 100% so we cannot have any bugs, right?

I had to work with tests that test many trivial things, tightly coupled with implementation details. Those would discourage small refactorings, because it took a lot of time to understand and fix the failed tests as changes in details of implementation would lead to failures in unittests. It also slowed down the pace of development for no good reason.

Unless a test has a value that exceeds the cost, it is a net negative.


>- code can only be tested with an integration test

Some code only makes sense to test with integration tests. It's not more effective with a code base where somebody has decided to fatten up the SLOC with dependency inversion just so that they can write some unit tests which test that x = x.

>- code is tested by comparing the stdout with an earlier version (hopefully it’s deterministic!)

Making the code deterministic or adapting the tests to accommodate that it isn't should be next on the ladder, not hoping that it is.


> coming up with good properties is not always trivial

This is difficult, but one technique that (might) make it easier for real-world applications beyond simple invariants is to take the approach of building a simple model of the system under test and, in the PBT, checking that your system's behavior matches the model's [1].

[1] https://dl.acm.org/doi/10.1145/3477132.3483540


Testing behaviour against an 'oracle' is a great class of properties to check.

Especially useful when you want to test an optimized version against a simpler (but slower) baseline version. Or when you have a class of special cases, that you can solve in a simpler way.

Testing the system against itself, but under symmetry, is also useful. But that goes close to general properties. A symmetry could be flipping labels around, or shuffling input order etc; it depends on your system.


I don't understand why PBT is above mutation testing. It seems like it's more of a popularity contest kind of thing, and not a matter of engineering tradeoffs or how useful it is.


In my experience, adding PBT once you've got a codebase amenable to unit testing in general is a relatively easy step: you can add property tests to your existing unit tests without changing much about your setup, and assuming your tests in general are quick, PBT won't add much overall time to your testing process.

But adding mutation testing tends to be harder: it's not just an extra thing you can add in without changing any existing code, it's testing whether the tests are useful in the first place. Which means when you introduce it, you'll probably need to spend some time fixing everything that's currently wrong. This makes it a next step in the testing process beyond just adding a new technique to your existing repertoire.

That said, I've used PBT for a while and not had as much success with mutation testing, so maybe this is a personal bias.


The setup for MT should be none. You just start it and see what you get.

> But adding mutation testing tends to be harder: it's not just an extra thing you can add in without changing any existing code

But it is...

> it's testing whether the tests are useful in the first place.

Not useful. Complete. MT checks that you have tests for all the behavior of the code.


Maybe we're talking about different things then, but in my experience, MT is a reasonably involved procedure that requires configuring the MT harness to understand where the code lives, how to run the test suite, how to interpret the test suite, etc, then running the mutation tests for upwards of an hour as it repeatedly makes various changes and runs the tests. If you want to do something more complicated like only test a given region of the codebase, then the configuration becomes even more involved.

This is all significantly more involved than my experience with PBT, which tends to be something that can be added without much ceremony to an existing test suite when it makes sense.

To be clear, I love the idea behind mutation testing and I have given it a go a few times with limited success, but I think your comment is overselling its simplicity.

That said, I'd love your advice: how do you introduce mutation testing to a large codebase that currently has an extensive set of tests but hasn't used mutation tests yet? And how do you maintain the MT side of things? It seems far too slow to regularly run in CI: do you just run the MT tool every now and then to make sure that the tests are still covering all the mutations? Or do you have a more structured approach?


> MT is a reasonably involved procedure that requires configuring the MT harness to understand where the code lives, how to run the test suite, how to interpret the test suite

mutmut autodetects this with a majority of setups, but yea, if you need to configure all that a lot then it can be annoying.

> then running the mutation tests for upwards of an hour as it repeatedly makes various changes and runs the tests

Yea, MT is slow heh. But it can be simple to start with. I generally recommend to do it for libraries mostly since they tend to have small and fast test suites which makes it much more fun. Or extract the code into a throw away project, do MT there and then move it back. It's a bit crap, but it works.

> To be clear, I love the idea behind mutation testing and I have given it a go a few times with limited success, but I think your comment is overselling its simplicity.

I had to write my own mutation tester because I couldn't get the existing ones to work, so I do feel your pain there :P

> That said, I'd love your advice: how do you introduce mutation testing to a large codebase that currently has an extensive set of tests but hasn't used mutation tests yet?

I party answered above, but I would only test extremely limited parts that are critical. And I would make sure to not run the entire test suite.

> And how do you maintain the MT side of things? It seems far too slow to regularly run in CI: do you just run the MT tool every now and then to make sure that the tests are still covering all the mutations? Or do you have a more structured approach?

This is exactly how I use it yes. People ask for CI support for mutmut and I've accepted PRs for it, but I just assume they will get it working and then later throw it all away as it's useless. I try to convince people it's the wrong approach but I have trouble getting them to listen.

If MT was WAY faster then maybe you could use it to validate it regularly, but for mutmut at least it's too slow. I have an experimental branch of mutmut that is much faster but I just don't have the time/interest to make that a reality right now. I don't particularly need MT in my current job...


PBT is below mutation testing, the list is ordered with lower tier up.


Sorry, I put that badly. What I meant was that I don't understand why PBT is something you apply before MT.


> While agreeing with the results of this article, I’ve found that convincing other developers of writing test with properties isn’t very easy: coming up with good properties is not always trivial.

Yes, but there's some easy targets:

Your example based tests often have some values that are supposed not to matter. You can replace those with 'arbitrary' values from your property based testing library.

Another easy test that's surprisingly powerful: just chuck 'arbitrary' input at your functions, and check that they don't crash. (Or at least, only throw the expected errors.) You can refine what 'arbitrary' means.

The implied property you are testing is that the system doesn't crash. With a lot of asserts etc in your code, that's surprisingly powerful.


I've never seen anyone do mutation testing with software (it's pretty common for hardware though). Does it require language level support?


I'm the author of mutmut, the primary mutation tester for python, so I think I can speak a bit on this.

It's quite straight forward to do MT in software. I've done quite a bit of it for specific libraries that I've built (iommi, tri.struct, and internal code). A big advantage in my book of MT over PBT is the much lower cognitive overhead for MT and that you can know you have done it completely. The second may be just a false sense of security or an emotional blanket, but still.

I have written about mutation testing a few times: https://kodare.net/2019/04/10/mutation-vs-property-based-tes... https://kodare.net/2018/11/18/the-missing-mutant-a-performan... https://kodare.net/2016/12/12/mutation-testing-in-practice.h... and my talk on mutmut from pycon sweden is on youtube: https://www.youtube.com/watch?v=fZwB1gQBwnU


Does that only work in Python though? What about compiled languages? You don't really want to have to recompile your whole project again for every line you change...


mutmut right now is implemented by writing changes to disk and starting new processes. But you can implement it via "mutation schemata" where you functionally compile all possible mutants ahead of time, plus the original function, and replace the original function with a trampoline that either calls the original or one of the mutants depending on some external state.

I have a prototype of mutmut that does this and it's 10 to 100x faster. It does have the downside of not being able to mutate stuff like static global variables and such though.


[I've never used any mutation testing tools either.]

In general it doesn't require language-level support, of course - you can just make a change and rebuild it, à la Stryker https://stryker-mutator.io/docs/stryker-net/technical-refere... . PITest operates on JVM bytecode (https://pitest.org/) for orders of magnitude speedup.


Yeah I meant language level support to make it viable, which rebuilding a gazillion times doesn't really sound like.


Where does automatic failing test input reduction fall into that ladder? Or is that assumed as part of PBT.


Nope. It's this disparaging of integration tests that gets us into hundreds of useless unit tests that in reality test nothing.

Integration tests must be much higher on the list.

Also: nothing is stopping you from running your prop tests in integration tests, too


Nothing stopping you. I don’t recommend it. Tests need to be fast or they don’t get written.

Adding property tests into integration tests will make things slower and flakier.


It depends. For a lot of applications, the integration is 99% of the app. Trying to unit test that just ends up testing whether you believe the other end of the integration behaves the way you believe it behaves. It's like having a test that verifies the hash of the executable binary. It tells you that something has changed, but not whether or not that change is desirable.


Test speed fetishism is a bad idea. It leads people to write unrealistic tests which end up passing when there is a bug and failing when there isn't.

Treating flakiness in integration tests as a fait accompli is also a bad idea. It's a bad idea - a bit like treating flakiness in the app itself as a fait accompli.


Tests that are slow to run don't get run often. People don't write tests first. They write fewer tests. They tend to test the happy paths.

I'm not saying we should avoid integration testing. It's necessary.

It doesn't mix well with PBT for a main test suite, in my experience.

You're better off making sure your domain types and business logic are encapsulated in your types and functions and don't require a database, a network, etc. PBT is great at specifying invariants and relations between functions and sets.

Use your integration testing budget for testing the integration boundaries with the database and network.

Update: there's no cure for writing bad tests. you just have to write better ones.


Tests get run as often as they get run. There's no rule that says that you have to run them less frequently if they are slower. It's not 1994. CPU time is cheap.

Realism is much more important than speed. A test that catches 20% more bugs and takes 5 seconds instead of 0.05s is just more valuable. It's simply not worth economizing on CPU cycles in 2024.

Yes, DI can make sense sometimes but definitely not all of the time.

The people who have ultra fast unrealistic tests inevitably end up leaning on manual testing as they experience bugs which their unit tests could never have caught.

>there's no cure for writing bad tests. you just have to write better ones.

Better means more realistic, hermetic and less flaky. It doesnt necessarily mean ultrafast.


> will make things slower and flakier.

If running tests with different data makes them flaky, your system is bad.

Regardless of property based testing, I've seen too many systems where there were hundreds of unit tests and which wold fail for the most trivial reasons because the actual integration of those units was never tested properly.


I love PBT and use it frequently.

So integration tests are generally IO-bound. They'll be slow.

With PBT you will generate 100, 1000 slow cases.

Integration tests are flaky in the sense that your components might not generate just the right inputs to trigger an error. They may not be configured under test the same way as they are in production. And so on.

PBT can be flaky in the sense that, even with carefully selected distributions, you can get a test that ran fine for months to fail on a run. Add that on top of integration test flakiness.

There's no silver bullet for writing bad specifications.


If you're interested in property-based testing, I highly recommend "How to Specify It!" by John Hughes, which is available both as a talk and as a paper.

https://research.chalmers.se/publication/517894/file/517894_... https://www.youtube.com/watch?v=G0NUOst-53U

He gives fantastic guidance about how to write PBTs.


Can't read the pdf right now but I'm a big fan of property based testing.

One thing I find that people struggle with is coming up with "good properties" to test with.

That's the wrong way to think about it. The properties you want to test are the function's contract. Checking that contract is the goal.

You can be as specific with the contract as you want. The more specific the more bugs you'll find.

Property-based tests are just one way to check the contract, as are hand-written unit tests. You could use a static analyzer or a model checker as well, they're all different approaches to do the same thing.

EDIT: by contract I mean the guarantees the function imposes on its output. A contract for a sorting function could be as simple as the length of the output being the same as the input. That's one property. Another is that every element in the output is also in the input. You can go all the way and say that for every element at index i, the element at index i+1 (if any) is larger.

But you don't need a perfect contract to start with nor to end with. You can add more guarantees/properties as you wish. The more specific, the better (but also slower) the tests.


I found that writing new property based tests was a skill I picked up relatively quickly. But learning how to retro-fit existing tests was a whole 'nother skill that I had to learn almost independently afterwards.


Over time of trying to adopt or even reinvent PBT I discovered that there isn't a good language to formally describe software at the level where it becomes interesting to test it. TLA+ gets somewhat close... but it's both too difficult to write and its hard to adapt to eg. interactive systems.

I don't mean to say that PBT is a bad idea. I actually think it's very good. I wish there was a way to make it really useful though. As the paper mentions, PBT in their experience is used for "component testing", which is just another name for "unit testing", which is where automation of this kind isn't all that important. Integration and E2E testing is what's a lot more important, but doesn't really have a way to approaching r.n.


Have you looked at P? https://p-org.github.io/P/


No, I haven't. Thanks for the link. I'd have to spend some time reading about it.

Ugh...

> The P compiler is implemented in C# and hence the tool chain requires dotnet.

Nah, sorry. I'm not going to read further about it.


Why?


I started my familiarity with programming using a bunch of Micorosoft's products, but later was able to completely migrate away from them. Today, if I'm required to use a Microsoft's product, it's kind of like asking a North Korean who valiantly fought to get away from that cursed country to go back and live there again.

You see, my understanding of what's happening with MS is that their sales department is exceptionally successful, which lowers the requirements for everything else in the product. A particular quality of this sales department is that they rarely sell to individual end users: to be efficient, they sell to people who have absolute control over the end users, but are also incompetent when it comes to software. Think governments, hospitals, schools etc.

Another aspect of MS software is that they, internally, don't really allow anything that's not made by MS, and they are big enough for this policy to succeed. The unfortunate side effect of this policy is the Stockholm syndrome: MS developers not only get used to the MS tools, they genuinely seem to like the suffering incurred by using them. Another unfortunate side effect of this policy is that MS developers have no idea what normal users actually like. So, they are kind of like Bender (a robot, who has only a very faint idea of how humans eat) in the Futurama episode, where he enters a cooking contest. He then proceeds to put all kind of inedible components into his meals (but eventually wins by adding some LSD to the mix).

So... I will never touch any MS products with a ten feet pole, unless forced to by my employer. And if the employer goes as far as requiring me to use MS products a lot, I'll look for another place to work. My job today is quite meaningless, boring and I don't get to work with things I like or people I respect... but having to use MS tools for work would still be a lot, lot worse. It's a kind of a torture... well, of course, it's an exaggeration: it doesn't cause me physical pain, but it's a kind of frustration that makes me want to punch a wall every few seconds.

And to top it: using C# for anything is kind of like... starting a cooking recipe with "put a handful of chopped garlic into the pen". You just know it's either a mistake, or a prank, or total incompetence. You just don't keep reading after that line: there are plenty of recipes around, it's not worth your time to try to figure out why this one is defective.


How do I learn to hurt people in such a peculiar way for them to write essays of this kind?


I've never been able to understand the "other" side, even though I've talked to many people at various levels of the hierarchy at MS and similar organizations. With the rank-and-file folks, my impression was usually that they are pretty clueless about the org they work for, and if it came to discussing their allegiance to the org, they would often act like radical political activists.

With the higher-ups... I had this weird feeling they weren't really humans. Not that they ever had to be honest with me or had to even pretend to care about me. I only met such people in the work setting (outside of one time when I got invited to a Bar Mitzva because the email was sent to the entire company and then I was the only idiot to show up at the event since everyone else knew better it wasn't intended for my kind of people :D)

I honestly cannot imagine the thought process of someone like that, or what do they do on a weekend, or how do they interact with people of their own social class. Some weird stuff about these kind of people is that often times they'd dress like a "simple person", but once you have a closer look at their jeans / t-shirt / sneakers, it'd be something that looks "plain" but in fact not even sold in the place a regular person shops for clothes because it costs like ten times the price of a similarly looking item.

So... how do they do that? How do they justify doing what they do? -- It's honestly easier for me to try on the shoes of a religious fanatic or a hardcore criminal than to get any insight into the minds of these people. One common explanation is sociopathy, but it looks like these people aren't really it: they have friends, get married, have children, and, personally, they don't kill others, not anymore than an average person would. They just have the ability to completely ignore the negative and widespread consequences of their decisions. Something I'd be terrified of out of my mind to the point I'd not be able to close my eyes at night.


Sound idiotic and the fact you refused to read a repo README because something something "people wearing jeans scares me" is, well, I'm speechless.


TIL: Jane Street use OCalm in production


They maintain their own standard library for Ocaml, don‘t they?


Not just the standard library, but a whole ecosystem around it.

And their own compiler. https://github.com/ocaml-flambda/ocaml-jst


Wow, hypothesis seems to have reached 4% of Python users? That's actually pretty impressive.


That's the same percentage of users who have Python package management figured out!


*of Python users that completed the Python Jetbrains survey.


I expect there's more data science conda etc. users than users to whom 'Jetbrains survey' even registers.

Not to be negative about hypothesis though, I like it, tried to help increase adoption at my previous job after a colleague introduced me, but that was shortly after Pycon UK 2018/19 and I haven't used it since. Not that I don't think we'd benefit from it, just probably lower hanging fruit atm.


Yeah most Python users don't even use static type annotations yet... Property testing is ambitious!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: