With the advent of coverage based fuzzing, and how well supported it is in Go, what am I missing from not using one of the property based testing libraries?
The distinction between property based testing and fuzzing is basically just a rough cluster of vibes. It describes a real difference, but the borders are pretty vague and precisely deciding which things are fuzzing and which are PBT isn’t really that critical.
- Quick running tests, detailed assertions —> PBT
- Longer tests, just looking for a crash —> fuzzing.
Interesting read. I guess I'm mostly interested in on number 2 here:
Under this point of view, a property-based testing library is really two parts:
1. A fuzzer.
2. A library of tools for making it easy to construct property-based tests using that fuzzer.
What should I (we?) be building on top of Go's fuzzer to make it easier to construct property-based tests?
My current strategy is to interpret a byte stream as a sequence of "commands", let those be transitions that drive my "state machine", then test all invariants I can think of at every step.
> Python is Turing complete, but does [TLA,] need to be? Is there an in-Python syntax that can be expanded in place by tooling for pretty diffs; How much overlap between existing runtime check DbC decorators and these modeling primitives and feature extraction transforms should there be? (In order to: minimize cognitive overload for human review; sufficiently describe the domains, ranges, complexity costs, inconstant timings, and the necessary and also the possible outcomes given concurrency,)
> But formal methods (and TLA+ for distributed computation) don't eliminate side channels. [in CPUs e.g. with branch prediction, GPUs, TPUs/NPUs, Hypervisors, OS schedulers, IPC,]
The Fuzzing computational task is similar to the Genetic Algorithm computational task, in that both explore combinatorial Hilbert spaces of potentially infinite degree and thus there is need for parallelism and thus there is need for partitioning for distributed computation. (But there is no computational oracle to predict that any particular sequence of combinations of inputs under test will deterministically halt on any of the distributed workers, so second-order methods like gradient descent help to skip over apparently desolate territory when the error hasn't changed in awhile)
The Fuzzing computational task: partition the set of all combinations of inputs for distributed execution with execute-once or consensus to resolve redundant results.
DbC Design-By-Contract patterns include Preconditions and Postconditions (which include tests of Invariance)
We test Preconditions to exclude Inputs that do not meet the specified Ranges, and we verify the Ranges of Outputs in Postconditions.
We test Invariance to verify that there haven't been side-effects in other scopes; that variables and their attributes haven't changed after the function - the Command - returns.
While the distinction between fuzzing and PBT is, er, fuzzy, what is unquestionable is that PBT is not meant to be coverage driven. It is a means to document behaviour for other developers in a way that is, in some cases, more descriptive than providing a fixed set of examples or other "traditional" ways of providing such documentation. Seeking coverage and "interesting" cases, rather than providing documentation, is decidedly the role of fuzzing.
You can totally combine coverage based fuzzing with property based tests. When I was at Google, I really enjoyed their internal tooling for combining both. You simply write a property based test as usual, but when it comes to execution, the testing framework compiles your test in a special way to get coverage and then adjust the random input to hit increased coverage. Of course they run the test across a cluster of machines completely automatically.
Traditional property based testing is implemented simply as a library, so they don't necessarily have coverage information to guide their random input generation.
I'm not sure how Go's fuzz tests differ from what you linked in your article, but you article said proper fuzzers need to run for days or weeks, and that PBT should approximately always chosen over fuzz testing.
But I'd take one step back, and ask a more meta question about testing: does a successful test mean successful code, and vice-versa? Is there anything in Go's contract that specifies that the same inputs to the same code will yield the same output?
A successful test implies that the documentation is true. That is, after all, why you are writing tests: Documentation – To explain to future developers the intent and usage behind the code you wrote. You could just as well write the documentation in Word instead, but then you'd lose the ability to perform machine verification of the documentation, which is a huge boon to moving codebases where the documentation and code are likely to fall out of sync if one is not incredibly careful (although perhaps LLMs will some day bridge that gap?).
The documentation being true does not imply that the documentation is useful, of course. Writing useful documentation is still an art.
This is a reason I’m very keen on making sure tests are focused on requirements rather than code.
A few jobs ago I would often be in a team where the entire team had turned over several times, we would be asked to do large updates to a legacy application or bump lots of dependencies and just not break anything. When pushed the product owner wouldn’t be able to describe what the app was supposed to do, particularly for unusual types of users (the account / user modelling was chaotic so say, several billing accounts per user, each with different products and access levels). At that point “foo calls bar” doesn’t clarify much intent.
As far as the API is concerned, the main thing you get is a combinator library for generating whatever random data structures you want to use. Working with Arbitrary types (which represent sets of random objects) makes it easy to write reusable functions for generating your test inputs. A library like that could probably be used along with Go’s fuzzing framework fairly easily?
I still think the usual combinators (map, filter, chain, and oneOf, to use fast-check’s names) can be a bit awkward, so I’m writing a new property testing library for JavaScript that hopefully works a bit nicer. (Experimental and not published yet.)
It kind of is. I often use randomised tests (fuzzing) for data structures and algorithm implementations. Throw a lot of asserts in and see if an hour of random input can find any edge cases I’ve missed. Usually the first time I run something like this it finds problems instantly. It’s very humbling.
I find when I do this I don’t need to write by hand anywhere near as many tests to get my software working well. I also usually turn any failures found by the fuzzer into standalone unit tests, to make any regressions easier to find later.
I agree, it can be. My example above matches yours (fuzzing a data structure).
I coded some tests that ensure the data structure is useful, many of them test examples from papers describing the data structure, but that don't necessarily cover all the corner cases.
Then I fuzzed it. I used Go's fuzzer, which is geared towards parsers and stuff. It can generate a stream of bytes and use that for fuzzing. The data structure is a set/map. So I interpret the stream of bytes as commands to add/remove random elements from the set/map. After I add an element, contains needs to return true; after I remove one, contains need to return false; if I put in a mapping, finding it must return what I just put there, etc. And at every step all the data structure invariants (that ensure logarithmic search, etc) need to hold.
That was stupidly effective at finding a few bugs, all within seconds, all with sequences of less than a dozen operations. Then it stops. And you get 100% coverage.
I'm assuming that, apart from ergonomics, where I kinda build my own state machine transitions out of a stream of bytes, the tooling actually seems more effective than property testing libraries.
There's a fun property-testing library for Haskell called LazySmallcheck (and a more powerful version called LazySmallcheck2012) which is sort-of coverage-based. Since Haskell values are lazy, it uses exceptions as its test data: if a particular exception gets thrown, the test must have forced that part of the test data, so it replaces that part and tries again.
For example, let's say our test data is an assoc-list like `[(Int, Customer)]`, and there's a bug that's triggered by repeated keys. LazySmallcheck will run the test on the most general assoc-list there is, a _|_ value!
myTest (exception A) --> throws A
Since A was thrown, LSC will try again; making the list slightly more specific; which means trying both constructors `[]` and `:`:
myTest [] --> test passes
myTest (exception A:exception B) --> throws A
The input `[]` is completely specified, and it didn't lead to any counterexample, so LSC is finished with that branch. In the `:` branch, the general pair `exception A` caused an error, so LSC will make it slightly more specific and try again:
myTest ((exception A, exception C):exception B) --> throws A
Since `exception A` is an Int, LSC can't use laziness to keep things general; it has to try concrete values, so it proceeds like Smallcheck; trying the "smallest" Int values:
myTest ((0, exception C):exception B) --> throws B
Again, LSC will make the general list `exception B` more specific:
myTest ((0, exception C):[]) --> passes
myTest ((0, exception C):exception A:exception B) --> throws A
Since the input `(0, exception C):[]` passed the test without triggering the exception, we know the second element of that pair must be irrelevant to the result. In other words, LSC has proved the property holds for inputs `(0, X):[]` for all X!
LSC hence won't try specialising `exception C`; but it will try "larger" Int values, e.g. `(1, exception C):[]`, `(-1, exception C):[]`, etc. up to a pre-defined "depth". I'll ignore these cases and focus on the branch with the longer list:
myTest ((0, exception C):(exception A, exception D):exception B --> throws A
Again, LSC will specialise the general Int `exception A` into specific values, starting with the "smallest":
myTest ((0, exception C):(0, exception D):exception B) --> FAIL!
This input has duplicate keys, which triggered the bug and caused our test to fail. Again, the fact that the exceptions inside this value didn't get thrown means those parts were irrelevant to the result. LSC indicates irrelevant values with `_`, so it will report this counterexample as `(0, _):(0, _):_`.
That's more informative for understanding this bug than e.g. Smallcheck or QuickCheck, whose "minimal" counterexamples would be fully-specified values, like `(0, Customer { name = "", orders = [], discount = Nothing, joined = 1970-01-01 }):(0, Customer { name = "", orders = [], discount = Nothing, joined = 1970-01-01 }):[]`.
Hence LSC's approach has three interesting features:
- It has a coverage-like approach that homes-in on interesting cases, by gradually generating the parts of a value that get forced (which are usually the parts being branched on)
- It can indicate which parts of a counterexample are irrelevant for the failure
- It can prove some properties universally (i.e. when that input is irrelevant to the pass result)
Anecdotally, I had a fantastic experience with `clojure.spec.alpha` (with or without `test.check`), and when I went to use python's `hypothesis` it was just... abysmal.
It seems like Hypothesis is unable to handle simple but "large" data sets >>by design<<, where "large" is really not so large. [0] It was such a pain that we ripped out Hypothesis (and generative testing altogether, sadly) completely from our python test suite at work.
it doesn't sound like hypothesis that unable to handles large data sets in this case, though that is indeed not its forte; it sounds like you were rejecting a large proportion of the shrunk instances, so hypothesis would try shrinking by setting a generated integer to zero, in order to see if the bug still existed for zero, and your test would reject it because it had a zero in it. not fail, but reject. for small instances this was just inefficient, but for larger ones it got to the point that hypothesis gave up
someone in that thread suggested that you use a different instance generation strategy that can't generate zeroes, instead of rejecting the instances hypothesis's shrinker most loves to generate, once they are generated. did you try that?
how does clojure.spec.alpha handle this differently?
> The disadvantage is that the generators are now parsers from the lists of bytes that can fail (introducing some inefficiency) and user can make a crazy generator that the internal shrinker will not be able to shrink perfectly. Nevertheless, it's the best DX of the three approaches, ...
though presumably you wouldn't agree that your test was written in a 'crazy' way
yes, but if it's too hard to generate an instance your test will accept as a valid instance (not passing, valid), i'd instead expect it to tell you to fix your test, so you can get properly shrunk results, and that's what it did. did you try the different instance generation strategy suggested in the thread? how does clojure.spec.alpha handle this differently?
On the surface, clojure.spec.alpha seems to handle things more or less similarly (you get random results skewing heavily towards small/trivial/edge values), but the main difference is that it can be used and abused without problem. As long as you can write predicates and compose specs, it handles it for you and “just works”.
did you try the different instance generation strategy suggested in the thread?
you make it sound like clojure.spec.alpha doesn't do shrinking at all. possibly you don't know what shrinking is? shrinking keeps your results from looking "random" by simplifying them until all the simpler versions of the test case pass, so it finds the simplest failing case instead of a "random" one. this also enables hypothesis to be a lot more aggressive at looking for failing cases, for example by generating much larger test data sets than you would want to pore over, because if it finds one that fails, it will shrink it before showing it to you
if i've misunderstood you and clojure.spec.alpha does do shrinking, what does it do in the case where your test case is written to reject all but an exponentially small fraction of shrunk cases? does it just take an exponentially long time to run your test suite? is that what you mean by 'just works'?
Yep. I loved clojure’s spec as it was really easy to build around and then I moved to elixir and found that I have to… drop down to an old erlang library (propEr) to write tests like that. Pretty disappointing.
Are you sure things are executed sequentially in that order? I would expect hypothesis strategies composed together to be lazily evaluated.
In any case, even in the case you describe, I would not expect the strategy to fail with probability 1. Remember that this generates multiple samples, and only one of these need to be valid.
The simple answer to a question posed in the article "On the other hand one could ask why there isn’t a requirement that published research should be reproducible using open source tools (or at least tools that are freely available to the public and other researchers)?" is that the obvious immediate outcome of such a requirement is that papers failing that requirement - like the Quviq QuickCheck papers, which seem to have been useful to the author and others - simply would not get published, and the community would lose out on that gift of information.
I think it would be good to have some publishers require reproducibility, and some publishers that don't. Every requirement is exclusionary, and there are always edge cases where a paper can be useful even if it doesn't fulfill a requirement.
As an intermediate point in the Require Reproducibility ↔ No Requirement spectrum, strong encouragement to have reproducible artifacts is attractive as well. I just got a paper accepted to ECOOP (European Conference on OO Programming; has lost it's OO-focus and is now a general PL conference) and it's trying something relatively new: artifact evaluations are considered as part of the submission process. Our paper had a reproducible and reusable artifact, and I think that helped our case with the reviewers.
That’s already the case. Eurosys includes an Artifact Evaluation for submitted papers - which typically includes code and data to allow reviewers to reproduce the work described in the paper. It’s optional, but encouraged.
This page lists the artifact evaluation criteria for a handful of conferences:
Obviously this is not a clear-cut question (one might even call it a political question, and those are notoriously no clear-cut), but this is hardly a valid defence. Because if you consider it valid, you can go however far you like using this statement as a shield. If you abandon reproducibility as a requirement, you don't have to explain anything you don't want to explain. You don't have to provide any data about your samples, no statistical significance tests. A vague abstract claiming such and such result has been achieved would be enough. Heck, the famous Fermat's margin-note on his private copy of Arithmetica is a totally valid research paper then: after all, we wouldn't want to lose out on the precious gift of information that a particular famous mathematician thought he has a concise and elegant proof of a theorem (even though he probably hadn't)!
FWIW, my own opinion on that political question is that current standards are way to loose. After all, nobody is forced to publish anything. There is a lot of research happening in the world, that is never published anywhere (e.g. because of proprietary value). It will never go away (unless the world reaches some communist-utopia, but even then it will be doubtful for other reasons). But if one works in academia (and, God forbid, receives grants for his work!) and his stated goal is to move world's scientific knowledge further, it's only fair to ask for him to actually follow that goal, and not just simply pretend following that goal in order to move up that twisted ladder of his academic career.
I wholly agree with you, and on the other hand have seen enough researcher code that I know why they don't share it. Usually it's such bad quality, hammered by multiple students to get to some goal, that by the end I was surprised they even trust the output. Publishing would at least make them know from the start this will have to be open sourced and probably even increase the success rate of research by improving coding standards.
Papers get published because authors want to increase their "importance index", which is in very direct correlation with how much they will get paid and whether they'll be making a career in academia.
Having more requirements towards that end is unlikely to impact the number of papers published.
The more severe problem with published papers is mistakes often deliberately overlooked in order to publish as much and as fast as possible. Having papers being easier to verify has a chance of improving this situation... but I wouldn't hold my breath. People are very good at cutting corners.
I do a lot of manual proptesting in Rust that all look something like:
let mut rng = rand::thread_rng();
for action_count in 1..4 {
for _ in 0..10_000 {
let seed = rng.gen::<u64>();
eprintln!("let seed = {seed};");
let mut rng = ChaChaRng::seed_from_u64(seed);
i.e. top level true randomness, then a bunch of nested loops (only 2 here, but some tests have more) to go from low-complexity cases to high-complexity
then generate a seed to seed a deterministic PRNG, and print it out so if the test fails, I just copy and paste the error seed to replay the error case
I have found doing this manual proptesting to be faster, more flexible, and generally less fuss than using any frameworks or libraries
I do this sort of thing too. I don’t have an automated way to shrink my input - but that’s usually fine in practice.
For example, say I’m testing a data structure. I’ll have an outer loop that picks a seed and an inner loop that does ~100 mutations of a data structure instance, testing assertions each time. If there’s a failure, I’ll try a bunch of seeds to look for one that fails the fastest. (In the fewest possible inner loop iterations).
It’s not a perfect system, but for most bugs I can usually get a test case that only needs 5-10 steps before a crash occurs. And if the reduction step finds a different bug? That’s no problem at all. I fix what I found and go back to trying more seeds.
That's why my outer loop goes from low-complexity cases to high-complexity cases, it has basically the same effect as shrinking (without actually having to do any work)
I had a quick look through the linked "Testing Telecoms Software with Quviq QuickCheck" paper, but couldn't immediately see anything to answer my question: "why is this stateful stuff not better rolled by hand?". The OP gestures at this with the key-value pair model of a key-value store, but why would one not just write a state machine; why does it require a framework? I literally did this last week at work to test a filesystem interaction; it boiled down to:
type Instruction = | Read of stuff | Write of stuff | Seek of stuff | …
And then the property is "given this list of instructions, blah". The `StateModel` formalism requires doing basically the same work! I really don't see that `StateModel` is pulling its weight: it's adding a bunch more framework code to understand, and the benefit is that it's getting rid of what is in my experience a very small amount of test code.
There are some tests where you’re right, but often the tricky part is shrinking failing cases. If you only want to generate sequences of state transitions that are considered “valid” then you typically have to have some sort of model state that dictates which test steps are valid for a given state, and you need to make sure that you don’t remove test steps during shrinking in a way that triggers a spurious failure by violating the preconditions that you followed to generate each step originally.
If any operation is valid in any state and you just want a totally random sequence of arbitrary operations, then yeah a stateful proptest framework may be overkill. But if you need to maintain a model state and specify preconditions for different operations, having a dedicated framework saves a lot of work.
Also, as others have mentioned, parallel state machine testing is another really cool benefit you can get from having a dedicated framework, but it’s not the only benefit.
My understanding is that the parallel QuickCheck will check that in your multithreaded program, all possible interleavings result in a final state that can also be reached by a sequential invocation of the commands.
Is it common to work in a runtime that lets the test harness control the order in which threads race to access mutable state? I wouldn't be surprised if Haskell could do it, but I think I'd have to write a custom interpreter of .NET or JVM bytecode, for example.
I believe Java has exactly that, but I couldn't recall the project name.
In the case of .NET, there is https://github.com/microsoft/coyote which works by rewriting IL to inject hooks that control concurrent execution state.
It would have been much more expensive to have a custom interpreter specifically for this task (CoreCLR never interprets IL).
This approach somewhat reminds me of precise release-mode debugging and tracing framework for C++ a friend of mine was talking about, which relies on either manually adding the hooks to source files or doing so automatically with a tool.
Given F# compiler outputs legal IL that is expected to execute in a particular way, and CoreCLR doesn't fail on importing it, and then executing it, it has probably more to do with either Coyote or other tooling that interacts with this setup.
I don't understand why there is less tolerance for having to make accommodations or take extra steps when working with F# than, let's say, when using Kotlin or Scala. Both follow a similar pattern where they can use pretty much every Java library by virtue of targeting JVM but can't use a huge amount of Java-only tooling that does meta-programming or instrumentation beyond what is provided by JVM bytecode specification, and yet it's not seen as such a huge ordeal from my surface impressions.
Does that issue mention F#? I may be blind but I see no evidence that it does.
In general the F# way is to write completely different stuff, right, either hiding away the underlying C# or completely replacing it. Witness the existence of the SAFE stack, Giraffe to hide the egregious ASP.NET, the totally different approaches with statically resolved type parameters in F# vs enormous hierarchies of interface types in C#, Myriad AST generation vs Roslyn stamping out strings, etc. I have no opinion on the languages hosted on the JVM, but there is a really nontrivial impedence mismatch between idiomatic C# and F#!
Another one, which I'm biased towards, is making shrinking automatic while keeping all the invariants you created while generating the values. I have a talk on this but the TL;DR is that
- derived QuickCheck-like shrinker functions that work on values (shrink : a -> [a]) have issues with the constraints, making people turn off shrinking instead of dealing with the issues
- rose tree "integrated shrinking" (eg. Hedgehog) follows the constraints of the generators, but has issues with monadic bind (using results of generators to dispatch to another generator)
- the only approach that seems to magically "just work" is Hypothesis' "internal shrinking", which uses a layer of indirection to shrink lists of random choices instead of the values themselves. The disadvantage is that the generators are now parsers from the lists of bytes that can fail (introducing some inefficiency) and user can make a crazy generator that the internal shrinker will not be able to shrink perfectly. Nevertheless, it's the best DX of the three approaches, and given it's a small miracle people are testing at all, feels to me like the approach most worth building on as a testing library author.
> - rose tree "integrated shrinking" (eg. Hedgehog) follows the constraints of the generators, but has issues with monadic bind.
We're at the limits of my amateur knowledge, but I believe this is a fundamental limitation of monadic bind/generators. Instead, you should prefer applicative generators for optimal shrinking. https://github.com/hedgehogqa/haskell-hedgehog/issues/473#is...
In other words, applicative generators do not use "results of generators to dispatch to another generator", but instead shrinking is optimal due to the "parallel" nature of applicatives (I'm using "parallel" in the monadic sense, and not the sense of article's "threading" sense). Since applicatives are "parallel", they can shrink the generators independently. (Whereas monadic generators are in "series" and therefore shrinking one necessarily changes the behavior of the subsequent generator, as you noted.)
Please feel free to link your talk if it's public!
Re the talk: it's unfortunately one of the Haskell Exchange talks that got pulled from the internet when the organizer, SkillsMatter, imploded. The Haskell Foundation is trying to revive and reupload those recordings, but my particular one isn't back on YouTube yet. But here's at least the slide deck: https://drive.google.com/drive/folders/1DjfhdeFW-qCCSlJUv_Xw...
I agree using applicatives helps somewhat, it basically minimizes your usage of bind.
Hypothesis works around this (and gets a bind that shrinks nicely) by recording those choices into (in principle) a flat list, and shrinking that. So your shrinker can work on both "before the bound fn" and "after it" at the same time, but the result might be something that the generator can't parse back into a value.
Sequence points in C and CPS/ANF compilation in Lisp are kind of related to this. Both monads and applicatives enable you to define f(g(), h()), where f, g, and h are stateful computations of some sort, but monads force you to specify which of g and h is invoked first [so it’s more like x = g(), y = h(), f(x, y)] while with applicatives the implementor of the stateful model can decide what happens in such cases.
[Disclaimer: I don’t know how QuickCheck actually works, so I’d appreciate a sanity check (hah) of the following from somebody with actual knowledge on the matter.]
GP’s point, if I understood it correctly, is as follows. If you’re doing randomized testing, then g and h could perhaps be defined as { for (i=0; i<n; i++) { int x = rand(); if (!fork()) return x; } } (and yes, in monad-land the moral equivalent of fork() is admissible as what I vaguely called a “stateful computation”). You see how strict ordering of side effects enforced by monads essentially forces you into a depth-first search of the state space (although I guess you could do iterative deepening?), while applicatives can allow for different exploration strategies (ones more interesting than C’s official behaviour of “that’s UB, you fool”).
C sequence points are arbitrarily imposed by imperial decree in the standard; they are not linked to any evaluation causality.
Dependency-driven de facto evaluation orders have always existed in C though: things that you can figure out must be ordered even though there isn't a sequence point.
The standard spells out that expressions like i = i + i are okay, but actually it's not necessary to do so. The assignment cannot take place until the assigned value is known. The assigned value is not known until the + is evaluated and the + cannot be evaluated until the operand values are retrieved. Therefore, the retrieval of the prior value of i necessarily precedes the movement of the new value into i.
This, rather than sequence points, is rather a bit like monadic sequencing.
> The assignment cannot take place until the assigned value is known.
Of course it can. It's retrieving the value out of the lvalue is what cannot be done and even then you can still stretch this delay further with the "as-if" rule. Haskell does something similar with its lazy evaluation IIRC.
> C sequence points are arbitrarily imposed [and] not linked to any evaluation causality.
That’s exactly my point. Writing g >>= \x -> h >>= \y -> f x y is like writing (x = g(), y = h(), f(x,y)), with an arbitrary ordering between computations g and h that the programmer is forced to choose, whereas f <$> g <*> h is like f(g(), h()), with no ordering between g and h imposed by the applicative model itself, similar to the absence of a sequence point between sibling arguments in C.
Let’s say your test data is a record with two fields that can be generated independently. After the property test framework finds a bug, it can independently shrink each field to find a simpler failing test.
Contrast with a test data generator that randomly chooses the first field, and then calculates the second field based on the first field’s value. The framework can’t directly mutate the second field because there’s a data dependency. If it just changes the second field, it will get a record that couldn’t have been generated.
The Hypothesis test framework took a different approach. Instead of mutating the test data directly, it mutates the “random” input and reruns the generator. If you do shrinking that way, you always get valid test data, though it might not be very similar to the original.
This is similar to what fuzzers do - you start with unstructured binary data and build data structures out of it. Your “random” test data generator is really a parser of random (or maybe not so random) input.
To have an idea of monads being sequential, think about a JS Promise:
promise.then(...).then(...).then(...)...
Promises are (almost) monadic, and the chain is sequential. You can't make it parallel, and that's the point: monadic binding represents sequential computation, aka "the semicolon operator".
To have an idea about applicatives being parallel, think about a list of functions, and a list of values; each function would be applied to a corresponding value, resulting in a list of results:
results = []
for ix in range(functions):
f = functions[ix]
x = values[ix]
results.append(f(x))
It's pretty obvious that you can do each f(x) in parallel and in any order, instead of the sequential code above.
(Why would one care about this all? Because many daily things are functors, applicatives, and monads, e.g. a list is usually all three.)
My usage of "parallel/await" is entirely metaphorical; I was kinda going for a Javascript-esque syntax with "await Parallel" - I'm assuming most people aren't familiar with F#'s `and!`. It doesn't make sense for applicative generators to (necessarily) use the thread pool.
Anecdotally, Hypothesis was very far from "just working" for me. I don't think it's really production-ready, and that seems to be by design.[0]
I did have quite a bit of prior experience with clojure.spec.alpha (with or without test.check), so despite some differences it's not like I was completely alien to the general ideas.
I've tried to replicate your issue in the Elm testing library I'm a maintainer of, which uses some of the core ideas from Hypothesis.
It was able to generate values just fine ([1]), so your problems might have been just some issue of the Hypothesis _implementation_ instead of something wrong with the general algorithm?
Note that your usage of Hypothesis was "weird" in two aspects:
1. You're using the test runner for generation of data. That's sometimes useful, particularly when you need to find a specific example passing some specific non-trivial property (eg. how people solve the wolf, goat, cabbage puzzle with PBT - "computer, I'm declaring it's impossible to solve the puzzle, prove me wrong!"), but your particular example had just "assert True" inside, so you were just generating data.
There might be libraries better suited to the task. Elm has packages elm/random (for generation of data) and elm-explorations/test (for PBT and testing in general). Maybe Clojure or Python have a faker library that doesn't concern itself with shrinking, which would be a better tool for your task. The below gist [1] has an example of using a generator library instead of a testing library as well.
2. You were generating then filtering (for "at least one non-0 values across a column"). That's usually worse than constructively generating the data you want. An example alternative to the filtering approach might be to generate data, and then if a column has all zeroes, generate a number from range 1..MAXINT to replace the first row's column with. Anyways, "all zeroes" shouldn't happen all that often so it's probably not a huge concern that you filtered.
The code provided was not real test code. The only reason `assert True` is there is to highlight the fact that the issue was with generation, and not any other part of the test.
> Anyways, "all zeroes" shouldn't happen all that often so it's probably not a huge concern that you filtered.
This is really the point.
Like I said, Clojure’s spec (with or without test.check) has no trouble.
Hypothesis doesn't try to give you a consistent probability distribution, especially not a uniform one. It has lots of heuristics to give you 'nasty' numbers in order to trigger more bugs.
That's actually one big reason I prefer Hypothesis over eg Rust's proptest, which makes no attempt to give you 'nasty' numbers. Last time I checked, QuickCheck didn't make any attempts to give you floats well-known to cause trouble either (like inf, negative inf, different NANs, 0, 1, -1, the least positive float, the largest negative float, the largest finite float just before infinity etc.)
There has been some research (can't find the paper right now but it was about F-metric or something?) suggesting uniform generators will trigger bugs more often than the edge-case-preferring ones. I'm still not too sure about whether I want to believe it though.
Anecdotally, I have triggered many more bugs with the edge-case-preferring generation:
Basically, that was me using Rust's proptest library which does uniform generation by default, and I hacked it up to prefer edge cases. For simplicity, for eg u32 I set it up to with something like 50% probability to pick one from 0, 1, 0xFFFF_FFFF, 2, and a few other special values, or to pick uniformly at random.
I vaguely remember some research that compared carefully hand-picked values vs property based testing. And the carefully hand-picked values performed slightly better at the same number of test cases. But if you modesty increased the number of test cases generated for property based testing, that swamped the advantage of careful hand picking.
Similarly, I suspect that any disadvantage that might exist for the scheme I outlined above compared to uniform sampling would go away, if you increased the number of test cases slightly. At least as long as you give a decent chunk of probability weight in my scheme to uniform sampling.
> - rose tree "integrated shrinking" (eg. Hedgehog) follows the constraints of the generators, but has issues with monadic bind
> - the only approach that seems to magically "just work" is Hypothesis' "internal shrinking", which uses a layer of indirection to shrink lists of random choices instead of the values themselves.
My preferred framework is falsify, which provides "internal integrated shrinking". It's similar to Hypothesis, but uses a tree of generators rather than a linear sequence (it's based on "selective functors", which are a nice interface that's also useful for e.g. validators).
> This library provides property based testing with support for internal integrated shrinking: integrated in the sense of Hedgehog, meaning that there is no need to write a separate shrinker and generator; and internal in the sense of Hypothesis, meaning that this works well even across monadic bind.
I've been talking with Edsko de Vries about Falsify when he was implementing it / writing the paper. Very cool stuff, although somewhat mind-bending :)
I've tried to use property based testing but always found it falls between two stools. If I understand a property well enough to write a strict test of it, I can generally push it into the type system and make it true by construction. And if I just want a "smoke test" then a single arbitrary input is easier.
For example, a lot of times I might have two implementations, a naive one that's slow and simple, and an optimized one. I can compare the output of both against each other on arbitrary input. That's a simple property that easy to understand, but hard to put in the type system in general.
Similarly, I often have properties that eg the order that input is presented in shouldn't matter, or sometimes you have ways to partition your data 'max(maximum over A, maximum over B)) = maximum(A union B)'. How do you encode that in the type system?
Or you have something like 'for any A and B: (find some optimal solution over A) is worse than (find some optimal solution over (A union B))'. Or you have idempotency like 'f(f(A)) = f(A)'.
These are all easy to understand properties, but not trivial to describe in most type systems.
Clearly it's better to enforce constraints at compile-time, if possible. However, there are many kinds of constraints that mainstream type checkers just can't handle. (Dependent types would help a lot, but still seem limited to niches, such as theorem provers.)
I wonder if it's missing the original QuviQ Erlang QuickCheck in the list? The full product is proprietary, but there is a freeware version QuickCheck Mini available as well: http://www.quviq.com/downloads/
For my C#/.NET testing I've been using CsCheck[0] and I've enjoyed it quite a bit. It was a lot more approachable compared to Hedgehog and FsCheck and is also quite fast.
I'm going to note also after digesting the article further that CsCheck has support for "stateful"/"model-based" testing in addition to property-based testing. It also supports the linearizable/parallel testing that is described in the article.
The obvious downside is the number of examples computed. A simple unit test can easily take 100 times as long when property-tested.
Besides, it would be really cool to have property-based integration or hybrid unit/integration tests. Or even property-based E2E tests. Unfortunately, the setup of the example will almost always take too long for a relevant set of runs.
For instance: If you have a basic data model (say in sqlalchemy) and want to write property-based tests (say in hypothesis), you can relatively quickly derive strategies for the models (but beware of recursion and primary keys). But writing that model instance into the DB for running an example just takes too long for swift testing.
I tried writing a Hedgehog-inspired library in Rust and realized how complex the underlying domain is.
For example, the RNG is "splittable" [1][2] which means you can take a deterministic seed and split it in two, e.g. for parallel generators to work independently but still deterministically. The effort that has gone into this feature level is a little numbing. I have an awe similar to that of the "fast inverse square root" hack when I see code like this:
-- | A predefined gamma value's needed for initializing the "root" instances of
-- 'Seed'. That is, instances not produced by splitting an already existing
-- instance.
--
-- We choose: the odd integer closest to @2^64/φ@, where @φ = (1 + √5)/2@ is
-- the golden ratio.
--
goldenGamma :: Word64
goldenGamma =
0x9e3779b97f4a7c15
when realizing that those numbers don't come easily. [3]
Wonder why most property-testing libaries don't have features like this?
The libraries require training to use. And they're not that easy to write.
> the current state-of-the-art when it comes to property-based testing is stateful testing via a state machine model and reusing the same sequential state machine model combined with linearisability to achieve parallel testing
Okay, okay. I admit I've never performed property-based stateful testing, nor in parallel. So that may be the coolest feature out there, because it addresses one of the hardest problems in testing.
But I think that yet other things have happened with modern property-testing libraries (e.g. Hypothesis, PropEr, Hedgehog, Validity):
Shrinking for free [4], generators for free [5], defining the probability distribution of your sub-generators in a composable way.
Maybe those features are not as significant, but they're equally missing from almost all property-test libaries.
Huh. I wonder why they don't use one of those random-access PRNGs like PRNS, which is basically a hash of a counter. Maybe not good enough in the speed-quality space?
Not all PRNGs are created equal; eg linear congruent ones take many steps until values of initial seeds that are close to each other become decorrelated.
Eg try java’s random prng library, and generate a few booleans with two seeds very close to each other.
In JS I often want to test a parameter value does what is expected in the positive case: foo("A"), but also in the negative case, i.e foo([not "A"]). Where [not "A"] some is some arbitrary string(s) representing the negative condition like say "ZZ". Can the js property based testing library help me generate an arbitrary not "A" elegantly?
PBT libraries usually have some support for filtering ("if you generate a value x such that P(x), discard it and try again"), but it's frowned upon. You should only use it when you're relatively certain it's not going to happen very often. But, it's an option.
You could try to somehow be clever about how you generate the inputs: ("A" + nonempty string | "B-Z" ++ possibly empty string)
Or you could generate a string, check it for "A"-ness, and if true, fix it before running the actual test by appending some random non-empty string after it.
The three approaches might look like (in pseudocode)
1. Filtering
gen.string().filter(s => s !== "A")
2. "Smart" construction
gen.oneOf([
"A" + gen.string().nonempty(),
gen.charBetween("B","Z") + gen.string()
])
3. Naive generation + fix
gen.string().andThen(s => (s === "A") ? s + gen.string().nonempty() : s)
It slows down the testing. All the invalid inputs are generated and then discarded, so it’s much more efficient to come up with a construct that succeeds all the time or at least most of the time.
If you ask for an integer but only want odds (as an example), 50% of the generated values will be tossed out with filtering. If you instead take every integer and make them odd (double and add 1, for instance) then all the generated inputs will be usable.
The major omission in this article is fuzzing. Not only is it practical and in wide (and growing use), it’s also far more advanced than QuickCheck’s approach of generating random inputs, because fuzzing can be _coverage-driven_. Property-based testing came out of academia and fuzzing came out of security research, initially they were not connected. But with the advent of in-process fuzzing (through libFuzzer), which encourages writing small fuzz tests rather than testing entire programs; and structure-aware fuzzing, which enables testing more than just functions that take a bytestring as input, in my view the two techniques have converged. It’s just that the two separate communities haven’t fully realized this yet.
One pitfall with non-coverage-driven randomized testing like QuickCheck, is that how good your tests are depends a lot on the generator. It may be very rarely generating interesting inputs because you biased the generator in the wrong way, and depending on how you do the generation, you need to be careful to ensure the generator halts. With coverage-driven fuzzing all of these problems go away; you don’t have to be smart to choose distributions so that interesting cases are more common, coverage instrumentation will automatically discover new paths in your program and drill down on them.
But isn’t fuzzing about feeding a large program or function random bytestrings as inputs, whereas property-based testing is about testing properties about data structures? It is true that fuzzers operate on bytestrings, but there is no rule that says we can’t use that bytestring to generate a data structure (in a sense, replacing the role of the random seed). And indeed this is what the Arbitrary crate [1] in Rust does, it gives tools and even derive macros to automatically generate values of your data types in the same way that QuickCheck can. The fuzzing community calls this Structure-Aware Fuzzing and there is a chapter about it in the Rust fuzzing book [2]. There are also tools like libprotobuf-mutator [3] that substitute fuzzers’ naive mutation strategies, but even with naive strategies fuzzers can usually get to 100% coverage with appropriate measures (e.g. recomputing checksums after mutation, if the data structure contains checksums).
I am using this extensively in my own projects. For example, RCL (a configuration language that is a superset of json) contains multiple fuzzers that test various properties [4], such as idempotency of the formatter. In the beginning it used the raw source files as inputs but I also added a more advanced generator that wastes less cycles on inputs that get rejected by the parser. The fuzzer has caught serveral bugs, and most of them would have had no hope of being found with naive randomized testing, because they required a cascade of unlikely events.
Structure-aware fuzzing is not limited to generating data structures either, you can use it to generate reified commands to test a stateful API, as you describe in the _Stateful property-based testing_ section. The Rust fuzzing book has an example of this [5], and I use this approach to fuzz a tree implementation in Noblit [6].
Agreed. A while back I played around with fuzzcheck [1], which let's you write coverage-guided, structure-aware property tests, but the generation is smarter than just slamming a fuzzer's `&[u8]` input into `Arbitrary`. It also supports shrinking, which is nice. Don't know that I would recommend it though. It seemed difficult to write your own `Mutator`s. It also looks somewhat unmaintained nowadays, but I think the direction is worth exploring.
JS is single threaded, but the process lives in an event loop.
You can easily write a JS function that can have a race condition by launching two asynchronous processes (such as writing to local storage in a browser or the file system both operating on the same data).
I had this happen just today while scraping different websites writing to the same files and overwriting each other.
Concurrency is not the same as parallelism. I’ve understood TFA to mean concurrency when it mentions parallelism. This would explain the presence of JS libraries which could test for concurrency problems in asynchronous code.
I don't think the problem is primarily a lack of good OSS implementations or UX lacking polish. To me, it's more that there is a mismatch between what programming language/CS enthusiasts find interesting and useful, and what industry software developers find useful.
First off, the "training" problem is a very real and fundamental blocker. I think academics or less experienced engineers see it as a problem of education, but I believe it's actually an engineering/organizational problem: for a company or even OSS project, when you use niche/unadopted tools like this you either need to provide training for it or you massively shrink the set of people able to work on your project by requiring or expecting them to learn it on their own. This introduces a practical headache bigger than the ones a tool like this solves (since these are just incrementally better ways to test software) - you get less contributions or need to spend more on getting people onboarded. Note that even if "training" people is free in the sense that there is no paid training material, you still pay a cost in lower productivty for new people and the time spent training.
Even once people are trained, you now have a process/institutional need to support the property based tests you wrote. That may mean paying for licenses for software, or maintenance of whatever infrastructure/integrations you write to get property based testing working in your developer workflow. And you also now need to rely on people using these tools correctly - the problem with formal verification is that it doesn't verify that you're actually solving the real problem correctly, just that your program operates as expected (ie it only verifies that your software is Working As Implemented, not Working as Intended). The more you rely on average joes to wield complex things like this well, the more headaches you introduce - if the bottom 10% of people you hire use it sloppily or wrongly, once you're past a few dozen regular contributors you're basically going to have constant issues. You see this all the time with even just regular unit and integration tests - less capable developers constantly introduce flakes or write tests so bad/basic that it'd be better for them to not even be writing tests at all.
Even if after considering all that it still seems a worthy tradeoff, there's the question of whether property based testing solves the problem you think it does. As I mentioned, it can moreso verify software is Working as Implemented rather than Working as Intended. But there is the bigger problem of major software projects not conforming to the simpler use cases that sell people on using this tool. In my experience stateless functions almost never have bugs, and when they do, it's mostly a "requirements discovery" problem you wouldn't catch with any kind of verification tooling.
Stateful verification becomes a problem in the kinds of real systems programming projects like the Linux Kernel or Qemu which stand to benefit most from verification. If you're verifying things at the level of a "higher level" abstraction like a scheduler, you almost always have really high dimensionality/a lot of state, perhaps deceptively more than you think, because of composition. And you tend to also have a devloop that is already very slow - builds take a long time, and running tests of higher level components can be "expensive" because it may take 100ms-1s for something like a kernel to finish starting in a test environment. For a single test that's nothing, but multiplied across all tests makes testing painful enough already; adding 3x or 100x by doing property based testing with new tests, even if pruned before submitting, could be either hugely expensive or way too slow. And similarly, the increased resource requirements to maintain state machines can be very expensive. TSAN does essentially parallel property-based testing and introduces 128-512 bytes of overhead for every 8 bytes of memory under test: https://github.com/google/sanitizers/wiki/ThreadSanitizerAlg....
Is it cool and underrated? Definitely. But it's not a silver bullet and honestly just isn't worth it for most software. Even when it is worth it, it has its drawbacks.
Oh man, talk about a walk down memory lane: a polite disagreement about the intersection of the public and private sectors in which commercial control of source code written by a founder who is also a researcher is being sold because the researcher found that to be more effective than opening the code (which he tried and didn’t get much attention as a result). Both reasonable arguments, I can see both sides. I remember when the software business wasn’t a dystopian nightmare.
Today a weird, scary splinter faction of the already terrifying Effective Altruism religion/cult is talking about building digital gods they can’t control and saying with a straight face that this is imminent (absurdly extravagant claim with no evidence more convincing than a logarithmic scale and a ruler) and using this as an argument to redefine everything from the definition of science to the very epistemology of factuality and getting the public to directly or indirectly subsidize it into the bargain.
People who make Ray Kurzweil’s Singularity magical thinking look flat fucking normal and frankly sound like L Ron Hubbard more every day are unilaterally deciding that all research into, ya know, what they claim is the most important event in human history is totally black box propriety because, ya know, for our own good.
I’ll take the polite Swiss functional programming folks over Anthropic’s Mechanical Interpretability people seven days a week and twice on Sunday.
Can we go back to this please? The new thing sucks.
My guy, read the blog post before responding. Not only has Stevan written a stateful property-based testing library, the article you didn't read addresses at great length the question of "why".
https://www.tedinski.com/2018/12/11/fuzzing-and-property-tes...
Like, with the below fuzz test, and the corresponding invariant checks, isn't this all but equivalent to property tests?
https://github.com/ncruces/aa/blob/505cbbf94973042cc7af4d6be...
https://github.com/ncruces/aa/blob/505cbbf94973042cc7af4d6be...