Hacker News new | past | comments | ask | show | jobs | submit login
Chasing a Bug in a SAT Solver (ochagavia.nl)
75 points by wofo 81 days ago | hide | past | favorite | 31 comments



If you have a large input that triggers a bug and you want to reduce it, it might be worth looking into the Delta Debugging[0] algorithm used by C-Reduce[1] to make minimal test cases for C-like languages. C-Reduce "happens to do a pretty good job reducing the size of programs in languages other than C/C++, such as JavaScript and Rust" so it might work out of the box for you. (Aside: there's also ddSMT for SMT solvers that take SMT-LIB2 input.)

0: https://www.st.cs.uni-saarland.de/dd/

1: https://github.com/csmith-project/creduce


A problem with test case reduction occurs when your test generator also ensures the test input satisfies some properties. This means naive reduction can produce invalid inputs for which "failure" doesn't represent a bug. An example is in testing C compilers where the C code must not exhibit any of the many undefined behaviors described in the C standard. Creduce goes to some lengths to work around this problem.

A solution to this is doing the reduction on a tree of random numbers provided to the test generator, so the output is always a legitimate test input. The Hypothesis testing system takes this approach.


i think hypothesis uses a string of random numbers rather than a tree?


It partitions the string into an implicit tree and uses that structure to reduce the string, if I understand correctly. The partitioning is obtained from the stack's structure during generation.


The Delta Debugging algorithm is an underused tool, which is not only applicable to debugging, but any task where you need to find a minimal satisfying subset. For example, it can also be used to automatically find an (approximately) minimal set of cloud permissions needed for some purpose: https://github.com/KanoComputing/aws-tools/blob/master/bin/a... This is otherwise quite a tedious exercise.

(It's approximate because theoretically you can find a local minimum - but for permissions, there's probably only one minimum)


I wrote a single file python3 implementation with no other dependencies here:

https://github.com/andrewchambers/ddmin-python

It can be used as a cli program or a python library.


Not sure if this makes sense but is that similar to fuzzing? Or can it be used alongside fuzzing

What I am familiar is with property testing libraries that automatically shrink failing cases, like quickcheck for Haskell, quickcheck for Rust and Hypothesis for Python. I suppose that if we modelled C code as an AST one could use quickcheck to shrink a test case. Is creduce like this, but operating on strings rather than trees?

https://stackoverflow.com/questions/16968549/what-is-a-shrin...


Yes, the kind of test case reduction that C-Reduce does is analogous to QuickCheck's shrinking. Some of C-Reduce's transformations are purely "on strings", e.g., "delete a physical line from the file", others are a bit more syntax-aware, e.g., "delete N tokens", and some actually build an AST using Clang and do some semantic transformations like "replace function parameter by global variable".

And yes, this is used alongside fuzzing, once a fuzzer (like Csmith from the same group as C-Reduce) has found a file that provokes a bug.

A reducer is not a fuzzer itself, although with some ingenuity it can be made to behave like a fuzzer in certain contexts: https://blog.regehr.org/archives/1284


AFL provides a nice standalone tool "afl-minimize", I've used it on several occasions and it works great for certain types of inputs.


I've used creduce (and cvise) to reduce test cases for all kinds of inputs, even ones that aren't C-like. If it has newline-delimited text, it'll work well (though you should adjust the knobs accordingly in order to converge faster).


thanks, this is fantastic!


I wrote a CSP solver a while ago. The bugs were alarmingly prone to being incomplete enumeration of solutions. The solver would say "Yep, here's all 2412 solutions to that problem", but there were more solutions out there it missed. I didn't usually know how many solutions there would be so it's hard to distinguish between "all of them" and "95% of them". However that also threw a lot of doubt on when the solver returned unsat - was the problem unsat, or was the solver failing to look in part of the search space?

I didn't find a plausible testing strategy for convincing myself that the last of that class of error was gone and ultimately stopped working on it as a result.


The best practical technique is generate random problems, and compare your solver with another (or two), and hope you didn't both have the same bugs :)

You can also often make progress by "shuffling" problems (reordering constraints / variables), or solving sub-problems + super-problems.


This is my go-to technique for tricky code.

Most of the time it's pretty easy to write a dumb, slow, almost certainly correct version of the thing in question and just fuzz the two implementations on small inputs. E.g., if you're speeding up floating-point division by 2 with bit-hacks in the exponent field, compare it to just dividing by 2 (and for up to 32-bit floats, just check all of them). If you're implementing a sort, compare it to bubble-sort. If you're implementing a SAT solver, compare that to just checking every possible solution (and even for a comparison solver that slow you can check every possible problem up to ~4-6 terms (depending a bit on how you define a "term") and spot-check problems with many terms and 30 independent variables).


Yes, I've done exactly this with my constraint solver ( https://github.com/minion/minion ). Famous last words, almost all bugs that have been reported have been repeatable with 6 variables of domain size at most 6 (maybe even less).

The only exceptions are cases where integer overflow occurs, and those really need seperate tests. I'm mostly handling that by putting fairly conservative bounds on maximum integer values, and asserting if those are ever violated, because (particularly as academic software), an assertion where your solve might work but is undertested is (in my mind) infinitely better than a wrong answer.


What sorts of code normally causes integer overflows in minion? I have to be super cautious of float overflows in most of my code, but usually with integers I'm implementing a bitmask or a 64-bit counter or something similarly not prone to such problems.


Just boring old arithmetic, it's quite common for people to write maths problems which end up overflowing a 64-bit integer.

Here's one I remember someone writing:

There is a 'numbers game' on the UK show countdown, where you are given a series of 6 numbers, and need to add, multiply, subtract and divide them to make a target. They were considering generalisations.

They wanted to allow '500' as an input number, and also allow 8 numbers. While their targets were numbers less than 10,000, 500^8 overflows a 64-bit integer, and we often want to find things like upper and lower bounds early on.


I'd be interested to know if you might be able to if you could find problems by looking at invariants over partitions of the space.

Like

Is the number of solutions with v1=false plus the number with v1=true the number with v1 free?

If we have n constraints then there 2*n variants where you negate subsets of constraints. Is the sum of solution counts over the 2*n subproblems equal the size of the variable space?


That's a good idea.

The search method is propagate constraints for a while then pick a variable and bisect it, solve each side and combine.

Generalising slightly from your suggestion, varying the order in which the search runs, and whether various propagations or optimisations are active, should only change runtime and not the set of results.

The other thing I've discovered since that project is clang's AST based branch coverage recording. Getting to the point where smallish unit tests that can be perturbed as above or enumerated exercise each path through the solver would probably shake out the bugs.


Maybe one could attempt to use a proof assistant to prove that the algorithm is correct?


For what it's worth I'm now working on a proof assistant so yeah, that's the game plan. I think the straight line case for the solver is probably OK but there's a huge range of optimisations out there and it's far too easy to introduce errors there.


At present we are a long way from proving general constraint solvers are correct, as they are built from some many little algorithms that interact in weird ways.

It's an interesting area, and I hope we'll get there in the future.


Or just prove yourself with pen and paper if the algorithm works, using structural induction and other techniques. Then it's just a matter of whether the code follows the spec


Resolvo (the SAT solver here) has been really good for us. It helped make some conda-forge bots up to 10x faster than the previous C-based solver (libsolv) while being memory safe.

The specific bot tests went from taking 60 minutes to ~6 minutes which is quite remarkable.


i've used drmaciver's property testing library hypothesis to find bugs in c libraries before; it has shrinking. 'doesn't crash' is the easiest property to define, but a bit fiddly to connect to property testing libraries, because you have to spawn a subprocess. in one case i just used fork, _exit, and wait, but in some cases things get ugly then (posix threads guarantee you almost nothing, but fortunately i wasn't using threads)

his 'minithesis' is the best way i've found to learn about implementing property-based testing

sat and smt solvers have gotten astoundingly good, and most of the leading ones are free software. smt-comp shows off new gains every year and has a lineup of the leaders

z3 is a particularly easy smt solver to get started with, in part because of its convenient python binding

a thing sat/smt and property-based testing have in common is that they both solve inverse problems, allowing you to program at a more declarative level


I don't see it mentioned here, but SAT solvers are also good targets for property based testing, providing failure cases for subsequent minimization. There are many properties that can be checked, for example that satisfaction-preserving transformations on an input do not alter the answer the solver returns.


Ah, counting! One of the hardest problems, that's why I suck at probability and stats. Knuths Volume 4


this article has almost zero content on the bug, the debugging, or the fix. I expect more details


Yeah, he says "it's too short to be an interesting story", and then doesn't get into any specifics


Having ended up with a critical bug in the SAT solver I wrote for my undergrad thesis, it really can be a challenge to fix without clear logs. So, always nice to see a little love for contribution through issues and finding minimal ways to reproduce edge cases.

While we do mention how good issue contributions are significant and meaningful, we often forget how there's often more to it than an initial filing, and may overlook the contributions from those that join lengthier issue threads later.

(Oh, and yes, that critical bug did impact the undergrad thesis, but it could be worked around, however meant I couldn't show the full benefits of the solver.)


I thought this was about SAT physics (used in games, etc)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: