Hacker News new | past | comments | ask | show | jobs | submit login
Disorderfs: FUSE-based filesystem that introduces non-determinism into metadata (debian.org)
127 points by mmastrac on Jan 29, 2021 | hide | past | favorite | 31 comments



Sorry for plugging in. NixOS has a plan to use DisorderFS in making 100% reproducible build.

https://r13y.com/

As I understand it, currently several packages are not reproducible (like, python, pytest, gcc), so it is not priority, but when those large packages will be done, r13y will start using DisorderFS to uncover remaining reproducibility bugs.

This is too idealistic, but gives lots of pleasure about package space.


Strictly speaking, this is nothing new. Debian has been doing it for 5 years.

https://reproducible-builds.org/citests/

All distros there are essentially reproducible fuzzing CI systems that introduces determinism through disorderfs, lang changes and so on. These changes are fine to fix but nothing you'd normally get when reproducing packages for a distribution.

Personally the important part is if the patches are upstreamed or not. This isn't something that is a priority among distribution.

Results from a fuzzing in Arch:

https://tests.reproducible-builds.org/archlinux/archlinux.ht...

Results from just chroot recreation:

https://reproducible.archlinux.org


Instead of introducing non-determinism, shouldn't it instead try to enforce determinism in any possible way? (E.g. running all processes under ptrace or by using virtualization and thereby making the OS behave in a deterministic way during a build).


> shouldn't it instead try to enforce determinism in any possible way

That would only fix the build machine's problem, it wouldn't fix anyone else's builds.

A repeatable build without determinism is a fix for all people everywhere.


> That would only fix the build machine's problem, it wouldn't fix anyone else's builds.

It would fix everyone else's builds if they added determinism in their builds (as opposed to adding non-determinism in the test-procedure). Besides, a test-procedure never gives a guarantee because a bug depending on non-determinism can be subtle.


> a bug depending on non-determinism can be subtle.

That's the point - to uncover bugs dependent on nondeterminism by using a filesystem that introduces it. This is for fuzz testing at the filesystem level, not literally reproducing the builds correctly multiple times.

From the linked README:

"This is useful for detecting non-determinism in the build process."


The goal is to ensure that builds can be deterministic despite non-determinism.

Once such criterion is enforced, then everybody can reproduce the build with that set of source files and build instructions, without requiring a special environment that forces a specific order of events.


> The goal is to ensure that builds can be deterministic despite non-determinism

...by using non-determinism?

Very mind bending for me, I'm not sure I understand, but I'm glad smart people are figuring this stuff out.


You use disorderfs as part of a CI process. The CI builds the package once without disorderfs, and once with disorderfs. If they produce the same output, the package is reproducible (at least with respect to filesystem order). Otherwise, something in the build process is depending on filesystem order and should be fixed to sort directory entries before using them. You wouldn't use disorderfs when building a package normally.

(At least this is how Debian uses disorderfs. I wrote the first version of disorderfs 6 years ago in a hacking session at DebConf15 in Heidelberg. I never expected to see it on the front page of HN!)


I think you need to build many times to be sure.

Therefore (the original question), instead of using "disorderfs", why not write and use an "orderedfs" for every build?


The original behavior of disorderfs was to randomly shuffle directory entries, but we quickly realized that this meant that sometimes the shuffle wouldn't do anything, so I changed the default behavior to simply reverse the directory entries instead. Therefore, you only have to build twice. (Ironically, disorderfs' "non-determinism" is actually deterministic.)

As to your original question, there are so many sources of nondeterminism that trying to emulate them all away would make builds more complicated, less performant (FUSE adds overhead), and less safe (since there would be more components that could potentially be backdoored).


Because doing it that way will make building (and verifying) a deterministic build more difficult for users, while forcing the builds to be deterministic in the face of randomised non-determinism means that anyone can build the project and get the same output without needing any complicated build configuration. That end goal (all builds are deterministic even if you don't have some magical reproducible build machine) is the holy grail of reproducible builds.

And since this is run as part of a CI process, you will get lots of builds over time and will root out all sorts of issues caused by non-determinism.


Because then your software is relying on guarantees not provided by the POSIX API and it would be incorrect.


Thanks! That makes more sense now.


Repro build folks are introducing variation in the build environment (inc with disorderfs) in order to uncover reproducibility bugs and then fix them. Here are the variations Debian is introducing:

https://tests.reproducible-builds.org/debian/index_variation...

It is similar to how Chaos Monkey increases the resilience of Netflix's service by introducing random failures and then for each of those failures working out how to prevent the failure from affecting the overall status of the service.


It's to root out "works [deterministically] on my machine" bugs earlier.


I think the idea is you can build from source on any file system or drive and get the same results. By adding artificial purposeful nondeterminism you can fix your builds to account for unintentional natural nondeterminism.


It’s going to be part of the testing suite. The packages should be able to build exactly the same even in a hostile environment. You want the environment “outside” your build system to be as chaotic as possible so you know there aren’t any accidental dependencies you don’t catch.


It's a form of fuzzing. A bug free build system would not be affected by non-determinism.


Buildbarn, a build cluster implementation for Bazel that I maintain, can also run build actions (compilation steps, unit tests) in a FUSE file system. Though the primary motivator for this is that it reduces the time to construct a build action's file system to nearly instant, it has the advantage that I can also do things similar to disorderfs. Shuffling directory listings is actually something that I also added. Pretty useful!

https://github.com/buildbarn/bb-remote-execution/blob/eb1150...


If you are confident that certain build steps are deterministic, can you enable Dockerfile-like caching for intermediate steps? The way docker does it is take a hash of the "input" filesystem and the command, and see if there's an associated "result" filesystem, and if there is then just jump to evaluating the next command with the previous result as input.


That's precisely what bazel does (and remote build execution systems like buildbarn allow this cache to be effectively shared among users)


Ah, cool! IIRC Bazel predates docker, so perhaps it's more correct to say that dockerfiles use a Bazel-like command caching strategy.


Yep, moreover bazel has a fine grained dependency graph and thus can invalidate only a minimal part of the action graph when some inputs change.

Docker instead has to re-build all subsequent layers when the input of even just one layer changes.

The two tools sit on a different point in the spectrum of simplicity of use though. Maintain build files for bazel (and dealing with other constraints of hermetic execution) is hard time-consuming and it's hard to convince many teams it's worth the effort.

Docker apparently struck a sweet spot in that it's easy to explain how to craft linear build steps and it does a half-decent job in actually caching stuff. Sometimes it doesn't work well depending on your workflow but people then have the incentive to read about how to improve the "cacheability" of their dockerfiles (multistage, reorder, dockerignore, ...). As many things in our craft, the human aspect trumps over technical brilliance.


From the headline, I was picturing EXIF tags, which would be pretty amusing. Store your photos at this mountpoint here, read 'em from that mountpoint over there, and their locations and camera details get anonymized. But the real data still lives on disk.

Alas that's not what this is about, but now I wonder how hard it would be to make the thing I had in my head.


Not too hard. You should build it !


A FUSE mount, or even a webdav virtual directory hierarchy, wouldn't be that tricky. Stripping metadata is trivial, thanks to ExifTool.


On Gitlab isn't possible even to read the Readme without javascript enabled. Too bad!



The work done in Debian around reproducible builds is really impressive.


nixbuild.net ("Nix Builds as a Service") can run any Nix build with fs randomization similar to disorderfs:

https://blog.nixbuild.net/posts/2021-01-13-finding-non-deter...

It isn't enabled by default but can be turned on with a setting:

https://docs.nixbuild.net/settings/#inject-fs-randomness




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: