Hacker News new | past | comments | ask | show | jobs | submit login
Running the "Reflections on Trusting Trust" Compiler (swtch.com)
287 points by rsc on Oct 26, 2023 | hide | past | favorite | 67 comments



What a coincidence, I was reading the GNU Mes project's doc today that's very relevant: https://www.gnu.org/software/mes/manual/mes.html


I thought of the term "bootstrap pilgrimage" years ago to concisely refer to such projects --- both to highlight the journey of learning they give, and to refer to the fact that not everyone may have the time nor skill to embark on one.


I like that term. I've tried a couple of bootstrap pilgrimages.

For a job, we needed Go on FreeBSD 8, which entailed a local patch to revert pipe2 to pipe in Go, and building from source. That was my first bootstrap. Go, as Russ mentions, is easy to bootstrap, and I've done it a couple of other times.

I've tried to bootstrap Rust, from its old compiler written in OCaml, but that one is trickier. It has not been maintained like Go's bootstrap compiler and the bootstrap chain has long since been broken. Furthermore, rustc is only guaranteed to be able to be buildable with the previous release. As far as I can tell, no one has built rustboot in many years. I like doing software archaeology, so I'll probably try again with that project sometime.

mrustc takes another approach, closer to the aforementioned GNU Mes, in that it's a reimplementation of Rust, intended for bootstrapping rustc. https://github.com/thepowersgang/mrustc


Funny story: The way my compiler ( https://github.com/neat-lang/neat ) used to build is, three years ago there was an initial minimal compiler that was written in D. And every time you checked out the Neat repo on a new system, it had a file with a list of breaking commits, and it would:

- git clone itself in a subfolder

- git checkout and build the initial D compiler

- install it in a temporary prefix

- git checkout the first breaking commit and build it with the initial compiler

- install it over the previous compiler

- git checkout ...

and so on. In normal operation all these stages would be cached, so before I abandoned this approach, I think I was up to a hundred or so intermediate versions.

Eventually I started doing git releases, which needed a better solution. So since I have an optional C backend, I just build the compiler with the C backend, then zip up all the C files to make the release. Then to bootstrap from it, I just do (effectively) gcc *.c -o build/neat_bootstrap.

edit: Ah, here it is: https://github.com/Neat-Lang/neat/blob/v0.1.6/bootstrap.sh


That's a lot of steps in your bootstrap chain. I think tying it to releases can make it easier, but it's good your process is automated.

Rust has a long chain too, but rustc only uses features itself, that the previous release supports. Releases are every 6 weeks.

Go had been bootstrapping from 1.4 (the last C compiler release), until the release of 1.20 this year, when the bootstrap compiler was bumped to 1.17.13 and will be bumped yearly [0]. That meant go1.4 had to be able to compile new versions, keeping the compiler from using new features in itself for 8 years. Notably, this now allows for generics to be used in the compiler.

[0]: https://github.com/golang/go/issues/54265


I did that too, many compilers ago! Of course, I broke something somewhere and forgot where. My compiler tastes have changed since, and I think next time I'll maintain a basic bootstrap from Scheme or Rust, but I always thought it was a neat way to do it.


Relevant is https://github.com/NixOS/nixpkgs/pull/85542 , a (very stale) PR that uses mrustc to get straight to Rust 1.29 and then walks the chain.


nowadays you can skip straight to 1.54, that's relatively fresh


I believe Guix has shiny and clean rust bootstrap using mrustc, no?



A couple years ago, I also worked on bootstrapping the Rust compiler from the last OCaml version. I managed to get it up to an early 2014 snapshot, before I got stalled on a particularly tricky issue with the forked LLVM, and decided to drop the project for the time being. I still have all the 200-odd scripts to patch and build everything; I should probably get around to publishing those.


You should definitely publish that and comment a link! That would be very useful, since you got much further than me.


I've always struggled to find a succinct way to describe the benefits of installing Arch Linux. "bootstrap pilgrimage" is the perfect term!


Archlinux has gotten substantial easier to install in the last decade or so. I no longer get any 'bootstrap pilgrimage' feelings from it. Lots of stuff now works out of the box, even.

But I guess, I'm too used to it, too?


I haven't installed it since ~2016/2017, so my knowledge might be outdated. If you use one of the arch-based distros with a GUI, then you're right that it's very easy to install.

If you follow the Wiki though, I think you still learn quite a bit: https://wiki.archlinux.org/title/installation_guide


I am following that guide whenever I do a new install. I've never used one of those GUI enabled installers for Arch.

I started with Arch around 2010, and last did a fresh install on a new computer a few months ago.

Yes, you can still learn quite a bit from the guide if you are new to it. But it's gotten a lot easier over the years. I remember eg X and wifi used to be a bit of a pain, but now they mostly just work. Thanks to UEFI, you can also boot directly into your kernel, and no longer need to muck around with an extra boot manager like grub or syslinux. (Assuming your computer's UEFI implementation is not too buggy.)


See also the Bootstrappable Builds project, especially their stage0 projects:

https://bootstrappable.org/ https://github.com/oriansj/stage0 https://github.com/oriansj/stage0-posix/


Reminds me of https://elephly.net/posts/2017-01-09-bootstrapping-haskell-p...

The difficulty of bootstrapping GHC


There’s of course a couple of notes on combating this attack: most compilers of today don’t actually produce exactly the same code if you run them twice. In broad strokes they do but often they’ll have randomness creep in, such as different build hashes or iteration order of associative containers. To truly get a bit-for-bit identical output you may need to do some extra work, or perhaps run yet another step in a controlled environment to protect against this.

Second, and more importantly, many people carry a copy of a trusted compiler around, though it’s rarely mention in attacks like these: their head. In a pinch people can do spot checks to verify codegen to see whether it looks correct, unless the backdoor is incredibly subtle. But experience shows us that the more complex and hidden a backdoor is the more likely it is to break when subjected to unfamiliar examination.


most compilers of today don’t actually produce exactly the same code if you run them twice.

Timestamps are one of the biggest offenders, but this is also why reproducible builds are important. Nondeterministic codegen is just scary.

Second, and more importantly, many people carry a copy of a trusted compiler around, though it’s rarely mention in attacks like these: their head.

This is also why I'm against inefficient bloated software in general: the bigger a binary is, the easier it is to hide something in it.

Along the same lines, a third idea I have for defending against such attacks is better decompilers --- ideally, repeatedly decompiling and recompiling should converge to a fixed point, whereas a backdoor in a compiler should cause it to decompile into source that's noticeably divergent.


Mike Pall somewhat famously wrote a Lua interpreter in assembler. Assuming that you can write a compiler for your language in Lua, you don't really have to trust trust, but you do have to trust Mike Pall. I'm not aware of any other raw assembly implementations of modern programming languages, but I suspect there are other examples. The overall scheme (^,~) could probably be replicated by a sufficiently dedicated team if there was interest.

Not exactly easy, but probably easier than a decompiler that produces human-equivalent source code.


Now this has me wondering if we should write more useless software. In this case, specifically, more terrible compilers and interpreters. Like, sure, technically, everything could be backdoored, but what if there are a whole bunch of C-compilers written by amateurs in languages ranging from assembly to python to Haskell to Common Lisp? Good luck compromising all of them.


The C4 compiler [https://github.com/rswier/c4] is a self-hosting compiler for a subset of the C programming language that produces executable x86 code. You can understand and audit this code in a couple of hours (its 528 lines).

It could be an interesting exercise to bootstrap up from something like this to a working linux environment based solely on source code compilation : no binary inputs. Of course a full linux environment has way too much source code for one person or team to audit, but at least it rules out RoTT style binary compiler contamination.


It's an interpreter.


You still have to trust your assembler and linker in that case.


In tune with the parent's point about decompilation, the transformation from assembly to machine code is more or less completely reversible and often local, allowing the assembly to be tested in chunks that are less detectable. Linking is also "reversible", though attacks on the linker are actually much more common in practice than attacks on the compiler (LD_PRELOAD injection etc). So the verification he was concerned with becomes much easier when using assembly for bootstrapping.


> decompilers ... converge to a fixed point

but why would you assume that the decompiler is not backdoored? It would know to remove the backdoor code, so the fixpoint is not going to show anything.


In theory, any computer tool you use could've already been compromised. You can't be 100% certain (even if you bootstrap from hand-made punchcards, the backdoor could be hidden in hardware).

In practice, you can push the likelihood of this down to arbitrarily small levels. Pick a decompiler that's as unrelated to your compiler as you can find. Then pick a second decompiler that's maximally unrelated to both the compiler and the first decompiler. It's highly unlikely all three tools will be backdoored in a fully compatible way.


> In theory, any computer tool you use could've already been compromised. You can't be 100% certain (even if you bootstrap from hand-made punchcards, the backdoor could be hidden in hardware).

This is the whole point of Thompson’s lecture, though.


Not if the backdoor is inserted into fopen/mmap calls in such a way that when any executable opens the binary for reading, it sees a version without the backdoor.


That can't be done without either changing the file size or being really obvious.


Malware does that all the time - return different stat() results, change the contents for anything which isn’t execution, etc. It’s detectable but fundamentally you’re in a nasty race since the attacker can use the same tools you do.


>Timestamps are one of the biggest offenders

>Nondeterministic codegen is just scary.

Why not focus on deterministic codegen only? Caring about some metadata changing does not seem to be as useful from a security perspective.


If you can get your binaries bit-for-bit identical, any idiot (read: even a computer) can tell they are the same.

If you have to make exceptions for some metadata, all of a sudden there's judgement and smarts involved. And judgement can go wrong.


When a lot of software is mostly reproducible it may be easier to make the matching logic smarter instead of needing to chase down every possible build variation in every single application on Earth.

>And judgement can go wrong.

Which is why it is important to have a good design and threat model.


In practice, chasing down the build variations has been proven workable with a good track record. See eg https://reproducible-builds.org/


>workable

That page has progress reports that are 8.5 years old and the project is not close to be done. There are still thousands of Debian packages to go.


Definitely, and they only have a limited pool of manpower.

However, at least the problems with this approach are all in one direction: Bits differing might be due to an actual attack or because of weird compiler issues, but if you get the bits to be identical, you know that no funny business is going on.

The proposal to have more complicated rules doesn't have this upside. And you still need to make sure that the things that actually matter are the same (like instructions in your binaries), even if you successfully separated out irrelevant metadata changes.


> Caring about some metadata changing does not seem to be as useful from a security perspective.

If the metadata were neatly separated from the executable portions of an artifact then it indeed wouldn't be too much of an issue[1], but it often isn't. It sometimes gets embedded into strings and other places in the executable itself where it theoretically could have an impact on the runtime (and of course makes executable signatures unreliable, etc.). Without that separation comparing two potentially-the-same executables becomes equivalent to the Halting Problem.

[1] You could sign only portions of the executable, for example... but such signature schemes are brittle and error-prone in practice, so best avoided. So you'd ideally want any and all metadata totally separate from the executable.


>where it theoretically could have an impact on the runtime

If there is a backdoor in the code there is a backdoor regardless of if there is an extra timestamp that exists.


In order to trust signed binaries, it is very valuable to be able to verify a binary came from some source.

This way anyone who signs modified code can be caught. This is especially easy with open source projects. Incidentally, Debian is working very hard on deterministic builds.


>it is very valuable to be able to verify a binary came from some source.

Code signatures have already existed for decades and do not require reproducable builds.


Yes, but reproducible builds allow anyone to rebuild the code and verify that the compiled binary release corresponds to the upstream source code.

Without deterministic builds, it's possible for the attacker to switch the compiled binary release and sign it with a stolen key with nobody noticing.


but if your threat model includes the key being stolen as a potential threat, then signing with said key would then be meaningless.


No. In the latter case the "upstream source code" is the key, and the compiler is the functional transformation under scrutiny. Think of a triangle where we want two fixed (trustworthy) points to ascertain the third. Here our trusted principles are a repository (better multiple repositories) of source code, and a deterministic executable from someone who has already compiled using a known good compiler.


It really limits the value of a stolen or leaked key. Similarly it limits the amount of trust you need in the key holder.

If the NSA could steal a debian key (or coerce a packager), without deterministic builds, they could put a backdoor in a popular Debian package with barely any chance of getting caught.

Deterministic builds make it much easier to detect this. That makes the value of such a stolen/coerced key much lower. Making it much less likely this gets attempted, and likely stopping anyone who used to try this attack.


That argument would mean that SSL transparency logs are pointless.

I'd argue that if the key being stolen/misused becomes more easily detectable, signing with said key becomes more meaningful.


SSL transparency logs are not a thing.

Certificate Transparency logs only show when new certificates are issued. If a certificate is stolen it is of no use.


They help people discover if the certificate authority's private key has been stolen/misused.


They are meant to detect issuers misbehaving, not individual SSL certs leaking.

It lets issuers show they are trustworthy.


Perhaps, but with deterministic builds, anyone can read the source code, recompile, verify integrity of the signed code, and even sign it themselves as a confirmation of its integrity. The more (independent) signatures there are, the more you can trust the precompiled binary.


clang+llvm releases do a three stage build and compare the output of the third stage with the second. This is fairly effective at rooting out many sources of indeterminate behavior.

> In broad strokes they do but often they’ll have randomness creep in, such as different build hashes or iteration order of associative containers.

In order to find defects related to codegen being altered by incidental ordering of objects in containers, a build option called LLVM_ENABLE_REVERSE_ITERATION was created. Builders periodically run with this mode to verify that the regression test suite still passes when containers iterate in reverse.

That said, it's true that there are probably remaining sources of variation among builds. There is a significant effort [1] to find these and avoid them.

[1] https://reproducible-builds.org/


Not all containers right? Just unordered containers. It’s a bit weird of a strategy though - most other applications use a random seed for unordered hash tables.


Rather than change the implementation of the DenseMap/DenseSet/StringMap/StringSet containers to mask such codegen defects, the community opted to create a test strategy to find the defects instead.

I would guess that this may also end up favoring the execution performance of the compiler over other design choices, but I could be wrong about that one.


A random seed for unordered hash tables would make a mess of reproducibility. Bad thing in this context.


The iteration order is undefined for these containers. The whole point of the reverse iteration is to find when reproducibility is dependent on undefined iteration order. If the reverse iteration accomplishes this, so would a random seed.


> most compilers of today don’t actually produce exactly the same code if you run them twice.

By default, but https://reproducible-builds.org/ has made excellent headway on allowing that to be fixed if you care.


The more I read about Ken and Russ's admiration for Ken's work, the more I think there is an Easter egg in the Go compiler. It looks like Russ has long planted the seed for his own Turing Award.

From the article:

> Today, however, the Go compiler does compile itelf, and that prompts the important question of why it should be trusted, especially when a backdoor is so easy to add. The answer is that we have never required that the compiler rebuild itself. Instead the compiler always builds from an earlier released version of the compiler. This way, anyone can reproduce the current binaries by starting with Go 1.4 (written in C), using Go 1.4 to compile Go 1.5, Go 1.5 to compile Go 1.6, and so on. There is no point in the cycle where the compiler is required to compile itself, so there is no place for a binary-only backdoor to hide.

I think that this paragraph gives just hints on where to look for:

* there is no "binary-only backdoor", so I'm expecting the backdoor to be available in the source distribution of Go.

* the backdoor is either in each release and/or is propagated from one release to the other

There is a feature of the Go compiler that I mistrust: //go:embed. I expect it could be a good vehicle to inject source code that isn't in the repository as a file. Especially as Russ himself is deeply involved in the design (see https://youtu.be/rmS-oWcBZaI?).

I also mistrust the vendored packages as a way to have backdoor code in the distribution. Normal Go programs are not allowed to import code from packages cmd/vendor/..., but what if the compiler was lifting that limit for himself?


Wow, the attack was implemented nearly ten years before the lecture, maybe right after reading https://seclab.cs.ucdavis.edu/projects/history/papers/karg74... about "the compiler trap door"?


What a fantastic analysis. I've loved the Trusting Trust paper for decades now, it's so short and sweet and mind-blowing. But I honestly assumed it was a kind of thought experiment, not something Thompson actually created. Amazing to see the actual code and also learn how it sort of got into the wild but didn't quite pervade. (At least, we don't think so.)


I never expected this to actually exist either. I thought it was just some hypothetical thought experiment. The implications are astounding.

Code already decides the future of nations. My country uses voting machines which run software. People keep asking for the source code even though it would prove nothing...


I love such a trivia like how Russ named the article link after Ken's original naming..


Trusting Russ


Usually rsc’s blog posts are intimations of what he’s going to propose next for Go. Um, er, oh boy.


It is indirectly about the new reproducible Go toolchain builds introduced with Go 1.21. https://go.dev/blog/rebuild


In this case, I think it is more about an Easter egg in the Go compiler.

I'm expecting we will finally discover a backdoor propagated since at least Go 1.4.


Haha, good one ;)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: