most compilers of today don’t actually produce exactly the same code if you run them twice.
Timestamps are one of the biggest offenders, but this is also why reproducible builds are important. Nondeterministic codegen is just scary.
Second, and more importantly, many people carry a copy of a trusted compiler around, though it’s rarely mention in attacks like these: their head.
This is also why I'm against inefficient bloated software in general: the bigger a binary is, the easier it is to hide something in it.
Along the same lines, a third idea I have for defending against such attacks is better decompilers --- ideally, repeatedly decompiling and recompiling should converge to a fixed point, whereas a backdoor in a compiler should cause it to decompile into source that's noticeably divergent.
Mike Pall somewhat famously wrote a Lua interpreter in assembler. Assuming that you can write a compiler for your language in Lua, you don't really have to trust trust, but you do have to trust Mike Pall. I'm not aware of any other raw assembly implementations of modern programming languages, but I suspect there are other examples. The overall scheme (^,~) could probably be replicated by a sufficiently dedicated team if there was interest.
Not exactly easy, but probably easier than a decompiler that produces human-equivalent source code.
Now this has me wondering if we should write more useless software. In this case, specifically, more terrible compilers and interpreters. Like, sure, technically, everything could be backdoored, but what if there are a whole bunch of C-compilers written by amateurs in languages ranging from assembly to python to Haskell to Common Lisp? Good luck compromising all of them.
The C4 compiler [https://github.com/rswier/c4] is a self-hosting compiler for a subset of the C programming language that produces executable x86 code. You can understand and audit this code in a couple of hours (its 528 lines).
It could be an interesting exercise to bootstrap up from something like this to a working linux environment based solely on source code compilation : no binary inputs. Of course a full linux environment has way too much source code for one person or team to audit, but at least it rules out RoTT style binary compiler contamination.
In tune with the parent's point about decompilation, the transformation from assembly to machine code is more or less completely reversible and often local, allowing the assembly to be tested in chunks that are less detectable. Linking is also "reversible", though attacks on the linker are actually much more common in practice than attacks on the compiler (LD_PRELOAD injection etc). So the verification he was concerned with becomes much easier when using assembly for bootstrapping.
but why would you assume that the decompiler is not backdoored? It would know to remove the backdoor code, so the fixpoint is not going to show anything.
In theory, any computer tool you use could've already been compromised. You can't be 100% certain (even if you bootstrap from hand-made punchcards, the backdoor could be hidden in hardware).
In practice, you can push the likelihood of this down to arbitrarily small levels. Pick a decompiler that's as unrelated to your compiler as you can find. Then pick a second decompiler that's maximally unrelated to both the compiler and the first decompiler. It's highly unlikely all three tools will be backdoored in a fully compatible way.
> In theory, any computer tool you use could've already been compromised. You can't be 100% certain (even if you bootstrap from hand-made punchcards, the backdoor could be hidden in hardware).
This is the whole point of Thompson’s lecture, though.
Not if the backdoor is inserted into fopen/mmap calls in such a way that when any executable opens the binary for reading, it sees a version without the backdoor.
Malware does that all the time - return different stat() results, change the contents for anything which isn’t execution, etc. It’s detectable but fundamentally you’re in a nasty race since the attacker can use the same tools you do.
When a lot of software is mostly reproducible it may be easier to make the matching logic smarter instead of needing to chase down every possible build variation in every single application on Earth.
>And judgement can go wrong.
Which is why it is important to have a good design and threat model.
Definitely, and they only have a limited pool of manpower.
However, at least the problems with this approach are all in one direction: Bits differing might be due to an actual attack or because of weird compiler issues, but if you get the bits to be identical, you know that no funny business is going on.
The proposal to have more complicated rules doesn't have this upside. And you still need to make sure that the things that actually matter are the same (like instructions in your binaries), even if you successfully separated out irrelevant metadata changes.
> Caring about some metadata changing does not seem to be as useful from a security perspective.
If the metadata were neatly separated from the executable portions of an artifact then it indeed wouldn't be too much of an issue[1], but it often isn't. It sometimes gets embedded into strings and other places in the executable itself where it theoretically could have an impact on the runtime (and of course makes executable signatures unreliable, etc.). Without that separation comparing two potentially-the-same executables becomes equivalent to the Halting Problem.
[1] You could sign only portions of the executable, for example... but such signature schemes are brittle and error-prone in practice, so best avoided. So you'd ideally want any and all metadata totally separate from the executable.
In order to trust signed binaries, it is very valuable to be able to verify a binary came from some source.
This way anyone who signs modified code can be caught. This is especially easy with open source projects. Incidentally, Debian is working very hard on deterministic builds.
No. In the latter case the "upstream source code" is the key, and the
compiler is the functional transformation under scrutiny. Think of a
triangle where we want two fixed (trustworthy) points to ascertain
the third. Here our trusted principles are a repository (better
multiple repositories) of source code, and a deterministic executable
from someone who has already compiled using a known good compiler.
It really limits the value of a stolen or leaked key. Similarly it limits the amount of trust you need in the key holder.
If the NSA could steal a debian key (or coerce a packager), without deterministic builds, they could put a backdoor in a popular Debian package with barely any chance of getting caught.
Deterministic builds make it much easier to detect this. That makes the value of such a stolen/coerced key much lower. Making it much less likely this gets attempted, and likely stopping anyone who used to try this attack.
Perhaps, but with deterministic builds, anyone can read the source code, recompile, verify integrity of the signed code, and even sign it themselves as a confirmation of its integrity. The more (independent) signatures there are, the more you can trust the precompiled binary.
Timestamps are one of the biggest offenders, but this is also why reproducible builds are important. Nondeterministic codegen is just scary.
Second, and more importantly, many people carry a copy of a trusted compiler around, though it’s rarely mention in attacks like these: their head.
This is also why I'm against inefficient bloated software in general: the bigger a binary is, the easier it is to hide something in it.
Along the same lines, a third idea I have for defending against such attacks is better decompilers --- ideally, repeatedly decompiling and recompiling should converge to a fixed point, whereas a backdoor in a compiler should cause it to decompile into source that's noticeably divergent.