Hacker News new | past | comments | ask | show | jobs | submit login

most compilers of today don’t actually produce exactly the same code if you run them twice.

Timestamps are one of the biggest offenders, but this is also why reproducible builds are important. Nondeterministic codegen is just scary.

Second, and more importantly, many people carry a copy of a trusted compiler around, though it’s rarely mention in attacks like these: their head.

This is also why I'm against inefficient bloated software in general: the bigger a binary is, the easier it is to hide something in it.

Along the same lines, a third idea I have for defending against such attacks is better decompilers --- ideally, repeatedly decompiling and recompiling should converge to a fixed point, whereas a backdoor in a compiler should cause it to decompile into source that's noticeably divergent.




Mike Pall somewhat famously wrote a Lua interpreter in assembler. Assuming that you can write a compiler for your language in Lua, you don't really have to trust trust, but you do have to trust Mike Pall. I'm not aware of any other raw assembly implementations of modern programming languages, but I suspect there are other examples. The overall scheme (^,~) could probably be replicated by a sufficiently dedicated team if there was interest.

Not exactly easy, but probably easier than a decompiler that produces human-equivalent source code.


Now this has me wondering if we should write more useless software. In this case, specifically, more terrible compilers and interpreters. Like, sure, technically, everything could be backdoored, but what if there are a whole bunch of C-compilers written by amateurs in languages ranging from assembly to python to Haskell to Common Lisp? Good luck compromising all of them.


The C4 compiler [https://github.com/rswier/c4] is a self-hosting compiler for a subset of the C programming language that produces executable x86 code. You can understand and audit this code in a couple of hours (its 528 lines).

It could be an interesting exercise to bootstrap up from something like this to a working linux environment based solely on source code compilation : no binary inputs. Of course a full linux environment has way too much source code for one person or team to audit, but at least it rules out RoTT style binary compiler contamination.


It's an interpreter.


You still have to trust your assembler and linker in that case.


In tune with the parent's point about decompilation, the transformation from assembly to machine code is more or less completely reversible and often local, allowing the assembly to be tested in chunks that are less detectable. Linking is also "reversible", though attacks on the linker are actually much more common in practice than attacks on the compiler (LD_PRELOAD injection etc). So the verification he was concerned with becomes much easier when using assembly for bootstrapping.


> decompilers ... converge to a fixed point

but why would you assume that the decompiler is not backdoored? It would know to remove the backdoor code, so the fixpoint is not going to show anything.


In theory, any computer tool you use could've already been compromised. You can't be 100% certain (even if you bootstrap from hand-made punchcards, the backdoor could be hidden in hardware).

In practice, you can push the likelihood of this down to arbitrarily small levels. Pick a decompiler that's as unrelated to your compiler as you can find. Then pick a second decompiler that's maximally unrelated to both the compiler and the first decompiler. It's highly unlikely all three tools will be backdoored in a fully compatible way.


> In theory, any computer tool you use could've already been compromised. You can't be 100% certain (even if you bootstrap from hand-made punchcards, the backdoor could be hidden in hardware).

This is the whole point of Thompson’s lecture, though.


Not if the backdoor is inserted into fopen/mmap calls in such a way that when any executable opens the binary for reading, it sees a version without the backdoor.


That can't be done without either changing the file size or being really obvious.


Malware does that all the time - return different stat() results, change the contents for anything which isn’t execution, etc. It’s detectable but fundamentally you’re in a nasty race since the attacker can use the same tools you do.


>Timestamps are one of the biggest offenders

>Nondeterministic codegen is just scary.

Why not focus on deterministic codegen only? Caring about some metadata changing does not seem to be as useful from a security perspective.


If you can get your binaries bit-for-bit identical, any idiot (read: even a computer) can tell they are the same.

If you have to make exceptions for some metadata, all of a sudden there's judgement and smarts involved. And judgement can go wrong.


When a lot of software is mostly reproducible it may be easier to make the matching logic smarter instead of needing to chase down every possible build variation in every single application on Earth.

>And judgement can go wrong.

Which is why it is important to have a good design and threat model.


In practice, chasing down the build variations has been proven workable with a good track record. See eg https://reproducible-builds.org/


>workable

That page has progress reports that are 8.5 years old and the project is not close to be done. There are still thousands of Debian packages to go.


Definitely, and they only have a limited pool of manpower.

However, at least the problems with this approach are all in one direction: Bits differing might be due to an actual attack or because of weird compiler issues, but if you get the bits to be identical, you know that no funny business is going on.

The proposal to have more complicated rules doesn't have this upside. And you still need to make sure that the things that actually matter are the same (like instructions in your binaries), even if you successfully separated out irrelevant metadata changes.


> Caring about some metadata changing does not seem to be as useful from a security perspective.

If the metadata were neatly separated from the executable portions of an artifact then it indeed wouldn't be too much of an issue[1], but it often isn't. It sometimes gets embedded into strings and other places in the executable itself where it theoretically could have an impact on the runtime (and of course makes executable signatures unreliable, etc.). Without that separation comparing two potentially-the-same executables becomes equivalent to the Halting Problem.

[1] You could sign only portions of the executable, for example... but such signature schemes are brittle and error-prone in practice, so best avoided. So you'd ideally want any and all metadata totally separate from the executable.


>where it theoretically could have an impact on the runtime

If there is a backdoor in the code there is a backdoor regardless of if there is an extra timestamp that exists.


In order to trust signed binaries, it is very valuable to be able to verify a binary came from some source.

This way anyone who signs modified code can be caught. This is especially easy with open source projects. Incidentally, Debian is working very hard on deterministic builds.


>it is very valuable to be able to verify a binary came from some source.

Code signatures have already existed for decades and do not require reproducable builds.


Yes, but reproducible builds allow anyone to rebuild the code and verify that the compiled binary release corresponds to the upstream source code.

Without deterministic builds, it's possible for the attacker to switch the compiled binary release and sign it with a stolen key with nobody noticing.


but if your threat model includes the key being stolen as a potential threat, then signing with said key would then be meaningless.


No. In the latter case the "upstream source code" is the key, and the compiler is the functional transformation under scrutiny. Think of a triangle where we want two fixed (trustworthy) points to ascertain the third. Here our trusted principles are a repository (better multiple repositories) of source code, and a deterministic executable from someone who has already compiled using a known good compiler.


It really limits the value of a stolen or leaked key. Similarly it limits the amount of trust you need in the key holder.

If the NSA could steal a debian key (or coerce a packager), without deterministic builds, they could put a backdoor in a popular Debian package with barely any chance of getting caught.

Deterministic builds make it much easier to detect this. That makes the value of such a stolen/coerced key much lower. Making it much less likely this gets attempted, and likely stopping anyone who used to try this attack.


That argument would mean that SSL transparency logs are pointless.

I'd argue that if the key being stolen/misused becomes more easily detectable, signing with said key becomes more meaningful.


SSL transparency logs are not a thing.

Certificate Transparency logs only show when new certificates are issued. If a certificate is stolen it is of no use.


They help people discover if the certificate authority's private key has been stolen/misused.


They are meant to detect issuers misbehaving, not individual SSL certs leaking.

It lets issuers show they are trustworthy.


Perhaps, but with deterministic builds, anyone can read the source code, recompile, verify integrity of the signed code, and even sign it themselves as a confirmation of its integrity. The more (independent) signatures there are, the more you can trust the precompiled binary.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: