*most compilers of today don’t actually produce exactly the same code if you run...

scythe · on Oct 26, 2023

Mike Pall somewhat famously wrote a Lua interpreter in assembler. Assuming that you can write a compiler for your language in Lua, you don't really have to trust trust, but you do have to trust Mike Pall. I'm not aware of any other raw assembly implementations of modern programming languages, but I suspect there are other examples. The overall scheme (^,~) could probably be replicated by a sufficiently dedicated team if there was interest.

Not exactly easy, but probably easier than a decompiler that produces human-equivalent source code.

tetha · on Oct 26, 2023

Now this has me wondering if we should write more useless software. In this case, specifically, more terrible compilers and interpreters. Like, sure, technically, everything could be backdoored, but what if there are a whole bunch of C-compilers written by amateurs in languages ranging from assembly to python to Haskell to Common Lisp? Good luck compromising all of them.

dzdt · on Oct 26, 2023

The C4 compiler [https://github.com/rswier/c4] is a self-hosting compiler for a subset of the C programming language that produces executable x86 code. You can understand and audit this code in a couple of hours (its 528 lines).

It could be an interesting exercise to bootstrap up from something like this to a working linux environment based solely on source code compilation : no binary inputs. Of course a full linux environment has way too much source code for one person or team to audit, but at least it rules out RoTT style binary compiler contamination.

tedunangst · on Oct 26, 2023

It's an interpreter.

dev_slash_null · on Oct 26, 2023

You still have to trust your assembler and linker in that case.

scythe · on Oct 26, 2023

In tune with the parent's point about decompilation, the transformation from assembly to machine code is more or less completely reversible and often local, allowing the assembly to be tested in chunks that are less detectable. Linking is also "reversible", though attacks on the linker are actually much more common in practice than attacks on the compiler (LD_PRELOAD injection etc). So the verification he was concerned with becomes much easier when using assembly for bootstrapping.

chii · on Oct 26, 2023

> decompilers ... converge to a fixed point

but why would you assume that the decompiler is not backdoored? It would know to remove the backdoor code, so the fixpoint is not going to show anything.

TeMPOraL · on Oct 26, 2023

In theory, any computer tool you use could've already been compromised. You can't be 100% certain (even if you bootstrap from hand-made punchcards, the backdoor could be hidden in hardware).

In practice, you can push the likelihood of this down to arbitrarily small levels. Pick a decompiler that's as unrelated to your compiler as you can find. Then pick a second decompiler that's maximally unrelated to both the compiler and the first decompiler. It's highly unlikely all three tools will be backdoored in a fully compatible way.

ericpauley · on Oct 26, 2023

> In theory, any computer tool you use could've already been compromised. You can't be 100% certain (even if you bootstrap from hand-made punchcards, the backdoor could be hidden in hardware).

This is the whole point of Thompson’s lecture, though.

codeflo · on Oct 26, 2023

Not if the backdoor is inserted into fopen/mmap calls in such a way that when any executable opens the binary for reading, it sees a version without the backdoor.

wizzwizz4 · on Oct 26, 2023

That can't be done without either changing the file size or being really obvious.

acdha · on Oct 26, 2023

Malware does that all the time - return different stat() results, change the contents for anything which isn’t execution, etc. It’s detectable but fundamentally you’re in a nasty race since the attacker can use the same tools you do.

charcircuit · on Oct 26, 2023

>Timestamps are one of the biggest offenders

>Nondeterministic codegen is just scary.

Why not focus on deterministic codegen only? Caring about some metadata changing does not seem to be as useful from a security perspective.

eru · on Oct 26, 2023

If you can get your binaries bit-for-bit identical, any idiot (read: even a computer) can tell they are the same.

If you have to make exceptions for some metadata, all of a sudden there's judgement and smarts involved. And judgement can go wrong.

charcircuit · on Oct 26, 2023

When a lot of software is mostly reproducible it may be easier to make the matching logic smarter instead of needing to chase down every possible build variation in every single application on Earth.

>And judgement can go wrong.

Which is why it is important to have a good design and threat model.

eru · on Oct 26, 2023

In practice, chasing down the build variations has been proven workable with a good track record. See eg https://reproducible-builds.org/

charcircuit · on Oct 26, 2023

>workable

That page has progress reports that are 8.5 years old and the project is not close to be done. There are still thousands of Debian packages to go.

eru · on Oct 27, 2023

Definitely, and they only have a limited pool of manpower.

However, at least the problems with this approach are all in one direction: Bits differing might be due to an actual attack or because of weird compiler issues, but if you get the bits to be identical, you know that no funny business is going on.

The proposal to have more complicated rules doesn't have this upside. And you still need to make sure that the things that actually matter are the same (like instructions in your binaries), even if you successfully separated out irrelevant metadata changes.

Quekid5 · on Oct 26, 2023

> Caring about some metadata changing does not seem to be as useful from a security perspective.

If the metadata were neatly separated from the executable portions of an artifact then it indeed wouldn't be too much of an issue[1], but it often isn't. It sometimes gets embedded into strings and other places in the executable itself where it theoretically could have an impact on the runtime (and of course makes executable signatures unreliable, etc.). Without that separation comparing two potentially-the-same executables becomes equivalent to the Halting Problem.

[1] You could sign only portions of the executable, for example... but such signature schemes are brittle and error-prone in practice, so best avoided. So you'd ideally want any and all metadata totally separate from the executable.

charcircuit · on Oct 26, 2023

>where it theoretically could have an impact on the runtime

If there is a backdoor in the code there is a backdoor regardless of if there is an extra timestamp that exists.

rocqua · on Oct 26, 2023

In order to trust signed binaries, it is very valuable to be able to verify a binary came from some source.

This way anyone who signs modified code can be caught. This is especially easy with open source projects. Incidentally, Debian is working very hard on deterministic builds.

charcircuit · on Oct 26, 2023

>it is very valuable to be able to verify a binary came from some source.

Code signatures have already existed for decades and do not require reproducable builds.

bheadmaster · on Oct 26, 2023

Yes, but reproducible builds allow anyone to rebuild the code and verify that the compiled binary release corresponds to the upstream source code.

Without deterministic builds, it's possible for the attacker to switch the compiled binary release and sign it with a stolen key with nobody noticing.

chii · on Oct 26, 2023

but if your threat model includes the key being stolen as a potential threat, then signing with said key would then be meaningless.

nonrandomstring · on Oct 26, 2023

No. In the latter case the "upstream source code" is the key, and the compiler is the functional transformation under scrutiny. Think of a triangle where we want two fixed (trustworthy) points to ascertain the third. Here our trusted principles are a repository (better multiple repositories) of source code, and a deterministic executable from someone who has already compiled using a known good compiler.

rocqua · on Oct 28, 2023

It really limits the value of a stolen or leaked key. Similarly it limits the amount of trust you need in the key holder.

If the NSA could steal a debian key (or coerce a packager), without deterministic builds, they could put a backdoor in a popular Debian package with barely any chance of getting caught.

Deterministic builds make it much easier to detect this. That makes the value of such a stolen/coerced key much lower. Making it much less likely this gets attempted, and likely stopping anyone who used to try this attack.

penteract · on Oct 26, 2023

That argument would mean that SSL transparency logs are pointless.

I'd argue that if the key being stolen/misused becomes more easily detectable, signing with said key becomes more meaningful.

charcircuit · on Oct 26, 2023

SSL transparency logs are not a thing.

Certificate Transparency logs only show when new certificates are issued. If a certificate is stolen it is of no use.

penteract · on Oct 26, 2023

They help people discover if the certificate authority's private key has been stolen/misused.

rocqua · on Oct 28, 2023

They are meant to detect issuers misbehaving, not individual SSL certs leaking.

It lets issuers show they are trustworthy.

bheadmaster · on Oct 26, 2023

Perhaps, but with deterministic builds, anyone can read the source code, recompile, verify integrity of the signed code, and even sign it themselves as a confirmation of its integrity. The more (independent) signatures there are, the more you can trust the precompiled binary.