Hacker News new | past | comments | ask | show | jobs | submit login
Reproducing Go binaries byte by byte (filippo.io)
193 points by FiloSottile on April 23, 2017 | hide | past | favorite | 30 comments



> Note: the default GOROOT, the one that the compiler will use if the environment variable is not set, must also match, since it will be copied into binaries

This seems worrisome for the privacy of builders. What if someone wishing to stay anonymous built and distributed a Go binary from a system where GOROOT included their home directory and revealed their identity? Am I misinterpreting something?


Yes, it's worrisome. An issue like this was brought up on golang-nuts in 2014 [1]; the response from the devs was that this behavior is intentional [2], and they note that this happens with other compilers as well.

I'm attempting to say this without sounding facetious: it's generally bad for privacy if you install tools and compilers to a personally-identifiable path, compile programs from personally-identifiable directories, perhaps using personally-identifiable accounts. Using generic paths (which is in this case the Go default), generically-named work directories, and generic-sounding user accounts is a viable mitigation, that has been used by many programmers who get burned by this from various environments.

As you suggest, perhaps there's an education gap here, because this is well-known by security professionals and intelligence services (like the CIA [3]), and less well-known by developers. Furthermore, most of these privacy-leaking behaviors are being (re-)discovered or discussed as a part of the 'reproducible builds' movement (like this one) to attempt to identify context-specific behavior, rather than a concentrated approach about privacy in particular.

[1] https://groups.google.com/forum/#!topic/golang-nuts/oVDD8oPv...

[2] https://groups.google.com/d/msg/golang-nuts/oVDD8oPvDIY/fQ_r...

[3] https://wikileaks.org/ciav7p1/cms/page_27721733.html


The devs were right that this does happen with other compilers as well and has been an issue with a great many compilers across a great many languages for as long as I can recall (I remember tearing Windows binaries apart in the 90s to glean information about the developers environment).

Thankfully this is less of an issue these days due to the proliferation of containerisation technologies and automated build pipelines, meaning the build system can be completely separated from developers working environment (eg a Jenkins box)


I used to enjoy digging into strings buried in binaries trying to find paths from the development environment.


At least for Go, I feel like this should be quite obvious to most developers.

The first time I saw a go stack trace I strings-d the binary to confirm, as I suspected, that they embedded this information in the binary. I filed it away and didn't think about it much since then (I'm not saying nobody could care, I just don't do things that require that kind of privacy)


> This seems worrisome for the privacy of builders. What if someone wishing to stay anonymous built and distributed a Go binary from a system where GOROOT included their home directory and revealed their identity? Am I misinterpreting something?

Anyone who cares enough about this must be sure to use a nondescript username / path for creating binaries.

Sniffing around for paths and filenames (and other compiler artifacts, like date/date formats/time zone/time of build, version numbers, locale/language markers, debug symbols, UUIDs, copyright strings, format strings) is an extremely well-known technique for investigating the origin of a suspect binary -- anyone who has ever heard of "malware attribution" is sure to know about this.


The safest way to avoid leaking such information like that is to build inside of a docker container.

Even if the goroot is not in a suspicious location, the actual location of each source file on disk is compiled in as well (which almost certainly will be in your home directory).

Copying it into `/usr/app/src/$pkgPath` in a docker container makes the path generic and also ensures various other details of your environment are less likely to leak out due to the mount namespacing.

Alternately, building it on a CI system like travis and ensuring those are the only binaries you ever distribute is marginally safer.


This. If you actually NEED to care about your privacy for whatever reason, you should have better opsec than building go binaries, a complicated process that you probably don't fully understand and may contain system artifacts, on your daily driver.


Are you serious? I have a pretty unique name. On some of my PCs I use my name as my account name. I only care about my privacy as much as the next guy, but providing release binaries from my PC shouldn't mean I'm sharing my personal information.

Honestly this thread is going in the same direction most other threads I see about any flaw in GO seem to take. Users popup to defend GO and try to downplay the issue or shift blame towards something else, like you not framing the issue in the "GO way" (I guess not wanting to expose your username in release binaries is just not the "GO way", if you're not using Docker for release builds your username showing up is your own fault.).


The reason people are defending Go is because this isn't a new phenomena in AOT compiled languages. Popular compilers for a whole plethora of different programming languages have been leaking environmental information into their binaries for as long as I've been a developer (~30 years). It's actually pretty normal practice - particularly if you want to provide debugging data (as this is used for). It's also the kind of analysis security researchers use to establish where malware originates.

This is also a non-issue because if you're compiling binaries for redistribution you'd disable the debugging references (via compiler flags) and be building your source in a separate environment to your development environment (eg using Jenkins). Heck, even just a docker container on the dev machine would work.

So to recap:

1) not a problem with Go; a problem with debugging references in all AOT compiled languages.

2) not a new issue; I've been observing this in other languages for at least 2 decades.

3) it's easily fixed by implementing any or all of the follow solutions:

i) build automation / CI pipeline (your build environment becomes separate from your dev environment)

ii) docker or other build container or VM (same reason as above)

iii) just disabling the debug references at compile time


The language used in the Google groups thread above strongly implies that release builds should have the same information embedded:

> I think this approach sounds to be the cleanest. Using -ldflags -s is probably untenable for releases since people expect all of the normal reflection and debugging to work properly. GOROOT_FINAL gets around the privacy issue without making any other major changes.

I'm no stranger to debugging symbols having personal info, that's why I specifically mention `release` builds no less than 3 times. They're even admitting it's a privacy issue but mentioning a fix that's non-default option, but I'll give them that having a fix period is good.

And to this: > The reason people are defending Go is because this isn't a new phenomena in AOT compiled languages

I'd correct that to the reason you're defending Go in that comment, plenty of people clearly don't think it's an issue not because they're used to debugging symbols having personal info, but because of what boils down to "What's the big deal you're using Docker right?".

All of that aside, my comment about my experiences with Go on HN go well past this single thread...


> All of that aside, my comment about my experiences with Go on HN go well past this single thread...

Honestly I see just as many people being highly critical of Go as well. Go is like the Apple of programming languages: it's opinionated, arguably spare in design and often polarises people opinions.

But with regards to your main point, i was talking about production builds as well. I've lost count of the number of times I've pulled build data from a Windows PE or Linux / FreeBSD ELF. I went to college in the 90s and it was a hobby of mine to reverse engineer the build environment of Windows executables (a bit like in this article) because I wanted to know what technologies the professionals were using.


My point is there are thousands of pieces to how computers work that people don't understand.

If you RELY on privacy you should have better opsec than just using your daily driver to build your go binaries. Compartmentalizing your life is important if you NEED the privacy.


That or try using a generic username such as "user" or "admin" on a system you control, that way the paths will show up as they should but will be too generic to be useful.


Do you know enough about your environment to confidently say that this is all that could be leaked? I don't.

Modern environments are to complex to manage safely with whitelists. Someone will always forget something.


I imagine people who really don't want to be identified digitally would do their research and likely use VM's that are stripped down, or whatever other options there are.


Practically speaking, I don't think it's a problem. You'd probably want to reproduce the builds in an environment that was, itself, reproducible and isolated, such as Docker or a chroot or a VM (Maybe for security reasons rkt would be a good candidate since it has signed images on top of normal cryptographic hashing.) That's my opinion anyway. It's also useful to do so for projects that use a CI pipeline.


Another option to this is Repeatr [1].

Repeatr is for -- like it says on the tin -- repeating things; and hopefully, reproducing them. It's all the containment (runc, underneath), plus the ability to pull in various filesystems, specified by hash.

For concrete examples, you can see Repeatr building Repeatr reproducibly here [2], for its own releases (it's a Go project, so this looks a lot like what's going on in Fillipo's blog here). Another example is this formula for reproducible builds of Runc [3] -- it uses cgo, and became reproducible with go1.7, which was very exciting!

The big gain is Repeatr's syntax for multiple inputs and built-in hash verification means you can assemble reproducible environments like this out of multiple filesystems (say, busybox from one place, golang compiler separately), they all cache separately, they all download in parallel, you never have to write your own "tar -xcvf" boilerplate... and most importantly, you get built-in sanity-checking that the upstream tarballs you're using haven't changed either. No centralized "registry" server required, either. Give it a try!

Disclaimer: author, of course :)

---

[1] http://repeatr.io , https://github.com/polydawn/repeatr

[2] https://github.com/polydawn/repeatr/blob/master/meta/release...

[3] https://github.com/polydawn/formulary/blob/master/formulas/r...


For getting true value from reproducible builds, the environment should be as free as possible. If a build requires a certain image to be used, you need to trust that image not to be backdoored. Ideally, the only binary dependency in the environmental requirements is the compiler itself. Everything else should be source-code or plain-text.

Getting the compiler out of the chain would be even better, but trusting trust is a hard problem.


Yes, when you turn it up to 11, reproducible builds can be seen as a recursive problem. However, that's not a doom and gloom issue: like the old saying goes: How do we eat an elephant? One bite at a time. The saying still works even if it's a recursive stack of elephants: just keep going one bite at a time.

Repeatr formulas are a precise identification of one build. That doesn't mean you shouldn't make a build that's even more portable. It means you can use more than one formula to precisely describe, and test, how portable and flexible your build is. You can also use one formula to describe one step of a build (say, compiling a base image), and then output the resulting filesystem... it's assigned a hash, and you can use this as an input to another formula. Thus, you can use binary images, AND have a complete (recursive!) reproducibility path.

You can use reppl [1] for creating pipelines like this (though it's not yet geared for public re-sharing and comparing of results -- that's a big todo, if you want to lend a hand!).

> trusting trust is a hard problem.

Imagine you had two formulas, both with fully hash-pinned input filesystems, which both build GCC. One uses a GCC binary as an input. The other uses a clang binary as an input instead. When you run each of them, you get a hash of the result filesystem. The remaining step to address Trusting Trust is left as an exercise to the reader :)

---

[1] https://github.com/polydawn/reppl/


It's actually possible to workaround this by using the -trimpath option to the assembler via '-asmflags -trimpath'.

I do this when I build Go for Solaris:

https://github.com/oracle/solaris-userland/blob/master/compo...

https://github.com/golang/go/issues/13616

See also:

https://github.com/golang/go/issues/16860


Note that builds with GCC and debug information are also potentially identifiable as they may contain full source paths (/home/my_actual_name/projects/my_project/src/my_module.c).


I believe in TraceBacks or whatever they are called, it will show the full path of the file that caused the issue. So yeah, it will reveal your name if your gopath is in your home directory.


GOROOT is the root directory of the Go installation itself.

More about GOROOT here[1].

If you install Go into your home directory, then yes that path would show up in resultant binaries.

[1]: https://dave.cheney.net/2013/06/14/you-dont-need-to-set-goro...


This is why my username in every machine is "user".


user@host here :)


I can understand having full paths in binaries for the debug versions (which you shouldn't be distributing anyway), but if you're compiling a release version, does the path still get included? If so, that sounds like a bug.


At least this one

  Interestingly, the build host architecture does not matter. 
  In other words, builds are reproducible across cross-
  compiling.
Is there a list of what compilers embed what identifier to binaries? Thinking about it compiler could embed CPU ID, MAC address, etc to make you traceable - but do they? (like color printer/photocopier embed almost invisible code on every age, to make you traceable).


I don't know of a list but people have been aware of the potential for decades: https://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thomp...


This is a bug that should be fixed.

In the meantime, if you want to be anonymous, build in a non-specific path like /build. And perhaps more generally, you might want to use a generic username like "user".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: