Hacker News new | past | comments | ask | show | jobs | submit login
When coding style survives compilation: De-anonymizing programmers from binaries (freedom-to-tinker.com)
218 points by randomwalker on Dec 29, 2015 | hide | past | favorite | 67 comments



This statement seems debatable: "Since all the contestants implement the same functionality, the main difference between their samples is their coding style."

All the (winning) contestants implement the same functionality, yes, but with possibly wildly different approaches/algorithms, so the main difference between code samples is not just "style" but what could be called "general thinking in and around the problem".

But this statement seems the most interesting: "By comparing advanced and less advanced programmers’, we found that more advanced programmers are easier to de-anonymize and they have a more distinct coding style."

Beginners tend to think alike, while experts develop an original line of thinking, that is identifiable. The paper could be called "Fingerprints of Thought"...


Beginners tend to think alike, while experts develop an original line of thinking, that is identifiable.

I'm not sure I agree with your conclusion. Beginners tend to look to others' code more frequently (often including copying and pasting), which means their "style" is really an amalgamation of multiple styles.

On the other hand, even when more experienced programmers look to others' code, they will still fold it into their own style, even when including large amounts of somebody else's code. Their own style shows through the entire project.

However, both of us are just hypothesizing. A great follow-up to this work would be to look at why beginners are harder to identify. I'm sure the other obvious follow up, "how to anonymize yourself", is in the works somewhere.

The other thing I'd be interested in is how well this holds up for multi-person projects. Would it be possible to identify if I submitted code to the Tor project? How much do I have to contribute before identification becomes likely.


Given the goal, that still counts as style. That thought patterns are reflected in code just makes it easier to tell who wrote it.


As a professional, I tend to think a distinct coding style is bad. You should try to write a code that is plain and unsurprising and reproducible. To me, this study gives us another reason that we need a more uniform/standardized methodology and good education for the software industry.


Writing code that is plain, unsurprising, and reproducible not only is, itself, a style but also still leaves a lot of room for creativity/individuality.


Writing code is some ways a lot like writing in general. From the "elements of style" book:

> Vigorous writing is concise. A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts. This requires not that the writer make all his sentences short, or that he avoid all detail and treat his subjects only in outline, but that he make every word tell.

https://en.wikipedia.org/wiki/The_Elements_of_Style


Concise sure, but at the very least reasonably easily decipherable and preferably readable.


There's always more than one way to skin a cat in these things, we're talking about choices much deeper than where one puts their curly braces. For projects I work on I can guess with a high degree of certainty who wrote what just based on general patterns, naming conventions, phrasing in comments, etc. Most of these could be called "plain and unsurprising and reproducible" but people tend to leave little fingerprints here and there.


Is this view unique towards programmers or do you extend it to artists and writers as well?


Artists and writers' products are supposed to be aesthetic; code is supposed to be readable and maintainable.

I apply it to journalism, though. Newspapers have long had "house" style guides to keep things consistent across reporters.


Call it a best practice for inexperienced artists and writers. If I, for instance, were to take up painting as a hobby, I would try to find some artistic formula or method to follow before branching out into a more unique style.

I would consider this more important for writing than art, since reading novels involves much more of a time commitment from your audiences to appreciate, whereas art gives more flexibility in deciding how long your audience wants to spend with it.

However, there's a difference between bad art and writing and bad programming. Bad art and writing lacks appreciation. Bad programming is full of errors, hard to maintain, and runs slow, giving an overall poor user experience.


Artists and writers don't usually make things that will be modified in the future. Art also faces less risk of dependency hell.


I think you present a false dichotomy.

A distinct coding style does not indicate that the code is surprising or difficult to follow.

A diversity of styles should help overall code quality in an organization because you get to pick and choose the 'best approach'* for a situation, and more importantly, learn from your peers.

* whatever that heuristic happens to be... probably unsurprising and reproduce-able for you. I'd say the heuristic for me is 'concise, read-ably correct'. I've always thought of programming as closer to writing mathematical/logical proofs than anything else. If I can eyeball a program and say 'yup, that does what claims it wants to do,' then it's good. I don't think that heuristic always holds up in practice, simply because successful code-bases blossom proportionally to their success.


Totally agree, in the paper they only look at competitive programming contests. Ideally, it would be interesting if the same could be applied to a larger code base with more authors and if it's possible to find a particular author in such a code base.


One might argue that the things that could be taught as part of such a standardized methodology would either be extremely generic, or can and should be automated. "Design patterns" and "idioms" are definitely helpful and necessary for practical reasons, but to me they frequently seem like good candidates for automatization. Maybe in an ideal world, the parts of a solution that are "unique", i.e., not well-known and established, are even the only ones that should be required to be specified by a human.


The original "design patterns" book, where the term comes from, talks exactly about that.


Why don't you consider different approaches and algorithms a part of definition of style?


It doesn't seem to be the primary meaning of the word; style is considered superficial, a coating over substance.

Of course that's a matter of interpretation, but I find the summary of the paper suffers from the ambiguity.

There is also the fact, that I alluded to, that winning entries in a programming contest should be more alike than non-winning entries, in that they all actually solve the problem; if non-winning entries were included, it could explain some of the discrepancy in identification success.


I think the correct logic is more like this: Because the winning entries are all solving the same problem, any discrepancies in how they solved it (logically or stylistically) are magnified when grouped together.


This is an interesting article, but there's one part that I don't really understand: 'After scaling up the approach by increasing the dataset size, we de-anonymize 600 programmers with 52% accuracy.'.

Isn't 52% close to a random guess? I don't get how this ties in with the rest of the paragraph either.


If they were saying that in 52% of the cases, they could guess the correct choice out of 600, then random chance alone would have been <.2%.

If they were saying that in 52% of cases, they could guess if some code did or did not belong to a given programmer, then it is basically a random guess.


> If they were saying that in 52% of cases, they could guess if some code did or did not belong to a given programmer, then it is basically a random guess.

That entirely depends on the distribution of the original sample.


Does it? If you're asked "does this code belong to this programmer?" and flip a coin, you'll approximate 50% regardless of distribution of programmers.

Of course, "const No" might do quite a bit better than 50%, depending on the distribution of the original sample (for that matter, so might "const Yes", but those distributions seem less likely).


Well, if there were 600 programmers, one of whom had written 649 programs and 599 of whom had written 1 program each, you could achieve 52% accuracy by always guessing the same guy.

Of course, the paper explicitly says "For this experiment, we use 600 contestants from [Google Code Jam] with 9 files" so I think in this case the distribution was probably fairly even?


I think you missed some context. Up-thread, someone raised the point that it's unclear whether the question was "Of these 600 programmers, which wrote this code?" or "Did this programmer write this code?", and this subthread is discussing the latter case.


Yeah, I had an underlying assumption in my mental model that they would have split the questions so that half were yes and half were no. But you are right that that assumption may not hold. A random programmer with a random program will be a 'no' far more than 50% of the time.


Considering there are 600 programmers, 0.166% (1/600) would be random guess.


No, a random guess would give you 0.16% accuracy.


52% is close to a random guess, yes: between two alternative outcomes, such as heads or tails in a coin toss.

If the question is, which of these 20 programmers wrote this code", then a random guess has only a 5% chance (1/20) of being right! So 52% is more than 10X better than random.


Don't miss the talk by Aylin starting in 50 minutes here:

http://streaming.media.ccc.de/32c3/hallg/

(Recording available later ...)




Often, just running `strings | grep '/home/'` on a binary will reveal the home directory of at least one programmer involved in the compilation process.


Or rather, this is what you will find if the company in question has amateurish procedures and doesn't use automated build tools and continuous integration.

Then yeah, you might very well find some guys /home/ path in firmware running on production basebands.

(Hi, Dojip Kim! https://www.linkedin.com/in/dojip-kim-7b0b1b6a)


This seems to indicate that modern optimizers still have a long way to go to be as effective as they could be, since broadly similar code even in different styles would end up with the same, most efficient end result(?)


They did this with optimizations off:

  The above mentioned executable binaries are compiled
  without any compiler optimizations, which are options to
  make binaries smaller and faster while transforming the
  source code more than plain compilation. As a result,
  compiler optimizations further normalize authorial style.
Everyone's still safe with some -O* code massage.


Saying everyone is 'still safe with some -O* code massage' is a very strong assumption. In fact the paper in question deals with the issue.

Taken from section VI. A. " Compiler Optimization: Programmers of optimized executable binaries can be de-anonymized.":

"[...] programming style is preserved to a great extent even in the most aggressive level-3 optimization. This shows that programmers of optimized executable binaries can be de-anonymized and optimization is not a highly effective code anonymization method."

Please try and do a basic level of investigation before making claims.


Did you read the next sentences? Even with optimizations turned on they could predict, not just as well.


I think you're assuming there is one maximally efficient ASM product that can be known at compile time. The same program will work on different inputs, and it's possible that different ASM products will handle some inputs more efficiently than others.


Well, this puts a damper on my plan to create a secret identity on the internet under which I release software the way Banksy releases art.

Unless I stop publicly writing software for a few years.


My suspicion is that most of these features wouldn't survive in an adversarial setting -- either by consciously changing your coding style or (better) using automated tools to rewrite your source code before compilation to alter the control flow structure (e.g., control flow flattening [1]).

http://reverseengineering.stackexchange.com/questions/2221/w...


Another option would be safety in numbers. Take a large executable written by someone else and wear its dead code like its your own skin. Hiding your true self in there somewhere.


That might be an interesting way to defeat it! Thanks.


I am not so sure that the coding style doesn't survive the automated tools. It's also about what features you implement and how you do it.


Running `strip` on the binary decreased classification accuracy by 12%. They quite correctly point out that this isn't a large drop, but strip is really the lowest of the low hanging fruit when it comes to code obfuscation. I'd expect much bigger drops from anything that's trying to make code harder to fingerprint.


> I release software the way Banksy releases art.

It's pretty easy to vandalize a wiki tho.


Wha


you can use another language


One of the points of the talk/article is that style survives compilation, obfuscation and optimization. I think (but the talk did not support that directly) that it would survive another language, at least to a certain degree.


I doubt that their system would be able to recognize programmers if their training samples in C and their test samples are in Haskell.


Maybe we can use this method to find Satoshi, if he uses github.


Leaving aside that finding Satoshi isn't a particularly laudable goal, isn't the original Bitcoin client open source? It's almost certainly easier to detect programming style from source code than from binaries.


The bitcoin client would be the training set.


In the ccc talk, the author actually talk about this. But she mentioned that their primary suspect doesn't have any code samples online, and they left it at that.

Their code should be online, but I am not sure what exactly is there though.


Depending of how much of Satoshi's fingerprint is left and how early they started to version control the code.


That was the first thing that came to my mind too :)


Really neat, especially since the source could be from different languages and compilers. I'd be interested at the deanonymization accuracy within the Go language where there is a widespread adoption of the code formatting tool and seeing if that has any impact.


Go embeds paths to libraries and some temporary paths in it's binaries. Feel free to run strings and possibly find the user's home directory and skip deanonymozation of source code.


I write almost exclusively Go these days, and I think most of the people I work with could tell you if I wrote a Go program


From only the binary code? That's quite a feet for the unassisted human :P (mind sharing what kind of people you work with?)


This sounds interesting. Could this be applied to Stuxnet, Duqu and malware in general, or would you require more information?


I thought about looking at the bitcoin implementation for finding Satoshi.


We have the commit history for the source code of bitcoin to profile Satoshi. No need to try and do it based off of the binary.


Sure. I thought of the general method.

I saw that this has been tried to some extent: https://en.wikipedia.org/wiki/Satoshi_Nakamoto#Nick_Szabo


The features given to their random forest classifier appear to be:

* assembly language instruction features: "token unigrams and bigrams"

* decompiled lexical features (unparsed decompiled text): "word unigrams, which capture the integer types used in a program, names of library functions, and names of internal functions when symbol information is available"

* syntactic features (from the parsed AST of the decompiled text): "AST node unigrams, labeled AST edges, AST node term frequency inverse document frequency (TFIDF), and AST node average depth"

* basic block features: TF-IDF weighted "unigrams and bigrams, that is, single basic blocks and sequences of two basic blocks"

source: http://www.princeton.edu/~aylinc/papers/caliskan-islam_when....


Interesting but if your code is open source a lot of people will be contributing to it bringing in their own style.

Depending on how popular the code is your fingerprint could be completely hidden amongst hundreds.


If the code is open-source there likely is a history of contributions. But this research is about "De-anonymizing programmers from executable binaries" anyway, so really not an OSS scenario.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: