A native code to C/C++ decompiler

userbinator · on Oct 8, 2014

No source? The first thing I did was try it on itself... which I suppose is somewhat of an "acid test". It took a few minutes and an enormous amount of memory, but finally it told me that the function at the very beginning of the executable, definitely a nontrivial one, decompiles to...

    void fun_401000() {
    }

I'm sure I hit upon some edge case and much better output can be had from this tool if I play with it some more, but for a first impression, not so good. But I'm definitely going to keep this one around, it looks promising.

maplant · on Oct 8, 2014

For all we know, that result is a complete facsimile of the original source. But then again, we don't have the source.

os_ · on Oct 8, 2014

The original and the most powerful disassembler is IDA Pro. The project was started in the 90s and has been used for security analysis, antivirus work, protection analysis/research, hacks as well as normal dev work in the closed-source ecosystems.

https://www.hex-rays.com/products/ida/index.shtml

The author has implemented a decompiler plugin over the top of IDA and it works on the real-world code. The point here is to annotate the disassembly bottom-up and then decompile.

https://www.hex-rays.com/products/decompiler/index.shtml

I don't want to bash the author of Snowman - this kind of research is serious fun. Yet, IDA has an insane lead.

tptacek · on Oct 8, 2014

IDA is great, but for day-to-day work, especially if you're casual, Hopper is a really strong contender, and it's (amazingly) even cheaper than IDA. Hopper also was designed from the get-go to have a first-class Python interface, and it includes a workable decompiler.

pjmlp · on Oct 8, 2014

IDA was not the first one. There have been a few others since the 80's.

os_ · on Oct 8, 2014

AFAIK it was the first one that got the model right.

The only other one I remember is Sourcer (sr) for MS-DOS.

jacquesm · on Oct 8, 2014

Sourcerer iirc. Purple floppy.

DigitalJack · on Oct 8, 2014

It's also expensive. I'm sure it's worth it from all I've heard, but it's unlikely I'll ever have the money for hobby work.

tptacek · on Oct 8, 2014

For what IDA does, it's incredibly inexpensive, so much so that it's distorted the market for reverse engineering tools. Consider that people who use IDA on a day-to-day basis have $250/hr+ bill rates, and if they use IDA, they rely on it. Meanwhile, the set of people who use IDA on a day-to-day basis is very small relative to the whole industry.

I'm not saying you should buy IDA, just that I think IDA is severely mispriced.

os_ · on Oct 8, 2014

I hear you. For the sake of completeness, here is the other side of the argument from students that use IDA for reverse engineering. Those activities are really about cracking freemium/shareware apps and the associated subculture is a little... well... special. I have heard countless times that IDA should cost $200 and the author should work with the community to improve the tool...

My stance is that the tool is very specialized, unique and the cost is reasonable for professional usecases.

tptacek · on Oct 8, 2014

I know I'm repeating myself here, but I want to make sure I communicate this:

IDA's price is so low that it actually harms the market for professional reverse engineering tools. Most useful products you can build --- tracers, visualizers, emulators, pattern matchers, debuggers --- fit into IDA's orbit. As products, as "feature/function/benefit" statements, they are subsets of IDA. But they're chained down by IDA's price. Just like IDA, they have to serve a small market of users who make tens of thousands of dollars per week using the tools, but the market optics make it hard to charge even a significant fraction of the (low) total cost of IDA.

It's sort of hilarious to me to see what Hopper is doing to the market. "Ruining it entirely" wouldn't be far from the truth. I'm only sort of complaining. Viscerally, I'm thrilled that Hopper exists.

duckingtest · on Oct 8, 2014

Who is making tens of thousands per week? Outside of selling exploits to government agencies or cybercrime. Or is that what you meant?

AV jobs pay shit.

tptacek · on Oct 9, 2014

The software security bill rate for people who are competent with IDA and can find bugs black-box with it exceeds $3k/day. Source: until Friday, I'm a principal at a very large software security consulting firm.

That's for projects denominated in billable days. Talented specialists do even better, on fixed-price projects for specific targets. Rates get higher for cryptographic work, as well.

I am not talking about selling vulnerabilities. I have never sold a vulnerability to anyone, nor have I (to my knowledge) done any software security work for any division of the USG or any other government, nor would I.

Don't work in AV. There are worse things about AV than the pay scale.

os_ · on Oct 8, 2014

Right, both products are expensive. I get them through work (for work). You can get the free (and very limited) version of IDA for hobby stuff:

https://www.hex-rays.com/products/ida/support/download_freew...

wolfgke · on Oct 9, 2014

Unluckily, e.g. support for ARM or x86-64 is missing from the free version (while I have no problem with the disallowance of commercial use - if I really made money with it, of course I'd pay).

makomk · on Oct 8, 2014

It's unlikely they'd let you license it for hobby work; last I heard they're very restrictive about who they're willing to license it to in order to prevent pirate copies leaking.

mmastrac · on Oct 8, 2014

Definitely not true. I know of someone who licensed it personally for a hobby, without a corporation.

jordigh · on Oct 8, 2014

C and C++ are different languages. I only saw C examples. How would you even decompile to C++? The only C++ information you have in the object code is the mangled names. How do you use that to get C++ code?

_wmd · on Oct 8, 2014

And vtables, rtti data, systematic sequences of ops for invoking base classes, intrinsic/library functions emitted by the compiler in specific situations, exception handling tables, ...

jordigh · on Oct 8, 2014

But none of those is something only a C++ compiler would specifically do, right? How would you distinguish a vtable emitted by a C++ compiler from one hand-rolled in C? I suppose you could just offer a reasonable guess. I wonder if decompiled GTK+ code would end up turning into C++.

mieko · on Oct 9, 2014

Most C++ platform ABIs are pretty trivial to recognize. A tool like this could distinguish a C++ binary by looking at its initialization/housekeeping sections, static constructors, __cxa_atexit, exception handling tables, linked libraries, etc.

For example, the Itanium C++ ABI (http://mentorembedded.github.io/cxx-abi/, perhaps Itanium's only real legacy) adopted by Linux/ELF and other platforms, leaves a huge amount of fingerprints on a binary. It'd take a very conscious effort, including hand-crafting linker scripts, to generate a C binary that a tool would incorrectly think was C++.

pjmlp · on Oct 8, 2014

As some other decompilers do, by having a knowledge pool of specific compilers.

Back in the 90's there was one targeted specifically to executables produced by Borland compilers.

userbinator · on Oct 8, 2014

The first C++ compiler (Cfront) compiled from C++ to C.

Once you get some C code that does things with structures and function pointers like a C++ compiler would do, I think it's not impossible to turn those back into classes if you can recognise the patterns that a C++ compiler uses to compile C++ constructs like classes, virtual functions, etc.

jacquesm · on Oct 8, 2014

It's actually a lot harder than C. There is a roughly 1:1 correspondence between C and assembly. There is no such 1:1 correspondence between assembly and C++.

Hemospectrum · on Oct 8, 2014

> There is a roughly 1:1 correspondence between C and assembly.

I'm afraid that hasn't been true for a number of years. Turn on the optimizer and the resulting assembly can be totally unrecognizable as a translation of what you actually wrote. The compiler takes all kinds of steps both to eliminate redundant operations (via CSE, loop unrolling, etc.) and to take advantage of the weird quirks of modern CPU architectures, like out-of-order execution.

dwyer · on Oct 8, 2014

Just yesterday I was debugging a piece of code I was writing. I started with a printf where I suspected a segfault was occurring.

  printf(...);
  for (...) {...}

The printf never executed. So I fired up gdb and confirmed the segfault was happening where I thought it was, but the for loop was being initialized before the printf. Even with debugging on and optimization off. Could be a bug in clang (I didn't think to try it with gcc), but even so it shows that the way things should work in theory don't necessarily dictate the way they do work in practice.

userbinator · on Oct 8, 2014

There is a roughly 1:1 correspondence between C++ and C, so it's just one more step to go from C to C++. Just off the top of my head...

    class data fields -> structure members
    non-static methods (ctors/dtors included) -> functions with an extra 'this' pointer parameter
    inheritance -> extended structures with matching prefices
    virtual functions -> structure with function pointers
    operator overloading -> methods with special names
    default arguments -> automatically inserted by compiler
    function overloading -> types encoded in name of function
    templates -> whatever code they generate
    exceptions -> special library functions are used, e.g. _CxxThrowException()

Some of those don't decompile so well (e.g. template-generated code, although a refactoring tool might be of help), and things like operators and overloaded functions are probably more of a stylistic choice than a difference in the generated code, but for the most part it doesn't look impossible to get some C++ features to show up in decompiled code. (And given that many of the C++-level optimisations revolve around removing unnecessary code, they could make the C to C++ step even easier.)

barrkel · on Oct 8, 2014

Interesting that gcc removes the \n from the string and calls puts() directly - this avoids the overhead of parsing the string for non-existent format specifiers.

The decompiler could do with a bit of work making dynamic library imports more symbolic. Following the puts call chain quickly disappears into a non-local jump to an address with no further references.

nes350 · on Oct 8, 2014

Another native code decompiler, although apparently abandoned long ago: http://boomerang.sourceforge.net

RDeckard · on Oct 8, 2014

Thanks for sharing, did not know. Of course, aware'ing everyone on IDA Hex-Rays on this thread too: http://www.hex-rays.com/products/ida/

72deluxe · on Oct 8, 2014

That "hello world" decompilation is complex!

[EDIT: Very informative replies below, thanks!]

qzc4 · on Oct 8, 2014

It's because of the #include <stdio.h>, isn't it?

ctz · on Oct 8, 2014

No, it's because its decompiling from _start downwards. From main downwards it's actually very straightforward.

You can also see that GCC did strength reduction of printf("thing\n") to puts("thing").

stinos · on Oct 8, 2014

Pretty impressive! I haven't used any disassembly tools in years and only remember last time I did I found it useless. Not sure if that was due to my lack of understanding or the generated output or rather a combination of both. This thing however: I fed it with an OpenGL test app which doesn't do much but still has hundreds of lines of modern C++ spread over different libraries and could clearly recognize lots of my functions in it and follow some program flow starting from main. Still hard but at least I didn't feel completely lost like years ago.

Someone1234 · on Oct 8, 2014

This is extremely useful for analysis. Even if you understand x86 ASM, it allows you to quickly jump around a lot more efficiently than you otherwise would.

It won't, for me, recompile back into the source application. So that is a limitation, but even with that limitation it is extremely useful (and the fact it looks the C/C++ back to the ASM, makes altering the ASM directly trivial).

fla · on Oct 8, 2014

This and also the fact it's available as an IDA plugin.

3rd3 · on Oct 8, 2014

I’m wondering whether one could use machine learning and C/C++ code from GitHub to find reasonable variable names automatically.

ntoshev · on Oct 8, 2014

I wonder if statistical machine translation approach could be usefully applied here. Get tons of source from github, compile with every compiler available, train on the result. It would be challenging to compile automatically at scale, to align the code with the source, or to get a source representation invariant to identifiers, but should be doable.

fiatmoney · on Oct 8, 2014

It's both harder and easier. There is a mechanical transformation, without un- or approximately translatable idioms like natural language. On the other hand, the dependency chain is much more complex - with something like link-time optimization, a change to one part of the code can completely change the result (for instance, if it suddenly allows inlining of a function everywhere). There is also the problem of, if not "idiom translation", "idiom generation" - people write code in a particular style that may not be captured by the generated output, even if it compiles the same.

Targeting something like Clang specifically, where you have access not only to the assembler & a potential source, but also a whole AST & intermediate data structures, would be pretty interesting.

daguu · on Oct 8, 2014

I'm brand new to C, but wouldn't this from the hello world example always eval true?

if (__JCR_END__ == 0 || 1) { return;

DSMan195276 · on Oct 8, 2014

It does always evaluate to true. I honestly can't figure out why it's there, I've been googling what the 'frame_dummy' function is supposed to do and the only information I've found is something on 'setting up the exception frame'. All the code in that function does though is force a seg-fault if that test code you posted fails, so I'm not sure what it accomplishes.

fnordfnordfnord · on Oct 8, 2014

If __JCR_END___ was always a boolean, yes.

tpush · on Oct 8, 2014

Am I missing something? '==' has a higher precedence than '||', so it should always evaluate as true.

DSMan195276 · on Oct 9, 2014

I think the catch he's getting at is that || imposes an ordering that the left side is checked before the right side. This also implies that any side-effects of the left side have to happen before the right-side is evaluated.

That said, I still don't know how you could get this code generated. If you make an equivalent piece of code with _JCR_END__ as a volatile int, you still get an infinite loop which has the mov op for reading the _JCR_END__ value but it doesn't bother to test it. IE. gcc still reads the variable but optimizes the loop to a while (1). I can't think of anyway to trick gcc into generating asm like this.

TickleSteve · on Oct 8, 2014

This will totally fail for optimised code if it is just using object code without debug information. There is no information in the resulting machine code that can indicate whether some code has been inlined or not. Basically any optimisations performed by the compiler will throw this decompilation off.

I question whether you can get any real use out of this...

anemic · on Oct 8, 2014

I don't think the market for this is to get the actual original code. It's more like understanding what a particular program does: when you see it on a higher level it's much easier to understand the code than reading raw assembly.

TickleSteve · on Oct 8, 2014

Absolutely... I get that, but its only functional for non-optimised code. My point is that for anything non-trivial, its not going to be terribly useful. You're still gonna need to understand what really is going on, optimisers mangle the code out of all recognisability for this decompiler.

pjmlp · on Oct 8, 2014

Unless the Assembly is using clever tricks like code rewriting, it is always possible to at very least decompile into some form of pseudo-code.

Just the fact of giving symbolic names to memory addresses and replacing Assembly opcodes by more meaningful instructions can make wonders trying to understand some code.

TickleSteve · on Oct 10, 2014

I disagree... inlining will remove all evidence that a function call existed. loop unrolling will remove all evidence that a loop existed (potentially). Those are basic optimisations, comiplers will transform the code out of all recognisability for a decompiler to be worthwhile.

m00dy · on Oct 8, 2014

Why only windows ? i couldn't get it.

schoen · on Oct 8, 2014

There's an enormous community of people who spend all their time worrying about the contents or behavior of Windows binaries. I've met some of them through my work, like malware analysts who deal with malware that's part of phishing attacks. The phishers will often prefer to create Windows-only attacks because Windows has such a commanding market share lead among most populations of phishing targets; in turn, that's what the people trying to defend against or mitigate those attacks will study. To folks in that sort of field, "binary" is virtually synonymous with "Windows binary"!

I guess also historically most of the tools for creating, modifying, and examining binaries for a given platform have been native to that platform, rather than cross tools. That's surely because most people (with the exception of embedded developers) do much more native development than cross development. I can get a small number of packages on my Linux machine that will deal with Windows executables in some relatively shallow way, but I have tons of programs already installed that do complicated and specific things to Linux ELF binaries even though I don't typically use those programs on a day-to-day basis.

cinch · on Oct 9, 2014

"The standalone decompiler runs fine under Wine."

astrange · on Oct 8, 2014

You can use Hopper on OS X: http://www.hopperapp.com

Or try and fix up Boomerang on other OSes, I suppose.

J_Darnley · on Oct 8, 2014

I'm kind of disappointed that there isn't a version available for IDA 5.0. Yeah, I'm cheap.