Hacker News new | past | comments | ask | show | jobs | submit login
A native code to C/C++ decompiler (derevenets.com)
137 points by fla on Oct 8, 2014 | hide | past | favorite | 54 comments



No source? The first thing I did was try it on itself... which I suppose is somewhat of an "acid test". It took a few minutes and an enormous amount of memory, but finally it told me that the function at the very beginning of the executable, definitely a nontrivial one, decompiles to...

    void fun_401000() {
    }
I'm sure I hit upon some edge case and much better output can be had from this tool if I play with it some more, but for a first impression, not so good. But I'm definitely going to keep this one around, it looks promising.


For all we know, that result is a complete facsimile of the original source. But then again, we don't have the source.


The original and the most powerful disassembler is IDA Pro. The project was started in the 90s and has been used for security analysis, antivirus work, protection analysis/research, hacks as well as normal dev work in the closed-source ecosystems.

https://www.hex-rays.com/products/ida/index.shtml

The author has implemented a decompiler plugin over the top of IDA and it works on the real-world code. The point here is to annotate the disassembly bottom-up and then decompile.

https://www.hex-rays.com/products/decompiler/index.shtml

I don't want to bash the author of Snowman - this kind of research is serious fun. Yet, IDA has an insane lead.


IDA is great, but for day-to-day work, especially if you're casual, Hopper is a really strong contender, and it's (amazingly) even cheaper than IDA. Hopper also was designed from the get-go to have a first-class Python interface, and it includes a workable decompiler.


IDA was not the first one. There have been a few others since the 80's.


AFAIK it was the first one that got the model right.

The only other one I remember is Sourcer (sr) for MS-DOS.


Sourcerer iirc. Purple floppy.


It's also expensive. I'm sure it's worth it from all I've heard, but it's unlikely I'll ever have the money for hobby work.


For what IDA does, it's incredibly inexpensive, so much so that it's distorted the market for reverse engineering tools. Consider that people who use IDA on a day-to-day basis have $250/hr+ bill rates, and if they use IDA, they rely on it. Meanwhile, the set of people who use IDA on a day-to-day basis is very small relative to the whole industry.

I'm not saying you should buy IDA, just that I think IDA is severely mispriced.


I hear you. For the sake of completeness, here is the other side of the argument from students that use IDA for reverse engineering. Those activities are really about cracking freemium/shareware apps and the associated subculture is a little... well... special. I have heard countless times that IDA should cost $200 and the author should work with the community to improve the tool...

My stance is that the tool is very specialized, unique and the cost is reasonable for professional usecases.


I know I'm repeating myself here, but I want to make sure I communicate this:

IDA's price is so low that it actually harms the market for professional reverse engineering tools. Most useful products you can build --- tracers, visualizers, emulators, pattern matchers, debuggers --- fit into IDA's orbit. As products, as "feature/function/benefit" statements, they are subsets of IDA. But they're chained down by IDA's price. Just like IDA, they have to serve a small market of users who make tens of thousands of dollars per week using the tools, but the market optics make it hard to charge even a significant fraction of the (low) total cost of IDA.

It's sort of hilarious to me to see what Hopper is doing to the market. "Ruining it entirely" wouldn't be far from the truth. I'm only sort of complaining. Viscerally, I'm thrilled that Hopper exists.


Who is making tens of thousands per week? Outside of selling exploits to government agencies or cybercrime. Or is that what you meant?

AV jobs pay shit.


The software security bill rate for people who are competent with IDA and can find bugs black-box with it exceeds $3k/day. Source: until Friday, I'm a principal at a very large software security consulting firm.

That's for projects denominated in billable days. Talented specialists do even better, on fixed-price projects for specific targets. Rates get higher for cryptographic work, as well.

I am not talking about selling vulnerabilities. I have never sold a vulnerability to anyone, nor have I (to my knowledge) done any software security work for any division of the USG or any other government, nor would I.

Don't work in AV. There are worse things about AV than the pay scale.


Right, both products are expensive. I get them through work (for work). You can get the free (and very limited) version of IDA for hobby stuff:

https://www.hex-rays.com/products/ida/support/download_freew...


Unluckily, e.g. support for ARM or x86-64 is missing from the free version (while I have no problem with the disallowance of commercial use - if I really made money with it, of course I'd pay).


It's unlikely they'd let you license it for hobby work; last I heard they're very restrictive about who they're willing to license it to in order to prevent pirate copies leaking.


Definitely not true. I know of someone who licensed it personally for a hobby, without a corporation.


C and C++ are different languages. I only saw C examples. How would you even decompile to C++? The only C++ information you have in the object code is the mangled names. How do you use that to get C++ code?


And vtables, rtti data, systematic sequences of ops for invoking base classes, intrinsic/library functions emitted by the compiler in specific situations, exception handling tables, ...


But none of those is something only a C++ compiler would specifically do, right? How would you distinguish a vtable emitted by a C++ compiler from one hand-rolled in C? I suppose you could just offer a reasonable guess. I wonder if decompiled GTK+ code would end up turning into C++.


Most C++ platform ABIs are pretty trivial to recognize. A tool like this could distinguish a C++ binary by looking at its initialization/housekeeping sections, static constructors, __cxa_atexit, exception handling tables, linked libraries, etc.

For example, the Itanium C++ ABI (http://mentorembedded.github.io/cxx-abi/, perhaps Itanium's only real legacy) adopted by Linux/ELF and other platforms, leaves a huge amount of fingerprints on a binary. It'd take a very conscious effort, including hand-crafting linker scripts, to generate a C binary that a tool would incorrectly think was C++.


As some other decompilers do, by having a knowledge pool of specific compilers.

Back in the 90's there was one targeted specifically to executables produced by Borland compilers.


The first C++ compiler (Cfront) compiled from C++ to C.

Once you get some C code that does things with structures and function pointers like a C++ compiler would do, I think it's not impossible to turn those back into classes if you can recognise the patterns that a C++ compiler uses to compile C++ constructs like classes, virtual functions, etc.


It's actually a lot harder than C. There is a roughly 1:1 correspondence between C and assembly. There is no such 1:1 correspondence between assembly and C++.


> There is a roughly 1:1 correspondence between C and assembly.

I'm afraid that hasn't been true for a number of years. Turn on the optimizer and the resulting assembly can be totally unrecognizable as a translation of what you actually wrote. The compiler takes all kinds of steps both to eliminate redundant operations (via CSE, loop unrolling, etc.) and to take advantage of the weird quirks of modern CPU architectures, like out-of-order execution.


Just yesterday I was debugging a piece of code I was writing. I started with a printf where I suspected a segfault was occurring.

  printf(...);
  for (...) {...}
The printf never executed. So I fired up gdb and confirmed the segfault was happening where I thought it was, but the for loop was being initialized before the printf. Even with debugging on and optimization off. Could be a bug in clang (I didn't think to try it with gcc), but even so it shows that the way things should work in theory don't necessarily dictate the way they do work in practice.


There is a roughly 1:1 correspondence between C++ and C, so it's just one more step to go from C to C++. Just off the top of my head...

    class data fields -> structure members
    non-static methods (ctors/dtors included) -> functions with an extra 'this' pointer parameter
    inheritance -> extended structures with matching prefices
    virtual functions -> structure with function pointers
    operator overloading -> methods with special names
    default arguments -> automatically inserted by compiler
    function overloading -> types encoded in name of function
    templates -> whatever code they generate
    exceptions -> special library functions are used, e.g. _CxxThrowException()
Some of those don't decompile so well (e.g. template-generated code, although a refactoring tool might be of help), and things like operators and overloaded functions are probably more of a stylistic choice than a difference in the generated code, but for the most part it doesn't look impossible to get some C++ features to show up in decompiled code. (And given that many of the C++-level optimisations revolve around removing unnecessary code, they could make the C to C++ step even easier.)


Interesting that gcc removes the \n from the string and calls puts() directly - this avoids the overhead of parsing the string for non-existent format specifiers.

The decompiler could do with a bit of work making dynamic library imports more symbolic. Following the puts call chain quickly disappears into a non-local jump to an address with no further references.


Another native code decompiler, although apparently abandoned long ago: http://boomerang.sourceforge.net


Thanks for sharing, did not know. Of course, aware'ing everyone on IDA Hex-Rays on this thread too: http://www.hex-rays.com/products/ida/


That "hello world" decompilation is complex!

[EDIT: Very informative replies below, thanks!]


It's because of the #include <stdio.h>, isn't it?


No, it's because its decompiling from _start downwards. From main downwards it's actually very straightforward.

You can also see that GCC did strength reduction of printf("thing\n") to puts("thing").


Pretty impressive! I haven't used any disassembly tools in years and only remember last time I did I found it useless. Not sure if that was due to my lack of understanding or the generated output or rather a combination of both. This thing however: I fed it with an OpenGL test app which doesn't do much but still has hundreds of lines of modern C++ spread over different libraries and could clearly recognize lots of my functions in it and follow some program flow starting from main. Still hard but at least I didn't feel completely lost like years ago.


This is extremely useful for analysis. Even if you understand x86 ASM, it allows you to quickly jump around a lot more efficiently than you otherwise would.

It won't, for me, recompile back into the source application. So that is a limitation, but even with that limitation it is extremely useful (and the fact it looks the C/C++ back to the ASM, makes altering the ASM directly trivial).


This and also the fact it's available as an IDA plugin.


I’m wondering whether one could use machine learning and C/C++ code from GitHub to find reasonable variable names automatically.


I wonder if statistical machine translation approach could be usefully applied here. Get tons of source from github, compile with every compiler available, train on the result. It would be challenging to compile automatically at scale, to align the code with the source, or to get a source representation invariant to identifiers, but should be doable.


It's both harder and easier. There is a mechanical transformation, without un- or approximately translatable idioms like natural language. On the other hand, the dependency chain is much more complex - with something like link-time optimization, a change to one part of the code can completely change the result (for instance, if it suddenly allows inlining of a function everywhere). There is also the problem of, if not "idiom translation", "idiom generation" - people write code in a particular style that may not be captured by the generated output, even if it compiles the same.

Targeting something like Clang specifically, where you have access not only to the assembler & a potential source, but also a whole AST & intermediate data structures, would be pretty interesting.


I'm brand new to C, but wouldn't this from the hello world example always eval true?

if (__JCR_END__ == 0 || 1) { return;


It does always evaluate to true. I honestly can't figure out why it's there, I've been googling what the 'frame_dummy' function is supposed to do and the only information I've found is something on 'setting up the exception frame'. All the code in that function does though is force a seg-fault if that test code you posted fails, so I'm not sure what it accomplishes.


If __JCR_END___ was always a boolean, yes.


Am I missing something? '==' has a higher precedence than '||', so it should always evaluate as true.


I think the catch he's getting at is that || imposes an ordering that the left side is checked before the right side. This also implies that any side-effects of the left side have to happen before the right-side is evaluated.

That said, I still don't know how you could get this code generated. If you make an equivalent piece of code with _JCR_END__ as a volatile int, you still get an infinite loop which has the mov op for reading the _JCR_END__ value but it doesn't bother to test it. IE. gcc still reads the variable but optimizes the loop to a while (1). I can't think of anyway to trick gcc into generating asm like this.


This will totally fail for optimised code if it is just using object code without debug information. There is no information in the resulting machine code that can indicate whether some code has been inlined or not. Basically any optimisations performed by the compiler will throw this decompilation off.

I question whether you can get any real use out of this...


I don't think the market for this is to get the actual original code. It's more like understanding what a particular program does: when you see it on a higher level it's much easier to understand the code than reading raw assembly.


Absolutely... I get that, but its only functional for non-optimised code. My point is that for anything non-trivial, its not going to be terribly useful. You're still gonna need to understand what really is going on, optimisers mangle the code out of all recognisability for this decompiler.


Unless the Assembly is using clever tricks like code rewriting, it is always possible to at very least decompile into some form of pseudo-code.

Just the fact of giving symbolic names to memory addresses and replacing Assembly opcodes by more meaningful instructions can make wonders trying to understand some code.


I disagree... inlining will remove all evidence that a function call existed. loop unrolling will remove all evidence that a loop existed (potentially). Those are basic optimisations, comiplers will transform the code out of all recognisability for a decompiler to be worthwhile.


Why only windows ? i couldn't get it.


There's an enormous community of people who spend all their time worrying about the contents or behavior of Windows binaries. I've met some of them through my work, like malware analysts who deal with malware that's part of phishing attacks. The phishers will often prefer to create Windows-only attacks because Windows has such a commanding market share lead among most populations of phishing targets; in turn, that's what the people trying to defend against or mitigate those attacks will study. To folks in that sort of field, "binary" is virtually synonymous with "Windows binary"!

I guess also historically most of the tools for creating, modifying, and examining binaries for a given platform have been native to that platform, rather than cross tools. That's surely because most people (with the exception of embedded developers) do much more native development than cross development. I can get a small number of packages on my Linux machine that will deal with Windows executables in some relatively shallow way, but I have tons of programs already installed that do complicated and specific things to Linux ELF binaries even though I don't typically use those programs on a day-to-day basis.


"The standalone decompiler runs fine under Wine."


You can use Hopper on OS X: http://www.hopperapp.com

Or try and fix up Boomerang on other OSes, I suppose.


I'm kind of disappointed that there isn't a version available for IDA 5.0. Yeah, I'm cheap.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: