Hacker News new | past | comments | ask | show | jobs | submit login
Dogbolt Decompiler Explorer (dogbolt.org)
449 points by ingve 6 months ago | hide | past | favorite | 89 comments



Can I just say, thanks to the person who posted this for waiting until this week to do so. (Side note: I suspect it was due to the recent coverage from C++ Weekly which is a great resource: https://www.youtube.com/watch?v=h3F0Fw0R7ME)

As recently as last week we had some horrible performance problems but it looks like the queue (https://dogbolt.org/queue) is mostly still fine! Other than the long pole of a few of the decompilers being backed up, things are humming along quite smoothly! Josh + Glenn have done some great work on it! (https://github.com/decompiler-explorer/decompiler-explorer/c...)


Related:

Decompiler Explorer - https://news.ycombinator.com/item?id=32079227 - July 2022 (82 comments)


Wow, I really could have used this for my Ph.D. research (deep learning for obfuscated code).

I ditched Ghidra in my experiments in favor of angr early on because Ghidra did not play nicely with multiprocessing and I had a lot of data to process. Well maybe it does but it was much easier for me to achieve the same thing with angr.

Love the name! Although I feel compelled to point out that Compiler Explorer is the name of the project and Godbolt is its author's last name, but I suppose if people are to the point of using Godbolt as a verb the ship has sailed.


We know! Similarly, the GH repo is actually the Decompiler Explorer:

https://github.com/decompiler-explorer/decompiler-explorer/


I like the name, it's cute and a nice homage.


Has there been any good progress in deobfuscating/decompiling machine code using Machine Learning techniques?


Short answer: not where it counts.

My work focuses on recognizing known functions in obfuscated binaries, but there are some papers you might want to check out related to deobfuscation, if not necessarily using ML for deobfuscation or decompilation.

My take is that ML can soundly defeat the "easy" and more static obfuscation types (encodings, control flow flattening, splitting functions). It's low hanging fruit, and it's what I worked on most, but adoption is slow. On the other hand, "hard" obfuscations like virtualized functions or programs which embed JIT compilers to obfuscate at runtime... as far as I know, those are still unsolved problems.

This is a good overview of the subject, but pretty old and doesn't cover "hard" obfuscations: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1566145.

https://www.jinyier.me/papers/DATE19_Obf.pdf uses deobfuscation for RTL logic (FGPA/ASIC domain) with SAT solvers. Might be useful for a point of view from a fairly different domain.

https://advising.cs.arizona.edu/~debray/Publications/generic... uses "semantics-preserving transformations" to shed obfuscation. I think this approach is the way to go, especially when combined with dynamic/symbolic analysis to mitigate virt/jit types of transformations.

I'll mention this one as a cautionary tale: https://dl.acm.org/doi/pdf/10.1145/2886012 has some good general info but glosses over the machine learning approach. It considers Hex-rays' FLIRT to be "machine learning", but FLIRT just hashes signatures, can be spoofed (i.e. https://siliconpr0n.org/uv/issues_with_flirt_aware_malware.p...), and is useless against obfuscation.

Eventually I think SBOM tools like Black Duck[1] and SLSA[2] will incorporate ML to improve the accuracy of even figuring out what dependencies a piece of software actually has.

[1]: https://www.synopsys.com/software-integrity/software-composi...

[2]: https://slsa.dev/


Very cool - thank you very much!

> My take is that ML can soundly defeat the "easy" and more static obfuscation types (encodings, control flow flattening, splitting functions). It's low hanging fruit, and it's what I worked on most, but adoption is slow.

If I wanted to implement my own toy HexRays-like decompiler using a few of these techniques to decompile x86-64 binaries is there any high quality up-to-date paper/resource you would recommend?

Or do you think that "A Generic Approach to Automatic Deobfuscation of Executable Code" paper is a good enough start?

Also, what do you think about https://tigress.wtf/ ?


"A Generic Approach" seems like a good starting point for a classical approach: building a set of reusable components and heuristics to recognize idioms, etc.

Might also be worth considering an approach integrating LLMs for summarizing code. Maybe you could fine-tune a pretrained model that already "understands" source code to associate sources with generated code? If going this route I would still probably use a disassembler to preprocess, and maybe also extract basic blocks to use as my "target" domain for fine-tuning.

As for Tigress, I used it extensively and found it to be really great most of the time. There are some limitations to be aware of: it only works with C code, and you have to turn your multi-file projects into a single file with a main() function. Also, its C parser (CIL) has some limitations (e.g. doesn't recognize the word static in "struct foo x[static 1]") so you might need to translate your C code first. I translated manually because it was a really rare issue for the code I started with. I also had mixed results using Virtualize and JIT. Sometimes they would emit invalid code, so I ended up just throwing out that data.

In my view, the up-and-coming Tigress challenger is obfuscator-llvm. I think it is very promising for future work because it inherently supports more languages than only C. But currently obfuscator-llvm is much more limited (~3 transformations compared to ~48). So if you're using C, today I would pick Tigress.


Thanks again! :)


Sometimes we must look back in angr


That better be a Bowie reference and not an Oasis reference.


John Osborne actually. But never Oasis ;0


I really wish a similar tool for exploring binary lifting to different IRs. Like Ghidra p-code with sleigh, LLVM Machine IR, Qemu TCG etc


IRs aren't generally suited toward small snippets of examination by human when you're starting with a full binary. I would imagine something like that would only work well when done for very small bits of assembly. Likewise, you might be interested in BNIL which is an entire stack of ILs that Binary Ninja is based on. (You can see it exposed in the cloud.binary.ninja UI or the demo)


Qemu works by translating a binary to an IR then doing stuff with it. Valgrind likewise. There's an optimiser called bolt (associated with facebook) which has the same idea.


Yup, I'm aware of both of those, but none of those tools listed so far are intended for the IR to be for human-consumable unlike disassemblers and decompilers. You think disassembly is verbose compared to a decompiler? Go look at the equivalent Vex (Valgrind's IR) for any non-trivial disassembly. It's suuuper verbose.

As far as I know, BNIL (https://docs.binary.ninja/dev/bnil-overview.html) is the only one that is designed to be readable and it still wouldn't make sense to include it in an IL comparison such as the one done here for decompilation in my opinion.


Speaking of decompilers, would Binary Ninja be a safe bet to pick? I've been told IDA is the gold standard, but it's also expensive for someone who wants to recreationally reverse engineer.


Binja decompiler is more-or-less fine. Its not as mature as IDA or Ghidra but its not a bad decompiler.

Though for me the big selling point on Binja is the Intermediate Languages (ILs). HIgh-level IL is the decompiler but you also get Low-level and Medium-level ILs as steps between assembly and source. If the decompiler is a bit funky you can look at the ILs to get a better idea of what is happening. the ILs are also just much nicer to read than plain assembly so I tend to use them a lot.

Its a feature that isn't really matched on any other platform. Ghidra and IDA both have a single IL that is more machine readable compared to Binja's human-readable ones.


IDA Free has essentially all the features of Pro nowadays, if you're only looking to do x86_64 on Windows/Linux.

https://hex-rays.com/ida-free/

The only thing you lose out on is Python scripting, which is kind of big, but for a free tool you really can't complain.

You probably want to use both IDA and Ghidra since they have different strengths/weaknesses and community plugins.


Honestly just use Ghidra. It has it's quirks but it's pretty good. And open source. If it's good enough for the NSA it's probably good enough for recreational use.


If Ghidra is made by NSA, does it mean that it can have backdoors for non-US users?


The code is open source and has been looked at by several people over the years. It would be quite hard for the NSA to sneak in a backdoor but it is never out of the question. However, the risk is so extremely minuscule when compared to other alternatives since they are not even open source.


Very nice. A parallel, I've been working on an emulator project recently, implementing my own disassembler, and I keep thinking about how I would turn patterns of machine code into a generalized form, which could then be turned into something like C-like pseudo-code, so it's been really compelling me lately to implement my own toy decompiler


BinaryNinja does this. They have several layers of intermediate representations[1], which they build their compiler on top of. Ghidra does something similar with their PCode. They disassemble to PCode and then decompile the PCode[2].

[1] https://docs.binary.ninja/dev/bnil-overview.html [2] https://riverloopsecurity.com/blog/2019/05/pcode/ (an example)


Thanks for sharing!


Any good and thorough decompiler tutorials for non-expert users?


There seems to be demand!

- Anyone care to give some pointers?


Now take the output of dogbolt and feed into godbolt.


Machine translation, for machine code.

Theoretically, a fixed point should be reached.


And reinforcement-train an LLM to reconstruct the original code...


That would be dogebolt


Love this - I can almost imagine the convincing for other companies wasn't even needed when they realized a small binary size and comparison to competitors would net them more business. A perfect little solution for triaging issues between services and comparing solutions.


That was indeed the logic. The two main commercial solutions included (Binary Ninja made by Vector 35, where I'm one of hte founders) and Hex-Rays both pay for all the hosting costs. And it's not particularly cheap -- there's a fair amount of compute to drive the decompilers especially as some of them are... not very efficient.


I wish I saw this when it was posted last year. This is awesome and really convenient.


HexRays online? Is that allowed?


From the FAQ, Hex-Rays actually sponsors the project:

> Vector 35 and Hex-Rays jointly sponsor the hosting on Digital Ocean as a community service.


It makes sense, it's a perfect advertisement of their superiority.


Indeed, looking at the samples HexRays really did a great job compared to the others, much more readable code.


When this first came out a year(ish?) ago, I remember seeing somewhere that they had received permission from Hexrays/Ilfak Guilfanov.


Not anymore!

angrily writes a letter to his congressman who won't understand a word of it


Your congressman doesn’t yet have hexrays to decompile your letter


From what I can tell in observation, they don't parse English either.


His brain is relegated to spewing out the Matrix unparsed as he receives it. He gets none of the blondes, brunettes or redheads.


OMG I am so happy

Of note: HexRays is not only cleaner, but right now their queue is mostly empty while others are backed up.


Binary Ninja likewise is empty and keeps up just fine as well. It's not a coincidence that the two commercial products that are funding it are both confident enough to put their stuff online like this.

And it's no conspiracy theory or intentional sandbagging, you can see the implementation: https://github.com/decompiler-explorer/decompiler-explorer

and if anyone can improve the other tools performance we'd be happy to accept it. We reached out to the Ghidra devs: https://github.com/NationalSecurityAgency/ghidra/issues/5228 but they didn't have any silver bullets for us either.


Is there a similar project for javascript? That is, de-obfuscating large javascript codebases?


> de-obfuscating large javascript codebases

Impossible, sadly.


> All submitted binaries are saved and made available to any of the authors of the tools used so they may improve their decompilers. If you're such an author who would like access, let us know!.

oof


Good that this is clearly mentioned up-front on their site.


If you believe that content you submit to websites is not examined by interested parties associated with that website, then - I have a bridge to sell you... or perhaps I should say a Google account to give you, free of charge.


Compare this policy to godbolt’s policy:

> In short: your source code is stored in plaintext for the minimum time feasible to be able to process your request. After that, it is discarded and is inaccessible. In very rare cases your code may be kept for a little longer (at most a week) to help debug issues in Compiler Explorer.


My bias may be showing, being a ctf-scene enthusiast. Most of these (tools on dogbolt) look like foss utilities you can run yourself. The rest, I'd imagine you are welcome to pay for licenses. Binary Ninja in particular, while maybe not cheap for everybody, isn't sky-high.


While it is possible they throw it all away:

1. If a third-party does their link-shortening, which gets the program text, then - it doesn't matter how nice they are. And if that party is Google then, well...

2. The language you quoted still allows them to keep effectively all information through mining aspects of it rather than keeping the entire code as a stretch of plain text.

3. If GodBolt or its servers are subject to US law, then there might be National Security Letters which compel it to pass information on to the US government, and keep that secret. And this is not a conspiracy theory, this what Snowden has exposed about Google, Apple, Microsoft, Yahoo etc.

So - I respect and like the GodBolt'ers, but you don't have a good guarantee of your data being kept private.


Pretty sure links work basically forever


I think they changed it recently, but all of the code you submit is embedded in the URL. (after an anchor) So, it's stored by google's link shortening service, but is resubmitted to the site every time you load it.


Sweet, free file hosting


Yep. Remember that that means you are not allowed to submit any binaries for which you don't have the license to redistribute.


They make it very clear. If you don't notice that before uploading some private binaries, that's on you.


so like vscode?


I threw some 16 bit files at it, all of them puked in one way or another.

One was a CP/M-86 "small model" executable, the other an object file (16 bit intel OMF object file) - i.e. compiler output.

Boomerang looked like it'd have most chance of getting somewhere since it mentioned having a DOS .EXE analyser.

I'm surprised that "Hex Rays" (i.e. IDA-Pro) got nowhere...


The name of this is a reference to the incredibly useful godbolt compiler explorer. If you are interested in this you will likely enjoy the other as well:

https://godbolt.org/


and for those who don't know it, that one is named after the author, Matt Godbolt.

I thought for a longtime it was some joke I wasn't getting related to deities smithing people.


> deities smithing people.

That's "deities smiting people.", but I really like the idea of deities smithing people :)


The Dwarven god in DnD is so good at crafting he can literally make new souls in his forge. :)


This happens in the Norse myths.


There's a joke about Adam and Eve in here somewhere. Genesis 2 for reference.


Sculpty terracotta would be a fitting choice. It's pretty easy to sculpt when kneaded, bakes in a traditional oven, keeps it's details. Perfect for silicone mold making.


> bakes in a traditional oven

Now that reminds me of a verse from a song I heard on the radio as a teenager:

  Had a meeting with my maker
  The superhuman baker
  He popped me in the oven
  And set the dial to lovin'


Damn. To just name something your last name.

I thought it was the sibling part to the Jesus Nut. https://en.wikipedia.org/wiki/Jesus_nut


It's never been called anything but either "GCC Explorer" or "Compiler Explorer", by me, anyway... The URL it's accessible for is an accident of the one I had hanging around :) (it's now available at compiler-explorer.com too, but...the name other people use has stuck so I'll never be able to reclaim my own domain...)


I think you _could_ reclaim your own domain if you wanted. You'd want to have a banner at the top with a clear note directing people to the new domain for the compiler explorer, so that people realize immediately that you're not domain squatting. A few people might put up a stink, but I'm pretty confident that most people wouldn't mind, especially since the tool itself is so useful. The name, for those who don't know it as your last name, is fun, but it isn't the reason people use the tool. Eventually, over enough time, people would start remembering the new URL, and you could shrink or remove the banner (and/or put a note elsewhere on the page).


Honestly "godbolt" is so memorable I can find it instantly even though I rarely use it; but "compiler-explorer" sounds like some generic SEO spam site that I'd probably never click on.


Even then the internet (and even books) are full of "godbolt" links, to the tool itself, to specific code samples. Till all those became irrelevant will take quite some time.

As a data point: Search on stack overflow yields "500" hits. https://stackoverflow.com/search?q=godbolt


Links to specific examples are less of a problem as he could redirect those to compiler-explorer.com and just keep that redirect up forever. Really the only URL that would need to be "reclaimed" is https://godbolt.org/ and having a prominent link to compiler-explorer.com thee would solve that issue.

OTOH the godbolt domain is at least not actively used for a number of other TLDs getting one of those might be an easier option.


It’s such a memorable name for a tool like that. Other than losing your domain name to the topic, how do you feel about the de facto name?

To a far far lesser degree, I’ve experienced many examples of “you named it X but everyone at work calls it Y and now you have to live with that.” It used to really irk me for some reason.


It is fantastic name of an otherwise fantastic tool. The day I found it was your last name made me chuckle and liked it even more. And since I am here, thank you very much for it!

I always call it the compiler explorer but the url, as a sibling comment says, is memorable.


Could be misremembering, but IIRC it was called Compiler Explorer and used to live only on a subdomain of godbolt.org. But, it was so useful that it became presumably vastly higher traffic than the personal homepage part and people often referred to it as just "Godbolt" probably because it sounds cooler and is shorter than saying "Compiler Explorer" (and it may not be obvious the domain name is a last name rather than just a cool name for something.)


Now that’s a pretty cool origin story for a name. What a compliment!


To be fair it's an amazing last name and it feels like there probably is a story, it just has to do with this guy's ancestors rather than the assembler tool we all know and love.


There's also RMSbolt, which is a Compiler explorer for Emacs, where Richard M. Stallman is regarded as the "creator".


It makes for a nice parallel, since the original version of godbolt was just a split tmux session with vim running on one side, and "watch 'gcc -S -o /dev/stdout'" on the other. The main advantage of putting it online is not needing all of the compilers locally.


> Richard M. Stallman

That's St IGNUcius to you.

[0] https://stallman.org/saint.html


It might also be a bit of a portmanteau with a second reference to dogpile.com which was a pre-Google "search engine" that compiled search results from multiple search engines. Back in the day you often had to separately search altavista.com, lycos.com, askjeeves.com, yahoo.com, etc. because some of them would work for your query but others would not and it was difficult to predict the performance of any particular search engine, but usually at least one of them would have the result you wanted/needed.

Dogpile was an automated way to search all of the search engines at the same time with one query.

https://web.archive.org/web/19990429194414/http://dogpile.co...


Look no further than https://dogbolt.org/faq

> It's meant to be the reverse of the amazing Compiler Explorer.

With a link to https://godbolt.org/

It’s very obvious that Dogbolt Decompiler Explorer is primarily named after Godbolt Compiler Explorer.


I do remember dogpile, but as one of the folks who named it, nope, that wasn't a conscious influence!


Oh, it you! Hi Jordan I miss you let’s hang out sometime :)


Yes, lets! And before hacker summer camp when we're way way too busy! :-)


nice




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: