Hacker News new | past | comments | ask | show | jobs | submit login

An x86 disassembler isn't that hard (I wrote one in the 80's and still use it every day) it's just a boatload of special cases.

https://www.digitalmars.com/ctg/obj2asm.html




To explain in more detail:

The x86 machine code format is decently simple to decode. It has a couple of opcode maps (one-byte opcodes, another for a two-byte opcode, and two more for three-byte opcodes. Each opcode needs to note which register bank the r field encodes, which one the r/m field encodes, and which one maps to the first and second argument; and additionally if it uses the ModR/M byte in the first place and the size of the immediate it uses (if any).

There are two complications to this basic scheme. The first is that some prefixes are "mandatory prefixes" (66h/F2h/F3h being the most common) for some instructions as opposed to simple modifications. The second is that some instructions take a ModR/M byte but use the different values of that byte to distinguish instructions. For example, 0FAE esp, [mem] would actually be XSAVE [mem] while 0FAE ebp, [mem] would be XRESTOR [mem].

Personally, I've considered building a kind of "world's worst x86 decoder" where ignoring the latter considerations is actually a feature of the decoder.


The first is that some prefixes are "mandatory prefixes"

If you look at the structure of the encoding you'll see that the 66 prefix, which is traditionally known as the "operand size prefix", is quite sensibly used to select between different register widths, e.g. between the SSE/SSE2 wider (128-bit) registers as opposed to the 64-bit MMX ones for the same instruction.

https://www.sandpile.org/x86/opc_2.htm has a nice summary of this.


In my "world's worst x86 decoder" idea, the idea is to make the prefixes (save possibly the segment register prefix) part of the instruction opcode, also including the REX.W and VEX.L bits as well. A nice side effect of this definition is that it makes operands on differently-sized registers different opcodes entirely, which isn't necessarily a bad idea.


When you get to the edges of the ISA finding flaws in dissassemblers is common enough that you often need to use more than one for RE-ing.

Christopher Domas for example used 5 I think when he was fuzzing CPUs.


I found doing the new prefixes for 64-bit to be a pain--so much so, that I basically rewrote mine from scratch.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: