X86 Addressing Under the Hood

userbinator · on Sept 27, 2018

x86 instruction encoding is best viewed in octal --- not just the ModRM, but the primary opcode too:

http://www.dabo.de/ccc99/www.camp.ccc.de/radio/help.txt

You can mentally assemble/disassemble the bulk of the commonly encountered instructions by memorising a few tables (in octal), the addressing modes being one of them. In 16-bit the memory addressing modes can be described as "one or more of {displacement}{BX,BP}{SI,DI}" and 32-bit "one or more of {displacement}{register}{scaled register}" with (e)BP as a special case.

vardump · on Sept 27, 2018

Good homework is to implement an x86-64 mod/rm encoder.

Quite a few corner cases. Sometimes you'll need to up-convert input parameters, because some combinations can't use 8-bit offset, for example, but require a 32-bit value. RSP can't use scaling, etc.

There's also plenty of redundancy: https://www.strchr.com/machine_code_redundancy

dsamarin · on Sept 27, 2018

I once had to multiply the `rsi` register by 3 so you bet I used `imul rsi, 3`, right? Nope. This does the exact same thing with an unknown or small performance advantage:

lea rsi, [rsi + rsi*2]

userbinator · on Sept 27, 2018

Definitely an advantage because AFAIK that doesn't go through the ALU but instead uses the dedicated address calculation unit, which is going to be faster than a multiply instruction.

A related lea trick is combining a shift, add, add (or subtract) immediate, and mov into a single instruction that doesn't modify flags, i.e.

    lea eax, [1234+ebx+ecx*4]

performs the equivalent of:

    eax = ebx + ecx*4 + 1234

BeeOnRope · on Sept 27, 2018

Not anymore, at least on modern Intel - lea instructions go through the ALU like other ALU instructions.

It's still an advantage over mul in this case, because it is a so-called "two component lea" which only has two components (base reg and index reg) so it takes 1 cycle latency and can execute at a throughput of 2 per cycle, versus 3 cycles latency and 1 per cycle for mul.

CalChris · on Sept 27, 2018

That is a complex rather than simple LEA instruction. There's a difference for Haswell and above.

Being complex (3 operands), it takes 3 cycles rather than 1 cycle for fast LEAs (1 or 2 operands).

It is also restricted to only the slow path port, port 1, instead of using ports 1 or 5 for fast path LEAs. ADD has 4 ports.

So Intel says in Section 3.5.1.2 of the Intel® 64 and IA-32 Architectures Optimization Reference Manual:

Assembly/Compiler Coding Rule 33. (ML impact, L generality) If an LEA instruction using the scaled index is on the critical path, a sequence with ADDs may be better. If code density and bandwidth out of the trace cache are the critical factor, then use the LEA instruction.

microcolonel · on Sept 27, 2018

The problem with CISC is that it gives assembler users a near infinite pit of clever tricks with which to consume their time. Though for what it's worth on RV*C you might use.

    c.add t1, t0
    c.slli t0, 1
    c.add t0, t1

or someatt, if you were trying to be clever (though it may not gain you anything). It'd give you two whole instruction bytes to do what you were going to do with t0 before you hit another alignment boundary, or maybe it's a0 and your last instruction is a c.jr x1!

Ambition is the mother of depravity.