Hacker News new | past | comments | ask | show | jobs | submit login
X86 Addressing Under the Hood (bone.id.au)
87 points by ScottWRobinson on Sept 26, 2018 | hide | past | favorite | 7 comments



x86 instruction encoding is best viewed in octal --- not just the ModRM, but the primary opcode too:

http://www.dabo.de/ccc99/www.camp.ccc.de/radio/help.txt

You can mentally assemble/disassemble the bulk of the commonly encountered instructions by memorising a few tables (in octal), the addressing modes being one of them. In 16-bit the memory addressing modes can be described as "one or more of {displacement}{BX,BP}{SI,DI}" and 32-bit "one or more of {displacement}{register}{scaled register}" with (e)BP as a special case.


Good homework is to implement an x86-64 mod/rm encoder.

Quite a few corner cases. Sometimes you'll need to up-convert input parameters, because some combinations can't use 8-bit offset, for example, but require a 32-bit value. RSP can't use scaling, etc.

There's also plenty of redundancy: https://www.strchr.com/machine_code_redundancy


I once had to multiply the `rsi` register by 3 so you bet I used `imul rsi, 3`, right? Nope. This does the exact same thing with an unknown or small performance advantage:

lea rsi, [rsi + rsi*2]


Definitely an advantage because AFAIK that doesn't go through the ALU but instead uses the dedicated address calculation unit, which is going to be faster than a multiply instruction.

A related lea trick is combining a shift, add, add (or subtract) immediate, and mov into a single instruction that doesn't modify flags, i.e.

    lea eax, [1234+ebx+ecx*4]
performs the equivalent of:

    eax = ebx + ecx*4 + 1234


Not anymore, at least on modern Intel - lea instructions go through the ALU like other ALU instructions.

It's still an advantage over mul in this case, because it is a so-called "two component lea" which only has two components (base reg and index reg) so it takes 1 cycle latency and can execute at a throughput of 2 per cycle, versus 3 cycles latency and 1 per cycle for mul.


That is a complex rather than simple LEA instruction. There's a difference for Haswell and above.

Being complex (3 operands), it takes 3 cycles rather than 1 cycle for fast LEAs (1 or 2 operands).

It is also restricted to only the slow path port, port 1, instead of using ports 1 or 5 for fast path LEAs. ADD has 4 ports.

So Intel says in Section 3.5.1.2 of the Intel® 64 and IA-32 Architectures Optimization Reference Manual:

Assembly/Compiler Coding Rule 33. (ML impact, L generality) If an LEA instruction using the scaled index is on the critical path, a sequence with ADDs may be better. If code density and bandwidth out of the trace cache are the critical factor, then use the LEA instruction.


The problem with CISC is that it gives assembler users a near infinite pit of clever tricks with which to consume their time. Though for what it's worth on RV*C you might use.

    c.add t1, t0
    c.slli t0, 1
    c.add t0, t1
or someatt, if you were trying to be clever (though it may not gain you anything). It'd give you two whole instruction bytes to do what you were going to do with t0 before you hit another alignment boundary, or maybe it's a0 and your last instruction is a c.jr x1!

Ambition is the mother of depravity.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: