The x86 architecture is the weirdo (2004)

comex · on April 19, 2022

Let's see how things have changed since 2004 when this was published:

> The x86 has a small number (8) of general-purpose registers

x86-64 added more general-purpose registers.

> The x86 uses the stack to pass function parameters; the others use registers.

OS vendors switched to registers for x86-64.

> The x86 forgives access to unaligned data, silently fixing up the misalignment.

Now ubiquitous on application processors.

> The x86 has variable-sized instructions. The others use fixed-sized instructions.

ARM introduced Thumb-2, with a mix of 2-byte and 4-byte instructions, in 2003. PowerPC and RISC-V also added some form of variable-length instruction support. On the other hand, ARM turned around and dropped variable-length instructions with its 64-bit architecture released in 2011.

> The x86 has a strict memory model … The others have weak memory models

Still x86-only.

> The x86 supports atomic load-modify-store operations. None of the others do.

As opposed to load-linked/store-conditional, which is a different way to express the same basic idea? Or is he claiming that other processors didn't support any form of atomic instructions, which definitely isn't true?

At any rate, ARM previously had load-linked/store-conditional but recently added a native compare-and-swap instruction with ARMv8.1.

> The x86 passes function return addresses on the stack. The others use a link register.

Still x86-only.

ncmncm · on April 19, 2022

Apple M1 supports optional x86-style memory event ordering, so that its x86 emulation could be made to work without penalty.

When SPARC got new microcode supporting unaligned access, it turned out to be a big performance win, as the alignment padding had made for a bigger cache footprint. That was an embarrassment for the whole RISC industry. Nobody today would field a chip that enforced alignment.

The alignment penalty might have been smaller back when clock rates were closer to memory latency, but caches were radically smaller then, too, so even more affected by inflated footprint.

jnordwick · on April 19, 2022

> as the alignment padding had made for a bigger cache footprint

I argues with some of the Rust compiler members the other day about wanting to just ditch almost all alighnment restrictions because I of this exact thing. They laughed and basically told me i didnt know what i was talking about. I remember about 15 years ago when i worked at market making firm and we test this and it was a great gain - we started packing almost all our structs after that.

Now, at another MM shop, and we're trying to push the same thing but having to fight these areguments again (the only alignmets I want to keep are for AVX and hardware accessed buffers).

kevingadd · on April 19, 2022

There are other things you need to take into account too - padding can make it more likely for a struct to divide evenly into cache lines, which can trigger false sharing. Changing the size of a struct from 128 bytes to 120 or 122 bytes will cause it to be misaligned on cache lines and reduce the impact of false sharing and that can significantly improve performance.

The last time I worked on a btree-based data store, changing the nodes from ~1024 bytes to ~1000 delivered something like a 10% throughput improvement. This was done by reducing the number of entries in each node, and not by changing padding or packing.

ncmncm · on April 19, 2022

True. Another reason to avoid too much aligning is to help reduce reliance on N-way cache collision avoidance.

Caches on modern chips can handle keeping up to some small fixed number, often 4, objects all in cache whose addresses are at the same offset into a page, but performance may collapse if that number is exceeded. It is quite hard to tune to avoid this, but by making things not line up on power-of-two boundaries, we can avoid out-and-out inviting it.

anyfoo · on April 19, 2022

FWIW, it's still better to lay out your critical structures carefully, so that padding isn't needed. That way, you win both the cache efficiency and the efficiencies for aligned accesses.

mjevans · on April 20, 2022

One of the forms of 'premature optimization' that's often worth doing. Just align everything you can to the biggest power of two data-size you can. Also, always use fixed sized data types. E.G. (modular int) uint32 or (signed 2s comp) sint32 rather than int

titzer · on April 20, 2022

WebAssembly ditched alignment restrictions and we don't regret it. There is an alignment hint, but I am not aware of any engine that uses it.

ncmncm · on April 19, 2022

Superstition is as powerful as it ever was.

cryptonector · on April 19, 2022

It's definitely received wisdom that may once have been right and no longer is.

Most people are not used to facts having a half-life, but many facts do, or, rather, much knowledge, does.

We feel very secure in knowing what we know, and the reality is that we need to be willing to question a lot of things, like authority, including our very own. Now, we can't be questioning everything all the time because that way madness lies, but we can't never question anything we think we know either!

Epistemology is hard. I want a doll that says that when you pull the cord.

cogman10 · on April 19, 2022

Sort of depends on the knowledge.

It's certainly true that in the tech industry things are CONSTANTLY shifting.

However, talk physics and you'll find that things rarely change, especially the physics that most college graduates learn.

cryptonector · on April 20, 2022

There was a famous study about the half-lives of "facts" in different fields. They do seem to vary by field.

MBCook · on April 19, 2022

Is this superstition or more received wisdom, which may have been true at one point in the past as is now just orthodoxy?

notriddle · on April 19, 2022

Fifty bucks says it isn't even about performance, but is instead about passing pointers to C code. Zero-overhead FFI has killed a lot of radical performance improvements that Rust could have otherwise made.

I don't know, because nobody's actually posting a link to it.

ncmncm · on April 19, 2022

This strikes me as likely. Bitwise compatibility with machine ABI layout rules has powerful compatibility advantages even in places where it might make code slower. (And, for the large majority of code, slower doesn't matter anyway.)

Of course C and C++ themselves have to keep to machine ABI layout rules for backward compatibility to code built when those rules were (still) thought a good idea. Compilers offer annotations to dictate packing for specified types, and the Rust compiler certainly also offers such a choice. So, maybe such annotations should just be used a lot more in Rust, C, and C++.

This is not unlike the need to write "const" everywhere in C and C++ because the inherited default (from before it existed) was arguably wrong. We just need to get used to ignoring the annotation clutter.

But there is no doubt there are lots of people who think padding to alignment boundaries is faster. And, there can be other reasons to align even more strictly than the machine ABI says, even knowing what it costs.

nynx · on April 20, 2022

Rust structs have non-C layouts. You can optionally specifiy that a struct should be laid out in the same way that C does it.

notriddle · on April 20, 2022

Structs aren’t the problem here. It’s the primitives.

You can take a pointer to an i32 and pass that pointer to C code as int32_t. This means it has to have the same alignment in Rust that it has in C.

ncmncm · on April 20, 2022

The topic at hand is, specifically, that nobody makes cores that enforce alignment restrictions anymore. So, it doesn't matter where the pointer goes. All that matters is if your compiler lays out its structs the same way as whoever compiled the code a pointer to one ends up in.

There are embedded targets that still enforce alignment restrictions, but you are even less likely to pass pointers between code compiled with different compilers, there.

notriddle · on April 20, 2022

> The topic at hand is, specifically, that nobody makes cores that enforce alignment restrictions anymore. So, it doesn't matter where the pointer goes.

Compilers can rely on alignment even if the CPU doesn't. LLVM does, which is why older versions of rustc had segfaults when repr(packed) used to allow taking references. While it would be pretty easy to get rustc to stop emitting aligned loads, getting Clang and GCC to stop emitting aligned loads might be trickier. https://github.com/rust-lang/rust/issues/27060

ncmncm · on April 20, 2022

People arguing against changing struct layout rules are probably mainly interested, then, in maintaining backward compatibility with older Rust, itself.

Anyway, same policy applies: annotate your structs "packed" anywhere performance matters and bitwise compatibility with other stuff doesn't.

zozbot234 · on April 20, 2022

Rust does not keep backward compatibility like that. In the absence of any statement forcing a specific ABI, the only guaranteed compatibility is with code that's part of the same build, or else part of the current Rust runtime and being linked into said build.

ncmncm · on April 20, 2022

Even in Rust, mapping memory between processes, or via files between past and future processes, happens. Although structure-blasting is frowned upon in some circles, it is very fast where allowed.

But... when you are structure-blasting, you are probably also already paying generous attention to layout issues.

ceeplusplus · on April 19, 2022

TSO has a performance cost, on M1 this is 10-15% [1] loss from enabling TSO on native arm64 code (not emulated).

[1]: https://blog.yiningkarlli.com/2021/07/porting-takua-to-arm-p...

ncmncm · on April 19, 2022

Yes, there are sound reasons for it to be optional. It is remarkable how little the penalty is, on M1 and on x86. Apparently it takes a really huge number of extra transistors in the cache system to keep the overhead tolerable.

cryptonector · on April 19, 2022

TIL. I should have known this... Maybe I'll start packing my structs too.

cesarb · on April 19, 2022

> ARM introduced Thumb-2, with a mix of 2-byte and 4-byte instructions, in 2003. PowerPC and RISC-V also [...]

x86 is still the weirdo. Both Thumb-2 and the RISC-V C extension (I don't know about PowerPC) have only 2-byte and 4-byte instructions, aligned to 2 bytes; x86 instructions can vary from 1 to 15 bytes, with no alignment requirement.

classichasclass · on April 19, 2022

Power10 has prefixed instructions. These are essentially 64-bit instructions in two pieces. They are odd even (particularly?) to those of us who have worked with the architecture for a long time, and not much otherwise supports them yet. Their motivation is primarily to more efficiently represent constants and offsets.

cryptonector · on April 19, 2022

I suspect variable-length instructions are a big gain because you get to pack instructions more tightly and so have fewer cache misses. Though, obviously, it's going to depend on having an instruction set that yields shorter text for typical assembly than fixed-sized instructions would. (In a way, opcodes need a bit of Huffman encoding!)

Any losses from having to do more decoding work are probably offset by having sufficiently deep pipelines and enough decoders.

phdelightful · on April 19, 2022

The counterpoint is that variable-length decoding introduces sequential dependence in the decoding, i.e. you don’t know where instruction 2 starts until you’ve decoded instruction 1. This probably limits how many decoders you can have. If you know all your instructions are 4B you can basically decode as many as you want in parallel.

astrange · on April 19, 2022

A larger problem is that they're bad for security; you can hide malicious instructions from static analysis by jumping into the middle of a cleaner one. Or use it to find more ROP gadgets, etc.

I can imagine ways to deal with this, but x86 doesn't have them.

gpderetta · on April 19, 2022

I think what happen in practice is that the decoders still speculatively decode in parallel then drop all misdecoded instructions. Easy when instructions have only a couple of sizes. Hard and wasteful for something like x86.

chasil · on April 19, 2022

ARM Thumb actually licensed patents from Hitachi Super-H, who did this first.

Supposedly, "MIPS processors [also] have a MIPS-16 mode."

https://en.m.wikipedia.org/wiki/SuperH

phamilton · on April 19, 2022

I vaguely recall that LL/SC solves the ABA problem whereas load-modify-store does not.

It's been a while, so I'm going to define my understanding of the ABA problem in case I misunderstood it:

x86 only supplies cmpxchange instructions on which will update a value only if it matches the passed in previous value. There's a class of concurrency bugs where the value is modified away from it's initial value and then modified again back to it's value. cmpxchange can't detect that condition, so if it's a meaningful difference often the 128-bit cmpxchange will be used with a counter in the second 64-bits that is incremented on each write to catch this case.

LL/SC will trigger on any write, rather than comparing the value, providing the stronger guarantee.

(Please correct me if this is inaccurate, it's been a hit minute since I learned this and I'd love to be more current on it).

zozbot234 · on April 19, 2022

AIUI, a cmpxchg loop is enough to implement read-modify-write of any atomically sized value. The ABA problem becomes relevant when trying to implement more complex lock-free data structures.

ungamedplayer · on April 19, 2022

Thank you for writing this. I was going to cover quite a lot of these points and you have done it so very succinctly.

It may be obvious, but I think it bears repeating. This blog entry should not reflect badly on Raymond C as he was reporting on the architecture at this time.

zinekeller · on April 19, 2022

The 2022 follow-up also said "And by x86 I mean specifically x86-32." Also, I don't he was on the AMD64 team yet at that time (still Itanium) so probably that is something.

NtGuy25 · on April 19, 2022

In regards to memory alignment, it's even worse. Most instructions work on unaligned data. But some instructions require 8 byte, 16 byte, 32 byte, 64 byte and I think there's even some 128 and 256 byte alignment. One of the more common pitfalls someone can find themselves in when coding x86-64 asm.

ceeplusplus · on April 19, 2022

There is still one big thing that hasn't changed but has been the subject of discussion on whether x86-64 is fundamentally bottlenecking CPU architecture. Variable length instructions means decoder complexity scales quadratically rather than linearly. It's been speculated this is one reason why even the latest x86 architectures stick with relatively narrow decode but Arm CPUs with lower performance levels (e.g. Cortex X1/2) are already 5-wide and Apple is 8-wide.

ttybird2 · on April 20, 2022

"As opposed to load-linked/store-conditional, which is a different way to express the same basic idea? Or is he claiming that other processors didn't support any form of atomic instructions, which definitely isn't true?"

It refers specifically to things like fetch_and_add. Supported by risc-v and i64.

gpderetta · on April 19, 2022

>>The x86 has a strict memory model … The others have weak memory models

> Still x86-only.

SPARC supported (eventually exclusively) TSO well before x86 committed to it. For a while Intel claimed to support some form of Processor Ordering which I understand is slightly weaker than TSO, although no Intel CPU ever took advantage of the weakened constraints.

moonchild · on April 20, 2022

>> The x86 supports atomic load-modify-store operations. None of the others do.

> As opposed to load-linked/store-conditional, which is a different way to express the same basic idea?

Perhaps that x86 has atomic addition, where other architectures must use a cas/ll+sc loop?

wolpoli · on April 19, 2022

Thanks for summarizing this. Did they do any other clean-up when moving to 64 bit?

moonchild · on April 20, 2022

PC-relative code.

anticensor · on April 19, 2022

Part 2: https://devblogs.microsoft.com/oldnewthing/20220418-00/?p=10...

Findecanor · on April 19, 2022

> The x86 has a strict memory model

x86 doesn't really impose sequential consistency between cores/threads. It imposes a Total Store Order (TSO) in which stores are always in order to each other but a store can be reordered after a load.

SPARC had TSO on later chips whereas earlier chips had weaker models. MIPS developed the other way: with older versions having stronger memory ordering and later getting relaxed memory ordering.

RISC-V chips can optionally support TSO but it seems that the motivation is programs ported from x86. IBM's z/Architecture (with lineage back to IBM/360) is still alive and also has TSO. BTW. The Mill is supposed to offer sequential consistency, but it remains to be seen if that will be a performance bottleneck.

dmatech · on April 20, 2022

That strictness of x86 execution order has been substantially relaxed in the last two decades and can be a bit of a pain to deal with in multithreaded code for the novice. The Pentium 3 added SFENCE (with SSE) and the Pentium 4 added LFENCE and MFENCE (with SSE2). I believe that prior to that, only the LOCK prefix was available.

gpderetta · on April 21, 2022

From the memory model point of view LFENCE and SFENCE are only relevant for SSE non temporal load and stores. A novice is never going to stumble on them by mistake.

MFENCE was added for convenience, but the same effect can be had with any locked instruction on a dummy memory location. In fact XCHG is often still faster than MFENCE.

In fact the x86 memory model has been strengthened in the last couple of decades as some reordering that were theoretically possible (but were never implemented in practice in any hardware) have been finally documented to be impossible since TSO has been embraced.

ajross · on April 19, 2022

> x86 doesn't really impose sequential consistency between cores/threads. It imposes a Total Store Order (TSO) in which stores are always in order to each other but a store can be reordered after a load.

To be more pedantic (and hoping I remember this correctly): TSO is indistinguishable in software from full sequential consistency. Any code to detect the difference must by definition be subject to race conditions (or must be an atomic read/write operation that on x86 would be serializing anyway). So x86 in fact does provide SC semantics "between cores/threads". It does have visible reordering artifacts from the perspective of hardware designs (e.g. MMIO registers) where a load has side effects.

gpderetta · on April 19, 2022

That doesn't sound right to me. Dekker algorithm is broken on x86 without explicit barriers but it works on a SC machine.

my123 · on April 20, 2022

And there are actual SC machines in the wild. For example, the Carmel cores on the Tegra Xavier processor provide SC guarantees.

fmajid · on April 19, 2022

An article by the Raymond Chen. Here's Joel Spolsky's contemporary take on Chen:

https://www.joelonsoftware.com/2004/06/13/how-microsoft-lost...

aap_ · on April 19, 2022

I wouldn't say the x86 is the weirdo in these cases, it's just not a RISC.

marcosdumay · on April 19, 2022

Notice that it's the only one remaining?

Yes, it's a weirdo.

FullyFunctional · on April 19, 2022

Not defending the weirdo, but IBM Z is a thing that is shipping, heavily used, and still evolving. It's more CISC than even x86.

I do find it sad that we are stuck in an 80 year old paradigm that is a horrible fit for how CPUs are actually implemented in 2022, but the inertia is a strong in this one.

kosherhurricane · on April 19, 2022

Inertia is the same as backward compatibility.

Kinda like gas engines and fuel pumps. Or wall socket formats. Inertia.

FullyFunctional · on April 20, 2022

I think you are illustrating my point, thanks. Gas engines are surely but slowly being replaced with much better, cleaner, and more efficient alternatives. Wall socket formats are there to stay but increasingly low DC loads are coalescing around USB, with USB-C really accelerating the change.

So yes, we can overcome inertia, it just take a lot of energy.

astrange · on April 20, 2022

ARM isn't a pure RISC anymore; it has complex addressing modes and arm7 has different instruction lengths. Complicated instructions are still worth adding if they happen to be hardware efficient, instruction fusing and cracking are expensive too and aren't optimization miracles.

I don't think RISC is a very useful term.

aap_ · on April 19, 2022

You mean as opposed to the other architectures mentioned in the article? ppc, mips, itanium and alpha. Two of which are utterly dead, one of which looks like it's dying (mips) and one that is little more than a niche product (ppc).

colejohnson66 · on April 20, 2022

It seems that MIPS has almost been entirely relegated to the academic space for courses on processor architecture

thedeadfish · on April 20, 2022

Implying that so called RISC cpus are actually RISC. There is nothing even remotely "reduced" about a modern ARM cpu.

DaiPlusPlus · on April 19, 2022

> Notice that it's the only one remaining?

Motorola 68K is still around too.

zwieback · on April 19, 2022

Came here to say that. At the time the post was written it was really just comparing x86 to the already dying RISC. Nowadays that comparison makes even less sense as the two schools of thought have more or less converged.

lynguist · on April 19, 2022

In essence, it is a RISC since the days of Pentium Pro. Each instruction is decoded to several micro-operations and the trick of making the CPU run fast lies in the decoder.

This also leads to situations where many instructions in x86 asm will run faster than fewer complex instructions which the decoder may not decode to its most optimal micro-operations sequence.

saagarjha · on April 19, 2022

No, this take is incorrect. x86-64 instructions do get lowered into µops, but all the common ones will only execute one, or maybe two. The real win for µops is letting Intel microcode instructions that nobody cares about, but you really shouldn't be using them anyways.

phamilton · on April 19, 2022

To what extent is the decoding consistent? It always felt weird to me that we'd do something in hardware on every execution that could be done once in software at compile time. But if the decoder does smart things that would make more sense to me. For example, with SMT the decoder could choose not to use certain ops if the compute units are in use by the other thread.

astrange · on April 20, 2022

That post isn't true, common instructions only take a few µops (or even one) and the main difference is register renaming. Microcoding mainly lets them support old and very complicated instructions that nobody actually needs to be fast.

You can see instruction latencies at https://www.agner.org/optimize/.

buildbot · on April 19, 2022

I am not sure of any arch that does this, but having a second stage decode buys 2 things off the top of my head - better instruction cache usage, esp. for small loops, and the flexibility to decode ops differently on different versions of the same arch from one binary. Otherwise you would have to have tons of different versions for each flavor of x86 out there.

imtringued · on April 19, 2022

Intel and AMD CPUs have micro op caches for tight loops.

akira2501 · on April 19, 2022

> on every execution that could be done once in software at compile time

Which is to me, essentially, VLIW. Those architectures didn't work out very well. There's a lot of memory pressure just to move instructions around and the costs didn't seem to outweigh the benefits.

phamilton · on April 20, 2022

The Mill CPU and architecture is a fascinating approach here. Who knows if we'll ever see a real piece of hardware from them, but as I understand it they require a compiler tuned to the exact cpu model to maximize on-die memory and other aspects.

gpderetta · on April 19, 2022

It is very inconsistent between major microarchitectural revisions, sometimes even between me minor steppings. Sometimes firmware updates can change the decoding (usually to fix bugs).

Jasper_ · on April 20, 2022

The firmware can only do this by turning on an extremely slow fallback path. It's not common and it's seen as an avenue of last resort.

aap_ · on April 19, 2022

That's an implementation detail. The submission was about architecture.

xg15 · on April 19, 2022

> Note from the future: At the time of writing, the term “x86” was used exclusively to refer to what later became known as “x86-32”. The name “x86-64” wouldn’t be invented until 2006.

Notes from the future are always a risky business. Now we're in the future and we call them x86 and x64 - which makes far less sense than x86-32 and x86-64, but I guess is shorter and doesn't require you to learn a new name for an old concept.

I guess the actual future always turns out a bit more messy than the science fiction version of it.

cesarb · on April 20, 2022

> The name “x86-64” wouldn’t be invented until 2006.

While x86-32 is AFAIK a backronym which would only be invented later, the original name of the AMD64 architecture is x86-64, and has been since that architecture was first revealed at the turn of the millennium. If you weren't following these developments back then, you can still see it for yourself in the Internet Archive: https://web.archive.org/web/20000817014037/http://www.x86-64... is the earliest snapshot of the official page for the x86-64 Linux porting effort, which links to https://web.archive.org/web/20000817071303/http://www.amd.co... which is a snapshot of a press release dated 2000-08-15 announcing that porting effort. Quoting from that press release: "AMD publicly released its 64-bit architecture specification, the x86-64™ Architecture Programmers Overview, last week to enable the industry to begin incorporating x86-64 technology support into their operating systems, applications, drivers, and development tools."

That is, the x86-64 name was first used publicly, as the official name of that architecture, no later than early August of the year 2000.

xyzzy21 · on April 20, 2022

Even weirder, x86 is only a superficial layer - all x86 code has been compiled to RISC on-the-fly since the 1990s. This is an economic artifact of the unpredicted success of the IBM PC.

anonymousiam · on April 19, 2022

I would argue that in 2004 there were still more processor architectures that used the stack for function parameters and return addresses than those that used registers.

tabtab · on April 19, 2022

x86 is a hack on top of a chip design that was originally intended as a low-cost calculator CPU that gradually grew up to power big-iron servers. Most of its competitors were designed for medium or large computers from the get-go, so had far less backward-compatibility baggage to deal with.

thedeadfish · on April 20, 2022

Not at all, the original 16bit x86 was well designed. The 32bit extensions were pretty decent too. Its not until much later that it really started to turn to shit. The real disaster started when Intel started haphazardly adding terribly thought out SIMD instructions. AMD64 truly sucks though, this is where the original encoding should have been replaced.

tabtab · on April 20, 2022

The 16bit chip had to be backward compatible with the 8bit, no?

thedeadfish · on April 20, 2022

Kind of but not really, whilst it shares some design choices, the 8086 is neither source or binary compatible with the 8080.