How about undocumented instructions? Sandsifter[1] is an interesting project and the video from BlackHat[2] is a good watch. There's also a previous discussion of it on HN[3].
> To not leave you hanging: Intel has an official x86 encoder/decoder library called XED. According to Intel’s XED, as of this writing, there are 1503 defined x86 instructions (“iclasses” in XED lingo), from AAA to XTEST (this includes AMD-specific extensions too, by the way). Straightforward, right?
Hopefully this will have either saved you a click or validated your time in reading the article.
For me the article was well worth it; where else but in ISA discussions can you find gems like the following?
> Does a non-instruction that is non-defined and unofficially guaranteed to non-execute exactly as if it had never been in the instruction set to begin with count as an x86 instruction? For that matter, does UD2 itself, the defined undefined instruction, count as an instruction?
Curious, does anyone actually care about the actual number primarily? I thought pretty much everyone who clicks on an article with that title would do so because they are interested in the insights gathered when getting to that number.
I doubt most people reading the article coming from HN are writing disassemblers, and all such people would have to read it anyway because the number itself isn't sufficient to validate that you've enumerated all of them (because as my sibling points out, it's more complicated than that). The specific number is the least interesting part.
While having instructions for everything that are slow in early models but can be significantly improved in silicon over time is one way to look at CISC, I genuinely wonder how much silicon is spent on instructions that are so rarely used they'd be better in software.
Or to ask another way: how many instructions are in billions of x86 cores that rarely if ever get used? Hmmm...
It's really more like 6000, if you do the accounting right. And, as someone who just ate a 2x (customer-facing) slow down porting a piece of x86 assembly to ARM, because that "one nifty instruction" was missing, I'm going to say: we should keep all of 'em.
_mm256_movemask_epi8, i.e., the "fast lexer" instruction. That instruction takes ~3c (depending on uarch), and the ARM equivalent (7-8 instructions) takes ~5-6c (depending on uarch). It's just annoying.
PMOVMSKB is a great instruction, and 3c understates how cheap it is - if you have a throughput problem (rather than a latency problem) it's even more efficient relative to the ARM equivalent.
I have a blog post about coping strategies for working around the absence of PMOVMSKB on NEON:
We used these techniques in simdjson (which I presume still uses them; the code has changed considerably since I built this): https://github.com/simdjson/simdjson
The best techniques for mitigating the absence of PMOVMSKB require that you use LD4, which results in interleaved inputs. This can sometimes make things easier, sometimes harder for your underlying lexing algorithm - sadly, it's not a 1:1 transformation of the original x86 code.
I'm somewhat curmudgeonly w.r.t. SVE, insisting that while the sole system in existence is a HPC machine from Fujitsu, that for practical purposes it doesn't really exist and isn't worth learning. I will likely revise this opinion when ARM vendors decide to ship something (likely soon, by most roadmaps). There's only so much space in my brain.
AVX-512's masks are OK. They're quite cheap. There are some infelicities. I was irate to discover that you can't do logic ops on 8b/16b lanes with masking; as usual the 32b/64b mafia strike again. This may be a symptom of AVX-512's origin with Knights*.
It would be nice if the explicit mask operations were cheaper. Unfortunately, they crowd out SIMD operations. I suppose this is inevitable given that they need to have physical proximity to their units - so explicit mask ops are on the same ports as the SIMD ops.
I also wish that there were 512b compares that produced zmm registers like the old compares used to; sometimes that's the behavior you want. However, you can reconstruct that in another cheap operation iirc.
> I'm somewhat curmudgeonly w.r.t. SVE, insisting that while the sole system in existence is a HPC machine from Fujitsu, that for practical purposes it doesn't really exist and isn't worth learning. I will likely revise this opinion when ARM vendors decide to ship something (likely soon, by most roadmaps).
Fair enough. I have high hopes for SVE, though. The first-faulting memory ops and predicate bisection features look like a vectorization godsend.
> There's only so much space in my brain.
I'm still going to attempt a nerd-sniping with the published architecture manual. Fujitsu includes a detailed pipeline description including instruction latencies. Granted its just one part, and its an HPC-focused part at that. But its not every day that this level of detail gets published in the ARM world.
> I was irate to discover that you can't do logic ops on 8b/16b lanes with masking; as usual the 32b/64b mafia strike again.
SVE is blessedly uniform in this regard.
> It would be nice if the explicit mask operations were cheaper. Unfortunately, they crowd out SIMD operations.
This goes both ways, though. A64FX has two vector execution pipelines and one dedicated predicate execution pipeline. Since the vector pipelines cannot execute predicate ops, I expect it is not difficult to construct cases where code gets starved for predicate execution resources.
The Fujitsu manuals are really good. Like you say, it's not often that you see that level of detail in the ARM world - or, frankly, the non-x86 world in general. From my prehistory as a Hyperscan developer back in the days before the Intel acquisition and the x86-only open source port, I have a lot of experience chasing around vendors for latency/throughput/opt-guide material. Most of it was non-public and/or hopelessly incomplete.
I salute your dedication to nerd-sniping. I need my creature comforts these days too much to spend days out there in the nerd-ghillie-suit waiting for that one perfect nerd-shot. That may be stretching the analogy hopelessly, but working with just architecture manuals and simulators is tough.
I am more aiming for nerd-artillery ("flatten the entire battlefield") these days: as my powers wane, I'm hoping that my superoptimizer picks up the slack. Despite my skepticism about SVE, I will retarget the superoptimizer to generate SVE/SVE2.
I mostly just think of the test set they must make use of at Intel. It makes my head hurt. Maybe you end up with a wad of legacy code so large that no one knows how it really works. That ends up being the real definition of the part.
If you cheat: all of them. GCC supports in-line assembly.
Of course, that’s not an interesting observation.
It gets (somewhat) interesting when you realize that one may inadvertently include in-line assembly. Do you call including system headers that happen to contain inline assembly cheating? Including headers for a standard C library? Compiling with old-fashioned compiler flags (for example to generate FPU instructions for the x87)?
And by the way, I used x86 as an example here, but don’t believe for a second the same thing doesn’t apply to, say, the ARM chip in your phone. Modern ARM chips support multiple encodings and also rank over 1000 instructions if you count them at the same level of granularity as XEDs “iforms”.
Indeed, those who think x86 is complex should also take a detailed look at the ARM64 instruction set, particularly its instruction encoding. If you thought making sense of x86 instruction encoding was hard, and that a RISC might seem simpler, AArch64 will puzzle you even more.
To use the MOV example, the closest ARM equivalent might be the 40 variants of LD, which the reference manual (5000+ pages) enumerates as: LDAR, LDARB, LDARH, LDAXP, LDAXR, LDAXRB, LDAXRH, LDNP, LDP, LDPSW, LDR (immediate), LDR (literal), LDR (register), LDRB (immediate), LDRB (register), LDRH (immediate), LDRH (register), LDRSB (immediate), LDRSB (register), LDRSH (immediate), LDRSH (register), LDRSW (immediate), LDRSW (literal), LDRSW (register), LDTR, LDTRB, LDTRH, LDTRSB, LDTRSH, LDTRSW, LDUR, LDURB, LDURH, LDURSB, LDURSH, LDURSW, LDXP, LDXR, LDXRB, LDXRH. Some, like LDP, are then further split into different encodings depending on the addressing mode.
My suspicion is that to achieve acceptable code density with a fixed-length instruction encoding, they just made the individual instructions more complex. For example, the add instruction can also do a shift on one of its operands, which would require a second instruction on x86.
I've been getting a lot of enjoyment out of 6502/z80 projects like rc2015 and ben eaters breadboard 6502 kit/videos. There's also nand2tetris, if you prefer a more academic approach.
Last time I wrote assembly, and it was a long while ago, it was way faster. But let's be honest, 95% of it was doing manual buffering on top of OS api's rather than use C stdlib. And the other 5% were by skipping itoa calls, by doing arithmetic directly on the string representation.
I think this is why assembler can be faster many times. Not because I'm better than a compiler. But because the structure of the language nudges you into faster approaches.
I've always been able to beat the compiler, and that's usually after trying to optimize using C. Admittedly, it's a whole lot harder to understand what's fast than it used to be. Access to SSE has it's own benefits.
It's been a problem (optimizing) for some time though. I remember it being some work to beat the compiler on the i960CA. OTOH, I seem to remember the i860 being not-so-great and for sure the TI C80 C compiler was downright awful (per usual for DSPs).
One should never loose to the complier, after all you can see it's output and it can't see yours.
Also, the programmer can "cheat" by doing things the compiler would consider invalid but are known to be ok given the larger context of the application.
The problem is the ROI is usually pretty bad as these assumptions rarely hold as the code evolves, in my experience, and the optimization usually only lasts for finite (sometimes shockingly short) amount of time. i.e. OS changes, hardware changes, memory changes, etc. etc. etc.
Back in the Pentium 1 and earlier days I could beat the compiler. But then it got hard.
And it changes so often, instructions that are fast on one CPU are not so fast on the next one, and vice versa.
Not to mention branch prediction and out-of-order execution makes it very difficult to meaningfully benchmark. Is my code really faster, or just seems like it because some address got better aligned or similar.
I've gotten significant speed gains in certain projects by simply replacing certain hand-optimized assembly in libraries (ie not my code) with the plain C code equivalent. The assembly was probably faster 10-15 years ago, but not anymore...
>I've gotten significant speed gains in certain projects by simply replacing certain hand-optimized assembly in libraries (ie not my code) with the plain C code equivalent.
That's an interesting point, plus there's the portability issue.
My own breadcrumbs of legacy code for this kind of innerloopish stuff has been to write a straightforward 'C' implementation (and time it), an optimized 'C' version (which itself can depend on the processor used), and a handtuned assembly version where really needed.
It allows you to back out of the tricky stuff plus acts as a form of documentation.
The assembly bug bit me again a few months back, but instead of writing it I had a grand time hand-disassembling some ARMv4T binaries - scratched something of the same itch as solving a sudoku.
ARM has all these variations which make it seem as complicated as x86, but they are distict variations and future CPUs can for example drop 16 bit Thumb fairly clearly.
It's way easier to determine instruction length on ARM. It's usually fixed. That eliminates a lot of brute force thrashing that X86 decoders have to do. It doesn't impact transistor count all that much on a huge modern CPU but it saves a decent amount of power. It's one of the things that factors into why ARM is so power efficient.
ARM has also been willing to drop older optional legacy stuff like Java oriented instructions that almost nobody used and Thumb. X86 supports nearly all legacy opcodes, even things like MMX and other obsolete vector operations that modern programs never use.
It's not that variable length is expensive, it's that variable length the way Intel does it is expensive. For instance — not that this is a good idea — you could burn the top 2b to mark instructions as 2/4/6/8 bytes (or whatever) in length. Then you can have your variable-width-cake-and-eat-your-fast-decode-too.
Each bit used halves the potential encoding space, e.g. 2^32 -> 2^31 possible instruction encodings.
For thumb, 32-bit instructions can be allocated up to 3/32 of the potential 32-bit space, and 16-bit instructions can use 29/32 of the 16-bit space (3 of the potential 5-bit opcodes denote a 32-bit instruction.) Which is probably a better ratio than 1/2 or 1/4 of each, for instance. Though I'm not sure how much of that encoding space is actually allocated or still reserved.
Related, I believe ARM has allocated about half of the 32-bit encoding space for current A64 instructions.
Further, if you want single-byte opcodes, then you took that space from 256 opcodes down to 64. It cost you 192 single-byte opcodes to use a 2b marker. This wouldn't be possible with the current x86 encodings [1].
Ah OK, I think I understand now. You are specifically referring to the ARM Thumb instruction set as an example of this encoding scheme in both your comments?
Only because a single instruction would take the "space" of multiple instructions in the I$-fetch-to-decode. The idea with variable-length encodings is that, for example, an 8B encoding does more than twice the work of a 4B encoding, so you lose out on the instruction slot, but win on the work done.
How is this so? I thought RISC-V was fixed length, except for 16 bit compressed instructions. And afaik those aren't identified by a singular particular bit.
See section 1.5 ("Base Instruction-Length Encoding") of the RISC-V spec. It's actually a bit more complex than just using 2 bits (I had forgotten those details), but the basic idea is the same in that there is a fixed cascade of bits identifying the instruction length.
There aren't any standard extensions with instructions >32b yet, but the extensibility is there in the base spec.
Due to how the instruction set evolved, for the Intel x86 architecture, you have to look at a lot of bits of the instruction stream to determine where the next instruction starts. To execute multiple instructions per clock cycles you also have to decode instruction multiple instructions per clock cycle.
I think this old Intel patent talks about one of their decoder implementations:
Intel x86 isn't self synchonizing. In theory you have to decode every byte of the instruction stream that came before to be sure where the boundaries of an instruction are. Normally it stabilizes after a while in real world instruction streams but you can craft malicious instruction streams which yield two different valid sets of instructions depending on whether you start reading them at an offset or not.
Contrast that to something like utf8 where that isn't possible.
I think that one of the things that distinguishes Arm from Intel (for good or bad) is that Arm _has_ left behind a lot of the legacy. aarch64 has no Thumb, Jazelle etc
This complexity is pushed down to operating systems, compilers, assemblers, debuggers. It ends up causing brutal human time overhead throughout the chain, and its cost effectively prevents security and high assurance.
This more than justifies moving away from x86 into RISC architectures, such as the rising open and royalty-free RISC-V.
Back in the 70s, a friend of mine said he loved the Z80 instruction set the best. I replied that it was the most complicated and kludgy one. He laughed and said the complexity allowed his code to shine above others and he got paid more.
Of course, these days the Z80 instruction set looks trivial :-)
> This complexity is pushed down to operating systems, compilers, assemblers, debuggers.
Is that really true though? In my experience, the amount of complexity in a layered system is roughly constant, and if you make one layer simpler, you have to make another more complex.
I would expect that compiler for a RISC architecture, for example, needs to do a lot of work figuring out which set of 'simple' instructions encode an intent most efficiently. It also needs to deal with instruction scheduling and other issues that a RISC compiler does not care about.
These days RISC is mostly about regular encoding of instructions which don't present challenges to doing precise interrupts on a deeply pipelined machine rather than just few instructions per se.
What do you mean by "precise" interrupts here? Do some types of interrupts cause worse pipeline stalls than other types? Are the interrupt handlers in RISC faster because of more efficient decoding? Is that the issue?
"An interrupt is precise if the saved process state corresponds to a sequential model of program execution in which one instruction completes before the next begins. In a pipelined processor, precise interrupts are difficult to implement because an instruction may be initiated before its predecessors have completed."
No, it's just that the fact that RISC didn't have any load-op-store instructions made it easier to pipeline them back in the day while still being able to handle interrupts as if they executed one instruction at a time and there was one instruction that had executed while the following one hadn't executed at all. That's possible to do with CISC architectures, of course, we do it all the time these days. But back in the day being able to implement that easily-ish in a reasonable number of transistors were a big factor in early RISC processors having so much better performance than the CISCs of the time.
In the modern day it mostly just means that you can implement fancy out of order schenanigans with a bit fewer engineer-years than non-RISC ISAs.
[1] https://github.com/Battelle/sandsifter
[2] https://www.youtube.com/watch?v=KrksBdWcZgQ
[3] https://news.ycombinator.com/item?id=18179212