Hacker News new | past | comments | ask | show | jobs | submit login
How many x86 instructions are there? (2016) (fgiesen.wordpress.com)
132 points by sandinmyjoints on April 21, 2021 | hide | past | favorite | 87 comments



How about undocumented instructions? Sandsifter[1] is an interesting project and the video from BlackHat[2] is a good watch. There's also a previous discussion of it on HN[3].

[1] https://github.com/Battelle/sandsifter

[2] https://www.youtube.com/watch?v=KrksBdWcZgQ

[3] https://news.ycombinator.com/item?id=18179212



Ah cool, seems I missed that when it appeared on HN a few weeks back. Thanks for the pointer.


yes and sansdifter are continue finding more, recent undocumented microcode modify instruction are find with it.


Very cool project! Thanks for the pointer.


> To not leave you hanging: Intel has an official x86 encoder/decoder library called XED. According to Intel’s XED, as of this writing, there are 1503 defined x86 instructions (“iclasses” in XED lingo), from AAA to XTEST (this includes AMD-specific extensions too, by the way). Straightforward, right?

Hopefully this will have either saved you a click or validated your time in reading the article.


For me the article was well worth it; where else but in ISA discussions can you find gems like the following?

> Does a non-instruction that is non-defined and unofficially guaranteed to non-execute exactly as if it had never been in the instruction set to begin with count as an x86 instruction? For that matter, does UD2 itself, the defined undefined instruction, count as an instruction?


This was why I posted it -- I learned a lot more than the answer to the title.


Curious, does anyone actually care about the actual number primarily? I thought pretty much everyone who clicks on an article with that title would do so because they are interested in the insights gathered when getting to that number.


If you are writing a disassembler or binary program decoder, such a number will help you be sure that you enumerate all the instructions.


I doubt most people reading the article coming from HN are writing disassemblers, and all such people would have to read it anyway because the number itself isn't sufficient to validate that you've enumerated all of them (because as my sibling points out, it's more complicated than that). The specific number is the least interesting part.


i think if you were writing such a program, this article would show that using such a number is a much more complicated idea than it sounds


Enumerating all instructions would be usefull to check wether you can decode all legal instructions.


The issue is that the list of all legal instructions is hard to define.


More than 1,500! Holy cow!

While having instructions for everything that are slow in early models but can be significantly improved in silicon over time is one way to look at CISC, I genuinely wonder how much silicon is spent on instructions that are so rarely used they'd be better in software.

Or to ask another way: how many instructions are in billions of x86 cores that rarely if ever get used? Hmmm...



It's really more like 6000, if you do the accounting right. And, as someone who just ate a 2x (customer-facing) slow down porting a piece of x86 assembly to ARM, because that "one nifty instruction" was missing, I'm going to say: we should keep all of 'em.


Which one?


_mm256_movemask_epi8, i.e., the "fast lexer" instruction. That instruction takes ~3c (depending on uarch), and the ARM equivalent (7-8 instructions) takes ~5-6c (depending on uarch). It's just annoying.


PMOVMSKB is a great instruction, and 3c understates how cheap it is - if you have a throughput problem (rather than a latency problem) it's even more efficient relative to the ARM equivalent.

I have a blog post about coping strategies for working around the absence of PMOVMSKB on NEON:

https://branchfree.org/2019/04/01/fitting-my-head-through-th...

We used these techniques in simdjson (which I presume still uses them; the code has changed considerably since I built this): https://github.com/simdjson/simdjson

The best techniques for mitigating the absence of PMOVMSKB require that you use LD4, which results in interleaved inputs. This can sometimes make things easier, sometimes harder for your underlying lexing algorithm - sadly, it's not a 1:1 transformation of the original x86 code.


Yep. My use-case is from your work. Brilliant stuff!


Have you had a chance to experiment with the SVE and/or AVX512 mask systems, yet?


No and yes, respectively.

I'm somewhat curmudgeonly w.r.t. SVE, insisting that while the sole system in existence is a HPC machine from Fujitsu, that for practical purposes it doesn't really exist and isn't worth learning. I will likely revise this opinion when ARM vendors decide to ship something (likely soon, by most roadmaps). There's only so much space in my brain.

AVX-512's masks are OK. They're quite cheap. There are some infelicities. I was irate to discover that you can't do logic ops on 8b/16b lanes with masking; as usual the 32b/64b mafia strike again. This may be a symptom of AVX-512's origin with Knights*.

It would be nice if the explicit mask operations were cheaper. Unfortunately, they crowd out SIMD operations. I suppose this is inevitable given that they need to have physical proximity to their units - so explicit mask ops are on the same ports as the SIMD ops.

I also wish that there were 512b compares that produced zmm registers like the old compares used to; sometimes that's the behavior you want. However, you can reconstruct that in another cheap operation iirc.


> I'm somewhat curmudgeonly w.r.t. SVE, insisting that while the sole system in existence is a HPC machine from Fujitsu, that for practical purposes it doesn't really exist and isn't worth learning. I will likely revise this opinion when ARM vendors decide to ship something (likely soon, by most roadmaps).

Fair enough. I have high hopes for SVE, though. The first-faulting memory ops and predicate bisection features look like a vectorization godsend.

> There's only so much space in my brain.

I'm still going to attempt a nerd-sniping with the published architecture manual. Fujitsu includes a detailed pipeline description including instruction latencies. Granted its just one part, and its an HPC-focused part at that. But its not every day that this level of detail gets published in the ARM world.

https://github.com/fujitsu/A64FX/tree/master/doc

> I was irate to discover that you can't do logic ops on 8b/16b lanes with masking; as usual the 32b/64b mafia strike again.

SVE is blessedly uniform in this regard.

> It would be nice if the explicit mask operations were cheaper. Unfortunately, they crowd out SIMD operations.

This goes both ways, though. A64FX has two vector execution pipelines and one dedicated predicate execution pipeline. Since the vector pipelines cannot execute predicate ops, I expect it is not difficult to construct cases where code gets starved for predicate execution resources.


The Fujitsu manuals are really good. Like you say, it's not often that you see that level of detail in the ARM world - or, frankly, the non-x86 world in general. From my prehistory as a Hyperscan developer back in the days before the Intel acquisition and the x86-only open source port, I have a lot of experience chasing around vendors for latency/throughput/opt-guide material. Most of it was non-public and/or hopelessly incomplete.

I salute your dedication to nerd-sniping. I need my creature comforts these days too much to spend days out there in the nerd-ghillie-suit waiting for that one perfect nerd-shot. That may be stretching the analogy hopelessly, but working with just architecture manuals and simulators is tough.

I am more aiming for nerd-artillery ("flatten the entire battlefield") these days: as my powers wane, I'm hoping that my superoptimizer picks up the slack. Despite my skepticism about SVE, I will retarget the superoptimizer to generate SVE/SVE2.


This instruction does quite a bit of leg work, and it becomes obvious why a RISK architecture would need 7-8 instructions to do the same, see: https://software.intel.com/sites/landingpage/IntrinsicsGuide...


This looks like a task for gorc: https://five-embeddev.com/riscv-bitmanip/draft/bext.html#gen...

(granted, that’s only 32/64 bits, but still…)


I'd also be curious to discover how many distinct x86 instructions gcc can even emit? I expect the answer is "a lot less than all of them."


I mostly just think of the test set they must make use of at Intel. It makes my head hurt. Maybe you end up with a wad of legacy code so large that no one knows how it really works. That ends up being the real definition of the part.


If you cheat: all of them. GCC supports in-line assembly.

Of course, that’s not an interesting observation.

It gets (somewhat) interesting when you realize that one may inadvertently include in-line assembly. Do you call including system headers that happen to contain inline assembly cheating? Including headers for a standard C library? Compiling with old-fashioned compiler flags (for example to generate FPU instructions for the x87)?


Probably a couple hundred at most, and for common programs several dozen.


Tried it out on Debian Bullseye x86_64:

  $ objdump -w -j .text --no-show-raw-insn -d /usr/bin/emacs-gtk | egrep '^ *[0-9]+:' | awk '{print $2}' | sort | uniq | wc -l
  130

  $ objdump -w -j .text --no-show-raw-insn -d /bin/ls | egrep '^ *[0-9]+:' | awk '{print $2}' | sort | uniq | wc -l
  97

  $ objdump -w -j .text --no-show-raw-insn -d firefox-bin | egrep '^ *[0-9]+:' | awk '{print $2}' | sort | uniq | wc -l
  136


Wow thanks, this is exactly the type of comment I come to HN for.

I wonder if the numbers would change significantly for gentoo or anything compiled manually that had more knowledge of the CPU specifics?


From my fairly riced gentoo system:

  $ objdump -w -j .text --no-show-raw-insn -d /usr/lib64/firefox/firefox-bin | egrep '^ \*[0-9]+:' | awk '{print $2}' | sort | uniq | wc -l
  134
So actually, less!


And by the way, I used x86 as an example here, but don’t believe for a second the same thing doesn’t apply to, say, the ARM chip in your phone. Modern ARM chips support multiple encodings and also rank over 1000 instructions if you count them at the same level of granularity as XEDs “iforms”.

Indeed, those who think x86 is complex should also take a detailed look at the ARM64 instruction set, particularly its instruction encoding. If you thought making sense of x86 instruction encoding was hard, and that a RISC might seem simpler, AArch64 will puzzle you even more.

To use the MOV example, the closest ARM equivalent might be the 40 variants of LD, which the reference manual (5000+ pages) enumerates as: LDAR, LDARB, LDARH, LDAXP, LDAXR, LDAXRB, LDAXRH, LDNP, LDP, LDPSW, LDR (immediate), LDR (literal), LDR (register), LDRB (immediate), LDRB (register), LDRH (immediate), LDRH (register), LDRSB (immediate), LDRSB (register), LDRSH (immediate), LDRSH (register), LDRSW (immediate), LDRSW (literal), LDRSW (register), LDTR, LDTRB, LDTRH, LDTRSB, LDTRSH, LDTRSW, LDUR, LDURB, LDURH, LDURSB, LDURSH, LDURSW, LDXP, LDXR, LDXRB, LDXRH. Some, like LDP, are then further split into different encodings depending on the addressing mode.

My suspicion is that to achieve acceptable code density with a fixed-length instruction encoding, they just made the individual instructions more complex. For example, the add instruction can also do a shift on one of its operands, which would require a second instruction on x86.


I feel sudden urge to write some assembly for fun. Have not done it for at least a couple of years I think.



I've been getting a lot of enjoyment out of 6502/z80 projects like rc2015 and ben eaters breadboard 6502 kit/videos. There's also nand2tetris, if you prefer a more academic approach.


Then when checking the results are half the speed of what the compiler spits out and the fun is gone. At least that's what happens to me...


Last time I wrote assembly, and it was a long while ago, it was way faster. But let's be honest, 95% of it was doing manual buffering on top of OS api's rather than use C stdlib. And the other 5% were by skipping itoa calls, by doing arithmetic directly on the string representation.

I think this is why assembler can be faster many times. Not because I'm better than a compiler. But because the structure of the language nudges you into faster approaches.


Good point, though I usually found that if I then go back and restructure my high-level code, the compiler beat my ASM again.


I've always been able to beat the compiler, and that's usually after trying to optimize using C. Admittedly, it's a whole lot harder to understand what's fast than it used to be. Access to SSE has it's own benefits.

It's been a problem (optimizing) for some time though. I remember it being some work to beat the compiler on the i960CA. OTOH, I seem to remember the i860 being not-so-great and for sure the TI C80 C compiler was downright awful (per usual for DSPs).


One should never loose to the complier, after all you can see it's output and it can't see yours.

Also, the programmer can "cheat" by doing things the compiler would consider invalid but are known to be ok given the larger context of the application.

The restrict keyword in c gives an example of how one can optimize code by knowing the larger context. https://cellperformance.beyond3d.com/articles/2006/05/demyst...

The problem is the ROI is usually pretty bad as these assumptions rarely hold as the code evolves, in my experience, and the optimization usually only lasts for finite (sometimes shockingly short) amount of time. i.e. OS changes, hardware changes, memory changes, etc. etc. etc.


Back in the Pentium 1 and earlier days I could beat the compiler. But then it got hard.

And it changes so often, instructions that are fast on one CPU are not so fast on the next one, and vice versa.

Not to mention branch prediction and out-of-order execution makes it very difficult to meaningfully benchmark. Is my code really faster, or just seems like it because some address got better aligned or similar.

I've gotten significant speed gains in certain projects by simply replacing certain hand-optimized assembly in libraries (ie not my code) with the plain C code equivalent. The assembly was probably faster 10-15 years ago, but not anymore...


>I've gotten significant speed gains in certain projects by simply replacing certain hand-optimized assembly in libraries (ie not my code) with the plain C code equivalent.

That's an interesting point, plus there's the portability issue.

My own breadcrumbs of legacy code for this kind of innerloopish stuff has been to write a straightforward 'C' implementation (and time it), an optimized 'C' version (which itself can depend on the processor used), and a handtuned assembly version where really needed.

It allows you to back out of the tricky stuff plus acts as a form of documentation.


I've started injecting code fragments into spectre exploits as a way of seeing whether the CPU likes them or not.


That's why it's more fun to try to optimize for size


For this case all fun is in the process ;)


The assembly bug bit me again a few months back, but instead of writing it I had a grand time hand-disassembling some ARMv4T binaries - scratched something of the same itch as solving a sudoku.


You'll feel less miserable if you target a modern RISC architecture such as RISC-V or a cleaner CISC architecture such as the venerable 68000.

Whenever bored, I read or watch chibiakumas tutorials. There's no such thing as knowing too many assemblers.


ARM has all these variations which make it seem as complicated as x86, but they are distict variations and future CPUs can for example drop 16 bit Thumb fairly clearly.


It's way easier to determine instruction length on ARM. It's usually fixed. That eliminates a lot of brute force thrashing that X86 decoders have to do. It doesn't impact transistor count all that much on a huge modern CPU but it saves a decent amount of power. It's one of the things that factors into why ARM is so power efficient.

ARM has also been willing to drop older optional legacy stuff like Java oriented instructions that almost nobody used and Thumb. X86 supports nearly all legacy opcodes, even things like MMX and other obsolete vector operations that modern programs never use.


It's not that variable length is expensive, it's that variable length the way Intel does it is expensive. For instance — not that this is a good idea — you could burn the top 2b to mark instructions as 2/4/6/8 bytes (or whatever) in length. Then you can have your variable-width-cake-and-eat-your-fast-decode-too.


> you could burn the top 2b to mark

Which seems reasonable.. but you just burned 3/4 of the single opcode instruction space, which may not be worth it for most general purpose loads.


Would you mind elaborating on the math of how the "top 2b" ends up burning 3/4 of the single opcode instruction space?


Each bit used halves the potential encoding space, e.g. 2^32 -> 2^31 possible instruction encodings.

For thumb, 32-bit instructions can be allocated up to 3/32 of the potential 32-bit space, and 16-bit instructions can use 29/32 of the 16-bit space (3 of the potential 5-bit opcodes denote a 32-bit instruction.) Which is probably a better ratio than 1/2 or 1/4 of each, for instance. Though I'm not sure how much of that encoding space is actually allocated or still reserved.

Related, I believe ARM has allocated about half of the 32-bit encoding space for current A64 instructions.


Further, if you want single-byte opcodes, then you took that space from 256 opcodes down to 64. It cost you 192 single-byte opcodes to use a 2b marker. This wouldn't be possible with the current x86 encodings [1].

[1]: https://www.sandpile.org/x86/opc_1.htm


Ah OK, I think I understand now. You are specifically referring to the ARM Thumb instruction set as an example of this encoding scheme in both your comments?


Variable length is always going to restrict the parralelism of your instruction decode (for a given chip area/power/etc cost).


Only because a single instruction would take the "space" of multiple instructions in the I$-fetch-to-decode. The idea with variable-length encodings is that, for example, an 8B encoding does more than twice the work of a 4B encoding, so you lose out on the instruction slot, but win on the work done.

I mean ... that's the theory.


> you could burn the top 2b to mark instructions as 2/4/6/8 bytes (or whatever) in length.

FWIW, this is exactly what RISC-V does.


How is this so? I thought RISC-V was fixed length, except for 16 bit compressed instructions. And afaik those aren't identified by a singular particular bit.


See section 1.5 ("Base Instruction-Length Encoding") of the RISC-V spec. It's actually a bit more complex than just using 2 bits (I had forgotten those details), but the basic idea is the same in that there is a fixed cascade of bits identifying the instruction length.

There aren't any standard extensions with instructions >32b yet, but the extensibility is there in the base spec.


Could you elaborate - what is about the Intel design that makes the decode so inefficient? Is "2b" bits here?

Are there examples of ISA or chips that handle variable length instruction encoding efficiently?


Due to how the instruction set evolved, for the Intel x86 architecture, you have to look at a lot of bits of the instruction stream to determine where the next instruction starts. To execute multiple instructions per clock cycles you also have to decode instruction multiple instructions per clock cycle.

I think this old Intel patent talks about one of their decoder implementations:

https://patents.google.com/patent/US5758116A/en


Intel x86 isn't self synchonizing. In theory you have to decode every byte of the instruction stream that came before to be sure where the boundaries of an instruction are. Normally it stabilizes after a while in real world instruction streams but you can craft malicious instruction streams which yield two different valid sets of instructions depending on whether you start reading them at an offset or not.

Contrast that to something like utf8 where that isn't possible.


I think that one of the things that distinguishes Arm from Intel (for good or bad) is that Arm _has_ left behind a lot of the legacy. aarch64 has no Thumb, Jazelle etc


The only real answer is: Too Many.

This complexity is pushed down to operating systems, compilers, assemblers, debuggers. It ends up causing brutal human time overhead throughout the chain, and its cost effectively prevents security and high assurance.

This more than justifies moving away from x86 into RISC architectures, such as the rising open and royalty-free RISC-V.


Back in the 70s, a friend of mine said he loved the Z80 instruction set the best. I replied that it was the most complicated and kludgy one. He laughed and said the complexity allowed his code to shine above others and he got paid more.

Of course, these days the Z80 instruction set looks trivial :-)


> This complexity is pushed down to operating systems, compilers, assemblers, debuggers.

Is that really true though? In my experience, the amount of complexity in a layered system is roughly constant, and if you make one layer simpler, you have to make another more complex.

I would expect that compiler for a RISC architecture, for example, needs to do a lot of work figuring out which set of 'simple' instructions encode an intent most efficiently. It also needs to deal with instruction scheduling and other issues that a RISC compiler does not care about.


There is complexity, but it’s not nearly as large as you claim it is, nor does RISC magically resolve the issues you’ve brought up.


Now ask how many unique x86 instructions do standard compilers emit...


Past related threads:

How Many X86-64 Instructions Are There Anyway? - https://news.ycombinator.com/item?id=14233296 - April 2017 (133 comments)

How many x86 instructions are there? - https://news.ycombinator.com/item?id=12358050 - Aug 2016 (39 comments)

Does a compiler use all x86 instructions? (2010) - https://news.ycombinator.com/item?id=12352959 - Aug 2016 (189 comments)

How Many X86-64 Instructions Are There Anyway? - https://news.ycombinator.com/item?id=11535178 - April 2016 (1 comment)



An interesting (if horrifying) exercise:

$ ndisasm -b 64 /dev/urandom


I still love the 33 instructions in the Microchip PIC instruction set.

https://www.microchipdeveloper.com/8bit:bl-instruction-set


In short - endless battle between RISC or CISC.


These days RISC is mostly about regular encoding of instructions which don't present challenges to doing precise interrupts on a deeply pipelined machine rather than just few instructions per se.


What do you mean by "precise" interrupts here? Do some types of interrupts cause worse pipeline stalls than other types? Are the interrupt handlers in RISC faster because of more efficient decoding? Is that the issue?


This is the classic definition

https://dl.acm.org/doi/10.1109/12.4607

"An interrupt is precise if the saved process state corresponds to a sequential model of program execution in which one instruction completes before the next begins. In a pipelined processor, precise interrupts are difficult to implement because an instruction may be initiated before its predecessors have completed."


No, it's just that the fact that RISC didn't have any load-op-store instructions made it easier to pipeline them back in the day while still being able to handle interrupts as if they executed one instruction at a time and there was one instruction that had executed while the following one hadn't executed at all. That's possible to do with CISC architectures, of course, we do it all the time these days. But back in the day being able to implement that easily-ish in a reasonable number of transistors were a big factor in early RISC processors having so much better performance than the CISCs of the time.

In the modern day it mostly just means that you can implement fancy out of order schenanigans with a bit fewer engineer-years than non-RISC ISAs.


Thanks these are both very helpful and good bits of historical perspective as well. Cheers.


Oh well, there's RISC and there's RISC.

I always think of the AMD 29000.


Or i860 & i960?


I mentioned the 29k since (as I remember, it's been a while) it didn't have a multiply instruction.


Endless? It ended a long time ago. RISC won.

There hasn't been any new CISC architecture worth mentioning in decades.


The legacy of variable length x86 is why Intel will always struggle now to keep up with ARM chips and Apple’s M1.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: