Hacker News new | past | comments | ask | show | jobs | submit login
Multi-byte NOP opcode made official (intel.com)
107 points by based2 on Aug 26, 2016 | hide | past | favorite | 34 comments



I wrote about long nops in http://www.agner.org/optimize/blog/read.php?i=25#82

I had a private email thread with H. Peter Anvin (formerly of Transmeta) about this.

I think there was an Intel patent about them.


Raymond Chen has a nice article about NOPs that had to be used on the 386 in Windows 95:

https://blogs.msdn.microsoft.com/oldnewthing/20110112-00/?p=...


> For example, there was one bug that manifested itself in incorrect instruction decoding if a conditional branch instruction had just the right sequence of taken/not-taken history, and the branch instruction was followed immediately by a selector load, and one of the first two instructions at the destination of the branch was itself a jump, call, or return.

My brain hurts imagining how they figured out and verified that this bug exists.


Judging from the Matt Dillon case, a lot of "nah, the bug must be in my code" must have been involved.


What is the Matt Dillon case?


The guy who was looking half a year for a bug in his code and then half a year for a bug in GCC (or something like that) only to find out that it was a bug in an AMD CPU.


I used crashme to discover a bug in the Broadcom Sibyte CPU that depended on branch prediction history.


I'd love to read details about your experience. Have you had a chance to write it up?


I never wrote it up. The only weird thing about it was that unmodified crashme wouldn't run because of a prefetch bug in the silicon, plus there was a debugging trap that wasn't handled in the kernel. I added a trivial handler for the trap and pre-processed the randomly-generated crashme code to nop prefetches... 15 seconds later, I saw the hang.


Man, that guy has some war stories—and an incredible memory.


Intriguing, but the original posts are from 2006 with a bump from 2008 and definitive answers are never given, just "this may require an NDA to discuss."

If you care about optimising NOPs, you're probably wrting a compiler, so I am curious if these instructions have found their way into any mainstream compiler such as GCC or Clang. Does this post explain some odd compiler behaviour?

Has newer information been published making this anything other than abuse of an undocumented quirk? I.e., liable to blow up on new processors, as it does on the Pentium MMX, according to the second post.

I mean, it's interesting but I'm not sure why it's here.


Yes, major compilers use the multi byte nop. GCC, clang, MS compilers, etc


> If you care about optimising NOPs, you're probably wrting a compiler

Or low-level systems code. For instance, the Linux kernel makes extensive use of multi-byte NOPs.


I remember gaining a fair bit of bounty recommending multi-byte NOPs for optimisation (use instead of multiple single-byte NOPs).

http://stackoverflow.com/a/18279617/319204

As part of researching this, i discovered an empirically verified list of NOPs (all the way from 1byte to 10bytes each.)

https://android.googlesource.com/toolchain/binutils/+/f22651...


> pipeline always stalls on a conditional far jump.

That doesn't make sense. Segment:offset jumps are always unconditional.


Maybe its because there is a possiblity the page table has done something wicked to the destination.


Have you read my comment?


I used multiple NOP all the time when learning to code 6502 on the VIC-20. With no assembler and typing in hex valued opcodes by hand, it was useful to leave gaps during during development to avoid retyping everything too often.

6502 actually had pretty good performance at the time on a per cycle basis.


I'm somewhat surprised that I can remember that the NOP opcode was EA. Amazing what sticks in your mind for almost forty years...


>> I'm somewhat surprised that I can remember that the NOP opcode was EA. Amazing what sticks in your mind for almost forty years...

poke 19215,25 It was used on MS basic on the Interact computer (1978/9) to enable peek and poke. Oddly the poke instruction would execute and THEN error, but poking this one value disabled the error. There was also a set of 3 pokes that disabled address checking which was used to prevent people from reading the system ROM and the basic interpreter itself.

I believe I still have quite a collection of the "Interaction" newsletter that was published in the Detroit area about this machine, which is where the above info came from. Yeah, the stuff we remember...


I know I'm being stupid somehow, but why isn't 0x90 0x90 0x90 0x90 0x90 0x90 0x90 0x90 0x90 a good enough 9-byte NOP? It's not like a modern CPU is going to spend a cycle individually fetching and pondering over every 0x90...


Modern intel CPUs can decide 4 instructions per cycle. 9 single byte nops in a row would therefore take 3 cycles (though the last cycle will overlap with upto 3 more instructions.)

A multibyte nop counts as one instruction, taking just one cycle to decode, and leaving room for upto 3 more instructions to decode that cycle.


Upvoted - thank you for the answer. However, I feel I can't get by without empirically testing it in the morning...


The 0F 18 has been known for a long time, and they are not "true NOPs" but PREFETCH instructions:

http://x86.renejeschke.de/html/file_module_x86_id_252.html


0F 1A and 0F 1B have recently become memory bound verification instructions, as part of MPX. So really, only 0F 1F is likely to be long-term safe to use as a multibyte NOP.

Likewise, after Pentium 4 REP NOP (F3 90) became PAUSE, the spinlock waiting hint instruction, so one needs to be careful about prefixes with the old NOP as well.


Although PAUSE only has a performance impact, it was part of its design (so you need only one binary for your spinlock, and it executes correctly even on processor who don't know anything about PAUSE)


I think they were introduced in the P6 as NOPs, and some of them (but not all of them!) later turned into PREFETCH instructions.


The thread seems to say nothing about them being faster. I guess if they were faster they would be quite useful for trampolines.

Not sure you should ever use a multi-byte instruction for alignment, but then you shouldn't use NOP for alignment in general. That's what 0xCC is for.


Since 0xCC is an interrupt, it's only useful for padding that isn't executed. If you want to align loops, you need NOPs. Single-instruction, multi-byte NOPs are certainly preferred by modern compilers.


Interesting, I didn't know loop alignment was a thing. I guess it makes sense since code is just data, too.


Absolutely. Furthermore, if the loop is small enough it can execute from uop cache and the instruction fetcher/decoder can be powered down for a turbo boost.


I was involved with a compiler pre-2006, and it's a fairly obvious observation that the Intel and AMD instruction decoding of the day was happier with multi-byte nops over multiple nops, for different reasons, but still faster for both.


http://stackoverflow.com/questions/4798356/amd64-nopw-assemb...

http://patents.com/us-9330011.html Microprocessor with integrated NOP slide detector, May 3, 2016 - VIA Technologies, Inc


Needs a 2006




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: