Multi-byte NOP opcode made official

yuhong · on Aug 26, 2016

I wrote about long nops in http://www.agner.org/optimize/blog/read.php?i=25#82

I had a private email thread with H. Peter Anvin (formerly of Transmeta) about this.

I think there was an Intel patent about them.

MarkSweep · on Aug 27, 2016

Raymond Chen has a nice article about NOPs that had to be used on the 386 in Windows 95:

https://blogs.msdn.microsoft.com/oldnewthing/20110112-00/?p=...

dingo_bat · on Aug 27, 2016

> For example, there was one bug that manifested itself in incorrect instruction decoding if a conditional branch instruction had just the right sequence of taken/not-taken history, and the branch instruction was followed immediately by a selector load, and one of the first two instructions at the destination of the branch was itself a jump, call, or return.

My brain hurts imagining how they figured out and verified that this bug exists.

jakub_h · on Aug 27, 2016

Judging from the Matt Dillon case, a lot of "nah, the bug must be in my code" must have been involved.

dingo_bat · on Aug 27, 2016

What is the Matt Dillon case?

jakub_h · on Aug 27, 2016

The guy who was looking half a year for a bug in his code and then half a year for a bug in GCC (or something like that) only to find out that it was a bug in an AMD CPU.

wumpus · on Aug 29, 2016

I used crashme to discover a bug in the Broadcom Sibyte CPU that depended on branch prediction history.

dingo_bat · on Aug 29, 2016

I'd love to read details about your experience. Have you had a chance to write it up?

wumpus · on Aug 29, 2016

I never wrote it up. The only weird thing about it was that unmodified crashme wouldn't run because of a prefetch bug in the silicon, plus there was a debugging trap that wasn't handled in the kernel. I added a trivial handler for the trap and pre-processed the randomly-generated crashme code to nop prefetches... 15 seconds later, I saw the hang.

gavinpc · on Aug 27, 2016

Man, that guy has some war stories—and an incredible memory.

robert_tweed · on Aug 26, 2016

Intriguing, but the original posts are from 2006 with a bump from 2008 and definitive answers are never given, just "this may require an NDA to discuss."

If you care about optimising NOPs, you're probably wrting a compiler, so I am curious if these instructions have found their way into any mainstream compiler such as GCC or Clang. Does this post explain some odd compiler behaviour?

Has newer information been published making this anything other than abuse of an undocumented quirk? I.e., liable to blow up on new processors, as it does on the Pentium MMX, according to the second post.

I mean, it's interesting but I'm not sure why it's here.

raverbashing · on Aug 26, 2016

Yes, major compilers use the multi byte nop. GCC, clang, MS compilers, etc

JoshTriplett · on Aug 27, 2016

> If you care about optimising NOPs, you're probably wrting a compiler

Or low-level systems code. For instance, the Linux kernel makes extensive use of multi-byte NOPs.

cvs268 · on Aug 27, 2016

I remember gaining a fair bit of bounty recommending multi-byte NOPs for optimisation (use instead of multiple single-byte NOPs).

http://stackoverflow.com/a/18279617/319204

As part of researching this, i discovered an empirically verified list of NOPs (all the way from 1byte to 10bytes each.)

https://android.googlesource.com/toolchain/binutils/+/f22651...

pwdisswordfish · on Aug 27, 2016

> pipeline always stalls on a conditional far jump.

That doesn't make sense. Segment:offset jumps are always unconditional.

adrianratnapala · on Aug 27, 2016

Maybe its because there is a possiblity the page table has done something wicked to the destination.

pwdisswordfish · on Aug 27, 2016

Have you read my comment?

WhitneyLand · on Aug 27, 2016

I used multiple NOP all the time when learning to code 6502 on the VIC-20. With no assembler and typing in hex valued opcodes by hand, it was useful to leave gaps during during development to avoid retyping everything too often.

6502 actually had pretty good performance at the time on a per cycle basis.

drbw · on Aug 27, 2016

I'm somewhat surprised that I can remember that the NOP opcode was EA. Amazing what sticks in your mind for almost forty years...

phkahler · on Aug 27, 2016

>> I'm somewhat surprised that I can remember that the NOP opcode was EA. Amazing what sticks in your mind for almost forty years...

poke 19215,25 It was used on MS basic on the Interact computer (1978/9) to enable peek and poke. Oddly the poke instruction would execute and THEN error, but poking this one value disabled the error. There was also a set of 3 pokes that disabled address checking which was used to prevent people from reading the system ROM and the basic interpreter itself.

I believe I still have quite a collection of the "Interaction" newsletter that was published in the Detroit area about this machine, which is where the above info came from. Yeah, the stuff we remember...

13of40 · on Aug 27, 2016

I know I'm being stupid somehow, but why isn't 0x90 0x90 0x90 0x90 0x90 0x90 0x90 0x90 0x90 a good enough 9-byte NOP? It's not like a modern CPU is going to spend a cycle individually fetching and pondering over every 0x90...

phire · on Aug 27, 2016

Modern intel CPUs can decide 4 instructions per cycle. 9 single byte nops in a row would therefore take 3 cycles (though the last cycle will overlap with upto 3 more instructions.)

A multibyte nop counts as one instruction, taking just one cycle to decode, and leaving room for upto 3 more instructions to decode that cycle.

13of40 · on Aug 27, 2016

Upvoted - thank you for the answer. However, I feel I can't get by without empirically testing it in the morning...

userbinator · on Aug 27, 2016

The 0F 18 has been known for a long time, and they are not "true NOPs" but PREFETCH instructions:

http://x86.renejeschke.de/html/file_module_x86_id_252.html

pbsd · on Aug 27, 2016

0F 1A and 0F 1B have recently become memory bound verification instructions, as part of MPX. So really, only 0F 1F is likely to be long-term safe to use as a multibyte NOP.

Likewise, after Pentium 4 REP NOP (F3 90) became PAUSE, the spinlock waiting hint instruction, so one needs to be careful about prefixes with the old NOP as well.

xorblurb · on Aug 27, 2016

Although PAUSE only has a performance impact, it was part of its design (so you need only one binary for your spinlock, and it executes correctly even on processor who don't know anything about PAUSE)

yuhong · on Aug 27, 2016

I think they were introduced in the P6 as NOPs, and some of them (but not all of them!) later turned into PREFETCH instructions.

revelation · on Aug 26, 2016

The thread seems to say nothing about them being faster. I guess if they were faster they would be quite useful for trampolines.

Not sure you should ever use a multi-byte instruction for alignment, but then you shouldn't use NOP for alignment in general. That's what 0xCC is for.

gliptic · on Aug 26, 2016

Since 0xCC is an interrupt, it's only useful for padding that isn't executed. If you want to align loops, you need NOPs. Single-instruction, multi-byte NOPs are certainly preferred by modern compilers.

revelation · on Aug 26, 2016

Interesting, I didn't know loop alignment was a thing. I guess it makes sense since code is just data, too.

scaramanga · on Aug 27, 2016

Absolutely. Furthermore, if the loop is small enough it can execute from uop cache and the instruction fetcher/decoder can be powered down for a turbo boost.

wumpus · on Aug 26, 2016

I was involved with a compiler pre-2006, and it's a fairly obvious observation that the Intel and AMD instruction decoding of the day was happier with multi-byte nops over multiple nops, for different reasons, but still faster for both.

based2 · on Aug 27, 2016

http://stackoverflow.com/questions/4798356/amd64-nopw-assemb...

http://patents.com/us-9330011.html Microprocessor with integrated NOP slide detector, May 3, 2016 - VIA Technologies, Inc

foota · on Aug 27, 2016

Needs a 2006