RISC instruction sets I have known and disliked

russell · on May 1, 2016

Historical nit: I always considered the CDC 6600 to be the first commercial RISC machine, although given its strange architecture, I can see that others might disagree. It had multiple floating point and integer processors. An assembly programmer had to be aware of them all. I would not write two FP divides in a row because the second would stall waiting for the first to finish. I could write two consecutive FP multiplies, because there were two FP multipliers. Instruction timings were always a consideration in selecting registers, because you wouldnt want to try using a register that was the target of another instruction until that instruction had completed. Fortunately there were interlocks so that you would get the register contents expected rather than some undefined intermediate state. You always had two or three parallel instruction flows going to take ad vantage of as many of the 10 or so processors available.

Other aspects of the architecture were truly strange. There were no load or store instructions. They were a side effect of setting an address register. I was the lead developer for two of the PL/I compilers for the 6600. Much fun. For those interested in strange architectures, I recommend the Wikipedia article https://en.wikipedia.org/wiki/CDC_6600.

Animats · on May 2, 2016

That was Seymour Cray, and his Cray machines were even more RISC-like. The Cray I was a very simple machine; it just had 64 of everything.

mpweiher · on May 2, 2016

> RISC instruction sets like PowerPC are usually expected to be highly regular [..] whereas the CISC style of instruction set is expected to be highly irregular and full of oddities

No. Not true at all. Not even close. CISC instructions are expected to be complex, hence the name. MC68K was pretty regular, NS32032 highly regular, both CISC.

RISC are reduced, not regular, so for example you'd expect memory access to only occur with specific memory access instructions, whereas all arithmetic and other computation only deals with registers.

So CISC was usually more regular, not less. x86 is the exception, because it was just extensions heaped on top of extensions: 8080 8 bit -> 8086 segmented 16 bit + 20 bit addresses -> 80286 protected 16 bit segmented with 24 bit addresses -> 80386 semi-segmented/mostly flat 32/32 bit, etc.

nickpsecurity · on May 2, 2016

That's a point I haven't seen before. Yes, many x86 haters who tend to like RISC architectures typically praise M68K's for their ISA. Clearly there's something else outside of RISC and CISC. You might have figured it out. Maybe not. Worth thinking on.

raverbashing · on May 2, 2016

true, at least in x86 16-bit just notice the limitations and special uses between ax/bx/cx/dx, etc

Symmetry · on May 1, 2016

CISC has always been much nicer to hard write assembly code in than RISC is. The only reason RISC was able to take off was that most code started to be generated by compilers rather than through assemblers.

Given the advance of modern computer architecture that the author talks about many of the old advantages of RISC no longer apply. If you're going to be doing out of order execution the the extra effort in implementing some extra instructions really isn't important for application processors.

The big advantage that RISC has these days is that fixed width instructions are easy on the decoder. You can also have variable width instruction that use UTF8esque byte marking to make things easier on the decoder but x86 doesn't have anything like that. But then again separating them entirely in the ISA makes thing easier on the designers.

Oh, and it's a bad idea to touch memory multiple times in a single instruction on a modern machine but Intel's optimization manuals warn you not to do that and compilers abide by those warnings. If you want an ISA feature that's really hard to design into a high performance uArch then there's indirect addressing but unlike most CISC ISAs x86 managed to avoid that one.

The advantages of RISC might be overblown in some sense, but there have been a lot of new instruction set architectures developed over the last 20 years when the RISC/CISC debates were raging. Many of those have been weird in various ways but almost all look a lot more like RISC instruction sets than CISC instruction sets and it's not just because people are following the herd.

And I've got the sense that my inside view of the issue is underestimating now advantageous RISCishness is. When ARM had the opportunity to redesign their ISA when the transitioned to 64 bits they simplified their ISA quite a bit and increased the number of registers from 16 to 32, basically making their ISA much more similar to a classical RISC design. I don't really understand why they thought that going that way was an advantage but it seems like people actually involved with designing these things instead of just thinking about them in their armchairs still think that RISC has a lot of advantages.

duaneb · on May 2, 2016

> CISC has always been much nicer to hard write assembly code in than RISC is.

I don't see this. Do you have any examples? x86 was only "nicer" for the first 10 hours of assembly—and if you're writing your own (dis)assembler, it's about 10x harder than any risc. It's also much harder to compute cycle time, which is the only reason I can think of to be staring at assembly for multiple hours (as opposed to e.g. c).

It's also much harder to get "good" at x86, and it mostly consists of learning the subset of x86 that is actually optimized in the way you might expect. The vast majority of instructions you shouldn't be using at all.

dietrichepp · on May 2, 2016

x86 was designed back when people wrote significant chunks of code in assembly not because they wanted to shave some time off a critical subroutine, but because they didn't even have a compiler. Under those conditions, I'd rather have x86 than PowerPC or ARM, it's just less typing.

duaneb · on May 2, 2016

While this is true, that form of coding is... dead. It doesn't produce better compilation backends—if you're working with strings, chances are the compiler will ignore most of the string instructions but LODS/LODSB/LODSW/LODSD.

yokohummer7 · on May 1, 2016

> The big advantage that RISC has these days is that fixed width instructions are easy on the decoder.

One thing that I've wondered is how much more efforts are needed to decode variable-width instructions. Decoding itself sounds fairly easy (but frankly I don't know any details), to the point that the amount of time needed for loading/storing/calculating overwhelms that of decoding. But decoding should happen extremely fast to fill the pipeline, so the speed might still matter. Can decoding instructions be an actual bottleneck?

dbcurtis · on May 1, 2016

OH, my, yes, it can become a bottleneck. Disclaimer: It has been a good many years since I was privy to the innards of an X86.

In the X86, it is possible for an instruction to be from 1 to 15 bytes long. (Maybe more today? It was 15 when I cared.) All you can tell from looking at the first byte is that it is either one byte longer than one byte. All you can tell from the 2nd byte is that it is either 2 bytes or longer than 2 bytes, and so on. When you walk all the way out to the 15th byte, you might find a MOD/RM field, which may contain invalid combinations. Finally you have enough information to raise (or not) the illegal instruction exception. That is one very nasty equation.

Just one example of how variable instructions can become annoying to a logic designer. OTOH, some machines are very regular in how instruction length is specified -- in IBM 370 code, for instance, you can look at the first 2 bits and know the instruction width. X86 is an example of organic accumulation of features over time leading to a large collection of special cases.

userbinator · on May 2, 2016

The majority are below 4 bytes though, and ModRMs are either the 2nd or 3rd (in case of 0F escape or other prefix) byte. The 15-byte limit still applies, and is very rarely approached. As I understand it, modern x86 decoders can handle (multiple of) the smaller instructions in one cycle, while longer ones take a cycle or two more.

Animats · on May 2, 2016

Intel and AMD approach this differently. Intel decodes a few instructions ahead of execution, and sometimes decodes speculatively. AMD at one time was expanding an entire cache line to fixed length instructions and executing the decoded form.

X86 allows you to store into code, even immediately ahead of execution. This made sense in the 1970s when Harry Pyle designed the instruction set and CPUs were slower than memory. Superscaler CPUs have to support this. But, since almost nobody does that any more, they don't do so efficiently. Storing into code near execution causes an exception event, flushing all the superscalar lookahead and backing up to just before the instruction doing the store into code. Then the code gets modified, and the pipeline reloads, having lost tens to hundreds of cycles.

cesarb · on May 1, 2016

> Can decoding instructions be an actual bottleneck?

Yes. We have processors which can execute 4 or more instructions in parallel (if I'm reading http://www.anandtech.com/show/6355/intels-haswell-architectu... right, the processor I'm using to type this message can start the execution of up to 8 microinstructions in parallel). You need to decode the instructions fast enough to keep up.

Since the clock is the same, you basically need several decoders in parallel. But with variable-length instructions, you have to know the length of the first instruction so the second decoder knows where to start; you have to know the length of both instructions so the third decoder knows where to start; and so on. The x86 architecture is a worst-case of a variable-length architecture: take a look at http://wiki.osdev.org/X86-64_Instruction_Encoding and think how you would determine the length of an arbitrary instruction.

High-performance x86 implementations have to do all kinds of crazy tricks. An extra pipeline stage solely to figure out the instruction lengths (see http://www.anandtech.com/show/6355/intels-haswell-architectu...), extra tags in the instruction cache to mark the instruction boundaries, decoding the instruction lengths while loading the instruction cache, caching already decoded instructions, and so on.

Contrast this with for instance RISC-V with the compressed instructions extension, where you have to examine just two bits on each instruction to figure out if it's a 32-bit or a 16-bit instruction. I'd have to look up the encoding for Thumb-2, but I'd expect it to be something equally simple. Make it simple enough, and you might be able to split the instructions and decode them in the same pipeline stage.

userbinator · on May 2, 2016

The x86 architecture is a worst-case of a variable-length architecture

I'm guessing you haven't seen VAX. The first byte isn't even organised in any discernable pattern so there are both rare and very common instructions there, and operands are specified using a very flexible system that makes length decoding far more difficult than x86.

In contrast, x86 has a mostly consistent 2-3-3 octal-based encoding, and having the first 2-3 bytes is usually enough to decode the instruction's length:

http://reocities.com/SiliconValley/heights/7052/opcode.txt

spc476 · on May 2, 2016

The VAX is an interesting case. While it's CISC, the instruction set is, oddly enough, very regular with operands following a common structure. Each operand has an initial byte describing the location, plus some additional bytes for displacements and indexing.

Aside from the CASE statement (yes, the VAX has a table jump instruction) that can be (if I calculated correctly) up to 65,558 bytes in size, the next longest instruction are the six-operand ones that can be (again, if I calculated correctly) 43 bytes in size. Two operand register-to-register operations (and some indexed-register operations) take 3 bytes (1 for opcode, one for each register). Once you get used to it, it's pretty easy to read the actual binary code.

daurnimator · on May 2, 2016

LuaJIT recently gained an x86 instruction length decoder. Check it out: https://github.com/LuaJIT/LuaJIT/commit/73680a5fc760cb39760e...

versteegen · on May 2, 2016

Wow, I'm surprised that that's faster than the old byte-wise scanning of the instruction stream, and doubly surprised if the benefit isn't negated by the extra cache pollution.

Actually, it's a lot simpler than I expected!

daurnimator · on May 2, 2016

It's probably not; but it does fix a bug where bytes got misinterpreted. See http://www.freelists.org/post/luajit/Random-failures-in-comp...

versteegen · on May 2, 2016

I should have thought of that... of course it's easy to see the code was probably wrong in hindsight. Thanks for the explanation.

Symmetry · on May 2, 2016

Yes, it's certainly a concern. There are ways to decode lots of instructions at once in a clock cycle but they take lots of extra transistors and more power. "Lots" here is on the order of 5% or so of the power budget compared to an ISA with better encoding so it's not a decisive advantage but it's something you notice as a designer.

And when balancing a CPU core you really want to make the front end wider than your execution resources would require so that you recover from branch mispredicts quickly and refill your various OoO buffers fast. x86 processors tend to do this less than, e.g., POWER because x86 decode is expensive.

thechao · on May 2, 2016

5%?! This isn't 2006, it's 2016. Decode is annoying, but it's a drop in the ocean compared to lighting up the memory stack.

KMag · on May 2, 2016

It's unfortunate he didn't touch more on the DEC Alpha AXP. It was designed from the start to be a 64-bit chip, unlike most 64-bit ISAs in use today.

Sure, it took them a bit to be convinced that single-byte loads and stores were worthy of dedicated instructions.

It was designed so that most of your kernel code wasn't actually running in a privileged CPU mode, but instead make upcalls to PAL code, a sort of super lightweight hypervisor that emulated how ever many of rings of protection the kernel needed. (Ultrix and Linux needed 2 rings. VMS ran on another set of firmware that emulated more rings.)

It was a nice clean design that was running at 500 MHz back when Intel could manage 200 MHz. Its memory model was more friendly to parallel execution (and less friendly to compiler and JIT writers) than the x86 memory model, forcing weaker consistency guarantees out of the JVM memory model as a result.

It seems a shame to me that the architecture was never revived. I'd like to hear more about its quirks and flaws.

0x0 · on May 1, 2016

The IA64 (Itanium) series on Old New Thing seems to highlight a very crazy instruction set and architecture:

https://blogs.msdn.microsoft.com/oldnewthing/20040119-00/?p=...

https://blogs.msdn.microsoft.com/oldnewthing/20150805-00/?p=...

bazizbaziz · on May 1, 2016

Kinda seems like the compiler just shouldn't allocate r0 for inline assembly on PPC, since it's only valid in special circumstances. Hard to fault the ISA a lot since this is basically the compiler backend author(s) missing a corner case, which is quite easy to do considering the breadth of a compiler backend.

msbarnett · on May 1, 2016

Yeah, that's not about the ISA, that's just Evidence That GCC's Inline ASM Functionality is a Mess #938292721

See also: http://free-electrons.com/blog/how-we-found-that-the-linux-n...

See also: http://robertoconcerto.blogspot.ca/2013/03/my-hardest-bug.ht...

comex · on May 1, 2016

Alternately it could be seen as evidence that PowerPC assembly syntax is a mess. For anyone who doesn't know, the way it works with typical PowerPC assemblers is that instructions take unadorned numbers for all arguments, and determine whether they refer to registers or immediates based on the instruction: "li 1, 2" sets R1 to the immediate 2 ("load immediate"), while "mr 1, 2" sets R1 to the value of R2 ("move register"). And then because people find bare numbers confusing, you have includes that do "#define r1 1" or equivalent for each register, so when writing assembly manually you can write "mr r1, r2". But because these are just dumb macros, nothing stops you from writing "li r1, r2" - the assembler will just macro expand r2 to 2 and treat it as an immediate!

Other architectures have the R prefix as an intrinsic part of the syntax, so if you write R2 in a slot where the instruction requires an immediate, you'll just get an error. If PowerPC did that, you'd still need to remember the right inline assembly constraint letter for GCC, but getting it wrong would 'just' result in an unpredictable compile error when the compiler decided to use r0, not silent misbehavior.

mpe · on May 2, 2016

At least with GNU as you can use %r1, %r2 etc. as an "intrinsic part of the syntax". Which means you can't use a register name where an immediate is expected.

However that doesn't fix the gotcha with r0 being special, that is specified in the ISA. In fact it's that way precisely so you can load an immediate without needing a separate opcode.

comex · on May 2, 2016

Huh, never knew that... but I just tried it and GAS (the version Debian installed as powerpc-linux-gnu-as, at any rate) accepted "lwz %r0, %r5(%r0)". Snatching defeat from the jaws of victory...

It would still be a gotcha, but a pretty minor one if messing it up just resulted in an error. I suppose the approach taken by AArch64 and others is preferable, where one register is just completely reserved as constant 0 rather than only in some encodings.

fra · on May 1, 2016

I second that. We've had some issues with gcc's inline assembly support. To make matters worse the list of things that it does and doesn't support is very fuzzy.

aidenn0 · on May 1, 2016

A few comments:

Yes, MIPS assembly is terrible, though I hear that the newer ISA revisions are better (the one I worked with most heavily was the 5k, and systems programming on it was very unpleasant).

Power and MIPS also both now have thumb-2 alike instruction encodings, POWER VLE and MIPS16e, respectively. It turns out that code size matters.

Lastly, RISC no longer means what it used to. It basically is used today to just mean a load-store architecture, as you now have variable-length instructions, multi-cycle arithmetic instructions, out-of-order superscalar chips labled "RISC"

When IBM came out with the POWER ISA, I seem to recall that one of the authors of the abacus book claimed it wasn't simple enough to be RISC, which is a quaint thought these days.

Symmetry · on May 1, 2016

It's sort of funny that ARM, which is really the CISCiest of the old RISC instruction sets, has ended up being the most successful one.

http://userpages.umbc.edu/~vijay/mashey.on.risc.html

renox · on May 2, 2016

ARM, which is really the CISCiest of the old RISC instruction sets, you can replace 'is' by 'was' since ARMv8.

Ericson2314 · on May 2, 2016

IMO the the success of CPU lines in recent years had almost nothing to do the intrinsic properties of the instructions set.

I don't blame MIPs from doing delayed branching etc because if most machine code is compiled from something else, and if most compilers use decent abstractions, one should be changing ISAs all the time to adapt and bolster the latest and greatest implementation techniques. (e.g. for out-of-order super scalar, its probably best to give the CPU some sort of dependency graph.)

The focus on hand-coding as a way to get to know the architecture, on the other hand, borderline insinuates that instructions sets should optimize for hand-coding, which is just plainly ridiculous.

userbinator · on May 2, 2016

On the other hand, no one wants to have to recompile everything all the time, which gives much force to the argument that CPUs should be more CISC, so that the same (complex) instructions will simply run faster due to hardware improvements. REP MOVS on x86 is a great example of this; it was originally the fastest way to do a block copy until around the Pentium when it lost (only slightly) to very large custom unrolled loops, but since ~P6 it has been internally optimised to copy whole cache lines at once and in the very latest microarchitectures it is once again the fastest.

Ericson2314 · on May 2, 2016

Well, as a NixOS User (which granted way post-dates RISC), I get the all the benefits of constant recompilation without burning any of my own CPU cycles.

Very good point on the `REP MOVS` front (and cool story!). Indeed if the Mill pans out it would instantly usher the renaissance for branch delay.

So yeah, I rather recompile than try to predict future architecture trends, but either way, the grossness should be there for performance not hand-coding ease.

chrismonsanto · on May 1, 2016

> x86 is not particularly nasty

x86's terrible reputation is well deserved. x86-64 fixes a number of problems, but for most of x86's lifetime we've had to live with 8 registers and the lack of IP-relative addressing...

Let's not forget how awful x87 floating point was, either!

userbinator · on May 2, 2016

IMHO "amd64" (as it should really be called, since AMD came up with it) could've been a lot more orthogonal, more like the 16 to 32-bit extension that came with the 386. In practice, it's really 8 registers "and 256 bytes more" since the area around the stack pointer will be cached and accesses there can be just as fast.

On the other hand, one of the things I like about x87 is that it's extremely dense because it's stack-based and RPN-ish. Here's 256 bytes of x87 awesomeness: http://www.pouet.net/prod.php?which=53816

protomyth · on May 1, 2016

I was curious why for the Xeon Phi / Larrabee they didn't drop all the 32-bit instructions since you would seriously have to recompile all your code anyway.

PhantomGremlin · on May 2, 2016

I'm not familiar with Larrabee.

But it looks like it's x86 plus SIMD. If so, then Intel probably has literally 1000's of man-years of validation suites for x86. If you remove some of the x86 instructions, how much of that test code do you break?

The Larrabee designers probably wanted to focus on the SIMD and on putting bunches of cores onto a single die. They didn't want to reimplement (or even fuss with) the x86 part. That's not what they were interested in.

Leaving x86 alone means the chip runs Windows, Linux, etc w/o any further effort. Or does it, like I said I don't know the architecture?

What do you gain by breaking that? It's probably a very small part of the silicon area anyway.

viraptor · on May 2, 2016

I'm confused why would IP-relative addressing be useful. Have you got some interesting examples?

gsg · on May 2, 2016

It's the basis of efficient relocations in position independent code, which is now very common.

PIC can be emitted on older x86 machines without RIP-relative addressing, but the code is larger and slower. As an example, consider -m32 gcc output for the C program

    int x;

    int getx() {
        return x;
    }

With no PIC:

    getx:
	movl	x, %eax
	ret

With PIC:

    getx:
	call    __x86.get_pc_thunk.cx
	addl    $_GLOBAL_OFFSET_TABLE_, %ecx
	movl    x@GOT(%ecx), %eax
	movl    (%eax), %eax
	ret

And with -m64, which emits PIC and uses x86-64s RIP-relative addressing:

    getx:
 	movl	x(%rip), %eax
	ret

Hopefully that makes the motivation clear.

viraptor · on May 2, 2016

Thanks, I've seen it so many times it seems I developed (%rip) blindness :) Of course it's useful this way.

cnvogel · on May 2, 2016

For any shared object loaded on an unknown address it makes it trivially easy to load data also in that library.

twic · on May 2, 2016

However, many of the later RISC architectures do share one annoying flaw. [...] the mechanism for storing 32-bit immediates can only encode a 32-bit value by splitting it across two instructions: a "load high" followed by an "add". [...] This design pattern turns up on almost all RISC architectures, though ARM does it differently (large immediates are accessed by PC-relative loads).

He neglects to mention the ARM's clever approach to this:

https://alisdair.mcdiarmid.org/arm-immediate-value-encoding/

to3m · on May 2, 2016

The ARM approach is clever, but you only load 8 bits. Suppose you need a 32-bit constant - it's going to take 4 instructions. Compared to the halfword instructions, you do win with certain awkward constants such as 3<<15. That's why neither approach is as good as having variable-length instructions ;)

This made me wonder how common such constants are. So I grabbed some code I've been working on recently, for which I happened to have assembly language output, and searched for every immediate constant. (This must be the first time I've found a good use for gcc's nasty AT&T x64 syntax.)

My code targets x64, so take the "analysis" with a pinch of salt. Out of the 8982 instructions that had immediate operands, there were 819 unique 32-bit constants. 762 (93%) were high-halfword or low-halfword only, so they could have been loaded with one halfword instruction. By comparison, only 392 (48%) could been loaded with one instruction on ARM.

Ten constants were better for ARM, in that they would take two halfword instructions to form, but only one MOV or MVN: ['0x00ffffff', '0x03ffffff', '0x0fffffff', '0x3fffffff', '0x7fffffe8', '0x7ffffffe', '0x7fffffff', '0x80000003', '0xfffffffe', '0xffffffff']. These ten constants were used by 84 instructions out of the 8982.

pm215 · on May 2, 2016

Modern ARM has load halfword insns too, so you can use those or the 8-bit imm encoding depending on the constant.

For a full 32 bit value prior to movw/movt you'd most likely load it from a constant pool rather than do a 4 insn sequence. (Some 32 bit values can be done with clever choice of 8imm sequences -- there's an algorithm you can use as a compiler to say "given this value can I create it in 3 or less insns?", which is worth the effort if you're targeting a pre-movw ARM cpu.)

speleo_engr · on May 1, 2016

"If you do have the misfortune to have to work with SPARC"

So true... I work with the LEON (GPL implementation of SPARC) and have the occasional very bad day of SPARC asm code.

Dylan16807 · on May 2, 2016

> At this point, it's probably better to have an efficient instruction encoding, save on memory bandwidth and instruction cache space, and have a comprehensible instruction set. Hence x86.

x86 is full of legacy single-byte instructions and complicated prefixes, hurting space efficiency and making it a huge pain to have high-throughput decoding. You could do a lot better if you took the x86 instruction list and reassigned all the encodings.

Zardoz84 · on May 2, 2016

> However, many of the later RISC architectures do share one annoying flaw. Immediate values are constants embedded within the instructions. Sometimes these are used for small values within expressions, but often they're used for addresses, which are 32-bit or 64-bit values. On PowerPC, as on SPARC and MIPS, the mechanism for storing 32-bit immediates can only encode a 32-bit value by splitting it across two instructions: a "load high" followed by an "add". This is a pain. Sometimes the two instructions containing the value are some distance apart. Often you have to decode the address by hand, because the disassembler can't automatically recognise that it is an address. This design pattern turns up on almost all RISC architectures, though ARM does it differently (large immediates are accessed by PC-relative loads). When I worked on an object code analyser for another sort of RISC machine, I gave up on the idea of statically resolving the target of a call instructions, because the target address was split across two instructions, one of which could appear anywhere in the surrounding code.

> The x86 system for storing 32-bit/64-bit immediates is much nicer. They just follow the instruction, which is possible because the instruction length is variable. Variable-length instructions are not usually seen in RISC, the Thumb-2 instruction set being the only exception that I know of.

Hybrid way, that could be the best of both worlds : https://github.com/trillek-team/trillek-computer/blob/master...

On a few words, it uses a bit to know if the literal would be bigger that could normally stored on a instructions of 4 bytes. If it's true, the next 4 bytes is the literal value.

0x0 · on May 1, 2016

Shouldn't gcc (or a similar helper tool) be able to understand the side effects of the asm instructions and automatically fill in all the clobber flags, instead of manually having to fill in all those crazy =r style markers? Is it not possible to code something that determines all affected registers for a given set of assembly opcodes?

msbarnett · on May 1, 2016

It's certainly possible to do a lot better than GCC's inline assembly setup, which has proven time and again to be a usability disaster that positively encourages writing subtly buggy code.

Microsoft's C compilers do a much better job with inline asm by being more conservative in how they allocate registers around the asm (https://msdn.microsoft.com/en-us/library/k1a8ss06.aspx).

CodeWarrior's PPC compilers remain the ne plus ultra of inline assembly ergonomics, and it's a goddamn shame that LLVM has chosen (for pragmatic reasons) to follow GCC's mediocre lead rather than pursue something akin to it.

userbinator · on May 2, 2016

I think this stems from the same "lazy design" that made (G)AS' x86 syntax so very unpleasant to work with --- they just decided to make inline Asm literally dump strings out into the compiler's own Asm output, with some printf-like placeholders to be replaced with variable names/register assignments. There's a belief of "strict modularity" (i.e. compiler doesn't know at all about the inline Asm other than what the programmer tells it explicitly) which probably had some influences too, whereas MSVC et.al decided to spend a bit more effort on making things work in a more integrated fashion.

cokernel_hacker · on May 2, 2016

Clang provides both GCC and MSVC style inline assembly.

msbarnett · on May 2, 2016

Holy crap, I was not aware of that.

edit: this doesn't seem well-documented, if true. The Clang site mostly just talks about GCC compatibility, eg) http://clang.llvm.org/compatibility.html#inline-asm

pm215 · on May 2, 2016

One major motivation for inline asm is wanting to use an insn which your toolchain doesn't know about (because it is too new, for instance). Inline also has to be able to handle cases like inline system call instructions -- in that case the clobbered registers are determined by the kernel syscall ABI, so it is impossible for a compiler to get them right by just looking at the asm insns.

nn3 · on May 1, 2016

Can anyone clarify his claim on MIPS data hazards? I didn't follow that one. To my knowledge MIPS has no special hazards, like a VLIW ISA would have. Is that not correct?

walter_artica · on May 1, 2016

AFAIK, since 1991 (with the MIPS R4000) they have included the necessary interlocks in the integer pipeline. Just check the reference manuals from that age (e.g. compare the R3000 and R4000 manual section on the pipeline).

Or else use a search engine to find sources like this: https://books.google.com.pe/books?id=LL52JBPU4CwC&pg=PA52

The article's author should review recent documentation on the MIPS architecture. I think that the paragraph that starts with the phrase "MIPS is the worst offender" seems ludicrous for people working with modern MIPS implementations (let's say post-1992!).

yuubi · on May 1, 2016

MIPS, the Microprocessor without Interlocked Pipeline Stages, doesn't make sure the result of one instruction is available before executing an instruction "later" in the stream that refers to that result. Something like (pseudo-asm with C-notation comments)

     ld r1, @r2   ; r1 = *r2
     add r3,r1,r4 ; r3 = r1+r4

wouldn't set r3=*r2+r4 because the memory access hasn't finished by the time the add runs.

pm215 · on May 1, 2016

This was only in the early versions of the MIPS architecture though (MIPS I, I think?). Later versions required the interlocks, so the add would stall rather than misbehaving, and you didn't need to actually put a nop in the load delay slot (though being able to schedule some useful insn into it was still performance-wise worthwhile). Since MIPS I implementations are a distant memory, in practice this in-retrospect misfeature is now ignorable these days. (In contrast, branch delay slots cannot be forgotten about because you can't backwards-compatibly change the branch insn behaviour; the best you can do is add new branch instructions which don't execute the delay slot insn, which MIPS has also done to some extent.)

Both load delay slots and branch delay slots are allowing the microarchitecture (a simple 3-stage pipeline) to dictate architecture, which is a classic way to store up pain for the future.

russell · on May 1, 2016

The lack of interlocks really surprised me, although the name said so. The CDC 6600 had them two decades earlier. We always carefully scheduled our instructions flows, but it was nice to know that the hardware would catch our goofs.

aidenn0 · on May 1, 2016

And this got really fun when superscalar MIPS processors with instruction prefetching came out. They had to introduce a different NOP called SSNOP that stalled all ALUs. Obviously they couldn't just declare "NOP stalls all ALUs" as that would have serious performance affects for places where NOPs are necessary (e.g. branch delay slots).

protomyth · on May 1, 2016

I'm curious why they still had NOP as part of the name since it actually did something.

aidenn0 · on May 2, 2016

Well NOP does nothing on one ALU, SSNOP does nothing on all ALUs.

And to give you an example of how you had to calculate things:

There were, if my memory serves me correctly 6 pipeline stages on the 5k numbered 0-5, plus instruction prefetch which was numbered -1. You subtracted the stage in which the instruction took effect from the stage in which a subsequent instruction needed to see that effect, and the result was the number of intermediate stages that all ALUs would need to go through.

Worst case scenario would be if you were modifying RAM that would be read as an instruction; it wouldn't take effect until stage 5, and instruction prefetch was stage -1 so you needed to make sure all ALUs were busy for 6 clock cycles. In theory you could do the math to figure out the scheduling for each ALU, but I just dropped 6 SSNOPs in there, since it was a code path that was only hit during loading of a new process, 6 wasted clock cycles was not a concern.

Note that this is unrelated to interlocks, as any Modified Harvard Architecture will require some sort off synchronization when changing the instruction stream. However, most ISAs have a single instruction that stalls the pipeline and discards any prefetched instructions (e.g. isync on Power). They added one in later revisions of the MIPS ISA as well.

Another fun thing was that there was no interrupt-safe way to disable interrupts, as the interrupt-enabled bit was in a word-sized register along with other values that could legitimately be changed by an ISR. This was also fixed by later revisions of the ISA.

maaku · on May 1, 2016

Sad that there was no mention / evaluation of RISC-V in this post, which attempts to resolve exactly the problems he identifies...

microcolonel · on May 1, 2016

I don't think RISC-V really addresses the encoding inefficiency problem, except for the "C" extension, sorta. Though I don't think that for OoO superscalar architectures, icache pressure is as much of a problem as it is on a fancy vliw.

But yeah, would be nice to get a take on RISC-V in context of this rant.

_chris_ · on May 2, 2016

Huh?

RISC-V with the compressed extension is incredibly efficient in its encoding. Better than x86 or ARM in both static and dynamic bytes per program.

Also, Icache pressure is a huge problem in modern warehouse-scale computers.

Any processor that cares about performance will almost certainly be implementing the C extension to RISC-V. It also enables more efficient macro-op fusion, turning common two instruction 4-byte idioms into a single, more powerful instruction.

microcolonel · on May 2, 2016

Thanks for going into more detail. I was basing my assumption that it wasn't a huge problem on the fact that the only people who complain about it first seem to the folks designing the Mill. They have a ridiculous/insane/cool solution to it.

Everyone else seems to first mention their cool branch predictor, or vector processor.

renox · on May 1, 2016

Given the lack of CCR in the RISC-V I doubt that he would be very impressed by it's ease of use..

dietrichepp · on May 2, 2016

I'm not convinced that's a big deal. You just end up using a register of your choice and sticking a flag in it.

renox · on May 2, 2016

OK, please show me the code to do a long addition or a long multiplication in RISC-V. (long as in 'multiple words')

dietrichepp · on May 3, 2016

Here. 64-bit addition on RV32I.

    ; input 1 (msb r1, lsb r2)
    ; input 2 (r3, r4)
    ; output (r5, r6)
    xori    r5, r4, -1
    sltu    r5, r5, r2
    add     r6, r4, r2
    add     r5, r5, r3
    add     r5, r5, r1

This is what I mean. Outside a few applications (mostly asymmetric crypto) nobody cares that it takes five instructions instead of two. Remember that this is the same processor that outright omits multiplication from the core spec.

TazeTSchnitzel · on May 1, 2016

I wonder what the author would think of the Mill.

DonHopkins · on May 2, 2016

That's "codesign" as in "co-design" not "code-sign".