The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat

ChuckMcM · on July 10, 2016

It always starts that way. We don't need a special instruction for X because we're so much faster and cost effective. And so far, that reasoning has always fallen under the need for "just one or two" special instructions to make what ever the killer algorithm of the day is, go a bit faster or smoother.

The thing is, since transistors are so cheap and plentiful these days you either waste them on bigger and bigger on chip caches or you use them for some special purpose action (like a fancy instruction or maybe a simple co-processor).

RISC was compelling when it was hard to make transistors, and simple has always been compelling in the quest for correctness, but simple != RISC.

_chris_ · on July 10, 2016

The title is cutting out an important part of my report's story: "Avoiding ISA Bloat with Macro-Op Fusion".

You can still specialize your processor to eke out more performance, but you (often) don't need to change the ISA to do it!

Why add load-pair instructions (like ARMv8 did) when you can simulate the exact same effects using macro-op fusion? And code size doesn't matter, since RISC-V's compressed ISA extension makes RISC-V the densest ISA out there without throwing away performance to get it.

ChuckMcM · on July 10, 2016

Let's start with I really like the RISC-V architecture, its compact, it is easy to understand and that it is "open" in the licensing sense is even more awesome. But I also think you've stepped into what I think of as the "known requirements" fallacy.

Fundamentally the fallacy is this; When you are looking backwards in time at all of the accommodations existing systems made to adapt to changing requirements, you can see a path that those systems missed. It is similar perhaps to standing on the high ground over looking a valley and seeing a path that people on the ground do not see. The fallacy is believing that somehow you are a better pathfinder than they are and you would not make their choices in their position.

Whether it was the saturating adds and multiplies that came with the DSP requirements, or the SIMD extensions that came with graphics, these systems live in an evolving ecosystem of computation which challenges their architectural invariants again and again. And commercially at least this pressure to adapt has so far had a perfect record of overwhelming the core architectural tenets and producing a small wart which later becomes a larger wart and then a system that, when you back on it, could have been built differently had they known the requirement that was going to be thrown at it.

There will always be a case to throw out the currently dominant ISA and replace it with one without the warts of the existing system. And the cost of that will be high because it means a very large software migration burden. And it is that cost which allows the warts to exist in the first place.

I think you have done some good work here. I particular like the analysis for the cost / benefit of adding instructions vs macro ops. I believe that will be a useful tool for doing architecture analysis going forward. I also found myself strongly rejecting the notion that that particular implementation of "macro op fusion" was sufficient to innoculate your ISA from any future changes. (which is implied by the abstract and endorsed by the conclusion)

_chris_ · on July 10, 2016

> The fallacy is believing that somehow you are a better pathfinder than they are and you would not make their choices in their position.

RISC-V has two huge advantages going for it; 1) they got to learn from 40 years of mistakes, and 2) they recognized how important it was for the ISA to be agnostic to the micro-architecture (I'm amazed at how I can tell you about details such as the number and types of ports on a processor's register file just based on reading the ISA manual!). I don't think the RISC-V authors pretend they would not have made the same mistakes back then, but rather, they are annoyed that the same mistakes keep being made today! ARMv8 is a newer ISA than RISC-V and yet I believe it violates #2 (why are their branches 8 bytes?), and the same SIMD design mistakes are still being made over and over again.

> I also found myself strongly rejecting the notion that that particular implementation of "macro op fusion" was sufficient to innoculate your ISA from any future changes. (which is implied by the abstract and endorsed by the conclusion)

I don't mean to imply that there are no new instructions that will ever need to be added! I think popcount is an example that is very painful to handle in software, and the idiom in assembly is probably too big to handle via macro-op fusion. I'm also very excited to eventually have a vector ISA in RISC-V.

Rather, my abstract is fighting against a whole slew of instructions that exist in current ISAs (or that people want to add to RISC-V) that don't need to be there. The load-pair/load-increment types that you see in ARMv8 come to mind.

ChuckMcM · on July 10, 2016

Fair enough, have you asked the ARMv8 architect why he added them? He mentioned at ARM TechCon that there was a lot of thought that went into each change they made.

[1] Richard Grisenthwaite, Lead Architect and Fellow. ARM -- https://www.arm.com/files/downloads/ARMv8_Architecture.pdf

_chris_ · on July 10, 2016

I would love to chat with the ARM guys and pick their brains. I suspect for a few things (like unfused branches) they wanted to match the existing micro-ops they were already generating since they would still have to support ARMv7!

But I'd also like to know why they didn't have things like AMOs (and instead had to add them in v8.1). And lastly, I would like to know why there is no Thumb for AArch64, and why it isn't the default ARMv8 behavior!

Someone · on July 11, 2016

"And code size doesn't matter, since RISC-V's compressed ISA extension makes RISC-V the densest ISA out there without throwing away performance to get it."

That doesn't rule out that there is some optimal compressed ISA that is a lot better than what RISC-V provides.

As a truly weird idea, if a subroutine were tagged with the number of working registers it needs, instructions referencing registers inside such a subroutine could be shortened (ideally for memory usage using arithmetic coding, but shaving of integral bits probably would be easier on the decoder). Tagging need not take much room, as it could be applied to whole address ranges (MMU blocks, or even larger ones; for example, one could sacrifice the most significant bits of a 64-bit address range to it) if the linker or the loader moved subroutines with equal register usage together (but performance would suffer because of increased cache misses)

Yes, that's a weird idea, and I would guess the gain in instruction density would not be enough to offset the extra cache misses and the added decoder complexity, and would not be particularly large to start with, but there may be real improvements. Maybe, one day, it will be efficient to gzip compress basic blocks or cache lines using a prefilled decoding table computed by the compiler?

_chris_ · on July 11, 2016

> That doesn't rule out that there is some optimal compressed ISA that is a lot better than what RISC-V provides.

There's an entire PhD thesis on the RISC-V compressed ISA (RVC) and it shows that RVC's compression comes fairly close to just using bzip on the binary!

> instructions referencing registers inside such a subroutine could be shortened

They're already 3-bit register specifiers... there's not much more to do here. At this point, either your binary fits in the icache and you're happy or it doesn't and no amount of 1-5% ideas is going to help you.

Someone · on July 11, 2016

"There's an entire PhD thesis on the RISC-V compressed ISA (RVC) and it shows that RVC's compression comes fairly close to just using bzip on the binary!"

Thanks for the info, but I find it hard to believe that, so I would appreciate a link. Wouldn't that imply that we can create a compressor that is fairly close in quality to bzip and that is randomly addressable in O(1)?

cperciva · on July 11, 2016

Can you recognize and automatically perform macro-op fusion on AES computations? How about SHA256? Ok, let's make it easy -- CRC32?

_chris_ · on July 10, 2016

Oh man, that was fast. I didn't tell anybody I had released this yet, and here it is!

Uhhh, AMA?

dbcurtis · on July 10, 2016

First, I am happy to see that you are attacking the correct problem. Everything interesting about computer architecture can be answered by asking two questions: 1) How is memory bandwidth used? (Seymour Cray lived and breathed that one every day), and 2) How is die area used? Achieving good code density is a double win -- less bandwidth for moving code into the I-cache, and reduced die area expended on I-cache to achieve a given amount of functionality.

I'm curious how you would relate your work to WISC (Writable Instruction Store Computers) and how you view WISC machines?

_chris_ · on July 10, 2016

> I'm curious how you would relate your work to WISC (Writable Instruction Store Computers) and how you view WISC machines?

You know, I've never heard of WISC machines before. I think the idea of RISC though is you are basically exposing the pipeline's micro-ops to the world, and the icache basically becomes your writeable "micro-code ROM", if you will.

If you can't change how the pipeline's datapath is wired up, I'm not sure I'm clever enough to figure out how changing the instruction set would help (they needed writeable ucode in the old days, because they had so many bugs!). But with that said, one of my lab's goals has been to design and implement vector machine datapaths that are fairly configurable (say between program phases) in a way that can capture more types of compute kernels.

nkurz · on July 10, 2016

Thanks, I've skimmed though it and have a few questions and comments. I know very little about RISC-V, but I've recently been diving into issues of macro- and micro-fusion on recent Intel x64.

The first question would be: Why do we care about instruction counts? Historically it's been used as a proxy for computational cost, but as you point out it no longer correlates that well. Additionally, many modern processors have µop caches that greatly reduce the importance of the initial instructions. But instead of suggesting that we ignore instruction count, the paper seems to suggest that we need to consider both instruction and µop count. Why?

You mention that "Compiler tool chains are a continual work-in-progress" and mention that you "used GCC for all targets as it is widely used and the only compiler available for all systems". This makes me wonder how much you are testing the differences in ISA's versus how much you are just testing GCC's ability to optimize for each.

My strong impression is that GCC does a poor job of optimizing macro- or micro-fusion for x64. It seems mostly oblivious to the issues, and the efficiency of the resulting code is equally likely to be anywhere from the same speed as microarchitecture conscious assembly to 2x slower. Did you attempt to correct for this when choosing x64 as the baseline?

For example, I noticed the x64 code in 401.bzip in Code 2:

  n = ((Int32)block[ptr[unHi]+d]) - med;

  4039d0: mov (%r10), %edx
  4039d3: lea (%r15,%rdx,1), %eax
  4039d7: movzbl (%r14,%rax,1), %eax
  4039dc: sub %r9d, %eax
  4039df: cmp $0x0, %eax 
  4039e2: jne 403a8a

While some of this is due to the requirements of the specific integer widths (intentionally? arbitrarily? accidentally?) specified in the source, GCC seems to have come up with a very roundabout solution. Why a MOV/LEA rather than a micro-fused ADD load-op? For that matter, why use an LEA (which is restricted to fewer execution ports) rather than an ADD?

And while in this case it happened to keep the CMP/JNE fusible, why does it have a CMP there at all, instead of using the flags already set by the SUB? From Sandy Bridge onward, SUB/JNE is also a macro-fused single instruction, so instead of two instructions fusing to one we have three instructions fusing to two.

There are probably reasons for all this, but it does make it it awkward as a baseline. Whether it makes it to the paper or not, verifying that the baseline x64 numbers don't change significantly with Clang or ICC with a variety of optimization flags might be useful. If you were heroic, comparing to a version with hand-optimized would be interesting as well.

A last comment is that I recently realized there were some significant differences between the Haswell and Skylake generations with regard to the effects of fusion on performance. A loop like this:

  #define ASM_MICRO_MACRO(in1, sum1, in2, sum2, max)          \
    __asm volatile ("1:\n"                                  \
                    "add (%[IN1]), %[SUM1]\n"               \
                    "cmp %[MAX], %[SUM1]\n"                 \
                    "jae 2f\n"                              \
                    "add (%[IN2]), %[SUM2]\n"               \
                    "cmp %[MAX], %[SUM2]\n"                 \
                    "jb 1b\n"                               \
                    "2:" :                                  \
                    [SUM1] "+&r" (sum1),                    \
                    [SUM2] "+&r" (sum2) :                   \
                    [IN1] "r" (in1),                        \
                    [IN2] "r" (in2),                        \
                    [MAX] "r" (max))

Ignoring that this isn't a particularly useful loop, it executes in at a single cycle per loop iteration on Skylake, but takes 1.8 cycles per iteration on Haswell. The amount of fusion is the same on each, but the performance differs greatly. I'm not yet sure yet of the specific reasons, but it makes me wonder if the choice of Ivy Bridge as the reference implementation for x64 might be influence your conclusions (although I don't know in which direction).

_chris_ · on July 10, 2016

> But instead of suggesting that we ignore instruction count, the paper seems to suggest that we need to consider both instruction and µop count. Why?

In part because I have nothing else to go on! Not every processor gives me a uop count. Instead I'm saying "let's look at instruction count, but with a glass full of salt." That may be fairly uncomforting, but it's what we have.

But at the end of the day, I care about making fast RISC-V processors, and this level of analysis was enough to give me the insight I wanted that a) the gcc generation isn't too bad for RISC-V and b) hey, there's some neat tricks we can use to make fast RISC-V processors once we see what other ISAs thought was worth spending their opcode space on.

> why does it have a CMP there at all, instead of using the flags already set by the SUB?

I was quite surprised by how little x86 and ARMv7 made use of condition codes. Perhaps telling how ARMv8 removed most conditional execution from its ISA...

> My strong impression is that GCC does a poor job of ...

That may be (and I was certainly horrified by how badly it does on libquantum which is perfectly vectorizable). However, I've been told (for at least many SPEC benchmarks) that gcc does about as well as icc, and well, it's the free and available compiler that virtually everybody uses, so we're stuck with it!

And the ISA has been around for 30+ years, so if it's still bad, is it the compiler's fault for not being smarter, or the ISA's fault for being too complex for the compiler to get it right?

> If you were heroic, comparing to a version with hand-optimized would be interesting as well.

There are a million things I'd love to have done (more benchmarks, more ISAs, more tool-chains...), but this isn't my PhD, and I had to draw the line somewhere! But if somebody wants a low-hanging Master's Thesis...

> it makes me wonder if the choice of Ivy Bridge as the reference implementation for x64 might be influence your conclusions

I'd love to read more about that. In my case, I use what I have on hand, which is a very expensive Xeon Ivy Bridge!

nkurz · on July 11, 2016

Great answers. You're doing excellent work!

stephencanon · on July 11, 2016

> And the ISA has been around for 30+ years, so if it's still bad, is it the compiler's fault for not being smarter

Well, the base ISA's been around for 30+ years, but all the interesting extensions are a lot newer, and the µarch (and therefore what's optimal) has changed enormously in that time.

mjevans · on July 10, 2016

It sounds like Skylake is better at isolating sub-threads via result dependencies AND/OR it has more resources to utilize for your particular example.

I wonder if it has some form of adaptation based on how likely it feels the parallelism is to pay off versus the power cost of that failure.

nkurz · on July 11, 2016

I believe the difference is that Skylake removed a bottleneck on the number of µops that can be sustained at some stage other than the execution units, but I'm not sure where. In the example I gave, the ADD from memory is an example of micro-fusion. It's held as a single "complex" µop in the DSB, but is "un-laminated" in to 2 for execution and retirement. The CMP/JCC pairs use macro-fusion, and are combined to form a single µop both in the queue and during execution.

It's documented that Skylake increased the number of µops that can be transferred from the DSB (Decoded Instruction Cache) to the IDQ (Instruction Decode Queue) from 4 µops per cycle to 6 µops per cycle. But by these metrics, we're only pulling 4 µops (representing 6 instructions) per cycle from the DSB, but these expand to 6 µops (not the same as the 6 instructions) at execution and retirement. I'm guessing that they improved the retirement bandwidth as well, but I haven't been able to find a clear statement of this, or what the new limits are.

stephencanon · on July 11, 2016

Just to rule out the obvious, you looked at the branch mispredict counter, right?

nkurz · on July 12, 2016

Pretty sure that's not the issue, but I checked now to confirm. I posted my test code to Agner's blog a few minutes ago, but it hasn't appeared yet. I repasted to a Gist here: https://gist.github.com/nkurz/5e389a29cd1eabaae67924b28b40e7...

The main thing to notice is that on Skylake the "two macro two micro" is fastest and executes at 1 cycle per iteration, while on Haswell is it slower than than a couple options with less fusion. I don't think there is anything that surprising here, but it does point out some not very clearly documented differences between Haswell and Skylake.

BR_INST_RETIRED_NEAR_TAKEN is to show the number of loop iterations. Run time in cycles is shown by CPU_CLK_UNHALTED_CORE. The difference between INSTR_RETIRED_ANY and UOPS_RETIRED_ALL shows the effect of macro-fusion of CMP/JCC. The difference between UOPS_ISSUED_ANY and UOPS_EXECUTED_CORE shows the effect of micro-fusion of LOAD/ADD. UOPS_EXECUTED_CORE and UOPS_RETIRED_CORE are the same on both machines, showing that there is no branch misprediction occurring.

panic · on July 10, 2016

Agner Fog's ForwardCom architecture has some interesting ideas around extensibility: https://github.com/ForwardCom/manual/blob/master/forwardcom....

unsignedqword · on July 11, 2016

Is there even any point of referring to an ISA as distinctly 'RISC' or 'CISC' anymore? Modern AMD64 'CISC' (and even some ARM machines, despite being considered 'RISC') machines are held together behind the scenes by a microcode with a much smaller arsenal of instructions, and then you have stuff like this where a RISC-V ISA chip might fuse ops together in a very CISC-like abstraction.

_chris_ · on July 11, 2016

> Is there even any point of referring to an ISA as distinctly 'RISC' or 'CISC' anymore?

First, you need to be careful to not confuse the ISA with the micro-architecture. RISC and CISC are descriptions of the interface that the SW sees, and the fact that Intel processor pipelines use RISC-like micro-ops doesn't change the fact that the x86 ISA is very CISC-y (just look at the picture of all the x86 registers). Of course, the fact that we can so readily decouple the ISA from the implementation does help us lessen the sins committed in bad ISA designs, which is where are RISC arguments come back in to play - if we can make all ISAs execute at about the same performance, why not go with the simplest ISA to implement?

Of course, although I'm being provocative in the report, in reality, RISC and CISC are not binary terms - they're points on a continuum.

For example, RISC-V's RV64I (integer-only) is incredibly RISC-y (not even a multiply!), but once you add in double-precision FMAs ("D") and atomic memory operations ("A") and variable-length instructions ("C")... well, it's not CISC like VAX, but it's certainly not pure RISC!

Taniwha · on July 11, 2016

The thing is the base x86 architecture ISNT very CISCy, there's only simple addressing modes (no double indirect modes), no addressing side effects to undo on exceptions (auto increments), only one TLB dip per instruction

(the push instruction is really the only excepetion)

compared to its compatriots (68k/vax/NSwhatever etc) it was positively RISCy in comparison - a vax instruction could take something like 29 TLB misses, make 29 memory accesses - a 68020 threw a horrible mess of microstate spew on a TLB miss so it could make progress (made unix signals implementation a nightmare, where do you safely put that stuff?)

I think its RISCyness is why it's still with us

kstenerud · on July 10, 2016

tl;dr

RISC-V with C extensions is a bit denser than x86-64 with vector operations disabled, and can execute common operations in slightly less cycles due to macro ops (especially in compare-branch idioms).

So until they add vector extensions to RISC-V, x86-64, though bloated, is still king.

_chris_ · on July 10, 2016

I compiled the x86 code using gcc 5.3 using mtune/march=native and -O3. It just turns out that the gcc compiler generates entirely scalar code for SPECInt!

astrange · on July 11, 2016

That totally seems wrong. I'm certain it should fire on SPECint, since all compiler optimizations are pretty much invented just to improve the SPECint score - gcc's autovectorizer was made to do this for IBM's POWER processors, which are of course not x86.

Maybe it didn't detect '-march=native' properly?

stephencanon · on July 11, 2016

The problem is that macro-op fusion, at least as commonly implemented, requires instructions to be adjacent, or at least close, for fusion to happen. This means that your careful plan of "not designing the µarch into the ISA" effectively goes out the window. It's not literally in the ISA, but the compiler needs to order the instructions just so to benefit from the fusion opportunities of the µarch that it's targeting, which often penalizes low-power designs that don't have the fusion in question.

There are some sequences where it makes an enormous amount of sense (cmp+jcc, mulhi+mullo, etc); most sufficiently complex µarchs already do those. RISC-V isn't magically going to squeeze a lot more macro-fusion opportunities out that other architectures aren't able to take advantage of.

alain94040 · on July 11, 2016

In your paper, please fix this glaring typo:

// psuedo-code for a ‘repeat move’ instruction

-> pseudo

_chris_ · on July 11, 2016

Yikes, my spell checker missed that. -.- (shakes fist at OSX).