RISC-V Public review for standard extensions Zc

snvzz · on Oct 15, 2022

Comments here: https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/j2x...

brucehoult · on Oct 15, 2022

Note that these extensions are primarily aimed at tiny embedded systems with code in ROM and no caches, so instruction fetch competing with data load/store is an issue. That was the major justification for load/store multiple in the 1985 ARM ISA, but it's been dropped or deprecated in recent ARM -A ISAs.

Byte and half load/store are also a lot more common in embedded software than in applications processors (especially as they are often handled by SIMD/Vector in larger CPUs)

Note that the B extension already added instructions such as manipulating single bits in registers (Zbs); clz, cpop, byte reverse, rotates (Zbb); and shift-and-add (Zba)

The Zcmp and Zcmt extensions are NOT intended to be added to large applications processors running standard OSes such as Linux or Android.

This is about competing with 8051, AVR, PIC, MSP430, ARMv6-M, ARMv7-M at the lowest end.

Dylan16807 · on Oct 15, 2022

I feel like "NOT intended" is even understating it, so I'll quote your other post:

"Code using Zcmp or Zcmt is incompatible with the RV64GC ISA assumed by current Linux etc OSes as it redefines and reuses the 16 bit opcodes for double precision floating point load and store. Full-size instructions must be used instead."

I'm amused that the instructions that get the best compression by far, multi-register push/pop, are also the instructions that are too much of a pain to bother with on large cores.

wyldfire · on Oct 15, 2022

> 1985 ARM ISA, but it's been dropped or deprecated in recent ARM -A ISAs

I guess the idea is to recover the opcode space? Or recover the area used to implement it?

Seems like deprecating instructions for an ISA doesn't really have any effect other than giving you more notice for the eventual removal. Unless you sit down and read the new PRM you'd probably never know.

phire · on Oct 15, 2022

ARMv8-A (aka 64bit ARM) is a completely different instruction set, with basically zero relation to the instruction encoding of 32bit ARM. They redesigned it from scratch.

And it's an instruction set designed for high-performance application cores (hence the -A) like what you would find as the main processor of a mobile phone, or a laptop or even a server. So they don't want anything optimised for lower-end designs. Plenty of other useful things to do with the encoding space.

For those low-end designs, ARM have a second ISA called ARMv8-M. Which is basically just the old thumb mode of 32bit arm with a few improvements.

The original 32bit arm ISA is basically dead these days (at least for new designs). You can still run some ARMv8-A chips in 32bit ARM mode, for compatibility. But that's actually optional in regards to the spec. Some only support running usermode 32bit ARM code. In the near future, we will see some cores that don't support 32bit arm code at all.

brucehoult · on Oct 15, 2022

>In the near future, we will see some cores that don't support 32bit arm code at all.

We already do. Lots of them. Every iPhone since the iPhone 8. M1 and M2 Macs. ThunderX in server farms, Cortex-X2 and Cortex-A510 cores in your high end 2022 Android phones (Galaxy S22, Xiaomi 12 etc). The Cortex-A710 cores in those phones do still run 32 bit code if needed.

ARM has said all new applications processor cores from 2023 will drop 32 bit support.

hajile · on Oct 15, 2022

X3 and A715 already dropped support (though maybe they're talking about shipped CPUs).

throwaway81523 · on Oct 15, 2022

RISC-V gets more and more complicated all the time. N different extensions of which almost any subset is a valid combo, so 2**N variants. Meanwhile, nothing about fast integer overflow detection.

brucehoult · on Oct 15, 2022

It's not a problem because:

- if you're compiling your own software you just tell the compiler what extensions the hardware you're using has.

- if you're running shrink-wrapped software today then you currently know exactly that RV64GC is supported, and next year you'll know there are 2.5 possibilities: RVA20 (RV64GC renamed), and RVA22 with or without V extension. I expect the next one (RVA24 or RVA25 or something) will make V compulsory.

If you actually genuinely care about fast integer overflow detection on RISC-V then make a proposal, analysing the costs and benefits.

As far as most people are concerned, there isn't a problem there in the first place.

throwaway81523 · on Oct 15, 2022

> As far as most people are concerned, there isn't a problem there [fast overflow detection] in the first place.

See, that is the problem. They try to be an architecture for the future but they are only concerned about performance of legacy code that doesn't check overflow either. The result is a zillion CVE's from overflow bugs, to go with the buffer overflows. Idk whether Rust is supposed to check overflow. Ada is supposed to check unless you suppress the check with a pragma, but GNAT sets the pragma by default because doing the check costs too much performance even on x86.

The Ada designers were not being fools when they specified that these checks should be done all the time. So an architecture of the future should support doing that efficiently, just like the CHERI(?) extension supports checking pointers efficiently.

As for the many combinations of features: it will be like x86 I guess, where it is a hassle but ultimately not a crushing one. Usually people just give up some performance by using lowest common denominator (or should that be greatest common factor) executables.

brucehoult · on Oct 15, 2022

Not having dedicated overflow checking instructions doesn't mean that you can't check overflows. You can.

There is no problem because checking overflows doesn't in practice cause significant overhead, especially in the kinds of superscalar CPUs that are in mobile devices and laptops and PCs and servers these days. Most of the time those wide OoO CPUs are looking in vain for more parallelism. Adding a few overflow-checking instructions just gives them something to do with execution slots they otherwise often can't use.

If there is a actual problem -- which needs to be PROVEN not just hand-waved -- then analyse it in the real world, document it, propose an ISA extension that would actually make things better in real-world applications, not just some hand-picked tiny loop or function.

zozbot234 · on Oct 15, 2022

Indeed, the overhead of checking overflow (which generally amounts to a few percentage points) is all about missed optimization opportunities due to the need for precise reporting of overflow errors. It has nothing whatsoever to do with the actual ISA, at least assuming a reasonable compiler.

snvzz · on Oct 15, 2022

>It has nothing whatsoever to do with the actual ISA

RISC-V doesn't have a flags register. This is an actual advantage: Superscalar microarchitecture implementations do not need to keep track of who set what flag.

This was a very conscious, carefully weighted decision, which presents a net win. One decision of many RISC-V designers had to make.

There's better and worse ways for an ISA to do something. For every ISA decision, implementations (small and large) will either benefit or have to work around.

You'll hear a lot of people insist that ISA doesn't matter; they're provably clueless.

tux3 · on Oct 15, 2022

>You'll hear a lot of people insist that ISA doesn't matter; they're provably clueless.

One of them being Torvalds, the transmeta guy (also known for Git, and a couple other hobby projects)

Now I tend to fall on your side of this argument, but I have to say I'd like to see the formal proof of cluelessness :)

Having read previous iterations of the same arguments, I've yet to see anything quite that decisive.

znwu · on Oct 16, 2022

On ISA design, it is just so wrong to endorse the opinion of a single person. ISA design has three significant parties of interest: IC designers, compiler authors, and software developers.

No one can master these three fields all at once. No one. Not even Linus (who would be a master software developer + a decent compiler specialist).

snvzz · on Oct 15, 2022

>One of them being Torvalds, the transmeta guy (also known for Git, and a couple other hobby projects)

A very capable person no doubt, but also one that's well-known to be wrong[0] about topics he is not an expert at.

0. https://www.cosy.sbg.ac.at/~clausen/PVSE2006/linus-rebuttal....

jhgkjhlkhjkljk · on Oct 16, 2022

Jim Keller has said ISA doesn't really matter beyond something like 15 necessary extensions.

snvzz · on Oct 16, 2022

Same Jim Keller has great things to say about RISC-V.

Which, by the way, includes mention of much attention these few important instructions (not extensions) got.

moonchild · on Oct 15, 2022

It is low overhead. But the 'zero cost abstractions'/'zero overhead [...]' crowd continues to blanch even at such low overhead. When you design a CPU for the masses, it is your responsibility to consider what they will do with it. The social factors are significant; not just the technical ones.

howinteresting · on Oct 15, 2022

Rust checks buffer overflows by default. It doesn't check integer overflows in release mode but that can be turned on as a compiler flag for a 10-20% performance hit on most programs.

moonchild · on Oct 15, 2022

I am curious where you get 10-20%. Dan luu measures <1% at https://danluu.com/integer-overflow/

howinteresting · on Oct 15, 2022

Those are figures I've heard second hand on a real project from some people I trust. My understanding is that a lot of optimizations become more complicated that way. I could be wrong though.

throwaway81523 · on Oct 15, 2022

1) it looks a lot worse than 1%, 2) it is on x86 not risc-v, 3) code bloat.

Meanwhile risc-v and everything else checks overflow for floating point arithmetic because it is mandated by the IEEE standard. That doesn't seem to cause slowdowns.

There is a code sequence in the risc-v docs someplace for doing a checked integer addition and it takes 3 or 4 instructions in the general case. Blecch. Better to set a flag like the FPU does, and check it at the end of a basic block.

zozbot234 · on Oct 15, 2022

That's only the suggested overflow checking for a single operation. When you have multiple add/sub/mul/compare operations in a row, checking for actual overflow becomes quite non-trivial and is not something that can be done by the hardware. Floating point has no equivalent to this, which is why it makes sense to have overflow checking as part of the instruction.

throwaway81523 · on Oct 15, 2022

I don't understand what you are saying. An overflow in an intermediate result should signal an overflow unless you're relying on some special property of 2's complement arithmetic in which case you should be using specific datatypes to suppress the check. All I'm saying about floating point is that the presence of overflow checking in high performance FPU's falsifies (afaict) the claim that it's unfeasible to do the same thing for integers.

andrekandre · on Oct 15, 2022

  > FPU's falsifies (afaict) the claim that it's unfeasible to do the same thing for integers.

was the argument that its unfeasible or that just not doing it is a net win?

adwn · on Oct 16, 2022

> 1) it looks a lot worse than 1%

The 9% / 28% are for -fsanitize=signed-integer-overflow,unsigned-integer-overflow, which prints a warning on overflow.

The version with -fsanitize-undefined-trap-on-error, which aborts the process, only causes a slowdown of 0-1%.

renox · on Oct 15, 2022

Even better to have a 'trap on integer overflow' state (either a bit in each instruction or in a context register), this way there's no/less icache impact.

bjourne · on Oct 15, 2022

It's a problem because if you are a vendor creating shrink-wrapped software you will have a hard time ensuring that your software supports all plausible extension combinations. Thus you will not bother just like most gaming companies does not bother porting their software to GNU/Linux. Sun basically did the same thing in the early 2000's with all their Java editions and profiles. It didn't work out well.

jhgkjhlkhjkljk · on Oct 16, 2022

The software makers will settle on a given subset of extensions most likely. The extensions here are meant for microcontrollers that won't run Linux. I think this will all sort itself out.

dzaima · on Oct 15, 2022

Mentioning 2^N variants on this specific proposal is probably the worst place to do so - Zca, Zcf, Zcd are all subsets of the 'c' extension, and Zcmp is incompatible with the 'c' extension, which severely cuts down the combinations - having the 'c' extension, which Linux expects, makes all (but Zcb I think?) extensions here either implicitly supported, or incompatible.

And, given the scope & purpose of these extensions, most things (distributed binary programs, OSes, all but a few compilers) won't need to use nor care about these. The alternative to these extensions, to those that need them, would be either suffering decreased performance/increased ROM size, choosing a different architecture, or making proprietary extensions (at which point you could have multiple vendors with incompatible extensions doing the same thing) - all of which are bad alternatives.

wyldfire · on Oct 15, 2022

I think the RISC-V committees/foundation realized the problem and is endorsing specific combinations of extensions.

snvzz · on Oct 15, 2022

Rather than "realized the problem", it would be more accurate to say that they understood the ramifications of extensions from the very beginning, took adequate steps, and never allowed it to become a problem.

rwmj · on Oct 15, 2022

Have you tried this recently?

  grep ^flags /proc/cpuinfo

Jtsummers · on Oct 15, 2022

2^N (I'm guessing you typed two *s, you need to type 4 to get two to show up).

  ** =>

*

  **** =>

**

throwaway81523 · on Oct 15, 2022

Thanks, fixed.

sylware · on Oct 15, 2022

It seems risc-v is missing the "small and big", "hardware accelerated", equilavent of x86 "rep stosX" and "rep movsX". Basically hardware memcpys and memsets. Heard latest arm ISA has them.

risc-v implementors/experts out there: is this true?

childintime · on Oct 15, 2022

not an expert, but it seems to me the vector extension should be able to do this.

sylware · on Oct 15, 2022

Like on ARM with neon and x86 with avx, but even on those, they got direct hardware acceleration without neon/avx of memcpys and memsets (big and small).