ARM vs. RISC-V Vector Extensions

brigade · on May 6, 2021

SVE was designed mindful of how CPUs currently operate, whereas RISC-V vector extensions were designed with fondness for how CPUs operated decades ago.

Well that's somewhat of an exaggeration, but XT-910 speculatively executes vector instructions based on a prediction of how vsetvl(i) modifies register configurations in order to achieve good performance, so changing this register configuration causes speculation failures as though it were a mispredicted branch. Which you need to do if you're doing any mixed precision operations, or mixed integer/floating point, and discourages small SIMD functions. Quote from their white paper: "this is not friendly to deeply pipelined processor architecture."

Fundamentally, I dislike how completely the meaning of RISC-V vector instructions depends on what instructions were executed an arbitrarily long time beforehand. Also he's really complaining about register indexing in load/stores?

_chris_ · on May 6, 2021

> Quote from their white paper: "this is not friendly to deeply pipelined processor architecture."

There is zero reason you can't rename the VL register. "Speculating" that it doesn't change is only one design point.

brigade · on May 6, 2021

Well... when the operation of register renaming itself depends on the vtype register, that isn't a complete solution.

brucehoult · on May 7, 2021

vsetvl{i} should not cause speculation failures.

The VTYPE part of vsetvl (which sets the VTYPE CSR) is conceptually part of dynamically following instruction opcodes. There simply isn't room for it in the current fixed-size 32 bit opcodes, but in future 64 bit instruction encodings [1] the VTYPE will be explicit in eery instruction.

This is no different to every IEEE-compliant Floating Point implementation (which ARM and x86 of course are) including a rounding mode field in every instruction which often (in fact usually) contains the value "dynamic" meaning that the rounding mode is taken from a CSR.

In both cases the instruction fetch stage can simply jam the current value from the CSR on to the end of each instruction and have it pass through the pipeline with it as if it had been in the instruction opcode in the first place.

VTYPE has 8 active bits in the current v1.0 draft, and the XT-910 and XT-906 implement a different 6 bit VTYPE of which 2 bits must always be zero (so 4 bits really).

VSETVL with a register argument could cause a mis-speculation, but it will normally only be used for saving and restoring VTYPE on task switches. Normal code will usually be using the immediate form, which the instruction decoder can pick off and save locally.

As for the VL CSR part, there is a bit in VTYPE that specifies whether a) you don't care about elements past VL -- they can be computed the same as if they were within the vector length or not, or b) elements past VL remain undisturbed. Designers/users of OoO machines may be happier when the software specifies a) and designers/users of microcontrollers may be happier when the software specifies b).

So when "tail agnostic" is set, very wide hardware might calculate all elements anyway, while hardware with N ALUs might calculate whatever portion of the tail falls in the block of N elements containing the end of the vector, but not later blocks.

The XT-910 designers of course didn't have this option as their CPU implements a two year old draft specification that no one was ever supposed to implement in mass-production hardware -- or at least if they do they have to know it has no guarantee to be compatible with RVV v1.0

The XT-906 uses the same vector processor (or at least the same ISA). I've been using one via ssh recently and hope to have my own board with the Allwinner D1 chip that uses the XT-906 core later this month. I've got some simple vector test results at:

http://hoult.org/d1_memcpy.txt http://hoult.org/d1_strcpy.txt

[1] 64 bit RVV opcodes will also likely have other enhancements such as being able to use any V register for masking, other masking options (e.g. invert the mask), ability to address more V registers, four operand versions of multiply-accumulate, and I don't know what else. No real effort has been expended on actually designing that in detail yet.

SuchAnonMuchWow · on May 6, 2021

Also, the RISC-5 vector extension is at the opposite of the philosophy of RISC vs CISC

brucehoult · on May 7, 2021

In what way?

- load/store with arithmetic register-to-register only. Check.

- single instruction length. Check.

- fixed instruction execution time. Check.

- large set of registers. Check.

SuchAnonMuchWow · on May 7, 2021

The vector extension has many good properties, like the ability to adapt to various size/complexity of processors: it can be implemented slowly on a small processor, and quickly on fast processor. This is because the instructions are very abstract and most of the control is done by the core instead of the compiler, like splitting the vectors into chunks that can be computed per cycle by the core.

But then you loose the fixed instruction execution time: if you configure your registers to be larger than the compute units of the core, the instruction will take multiple cycles.

Citing wikipedia on RISC archs:

"The main distinguishing feature of RISC architecture is that the instruction set is optimized with a large number of registers and a highly regular instruction pipeline, allowing a low number of clock cycles per instruction (CPI)"

- a highly regular instruction pipeline: Nope

- a low number of clock cycles per instruction (CPI): Nope

In the 80'-90', it became apparent that RISC vs CISC was a moot point, and both types of ISA could be equally efficient. However, I think if you want a universal ISA, CISC is better because it is easier to decompose complex instructions into smaller ones on small processor, rather than to detect patterns of simple instructions and optimize them on a large processor.

This is the direction RISC-V took with the vector extension. But I find it a stretch to call it a RISC ISA then.

brucehoult · on May 7, 2021

Thanks for the reply.

>But then you loose the fixed instruction execution time: if you configure your registers to be larger than the compute units of the core, the instruction will take multiple cycles.

"multiple" doesn't imply "not fixed".

Floating Point on pretty much any modern CPU, CISC or RISC, takes 3 or 4 clock cycles for add/subtract/multiply/FMA. It's not 1 cycle, but it's always the same. This means the compiler can (if it's told what CPU the code will be running on with -mtune or whatever) accurately schedule the code for maximum efficiency. And usually the FPU will be pipelined, so you can start a new add or multiply every clock cycle, as long as it doesn't depend on the result of a very recent one.

Recently I've been using the Allwinner D1, which implements draft version 0.7.1 of the RISC-V Vector ISA with 128 bit registers. Like an FPU, most instructions take 3 clock cycles with LMUL=1. If you increase LMUL then, yes, the instructions take longer, but it is deterministic and you can do your instruction scheduling around it.

If you look at the result of a vectorised memcpy() I did on the D1, you can see that the execution time is absolutely identical at 30.9 ns (31 clock cycles at 1.008 GHz) from sz=0 to sz=64

http://hoult.org/d1_memcpy.txt

At sz=0 this is 60% of the time for the standard scalar memcpy() in glibc!! By sz=64 it is 3.6 times faster than the standard memcpy().

I'm using LMUL=4 there. LMUL=2 provided slightly lower latency for sz<=32 but couldn't achieve as much speed in cache. LMUL=8 didn't provide any more speed in cache but had latency closer to the standard scalar code (but still less than it) for sz=0.

SuchAnonMuchWow · on May 7, 2021

I never actually wrote programs with risc-v vectors instructions, only for other ISA, so you seems to have much more experience than me on on that. Thanks for sharing your experience, and for the discussion.

> "multiple" doesn't imply "not fixed".

But in the case of risc-v, it really is "not fixed". For example, can you tell me the number of cycle it takes to execute this instruction: `vadd.vv v3, v1, v2` ?

Another example, in your implementation of memcpy:

``` 0: 86aa mv a3,a0

0000000000000002 <.L1^B1>:

   2: 00267757           vsetvli a4,a2,e8,m4,d1

   6: 12058007           vlb.v v0,(a1)

   a: 95ba                 add a1,a1,a4

   c: 8e19                 sub a2,a2,a4

   e: 02068027           vsb.v v0,(a3)

  12: 96ba                 add a3,a3,a4

  14: f67d                 bnez a2,2 <.L1^B1>

  16: 8082                 ret

```

From what I understand, you can't tell how long each vlb.v or vsb.v will take on a given processor, as it depend on the number of elements that will be returned in a4. The last iteration of the loop for example will probably take less time than the others, because the instructions will only have to load and store the trailing part of the data.

> If you look at the result of a vectorised memcpy() I did on the D1, you can see that the execution time is absolutely identical at 30.9 ns (31 clock cycles at 1.008 GHz) from sz=0 to sz=64

Again, I'm not arguing against the efficiency of RISC-V vector extension. I'm sure any processor designer will be able to implement it as efficiently as if the instructions were closer to the raw compute logic.

I'm not even arguing against the extension. If anything, I think I would prefer if an open CISC ISA was gaining popularity as an open source platform instead of RISC-V, because it would have the same portability implications, but with smaller code size and memory footprint, and while leaving a lot of room for innovation and optimization on the actual design.

What I actually argue against is calling it a RISC ISA, because while the core of the ISA is RISC, this vector extension is not.

brucehoult · on May 7, 2021

Actually I can. I can tell you that on the chip I'm currently using "vadd.vv v3, v1, v2" takes 3 clock cycles for LMUL=1, 6 clock cycles for LMUL=2, 12 clock cycles for LMUL=4, and 24 clock cycles for LMUL=8.

You always have to know what current LMUL you have, otherwise you can try to use illegal register numbers. For example in LMUL=8 you can only use v0, v8, v16, and v24. Anything else causes an exception.

"From what I understand, you can't tell how long each vlb.v or vsb.v will take on a given processor, as it depend on the number of elements that will be returned in a4. The last iteration of the loop for example will probably take less time than the others, because the instructions will only have to load and store the trailing part of the data."

If you look at the link I provided, you will see that's not the case. The memcpy() function takes absolutely identical time for any memcpy() length from 0 bytes to 64 bytes. It does not depend on the number of active elements. At least on this chip, which is currently the only chip in the world you can get your hands on with a RISC-V Vector unit.

On some other chip it might take a different amount of time. But on this one, and probably many others, it takes the same amount of time.

"I think I would prefer if an open CISC ISA [...] with smaller code size and memory footprint"

CISC ISAs do not have smaller code size.

The two most compact code modern full-featured ISAs are ARMv7 and RISC-V. They are smaller than i686 by quite a margin.

In 64 bit there is absolutely no competition. RISC-V is the smallest code, with ARMv8 and AMD64 quite similar to each other but significantly bigger.

Just look at the same programs compiled for each one and you'll see. I suggest something like Ubuntu 21.04, which is available for all three. Take a look in /bin and /usr/bin and run "size" on the binaries. It's indisputable. RISC-V is the clear winner in 64 bit ISAs. ARMHF is similar or a little bit more compact in 32 bit. CISC x86 isn't close in either case.

jcranmer · on May 6, 2021

> RISC-V however does not work like this. The RISC-V vector registers are in a separate register file not shared with the scalar floating point registers.

Honestly... in hardware, they probably are actually in the same register file. It just now means you have two sets of architectural registers that rename to the same register file.

As for the rest of the article, it looks like it mostly boils down to "I'm intimidated by assembly programming" as opposed to any actual critique of the strengths and weaknesses of the vector ISAs. There's superficial complaints about the numbers of instructions, or different ways to write (the same? I only know scalar ARM assembly, not any vector extensions) instructions. On a quick reread, I see a complaint that's entirely due to how ARM represents indexed load operations, which has absolutely nothing to do with the vector ISA whatsoever.

If your goal is to understand how hardware SIMD works, you're probably better off sticking to C code with intrinsics, that way you're not distracted by the extra hoops you may have to go through that arise just by translating C into assembly.

_chris_ · on May 6, 2021

>> The RISC-V vector registers are in a separate register file not shared with the scalar floating point registers.

> Honestly... in hardware, they probably are actually in the same register file. It just now means you have two sets of architectural registers that rename to the same register file.

You could have a single unified pool of physical registers that can be handed out to any register, but there's only some advantage to do so and a lot of advantages in keeping them separate. Either way, that's a micro-architectural detail that the designers have the freedom to choose (or not choose) to do.

From the software's point of view, there's a lot of advantages in keeping different architectural registers separate.

jcranmer · on May 6, 2021

> You could have a single unified pool of physical registers that can be handed out to any register, but there's only some advantage to do so and a lot of advantages in keeping them separate. Either way, that's a micro-architectural detail that the designers have the freedom to choose (or not choose) to do.

What's the advantage to keeping them separate? If you're implementing vector instructions, then your scalar floating-point units are probably going to be the same as the vector floating-point units, with zero-extension for the unused vector slots. At that point, keeping them separate hardware register slots is detrimental: it's now costing you extra area as well, with concomitant power costs. You also need larger register files to accommodate all of the vector registers and the floating-point registers, when you're only likely to use half of them at any times. If you're pushing the vector units to their throttle, you'll have little scalar code to need all the renaming; if you're pushing the scalar units to their throttle, you'll similarly have little vector code.

From a software viewpoint, eh, there's not really any advantage to keeping them separate. You tend to use scalar xor vector floating point code, not both (this isn't true for integer, though), so there's little impact on actual register pressure. More architectural registers means more state to have to spill on context switches.

brucehoult · on May 7, 2021

> On a quick reread, I see a complaint that's entirely due to how > ARM represents indexed load operations, which has absolutely > nothing to do with the vector ISA whatsoever.

Not exactly true.

If you can use fancy addressing modes in your vector loads and stores and you have a fixed length 32 bit opcode (as both Aarch64 and RISC-V do[1]) then specifying an index register and how much to shift it by is taking up an extra 7 bits of your opcode (5 for register number, 2 for shift amount) vs an instruction that just specifies a base pointer register.

That means one instruction is taking up the opcode space that could otherwise be used by 128 different instructions instead.

That means either your vector ISA has fewer instructions and capabilities than it otherwise could have, or else it is taking up a lot more of the overall opcode space.

    loop:
        LD1D z1.d, p0/z, [x1, x3, LSL #3] // load x
        LD1D z0.d, p0/z, [x2, x3, LSL #3] // load y
        FMLA z0.d, p0/m, z1.d, z2.d
        ST1D z0.d, p0, [x2, x3, LSL #3]
        INCD x3                 // i
        WHILELT p0.d, x3, x0    // i, n
        B.ANY loop

Yeah, it's a little more code (4 bytes in RISC-V) to increment x1 and x2 by 8 using extra instructions but 1) that FMLA almost certainly takes 3 or 4 clock cycles, giving you plenty of spare cycles to execute the adds even on a single-issue machine, and 2) vectorized loops are likely to take up so little of your overall application that it makes no difference.

There's an argument to be made that it's worth the opcode space for scaled indexed addressing for integer loads and stores. Reasonable people may differ. But the case for FP and Vector loads and stores is pretty much non-existent.

It's not just a matter of "Ohhh .. that instruction looks so scary"

[1] RISC-V has 16 bit opcodes for very simple and common instructions, but the Vector ISA is entirely 32 bit opcodes.

janwas · on May 7, 2021

> If your goal is to understand how hardware SIMD works, you're probably better off sticking to C code with intrinsics

Agreed, and we're also using intrinsics in time-critical places. I am confident we will be able to hide both SVE and RVV behind the same C++ interface (https://github.com/google/highway) - works for RVV, just started SVE.

brucehoult · on May 7, 2021

If your vector registers are only 128 bits long then it's probably ok to have one big pool of registers. But if your vector registers are 512 bits, 4096 bits, 65536 bits then it's an awful waste not not be able to use one of those just because you need another int or FP scalar value in your loop.

swiley · on May 6, 2021

Undergrad computer architecture class freaked me out. Assembly doesn't even accurately describe what the machine does.

CalChris · on May 6, 2021

I think the RISC-V Vector Extensions are very elegant. However, I'm more interested in what hard core practitioners think and the ones I follow are nonplussed.

https://gist.github.com/erincandescent/8a10eeeea1918ee4f9d99...

https://twitter.com/geofflangdale/status/1155122593369710592

I'm reminded of a compliment for Vishy Anand's middlegame technique. He plays very well with the knight pair.

brucehoult · on May 7, 2021

Here’s the opinion of probably THE most important ARM engineer of the 1990s and 2000s, Dave Jaggar who developed the ARM7TDMI, Thumb, Thumb2.

https://www.youtube.com/watch?v=_6sh097Dk5k

Check at 51:30 where he says (bringing up the topic of RISC-V himself) “I would Google RISC-V and find out all about it. They’ve done a fine instruction set, a fine job […] it’s the state of the art now for 32-bit general purpose instruction sets. And it’s got the 16-bit compressed stuff. So, yeah, learning about that, you’re learning from the best.”

jpfr · on May 6, 2021

RISC-V really shines when the op-compression extension and macro-op fusion are taken into account.

https://news.ycombinator.com/item?id=25542963

Many of the early critiques of RISC-V have not considered that. And many points become entirely moot in light of these techniques.

sanxiyn · on May 6, 2021

Note that while the article is broadly correct, RISC-V Vector Extension is still in development and the article is based on the old version. SETVL is now three arguments (not two) and renamed to VSETVL, for example.

brucehoult · on May 7, 2021

To be more precise, VSETCFG is gone (rolled into VSETVL) and VSETVL's 3rd argument is an integer encoding the VTYPE with four subfields, written in assembly language for VSETVLI as up to four arguments: element width, register length multiplier, tail agnostic flag, mask agnostic flag (trailing subfields can be defaulted).

_chris_ · on May 6, 2021

It looks like RVV will be going up for ratification this summer, so (https://github.com/riscv/riscv-v-spec/blob/master/v-spec.ado...) is pretty close to the final version.

kingsuper20 · on May 6, 2021

I hit so many cases in SSE (and others) where I needed 'just one more instruction' and instead had a bit of a mess, my bet is that simple SIMD instruction sets will always grow over time.

For fun, maybe they could bolt on a VLIW set (or KLIW, 'Kinda Large') and push some of the ordering work onto the programmer/compiler.

Symmetry · on May 6, 2021

>If you are a hobbyist like me, who just wants to keep up to date with how technology is evolving and what things like vector processing is, then safe yourself a lot of trouble and just read a RISC-V book.

Yes, I'd absolutely agree. If you're a professional you'll probably find the hundreds of ARM instructions useful for optimizing your kernels if you're hand-crafting assembly but you'll be pretty far up the learning curve before you get to the crossover point. And not just for hobbyists but also for most academics working on exploring vectorization as well.

But if it's me just making use of the effort someone else has put into BLAS for my robot then I'll probably be better off with an SVE processor both from a code optimization scope standpoint and an ease of hardware implementation standpoint.

brucehoult · on May 7, 2021

Compared to NEON, SSE, AVX etc, yes.

Compared to RVV? I don't think that follows.

Also there won't be any generally available (for example in a Raspberry Pi or other SBC) implementations of SVE for probably several years yet, while Allwinner has their D1 chip using RVV [1] already in mass production and I've been running test code on it e.g.

http://hoult.org/d1_memcpy.txt http://hoult.org/d1_strcpy.txt

Sipeed and Pine64 have announced plans to sell Linux SBCs with this chip for as little as $10 or $12 within the next few months.

[1] it's a two year old version 0.7.1 of RVV, not very compatible with the current draft (though that memcpy() works on both), but newer than what is shown in the article.

Symmetry · on May 7, 2021

If we're talking about non-final versions of RVV then I think there likely won't be any optimized versions of matrix libraries coming out for those chips and both RVV and SVE are going to be effectively moot for a few years for people not writing their own assembly.

It's nice that you can get a RV board so cheaply as a hobbyist but as someone who uses vector libraries professionally I don't care and don't expect to ever use a chip costing less than $100. Right now getting high end chips in industrial PCs means going with x86. You can get an ARM chip with nice performance in an Apple laptop but that's not a form factor that's useful to me. ARM's own designs are a few years behind Apple but keeping pace so I think that it won't be too many years before I could plausibly be using ARM chips. I don't have any confidence I'll be able to get a RV chip that performs as well as a contemporary Intel or AMD chip in that timeframe.

brucehoult · on May 7, 2021

I expect glibc, musl, newlib to have at least some RVV optimised functions (memcpy, memmove, memset, bzero, strlen, strcpy, strncpy, memcmp, strcmp etc) very soon. Certainly for RVV 1.0, but I think also for RVV 0.7.1 since a lot of boards are going to be sold with that.

Simply having RVV versions of these in the standard library will instantly get most of the performance improvement available for general purpose code.

I will be personally pushing for that as 1) I'll have RVV 0.7.1 hardware probably this month, 2) I know both RVV 0.7.1 and 1.0 well and can write these routines in my sleep, and 3) I have FSF copyright assignments in place for gcc, binutils, glibc.

If you are only interested in computing devices with contemporary x86 or M1 performance then, sure, it's going to be quite some years before you have to care about either SVE or RVV. Apple might well implement SVE sometime soon -- obviously no one knows.

I'm typing this on a 16 GB M1 Mac Mini. It's replaced my 32 core ThreadRipper running Linux for day to day use as it's silent and has snappier UI, and is faster for anything using up to four cores, and not all that much slower even for things that sometimes use 32 cores.

See for example the last test in http://hoult.org/arm64_mini.html where it's 71% as fast as the ThreadRipper building the entire RISC-V GNU toolchain in a Linux for arm64 VM. Actually that was on a Mini with 8 GB RAM where I couldn't afford to allocate more than 4 GB to the VM. It's faster than that on my current 16 GB Mini.

Scene_Cast2 · on May 6, 2021

I think some of the complaints are about the number of instructions. But when you need to e.g. load a masked set of vectors, it's really nice to have such an instruction handy.

ChuckMcM · on May 6, 2021

One of the more interesting and fun things you can do these days is get a ULX3S[1], use the community RISC-V project to turn it into a fully functional RISC-V system, and then design your own vector instructions to play around with different ways of doing things. All for < $200 US which is always amazing to me.

[1] https://www.crowdsupply.com/radiona/ulx3s (85T version recommended)

klelatti · on May 6, 2021

TLDR Arm SVE is a bit of a pain to hand write assembly for because there is more to look up?

Can anyone recommend a good introduction to SVE2?

brandmeyer · on May 6, 2021

SVE2 isn't all that different from SVE. The early SVE introductory papers and tutorials are all great information.

https://arxiv.org/abs/1803.06185

https://developer.arm.com/architectures/instruction-sets/sim...

If the author had spent as much time reading the introductory manuals as he did bitching all over that webpage, he might have made some progress /rant.

klelatti · on May 6, 2021

Thanks - these look great. Entirely justified rant too!