I never actually wrote programs with risc-v vectors instructions, only for other ISA, so you seems to have much more experience than me on on that. Thanks for sharing your experience, and for the discussion.
> "multiple" doesn't imply "not fixed".
But in the case of risc-v, it really is "not fixed".
For example, can you tell me the number of cycle it takes to execute this instruction: `vadd.vv v3, v1, v2` ?
Another example, in your implementation of memcpy:
From what I understand, you can't tell how long each vlb.v or vsb.v will take on a given processor,
as it depend on the number of elements that will be returned in a4.
The last iteration of the loop for example will probably take less time than the others,
because the instructions will only have to load and store the trailing part of the data.
> If you look at the result of a vectorised memcpy() I did on the D1, you can see that the execution time is absolutely identical at 30.9 ns (31 clock cycles at 1.008 GHz) from sz=0 to sz=64
Again, I'm not arguing against the efficiency of RISC-V vector extension. I'm sure any processor designer will be able to implement it as efficiently as if the instructions were closer to the raw compute logic.
I'm not even arguing against the extension. If anything, I think I would prefer if an open CISC ISA was gaining popularity as an open source platform instead of RISC-V, because it would have the same portability implications, but with smaller code size and memory footprint, and while leaving a lot of room for innovation and optimization on the actual design.
What I actually argue against is calling it a RISC ISA, because while the core of the ISA is RISC, this vector extension is not.
Actually I can. I can tell you that on the chip I'm currently using "vadd.vv v3, v1, v2" takes 3 clock cycles for LMUL=1, 6 clock cycles for LMUL=2, 12 clock cycles for LMUL=4, and 24 clock cycles for LMUL=8.
You always have to know what current LMUL you have, otherwise you can try to use illegal register numbers. For example in LMUL=8 you can only use v0, v8, v16, and v24. Anything else causes an exception.
"From what I understand, you can't tell how long each vlb.v or vsb.v will take on a given processor, as it depend on the number of elements that will be returned in a4. The last iteration of the loop for example will probably take less time than the others, because the instructions will only have to load and store the trailing part of the data."
If you look at the link I provided, you will see that's not the case. The memcpy() function takes absolutely identical time for any memcpy() length from 0 bytes to 64 bytes. It does not depend on the number of active elements. At least on this chip, which is currently the only chip in the world you can get your hands on with a RISC-V Vector unit.
On some other chip it might take a different amount of time. But on this one, and probably many others, it takes the same amount of time.
"I think I would prefer if an open CISC ISA [...] with smaller code size and memory footprint"
CISC ISAs do not have smaller code size.
The two most compact code modern full-featured ISAs are ARMv7 and RISC-V. They are smaller than i686 by quite a margin.
In 64 bit there is absolutely no competition. RISC-V is the smallest code, with ARMv8 and AMD64 quite similar to each other but significantly bigger.
Just look at the same programs compiled for each one and you'll see. I suggest something like Ubuntu 21.04, which is available for all three. Take a look in /bin and /usr/bin and run "size" on the binaries. It's indisputable. RISC-V is the clear winner in 64 bit ISAs. ARMHF is similar or a little bit more compact in 32 bit. CISC x86 isn't close in either case.
> "multiple" doesn't imply "not fixed".
But in the case of risc-v, it really is "not fixed". For example, can you tell me the number of cycle it takes to execute this instruction: `vadd.vv v3, v1, v2` ?
Another example, in your implementation of memcpy:
``` 0: 86aa mv a3,a0
0000000000000002 <.L1^B1>:
```From what I understand, you can't tell how long each vlb.v or vsb.v will take on a given processor, as it depend on the number of elements that will be returned in a4. The last iteration of the loop for example will probably take less time than the others, because the instructions will only have to load and store the trailing part of the data.
> If you look at the result of a vectorised memcpy() I did on the D1, you can see that the execution time is absolutely identical at 30.9 ns (31 clock cycles at 1.008 GHz) from sz=0 to sz=64
Again, I'm not arguing against the efficiency of RISC-V vector extension. I'm sure any processor designer will be able to implement it as efficiently as if the instructions were closer to the raw compute logic.
I'm not even arguing against the extension. If anything, I think I would prefer if an open CISC ISA was gaining popularity as an open source platform instead of RISC-V, because it would have the same portability implications, but with smaller code size and memory footprint, and while leaving a lot of room for innovation and optimization on the actual design.
What I actually argue against is calling it a RISC ISA, because while the core of the ISA is RISC, this vector extension is not.