ARMv8 code density is quite good for a fixed-length ISA and is of course much be...

snvzz · on Dec 2, 2021

>in which case the code size advantage of ARMv8 vs. RISC-V would increase significantly.

Many things could be said about ARMv8, but that it has good code size is not one of it. It does, in fact, have abysmal code density. Both RISC-V and x86-64 produce significantly smaller binaries. For RISC-V, we're talking about a 20% reduction of size.

There's a wealth of papers on this, but you can verify this trivially yourself, by either compiling binaries for different architectures from the same sources, or comparing binaries in Linux distributions that support RISC-V and ARM.

>where combined compare-and-branch instructions could be added, and with a larger branch offset range than in RISC-V

If your argument is that ARMv8 could get better over time, I hate to be the bearer of bad news. ARMv9 code density isn't any better.

>and the worst is the lack of indexed addressing, which frequently requires 2 RISC-V instructions instead of 1 ARM instruction.

These patterns are standardized, and they become one instruction after fusion.

RISC-V, unlike the previous generation of ISAs, was thoroughly designed with hindsight on fusion. The simplest microarchitectures can of course omit it altogether, but the cost of fusion in RISC-V is low; I have seen it quoted at 400 gates.

brucehoult · on Dec 2, 2021

Instruction fusion is a possibility for the future, which has been discussed academically, but no one implements it at present. I'm not sure anyone will -- it's too much complexity for simple cores, and not needed for big OoO cores.

The one fusion implementation I'm aware of if the SiFive 7-series combining a conditional branch that jumps forward over exactly one instruction. It turns the instruction pair into predicated execution.

I agree with everything else. In particular the code density. Anyone can download Ubuntu or Fedora images for the same release for amd64, arm64, and riscv64. Mount them and run "size" on any selection of binaries you want. The RISC-V ones are consistently and significantly smaller than the other two, with arm64 the biggest.

brucehoult · on Dec 2, 2021

I'm not sure how you missed RISC-V's big feature for code density -- the "C" extension, giving it arbitrarily mixed 16 and 32 bit opcodes.

I've heard of that feature before somewhere else. It gave the company that invented it unparalleled code density in their 32 bit systems and propelled them to the heights of success in mobile devices. What was their name? Wait .. oh, yes ... ARM.

Why they forgot this in their 64 bit ISA is a mystery. The best theory I can come up with is that they thought the industry had shaken out and amd64 was the only competition they were going to have, ever. Aarch64 does indeed have very good code density for a fixed-length 32 bit opcode ISA, and comes very close to matching amd64. They may have thought that was going to be good enough.

Note: the RISC-V "C" extension is technically optional, but the only CPU cores I know of that don't implement it are academic toys, student projects, and tiny cores for use in FPGAs where they are running programs with only a few hundred instructions in them. Once you get over even maybe 1 KB of code it's cheaper in resources to implement "C" than to provide more program storage.

lucian1900 · on Dec 2, 2021

Unfortunately, variable length opcodes are a problem for wide superscalar machines, i.e. the fast ones.

ruslan · on Dec 3, 2021

Speaking about RISC-V, no it is not. In RISC-V "C" all 16 instructions have their 32 bit counterparts. When front-end reads in an instruction word (32 bits) it extracts two 32 bit ops from it then feeds them serially to decoder. So, there's only one same decoder that does the work both for 16 and 32 bit ops (basically it does not distinguish them), and that's also what makes macro op fusion possible and easy to implement, unlike ARM's Thumb which has two separate decoders with all the consequences.

dmitrygr · on Dec 3, 2021

ARM literally documented thumb as using the exact mechanism you just claimed they do not have and riscv does. Suggest reading of ARMv4T spec

ruslan · on Dec 4, 2021

I'll surely read ARMv4T specs when I will have a bit more free time, thanks :). But, ARM requires switching machine mode to select instruction set (you cannot mix Thumb with regular 32 bit), which kind of hints a selection of decoder takes place. In RISC-V, albeit it's up to micro-arch designer to choose, only one decoder is needed and you can have a mixture of 16 bit and 32 bit instructions in the program flow. What's more, with macro-op fusion feature, two consequent "C" instructions can be viewed as one "unnamed" instruction that does a lot more work. Bit more details from RISC-V authors on the subject along with benchmarks: https://riscv.org/wp-content/uploads/2016/07/Tue1130celio-fu...

seoaeu · on Dec 3, 2021

But not that much of a problem. x86 is way, way worse about variable length opcodes than RISC-V and there are plenty of fast x86 processors...

zozbot234 · on Dec 2, 2021

The thing with lack of shifted indexed addressing is that it just might not matter all that much beyond toy examples. Address calculations can generally be folded in with other code, particularly in loops which are a common case. So it's only rarely that you actually need those extra instructions.

adrian_b · on Dec 2, 2021

Shifted indexed addressing is needed more seldom, but indexed addressing, i.e. register + register, is needed in every loop that accesses memory.

There are 2 ways of programming a loop that addresses memory with a minimum of instructions.

One way, which is preferable e.g. on Intel/AMD, is to reuse the loop counter as the index into the data structure that is accessed, so each load/store needs a base register + index register addressing, which is missing in RISC-V.

The second way, which is preferable e.g. on POWER and which is also available on ARM, is to use an addressing mode with auto-update, where the offset used in loads or stores is added into the base register. This is also missing in RISC-V.

Because none of the 2 methods works in RISC-V with a minimum number of instructions, like in all other CPUs, all such loops, which are very frequent, need pairs of instructions in RISC-V, corresponding to single instructions in the other CPUs.

brucehoult · on Dec 2, 2021

A big difference here is that the RISC-V instructions are usually all 16 bits in size while the Aarch64 and POWER instructions are all 32 bits in size. So the code size is the same.

Also, high performance Aarch64 and POWER implementations are likely to be splitting those instructions into two decoupled uops in the back end.

Performance-critical loops are unrolled on all ISAs to minimise loop control overhead and also to allow scheduling instructions to allow for the several cycle latency of loads from even L1 cache. When you do that, indexed addressing and auto-update addressing are still doing both operations for every load or store which, as well as being a lot of operations, introduces sequential dependency between the instructions. The RISC-V way allows the use of simple load/store with offset -- all of which are independent of each other -- with one merged update of each pointer at the end of the loop. POWER and Aarch64 compilers for high performance microarchitectures use the RISC-V structure for unrolled loops anyway.

So indexed addressing and auto-update addressing give no advantage for code size, and don't help performance at the high end.

Iwan-Zotow · on Dec 4, 2021

"ARMv8 code density is quite good for a fixed-length ISA and is of course much better than that of RISC-V."

not true

you could compile and compare, say, gcc (cc1) of the same version on arm64 and rv. arm64 has larger binaries