Hacker News new | past | comments | ask | show | jobs | submit login

> built with hindsight no other ISA has

Why do all the riscv fans Conveniently ignore aarch64 when they make statements like this? It was in fact a completely clean new design, based on hindsight, by people who know what they are doing, and with no legacy Cruft.




Aarch64 obviously isn't a completely clean sheet design. It was constrained by having to execute on the same CPU pipelines as 32 bit code, at least for the first decade or so. And the 32 bit mode has to perform well. There are tens of millions of Raspberry Pi 3s and 4s (and later model Pi 2s) which have 64 bit CPUs but have never seen a 64 bit instruction in their lives. Android phones have been supporting both 32 and 64 bit apps for a long time.

The "by people who know what they are doing" thing is just pure FUD. Sure, ARM employs some competent people, but no more so than IBM, Intel, AMD or the various members of RISC-V International.


I'm a fan of RISC-V but the freedom is a large part of it. Aarch64 is a very well designed ISA and clearly has a lot of benefit of hindsight. The load pair/store pair instructions, the addressing modes, fixed 32-bit instruction size, etc. It all really helps. I suspect that Apple was actively part of designing it.

I think however that RISC-V isn't that much worse and because of the freedom we will almost certainly see more implementation of RISC-V. I'd be watching Tenstorrent, SiFive, Rivos, Esperanto, and maybe Alibaba/T-Head.


>Why do all the riscv fans Conveniently ignore aarch64 when they make statements like this? It was in fact a completely clean new design, based on hindsight, by people who know what they are doing, and with no legacy Cruft.

aarch64 seems poorly designed to me.

ARMv7 had thumb, but for some reason ARMv8 did not incorporate any lessons from that. As a result, code density is bad; ARMv8 binaries are huge.

ARMv9, to be available in chips next year, is just a higher profile of required extensions, and does nothing to fix that.

Ever wonder why M1 needs such huge L1 cache? Well, now you know.

Considering ARMv9 will be competing against RVA22, I don't have much hope for ARM.


ARMv8 code density is quite good for a fixed-length ISA and is of course much better than that of RISC-V.

RISC-V has only one good feature for code density, the combined compare-and-branch instructions, but even this feature was designed poorly, because it does not have all the kinds of compare-and-branch that are needed, e.g. if you want safe code that checks for overflows, the number of required instructions and the code size explode. Only unsafe code, without run-time checks, can have an acceptable size in RISC-V.

ARMv8 has an adequate unused space in the branch opcode map, where combined compare-and-branch instructions could be added, and with a larger branch offset range than in RISC-V, in which case the code size advantage of ARMv8 vs. RISC-V would increase significantly.

While the combined compare-and-branch of RISC-V are good for code density, because branches are very frequent, the rest of the ISA is bad and the worst is the lack of indexed addressing, which frequently requires 2 RISC-V instructions instead of 1 ARM instruction.


>in which case the code size advantage of ARMv8 vs. RISC-V would increase significantly.

Many things could be said about ARMv8, but that it has good code size is not one of it. It does, in fact, have abysmal code density. Both RISC-V and x86-64 produce significantly smaller binaries. For RISC-V, we're talking about a 20% reduction of size.

There's a wealth of papers on this, but you can verify this trivially yourself, by either compiling binaries for different architectures from the same sources, or comparing binaries in Linux distributions that support RISC-V and ARM.

>where combined compare-and-branch instructions could be added, and with a larger branch offset range than in RISC-V

If your argument is that ARMv8 could get better over time, I hate to be the bearer of bad news. ARMv9 code density isn't any better.

>and the worst is the lack of indexed addressing, which frequently requires 2 RISC-V instructions instead of 1 ARM instruction.

These patterns are standardized, and they become one instruction after fusion.

RISC-V, unlike the previous generation of ISAs, was thoroughly designed with hindsight on fusion. The simplest microarchitectures can of course omit it altogether, but the cost of fusion in RISC-V is low; I have seen it quoted at 400 gates.


Instruction fusion is a possibility for the future, which has been discussed academically, but no one implements it at present. I'm not sure anyone will -- it's too much complexity for simple cores, and not needed for big OoO cores.

The one fusion implementation I'm aware of if the SiFive 7-series combining a conditional branch that jumps forward over exactly one instruction. It turns the instruction pair into predicated execution.

I agree with everything else. In particular the code density. Anyone can download Ubuntu or Fedora images for the same release for amd64, arm64, and riscv64. Mount them and run "size" on any selection of binaries you want. The RISC-V ones are consistently and significantly smaller than the other two, with arm64 the biggest.


I'm not sure how you missed RISC-V's big feature for code density -- the "C" extension, giving it arbitrarily mixed 16 and 32 bit opcodes.

I've heard of that feature before somewhere else. It gave the company that invented it unparalleled code density in their 32 bit systems and propelled them to the heights of success in mobile devices. What was their name? Wait .. oh, yes ... ARM.

Why they forgot this in their 64 bit ISA is a mystery. The best theory I can come up with is that they thought the industry had shaken out and amd64 was the only competition they were going to have, ever. Aarch64 does indeed have very good code density for a fixed-length 32 bit opcode ISA, and comes very close to matching amd64. They may have thought that was going to be good enough.

Note: the RISC-V "C" extension is technically optional, but the only CPU cores I know of that don't implement it are academic toys, student projects, and tiny cores for use in FPGAs where they are running programs with only a few hundred instructions in them. Once you get over even maybe 1 KB of code it's cheaper in resources to implement "C" than to provide more program storage.


Unfortunately, variable length opcodes are a problem for wide superscalar machines, i.e. the fast ones.


Speaking about RISC-V, no it is not. In RISC-V "C" all 16 instructions have their 32 bit counterparts. When front-end reads in an instruction word (32 bits) it extracts two 32 bit ops from it then feeds them serially to decoder. So, there's only one same decoder that does the work both for 16 and 32 bit ops (basically it does not distinguish them), and that's also what makes macro op fusion possible and easy to implement, unlike ARM's Thumb which has two separate decoders with all the consequences.


ARM literally documented thumb as using the exact mechanism you just claimed they do not have and riscv does. Suggest reading of ARMv4T spec


I'll surely read ARMv4T specs when I will have a bit more free time, thanks :). But, ARM requires switching machine mode to select instruction set (you cannot mix Thumb with regular 32 bit), which kind of hints a selection of decoder takes place. In RISC-V, albeit it's up to micro-arch designer to choose, only one decoder is needed and you can have a mixture of 16 bit and 32 bit instructions in the program flow. What's more, with macro-op fusion feature, two consequent "C" instructions can be viewed as one "unnamed" instruction that does a lot more work. Bit more details from RISC-V authors on the subject along with benchmarks: https://riscv.org/wp-content/uploads/2016/07/Tue1130celio-fu...


But not that much of a problem. x86 is way, way worse about variable length opcodes than RISC-V and there are plenty of fast x86 processors...


The thing with lack of shifted indexed addressing is that it just might not matter all that much beyond toy examples. Address calculations can generally be folded in with other code, particularly in loops which are a common case. So it's only rarely that you actually need those extra instructions.


Shifted indexed addressing is needed more seldom, but indexed addressing, i.e. register + register, is needed in every loop that accesses memory.

There are 2 ways of programming a loop that addresses memory with a minimum of instructions.

One way, which is preferable e.g. on Intel/AMD, is to reuse the loop counter as the index into the data structure that is accessed, so each load/store needs a base register + index register addressing, which is missing in RISC-V.

The second way, which is preferable e.g. on POWER and which is also available on ARM, is to use an addressing mode with auto-update, where the offset used in loads or stores is added into the base register. This is also missing in RISC-V.

Because none of the 2 methods works in RISC-V with a minimum number of instructions, like in all other CPUs, all such loops, which are very frequent, need pairs of instructions in RISC-V, corresponding to single instructions in the other CPUs.


A big difference here is that the RISC-V instructions are usually all 16 bits in size while the Aarch64 and POWER instructions are all 32 bits in size. So the code size is the same.

Also, high performance Aarch64 and POWER implementations are likely to be splitting those instructions into two decoupled uops in the back end.

Performance-critical loops are unrolled on all ISAs to minimise loop control overhead and also to allow scheduling instructions to allow for the several cycle latency of loads from even L1 cache. When you do that, indexed addressing and auto-update addressing are still doing both operations for every load or store which, as well as being a lot of operations, introduces sequential dependency between the instructions. The RISC-V way allows the use of simple load/store with offset -- all of which are independent of each other -- with one merged update of each pointer at the end of the loop. POWER and Aarch64 compilers for high performance microarchitectures use the RISC-V structure for unrolled loops anyway.

So indexed addressing and auto-update addressing give no advantage for code size, and don't help performance at the high end.


"ARMv8 code density is quite good for a fixed-length ISA and is of course much better than that of RISC-V."

not true

you could compile and compare, say, gcc (cc1) of the same version on arm64 and rv. arm64 has larger binaries


> for some reason ARMv8 did not incorporate any lessons from that.

I used to think so too, until I asked some more knowledgeable people about it. Turns out the lesson IS that not having it is better. Fixed-sized instructions make a decoding significantly simpler, making it much easier to make very wide front ends


A little easier, not much easier. A number of organisations are making very wide RISC-V implementations, and one has already published how their decoder works. It's modular, with each block looking at 48 bits of code (the first 16 overlapping with the previous block) and decoding either two 16 bit instructions, or one aligned 32 bit instruction, or one misaligned 32 bit instruction with a following 16 bit instruction, or one misaligned 32 bit instruction followed by an ignored start of another misaligned 32 bit instruction.

You can put as many of these modules side by side as you want. There is a serial dependency between them in that each block has to tell the next block whether its last 16 bits are the start of a misaligned 32 bit instruction or not. That could become an issue with really really wide but for something decoding e.g. 16 bytes at a time (4 to 8 instructions) it's not an issue.

There is a trade-off between a little bit of decoder complexity and a lot of improved code density -- but nowhere near to the same extent as say x86.


While I haven no personal experience writing aarch64 assembler code my experience with ARM v6m an v7m makes me doubt your implied insult that ARM just failed/didn't give a fuck about their instruction set. Thumb 1 and 2 are well designed instruction sets optimized for a certain kind of uarch. Almost all quirks exposed to the low level programmer are there for good reasons and while some of the constraints are a pose a challenge for compiler writers they are not beyond the capabilities of GCC or LLVM. There are several possible reasons for ARM to return to a fixed length 32 bit encoding e.g. to allow very wide OoO designs like Apple's Firestorm cores or because the gain is smaller for 64 bit code with larger constants and better served by PC relative constant pools. And while the quirky LDMIA function prologue is very flexible, appeals to me as assembler programmer and saves code space having a single instruction potentially modify most integer registers as well as change the program counter and the active instruction set is hard to implement well while easier to implement register pair load/store instructions are enough for most common instruction sequences. The tradeoff was different for in-order ARM2/3 CPUs with single ported memory and a tiny unified cache (if that).


> Thumb 1 and 2 are well designed instruction sets

I had thought that Thumb 1 had serious shortcomings, which is why they ended up needing Thumb 2.


Ever wonder why M1 needs such huge L1 cache? Well, now you know.

I'm not sure I follow this, but it reminds me to ask: does RISC-V allow for designs to have both efficiency & performance cores like the ARM big.LITTLE concept? Has anyone made one yet?


Of course you can do it. SiFive has been allowing customers to configure core complexes with a mixture of different core types for years -- for example mixing U84 cores with U74 or U54. If you want to do a BIG.little thing with transferring a running program from one core type to another that's just a software thing -- and using cores with the same ISA but different microarchitecture.

To date the examples of this that have been shipped to the public have used cores with similar microarchitecture, but a different set of extensions.

For example the U54-MC in the HiFive Unleashed and in the Microsemi Polarfire SoC FPGAs use four U54 cores plus one E51 core for "real time" tasks. The E51 doesn't have an FPU or MMU or Supervisor mode. The U74-MC in the HiFive Unmatched is similar.

Alibaba's ICE SoC, which you may have seen videos of running Android, has two C910 Out-of-Order cores (similar to ARM A72/A73) implementing RV64GC, and a third C910 core that also has a vector processing unit with two pipes with 256 bit vector ALU each, plus 128 bit vector load and store pipes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: