Some context: RISC-V Summit is next week, and RISC-V international has just approved a batch of important extensions[0]. With these extensions, RISC-V is not missing anything relative to ARM and x86 ISAs in terms of functionality.
I expect a lot of tape-outs to happen this month, as core vendors were probably holding for the announced ratifications, in fear of last minute changes. Next year is going to be exciting.
I wouldn't say RISC-V isn't missing anything. The lack of add/subtract with carry is an issue for efficient runtime of many JITed languages like JavaScript.
That being said, I don't think it's the worse thing in the world like some do. The focus now should be on compiled code since JITs by definition can make runtime descions on if some future extension that fixes this deficiency exists or not. The J extension has stalled for the moment, but with these other extensions ratified there should be more bandwidth available hopefully.
Maybe, but it's a leap, IMO. The equivalent patterns are 3x as long, and modify tons of arch visible state for their intermediate results which leaves more work for those combined instructions to do.
The complaint is valid, IMO, and would show up on the filtration test they used to come up with ops if they were working with JITs too rather than just what's in AOT code.
It can try... but you're basically trying to "decompile" or "compress" code to a higher level, and that's not easy nor efficient. If something relatively simple like ADC is difficult, think of something like an entire encryption/hash round, which competing CISC processors already have dedicated instructions for. In the case that you do manage to make that work, there's still the matter of those extra instructions taking up valuable space in caches and memory bandwidth.
Hence why I don't think "RISC is the future" unlike a lot of other proponents; I think a CISC with uop-based decoding will be more scalable and performant. Even ARMs have moved a little in that direction.
Classic CISC processors like the VAX had lots of memory to memory instructions complex looping constructs etc. Special ops that are register to register aren't anti-RISC.
> Can't vendor's making desktop/mobile class CPUs detect the equivalent pattern and optimize it in microcode or silicon?
The riscv stans keep saying that, but nobody has given a demo or shown benchmarks afaik, even under simulation. So it's just handwaving.
It's not only javascript, of course. int overflow in C is an error condition (undefined behaviour) that compilers usually don't try to trap (the -trapv option in gcc and clang enables trapping at some performance cost, so it's rarely used and we get continuing bugs and vulnerabilities as a result. Ada mandates trapping unless you enable an unsafe optimization which is, um, enabled by default in GNAT). Riscv increases that performance cost considerably from what I can tell. That's the opposite of what we needed.
I'm no CPU architect but I know they are able to signal overflow in floating point arithmetic, since IEEE 754 requires that. So I don't understand why they can't do it for integers.
Isn't the obvious solution to the problem of overflows to define the behavior like pretty much all newer languages did it (presumably because they learned from the errors committed by C)?
If you only want safety, then trapping or not, signalling or not does not matter at all. It is UB that causes safety problems, not the overflow itself. And RISC-V mandates the overflow handling manner. No UBs.
Throw on arithmetic overflow is a language choice. And at least Rust thinks that arithmetic exception everywhere is not necessary for security.
The only related problem with no overflow trapping is that dynamically typed languages needs numerical type conversion on overflow. But TBH, if a numerical javascript program often generates 1.7E308, then it's a terrible program that no one should care.
Interestingly the MIPS CPU traps on overflow for the add and sub instructions. You have to use the addu or subu instructions to get the usual behavior of overflow.
Faster than ARM A-77: https://www.phoronix.net/image.php?id=2021&image=sifive_p650... . Performance comparable to Apple Icestorm architecture, the 'efficiency' cores in M1. Considering A-710 is the fastest ARM core currently available and its successor will only be available next year, SiFive is just a few years before real competition starts in an arena currently dominated by ARM.
It will be interesting to see a comparison on power-efficiency as well as performance. RISC-V implementations have shown a pretty sizeable advantage wrt. power use in the past, and we don't quite know how this advantage compares in these larger, performance-focused designs.
Sure, have you actually used one? There some challenges with the software support of RPIs especially model 4 and GPU drivers. I would like to see a platform (potentially RISCV) that has great software support and I could finally use one of these devices as a replacement for TV set top boxes running Android.
Because a CPU architecture does not exist in a vacuum and RISCV’s marketing is about being open, so I expect they are going to have the rest of the SOC as open as well. I guess it will be easy enough to have all the drivers part of Linux.
such as the $30 Sparkfun Red or the $20 Lofive boards. Those are for running an RTOS, not Linux, but they compete with Arduino, mbed, teensy, and other ARM Cortex M series microcontrollers.
A price target of $10 is something you'll only hit with massive scale-up.
That seems to imply a certain integer arithmetic performance, but I wonder what the floating point performance is. They could have just said "X flops".
Comparing to other benchmarks at [1], I have no idea, because they all have denormalized results, so totals, rather than per GHz per core. Nice reporting.
How fast is this thing? Pentium? first gen i3? current gent ryzen 5? The fact that they are being so obtuse about it leads me to believe performance isn't great.
There's several vendors besides RISC-V offering cores for licensing. There's even some OSHW cores that can be freely used.
Even if we choose to ignore the technical prowess of being a true 5th generation RISC ISA built with hindsight no other ISA has, what's IMHO a big deal in RISC-V is the mere availability of this market of cores.
It poses a threat to ARM's business model, where ARM licenses cores and ISA, but nobody else than ARM can license cores to others.
Why do all the riscv fans Conveniently ignore aarch64 when they make statements like this?
It was in fact a completely clean new design, based on hindsight, by people who know what they are doing, and with no legacy Cruft.
Aarch64 obviously isn't a completely clean sheet design. It was constrained by having to execute on the same CPU pipelines as 32 bit code, at least for the first decade or so. And the 32 bit mode has to perform well. There are tens of millions of Raspberry Pi 3s and 4s (and later model Pi 2s) which have 64 bit CPUs but have never seen a 64 bit instruction in their lives. Android phones have been supporting both 32 and 64 bit apps for a long time.
The "by people who know what they are doing" thing is just pure FUD. Sure, ARM employs some competent people, but no more so than IBM, Intel, AMD or the various members of RISC-V International.
I'm a fan of RISC-V but the freedom is a large part of it. Aarch64 is a very well designed ISA and clearly has a lot of benefit of hindsight. The load pair/store pair instructions, the addressing modes, fixed 32-bit instruction size, etc. It all really helps. I suspect that Apple was actively part of designing it.
I think however that RISC-V isn't that much worse and because of the freedom we will almost certainly see more implementation of RISC-V. I'd be watching Tenstorrent, SiFive, Rivos, Esperanto, and maybe Alibaba/T-Head.
>Why do all the riscv fans Conveniently ignore aarch64 when they make statements like this? It was in fact a completely clean new design, based on hindsight, by people who know what they are doing, and with no legacy Cruft.
aarch64 seems poorly designed to me.
ARMv7 had thumb, but for some reason ARMv8 did not incorporate any lessons from that. As a result, code density is bad; ARMv8 binaries are huge.
ARMv9, to be available in chips next year, is just a higher profile of required extensions, and does nothing to fix that.
Ever wonder why M1 needs such huge L1 cache? Well, now you know.
Considering ARMv9 will be competing against RVA22, I don't have much hope for ARM.
ARMv8 code density is quite good for a fixed-length ISA and is of course much better than that of RISC-V.
RISC-V has only one good feature for code density, the combined compare-and-branch instructions, but even this feature was designed poorly, because it does not have all the kinds of compare-and-branch that are needed, e.g. if you want safe code that checks for overflows, the number of required instructions and the code size explode. Only unsafe code, without run-time checks, can have an acceptable size in RISC-V.
ARMv8 has an adequate unused space in the branch opcode map, where combined compare-and-branch instructions could be added, and with a larger branch offset range than in RISC-V, in which case the code size advantage of ARMv8 vs. RISC-V would increase significantly.
While the combined compare-and-branch of RISC-V are good for code density, because branches are very frequent, the rest of the ISA is bad and the worst is the lack of indexed addressing, which frequently requires 2 RISC-V instructions instead of 1 ARM instruction.
>in which case the code size advantage of ARMv8 vs. RISC-V would increase significantly.
Many things could be said about ARMv8, but that it has good code size is not one of it. It does, in fact, have abysmal code density. Both RISC-V and x86-64 produce significantly smaller binaries. For RISC-V, we're talking about a 20% reduction of size.
There's a wealth of papers on this, but you can verify this trivially yourself, by either compiling binaries for different architectures from the same sources, or comparing binaries in Linux distributions that support RISC-V and ARM.
>where combined compare-and-branch instructions could be added, and with a larger branch offset range than in RISC-V
If your argument is that ARMv8 could get better over time, I hate to be the bearer of bad news. ARMv9 code density isn't any better.
>and the worst is the lack of indexed addressing, which frequently requires 2 RISC-V instructions instead of 1 ARM instruction.
These patterns are standardized, and they become one instruction after fusion.
RISC-V, unlike the previous generation of ISAs, was thoroughly designed with hindsight on fusion. The simplest microarchitectures can of course omit it altogether, but the cost of fusion in RISC-V is low; I have seen it quoted at 400 gates.
Instruction fusion is a possibility for the future, which has been discussed academically, but no one implements it at present. I'm not sure anyone will -- it's too much complexity for simple cores, and not needed for big OoO cores.
The one fusion implementation I'm aware of if the SiFive 7-series combining a conditional branch that jumps forward over exactly one instruction. It turns the instruction pair into predicated execution.
I agree with everything else. In particular the code density. Anyone can download Ubuntu or Fedora images for the same release for amd64, arm64, and riscv64. Mount them and run "size" on any selection of binaries you want. The RISC-V ones are consistently and significantly smaller than the other two, with arm64 the biggest.
I'm not sure how you missed RISC-V's big feature for code density -- the "C" extension, giving it arbitrarily mixed 16 and 32 bit opcodes.
I've heard of that feature before somewhere else. It gave the company that invented it unparalleled code density in their 32 bit systems and propelled them to the heights of success in mobile devices. What was their name? Wait .. oh, yes ... ARM.
Why they forgot this in their 64 bit ISA is a mystery. The best theory I can come up with is that they thought the industry had shaken out and amd64 was the only competition they were going to have, ever. Aarch64 does indeed have very good code density for a fixed-length 32 bit opcode ISA, and comes very close to matching amd64. They may have thought that was going to be good enough.
Note: the RISC-V "C" extension is technically optional, but the only CPU cores I know of that don't implement it are academic toys, student projects, and tiny cores for use in FPGAs where they are running programs with only a few hundred instructions in them. Once you get over even maybe 1 KB of code it's cheaper in resources to implement "C" than to provide more program storage.
Speaking about RISC-V, no it is not. In RISC-V "C" all 16 instructions have their 32 bit counterparts. When front-end reads in an instruction word (32 bits) it extracts two 32 bit ops from it then feeds them serially to decoder. So, there's only one same decoder that does the work both for 16 and 32 bit ops (basically it does not distinguish them), and that's also what makes macro op fusion possible and easy to implement, unlike ARM's Thumb which has two separate decoders with all the consequences.
I'll surely read ARMv4T specs when I will have a bit more free time, thanks :). But, ARM requires switching machine mode to select instruction set (you cannot mix Thumb with regular 32 bit), which kind of hints a selection of decoder takes place. In RISC-V, albeit it's up to micro-arch designer to choose, only one decoder is needed and you can have a mixture of 16 bit and 32 bit instructions in the program flow. What's more, with macro-op fusion feature, two consequent "C" instructions can be viewed as one "unnamed" instruction that does a lot more work. Bit more details from RISC-V authors on the subject along with benchmarks:
https://riscv.org/wp-content/uploads/2016/07/Tue1130celio-fu...
The thing with lack of shifted indexed addressing is that it just might not matter all that much beyond toy examples. Address calculations can generally be folded in with other code, particularly in loops which are a common case. So it's only rarely that you actually need those extra instructions.
Shifted indexed addressing is needed more seldom, but indexed addressing, i.e. register + register, is needed in every loop that accesses memory.
There are 2 ways of programming a loop that addresses memory with a minimum of instructions.
One way, which is preferable e.g. on Intel/AMD, is to reuse the loop counter as the index into the data structure that is accessed, so each load/store needs a base register + index register addressing, which is missing in RISC-V.
The second way, which is preferable e.g. on POWER and which is also available on ARM, is to use an addressing mode with auto-update, where the offset used in loads or stores is added into the base register. This is also missing in RISC-V.
Because none of the 2 methods works in RISC-V with a minimum number of instructions, like in all other CPUs, all such loops, which are very frequent, need pairs of instructions in RISC-V, corresponding to single instructions in the other CPUs.
A big difference here is that the RISC-V instructions are usually all 16 bits in size while the Aarch64 and POWER instructions are all 32 bits in size. So the code size is the same.
Also, high performance Aarch64 and POWER implementations are likely to be splitting those instructions into two decoupled uops in the back end.
Performance-critical loops are unrolled on all ISAs to minimise loop control overhead and also to allow scheduling instructions to allow for the several cycle latency of loads from even L1 cache. When you do that, indexed addressing and auto-update addressing are still doing both operations for every load or store which, as well as being a lot of operations, introduces sequential dependency between the instructions. The RISC-V way allows the use of simple load/store with offset -- all of which are independent of each other -- with one merged update of each pointer at the end of the loop. POWER and Aarch64 compilers for high performance microarchitectures use the RISC-V structure for unrolled loops anyway.
So indexed addressing and auto-update addressing give no advantage for code size, and don't help performance at the high end.
> for some reason ARMv8 did not incorporate any lessons from that.
I used to think so too, until I asked some more knowledgeable people about it. Turns out the lesson IS that not having it is better. Fixed-sized instructions make a decoding significantly simpler, making it much easier to make very wide front ends
A little easier, not much easier. A number of organisations are making very wide RISC-V implementations, and one has already published how their decoder works. It's modular, with each block looking at 48 bits of code (the first 16 overlapping with the previous block) and decoding either two 16 bit instructions, or one aligned 32 bit instruction, or one misaligned 32 bit instruction with a following 16 bit instruction, or one misaligned 32 bit instruction followed by an ignored start of another misaligned 32 bit instruction.
You can put as many of these modules side by side as you want. There is a serial dependency between them in that each block has to tell the next block whether its last 16 bits are the start of a misaligned 32 bit instruction or not. That could become an issue with really really wide but for something decoding e.g. 16 bytes at a time (4 to 8 instructions) it's not an issue.
There is a trade-off between a little bit of decoder complexity and a lot of improved code density -- but nowhere near to the same extent as say x86.
While I haven no personal experience writing aarch64 assembler code my experience with ARM v6m an v7m makes me doubt your implied insult that ARM just failed/didn't give a fuck about their instruction set. Thumb 1 and 2 are well designed instruction sets optimized for a certain kind of uarch. Almost all quirks exposed to the low level programmer are there for good reasons and while some of the constraints are a pose a challenge for compiler writers they are not beyond the capabilities of GCC or LLVM. There are several possible reasons for ARM to return to a fixed length 32 bit encoding e.g. to allow very wide OoO designs like Apple's Firestorm cores or because the gain is smaller for 64 bit code with larger constants and better served by PC relative constant pools. And while the quirky LDMIA function prologue is very flexible, appeals to me as assembler programmer and saves code space having a single instruction potentially modify most integer registers as well as change the program counter and the active instruction set is hard to implement well while easier to implement register pair load/store instructions are enough for most common instruction sequences. The tradeoff was different for in-order ARM2/3 CPUs with single ported memory and a tiny unified cache (if that).
Ever wonder why M1 needs such huge L1 cache? Well, now you know.
I'm not sure I follow this, but it reminds me to ask: does RISC-V allow for designs to have both efficiency & performance cores like the ARM big.LITTLE concept? Has anyone made one yet?
Of course you can do it. SiFive has been allowing customers to configure core complexes with a mixture of different core types for years -- for example mixing U84 cores with U74 or U54. If you want to do a BIG.little thing with transferring a running program from one core type to another that's just a software thing -- and using cores with the same ISA but different microarchitecture.
To date the examples of this that have been shipped to the public have used cores with similar microarchitecture, but a different set of extensions.
For example the U54-MC in the HiFive Unleashed and in the Microsemi Polarfire SoC FPGAs use four U54 cores plus one E51 core for "real time" tasks. The E51 doesn't have an FPU or MMU or Supervisor mode. The U74-MC in the HiFive Unmatched is similar.
Alibaba's ICE SoC, which you may have seen videos of running Android, has two C910 Out-of-Order cores (similar to ARM A72/A73) implementing RV64GC, and a third C910 core that also has a vector processing unit with two pipes with 256 bit vector ALU each, plus 128 bit vector load and store pipes.
I find it amusing that RISC-V allegedly creates "fragmentation risk" when platform fragmentation in the ARM ecosystem already exists and it's painful enough -- at least that's what I recall from some comparisons with the x86/PC platform with respect to Linux kernel development.
They'll be fine if they focus on their microarchitectures rather than the ISA (where IMHO they've already lost), and make the process for obtaining a license much more streamlined; I've heard it takes no less than 18 months of long negotiations to license anythin from ARM. That's not sustainable now that there's competition.
High performance implementations are possible even with bad ISAs, given enough resources.
x86-64 is much worse than ARM. It's a literal clusterfuck. And yet.
A high performance implementation of ARM, which is a much better ISA than x86-64, was something expected to happen sooner or later. It did not surprise me.
As far as OSHW cores go, it's so very nice to be able to throw something together in verilog and be able to inherit a compiler and not be trampling on someone else's copyright...
The press-release does not say anything about physical chip, but a licensable core that can be used to build SoCs. Here SiFive acts same way as ARM does - sells cores.
Mind you that raw core performance is not everything, memory bandwidth and caches are crucial to make sure the CPU isn't waiting for data all the time.
I trust that the Apple benchmarks include all such effects. I'm less convinced that the RISC-V "projections" include them. SPECint2006 is supposed to be measured with real memory and an OS. Per-GHz numbers can't accurately reflect main memory latency, since its speed doesn't scale with the CPU clock.
Right, and "per GHz" numbers are also not very useful because you can't just crank up the GHz when you need performance. Even with the same process technology, you can't assume different microarchitectures will max out at the same frequency.
You're right, and remarkably Apple has found a major roadblock to clock speeds while using ARMv8.
M1's L1 cache is huge, as a workaround to ARMv8's poor code density. Larger cache means lower clocks, unfortunately there's no way around speed of light.
How impressive that number is rather depends on how many GHz they're managing. In general the slower you design your clock to clock, the faster you can make all your caches. Plus the slower you clock your core, designed in or not, the lower the number of clock cycles it takes to talk to main memory.
They claim it's slightly faster than A77. That would have the IPC getting pretty close to AMD's Zen 1 chips (though probably at a lower peak frequency).
If i recall correctly the sifive unmatched is still pretty slow compared to ARM( https://www.phoronix.com/scan.php?page=article&item=hifive-u... ).Now this board is not the one in question(P650) but we'll have to observe upcoming benchmarks [for which i recommend phoronix]
Obviously you can't even think about comparing it further with Intel & AMD, but when you look at the history of something like ARM(which i believe is 30-40 years old), riscv came a long way pretty fast, and the good thing it's a solid choice for the future due being open.
Sweet, are there any resources on transitioning/migrating or differences between x86_64 and riscv; or the ISAs are drastically different that it's just better to dive in head-first?
I played with Linux (Debian) for RV64 in Qemu and the experience was pretty painless, albeit slow but I think that was more to do with emulation overhead. My personal programs I wrote myself compiled and ran without a single issue. Obviously if you need closed source software or software with build tools targeting an OS that doesn't run on RISC-V, then yeah you would have a hard time. Most open source software should be a breeze to port. A lot of the weirdness of the platform will likely be from bootloaders and drivers.
Absolutely. If you can go with OSS only, than it's easy as a breeze. Yet if you depend on proprietary stuff like I do (CADs, Wine, etc), than it's a pain. I'm currently trying to switch office to ARM (RPi 4 and Baikal-M) and that is not easy.
I expect a lot of tape-outs to happen this month, as core vendors were probably holding for the announced ratifications, in fear of last minute changes. Next year is going to be exciting.
[0]: https://riscv.org/announcements/2021/12/riscv-ratifies-15-ne...