So I've had the itch to get a 64-bit RISC-V board of some sort. For reasons I won't get into, I'm not so interested in the 32-bit version.
The prices for the Genesys 2 Kintex-7 are $1000 USD at Digikey, which is way too much to spend on a hobby. I suppose I could try to port the FPGA code to a cheaper platform, but I'm no expert on that sort of development.
Arguably, I'm not an expert on operating systems development either...
On the other end of the spectrum, There is this Canaan Kendryte K210 chip, which is a dual-core 64-bit RISC-V SoC, but only has 8MiBytes of on-chip SRAM, with no provision for SDRAM. An RTOS is of course the straight-forward choice for that, though people are apparently able to run a very slim Linux kernel on it as well. You can get various boards from Seeed Studio for $30 USD or less.
Is there something else that is a reasonable price? Less than $100 USD? Yes, I know, qemu is free.
Could someone break down the significance of this for those of us who are less hardware-acquainted? Does this represent the cutting edge of RISC-V CPU design? And how difficult would it to implement this design properly instead of simply emulating it with an FPGA?
Emulation to me implies simulation, which isn't what an FPGA does.
A configured FPGA and an ASIC are both real electric circuits. In an FPGA the building blocks are larger groups of transistors (typically forming "Look Up Tables") that are connected together by a network of a bajillion electronic switches (read: other transistors). In an ASIC the building block is generally a transistor (although technically any structure you can "tape-out" with lithography) and everything is wired up directly.
Porting this to an ASIC is more gated by access to proprietary tooling than anything else. There's a number of open source efforts to replicate this tooling on older process nodes (search for SkyWater PDK and OpenROAD).
As for how significant this particular RISC-V chip is, there are a bunch of open source RISC-V chips cropping up. This one does seem to have more extensions implemented than the typical RISC-V chip, which is notable. Personally I'm more interested in RISC-V CPUs written in newer tooling (i.e. nMigen or Chisel) rather than Verilog as I find Verilog a trusty but a bit archaic (similar to how C is seen these days).
Emulation just means running it on hardware it wasn't intended. That runs the spectrum between ISA emulators like Dolphin, software RTL simulators like verilator, and FPGAs can certainly be used for emulation if the design wasn't really intended for them as the real end goal.
There's also an interesting point in that spectrum of the cadence/synopsys hardware emulators. They run in a rack sized device composed of a bunch of FPGAs and arrays of custom ascetic processor cores that can only do logic and branch ops to run HDL for SoC sized designs at ~1-10MHz.
This has been tapped out many times in different ways already. ETH has developed these and given them over for open source management and commercialization. The have a whole host of research projects around low energy cores and different accelerators.
The idea is to have very competitive core for the 64bit embedded core that is fully open and has lots of ways to combine it with other cores and other accelerators.
This core is cutting edge for a commercial open source embedded 64bit capable chip, but it is not the highest performance core in RISC-V.
Play around with this website to see all the ETH chips, many are RISC-V: http://asic.ethz.ch/
Emulation is when a CPU in instruction set A runs machine code in instruction set B.
A specification of a piece of hardware called a design and is written in a hardware design language (HDL). The most used HDLs are VHDL (mostly in Europe, I believe) and Verilog (mostly in the USA, I believe).
A design is just a set of source files which admit a complete description. To actually implement the design on a physical design, additional steps are needed. Specifically, the code needs to be mapped to the FPGA (this can be divided into a couple of steps: 'translate', 'map', and 'place and route'). This is a messy optimization problem which is typically approached with a stochastic algorithm, and usually takes quite long (think a couple of minutes for trivial designs, and up to 8 hours for complicated designs on big FPGAs).
To implement a 'silicon implementation' of a CPU, you also need a layout. I'm not 100% sure how these are done, but I would guess it's some mixture of automatic design, re-use of existing blocks, and manual design by greybeard electrical engineers. The design needs to satisfy some design rules, which are determined by the process that is used.
The advantage of having a dedicated layout over having a synthesizable design that you can run on an FPGA, is that it often clocks much faster. I have no idea how much faster though (I think at least 10x?), and I'd guess it depends on the design, process, FPGA, and how good the manual layout is.
A̶s̶ ̶f̶a̶r̶ ̶a̶s̶ ̶I̶'̶m̶ ̶a̶w̶a̶r̶e̶,̶ ̶t̶h̶i̶s̶ ̶i̶s̶ ̶o̶n̶e̶ ̶o̶f̶ ̶t̶h̶e̶ ̶o̶n̶l̶y̶ ̶o̶p̶e̶n̶ ̶s̶o̶u̶r̶c̶e̶ ̶d̶e̶s̶i̶g̶n̶s̶ ̶f̶o̶r̶ ̶a̶ ̶R̶I̶S̶C̶-̶V̶ ̶C̶P̶U̶. While the instruction set of RISC-V itself is unpatented and open source, there's a lot of work that needs to happen between having an instruction set and having a fully designed chip.
Hardware development is roughly broken into 2 stages - front end and back end. Front end is where all of the architectural work happens, and is roughly analogous in software to writing the C code for a program. If you can run it on an FPGA, it means the front-end work is basically complete.
Back-end work is the equivalent of assembler. Except, instead of compiling to a computer architecture, you are compiling to a "process", such as TSMC 7nm. Every process uses different chemistry and physics to work, so backend work strictly cannot be re-used between processes. The wires and gates etc. have different physical properties and just won't work if you use a backend design on the wrong process.
Unlike in programming, the "compilation" process is highly non-trivial. Depending on how much performance optimization you are doing, the back-end work to put this architecture onto say a 7nm process could be tens of millions of dollars. "Compilers" (called place-and-route in hardware) are not very good and need an engineer (or many engineers) babysitting them. And if you want it to go fast, you are going to need to optimize much of it by-hand, which again is highly non trivial.
So even though this repo isn't going to get you a completed chip, it's a really great resource if you want to make your own chip, because it gives you an architecture and a full front end implementation. The front-end is where most people consider the "brains" of a project to be, generally a design shop has its best minds working in front end, and many front end engineers look down on backend work as "grunt work". That said, there are plenty of brilliant people on back end as well, and back end is particularly important and difficult if you are doing something like a wifi chip or bluetooth chip.
RISC-V could be a pretty good ISA, if only it came with a popcount instruction. Given popcount, it is easy to compose a whole range of other essential operations with just one or two more instructions. Without, you are stuck running two dozen instructions just to get to the first step.
Is the RISC-V evolutionary process capable of processing small-sized, incremental improvements? Or are only big "extension" instruction families even possible?
The B extension spec seems to be stagnating. It will anyway be optional and probably rarely implemented even if ever finished. If we have to wait on B, it probably means not getting popcount.
Conditional move would be there implicitly if they had defined boolean "true" to be ~0 in the ABI: just AND. Instead they made it 1, so you have to negate it before you can use it for anything. This has been well understood in SIMD circles for decades; it is a mystery how they could have missed it.
I am ready for a RISC-VI, RISC-W, or RISD-V "fixed version".
B extension was making good progress after it had slowed down for a while. Seems like standardization has slowed because of COVID.
The argument that it will never be implemented seems kind of wrong, why not? Multiple vendors and open source extensions already implemented the essential functionality in non standard ways. In the next standard profile for Linux, the V and B extensions will very likely be included.
ISA standardization never happens as fast as you hope it does.
Popcount is important even in minimal implementations, including "embedded". Waiting for B is not a way to get popcount where it is needed. We need another track.
Cool. No register renaming but scoreboared issue means it's still weakly out of order. It seems to be grabbing the instruction stream in 32 bit chunks but I wonder if it can decode two compressed instructions in a single clock cycle?
I doubt it because they can cross page-borders which is a seriously annoying "feature" in the compressed extension. Also 32-bit instructions can cross the page border too!
The 32-bit instruction sets I'm familiar with all tend to enforce alignment which means page boundaries aren't a problem. Also, you get 2 extra bits of jump distance for free by not having to specify the exact byte you're jumping too but only the 32 bit word.
If you disable the compressed extension then you do get enforced alignment. So, for example RV32I, RV64I both 32-bit alignment. I just ran a small linux program I had for RV32GC and I found 2 places where an instruction crossed a page border. I don't understand why they thought that was a good idea, and for such a modern architecture too, but I guess on hardware it's fine?
Well, there's a few places were it makes sense and doesn't. And it being an optional extension means that it pretty much exists where it's benefits outweigh it's costs.
The really tiny cores with RV-C don't even have page tables, so there's no straddling to worry about. There it's very cheap, and has a lot of benefit, so you see it in the same target (and same reasons) as CortexM cores being Thumb2 based.
One level of complexity up, a simple embedded core with a full MMU (taking the place of what'd be a classic 5 stage RISC like in a home router or something) is the least likely to want it, but in those cases you're normally doing buildroot like software anyway, so it's not a big deal that you have to compile everything with rv32g instead of rv32gc.
Larger more powerful cores have the gate count necessary to paper over the complexity of straddling page boundaries. And that's why you see the binary distros pretty much requiring RV-C. The increased gate count there is worth the better utilization of the I$.
It's cleaner than ARM or MIPS. AArch32 has something like 1200 instructions these days and a ton of cruft in there. AArch64 has good but not great instruction density, having nothing like ARM Thumb or RV-C.
MIPS (or at least what most people mean when they say MIPS) has all sorts of gross stuff like branch delay slots that are a pain in any non 5 stage in order design.
The European rocket is named after a Cretan princess from Greek mythology. This project is related to the rocket-chip generator. Nerds involved all around :)
I think it was the first launch? Since then if memory serves the program has very reliable (if expensive). There also was a (very good) Ariane 4 before with quite the OK record...
I can't help but think of Ariane V and its famous launch failure due to a programming error. I hope Ariane RISC-V fares better. ;) (Ok, the rocket platform has been reliable since 2003, but that ruins the joke.)
Was going to say the same. I learnt of the Ariane 5 explosion in a class on critical-systems software development. The out-of-range error that cost $0.5bn.
It wasn't even the out-of-range value that caused the failure--that would have been fine. The failure was caused by trapping the error, and dumping debug data onto the realtime control bus.
The prices for the Genesys 2 Kintex-7 are $1000 USD at Digikey, which is way too much to spend on a hobby. I suppose I could try to port the FPGA code to a cheaper platform, but I'm no expert on that sort of development.
Arguably, I'm not an expert on operating systems development either...
On the other end of the spectrum, There is this Canaan Kendryte K210 chip, which is a dual-core 64-bit RISC-V SoC, but only has 8MiBytes of on-chip SRAM, with no provision for SDRAM. An RTOS is of course the straight-forward choice for that, though people are apparently able to run a very slim Linux kernel on it as well. You can get various boards from Seeed Studio for $30 USD or less.
Is there something else that is a reasonable price? Less than $100 USD? Yes, I know, qemu is free.