Hacker News new | past | comments | ask | show | jobs | submit login
VexRiscv is a quadcore, Linux-capable RISC-V softcore for FPGA (antmicro.com)
174 points by homarp on May 21, 2020 | hide | past | favorite | 77 comments



One of the most interesting aspects of the VexRiscv is the way it's implemented. The VexRiscv is written with SpinalHDL, a hardware description library in Scala. But that's not the main thing: in additional to Verilog and VHDL, there are other ways to write RTL, from Python to Scala to Haskell.

What's really special is that the VexRiscv is constructed from a large number of plugins that split up the design 'horizontally' per feature instead the traditional 'vertical' way that is pipeline-stage oriented.

It makes it possible to implement all aspects of, say, a new instruction in one file, instead of spreading it over many different files.

I've written about that here: https://tomverbeure.github.io/rtl/2018/12/06/The-VexRiscV-CP....


This is somewhat orthogonal to what you are saying, but I’ve wondered for a while if it’s possible to achieve vertical and horizontal abstraction at the same time. When you are working on the actual implementations, horizontal style is clearly preferable, but if you want to change the abstraction, the vertical style is much easier. The limitation of just one being accessible at a time seems to purely be a consequence of the fact that we use the same representation for reading and writing code. Why can’t I switch from vertical to horizontal mode when reading code? Maybe it’s even possible to switch when writing?? With a plug-in like structure you certainly get some benefits but at the same time you can let go of a nice global view - it would be nice to have both.


In a way, the VexRiscv is already implemented in both directions:

While stages are specified individually, you can declare those stages to collapse together.

The VexRiscv can be specified to be between 2 and 5 stages.

In terms of readability, you still have the 5 stages separated out within the same file.


Oh, that's really cool. It seems like a nice place in the middle which I didn't really think was possible before.


Worth mentioning SymbiFlow: https://symbiflow.github.io/, it's a fully open-source flow for FPGAs, Xilinx support (targetting the Arty A7 that the project in this story uses for instance) is on the way so hopefully won't be long until you can build a opensource RISC-V SoC that can run Linux entirely on open source tooling.


To be clear, creating a linux capable Artix-7 image using only open source tools can be done today, right now! https://github.com/SymbiFlow/symbiflow-examples


Awesome! Hadn't seen that, includes the DDR controller too which I thought may be one of the trickier parts to get going under Symbiflow.


Eh, did they finally unlock all the transceivers, MMCMs, DSP48s, etc?

It's super unimpressive to do a basic processor with flip flops and BRAM with a primitive simulated-annealing P&R. You're getting like... 10% of what the chip from a decade ago was capable of.

BuT iTS OPeN SoUrcE!

Eh. I guess. If your time is worth zero and you're willing to overpay for silicon you can't use effectively with your open tools. Webpack is gratis from Xilinx and it stomps the open source stuff, I have no idea why you'd bother.


You read my mind. This is awesome. Wonder how hard it'd be to DIY a smartphone and go from there.


I'm interested in developing an SoC with separate DRAM controllers for instructions and data. I did some reading about RISCV-BOOM core and LiteDRAM.

I also researched the process of turning an SoC design into physical chips. I estimated the cost to be around USD $150,000 for the first handful of chips, using TSMC's CyberShuttle.

The learning curve for this technology is extremely steep. I would probably need to spend years learning the various skills.

The PolarFire SoC block diagram [1] shows a DRAM controller and a DRAM PHY.

The Arty A7 Reference Manual [2] talks about using the Xilinx Vivado to add peripheral blocks into the SoC design. Is this a way to add Xilinx's proprietary DRAM controller block, which would then need to be licensed separately?

Does Antmicro's demo use LiteDRAM to interface with Arty A7's DRAM PHY?

What would be involved in modifying VexRiscv and its MMU to support normal data memory and a separate read-only instruction memory? For someone with the necessary skills, is it a 1-month project or a 1-year project?

I checked a bunch of FPGA development boards. The only boards I found with multiple DRAM chips have the Cyclone V FPGA and cost >$1,000. Why do those boards cost so much more than the $150 Artix 7 boards?

[1]: https://www.microsemi.com/product-directory/soc-fpgas/5498-p...

[2]: https://reference.digilentinc.com/reference/programmable-log...


I'm disappointed today's general purpose CPU's and microcontrollers don't come with some integrated FPGA space, similar to how you have SRAM and other peripherals. Intel talked about it a few years back [1] but I'm not sure anything materialized.

The closest I've seen in popular chips is a few gates worth of programmable logic. Are there any hidden gems I've missed out on?

[1] https://www.nextplatform.com/2018/05/24/a-peek-inside-that-i...


With the transition of compute from performance focussed to performance per watt focussed (due to cooling usually being the limiting factor), the niche for the FPGA has almost vanished.

There are very very very few compute tasks where an FPGA solves a problem with better performance per watt than both a CPU and a GPU.

I would bet that emulating a RISC-V program on x64 is far more power efficient than running a RISC-V core on an FPGA for example.


"the niche for the FPGA has almost vanished."

While Xilinx posts record quarterly profits and there are more FPGAs moved than ever before.

But go on. Enlighten us...

"There are very very very few compute tasks where an FPGA solves a problem with better performance per watt than both a CPU and a GPU."

Oh right. Let me just slap those into my satellite, radio system, aircraft control system, military system, enmbedded system... How could I have not seen the light?

"I would bet that emulating a RISC-V program on x64 is far more power efficient than running a RISC-V core on an FPGA for example." Hahahahahahaha. What size bet, chief? I need a new pair of shoes.


Low-volume networking hardware comes to mind. Where you need very-high performace routing, but aren't shipping enough units to make up the cost of using an ASIC.

The aren't many of the them so power usage isn't a bit cost, but their performance effects the performance of many other machines, where most of the power draw is.


An ECP5 will sit on the order of ~100mW and you can clock those up to dozens of MHz. They can have multiple cores running in parallel (an ECP5 85k will fit dozens, probably well over a hundred RISC-V cores if you do your homework.) Even a laptop sitting at 10W is going to be orders of magnitude more power inefficient than this in terms of raw instructions-per-cycle-per-watt if you're emulating. That is not the best metric, necessarily, but there you go.

And since you mentioned perf-per-dollar -- ignoring soft CPUs, any deeply pipelined algorithm is very likely going to destroy price-comparable CPUs in terms of throughput e.g. you can do 16-to-32 bytes per cycle of AES on a dinky FPGA from 10 years ago for a few dollars, and at 50MHz you're doing 1.6GB/s, and people have been achieving this, or multiple times this, for 15+ years. Things like TDP are not a measure of "overall system design efficiency", it's a measure of thermal capacity, thermal budgets, and nothing more. (BTW, the only general purpose CPU that comes close to this number directly for AES is, like, Ice Lake, since VAESNI can turn out 16 bytes per cycle or whatever IIRC, but now you're well back into "multiple watts" territory on a multi-GHz CPU.) The reason people still use CPUs for these tasks isn't because they don't want better performance: it's because software has better agility and is easier to acquire and modify and distribute. You can have systems that are dozens of times more efficient than commodity ones for a wide variety of tasks, they will just be a pain in the ass to use, program, acquire, and build. You can figure out most of this with basic napkin math.

Stop thinking so much about individual components, and start thinking about global system design -- because the entire system has its own performance criteria that may vary drastically compared to an individual component within it.

> There are very very very few compute tasks where an FPGA solves a problem with better performance per watt than both a CPU and a GPU.

This is like stating "There are very few tasks where a car would do as well as a snowmobile." They aren't comparable for purpose. Hacker News is pop-culture-y so everyone thinks "the only thing that matters is a cool CPU running in a rack with a 7nm TSMC process that can run my Go application on Kubernetes that will disrupt The Market of Smart Toilets" or whatever they do day to day, and extrapolate from there. But I'd guess the vast majority (like, 85% or more) of FPGA field has literally nothing to do with this. A huge amount of work basically revolves around "just" interfacing with analog devices at pico/nanosecond level resolutions...

The quest for best perf-per-watt is one largely driven by datacenters and personal consumer electronics, which have both high volume and high yield, and where the largest challenges revolve around power, cooling, etc. Furthermore these systems run workloads that are largely general purpose "state machines" that use some memory and some CPU and some disk, etc, and need to try and hit a balance among all of these. There is a large amount of resource arbitrage going on. "A rising tide lifts all boats" in this case. But little of that applies in this field; people use older nodes and the same chips for 5-10+ years (or longer) straight because they need to deliver latency-sensitive solutions, customized hardware at low volume, "hardware glue" for various analog systems, highly specialized algorithmic solutions for the lowest total BOM cost, etc. They aren't aiming to replace the systems created by digital Silicon Valley software programmers.

There is a push to move FPGAs into the datacenter (see: Xilinx and their exploding revenue) but it's unclear if they will settle into specific niches or be used as supplementary devices or whatnot.


Your comments will be much better without the condescending tone. I point this out because I also talk down to people unintentionally. It's a difficult habit to break.


It’s hard to get across just how out of their depth someone is without putting it fairly explicitly.


There's a lot of use cases today where perf per watt doesn't matter.


Name some... Any task where perf per dollar matters also boils down to perf per watt, since watts cost dollars...


Say, robotics where the difference between an FPGA and something else is way overshadowed by the motors.

Or really basically anywhere that has you interacting with the real world directly connected to your compute, and not just compute off in a datacenter.


> Or really basically anywhere that has you interacting with the real world directly connected to your compute, and not just compute off in a datacenter.

Which, at least in mass-market applications, mostly happens on phones and other battery-powered devices. :)


Or the dozens of other devices in your house that have the benefit of mains voltage.

I don't think perf per watt was a differentiator in the compute chosen for your TV, your monitor, your AV receiver, your fridge, etc.


Yes, but none of these things use significant compute.


Monitors and TVs absolutely do.

And it's not unheard of for them to have FPGAs for that reason.


I guess one could consider the video hardware decoder a FPGA, although I would bet that the smart models just use a GPU in some SOC model instead.


No, there are literal FPGAs in some TVs and monitors. It happens more often on early runs on new tech. You don't need to rename something else.


But many, many important applications are not mass-market.


realtime systems. avionics, robots, radar processing, low latency network appliances, trading platforms...


I thought FPGAs were more power efficient than GPUs for many ML applications for example (CNNs aside)?


For unusual ML architectures, like 1 bit precision, that might be true.

For anything using floating point maths, it isn't true.


What about this paper then, section 5.1.2, energy efficiency: https://dl.acm.org/doi/10.1145/2694344.2694347

Substantial gains in terms of performance per watt for FPGAs?


Not in the desktop/server space but there are a few products like this. Xilinx has the Zynq line with 1-2 Cortex-A9 cores paired to an FPGA. Microsemi has their Cortex-M3 based Smartfusion line and is supposedly launching their Polarfire SoC with 4 RISC-V cores plus an FPGA later this year.


The 1-2 Arm A9 Zynqs were introduced in 2011. There has been much progress!

MPSoCs have 4x Arm A53s and 2x Arm R5s (and Mali-400 graphics, although they're moving away from that because they found most customers don't care about that).

RFSoCs have something similar - strapped directly to tiles that do 4-6 Gbps analog/digital or digital/analog converters. If you're trying to make a badass missile front end, radio system, or radar system they're amazing!

They even have FEC hard cores that run incredibly fast. They're amazing for all kinds of waveform work.


They have the patents to do even better than that and create hierarchies of miniature programmable fabric, similar to the concept of L1-L4 caches except for FPGA designs [0]. Unfortunately, they don't have the organizational willpower to do true innovation though. From what I have heard, significant portions of the designs for their processor lines are not understood by any current or recent employees. There is a tremendous amount of legacy "code" and fear of changing things that may break backwards compatibility. There is no vision at the top, their process lead is gone, and the architecture team is patching decades of bad security without simplifying designs.

It seems like they are just riding out their market share for as long as they can, which could be a while. Intel has a really strong brand.

[0] https://patents.google.com/patent/US10310868B2/


Well Xilinx has the Zynq-7000 SoC[1], featuring an ARM Cortex-A9 CPU along with a potentially quite large FPGA.

Not exactly cheap though, at least in small quantities[2]

[1]: https://www.xilinx.com/products/silicon-devices/soc/zynq-700...

[2]: https://www.digikey.com/products/en/integrated-circuits-ics/...


Intel has the Cyclone V range.

Also, I've never done it myself but I've read that digikey prices are almost never the actual price for FPGAs even in relatively small quantities (haggling with avnet).


There is another interesting path for the hobbyist, based on the Terasic DE10-nano. A group of retrocomputing enthusisasts has chosen this as their base to work upon: https://github.com/MiSTer-devel/Main_MiSTer/wiki and it even got picked up by some gaming hardware magazine, earlier this year: https://hothardware.com/reviews/mister-diy-console-fpga

While i don't care so much about the systems they emulate, and the cores they offer, this seems like a nice place to get started because of the many examples one can learn from.

Also Intel themselves offer many tutorials for it, on github and elsewhere.

And it creates an 'ecosystem' of reasonably available hardware expansions, without the usual markup for that stuff, should one need, or want them.

Would i start today, i'd use this. As it is, i've spent much more for similar stuff in the past.

I approve!


Can confirm. The only time I've ever been quoted list price for an FPGA from a distributor (other than digikey) was when our purchasing agent managed to personally enrage the sales guy.

We still didn't end up _paying_ list price (after I made the purchasing guy apologize - no details, but he was 100% in the wrong).


Well, that is true for all components at digikey. You don't mass-produce stuff with components bought from DK (or avnet either, for that matter, depending on where you do your production)


What’s a little frustrating with the Xilinx parts listed on Digi-Key, and to some extent other FPGA vendors, is the lack of any price breaks.

I looked up a random i.MX6 processor from NXP. It’s $32.08 for one of them, and $20.95 per part when you get a reel of 500. If you need a few thousand units made, it’s perfectly reasonable to order parts from Digi-Key. You might be overpaying compared to what a good purchasing person can get you, but it’s fine for a few thousand parts, and quick and easy.

Look up any Xilinx part on Digi-Key, and there’s just a single price break at 1 unit. Even Lattice parts only go up to price breaks of 100.


Worked with FPGAs at a past gig, and that is correct. There is heavy markup due to the accessibility of the parts.


The FPSLIC comes to mind (https://www.digikey.com/catalog/en/partgroup/fpslic/14188?mp...) and the Cypress PSoC


Are there many applications that would benefit from having it on die versus across a few PCIe lanes? AFAIK putting it on a card would still grant it the same full speed DMA it would have being on-die with the CPU.


The benefit from putting it on-die is not speed but economy of scale.


One possible reason is that people designing with FPGAs already have 2 options - FPGAs with on-chip processors, and using soft cores built in the FPGA.

In other words, don't get some FPGA with your micro. Get a micro with your FPGA. The later exists.


There was a company that had an FPGA implementation of HyperTransport so that you could link the FPGA to a network of AMD Opteron CPUs.


https://en.wikipedia.org/wiki/Torrenza ? That got any real world use, ever?


What about the Xilinx Zynq?


Don't general purpose CPU's and μC's use replaceable microcode as part of the chip implementation? Shouldn't that boil down to the same thing?


The microcode is still controlling a microprocessor, not defining gates. It's more like microcode is the assembly language and the x86 or w/e is byte code.


As cool as this is, enough with the FPGAs and the RISC-V Arduino clones. A real, relatively inexpensive (e.g. sub $100) RISC-V SoM/board that can run Linux is desperately needed (with at least at BeagleBone Black performance levels).

I really like the idea of RISC-V and I'm willing to make the investment in software (and in fact have done so with QEMU), I just can't get any real hardware (for a non-silly price).


I think the main barrier to a cheap RISC-V board like you describe is a real Android port. By "real" I mean it needs working ART and V8 compiler ports so apps and web pages don't run at 10% of the speed of low-end ARM chips.

Once that exists, I think we'll see companies develop RISC-V chips cheap enough for low-end smartphones and other IoT devices. Those are the chips that are cheap enough to put in a <$100 board.


The fact that it fits in the 35T version with room to spare is pretty huge, especially since certain packages of the 35T start as low as $35 for a single chip, no MOQ.

I could see myself dropping one of these on a homemade project if I ever spend the time making a reliable reflow oven...


If you're happy that it fits in a 35T, you'll be ecstatic to learn that a single VexRiscv fits comfortably in a Cyclone II EP2C5 FPGA. :-)

It's hard to find an FPGA that's too small to fit one.


it runs on the 5k lut ice40up5k, e.g. as the default bootloader in fomu: https://www.crowdsupply.com/sutajio-kosagi/fomu

(the whole thing fits inside the usb-a port)


FPGA Noob here. I have 2 questions about FPGA's that I'm hoping someone here can help me out with:

1. For the FPGAs i've looked at, you seem to have to initially configure them before being able to run your programs on them, kind of like EEPROM. I feel it would be much more interesting from a reconfigurable computing perspective if the devices were able to programatically re-configure on the fly as easily as it is to read and write to DRAM or Flash Memory. So what are the barriers that prevent the hardware from being able to do this?

2. Its exciting to see projects like Symbiflow making great progress, but after reading some expert opinions[1] it seems like an extremely difficult challenge to attempt to reverse engineer hardware from commercial FPGA vendors who wish to keep their designs closed in order to protect their IP and compete. So my question is wouldn't it be a more feasible goal to construct fully open FPGA platform from scratch, just like RISC-V is doing with CPUs? What would the obstacles be here?

Thanks!

[1] https://www.reddit.com/r/FPGA/comments/a5pzs5/prediction_ope...


1. Most higher capacity FPGA have a features called "partial reconfiguration" where you can reload a part of the FPGA with a new bitstream. This new part can usually come from anywhere (PCIe, SPI, ...)

2. RISC-V is a ISA. It's not an implementation. You can implemented it on FPGA or ASIC. When you implement a RISC-V on an FPGA, it will cost you between $2 and maximum a few $1000 in silicon. FPGA technology itself is something that can only reasonably be implemented in an ASIC. The initial cost of an ASIC can go anywhere from $100K (on a very old process) to multiple millions.



1. I'm not sure what you mean but remember that FPGAs don't run programs as per se (HDL gets compiled to logic, not instructions). The bitstream can be modified, it just gets loaded from some flash. I'm not sure where it's done, but it's possible.

2. The obstacles are billions and billions of R&D (and you'd need similar amounts to get a fab pick up the phone too). Reverse engineering the bitstream is also difficult because of this - Symbiflow (i.e. Trellis etc.) have got the bulk of the bitstream done (apart from specialized blocks like those for DSP), but you need to have good algorithms to decide what to do with that bitstream e.g. a fully open source flow requires intricate timing analysis.


Regarding 2: the question was about an open-source solution, so I think that "billions and billions of R&D" will translate to just a lot of time spent and no literal cost, e.g. just like GCC is free as in beer.

It is true that getting a wafer fabricated will cost a lot of money (in the millions maybe?) but this may be money well spent because the resulting FPGA design can be used over and over. I think this would be in the reach of perhaps some universities or government technology centers, if someone could formulate the case for it.


Nearly all the cool stuff in GCC and LLVM is paid for by companies (paying the salaries of developers). The software could definitely be done in this way (Symbiflow is very very nice), but keep in mind that developing an FPGA will require a lot of hardware and bums in seats.

The question is similar in scale to building an open source Intel core i7 - it's not impossible but keep in mind that an FPGA big enough (for example) to prototype any subsections of the CPU let alone the whole thing would cost hundreds of thousands.


Now the only problem left is to find an open-source-friendly FPGA manufacturer.


Is Dolu1990 the primary developer of everything SpinalHDL related?

I'm genuinely impressed with his effort.


This project looks amazing.

For folks writing in SpinalHDL, is anyone using Quartus? Or are you using a fully open-source toolchain? i.e. What is your workflow?

I'm interested in trying out SpinalHDL, but I'm not sure how to integrate it into what I'm doing.


I'm using SpinalHDL for all my hobby projects, and I use Intel Quartus, Xilinx ISE, or Yosys, depending on the FPGA family.

This project is an FPGA based ray-tracer that uses Xilinx ISE written in SpinalHDL: https://github.com/tomverbeure/rt. This project uses SpinalHDL to drive an LED cube, which uses Quartus: https://github.com/tomverbeure/cube (it also uses a VexRiscv). And here is a small project that drives an LED matrix with WS2812B LEDs, that runs on an Upduino2 with a Lattice UP5K FPGA, which uses open source Yosys/NextPNR: https://github.com/tomverbeure/led_matrix.


Nice. I was just looking through your ray-tracer code :)

It looks like a really nice abstraction. I've been working on the MiSTer project, writing arcade cores for the Cyclone V in VHDL.

I've made a huge effort to keep things clean, but SpinalHDL could be a great way to tame some of the code.

Will start blinking some LEDs and see how it goes...


How far down the road of something like nand2tetris can I take a FPGA ? Can I design my own system in verilog and flash it to a FPGA like this and effectively run on my own computer?


Yep, that's the point of FPGAs. The easiest way to see is to synthesize it in an FPGA tool (e.g. Vivado) which you can probably do for free. Different FPGAs have different hardware resources, but for any small designs you can probably fit them on a cheap FPGA.


Absolutely.


I didn't find what clock rate it runs at. It mentions booting linux in 4 seconds, but that is hard to extrapolate into a core clock frequency.

100 MHz? 200 MHz? higher?


I’ve clocked a VexRiscv at 80 MHz on a very old Cyclone II.

On a modern FPGA, 300 to 400MHz should be possible, depending on configuration: caches and branch predictors tends to reduce the clock speed.

You can find some results here: https://github.com/SpinalHDL/VexRiscv#area-usage-and-maximal...

None of those numbers are for the fastest FPGAs.


Doesn't that depend on the FPGA you use?


Of course, but he article author did build it on a specific system and said the boot speed, so saying that clock rate would have also been a useful data point if the platform is known. Someone using a different FPGA could use rough scaling factor to guesstimate what they might achieve for the platform they have on hand.


Added this to the article - it's 100 MHz in this design. For some more performance info on Vex in general, see https://github.com/SpinalHDL/VexRiscv

We've yet to make detailed analyses on the multicore version, but in general I'd say it's pretty decent.


Kind of off topic, but how much time is involved building processors with FPGA's, especially for modern architectures like RISC-V? I only have a very basic overview knowledge about FPGA's and almost none at the time about HDL's (I plan on learning!), but with the complexity involved in modern processors, I can't imagine this being a few weeks or even months or work.


RISC-V as an ISA is designed to be easy to implement. You can get a simple implementation in 3k lines of Verilog [1].

That being said, there's a huge difference between a toy / simple multi-cycle machine-mode RISC-V core and one with a modern, performent microarchitecture (pipelined, super-scalar, multi-issue, cache coherent across multiple cores, with efficient branch prediction). There's also extra work to implement RISC-V extensions that let you run any 'real' code like Linux (which requires anything from simple ISA extensions to implementing the Privileged Instruction spec which dictates additional things like the MMU and interrupt controller).

[1] - https://github.com/cliffordwolf/picorv32/blob/master/picorv3...


It depends what you want to do? Do you want to build a processor from scratch? A basic processor for core RISC-V is actually quite simple. The RISC-V I specification (all the core instructions not including multi/div, none of the privileged spec) is not complex.

Implementing something like the story talks about is obviously far more complex.

If you're mostly interested in putting together existing processors that can again can vary in complexity, there are some things with 'batteries included' where you can just spin up an FPGA image that works then go poking around, others where there will be significant work in integrating things into a working system.

I'll give a plug to Ibex (https://github.com/lowRISC/ibex) which is the core I work on, it doesn't have an MMU and is targetted at embedded applications. It's a 'real' core, in that it's suitable for taping-out into a real system but still quite simple to understand. OpenTitan (https://github.com/lowRISC/opentitan) is a notable project we're also working on that uses it, it's an open source root of trust and will give you a working RISC-V SoC you can put on an FPGA, you can easily carve out the security things leaving you with a RISC-V core, some SRAM and various useful peripherals.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: