Nyuzi: An experimental FPGA multicore GPGPU processor

zymhan · on March 30, 2016

Just thinking about writing VHDL again gives me the heebie-jeebies. But this is very cool, and I love the possibilities enabled by FPGAs and OSS. However we're still a ways away from having an entire open source FPGA development stack.

brian-armstrong · on March 30, 2016

I've been really wanting to try Chisel https://chisel.eecs.berkeley.edu/

They have a RISC V implementation in it so it can't be too bad

sebastic · on March 31, 2016

Chisel doesn't help. It just converts to verilog. The verilog is then converted by a closed source program to a closed file format that is then uploaded by a closed sourced program to closed fpga hardware.

striking · on March 30, 2016

This open source stack may interest you, then. http://www.clifford.at/icestorm/

cinquemb · on March 31, 2016

Right now for a project im using kicad, yosys, arachne-pnr and icestorm. As n00b to this layer (most exposure was through matsci and cheme req classes), its so much fun to be able to learn along the way (and i get to save $600 instead of buying someone else fpga and peripherals and getting exactly want I want i need out of it, no more or less). Although i wish i could get on with the rest of my project, this is a fun detour.

tricky · on March 30, 2016

oh man, VHDL was the worst. My hope is things like this will get people to build better, open sourced tooling.

whitegrape · on March 30, 2016

At least it's better than Verilog. But there's Python to the rescue! http://www.myhdl.org/

Still a lot of tooling left but my FPGA tinkering became much more enjoyable when I stumbled across MyHDL.

vvanders · on March 30, 2016

Ugh.

Every time I see imperative language X adapted to RTL/VHDL/Verilog I want to slap someone.

They aren't the same, gates are a fundamentally different primitives. You shouldn't be using a language built around serial actions for something that's inherently parallel. You bring all sorts of baggage along that you don't need.

[edit] They don't even specify if their state machine generated is Mealy vs Moore. This is the stuff that you want control over and not abstracted away by some language that you don't know how it will synthesize.

sklogic · on March 30, 2016

You did not get it. MyHDL is simply a macro-processor used to generate an RTL. And generating a sequence of similar, say, module instances is ok even with an imperative language, even Verilog got for loops for this.

Chisel and Clash are not any different.

vvanders · on March 30, 2016

You might want to re-read my post again, I'm specifically railing against imperative -> RTL/VHDL/Verilog.

Verilog may have for loops but they're largely used for simulation, they're expensive in terms of synthesized hardware(mostly toolchain/process dependent on how they synthesize) and can only be fixed size.

I've got no issue with Chisel, it's a DSL which I think is a good fit. What I'm arguing against is constructs that don't have a clear mapping to hardware representations which means you have to guess at what the compiler generates(see my Mealy vs Moore comment above).

kersny · on March 30, 2016

My understanding is that MyHDL is in fact more like Chisel/Clash, it specifically says "It does not turn arbitrary Python into silicon" in their getting started guide (and as sklogic says).

See also: a MyHDL UART: https://github.com/andrecp/myhdl_simple_uart/blob/master/ser...

vvanders · on March 30, 2016

Yeah, but that feels like round hole/square peg -ish. It's like saying we've got all this stuff over here but don't use it, really, we mean don't!

You're going to spend a ton of docs explaining to the user what parts of the language you can't use rather than having a DSL spec that's clear in what's supported.

With a proper DSL you also don't have to massage language features that don't quite map(say enums for state machines) into a format that it's not meant for.

sebastic · on March 31, 2016

You have no idea what you're talking about. You're wrong, MyHDL is a proper hardware DSL in Python. You've never used it and you have no idea what you're talking about

sklogic · on March 30, 2016

MyHDL is exactly the same kind of a eDSL as Chisel. And you have a full control of what it generates, see the examples.

vvanders · on March 30, 2016

If it's Python, it's not a DSL by definition.

Since you don't seem interested in addressing just one issue(I'm sure I could come up with more) around state machines I don't see any reason in continuing this discussion.

sklogic · on March 30, 2016

> If it's Python, it's not a DSL by definition.

Scala is not any better. It's exactly the same kind of a eDSL - no macros, nothing, just generating objects in runtime and then serialising this tree into a Verilog code. Nothing fancy.

> around state machines

What state machines?!? It does not have any more features for defining FSMs than an underlying Verilog.

frozenport · on March 31, 2016

Actually Verilog developers are orders of magnitude more efficient than their VHDL counterparts.

http://www.bawankule.com/verilogcenter/contest.html

whitegrape · on March 31, 2016

Wow 1997, but thanks for the link, it was a fun read. It'd be fun to hear from people in the industry what their experience has been on larger multi-people projects -- e.g. how long would it take to create an ASIC like http://www.jandecaluwe.com/hdldesign/digmac.html ?

wyager · on March 30, 2016

Clash, Chisel, and Lambda-CCC are all big improvements over Verilog/VHDL.

Cieplak · on March 30, 2016

Looks like he got it running on a Altera DE2-115 board, which has these specs:

    114,480 logic elements (LEs)
    3,888 Embedded memory (Kbits)
    266 Embedded 18 x 18 multipliers
    4 General-purpose PLLs
    528 User I/Os

[1]: http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=E...

zackmorris · on March 31, 2016

Does anyone know if there's an opposite analog of this? I would very much like to run a parallel language like VHDL or Verilog on the GPU since:

1) OpenCL/CUDA have an OpenGL-inspired syntax with a steep learning curve and limited generalizability

2) FPGAs don't seem to be gaining the economies of scale of GPUs

I simply want to be able to emulate thousands of CPUs (millions of gates) for physics, AI, big data etc, in a way that's accessible, affordable and won't catch fire. I'm thinking MATLAB or Octave but with near-ideal speedup for embarrassingly parallel problems.

jasonwatkinspdx · on March 31, 2016

Do you already know VHDL or Verilog? Most people would not consider them simpler or more productive than OpenCL IMO.

Julia fits your last sentence.

make3 · on March 31, 2016

people use theano/tensorflow for this

tkinom · on March 30, 2016

Is this GPGPU only, or does it also support GLES or OpenGL?

jdmoreira · on March 30, 2016

I guess you could implement GLES on top of it. They implemented a renderer for quake maps... https://github.com/jbush001/NyuziProcessor/tree/master/softw...

vvanders · on March 30, 2016

Looks like it runs at about 1 FPS on a 50Mhz core(with screenshots): http://latchup.blogspot.com/2015/06/not-so-fast.html

Still crazy awesome, that's a ton of work.

makomk · on March 30, 2016

Well, you can apparently do 3D rendering but it's kind of slow based on the latest information I can find: http://latchup.blogspot.co.uk/2015/06/not-so-fast.html

They are (or were) spending a huge amount of instructions on stuff that'd have dedicated hardware/instruction set support on a proper GPU. Normally rasterization and texture sampling runs in dedicated hardware, colour packing/unpacking is integrated into the memory access instructions (at least on Radeon), etc. Stuff that'd be one or two instructions on a commercial GPU instead took dozens or hundreds.

jeffbush · on March 31, 2016

(author here)

Yeah. One area I'd like to investigate is adding specialized instructions. The existing renderer is not highly optimized.

wyldfire · on March 30, 2016

Man, it'd be super sweet if we could get an OpenCL frontend for this target.

jeffbush · on March 31, 2016

Technically it already supports OpenCL, as it has an LLVM backend and Clang port. However, it will generate scalar code that doesn't take advantage of the vector unit. To support it properly, it would need extra passes for SPMD vectorization.

wyldfire · on April 1, 2016

I think there's a lot of follow-through beyond just llvm and clang support in order to make a full OCL platform -- device enumeration, etc. Plus I don't think clang distributes a complete front end (headers/type defns etc). There's some open source projects that could supplement this, though.