FPGA Acceleration by Dynamically-Loaded Hardware Libraries [pdf]

wyager · on Oct 23, 2016

I think the single biggest hurdle to FPGA adoption is that FPGA development is horrible. The IDEs suck, everything is license-encumbered and full of DRM, and debugging is a nightmare.

The best thing I've found so far is Clash. It translates Haskell code to (System)Verilog or VHDL. It works remarkably well, and is way easier to write and test than writing VHDL or Verilog directly. You can compile it to a plain old program for testing and debugging, which means you can use things like QuickCheck for aggressive testing. If you're really serious, you can use something like LiquidHaskell for really aggressive formal verification. There's also Lambda-CCC, which looks a bit more theoretically rigorous, but I couldn't get that working when I tried it.

The key here is that non-strict lambda calculus maps remarkably well to hardware, which means that it's actually relatively straightforward to translate a program to an exactly equivalent hardware representation. The only exception is unbounded recursion (at the function or data level), because obviously you can't represent infinite stuff in hardware.

I've written a few FPGA projects using Clash that I never would have had the patience to finish if I had used a traditional HDL. There's usually a bit of a performance hit over hand-optimized HDL, but (as with using e.g. Python over assembly) there are many cases where developer time is more expensive than a small efficiency loss.

alain94040 · on Oct 24, 2016

What does Clash bring to FPGA design? I can easily imagine how functional languages make it easy to express the datapath of a circuit. But If you still specify registers explictly and have to design your pipeline by hand, it's not a big improvement over traditional HDLs. Can it help you infer correct control logic? That would be useful. Explore different speed/area solutions?

When performance matters - and it does or you wouldn't bother to design hardware - everything gets heavily pipelined with very short logic between pipes stages. Will Clash be any more readable than Verilog then?

wyager · on Oct 24, 2016

I think you're focusing on a bit too narrow scope of improvements over traditional HDLs.

A few advantages include: sensible language design, strong types, ADTs, easier and much faster testing/debugging, easier formal verification, easier, more powerful, and safer parametrization, etc.

And yes, even if your clash degenerates to basically copying verilog (which it usually doesn't, in my experience) it's still much easier to read thanks to the richer types and ADTs.

nickpsecurity · on Oct 24, 2016

Same question for you:

https://news.ycombinator.com/item?id=12776179

cheiVia0 · on Oct 24, 2016

There are open FPGA toolchains now, some links here:

https://wiki.debian.org/FPGA

werdna123 · on Oct 23, 2016

I agree. The other big one is power consumption

pjc50 · on Oct 23, 2016

From abstract: "Provided a library of application-specifc processors, we load on-the-fly the specifc processor in the FPGA, and we transfer the execution from the CPU to the FPGA-based accelerator."

Key takeaways: (2.2.2) neat trick to accelerate reconfiguration; (3) sample applications involving BCD arithmetic; (4) efficient scheduling to avoid thrashing the reconfiguration.

(Personally, I suspect that until we have a good, open or OS provided API to FPGA configuration we're going nowhere. 3D acceleration required this in the form of OpenGL and DirectX.)

jackyinger · on Oct 23, 2016

APIs are just glue, what the FPGA world needs to be open is open compilers for high level hardware description languages.

However, there is little incentive for FPGA manufacturers to publish enough details for open FPGA compilers to be developed.

Unlike CPUs that can present a public ISA and handle fancy microarchitecture details behind the scenes, FPGA compilers require intimate knowledge of the particular FPGA's microarchitecture. No need to spill the secret sauce.

Also, when you target an FPGA, you target a single FPGA model. Change to a different model even in the same produce family you must recompile. BTW, for large FPGAs compilation can take all night.

And lastly, FPGA compilation is a graph embedding problem which is significantly more difficult that the sequential optimization problem CPU compilers handle. This is why it takes so long to compile for FPGA.

nickpsecurity · on Oct 24, 2016

What do you think of Synflows HLS, IDE, and so on for that?

https://www.synflow.com/

I haven't seen hardly any FPGA users or experts review it.

jackyinger · on Oct 24, 2016

This looks like a high level design entry language. It gets its portability by outputting verilog/VHDL code that is then fed into your synthesis (compiler) tools from your FPGA vendor.

There are fair number of these sorts of languages out there. Currently I'm using Altera's OpenCL kit which certainly falls into this high level design category.

The thing is that even if you're coding in a language that looks like C, you need to understand how hardware is inferred from it. Which requres a basic understanding of hardware concepts like state machines built from combinational logic and flip flops, as well as pipelining.

Bottom line on high level design... Plus side; gloss over details, focus on the workload. Down side; weird idioms, extra overhead compared to low level (i.e. Verilog).

RandomOpinion · on Oct 24, 2016

All the HLL to HDL converters I know of require you to structure you HLL code as if it were HDL code. In other words, you are not absolved of understanding how things work at the hardware level and you can't just feed regular code into it and have it pop out hardware.

If you have to write stilted HLL to implement your design anyway, you might as well do it properly and write it in one of the HDLs instead.

RandomOpinion · on Oct 23, 2016

>Personally, I suspect that until we have a good, open or OS provided API to FPGA configuration we're going nowhere. 3D acceleration required this in the form of OpenGL and DirectX.

There's no demand for it. There's nearly nothing that the consumer does that requires that level of hardware acceleration (with the possible exception of Photoshop) and server type applications are perfectly fine with custom software interfaces.

angry_octet · on Oct 23, 2016

There is a huge demand, but it is already met by GPUs. FPGAs work much better for streaming and low latency applications though. See for example the Hololens processor, though that is an ASIC it could be done by an FPGA paired with a GPU.

nickpsecurity · on Oct 24, 2016

You missed the whole gaming market in that. There's also offloading I/O (eg Bittorrent) and audio/video playback or conversion. In business desktops, I could see additional uses like security coprocessor or accelerating analysis of large data sets coded in parallelized R or something. Plenty of use-cases in HPC with them long being consumers of FPGA's, ec. Server farms, esp cloud, are competing on performance per watt with FPGA's being champs in that in certain workloads.

So, I see plenty of opportunities on consumer and business side. I think adoption in consumer space will be quite low like with other high-end equipment.

imtringued · on Oct 24, 2016

OpenCL runs on FPGAs.

jackyinger · on Oct 23, 2016

Really neat! Having a fixed framework to plug accelerator blocks into is lets accelerator designers cut to the chase.

I'd bet the 10ms hit to reconfigure their zynq grows on larger FPGAs...

Edit (Additional Thought): Thus could allow designers significantly smaller/cheaper FPGAs. Rather than statically implement all required functions, intermediate results can be saved in off chip RAM while the next function is loaded. That said it'd require significant additional engineering effort.

mrlambchop · on Oct 24, 2016

When I first heard about the co-processor extensions in ARM, I spent hours dreaming up a dynamic FPGA accelerated co-processor that could have dynamic 'instructions' swapped in based on typical work loads it was receiving. The compiler would generate the custom HW FPGA programs based on profiling data from previous iterations of the app and static inspections - the dynamic complex blocks then attached to the ELF for loading. A statistics block in the CPU determines which ones are loaded via a kernel driver (I think you could do with with pure SW actually - no HW support needed) - you'd need a "branch if instruction-X is available" and I guess some interlocks to stop instructions being swapped out until the SW has finished with them etc...

I may dream, but this team did it! Super cool. Having spent a day with Zynq trying to load bitstreams from Linux running on the hard cores (and completely failing such that I changed attack vector) this alone is impressive :)

CalChris · on Oct 23, 2016

There was a company, GigaOps, back in the 90s that did this.

http://arith.stanford.edu/courses/abstracts/gigaops.html

Their MVP was accelerating Photoshop effects plugins but they developed a decent C like language for general kernels.

Way ahead of its time.

rch · on Oct 23, 2016

I still remember sitting through a CS colloquium in the late 90s on the potential of GPU acceleration, and being utterly devastated that it was obviously going to sideline this type of FPGA work for a long time.

Even if progress never stopped in certain niche areas, I'm happy to see it clawing back into the light of day.

RandomOpinion · on Oct 23, 2016

There actually was a fair bit of research on dynamically loading hardware modules into FPGAs on the fly going all the way back to the '90s. IEEE's International Symposium on Field-Programmable Custom Computing Machines (http://fccm.org/2016/previous.html#past) had a "Run Time Reconfiguration" section in the '90s that's specific to papers that use that mechanism.

It's nice to see that research is still advancing in this area.