Any good examples where FPGAs outperform standard CPUs?

pjc50 · on June 1, 2015

http://www.argondesign.com/case-studies/2013/sep/18/high-per...

Low-latency interaction with hardware. In this case, it was possible to start outputting a TCP reply before the end of the packet it was replying to.

In general, FPGAs win for integer and boolean functions which are amenable to deep pipelining, or are capable of very high parallelism. One of my colleagues produced a neat chart of what level of parallelism was best suited to which of CPU/GPU/FPGA.

FPGAs don't have an advantage if you're memory or IO bound, and they have terrible power consumption.

thesz · on June 1, 2015

You are not quite correct about power consumption. FPGA can be quite competitive when applied right.

First, of course, are very efficient Ethernet ports and many other efficient hard blocks.

Second, you can fuse many operations into one. For example, you cannot fuse 3 FP adds into one operation nor on GPU neither on CPU. It is often 2 or 4, rarely something inbetween or outside of those. GPU and CPU operations on vectors can be wasteful as vectors can be underutilized and on FPGA you can create circuitry that fits problem rather well.

I also think that you are not quite right about I/O or memory boundness. In FPGA you can add another I/O controller and use extra device for I/O, loading spare resource in FPGA. Same for memory - DDR controller synthesized into cells won't be very optimal, but you can have many of them nevertheless.

pjc50 · on June 1, 2015

It depends on what you're doing. The static power consumption is rather high. If your solution fits fairly exactly into the FPGA and is running most of the time, then yes.

Having two DDR controllers helps the overall memory bandwidth number if your reads are not localised. If you're doing a lot of data-dependent loads this doesn't help at all (e.g. scrypt).

In all cases it's much harder for software developers to develop for FPGA, so this cost needs to be factored in. They're very good in their niche, not a general purpose silver bullet.

thesz · on June 2, 2015

I think you are not quite right about non-local data access and bandwidth. Convex (I believe, I may not be right) has memory controller that provides full bandwidth utilization for R-MAT random graph analyses.

Your example with scrypt also does not access much of RAM, especially with Salsa20/8 standard function. As far as I can see, it also has parallelization parameter, and computation within top loop can be done in parallel.

Yes, it is hard to program for FPGA. But not that much - I myself programmed a system that performed translation from (pretty much high level) imperative description of algoithm to synthesable Verilog/VHDL code. In a one and half of month.

In my opinion, programming for FPGA is very entertaining, especially if you do not write V* code by hand.

raverbashing · on June 1, 2015

This question comes up often, and it's the wrong question to ask.

Here's the thing. FPGA performance has nothing to do with it. They do what CPUs can't.

You don't choose to plug an FPGA in place of a CPU, you plug it where you can't.

Tying peripherals together, glue logic, bus connection. Some FPGAs have a builtin CPU (or you can plug a soft one together with your circuit)

zokier · on June 1, 2015

While true in many cases, you can bet your bottom dollar that Intel did not spend $16.7bn to enter the embedded glue logic market. It is very much the compute aspect they are after, as a weapon against AMDs HSA and other GPGPU-style solutions.

raverbashing · on June 1, 2015

But an FPGA (as it is today) cannot compete with a GPGPU

Maybe they will go for in-the-fly reconfiguration for specific computations (as: load your specialized circuit in an FPGA and fire away)

fpgaminer · on June 1, 2015

I designed the first commercial Bitcoin mining FPGAs, and though for awhile the FPGAs were not competitive with GPUs (overall, they beat them on power and usability) they eventually were (with the advent of Kintex and similar). Of course, that only lasted briefly, as the rapid growth in the market led to an influx of VC money to fund ASICs.

And that's where FPGAs shine; that small to medium volume market where small companies are doing innovative things but don't have the millions required to risk building an ASIC.

makomk · on June 1, 2015

Bitcoin mining was probably close to a best-case scenario for FPGA compute though - it was computing a fixed function designed for easy hardware implementation at full capacity 24/7. And even then, actually implementing it and making it competitive was a huge colossal pain.

noipv4 · on June 1, 2015

It was completely compute bound. Communicating with the host using just an RS232C UART was sufficient to keep the FPGA busy while computing Bitcoin's 2xSHA256 hashes.

raverbashing · on June 1, 2015

> And that's where FPGAs shine; that small to medium volume market where small companies are doing innovative things but don't have the millions required to risk building an ASIC

I don't disagree with you on that count. Especially in this case (since for most cases a processor does a job with a better cost/benefit), FPGAs shine on very specialized/heavy computation tasks.

Sophistifunk · on June 2, 2015

Not only that, but I'd say it (could be) the best place to learn about the very bottom of the computing stack, and experiment with chip design and wacky ideas.

derefr · on June 1, 2015

> in-the-fly reconfiguration for specific computations

I always wondered, given that Intel's processors already have a pretty large gap between their instruction set and their real microcode, whether it would make sense to have a nominal "CPU" that, when fed an instruction stream, executes it normally on general-purpose cores, but also runs a tracing+profiling JIT over it to eventually generate a VHDL gate-equivalent program to jam into an on-core FPGA. "Hardware JIT", basically, with no explicit programming step needed.

amelius · on June 1, 2015

Programming a CPU is becoming more and more a problem of fitting as much in the data caches as possible. Bandwidth is the problem, not the speed of the execution units. I don't see the huge benefit of an FPGA here.

ethbro · on June 1, 2015

> But an FPGA (as it is today) cannot compete with a GPGPU

I read an interesting quip somewhere on software/hardware development: 'civil engineering would look very different if the properties of concrete changed every 4 years.'

If at some point we stop scaling chip performance. And many-core-integration in/on a single chip/die stops making sense. Then glue logic starts to look like a key differentiator. And control over glue logic starts to look like control over profits.

Intel ate the chipset for performance reasons and so they could shape their own destiny.

If there aren't fundamental breakthroughs to preserve performance scaling as we know it, then I see this as more of the same.

vardump · on June 1, 2015

But an GPGPU (as it is today) cannot compete with a FPGA.

Of course, it depends entirely on what you're doing with them. Keywords: horse, course, different.

pjc50 · on June 1, 2015

That's the logical assumption. Of course, to make this work, the tooling would have to be wildly different; at the very least, an open bitstream format.

pdkl95 · on June 1, 2015

In addition to the benefits that others have already mentioned, when you use an FPGA, you can customize your hardware to provide task-specific features. An interesting example would be this demoscene project by LFT (Linus Åkesson):

http://www.linusakesson.net/scene/parallelogram/

"For this demo, I made my own CPU, ... cache, ... blitter with pixel shader support, a VGA generator, and an FM synthesizer."

In his explanation for why he wrote his own CPU in the FPGA, Linus explained "...I was able to take advantage of the added flexibility. For instance, at one point the demo was slightly larger than 16 KB, but I could fix this by adding some new instructions and a new addressing mode in order to make the code compress better."

listic · on June 1, 2015

I knew something like this had to exist. Shouldn't this approach extend to modern hardware? E.g. surely there must be cases where it is effective to use custom FPGA-based hardware fit for the job, rather than (or in addition to) CPU and GPGPU?

I heard counter-arguments to the tune of 'hardly anyone wants to program their FPGA' which sounf strange to me: after all, hardly anyone wants to program their pixel shaders, either.

kevinnk · on June 1, 2015

The real problem with FPGAs is that for 99% of use cases, CPU/GPGPU is good enough. And in the cases you really do need the extra speed, its rare you also need the flexibility, in which case you'd make an ASIC. There is a niche for FPGAs (especially in prototyping), but it's not as big of a market as you would imagine.

cmrdporcupine · on June 1, 2015

Awesome demo.

Also, dude has a Symbolics Space Cadet keyboard. Respect.

bryanlarsen · on June 1, 2015

FPGA's are chips, they're mostly orthogonal to CPU's. Sure there's some overlap in the margins, but the vast majority of FPGA's do things that CPU's can't do. There is some competition because an FPGA can do everything a CPU or GPU can, but the converse isn't true: there's lots of stuff that an FPGA can do that a CPU or GPU can't. To oversimplify, an FPGA could replace almost any digital chip anywhere.

m_mueller · on June 1, 2015

HFT optimizes for low latency rather than high performance, and FPGA is currently the state of the art there to my knowledge [1].

When it comes to performance, you should look at FPGA in terms of performance per Watt. Generally speaking they outperform GPUs by an order of magnitude in FLOP/W*s [2], which in turn already have ~ 3x-5x advantage over Xeons [3]. This measure is the most important one when it comes to the question, how many chips you can put in a given rack. FPGAs are still held back in terms of cost per dollar invested, since they have been quite pricey - with Intel this could change.

[1] http://stackoverflow.com/questions/17256040/how-fast-is-stat...

[2] http://synergy.cs.vt.edu/pubs/papers/adhinarayanan-channeliz...

[3] http://streamcomputing.eu/blog/2012-08-27/processors-that-ca...

gchadwick · on June 1, 2015

> Generally speaking they outperform GPUs by an order of magnitude in FLOPS/W*s [2]

The paper you link to is measuring MSPS/W (mega samples per second) and the algorithm they are studying relies on fixed point. It uses built in DSP blocks in the FPGA that are integer only. There is no floating point so it is incorrect to say this shows FPGAs give better FLOPS/W. It isn't all that surprising the FPGAs are doing better, the GPUs are all about floating point which isn't being used here.

Their GPU implementations use floating point as well as int and short. The efficiency barely differs between them showing that this particular GPU wasn't optimising with integer power efficiency in mind (which an FPGA implementation relying on DSP48s very much is).

zerohp · on June 1, 2015

Altera is claiming they will have >10 TFLOPS next year. They designed floating point DSP blocks in the Arria 10 and Stratix 10 (due out 2016Q1).

https://www.altera.com/content/dam/altera-www/global/en_US/p...

gchadwick · on June 1, 2015

It would be interesting to the see the same experiment repeated using an NVidia Tesla and an Intel Xenon Phi. They used AMD GPUs not targeted at HPC so it's unsurprising the integer path is not power efficient (desktop/mobile graphics is all floating point).

mrb · on June 1, 2015

Repeating the experiment with Tesla or Xenon Phi will show you the same thing: that GPUs are less efficient than FPGAs in this load. Their inferior efficiency has nothing to do with whether the polyphase channelization load is integer or floating point. A GPU consists of hundreds or thousands of microprocessors that have a traditional architecture: instruction decoding block, execution engines, registers, etc. Decoding and executing instructions is inherently less power-efficient than having this logic hard-wired as it can be in an FPGA.

vardump · on June 1, 2015

> A GPU consists of hundreds or thousands of microprocessors that have a traditional architecture: instruction decoding block, execution engines, registers, etc.

Any example of a GPU with "hundreds or thousands of microprocessors"? Nvidia Titan X has 12 [1] microprocessors by your definition.

[1]: SM, Streaming Multiprocessor in Nvidia's terminology. Smallest unit that can branch, decode instructions, etc.

mrb · on June 1, 2015

I am well aware of the technical details and that I used a liberal definition of "microprocessor". My wording was vague on purpose (I didn't want to delve into the details). I didn't mean to imply that each "microprocessor" had their own instruction decoding block (they don't).

An AMD Radeon R9 290X has 2816 stream processors (44 compute units of 64 stream processors) per their terminology. There is only 1 instruction decoder per compute unit, so a stream processor cannot completely branch off independently, but it can still follow a unique code path via branch predication. This is kind of comparable to an Nvidia GPU having "44 streaming multiprocessors".

But whether you call this 44 or 2816 processors is irrelevant to my main point: a processor that has to decode/execute 44 or 2816 instructions in a single cycle while supporting complex features like caching, branching, etc, is going to be less efficient than a FPGA with hard-wired logic (edit: "hard-wired" from the view point of "once the logic has been configured").

gchadwick also said integer workloads were "not power efficient" on GPUs, but that's also false. Most SP floating point and integer instructions on GPUs are optimized to execute in 1 clock cycle, so they are equally optimized. And of course integer logic needs fewer transistors than floating point logic, so an integer operation is going to consume less power than the corresponding floating point operation.

makomk · on June 1, 2015

FPGA's don't actually have "hard-wired logic" though - they have a configurable routing fabric that takes up a substantial proportion of the die area and has much worse propagation delays than actual hardwired logic, leading to lower clocks than chips like GPUs. Being able to connect logic together into arbitrary designs at runtime is prerty expensive.

m_mueller · on June 1, 2015

Thanks for pointing it out, I'm so used to FLOP used for benchmarks that I don't even question it anymore - mega samples didn't tick me off as being IP only.

lqdc13 · on June 1, 2015

I think AMD GPUs are much better at integer operations. NVIDIA ones are good at floating point.

adwn · on June 1, 2015

> since they have been quite pricey - with Intel this could change

Why would it? The price of an FPGA is not really determined by its production cost – for medium-size Xilinx FPGAs for example, the ratio of price/chip production cost is on the order of 50.

m_mueller · on June 1, 2015

By that you mean the margin? Well the margin is largely depending on competition (or the lack thereof), isn't it? That's why I think a big player could make a difference iff they put their weight into it.

adwn · on June 1, 2015

Not necessarily margin, rather non-recurring engineering costs (fab masks, design, software development), marketing, sales, administration, other overhead. You need massively higher demand in order to lower those costs, which won't happen only because Intel is entering the market. Instead, you'd need to significantly change the principles and trade-offs of FPGAs, for which I don't see any indications.

m_mueller · on June 2, 2015

Here's how I see FPGA in a perfect world: You have an x86 compiler that finds, through static analysis, a subset of your program that fits nicely on your FPGA, and according to performance models leads to a good speedup. This program is then automatically passed on to the FPGA compiler. So for the novice programmer the thing is just a black box, enabled by some compiler flag. Furthermore you can steer the compiler through usage of compiler directives such as OpenMP or OpenACC. We know that FPGA compilation takes a long time, so the problem here is fixing mistakes - every iteration may potentially cost you hours, which AFAIK is what makes FPGA programming so costly. Therefore the static analysis has to be of very high quality and the auto-offloading should be conservative. This sort of thing IMO could significantly change the popularity of FPGA, thus offsetting the R&D costs.

deelowe · on June 1, 2015

FPGAs don't outperform CPUs. That doesn't really make sense. FPGAs outperform software. Also, if you're in the chip business, FPGAs let you prove designs in-situ before moving to ASICs.

solarexplorer · on June 1, 2015

Signal processing.

FPGAs have high speed links and can perform thousands of operations in parallel. When you have a huge amount of data to process in real time, an FPGA (or ASIC) is often your only choice.

phamilton · on June 1, 2015

Second this. Video processing is somewhere I've used fpgas with great success. Embedded systems especially get a huge perf/watt boost when vision processing goes (at least partly) to an FPGA.

Video works really well because its discrete (60fps) and an FPGA just can't hit the clock speeds an asic will. If you already have a semantic clock requirement that is low, streaming works great.

jhallenworld · on June 1, 2015

Here is what they are really good at: deliver your custom hardware early to market (vs. ASIC or custom). This is a huge benefit if you consider that the first to market usually gets to own the market and command the best price.

bjackman · on June 1, 2015

Wrote a mandelbrot generator for uni that performs at several fps on a 25MHz clock using 8 multipliers. Definitely not possible on a CPU! https://instagram.com/p/2Y4CMtP95Q/?taken-by=yawn_brendan

But yeah pretty much any repetitive computation. 3DES enc/decryption is another example (although I think people normally use ASICs for that when they want to do loads of it).

stavrus · on June 1, 2015

A friend did some research in this area, comparing FPGAs, CPUs and GPUs. He published a paper [1] in regards to performance for several common Linear Algebra computations across a variety of input sizes. In particular, Figure 2 shows you where each of the platforms works best.

FPGAs are essentially re-programmable hardware, so they tend to outperform CPUs/GPUs when you program them for a specific task. They don't have to deal with most of the overhead that the more generalized platforms deal with which is why they dominate in the small input sizes. However, with FPGAs you're trading space (silicon) for that re-programmability so you can't have as much hardware in the same area as say a GPU. Thus, when the data sizes have saturated the available hardware of the FPGA for computation, the GPU begins to outperform. Due to the decreasing node sizes (28nm, 22nm, etc), we can fit more programmable logic into the same area, which causes the chart I mentioned above to shift more into the FPGA's favor.

[1]: http://www.researchgate.net/profile/Sam_Skalicky/publication...

rgovind · on June 1, 2015

I work on developing FPGA based prototyping platforms (Basically chip verification solutions). We are one area where FPGAs perform better than standard CPUs.

DoritosMan · on June 1, 2015

Bitcoin mining?

sireat · on June 1, 2015

You are right, but FPGA is very old news in Bitcoin mining about 3-4 years old, everyone moved to custom ASICs.

noipv4 · on June 1, 2015

Bitcoin mining. (although ASICs have now taken over, for a good part of 2010-2013 FPGAs were the king esp. for power limited mining rigs)

Bjartr · on June 1, 2015

IIRC many mining ASICs are those same FPGA layouts etched permanently.

woodchuck64 · on June 1, 2015

Think of FPGAs as having the potential to be primitive GPGPUs. They outperform CPUs in all the same areas GPUs outperform CPUs.

FPGAs are like GPUs with no floating-point, no caches, and limited local memory. But if you can implement a kernel in FPGA with comparable memory bandwidth, you'll usually outperform GPGPU while using as little as 1/50 the power.

vardump · on June 1, 2015

> Think of FPGAs as having the potential to be primitive GPGPUs. They outperform CPUs in all the same areas GPUs outperform CPUs.

FPGAs have several orders of magnitude lower latency than GPGPU. GPUs have memory access latency of 1 microsecond, getting something useful out of them >1 ms. FPGAs can have state machines running at 200 MHz, or 5 ns cycle time.

> FPGAs are like GPUs with no floating-point, no caches, and limited local memory. But if you can implement a kernel in FPGA with comparable memory bandwidth, you'll usually outperform GPGPU while using as little as 1/50 the power.

Some FPGAs do have floating point hard blocks. Integrated SRAMs (syncram) can be used as caches and usually are. FPGAs usually have DRAM controllers as hard blocks, so local memory is not that limited. Unless you consider up to 8 GB (newer models up to 32 GB) limited.

Bootvis · on June 1, 2015

They're used in HFT, it seems mainly for sending and receiving orders.

bladecatcher · on June 1, 2015

The FPGAs run entire algorithmic strategies. It's not just the order management that they handle.

thibautx · on June 1, 2015

FPGAs run the execution of strategy, everything else is done by high level software/systems. FPGAs are connected directly to the exchange and also help receive a feed of market data (in addition to other market data feeds).

bladecatcher · on June 1, 2015

I guess that depends on how close to the exchange you want to be. If your strategies were extremely latency sensitive, you might want to have not just the execution logic, but the larger algo framework on there too. This is particularly true of strategies that make use of order book dynamics.

PaulHoule · on June 1, 2015

A big advantage that FPGAs have is that they are fast at "single threaded" tasks so they can parse data in formats that aren't easy to parse quickly, like the XML-based FIX protocol.

uxcn · on June 1, 2015

I'm not sure FPGAs are 'suited' for parsing formats with lots of variable lengths and data dependencies like XML, but obviously they can be faster than typical Von Neumann. I know binary market data streams were one of the first uses in the financial industry.

mattgodbolt · on June 1, 2015

FIX is usually sent and received as ASCII tags of the form NUM=value\x01 , not as XML.

Source: http://en.m.wikipedia.org/wiki/Financial_Information_eXchang...

CamperBob2 · on June 2, 2015

They can't be compared directly, although that will never convince people to stop trying.

You tell a CPU what to do. You tell an FPGA what to be.

maccard · on June 1, 2015

They usually require minimal amounts of power in comparison with a CPU with a similar thoroughput