Intel Marrying FPGA, Beefy Broadwell for Open Compute Future

trengrj · on March 15, 2016

Does an Open Compute Future include Intel's ME?

Intel's Management Engine (which is on all modern Intel chips) acts as a unverifiable second processor with memory and network access running closed proprietary software. Its existence prevents any sort of security from state level actors.

winter_blue · on March 15, 2016

Woah, I didn't know this existed! This is really serious. Why aren't people talking more about this?

trengrj · on March 15, 2016

There was a CCC talk about it recently https://media.ccc.de/v/32c3-7352-towards_reasonably_trustwor... and there is a good write up about it on Libreboot site https://libreboot.org/faq/#intelme.

With regards to why people aren't talking more about this, maybe fatalism is involved as it seems there is a certain inevitability that whoever controls the silicon also controls to some extent computers. There have been some interesting developments in people trying to create open hardware like lowRISC http://www.lowrisc.org/ and also using ARM so there is some hope in this area.

beeboop · on March 15, 2016

Intel ME has very useful legitimate purposes for existing. It's a very powerful tool for some people, and most people would call you a conspiracy theorist for suggesting that Intel, or someone who manages to get security creds from Intel, would misuse them break into your computer. I have also had allegedly advanced system engineers tell me that there is really no risk for an outside party being able to compromise your system with this technology, but I remain pretty skeptical of that claim.

In short, the overlap between people who know about it and are also security conscious is pretty small. There's also dozens of other things people should be more concerned about in terms of a corporate or state actor gaining unauthorized access to your computer.

PaulHoule · on March 15, 2016

I am much more afraid of non-state actors that would get access to back doors than state actors. The people at the NSA might not be angels but they have some sense of ethics and some controls, other people out there don't. What if someone like Snowden or the Walker brothers get access to it and make it open source or sell it to ISIS?

ZenoArrow · on March 15, 2016

Discussions about it do exist. I first read about the dangers of it via the Libreboot website. As for why it hasn't had mainstream coverage, your guess is as good as mine.

In case anyone wants a CPU + FPGA combination without this type of security risk, check out Zynq.

subway · on March 15, 2016

I'm not convinced Zynq is much better. They just move the MElike functions on-die and call it TrustZone.

ZenoArrow · on March 15, 2016

TrustZone is not equivalent to Intel ME. Software must be designed to work with TrustZone extensions, and that software is under the control of the user (as long as you run open-source operating systems). If the software you're running doesn't use any TrustZone extensions, then the TrustZone hardware will not be used.

Intel ME on the other hand is baked into the hardware, and AFAIK can't be properly switched off, it appears the best you can do is set up a fake Intel MPS Server to point it to:

https://software.intel.com/en-us/forums/intel-business-clien...

khedoros · on March 15, 2016

Well, AMT has been around since about 2005, and vPro since 2007. It's a favorite topic for some people on Slashdot to bring up. There's a hackaday article from about 2 months ago [1]. My theory is that it's a combination of factors.

- Introduced a long time ago; it's old news.

- It is talked about, but only in certain circles

- Confusion over the relationship between vPro, IME, and AMT, or at least unfamiliarity with those terms making articles/discussions related to them less likely to pop out as interesting topics.

- Some of the people who know about it consider it the normal, boring, status quo, and they react to anyone shocked by it as an alarmist.

[1]: http://hackaday.com/2016/01/22/the-trouble-with-intels-manag...

jrapdx3 · on March 15, 2016

Is that anything like the "secure enclave" we've heard so much about re: Apple iPhone, at least the current versions? Not sure how much benefit or utility that would add to a server vs. phone CPU, but it's interesting anyway.

pjc50 · on March 15, 2016

The TPM (where present) is analogous to the secure enclave, although the mechanism is a bit different. The ME is more of an "insecure exclave": it has a lot of access to the rest of the system, but the OS and userland cannot use it for anything useful. It's designed to allow access from outside over the network connection. Useful for ILM systems, but very frustrating for people who aren't in a datacenter and don't like it being there.

creshal · on March 15, 2016

IIRC the secure enclave equivalent on x86 is the TPM module, and/or Intel SGX and AMD's Secure Processor (using a secondary CPU running ARM TrustZone); both are independent of Intel's ME.

Confused yet?

windowsworkstoo · on March 15, 2016

No, it's IPMI on Desktop processors

jrapdx3 · on March 15, 2016

Thanks for the clarifications. Indeed is confusing for those of us not so hardware savvy.

Ericson2314 · on March 15, 2016

Implement x86 on FPGA; what now NSA! :D

notalaser · on March 15, 2016

FPGAs have something akin to a bootloader which is used to download the bitstream. There has to be some intelligence in order to reconfigure the hardware on it. That something is firmware running on a tiny core. So basically it's already backdoored :-).

tlrobinson · on March 15, 2016

Covertly backdooring arbitrary hardware on an FPGA seems orders of magnitude more complex than sneaking a backdoor into a feature designed to have DMA and network access.

notalaser · on March 16, 2016

IMHO, it's not significantly harder, at least in principle, than on other platforms - in fact, albeit intuitively, I suspect it's easier.

Most high-end FPGAs aren't "burned" the way you would with a logic array by burning fuses. Configuration data (i.e. the connections between the FPGA's tiny components) is stored in a non-volatile memory and is loaded when the device starts. Altering the design is simply a matter of altering this configuration data, which is easy to do dynamically, seeing how there's a closed piece of code running on a closed module which has access to every bit of it.

The problem ends up being boiled down to problems that we're already aware of: authenticating configuration data (which is akin to the problem of authenticating the OS running on a general-purpose CPU), ensuring that the FPGA's configuration matches that which was programmed in the non-volatile memory an so on.

The configuration data loader is, to the best of my knowledge, a pretty trivial piece at the moment, with the exception of high-end devices for sensible applications (which do include things like encryption, so that the bitstream cannot be retrieved in a useful form). But real-world requirements will soon provide a good excuse for inflating it to a level of complexity where backdoors can be hidden.

It's also important to realize that much of the hardware that ends up on an FPGA isn't really arbitrary data, it's in the form of vendor-supplied IPs that are probably pretty easy to recognize. Implementing your own cryptography hardware is as bad an idea as writing your own cryptography code. It don't think it would be too hard to backdoor a loader that alters the bitstream so that its crypto modules are weak, under specific circumstances.

rjsw · on March 15, 2016

Some FPGAs might. All the Xilinx ones that I have used need to be explicitly programmed from outside, usually via SPI.

duskwuff · on March 15, 2016

Not sure which Xilinx parts you've used (older ones?), but all of the modern ones can be strapped to load a bitstream from various types of flash on powerup. See Xilinx UG380 for details for the Spartan-6 family:

http://www.xilinx.com/support/documentation/user_guides/ug38...

FullyFunctional · on March 15, 2016

This is the master plan, but replace x86 with RISC-V to make it more efficient. It's impractical for economical reasons to backdoor the FPGA (where's the IO? You don't know.) nor the compiler (Hey, all I see is a netlist, my AI isn't that clever).

Natanael_L · on March 15, 2016

Backdoor the FPGA...

creshal · on March 15, 2016

When in doubt, backdoor the compiler.

moonbug · on March 15, 2016

[flagged]

slashdev · on March 15, 2016

More so with your contribution.

Dylan16807 · on March 15, 2016

Do you have an actual criticism of the comment?

pjc50 · on March 15, 2016

The critical issue here is whether this is really "open".

The x86 platform, like most processors, has a documented instruction set and software loading process. (There are undocumented corners, but the "front door" is open). Whereas historically almost all FPGAs have had fully closed bitstream formats and loading procedures. This necessitates the use of the manufacturer's software which is (a) often terrible and (b) usually restricted to Verilog and VHDL.

If Intel ship a genuinely open set of tools, then all manner of wonderful things could be built on the FPGA, dynamically. That requires being open down to the bitstream level, which also requires that the system is designed so that no bitstream can damage the FPGA.

To me this is most interesting not at the server level but at the ""IoT"" level; if they start making Edison or NUC boards that expose the FPGA to a useful extent.

bravo22 · on March 15, 2016

(a) you don't have to use any vendor's IDE, only their placement and synthesis tools which all support command line (b) Verilog and VHDL are the only two dominant HDL languages.. what other language would you want to program in??

pjc50 · on March 15, 2016

I'd quite like to see some HDL innovation. You can compile to Verilog just like people compile to Javascript to run in the browser, but it's not the most convenient way of doing it.

bravo22 · on March 15, 2016

HDL is a very different paradigm than Javascript. You aren't writing a program. You are describing hardware which is why it is synthesized into gates and not compiled. They are apples and oranges. The closest analog I can think of is vector graphics vs raster graphics, i.e. Adobe Illustrator vs Photoshop. They're meant for different purposes.

Verilog/SystemVerilog are pretty great at what they do. People are pretty happy with them. The problem is those who get into FPGA (from CS background) expect to write "code" which isn't what you are doing in the hardware world.

There is also SystemC when you want to write test benches, or Bus-Functional-Models. Those is also used pretty extensively in the hardware flow.

pjc50 · on March 16, 2016

You don't have to mansplain HDL to me.

I agree that writing code will lead you astray, but not with "people are pretty happy with Verilog". It has a whole load of limitations: no aggregate types, no language-level distinction between synthesisable and unsynthesisable, and quasi-sequentialism that confuses beginners.

bravo22 · on March 16, 2016

SystemVerilog supports structs and unions. Synthesizable constructs are up to the synthesis tool to infer, and they are very clear as to how they work.

Any HDL language will be confusing to someone who isn't used it because you are defining a set of parallel behaviours and responses to specific stimulus and not a set of instructions.

There can always be improvements but as of today Sys/Verilog is the most popular by far, followed by VHDL. Therefore the tool vendors support those languages which is what I was responding in the original post.

TrevorJ · on March 16, 2016

>mansplain

Really?

HanW · on March 15, 2016

b) Bluespec

bravo22 · on March 15, 2016

Bluespec is SystemVerilog extensions, so it still falls under Verilog support. Bluespec compiler spits out Verilog RTL for it to be fed into other tools, like Xilinx's synthesizer. So yes you can use Bluespec with Xilinx/Altera tools today.

It also is new and thus not known and used by thousands of RTL coders everyday.

sitkack · on March 15, 2016

AMD HSA is more compelling. These look like they will be just as hard to program as adding in an accelerator card, only the bandwidth between the CPU and the FPGA is higher. Everything is merging into an amorphous blob, FPGAs have been adding hard blocks for years, GPUs have been adding scalar accelerators. The vector and the scalar, the hardwired and the adaptable are all becoming one. Hell, even the static languages are adding dynamism and the dynamic languages are adding types. Floating point is becoming succinct [0]. Computation is becoming a continuum.

[0] http://johngustafson.net/unums.html

btown · on March 15, 2016

That Gustafson link is intriguing - efficient floating point without the drawbacks. Previous HN discussion: https://news.ycombinator.com/item?id=9943589

arc776 · on March 15, 2016

Programmable logic on chips will be INCREDIBLE for the field of intrinsic hardware evolution, which is a slowly emerging science. This is huge for the field of AI and electronics.

I've been waiting to see this kind of thing for years, ever since I read Adrian Thompson's work on evolution with FPGAs, in which he:

"Evolved a tone discriminator using fewer than 40 programmable logic gates and no clock signal in a FPGA" (slides: https://static.aminer.org/pdf/PDF/000/308/779/an_evolved_cir...)

EDIT: Full paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50....

The field has crawled along pretty slowly since then as far as I can tell.

However, this could be a HUGE thing for computing; developers would finally have a way to create hardware that interacts directly with the physical world in ways we haven't thought of yet. As a small example, Thompson's work revealed circuits that were electromagnetically coupled in novel ways but not connected to the circuit path, yet were required for the circuit to work. Using evolution, in time we should be able to come up with unique solutions to hardware problems not envisaged by human designers.

This is really exciting.

mchahn · on March 15, 2016

There were many years when any hardware computing acceleration card failed pretty quickly because CPUs were advancing so rapidly hardware could not keep up. Apparently we have reached an end to this. With GPUs, and things like this FPGA integration, hardware matters again.

0x07c0 · on March 15, 2016

+1

Before the difference between highly optimized code and ok code was maybe 2-3x speedup. Roughly one Moore's law doubling. With heterogeneous computing that is more like 20-30x or more. And Moore's law is dead! This will change a loot in the IT world (More servers are the solution, developer time is more expensive then computer time, etc..). Learn C, down on the metal programming is back, the future is heterogeneous parallel computing.

sitkack · on March 15, 2016

No, native won't help you. Native is a red herring. C is not the correct abstraction level to take advantage of heterogeneous parallel hardware. The advantage of Rust isn't that it is native, the advantage is that removes the GC and the pressure on the memory subsystems and the latencies involved in compaction. The Hotspot JIT produces code as fast or faster than a C compiler. One could design a language that is high level and removes the GC through an affine type system. I predict there will be a hybrid language that does gradual-affine typing that marries a GC, escape analysis and use at most once semantics.

0x07c0 · on March 15, 2016

I wish that to be true, but that is not what I'm seeing. (I'm doing HPC.) . It's not about native c performance vs some other language. Its about the the low level stuff you can do in C. You use avx (and the compiler don't help(they are supposed to, but don't do it very well), you have to use intrinsics or asm), then memory stuff, cash blocking, alignment, non temporal stuff. Same for CUDA, compiler don't get that much performance. You have to think about all low level stuff, usually memory, like alignment, use shared memory or not, cash line size etc.. . And then you are using multiple GPUs.. No help from compiler, you have to do all by your self. Had been nice with compiler doing it, and there are some compiler that helps. But you don't get max performance, and with some effort the performance you get by handcoding all this stuff is much greater then what compilers can give you. And that advantage is increasing.

sitkack · on March 15, 2016

Ok, maybe it isn't a question of C and native, but access to low level semantics, memory layout and specialized instructions. The majority of programs and programmers are better served by going with a higher level, easier to parallelize semantics than dropping down to architecture specific features. I am thinking Grand Central Dispatch vs assembler.

I would argue that the low level work you are doing should be done in a macro or compiler.

http://www.graphics.stanford.edu/~hanrahan/talks/dsl/dsl1.pd...

http://www.graphics.stanford.edu/~hanrahan/talks/dsl/dsl2.pd...

Pat Hanrahan makes a compelling argument for using special purpose DSLs to construct efficient performant code that takes advantage of heterogeneous hardware.

See the Design of Terra, http://terralang.org/snapl-devito.pdf

0x07c0 · on March 15, 2016

Thanks! This are really useful. (I'm actually now making a small dsl for distributing work on accelerators.)

I personally really like the idea from the the Halide language, having one language for algo, another for how the computation is done. If something like that could be made general purpose it would be very useful.

http://halide-lang.org/

>should be done in a macro.. Encourage c programmers to use macros is like encouraging alcoholics to drink :) But I guess you didn't think about pre processor macros.

sitkack · on March 15, 2016

I find halide really interesting. Like the split between control and data planes. It made me realize we conflate things and don't even realize that they can be separated.

>...macro

Yeah, I didn't have preprocessor macros in mind. ;*| But wonderful, AST slinging hygienic Macros!

Take a look at http://aparapi.github.io/ it one of the best examples of making OpenCL a first class citizen in Java.

oldmanjay · on March 15, 2016

I feel like I missed your point, in what way are GPUs and FPGAs hardware that CPUs are not?

kbob · on March 15, 2016

GPUs and FPGAs are accelerators. CPUs are not.

grkvlt · on March 15, 2016

I think his point was that accelerators have the same problems: it used to be that 18 month's ago's accelerator is obsoleted by today's due to Moore's law. This is no longer true, so todays accelerator is as good as the next iteration. Of course, I guess what you mean though, is that the future is in developing different, new accelerators for specific niches. So your future compute system is modular, with an accelerator module bus, and plugged into it cards for managing graphics, physics, fluid dynamics, I/O, crypto and so on. You could build a system tailored to your workload...

deadgrey19 · on March 15, 2016

Programable logic on the die sounds like a great thing in principle, but the place where it really comes into its own is doing I/O work. Network/disk acceleration, offload, encryption. This is where hardware (which is slow and wide), but is reconfigurable over software lifecycle (e.g. protocols, file systems etc which change rapidly) would be a benefit. So the real question is, what is the I/O capability of one of these things? Will the high speed transceivers be exposed in a way that I/O devices can talk directly to it, or will they all need to go through a slow, high latency PICe interconnect. If the later, then I would predict a chocolate tea-pot in the making.

srcmap · on March 15, 2016

One can program the NIC to DMA packets directly info the address space allot for FPGA. Once setup, the FPGA should be able to get hold the packets and start processing completely without a single CPU cycle use on data plane.

deadgrey19 · on March 15, 2016

Sure. This is a possibility, although it is a bit round about and there would be an interesting song and dance in the NIC driver. NICs typically are told where to DMA to using descriptor tables programmed into the NIC by the driver. To do this truly without CPU intervention, you would need to write a hardware driver in the FPGA to program the NICs descriptor tables (can't even imagine what a nightmare that would be). Otherwise, you would have to have the CPU involved in setting up and negotiating transfers between the NIC and FPGA and a second driver between the FPGA and software. It's pretty messy either way. And given the proliferation of cheap FPGA enabled NIC's it seems like a non-starter. If the FPGA transceivers are broken out directly, then a simple adapter board would allow the FPGA to talk directly the network and/or memory device.

gricardo99 · on March 15, 2016

> you would have to have the CPU involved in setting up and negotiating transfers between the NIC and FPGA and a second driver between the FPGA and software

Plenty of "kernel bypass" and RDMA type functions use shared/user-space memory for "zero-copy" (in reality one copy), operations between NIC and software. If a similar scheme can be used with the FPGA then it would not have too much overhead. I agree, not as direct/efficient as having FPGA serdes I/O go directly to some SPF+/network transceiver, but then you'd also be taking up valuable FPGA gate capacity to run NIC PHY/MAC and standard L2/L3 processing functions that you get from a NIC.

deadgrey19 · on March 15, 2016

RDMA/kernel bypass NICs work by mapping chunks of RAM and then automatically DMA'ing packets into those chunks. Again, it would be a pretty round about way to give the FPGA access to packets to copy data to RAM, then copy down to the FPGA, then copy back up to RAM. Much simpler/better to let the data stream through the FPGA from/to the wire. In addition, the PHY/MAC layers these days are pretty thin for Ethernet style devices and modern FPGA's are by comparison huge. I'm not saying it can't be done, I'm just saying it seems sub-optimal when FPGA's already have a ton of I/O resources and are already used as NICs. The question as to wether these resources are exposed to the outside world is the salient one.

ShinyCyril · on March 15, 2016

These slides mention the possibility of using PCIe or QPI to attach to the CPU: http://www.ece.cmu.edu/~calcm/carl/lib/exe/fetch.php?media=c....

deadgrey19 · on March 15, 2016

I think I may not have been clear enough. I think we can take it as a given that there will be some kind of high-speed interconnect between the CPU and the FPGA (as you say, PCIe or QPI or whatever). But this is not the I/O problem I'm worried about. What I'm wondering is how the FPGA gets its access to the outside world. Does everything have to go through the CPU (which would be slow) or would the FPGA be able to use it's own multi-gigabit transceivers to talk directly to I/O devices (like NICs or HMC or Flash).

I always imagined that the best use of FPGA's in systems like this would as an I/O coprocessor. If the only way to get to the FPGA is via the CPU, then most (all?) of the benefit is lost.

deadgrey19 · on March 15, 2016

Although having taken a look at those slides, it would appear that the FPGA is given it's own PCIe bus connectivity to the outside world. That could be an interesting way to interact with the outside world.

virtuallynathan · on March 15, 2016

The Arria 10 GX can do up to 16x 30Gbps transceivers and 96x 17Gbps transceivers - I can't imagine they'd turn off all of these, since that is the selling point.

deadgrey19 · on March 15, 2016

The question is where are they broken out? If at all? A bunch will be used for the QPI/KPI interface, but how many will get broken out to play with in the real world? If the slides are correct, 8x will be broken out as PCIe so that's something I guess. But not very much. Which is kind of the point I'm making. FPGA's are great I/O processor devices, but if you can't access the I/O, then that use case is toast.

justinclift · on March 15, 2016

Infiniband as a connection technology would be interesting (to me), but Intel has been developing something called Omni-Path, which seem to be their version of a successor for it.

Kind of wondering if these will work with Intel's Omni-Path, and if so, what the shape of that makes things...

In theory, could be very interesting. Reality though, we get to find out. :)

juicenx · on March 15, 2016

Check out Netronome. www.netronome.com

They do exactly this in a programmable PCIe card and custom ASIC.

deadgrey19 · on March 15, 2016

I have used/programmed their cards. It was the worst 3 months of my life. I will never willingly touch their stuff again.

Buggy compiler implementing a superset of a subset of C89 with totally crazy macro extensions, and bizarre locality properties (e.g manually declaring if a variable is in a register or in ram), bad impossible to decipher (machine generated!) "documentation", 3 (!!) different and incompatible "standard" libraries each implementing different sets of features, 2 of which were written in assembler and inaccessible from C directly and almost no debugging tooling. e.g I had to write my own locking library because there wasn't one. What a nightmare.

CoffeeDregs · on March 15, 2016

Finally, this technology is gaining acceptance. Leopard Logic and others tried this about 15 years ago but Moore's Law and Dennard Scaling were still going so CPU+FPGA didn't take hold. I'm not sure exactly how Intel is going to implement this but the predecessors had multiple configuration planes so that the FPGA could be switched from Config1 to Config2 in nanoseconds (e.g. TCP offload, then neural network calculation, then TCP offload, etc) and had some automatic compiler support.

csense · on March 15, 2016

My question is what market is going to be driving this? Who will want to buy this, and how deep are their pockets? Is this a niche product for a handful of applications, or something we'll see in every PC in 5 years?

The GPU was successful because it had a killer app: Gaming. What's the killer app for the FPGA going to be?

extrapickles · on March 15, 2016

Server side things. Machine learning, on cpu die network switch, various forms of offloading (SSL, compression, possibly hypervisor stuff).

It will be awhile before it shows up in consumer gear, as the use cases are not there yet. Consumers may still benefit as when someone figures out something amazing for it to do, they will get a hardened version of it.

creshal · on March 15, 2016

> various forms of offloading (SSL, compression, possibly hypervisor stuff).

I wonder whether Intel will allow that. Better hardware offloading for various algorithms (SHA, RSA, AES, …) and hypervisor acceleration (VT-x, VT-d, EPT, APICv, GVT, VT-c, SRIOV, …) have been one of the main selling points for new CPU generations. An FPGA would render most of them moot by allowing operators to configure whatever offloading they need without requiring new, expensive Intel chips.

petra · on March 15, 2016

Today's FPGA's take maybe 10x-20x more chip area, hence much more expensive than the same logic implemented in an ASIC,Same goes with power, so selling FPGA's would be more profitable in and of itself, plus having a full ecosystem doing r&d on FPGA algorithms which Intel can later build into chips and sell.

extrapickles · on March 15, 2016

Intel can always release versions with more fabric, better access to the cores, main memory, caches and system devices.

It will also open up the ability for them to separately sell offloading features as IP cores. Also the risk of them having to disable a feature because of an error goes down as they can easily issue an update for it.

kryptiskt · on March 15, 2016

Custom circuitry will always be much more efficient than the same thing on an FPGA, so it should be easy to outcompete the FPGA for a task. The FPGA can help Intel with market research for what to put in silicon, just look at what is popular to put there.

wtallis · on March 15, 2016

The virtualization stuff isn't offloadable. They're a bunch of invasive changes to the memory and I/O paths of the processor core, not stuff that could be handed off to a coprocessor.

Dylan16807 · on March 15, 2016

You could route a lot of the I/O paths through the FPGA if it was suitably wired. A slight latency bump to make aspects of virtualization never require any cycles to be spent handling them.

oakwhiz · on March 15, 2016

They can always sign the FPGA bitstreams so that only approved bitstreams are allowed to run on matching approved silicon. Similar to CPU binning, certain CPUs would only be able to load certain bitstreams depending on fuse flags.

sklogic · on March 15, 2016

Maybe they will market their next generations as "bigger and better FPGAs" now.

nitrogen · on March 15, 2016

Anything reasonably parallel that needs high throughput or low latency can benefit from an FPGA. I would probably find use for one in audio/video processing offload for media production. If the FPGAs entered consumer parts, games could use them for I/O offloading to give gamers a latency edge, or process thousands of parallel game simulation elements in ways maybe GPUs can't help.

adwn · on March 15, 2016

> Anything reasonably parallel that needs high throughput or low latency can benefit from an FPGA.

That is not quite right. Rather: "Anything reasonably parallel that needs high throughput or low latency and which is not already provided by the CPU or GPU can benefit from an FPGA."

Modern server and desktop CPUs are incredibly fast and offer a lot of parallelism for certain operations. For example, when it comes to floating-point operations, no FPGA has the slightest chance against a desktop CPU (for prices in the same order of magnitude), and even less against a GPU.

sklogic · on March 15, 2016

CPUs and GPUs got a pathetic memory bandwidth. For this class of problems FPGAs are unmatched.

adwn · on March 15, 2016

Are you refering to internal bandwidth? Because CPUs have roughly the same external memory bandwidth, and GPUs have about an order of magnitude higher external bandwidth than FPGAs.

sklogic · on March 15, 2016

Both internal and external. The larger FPGA parts can host quite lot of DDR controllers, but what is the most important thing here is that there are HUNDREDS of single-cycle two-port block rams. No CPU or GPU can match this with their pitiful fixed tiny caches.

adwn · on March 15, 2016

Xilinx Kintex UltraScale KU040-2FFVA1156E (quoted price: about 2000 USD), a typical high-performance-computing FPGA, can fit at most 3 (maybe fewer) 64-bit DDR4 controllers @ 2400 Mbit/s, for a theoretical peak bandwidth of 57.6 GB/s.

Intel Xeon E5-2670 v3: 68 GB/s [1].

NVidia K40: 288 GB/s.

So you're wrong about external memory bandwidth. Regarding internal bandwidth: True, FPGA block RAM has a huge aggregated bandwidth, but it comes with some limitations.

[1] http://ark.intel.com/products/81709

sklogic · on March 15, 2016

Three pre-cooked DDR controller macro blocks. Now think of how many soft controllers can be added with so many spare pins.

Also, think of all the high-speed tranceivers to feed the data into an FPGA.

For the problems I am working with, FPGAs, even the mid-range ones, are far more suitable than the top GPUs.

adwn · on March 15, 2016

There are no hard-core SDRAM controllers on the KU040, I'm talking about soft-core controller. The bandwidth is limited by the number of IO pins (what good is a memory controller without access to a memory chip?); SDRAM needs quite a number of signal lines.

What FPGA device are you using? How many memory controllers are on it, what width do they have, and at what frequency are the memory modules operating? What external bandwidth do you actually achieve?

> For the problems I am working with, FPGAs, even the mid-range ones, are far more suitable than the top GPUs.

I believe you, but we're talking about maximum external memory bandwidth, not suitability in general.

sklogic · on March 15, 2016

Tbh I do not care that much about the external memory bandwidth, I only need internal plus some slow swapping into external DDR (used at most 4 channels so far) - my use cases are solely streaming, therefore tranceivers are more than enough. In some cases even a lowly Spartan6 is able to beat all the shit out of Teslas. Compare hundreds of memory fetches a cycle vs. whatever the pitiful NVidia cache is capable of (and remember that if your load is trashing your cache, you're screwed, no way to fix it if there is no option to pre-scramble your data for a linear access).

sitkack · on March 15, 2016

External bandwidth is, this isn't going to change. The fpga will have full access to caches, this is where it will shine.

sapek · on March 15, 2016

Some public information on what MSFT is doing today: http://research.microsoft.com/en-us/projects/catapult/

kbob · on March 15, 2016

Microsoft and (apparently) Intel think the killer app is in accelerating datacenter apps. I think it's more likely to be something nobody is doing yet.

Ericson2314 · on March 15, 2016

While I've messed around with FPGAs alone and found it fun, anything I imagine them doing as an accelerator/coprocessor/whatever sounds quite dull.

The program to reprogram them every few seconds (genetic algorithms?) sounds more interesting.

imtringued · on March 15, 2016

If you don't need it as an accelerator then what is the point of hardware FPGA? Just use a simulator in that case.

Ericson2314 · on March 15, 2016

It was for school. Simulator would have worked fine.

berntb · on March 15, 2016

Could you design a language which do a JIT style compilation to data flow calculations, by creating some arithmethic and logical functions onto a FPGA?

(I assume the FPGA would need a way to take over the computer bus and access memory? How multiple functional FPGA subunits do that, I leave as an exercise for Intel. :-) )

A dynamic software accelerator, not a specific function one.

(It is probably obvious from the above that the little I know of hardware and FPGAs is a bit old... we almost used keyboards made out of stone. :-) )

adwn · on March 15, 2016

> Could you design a language which do a JIT style compilation to data flow calculations, by creating some arithmethic and logical functions onto a FPGA?

The complete compilation process (source code --> netlist --> mapped design --> placed & routed design) can take several hours for large designs; maybe several minutes for small designs. Not suitable for JIT.

Natanael_L · on March 15, 2016

Custom full disk encryption implementations, like crypto acceleration for an algorithm such as norx.io. Or custom video codec acceleration for streaming it rendering. Or physics simulation acceleration.

adwn · on March 15, 2016

> Or physics simulation acceleration.

No: Physics simulations are typically FLOPS-limited; modern CPUs deliver a lot more floating-point operations per second than FPGA devices at comparable prices.

leni536 · on March 15, 2016

Can't one trade precision to do fast fixed point operations on FPGAs?

adwn · on March 15, 2016

It's not only the precision (i.e., the bit-width), but also the variable exponent (i.e., the "floating-point" part).

Unfortunately, simulations often cover a wide range of exponents during a single run (one matrix element might be 5.74293574325e8, while the one next to it might be 3.25356343e-9, and you still want to preserve their precision), and the exponents of the inputs might vary a lot between different runs. You can only use fixed-point if you have a good idea of the exponents of the input numbers, and how those change during the course of the computation. That works well for typical digital signal processing applications, and not so well for generic number crunching libraries.

vonmoltke · on March 15, 2016

> That works well for typical digital signal processing applications, and not so well for generic number crunching libraries.

Even then it breaks down with more complex signal processing algorithms. FPGAs are great for simpler algorithms like FFTs, digital filtering, and motion compensation. They aren't quite as good at more complicated algorithms like edge detection. They really break down when you want to use ML-based algorithms.

qznc · on March 15, 2016

AI? https://gigaom.com/2015/02/23/microsoft-is-building-fast-low...

More precisely, to put AI on smart phones. Neural network is cheaper on FPGA. You can switch between AIs (voice/image/text recognition, games, etc) quickly and update them over the internet.

sklogic · on March 15, 2016

And why then Tesla is successful? A pure GPGPU, no use for gaming whatsoever.

makomk · on March 15, 2016

Nvidia Tesla is still based around the same GPU architecture as NVidia's gaming GPUs, and in most cases even the exact same GPU chips. It benefits hugely from the R&D and economies of scale that come from it being based on a gaming product.

crmd · on March 15, 2016

Storage - compression

jstoja · on March 15, 2016

I think that along the hardware problem of integrating both chips on the same die, the other problematics are about their programmation. We have a pretty advanced abstraction when it comes to CPU nowadays, but looking at some code to program FPGAs we definitely can see that it's not that simple for developers to enter this world.

nitrogen · on March 15, 2016

There's a bit of a bootstrapping/catch-22 problem with FPGAs. Right now they are all mostly proprietary hardware with proprietary development tools, so only the most demanding niches will invest. Because they only see niche use, nobody invests in more general tools.

If Intel added a fully open, fully specified FPGA to a CPU, that anyone could write tools for, that could change.

Ericson2314 · on March 15, 2016

Actually I'd argue x86 is huge and scary, and Windows/Unix on x86 huger and scarier, but plain sequential circuits quiet simple.

The actual problem (as stated in the other comments) is that the tooling is all proprietary (huge and scary) and has received NO love.

adwn · on March 15, 2016

"Plain sequential circuits" are also quite slow and therefore useless as accelerators alongside modern x64 CPUs.

floatboth · on March 15, 2016

OpenCL? https://www.altera.com/products/design-software/embedded-sof...

There are even vendors marketing PCI FPGA cards as "OpenCL FPGA cards": http://www.nallatech.com/solutions/fpga-accelerated-computin...

creshal · on March 15, 2016

> the hardware problem of integrating both chips on the same die

Mostly a non-issue. Multi-chip packages have been around for decades. Intel has been putting DRAM chips onto their CPU packages (for their Iris Pro graphics chips) for a few generations now; Core i CPUs also have on-chip PCIE controllers together with their QPI inter-cpu interface, if for some reason the latter can't be made to work with FPGAs.

PaulHoule · on March 15, 2016

I think "beefy" and "broadwell" is a contradiction.