Hacker News new | past | comments | ask | show | jobs | submit login
Intel Xeon processor with FPGA now shipping (fpgaer.wordpress.com)
172 points by chclau on May 25, 2018 | hide | past | favorite | 79 comments



From The Next Platform (https://www.nextplatform.com/2018/05/24/a-peek-inside-that-i...):

The initial workload that Intel is targeting is putting Open Virtual Switch, the open source virtual switch, on the FPGA, offloading some switching functions in a network from the CPU where such virtual switch software might reside either inside a server virtualization hypervisor or outside of it but alongside virtual machines and containers. This obviates the need for certain Ethernet switches in the kind of Clos networks used by hyperscalers, cloud builders, telcos, and other service providers and also frees up compute capacity on the CPUs that might otherwise be managing virtual switching. Intel says that by implementing Open Virtual Switch on the FPGA, it can cut the latency on the virtual switch in half, boost the throughput by 3.2X, and crank up the number of VMs hosted on a machine by 2X compared to just running Open Virtual Switch on the Xeon portion of this hybrid chip. That makes a pretty good case for having the FPGA right close to the CPU – provided this chip doesn’t cost too much.


That seems like a weak benefit. I don't think the fpga part will be cheap. In fact it will probably be significantly more expensive than the CPU. So then why not buy a chip with twice the cores if running twice the VMs is what you want to do with it? Also you can buy an AMD EPYC chip with twice the cores for about the same price you'd pay only for the CPU part of this chip/half the intel cores.


>Also you can buy an AMD EPYC chip with twice the cores for about the same price you'd pay only for the CPU part of this chip/half the intel cores.

Indeed. What benefits the most from being turned into FPGA circuit are stateless or minimally stateful circuits like media decoders/encoders.

Complex stuff like fancy routers, classifiers, internet protocol inspection/handling will not gain a hundredfold speedup unlike the stuff above. This is why cheap x86 based routers are still a thing.

At the moment, I am involved with one cloud provider in China that bids big on cheap FPGAs and RDMA. AWS and Azures can be defeated in detail.

The plan is following: provide hardware accelerated "building blocks" of any modern dotcom business.

Need memcached? We have it running on RDMA, from a bare metal ASIC, 10 times faster than any x86.

Need transcoding? We have it available on RDMA, from a bare metal ASIC, 10 times cheaper than any AWS instance on buck/megabyte.

Need API proxy for TLS/Gzip with gigabytes per second throughput? We have it running on RDMA, from a box with 4 PCIe accelerators, and AWS has nothing to offer for this use case other than "buy a hundred of top tier high performance instances and put them behind load balancers"


Pretty sure AWS has had f1 instances for some time already.


Yeah, but for those one you still have to do all the HDL programming stuff yourself. Alibaba on other hand aims to provide everything ready to use over a cute RDMA api


>"At the moment, I am involved with one cloud provider in China that bids big on cheap FPGAs and RDMA. AWS and Azures can be defeated in detail."

Can you share how this FPGA and RDMA combination works architecturally? What's the software stack composed of? Might you have any links or other resources you could share?


Made almost entirely in house.

You provision access over in-dc REST api, then rdma capable routers route the appliance to your vm, then you access it through provided libs for linux.

In other words, you don't deal with RDMA yourself at any moment


Folks that need this don't care about the expense. it's about the pipeline between the CPU and FPGA. If it's 5-10x as fast as plugin FPGA, There will be a market.


There could be a win if the FPGA part is efficient. Power and heat are a limiting factor for rack density, and this could allow you to run more machines in a rack than you could otherwise support.

Pretty narrow still, but the market segment is possibly real.


Depends on the functional area. It will be a huge boost for virtual networking especially for virtual network functions. I personally know of cases where we have to wrangle with sr-Iov and dpdk. I will be super interested in the ovs fpga


Could this help out in the HFT userland-networking scene?


On a whole different level, if you are merely interested in FPGAs in your computer then the PicoEVB neatly goes in the same M.2 slot as wifi cards do and communicates over PCIe, even if just PCIe Gen 2 x1. As far as I am aware this is, by far, the cheapest way to get an FPGA on the PCIe bus.


There's no onboard RAM on the PicoEVB, other than whatever the FPGA itself provides. That makes it less interesting.


Agreed. M-key version coming soon with onboard RAM and 4x PCIe: https://github.com/RHSResearchLLC/uEVB


For low cost desktop PCIe with memory and I/O there are these two offerings: https://numato.com/product/galatea-pci-express-spartan-6-fpg...

The galatea from numato is pretty affordable and has 2Gb (256MB) of onboard RAM and comes with a dual ethernet and pmod breakout board from Amazon for $300 USD.

http://www.latticestore.com/products/tabid/417/categoryid/59...

The lattice ECP5 PCIe board comes with I think 128MB DDR3 and dual gigabit ethernet.

And for boards without memory but lots of I/O there is Mesa Electronics: http://www.mesanet.com/ CNC oriented but you can do whatever you want with the FPGA.


Don't new laptops only have one of those, for their hard drives?


The Lenovo Thinkpad T480 has an M.2 slot and a normal SATA bay. If you order with an NVMe SSD, it goes in a caddy with an adapter in the SATA bay. The M.2 slot is optionally used for WAN or a small secondary SSD. However is is only 42 mm long and PCIe only, so third party SSDs are difficult to find.

I only know all this because I just bought one, and have been hunting for something useful to put in that slot...


Sigh. There is no such thing as an "M.2 slot". There is always a key and a size. The T480 has three M.2 slots:

1. Optional M.2 key M 2280 in the main bay via an adapter.

2. M.2 key B 2242, the Lite On T11 works there for an SSD. It's the only consumer M>2 key B+M 2242 PCIe SSD. This generation doesn't support SATA disks in this slot, the T470 did.

3. M.2 key E 2230 as practically all laptops do since Broadwell for wifi.


Bugger. So the FPGA card won't work then. Thanks for explaining.

It's all very confusing and frustrating. So the chances of me finding anything useful to put in that slot are roughly nil, then? I can see no physical or technical reason why my slot should not support this card, except that the card only supports two out of the three possible PCIe-carrying configurations, and I was unlucky. Seems like a poorly designed "standard".


Of course it will work! That's the point: remove the wifi card and use a USB dongle instead. Practically every laptop featuring a Broadwell or later CPU will work.


I meant in my spare, useless 'key B' slot, which will still be spare and useless if I take the wifi card out.

It especially rankles because they devoted two whole PCIe lanes to that damn slot, which would be better put to use unbottlenecking my highly expensive Samsung P981 NVMe drive.

The T480s does not suffer this problem.


The T480 is almost a T470 and unbottlenecking said drive would've required a motherboard redesign and a new adapter into the bay. Not this generation. The T470 was a completely new design, chassis and planar both.

The T480s is a new machine.


Most laptops have an m.2 key E slot for the wifi slot but it's rarely mentioned because noone ships a laptop without wifi and they don't want to confuse people. Yes, most laptops only have a single M.2 key M slot but that's not this card is after.


It's confusing me. wifi cards don't need m2 slot, I thought it was a standard made for SSDs. Usually wifi cards use basic mini pcie sockets, even on old (2006>) 12" laptops you can find 2, albeit sometimes only one socket is installed (the other is empty space with pins there, some people even soldered one manually)


Directly from Wikipedia:

Buses exposed through the M.2 connector are PCI Express 3.0, Serial ATA (SATA) 3.0 and USB 3.0, which is backward compatible with USB 2.0. As a result, M.2 modules can integrate multiple functions, including the following device classes: Wi-Fi, Bluetooth, satellite navigation, near field communication (NFC), digital radio, Wireless Gigabit Alliance (WiGig), wireless WAN (WWAN), and solid-state drives (SSDs).


M.2 slots come with different keys for different purposes. key B is used for WWAN and mostly SATA SSDs (although some slower NVMe drives ship with key B+M , the plus sign means it works in both slots). key M is used for x4 NVMe SSDs.

And key E is used for wifi. And this card here.


Some laptops have two.


Beware trying to use these on desktops with a PCIe adapter though, the adapters don't seem to break out all the correct signals, so you get USB, or PCIe, but not both.


Not so -- the for-wifi adapters (ie those with a m.2 key E) always present a single PCIe otherwise no Intel wifi card would work.


Do those also expose the USB lines on the NGFF port? I've found the PCIe-to-NGFF adapters to be PCIe-only. And I believe USB lines are required for the onboard USB JTAG to work. Am I wrong?


Thank you! I was looking for something cool to put in my spare M.2 slot. It's only 2242 so useful SSDs are a bit thin on the ground.


Cool, will check it out. Thanks for sharing.


Note that this is a ~$2500 processor with a ~$5000 FPGA stuck to it.


Depending on the price, this will be interesting for I/do-something/O workloads like video encoding/decoding. Depending on the price, looking forward to it. Of course, depending on the price. :)


You won't have enough resources for video encoding, most probably. Arria 10 FPGA has 67M bits of memory total (8.5M bytes), including one simulated with registers. As far as I remember, just storing 1920x1080 (HD) frame in 4:2:0 mode would require 4M bytes, half of memory. 4K would require 4 times as much resources.

You may do some on-line processing like running neural net on the video content, but that's about it. Don't expect anything super exciting from that chip.

Yet, it will bring joy to high frequency traders. The systems there do most of the work in FPGA, including UDP/TCP/IP packets processing and offload some work to CPU (broadcasts about network topology are handled on CPU, for example). They also would like to receive CPU computation results as fast as they can and this chip is exactly that.


You have more than enough resources for video encoding, because all modern codecs have a macroblock structure. You don't need to keep the whole current frame in at once, you can slide a window around it. (Conversely, you do need to keep the matching area of the previous frame and maybe even the next frame in order to do proper inter-prediction). That said I would assume GPUs are more suited to this task.

By FPGA standards this thing is enormous.


You are also limited by number of access paths into the memory which holds, say, macroblock. I forgot about this issue, sorry.

For block RAM in previous generation FPGAs from Altera there were one read and one write paths. To have two read paths you would need to copy block RAM as many times as you need read paths. This means that if you search for block content inside a macroblock with N parallel accesses you would need N copies of macroblock stored.

Tabula's time shifting tech allowed for up to, I believe, twelve paths into the block RAM, six for read and six for write (they were time-scheduled to 1-read-1-write block RAM operating at six times the frequency). I thought this thing would be a road for superscalar FPGA CPUs, but Tabula was closed.

You can imagine not using RAM at all, but then you will spend other resources in FPGA.

These are problems with video compression I see here. I think they are substantial but not unsurmountable and require a balance to solve. It is just me that I saw the balance is not in favour of FPGA.


Such a shame that Tabula closed down. They had some of the most interesting tech -- hopefully someone is able to pick it up someday. As someone who works with FPGAs, I think a solution like theirs is the only way FPGAs will become mainstream.


Altera took many of the employees (although quite a few are now at AWS) and it's rumored they also bought the IP. A lot of the core ideas will live on in Stratix, hopefully.


Depends alot on the codec too. Light weight codecs like Tico/jpeg-xs and some jpeg implementations can use very little ram (~12 rows worth). Others need alot more for rate control, motion estimation, etc.


Could this be a hint at Intel's supercomputing and AI strategy? They cannot compete with GPUs on flops with Xeon, but an embedded FPGA might get them closer.

It is a risky strategy however. Even if they can attain similar performance, which I doubt, programmability remains the big problem for FPGAs. I know Intel is pushing openCL but it simply does not have an ecosystem for software right now, and it remains to be seen if they can even enable much of the feature set of openCL on an FPGA.


> They cannot compete with GPUs on flops with Xeon, but an embedded FPGA might get them closer.

FPGAs aren't known for their FLOPs though. Sure, the high end ones pack a punch, but compared to GPUs are still extremely expensive.


Altera has hard floating point blocks in their FPGAs, I don't think any other manufacturer does though.


Yeah - the best Stratix 10 can do 9.2 TFLOPs. A GTX 1080 Ti can do 11.3 TFLOPs. Now I can't find any price info on those Stratixes, but given how expensive these FPGAs generally get, I highly doubt it will be anywhere near the $700 for a 1080 (and that's for the entire card!).

I am not saying these things don't have applications, but the "usual" computational workloads are not one of them.


>programmability remains the big problem for FPGAs

I agree because there's added levels of complexity with HDLs that SW doesn't have to deal with. What would be nice is if there was a tool that you could declare your problem (functions) and your constraints (throughput, LUTs available) and it would figure out the memory, ALUs and pipe-lining needed to solve the problem.


That is called high-level synthesis and there are some tools for this available (and I know at least one company who produces a proprietary toolchain - including hardware - for this).

In my experience, high-level synthesis makes things a lot easier for the programmer, but you still have to be aware of very low-level details (and you still can get bitten by abstractions that you don't fully understand).


That company might fruitfully make their system available within a datacenter / "VPS"-type context, eg the way Cloud9 does it. (https://c9.io/, you make a free account, you instantly get a Web based editor connected to a Debian VM)

If their system accepts "mostly ordinary" code in eg C, a fair amount of stuff would probably work with it, so it could scale beyond educational exploration/tinkering, too.


Here is a service I know of: https://reconfigure.io/


I think they did exactly that with Amazon AWS instances (although I have to admit that I don't know the specifics).



There are plenty of C-to-HDL languages (https://stackoverflow.com/questions/8988629/can-you-program-...) and there's some interesting tidbits about them in this SO link.


Xeon Phi coprocessors are flops competitive with GPUs and are used in many supercomputers.

Embedded FPGA accelerators are of interest primarily to large server farms and cloud operators, where the cost of development for FPGA acceleration is cheaper than the savings of acceleration-- Microsoft in particular has done a lot of work in this area (for Bing), and probably worked with Intel to define the CPU.


Pretty sure that Phi as you know it is largely being left. Argonne's Aurora exascale machine was suppose to be Xeon Phi, but it has been postponed till 2021 and will likely use some other tech. Also look at the top 500 list from late 2017. Trinity the NNSA computer at #7 is Phi based but only gets about 50% of theoretical peak.


Phi is dead. The next NERSC computer is GPUs. Not sure what Argonne is doing but there’s no reason to think it won’t be GPUs. The only reason Phi/KNL made sense was because it seemed like a conservative choice when GPUs were more exotic, which may have been okay if Intel didn’t have the crazy yield problems they had. FPGAs don’t make sense because the conservative choice at this point is GPU. I don’t think any of the leading compute facilities have the stomach for a new technology like this, especially from intel.


Cmon it wasn't just yield problems. Premise that you can just recompile CPU code and it would run quickly on Phi was false.


It's a pity. Intel does have a compiler that works great for Xeon Phi: ISPC. Alas, it seems Intel was never serious about ISPC.


> The next NERSC computer is GPUs.

Got a source on this? I've not seen an official mention of what sort of architecture NERSC-9 will be.

> Not sure what Argonne is doing but there’s no reason to think it won’t be GPUs.

It will be a future Intel processor [1].

[1] https://www.hpcwire.com/2017/09/27/us-coalesces-plans-first-...


Nothing I can link to, but that information has been shared to DOE labs and projects relying on large NERSC allocations.


My personal tinfoil hat take is that Xeon Phi was made to make China waste top dollar on their supercomputers going with a less useful accelerator made by a US company.


By wasting millions of dollar and literally gave the Phi for free to China, which is part of the Intel CPU deal?

The answer is rather simple. Xeon Phi wasn't as good as the GPU counterpart.


Once China figured that out, the way they'd implement not falling for the same trick twice is simply never using US tech again. Which would limit other opportunities.

It's also at least plausible that they'd probably have people trained to think like Americans who might think of the same thing in advance.



Yes US government has forbidden the export of Xeon and Xeon Phi to China's supercomputing centers. It prompted the construction of Sunway TaihuLight, which uses Chinese own CPUs.


This is great! I think that some time in the future, FPGA will be a vital part for data processing.

In theory, an application with a computation-heavy task could program the FPGA to provide part of that task in hardware (think the hot innermost loop body).

What I am worried about is the infrastructure that is needed to make this happen: Is there even support for this in our compilers? What would support look like?


> In theory, an application with a computation-heavy task could program the FPGA to provide part of that task in hardware (think the hot innermost loop body).

This is absurdly far away; it reminds me of the decades of assumption that 4GLs are going to obsolete programmers or the decades of trying to cross-compile C to FPGAs, badly.

It doesn't help that there's a huge infrastructure barrier caused by closed tools. Imagine if Intel brought out a processor with a proprietary instruction set where you were only allowed to use their FORTRAN compiler (no C, let alone anything more modern or JIT) with a per-seat license. That's where FPGA tools are.

We won't see a Cambrian explosion in FPGA tooling until they are made properly open. Building using open-source tools needs to be actively supported by the manufacturer.

There are also conceptual obstacles; FPGAs are sufficiently different that programmers have to re-learn and re-write idioms in order to get usable results. It's as big a jump as going from Javascript to CUDA.


With the end of Moore's law, won't this kind of thing become increasingly necessary, though?

I agree with you about the problems of closed-source - but I feel like the end of Moore's law will encourage creative solutions to performance problems - which FPGAs would at least make technically possible. And, even if it's like the old days when people bought compilers and access to source, people will still do it - since it'll be the best way to get an edge on the competition.


Back in the mid-to-late nineties, Federico Faggin, designer of Intel's seminal 4004 (and ancestor of the whole x86 lineage) co-founded a company called Starbridge Systems that aimed to construct and sell what they termed hypercomputers whose architecture was rapidly-reconfiguring FPGAs that an enabled a compile-to-hardware model.

Here's one of the few references to that project that I can now find online, a WIRED era piece with all the gushing nerd optimism of the pre-dot-com-bust: https://www.wired.com/1999/02/the-super-duper-hypercomputer/

I suppose this is where Intel is headed now with these first few tentative steps.


Isn't this what Transmeta was also trying to do? https://en.wikipedia.org/wiki/Transmeta


> Is there even support for this in our compilers? What would support look like?

FPGA offload has been a very active area of compiler research for more than a decade. I like this paper for example https://dri.es/files/fpl05-paper.pdf


It is possible to imagine a future in which no one outside the processor manufacturer needs to do a damn thing. Nowadays, no one else knows or cares what microcode is used to execute x86 instructions. Maybe someday hot enough sections of code will get transparently JITted to logic gates.


> think the hot innermost loop body

In many applications data access is the bottleneck, and an FPGA coprocessor will not give any improvements in that area.

Using an FPGA to do arithmetic computations (e.g. as in audio/video coding) could provide a benefit, but I fail to see how this improves things much over having an extra CPU core.

The only thing an FPGA does is replace registers by wires, basically, and there's not much to be gained there, I guess.


I am not sure that's necessarily true as long as the FPGA has access to the CPU bus - as it seems to here - as opposed to hanging off a slower peripheral bus - as a lot of other FPGA experiments/cards have done in the past. If you don't penalize the FPGA with slower data access, it can make a big difference for many (albeit not all) applications.


There is software that can poorly map OpenCL calls onto HDL code. It is still light years behind a human designer.


I thought FPGAs were limited to either cases where you need some custom hardware _now_ or where demand is low because, if demand is high enough and you can wait, custom ASICs are cheaper, lower power, and faster.

What has changed here? Are FPGAs faster now or are many more people experimenting, creating sufficient demand for custom hardware at short notice?


CPU performance increases are a lot slower now so it now makes sense to spend time on custom RTL. In the past you would just wait 8 months and buy the new twice as fast CPU.

FPGAs are a waste of Silicon compared to ASICs, but for fixed applications they are less wasteful than CPUs.


This is cool! I have so many questions, like how does it work with context switching? How do they protect access to the cache?


They don't, it's an Intel CPU.


does it mean that you can load your designs on that fpga ? where can we find what features are exposed to sw ?


Looks like a future AWS instance type.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: