Hacker News new | past | comments | ask | show | jobs | submit login
Intel Completes $16.7B Altera Deal (eweek.com)
162 points by spacelizard on Dec 29, 2015 | hide | past | favorite | 73 comments



Hell yes! Intel chips are about to get exciting again. SGI put FPGA's on nodes connected to its NUMA interconnect with great results. Intel will likely put it on its network on chip with more bandwidth and integration while pushing latency down further. 90's era tools that automatically partitioned an app between a CPU and FPGA can be revived now once Intel knocks out those obstacles that held them back.

Combine that with OSS developments by Clifford Wolf and Synflow in synthesis that can be connected to OSS FPGA tools to show even more potential here. Exciting time in HW field.


Even more exciting is the OmniPath[1] stuff that came out as a result of the Infiniband acquisition. RDMA + Xeon Phi + the insane number of PCI-e lanes[2] available for those new M.2 SSDs which post just absolutely insane numbers[3] all of which are supported by ICC[4] and you've got a really budget-friendly HPC setup. I'm really hoping for IBM's OpenPOWER to gain traction because Intel is poised to capture the mid-market in a dramatic fashion.

[1] See: IntelOmniPath-WhitePaper_2015-08-26-Intel-OPA-FINAL.pdf (my copy is paywalled, sorry) [2] http://www.anandtech.com/show/9802/supercomputing-15-intels-... [3] http://www.anandtech.com/show/9702/samsung-950-pro-ssd-revie... [4] https://software.intel.com/en-us/articles/distributed-memory... This is for Fortran, but the same remote Direct-Memory-Access concepts extend over to the new Xeon architecture.


Not to nitpick, but M.2 is just a form factor. The big gains come from being NVMe PCIe, not the form factor. You get the same gains with NVMe PCIe in 2.5" drive form factor.


M.2[1] defines both the form factor and (importantly) the interface. While it is true that NVMe PCIe is the interface that makes the difference here, the standardization of both the interface and the form factor seems pretty important here.

[1] https://en.wikipedia.org/wiki/M.2


Probably should've posted this response to you:

https://news.ycombinator.com/item?id=10805327

Now I mean it even more.


Whut? Mellanox has been purchased? Explain "Infiniband acquistion"


Intel purchased QLogic's Infiniband business in 2012.


Oh crap i didnt know that. Means they have a near-NUMA interconnect, FPGA tech, and recent RAS features. They could be a supercomputing and cloud force of nature if they play their cards right. They'll have advantage of best nodes, tools, and probably mask discounts. Getting more exciting...


I'm excited about this too, but the article suggests this is an industry wide thing; IBM is doing this with Xilinx, Qualcomm is experimenting with ARM stuff (not sure how this is different than the Zynq), AMD also with Xilinx. If there is something that works here I'm sure Intel wont be the only game in town.


Can you give any examples of how FPGAs helped SGI? I'm aware that a certain Voldemort-like .gov liked them at one time, but I never saw any uptake in the real world.

Intel is a volume player; this makes FPGAs a bit of a head-scratcher, since in the Grand Scheme, products might get prototyped as FPGAs, but they jump to high-volume, higher-performance ASICs ASAP.


Bing is using them: http://www.enterprisetech.com/2014/09/03/microsoft-using-fpg...

Using FPGAs allows new features and optimizations to be tested and productionized (almost) as quickly as regular software. Upgrading to a newer design is essentially free and instantaneous, compared to expensive and time-consuming process of producing new chips.


I forgot to answer the other part of your question because I was skimming. The FPGA's didn't help SGI I don't think. They were just too hard to use and the nodes were expensive. However, SGI did the right thing at the wrong time period: connecting FPGA's to memory shared with CPU's over an ultra-high-speed, low-latency, cache-coherent interconnect. This allowed FPGA co-processing to do far more than it ever could on PCI cards where the PCI overhead was significant for many applications. Just like how NUMA performance trumped clusters over Ethernet or even Infiniband in many cases.

So, I was citing them as the opening chapter to the book Intel's about to be writing on how to do FPGA coprocessors. Hopefully. Meanwhile, FPGA's plugging into HyperTransport or an inexpensive NUMA would still be A Good Thing to have. :)


AMD was in the lead with the semi-custom processors that combined their CPU's plus custom logic for customers. That apparently was a huge business. Intel started doing it, too, IIRC. There's an overlap between those buying high-end CPU's and semi-custom stuff that might appreciate reconfigurable logic on-chip. It's a nice alternative given you get huge speed boost without capital investment of ASIC's and can keep tweaking it to optimize or fix mistakes.


As someone with basic knowledge of FPGA structure and how HDLs work, does anyone have a link on the limitations of FPGA-implemented processing vs traditional CPU/GPU architecture?

I get the sense there's a hole in my knowledge as to exactly what kinds of limits the structure of FPGAs places on the end result. And more importantly, why.


Since no experts replied, I'll try to make an attempt. The FPGA is a series of configurable logic blocks and routing hardware that simulate other logic. All digital hardware reduces down to primitive logic. So, FPGA's can do about any digital design. The real limitations are in performance, energy use, and cost.

The flexibility of an FPGA, from routing & configurable logic, takes up a lot of space that otherwise wouldn't be in the system. The blocks themselves are larger and slower than the primitives they simulate. The extra delays from this mean the FPGA can't be as fast as dedicated hardware. If FPGA simulates a CPU or GPU, the real CPU or GPU will always be faster due to optimized logic.

The other issue is power. The FPGA has all kinds of circuits that have to be ready to load up new configuration and simulate something else. Due to dynamic nature, the active parts also use more energy with all the extra circuitry. The result is that FPGA's always use more power than the custom chip.

The last one, cost, comes from the business model. No chips outside of recent GPU's have challenged FPGA's without going bankrupt or being a tiny niche player. Additionally, the EDA tools to make use of them are ridiculously hard to build with Big Two (Altera and Xilinx) investing a ton to get as good as they are. They also give them out cheap to free. So, anyone implementing a FPGA sold at cost will be unlikely to compete with Big Two on tools that make the most of the FPGA. That means anyone using FPGA's will pay high unit prices to line Big Two's pockets for quite some time.

Far as the structure, you have to map the hardware to the structure. Hardware is often done in pieces that connect to each other in a grid. So, the mapping isn't terrible. It's just hard to do efficiently.


So I have this vague recollection that Intel had an FPGA division in the early 90's that they spun off. Was that what became Lattice? Sad that the Interwebs get really murky pre 1995


Good memory. It was Intel's Programmable Logic Devices unit. First FPGA in 92, and was sold in 1994 for $50m to Altera[0].

The processors were the FLEXlogic line. They only released a few (looks like 4 total[1]). Here's an announcement for one: https://groups.google.com/forum/#!topic/comp.sys.intel/YBUtO...

0. http://www.embedded.com/electronics-blogs/max-unleashed-and-...

1. http://www.intel-vintage.info/timeline19901995.htm


I'm hoping this will lead to improvements in their FPGA development environments.


I'm not holding my breath. Even if they decide to do it, I imagine it would take a solid 5 years to flush all the crap out of the pipeline.


I was thinking the same thing too. I have always had a much easier time working with the Xilinx/Mentor workflows, and I would love to see the competition in that space. But then I remembered the last time I tried to download my copy of Intel C++; over an hour lost in a maze of broken links, ending with having to open three different support cases, and I stopped holding my breath.


OK now how quickly can FPGAs be adapted to search through Postgres indexes?


Netezza, a data warehouse appliance which was built on PostgreSQL, uses FPGAs as the first step to process data as it is read from disk.


Netezza (acquired by IBM) has been doing this for many years.


I can't see how it would help - this kind of search involves almost no computation and a lot of memory/disk bandwidth.

People need to remember that FPGAs are not a magic bullet, especially not for throughput; they're better used for low-latency hardware interaction and things where you need cycle-deterministic behaviour.

Crypto is a far more interesting potential case.


I beg to disagree.

One need error correction to fight disk/OS failures and compression to conserve bandwidth. Both tasks are highly specializable and good candidates for FPGA implementation.

Having tool like FPGA at your disposal makes you look at problems differently.


And low power computation


Only low power consumption in the sense that you get to run what you want because the chip is fully dedicated to a task. A microcontroller could draw less power doing the same task.


Are you sure about that? FPGA MIPS/Watt tends to compare rather badly.


What precisely is an "instruction" in the MIPS/watt rate here, given the FPGA context?


He's right in the sense that an FPGA will use more power than a dedicated chip. The logic elements are fairly large when compared to the ones in an Intel CPU for example. FPGAs are good for when you know you will need to change a design on the regular like prototyping or when the design will have many updates. If you need it to be faster or you need more than a few hundred, going with an MPGA (depending on the use) might be cheaper. These don't allow change to the design as its baked into the chip but they use the same type of logic as FPGAs and require less power as the logic elements are smaller.


Quickest way is to use what's called a high-level synthesis tool that converts a high-level version of algorithm to hardware language. Synthagate, Handel-C, Catapult-C, Synflow's C-flow, C-to-Silicon... many tools claim to do it. Best to have someone with hardware background help, though.


Xilinx has started to offer, for free , such high-level language tool , as long as you use their lower end fpga's, for those interested in trying such tool.


Is it Vivado they're giving away? That would be pretty awesome given even a Spartan6 is quite powerful.


Yep vivado HLS.


Intel CEO Brian Krzanich: "We will apply Moore's Law to grow today's FPGA business, and we'll invent new products that make amazing experiences of the future possible"

PHB, how you've grown !


Not sure how they think FPGAs are going to reduce their "cloud workload". FPGAs are pretty power hungry (aside from lattice) and only work well if you have some unique requirements.


Fast cores take exponentially more energy than slow ones. So the solution is to use more slow and simple cores instead. We get more performance per watt that way. On PCs we can use GPU's to make computations in parallel. I guess this is like that, but for servers.


GPUs are often considered "hard to program", let's see how this goes for FPGAs.


Intel suggests we program their GPUs and FPGAs with OpenCL. So, in theory, programming FPGAs could be not much harder than GPUs.


At some point we should be able to code these things in (a subset) of C++. Microsoft has made C++ AMP for GPU programming. Herb Sutter has said it or something similar has a good chance of being standardised in C++ at some point.

Some chatter along those lines:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/p006...


Good news then. There is no shortage of speedup when we stop considering all memory being equal.


>Fast cores take exponentially more energy than slow ones

What's your source for this/why does this happen?


From my fuzzy memory:

To make a cpu fast you need to shrink them. To the point that the "wires" in the core are so close together that electrons jump from one another. So there is lots of electrical interference. To overcome this voltage needs to be increased. This takes more power and makes the cpu hotter (which requires cooling).

But dont take my word for it. I did some quick Googling. Maybe you can find some better source and explanation:

https://en.wikipedia.org/wiki/Multi-core_processor#Technical...

"For general-purpose processors, much of the motivation for multi-core processors comes from greatly diminished gains in processor performance from increasing the operating frequency. This is due to three primary factors:

- The memory wall; [...]

- The ILP wall; [...]

- The power wall; the trend of consuming exponentially increasing power with each factorial increase of operating frequency. This increase can be mitigated by "shrinking" the processor by using smaller traces for the same logic. The power wall poses manufacturing, system design and deployment problems that have not been justified in the face of the diminished gains in performance due to the memory wall and ILP wall."


It is very well known that the out of order execution engine of modern cpu cores has a very high power overhead.

https://en.wikipedia.org/wiki/Out-of-order_execution

https://en.wikipedia.org/wiki/Bonnell_%28microarchitecture%2...


Basic logic: as you cram more and more out of order and parallel execution, branch prediction and other advanced techniques in a core, there is a diminishing return. If it weren't, you could create a single core of unlimited computing power.


I don't think it is exponential. Dynamic losses in a gate arise from a couple factors (shoot through, dissipation capacitance) but as far as I know they are a joules per switching event effect which scales linearly with clock.


FPGAs are excellent for parallelizing IO, so if the application involves lots of IO, an FPGA coprocessor will likely reduce power consumption if the CPU delegates IO intensive operations to the FPGA.


Not if you're sitting around waiting for data stored on some DB or distributed file system. FPGAs don't do much to address that. However there are certainly cases where you could design very lean systems where you could win with an FPGA, but my guess is the average developer wont be any better at that than they are on the CPU side developing tight high performance code.


Yup, that's my feeling. We can't even get developers to pay attention to CPU cache lines. I'm very doubtful that your average SW dev could understand how to use FPGAs to their advantage.


Just speed up the database and filesystem with FPGAs ;)




Because they can accelerate arbitrary algorithms anywhere from a percentage to 50x increase often for a fraction of the energy and clock rate?

I'd guess more so when sequential part is on top tier CPU and accelerator is on its NOC.


Where is this magic FPGA that takes an arbitrary algorithm and accelerates it 50x?

FPGAs have a nice niche for certain applications, and they deserve more prominence, better tooling and widespread use, but they are not magic.


FPGAs are usually very good at accelerating one computing algorithm at a time. Not two. Just one. They'll beat processors unless the operations are an obvious fit for processors (for instance, floating point). I/O is actually a bottleneck in FPGAs.

In practice, most applications require executing plenty of algorithms one after the other, depending on input data, with a lot of glue in between (I/O). FPGAs are terrible at that.


50x isn't unreasonable if you can parallelize and/or pipeline a rigidly-defined process, cutting out as much memory and cache accesses as possible.

With the right programming model, 50x seems very, very reasonable, not even magical. A typical computer does a decently high amount of context-switching and cache-loading.


Yes, of course. Hence why FPGAs are brilliant for stuff like networking equipment, massively parallel coupled with massive low-level IO.

But it doesn't work that way for arbitrary algorithms.


On the contrary, given a capable compiler and/or language, translating algorithms into hardware sounds extremely feasible to me.


It's easy in theory but hard go do well. Most that work pretty well are commercial. Here's the only OSS one I know still getting significant:

http://legup.eecg.utoronto.ca/publications.php


I think then we are in agreement that it is /feasible/, then?

I imagine an FPGA in every chipset would go a long way towards writing the software to put them to practical use, though that may also be wishful thinking on my part.


It was mainly used for algorithms and prototyping ASIC's for most of its existence. The high level tools advertised to software people likewise did that. These days, they have great I/O options too.


Where is it that you were taught to take a part of a quote, turn it into a strawman, and knock that down? My quote said effects range from a percentile (low effect) to 50x. That means it has minimal effect over CPU's on some with 50x on others. Try to stick with what I said, eh?

Also, some algorithms can do over 100x. The good ones are just more likely to get less than that so i left it off. I've seen 10-50x in all kinds of papers, though. And those were academics rather than FPGA pro's.


I think revelation was specifically referring to the word "arbitrary" in your post. That makes it sound like FPGAs can speed up any given algorithm, which is obviously not the case.


Could be misleading to some. The intent for arbitrary was that you could implement any algorithm on them. Whereas, custom hardware is usually fixed and some programmable ones are more limited. Plus, I think you can speed up any algorithm at least a little if putting it on FPGA on more advanced node. Reason being there's no other processing besides the algorithm and it's harder to get GP CPU's that optimal at given process node.

That said, the likes of Intel, AMD, and IBM put so much work into CPU's at most advanced nodes that some sequential algorithms will likely do better on them. Synthesized, amateur FPGA bitstream just can't compete with custom, pro, hard blocks designed for exact purpose.


Yes, they pretty much are magic, for applications that benefit from them. For example, no general-purpose CPU can capture RF data directly from an ADC at multiple gigabytes per second, decimate it, and break it into 1024+ channels in real time. Any general-purpose CPU that could do that would be hopelessly uncompetitive in the marketplace.

How Intel benefits from owning an FPGA company, I'll confess I don't know. Altera must have some awfully valuable IP that Intel needs.


It's like I said: FPGA's are awesome accelerators for all kinds of things, datacenters are already using SOC's doing cores + hard blocks (see Cavium Octeon III), and Intel can possibly capitalize by adding FPGA tech to server CPU's. On top of it, they need years of investment into EDA to make it usable. What Altera already did for them.

With FPGA on-chip, you can offload intensive computations to semi-custom blocks. I/O processing, compression, crypto, fast transactions, data mining, you name it.


Wild guess: perhaps their OpenCL to FPGA compiler. OpenCL is many times easier to program than VHDL.


>percentage to 50x increase

you can do that today going from ruby/js to c


That's a particularly thoughtless comment.

The kinds of things being discussed here (compute heavy algorithms) are typically already accelerated by using high performance native libraries. Think BLAS or OpenCL or CUDA and the Python bindings in Numpy etc.

FPGAs aren't usually going to be much use in business-logic heavy algorithms (which of course are often written in interpreted languages). The performance limits of these are usually set by IO, and in some circumstances FPGAs may help with parts of that (eg, the aforementioned database indexing).

Realistically, increasing use of the new M.2 interface with fast SSDs will make more real-life difference though.


OTA CPU upgrades? ;)


Two words:

- Neuromorphic.

- Bye bye Xilinx.


Pretty sure IBM and AMD are both partnering with Xilinx, they're not going anywhere any time soon. (Plus they also have more enterprise contracts than Altera does.)


Fpgas really only accelerate parallel workloads, sequential computation is done easier and just as good with a CPU.

Problem with massive parallelism becomes communication costs and spatial routing. Nothing is free.

More excited about commodity chips with 100s of cores. Rather have something that's easier to program with a faster dev cycle if I'm going to tackle parallelism.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: