Wow: Intel unveils 1 teraflop chip with 50-plus cores

mrb · on Nov 16, 2011

"Wow?" This is actually disappointingly low raw TFLOPS performance.

Intel's Knights Ferry GPGPU ASIC is not yet available, but already outperformed by 2-year-old chips from AMD and Nvidia who have both been selling GPU ASICs breaking the 1 TFLOPS barrier (single precision) for over two years now. The AMD Radeon HD 5870 and HD 6970 both reach 2.7 TFLOPS, and AMD makes a dual-ASIC PCIe card HD 6990 reaching 5.1 TFLOPS. Nvidia's mid-level GTX 275 (1.01 TFLOPS) was released in April 2009.

In fact, Knights Ferry evolved from the Larrabee GPU project, which disappointed Intel so much in terms of performance that they decided to forgo the GPU market (as it was clearly not going to be competitive), and to remain focused only on the GPGPU market by evolving Knights Ferry from Larrabee.

The one strong advantage of Knights Ferry is not performance, but x86 compatibility, as it would theoretically make it easy to port programs to. Although one would still have to rewrite the app to use the LRBni instruction set (512-bit regs) to fully exploit the computing performance... or else one would be limited to a quarter of its potential with SSE (128-bit regs.)

Another relative advantage of Knights Ferry is that each of the 50+ cores will probably be able to execute 50 unique instructions every clock cycle, making it very flexible. (Compared to, say, the HD 6970 which has 384 "cores" or VLIW units only able to execute 24 unique instructions: the ASIC is organized in 24 SIMD engines of 16 VLIW units each, the 16 VLIW units in each SIMD engine execute the same instruction in 16 different thread contexts, for a total of 384 threads.)

Edit: my bad, it looks like Intel claims 1 TFLOPS in double precision, which would put it up to the level of upcoming AMD chips (HD 7970 rumored to provide 4.1 SP TFLOPS or 1.0 DP TFLOPS in early 2012.)

stuntprogrammer · on Nov 16, 2011

A few problems in your comment:

1) The article text is wrong (and doesn't match the pics of the slides). The chip demonstrated today is Knights Corner which is a new part, not the older Knights Ferry SDV.

2) When counting flops we need to distinguish between single precision flops and double precision flops. You're comparison isn't valid -- Knights Corner was shown sustaining over 1TF on a double precision code. Nvidia's most recent flagship GPU has a theoretical peak of 515GF/s but sustains less than 225GF/s on the same DGEMM operation. Knights Corner is sustaining 4-5x that, and this implies that it's theoretical peak is higher again. AMD's GPUs also cannot touch this with a single chip. Their dual chip 6990 has what looks like the same theoretical peak but far lower practical performance due to being more of a graphics part than a compute part (e.g. look at the cache structures).

You are correct that these are real cores, each with a wide vector unit. If we wanted the equivalent of GPU "cores" we should multiply out by the vector width per core.

Retric · on Nov 16, 2011

This Intel chip has a theoretical max performance of 1TF/s actual benchmarks are clearly going to be lower than that. The only thing slightly interesting about this is x86 but considering the large vector unit and anemic cache your not going to be able to port high performance code to this without massive changes anyway. And while comparing new chips vs existing chips is always a tradeoff, looking at the Radeon HD 6970 released in Dec 15, 2010 which had 2.7 TFLOPs Single Precision and 683 GFLOPs Double Precision this is a relatively minor jump in single chip double precision performance and unless there releasing it next week it would probably still be far slower than it's competitors.

That's also raw performance, considering this is a brand new architecture it's likely to have some significant bottlenecks limiting it's performance for the next 2-3 product cycles.

PS: Considering so few details where provided it's hard to look at this as anything but Intel saying "Please don't port your code we will have competitive x86 chips out at some point in time."

stuntprogrammer · on Nov 16, 2011

It's doing 1TF sustained and no one ever sustains 100% of their theoretical peak, so we know the peak is higher than 1TF. Consider also the sandbagging of number of cores as "50+". How big is the "+"? In reality the design will have a larger number that is then binned by yield and so forth to give a range of SKUs, as usual.

Next, you're comparing a sustained number on DGEMM with theoretical peaks on other machines. Nvidia sustains <225GF on DGEMM with Fermi so this is 4-5x. Last I looked, AMD were sustaining ~500GF with Cayman, so this is 2x, and a much easier machine to sustain perf on for other codes compared to Cayman. If you consider a potentially sandbagged 2x sustained perf to be "relatively minor" then so be it.

There are few public details provided but many of us have been programming with the Knights Ferry SDV kit in preparation for this part. So we have experience with the tools, with the use of lots of similar cache coherent x86 cores, etc. I can tell you this -- it's much easier to work with this than GPUs, and I've written a ton of code on all kinds of whacky machines, production compute code on GPUs included.

onemoreact · on Nov 16, 2011

I don't see any mention of this chip doing DGEMM at 1TF. Just that it's sustaining 1TF performance, but you can write assembler code that get's within 1% of theoretical peak flops if your not trying to get anything done but, if you have a source feel free to give it. (Not that that even means much, AMD's getting 80% of theoretical max FLOPS on that benchmark and I assume Intel would pick the optimum benchmark for it's chip even if they had to design the chip around the benchmark.)

Also, I don't see anything that suggests it's anywhere close to a production chip. More important, Knights Ferry chips may help engineers build the next generation of supercomputing systems, which Intel and its partners hope to delivery by 2018. Not to mention your comparing a preproduction chip with a year old chip that's running on a 2 year old process when AMD, Intel, and Nvidia are about to do a die shrink.

stuntprogrammer · on Nov 16, 2011

Sustaining 1TF on DGEMM was explicitly mentioned by Intel in the presentation/briefing.

It's also mentioned in the press release:

http://newsroom.intel.com/community/intel_newsroom/blog/2011...

"The first presentation of the first silicon of “Knights Corner” co-processor showed that Intel architecture is capable of delivering more than 1 TFLOPs of double precision floating point performance (as measured by the Double-precision, General Matrix-Matrix multiplication benchmark -- DGEMM). This was the first demonstration of a single processing chip capable of achieving such a performance level."

Does it mean much? It means something to me, and is a great first step for those of us running compute intensive codes. They really wouldn't get far if they designed the chip only around being able to do this.

As I mentioned elsewhere in the thread, the article text is incorrect. The chip we're discussing is Knights Corner not Knights Ferry. The latter has been in early user hands for quite some time now and I've spent plenty of time hacking on it. Knights Corner is the new chip that is working it's way to production via the usual process with ship for revenue in 2012.

The 2018 target is for an exascale machine, not shipment of initial MIC devices. TACC have already announced they'll be building out a 10 petaflop MIC based system next year to go operational by 2013.

Yes, I'm comparing a chip that has not shipped, but given the perf advantage, given the tools and productivity advantage, given the multiyear process advantage Intel is sustaining, this is not a chip to be ignored. Knights Corner is shipping on 22nm. Other vendors have notoriously had difficultly on previous processes, depend on fabs like TSMC who are doing 28nm for them, and will be later to 14nm etc.

onemoreact · on Nov 16, 2011

Thanks for clearing that up, my google foo is weak when they use the wrong names.

Still, it looks like they really do design for benchmarks: "Xeon E5 delivers up to 2.1* times more performance in raw FLOPS (Floating Point Operations Per Second as measured by Linpack) and up to 70 percent more performance using real-HPC workloads compared to the previous generation of Intel Xeon 5600 series processors." 110% on benchmark = 70% in real world apps.

Granted, if this works out great, I have seen Intel blow to many new 'high performance' chips to expect much still they might just pull this one off. Unlike say the http://en.wikipedia.org/wiki/Itanium etc

PS: I always look at what Intel get's x86 to do much like how Microsoft could develop software, it's not that the capability is awesome so much as watching a mountain of hacks dance. They have a huge process advantage and can throw piles of money and talent at the process but they are stuck with optimization's made when computers where less than 1% as powerful.

stuntprogrammer · on Nov 16, 2011

We should distinguish between designing for a benchmark and designing for a set of workloads. Everyone choices representative workloads they care about and evaluate design choices on a variety of metrics from simulating execution of parts of those workloads.

Linpack is a common go-to number because, for all the flaws, it's a widely quoted number. E.g. used in the top500 ranking. It tends to let the cpu crank away and not stress the interconnect, and is widely viewed as an upper bound on perf for the machine. In the E5 case it'll be particularly helped by the move to AVX enabled cores, and take more advantage of that than general workloads. Realistic hpc workloads stress a lot more of the machine beyond the cpu. Interconnect performance in particular.

People like to dump on x86 but it's not that bad. There are plenty of features no one really uses and we still have around, but those features will often end up being microcoded and not gunking up the rest of the core. The big issue is decoder power and performance. x86 decode is complex. On the flipside, the code density is pretty good and that is important. Secondly, Intel and others, have added various improvements that help avoid the downsides. E.g. caching of decode, post-decode loop buffers and uop caches etc. Plus the new ISA extensions are much kinder..

Retric · on Nov 17, 2011

The problem with x86 is when you scale the chips to N cores you have N copy's of all that dead weight. You might not save many transistors by say dropping support for 16 bit floats relative to how much people would hate you for doing so. However, there are plenty of things you can drop from a GPU or vector processor and when you start having 100's of them it's a real issue.

Still with enough of a process advantage and enough manpower you can end with something like the i7 2600 which has a near useless GPU and a ridiculous pin count and still dominates all competition in it's price range.

stuntprogrammer · on Nov 17, 2011

Is there a cost? Of course. But arguably it's in the noise on these chips. Knights Ferry and Corner are using a scalar x86 core derived from the P54C. How many transistors was that? About 3.3 million. By contrast, Nvidia's 16-core Fermi is a 3 billion transistor design. (No, Fermi doesn't have 512 cores, that's a marketing number based on declaring that a SIMD lane is a "cuda core", if we do the same trick with MIC we start doing 50+ cores * 16 wide and claiming 800 cores).

How can we resolve this dissonance? Easy -- ignoring the fixed function and graphics only parts of Fermi, most of the transistors are going to be in the caches, the floating point units and the interconnect. These are places MIC will also spend billions of transistors but they're not carrying legacy dead weight from x86 history -- the FPU is 16 wide by definition must have a new ISA. The cost of the scalar cores will not be remotely dominant.

I'm not sure why you are concerned about the pin count on the processor, except perhaps if you are complaining about changing socket designs which is a different argument. The i7 2600 would fit in a LGA 1155 (i.e. 1155 pins) whereas Fermi was using a 1981 pin design on the compute SKUs. The sandy bridge CPU design is a fine one. The GPU is rapidly improving (e.g. ivy bridge should be significantly better, and will be a 1.4 billion transistor design in the same 22nm as Knights Corner).

sausagefeet · on Nov 16, 2011

Where can I learn more about how all this hardware works?

rbanffy · on Nov 16, 2011

> Although one would still have to rewrite the app to use the LRBni instruction set (512-bit regs) to fully exploit the computing performance...

http://drdobbs.com/architecture-and-design/216402188 ("A First Look at the Larrabee New Instructions")

anigbrowl · on Nov 16, 2011

I guess this is partly about stealing AMD's thunder following their recent release of the 16-core bulldozer chips. Not that the technologies directly compare, but most non-geeks are just going to see 50 cores >> 16 and buy Intel again next time.

Klinky · on Nov 16, 2011

Intel MIC is competing with GPUs not CPUs.

anigbrowl · on Nov 16, 2011

I feel you should reread the second sentence of my comment above.

Klinky · on Nov 19, 2011

Non-geeks will not care because it will not be marketed to them or available for them to purchase. This is a High Performance Computing product going against AMD's Firestream or nVidia's Tesla. These are high dollar products for niche tasks that non-geeks won't be doing. Non-geeks will be listening to what the salesman tells them is good, not referencing some blurb their geek friend told them about a 50-core Intel product. This product is the successor to Knights Ferry which was a 32-core part released way-way-way before Bulldozer.

modeless · on Nov 16, 2011

I see questions about why this is better than a GPU for anything. Two main things:

1. The double-precision floating point performance is a lot better.

2. Unlike GPUs which have baroque memory access restrictions and many performance cliffs, this is a much more familiar SMP architecture with a unified coherent cache hierarchy.

Klinky · on Nov 16, 2011

1. It's estimated to come out a year later than AMD's Radeon part which will boast similar double-precision floating point performance. Both could be delayed, though if Southern Islands comes out on time & Knights Corner gets delayed, by the time KC sees the light of day AMD or nVidia might have another part out by then that will offer even higher DPFP performance.

2. I am not sold that modeling the cores after x86/SMP means special care won't be needed to feed the Intel MIC architecture properly. I'd like to see some real world numbers on purchasable hardware.

joshu · on Nov 16, 2011

Heh. Article is nearly incoherent:

> If you're building a new system and want to future-proof it, the Knights Ferry chip uses a double PCI Express slot. Chrysos said the systems are also likely to run alongside a few Xeon processors.

rbanffy · on Nov 16, 2011

The memory bus must be saying "Great. Another 50 mouths to feed".

You have to design your program very carefully if you don't want the cores to starve.

jwatte · on Nov 16, 2011

So it doesn't run the general x64 system architecture? Then how is this different from GPGPU? I thought NVIDIA broke a teraflop on a dual slot a while back (dunno if it was single GPU.) Slot based coprocessors have always been a very niche kind of thing.

Basically, if I can't hook it up to my SSD array and also my GPU, then it's not a "real" computer -- like the reporter was talking about a laptop. And if I can't rent it by the hour from Amazon, then it's not really a good investment (Amazon already has GPU instances.)

Or, you know, maybe this time it will work, when every time before, a co-processor platform has failed...

r3demon · on Nov 16, 2011

AMD Radeon HD 6990 already has over 1 TFLOPS performance double-precision, and there's no problem buying it, Intel is too late.

phamilton · on Nov 16, 2011

Yes there is no problem buying it, but have you ever tried programming in OpenCL? Complexity aside, GPGPU hits a big bottleneck when dealing with large datasets. There just isn't enough memory available on the GPU, and transfers to and from the device are costly.

buff-a · on Nov 16, 2011

I promise you that if you don't pay an equivalent amount of attention to data availability on these intel chips you wont see teraflop speed. If there's one thing that GPGPU did it was force idiots to have to thing about presenting data to the processor instead of leaving it all over the fucking place and letting the cache "sort it out" (read: "run really slow").

That said. I'll take one =)

marcf · on Nov 16, 2011

There are AMD video cards with 4GB of memory these days and special cards have up to 16GB I understand.

phamilton · on Nov 16, 2011

16GB isn't adequate for the work I've done on them. Genome assembly ( an O(n^2k^2) process ) generally has 100GB of data, each segment of which needs to be compared against each other segment ( O(n^2) ), and when comparing two segments each datum needs to be compared to each other datum ( O(k^2) ).

So while you can just transfer 4GB of data to the card at a time, it really doesn't cut it.

r3demon · on Nov 16, 2011

I'm sure Intel won't cut it either, PCI Express speed is limited, and RAM woudn't catch up with simple computation speed. Find a better algorithm.

rayiner · on Nov 16, 2011

Far more limited architecture.

r3demon · on Nov 16, 2011

By this logic all RISK architectures are very limited, but they can compute. Parallel architectures have to be seriously limited anyway, you just can't access 100Gb data with 1000 parallel processes at the same time.

cultureulterior · on Nov 16, 2011

I'll be very interested to see how this does with raytracing.

nextparadigms · on Nov 16, 2011

This was to be expected. In a classic disruptive innovation fashion, Intel is starting to move upmarket, where the profits are higher, and in a few years they'll be leaving the mobile and notebook/PC market to ARM.

ck2 · on Nov 16, 2011

These aren't x86 cores, are they?

I mean 50 atom cores would be downright silly.

50 i3 cores, well then you might have something.

rayiner · on Nov 16, 2011

Yes, these are x86 cores. Actually quite a bit like the Pentium (p55c) with a 512-bit vector unit bolted on.

suivix · on Nov 16, 2011

What is the significance of this over standard GPUs that can already do over a teraflop?

See table: http://en.wikipedia.org/wiki/Northern_Islands_(GPU_family)

stuntprogrammer · on Nov 16, 2011

The AMD 6990 has ~1.37TF double precision, but uses 2 GPU chips to do it, where this chip is that perf level.

It is difficult to get good performance out of the GPUs for a very wide range of highly parallel programs. Effectively, you are programming a part that is trying to give you mainly a graphics part, since that is where the volume is, with enough compute compromises to try to grow that market. MIC is designed to be a compute processor from the get go. How about this for a difference: it can boot linux all on its own! You can ssh into it and run programs. You can even run 'reverse offload' programs that call out to code on the CPU! Trying doing any of that with a GPU.

BTW, this MIC chip has a large number of cores (50+), these are real cores, and they're not doing the GPU marketing trick of counting SIMD lanes as "cores". You could multiply 50+ * 16 to get the equivalent number of GPU "cores". Each core is cache coherent, with a decent memory hierarchy designed for compute. There's no graphics tax on here.

I have much more expectation that Intel can leverage their massive process advantage to keep MIC ahead on compute performance each generation. It'll be a relief to have compute parts rather than repurposed GPUs.

phamilton · on Nov 16, 2011

x86 instruction set. GPUs are a pain to program and port applications to.

If I recall correctly, this is somewhat of a spinoff of the larrabee chip - http://en.wikipedia.org/wiki/Larrabee_(microarchitecture)

You can see in the wikipedia page the benefits of Larrabee over traditional GPUs. I believe the new chip was designed to be even more flexible and similar to modern processors.

eliben · on Nov 16, 2011

It's x86, a normal Intel CPU. It can even run Linux. Drastically different from what a GPU is

cleverjake · on Nov 16, 2011

mobility.

ciderpunx · on Nov 16, 2011

I should probably get one of these for my laptop.