"Wow?" This is actually disappointingly low raw TFLOPS performance. Intel's Knig...

stuntprogrammer · on Nov 16, 2011

A few problems in your comment:

1) The article text is wrong (and doesn't match the pics of the slides). The chip demonstrated today is Knights Corner which is a new part, not the older Knights Ferry SDV.

2) When counting flops we need to distinguish between single precision flops and double precision flops. You're comparison isn't valid -- Knights Corner was shown sustaining over 1TF on a double precision code. Nvidia's most recent flagship GPU has a theoretical peak of 515GF/s but sustains less than 225GF/s on the same DGEMM operation. Knights Corner is sustaining 4-5x that, and this implies that it's theoretical peak is higher again. AMD's GPUs also cannot touch this with a single chip. Their dual chip 6990 has what looks like the same theoretical peak but far lower practical performance due to being more of a graphics part than a compute part (e.g. look at the cache structures).

You are correct that these are real cores, each with a wide vector unit. If we wanted the equivalent of GPU "cores" we should multiply out by the vector width per core.

Retric · on Nov 16, 2011

This Intel chip has a theoretical max performance of 1TF/s actual benchmarks are clearly going to be lower than that. The only thing slightly interesting about this is x86 but considering the large vector unit and anemic cache your not going to be able to port high performance code to this without massive changes anyway. And while comparing new chips vs existing chips is always a tradeoff, looking at the Radeon HD 6970 released in Dec 15, 2010 which had 2.7 TFLOPs Single Precision and 683 GFLOPs Double Precision this is a relatively minor jump in single chip double precision performance and unless there releasing it next week it would probably still be far slower than it's competitors.

That's also raw performance, considering this is a brand new architecture it's likely to have some significant bottlenecks limiting it's performance for the next 2-3 product cycles.

PS: Considering so few details where provided it's hard to look at this as anything but Intel saying "Please don't port your code we will have competitive x86 chips out at some point in time."

stuntprogrammer · on Nov 16, 2011

It's doing 1TF sustained and no one ever sustains 100% of their theoretical peak, so we know the peak is higher than 1TF. Consider also the sandbagging of number of cores as "50+". How big is the "+"? In reality the design will have a larger number that is then binned by yield and so forth to give a range of SKUs, as usual.

Next, you're comparing a sustained number on DGEMM with theoretical peaks on other machines. Nvidia sustains <225GF on DGEMM with Fermi so this is 4-5x. Last I looked, AMD were sustaining ~500GF with Cayman, so this is 2x, and a much easier machine to sustain perf on for other codes compared to Cayman. If you consider a potentially sandbagged 2x sustained perf to be "relatively minor" then so be it.

There are few public details provided but many of us have been programming with the Knights Ferry SDV kit in preparation for this part. So we have experience with the tools, with the use of lots of similar cache coherent x86 cores, etc. I can tell you this -- it's much easier to work with this than GPUs, and I've written a ton of code on all kinds of whacky machines, production compute code on GPUs included.

onemoreact · on Nov 16, 2011

I don't see any mention of this chip doing DGEMM at 1TF. Just that it's sustaining 1TF performance, but you can write assembler code that get's within 1% of theoretical peak flops if your not trying to get anything done but, if you have a source feel free to give it. (Not that that even means much, AMD's getting 80% of theoretical max FLOPS on that benchmark and I assume Intel would pick the optimum benchmark for it's chip even if they had to design the chip around the benchmark.)

Also, I don't see anything that suggests it's anywhere close to a production chip. More important, Knights Ferry chips may help engineers build the next generation of supercomputing systems, which Intel and its partners hope to delivery by 2018. Not to mention your comparing a preproduction chip with a year old chip that's running on a 2 year old process when AMD, Intel, and Nvidia are about to do a die shrink.

stuntprogrammer · on Nov 16, 2011

Sustaining 1TF on DGEMM was explicitly mentioned by Intel in the presentation/briefing.

It's also mentioned in the press release:

http://newsroom.intel.com/community/intel_newsroom/blog/2011...

"The first presentation of the first silicon of “Knights Corner” co-processor showed that Intel architecture is capable of delivering more than 1 TFLOPs of double precision floating point performance (as measured by the Double-precision, General Matrix-Matrix multiplication benchmark -- DGEMM). This was the first demonstration of a single processing chip capable of achieving such a performance level."

Does it mean much? It means something to me, and is a great first step for those of us running compute intensive codes. They really wouldn't get far if they designed the chip only around being able to do this.

As I mentioned elsewhere in the thread, the article text is incorrect. The chip we're discussing is Knights Corner not Knights Ferry. The latter has been in early user hands for quite some time now and I've spent plenty of time hacking on it. Knights Corner is the new chip that is working it's way to production via the usual process with ship for revenue in 2012.

The 2018 target is for an exascale machine, not shipment of initial MIC devices. TACC have already announced they'll be building out a 10 petaflop MIC based system next year to go operational by 2013.

Yes, I'm comparing a chip that has not shipped, but given the perf advantage, given the tools and productivity advantage, given the multiyear process advantage Intel is sustaining, this is not a chip to be ignored. Knights Corner is shipping on 22nm. Other vendors have notoriously had difficultly on previous processes, depend on fabs like TSMC who are doing 28nm for them, and will be later to 14nm etc.

onemoreact · on Nov 16, 2011

Thanks for clearing that up, my google foo is weak when they use the wrong names.

Still, it looks like they really do design for benchmarks: "Xeon E5 delivers up to 2.1* times more performance in raw FLOPS (Floating Point Operations Per Second as measured by Linpack) and up to 70 percent more performance using real-HPC workloads compared to the previous generation of Intel Xeon 5600 series processors." 110% on benchmark = 70% in real world apps.

Granted, if this works out great, I have seen Intel blow to many new 'high performance' chips to expect much still they might just pull this one off. Unlike say the http://en.wikipedia.org/wiki/Itanium etc

PS: I always look at what Intel get's x86 to do much like how Microsoft could develop software, it's not that the capability is awesome so much as watching a mountain of hacks dance. They have a huge process advantage and can throw piles of money and talent at the process but they are stuck with optimization's made when computers where less than 1% as powerful.

stuntprogrammer · on Nov 16, 2011

We should distinguish between designing for a benchmark and designing for a set of workloads. Everyone choices representative workloads they care about and evaluate design choices on a variety of metrics from simulating execution of parts of those workloads.

Linpack is a common go-to number because, for all the flaws, it's a widely quoted number. E.g. used in the top500 ranking. It tends to let the cpu crank away and not stress the interconnect, and is widely viewed as an upper bound on perf for the machine. In the E5 case it'll be particularly helped by the move to AVX enabled cores, and take more advantage of that than general workloads. Realistic hpc workloads stress a lot more of the machine beyond the cpu. Interconnect performance in particular.

People like to dump on x86 but it's not that bad. There are plenty of features no one really uses and we still have around, but those features will often end up being microcoded and not gunking up the rest of the core. The big issue is decoder power and performance. x86 decode is complex. On the flipside, the code density is pretty good and that is important. Secondly, Intel and others, have added various improvements that help avoid the downsides. E.g. caching of decode, post-decode loop buffers and uop caches etc. Plus the new ISA extensions are much kinder..

Retric · on Nov 17, 2011

The problem with x86 is when you scale the chips to N cores you have N copy's of all that dead weight. You might not save many transistors by say dropping support for 16 bit floats relative to how much people would hate you for doing so. However, there are plenty of things you can drop from a GPU or vector processor and when you start having 100's of them it's a real issue.

Still with enough of a process advantage and enough manpower you can end with something like the i7 2600 which has a near useless GPU and a ridiculous pin count and still dominates all competition in it's price range.

stuntprogrammer · on Nov 17, 2011

Is there a cost? Of course. But arguably it's in the noise on these chips. Knights Ferry and Corner are using a scalar x86 core derived from the P54C. How many transistors was that? About 3.3 million. By contrast, Nvidia's 16-core Fermi is a 3 billion transistor design. (No, Fermi doesn't have 512 cores, that's a marketing number based on declaring that a SIMD lane is a "cuda core", if we do the same trick with MIC we start doing 50+ cores * 16 wide and claiming 800 cores).

How can we resolve this dissonance? Easy -- ignoring the fixed function and graphics only parts of Fermi, most of the transistors are going to be in the caches, the floating point units and the interconnect. These are places MIC will also spend billions of transistors but they're not carrying legacy dead weight from x86 history -- the FPU is 16 wide by definition must have a new ISA. The cost of the scalar cores will not be remotely dominant.

I'm not sure why you are concerned about the pin count on the processor, except perhaps if you are complaining about changing socket designs which is a different argument. The i7 2600 would fit in a LGA 1155 (i.e. 1155 pins) whereas Fermi was using a 1981 pin design on the compute SKUs. The sandy bridge CPU design is a fine one. The GPU is rapidly improving (e.g. ivy bridge should be significantly better, and will be a 1.4 billion transistor design in the same 22nm as Knights Corner).

sausagefeet · on Nov 16, 2011

Where can I learn more about how all this hardware works?

rbanffy · on Nov 16, 2011

> Although one would still have to rewrite the app to use the LRBni instruction set (512-bit regs) to fully exploit the computing performance...

http://drdobbs.com/architecture-and-design/216402188 ("A First Look at the Larrabee New Instructions")

anigbrowl · on Nov 16, 2011

I guess this is partly about stealing AMD's thunder following their recent release of the 16-core bulldozer chips. Not that the technologies directly compare, but most non-geeks are just going to see 50 cores >> 16 and buy Intel again next time.

Klinky · on Nov 16, 2011

Intel MIC is competing with GPUs not CPUs.

anigbrowl · on Nov 16, 2011

I feel you should reread the second sentence of my comment above.

Klinky · on Nov 19, 2011

Non-geeks will not care because it will not be marketed to them or available for them to purchase. This is a High Performance Computing product going against AMD's Firestream or nVidia's Tesla. These are high dollar products for niche tasks that non-geeks won't be doing. Non-geeks will be listening to what the salesman tells them is good, not referencing some blurb their geek friend told them about a 50-core Intel product. This product is the successor to Knights Ferry which was a 32-core part released way-way-way before Bulldozer.