I'm not really sure what real need they're trying to fill here. A single Atom core has a TDP of around 4W, performance-wise it's at about 10% of a mid range Core2 Duo, which has a TDP of around 65W, although the mobile versions are much more efficient than that (35W or so). To get 10 Atoms (or 5 if they're N300 series dual-cores) running must take much more infrastructure than a single Core2, which consumes power as well, so I doubt they're getting more FLOPS/Watt or integer ops/Watt than a Xeon or Opteron cluster.
So the intention must be maximising I/O. What sort of workloads are so shared-nothing that they can parallelise to this many non-shared-memory CPUs efficiently? Content Delivery Networks? Seems incredibly niche; niche enough that the CDNs probably have already built their own.
And what exactly is the I/O bottleneck on a Xeon system that a bunch of Atom systems can do better? FSB/Memory throughput maybe? The Nehalems already have a 192-bit, 1333MHz DDR3 memory interface per CPU and gigantic caches, along with I/O that doesn't share data paths with memory accesses.
Frankly with the Niagra out there, this seems partially redundant. Then again, making I/O tasks, essentially a chain of closures, into reasonable threads is a pretty bad waste by itself...
Yup these atoms are essentially I/O processors, just running enough buffer cache management, filesystem, and driver code required to keep other components (network and disk) at high utilization.
The benefit for atoms here is the short latency. They have short pipelines (better for the little actual computation required for I/O driving), and there are presumably more cores available here than the equivalent Xeon system (reducing queuing delays).
You have to count power consumed in the whole box, not only the CPU. This reduces the gap between C2D and Atom. OTOH, the Atom is not that much slower than a C2D when you are not doing SSE operations. Their box can be a cluster of servers, not a shared memory monster. 8 boxes with independent SATA, memory and net connections usually perform much better than a single 8-socket box if what you are doing does not involve a shared dataset for all processes.
OTOH, the Atom is not that much slower than a C2D when you are not doing SSE operations.
It's exactly the opposite. Not counting multiplies, pshufb, and a few other instructions, SSE operations are mostly reasonably performant.
The real problems are that the Atom has:
1) no out of order execution, which results in an enormous performance penalty on all code.
2) very high penalties for many "transfers" between operations: for example, the result of an arithmetic op cannot be used for addressing for quite a few clocks. This penalty is humongous when dealing with pointers-to-pointers in managed code, lookup tables, and so forth.
3) a completely un-pipelined multiplication unit that takes 7 cycles, compared to the pipelined 3-cycle Core 2 unit.
4) an enormous number of instructions that were microcoded internally (and thus extraordinarily slow) in order to reduce TDP and chip size.
5) only two integer ALUs, which can only dual-issue in some specific cases.
Overall, a 1.6Ghz Atom is significantly slower than a 1.6Ghz Pentium 4. A Core 2 is at least twice as fast per clock per core as a Pentium 4. And a Core i7 is another 40-50% faster per clock per core...
If you want something that gets out significantly more performance per watt than a Core 2 or Core i7, look at the ARM Cortex A9.
I would prefer an ARM-based server, but it seems far too many people want x86s... Go figure.
As for SSE, I was under the impression (I don't know where I read that) the Atom achieved its massive power reduction through a very aggressive simplification of the floating point hardware. Maybe the HT capabilities somewhat mitigate the lack of out-of-order execution. In my experience, floating point sucks a bit, but most other uses seem fine.
I am writing this on an Atom netbook while a Django app is importing a largish volume of data for a couple tests (it takes two hours to do it with a SQLite database) and Transmission downloads a torrent (need an alternate Ubuntu install disk). The browser is still responsive as long as I don't try to view videos. My worst complain with this machine is the hard disk being really slow, but considering price, size, weight and battery life, I guess I can't be too picky.
I'm having trouble understanding what the market for this is. It's not going to win on instructions-per-die-area-per-second (NVIDIA and ATI are already ahead of this mark now, with hardware rather cheaper than $100k). And with 512 distinct CPU packages, there's no way the interconnect is going to be faster than the high speed serial stuff we're already using for stuff like SATA, 10G ethernet, Infiniband, etc...
So it's basically a physically smaller supercomputer running low-power CPUs. It probably wins on real estate and power metrics, and probably loses on cost vs. racks of consumer stuff. Is there a market for that? Note that the investment came not from a VC fund, but from the DoE...
I think SeaMicro is targeting commercial workloads, not HPC; that immediately excludes GPUs as competitors. It could be somewhat cheaper and lower power than a Nehalem cluster, but I also don't see anything revolutionary.
My guess would be that the petabyte of storage is the selling point and the fact that the ratio of CPU's to storage is pretty high means it can probably act as a high performance low cost SAN or something like that. With 512 CPUS that's one per 2 TB drive per processor. Seems similar to a Capricorn Tech's approach, http://www.capricorn-tech.com/
From my very superficial reading of the article it would seem like the win is that you're getting huge amounts of density.
I think their argument is that a typical web-style datacenter is throwing up hundreds and hundreds of commodity servers to do some sort of distributed processing tasks (think of something like a hadoop cluster or a ton of app servers behind a load balancer).
With this setup their argument may be you're no longer in need of a giant network/switching infrastructure to do these types of compute tasks. Plus all the power/cost savings of only needing a couple boxes just to get those thousands of CPUs. I wonder what kind of RAM these things are going to have, as obviously that's pretty crucial.
But yeah, losing one of these things would cause more than a little headache of an outage. Seems like an interesting approach that's worth more thought though.
"But both of these firms are going against what is currently the biggest trend in corporate data centers: commodity servers. Such boxes aren’t simply a collection of low-power chips — they have to be networked from inside in order to deliver optimal performance for the lowest power consumption"
So the intention must be maximising I/O. What sort of workloads are so shared-nothing that they can parallelise to this many non-shared-memory CPUs efficiently? Content Delivery Networks? Seems incredibly niche; niche enough that the CDNs probably have already built their own.
And what exactly is the I/O bottleneck on a Xeon system that a bunch of Atom systems can do better? FSB/Memory throughput maybe? The Nehalems already have a 192-bit, 1333MHz DDR3 memory interface per CPU and gigantic caches, along with I/O that doesn't share data paths with memory accesses.