It's really a shame how AMD's per-thread performance has been lagging Intel. Maybe it's just their postition as an also-ran in a basically monopolized market (desktop CPU manufacture), but I've always seen them as ahead of the curve. They were optimizing per-cycle performance with Athlon while Intel was still pushing higher clocks. They moved to 64-bit while Intel was still pushing higher clocks. Now that Intel has turned their attention more fully to optimizing their per-clock performance (and pulled ahead by a full generation in fab technology), AMD hasn't really had a leg to stand on. Price and low-power are OK, but ARM is already strong in those areas, and ARM servers are becoming more likely by the year. AMD needs something more transformative than massively-parallel chips and high frequencies.
I know it's a long shot, but could they try to go the way Intel did with Itanium, and invent a new architecture/instruction set specifically designed for server workloads? People have broadly expressed interest in ARM, but the support isn't there right now. If AMD had a new RISC architecture with awesome support and incentives for developers to target it, they might be able to steal share from x64 and ARM chips.
AMD just released an 8-core, 4.0 GHz processor that keeps up somewhat better on single-threaded tasks (while matching or beating a Core i7 on multithreading!), for only $200. http://www.anandtech.com/show/6396/the-vishera-review-amd-fx... "Through frequency and core level improvements, AMD was able to deliver a bit more than the 10 - 15% performance increased [it] promised... Thankfully, Vishera does close the gap by a decent amount and if AMD extends those gains it is on an intercept course with Intel. The bad news is, that intercept wouldn't be in 2013... Steamroller [will be] far more focused on increasing IPC, however without a new process node it'll be difficult to demonstrate another gain in frequency like we see today with Vishera. I suspect the real chance for AMD to approach parity in many of these workloads will be with its 20nm architecture, perhaps based on Excavator in 2014."
Sadly, though, the previous page in that same review shows that the power consumption gap remains wide.
With desktop PCs continuing to lose popularity, and small form factors being popular with the desktops that are selling, that's going to be a much bigger problem for AMD going forward than their single-thread performance is.
6366HE has a base frequency of 1.8GHz, with 2.3/3.1 turbo.
For the same money you can get a six core Intel E5-2630 (2.3/2.8GHz).
The benchmarks [1] say the Vishera (desktop Piledriver) has about 50% of the Intel performance per core at the same frequency, so unfortunately the 6-core Intel is roughly equivalent to the 16-core AMD.
The answer is, of course, "it depends." Heavily threaded floating-point workloads, for example, will see a ton of contention and performance degradation on the 6200-series Piledrivers because there's one floating-point unit for every processing core. The 6100 series doesn't have this limitation, but IIRC they still have some perf wonkiness on some loads.
There are also power concerns with the AMD stuff, which matters a lot more in data centers/"cloud" environments than it necessarily does in a desktop.
A lot of server workloads (mail, web, database, most apps) are pretty light on memory bandwidth, so 16 cores sharing memory bandwidth isn't necessarily an issue.
Cache contention/invalidation is probably the bigger potential issue. I don't have a lot of experience with server VMs and I don't know how smart (or dumb) they are about pinning things to particular CPU cores in order to minimize cache issues.
> I just hope AMD doesn't go away or
> Intel pricing is going to go through
> the roof (again).
I don't want AMD to go away either, but Intel's pricing won't necessarily go through the roof if AMD goes away.
The x86 CPU market is essentially saturated at this point. You can buy a "fast enough for most stuff" desktop computer for $50 or less at a thrift store, and most of us on Hacker News probably already own 4 or more x86 cores.
At this point, Intel's primary competitors are its own previous CPUs. I own several Core 2 Duos, a Core 2 Quad, and a Core i7 quad.
By all accounts, I am exactly the kind of customer they're going after with their newer CPUs. However, since I'm so happy with my current stuff, Intel is really going to have to outdo themselves (and price it right) before I'm moved to buy another Intel chip.
on the consumer side? I agree with you. But the server side, when you are paying for power (and can consolidate a bunch of servers into one) the compute power available per watt makes a staggering difference. Yes, I can run my business off 100 servers from 5 years ago... or 25 modern servers, where the modern servers take about as much power one for one, as the 5 year old servers.
Considering that power/cooling/etc in a datacenter often goes for north of $250 per Kw per month, you can bet that we watch performance/watt very closely.
As someone who has little idea about chip design: is there that much room for benefit purely from instruction set? i.e. if someone were to design the Optimum Instruction Set [and perhaps it'd have to be optimum in only some types of workloads at the expense of others?] and start building chips with it, how much advantage would they really have over, say, x64?
Maybe a good question to go alongside this: if two teams of equal capability with equal access to fabs, patents, etc were to both start completely fresh, one team making the best x64 chip they could and one team making the best ARM chip they could, how much difference would there be in speed and power-consumption between the end products?
There is an excellent talk titled "Things CPU architects need to think about" from Stanford's ee380 class. Go to http://www.stanford.edu/class/ee380/ay0304.html and pick the Feb 18'th entry (click on the old school icon on the right). It will play with vlc on Linux and Windows Media Player on Windows.
Although the talk is from 9 years ago, the material covered is still very relevant today. It is also quite funny. One of the things talked about is the processing of instructions and chaos theory, including non-intuitive stuff like inserting delays to make things run faster!
It should be noted that x86 processors haven't executed x86 instructions since the early nineties. They are translated into risc like micro-ops. The translation takes a very small fraction of the die area and it gets smaller which each generation of chip/process.
The difference between x86 processor implementations and ARM is that x86 try to get the greatest throughput possible. This is done via techniques like having multiple execution units and executing instructions in parallel where possible (known as ILP and typical values are 2.1), executing instructions out of order where it doesn't make a difference to the results, executing multiple instructions in stages concurrently (pipelines), having tracking for branches to better predict if they will be taken, speculative execution of both parts of a branch at the same time and throwing away the one that turns out not to be taken, complex memory machinery to keep code and data flowing, high clock speeds for the die as a whole, and even higher ones for parts if not all in use and the list goes on. This is not a requirement of x86 implementations but is what most of them do. Intel goes very far down this road, AMD not quite so far, and some implementations like Atom do barely any of it.
ARM processors generally do none of that. It keeps them smaller and simpler, which means lower performance and less power.
For your final paragraph, the instruction set is largely irrelevant. While x86 does have some warts, ARM does too (eg condition codes). The thing you left out is compilers as they generate the code to be executed. Roughly speaking the answer is the winner will be whoever has the better compilers. BTW Moore's law predicts transistor doubling per area every 18 months which most paraphrase as performance will double every 18 months. Someone did a study on compilers and found that compilers double performance about every 18 years!
Because ARM execution has been so simple for so long (eg no concurrent execution of instructions) the compilers haven't mattered that much. In a maximum performance world they matter a lot more, especially with instruction scheduling. And of course most programs would have been compiled a while ago, and probably use conservative optimisations (more aggressive ones can introduce bugs). With Itanium Intel had the idea of making the chip 6 way parallel and let the compiler figure out how to use that (ie smart compiler, dumb/simple chip). It didn't really work.
Indeed, although not to quite the same degree as x86 implementations. They are also dual issue so you'll only get a maximum ILP of 2 although that likely won't hurt that much. I was going to include a bibliography but the post was long enough!
Cortex-A15 and Apple A6 are the high-end now. Cortex-A9's are so small and (relatively) cheap that they throw 4 of them on $20 tablet SoCs even for applications that don't really benefit from many cores.
RISC vs CISC doesn't matter any more. I wrote an article (once featured on HN, when ARM first began making waves) that is basically "a crashcourse on CPU architecture for dummies" on my blog: http://neosmart.net/blog/2010/the-arm-the-ppc-the-x86-and-th...
The tl;dr of it is that today, the RISC vs CISC or ARM vs x86 debate doesn't really matter - that's just a question of what "APIs" (if you will) are exposed from the CPU. Internally, they're all approaching working the same way.
The thing is, CISC vs. RISC seems to be an ideological debate more than anything else. The only reason I suggested designing a RISC CPU is because a) ARM has made RISC for servers sexy again, and b) Intel already tried a CISC chip with Itanium, and another would scare vendors pretty badly.
The big difficulty I see is designing an instruction set, architecture and compiler which can show serious performance gains for workloads that matter. Making it easy for developers to come over from x64 while juicing the performance per clock would be a big win.
P.S. I'm not a CPU expert by any means. My understanding is that x86_64 has a big legacy overhead from implementing backwards compatibility with x86, and a goal of Itanium was to reduce the amount of the die dedicated to backwards compatibility. If anyone wants to correct me, feel free.
I don't think Itanium was CISC, they call it "EPIC" (explicitly parallel instruction computing: http://en.wikipedia.org/wiki/Explicitly_parallel_instruction...). The basic idea of this was to move logic about what could be parallelized from the CPU to the compiler, but this turned out not to work so well. In Donald Knuth's words, "The Itanium approach...was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write."
how much difference would there be in speed and power-consumption between the end products?
Probably not much right away, but over time, you'd see a real difference. One key issue is that x86 decode logic takes a substantial amount of engineering time to design in each generation. My spouse is a design engineer at a CPU maker and one thing I've learned is that there's a lot less automation than you'd think. Each new generation requires a bunch of dedicated engineering resources to make the extra-painful instruction decode work on a the new process with new performance constraints, etc. Decode is not something you can design once and then just reuse indefinitely; you take the NRE hit on every design cycle.
Actually I would think that instruction decoding is one of the few things you can reuse... that pipeline stage is typically not the performance limiter, and the instruction set only makes small changes each generation. When I was a chip designer that block usually got assigned to the least experienced team :)
Note that actual CPU designers spend a significant amount of time doing routing and other tasks that you might think would be completely automated. So while you might have a block of verilog code that specifies the RTL for your decode unit, and you can certainly reuse that, each new chip will require you to redo routing at different levels which will require lots of engineering time.
Spending time on that might be fine if x86 ISA was getting you a significant performance advantage, but since it is not, the extra NRE you blow on physical optimization of decode logic is just wasted effort that could be better spent elsewhere.
The biggest advantages ARM has over Intel are all probably memory related. x86 will tend to have have more memory operations for doing a given thing, and with x86's stricter memory model it will be harder to design a system with lots of outstanding memory operations. ARM's simpler decoding is also an advantage from a power consumption standpoint, if nothing else. These aren't huge, though, and I'd give ARM a maybe 15% advantage or so? But remember that all things aren't equal, and Intel has a big lead in fabrication techniques over everyone else, and lots of huge teams of chip architects.
I think a bigger issue though is high clock speeds forced on the industry for marketing reasons. Current pipelines are way way to deep, but they can never come down because that would decrease clock rates and lots of people have been trained into believing that clock speeds indicate performance. If Intel & AMD could convince the world to ignore clock rates, we'd get chips that had slightly higher throughput with much lower power consumption and slightly faster design cycles.
This was the case around the turn of the millenium, back when you had Pentium 4s that had over 30 stages in their pipelines. But those weren't really as good as AMD's less pipelined Athalon chips, and eventually Intel shifted from it's Pentium 4 lineage to the much more reasonable architecture derived from the Pentium 3, through the Pentium M, and then the Core architecture. These had around 15 pipeline stages, and despite their slower clocks the Core processors could out-preform the Pentium 4s. So the changes that you're saying could never happen did happen, and nearly a decade ago.
I think you're being downvoted because your premise is flawed; we're at an OK place with respect to pipe depth versus clock speed. As early as the Athlon line people began to realize that clock doesn't directly map to speed, and recently Intel's move to the i3, i5, i7 models really pushed consumers away from a clock-based definition of speed.
Anecdotally, I have an elderly neighbour who bought an i7 laptop, not because it was the higher number, but because her friend had told her that it was higher quality. Inadvertently, this tech-illiterate person had inferred that an i3 was somehow going to fail sooner or produce inferior results, because of the marketing Intel had performed. This kind of branding is far more powerful than Ghz nowadays, and it's a story that Intel more or less gets to make up. The only people technical enough to bother looking for a clock speed now probably understand all of the marketing jargon, more or less.
To return to the original point; Intel's marketing isn't dictated by what people want, Intel's marketing dictates what people want. They aren't trying to make higher clocked chips to convince people they're better. They have a dominant market position.
Every CPU designer I've spoken with thinks that pipelines are way too deep and clocks are way too fast, especially given memory bandwidth and latencies. So I don't see how we could be at an "OK place". Perhaps I've only spoken with ignorant CPU designers at Intel and AMD?
Obviously, technically sophisticated people understand that clock speed is not the sole determinant of performance, but a lot of people making purchasing or marketing decisions aren't that sophisticated.
I don't mean to disparage your sources, but if everyone in engineering at the two biggest desktop CPU designers thinks pipelines are too deep in current designs, we should be seeing a big change. I understand bureaucracy well enough, but I think if everyone at AMD knew how to magically improve their current design, they would have done it by now. At this point, marketing wouldn't hold them back from assuming a dominant position technically, especially not in the server space.
I've probably gotten in past my depth at this point, but it seems like the bigger complaint is memory latency and bandwidth. My understanding was that this was part of the move to on-die memory controllers and increasing levels of L2 and L3 cache.
if everyone in engineering at the two biggest desktop CPU designers thinks pipelines are too deep in current designs, we should be seeing a big change
No, not necessarily. If one company puts out a lower clocked product with equivalent performance but lower power, the other company will be able to crush them in marketing and sales. No system integrator wants to try and sell a 1GHz product to the public. No one wants to convince retailer marketers that a system clocked at half the speed of their competitors is actually just as fast.
Take a look at laptop ads and ask yourself why they mention clock speed at all. That number isn't really comparable across different product lines or generations within the same product line. But people use it as a proxy for performance, so the ads keep including it.
The graveyards are full of companies that put out better technology products than their competitors.
> No system integrator wants to try and sell a 1GHz product to the public.
AMD has pushed lots of low power parts for mobile, for integrated systems, etc. If AMD had the resources to produce a low power, slower clocked chip, why was Turion such a dog? AMD hasn't actually done very well in the mobile space historically, when they could have produced low-power Ultrabook-like designs. Hell, they could've made something like the new Chromebook and had no fans. Either there's a severe lack of vision, or this is actually much more difficult to implement with the x86 instruction set than you're lettingon.
> But people use it as a proxy for performance, so the ads keep including it.
People generally don't care about clock speeds at this point, and I don't know if they ever really did. I worked retail about 6 years ago, and customers had no clue about clock speeds. Frankly, they were mostly worried about hard drives and screen size.
> No one wants to convince retailer marketers that a system clocked at half the speed of their competitors is actually just as fast.
Retail is the tip of the iceberg. HPC is a big market, commodity servers are a huge market. Halving your power consumption in those areas would be massive, and would give AMD a real cash injection. But there's no silver bullet there. You may be slightly right, but you're massively overstating the benefits compared to the costs of implementing it.
> If AMD had the resources to produce a low power, slower clocked chip, why was Turion such a dog?
Because not even Intel can maintain two different architectures at once and stay competitive. AMD would surely go bankrupt before they could complete a major architecture re-design.
"No one wants to convince retailer marketers that a system clocked at half the speed of their competitors is actually just as fast"
They don't have to. Reviewers would shout from the rooftops that your new laptop/tablet does not feel hot when holding it and lasts significantly longer on a battery charge. Then, you market your devices by quoting the reviewers.
Even on a desktop, a cooler CPU has advantages. Put it in a smaller, quieter box, and advertise that.
I didn't downvote you, but there is research from Intel, IBM, and others showing that very deep pipelines (they disagree on the exact numbers) are indeed optimal if power isn't a concern (which it wasn't until ca. 2003). http://web.archive.org/web/20021211202017/http://systems.cs....
>they can never come down because that would decrease clock rates and lots of people have been trained into believing that clock speeds indicate performance
is a bold statement that is not likely to be true. Casual metrics of performance change, and I doubt there are many people confused as to whether they choose a 4GHz Pentium 4 over a lower clocked Core 2, much less something like a Xeon E5.
Perhaps people are not confused simply because those older processors are no longer sold. AMD certainly felt the need to fudge their frequency back in the K7 days (although megahurtz marketing may be specific to the desktop market).
I have heard about this, but AMD really needs to throw their weight behind getting important software stacks onto ARM. If they can't convince big players to ship ARM builds of software, they're not going to get the kind of market share they need.
Linux already runs on ARM, and there are multiple distros that support it. Most of the web runs on Linux, Apache, MySQL, and PHP/Python/Ruby/Perl, all of which run fine on ARM. I'd say that's a sizable server market that would already be happy on an ARM server. I would be. We run nothing on our servers that doesn't already compile on ARM.
Then again, we've just switched to an entirely virtualized infrastructure running under KVM a few months ago, and ARM doesn't really have much in the way of hypervisor support. So, I guess there is some part of the software stack that ARM won't work for.
There's KVM paravirtualization support for ARM, and the Cortex-A15 supports hardware virtualization (and probably all the 64-bit ARM processors will, too).
It's fine for people who administer their whole stack themselves, but companies like Red Hat don't seem to be on board yet (their website only mentions embedded applications as far as I can see). Compiling open source software against a platform is the easy part; most places don't want to deploy their own build of Apache. Intel has also spent considerable resources contributing patches and optimizing compilers for their platform. If AMD wants to move large clients off x64, they can't just say 'Well, look, Perl compiles'. They need consulting shops onboard, they need large, well-tested repositories of Linux packages.
Debian, Ubuntu, and several other distros have ARM builds. It's not a matter of saying, "Oh, you can compile it yourself." It's just a matter of saying, "Use one of these extremely popular distributions."
Fedora is available for ARM, which means RHEL and CentOS almost certainly aren't far behind. There's also a port of RHEL to ARM called Red Sleeve, which means most of the hard work of a port has already been done. Given that the potential cost savings are pretty big in a large data center, I suspect we'll start seeing deployments pretty soon, and the pressure on Red Hat to provide ARM will grow.
I'm not really arguing with you, per se. There are plenty of people who won't make the jump until Red Hat does. But, there are plenty of people who take their cues from other providers.
Gentoo has always been aggressively multiplatform. I know there are other distros which run on ARM as well, but I don't know how clean their solutions are.
Anyway, hardware availability is an important prerequisite. Sure everything mobile uses ARM, so the Linux kernel is in good shape. But once there's cheap server hardware available, more work will be done on the distros.
Can anyone with experience explain why the frequency of processors has basically stalled since hitting 3.x Ghz? I assume there is some kind of physical barrier, but i'm not sure what it is.
Temperatures mainly. CPUs were approaching 4GHz a decade ago, but the power consumption and temperature output made higher speeds (and even those speeds) impractical. Multicore at those speeds would've (again) been power-wise impractical, so it made more sense to drop the speeds and have 2-4 cores until process improvements allowed clock speeds and core counts to both be higher.
In addition to power concerns of many fast cores on one chip, server chips (the multicore champions) tend to be larger dies, and place higher value on MTBF. I believe both of these factors work against super high frequency server offerings.
Right, and that makes sense. Larger dies are primarily a cost issue; they reduce yield, which increases cost per functional die. More development effort spent on MTBF is a resource issue; work spent improving MTBF is work not spent improving operating frequency, but with enough resources (money) it doesn't have to be a problem.
I could be wrong on this, but I believe IBM's AIX line is not cheap, if you see where I'm going with this.
You're definitely going in the right direction; Power5/6/7 hardware is very pricey. I'm amazed that in this day and age, AIX still has a place in many companies, and slightly perturbed that I have to maintain it. I'd much rather have the applications running under RHEL.
Besides the physical limits, for practical reasons clock speed shouldn't be the only thing one tries to optimize and a faster clockspeed doesn't imply a faster processor. http://en.wikipedia.org/wiki/Megahertz_myth More to the point, I wouldn't be surprised if there's already an Intel processor on the market with overall better performance than this new thing while having less cores and/or GHz.
They tend to overheat past that point in normal computers. Enthusiasts can get production chips running at up to 6GHz, but at that point they're having to use things like liquid nitrogen pumps. A passive radiator with an air fan can only conduct away so much heat. My newest PC (A core i7 or something, I forget exactly what now) has a heatsink that's bigger than my fist.
I don't think the reason to use LN2 (liquid nitrogen) in extreme overclocking is because of the added heat generation. I bet water cooling (which is very "mainstream" compared to the more extreme solutions) could easily handle the extra heat. Afaik you actually need to reach very low temperatures to get the silicon work properly at the high speeds (I guess it has something to do with thermal noise and conductivity, but that's just guessing).
And that is imho a sign that it's not just a power consumption / clockspeed tradeoff, but instead there are actually limits on how high you can go at room temperature.
I suspect that even just looking at single-core performance it would be hard to make a 10GHz CPU that approaches what we're doing with 3GHz nowadays.
Take Pentium 4 - which used the architecture Intel was working on when they were predicting that they'd have 10GHz CPUs by now. Passmark's score for a 3GHz Pentium 4 is 384.
An i7-3940XM runs at 3GHz, and gets a score of 10,490. That's a quad-core computer, so I'm going to go ahead and be sloppy by dividing by 4 to get a score of 2622.5 per core.
In other words, a single i7 core is doing almost seven times as much work per clock cycle. And assuming they only ramped up the clock rate, the P4 would have to be running at over 20GHz to match what the i7 is doing on a per-core basis.
Given the heat dissipation problems that come into play with high clock rates, that seems like a doubtful proposition. In hindsight, it looks like Intel definitely made the right choice by dumping the NetBurst plan and instead letting clock rates stagnate (they've even retracted a bit from the high water mark) while designing the processor core to do a lot more with each cycle.
> P4 would have to be running at over 20GHz to match what the i7 is doing on a per-core basis.
Seems to me memory getting 10x more bandwidth and cache getting 8x larger probably accounts for the majority of the performance differences between these processors not instructions per cycle, which I think has gone up by more like 2x.
So a Pentium 4 with current memory and cache would need to be more like 8 Ghz if it scaled linearly like that (P4 had ~3 instructions per cycle, i7 ~7-8).
Unfortunately with clock speeds, we wind up bumping up against fundamental laws of physics.
At 10Ghz, something moving at the speed of light can only go around 3 centimeters per clock cycle. Take out gate delays, and you wind up with something that just isn't practical.
Speed of light delays don't have anything to do with it, since features sizes are also shrinking. If we were limited by that then shrinking transistors from 65nm to 22nm should have let us clock equivalent cores three times as fast, but that didn't happen.
The important things that changed were that smaller cores tend to leak more, which means that voltage had to be scaled down aggressively. And drive currents are now limited by velocity saturation.
IBM and the Designers of the POWER6 cores went all the way up to 6 GHz and had them operational. The current POWER7 chips can go up to 4.25 Ghz and they have something like 32 Threads per chip
The T2 processors from Sun had over 100 threads per chip. Its a shame that they didn't do better in the market place.
T2s were pretty slow tho, we tried running ruby on them and they couldn't do it. Even with Java it was pretty mediocro performance. A dual core intel could kill a 120 thread sun box by a large margin
Take eight chips. Stack them, insert spacers and bridges from top to bottom, and connect them via BGA or pins or what-have-you on the very top and very bottom chips.
Now turn them sideways, so they rest on the edges. Put the stack in a small ceramic container with copper bottom and top. Fill with a high efficiency thermal transfer fluid, and make sure the convection currents flow properly.
You're right about bumping up against the fundamental laws of physics, but 10 GHz is still on the low side. For single transistors, see HEMTs. They can run an order of magnitude faster than that. Whole chips are something different, but I wouldn't discount it just yet. 10GHz on silicon though.. that's doubtful.
A decade ago I figured we be at 100 cores by now, didn't count on using quantum computing instead of adding more cores.
Of course you cannot compare 10 cores in 2013 to 10 cores in 2023, way more efficient per core with the right code.
> didn't count on using quantum computing instead of adding more cores.
And you probably still shouldn't. Quantum computing only provides a speedup for a very specific set of problems. For the vast majority of everyday computing tasks, quantum computing offers no practical advantage over classical computing.
This is a nice bump on what you can do currently with a standard monolithic server, which I think was 4x12 => 48 cores in the last generation. 64 high performance cores is pretty useful for a lot of modest sized big data computational tasks that otherwise would require you to step up to an entirely different (distributed) architecture.
I'm wondering where is AMD's counterpart to Intel's Xeon E3-12XX? ECC RAM, AES NI and a low price point and hopefully not too high idle power usage should make an attractive CPU.
On a side note: does anyone know what happened to the "Zurich" 32XX Opterons?
I know it's a long shot, but could they try to go the way Intel did with Itanium, and invent a new architecture/instruction set specifically designed for server workloads? People have broadly expressed interest in ARM, but the support isn't there right now. If AMD had a new RISC architecture with awesome support and incentives for developers to target it, they might be able to steal share from x64 and ARM chips.