AMD EPYC “Rome” Server Processors to Feature 8 to 64 Cores

bhouston · on June 20, 2019

64 cores but 128 with simultaneous multithreading. And in a 2C configuration, you get 256 threads. That is a beautiful thing.

tasubotadas · on June 20, 2019

It's a good time to be a Python developer :-D

geezerjay · on June 20, 2019

It's a very good time to dive into microservices/containers/container orchestration.

Good times.

Dude26666 · on June 20, 2019

Heaviest microscope for that nail so far.

agumonkey · on June 20, 2019

Memories of Niagara falls

Symmetry · on June 20, 2019

These threads have a lot more heft than the ones in Sun's T series.

jjav · on June 21, 2019

One would hope so, that was 15 years ago.

But Niagara was supremely cool back then. I still have a T1000 on a rack at home (not that I use it anymore).

agumonkey · on June 20, 2019

Still, it was probably the first processor to allow for this kind of thread count (but I may be wrong)

rbanffy · on June 20, 2019

And Xeon Phi...

growlist · on June 20, 2019

I had a quad CPU Dell server. Imagine 512 threads :-0

mfatica · on June 20, 2019

Look at me, I'm the GPU now

rubyn00bie · on June 20, 2019

The real question for me with all these AMD releases, what's Intel gonna release? It's surprising to me that Apple didn't announce an AMD based Mac Pro-- the chip Intel gave them must really be something. Too bad though because there could be a lot more of them (eventual Mac Pros) out there I'm guessing if they were using these Rome chips.

uponcoffee · on June 20, 2019

There's probably bit of hardware lot it with Intel chips. Since they control the hardware, Apple probably has invested quite a bit into hardware specific optimizations that won't readily port to AMD chipsets.

Also, AMD is only recently been a performance front runner - they might not hold on to that for long. In the short term it doesn't make sense to jump ship.

For for general computation/ecosystems or for generation/device specific ventures (e.g. Consoles) it makes sense to turn with the tides.

stcredzero · on June 20, 2019

Since they control the hardware, Apple probably has invested quite a bit into hardware specific optimizations that won't readily port to AMD chipsets.

It's in Apple's history to be able to move on from hardware. It's also in their history to make the wrong choices, necessitating the move.

ufoolme · on June 20, 2019

money is they’ll move in next 5 years to at least partial move to their own ARM chips, they already have to a degree.

Narishma · on June 21, 2019

People have been saying that for the last 5 years if not longer.

zamalek · on June 20, 2019

Intel hasn't manufacturered bespoke chips for some years now, which is why all, but one, mainstream consoles have been AMD.

kllrnohj · on June 21, 2019

Funnily enough the only "bespoke" Intel chip in recent times was Intel's own i7-8705G & i7-8809G used in the Hades Canyon NUC that replaced Intel's HD graphics with an AMD Vega M.

ZeroCool2u · on June 21, 2019

Which was an amazing chip. The Vega M + Intel cores combo delivered the graphics performance that Intel has been promising for years now. Would have loved to see a chip like that power the next few generations of ultrabooks, MBP's, etc.

dpedu · on June 20, 2019

It's probably not just the chip, Intel provides financial incentives for manufacturers who build products around their chips.

ksec · on June 20, 2019

>Intel provides financial incentives for manufacturers who build products around their chips.

Yes, but I think the bigger picture, Apple needed Intel's iPhone modem for another year. May not be worth to damage the relationship, not to mention the design were likely finalise some time ago.

I would not be surprised if the Sale of Modem Business to Apple will include agreement to keep Intel CPU on Mac. ( For the time being )

And I am sure AMD should be aiming EPYC at the Datacenter usage of Apple, which is huge in itself, rather then the market of Mac Pro. Although Apple using AMD on Mac would be a pretty big statement to the rest of the market.

astrodust · on June 20, 2019

That usually involves advertising, but Apple has turned down those offers. Not even an "Intel inside" sticker.

Apple's relationship with Intel is pretty rocky now. Intel fumbled the i9, their Xeon chips aren't keeping up with AMD's workstation offerings, and their efforts to build a cellular modem chip utterly failed, leaving Apple at the mercy of Qualcomm for that part.

I'm sure Apple will cut loose on Intel as soon as they can. They're probably tired of the bullshit.

rbanffy · on June 20, 2019

The financial incentives can cover other chip lines. Apple sells a ton of laptop CPUs and has some preferred partner status with Intel.

kllrnohj · on June 20, 2019

I think it'd really only be surprising if the Mac Pro ignored these chips when they were already available. At this point, though, the Mac Pro is shipping before Rome is available. The timing is just off. Current Xeons are faster than current Epyc's on average, so the choice today makes sense. Particularly if Apple wants to avoid NUMA in these machines. It wouldn't be unreasonable for a Mac Pro refresh in 2 years to switch to Epyc, though, if current trends continue to hold.

juergbi · on June 20, 2019

No, Rome is expected to officially launch in early Q3 (it's already shipping to hyperscalers) and the Mac Pro is expected to ship in September.

However, Zen 2 Threadripper would likely be a better fit for a workstation such as Mac Pro as it will (most likely) have higher clock speeds. The current Threadripper lineup doesn't support more than 256 GB RAM, though. I don't know whether the new Threadripper will support RDIMM/LRDIMM to compete with Xeon-W on memory capacity.

rbanffy · on June 20, 2019

The Mac Pro uses the LGA3647 socket, which I assume will give it some headroom for incremental upgrades in the next few years before a major overhaul is due. Intel already offers 112-thread part for that socket.

juergbi · on June 20, 2019

No, the 56C/112T Intel Xeon Platinum 9282 is a BGA part, it's not socketed. You can buy this niche product only as part of a server with liquid cooling.

And the successors to Cascade Lake seem to switch to the LGA4189 socket to support 8 memory channels (Cooper Lake and Ice Lake), so I wouldn't expect any upgrades on LGA3647.

rbanffy · on June 20, 2019

You are right about the socket - ark.intel.com doesn't mention the socket for the two top-end parts. All others use the previous one.

The good news is that the next generation Mac Pro will have a ton of extra memory bandwidth.

jlawer · on June 21, 2019

The bad news is that the next generation Mac Pro will be released in 5+ years from now

33degrees · on June 20, 2019

At this point, I think it's a lot more likely for Apple to switch to using their own chips, at least at the low end.

astrodust · on June 20, 2019

Given the JavaScript performance of the shipping units intended for mobile use, which is eclipsing the best of Intel's offerings, it might be possible that the A14 or A15 iteration actually surpasses Intel's chips at x86-64 code as well when run through a Rosetta-like compatibility layer.

What if the high-end chip was ARM? It's not just about raw speed, it's about how much performance you can squeeze out of a particular thermal envelope, or compute per watt. If ARM offers 2x the performance per watt, doesn't matter what Intel's chips do with hypothetically unlimited power.

kllrnohj · on June 21, 2019

> it might be possible that the A14 or A15 iteration actually surpasses Intel's chips at x86-64 code as well when run through a Rosetta-like compatibility layer

Extremely unlikely, if only because x86-64's memory model is much stronger than ARM's. Emulating that on ARM would be a performance disaster.

Apple could have the internals of A14 or A15 have x86's memory model, but that's non-trivial changes and may have too much impact on their ARM performance to justify it. Seems far more likely we'd just see a Macbook that's just straight ARM with x86 code just not supported at all.

> What if the high-end chip was ARM? It's not just about raw speed, it's about how much performance you can squeeze out of a particular thermal envelope, or compute per watt. If ARM offers 2x the performance per watt, doesn't matter what Intel's chips do with hypothetically unlimited power.

For workstations it's almost entirely about raw speed. The power cost is a rounding error compared to the salary of the person using it that's now spending more time waiting on things and less time getting work done.

floatboth · on June 20, 2019

What would be awesome if instead of a Rosetta-like (or qemu-user-like) compatibility layer, the chip supported multiple ISAs natively.

The only company that has the rights to do both ARMv8 and amd64 is.. AMD :)

Symmetry · on June 20, 2019

That would be nice but NVidia got in legal trouble with x86 patents when they tried to do that with their Project Denver.

https://en.wikipedia.org/wiki/Project_Denver

mappu · on June 20, 2019

Most of the original x86_32 patents have now expired, that was sufficient to let Microsoft build the emulation layer for WoA.

It might be more feasible now, except in the last ~5-10 years there's been a push to make many PC apps x86_64 only (e.g. Ubuntu dropping i386 support), so the benefit isn't quite as wide any more.

floatboth · on June 21, 2019

Do software implementations have to care? qemu implements like every ISA out there..

mappu · on June 22, 2019

Presumably so, or Microsoft would have extended the WoA compatibility to x86_64 as well.

From 2017 https://arstechnica.com/information-technology/2017/06/intel...

ucha · on June 20, 2019

Apple is very invested in Thunderbolt 3 which AMD chips can't provide.

squar1sm · on June 20, 2019

AMD motherboards can do TB3.

https://www.asrock.com/mb/AMD/X570%20Taichi/#Specification Supports an add-in card (connector). It's coming more generally since Intel dropped the licensing fees for TB. But yeah, only a few AMD boards do it (without hacks).

StrangeDoctor · on June 20, 2019

Can’t is too strong, individual hackers have mixed the two. I’m sure Apple could make it work.

hawski · on June 20, 2019

Isn't the newest USB standard just Thunderbolt with different name?

kllrnohj · on June 21, 2019

Yes, USB 4

_ph_ · on June 20, 2019

Apple, like all big vendors, is trapped by the mobile processors. There is still no way around Intel there. Until they are willing to get rid of Intel Chips in their whole lineup, they can't offer AMD chips in their desktop machines. For the same reason, AMD didn't gain dominance in the times the Opteron ran circles around the Intel chips.

djsumdog · on June 20, 2019

Apple had probably been designing that for a long time. It's difficult to change something that custom once you start going down that path.

kitchenkarma · on June 20, 2019

People use Macs also to produce music and for this task the most important is single core performance. AMD is not great at this and it seems like the new line of CPUs could only get on par with Intel which isn't enough to convince professionals to move.

bhouston · on June 20, 2019

Ryzen 3 is fast on single core and Intel's various bugs have slowed them down. I believe they are now near parity.

kitchenkarma · on June 22, 2019

I have all bug fixes turned off, so I don't suffer from that.

bhouston · on June 23, 2019

I believe that represents the 1% or less of the population.

shereadsthenews · on June 20, 2019

Intel chips are still significantly better than AMD for many common workloads. If you are running SpecCPU or Cinebench in production then AMD might be right for you; in all other cases I urge you to run your own realistic benchmarks before buying. Intel’s “response” to Rome came out in April. It’s a 28-core chip that costs $12k. The reason Intel doesn’t feel price pressure here is they are still way ahead in performance on real high-end workloads like DBMSs etc.

kllrnohj · on June 20, 2019

Rome is Zen 2. You have absolutely no clue whatsoever how it performs. It may still have the weaknesses of Zen 1, but it very well might not. AMD made a bunch of changes, including a complete overhaul of the memory system (no more NUMA).

We'll know for sure when the product is actually out and we have independent benchmarks, but at this point you're just making things up and stating them as facts.

That aside Epyc was already ahead on real high-end workloads like povray or NAMD ( http://www.ks.uiuc.edu/Research/namd/ ). Epyc also puts up top numbers on compilation performance and OpenSSL. So it already isn't as black & white as you're pretending anyway. MySQL/DMBS is not the only server workload that exists, even though it may be the only workload you specifically care about.

shereadsthenews · on June 20, 2019

[flagged]

kllrnohj · on June 20, 2019

Don’t assume everyone suffers from the same ignorance that you suffer. The buyers in a position to get early samples of Rome are not paying retail prices on Intel.

And those players have already announced AMD is being added to their offerings: https://aws.amazon.com/ec2/amd/

tempguy9999 · on June 20, 2019

> on real high-end workloads like DBMSs

DBMSs are normally IO constrained or memory constrained, or lack-of-index index constrained or query-plan-gone-mental-from-cardinality-misestimation constrained, or others. It's very unusual IME to find one that is CPU constrained. Bringing up DBs in this context is peculiar to say the least.

I've literally just lost 20 hours trying to debug a query that intermittently ran many times too slow. It was a memory misconfiguration. Extra CPU is at best a bandage over these kinds of problems, at worst, just wasted.

bitL · on June 20, 2019

That used to be true for original EPYC but might not be true for Rome (outside AVX-512 workloads). IMO it's more due to continuous Apple + Intel collaboration and supply chain, AMD not being able to supply large market (they share TSMC with Apple for their CPUs), and overall unknown situation for AMD (will they stick with what they do now or abort?). If AMD just matches single thread performance of 9900k, there's not much point in switching. And Apple already uses their GPUs anyway, even if NVidia is still much better.

jannyfer · on June 20, 2019

Sources?

nabla9 · on June 20, 2019

Does anyone have an idea how many different processors tapeouts are needed to create this line?

It seems reasonable to assume that 48-core chip is just 64-core chip with few defective cores. A lower clocked version is the same ship that high clocked that did not pass some test.

opencl · on June 20, 2019

There are three tapeouts used for the entire desktop, HEDT, and server lineup combined. Everything uses the same 8 core CPU dies and there are two different I/O dies- a smaller one used for desktop (which doubles as the X570 chipset) and a big one for HEDT/server. There's also a separate 4 core + GPU die used for laptop parts.

juergbi · on June 20, 2019

This is correct for desktop and server. However, AMD hasn't said anything about the HEDT I/O die yet, as far as I know.

Assuming Zen 2-based Threadripper will still have quad channel memory and 64 PCIe lanes, AMD might go for a medium sized I/O die instead of disabling half of the large server I/O die. Or maybe the Threadripper volume is too low and a separate tapeout is not worth it.

jjwhitaker · on June 21, 2019

I haven't seen a lot for Zen2 based Threadripper news since the end of May so maybe I missed how they are doing those chips like via the same 8 core chiplets as Zen2 or something larger. If their yields are really really good, maybe they are stacking 8 core chiplets for Rome and based on final yields on that determine what range will be offered as Zen2 TR HEDT to split 16 core Zen2 from Rome. Maybe at 16 cores, Zen2/Ryzen 9 is HEDT enough to compete?

If they are doing larger chiplets for Rome like 16 or 32 core but pairing up with Infinity Fabric for up to 64 cores, yields could also determine what becomes Zen2 TR with the weak chiplets becoming 8-32 core 92 chiplets), single CPU HEDT material? Or will they have a whole other solution here like limiting Zen2 TR to the same socket as previous gens to push upgrades vs a first gen TR user jumping to a Ryzen 3850X vs low end Rome?

opencl · on June 20, 2019

This is true that they haven't actually confirmed that Threadripper is using the EPYC I/O die but I would be very surprised if it isn't. The entire idea behind Threadripper was to reuse as much EPYC tech as possible (in particular the physical socket) to keep NRE costs low because the HEDT market is so small.

kllrnohj · on June 20, 2019

There's 9 different independent pieces of silicon on the 64-core chip. It looks like this: https://images.anandtech.com/doci/13561/amd_rome-678_678x452... ( source https://www.anandtech.com/show/13561/amd-previews-epyc-rome-... )

Each of those smaller dies is 8 cpu cores. So 8x8 = 64. The 48-core one could either be 6 full-yield chiplets, or 8 partially defective ones with just 6 cores active. I'd guess it's 8 chiplets with 6 cores each just because that seems like it'd be more balanced, but I don't think we'll know for sure until someone de-lids a shipping one.

ip26 · on June 20, 2019

It depends on yield. If it's very high, you save money using only 6 chiplets. If it's not as great then using 8 lets you downbin the 64 core parts that were defective into the 48 core slot.

You can't do both in the same SKU because they have different performance, e.g. different amounts of L3 cache & memory bandwidth.

frankchn · on June 20, 2019

Two tapeouts, one for the IO die and one for the 8-core chiplet. For lower core count versions they can simply leave chiplet slots blank: https://img.purch.com/o/aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNv...

paulmd · on June 20, 2019

Consumer dies use a smaller IO die, so three tapeouts.

Epyc IO die will probably be re-used on Threadripper with chiplet slots blank though.

bryanlarsen · on June 20, 2019

2. Epyc is a combination of multiple tiny 7nm 8 core CPU chiplets and a single massive 14nm IO chip.

leeter · on June 20, 2019

So at a 225w TDP I'd guess that part is going to be clocked in the 1.2 -1.5GHz range. Definitely a specialist part IMO because with only 8 channels of memory that's 1 channel per eight cores which is not a ton of bandwidth. So for workloads that largely stay in the (I assume) ample L3/L2 caches that part will rock. But for anything that needs a lot of bandwidth spread across cores (databases come to mind) it will probably struggle where the higher clocking 24 core or 32 core parts will probably chug on fine. This oddly seems like a case where the 48 core may still be a better buy even for similar workloads due to higher base clocks.

Just my two cents.

juergbi · on June 20, 2019

It seems like the 64 core part will be able to run at 2.35 GHz, see https://www.anandtech.com/show/13598/amd-64-core-rome-deploy...

leeter · on June 20, 2019

Good to know I was assuming they'd run a bit hotter. That will actually make the memory situation worse though in many regards.

abc_lisper · on June 20, 2019

Giddy! At this rate I will be running a 64 Core processor on my desktop in 2 years!

Zekio · on June 20, 2019

Well depending on if the rumors about the 64 core Threadripper are true you could do that later this year or maybe early next year

MuffinFlavored · on June 20, 2019

and still have some I/O operation block your entire GUI/OS :D

vbezhenar · on June 20, 2019

More likely sleep with lock.

mfatica · on June 20, 2019

elaborate?

Rebelgecko · on June 20, 2019

I think they're pointing out that many of the bottlenecks that make computers feel slow can't be fixed by throwing more cores at them—multithreading can be tricky

astrodust · on June 20, 2019

You could have 128 cores, 4TB of memory, multiple top-end NVMe drives and Windows will still lock up when accessing an offline network share.

Good job Microsoft.

Zekio · on June 21, 2019

you can solve that by ticking a checkmark in `Folder Options` that puts folders in separate processes

jtl999 · on June 20, 2019

I had that issue with OS X 10.11 and disconnecting my Thunderbolt NIC without umounting cleanly.

Argh

mtgx · on June 20, 2019

There's already a 64-core Threadripper rumored for the end of this year.

Top-tier Ryzen 9 should reach "only" 32 cores on the 5nm process in 2 years.

Ragib_Zaman · on June 20, 2019

Is this conjecture or has AMD stated in a roadmap they plan to have 5nm in 2 years?

astrodust · on June 20, 2019

They're going to use whatever TSMC uses, and TSMC is committed to 5nm. The process is already being tested.

https://www.tomshardware.com/news/tsmc-5nm-euv-process-node,...

bitL · on June 20, 2019

> "only" 32 cores

But it might be 4-way SMP, so 128 threads...

uep · on June 20, 2019

I've seen multiple 4-way SMP rumors in comments in the last few weeks. Was there something hinted at, or a "leak" implying this might happen? I did a brief search but didn't find anything.

bitL · on June 21, 2019

Just rumors on /r/AMD, who knows?

opencl · on June 20, 2019

You can buy a 64 core dual socket EPYC workstation today for ~$15000. I wonder how much cheaper it'll be 2 years from now.

bitwize · on June 20, 2019

Oh man, I can't wait to make -j 128...

nottorp · on June 20, 2019

You'd better have a couple of very fast NVME SSDs in a striped configuration for i/o to keep up ;)

dman · on June 20, 2019

Unless you use ramdisk! :)

cr0sh · on June 20, 2019

I'm curious if anyone knows about that image of the AMD cpu - it says on it:

"DIFFUSED IN USA"

What exactly does that mean? Given the next line is "MADE IN CHINA", it would seem like "DIFFUSED" should be "DESIGNED" - or does that word have a new meaning?

floatboth · on June 20, 2019

IIUC, diffused is where the silicon wafer is created. For 14/12nm, that's the GloFo fab in New York.

Made is where it's attached to the substrate, packaged etc.

knd775 · on June 20, 2019

Diffusion is a step in manufacturing CPU dies. It generally refers to the following step: http://www.cpushack.com/EtchingWafers.html

djsumdog · on June 20, 2019

Do these processors use NUMA, similar to the high end Intel Xeon chips?

ip26 · on June 20, 2019

NUMA isn't a feature, it's a design compromise. Ideally every line of memory takes the same amount of time to access. But in large complex designs, you can increase performance for some memory at the price of lower performance for other memory, aka Non Uniform Memory Access.

(These chips do exhibit NUMA)

eightysixfour · on June 20, 2019

I believe these chips have Uniform Memory Access via a shared memory interface on the I/O die. Am I mistaken?

Edit, just confirmed:

>Thanks to this improved design, each chiplet can access the memory with equal latency. The multi-core beasts can support to 4TB of DDR4 memory per socket.

https://www.tomshardware.com/news/amd-64-core-128-thread-7nm...

ip26 · on June 20, 2019

They exhibit NUMA because if chiplet0 wants a line of memory that is held by chiplet4, it has to go get it from chiplet4. So the degree of NUMA is improved from the previous generation, but it is still not UMA.

Symmetry · on June 20, 2019

Generally caches aren't considered to be "memory" in this sense or otherwise every multi-core chip would be considered NUMA since they all have private caches. Instead you normally talk about an architecture being NUMA when cores have different access speeds to different parts of main memory, as when you need to get another socket to forward you information from a RAM bank. This is something that the OS generally has to be aware of in scheduling decisions, unlike caches which are automatically managed by hardware.

paulmd · on June 20, 2019

No memory is held by any chiplet, it's all held by the IO die and chiplets ask the IO die to access memory for it.

So there is no longer "near" and "far". In a sense, it's all "far" now (but hopefully not too far). But it is all uniform now.

ip26 · on June 20, 2019

The chiplets have cache, which holds copies of memory. If a process has the line open in an exclusive state, e.g. locked, other chiplets cannot just get the line from memory, because it might be out of date. So they must go ask whoever holds the lock to flush & release.

https://en.wikipedia.org/wiki/MESIF_protocol

wmf · on June 20, 2019

When you're talking about cache it's NUCA, not NUMA.

deevolution · on June 20, 2019

What's stopping them from creating 1000 core cpus?

Symmetry · on June 20, 2019

Yields are the main barrier to single pieces of silicon that big. You have a certain chance of getting a defect per square mm of you chip and as they get bigger the chance of a bad defect gets higher and yields go down. Often they'll occur in a place where you can just disable a core or bank of cache and still sell the chip but not always. So yield rates tend to go down as chips get bigger. Also larger chips make less efficient use of the wafer.

Economically, there aren't so many people looking for 1000 cores that it makes sense to put in the NRE to assemble a giant package to put all of that in versus just selling a system that can have multiple sockets. Cooling limits also make spreading out work across multiple sockets a better choice.

namrog84 · on June 20, 2019

How do GPU cores differ in that they have thousands of cores?

Symmetry · on June 20, 2019

As wmf said, what NVidia calls a "core" isn't something that can issue it's own instructions so it isn't really something you'd normally consider a core, though it does have a PC and can compare its PC against the PC of broadcast instructions to decide if it should execute or not so it's a bit more sophisticated than a simple SIMD vector lane. Maybe on par with an execution port?

What's more equivalent to a CPU core would be what NVidia calls an SM and AMD calls a compute unit. These decide which instructions to issue next and broadcast them to the various lanes. You'll have dozens of them in a typical GPU, about the same as the number of CPU cores in the same silicon area.

wmf · on June 20, 2019

They don't; GPUs have <72 real cores and thousands of marketing cores. And they can disable defective cores so their massive dies are still usable.

mrguyorama · on June 20, 2019

Is there a market need for 1000 cores running simultaniously that doesn't already need all the other supporting infra you get from a couple extra "full fat" servers each running 64 cores? Presumably if there is any need, it's currently being done by gpus

nrki · on June 20, 2019

Heat, energy usage.

Ragib_Zaman · on June 20, 2019

Also maybe complications involving latency for communication between cores, and memory coherence protocols.

imtringued · on June 21, 2019

Memory bandwidth. Top end Xeon Phis have 288 threads. They need high bandwidth memory and the Xeon Phis are limited to only a few gigabytes of HMC.

mastax · on June 20, 2019

Looks like this is an account that only posts links to this obscure Indian tech blog.

mft_ · on June 20, 2019

All blogs presumably start out obscure. This one seems reasonably written and the content is interesting. (Genuinely) What's the problem?

cptskippy · on June 20, 2019

Is there anything inherently wrong with that?

Royal · on June 20, 2019

Having a Rome 64 processor doesn't sound very promising.

https://en.wikipedia.org/wiki/Great_Fire_of_Rome