Hacker News new | past | comments | ask | show | jobs | submit login
AMD EPYC “Rome” Server Processors to Feature 8 to 64 Cores (techquila.co.in)
100 points by areejs on June 20, 2019 | hide | past | favorite | 110 comments



64 cores but 128 with simultaneous multithreading. And in a 2C configuration, you get 256 threads. That is a beautiful thing.


It's a good time to be a Python developer :-D


It's a very good time to dive into microservices/containers/container orchestration.

Good times.


Heaviest microscope for that nail so far.


Memories of Niagara falls


These threads have a lot more heft than the ones in Sun's T series.


One would hope so, that was 15 years ago.

But Niagara was supremely cool back then. I still have a T1000 on a rack at home (not that I use it anymore).


Still, it was probably the first processor to allow for this kind of thread count (but I may be wrong)


And Xeon Phi...


I had a quad CPU Dell server. Imagine 512 threads :-0


Look at me, I'm the GPU now


The real question for me with all these AMD releases, what's Intel gonna release? It's surprising to me that Apple didn't announce an AMD based Mac Pro-- the chip Intel gave them must really be something. Too bad though because there could be a lot more of them (eventual Mac Pros) out there I'm guessing if they were using these Rome chips.


There's probably bit of hardware lot it with Intel chips. Since they control the hardware, Apple probably has invested quite a bit into hardware specific optimizations that won't readily port to AMD chipsets.

Also, AMD is only recently been a performance front runner - they might not hold on to that for long. In the short term it doesn't make sense to jump ship.

For for general computation/ecosystems or for generation/device specific ventures (e.g. Consoles) it makes sense to turn with the tides.


Since they control the hardware, Apple probably has invested quite a bit into hardware specific optimizations that won't readily port to AMD chipsets.

It's in Apple's history to be able to move on from hardware. It's also in their history to make the wrong choices, necessitating the move.


money is they’ll move in next 5 years to at least partial move to their own ARM chips, they already have to a degree.


People have been saying that for the last 5 years if not longer.


Intel hasn't manufacturered bespoke chips for some years now, which is why all, but one, mainstream consoles have been AMD.


Funnily enough the only "bespoke" Intel chip in recent times was Intel's own i7-8705G & i7-8809G used in the Hades Canyon NUC that replaced Intel's HD graphics with an AMD Vega M.


Which was an amazing chip. The Vega M + Intel cores combo delivered the graphics performance that Intel has been promising for years now. Would have loved to see a chip like that power the next few generations of ultrabooks, MBP's, etc.


It's probably not just the chip, Intel provides financial incentives for manufacturers who build products around their chips.


>Intel provides financial incentives for manufacturers who build products around their chips.

Yes, but I think the bigger picture, Apple needed Intel's iPhone modem for another year. May not be worth to damage the relationship, not to mention the design were likely finalise some time ago.

I would not be surprised if the Sale of Modem Business to Apple will include agreement to keep Intel CPU on Mac. ( For the time being )

And I am sure AMD should be aiming EPYC at the Datacenter usage of Apple, which is huge in itself, rather then the market of Mac Pro. Although Apple using AMD on Mac would be a pretty big statement to the rest of the market.


That usually involves advertising, but Apple has turned down those offers. Not even an "Intel inside" sticker.

Apple's relationship with Intel is pretty rocky now. Intel fumbled the i9, their Xeon chips aren't keeping up with AMD's workstation offerings, and their efforts to build a cellular modem chip utterly failed, leaving Apple at the mercy of Qualcomm for that part.

I'm sure Apple will cut loose on Intel as soon as they can. They're probably tired of the bullshit.


The financial incentives can cover other chip lines. Apple sells a ton of laptop CPUs and has some preferred partner status with Intel.


I think it'd really only be surprising if the Mac Pro ignored these chips when they were already available. At this point, though, the Mac Pro is shipping before Rome is available. The timing is just off. Current Xeons are faster than current Epyc's on average, so the choice today makes sense. Particularly if Apple wants to avoid NUMA in these machines. It wouldn't be unreasonable for a Mac Pro refresh in 2 years to switch to Epyc, though, if current trends continue to hold.


No, Rome is expected to officially launch in early Q3 (it's already shipping to hyperscalers) and the Mac Pro is expected to ship in September.

However, Zen 2 Threadripper would likely be a better fit for a workstation such as Mac Pro as it will (most likely) have higher clock speeds. The current Threadripper lineup doesn't support more than 256 GB RAM, though. I don't know whether the new Threadripper will support RDIMM/LRDIMM to compete with Xeon-W on memory capacity.


The Mac Pro uses the LGA3647 socket, which I assume will give it some headroom for incremental upgrades in the next few years before a major overhaul is due. Intel already offers 112-thread part for that socket.


No, the 56C/112T Intel Xeon Platinum 9282 is a BGA part, it's not socketed. You can buy this niche product only as part of a server with liquid cooling.

And the successors to Cascade Lake seem to switch to the LGA4189 socket to support 8 memory channels (Cooper Lake and Ice Lake), so I wouldn't expect any upgrades on LGA3647.


You are right about the socket - ark.intel.com doesn't mention the socket for the two top-end parts. All others use the previous one.

The good news is that the next generation Mac Pro will have a ton of extra memory bandwidth.


The bad news is that the next generation Mac Pro will be released in 5+ years from now


At this point, I think it's a lot more likely for Apple to switch to using their own chips, at least at the low end.


Given the JavaScript performance of the shipping units intended for mobile use, which is eclipsing the best of Intel's offerings, it might be possible that the A14 or A15 iteration actually surpasses Intel's chips at x86-64 code as well when run through a Rosetta-like compatibility layer.

What if the high-end chip was ARM? It's not just about raw speed, it's about how much performance you can squeeze out of a particular thermal envelope, or compute per watt. If ARM offers 2x the performance per watt, doesn't matter what Intel's chips do with hypothetically unlimited power.


> it might be possible that the A14 or A15 iteration actually surpasses Intel's chips at x86-64 code as well when run through a Rosetta-like compatibility layer

Extremely unlikely, if only because x86-64's memory model is much stronger than ARM's. Emulating that on ARM would be a performance disaster.

Apple could have the internals of A14 or A15 have x86's memory model, but that's non-trivial changes and may have too much impact on their ARM performance to justify it. Seems far more likely we'd just see a Macbook that's just straight ARM with x86 code just not supported at all.

> What if the high-end chip was ARM? It's not just about raw speed, it's about how much performance you can squeeze out of a particular thermal envelope, or compute per watt. If ARM offers 2x the performance per watt, doesn't matter what Intel's chips do with hypothetically unlimited power.

For workstations it's almost entirely about raw speed. The power cost is a rounding error compared to the salary of the person using it that's now spending more time waiting on things and less time getting work done.


What would be awesome if instead of a Rosetta-like (or qemu-user-like) compatibility layer, the chip supported multiple ISAs natively.

The only company that has the rights to do both ARMv8 and amd64 is.. AMD :)


That would be nice but NVidia got in legal trouble with x86 patents when they tried to do that with their Project Denver.

https://en.wikipedia.org/wiki/Project_Denver


Most of the original x86_32 patents have now expired, that was sufficient to let Microsoft build the emulation layer for WoA.

It might be more feasible now, except in the last ~5-10 years there's been a push to make many PC apps x86_64 only (e.g. Ubuntu dropping i386 support), so the benefit isn't quite as wide any more.


Do software implementations have to care? qemu implements like every ISA out there..


Presumably so, or Microsoft would have extended the WoA compatibility to x86_64 as well.

From 2017 https://arstechnica.com/information-technology/2017/06/intel...


Apple is very invested in Thunderbolt 3 which AMD chips can't provide.


AMD motherboards can do TB3.

https://www.asrock.com/mb/AMD/X570%20Taichi/#Specification Supports an add-in card (connector). It's coming more generally since Intel dropped the licensing fees for TB. But yeah, only a few AMD boards do it (without hacks).


Can’t is too strong, individual hackers have mixed the two. I’m sure Apple could make it work.


Isn't the newest USB standard just Thunderbolt with different name?


Yes, USB 4


Apple, like all big vendors, is trapped by the mobile processors. There is still no way around Intel there. Until they are willing to get rid of Intel Chips in their whole lineup, they can't offer AMD chips in their desktop machines. For the same reason, AMD didn't gain dominance in the times the Opteron ran circles around the Intel chips.


Apple had probably been designing that for a long time. It's difficult to change something that custom once you start going down that path.


People use Macs also to produce music and for this task the most important is single core performance. AMD is not great at this and it seems like the new line of CPUs could only get on par with Intel which isn't enough to convince professionals to move.


Ryzen 3 is fast on single core and Intel's various bugs have slowed them down. I believe they are now near parity.


I have all bug fixes turned off, so I don't suffer from that.


I believe that represents the 1% or less of the population.


Intel chips are still significantly better than AMD for many common workloads. If you are running SpecCPU or Cinebench in production then AMD might be right for you; in all other cases I urge you to run your own realistic benchmarks before buying. Intel’s “response” to Rome came out in April. It’s a 28-core chip that costs $12k. The reason Intel doesn’t feel price pressure here is they are still way ahead in performance on real high-end workloads like DBMSs etc.


Rome is Zen 2. You have absolutely no clue whatsoever how it performs. It may still have the weaknesses of Zen 1, but it very well might not. AMD made a bunch of changes, including a complete overhaul of the memory system (no more NUMA).

We'll know for sure when the product is actually out and we have independent benchmarks, but at this point you're just making things up and stating them as facts.

That aside Epyc was already ahead on real high-end workloads like povray or NAMD ( http://www.ks.uiuc.edu/Research/namd/ ). Epyc also puts up top numbers on compilation performance and OpenSSL. So it already isn't as black & white as you're pretending anyway. MySQL/DMBS is not the only server workload that exists, even though it may be the only workload you specifically care about.


[flagged]


Don’t assume everyone suffers from the same ignorance that you suffer. The buyers in a position to get early samples of Rome are not paying retail prices on Intel.

And those players have already announced AMD is being added to their offerings: https://aws.amazon.com/ec2/amd/


> on real high-end workloads like DBMSs

DBMSs are normally IO constrained or memory constrained, or lack-of-index index constrained or query-plan-gone-mental-from-cardinality-misestimation constrained, or others. It's very unusual IME to find one that is CPU constrained. Bringing up DBs in this context is peculiar to say the least.

I've literally just lost 20 hours trying to debug a query that intermittently ran many times too slow. It was a memory misconfiguration. Extra CPU is at best a bandage over these kinds of problems, at worst, just wasted.


That used to be true for original EPYC but might not be true for Rome (outside AVX-512 workloads). IMO it's more due to continuous Apple + Intel collaboration and supply chain, AMD not being able to supply large market (they share TSMC with Apple for their CPUs), and overall unknown situation for AMD (will they stick with what they do now or abort?). If AMD just matches single thread performance of 9900k, there's not much point in switching. And Apple already uses their GPUs anyway, even if NVidia is still much better.


Sources?


Does anyone have an idea how many different processors tapeouts are needed to create this line?

It seems reasonable to assume that 48-core chip is just 64-core chip with few defective cores. A lower clocked version is the same ship that high clocked that did not pass some test.


There are three tapeouts used for the entire desktop, HEDT, and server lineup combined. Everything uses the same 8 core CPU dies and there are two different I/O dies- a smaller one used for desktop (which doubles as the X570 chipset) and a big one for HEDT/server. There's also a separate 4 core + GPU die used for laptop parts.


This is correct for desktop and server. However, AMD hasn't said anything about the HEDT I/O die yet, as far as I know.

Assuming Zen 2-based Threadripper will still have quad channel memory and 64 PCIe lanes, AMD might go for a medium sized I/O die instead of disabling half of the large server I/O die. Or maybe the Threadripper volume is too low and a separate tapeout is not worth it.


I haven't seen a lot for Zen2 based Threadripper news since the end of May so maybe I missed how they are doing those chips like via the same 8 core chiplets as Zen2 or something larger. If their yields are really really good, maybe they are stacking 8 core chiplets for Rome and based on final yields on that determine what range will be offered as Zen2 TR HEDT to split 16 core Zen2 from Rome. Maybe at 16 cores, Zen2/Ryzen 9 is HEDT enough to compete?

If they are doing larger chiplets for Rome like 16 or 32 core but pairing up with Infinity Fabric for up to 64 cores, yields could also determine what becomes Zen2 TR with the weak chiplets becoming 8-32 core 92 chiplets), single CPU HEDT material? Or will they have a whole other solution here like limiting Zen2 TR to the same socket as previous gens to push upgrades vs a first gen TR user jumping to a Ryzen 3850X vs low end Rome?


This is true that they haven't actually confirmed that Threadripper is using the EPYC I/O die but I would be very surprised if it isn't. The entire idea behind Threadripper was to reuse as much EPYC tech as possible (in particular the physical socket) to keep NRE costs low because the HEDT market is so small.


There's 9 different independent pieces of silicon on the 64-core chip. It looks like this: https://images.anandtech.com/doci/13561/amd_rome-678_678x452... ( source https://www.anandtech.com/show/13561/amd-previews-epyc-rome-... )

Each of those smaller dies is 8 cpu cores. So 8x8 = 64. The 48-core one could either be 6 full-yield chiplets, or 8 partially defective ones with just 6 cores active. I'd guess it's 8 chiplets with 6 cores each just because that seems like it'd be more balanced, but I don't think we'll know for sure until someone de-lids a shipping one.


It depends on yield. If it's very high, you save money using only 6 chiplets. If it's not as great then using 8 lets you downbin the 64 core parts that were defective into the 48 core slot.

You can't do both in the same SKU because they have different performance, e.g. different amounts of L3 cache & memory bandwidth.


Two tapeouts, one for the IO die and one for the 8-core chiplet. For lower core count versions they can simply leave chiplet slots blank: https://img.purch.com/o/aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNv...


Consumer dies use a smaller IO die, so three tapeouts.

Epyc IO die will probably be re-used on Threadripper with chiplet slots blank though.


2. Epyc is a combination of multiple tiny 7nm 8 core CPU chiplets and a single massive 14nm IO chip.


So at a 225w TDP I'd guess that part is going to be clocked in the 1.2 -1.5GHz range. Definitely a specialist part IMO because with only 8 channels of memory that's 1 channel per eight cores which is not a ton of bandwidth. So for workloads that largely stay in the (I assume) ample L3/L2 caches that part will rock. But for anything that needs a lot of bandwidth spread across cores (databases come to mind) it will probably struggle where the higher clocking 24 core or 32 core parts will probably chug on fine. This oddly seems like a case where the 48 core may still be a better buy even for similar workloads due to higher base clocks.

Just my two cents.


It seems like the 64 core part will be able to run at 2.35 GHz, see https://www.anandtech.com/show/13598/amd-64-core-rome-deploy...


Good to know I was assuming they'd run a bit hotter. That will actually make the memory situation worse though in many regards.


Giddy! At this rate I will be running a 64 Core processor on my desktop in 2 years!


Well depending on if the rumors about the 64 core Threadripper are true you could do that later this year or maybe early next year


and still have some I/O operation block your entire GUI/OS :D


More likely sleep with lock.


elaborate?


I think they're pointing out that many of the bottlenecks that make computers feel slow can't be fixed by throwing more cores at them—multithreading can be tricky


You could have 128 cores, 4TB of memory, multiple top-end NVMe drives and Windows will still lock up when accessing an offline network share.

Good job Microsoft.


you can solve that by ticking a checkmark in `Folder Options` that puts folders in separate processes


I had that issue with OS X 10.11 and disconnecting my Thunderbolt NIC without umounting cleanly.

Argh


There's already a 64-core Threadripper rumored for the end of this year.

Top-tier Ryzen 9 should reach "only" 32 cores on the 5nm process in 2 years.


Is this conjecture or has AMD stated in a roadmap they plan to have 5nm in 2 years?


They're going to use whatever TSMC uses, and TSMC is committed to 5nm. The process is already being tested.

https://www.tomshardware.com/news/tsmc-5nm-euv-process-node,...


> "only" 32 cores

But it might be 4-way SMP, so 128 threads...


I've seen multiple 4-way SMP rumors in comments in the last few weeks. Was there something hinted at, or a "leak" implying this might happen? I did a brief search but didn't find anything.


Just rumors on /r/AMD, who knows?


You can buy a 64 core dual socket EPYC workstation today for ~$15000. I wonder how much cheaper it'll be 2 years from now.


Oh man, I can't wait to make -j 128...


You'd better have a couple of very fast NVME SSDs in a striped configuration for i/o to keep up ;)


Unless you use ramdisk! :)


I'm curious if anyone knows about that image of the AMD cpu - it says on it:

"DIFFUSED IN USA"

What exactly does that mean? Given the next line is "MADE IN CHINA", it would seem like "DIFFUSED" should be "DESIGNED" - or does that word have a new meaning?


IIUC, diffused is where the silicon wafer is created. For 14/12nm, that's the GloFo fab in New York.

Made is where it's attached to the substrate, packaged etc.


Diffusion is a step in manufacturing CPU dies. It generally refers to the following step: http://www.cpushack.com/EtchingWafers.html


Do these processors use NUMA, similar to the high end Intel Xeon chips?


NUMA isn't a feature, it's a design compromise. Ideally every line of memory takes the same amount of time to access. But in large complex designs, you can increase performance for some memory at the price of lower performance for other memory, aka Non Uniform Memory Access.

(These chips do exhibit NUMA)


I believe these chips have Uniform Memory Access via a shared memory interface on the I/O die. Am I mistaken?

Edit, just confirmed:

>Thanks to this improved design, each chiplet can access the memory with equal latency. The multi-core beasts can support to 4TB of DDR4 memory per socket.

https://www.tomshardware.com/news/amd-64-core-128-thread-7nm...


They exhibit NUMA because if chiplet0 wants a line of memory that is held by chiplet4, it has to go get it from chiplet4. So the degree of NUMA is improved from the previous generation, but it is still not UMA.


Generally caches aren't considered to be "memory" in this sense or otherwise every multi-core chip would be considered NUMA since they all have private caches. Instead you normally talk about an architecture being NUMA when cores have different access speeds to different parts of main memory, as when you need to get another socket to forward you information from a RAM bank. This is something that the OS generally has to be aware of in scheduling decisions, unlike caches which are automatically managed by hardware.


No memory is held by any chiplet, it's all held by the IO die and chiplets ask the IO die to access memory for it.

So there is no longer "near" and "far". In a sense, it's all "far" now (but hopefully not too far). But it is all uniform now.


The chiplets have cache, which holds copies of memory. If a process has the line open in an exclusive state, e.g. locked, other chiplets cannot just get the line from memory, because it might be out of date. So they must go ask whoever holds the lock to flush & release.

https://en.wikipedia.org/wiki/MESIF_protocol


When you're talking about cache it's NUCA, not NUMA.


What's stopping them from creating 1000 core cpus?


Yields are the main barrier to single pieces of silicon that big. You have a certain chance of getting a defect per square mm of you chip and as they get bigger the chance of a bad defect gets higher and yields go down. Often they'll occur in a place where you can just disable a core or bank of cache and still sell the chip but not always. So yield rates tend to go down as chips get bigger. Also larger chips make less efficient use of the wafer.

Economically, there aren't so many people looking for 1000 cores that it makes sense to put in the NRE to assemble a giant package to put all of that in versus just selling a system that can have multiple sockets. Cooling limits also make spreading out work across multiple sockets a better choice.


How do GPU cores differ in that they have thousands of cores?


As wmf said, what NVidia calls a "core" isn't something that can issue it's own instructions so it isn't really something you'd normally consider a core, though it does have a PC and can compare its PC against the PC of broadcast instructions to decide if it should execute or not so it's a bit more sophisticated than a simple SIMD vector lane. Maybe on par with an execution port?

What's more equivalent to a CPU core would be what NVidia calls an SM and AMD calls a compute unit. These decide which instructions to issue next and broadcast them to the various lanes. You'll have dozens of them in a typical GPU, about the same as the number of CPU cores in the same silicon area.


They don't; GPUs have <72 real cores and thousands of marketing cores. And they can disable defective cores so their massive dies are still usable.


Is there a market need for 1000 cores running simultaniously that doesn't already need all the other supporting infra you get from a couple extra "full fat" servers each running 64 cores? Presumably if there is any need, it's currently being done by gpus


Heat, energy usage.


Also maybe complications involving latency for communication between cores, and memory coherence protocols.


Memory bandwidth. Top end Xeon Phis have 288 threads. They need high bandwidth memory and the Xeon Phis are limited to only a few gigabytes of HMC.


Looks like this is an account that only posts links to this obscure Indian tech blog.


All blogs presumably start out obscure. This one seems reasonably written and the content is interesting. (Genuinely) What's the problem?


Is there anything inherently wrong with that?


Having a Rome 64 processor doesn't sound very promising.

https://en.wikipedia.org/wiki/Great_Fire_of_Rome




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: