The article states the 7C13, priced in the $1900-2000 range, seems to be a low-cost version of the more common retail 7713P, priced officially around $5000 about 2-3 years ago. But the thing is the 7713P itself is such an old processor that it can be found for even cheaper: from $1300 to $2000 right now.
Of course these are sellers who sell low volumes and who are not well-known, but that's just because all these processors, 7C13 or 7713P, are at their end-of-life so none of the big retailers have them in stock.
> But the thing is the 7713P itself is such an old processor that it can be found for even cheaper
If you're pointing out used hardware, doesn't it make sense to factor in service life?
Google tells me that the service life of a processor can be as low as 5 years it it's subjected to poor cooling, and certainly those who are getting rid of those expensive processors are doing so due to a need to rotate their hardware.
The advantage of this program is the graphical presentation, but it's giving a fraction of the information offered by "turbostat" while using > 10x more CPU time.
Seems like a good value per dollar for a monero mining rig
approx 4 x the performance of a 5950x on monerobenchmark site.
Given that the epyc has about 4 times the cache (256MB vs 64MB) this makes sense in the monero world. I'd assume real world performance side by side compairison the epyc would get up past 4x than a real world 5950x which requires a lot of tweaking to get anywhere close the monerobenchmark numbers.. I'd expect the epyc runs better out of the box
here is a quick blurb from someone on reddit which sums up the general advice in a way that matches my experience
"most mining algorithms targeted for CPUs require certain amount of L3 cache per thread (core), usually 1-4MB, so just divide your total amount of CPU L3 cache by this number and the result is how many threads can you run max on your cpu. For example if an algorithm requires 2MB of cache per thread and you have a 10-core 16MB L3 cache cpu, you can run at most 16/2=8 threads, 9 or 10 threads will result in worse performance as cores will be kicking out each other's data from the cache. "
https://www.reddit.com/r/MoneroMining/comments/jurv6j/proces...
Below are two real world examples of monero mining I run myself
An example is my i9-10850K. exact same performance using 8 cores or 10 cores (16 threads or 20 threads).. in fact it slightly goes down performance wise at 20 threads.. Given that it has 20MB cache, it is an example of the bottom limit not being optimal. using this example i'd say minimum is around 1.12MB per thread, or 2.25MB per core
Another machine I have, the 5950x crunches away all day and night using all threads (32) on all cores (16) with 64MB cache no problem
this correlates to 4MB per core / 2MB per thread.. and it seems like more than what it needs because I can use the machine all day for daily tasks with no hiccups and while mining full out. If you have a desktop at work and need to kill it all day long, the 5950x will absolutely take anything and everything you throw at it.. samsung b-die ram helps monero mining as well. I run only 32GB but in 4 8GB b-die sticks.
It’s good to recoup some energy. But in a cold climate, you could be using a heat pump to move 2-5x more heat into the home for the same amount of energy as electric resistive heat.
There are space heater sized heat pumps? I could definitely use some. We've got a few cold areas of the house that don't get enough heat in winter but a remodel isn't going to happen any time soon.
I bought 2 new retail boxed WOF 7402's in 2020 for $734.54 USD each. 96 threads and 256 MiB L3 total, decent single- and multi-core performance. So I can't currently justify upgrading to (used) 7742/7702/7H12, much less any form of Milan or later chips. I also looked into upgrading to 1, 2, or 4 TiB of RAM, but the costs still remain absurd.
I guess I'll take my type-1 virtualized, 2 Romes, 0.5 TiB DDR4-3200 RDRAM, 32 TiB NVMe SSDs, and ~150 TiB net RAID10 of Ultrastar HDD until the wheels fall off, because it's more than I'll need for almost any purpose. Whenever I need more temporarily, I rent it rather than sink personal $ into excess metal.
PS: During the pandemicfest, I shoehorned the H11DSi-NT into a Core V71 TGE with a bit of Dremeling and tapping holes. The case is rammed with enormous Noctua fans, an AX1600i PSU, and USB distribution for a TRNG and an HSM.
Still rocking my EPYC 7282 in my home server, which really sits in a sweet spots: 16 Cores, about $700, 120W TDP (because of the reduced memory bandwidth).
Looks like the 7303 fills that same niche in the Milan generation (and should be compatible with any ROME mainboard, possibly after a BIOS update), or if you're building a new system you can get the 32-Core Siena 8324PN for about 130W TDP.
(While it may be silly to look at TDP for a server CPU, it does matter for home servers if you want a regular PC Chassis and not a 1U/2U case with a 12W Delta cooling fan that is audible three cities over. In fact, you can get the 8-Core 80W 8324PN and still get all those nice PCIe lanes to connect NVMe SSDs to, and of course ECC RAM that doesn't require hunting down a specific motherboard and hoping for the best.)
Only works on the original motherboard (or maybe only motherboards made by the specific vendor the CPU was locked to) - so if you buy a used vendor-locked CPU, there's a risk it's basically just a nice looking paperweight. Serve The Home has a pretty good video: https://www.youtube.com/watch?v=kNVuTAVYxpM
With these constraints, what is the benefit of Epyc over Threadripper? I've been running a 3970x in my workstation for several years now. Sure it's about twice the TDP, but with water cooling it stays quite silent even on full load.
I wanted remote management (IPMI), which none of the Threadripper boards offered. I went with the ASRock Rack ROMED8-2T, which also has 2x 10G Ethernet on board, which was another nice thing I didn't have to sacrifice a PCIe slot for. It does require a Tower Case with space for fans on top though, because the CPU slot is rotated 90 Degrees compared to Threadripper boards, so the airflow is different.
The EPYC CPU was also a quite a bit cheaper than the then-equivalent Threadripper 2950X (though the mainboard being $600 made up for that). This is even more true today because AMD really jacked up the prices for Threadripper to the point that EPYC is actually a good budget alternative. I guess that making 16 Core Ryzen made low-end Threadrippers less attractive, but it's the PCIe slots that were so great about those!
Also, I do believe that it was much easier to find 64 GB RDIMMs whereas 64 GB ECC UDIMMs were not available or much more expensive, though my memory (ha!) is hazy on that, I just remember it being a PITA.
So that EPYC system was just much more compelling.
ROMED8-2T is one of the all-star boards of the modern era imo. Like that's literally "ATX-maxxed" in the Panamax sense - you can't go bigger than that in a traditional ATX layout, and there is no point to having a bigger CPU (even if you do not use all the pins) because it starts to eat up the space for the Other Stuff. It's a local optimum in board design and everything else past here gets weirder and has to start making more and more tradeoffs.
EEB/EE-ATX can push things a little farther (like GENOAD8X-2T) but you can't pull any more PCIe slots off, so it has to be MCIO/oculink instead. And imo this is the reasonable limit of what can be done with single-socket Epyc even in EEB, this is "EEB-max".
And you can't really get more than 8 memory slots without moving the CPU over to the other side of the board, like MZ32-AR0 or MZ33-AR0, which means it overhangs the PCIe slots etc. IIRC you can sorta do 16-dimm SP3 if you don't do OCP 2.0 (gigabyte or asus might have some of these iirc) and you drop to like 5 pcie slots or something. But it's really hard to get 2DPC on epyc at all, the layouts get very janky very quickly.
You can fit more RAM slots into EEB/EE-ATX with a smaller socket (dual 2011-3 with 3DPC goes up to 24 slots in EE-ATX) but 2DPC is as big as you can go with epyc in a commodity form-factor. In SP5 this gets fully silly, MZ33-AR0 is an example of 2DPC 12-channel SP5, and it's like, oops all memory slots, even with EEB and completely overlapping every single pcie slot.
And of course dual-socket epyc gets very cramped even on EEB/EE-ATX even with only 8 slots per socket (MZ72-HB0). You just are throwing away a tremendous amount of board space and you lose pcie, MCIO, everything. SP3 is already a honkin big socket let alone SP5, let alone two SP3, let alone two SP5, etc... they are big enough that you have to make Tough Choices about what parts of the platform you are going to exploit, or accept a non-"standard" form factor (it's not standard for anyone except home users/beige boxes). Servers don't use EEB/EE-ATX form factors anymore, because it just isn't the right shape for these platforms. And you need to be pulling a significant amount of the IO off in high-density formfactors (MCIO, Oculink, SlimSAS, ...) already, and your case ecosystem needs to support that riser-based paradigm, etc. ATX is dying and enthusiasts are not even close to being ready for the ground to shift underneath them like this.
There's still good AM4, AM5, and LGA1700 server boards (with ECC) btw - check out AM5D5ID-2T, X570D4I-2T, X470D4U, W680 ACE IPMI, W680D4U-2L2T/G5, X11SAE-M, X11SAE-F, IMB-X1231, IMB-X1314, X300TM-ITX, etc. And Asrock Rack and Supermicro do make threadripper boards too (WRX80D8 family, TRX40D8 family, etc), although I think they're not viable since threadripper is leaning farther and farther into the OEM market and it just doesn't make cost sense unless you really need the clocks. It's not like the X99 days where HEDT was just "better platform for enthusiasts", there is a big penalty to choosing HEDT right now if you don't need it, and it's generally too segmented to make sense.
Unregistered DDR4 tops out at 32GB per stick (UDIMM or SODIMM), registered can go larger. DDR5 unregistered will go larger, and actually a few 48GB sticks do exist already, but generally you can't use all four slots without a massive hit to clocks (current LGA1700/AM5 drop to 3600 MT/s) so consumers/prosumers have to consider that one carefully.
(this generally means that drop-in upgrades are not viable for DDR5 memory btw - 4-stick configs suck, you should plan on just buying 2 new sticks when you need more. And the slots on the mobo are worse than useless, since the empty slots worsen the signal integrity compared to 2-slot configurations without the extra parasitics...)
I agree, the ROMED8-2T has everything I want and compromises almost nothing. One of the PCI Express slots is shared with one of the on-board M.2 slots, SATA, and Oculink, but even then, you get to choose: Run the slot at x16 and turn off M2/Sata/Oculink? Run the Slot in x8 and get M2/Sata but lose Oculink? Or disable the slot and get M2/Sata/Oculink? I think that's a great compromise (I run the slot at x8 and use it for a Fibre Channel card to my backup tape drive). Lovely block diagram in the manual as well.
Plenty of Fan headers as well, and using SFF-8643 connectors for the SATA ports makes so much sense (though it's an extra cost for the cables). They even put a power header if you run too many high-powered PCIe cards (since PCIe allows AFAIK to pull up to 75W from the slot).
They really put every feature that makes sense onto that board, and yeah, if you want Dual CPUs or 16 DIMM Slots, chances are that a proper vendor server is more what you want.
I can't think of anything that I don't like about the board. Well, I wish the built-in Ethernet ports weren't RJ45 but SFP+, but that's really the only thing I wish to change.
> I can't think of anything that I don't like about the board. Well, I wish the built-in Ethernet ports weren't RJ45 but SFP+, but that's really the only thing I wish to change.
YES. Jesus. The fact that the state of the art for SFP+ motherboards is basically still Denverton (C3758 etc) is embarrassing. You have all these "server" motherboards with consumer base-t standards... even if you have base-t to the workstations, surely your fancy SOHO/SMB setup will have a server closet where it would make sense to have SFP+ for the server, to punch down a bunch of individual base-T links... (and actually base-t has much more latency than SFP+ as well - measurable on NVMe drives etc)
In theory this is something that OCP 2.0 mezzanine cards (like MZ31-AR0 uses) can do for you. Actually these are quite cheap because of the very limited surplus market for them, and you can get adapter cards to put them in PCIe slots if you want (the adapter cards are unusual enough they're not cheap, but like most pcie adapter cards they're not inherently expensive or electrically difficult). So you can put SFP+ on anything you want with a OCP 2.0 slot - but of course most of the asrock rack, supermicro, etc are all base-t with no OCP 2.0. Infuriating.
(OCP 2.0 does make "traditional" IO area very difficult however, it eats a lot of space in that IO shield area, and this often has the side-effect that the CPU gets pushed over into the other side of the board where it starts overlapping pcie slots etc. There are reasons to not do it - and this is another problem brought on by the ATX layout. But then offer some SFP+ please - really SFP28 25gbit should be available at this point on things like the ROMED8 model imo.)
And before someone says Minisforum MS-01... no ECC (which could be forgiven) and the chinese vendors are unfortunately quite poor in the support department in general. They are great at marketing via influencers etc but I have read a number of people say they've run them long-term and were upset about the support story.
Which is a shame because on paper it's quite attractive - 12/13th gen are incredibly speedy (faster than AMD 7000 series) and most server workloads don't benefit that much from v-cache, a laptop CPU (again, ideally with ECC) with a couple SFP28 cards is more or less an ideal 25gb switch, can serve NVMe flash drives pretty fast, etc. It is quite desirable to have high per-thread performance in a homelab, especially when you are the sole user (4 people running 1 GB/s is not the same as 1 person running 4 GB/s). i3 7100 was actually an extremely interesting sku for this reason - 3.9 GHz dual-core (7350K is 4.2 GHz base, and 8350K/9350K were 4.2 GHz base 4C4T) is quite punchy, you never drop clocks due to AVX offset, and i3s supported ECC in this era. AM4 server boards weren't mature yet (X470D4U was one of the first good ones and it still took several years to stabilize fully) so the alternative was sandy bridge xeons etc, and the 7100 (while not a popular gaming CPU) absolutely destroyed that shit for homelab NAS builds, in both perf and efficiency.
Again, it's kind of a tragedy that C3758 is the norm still - that's 8x 14nm e-cores with RDIMM support and onboard Intel QAT (of whatever gen). That is not fast at all, we are talking like sub-zen1 performance here probably, with no AVX, etc.
(Unfortunately minisforum's AMD boards are not any better in the support department, and AMD segments ECC to the Pro laptop chips too, etc.)
I think in practice the problem is the length of the SFP+ cage - it's noticeably longer than base-t. And that means either you're wasting space on your base-t boards, or you have to design a custom layout for SFP+, which is already a (truthfully) small/niche market etc. It is understandable, just unfortunate.
They don't have to be used; there's overstock and grey market possibilities.
I've seen many times (usually smaller) runs of products and noticed house-marked or otherwise oddly identified chips and when asked the producer said "the OEM didn't use them so they sold them to us cheap". And I've certainly bought a couple brand new big-box labeled motherboards that were really (and obviously) minor variations of existing Asus, Gigabyte or Supermicro motherboards. Shoot, somewhere I've got a NiB Intel Phi card with a weird part number only because it was made for (I think) Dell and now that Phi is dead they were being fire-saled.
I don't think they are barred from sale, but I do think that if you're selling secondhand CPUs on Newegg, the used nature of the hardware should be prominently stated. That is, for customers who are still willing to risk their money on Newegg.
With those caveats it's a great deal for something like a build box. You could put this into an existing ATX case with $1000 worth of RAM (that you may already own?) for less than the price of a new Threadripper CPU.
It’s become a marketplace ever since it was bought out, so lots of sellers of varying qualities.
As long as you always filter by “sold and shipped by Newegg” you should be fine.
Who are the recommended vendor for purchasing PC parts these days then? That is, who (if anybody) fills New Egg's previous niche? I've actually just bought a bunch of stuff from New Egg, after not doing any PC building for 15+ years, and didn't initially realized how much they had switched to the "marketplace" model.
Some maybe-interesting observations about my experiences over the last few years, as someone who uses both cloud-provisioned (GCP N2D) and dedicated-server (e.g. OVH HGR-HCI-class) AMD EPYC-based machines at $work.
• GCP N2D instances always had a strict per-AZ allocation quota. This allocation quota has not increased over time. And when we asked to have it bumped up, it was the only time a quota-increase request of ours has ever been denied.
• When OVH was offering their HGR-HCI-6 machine type (2x EPYC 7532), we provisioned a few of them. The first few, leased ~2 years back, each took a few days to provision — presumably, OVH doesn't buy these expensive CPUs until a customer asks for a machine to be stood up with one in them. More recently, though (~6mo ago), for the same machine type, they gave us a provisioning lead time of more than a month, due to supply difficulties for the CPU.
• These chips were buggy! Again on OVH, when allocating these HGR-HCI-6 machines, we were allocated two separate machines that ended up having CPU faults. (Symptoms: random reboots after an hour or two of heavily utilizing the native AES-NI instructions; and spurious "PCI-e link training errors" in talking to the network card and/or NVMe drives.) They were replaced quickly, but I've never seen this kind of CPU fault on a hardware-managed system before or since.
• Just a month ago, the high-end dedicated-server hosters (OVH, but also Hetzner and so forth) seem to have removed all their SKUs that use 2nd- and 3rd-gen EPYC 7xxx CPUs. (Except for one baseline SKU on OVH, which is probably there because they have a big pile of them.) Everything suddenly switched over to 4th-gen 9xxx EPYCs just a month or two ago. It might just be that availability of these 9xxx EPYCs is finally reaching levels where these providers think they can meet demand with them — but everyone switching over simultaneously, and dropping their old SKUs at the same time?
• GCP recently launched the storage-optimized Z3 instance type. They chose to build this instance type on an Intel platform (Sapphire Rapids.) That's even though AMD EPYCs have had enough PCIe lanes to deliver equivalent performance to this Z3 platform — ignoring the "Titanium offload" part, which isn't CPU-platform-specific — for years. (In fact, the need for a huge pool of fast NVMe is in part why we switched some of our base load from GCP over to those OVH HGR-HCI-6 instances — which satisfied our needs quite well.) GCP could in theory have launched something akin to this instance type, with the same 36TiB storage pool size (but PCIe 4.0 speeds rather than 5.0) three years ago, using EPYC 7xxxs. Cusomers have been asking for something like that for years now — wondering why GCP instances are all limited to 8.8TiB of local NVMe. (We actually asked them ourselves, back then, where "the instance type with more local NVMe" was. They gave a very handwave-y response, which in retrospect, may have been a "we're trying, but it's not looking good for delivering this at scale right now" response.)
These points all lead me to believe that something weird happened with the EPYC 7xxx rollout. Supply didn't grow to meet demand over time.
And then, suddenly, after the seeming EOL of these chips — but long before cloud providers would normally cycle them out — we're seeing 7xxxs ending up on the open market, in enough bulk to make them affordable? Bizarre.
---
My own vague hypothesis at this point, is that at some point during the generation, AMD discovered a fatal flaw in the silicon of the entire EPYC 7xxx platform. Maybe it was the hardware crypto instructions, like I saw. Or maybe it was some capability more specific to cloud-computing customers (SEV-SNP?) that turned out to not work right (which would make more sense given that Threadrippers didn't see the same problems.) So the big cloud customers immediately halted their purchase orders (keeping only what they had already installed so as to not disturb existing customer workloads); and AMD responded by scaling down production.
This resulted in two things: a supply shock of AMD EPYC-based machines/VMs that lasted for a while; but also, negotiated settlements with the cloud vendors, where AMD was now obligated to fulfill existing POs for 7xxx parts with 9xxxs, as they ramped up production of those. Which is why 9xxxs have taken so long (2 years!) to make it onto the open market: the lines have been dedicated to fulfilling not just 9xxx bulk purchase-orders, but also 7xxx purchase-orders.
(And which is why the switchover to 9xxx among smaller players is so immediate: such a switchover has been on every hosting company's roadmap for a long time now, having been repeatedly delayed by supply issues due to the huge volume of 9xxx parts required to satisfy the clouds' backlogged demand. They've had a stock of 9xxx-compatible motherboards + memory + PSUs + chassis just sitting there for months/years now, waiting for 9xxx CPUs to slot into them.)
Perhaps we're seeing these cloud-customer 7Cxx parts on the open market now, because the clouds have finally received enough 9xxxs to satisfy their actual demand for 9xxxs, and their backlogged demand for 7xxxs; and the clouds are now finally at the point where they can replace their initial actual (faulty / feature-disabled) 7xxx parts they were sent with 9xxxs, selling off the 7xxx parts.
My guess is that, now that they have "fixed" AMD chips in place, we'll soon see the cloud providers heavily hyping up some particular AMD-silicon-enabled feature that they had been starting to market four years ago, but then went radio-silent on. ("Confidential computing", maybe.)
---
I'd love to hear what someone with more insider knowledge thinks is happening here.
CPU faults on individual machines aren't that rare. The machine has a dodgy power supply that almost works but has voltage drop under load etc. Sometimes this can be caused by environmental factors. The rack is positioned poorly and has thermal issues, the UPS is supplying bad power etc. Then you can see issues with multiple machines, or replace the machine without fixing the issue. Vendors often put machines for the same customer in the same rack for various reasons, e.g. because they might send a lot of traffic to each other and put less load on their network if connected to the same switch, but then if there is a problem in that rack it affects more of your machines.
The Epyc 7000 series was popular. There have been enough of them in private hands for long enough that if there were widespread issues they would be well-known.
It's possible that AMD didn't order enough capacity from TSMC to meet demand, and couldn't get more during the COVID supply chain issues. For the 9000 series they learned from their mistake, or there is otherwise more fab capacity available now, so customers can get them. Meanwhile cloud providers really like Zen4c because they can sell "cores" that cost less and use less power, so they're buying it and replacing their existing hardware as they tend to do regardless. That is typically how they expand their business: If you add more servers you need more real estate and power and cooling. If you replace older servers with faster ones, you don't.
To be clear, it was a CPU fault that doesn't occur at all when running e.g. stress-ng, but only (as far as I know) when running our particular production workload.
And only after several hours of running our production workload.
But then, once it's known to be provokeable for a given machine, it's extremely reliable to trigger it again — in that it seems to take the same number of executed instructions that utilize the faulty part of the die, since power on. (I.e. if I run a workload that's 50% AES-NI and 50% something else, then it takes exactly twice as long to fault as if the workload was 100% AES-NI.)
And it isn't provoked any more quickly, by having just provoked it and then running the same workload again — i.e. there's no temporal locality to it. Which would make both "environmental conditions" and "CPU is overheating / overvolting" much less likely as contributing factors.
> There have been enough of them in private hands for long enough that if there were widespread issues they would be well-known.
Our setup is likely a bit unusual. These machines that experienced the faults, have every available PCIe lane (other than the few given to the NIC) dedicated to NVMe; where we've got the NVMe sticks stuck together in extremely-wide software RAID0 (meaning that every disk read fans in as many almost-precisely-parallel PCIe packets contending for bus time to DMA their way back into the kernel BIO buffers.) On top of this, we then have every core saturated with parallel CPU-bottlenecked activity, with a heavy focus on these AES-NI instructions; and a high level of rapid allocation/dellocation of multi-GB per-client working arenas, contending against a very large and very hot disk page cache, for a working set that's far, far larger than memory.
I'll put it like this: some of these machines are "real-time OLAP" DB (Postgres) servers. And under load, our PG transactions sit in WAIT_LWLOCK waiting to start up, because they're actually (according to our profiling) contending over acquiring the global in-memory pg_locks table in order to write their per-table READ_SHARED locks there (in turn because they're dealing with wide joins across N tables in M schemas where each table has hundreds of partitions and the query is an aggregate so no constraint-exclusion can be used. Our pg_locks Prometheus metrics look crazy.) Imagine the TLB havoc going on, as those forked-off heavy-workload query workers also all fight to memory-map the same huge set of backing table heap files.
It's to the point that if we don't either terminate our long-lived client connections (even when not idle), or restart our PG servers at least once a month, we actually see per-backend resource leaks that eventually cause PG to get OOMed!
The machines that aren't DB servers, meanwhile — but are still set up the same on an OS level — are blockchain nodes, running https://github.com/ledgerwatch/erigon, which likes to do its syncing work in big batches: download N blocks, then execute N blocks, then index N blocks. The part that reliably causes the faults is "hashing N blocks", for sufficiently large values of N that you only ever really hit during a backfill sync, not live sync.
In neither case would I expect many others to have hit on just the right combination of load to end up with the same problems.
(Which is why I don't really believe that whatever problem AMD might have seen, is related to this one. This seems more like a single-batch production error than anything, where OVH happened to acquire multiple CPUs from that single batch.)
---
> It's possible that AMD didn't order enough capacity from TSMC to meet demand, and couldn't get more during the COVID supply chain issues.
Yes, but that doesn't explain why they weren't able to ramp up production at any point in the last four years. Even now, there are still likely some smaller hosts that would like to buy EPYC 7xxxs at more-affordable prices, if AMD would make them.
You need an additional factor to explain this lack of ramp-up post-COVID; and to explain why the cloud providers never started receiving more 7xxxs (which they would normally do, to satisfy legacy clients who want to replicate their exact setup across more AZs/regions.) Server CPUs don't normally have 2-year purchase commitments! It's normally more like 6!
Sure, maybe Zen4c was super-marketable to the clouds' customers and saved them a bunch of OpEx — so they negotiated with AMD to drop all their existing spend commitments on 7xxx parts purchases in favor of committing to 9xxx parts purchases.
But why would AMD agree to that, without anything the clouds could hold over their head to force them into it? It would mean shutting down many of the 7xxx production lines early, translating to the CapEx for those production lines not getting paid off! Being able to pay off the production lines is why CPU vendors negotiate these long purchase commitments in the first place!
And if the clouds are replacing capacity, then where are all those used CPUs going?
This server was never in an IaaS datacenter. This is a motherboard straight from the motherboard vendor, with an EPYC 7C13 prepopulated into it.
This isn't the sort of thing you get when a cloud resells. This is the sort of thing you get when a cloud (or other hosting provider) stops buying unexpectedly — and upstream suppliers/manufacturers/integrators are left holding the bag, of preconfigured-to-spec hardware they no longer have a pre-committed buyer for.
> if I run a workload that's 50% AES-NI and 50% something else, then it takes exactly twice as long to fault as if the workload was 100% AES-NI.
If you mean it takes twice as long on average, that's consistent with marginal hardware, where the fault is more sensitive to occurring during AES instructions so the more you execute the higher the probability.
If you mean it always takes exactly twice as long, that sounds more like a software issue, where there is some counter and when it rolls over its behavior changes. In theory this could be microcode/firmware/library rather than your code, but then it's likely that someone else would have noticed by now.
> And it isn't provoked any more quickly, by having just provoked it and then running the same workload again — i.e. there's no temporal locality to it. Which would make both "environmental conditions" and "CPU is overheating / overvolting" much less likely as contributing factors.
That's if the workload is inducing higher power consumption or higher temperatures, rather than the voltage or temperature constantly being out of spec but only marginally, so there is at all times a probability of random errors, to which certain types of instructions are more susceptible.
> Imagine the TLB havoc going on, as those forked-off heavy-workload query workers also all fight to memory-map the same huge set of backing table heap files.
So now there are two possibilities. One, your workload is really unusual and is triggering a rare hardware bug nobody else hits. But then nobody else is worried about or even knows about it, so it wouldn't be affecting the market. Two, it's on the heavy side but not so much that other people don't hit it too, and then the issue would be public.
It's not very plausible that the problem could be known to all cloud providers but not the general public.
> Yes, but that doesn't explain why they weren't able to ramp up production at any point in the last four years. Even now, there are still likely some smaller hosts that would like to buy EPYC 7xxxs at more-affordable prices, if AMD would make them.
They don't always lower the prices of the old models very much, they just keep making them for anyone who wants to keep buying them because they want uniformity. But then anyone who didn't buy a lot of them before isn't going to buy a lot of them now instead of just buying the new ones.
The prices come down on the used market from supply and demand, as the people buying new ones and selling the old ones, and then for new stock that retailers already own and want to get it off the shelves to make room for the new. But that doesn't mean you could find new stock of the old model in volume for the lower price. Which is why the people who want that contract for it ahead of time.
> But why would AMD agree to that, without anything the clouds could hold over their head to force them into it?
Because they're just as happy to make more of the newer models as the older ones, if the production capacity is available. Also, they could just be charging more for them. "Oh, you want 10,000 units for $2500 each instead of the contract which says you have to buy 10,000 units for $2000 each? Okay then."
> It would mean shutting down many of the 7xxx production lines early, translating to the CapEx for those production lines not getting paid off!
AMD doesn't have production lines, they use TSMC. TSMC, in turn, would just sell the capacity to someone else.
> And if the clouds are replacing capacity, then where are all those used CPUs going?
The original problem was that they couldn't make enough of them. Also, the cloud providers generally replace their oldest servers. Zen2/Zen3 isn't all that old. They'll be installing Zen4 and taking out Zen1 or five and ten year old Xeons. Which are all over the place on eBay.
Nah, it's just variation on price of older stock. Relatively few buyers and relatively low stock pushes the variance up. E.g. I see a 7763 listed at 3k in one store and 4k in another.
If you can find a motherboard to match it's a lot of computer for the price.
There is not an EPYC 7xx4. EPYC 8004 is Siena. We reviewed an ASRock Rack Siena platform this week actually.
We had this in the Genoa launch piece but AMD largely kept $/core constant between Milan and Genoa. Milan is still being sold since it uses cheaper PCIe Gen4 motherboards and DDR4.
On the server side it takes a few quarters for new products to start making up the majority of shipments. These days it is better to think of new servers as N, N-1, and still some N-2 generations being sold as new.
It's good but I have a feeling anything you find on aftermarket will be used and abused. These chips are designed to handle high thermal load, but if it's been in a DC or server room, it may impact its longevity.
Agreed - with CPU's, if it's working the day you buy it, it will most likely still be working in 10 years with typical desktop use cases, no matter the past life it had.
that's not true at all, XMP is very much within the wheelhouse of "typical desktop use-cases" and can absolutely damage a CPU from electromigration within a matter of years.
(or rather, the overclocked/out-of-spec memory controller usually requires the board to kick up the CPU memory controller (VCCSA/VSOC) voltages, and that's what does the damage.)
People have generally convinced themselves that it's safe but, the rate of CPU failures is incredibly high among enthusiasts compared to the general enterprise fleet and the reason is XMP. This has been "out there" for a long time if you know to look for it. But, enthusiasts fall into that classic "can't make a man understand when his salary depends on not understanding it" thing - everyone has every reason to convince themselves it doesn't, because it would affect "their lifesyle".
But electromigration exists. Electromigration affects parts on consumer-relevant timescales, if you overclock. Electromigration particularly affects memory controllers/system fabric nowadays. And yes, you can absolutely burn out a memory controller with just XMP (and the aggressive CPU voltages it applies) and this is not new or a secret. And the problem of electromigration/lifespan is accelerating as the operating range becomes narrower on newer nodes etc.
Similarly: "24/7 safe" fabric overclocks are really not. Not on the order of years. Everyone is already incentivized to push the "official" limit as much as is safe/reliable - AMD/Intel know about the impact on benchmark scores too, they want their parts to look as good as they can. There is no "safe" increase above the official spec, not really.
The unique thing about Asus wasn't that they killed a chip from XMP - it's that they put so much voltage into it that it went into immediate runaway and popped instantly, explosively, and visibly. And it's not surprising it was Asus (those giant memory QVLs come from just throwing voltage at the problem) but low-key everyone has been applying at least some additional voltage for a long time. Eventually it kills chips. It's overclocking/out-of-spec and very deliberately and specifically excluded from the warranty (AMD GD-106/GD-112).
It's completely understandable why AMD wants to make some fuses/degradation canary cells to monitor whether the CPU has operated out-of-spec as far as warranty coverage. This is a serious failure mode that probably causes a large % of overall/total "CPU premature failure" warranty returns etc. And essentially it continues to get worse on every new node and with every new DDR standard, and with the increased thermals that currently are characteristic of stacked solutions etc.
But are there many cases of a CPU being overclocked (& overheated & overvolted), then later not being overclocked (and working fine), but then failing shortly afterwards?
Yes, I understand it is theoretically possible. But I think it is just super rare - I've never heard of a single case.
Was looking through semiengineering for some more sources and some of them address it. Aging (I should probably say "aging" instead of electromigration, I'm not referring to just one effect here) is such a problem below 10nm that literally even just idling the chip wears it noticeably... and of course that leads to uneven wear on the cores too, etc. It's not just electromigration etc anymore.
The reason you don't notice this is that the chip is engineered so you don't notice it. The boost clocks will slow down over time, the voltage applied will increase over time (dynamically controlled by measuring the degradation of the canary cells). Unless the chip catastrophically fails, you probably won't notice the slowdown etc, in scenarios that would have resulted in chip failure 20 years ago. The chips are simply designed to tolerate that - because they have to be, even during normal operation (!).
The lifespan of a 5nm chip is not "infinite if treated properly" anymore. It is actually finite in terms of even idle power-on hours etc let alone load hours. A large number of power-on hours, and deliberately engineered to be large and to tolerate the damage gracefully, but people's mental models of "power-on hours doesn't hurt the chip" is fundamentally not correct anymore. Miners running lots of hours on that 7nm GPU etc is not "just fine" etc.
Also, once you get it beyond the "damage point", especially in analog stuff you have simply changed the characteristics of the circuit. If the amplifier's bias circuit leads some other part of the circuit to be hit with a higher gain, that can continue damaging it even if you stop further damaging the bias circuit etc. Memory and PCIe are analog circuits here.
> Digital and analog will be affected differently, as will devices subject to frequent change — and in some cases, infrequent change. “Any place where there’s a lot of activity will be more sensitive to device aging,” says Art Schaldenbrand, senior product manager at Cadence. “For devices, you can look at the clock tree and look at what is happening. Digital designs are sensitive to delay changes. The other place where this becomes a challenge is within analog designs. An example would be in a bias tree. With the bias transistors moving and aging, it can potentially accelerate the aging of other devices in the bias network. There’s always going to be some different elements in the design, and you have to look at them a little bit differently to be able to analyze the reliability.”
[ ...]
> But you have to be careful to consider all of the important areas. “There is a phenomenon called non-conductive stress,” says Cadence’s Schaldenbrand. “Consider a device such as a watch dog or monitor. It will be sitting idle, potentially for years, and you want it to spring into action if there’s some sort of condition that occurs. Even those circuits, that you think are you’re just sitting there doing nothing, are being stressed. They can age and potentially fail due to the aging that occurs while they’re sitting idle.”
> This impacts the gate because of the natural behavior of the transistors, Elhak explained. “In the transistor you have a gate, which has an electric field that is supposed to control the current that is flowing between the drain and the source but there are random events. This electric field causes some of those carriers, instead of flowing between the gate and the source, to go and get injected into the gate. As more carriers get injected over time, the electrical properties of the gate start to differ because it’s not supposed to have those carriers in it. That changes the properties of the whole device because now the gate is supposed to control that electric field it is now made of a different material.”
> The second mechanism that causes aging is called the bias temperature instability (BTI), which happens when there is a constant bias on the device meaning there is current flowing. Here, instead of being driven by electric field, it is driven here by bias and temperature. Also, charges start to get trapped into the gate and as this happens, the properties of the gate change and again it impacts the threshold voltage and the carrier mobility in that channel. “If you change the threshold voltage and if you change the mobility, then you have a different transistor,” he asserted.
true, I am just pushing back on the idea that "abusing a chip is mostly a myth" and "if a CPU is working on the day you buy it, it's fine for desktop use-cases". For server parts that can't be OC'd - true, I guess. For regular CPUs? Absolutely not true, enthusiasts abuse the shit out of them and even if you do no further damage yourself, the degradation can continue over time etc as parts of the circuit just become critically unstable from small routine usage etc.
(people treat their CPUs like gamer piss-jugs, big deferred cost tomorrow for a small benefit today.)
But yes - ironically this means surplus server CPUs are actually way more reliable than used enthusiast CPUs. In some cases they are drop-in compatible in the consumer platform (although not so much in newer stuff), and the server stuff got the better bins in the first place, and it's cheaper (because they sold a lot more units), and also hasn't been abused by an enthusiast for 5 years etc. If you are on a platform like Z97 or X99 that supports the Xeon chips, the server chips are a complete no-brainer.
And some xeons are even multiplier unlocked etc - used to be a thing, back in the day.
("server bins are binned for leakage and don't compete with gaming cpus" is another myth that is not really true except for XOC binning - server CPUs are better binned than enthusiast ones for ambient use-cases.)
People forget there's decade old servers out there working 24/7.
Anyway there's a Microsoft research paper on silicon which essentially says that failure rates of CPUs increase by mostly two factors:
- cycles. The more calculations the higher the rate of failure
- temperature/power. I will let you guess it by yourself. Even minor slips overvoltages and overclocks can enhance failure rates by magnitude of orders.
Getting back to your comment: I would've chosen a GPU used for mining (if properly cleaned during its life span, far from a given) over years rather than one used by some kid benchmarking and overclocking any day. Because years of crunching calculations did very little damage in comparison to a kid trying to find the overclock limits for few days. Most mining GPUs were used undervolted and underclocked (especially as Ethereum mining was memory rather than core intensive).
When I ran a ETH miner for a little bit it was putting an insane amount of pressure on the video memory making it overheat since memory chips aren’t always properly cooled. It would crash my machine until I changed some settings. Never had such issues with any other workloads
I'd much rather have something that came from a server room. Lots of cool, dust-free air--far better than a machine that's been sitting under someone's desk, clogged with dust and exposed to who know what temperatures.
https://www.ebay.com/sch/i.html?_from=R40&_nkw=amd+7713P&_sa...
https://www.amazon.com/dp/B0BRR2369B (listed by small third-party vendor)
https://starmicroinc.net/amd-epyc-7713p-2-0ghz-socket-sp3-64...
Of course these are sellers who sell low volumes and who are not well-known, but that's just because all these processors, 7C13 or 7713P, are at their end-of-life so none of the big retailers have them in stock.