64 Core Threadripper 3990X CPU Review

axilmar · on Feb 13, 2020

The article presents the situation as if limiting the number of threads by the operating system is not a conscious choice by Microsoft but a natural law that cannot be avoided.

Microsoft deliberately does these limitations in order to force people to pay more for its sofware. It's a shame, really.

Semaphor · on Feb 13, 2020

Isn’t that the same with a lot of hardware? Like CPUs where you have to pay more to get a feature unlocked that’s there but unusable when you do not pay extra? And also the same for almost all software with multiple feature sets. I don’t understand why it’s bad when MS does it.

maccard · on Feb 13, 2020

Hardware is _kind of_ different. Hardware is usually binned for tolerances/expected demand. e.g. the manufacturing process for an i5 and an i3 might be the same, but the acceptable tolerances for an i3 are lower. If the part fails the test for an i5, it will be sold as an i3, and the extra features disabled. This makes sense, otherwise you run into a bad press part lottery.

(of course they do also bin based on demand/sales, which is scummy.)

ChuckNorris89 · on Feb 13, 2020

>of course they do also bin based on demand/sales, which is scummy.

Why is this scummy? It's a free market and it's not a life saving drug, they can ask for whatever they want, nobody is forcing you to buy it.

That's like saying "Software engineers are paid according to demand/market which is scummy."

The best business strategy for companies and professionals alike is to find a niche with less competition where you are a leader(cough, monopoly, cough) and charge whatever the market will bear. If Apple/Microsoft/AMD/Intel can get away with that it means the market can bear it. The fact that you don't like it, is another issue.

zozbot234 · on Feb 13, 2020

Because it's price gouging. They would be actively restricting supply of their highest-performing parts for no other reason than to charge a higher price for them. Software engineers don't do this, they're highly paid because demand is so huge that it outstrips even an ample supply.

Of course, in practice, it can be hard to tell whether intentional "crippling" is going on. Yield problems are common and might manifest in the exact same way, after all - for example, chips that don't "make timing" for their design frequency can be sold for lower specs.

aikinai · on Feb 13, 2020

This is a ridiculous statement. Tons of software is sold with more expensive versions unlocking more advanced features.

ajross · on Feb 13, 2020

> Because it's price gouging.

No, it's deliberate discounting of economy products!

What's the difference? I don't know either. Price discrimination is a very real thing that exists virtually everywhere. It's why the market will bear Beats headphones in the same shops that sell generics at 20% the price.

At the end of the day, a manufacturer doesn't "owe" you something at cost. The idea of capitalism is that competition in the market will drive prices down instead, and it does. But one side effect is that customers and sub-markets that are able to bear higher markups on their versions of stuff are asked to pay it even when it doesn't make sense at the level of per-part costs.

So it's true, that that Beats headset costs only $4 to make and "should" be available much cheaper than it is. But it isn't, and the reason is simply that people are willing to buy them at $200. If you're not, that seems unfair, I agree. But... what's the alternative?

zozbot234 · on Feb 13, 2020

> it's deliberate discounting of economy products

That's a very nice way and non-scummy way of putting it, yes. If that deep discount compared to the "full-featured" price was made clearer, I assume that many people would not even find it scummy that these discounts are conditioned to being sold a "crippled" version of the product. So, depending on the "framing" you pur on this, you can nicely account for either of these perspectives.

leetcrew · on Feb 13, 2020

I'm not really sure what you want them to do in their marketing materials. it's not like Intel keeps it a secret that the hedt parts are just xeons with higher boost clocks and ecc disabled. the general public doesn't even understand what ecc is, nor are they considering buying a $1000+ cpu. I doubt most people even understand hyperthreading (the main difference between i7 and i5).

ajross · on Feb 13, 2020

The important point was the economics in the later stuff. Price discrimination isn't about "framing", it's just the way competition works in the case of multiple submarkets needing similar products but having different price tolerances. You can't "fix" this with framing or spin, it's just the way markets work.

kevin_thibedeau · on Feb 13, 2020

Market segmentation is what allows you to have cheap things. Next time you walk through first class to your coach seat remember that those people are subsidizing you.

brokenmachine · on Feb 13, 2020

I think it's flawed logic to say that first class is subsidizing coach.

Sure, they are paying more per person. But in exchange they are getting more room, more attention from attendants, boarding first, etc.

If the plane was completely full of people in coach and had no first class at all, I don't think that means the airlines would lose money.

Also, if there were no people in coach, then the first class tickets would be astronomical.

leetcrew · on Feb 13, 2020

"price gouging" is a myopic view of what goes on in the CPU/GPU market.

intel/amd could not afford to sell hardware at consumer prices if buyers could turn around and put those consumer parts in servers at scale. in the alternate universe where certain features aren't disabled for consumer parts, your i7 costs as much a a xeon; you don't get xeon features for i7 money.

zozbot234 · on Feb 13, 2020

Yes, as someone has said elsewhere, they might essentially be selling you their chips at a deep "discount", while making that discount conditional on forgoing some desirable features. It's pretty much the same thing, just with a different framing placed on it.

bryanlarsen · on Feb 13, 2020

One of the reasons that free market capitalism enjoys wide support in the western world is because of the "consumer surplus".

That is because for the vast majority of transactions, there is a very large gap between the highest price a consumer is willing to pay and the lowest price a producer is willing to accept. The canonical example is for essentials like food: people are willing to pay every single penny they have for food for their family. OTOH, farmers are willing to sell wheat in bulk for under 10 cents a pound.

For obvious reasons farmers cannot set different prices for each pound of wheat to get "all the money". It's seen as deeply unfair when the vast majority of industries are price takers while a select few can charge the maximum a customer can bear. If every industry could effectively price differentiate could we'd be spending all our money on food with no money left to buy CPU's and our whole free market system would collapse.

brokenmachine · on Feb 13, 2020

Not sure I'm getting all the subtleties of your argument but for food, say for example suddenly they charged 100x the price for wheat, people would switch to something different.

Of course people need food, but there are many kinds of food and it's competitive between the different foods.

Lifesaving drugs are something where it's often difficult to substitute.

brippalcharrid · on Feb 14, 2020

But even if it were a life-saving drug, would it exist at that point in time (rather than a few decades later) if billions of dollars hadn't been spent to bring it to market (instead of investing the money in any one of a number of other things, including non-pharmaceutical healthcare research), and would that money have been invested in the first place if there hadn't been an expectation of a return on investment that took into account the risk of failure?

maccard · on Feb 13, 2020

> Why is this scummy? It's a free market and it's not a life saving drug, they can ask for whatever they want, nobody is forcing you to buy it.

That's pretty much the only argument _for_ doing it; it's the equivalent of defending your argument as free speech - just because you _can_ say it, doesn't mean you should.

bobbob1921 · on Feb 13, 2020

i agree , you can also add vmware to this list. they ship a free (and great imo) hypervisor which has full features built in, with only a license key unlocking those advanced (and expensive) features. this also applies to almost every piece of demoware software.

smolder · on Feb 13, 2020

They also disable ECC, virtualization, VFIO, and other features for consumer products. AMD and Nvidia regularly produce a single GPU chip that gets used in consumer and PRO cards, costing much more on pro cards, with the only difference being packaging and the software. In effect, pro users subsidize development for everyone else in order to avoid some friction, but IMO that's fine, because pro users (CAD, design) have the most to gain from such development.

lykr0n · on Feb 13, 2020

Kinda. The manufacturing process will always produce defects in these chips. A run can produce 100 chips (let's say)- 10 of those are perfect. 100% functional, and are sold as the highest quality chip. the next 20 are 99.99% functional. These are binned into their respective segments based on how well they test. These are sold as lower end parts. Maybe a core doesn't work or it can't clock as high.

Maybe the other 70 have some major issue with the special components, so they are only 99% functional and therefor can't be used in the high end segment, but can be used in the consumer chips.

Not every chip is perfect, which is why there exists different product classes. For some chips, an i5 is just an i7 that didn't fully pass validation.

smolder · on Feb 13, 2020

I understand how that works. My point was just that it's not always about binning, sometimes it's all market segmentation. Drawing both sides of a polygon in a single pass is a pro GPU driver feature that has nothing to do with defects or binning. I think if Intel has unexpectedly good yield with low defects, they may end up using silicon good enough to support hyperthreading in a CPU that lacks the feature. They disable silicon-viable features for consumer chips on the regular --it's just to guarantee users of those features pay for Xeons.

chungus_khan · on Feb 13, 2020

If you bin based on tolerances, you kind of have only two options: either also bin chips that meet tolerances in order to satisfy demand for the lower-binned chip or introduce a separate parallel chip alongside the binned ones that actually targets their specs (which you would still have to bin downward).

It's scummy and inefficient from an obvious human perspective to intentionally cripple a superior product in order to target a cheaper market, but that's the natural result of market actors being incentivized to maximize their own return, they don't really have any reason to care about the human side of that argument.

maccard · on Feb 13, 2020

> or introduce a separate parallel chip alongside the binned ones that actually targets their specs (which you would still have to bin downward).

you know, I hadn't actually thought of that. If you did this, presumably it would involve the cost of setting up another manufacturing line, (colossal), and you'd _still_ have binning issues. Presumably the tolerances would be looser however; if you can build within 5% power usage of X, you should be able to _easily_ build within 5% power usage of (x/2).

Interesting thoughts.

dsr_ · on Feb 13, 2020

I assert that it's usually bad when anyone does it, because it is evidence that they have excess market power. If there were enough competition, they couldn't produce a thing with capabilities x,y, and z and then charge an unlock fee for y and z.

bostonpete · on Feb 13, 2020

Lots of complex software systems are sold this way, allowing customers to add-on features a la carte. I don't see that as a bad thing -- allows people who only need a subset of the functionality to only pay for what they need. Oftentimes, the functionality is sitting there waiting to be unlocked.

Semaphor · on Feb 13, 2020

So for example LINQPad [0] has excessive market power? It’s a REPL for C#.

[0]: https://www.linqpad.net/Purchase.aspx (Purchase site has the version comparison)

edit: I’d actually go as far as saying that by your definition almost every good software tool has excessive market power.

undersuit · on Feb 13, 2020

Your software is all the same quality. The copy one customer has doesn't have a series of bits in it that are mistakes in the production that can't be removed. Everyone's binary is capable of the same actions. Software is malleable, hardware by it's very name is less so.

zaarn · on Feb 13, 2020

Most mainframe manufacturers will ship you a mainframe that has been configured way above what the customer ordered. Because in most cases the customer will upgrade and changing a license key is cheaper than building and shipping a new mainframe.

7952 · on Feb 13, 2020

The alternative may be that having a single price point would price some consumers out of the market.

sounds · on Feb 13, 2020

Sibling comments assert that software with unlocks demonstrates the importance of allowing "locks" on capabilities.

That's the "affirming the consequent" fallacy. Specifically, siblings start from the true statement that: people buy software with locks.

They then affirm the consequent, that software with locks is the correct market solution.

rat9988 · on Feb 13, 2020

>it is evidence that they have excess market power.

Tesla doesn't and they still dot it.

ClumsyPilot · on Feb 13, 2020

Doesnt it? It has huge brand value, a great niche, mass flowing... it's almost apple of cars.

daemin · on Feb 13, 2020

Supporting only 64 processors in a group on a 64bit operating system seems like a reasonable and sane technical solution. It means you can use a single 64bit variable as a bitmask for various processor related functions in a process.

I would bet that many other bits of software also have this limitation because they too thought that using a 64bit value for a processor mask would be sufficient.

m463 · on Feb 14, 2020

The marketing folks will be overjoyed -- We're going to need a 128-bit operating system soon.

RantyDave · on Feb 14, 2020

Ahhh, I don't think so. We're already in the place where memory bandwidth is the big killer so I can't imagine any enthusiasm for making our pointers twice as large.

astrodust · on Feb 14, 2020

Unlikely and if that's the case, trivial to fix. We're talking about masking against N potential bitmasks instead of one, which is still stupid easy and people do it all the time.

kijin · on Feb 13, 2020

A 6-bit variable would be enough to hold 64 values.

dragontamer · on Feb 13, 2020

But you need 64-bits to represent affinity masks and scheduling. IE: Which of the 64-cores are available for a thread to schedule into.

----------

Furthermore, your Threadripper 3990x should ONLY really be pushing affinities for across a 8c / 16-thread die anyway, at least as far as the Windows scheduler is concerned.

Windows programmers should use multiple thread groups for different 16-thread CCX. Because its extremely costly to move all of your local state from one die's L3 cache into another.

https://images.anandtech.com/doci/15318/amd_rome-678_678x452...

Look at the chip, like physically look at it. AMD has 9x chips here (1x I/O die for memory, and 8x compute chips, each with 8x cores). You really only want to be moving threads within those 8x cores, because moving a thread across chips is less efficient.

So the ideal world, people would understand Windows's scheduler and work around it. Instead of complaining about how it is different from Linux's. Windows's Thread groups being 64-sized is reasonable for the job it wants to do... and there are other parts of the API that allow you to work across Window's 64-sized processor groups. In particular, to increase your affinities across to a 2nd processor group.

thedance · on Feb 13, 2020

There are loads of places where it’s more convenient to test a bit vector than compare a scalar value. It’s just cheaper at runtime. Of course it may be possible to cheaply extend the bit vector scheme to 128 bits or even 256 fairly easily, using the native vector registers. But I bet Windows still supports some pretty old models of CPUs.

Adrock · on Feb 13, 2020

That doesn’t allow you to use it as a bit mask, which is what OP was saying.

phamilton · on Feb 13, 2020

Another way to express this is that a 64-bit value can express every combination of processors, so if you want to say something like "this process should run on these 12 specific processors" you can do so in a single 64-bit value.

bhouston · on Feb 13, 2020

It isn't sane, it is an artificial limitation and poor design. At best it was an pre-mature code optimization.

It should be fixed by a professional coder who doesn't impose limitations, especially in an OS, based on wanting to fit bit flags into a single variable.

Windows should fix this -- there should be nearly unlimited cores per group. They can still have limitations per Windows version so they can price segment their market.

analognoise · on Feb 13, 2020

That's an uncharitable take - that pricing model supports some seriously good engineers doing amazing work. And how many people could afford tens of thousands of dollars for hardware but can't afford...

Googles it...

$84 a year per user. Or $168 for E5 per year.

That just doesn't strike me as Microsoft being bad?

logicchains · on Feb 13, 2020

I'm curious, to anyone working on Windows kernel development, at what point does a feature/scheduler improvement become so good that people decide "nah, this is way too cool to put in Windows Home Edition, let's feature-gate it to the enterprise version instead!"?

traskjd · on Feb 13, 2020

This is actually something I find Microsoft does poorly for enthusiasts (both the crowd buying these types of chips, but also who care about a scheduler). They have VERY low visibility into these types of improvements.

I can find a billion blog posts about notepad supporting Linux and Windows line returns. I struggle to find anything about improvements for new hardware.

I wish there was a blog (albeit unfashionable these days) or similar which told these stories. I often feel the lack of their existence is a taint on data driven decisions: these customers are likely insignificant, but their influence on others is significant (even if their advice isn’t “buy a 64 core CPU”, it will be “Windows 10 is great, use that!”).

I’ve largely been a MSFT fanboy since a young age, and I kind of get being mainstream, but tossing a bone to the enthusiastic supporters would be nice.

sounds · on Feb 13, 2020

Linus Tech Tips started as a vlog, and is fairly popular.

They called out this problem between Threadripper 3990X and Windows 10 a bit ago.

traskjd · on Feb 13, 2020

It was good. Love Linus.

This was his video on the 3990X recently for those interested: https://youtu.be/1LaKH5etJoE

sbergot · on Feb 13, 2020

I guess it depends on their segmentation strategy. If you have not read it already there is a great article by Joel Spolsky on this topic: https://www.joelonsoftware.com/2004/12/15/camels-and-rubber-...

In this case it seems pretty justified to include this feature in the enterprise edition only.

cesarb · on Feb 13, 2020

Also relevant is this article (which links to that one) by Jeff Atwood: https://blog.codinghorror.com/oh-you-wanted-awesome-edition/

Mirioron · on Feb 13, 2020

The problem is that there is no easy way for somebody to pay more to access this feature. You have to jump through hoops to buy it and you probably won't even get official support. Meanwhile the alternative to Windows doesn't cost anything monetarily. But at the end of the day, both of these aren't important factors yet. It's mostly a question of what OS the application you bought this CPU for runs on (or runs better on - Blender). The hardware is still expensive enough that people won't buy hardware like this "at random", but this is likely to change quickly.

CamperBob2 · on Feb 13, 2020

Great. How do I, a sole proprietor, buy this "enterprise edition?"

jsjohnst · on Feb 13, 2020

Sign up for a Microsoft small business account. Via that, you get a MSDN (or whatever they call it these days) subscription. Then you can download any version of the OS and get a license key from the same webpage to activate it.

vetinari · on Feb 13, 2020

While you can download it from MSDN, and it will activate, check your licenses. In most cases, these licenses are intended for development and it is forbidden to use them in production.

CamperBob2 · on Feb 13, 2020

In the past, you weren't allowed to use MSDN distributions for everyday business purposes. Is that no longer the case?

sixothree · on Feb 13, 2020

Any chance you have a link to this offering? Not seeing MSDN being listed as included with anything buyable.

jsjohnst · on Feb 14, 2020

Apologies, it seems the program was discontinued in 2018. Lookup details on Microsoft’s BizSpark program for more info.

This program replaced it, but I have zero info besides knowing it replaced BizSpark so might not be comparable:

https://startups.microsoft.com/en-us/

jaywalk · on Feb 13, 2020

Not sure if this is what the poster above was referencing, but here's an offering: https://visualstudio.microsoft.com/vs/pricing-details/

vbezhenar · on Feb 13, 2020

On ebay for $10.

CamperBob2 · on Feb 13, 2020

Ah, yes, the Virus Of The Month club.

klingonopera · on Feb 13, 2020

Have you ever had that actually occur to you with software bought on eBay that's non-physical?

I've bought numerous Windows 7 and a couple of Windows 10 keys, plus a few MS Office keys, never had any issues.

Sometimes, the keys would require phone activation, but that was ages ago. Nowdays they just work.

The providers give you a key that you can use to activate, and a link for the ISO, but that part is optional. The keys work with untouched ISOs, too. FWIW, I once downloaded the seller's ISO and compared the hashes, and they matched up.

I may have to add, I live in Germany, which is part of the EU, and the courts here have decided key resale is completely legal and MS has to tolerate it.

["Volume license software may be sold individually"]: https://translate.google.com/translate?sl=auto&tl=en&u=https...

[original]: https://www.golem.de/news/bgh-urteil-software-aus-volumenliz...

vbezhenar · on Feb 13, 2020

Virus in the product key? That's something incredible.

CamperBob2 · on Feb 14, 2020

True, I was thinking of software distributed on DVDs or ISOs.

A stolen or misappropriated product key is almost as bad as a virus, though. It will work great... but only at first, until Microsoft notices that 50,000 customers are using it.

blaser-waffle · on Feb 14, 2020

I bet on some level they know and don't care. If you're buying a cheapo key off of ebay then chances are you were never going to pay the full price. Meanwhile they're still using the software, learning how it works, and growing the Windows ecosystem further. Vaguely like a loss-leader that gets people in the door.

zerkten · on Feb 13, 2020

> I'm curious, to anyone working on Windows kernel development, at what point does a feature/scheduler improvement become so good that people decide

A kernel developer is going to be 3-4 steps removed from those decisions. In many ways they are just a consumer of a product like this too and work on their subsystem. They are unlikely to be worried because it's the product they build and might even feel that it's inexpensive for the level of engineering they invest.

dvdbloc · on Feb 13, 2020

I wish I could see a statistic on the number of people that would have bought Win 10 Home to run on their 64 core machine if only it had support for 64 cores!

tinco · on Feb 13, 2020

It's going to happen soon. AmD's high end consumer model is already very close to that number. When these become commonplace, say in 8 years, someone might find something fun or productive to do with 64 cores, and it's gonna suck if your OS is the bottleneck.

rubber_duck · on Feb 13, 2020

> When these become commonplace, say in 8 years

Someone is going to run an OS from 2020 on a brand new machine in an age where 64 threads is common place ?

These things get segmented according to market - home is tailored to suit commonplace.

AlfeG · on Feb 13, 2020

You mean all those people still running Windows 7 in 2020?

rubber_duck · on Feb 13, 2020

On modern computers ?

Yizahi · on Feb 13, 2020

By the time it happens MS will support it in Home edition. I remember that Win2000 for example supported something like 2 or 4 CPUs in standard edition and you had to buy server editions or even datacenter edition for more.

Dylan16807 · on Feb 13, 2020

Probably. Once upon a time I hit the ram limit for Windows 7 Home of 16GB.

Then 8 jumped to 128GB.

0xfaded · on Feb 13, 2020

Buy a 64-core processor and futureproof against the coming 64-core Windows

raverbashing · on Feb 13, 2020

It's not necessarily "so good" especially for workstation loads vs server loads. You have the same issue in Linux with the different config options (a simple example that's not relevant anymore: an smp kernel is slower than a single core one)

Yes, it could be a config option as well instead of a fixed choice.

cesarb · on Feb 13, 2020

> (a simple example that's not relevant anymore: an smp kernel is slower than a single core one)

The main reason this example is not relevant anymore is the "alternatives" system Linux uses nowadays, which dynamically patches the kernel code according to the hardware features (in this case, it detects it's running on a single core and removes the SMP locking code).

qubex · on Feb 13, 2020

I remember reading in Inside Windows 2000 by Mark Russinovich & David Solomon that the non-SMP kernel was the exact same binary built from the exact same SMP-compliant code with all the SMP stuff NOP-ed out.

alg0rith · on Feb 14, 2020

Just pirate Windows Enterprise

tbenst · on Feb 13, 2020

Wish they’d review on Linux. Windows does not seem like the target audience given all the limitations.

rwmj · on Feb 13, 2020

I've played with the AMD Daytona Rome Server (two EPYC sockets, 2*64 = 128 cores, 256 threads), with RHEL, and it rocks. However it's quite hard to find workloads that keep all 256 threads busy at once. Most builds aren't nearly parallel enough, most programs can't find work for 256 threads. So as a personal machine 128 or 256 threads aren't really worth it unless money is no object. Likely the best current use for these is as servers for running large numbers of virtual machines or containers.

Mirioron · on Feb 13, 2020

>However it's quite hard to find workloads that keep all 256 threads busy at once.

Run Crysis only on the CPU.[0]

[0] https://www.youtube.com/watch?v=1LaKH5etJoE&t=10m37s

glangdale · on Feb 13, 2020

I am craving one of these for my superoptimizer. The level of task parallelism I have is north of 100M independent jobs; my last run took a single-core machine 20 days. It's pretty rare to have a workload like this but as more machines ship with >16 cores, I think more developers will look at the order-of-magnitude improvements of parallelizing their tasks where possible.

0-_-0 · on Feb 13, 2020

Could you run it on the GPU then? Or does it have a lot of branches?

glangdale · on Feb 14, 2020

It uses an SMT solver (can use Z3, Yices or Boolector) all of which are very complex and branchy. So no GPGPU - maybe some specialist in SAT solving or SMT could build that one day, but that person would not be me.

Straw · on Feb 13, 2020

Can you tell us more about this superoptimizer?

glangdale · on Feb 14, 2020

It's an SMT-based solver to make fast "single-lane" SIMD calculations for AVX2 (doing different things in multiple lanes, doing AVX512 or SSE or ARM or general purpose registers, etc. are all on my todo list). Branch free code only. Can handle up to 4 or 5 instruction sequences including things like generating constants like 128-bit tables for PSHUFB (considering PSHUFB as a single-lane operation where every lane looks up a table).

Straw · on Feb 21, 2020

Sounds very cool! I recently wondered about the possibility of superoptimizing vectorized code, so glad to hear about it!

Would you like to chat about opportunities to do analagous work for GPU instruction sets?

I work at a startup making high-performing GPU software and compilers, incidentally including a regex engine! (We can't quite match HyperScan on the CPU, but support capture groups and very high throughput on GPUs.) We also have several other interesting projects, and would like to start a superoptimizer at some point.

P.S. After reading your blog, one of our engineers said: "We see you like PSHUFB. So do we."

namibj · on Feb 13, 2020

In what context are you working with Superoptimizers, if you don't mind me asking?

yvdriess · on Feb 13, 2020

Why are you sticking to desktop processors with that kind of workloads?

glangdale · on Feb 14, 2020

Because this is a hobby project and I was doing it at home? eventually I will be doing this in the cloud or with a home-based server or both, but you have to start somewhere. Much of the work is currently centered around "being less stupid" rather than "try to parallelize more" as wasting tons of time doing pointless solves is not a good idea - I've already got about 80-100x speed improvements from optimizations (not looking at sequences I can prove won't work, not looking at sequences that I can prove "do the same thing as some other sequence").

qubex · on Feb 13, 2020

Of course, not everything can be parallelised.

voldacar · on Feb 13, 2020

What are you superoptimizing?

UI_at_80x24 · on Feb 13, 2020

This is the exact same conclusion we've come to also.

I work at a video production company, with a good mix of Linux, FreeBSD, and Windows for the Adobe torture.

Threadripper: incredible. The CPU is so strong, we struggle (NVMe RAID0) to keep up with disk IO.

So I thought some of our more parallel tasks could be moved to Epyc.

In the end 80-90% of the cores sit idle. So it got repurposed to a VMHost, and it does that good well enough.

t0mas88 · on Feb 13, 2020

There CPUs are not great for lots of VMs because the memory bandwidth then becomes limiting. So you don't get the full performance out of your VMs compared to having more boxes with smaller CPUs.

tn890 · on Feb 13, 2020

Making it an ESXi host would be the obvious solution.

nineteen999 · on Feb 13, 2020

Blender's Cycles renderer would also eat those cores for lunch with a detailed enough scene.

anon73044 · on Feb 13, 2020

And get raked over the coals with vmWare's new per core license update? Only quad channel? Only 256gb of ram supported?

A pair of used 7601s, motherboard, ram, power supply and chassis would still be cheaper for a SOHO setup.

nattmat · on Feb 13, 2020

https://www.phoronix.com/scan.php?page=article&item=3990x-th...

Ottolay · on Feb 13, 2020

They really should have done both Windows and Linux for these type of non-gaming processors. Expect more from Anandtech.

leetcrew · on Feb 13, 2020

they already ran the benchmarks three times just for windows. it would be nice to see linux numbers too, but you can't criticize the level of effort they put in.

growlist · on Feb 13, 2020

Exactly. I'd say there's a fair chance that anyone playing about with this type of hardware will want to be tweaking fairly low-level config etc. of the type that's hidden in Windows anyhow.

danmg · on Feb 13, 2020

I've been using my 64-core threadripper on ubuntu without a problem.

astrodust · on Feb 14, 2020

How's that htop looking?

danmg · on Feb 14, 2020

on my vertical monitor it's still usable.

btmorex · on Feb 13, 2020

I agree. An even bigger problem is that most of the benchmark suite is completely irrelevant to what these CPU's are going to be used for. These are meant for servers. They should keep that in mind when selecting what benchmarks and which operating system to use.

atq2119 · on Feb 13, 2020

The Threadripper CPUs are meant for workstations / high-end desktops (HEDT), not servers. For servers, you'd want the EPYC line. The benchmark selection is somewhat reasonable, though yes, you'd want some Linux tests in there.

washadjeffmad · on Feb 13, 2020

While some platforms are certainly designed with long term always-on use in mind, "meant to" isn't quite right.

We mostly operate sTR4s as HEDTs, but we also have X399 server boards with IPMI and 10GbE in a board configuration designed to be installed racked. To us, there's only a minor functional difference in role.

thedance · on Feb 13, 2020

You just answered the question. It would be interesting to see TR and EPYC head-to-head in server workloads, to confirm or deny the theory that EPYC is better for them.

anon73044 · on Feb 13, 2020

Depends on the workload, but EPYC's 8 channel memory support would definitely be an advantage.

thedance · on Feb 13, 2020

Well, maybe. I would not be willing to claim it without showing that some practical workload was bound by memory throughput.

rbanffy · on Feb 13, 2020

What benchmarks do you think would be relevant? This seems like a part that's designed for AV production, 3D and SFX work. It could be used in a server-like role, but, for 500 more, the EPYC counterpart is a much better deal.

TsiCClawOfLight · on Feb 13, 2020

I think compiling, data processing and similar workloads are more relevant here. For the tasks you mention, the GPU would seem more important.

rbanffy · on Feb 13, 2020

I thought so, but a couple days back someone pointed me to this: https://stackoverflow.com/questions/38029698/why-do-we-use-c...

zone411 · on Feb 13, 2020

Servers can often use more than 256 GB RAM and 4 memory channels, like the EPYC line these are based on. I'd like to see it benchmarked as a workstation for local builds (both Windows and Linux) and for accessing and processing in parallel 100 GB+ data in RAM (for quicker prototyping than GPUs, or verification). Those would be my main uses.

nxc18 · on Feb 13, 2020

Not sure who needs to hear this, but another _huge_ Windows 10 limitation: no support for nested virtualization on amd processors. This means Ryzen users can’t benefit from a bunch of security improvements and also things like the new Windows 10X emulator can’t work.

Kind of off topic, but it’s the kind of nasty surprise I wouldn’t want to get after deciding to buy, so hope this helps someone.

https://windowsserver.uservoice.com/forums/295047-general-fe...

sebazzz · on Feb 13, 2020

Is this an limitation on the processor or on Windows itself? Since virtualization is not part of x86 but rather vendor specific, is the AMD implementation difficult?

nxc18 · on Feb 13, 2020

It appears to be a Windows limitation - I'm not an expert, but my understanding is that AMD supports all the same extensions for it (under AMD rather than Intel branding) and it appears the Windows team is (finally) working on it. I understand that other hypervisors do support nested virt on AMD.

To be fair to the Windows team, AMD in the data center / pro desktops wasn't really viable for a very long time, so its understandable that it wasn't prioritized.

traskjd · on Feb 13, 2020

https://twitter.com/withinrafael/status/1227505393439346688?...

Sounds like it’s coming.

numlock86 · on Feb 13, 2020

The last benchmark is interesting:

> 1080p60 HEVC at 3500 bitrate with "Fast" preset - 319 fps

Ok, why were these parameters chosen? What's the application? I recommend everyone to look at 1080p60 video footage encoded with h265 with the "fast" encoder preset at 3500 bitrate. Calling it terrible would be a compliment. Unless you encode really slow and visual easy motion, which brings up the question why you would need 60 fps in the first place. Even at the "medium" preset with 1080p60 you should - regardless of application - be at least in the 5000+ range with your bitrate. And even that comes with a lot of trade offs, because that's just where live streaming starts.

Sesse__ · on Feb 13, 2020

I believe most of these benchmarks were set at a time when you simply couldn't run x265 on “slow” on any reasonable CPU if you ever wanted it to complete. But yes, I'd really like CPU benchmarkers to move to higher-quality presets for video encoding, because they do tend to have different kinds of performance curves.

Fun fact: There's no point in running x265 on the fastest presets unless you absolutely need to have HEVC; x264 on slow is faster _and_ gives better quality per bit. See the second graph on https://blogs.gnome.org/rbultje/2015/09/28/vp9-encodingdecod... (a few years old, the situation is likely to look similar but not identical).

ksec · on Feb 13, 2020

>(a few years old, the situation is likely to look similar but not identical).

x265 has made massive improvement over the years, 2015 x265 wasn't even considered good; despite all of its hype, or another way to think about it is how well x264 managed to squeeze every last bit of detail possible.

Sesse__ · on Feb 13, 2020

IIRC, I redid this graph in early 2019 (using Tears of Steel), and it looked pretty similar.

govg · on Feb 13, 2020

Unsure how to tag people like dang, but is it possible to change the link or the title? The link is for page 3 of the review, which is a broader discussion about multi-threading on Windows.

rbanffy · on Feb 13, 2020

64 threads ought to be enough for anyone.

If I am investing USD 4000 on a CPU, I'd probably go for the EPYC part for 500 more and twice the memory bandwidth. It'd be interesting to run these benchmarks under perf to see how many L3 cache misses happen and how much they cost in cycles.

dragontamer · on Feb 13, 2020

Video games are clock-limited.

If you want a system that can be a workhorse 128-thred monster on the working days, but reach 4.5GHz boost on the weekends for CS:GO and other video games... you'll want a Threadripper, not an EPYC.

The low clock speeds, and the RDIMMs / LRDIMMs of EPYC add latencies which slow down video game (mostly single-threaded) performance.

--------

For those who don't play video games: there are a variety of single-threaded tasks still in various workplaces. A surprising amount of 3d graphics work remains single-thread bound.

In particular, modeling is typically single-thread bound (the GUI thread, where the user is clicking menus and such). Most custom scripts are single-thread bound, and 3d modelers need a LOT of custom scripts. Those scripters aren't necessarily optimization masters who know how to take advantage of multi-core architectures.

3d Rendering is of course multithreaded. But the 3d artist still needed to click on a lot of menus and scripts to get the mesh to look right.

imtringued · on Feb 13, 2020

The Threadripper has the same single core performance as the regular Ryzen line. Epyc CPUs run at lower frequencies to save power.

tiernano · on Feb 13, 2020

the Epyc also allows you to use more RAM... seems the Threadripper tops out at 256GB due to the memory type (https://www.youtube.com/watch?v=1LaKH5etJoE) but Epyc would allow 2TB...

Tepix · on Feb 13, 2020

There were rumors about TRX80 and WRX80 chipsets that raised the 256GB RAM limit to 2TB for Threadripper CPUs. Alas, they never went past the rumors stage.

close04 · on Feb 13, 2020

They don't exist [0] and the chipset wouldn't really influence memory support since you have the IMC in the CPU.

[0] https://www.anandtech.com/show/15359/trx80-and-wrx80-dont-ex...

thedance · on Feb 13, 2020

Why would the chipset be involved in any way?

vbezhenar · on Feb 13, 2020

Threadripper has much better frequencies.

C1sc0cat · on Feb 13, 2020

And you can get Dual EYPC motherboards If you really want to go for it.

tuananh · on Feb 13, 2020

EPYC is only 500usd more?

btw, the cost for server component would probably be more expensive too right?

philjohn · on Feb 13, 2020

Plus lower base and boost clocks

rbanffy · on Feb 13, 2020

That is somewhat offset by twice as much memory bandwidth. Every L3 miss on a 3990X is twice as expensive as it's on the 7702P

dragontamer · on Feb 13, 2020

And the EPYC uses higher latency RDIMMs and LRDIMM RAM. Traditionally, server RAM (RDIMM or LRDIMMs) is 50ns+ slower than consumer UDIMMs.

Video games and pointer-chasing care more about that latency figure than bandwidth. I'm sure there are bandwidth-bound tasks, but any latency-bound task would prefer highly-clocked, over-volted (1.35V) DDR4 UDIMMs at 3200 MT/s CAS16 or faster.

Server RDIMMs are naturally slower, due to the register (and LRDIMMs are probably even slower). RDIMMs and LRDIMMs are designed for capacity more so than latency: you can have 1TB of LRDIMMs RAM on one machine but all the RAM runs slightly slower (3200MT/s maybe CAS22 or slower) as a result.

tuananh · on Feb 14, 2020

> 50ns+ slower than consumer UDIMMs.

is that like twice as slow?

pella · on Feb 13, 2020

https://news.ycombinator.com/item?id=22266386

tasubotadas · on Feb 13, 2020

Reading this makes me mad that Python still hasn't sorted its business out with GIL.

imtringued · on Feb 13, 2020

Use multiprocessing [0] if you really want to stick with python. You're leaving a lot of performance on the table due to inter process communication but python isn't exactly known for being the fastest programming language. I assume your goal is making legacy applications run on multiple cores and multiprocessing is fine for that.

[0] https://docs.python.org/3.8/library/multiprocessing.html

glangdale · on Feb 13, 2020

Don't know why the downvotes; this limitation of Python is really hard to fathom. IIRC they had a project to remove it and it went into the weeds somehow (I think it made performance worse?)

blattimwind · on Feb 13, 2020

Python has GIL -> Python apps are mostly single-threaded -> Single-threaded performance is important -> Adding granular locking has impact on single-thread performance -> CPython "isn't supposed to have perf regressions" -> Python has GIL

bildung · on Feb 13, 2020

Because e.g. nodejs has a GIL, too, and apparently no one thinks this is a problem.

For web applications one usually has a software chain like web server <-> wsgi server <-> dozens of python instances.

Standalone processes just implement threading, which also is fairly easy (as far as theading itself can be easy).

Scientific libraries like scipy can use parallel processes automatically in the background (using things like BLAS), as long as the data is modelled correctly.

tasubotadas · on Feb 13, 2020

I just gonna repost what I've written before about this before:

The current state of threading and parallel processing in Python is a joke. While they are still clinging to the GIL and single-core performance, the rest of the world is moving to 32 core (consumer) CPUs.

Python's performance, in general, is a crappy[1] and is beaten even by PHP these days. All the people that suggest relying on multiprocessing probably haven't done anything that's CPU and Memory intensive because if you have a code that operates on a "world-state" each new process will have to copy that from a parent. If the state takes ~10GB each process will multiply that.

Others keep suggesting Cython. Well, guess what? If I am required to use another programming language to use threads, I might as well go with Go/Rust/Java instead and save the trouble of dabbling with two languages.

So where does that leave (pure-)Python? It can only be used in I/O bound applications where the performance of the VM itself doesn't matter. So it's basically only used by web/desktop applications that CRUD the databases.

It's really amazing that the machine learning community has managed to hack around that with C-based libraries like SciPy and NumPy. However, my suggestion would be to drop GIL and copy the whatever model has been working for Go/Java/C#. If you can't drop GIL because some esoteric features depend on that, then drop them as well.

[1] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

kortex · on Feb 13, 2020

Every single project which has tried to drop the GIL has failed in some way. It's not some "esoteric features", it's fundamentally a hard problem that implicates the entirety of the python object model, python C api, scoping, imports, and GC.

I think multi-interpreting is the way to go, but that still would require a framework for ensuring safe memory access.

Speaking of Go, I always thought it would be neat to write a python implementation in Go, but leverage Go's GC, and implement the 'go' keyword/function for easy parallelism. But you still have the problem of scoping and memory safety. Or similar idea but with Rust. Something tells me that isn't a trivial undertaking, especially if you want all the libraries, which is 75% the point of python.

bildung · on Feb 13, 2020

> All the people that suggest relying on multiprocessing probably haven't done anything that's CPU and Memory intensive because if you have a code that operates on a "world-state" each new process will have to copy that from a parent. If the state takes ~10GB each process will multiply that.

This is wrong, there are multiple ways of python threads working on shared data.

> It's really amazing that the machine learning community has managed to hack around that with C-based libraries like SciPy and NumPy.

Well, the main implementation of the whole language is c-based. I can't see how that implies hackiness.

> If you can't drop GIL because some esoteric features depend on that, then drop them as well.

There have been multiple python implementations without a GIL available for way over 10 years, for example pypy, ironpython and jython. Yet these never went mainstream, which strongly implies the GIL problem actually isn't that much of a problem in the real world.

gameswithgo · on Feb 13, 2020

>nodejs has a GIL, too, and apparently no one thinks this is a problem.

I think this is a problem

RantyDave · on Feb 15, 2020

There are (IMHO) two scenarios where this doesn't matter: If you're writing a web service then the multi threading (processing) is done by the front end ie Nginx; and for 'scientific' or similar computationally expensive stuff Python is turning into a scripting layer over C libraries. Have a look at numpy to see what I mean.

celticmusic · on Feb 13, 2020

What's hard to fathom about Python insisting on a GIL to ensure all C extensions play nice?

gameswithgo · on Feb 13, 2020

Stop using interpreted langs for real work?

jgaa · on Feb 13, 2020

The easy solution is to run Linux.

trasz · on Feb 13, 2020

Or FreeBSD. There have been some huge improvements to scalability of network, filesystems, and memory management recently.

fullstop · on Feb 13, 2020

Any details on that? We're spinning up a 48 core epyc soon on 12.1. The last "beefy" server we did was a 32 core intel on 11.x, and I'll be happy if these changes are in 12.1.

drewg123 · on Feb 13, 2020

You want 13-current. A lot of the NUMA work that has been done is pretty invasive, and not suitable for backporting because it changes KBIs

trasz · on Feb 13, 2020

Definitely. Some of the network changes (epoch(9), kind of like RCU) went into 12, IIRC, but you mostly want 13-CURRENT.

smallstepforman · on Feb 13, 2020

You misspelt Haiku.

artificial · on Feb 13, 2020

I want to believe.

traskjd · on Feb 13, 2020

Given the yet-unreleased new kernel only starts to add support for Zen3, I’m not sure this is entirely true.

bildung · on Feb 13, 2020

The 3990X is Zen2, Zen3 isn't released yet.

Fnoord · on Feb 13, 2020

And Zen2 works perfectly fine on Linux. What does not work (confirmed on my Ryzen 3900X) is k10temp.

AbacusAvenger · on Feb 13, 2020

Does not work in what way? Doesn't report anything, reports incorrect values, what?

It seems to be working fine for me on kernel 5.5.0 on a 3800x and on a 3950x...

Fnoord · on Feb 13, 2020

It lacks a lot of data like vcore, CCX, etc. See e.g. [1]

[1] https://www.phoronix.com/scan.php?page=news_item&px=K10temp-...

traskjd · on Feb 13, 2020

Appreciate the correction. Late night posting :)

ageofwant · on Feb 13, 2020

Honestly. Windows is just a waste on that architecture as it stands. Run Linux as base OS and give 8 cores to the Windows VM should you have use for it.

thomasahle · on Feb 13, 2020

288MB Cache. I wonder if we'll ever be given some control about how and what computations are cached. It seems like a lot of memory to leave to simple heuristics.

jzwinck · on Feb 13, 2020

Such control has been available for many years. On x86 you have "non-temporal" load and store instructions [1] and the Cache Disable bit in Control Register 0 [2] which can be used to suggest which bytes are not suitable for caching. C also has posix_madvise() [3] which is somewhat relevant.

[1] https://stackoverflow.com/questions/37070/what-is-the-meanin...

[2] https://en.wikipedia.org/wiki/Control_register

[3] http://man7.org/linux/man-pages/man3/posix_madvise.3.html

szatkus · on Feb 13, 2020

It's 32MB L3 per die and 512KB L2 per core, they sum them for marketing effect. Effectively one core can access "only" 32.5MB.

szatkus · on Feb 15, 2020

Errata: as on one die there are two CCXs, one core can access 16.5MB

bob1029 · on Feb 13, 2020

I feel the cache in these modern CPUs is being used extremely efficiently. If they have a branch predictor that is 99% accurate, I have faith in letting the same engineers manage my cache eviction strategy. The extreme scope of modern OoO execution strategies would probably make detailed CPU cache management more of a liability than an asset.

One simple policy would be to try to ensure your application is as small as reasonably possible. If you can fit your entire executable image in a fraction of your L3 you are probably sitting in a really good spot.

SequoiaHope · on Feb 13, 2020

I’ve heard these “simple heuristics” are multi-megabyte neural networks these days. I believe I heard that claim in this video: https://youtu.be/ymcOLL2qEg8

pas · on Feb 13, 2020

Usually branch prediction is what ANNs are used for in CPUs. Caches usually operate as complicated pseudo-LRUs.

https://stackoverflow.com/questions/22597324/what-cache-inva...

It's possible that Tesla does something different that makes sense in their specific situation, but general purpose CPUs - as far as I know - don't use NNs for this.

Sesse__ · on Feb 13, 2020

Ryzen and some ARMs use perceptrons for branch prediction. It's essentially a single-layer neural network with no nonlinearity at the end.

Multi-megabyte NNs for a normal branch predictor would be impossible; way too high latency to be useful.

0-_-0 · on Feb 13, 2020

> It's essentially a single-layer neural network with no nonlinearity at the end.

That's just a fancy way of saying matrix multiplication :)

Koshkin · on Feb 14, 2020

No, matrix multiplication is just an implementation detail.

neysofu · on Feb 13, 2020

Didn't Intel try precisely that with Itanium, only for it to fail spectacularly?

masklinn · on Feb 13, 2020

Wasn’t itanium mostly the reverse, expectations that compiler heuristics would take care of the architectural complexity?

sgt · on Feb 13, 2020

Now imagine a Beowulf cluster of these.

wewake · on Feb 13, 2020

This article is more about Windows 10 limitations than 3990K. Good read though.

tareqak · on Feb 13, 2020

What is the latest on just turning of SMT/Hyperthreading? Then you don't run into the greater than 64 threads issue with this CPU? I remember there being a reason to turn it off unrelated to performance, but I do not remember if there was more than one reason [0].

[0] https://marc.info/?l=openbsd-tech&m=153504937925732&w=2

bluedino · on Feb 13, 2020

Or save $1990 and just buy the 3970X

masklinn · on Feb 13, 2020

The 3970X has half the threads because it has half the cores, not because SMT is disabled.

You may want to avoid the issue by giving up 10% if your performances, less so by giving up 50%.

gameswithgo · on Feb 13, 2020

3970X also has much higher clock rates though! Might be a net win for some workloads.

tareqak · on Feb 13, 2020

Yes, but if you care about 64 threads without the possible side-channel issues that SMT currently/theoretically has, then you are back to the 3990X.

For what it is worth, I have an AMD Ryzen 2700X Eight-Core Processor that I got in 2018, and I keep SMT off. I do some light gaming with it, and I am happy. I did not notice a big drop in performance, but I did not truly measure the difference.

alg0rith · on Feb 14, 2020

Meanwhile, AMD's Navi doesnt have stable drivers...

exabrial · on Feb 13, 2020

The article is more about a bug in Windows and it's scheduler than actual benchmarks.

ajross · on Feb 13, 2020

To be fair, scheduler misfeatures are a whole lot more interesting than benchmarks that scale in expected ways.

aschismatic · on Feb 13, 2020

The title is misleading because the submitted link is to the third section of the review. A link to the full review was submitted last Friday [0].

[0] https://news.ycombinator.com/item?id=22266386

thedance · on Feb 13, 2020

Computers exist to run software. If the computer doesn’t actually run the software well, then it’s not interesting that it seems powerful on paper.