The article presents the situation as if limiting the number of threads by the operating system is not a conscious choice by Microsoft but a natural law that cannot be avoided.
Microsoft deliberately does these limitations in order to force people to pay more for its sofware. It's a shame, really.
Isn’t that the same with a lot of hardware? Like CPUs where you have to pay more to get a feature unlocked that’s there but unusable when you do not pay extra? And also the same for almost all software with multiple feature sets. I don’t understand why it’s bad when MS does it.
Hardware is _kind of_ different. Hardware is usually binned for tolerances/expected demand. e.g. the manufacturing process for an i5 and an i3 might be the same, but the acceptable tolerances for an i3 are lower. If the part fails the test for an i5, it will be sold as an i3, and the extra features disabled. This makes sense, otherwise you run into a bad press part lottery.
(of course they do also bin based on demand/sales, which is scummy.)
>of course they do also bin based on demand/sales, which is scummy.
Why is this scummy? It's a free market and it's not a life saving drug, they can ask for whatever they want, nobody is forcing you to buy it.
That's like saying "Software engineers are paid according to demand/market which is scummy."
The best business strategy for companies and professionals alike is to find a niche with less competition where you are a leader(cough, monopoly, cough) and charge whatever the market will bear. If Apple/Microsoft/AMD/Intel can get away with that it means the market can bear it. The fact that you don't like it, is another issue.
Because it's price gouging. They would be actively restricting supply of their highest-performing parts for no other reason than to charge a higher price for them. Software engineers don't do this, they're highly paid because demand is so huge that it outstrips even an ample supply.
Of course, in practice, it can be hard to tell whether intentional "crippling" is going on. Yield problems are common and might manifest in the exact same way, after all - for example, chips that don't "make timing" for their design frequency can be sold for lower specs.
No, it's deliberate discounting of economy products!
What's the difference? I don't know either. Price discrimination is a very real thing that exists virtually everywhere. It's why the market will bear Beats headphones in the same shops that sell generics at 20% the price.
At the end of the day, a manufacturer doesn't "owe" you something at cost. The idea of capitalism is that competition in the market will drive prices down instead, and it does. But one side effect is that customers and sub-markets that are able to bear higher markups on their versions of stuff are asked to pay it even when it doesn't make sense at the level of per-part costs.
So it's true, that that Beats headset costs only $4 to make and "should" be available much cheaper than it is. But it isn't, and the reason is simply that people are willing to buy them at $200. If you're not, that seems unfair, I agree. But... what's the alternative?
That's a very nice way and non-scummy way of putting it, yes. If that deep discount compared to the "full-featured" price was made clearer, I assume that many people would not even find it scummy that these discounts are conditioned to being sold a "crippled" version of the product. So, depending on the "framing" you pur on this, you can nicely account for either of these perspectives.
I'm not really sure what you want them to do in their marketing materials. it's not like Intel keeps it a secret that the hedt parts are just xeons with higher boost clocks and ecc disabled. the general public doesn't even understand what ecc is, nor are they considering buying a $1000+ cpu. I doubt most people even understand hyperthreading (the main difference between i7 and i5).
The important point was the economics in the later stuff. Price discrimination isn't about "framing", it's just the way competition works in the case of multiple submarkets needing similar products but having different price tolerances. You can't "fix" this with framing or spin, it's just the way markets work.
Market segmentation is what allows you to have cheap things. Next time you walk through first class to your coach seat remember that those people are subsidizing you.
"price gouging" is a myopic view of what goes on in the CPU/GPU market.
intel/amd could not afford to sell hardware at consumer prices if buyers could turn around and put those consumer parts in servers at scale. in the alternate universe where certain features aren't disabled for consumer parts, your i7 costs as much a a xeon; you don't get xeon features for i7 money.
Yes, as someone has said elsewhere, they might essentially be selling you their chips at a deep "discount", while making that discount conditional on forgoing some desirable features. It's pretty much the same thing, just with a different framing placed on it.
One of the reasons that free market capitalism enjoys wide support in the western world is because of the "consumer surplus".
That is because for the vast majority of transactions, there is a very large gap between the highest price a consumer is willing to pay and the lowest price a producer is willing to accept. The canonical example is for essentials like food: people are willing to pay every single penny they have for food for their family. OTOH, farmers are willing to sell wheat in bulk for under 10 cents a pound.
For obvious reasons farmers cannot set different prices for each pound of wheat to get "all the money". It's seen as deeply unfair when the vast majority of industries are price takers while a select few can charge the maximum a customer can bear. If every industry could effectively price differentiate could we'd be spending all our money on food with no money left to buy CPU's and our whole free market system would collapse.
Not sure I'm getting all the subtleties of your argument but for food, say for example suddenly they charged 100x the price for wheat, people would switch to something different.
Of course people need food, but there are many kinds of food and it's competitive between the different foods.
Lifesaving drugs are something where it's often difficult to substitute.
But even if it were a life-saving drug, would it exist at that point in time (rather than a few decades later) if billions of dollars hadn't been spent to bring it to market (instead of investing the money in any one of a number of other things, including non-pharmaceutical healthcare research), and would that money have been invested in the first place if there hadn't been an expectation of a return on investment that took into account the risk of failure?
> Why is this scummy? It's a free market and it's not a life saving drug, they can ask for whatever they want, nobody is forcing you to buy it.
That's pretty much the only argument _for_ doing it; it's the equivalent of defending your argument as free speech - just because you _can_ say it, doesn't mean you should.
i agree , you can also add vmware to this list. they ship a free (and great imo) hypervisor which has full features built in, with only a license key unlocking those advanced (and expensive) features. this also applies to almost every piece of demoware software.
They also disable ECC, virtualization, VFIO, and other features for consumer products. AMD and Nvidia regularly produce a single GPU chip that gets used in consumer and PRO cards, costing much more on pro cards, with the only difference being packaging and the software. In effect, pro users subsidize development for everyone else in order to avoid some friction, but IMO that's fine, because pro users (CAD, design) have the most to gain from such development.
Kinda. The manufacturing process will always produce defects in these chips. A run can produce 100 chips (let's say)- 10 of those are perfect. 100% functional, and are sold as the highest quality chip. the next 20 are 99.99% functional. These are binned into their respective segments based on how well they test. These are sold as lower end parts. Maybe a core doesn't work or it can't clock as high.
Maybe the other 70 have some major issue with the special components, so they are only 99% functional and therefor can't be used in the high end segment, but can be used in the consumer chips.
Not every chip is perfect, which is why there exists different product classes. For some chips, an i5 is just an i7 that didn't fully pass validation.
I understand how that works. My point was just that it's not always about binning, sometimes it's all market segmentation. Drawing both sides of a polygon in a single pass is a pro GPU driver feature that has nothing to do with defects or binning. I think if Intel has unexpectedly good yield with low defects, they may end up using silicon good enough to support hyperthreading in a CPU that lacks the feature. They disable silicon-viable features for consumer chips on the regular --it's just to guarantee users of those features pay for Xeons.
If you bin based on tolerances, you kind of have only two options: either also bin chips that meet tolerances in order to satisfy demand for the lower-binned chip or introduce a separate parallel chip alongside the binned ones that actually targets their specs (which you would still have to bin downward).
It's scummy and inefficient from an obvious human perspective to intentionally cripple a superior product in order to target a cheaper market, but that's the natural result of market actors being incentivized to maximize their own return, they don't really have any reason to care about the human side of that argument.
> or introduce a separate parallel chip alongside the binned ones that actually targets their specs (which you would still have to bin downward).
you know, I hadn't actually thought of that. If you did this, presumably it would involve the cost of setting up another manufacturing line, (colossal), and you'd _still_ have binning issues. Presumably the tolerances would be looser however; if you can build within 5% power usage of X, you should be able to _easily_ build within 5% power usage of (x/2).
I assert that it's usually bad when anyone does it, because it is evidence that they have excess market power. If there were enough competition, they couldn't produce a thing with capabilities x,y, and z and then charge an unlock fee for y and z.
Lots of complex software systems are sold this way, allowing customers to add-on features a la carte. I don't see that as a bad thing -- allows people who only need a subset of the functionality to only pay for what they need. Oftentimes, the functionality is sitting there waiting to be unlocked.
Your software is all the same quality. The copy one customer has doesn't have a series of bits in it that are mistakes in the production that can't be removed. Everyone's binary is capable of the same actions. Software is malleable, hardware by it's very name is less so.
Most mainframe manufacturers will ship you a mainframe that has been configured way above what the customer ordered. Because in most cases the customer will upgrade and changing a license key is cheaper than building and shipping a new mainframe.
Supporting only 64 processors in a group on a 64bit operating system seems like a reasonable and sane technical solution. It means you can use a single 64bit variable as a bitmask for various processor related functions in a process.
I would bet that many other bits of software also have this limitation because they too thought that using a 64bit value for a processor mask would be sufficient.
Ahhh, I don't think so. We're already in the place where memory bandwidth is the big killer so I can't imagine any enthusiasm for making our pointers twice as large.
Unlikely and if that's the case, trivial to fix. We're talking about masking against N potential bitmasks instead of one, which is still stupid easy and people do it all the time.
But you need 64-bits to represent affinity masks and scheduling. IE: Which of the 64-cores are available for a thread to schedule into.
----------
Furthermore, your Threadripper 3990x should ONLY really be pushing affinities for across a 8c / 16-thread die anyway, at least as far as the Windows scheduler is concerned.
Windows programmers should use multiple thread groups for different 16-thread CCX. Because its extremely costly to move all of your local state from one die's L3 cache into another.
Look at the chip, like physically look at it. AMD has 9x chips here (1x I/O die for memory, and 8x compute chips, each with 8x cores). You really only want to be moving threads within those 8x cores, because moving a thread across chips is less efficient.
So the ideal world, people would understand Windows's scheduler and work around it. Instead of complaining about how it is different from Linux's. Windows's Thread groups being 64-sized is reasonable for the job it wants to do... and there are other parts of the API that allow you to work across Window's 64-sized processor groups. In particular, to increase your affinities across to a 2nd processor group.
There are loads of places where it’s more convenient to test a bit vector than compare a scalar value. It’s just cheaper at runtime. Of course it may be possible to cheaply extend the bit vector scheme to 128 bits or even 256 fairly easily, using the native vector registers. But I bet Windows still supports some pretty old models of CPUs.
Another way to express this is that a 64-bit value can express every combination of processors, so if you want to say something like "this process should run on these 12 specific processors" you can do so in a single 64-bit value.
It isn't sane, it is an artificial limitation and poor design. At best it was an pre-mature code optimization.
It should be fixed by a professional coder who doesn't impose limitations, especially in an OS, based on wanting to fit bit flags into a single variable.
Windows should fix this -- there should be nearly unlimited cores per group. They can still have limitations per Windows version so they can price segment their market.
That's an uncharitable take - that pricing model supports some seriously good engineers doing amazing work. And how many people could afford tens of thousands of dollars for hardware but can't afford...
Googles it...
$84 a year per user. Or $168 for E5 per year.
That just doesn't strike me as Microsoft being bad?
I'm curious, to anyone working on Windows kernel development, at what point does a feature/scheduler improvement become so good that people decide "nah, this is way too cool to put in Windows Home Edition, let's feature-gate it to the enterprise version instead!"?
This is actually something I find Microsoft does poorly for enthusiasts (both the crowd buying these types of chips, but also who care about a scheduler). They have VERY low visibility into these types of improvements.
I can find a billion blog posts about notepad supporting Linux and Windows line returns. I struggle to find anything about improvements for new hardware.
I wish there was a blog (albeit unfashionable these days) or similar which told these stories. I often feel the lack of their existence is a taint on data driven decisions: these customers are likely insignificant, but their influence on others is significant (even if their advice isn’t “buy a 64 core CPU”, it will be “Windows 10 is great, use that!”).
I’ve largely been a MSFT fanboy since a young age, and I kind of get being mainstream, but tossing a bone to the enthusiastic supporters would be nice.
The problem is that there is no easy way for somebody to pay more to access this feature. You have to jump through hoops to buy it and you probably won't even get official support. Meanwhile the alternative to Windows doesn't cost anything monetarily. But at the end of the day, both of these aren't important factors yet. It's mostly a question of what OS the application you bought this CPU for runs on (or runs better on - Blender). The hardware is still expensive enough that people won't buy hardware like this "at random", but this is likely to change quickly.
Sign up for a Microsoft small business account. Via that, you get a MSDN (or whatever they call it these days) subscription. Then you can download any version of the OS and get a license key from the same webpage to activate it.
While you can download it from MSDN, and it will activate, check your licenses. In most cases, these licenses are intended for development and it is forbidden to use them in production.
Have you ever had that actually occur to you with software bought on eBay that's non-physical?
I've bought numerous Windows 7 and a couple of Windows 10 keys, plus a few MS Office keys, never had any issues.
Sometimes, the keys would require phone activation, but that was ages ago. Nowdays they just work.
The providers give you a key that you can use to activate, and a link for the ISO, but that part is optional. The keys work with untouched ISOs, too. FWIW, I once downloaded the seller's ISO and compared the hashes, and they matched up.
I may have to add, I live in Germany, which is part of the EU, and the courts here have decided key resale is completely legal and MS has to tolerate it.
True, I was thinking of software distributed on DVDs or ISOs.
A stolen or misappropriated product key is almost as bad as a virus, though. It will work great... but only at first, until Microsoft notices that 50,000 customers are using it.
I bet on some level they know and don't care. If you're buying a cheapo key off of ebay then chances are you were never going to pay the full price. Meanwhile they're still using the software, learning how it works, and growing the Windows ecosystem further. Vaguely like a loss-leader that gets people in the door.
> I'm curious, to anyone working on Windows kernel development, at what point does a feature/scheduler improvement become so good that people decide
A kernel developer is going to be 3-4 steps removed from those decisions. In many ways they are just a consumer of a product like this too and work on their subsystem. They are unlikely to be worried because it's the product they build and might even feel that it's inexpensive for the level of engineering they invest.
I wish I could see a statistic on the number of people that would have bought Win 10 Home to run on their 64 core machine if only it had support for 64 cores!
It's going to happen soon. AmD's high end consumer model is already very close to that number. When these become commonplace, say in 8 years, someone might find something fun or productive to do with 64 cores, and it's gonna suck if your OS is the bottleneck.
By the time it happens MS will support it in Home edition. I remember that Win2000 for example supported something like 2 or 4 CPUs in standard edition and you had to buy server editions or even datacenter edition for more.
It's not necessarily "so good" especially for workstation loads vs server loads. You have the same issue in Linux with the different config options (a simple example that's not relevant anymore: an smp kernel is slower than a single core one)
Yes, it could be a config option as well instead of a fixed choice.
> (a simple example that's not relevant anymore: an smp kernel is slower than a single core one)
The main reason this example is not relevant anymore is the "alternatives" system Linux uses nowadays, which dynamically patches the kernel code according to the hardware features (in this case, it detects it's running on a single core and removes the SMP locking code).
I remember reading in Inside Windows 2000 by Mark Russinovich & David Solomon that the non-SMP kernel was the exact same binary built from the exact same SMP-compliant code with all the SMP stuff NOP-ed out.
I've played with the AMD Daytona Rome Server (two EPYC sockets, 2*64 = 128 cores, 256 threads), with RHEL, and it rocks. However it's quite hard to find workloads that keep all 256 threads busy at once. Most builds aren't nearly parallel enough, most programs can't find work for 256 threads. So as a personal machine 128 or 256 threads aren't really worth it unless money is no object. Likely the best current use for these is as servers for running large numbers of virtual machines or containers.
I am craving one of these for my superoptimizer. The level of task parallelism I have is north of 100M independent jobs; my last run took a single-core machine 20 days. It's pretty rare to have a workload like this but as more machines ship with >16 cores, I think more developers will look at the order-of-magnitude improvements of parallelizing their tasks where possible.
It uses an SMT solver (can use Z3, Yices or Boolector) all of which are very complex and branchy. So no GPGPU - maybe some specialist in SAT solving or SMT could build that one day, but that person would not be me.
It's an SMT-based solver to make fast "single-lane" SIMD calculations for AVX2 (doing different things in multiple lanes, doing AVX512 or SSE or ARM or general purpose registers, etc. are all on my todo list). Branch free code only. Can handle up to 4 or 5 instruction sequences including things like generating constants like 128-bit tables for PSHUFB (considering PSHUFB as a single-lane operation where every lane looks up a table).
Sounds very cool! I recently wondered about the possibility of superoptimizing vectorized code, so glad to hear about it!
Would you like to chat about opportunities to do analagous work for GPU instruction sets?
I work at a startup making high-performing GPU software and compilers, incidentally including a regex engine! (We can't quite match HyperScan on the CPU, but support capture groups and very high throughput on GPUs.) We also have several other interesting projects, and would like to start a superoptimizer at some point.
P.S. After reading your blog, one of our engineers said: "We see you like PSHUFB. So do we."
Because this is a hobby project and I was doing it at home? eventually I will be doing this in the cloud or with a home-based server or both, but you have to start somewhere. Much of the work is currently centered around "being less stupid" rather than "try to parallelize more" as wasting tons of time doing pointless solves is not a good idea - I've already got about 80-100x speed improvements from optimizations (not looking at sequences I can prove won't work, not looking at sequences that I can prove "do the same thing as some other sequence").
There CPUs are not great for lots of VMs because the memory bandwidth then becomes limiting. So you don't get the full performance out of your VMs compared to having more boxes with smaller CPUs.
they already ran the benchmarks three times just for windows. it would be nice to see linux numbers too, but you can't criticize the level of effort they put in.
Exactly. I'd say there's a fair chance that anyone playing about with this type of hardware will want to be tweaking fairly low-level config etc. of the type that's hidden in Windows anyhow.
I agree. An even bigger problem is that most of the benchmark suite is completely irrelevant to what these CPU's are going to be used for. These are meant for servers. They should keep that in mind when selecting what benchmarks and which operating system to use.
The Threadripper CPUs are meant for workstations / high-end desktops (HEDT), not servers. For servers, you'd want the EPYC line. The benchmark selection is somewhat reasonable, though yes, you'd want some Linux tests in there.
While some platforms are certainly designed with long term always-on use in mind, "meant to" isn't quite right.
We mostly operate sTR4s as HEDTs, but we also have X399 server boards with IPMI and 10GbE in a board configuration designed to be installed racked. To us, there's only a minor functional difference in role.
You just answered the question. It would be interesting to see TR and EPYC head-to-head in server workloads, to confirm or deny the theory that EPYC is better for them.
What benchmarks do you think would be relevant? This seems like a part that's designed for AV production, 3D and SFX work. It could be used in a server-like role, but, for 500 more, the EPYC counterpart is a much better deal.
Servers can often use more than 256 GB RAM and 4 memory channels, like the EPYC line these are based on. I'd like to see it benchmarked as a workstation for local builds (both Windows and Linux) and for accessing and processing in parallel 100 GB+ data in RAM (for quicker prototyping than GPUs, or verification). Those would be my main uses.
Not sure who needs to hear this, but another _huge_ Windows 10 limitation: no support for nested virtualization on amd processors. This means Ryzen users can’t benefit from a bunch of security improvements and also things like the new Windows 10X emulator can’t work.
Kind of off topic, but it’s the kind of nasty surprise I wouldn’t want to get after deciding to buy, so hope this helps someone.
Is this an limitation on the processor or on Windows itself? Since virtualization is not part of x86 but rather vendor specific, is the AMD implementation difficult?
It appears to be a Windows limitation - I'm not an expert, but my understanding is that AMD supports all the same extensions for it (under AMD rather than Intel branding) and it appears the Windows team is (finally) working on it. I understand that other hypervisors do support nested virt on AMD.
To be fair to the Windows team, AMD in the data center / pro desktops wasn't really viable for a very long time, so its understandable that it wasn't prioritized.
> 1080p60 HEVC at 3500 bitrate with "Fast" preset - 319 fps
Ok, why were these parameters chosen? What's the application? I recommend everyone to look at 1080p60 video footage encoded with h265 with the "fast" encoder preset at 3500 bitrate. Calling it terrible would be a compliment. Unless you encode really slow and visual easy motion, which brings up the question why you would need 60 fps in the first place. Even at the "medium" preset with 1080p60 you should - regardless of application - be at least in the 5000+ range with your bitrate. And even that comes with a lot of trade offs, because that's just where live streaming starts.
I believe most of these benchmarks were set at a time when you simply couldn't run x265 on “slow” on any reasonable CPU if you ever wanted it to complete. But yes, I'd really like CPU benchmarkers to move to higher-quality presets for video encoding, because they do tend to have different kinds of performance curves.
Fun fact: There's no point in running x265 on the fastest presets unless you absolutely need to have HEVC; x264 on slow is faster _and_ gives better quality per bit. See the second graph on https://blogs.gnome.org/rbultje/2015/09/28/vp9-encodingdecod... (a few years old, the situation is likely to look similar but not identical).
>(a few years old, the situation is likely to look similar but not identical).
x265 has made massive improvement over the years, 2015 x265 wasn't even considered good; despite all of its hype, or another way to think about it is how well x264 managed to squeeze every last bit of detail possible.
Unsure how to tag people like dang, but is it possible to change the link or the title? The link is for page 3 of the review, which is a broader discussion about multi-threading on Windows.
If I am investing USD 4000 on a CPU, I'd probably go for the EPYC part for 500 more and twice the memory bandwidth. It'd be interesting to run these benchmarks under perf to see how many L3 cache misses happen and how much they cost in cycles.
If you want a system that can be a workhorse 128-thred monster on the working days, but reach 4.5GHz boost on the weekends for CS:GO and other video games... you'll want a Threadripper, not an EPYC.
The low clock speeds, and the RDIMMs / LRDIMMs of EPYC add latencies which slow down video game (mostly single-threaded) performance.
--------
For those who don't play video games: there are a variety of single-threaded tasks still in various workplaces. A surprising amount of 3d graphics work remains single-thread bound.
In particular, modeling is typically single-thread bound (the GUI thread, where the user is clicking menus and such). Most custom scripts are single-thread bound, and 3d modelers need a LOT of custom scripts. Those scripters aren't necessarily optimization masters who know how to take advantage of multi-core architectures.
3d Rendering is of course multithreaded. But the 3d artist still needed to click on a lot of menus and scripts to get the mesh to look right.
the Epyc also allows you to use more RAM... seems the Threadripper tops out at 256GB due to the memory type (https://www.youtube.com/watch?v=1LaKH5etJoE) but Epyc would allow 2TB...
There were rumors about TRX80 and WRX80 chipsets that raised the 256GB RAM limit to 2TB for Threadripper CPUs. Alas, they never went past the rumors stage.
And the EPYC uses higher latency RDIMMs and LRDIMM RAM. Traditionally, server RAM (RDIMM or LRDIMMs) is 50ns+ slower than consumer UDIMMs.
Video games and pointer-chasing care more about that latency figure than bandwidth. I'm sure there are bandwidth-bound tasks, but any latency-bound task would prefer highly-clocked, over-volted (1.35V) DDR4 UDIMMs at 3200 MT/s CAS16 or faster.
Server RDIMMs are naturally slower, due to the register (and LRDIMMs are probably even slower). RDIMMs and LRDIMMs are designed for capacity more so than latency: you can have 1TB of LRDIMMs RAM on one machine but all the RAM runs slightly slower (3200MT/s maybe CAS22 or slower) as a result.
Use multiprocessing [0] if you really want to stick with python. You're leaving a lot of performance on the table due to inter process communication but python isn't exactly known for being the fastest programming language. I assume your goal is making legacy applications run on multiple cores and multiprocessing is fine for that.
Don't know why the downvotes; this limitation of Python is really hard to fathom. IIRC they had a project to remove it and it went into the weeds somehow (I think it made performance worse?)
Python has GIL -> Python apps are mostly single-threaded -> Single-threaded performance is important -> Adding granular locking has impact on single-thread performance -> CPython "isn't supposed to have perf regressions" -> Python has GIL
Because e.g. nodejs has a GIL, too, and apparently no one thinks this is a problem.
For web applications one usually has a software chain like web server <-> wsgi server <-> dozens of python instances.
Standalone processes just implement threading, which also is fairly easy (as far as theading itself can be easy).
Scientific libraries like scipy can use parallel processes automatically in the background (using things like BLAS), as long as the data is modelled correctly.
I just gonna repost what I've written before about this before:
The current state of threading and parallel processing in Python is a joke. While they are still clinging to the GIL and single-core performance, the rest of the world is moving to 32 core (consumer) CPUs.
Python's performance, in general, is a crappy[1] and is beaten even by PHP these days. All the people that suggest relying on multiprocessing probably haven't done anything that's CPU and Memory intensive because if you have a code that operates on a "world-state" each new process will have to copy that from a parent. If the state takes ~10GB each process will multiply that.
Others keep suggesting Cython. Well, guess what? If I am required to use another programming language to use threads, I might as well go with Go/Rust/Java instead and save the trouble of dabbling with two languages.
So where does that leave (pure-)Python? It can only be used in I/O bound applications where the performance of the VM itself doesn't matter. So it's basically only used by web/desktop applications that CRUD the databases.
It's really amazing that the machine learning community has managed to hack around that with C-based libraries like SciPy and NumPy. However, my suggestion would be to drop GIL and copy the whatever model has been working for Go/Java/C#. If you can't drop GIL because some esoteric features depend on that, then drop them as well.
Every single project which has tried to drop the GIL has failed in some way. It's not some "esoteric features", it's fundamentally a hard problem that implicates the entirety of the python object model, python C api, scoping, imports, and GC.
I think multi-interpreting is the way to go, but that still would require a framework for ensuring safe memory access.
Speaking of Go, I always thought it would be neat to write a python implementation in Go, but leverage Go's GC, and implement the 'go' keyword/function for easy parallelism. But you still have the problem of scoping and memory safety. Or similar idea but with Rust. Something tells me that isn't a trivial undertaking, especially if you want all the libraries, which is 75% the point of python.
> All the people that suggest relying on multiprocessing probably haven't done anything that's CPU and Memory intensive because if you have a code that operates on a "world-state" each new process will have to copy that from a parent. If the state takes ~10GB each process will multiply that.
This is wrong, there are multiple ways of python threads working on shared data.
> It's really amazing that the machine learning community has managed to hack around that with C-based libraries like SciPy and NumPy.
Well, the main implementation of the whole language is c-based. I can't see how that implies hackiness.
> If you can't drop GIL because some esoteric features depend on that, then drop them as well.
There have been multiple python implementations without a GIL available for way over 10 years, for example pypy, ironpython and jython. Yet these never went mainstream, which strongly implies the GIL problem actually isn't that much of a problem in the real world.
There are (IMHO) two scenarios where this doesn't matter: If you're writing a web service then the multi threading (processing) is done by the front end ie Nginx; and for 'scientific' or similar computationally expensive stuff Python is turning into a scripting layer over C libraries. Have a look at numpy to see what I mean.
Any details on that? We're spinning up a 48 core epyc soon on 12.1. The last "beefy" server we did was a 32 core intel on 11.x, and I'll be happy if these changes are in 12.1.
Honestly. Windows is just a waste on that architecture as it stands. Run Linux as base OS and give 8 cores to the Windows VM should you have use for it.
288MB Cache. I wonder if we'll ever be given some control about how and what computations are cached. It seems like a lot of memory to leave to simple heuristics.
Such control has been available for many years. On x86 you have "non-temporal" load and store instructions [1] and the Cache Disable bit in Control Register 0 [2] which can be used to suggest which bytes are not suitable for caching. C also has posix_madvise() [3] which is somewhat relevant.
I feel the cache in these modern CPUs is being used extremely efficiently. If they have a branch predictor that is 99% accurate, I have faith in letting the same engineers manage my cache eviction strategy. The extreme scope of modern OoO execution strategies would probably make detailed CPU cache management more of a liability than an asset.
One simple policy would be to try to ensure your application is as small as reasonably possible. If you can fit your entire executable image in a fraction of your L3 you are probably sitting in a really good spot.
I’ve heard these “simple heuristics” are multi-megabyte neural networks these days. I believe I heard that claim in this video:
https://youtu.be/ymcOLL2qEg8
It's possible that Tesla does something different that makes sense in their specific situation, but general purpose CPUs - as far as I know - don't use NNs for this.
What is the latest on just turning of SMT/Hyperthreading? Then you don't run into the greater than 64 threads issue with this CPU? I remember there being a reason to turn it off unrelated to performance, but I do not remember if there was more than one reason [0].
Yes, but if you care about 64 threads without the possible side-channel issues that SMT currently/theoretically has, then you are back to the 3990X.
For what it is worth, I have an AMD Ryzen 2700X Eight-Core Processor that I got in 2018, and I keep SMT off. I do some light gaming with it, and I am happy. I did not notice a big drop in performance, but I did not truly measure the difference.
Microsoft deliberately does these limitations in order to force people to pay more for its sofware. It's a shame, really.