Does simultaneous multithreading still make sense?

xscott · on Dec 22, 2019

I worked on a project where the (large) customer had some legacy requirements about percentage of "CPU" our application was allowed to use. The requirement was written back in the days when a single computer really only had one core, and once things like that are written it's hard to get them unwritten.

For our application (heavily numeric, very well behaved cache access), turning on hyperthreading only increased real performance by about 10% (measured as work completed per unit of time). However, we settled on a metric where we defined CPU use to be load average divided by number of cores. Doubling the number of cores the system showed in top allowed us to meet the required margin.

So from a bureaucratic point of view, hyperthreading was a 100% improvement.

xoa · on Dec 22, 2019

Flagging this as it's an absurdly shallow article apparently combining about 10 minutes of "research" after hearing something on twitter with conflating typical end user use-cases and an entire technology. The "tuning" and "oh noes my VMs this is surely a new problem nobody doing virtualization has ever thought of" section is too absurd to even bother with. But for the security aspect it's worth pointing out that in many, if not most [1], true performance critical environments all code being run is trusted. The system or cluster is dedicated to being given one specific job after another to crunch on, exclusively by authorized users in authenticated ways and outputting exclusively to a controlled channel going off-system. Even if it ever should have a problem, it would merely result in possibly some corruption of data in flight and some downtime as the whole thing was re-imaged, but nothing that would be remotely worse a 15-50% drop in performance (!). For roots sake.

----

1: where "most" means "in the raw amount of hardware $$$ spent".

zamadatix · on Dec 22, 2019

There aren't really many "performance critical" multithreaded environments in the world. For the most part you either have something that doesn't scale and needs a really fast thread or you have a cost equation about how many servers you need to buy/maintain. The main exception that comes to mind are extremely large databases that heavily resist horizontal scaling due to poor design (of either software or database).

I'd argue most large compute by total $ is actually shared at the host level i.e. public/private cloud or user devices. Basically the only things that aren't are dedicated clusters for specific applications and a few hundred supercomputers while AWS alone has over 100x as many cores as the largest supercomputer.

Also I don't think there is a high horse to be on about an article not targeting the audience of the largest exacompute scale clusters, not everyone/everything at HN need be at the forefront of the field to avoid being flagged.

chaps · on Dec 23, 2019

Er, there aren't many performance critical multithreaded environments? Latency sensitive systems disagree, and those are all over the place.

zamadatix · on Dec 23, 2019

Can you give examples of something that scales via threading but requires in a single thread the function takes 500 microseconds instead of 600 microseconds to compute that actually contradicts that "most" systems aren't this way?

jcelerier · on Dec 23, 2019

Most professionnal audio software is like that - you can have e.g. a thread per track to make it simple but you also have to ensure that each execution cycle does not take more than 1 millisecond else you get audio glitches. And there is no limit to how hard you have to improve - this is a central factor for people buying your software (see dawbench) and artists really really don't like limits - they will always try to add more effects, etc etc on each track.

zamadatix · on Dec 23, 2019

Agreed but I'd hardly call pure software professional real time audio setups a disruptor of the vast majority of systems. Put all of these niche compute heavy multithreaded real time use cases and you have <1% the CPU market.

I.e. my claim was "There aren't really many 'performance critical' multithreaded environments in the world." not that there aren't any.

jcelerier · on Dec 23, 2019

> Agreed but I'd hardly call pure software professional real time audio setups a disruptor of the vast majority of systems.

I mean, there's still a few hundred thousand people registered on DAW-related forums so certainly a fair bit more are using those. That is more than a dozen european countries. Sure, it's not angry birds but I do not think that it is relevant to cater to the lowest common denominator of software.

tomc1985 · on Dec 23, 2019

There are lots of performance-critical environments out there in the embedded world... why else would VxWorks be so popular?

zamadatix · on Dec 23, 2019

Unless you're implying there are a lot of embedded machines running VxWorks on overclocked i9 9900K's because they needed the single core throughput I think we are talking about completely different concepts.

altmind · on Dec 22, 2019

Of course SMT makes sense. Why would it not be? Article says that thats because ppl only count the threads in their "cpuinfo" output and get the wrong impression? Intel vulnerabilities are not SMT vulnerabilities per-se, they are side channel attacks on a specific SMT implementation.

ljhsiung · on Dec 22, 2019

Also want to add that to say "don't use SMT because it's insecure" is the same as saying "don't use a cache because it's insecure" or "don't use speculative execution". As a short-term fix, I would 100% agree that disabling SMT e.g. OpenBSD's approach is awesome and shows their security-consciousness. But to preach "disable SMT because it's too challenging" feels very lazy.

Additionally, as you've said, it's still uArch dependent. For example-- the Fallout vulnerability (one of the MDS attacks) only worked on Intel machines, but not AMD+ARM, most likely due to the differences in how the two designs handled store-to-load forwarding on the store queues/buffers.

The author seems to also value security over performance. I do as well. But the balance between performance and security is a fickle one, and I feel that "SMT is nonsensical" is a bit too much

phkahler · on Dec 22, 2019

>> Additionally, as you've said, it's still uArch dependent

Intel would love for everyone to disable SMT regardless of vendor. That would help them with relative performance.

pjmlp · on Dec 22, 2019

Some of them happen to be shared across all CPU vendors.

kijin · on Dec 22, 2019

Which, now that they are known, can be fixed in future iterations of the technology. Just because Intel won't (or can't) fix their damn products doesn't mean that others won't.

pjmlp · on Dec 22, 2019

It also doesn't mean others will fix their damn products.

And even if they do, it doesn't mean anyone will replace existing hardware already deployed into production.

kijin · on Dec 22, 2019

Sure, there's no guarantee, but that's not a particularly good reason to write off SMT wholesale.

People who already have Intel CPUs in production aren't just going to turn off hyper-threading, either, regardless of what we say about whether or not future products should support it.

AdrianB1 · on Dec 22, 2019

When you pay licenses per-CPU and SMT is doubling the cost with the licenses without doubling the performance, SMT does not make sense. For other cases, it makes. There is no universal use case for it.

tyingq · on Dec 23, 2019

There's a persistent rumour that Oracle does this, but they don't. For example:

"Amazon EC2 and RDS - count two vCPUs as equivalent to one Oracle Processor license"

https://www.oracle.com/assets/cloud-licensing-070579.pdf

Is there a vendor that does count a hyperthread as a core for software licensing?

rumanator · on Dec 23, 2019

Please don't cherry pick quotes. The full quote with the bit you left out is as follows:

> Amazon EC2 and RDS - count two vCPUs as equivalent to one Oracle Processor license if hyper-threading is enabled, and one vCPU as equivalent to one Oracle Processor license if hyper-threading is not enabled.

As you see, your own quote confirms that yes the rumors are true: Oracle does charge per CPU.

tyingq · on Dec 24, 2019

Yeah, that's what I said. Oracle charges per core. I didn't cherry pick anything.

h2odragon · on Dec 22, 2019

Given the mismatch between memory latency and how fast a cpu can actually run when it does have data, SMT still does make sense, sometimes, for some kinds of system. bigger better caches make it less useful and security... well. "Ownership" of ones computational environment is a metaphysical debate now, this is just one more bullet point on the list.

clarry · on Dec 22, 2019

In the linked article about ghk's talk, you find this tidbit: "If you're not using a supported distro, or a stable long-term kernel, you have an insecure system. It's that simple. All those embedded devices out there, that are not updated, totally easy to break."

Is he still talking about SMT, or just poor security of Linux in general?

I'm wondering about this since "all those embedded devices out there" that I can think of are not running CPUs with SMT.

esotericn · on Dec 22, 2019

Tons of stuff like NUC's used as digital displays, kiosks, etc, everywhere. I'd be surprised if even half of that was on a proper update path.

Embedded isn't just like, microcontrollers. Think about all the times you've seen a BSOD on a billboard.

clarry · on Dec 22, 2019

I don't really consider NUCs or other off-the-shelf commodity x86/amd64 mini PCs to be "embedded." Especially if they're running an off-the-shelf commodity OS that can throw a BSOD. That's not some custom distro that cannot be updated; whether they actually care enough to update it is a different matter altogether. You totally can pick a supported distro with LTS kernels, and keep it up to date.

I'm writing this as someone who's used Shuttle's fanless mini PCs (designed for PoS/kiosk use) as desktop & server hardware, all with proper updates. And I work for a company that does actual, custom embedded hardware. I've made a billboard too. None of the actual embedded hardware (almost exclusively ARM) I've used is SMT-capable. Even among off-the-shelf amd64 solutions, it's common for people to stretch the penny and buy a celeron/pentium without hyperthreading.

And, fwiw, I've never witnessed a BSOD on a billboard in person.

jacquesm · on Dec 22, 2019

If it doesn't have a jtag connector it isn't embedded.

zamadatix · on Dec 22, 2019

This comment is extremely ironic considering it was an Intel processor that spurred the widespread use of JTAG and all the way up to Skylake Intel products had traditional JTAG connectors. These days they do JTAG over a physical USB port but I'm not sure how the shape of the port is supposed to matter.

JTAG on the NUCs actually led to a CVE as well IIRC.

jacquesm · on Dec 22, 2019

JTAG has nothing to do with Intel per-se but everything with BGAs which made it super hard to get to certain signals.

zamadatix · on Dec 23, 2019

It was (relatively) uncommon until Intel released the 80486, then it became very popular and was found on basically every chip. Not that there weren't devices before and after that used JTAG but none nearly as influential in it's growth.

toast0 · on Dec 22, 2019

In my mind, SMT made more sense when core counts were low. These days, desktop use cases can more often run out of threads to run than places to run them. Server use cases can often run more threads, but it might not be useful to run 32 cpu threads if your NICs can only properly run 16 queues.

Fronzie · on Dec 22, 2019

For computational tasks, I've seen SMT give a roughly 50% performance increase compared to not using SMT on the same machine.

Much of that depends on how 'regular' the executions are. A highly optimized FFT or BLAS routine will benefit less than a sparse matrix computation, where part of the time is spent in indexing, rather than floating point operations.

tyingq · on Dec 22, 2019

Some SMT on/off benchmark comparisons on a Ryzen 3900x. Confirms your "sometimes 50+% / sometimes nothing" experience.

https://www.techpowerup.com/review/amd-ryzen-9-3900x-smt-off...

Symmetry · on Dec 22, 2019

For highly optimized routines I'd tend to worry about not gaining anything due to be limited by cache speed, or even about losing performance on net due to cache thrashing.

kijin · on Dec 22, 2019

There's no direct correlation between NIC queues and CPU threads. The days of dedicating one thread to each incoming connection and/or HTTP request are long behind us, not to mention there are many tasks that require a lot of processing with little to no network activity.

NightlyDev · on Dec 22, 2019

That's not true. NIC queues and CPUs are still very important to get good performance on servers by pinning different queues to different cores.

kijin · on Dec 22, 2019

Only if your workload involves passing around a lot of network traffic (e.g. a load balancer) or is highly latency-sensitive.

I've got backend servers that routinely max out 32c/64t CPUs but push so little traffic that replacing the NIC with a cell phone modem would make no discernible difference. There are many types of server workloads where the NIC is not the bottleneck at all, so the parent's argument that low NIC queue counts make high CPU core counts useless is false.

zamadatix · on Dec 22, 2019

"Server use cases can often run more threads, but it might not be useful" != "low NIC queue counts make high CPU core counts useless "

You're debating an argument that was never made.

SahAssar · on Dec 22, 2019

The parent said "but it might not be useful". "Might" being an important word here.

hindsightbias · on Dec 23, 2019

There are many >100 core POWER8 or POWER9 systems running SAP HANA, Epic or Oracle with SMT4 or SMT8 today.

jmakov · on Dec 22, 2019

Some benchmarks for orientation: https://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd...

jmakov · on Dec 22, 2019

Also for different workloads: https://www.phoronix.com/scan.php?page=news_item&px=AMD-Ryze...

hrgiger · on Dec 23, 2019

This makes me wonder how SMT handled in linux kernel especially on cpu-idle and scheduling then I have found below articles, sharing for those who is also interested in:

1- Rock and a hard place: How hard it is to be a CPU idle-time governor https://lwn.net/Articles/793372/

2- Many uses for Core scheduling https://lwn.net/Articles/799454/

CodeArtisan · on Dec 22, 2019

I would say that one of the major performance boosts of Zen over Bulldozer is the introduction of real SMT due to the expiration of the patents. Bulldozer had CMT which is not the same technique.

CMT vs SMT (very simplified view): https://i.imgur.com/AcZnipK.png

As you can see, with CMT, you have the same amount of ALUs than with SMT but a single thread can only use its dedicated ALU leaving the other one useless meanwhile SMT allows a single thread to use all ALUs.

Narishma · on Dec 23, 2019

> due to the expiration of the patents

How do you know that's the reason?

tyingq · on Dec 22, 2019

It's certainly good for Amazon, where they pawn off a thread as a "vCPU".

If SMT dies off, it would be a pretty big margin hit for them.

ridiculous_fish · on Dec 23, 2019

How will SMT evolve with the frequency down-clocking required by AVX-512? Might a thread be penalized because it happens to be executed concurrently with a AVX-512 thread on the same score?

touisteur · on Dec 23, 2019

I thought down-clocking was on the first generation of low-end almost-not-Xeons with AVX-512? Will a 2018/19 Xeon Gold or Platinum really down-clock?

metaphor · on Dec 22, 2019

FYSA, SMT in this context is simultaneous multithreading a.k.a. hyperthreading, not surface mount technology.

Hardware folks can safely move on.

saagarjha · on Dec 22, 2019

And not “Satisfiability modulo theories” either, it seems. I would never recommend people to “move on” from an interesting article, though.

metaphor · on Dec 22, 2019

Agreed.

At first glance, I genuinely thought this was going to be a pitch for yet another fragile additive manufacturing toy with narrow usecase, or a new process that enables IPC-7092 designs on the cheap.

p1necone · on Dec 22, 2019

I read it as Shin Megami Tensei, but that's even less likely.

willis936 · on Dec 22, 2019

What makes you think ISA design is not in the wheelhouse of “hardware folks”?

metaphor · on Dec 22, 2019

It was a half-hearted remark in passing targeted towards the class of "hardware folks" who might care about the finer details of surface mount technology. Try not to get too offended.

vilaca · on Dec 22, 2019

Imagine my confusion, I clicked thinking I was going to read an article discussing 'Through-Hole vs. Surface' mounting of PCB components.

maweki · on Dec 22, 2019

I wish that acronyms would be written out if they have multiple meanings in the computer context. My first thought was "how can satisfiability modulo theory ever not make sense?"

callmeal · on Dec 22, 2019

I thought it was Surface Mount Technology and was wondering what kind of replacement was being proposed.

grzm · on Dec 22, 2019

The first paragraph of the article makes it very clear what they’re referring to:

> Whatever machine you’re reading this on, it’s highly likely that not all of the CPUs shown by the OS are actually physical processors. That’s because most modern processors use simultaneous multithreading (SMT) to improve performance by executing tasks in parallel.

rektide · on Dec 22, 2019

Cores sharing some caches make sense but no maybe smt does not make sense.

philjohn · on Dec 22, 2019

Or does SMT make sense because looking at instructions coming in and branch predicting to execute some speculatively can only go so far, and sometimes hints from the application that "hey, this can be run independently of that" helps with overall throughput?

baybal2 · on Dec 22, 2019

Yes, it does. Instruction level vulnerabilities arise from execution of insecure code.

If you have to do so, your security is already compromised. Shared hosting, virtualisation, and etc are all insecure by definition.

rythie · on Dec 22, 2019

Intel i5 desktop chips don’t have hyper-threading (SMT) and haven’t for the 10 years they’ve been available. Typically the i7 variant of the same CPU has been about £100 more (roughly 50%). The point about only 5% extra die space makes no difference to the consumer, as there is/was quite a high cost premium on desktops for that feature. Now Intel has removed hyper-threading from most of it’s i7 desktop chips, and you get 2 extra cores over the i5 version instead.

tyingq · on Dec 22, 2019

"Intel i5 desktop chips don’t have hyper-threading (SMT) and haven’t for the 10 years they’ve been available."

That's mostly true, though there have been a few desktop i5 processors with hyperthreads.

Like: https://ark.intel.com/content/www/us/en/ark/products/43546/i...

rythie · on Dec 22, 2019

I didn’t spot that one, though it was almost 10 years ago and I don’t see more recent examples. My point was that a large number of users don’t actually have hyper-threading on the desktop.

Narishma · on Dec 23, 2019

What about i3 processors. Or laptop processors, AFAIK those all support HT.

baybal2 · on Dec 22, 2019

Desktop cores are tiny today in comparison to all other useless stuff put on die like "AI" accelerators and such.

That is even worse in mobile chips. All people call Intel cores to be oversized, but they need to look at up to date die shots. All cores combined can be less than a half of the die area.