Hacker News new | past | comments | ask | show | jobs | submit login
Does simultaneous multithreading still make sense? (codeblueprint.co.uk)
87 points by wheresvic3 on Dec 22, 2019 | hide | past | favorite | 65 comments



I worked on a project where the (large) customer had some legacy requirements about percentage of "CPU" our application was allowed to use. The requirement was written back in the days when a single computer really only had one core, and once things like that are written it's hard to get them unwritten.

For our application (heavily numeric, very well behaved cache access), turning on hyperthreading only increased real performance by about 10% (measured as work completed per unit of time). However, we settled on a metric where we defined CPU use to be load average divided by number of cores. Doubling the number of cores the system showed in top allowed us to meet the required margin.

So from a bureaucratic point of view, hyperthreading was a 100% improvement.


Flagging this as it's an absurdly shallow article apparently combining about 10 minutes of "research" after hearing something on twitter with conflating typical end user use-cases and an entire technology. The "tuning" and "oh noes my VMs this is surely a new problem nobody doing virtualization has ever thought of" section is too absurd to even bother with. But for the security aspect it's worth pointing out that in many, if not most [1], true performance critical environments all code being run is trusted. The system or cluster is dedicated to being given one specific job after another to crunch on, exclusively by authorized users in authenticated ways and outputting exclusively to a controlled channel going off-system. Even if it ever should have a problem, it would merely result in possibly some corruption of data in flight and some downtime as the whole thing was re-imaged, but nothing that would be remotely worse a 15-50% drop in performance (!). For roots sake.

----

1: where "most" means "in the raw amount of hardware $$$ spent".


There aren't really many "performance critical" multithreaded environments in the world. For the most part you either have something that doesn't scale and needs a really fast thread or you have a cost equation about how many servers you need to buy/maintain. The main exception that comes to mind are extremely large databases that heavily resist horizontal scaling due to poor design (of either software or database).

I'd argue most large compute by total $ is actually shared at the host level i.e. public/private cloud or user devices. Basically the only things that aren't are dedicated clusters for specific applications and a few hundred supercomputers while AWS alone has over 100x as many cores as the largest supercomputer.

Also I don't think there is a high horse to be on about an article not targeting the audience of the largest exacompute scale clusters, not everyone/everything at HN need be at the forefront of the field to avoid being flagged.


Er, there aren't many performance critical multithreaded environments? Latency sensitive systems disagree, and those are all over the place.


Can you give examples of something that scales via threading but requires in a single thread the function takes 500 microseconds instead of 600 microseconds to compute that actually contradicts that "most" systems aren't this way?


Most professionnal audio software is like that - you can have e.g. a thread per track to make it simple but you also have to ensure that each execution cycle does not take more than 1 millisecond else you get audio glitches. And there is no limit to how hard you have to improve - this is a central factor for people buying your software (see dawbench) and artists really really don't like limits - they will always try to add more effects, etc etc on each track.


Agreed but I'd hardly call pure software professional real time audio setups a disruptor of the vast majority of systems. Put all of these niche compute heavy multithreaded real time use cases and you have <1% the CPU market.

I.e. my claim was "There aren't really many 'performance critical' multithreaded environments in the world." not that there aren't any.


> Agreed but I'd hardly call pure software professional real time audio setups a disruptor of the vast majority of systems.

I mean, there's still a few hundred thousand people registered on DAW-related forums so certainly a fair bit more are using those. That is more than a dozen european countries. Sure, it's not angry birds but I do not think that it is relevant to cater to the lowest common denominator of software.


There are lots of performance-critical environments out there in the embedded world... why else would VxWorks be so popular?


Unless you're implying there are a lot of embedded machines running VxWorks on overclocked i9 9900K's because they needed the single core throughput I think we are talking about completely different concepts.


Of course SMT makes sense. Why would it not be? Article says that thats because ppl only count the threads in their "cpuinfo" output and get the wrong impression? Intel vulnerabilities are not SMT vulnerabilities per-se, they are side channel attacks on a specific SMT implementation.


Also want to add that to say "don't use SMT because it's insecure" is the same as saying "don't use a cache because it's insecure" or "don't use speculative execution". As a short-term fix, I would 100% agree that disabling SMT e.g. OpenBSD's approach is awesome and shows their security-consciousness. But to preach "disable SMT because it's too challenging" feels very lazy.

Additionally, as you've said, it's still uArch dependent. For example-- the Fallout vulnerability (one of the MDS attacks) only worked on Intel machines, but not AMD+ARM, most likely due to the differences in how the two designs handled store-to-load forwarding on the store queues/buffers.

The author seems to also value security over performance. I do as well. But the balance between performance and security is a fickle one, and I feel that "SMT is nonsensical" is a bit too much


>> Additionally, as you've said, it's still uArch dependent

Intel would love for everyone to disable SMT regardless of vendor. That would help them with relative performance.


Some of them happen to be shared across all CPU vendors.


Which, now that they are known, can be fixed in future iterations of the technology. Just because Intel won't (or can't) fix their damn products doesn't mean that others won't.


It also doesn't mean others will fix their damn products.

And even if they do, it doesn't mean anyone will replace existing hardware already deployed into production.


Sure, there's no guarantee, but that's not a particularly good reason to write off SMT wholesale.

People who already have Intel CPUs in production aren't just going to turn off hyper-threading, either, regardless of what we say about whether or not future products should support it.


When you pay licenses per-CPU and SMT is doubling the cost with the licenses without doubling the performance, SMT does not make sense. For other cases, it makes. There is no universal use case for it.


There's a persistent rumour that Oracle does this, but they don't. For example:

"Amazon EC2 and RDS - count two vCPUs as equivalent to one Oracle Processor license"

https://www.oracle.com/assets/cloud-licensing-070579.pdf

Is there a vendor that does count a hyperthread as a core for software licensing?


Please don't cherry pick quotes. The full quote with the bit you left out is as follows:

> Amazon EC2 and RDS - count two vCPUs as equivalent to one Oracle Processor license if hyper-threading is enabled, and one vCPU as equivalent to one Oracle Processor license if hyper-threading is not enabled.

As you see, your own quote confirms that yes the rumors are true: Oracle does charge per CPU.


Yeah, that's what I said. Oracle charges per core. I didn't cherry pick anything.


Given the mismatch between memory latency and how fast a cpu can actually run when it does have data, SMT still does make sense, sometimes, for some kinds of system. bigger better caches make it less useful and security... well. "Ownership" of ones computational environment is a metaphysical debate now, this is just one more bullet point on the list.


In the linked article about ghk's talk, you find this tidbit: "If you're not using a supported distro, or a stable long-term kernel, you have an insecure system. It's that simple. All those embedded devices out there, that are not updated, totally easy to break."

Is he still talking about SMT, or just poor security of Linux in general?

I'm wondering about this since "all those embedded devices out there" that I can think of are not running CPUs with SMT.


Tons of stuff like NUC's used as digital displays, kiosks, etc, everywhere. I'd be surprised if even half of that was on a proper update path.

Embedded isn't just like, microcontrollers. Think about all the times you've seen a BSOD on a billboard.


I don't really consider NUCs or other off-the-shelf commodity x86/amd64 mini PCs to be "embedded." Especially if they're running an off-the-shelf commodity OS that can throw a BSOD. That's not some custom distro that cannot be updated; whether they actually care enough to update it is a different matter altogether. You totally can pick a supported distro with LTS kernels, and keep it up to date.

I'm writing this as someone who's used Shuttle's fanless mini PCs (designed for PoS/kiosk use) as desktop & server hardware, all with proper updates. And I work for a company that does actual, custom embedded hardware. I've made a billboard too. None of the actual embedded hardware (almost exclusively ARM) I've used is SMT-capable. Even among off-the-shelf amd64 solutions, it's common for people to stretch the penny and buy a celeron/pentium without hyperthreading.

And, fwiw, I've never witnessed a BSOD on a billboard in person.


If it doesn't have a jtag connector it isn't embedded.


This comment is extremely ironic considering it was an Intel processor that spurred the widespread use of JTAG and all the way up to Skylake Intel products had traditional JTAG connectors. These days they do JTAG over a physical USB port but I'm not sure how the shape of the port is supposed to matter.

JTAG on the NUCs actually led to a CVE as well IIRC.


JTAG has nothing to do with Intel per-se but everything with BGAs which made it super hard to get to certain signals.


It was (relatively) uncommon until Intel released the 80486, then it became very popular and was found on basically every chip. Not that there weren't devices before and after that used JTAG but none nearly as influential in it's growth.


In my mind, SMT made more sense when core counts were low. These days, desktop use cases can more often run out of threads to run than places to run them. Server use cases can often run more threads, but it might not be useful to run 32 cpu threads if your NICs can only properly run 16 queues.


For computational tasks, I've seen SMT give a roughly 50% performance increase compared to not using SMT on the same machine.

Much of that depends on how 'regular' the executions are. A highly optimized FFT or BLAS routine will benefit less than a sparse matrix computation, where part of the time is spent in indexing, rather than floating point operations.


Some SMT on/off benchmark comparisons on a Ryzen 3900x. Confirms your "sometimes 50+% / sometimes nothing" experience.

https://www.techpowerup.com/review/amd-ryzen-9-3900x-smt-off...


For highly optimized routines I'd tend to worry about not gaining anything due to be limited by cache speed, or even about losing performance on net due to cache thrashing.


There's no direct correlation between NIC queues and CPU threads. The days of dedicating one thread to each incoming connection and/or HTTP request are long behind us, not to mention there are many tasks that require a lot of processing with little to no network activity.


That's not true. NIC queues and CPUs are still very important to get good performance on servers by pinning different queues to different cores.


Only if your workload involves passing around a lot of network traffic (e.g. a load balancer) or is highly latency-sensitive.

I've got backend servers that routinely max out 32c/64t CPUs but push so little traffic that replacing the NIC with a cell phone modem would make no discernible difference. There are many types of server workloads where the NIC is not the bottleneck at all, so the parent's argument that low NIC queue counts make high CPU core counts useless is false.


"Server use cases can often run more threads, but it might not be useful" != "low NIC queue counts make high CPU core counts useless "

You're debating an argument that was never made.


The parent said "but it might not be useful". "Might" being an important word here.


There are many >100 core POWER8 or POWER9 systems running SAP HANA, Epic or Oracle with SMT4 or SMT8 today.




This makes me wonder how SMT handled in linux kernel especially on cpu-idle and scheduling then I have found below articles, sharing for those who is also interested in:

1- Rock and a hard place: How hard it is to be a CPU idle-time governor https://lwn.net/Articles/793372/

2- Many uses for Core scheduling https://lwn.net/Articles/799454/


I would say that one of the major performance boosts of Zen over Bulldozer is the introduction of real SMT due to the expiration of the patents. Bulldozer had CMT which is not the same technique.

CMT vs SMT (very simplified view): https://i.imgur.com/AcZnipK.png

As you can see, with CMT, you have the same amount of ALUs than with SMT but a single thread can only use its dedicated ALU leaving the other one useless meanwhile SMT allows a single thread to use all ALUs.


> due to the expiration of the patents

How do you know that's the reason?


It's certainly good for Amazon, where they pawn off a thread as a "vCPU".

If SMT dies off, it would be a pretty big margin hit for them.


How will SMT evolve with the frequency down-clocking required by AVX-512? Might a thread be penalized because it happens to be executed concurrently with a AVX-512 thread on the same score?


I thought down-clocking was on the first generation of low-end almost-not-Xeons with AVX-512? Will a 2018/19 Xeon Gold or Platinum really down-clock?


FYSA, SMT in this context is simultaneous multithreading a.k.a. hyperthreading, not surface mount technology.

Hardware folks can safely move on.


And not “Satisfiability modulo theories” either, it seems. I would never recommend people to “move on” from an interesting article, though.


Agreed.

At first glance, I genuinely thought this was going to be a pitch for yet another fragile additive manufacturing toy with narrow usecase, or a new process that enables IPC-7092 designs on the cheap.


I read it as Shin Megami Tensei, but that's even less likely.


What makes you think ISA design is not in the wheelhouse of “hardware folks”?


It was a half-hearted remark in passing targeted towards the class of "hardware folks" who might care about the finer details of surface mount technology. Try not to get too offended.


Imagine my confusion, I clicked thinking I was going to read an article discussing 'Through-Hole vs. Surface' mounting of PCB components.


I wish that acronyms would be written out if they have multiple meanings in the computer context. My first thought was "how can satisfiability modulo theory ever not make sense?"


I thought it was Surface Mount Technology and was wondering what kind of replacement was being proposed.


The first paragraph of the article makes it very clear what they’re referring to:

> Whatever machine you’re reading this on, it’s highly likely that not all of the CPUs shown by the OS are actually physical processors. That’s because most modern processors use simultaneous multithreading (SMT) to improve performance by executing tasks in parallel.


Cores sharing some caches make sense but no maybe smt does not make sense.


Or does SMT make sense because looking at instructions coming in and branch predicting to execute some speculatively can only go so far, and sometimes hints from the application that "hey, this can be run independently of that" helps with overall throughput?


Yes, it does. Instruction level vulnerabilities arise from execution of insecure code.

If you have to do so, your security is already compromised. Shared hosting, virtualisation, and etc are all insecure by definition.


Intel i5 desktop chips don’t have hyper-threading (SMT) and haven’t for the 10 years they’ve been available. Typically the i7 variant of the same CPU has been about £100 more (roughly 50%). The point about only 5% extra die space makes no difference to the consumer, as there is/was quite a high cost premium on desktops for that feature. Now Intel has removed hyper-threading from most of it’s i7 desktop chips, and you get 2 extra cores over the i5 version instead.


"Intel i5 desktop chips don’t have hyper-threading (SMT) and haven’t for the 10 years they’ve been available."

That's mostly true, though there have been a few desktop i5 processors with hyperthreads.

Like: https://ark.intel.com/content/www/us/en/ark/products/43546/i...


I didn’t spot that one, though it was almost 10 years ago and I don’t see more recent examples. My point was that a large number of users don’t actually have hyper-threading on the desktop.


What about i3 processors. Or laptop processors, AFAIK those all support HT.


Desktop cores are tiny today in comparison to all other useless stuff put on die like "AI" accelerators and such.

That is even worse in mobile chips. All people call Intel cores to be oversized, but they need to look at up to date die shots. All cores combined can be less than a half of the die area.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: