Intel’s Thread Director: Assisting the OS to make task placement decisions

jfoutz · on Aug 25, 2021

Hobbyist me thinks this is really cool. Thinking about how different runtimes might hint allocation of resources is fascinating.

Professional me knows I'll be stuck figuring out that some bursty process that really needs p1 access all the time, in spite of usually being idle is going to be super pissed off when he gets paged at 2am. and he'll be stuck spending a lot of time figuring out how to pin that process to that core.

Worker drone me is going to be sad thinking that slack and chrome are snarfing the good cores while my compile times suffer.

Bit of a hot take, but it's a tragedy of the commons situation. Programmers are smart, they'll find tricks to grab the fast cores. There are a maybe zero organizations that can get alignment to keep important processes on fast cores, it's way better to just be fast and point the finger at other teams for being slow. What a time to be alive.

It's super cool. it has potential to be amazing. But even forcing every task to run on an E-core, and a heavy bias to fast cores being idle will be gamed. I guess it's better to know than be surprised.

jjoonathan · on Aug 25, 2021

Desktop GUIs are really lagging when it comes to giving the user tools to corral unruly apps. I know, they were born in a more trusting time, but mobile has understood for many years that the user-app relationship can get slightly adversarial. It's high time for the desktop to take a more solid swing at it.

Hey, maybe we could even have kept Flash around if it were easy to prevent flash ads from fighting each other to take 100% CPU. It might have taken some iteration to figure out how to map a simple GUI-appropriate slider to a scheduling policy (10% -> 10ms every 100ms?) but I bet it could have been done. It would have had a nice side effect of appropriately directing user outrage at offending applications, too.

specialist · on Aug 25, 2021

Browsers should support resource quotas.

I wouldn't mind the client side scripts so much if they behaved. Even with uBlock and such, random pages will peg my CPU, drain my battery. IIRC, allrecipes.com, imdb.com, and of course youtube.com.

Forcing Reader View often helps, but it's hit & miss.

The Great Suspender worked ok for Chrome. But I'm now 90% Safari and 10% FireFox.

Adjacent Idea: Browser history should also include other meta. Like page size. (weight), resources consumed, time spent on page, when page was closed, permissions requested & granted, etc.

Dylan16807 · on Aug 25, 2021

> Professional me knows I'll be stuck figuring out that some bursty process that really needs p1 access all the time, in spite of usually being idle is going to be super pissed off when he gets paged at 2am.

In a way that can't wait tens of microseconds?

> and he'll be stuck spending a lot of time figuring out how to pin that process to that core.

Pinning to cores is really easy.

> Worker drone me is going to be sad thinking that slack and chrome are snarfing the good cores while my compile times suffer.

I'd say something about adjusting priorities but there's a much more important thing to realize here.

If your CPU didn't have this, you'd have 6 fewer cores. Or probably 12 in the next version. Even if your compiler can't use two of the fast cores, it's going to run much faster overall.

> tragedy of the commons situation

If you want to improve your own responsiveness at the expense of other programs you could already busy loop everything. And nobody does that.

I don't think it's something to worry about.

eptcyka · on Aug 25, 2021

For years, there were multiple Android SoCs with stupid amounts of cores, a mixup between 2 or 3 core types, and the end result is that pretty much all contemporary matchups between the best and worst Apple SoCs and ARM SoCs put in Android phones were favoring the glorious fruit company.

ajtulloch · on Aug 25, 2021

Underappreciated fact - the Apple SoCs since A10 (iPhone 7, etc) are also big.LITTLE designs! https://en.m.wikipedia.org/wiki/Apple_A10

eptcyka · on Aug 25, 2021

You're correct, I was not aware of this. You've invalidated my argument and I stand corrected.

sroussey · on Aug 25, 2021

And Macs are now big.LITTLE as well. I’m curious how this compares to what apple does for macOS.

saagarjha · on Aug 27, 2021

A10 was kind of a strange big.LITTLE in that it could only run one set at a time. Future revisions fixed this.

sillystuff · on Aug 25, 2021

I'm curious if the existing cpu performance of even mid-range modern phone SoCs is really even used by most users. I suspect gamering and winning benchmarks may be the primary use cases.

I was trying to reduce power consumption on my phone, and disabled all but one low perf core (the phone's SoC has 2 high perf and 4 low perf cores), and things were very noticeably laggy.

But, then, I tried with only two low perf cores enabled. Everything that I used my phone for (calls, messaging, notes, web browsing with an ad blocker) worked fine. So, the extra four disabled cores really didn't affect my ?average? use case.

I was concerned about consumption during active usage, not idle, but disabling the cores didn't affect power consumption for either (even with only one enabled).

eptcyka · on Aug 25, 2021

Is the SoC in question even capable of physically powering down individual cores?

sillystuff · on Aug 25, 2021

It is a Qualcomm Snapdragon 765G. I assumed there would be aggressive power gating for a mobile processor, but I have no idea. Google didn't turn up anything. Maybe in some doc behind an NDA?

jbluepolarbear · on Aug 25, 2021

I’m still unconvinced about Intels big core little core architecture. Sounds great for a laptop, but I don’t want that in my desktop. In games waiting tens of milliseconds is a long time.

Dylan16807 · on Aug 25, 2021

micro seconds.

It can advise on shifting cores around as fast as the OS is ready to, so the delay doesn't have to be worse than a normal sleep.

If you have a normal "active game" timer frequency, then Windows is probably checking in every millisecond, and it will have extremely accurate data to use. And that's even without any special mechanisms to react faster.

sroussey · on Aug 25, 2021

For a desktop (or a laptop plugged in) you could just use little cores for sleep mode.

protomyth · on Aug 25, 2021

Or let the little cores run OS processes that are always running so they don't interrupt your game.

omegalulw · on Aug 25, 2021

How? This just gives more control to the OS. It's the OS's job to ensure that process priorities are enforced. I don't see how you get around that.

jfoutz · on Aug 25, 2021

my (flawed?) impression is, the OS implementation is heuristic. putting a process on a core is not much different than putting state in swap. it's not that important with nvram disks, but it's not that hard to keep your data out of swap. There's bound to be a way to keep a process on the fast core. Just do some useless work that keeps the process on the fast core. Or cooperate with the os to keep the critical process on the fast core.

jayd16 · on Aug 25, 2021

I don't know. I don't really spend a lot of time dealing with gamed process priority. Would this be all that different?

w7 · on Aug 25, 2021

My big question is how will hypervisors handle this heterogenous architecture, and how much it will affect them. All cores are not equal which means vCPUs are no longer just vCPUs. I believe most guest workloads are somewhat opaque to the hypervisor, and as far as I know there isn't yet a way for guests to be aware of or demand on what cores its workloads are scheduled on any of the major hypervisors.

dragontamer · on Aug 25, 2021

It'd assume its just "asymmetric SMT" as a model. Each "hardware core" is seen as 2x big hardware thread + 1x small hardware thread.

The 2x big hardware threads are from the big-core (SMT / Hyperthreading), while the 1x small hardware thread is the small-core.

Since this chip is 8-big cores + 8 little cores, the math works out.

-----

A future chip is rumored to be 8-big cores + 16 little cores, which can be implemented instead as 2x big-threads + 2x little-thread cluster (1x big core + 2 little cores per cluster).

Though of course, that depends heavily upon the implementation details of this "Thread Director".

Wowfunhappy · on Aug 25, 2021

How does this work today with processors that have more threads than physical cores?

w7 · on Aug 25, 2021

For the most part guests are freely scheduled across threads/cores. Though I do know VMware's ESXi will prefer to place a guest's workloads on threads that do not share the same physical core if free resources allow.

An important difference though is that guests do not have to be aware of most of this scheduling behavior, the exception being NUMA. My guess at this point is we might see something similar to virtual NUMA topologies, but for big-little.

jiggawatts · on Aug 25, 2021

Currently, all hyperthreads are equal.

In this new model, some cores are hyperthreaded, and some aren't. Some cores are "full spec", some aren't.

sudosysgen · on Aug 25, 2021

Hyper threads are not equal! If you thread a task over two hyper threads that belong to the same core you will see much less of a performance improvement.

Similarly, even if a hyperthread has low utilisation, if it's twin is busy you will see lower performance.

jiggawatts · on Aug 26, 2021

They're all equal in the sense that physically they're all the same. If you load every hardware thread equally, they run at equal speed. In a sense each thread is exactly 1/2 of a core.

It's not like "logical CPU 3" is slower than "logical CPU 7"!

This is like a highway with equal width lanes. Sure, there might be more traffic in some lanes, but the lanes themselves are equal.

The new Intel CPU is like 8 wide lanes that can be used by up to 16 motorcycles or 8 cars (or combinations thereof), alongside 8 medium-width lanes only usable by small cars. It's bizarre for a desktop CPU.

PS: Looking at the die shots, it boggles the mind that they didn't include 16 efficiency cores! They're so tiny that it would have been a negligible area increase, but given the relative performance it seems like it would have been worthwhile. I'm guessing memory bandwidth limits are holding them back somewhere...

Wowfunhappy · on Aug 26, 2021

Wouldn't hyper-threading be more like a highway in which certain pairs of lanes occasionally merge into a single lane temporarily? If one lane has loads of traffic, you're going to want to enter the highway on a non-adjacent lane.

kfprt · on Aug 25, 2021

Didn't Intel already try something like this with Lakefield? It didn't go well.

Let's say the thread starts on a high performance core and enables the code paths for features the efficiency core doesn't have. How will the software know to not move the thread to the efficiency core? If it did do so wouldn't it throw errors? I frankly don't see how they can solve this problem with software and not have to recompile everything to be hybrid arch aware. In the android space everyone throws their phones away every few years so this hasn't been an issue. To me it seems like Intel is creating another Itanic situation where no one is going to compile their software to target the hybrid paradigm.

praseodym · on Aug 25, 2021

Alder Lake fixes this issue by including all features on the efficiency cores, so a thread should be able to move between performance and efficiency cores seamlessly.

It also drops AVX-512 support entirely, presumably because the efficiency cores don’t support it and the problem you mention isn’t easily solvable.

kfprt · on Aug 25, 2021

It seems you are correct. https://en.wikipedia.org/wiki/Golden_Cove https://en.wikipedia.org/wiki/Gracemont_(microarchitecture)

Does this mean we will get RPi style Gracemont SBC's or did Intel drop that after Edison?

jiggawatts · on Aug 25, 2021

> It also drops AVX-512 support entirely

I wonder if Zen 4 will have AVX-512 support. If so, given that it will use a 5nm process instead of this processor's 7nm process, it will absolutely blow it out of the water.

It would be somewhat ironic to see Intel being trounced by an instruction set they invented and then stopped using!

M277 · on Aug 25, 2021

Zen 4 [so far] from all the rumors actually seems to be competing against Intel 13th gen. (Both coming Q4 2022)

sroussey · on Aug 25, 2021

Please, no.

AVX-512 was a mistake, causing Intel lots of issues for the benefit of a new benchmark. It heats up and clocks down the chip for other work. And it takes up too much space.

Again, please, no.

jiggawatts · on Aug 26, 2021

Previous iterations heated up the processor and forced it to clock down because Intel was stuck on the 14nm process.

Intel developed the instruction set and expected to rapidly shrink the die to 10nm and then 7nm, which would have fixed the power draw issues. The shrink never happened, and this then made AVX-512 look bad.

It's not AVX-512 that's at fault, it's the manufacturing process. Fix the process, and the instruction set can shine.

Many people writing vector code by hand say that they much prefer AVX-512 over its predecessors because it is complete, flexible, and powerful.

The only reason Intel didn't lose every benchmark against AMD recently is because for some workloads AVX-512 doubles throughput despite being hamstrung by the power draw and overheating problems.

A 5nm chip using AVX-512 might only need to clock down 10-20%, or not at all. Or just use turbo for a shorter period.

sroussey · on Aug 31, 2021

The manufacturing process would alter the whole chip. AVX-512 ratio would still be the same, still lead to similar issues. It was a mistake. I would not be surprised to see them undo it in same chips to get a better big.LITTLE design out since Atom does not have it.

jeffbee · on Aug 25, 2021

They also had to add Atom features to the "big" Core cores: UMONITOR, UMWAIT, and TPAUSE.

adrian_b · on Aug 25, 2021

According to the Intel patches to LLVM, the Gracemont cores and the Golden Cove cores as configured in Alder Lake support the complete Broadwell ISA (i.e. including things like BMI1, BMI2 and ADX, which were not mentioned in the presentations).

Besides the Broadwell instructions, all the instructions supported by Tremont (the previous core in the Atom series of cores), but not by Broadwell, are also supported, plus VNNI from Cascade Lake and some of the non-AVX-512 instructions introduced by Ice Lake.

jeffbee · on Aug 25, 2021

Thanks! It’s interesting because there are still problems with this model. The peak-optimized application architecture for a chip where four cores share L2 is obviously going to look different from the one where they share a slower L3, so it’s still weird to migrate for some programs. Most people aren’t going to notice.

I’m excited for this part because Tremont was never a thing normal people got to buy. I think it’s going to be a real pleasure to program these E-cores.

the8472 · on Aug 25, 2021

> How will the software know to not move the thread to the efficiency core?

There could be a dirty state flag that indicates whether vector registers or other advanced instructions were used.

> If it did do so wouldn't it throw errors?

The kernel could catch those errors and move the thread back to the performance core and resume execution at the trapped instruction and mark the thread as non-schedulable on the efficiency cores.

spockz · on Aug 25, 2021

That could be interesting. I remember reading a report here years ago about a funny bug that was caused by the big and little cores having different endianness. Indeed the process would start on one core, process some data, get scheduled to the other core and boom, weird output. Does someone recall the story?

cesarb · on Aug 25, 2021

You're probably remembering the bug where the big and little cores had different cache line sizes (https://www.mono-project.com/news/2016/09/12/arm64-icache/).

spockz · on Aug 28, 2021

Wow. I mis remembered that quite spectacularly. Whoops.

abbeyj · on Aug 27, 2021

https://medium.com/@niaow/a-big-little-problem-a-tale-of-big...

adrian_b · on Aug 25, 2021

That seems like a too stupid bug to have existed in a production system.

However, there was a real problem a few years ago with some Samsung Exynos processors, where the cache line size was different between the big and the little cores.

A few programs made assumptions about the cache line size and had weird behavior after starting on one kind of core, then being migrated on the other kind of core.

lttlrck · on Aug 25, 2021

Haha. That sounds like a joke playing off the BIG.little name. I'd like to read about that.

spockz · on Aug 28, 2021

Indeed. But apparently I was mistaken. It was the story about the different cache lines on the Exynos.

blibble · on Aug 25, 2021

> This fundamental change is one reason why Windows 11 exists.

they couldn't change the thread scheduler without a new major release?

pull the other one

roblabla · on Aug 25, 2021

Fun fact: Windows 10 21H1 and 11 share the exact same kernel binary (ntoskrnl.exe). Which, of course, contains the scheduler. So, yeah, that's a lie if I've ever seen one.

10000truths · on Aug 25, 2021

Perhaps it's already included in both kernels, and only used based on a runtime CPUID check?

lloydatkinson · on Aug 25, 2021

Yes, its a blatant lie for sure. Just like Windows has for a long supported the existence or not of certain CPU features. It's a very obvious ploy of latching on to some new CPU feature with "perfect timing" to force people on "older" hardware to upgrade. My Skylake i7 for example is "too old" for Windows 11. It's a complete joke.

dragontamer · on Aug 25, 2021

> they couldn't change the thread scheduler without a new major release?

I'd argue that thread schedulers are a core component of any operating system.

That being said: Windows8 had support for heterogenous processors (aka: ARM's big.LITTLE), so I'm not sure if a fundamental change was really needed here. Microsoft already put a lot of hard work for those Windows8 phones (which used the same kernel as Windows8 desktop).

mnd999 · on Aug 25, 2021

It doesn’t sound like anyone outside of Intel and Microsoft actually knows how this works. How are Linux and the BSDs going to be able to support this?

yvdriess · on Aug 25, 2021

The way it usually goes, Intel submits a kernel patch.

pjmlp · on Aug 25, 2021

Only when they are motived enough, PowerVR, Kaby Lake Vulkan?

wmf · on Aug 25, 2021

Those products sold almost no units. Alder Lake will be Intel's mainstream CPU so there should be Linux support.

pjmlp · on Aug 25, 2021

PowerVR had Linux support, just not the way FOSS advocates wanted it.

Thev00d00 · on Aug 25, 2021

Android phones have had big.LITTLE a while now, you would hope that scheduler support has matured in that time.

bitL · on Aug 25, 2021

Android phones don't have real multitasking; SailfishOS might be a better example.

vbezhenar · on Aug 25, 2021

Android phones run real Linux kernel with real multitasking. How multitasking exposed in UI is a different question.

bitL · on Aug 25, 2021

That's a bit like saying "your car has 5 gears but you can only use the first one". In practice you have 1 gear.

detaro · on Aug 25, 2021

No, that doesn't make any sense.

The system and apps on it make use of multiple cores, that's all that matters for the scheduling point.

dgan · on Aug 25, 2021

Do they need to support this? Another proprietary blob that takes decisions what process to run where? Just asking

mnd999 · on Aug 25, 2021

If Windows is making changes at scheduler level to support it, I would think yes, they do.

sharikone · on Aug 25, 2021

It seems Microsoft and Intel are threatened by the M1 and had to do this hastily. Hence the rush for Windows support in spite of Linux.

Dylan16807 · on Aug 25, 2021

What's hasty about this?

And it shouldn't be hard to get into Linux, I wouldn't really say this partnership notably hurts that.

pjmlp · on Aug 25, 2021

M1 threatens no one besides wishful thinking, Apple isn't going to start selling M1, and their laptops/desktops account for about 10% worldwide market share.

bell-cot · on Aug 25, 2021

Apple's history as a supplier of server hardware isn't so good. (Cough, gag, XServe.) But they could have a few crack chip designers working on bigger brothers to the M1, may have noticed how many $Billions are being spent annually on servers for data centers, and might even act as if they actually want to be successful in that market segment, this time around.

Or - probably more likely - Microsoft, Intel, etc. want to keep that ~90% worldwide market share of theirs from shrinking.

pjmlp · on Aug 25, 2021

Unless the rest of the world gets to earn US level salaries, that is very unlikely.

tyingq · on Aug 25, 2021

It at least made it very clear to AWS that their ARM processors can be improved to a point that they can extricate themselves away from Intel.

pjmlp · on Aug 25, 2021

So how many Java workloads are actually taking advantage of Graviton nodes?

Any numbers on JVM performance on them?

tyingq · on Aug 25, 2021

I meant that Apple Silicon sets a concrete and achievable goal (and ideas for) the next gen of Graviton.

That seems unrelated to however well Java runs on current Graviton.

pjmlp · on Aug 25, 2021

It is totally related, given that Java deployments are one of the major AWS business cases.

lifty · on Aug 25, 2021

While I can't really know for sure, but based on perceived performance and smoothness, most likely M1 + MacOS already does something similar in terms of thread and process prioritisation. iOS has done this for a long time, making sure that rendering threads are not resource starved.

vbezhenar · on Aug 25, 2021

AFAIK apple runs background tasks on slow cores which is supposed to free fast cores for foreground tasks.

I still don't understand why is it necessary in an OS which can juggle threads hundreds times per second using priority system. At least ignoring power efficiency aspect and concentrating only on performance or perceived performance aspects.

jandrewrogers · on Aug 25, 2021

Context switching is a major performance killer on modern CPUs. You can look at the "slow" cores as being optimized for offloading context switches and interrupts across myriad threads so that the "fast" cores can focus on the primary workload with few context switches.

This is the same motivation behind why high-performance software architecture uses a "thread per core" model. There are large throughput gains to be had by minimizing context switching and CPU cache sharing across threads.

Pulcinella · on Aug 25, 2021

Do you know of any resources for learning CPU-bound parallel/concurrent programming? I wish it was easier to user all the cores my devices actually have. Asynchronous IO and network stuff is “easy” because you don’t care to much when then work completes as long as it’s not blocking the main thread, but speeding up CPU bound tasks is much harder. For all its flaws, Unity with their DOTS (and I believe specifically their job system) is the only thing that I am aware of that facilitates programmers multithreading CPU bound work (you can multithread actual gameplay code rather than just the standard rendering thread, game logic thread, IO thread, etc. of many games)rather than just having to roll your own solution.

sharikone · on Aug 25, 2021

HPC courses can be pretty good for that.

Stuff like OpenMP is great for doing quick things like a parallel for cycle

earthnail · on Aug 25, 2021

Apple does dominate the premium segment, though. Raw market share isn’t enough; share of actual profit is more interesting.

pjmlp · on Aug 25, 2021

It is still 10% of the world.

Volkswagen doesn't need to worry about Polo sales just because Ferrari outperforms them.

jwlake · on Aug 25, 2021

There does seem to be some of that in the strategy and article. The odd thing is it doesn't seem to address the elephant in the room which is single threaded sustained memory bandwidth. From my usage it seems like that's the only huge upgrade.

didibus · on Aug 25, 2021

I'm surprised they prioritized Windows 11 and not Linux. Is this meant to most benefit consumers? I'd think something so complex would have been first targeted at servers.

icegreentea2 · on Aug 25, 2021

Adler Lake (the family of chips this Thread Director is paired with) is being released as a desktop/laptop chip. The largest configuration shown was 8 performance cores + 8 efficiency cores.

These aren't going to the server yet.

pjmlp · on Aug 25, 2021

They always do.

Just like whatever Khronos does, AMD, Intel and NVidia collaborate with Microsoft in doig native DirectX hardware support and only afterwards those features show up on OpenGL and Vulkan extensions soup.

garaetjjte · on Aug 25, 2021

Vendor OpenGL extensions for new features frequently appeared before DirectX support.

pjmlp · on Aug 25, 2021

Like what?

Hardware T&L, GPU shaders, Mesh Shaders, Compute Shaders, Ray Tracing, certainly not.

garaetjjte · on Aug 26, 2021

To pick one of the list, OpenGL and Vulkan mesh shaders extensions were published in September 2018 [0], while D3D12 didn't support it officially until November 2019 [1] (on insidier preview of 20H1 build, which shipped in May 2020). Granted, there was NVAPI D3D12 extension kludge for it, but certainly Khronos APIs weren't treated worse. Talk introducing mesh shaders presented examples in GLSL, and they said they weren't covering NVAPI because it's ugly. [2]

[0] https://www.khronos.org/registry/OpenGL/extensions/NV/NV_mes... [1] https://devblogs.microsoft.com/directx/coming-to-directx-12-... [2] https://www.youtube.com/watch?v=Ge427_2VORo&t=1607s

pjmlp · on Aug 26, 2021

Fair enough I was wrong on that one, with the small detail that on Khronos it was and still is, a NVIDIA extension, whereas all DirectX 12 Ultimate cards must support it.

zokier · on Aug 25, 2021

I don't think Intel has this sort of hybrid server cpus in their near future roadmap; the next Xeons (Sapphire Rapids) are expected to be pure big core (Golden Cove) configuration.

qzw · on Aug 25, 2021

This is supposed to help bring intel’s perf/watt closer to the M1. So it’s primarily a consumer play at this point.

Neil44 · on Aug 25, 2021

One thing that springs to mind is, as fas as I know it's not too difficult to manually assign tasks to cores on linux in a way that makes sense to the workload. But I'm not sure if you can do that at all on Windows.

emerged · on Aug 25, 2021

You can lock any thread or any process to any number of specific cores on both platforms.

spockz · on Aug 25, 2021

You can do it at the process level using the task manager. I'm not sure you can do it at thread level (programmatically) or even lower level.

voiper1 · on Aug 25, 2021

The huge CPU-disparity is more on mobile and coming to desktops.

It's "just" hyperthreading on servers right now that could benefit from this.

arvinsim · on Aug 25, 2021

Perhaps the first chips created with this are consumer-oriented.

656565656565 · on Aug 27, 2021

From the article “In previous versions of Windows, the scheduler had to rely on analysing the programs on its own, inferring performance requirements of a thread but with no real underlying understanding of what was happening.”

Does anyone know what is meant by the scheduler analyses the program?

saagarjha · on Aug 27, 2021

The scheduler looks at resource consumption by the program and guesses what it is doing and when it should be scheduled.

656565656565 · on Aug 27, 2021

Thanks. Any references on how this works?

saagarjha · on Aug 28, 2021

Wikipedia has a basic overview: https://en.wikipedia.org/wiki/Scheduling_(computing)