Hobbyist me thinks this is really cool. Thinking about how different runtimes might hint allocation of resources is fascinating.
Professional me knows I'll be stuck figuring out that some bursty process that really needs p1 access all the time, in spite of usually being idle is going to be super pissed off when he gets paged at 2am. and he'll be stuck spending a lot of time figuring out how to pin that process to that core.
Worker drone me is going to be sad thinking that slack and chrome are snarfing the good cores while my compile times suffer.
Bit of a hot take, but it's a tragedy of the commons situation. Programmers are smart, they'll find tricks to grab the fast cores. There are a maybe zero organizations that can get alignment to keep important processes on fast cores, it's way better to just be fast and point the finger at other teams for being slow. What a time to be alive.
It's super cool. it has potential to be amazing. But even forcing every task to run on an E-core, and a heavy bias to fast cores being idle will be gamed. I guess it's better to know than be surprised.
Desktop GUIs are really lagging when it comes to giving the user tools to corral unruly apps. I know, they were born in a more trusting time, but mobile has understood for many years that the user-app relationship can get slightly adversarial. It's high time for the desktop to take a more solid swing at it.
Hey, maybe we could even have kept Flash around if it were easy to prevent flash ads from fighting each other to take 100% CPU. It might have taken some iteration to figure out how to map a simple GUI-appropriate slider to a scheduling policy (10% -> 10ms every 100ms?) but I bet it could have been done. It would have had a nice side effect of appropriately directing user outrage at offending applications, too.
I wouldn't mind the client side scripts so much if they behaved. Even with uBlock and such, random pages will peg my CPU, drain my battery. IIRC, allrecipes.com, imdb.com, and of course youtube.com.
Forcing Reader View often helps, but it's hit & miss.
The Great Suspender worked ok for Chrome. But I'm now 90% Safari and 10% FireFox.
Adjacent Idea: Browser history should also include other meta. Like page size. (weight), resources consumed, time spent on page, when page was closed, permissions requested & granted, etc.
> Professional me knows I'll be stuck figuring out that some bursty process that really needs p1 access all the time, in spite of usually being idle is going to be super pissed off when he gets paged at 2am.
In a way that can't wait tens of microseconds?
> and he'll be stuck spending a lot of time figuring out how to pin that process to that core.
Pinning to cores is really easy.
> Worker drone me is going to be sad thinking that slack and chrome are snarfing the good cores while my compile times suffer.
I'd say something about adjusting priorities but there's a much more important thing to realize here.
If your CPU didn't have this, you'd have 6 fewer cores. Or probably 12 in the next version. Even if your compiler can't use two of the fast cores, it's going to run much faster overall.
> tragedy of the commons situation
If you want to improve your own responsiveness at the expense of other programs you could already busy loop everything. And nobody does that.
For years, there were multiple Android SoCs with stupid amounts of cores, a mixup between 2 or 3 core types, and the end result is that pretty much all contemporary matchups between the best and worst Apple SoCs and ARM SoCs put in Android phones were favoring the glorious fruit company.
I'm curious if the existing cpu performance of even mid-range modern phone SoCs is really even used by most users. I suspect gamering and winning benchmarks may be the primary use cases.
I was trying to reduce power consumption on my phone, and disabled all but one low perf core (the phone's SoC has 2 high perf and 4 low perf cores), and things were very noticeably laggy.
But, then, I tried with only two low perf cores enabled. Everything that I used my phone for (calls, messaging, notes, web browsing with an ad blocker) worked fine. So, the extra four disabled cores really didn't affect my ?average? use case.
I was concerned about consumption during active usage, not idle, but disabling the cores didn't affect power consumption for either (even with only one enabled).
It is a Qualcomm Snapdragon 765G. I assumed there would be aggressive power gating for a mobile processor, but I have no idea. Google didn't turn up anything. Maybe in some doc behind an NDA?
I’m still unconvinced about Intels big core little core architecture. Sounds great for a laptop, but I don’t want that in my desktop. In games waiting tens of milliseconds is a long time.
It can advise on shifting cores around as fast as the OS is ready to, so the delay doesn't have to be worse than a normal sleep.
If you have a normal "active game" timer frequency, then Windows is probably checking in every millisecond, and it will have extremely accurate data to use. And that's even without any special mechanisms to react faster.
my (flawed?) impression is, the OS implementation is heuristic. putting a process on a core is not much different than putting state in swap. it's not that important with nvram disks, but it's not that hard to keep your data out of swap. There's bound to be a way to keep a process on the fast core. Just do some useless work that keeps the process on the fast core. Or cooperate with the os to keep the critical process on the fast core.
My big question is how will hypervisors handle this heterogenous architecture, and how much it will affect them. All cores are not equal which means vCPUs are no longer just vCPUs. I believe most guest workloads are somewhat opaque to the hypervisor, and as far as I know there isn't yet a way for guests to be aware of or demand on what cores its workloads are scheduled on any of the major hypervisors.
It'd assume its just "asymmetric SMT" as a model. Each "hardware core" is seen as 2x big hardware thread + 1x small hardware thread.
The 2x big hardware threads are from the big-core (SMT / Hyperthreading), while the 1x small hardware thread is the small-core.
Since this chip is 8-big cores + 8 little cores, the math works out.
-----
A future chip is rumored to be 8-big cores + 16 little cores, which can be implemented instead as 2x big-threads + 2x little-thread cluster (1x big core + 2 little cores per cluster).
Though of course, that depends heavily upon the implementation details of this "Thread Director".
For the most part guests are freely scheduled across threads/cores. Though I do know VMware's ESXi will prefer to place a guest's workloads on threads that do not share the same physical core if free resources allow.
An important difference though is that guests do not have to be aware of most of this scheduling behavior, the exception being NUMA. My guess at this point is we might see something similar to virtual NUMA topologies, but for big-little.
Hyper threads are not equal! If you thread a task over two hyper threads that belong to the same core you will see much less of a performance improvement.
Similarly, even if a hyperthread has low utilisation, if it's twin is busy you will see lower performance.
They're all equal in the sense that physically they're all the same. If you load every hardware thread equally, they run at equal speed. In a sense each thread is exactly 1/2 of a core.
It's not like "logical CPU 3" is slower than "logical CPU 7"!
This is like a highway with equal width lanes. Sure, there might be more traffic in some lanes, but the lanes themselves are equal.
The new Intel CPU is like 8 wide lanes that can be used by up to 16 motorcycles or 8 cars (or combinations thereof), alongside 8 medium-width lanes only usable by small cars. It's bizarre for a desktop CPU.
PS: Looking at the die shots, it boggles the mind that they didn't include 16 efficiency cores! They're so tiny that it would have been a negligible area increase, but given the relative performance it seems like it would have been worthwhile. I'm guessing memory bandwidth limits are holding them back somewhere...
Wouldn't hyper-threading be more like a highway in which certain pairs of lanes occasionally merge into a single lane temporarily? If one lane has loads of traffic, you're going to want to enter the highway on a non-adjacent lane.
Didn't Intel already try something like this with Lakefield? It didn't go well.
Let's say the thread starts on a high performance core and enables the code paths for features the efficiency core doesn't have. How will the software know to not move the thread to the efficiency core? If it did do so wouldn't it throw errors? I frankly don't see how they can solve this problem with software and not have to recompile everything to be hybrid arch aware. In the android space everyone throws their phones away every few years so this hasn't been an issue. To me it seems like Intel is creating another Itanic situation where no one is going to compile their software to target the hybrid paradigm.
Alder Lake fixes this issue by including all features on the efficiency cores, so a thread should be able to move between performance and efficiency cores seamlessly.
It also drops AVX-512 support entirely, presumably because the efficiency cores don’t support it and the problem you mention isn’t easily solvable.
I wonder if Zen 4 will have AVX-512 support. If so, given that it will use a 5nm process instead of this processor's 7nm process, it will absolutely blow it out of the water.
It would be somewhat ironic to see Intel being trounced by an instruction set they invented and then stopped using!
AVX-512 was a mistake, causing Intel lots of issues for the benefit of a new benchmark. It heats up and clocks down the chip for other work. And it takes up too much space.
Previous iterations heated up the processor and forced it to clock down because Intel was stuck on the 14nm process.
Intel developed the instruction set and expected to rapidly shrink the die to 10nm and then 7nm, which would have fixed the power draw issues. The shrink never happened, and this then made AVX-512 look bad.
It's not AVX-512 that's at fault, it's the manufacturing process. Fix the process, and the instruction set can shine.
Many people writing vector code by hand say that they much prefer AVX-512 over its predecessors because it is complete, flexible, and powerful.
The only reason Intel didn't lose every benchmark against AMD recently is because for some workloads AVX-512 doubles throughput despite being hamstrung by the power draw and overheating problems.
A 5nm chip using AVX-512 might only need to clock down 10-20%, or not at all. Or just use turbo for a shorter period.
The manufacturing process would alter the whole chip. AVX-512 ratio would still be the same, still lead to similar issues. It was a mistake. I would not be surprised to see them undo it in same chips to get a better big.LITTLE design out since Atom does not have it.
According to the Intel patches to LLVM, the Gracemont cores and the Golden Cove cores as configured in Alder Lake support the complete Broadwell ISA (i.e. including things like BMI1, BMI2 and ADX, which were not mentioned in the presentations).
Besides the Broadwell instructions, all the instructions supported by Tremont (the previous core in the Atom series of cores), but not by Broadwell, are also supported, plus VNNI from Cascade Lake and some of the non-AVX-512 instructions introduced by Ice Lake.
Thanks! It’s interesting because there are still problems with this model. The peak-optimized application architecture for a chip where four cores share L2 is obviously going to look different from the one where they share a slower L3, so it’s still weird to migrate for some programs. Most people aren’t going to notice.
I’m excited for this part because Tremont was never a thing normal people got to buy. I think it’s going to be a real pleasure to program these E-cores.
> How will the software know to not move the thread to the efficiency core?
There could be a dirty state flag that indicates whether vector registers or other advanced instructions were used.
> If it did do so wouldn't it throw errors?
The kernel could catch those errors and move the thread back to the performance core and resume execution at the trapped instruction and mark the thread as non-schedulable on the efficiency cores.
That could be interesting. I remember reading a report here years ago about a funny bug that was caused by the big and little cores having different endianness. Indeed the process would start on one core, process some data, get scheduled to the other core and boom, weird output. Does someone recall the story?
That seems like a too stupid bug to have existed in a production system.
However, there was a real problem a few years ago with some Samsung Exynos processors, where the cache line size was different between the big and the little cores.
A few programs made assumptions about the cache line size and had weird behavior after starting on one kind of core, then being migrated on the other kind of core.
Fun fact: Windows 10 21H1 and 11 share the exact same kernel binary (ntoskrnl.exe). Which, of course, contains the scheduler. So, yeah, that's a lie if I've ever seen one.
Yes, its a blatant lie for sure. Just like Windows has for a long supported the existence or not of certain CPU features. It's a very obvious ploy of latching on to some new CPU feature with "perfect timing" to force people on "older" hardware to upgrade. My Skylake i7 for example is "too old" for Windows 11. It's a complete joke.
> they couldn't change the thread scheduler without a new major release?
I'd argue that thread schedulers are a core component of any operating system.
That being said: Windows8 had support for heterogenous processors (aka: ARM's big.LITTLE), so I'm not sure if a fundamental change was really needed here. Microsoft already put a lot of hard work for those Windows8 phones (which used the same kernel as Windows8 desktop).
It doesn’t sound like anyone outside of Intel and Microsoft actually knows how this works. How are Linux and the BSDs going to be able to support this?
M1 threatens no one besides wishful thinking, Apple isn't going to start selling M1, and their laptops/desktops account for about 10% worldwide market share.
Apple's history as a supplier of server hardware isn't so good. (Cough, gag, XServe.) But they could have a few crack chip designers working on bigger brothers to the M1, may have noticed how many $Billions are being spent annually on servers for data centers, and might even act as if they actually want to be successful in that market segment, this time around.
Or - probably more likely - Microsoft, Intel, etc. want to keep that ~90% worldwide market share of theirs from shrinking.
While I can't really know for sure, but based on perceived performance and smoothness, most likely M1 + MacOS already does something similar in terms of thread and process prioritisation. iOS has done this for a long time, making sure that rendering threads are not resource starved.
AFAIK apple runs background tasks on slow cores which is supposed to free fast cores for foreground tasks.
I still don't understand why is it necessary in an OS which can juggle threads hundreds times per second using priority system. At least ignoring power efficiency aspect and concentrating only on performance or perceived performance aspects.
Context switching is a major performance killer on modern CPUs. You can look at the "slow" cores as being optimized for offloading context switches and interrupts across myriad threads so that the "fast" cores can focus on the primary workload with few context switches.
This is the same motivation behind why high-performance software architecture uses a "thread per core" model. There are large throughput gains to be had by minimizing context switching and CPU cache sharing across threads.
Do you know of any resources for learning CPU-bound parallel/concurrent programming? I wish it was easier to user all the cores my devices actually have. Asynchronous IO and network stuff is “easy” because you don’t care to much when then work completes as long as it’s not blocking the main thread, but speeding up CPU bound tasks is much harder. For all its flaws, Unity with their DOTS (and I believe specifically their job system) is the only thing that I am aware of that facilitates programmers multithreading CPU bound work (you can multithread actual gameplay code rather than just the standard rendering thread, game logic thread, IO thread, etc. of many games)rather than just having to roll your own solution.
There does seem to be some of that in the strategy and article. The odd thing is it doesn't seem to address the elephant in the room which is single threaded sustained memory bandwidth. From my usage it seems like that's the only huge upgrade.
I'm surprised they prioritized Windows 11 and not Linux. Is this meant to most benefit consumers? I'd think something so complex would have been first targeted at servers.
Adler Lake (the family of chips this Thread Director is paired with) is being released as a desktop/laptop chip. The largest configuration shown was 8 performance cores + 8 efficiency cores.
Just like whatever Khronos does, AMD, Intel and NVidia collaborate with Microsoft in doig native DirectX hardware support and only afterwards those features show up on OpenGL and Vulkan extensions soup.
To pick one of the list, OpenGL and Vulkan mesh shaders extensions were published in September 2018 [0], while D3D12 didn't support it officially until November 2019 [1] (on insidier preview of 20H1 build, which shipped in May 2020). Granted, there was NVAPI D3D12 extension kludge for it, but certainly Khronos APIs weren't treated worse. Talk introducing mesh shaders presented examples in GLSL, and they said they weren't covering NVAPI because it's ugly. [2]
Fair enough I was wrong on that one, with the small detail that on Khronos it was and still is, a NVIDIA extension, whereas all DirectX 12 Ultimate cards must support it.
I don't think Intel has this sort of hybrid server cpus in their near future roadmap; the next Xeons (Sapphire Rapids) are expected to be pure big core (Golden Cove) configuration.
One thing that springs to mind is, as fas as I know it's not too difficult to manually assign tasks to cores on linux in a way that makes sense to the workload. But I'm not sure if you can do that at all on Windows.
From the article “In previous versions of Windows, the scheduler had to rely on analysing the programs on its own, inferring performance requirements of a thread but with no real underlying understanding of what was happening.”
Does anyone know what is meant by the scheduler analyses the program?
Professional me knows I'll be stuck figuring out that some bursty process that really needs p1 access all the time, in spite of usually being idle is going to be super pissed off when he gets paged at 2am. and he'll be stuck spending a lot of time figuring out how to pin that process to that core.
Worker drone me is going to be sad thinking that slack and chrome are snarfing the good cores while my compile times suffer.
Bit of a hot take, but it's a tragedy of the commons situation. Programmers are smart, they'll find tricks to grab the fast cores. There are a maybe zero organizations that can get alignment to keep important processes on fast cores, it's way better to just be fast and point the finger at other teams for being slow. What a time to be alive.
It's super cool. it has potential to be amazing. But even forcing every task to run on an E-core, and a heavy bias to fast cores being idle will be gamed. I guess it's better to know than be surprised.