Hacker News new | past | comments | ask | show | jobs | submit login
Intel Publishes Fast AVX-512 Sorting Library, 10~17x Faster Sorts in NumPy (phoronix.com)
286 points by mfiguiere on Feb 15, 2023 | hide | past | favorite | 112 comments



The ironic part is the latest Intel CPU no longer support AVX-512 and AMD now provide better AVX-512 CPU implementation in Zen 4.

https://www.anandtech.com/show/17047/the-intel-12th-gen-core... https://www.phoronix.com/review/amd-epyc-9004-genoa


Isn't it on their consumer lines only that Intel removed AVX-512?

Saphire Rapids is indicated as having support for it.

I know that even though you pointed to an EPYC CPU, all Zen4 support it, but Intel probably released it more for the professional users than their non-pro ones.


Yes it's only the Alder Lake (ie cheaper, consumer oriented CPUs) in which it has been removed. Server chips still have it AFAIK.

Even on Alder Lake, the official explanation is that it has both P(erformance) and E(fficiency) cores, with the E cores being significantly more power efficient and the P cores being significantly faster. The P cores have AVX512, the E cores don't. Since most kernels have no idea that this is a thing and treat all CPUs in a system as equal, they will happily schedule code with AVX512 instructions on an E core. This obviously crashes, since the CPU doesn't know how to handle those instructions. Some motherboard manufacturers allowed you to work around this by simply turning off all the E-cores, so only the P-cores (with AVX512 support) remained. Intel was not a fan of this and eventually disabled AVX512 in hardware.

As ivegotnoaccount mentioned, the Sapphire Rapids range of CPUs will have AVX512. Those are not intended for the typical consumer or mobile platform though, but for servers and big workstations where power consumption is much less of a concern. You would probably not want such a chip in your laptop or phone.


It would have been possible to devise a system call to 'unlock' AVX512 for an application that wants to use it, which would pin it to only be scheduled on P cores.


You end up with the issue of what happens if a commonly used library (or even something like glibc) wants to use AVX512 for some common operation, you could end up with most / all processes pinned to the P cores.


If you explicitly have to request AVX512 that might discourage glibc from using it.


AVX-512 is wide enough to process 8 64 bit floats at once. To get a 10x speedup with an 8 wide SIMD unit is a little difficult to explain. Some of this speedup is presumably coming from fewer branch instructions in addition to the vector width. It's extremely impressive. Also, it has taken Intel a surprisingly long time!


L1 cache on Intel machines reads/writes in 512-bit chunks. So you get a 2x faster L1 cache when working with AVX512 on Intel IIRC.

Or perhaps more accurately: L1 cache that can process twice the data in the same amount of time.


That sounds suspiciously as if an implementation tailored to that cache line size might see a considerable part of the speedup even running on non-SIMD operations? (or on less wide SIMD)


> cache line

Wrong side of the L1 cache. Cache-lines is how L1 cache talks to L2/L3 cache.

I'm talking about the load/store units in the CPU core, or the Core <--> L1 cache communications. This side is less commonly discussed online, but I am pretty sure its important in this AVX512 discussion. (To be fair, I probably should have said "Load/Store" unit instead of L1 cache in my previous post, which would have been more clear)

-------------

Modern CPU cores only have a limited number of load/store units. Its superscalar of course, like 4 load/stores per clock tick or something, but still limited. By "batching" your loads/stores into 512-bits instead of 256-bits or 64-bits at a time, your CPU doesn't have to do as much work to talk with L1 cache.


Ah, so it's not so much about adjacency (which would even benefit an implementation that insisted on loading bytes individually) but about the number of operations required to bucket-brigade those zeroes and ones into registerland when the cache issue is solved (which I'd have very much expected to be the case in the baseline of the comparison, just like other replies).

I was close to dismissing your reply as merely a nomenclature nitpick, but I think I have learned something interesting, thanks!


I definitely didn't use the correct language in my post above. But I think we've got the misunderstanding cleared up now.

I'll try to use the word "Load/store Unit" when talking about that part of the CPU from now on.


Almost any array-math implementation that's aware of cache sizes is going to outperform the ones that don't. By a heavy margin.


People already do optimizations like this all the time, when they're working on low-level code that can benefit from it. Sorting is actually a good example, all of the major sort implementations typically use quicksort when the array size is large enough and then at some level of the recursion the arrays get small enough that insertion sort (or even sorting network) is faster. So sorting a large array will use at least two different sorting methods depending on what level of recursion is happening.

You can get information about the cache line sizes and cache hierarchy at runtime from sysfs/sysconf, but I don't think many people actually do this. Instead they just size things so that on the common architectures they expect to run on things will be sized appropriately, since these cache sizes don't change frequently. If you really want to optimize things, when you compile with -march=native (or some specific target architecture) GCC/Clang will implicitly add a bunch of command line flags to the compiler invocation that set up different preprocessor defines that expose information about the cache sizes/hierarchy for the target architecture.


AVX-512 has masks and a lot of new instructions. It's not just wider.


My understanding is that AVX-512 also has a lot more functions, so composing something less naturally parallel (eg. simdJSON) is easier in it


avx512 also gives you 2x more register space which can be very useful.


It's not "8-wide", it's "512-bits wide". The basic "foundation" profile supports splitting up those bits into 8 qword, 16 dword, etc. while other profiles support finer granularity up to 64 bytes. Plus you get more registers, new instructions, and so on.


AVX2 is like a portion of a pie without filling. AVX512 is like a full pie with extra filling. You're getting filling, not simply more pie.


On an Ice Lake GCE instance, Highway's vqsort was 40% faster when sorting uint64s. vqsort also doesn't require avx512, and supports a wider array of types (including 128-bit integers and 64-bit k/v pairs), so it's more useful IMO. It's a much heavier weight dependency though.

Code / scripts here: https://github.com/funrollloops/parallel-sort-bench

I had to use a cloud instance for testing since I don't have an avx512-capable personal machine.


Thanks for sharing the benchmark :D Is there anything we could do to make Highway/vqsort (feel like) a lighter dependency?


I made that comment because the Intel library is header only. While header only libraries can be convenient, for non-trivial projects I prefer a well engineered CMake-based build for better compile times.


Got it. Highway is mostly header-only and we can move towards fully if anyone is interested.

We also have a CMake build, not sure about well-engineered but patches welcome from anyone with more CMake expertise :)


I think as long as FetchContent works it's super easy to try out your library and see if its something that should be pulled in as a real dependency.

I have heard that FetchContent can make something unreliable, as in dependent on having a connection, but I think it opens up the door for a lot of people to be willing to try it :) (That, and you can turn off mandatory updates, making it work offline too).

I think the basic rule for making FetchContent work is to just put the CMakeLists.txt in the root directory and then make sure that you can use the project as a CMake sub-project.


OK, I think we meet those criteria. JPEG XL can use Highway via add_subdirectory, assuming that is what you mean by sub-project.


Yes, exactly!


Interesting! The benchmark appears to be using only random data though. Any measurements for partially sorted or reverse sorted data?


Good question. I added a test for sorted runs in addition to random data, and added pdqsort for comparison.

avx512_qsort benefits from sorted runs while vqsort does not, but vqsort is still 20% faster in this case as well.

The full results are in the README, but a short version:

  --------------------------------------------------                                                                   
  Benchmark                            Time      CPU                                                                   
  --------------------------------------------------                                                                   
  TbbSort/random                     3.17 s  16.7  s                                                                   
  HwySort/random                     2.27 s   2.27 s                                                                   
  StdPartitionHwySort/random         2.06 s   4.01 s                                              
  PdqSort/random                     5.67 s   5.67 s                                              
  IntelX86SIMDSort/random            3.73 s   3.73 s                                              
  TbbSort/sorted_runs                1.58 s   8.02 s                                              
  HwySort/sorted_runs                2.38 s   2.38 s                                              
  StdPartitionHwySort/sorted_runs    1.11 s   3.21 s                                              
  PdqSort/sorted_runs                5.30 s   5.30 s                                                                   
  IntelX86SIMDSort/sorted_runs       2.90 s   2.90 s


Thanks for sharing. I've posted bench_sort results in another thread. vqsort is about 1.8 times as fast for uniform random, a bit more of an advantage than your results here.

Note that there appears to be some bug because our benchmark's verification fails.


(For completeness, the bug was only x86-simd-sort lacking support for reverse sort, which our benchmark tests for.)


Here's the vqsort discussion from the last time I saw it on this site: https://news.ycombinator.com/item?id=31622548


It would be interesting to see it benchmarked against the highway qsort[1] Google published last year.

[1] https://github.com/google/highway/tree/master/hwy/contrib/so...


sagarm has posted one result in another thread. I'll also look into adding their code to our benchmark :)

It's great to see more vector code, but caveat for anyone using this: the pivot sampling is quite basic, just median of 16 evenly spaced samples. This is will perform poorly on skewed distributions including all-equal and very few unique values. Yes, in the worst case it can resort to std::sort but that's a >10x speed hit and until recently also potentially O(N^2)!.

We have drawn larger samples (nine vectors, not one), and subsequently extended the vqsort algorithm beyond what is described in our paper, e.g. special handling for 1..3 unique keys, see https://github.com/google/highway/blob/master/hwy/contrib/so....


I've posted bench_sort results in another thread. vqsort is about 1.8 times as fast for uniform random 32/64-bit.


Now we only need a consumer CPU from Intel with AVX-512 enabled.


For consumer cpus, you can go back to 11th Gen Intel if you want avx-512 support.[1] Not ideal, I know.

[1] https://blog.reyem.dev/post/which-consumer-computers-support...


Not listed on that page is the Microsoft Surface Laptop Go, which has the same i5-1135G7 as the X1 Carbon listed.

It appears that MS is clearing out their remaining stock with discounts, and they are really nice little machines with outstanding build quality, very good keyboards, and a 3:2 touchscreen.

It was never a popular machine, I think it had very unfortunate naming which leads people to confuse it with other MS products. You have to think of it as something like a super-premium Chromebook to understand what it is for. But regardless, you can dump Windows and install Linux just fine.


RIP Icelake, we hardly knew you.



Dedicated hardware for common algorithms is actually not very far fetched. In addition to GPUs, we already have examples of HSMs [1] and TPUs [2] that optimize for specific cryptographic and machine learning operations.

[1]: https://en.wikipedia.org/wiki/Hardware_security_module

[2]: https://en.wikipedia.org/wiki/Tensor_Processing_Unit


There's video de- and encoding as well.


Pretty sure the internet exists by virtue of algorithm specialized hardware.


Intel has dedicated gzip hardware in QAT among a few other hardware blocks.


well, really that's basically the only place left to go at the moment. I don't think we're likely to have 10GHz any time soon or 1,024 cores either. Specialized circuits are probably all that's left before we start hitting asymptotes.



This is incredible, I feel like I am manifesting things by posting HN comments. What'll they think of next, one billion euros in cash hidden in my wardrobe!?


Careful. The monkey’s paw curls …


This is probably the vectorized quicksort. I remember the paper detailing the algorithm here on HN.

Since then, I know that if I really need to sort numbers very fast one day, I would have to learn the vectorized way to quicksort.

I would probably write it directly in assembly though, with a C API (coze tinycc, cproc, scc...)


Thanks Intel! Now maybe you can release some processors for those of us at home, that actually have AVX-512!


Please.


What is the speedup compared against? Is it compared against non-avx code? Or is it compared against avx2 (256-bit)?

From my experience, vectorizing with avx2 can speed up 3x-10x against non-avx operations depending on the data. The operations involved are things like finding common prefix or searching in string.


I read that as 10^17 times faster sorts and I thought to myself, now that’s news!


"Intel has decided that CPUs no longer should sort, and instead will return original arrays in constant time. This has shown to have a 10^17 performance increase in some benchmarks."


"According to the many-worlds interpretation of quantum mechanics, there may exist a universe where every array is already sorted, resulting in a 10^17 performance increase in some timelines."



Is sorting one of those things that’s so common that it deserves its own CPU hardware/instructions to aid in performance?

Or does AVX-512 provide a lot of what that would theoretically be?


AVX-512 is not a dedicated sorting instruction set, but rather instructions dedicated to doing the same computations in parallel over 512 bit wide registers. So you can do the same operation in the same time for 8 doubles, 16 integers, or 64 bytes at once.

Coming up with good usage of those instructions can be tricky. It's not just the typical arithmetic things, but also instructions that shuffle around values in those registers based on values elsewhere and combining all that cleverly then can yield speed-ups for algorithms that deal with lots of data serially.

A while ago while trying to understand all that (for the older instruction sets) I've read this CodeProject article: https://www.codeproject.com/Articles/874396/Crunching-Number... – AVX-512 is basically similar, just wider. Although I've heard it has a few more useful instructions as well that have no counterpart in the older instruction sets.


To add to this:

A really good, digestible talk about using SIMD for sorting: https://www.youtube.com/watch?v=M6HaSvifxwQ

You can also read the author's blog post series, beginning at https://bits.houmus.org/2020-01-28/this-goes-to-eleven-pt1


Heck yeah. Sorting is a pretty common operation in tons of algorithms, which is why you find some form of a sort function in pretty much every language's standard runtime. Sure, this won't help much for sorting strings, but numerical sorts are still address a significant chunk of problems.


"Ordinateur", which is French for "computer", literally means "sorting machine". So wherever the sorting instructions would go, the computer would follow.


Just to pick a random example: creating an SQL database with an index for faster search requires sorting


Except for the special case of loading data into an empty table that already has an index, that requires repeatedly inserting items in a sorted ‘list’ (more likely a btree), keeping it sorted.

That’s quite a different thing.


>btree

That's literally a collection of arrays that needs to be sorted. And it's still a traditional sorting.


I really should have clarified that I understand that sorting is important. Hence why wondering if it deserves its own instructions. ;)


It's important enough that there are whole books written about it.

https://www.amazon.com/Art-Computer-Programming-Sorting-Sear...


That’s a kind of interesting idea. Better implementations of sorting algorithms will go down to a small-ish base case and then do a linear sort. Maybe with AVX512 sized registers there’s room for a “sort the register” instruction to act as the base case, haha.


Lots of things are sorted: search results, recommendations, your news feed


Benchmark results using vqsort's bench_sort on Skylake workstation (patch: https://github.com/google/highway/pull/1140)

vqsort is about 1.9x and 1.8x as fast on 1M keys and 100M uniform random keys, respectively.

Note that this code did not pass the benchmark's verification.

      AVX3:           vq:     i32: uniform32: 1.00E+06 1051 MB/s ( 1 threads)
      AVX3:        intel:     i32: uniform32: 1.00E+06  539 MB/s ( 1 threads)
      AVX3:           vq:     i64: uniform32: 1.00E+06 1000 MB/s ( 1 threads)
      AVX3:        intel:     i64: uniform32: 1.00E+06  500 MB/s ( 1 threads)

      AVX3:           vq:     i32: uniform32: 1.00E+08  614 MB/s ( 1 threads)
      AVX3:        intel:     i32: uniform32: 1.00E+08  345 MB/s ( 1 threads)
      AVX3:           vq:     i64: uniform32: 1.00E+08  584 MB/s ( 1 threads)
      AVX3:        intel:     i64: uniform32: 1.00E+08  314 MB/s ( 1 threads)


I expect this is only for data that fits in L1 cache. Sorts are memory-constrained once you exceed cache size.


Thanks Intel for publishing something that's useful on AMD consumer CPU's but not on Intel ones.


For those not in the know here, Intel's actually had some fairly ok AVX-512 implementations on consumer chips in the past, even if they do cause the whole chip to downclock significantly.

But for the new Alder Lake cpu, which has P/E Performance/Efficiency cores, the efficiency cores don't have AVX-512, so code would have to find a way to switch modes it runs in as it is shuffled between cores. So generally, AVX-512 is regarded as not-actually-available on Alder Lake.


> so code would have to find a way to switch modes it runs in as it is shuffled between core

It's even worse than that. Initially you could disable E-cores in BIOS to get the system to report AVX-512 being available, but Intel released a microcode update to remove this workaround[0]. Intel also stated that they started fusing off the AVX-512 in silicon on later production Alder Lake chips[1]. Also compare the Ark entries for the Rocket Lake[2], Alder Lake[3], and Raptor Lake[4] flagships. Only the 11900k lists AVX-512 as an available Instruction Set Extension. So it's reasonable to say that AVX-512 on consumer Intel lines is dead for now, whereas AMD has just introduced it in the Ryzen 7000 series.

[0] https://www.tomshardware.com/news/intel-reportedly-kills-avx...

[1] https://www.intel.com/content/www/us/en/support/articles/000...

[2] https://ark.intel.com/content/www/us/en/ark/products/212325/...

[3] https://ark.intel.com/content/www/us/en/ark/products/134599/...

[4] https://ark.intel.com/content/www/us/en/ark/products/230496/...


Does anyone know why they would do this? If AVX-512 works fine on P-cores, and if certain people disable E-cores because they want to use AVX-512, why would they stop those who want to from being able to use it? Why would they go to such extreme lengths to disable something?


"The glibc problem"

You can schedule among heterogeneous cores, that's not really a problem. You simply have another bit for "task used AVX512" and let the task run without AVX512 so it faults the first time it tries to use it. The same stuff is done (or used to be done) for AVX, because if you know a task doesn't use AVX, you don't need to preserve all those registers.

The issue is that eventually someone will find that memcpy* is 4.79 % faster on average with AVX-512 and will put that into glibc and approximately five minutes later all processes end up hitting AVX-512 instructions and zero processes can be scheduled on the E cores, making them completely pointless.

* It doesn't have to be memcpy or glibc, it's sufficient if some reasonably commonly used library ends up adopting AVX-512 when available.


> and zero processes can be scheduled on the E cores, making them completely pointless.

So because AVX-512 is fast, but E cores are slow, we should keep everything slow and prevent adoption of fast AVX-512 to prevent those E cores becoming pointless?


Well, Intel is in the business of selling e cores.


Nobody really knows for sure.

The immediate problem is that CPUID is not deterministic for naive software, if you don't set affinity-masks you don't know whether you will be scheduled onto p-cores or e-cores, and so the result you get will vary.

More generally, software doesn't know what configurations of threads to launch... you want to launch as many AVX-512 threads as you have logical cores, but not more, because they won't run on e-cores.

Software could potentially run a cpuid instruction affine to each logical core though, and collate the results... all you need know is "16 logical cores with AVX-512 and 4 without".

And software that isn't AVX-512 aware doesn't need to worry about it at all, since it doesn't know AVX-512 instructions. I guess the long tail of support is the stuff written for Skylake-SP in the meantime, but how much adoption really is there? It's that narrow gap between "regular stuff that never adopted AVX-512 because it wasn't on consumer platform" and "stuff that isn't HPC enough to be really custom" but also "stuff that won't receive an update". How much software can that really describe, especially with the reaction against Skylake-SP's clockdowns in mixed AVX-512+non-AVX workloads?

And also, that software can just launch AVX-512 threads and if they end up on the e-cores you trap the instruction and affine them to the p-cores. Linux already has support for this because Linux doesn't save AVX registers if there have never been AVX instructions used, so, it just would become another type of interrupt for that first AVX-512 instruction. Linus has commented that this is perfectly feasible and he's puzzled why they're not doing it too.

Nobody knows what the fuck is going on and there has been no plan expressed to anyone outside the company as to what the exact problem is and whether they're looking at anything to fix it going forward. It's a complete mystery, nobody even knows if it's something critical or everything is just too on-fire to care about that right now.

(and if it wasn't on fire before, it probably is now, nobody you want to retain is hanging around after a 20% pay cut off the top and truly insulting retention bonuses... ranging as high as $200 for a senior principal (no, that is not missing a "K"). Oh and we paid $4b in dividends, and you need to move to Ohio if you want to keep your job, yes the ohio with the cancer cloud. Intel is fucked.)


Perhaps market segmentation, perhaps they heard of a vulnerability in their implementation that they couldn't patch (hence the microcode update). Intel loves market segmentation (server specific avx extensions, bfloat, ecc, overclocking), and I wouldn't be shocked to see them sell avx512 support as a "dlc" microcode update down the road.


I wonder what Intel's plan is for the future here. Will a future efficiency core support AVX-512? Or will intel just abandon it on consumer in favor of a variable-length simd instruction set?


Crestmont is the next e-core after gracemont and appears to still not have AVX-512.

It would be highly desirable for e-cores to implement microcoded AVX-512 support to break the heterogeneous-ISA problem, if nothing else. You don't need to use the same implementation but if you can support the same ISA via microcode then software doesn't need to worry about heterogeneous ISA. Maybe crestmont does the microcode thing, possibly, but in the die shots there aren't many visible changes in the vector unit vs gracemont design.

The next p-core will continue to have AVX-512 but it will continue to be fused off.

This obviously is completely insane, like, even if you validated Raptor Cove already and you can't just take AVX-512 out, you're just going to keep including it in all your future designs too? It's not an insignificant amount of area, even on consumer it's probably at least 10% extra area just to even support a 256b vector/microcode, and it just looks crazy to introduce and then abandon it at the exact moment your competitor adopts and supports it.

Only thing I can think of is that maybe they have some other instructions which utilize microcode intrinsics implemented on the AVX-512 engines... like how Turing implements its Rapid Packed Math (dual-rate FP16) support using the tensor engines. Something else in the design that locks them into AVX-512 even if it is not externally exposed?

But again, they don't even support it even if you turn off the e-cores entirely... why the fuck would you do that? It's like the most confusing resolution to this problem and satisfies nobody, probably not even Intel. I guess they flatly do not want to touch it at all for some reason.

Note that this also includes mobile going forward since mobile will have big.LITTLE designs too... it's a serious amount of work and years of rollout they're pissing down the drain here.


It's going to be somewhat hilarious if Intel makes a AMD Bulldozer alike architecture for E-cores where there are AVX-512 units shared among 2 or 4 e-cores.


To go back to a past tech-screed: CMT as implemented by Bulldozer is just SMT with inefficiently-allocated resources. If the frontend (cache, fetch, decode, scoreboarding) and the FPU and retirement are all shared, what exactly did Bulldozer have that was unique to each 'core'? It was an integer ALU dedicated to each thread, that's it. And that's functionally identical to SMT but with a dedicated ALU for each thread. And if one of the threads has enough ILP to occupy two units and the other one isn't being used... why not let the thread use them both and get more work done?

https://news.ycombinator.com/item?id=34494484

So sure, let's do Bulldozer, a bunch of weaker but space-efficient threads (you know, e-cores) but put four threads on a single module sharing an AVX-512 unit, but let's also make it SMT so they can steal unoccupied execution units from other threads in their module. We could call it... Xeon Phi. ;)

https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing

And sure maybe Bulldozer was "ahead of its time" but I think that probably undersells just how weak they are especially when threads in a module start contending for shared resources. Both Bulldozer and Xeon Phi get incredibly weak when multiple threads are on a module, the higher threadcount is offset by a reduction in IPC too. And while that is still probably a net perf-per-area gain, your application really has to like threads for it to be worth it.

I'm gonna say it: if you think Bulldozer was "ahead of its time" then so was Larrabee and Xeon Phi. Bulldozer was a first stab at this Xeon Phi idea for AMD. And in both cases I'm not 100% sure it was worth it. The market doesn't seem to have thought so.

Now again: the devil is kinda in the details. It all depends what is shared. If you can make it so the performance hit is really small except for the shared FPU, that's one thing - there's nothing inherently wrong with this idea, that's what the Sun Niagara series does too. But Sun Niagara is also noted for comparatively weak FP performance (it's an integer-monster for database work). It all depends on just how much duplication per-thread and how much shared resource and how much area benefit it gets you.

But like, take a Niagara core, and let's say we have a couple threads with an opportunity for a bunch of ILP and a bunch of unused execution units sitting there. Why should you not launch onto them? That's the CMT vs SMT question to me. And it's fine if the answer is "scheduling complexity" but you need to think about that question before just blindly pinning resources to specific threads.

And again none of this is to dump on e-cores specifically. Intel's P-cores are too big, they are like triple the transistor count of AMD for a smidge more performance. I think the long-term future lies in the e-cores for Intel, they will replace Coves with Monts eventually. Sierra Forest is the most interesting Intel product in a long time imo. AMD is in less need of e-cores, a Zen3 core only has about 2x the transistor count of a Gracemont core and to me that's fine, it's an SMT core with higher per-thread performance too, that's a reasonable sidegrade. AMD's strategy of pursing "compact" cores makes sense to me, they don't need a whole separate e-core, their P-cores are already area-efficient. They just are going to squeeze the last 10-20% out of it for area-optimized applications and call it a day.

(AMD has done a really good job avoiding cruft - supposedly Zen3 was a from-scratch redesign (Zen2 was actually supposedly a tweak, according to AMD's engineering lead), etc. And they've built this modularity of design that lets them embrace semicustom and advanced packaging and make innovative products and not just architectures. It really feels like Intel has been coasting on the Sandy Bridge design for a long time now, not even the kinds of Zen2->3 shifts, just incremental tweaks. Their iGPU stuff is evidently just as tightly tied to the CPU side as the CPU stuff is to their nodes, everything at Intel is obviously just one giant ball of mud at this point and it's incremental changes and legacy cruft all the way down. I am very down on Intel lately because even completely ignoring the current set of products and their merits, AMD is executing well and Intel is simply not. They've had 6 years since the Ryzen launch to turn things around and they still can't do the job right. AMD is obviously the better company right now in the Warren Buffett "own stocks that you'd want to buy the product" sense.)

https://www.youtube.com/watch?v=3vyNzgOP5yw

But I'm just not sure the Xeon Phi/Bulldozer/Niagara concept has really worked all that well in practice.

Anyway it's also possible that instead of sharing one unit among four cores, they put a unit in each core but it executes over multiple cycles, like AMD did with Zen1/Zen+ and 256b vectors. Or you have two 256b units that fuse to become a 512b unit. That seems to have been the design trend recently, that's how AMD does their AVX-512 on Zen4.

But those kinds of changes are what I mean when I say "if there were changes in the vector units it would probably show up on the die shots". Crestmont die shots seem to show a pretty unchanged AVX unit from Gracemont - it seems unlikely they changed it too significantly.

https://www.semianalysis.com/p/meteor-lake-die-shot-and-arch...

https://twitter.com/Locuza_/status/1524441315441786881


Delightfully fun write up. I made a pretty superficial jab, but you've really painted a great picture of microprocessor design/tradeoffs as they've happened. Nice links. Just getting started on the Mike Clark of AMD interview & the background story alone has been delightful to hear, excited for the rest!

Happy 1 year-since-reveal to Sierra Forest's, announced February 17th 2022 at the Intel Investors Meeting. Definitely have a deep love of the "communication processor" grade gear, many-thread cloud systems, & this really can set the tone on what to expect from new Intel & massively-many-core systems. Wikipedia says it's Gracemont and Intel 4 (Intel 4 being due really soon, Gracemont from 2021), and Intel's recently said SF ought ship in 2024; I hope the plans for this chip have some room to evolve, or that we see more follow-up parts in good time. As you say, P-cores are just too damned big; Intel's been iterating on one big-core design for too long & it's gotten too big. Figuring out what we can do with smaller-core is much closer to the sweet spot for nearly all cloud louds: lots of processes of all sorts running; seems like turf for SMT (and you're 99.98% likely to be right about CMT but who knows, especially if we have a limited number of very big vector executors).

Two random mentions. I did really like Lakefield, which tried to be a ultraportable capable 1P 4E alike system, extremely well integrated. Also, I was incredibly fascinated by the semianalysis post mentioning a rumors, that the Intel Meteor Lake SOC die might have it's own integrated E-core island? That seems insanely bright; just turn off the core complex, a lot. Extremely smart for consumer computers. I think that task of actually understanding when real P-cores really should be brought up is an interest challenge facing consumer computing today; a place where Lakefield probably had good hardware but not enough software tuning to make the power-efficiency trade-offs that would have let it truly shine as exceptional.

Ok, so meanwhile AMD is going to be trying Zen 4c (starting with Bergamo, due about now-ish), with SMT2 (alike Zen4) & a huge number of somewhat cut down Zen cores (less cachesd, power, clocks). AVX512 is supposedly still included, but at an even more reduced rate than Zen4's reduced rate; awesome, sounds great. Not as relevant to the discussion so far, but just gonna mention: rumor-mill this week is that Genoa, the large Zen4 epyc chips, which were expected real-soon-now, are alleged to be facing some significant delays.

I'd love to see some SMT3/SMT4 show up again. Calling out Knights Landing as an SMT4 chip with a big vector unit is extremely on target, extremely interesting. That question of how bad the scheduling complexity really is is a compelling question, one that is much less visible than many of the knobs & dials that core design more visually alters (cache sizes & bus widths often being measurable via a random die shot at a trade show, for example). I suspect it probably is not really that huge a barrier. Still, we seem lament to explore much beyond SMT2, for the time being, but maybe that's fine.

Thanks again for the very fun posts paulmd. You have a ton of other delightful chip-design scuttlebutt in comments elsewhere too; this is a treasure to read.


> Delightfully fun write up. I made a pretty superficial jab, but you've really painted a great picture of microprocessor design/tradeoffs as they've happened.

PaulGPT aims to deliver. Not actually an AI, just frequently accused of being one because I read some shit and it triggers Opinions and Tangents. And I don’t stake any position lightly, I have More Opinions why I’m right lol. It leads to Controversy. But I’m perfectly willing to defend my opinions against counterarguments and ultimately I’d rather mald and then admit I’m wrong.

> Definitely have a deep love of the "communication processor" grade gear,

I really wish Denverton had been more available. I can't even bite on surplus enterprise gear because there's barely any out there. Same with xeon-D, sick on paper but way too expensive. Plz make the intel accelerator thingy just onboard everything and also useful for zfs checksumming, that would be a gamechanger for ZFS on NVMe :\

> I did really like Lakefield, which tried to be a ultraportable capable 1P 4E alike system

Yes as I have commented, I really have been a fan of Kabini (Athlon 5350), Airmont (N2808), and Goldmont Plus (J5005) and recently I landed a pair of Skylake NUC7i7 for $125 a pop as well. They still are compelling for certain "microserver" applications given their extreme low price - a $50 CPU+mobo or a $125 booksize changes the expectations. 10 years ago it was $50 for a 5350 and mobo, 8 years ago it was $125 for a 2GB/32GB ECS Liva X, 3 years ago it was $125 for a J5005 NUC, recently $125 for a barebones with thunderbolt support? Yes, I like cheap machines even if their power is limited, $150 for a barebones machine that offers a faster capability at low TDP or some other unique capability is fine with me.

I am looking to use a RPi4 for a local NTP stratum-1 server with GPS and maybe use some of the nucs or other minipcs for freeIPA or a wireguard bastion or similar. With sufficient RAM a J5005 NUC actually made a really nice thin client during COVID WFH - swapping completely tanks performance and 16GB+ makes it perfectly fine even with lots of tabs/etc.

I have my eye on the Atlas Canyon NUCs too, which are finally available in quantity. The only things I don't like are the reduction to single-channel (Which seriously impacts performance vs the expected scaling, especially in iGPU) and the continued lack of Thunderbolt/USB4 - I like that it finally has M.2 NVMe and some other niceties but it really needs USB4 so you can plug it into more powerful stuff if desired. External expansion is going to be very baseline once USB4 reaches saturation and it’s not going to be that many years.

> rumors, that the Intel Meteor Lake SOC die might have it's own integrated E-core island? That seems insanely bright; just turn off the core complex,

Yeah that would be a cool workaround to the data-movement power penalties of chiplets/tiles. I mentioned elsewhere but having to have IF links powered up just to have cores idling along is an obviously dumb thing and yeah that’s a good solution for it, turn off the IF links and just run shit on the IO die.

Data movement and general idle-power is obviously the penalty of MCM and the more data you move across the more (smaller) chiplets the higher it is. It’s all just some new asymptotic limit of scalability (which direct bonding like cu-cu will change again, at the cost of thermals).

Intel claims EMIB is less than an interposer… I’m not really clear how the fiberglass-style interposer vs silicon interposer vs EMIB all stack up in practice. EMIB is 900% crucial to Intel’s future though, remember that TSMC offers advanced packaging solutions and (just like sapphire rapids getting their shit together so Intel has a viable cell library+process to sell to custom foundry) getting their shit together on EMIB is going to be a mandatory requirement for custom foundry’s success. The idea that a couple major intel chiplet/tile based products are seeing a lot of delays is generally concerning.

> I think that task of actually understanding when real P-cores really should be brought up is an interest challenge facing consumer computing today

I think phones have pretty good solutions for this but consumer and server both may have a different optimum for poweriness and boostiness. I think an optimum solution probably is not computable for the same reasons most “sufficiently advanced compiler-magic” doesn’t work - you’ll only really know at runtime. Still I am highly in favor of whatever hinting schemes we can come up with - good dev behavior can make a lot of difference.

Apple’s cores are really interesting in this area. Blizzard is extremely fast and small (like 1/3 of a Gracemont core transistors for similar-ish performance? I’ve never seen exact numbers but broadly that’s how it stacks up) and honestly Avalanche is sick, especially for any sort of JVM task. It’s just generally good at JIT, it’s not just x86 it really just crushes JIT compared to other architectures and JVM falls into that too. Cinebench is underselling the perf/w at load (because of less frontend load on x86) and I think the idle power stuff is undersold too. When it’s really on Linux I think the numbers are going to be impressive.

What’s a real easy answer to “when should I run p-cores”? Just have a real fucking fast e-core and if the e-core complex is getting overrun on a prolonged basis, pull out the really big guns.

> Not as relevant to the discussion so far, but just gonna mention: rumor-mill this week is that Genoa, the large Zen4 epyc chips, which were expected real-soon-now, are alleged to be facing some significant delays.

I did not know this, welp. Intel catches a bit of a lucky break. I think their mindshare is damaged though, even if Genoa were 6 months after SPR-SP nobody would really care tbh, intel’s been a lot later and AMD shows signs of better execution these days.

Sapphire Rapids is a good chip though. It’s not “lol throw your AMD shit in the dumpster” tier good, but, the game is back on, it’s good enough Intel can sell it, especially if they continue to be willing to cut deals. They need cash, they’re desperate, and this is obvious even now, if you’re paying more than half list price for intel you’re a complete fucking chump and quarter or less is more typical. Same for the 10-series during pandemic and the 12/13th gen prices, Intel is willing to deal to keep the fabs busy and keep the cash coming.

https://www.youtube.com/watch?v=_2yjjHzifL8

> Calling out Knights Landing as an SMT4 chip with a big vector unit is extremely on target, extremely interesting.

That’s never occurred to me either but the framing of “lol what if bulldozer but with AVX-512” rubbed both those nerves at the same time. No fuck you what if that already exists and everyone hates it!?!? ;)

(And then the transformer model takes over, beep boop paulGPT online, that's an interesting one for the following reasons ;)

> That question of how bad the scheduling complexity really is is a compelling question

Yes I agree, that is the money question. How much area did AMD save by having some fixed portions of the pipeline that don’t have to be scheduled between threads? How much does Sun save on Niagara? Or Intel on Core-SMT2 or Phi-SMT4? This would be an extremely interesting three-way from chips+cheese or similar, how well did all those sets of tradeoffs work and why were respective decisions made for those designs/use-cases? They all made different decisions around their frontend, I should take a peek at agner fog’s microarchitecture on those uarchs sometime.

> I suspect it probably is not really that huge a barrier.

That’s my guess as well. Alternating threads on the frontend may be the worst of it. If you have fixed decode/fetch units per-thread ala HyperThreading (or the option of a split of 5/0, 3/2, 2/3, etc) maybe that’s most of the squeeze. IDK though.

> Still, we seem lament to explore much beyond SMT2, for the time being, but maybe that's fine.

No… there is another ;)

https://en.wikipedia.org/wiki/POWER9

TBH I really really want a TALOS II setup, that is going to be one of those things that I snipe in 10-15 years when they’re cheap, it’s a neat piece of hardware. I have an AMD Quadfather and a KNL pcie coprocessor (we have a discord, folks!) and some other assorted nostalgia-tech too.

https://discord.gg/2qJXMTmE (expiring link to control spam, feel free to ping me on another tech thread later if anyone needs)


Perhaps there wasn't enough lead time to get it into Crestmont after the decision to go with P+E for Alder Lake.


AFAIK Consumer Zen4 supports 12 of the 15 AVX-512 extensions, do we know for certain this doesn't target one of the ones AMD is missing?


The newest extension it needs is -VBMI2, which is supported by Zen 4. -DQ and -BW are quite old and very common amongst all implementations by this point.


zen4 supports basically everything except the xeon phi SMT4 intrinsics (4VMMW or whatever). As did alder lake before its removal.

The support story for AVX extensions is not as complex as people make it out to be anyway. Server is a monotonic sequence, consumer is a monotonic sequence, both of them are converging apart from 1 or 2 that are unique to one or the other. Xeon Phi has the SMT4 intrinsics that are completely its own thing due to the SMT4 there, but you'll know if you're targeting xeon phi.

https://i.imgur.com/idAjB1X.png

So as you can see, consumer supports everything except BFloat16 for neural net training. Consumer doesn't do that so it's not a problem. And it doesn't support the Xeon Phi stuff because Xeon Phi is its own crazy thing.

No uarch family in that chart has ever abandoned an extension once it was adopted. So unless you are taking a consumer application and running it in the server, it's literally not even a problem. And server gets bfloat. That's it, that's literally the only two things you have to know.

but letting AMD fanboys draw le funni venn diagram is obviously way catchier than a properly organized chart representing the actual family trees involved... SSE would look bizarre if you represented it that way too, like all AMD's weird one-off SSE4 extension sets released in the middle of more fully-featured implementations... but people working in good faith would never actually be confused by that because they understand it's a different product family and year of release is not the only factor here.

--

Really the thing that has been a problem is that server has been stalled out forever... first 10nm problems and now sapphire rapid has more than a dozen known steppings. They can't get the newer architectures out, so consumer has been moving ahead without them... up until alder lake nuked the whole thing. If server had been able to get newer uarchs out, there would be a lot more green bars in server too.

supposedly the fab teams are actually ready to go now, and the problem is the design teams aren't used to operating in an environment where they can't go down the hall and have the fab teams fix their shit. Intel put the foot down and aren't letting them do that anymore, since the fab teams need to sell the resulting process/cell libraries to external foundry customers, and the design teams need to be able to make their shit work on external foundries. You can't do this hyper-tuned shit where the process is tweaked to make your bullshit cell designs work. But some of the teams are not mature enough to work in a portable environment where design rules actually have to be obeyed because Intel historically never had to.

When you hear the infinite steppings of Sapphire Rapids and the network chip team's continued inability to put out a 2.5gbe chipset that works (I think we are on public release number 6 now?), it's pretty obvious who the worst culprits were. Meteor Lake may also be having packaging/integration problems (although this is supposition by me based on what products are delayed - coincidentally it is a lot of chiplet/tile stuff and intel obviously lacks experience in advanced packaging) but the products that have infinite steppings obviously can't get their own shit together even on their own tiles let alone talking to other people's tiles.

But Intel supposedly are not kidding that Intel 4 is ready to go and they've just got nothing to run on it yet. Hence looking for outside partners. Supposedly they've got at least one definite order signed for Intel 3 in 2024, and I think there will be a lot of people happy to diversify and derisk away from the TSMC monoculture that has emerged... if TSMC stumbles, right now there is no alternative.

https://www.tomshardware.com/news/intel-ifs-lands-3nm-to-mak...

Samsung has all the same conflict-of-interest problems as Intel and also a track record of really mediocre fab execution. Supposedly they are ahead on GAAFET but like... we'll see, it's Samsung, who knows. They've stumbled just as much as Intel, just not on 7nm tier - I remember the iphone "is it TSMC or Samsung" games too. Samsung has put out a lot of garbage nodes and a lot of poorly-yielding nodes of their own.


edit since I can't edit: "And server gets bfloat" meaning "if you were to bring a ML training server application over to consumer it might not work".

Basically what I'm saying is, the only 2 situations that would be a problem is going consumer->server (which I don't see happening often) or going server ML training -> consumer if it doesn't have a non-BFloat16 fallback. And everyone does ML training on GPUs anyway.

Otherwise everything supports everything. Going backwards within a family might be a problem, but, that's always a problem, it's not a support matrix problem where there's a mixture of capability, it's just backwards compatibility to older hardware with less features.

The real problem, as I said, is that "Cooper Lake" there is Ice Lake-SP which was stalled for years, and by the time it was adopted Milan was already in the market and Cooper Lake was dead on arrival. So nobody actually has Cooper Lake, if you have AVX-512 server it's 99.9% chance it's either Skylake-SP or Cascade Lake-SP.

Which is 100% drop-in compatible with any consumer platform that anyone has (since conveniently nobody has Cannon Lake either). The literal only problem is taking consumer applications and running them on server stuff, and there's a well-defined server compatibility set there too.

--

Going forward, Sapphire Rapids is Golden Cove cores, so it should have the same support bars as Alder Lake there, ie basically everything, including server bfloat as well.

https://www.phoronix.com/image-viewer.php?id=intel-sapphirer...

(and of course the other problem being Intel has no idea what the fuck they're doing with big.LITTLE on the consumer platform... the support matrix for everything consumer-family going forward is apparently "nothing" because they've dropped AVX-512 entirely.)

--

Let me drill this down to the generations you actually need to care about: (that poor PNG...)

https://i.imgur.com/2HLrIjr.png

Like literally the AVX-512 support matrix is a complete fucking non-issue, it's an absolute tempest in a teapot by people who have never touched or looked seriously at AVX-512. The AVX-512 rollout is a dumpster fire in many many ways but an overly-complex support matrix is not one of them.


Numpy is something you could expect to find running on a workstation and Intel's workstation CPU line has had AVX-512 continuously since 2017.


Alder Lake and Raptor Lake workstation (W680 Chipset) doesn't have AVX512 enabled.


Fair. The "entry workstation" thing from Intel is baffling. I was thinking Xeon W, but then of course there was the Xeon W-12xx that lacked AVX-512.

In short, I was wrong. It would have been more correct to say that Intel has offered a workstation part with AVX-512 continuously since Skylake.


Intel released a couple days ago new Xeon-W CPUs with AVX-512.

The wx-24xx uses the same p-cores as alder lake, but with avx-521 enabled. Same for wx-34xx but with raptor lake's p-core.

The entry levels xeon are believed to be identical to Core ix, with the e-cores disabled and avx-512 enabled. The extra I/O and ECC support is done by the chipset.

In summary, they just went to support avx-512 only on xeon w.

This makes sense, since must be hard to schedule between cores with and w/o avx (the e-cores doesn't have avx).


Sad, big win for the Ryzen or future Zen 4 threadripper.


There was a longstanding issue where AVX-512 would trigger frequency throttling on a number of Intel CPUs, resulting in a net performance penalty for mixed workloads.


No, there was a longstanding issue of people worrying that it would result in a net performance penalty for mixed workloads. Meanwhile people actually using AVX-512 dealt with that by making sure mixed workloads were batched appropriately and it is usually a big net win even if you don't worry about it.


Ok, but that hasn’t been a problem in a while.


Yeah, but the mystique still lingers on.


Hasn't been a problem because software corrected for that, or hasn't been a problem because Intel resolved the underlying throttling issues with AVX-512?


It was never really a problem.


It was only useful in a couple of corner cases, but not in general library functions.

Intel could never fix its thermal issues, AMD did.


Both


“Longstanding” here means “really only in a single CPU generation (SKX), and even then only if you were dumb and used like 50 AVX-512 instructions in isolation for no reason at all.”


Assuming Intel didn't add code that's:

  if (AMD_CPU) { go_slow() }
.... again


It's open-source, as the article says.


I still don't get why people think Intel is obligated to optimize AMD performance. From what I recall, it wasn't a case of slowing down AMD devices, they just didn't apply code optimizations to machines not using Intel.


The code literally checked for CPU = AMD, and ignore the CPU feature bits what show which accelerations are available.

So sure Intel shouldn't tune for AMD, but if the AVX2 is listed as available a compiler should use it. This was proven when via some shared library trickery you could lie about the CPU name, and suddenly the AMD CPU was faster.

I stumbled across this on a "Why is matlab much slower than I'd expect" thread. Lying about the CPU greatly improved performance and still showed the correct answers.


With all the errata compilers need to correct for, I wouldn't blame the compiler for not optimising for foreign chips. Matlab chose to use a library that only works well for Intel (a simple benchmark would've shown that), I don't think Intel's compiler team should be forced to write code for AMD chips. I very much doubt AMD's driver team will optimise their OpenCL tools for Nvidia hardware either.

Blame Matlab and friends for slowing down their software on your computer.


If Matlab performance is important for competitive reasons, AMD should hire a few SW engineers to build tuned linear-algebra libraries like Intel does. It’s not rocket science. They could be competitive or faster with a low-millions-per-year investment (and only slightly behind with a much smaller investment in grad students and open source).

I worked on these sorts of libraries (not at Intel) for a decade, it’s very, very common to dispatch on CPU family rather than feature flags, because performance details are tied to the microarchitecture, not availability of the feature. Even though Skylake and Haswell both support AVX2 and FMA, there are huge differences in implementation that effect what the best implementation will be, so you can’t just say “oh, FMA is available, use version 37.” Instead you do “if we’re on Skylake, use the Skylake version, if we’re on Haswell, use the Haswell version, otherwise fall back on the generic runs-anywhere version.” Nothing underhanded about it.


> it wasn't a case of slowing down AMD devices, they just didn't apply code optimizations to machines not using Intel

What's the difference?

The thing that would be okay is "not having optimizations designed for AMD devices".

But when you already have the optimizations, and you refuse to use them because AMD, that is not okay.


> What's the difference

Let's be naïve: "not doing something" is indeed, on a literal level, different from "doing something to slow it down". It's a bit like the nuances of a lie by omission vs a proper lie. That being said... both are still considered lies in a more abstract sense, and so is this deliberate slowdown. I assume the author just took it all a little more literal than most people (probably?) deem necessary.


I would say "not doing something" is an incorrect description of what it did, though. They had to write extra code to make the behavior on Intel and AMD differ.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: