AVX 512 has a number of instructions that make it useful for UTF-8 parsing, floating point parsing, XML parsing, JSON parsing, things like that. It is tricky coding
All things that are good for HFT but also good for speeding up your web browser and maybe even saving power because you can dial down the clock rate. It's a tragedy that Intel is fusing off AVX 512 in consumer parts so they can stuff the chips with thoroughly pizzled phonish low-performance cores.
Vector instructions (ala Cray) are future proof in ways that SIMD are not. Eventually every high performance CPU will have HBM, I am not sure SIMD is the mechanism we will use in to extract the most efficiency out of the platform.
I feel that SIMD vs vector is not the most useful distinction to make.
Yes, at the ISA and binutils level there is a difference in terms of number of instructions and ugly encodings.
But the actual application source code can actually look the same for SIMD and vector - vector-length agnostic style is helpful for both.
Instead, it seems more interesting to focus on preparing more applications for either SIMD or vector: with data-parallel algorithms, avoiding branches etc.
Of course when you only have packed instructions (a la SIMD) you have no choice but using them in place of vector instructions, but vector instructions in general do not fully subsume packed instructions. I think packed instructions are better at exploiting opportunistic parallelism from either a short loop (where vector instructions would have higher overhead) or a series of similar but slightly different scalar instructions.
Not sure if you're referring to high frequency trading here, but if that's the case then it's just not true.
AVX is infamously bad for low latency becuse these instructions result in insane amounts of heat which forces the CPU to get down clocked (sometimes the cores on the neighboring tiles on the SoC too). P-State transitions are a serious source of execution jitter and a big no-no in our business.
Many HFT shops actually disable AVX with noxsave.
It's fantastically useful for HPC thought.
I think I found there’s also a quite expensive ‘warmup’ which makes much of AVX only worth it for larger loops. But my memory or original understanding might be incorrect and my understanding of why that warmup might be the case (maybe microcode caches in the front end??) is very poor.
Didn't read the post ? THat is explicitly mentioned
> Yes, yes, yes, you can use it in random places. It's not only some JSON parsing library, you'll find it in other libraries too. I suspect the most use it gets in many situations is - drum roll - implementing memmove/strchr/strcmp style things. And in absolutely zero of those cases will it have been done by the compiler auto-vectorizing it.
I couldn't agree more. I don't think compiler vectorization is that useful even for columnar (!) database we're building. The specialized JIT doesn't even use AVX512 because too much effort for little to no gain.
Vectorization (auto or manual) can really help optimize bottlenecks like evaluating simple comparisons once you're out of easy algorithmic wins. It takes so much attention it's only worth doing in the most critical cases IMO.
Auto-vectorization can be fragile.
Manual vectorization is a ton of work and difficult to maintain.
> And in absolutely zero of those cases will it have been done by the compiler auto-vectorizing it.
How much of that is due to language semantics? Can't FORTRAN compilers make better use of the AVX instructions than, say, a C or Rust compiler could due to memory reference semantics?
In theory, Rust should have much better memory semantics than C but currently much fighting is required to get llvm to take advantage of the Rust semantics.
I think that's what "if you aren't into parsing (but not using) JSON" is supposed to be referencing.
And, "It's not only some JSON parsing library, you'll find it in other libraries too. I suspect the most use it gets in many situations is - drum roll - implementing memmove/strchr/strcmp style things. And in absolutely zero of those cases will it have been done by the compiler auto-vectorizing it."
One thing I learned in grad school is you can do a stupendous number of floating point operations in the time it takes to serialize a matrix to ASCII numbers and deserialize it. Whatever it is you are doing with a JSON document might be slower than parsing it.
It's true that autovectorization accomplishes very little but specialized libraries could have a notable effect on perceived performance if they were widely developed and used.
Frankly Intel has had less interest in getting you to buy a new computer by making it the best computer you ever bought than it has been in taking as much revenue as it can from the rest of the BOM, for instance the junk integrated graphics that sabotaged the Windows Vista launch and has been making your computer crash faster ever since. Another example is that they don't ship MKL out of the box on Windows or Linux although they do on MacOS. And Intel wonders why their sales are slipping...
Matrix multiplication and similar is also one of the few operations where algorithms and special case instructions are interesting for floating point on a massive scale.
I.e adding two arrays together, computing dot products, those operations are just memory bound when the data grows, but matrix multiplication is dense enough with operations per element that it is limited by arithmetic operations too.
> Intel oneAPI Math Kernel Library (Intel oneMKL; formerly Intel Math Kernel Library or Intel MKL), is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, fast Fourier transforms, and vector math.
> One thing I learned in grad school is you can do a stupendous number of floating point operations in the time it takes to serialize a matrix to ASCII numbers and deserialize it. Whatever it is you are doing with a JSON document might be slower than parsing it.
> It's true that autovectorization accomplishes very little but specialized libraries could have a notable effect on perceived performance if they were widely developed and used.
I mean, sure, but even if we take JSON as an example, in vast majority of cases it gets fed to a giant blob of JS driving even bigger blob of browser code.
The cases where you do deserialization -> very little processing -> serialization are pretty rare.
Sure, if it is already on chip might as well use it, but realistically savings will be in single digit percents.
> The cases where you do deserialization -> very little processing -> serialization are pretty rare.
Actually I've seen a lot of systems that do that - query a datastore, do some minimal processing of the results, feed them back to the caller. Although that tends to get addressed at a higher level, e.g. MongoDB drivers shifting towards using BSON.
I don't think Intel has even decided whether to fuse them off or not. In my newest Intel desktop, a Core i7-13700K, AVX-512 is actually available. It wasn't available on the i7-12700K that was in the same system on the same motherboard a few weeks ago.
i thought the reason they wanted to fuse them off was for all cores being completely compatible (just slower), without the need for the OS to force scheduling on specific cores (I'd guess what would have to happen is if an instruction not available on efficiency core was called, it trap, the OS would see it and then mark the process to only be executable on performance core?)
The official specs for Raptor Lake don't include AVX512 support, so most likley this is just something that slipped through on your system. It might be "corrected" in later steppings or updated firmware. Some early 12th gen systems had unfused CPUs with AVX512 available on the performance cores and firmware that didn't disable it. https://www.intel.com/content/www/us/en/products/sku/230500/...
Generally speaking, the idea that the average person uses more than 4 cores is insane. Even as a power user/dev I can count the times when I needed more than 4 cores on one hand.
All the benchmarks that these CPUs are evaluated by, test either games, or run the same crazy stuff like fluid simulations that OPs article complains about. Hardware reviewers need a reality check.
Many processes yes, but background tasks don't need much power and one core has a lot of power. So one core should be plenty for all background tasks combined.
I hate to break it to you, but outside of gaming, fluid sim, and battery life / power draw tests, nothing else matters for modern chips. Schmucks only need the kind of horsepower a phone is capable of, or want to know how long it will last. The only people who care about performance either really care because more perf = more dollars (fluid sim, hft, etc.) or they're gamers.
I use CAD for a hobby. My 5950x has 16 cores and I desperately wish I could afford a 64 core desktop because each model takes 60 - 240 seconds to compute. That’s a pain when it takes multiple iterations to get a good design, and it leads to more complicated programming to offer quick vs detailed model creation, and that complexity leads to bugs.
Mostly, I agree with your post. Related: I find browsing the web on an older phone is awful. A great example: Google Maps is one of the most intensive web apps that I use. After each phone upgrade (~3years), it a huge leap forward in performance. I assume this can be explained by improved WiFi, CPU clock/cores/cache, and main memory. What is your guess about which part of the upgrade is most impactful? (Let's assume the version of Chrome is roughly equivalent between new and old phone.) My guess: It is mostly due to improved CPU (clock/cores/cache).
The thing about games is they're almost always GPU limited unless you are chasing after super high framerates. Games consoles were typically the target for games and with the exception of the latest gen stuff, all of them used to have weak CPUs.
>"Although AVX-512 was not fuse-disabled on certain early Alder Lake desktop products, Intel plans to fuse off AVX-512 on Alder Lake products going forward." -Intel Spokesperson to Tom's Hardware.
Whether or not AVX 512 is the future, Intel's handling of this left a sour taste for me, and I'd rather be future-proof in case it does gain traction, since I do use a CPU for many years before building a new system. Intel's big/little cores (with nothing else new of note compared to 5+ years ago) offer nothing that future-proofs my workflows. 16 cores of equally performant power with the latest instruction sets does.
I very recently upgraded and had considered both Raptor Lake and Zen 4 CPU options, ultimately going with the latter due to, among other considerations, the AVX-512 support.
Future proofing is no doubt a valid consideration, but some of these benefits are already here today. For example, in a recent Linux distro (using glibc?), attach a debugger to just about any process and break on something like memmove or memset in libc, and you can see some AVX-512 code paths will be taken if your CPU supports these instructions.
Have you been programming with the Zen 4? I bought one, and I've been using the avx512 intrinsics via C++ and Rust (LLVM Clang for both), and I've been a little underwhelmed by the performance. Like say using Agner Fog's vector library, I'm getting about a 20% speedup going from avx2 to avx512. I was hoping for 2x.
Not really. The "pure" computations are double pumped but some of the utility instructions you use with the computations are native AVX-512. And there has been a lot of analysis out there about this and AFIK the conclusion is that outside of very artificial benchmarks most (not all) applications of AVX-512 will never saturate the pipeline enough for it (being double pumped) to matter (i.e. due to how speculative execution, instruction pipeling etc. work in combination with a few relevant instructions being native AVX-512).
Even more so for common mixed workloads the double pumped implementation can even be the better choice as using it puts less constraints on clock speed and what the CPU can do in parallel with it internally.
Sure if you only look at benchmarks focused on benchmarks which only care about (for most people) unrealistic usage (like this article also pointed out many do) your conclusions might be very different.
I think the notion of double pumping is only in the VMOV operation. Looking at Agner[1], the rest of the instructions have similar Ops/Latency to their avx2 counterparts.
If I need to crunch small amounts of data in a hurry, existing instructions are fine for that.
If I need to crunch large amounts of data in a hurry, I'll send it to a GPU. The CPU has no chance to compete with that.
I honestly don't understand who/what AVX512 is really for, other than artificial benchmarks that are intentionally engineered to depend on AVX512 far more often than any real-world application would.
> If I need to crunch large amounts of data in a hurry, I'll send it to a GPU. The CPU has no chance to compete with that.
It takes literally 1 to 10 microseconds (10,000 nanoseconds) to talk to a GPU over PCIe.
In the 40,000 clock cycles, you could have processed 2.5 MB of data with AVX512 instructions *BEFORE* the GPU even is aware that you're talking to it. Then you gotta start passing the data to the GPU, the GPU then has to process the data, and then it has to send it all back.
All in all, SIMD instructions on CPU-side are worthwhile for anything less than 8MB for sure, maybe less than 16MB or 32MB, depending on various details.
----------
That's one core. If your one-core machine can talk to the other 32-cores or 128-cores of your computer (see 64-core EPYC dual-socket computers), you can communicate to another core in just 50-nanoseconds or so (~200 clock penalty), and those other cores can be processing AVX512 as well.
So if you're able to use parallel programming on a CPU, its probably closer to 1GB+ of data before it actually is truly an 'obvious' choice to talk to the GPU, rather than just keeping it on CPU-only side.
---------
Example of practical use: AES-GCM can be processed using AVX512 in parallel (each AES-GCM block is a parallel instance, the entire AES-GCM stream is in parallel), but no one will actually use GPUs to process this... because AES-instructions are single-clock tick (or faster!!) on modern CPUs like Intel / AMD Zen.
That's just going to happen whenever you go to an TLS1.2 or HTTPS instance, which is pretty much all the time? Like, every single byte coming out of Youtube is over HTTPS these days and needs to be decrypted before further processing.
> The reader is referred to the timings for Tiger Lake and Gracemont.
On Tiger Lake (pg 167):
> Warm-up period for ZMM vector instructions
> The processor puts the upper parts of the 512 bit vector execution units into a low power mode when they are not used.
> Instructions with 512-bit vectors have a throughput that is approximately 4.5 times slower than normal during an initial warm-up period of approximately 50,000 clock cycles.
I'm not saying you are wrong. I just haven't heard about that.
> Since 512-bit instructions are reusing the same 256-bit hardware, 512-bit does not come with additional thermal issues. There is no artificial throttling like on Intel chips.
At least for Zen4, there's no worries about throttling or anything really. Its the same AVX hardware, "double pumped" (two 256-bit micro-instructions output per single 512-bit instruction). But you still save significantly on the decoder (ie: the "other" hyperthread can use the decoder in the core to keep executing its scalar code at full speed, since your hyperthread is barely executing any instructions)
Hopefully they're updated for the new post-quantum algorithms.
Which would you rather have: some fixed-function unit shared between all cores (load balancing? what if you're suddenly doing crypto stuff on many cores?), or the general-purpose tools for running any algorithm on any core?
AES isn't really threatened by quantum computing AFIK
And most encryption is to use "something" to get an AES key and then us that to decrypt data.
And that "something" (e.g. RSA/ECC based approaches) is what is threatened by quantum computing. But it's also not overly problematic if that "something" becomes slower to compute as it's done "only" once per-"a bunch of data".
AFIK the situation is similar for signing, as you don't sign the data itself normally but instead sign a hash of it and I think the hashing algorithms mostly used are not threatened by quantum computing either.
hm, which AES? AES-128 is getting a bit tight already for multi-target attacks.
As to quantum, it looks like practical serial or parallel application of Grover's algorithm might still be decades away.
But that is with current knowledge, and who knows what other breakthroughs will be made.
Wrt algorithms, it really is an implementation detail of how flexible they make the crypto engines.
The number of crypto units would be SKU specific depending on the workload. A server box doing service mesh would need one per concurrent flow presumably.
The thing that accelerator offload gets you is an additional thermal budget to spend on general purpose workloads. If you know that you will be doing SERDES, and enc/dec, offloading those to an accelerator frees up watts of TDP (thermal design power) to spend other places. This is also why we see big/little architectures, the OOO processors suck up a lot of power. In-order cores are just fine for latency insensitive workloads.
Most crypto beyond the initial key exchange is symmetric (AES, ChaCha) and those are still resistant to quantum attacks (well beyond the brute force search speed ups). Post-quantum key exchange is fast enough that it doesn't need dedicated units, unless you're in some super constrained environments. But in that case you'll run into other issues too.
> If I need to crunch large amounts of data in a hurry, I'll send it to a GPU. The CPU has no chance to compete with that.
PCIe is real bottleneck that affects bandwidth and latency. It only really works if your data is already resident on the GPU, your kernels are fixed and the amount of data returned is small.
With HBM and VCache, we will see main memory bandwidth over 1TB/s for consumer high perf cpus in the near future, at those rates, GPUs won't make sense. GPUs are basically ASICs that can take advantage of hundreds of GB/s of memory bandwidth, when the CPU can do scans at that same rate, the necessity of a GPU is greatly reduced.
If you look at what AVX512 is often used (as the article mentions), its less of math and more of just speeding up processing of deserializers and various other things that can do bunch of operation on few bytes at a time.
Which does look like hilarious waste of silicon, it could cut all of the float/division transistors off and still be plenty useful
That's a bold claim which requires extraordinary evidence.
Sufficient counterexamples, most already mentioned in this discussion: databases, NN-512, TLS(AES), JPEG XL decoding (1.5x speedup), ...
> If I need to crunch large amounts of data in a hurry, I'll send it to a GPU. The CPU has no chance to compete with that.
Latency is a hell of a thing. Anything over a millisecond is an absolute eternity, and I know the GPU imposes more than that for most practical applications - especially gaming.
I can't think of a case like that, where the difference between AVX512 and other instruction sets would be humanly perceptible. Those situations usually end up constrained by memory bandwidth, not CPU or cache throughput.
I'm sure those cases exist, but they don't justify such a large chunk of silicon. And they definitely don't justify slowing down the rest of the CPU.
Inspector says that this web page, we're talking on is 48kB in size. That's small enough to fit inside L1 cache. Every single connection to the Web Server needs its own AES-key for encryption sake. The 48kB of data (such as my posts above, or your posts) are hot in Cache, but if 100 visitors to this webpage need to get it, HTTPS needs to happen 100x different times with 100x different keys.
So this 48kB text message (consisting of all the comments of this page) are going to have to be encrypted with different, random, AES keys to deliver the messages to you or me. AES operates on 16-bytes at a time, and AES-GCM is a newer algorithm that allows for all 48,000+ bytes to be processed in parallel.
AVX-512 AES instructions are ideal for processing this data, are they not? And processing them 4x faster (since 4x instances of AES are occurring in parallel, since AVX512 can work on 64 bytes per tick / 16 bytes per AES instance), is a lot better than just doing it 16-bytes at at time with the legacy AES-NI instructions.
----------
Despite being a parallel problem, this will never be worthwhile to send to the GPU. First, GPUs don't have AES-ni instructions. But even if they did, it would take longer to talk to the GPU than for the AVX512-AES instructions to operate on the 48kB of data (again: ~40,000 clock ticks to just start talking with the GPU in practice). In that amount of time, you would have finished encrypting the payload and have sent it off.
I've seen a lot about AVX-512 and didn't know those instructions existed until just now. They're not exactly generic vector instructions. And that's a nice improvement, but is AES-NI ever slow enough to matter? The numbers I found were inconsistent but all very fast.
Probably more important, there's a 256 bit version of that instruction. You can get half of that extreme throughput without AVX-512.
And a surprising amount of it was in TLS optimizations, in particular, offloading TLS to the hardware (Apparently Mellanox ConnectX ethernet adapters can do AES offload now, so the CPU doesn't have to worry about it).
Since Mellanox ConnectX adapters are trying to solve the AES problem still, I have to imagine that its a significant portion of a lot of server's workloads. Intel / AMD are obviously interested in it enough to upgrade AES to 4x wide in the AVX512 instruction set.
I can't say its particularly useful in any of _my_ workloads. But it seems to come up enough in those hyper-optimized web servers / presentations.
Something that comes to mind is real-time controls, like for high speed manufacturing, rockets and jets, medical robots, etc. These computations are often highly vectorized and are extremely latency-sensitive, for obvious reasons.
Wasn't the issue that efficiency cores don't support AVX-512 and that operating system schedulers/software don't deal with this yet and end up running AVX-512 code on the efficiency cores?
That's a terrible CPU design. You might as well ship arm cores if you are going to have a mismatch of instruction set support on efficiency vs power cores.
Especially since apps using AVX-512 will likely sniff for it at runtime. So now, you have a thread that if rescheduled onto an efficiency core it will break on you. So now what, does the app dev need to start sending "capabilities requests" when it makes new threads to ensure the OS knows not to put it's stuff on an efficiency core?
It could be done dynamically by the scheduler: whenever a thread tries to use AVX-512 on the efficiency core, move it to the power core and keep it there for a certain amount of time. If I am not mistaken, the CPU also exposes instruction counters, which would allow the OS to determine whether a thread has tried using AVX during its last time slice.
In our modern multithreading world, many applications already have separate idle and worker threads. I would not be surprised if such an approach could be implemented with negligible performance drawbacks.
It could also be done by the application. At least on Windows, you can provide the OS with a bitmask of which (virtual) processors you want it to be scheduled on.
So the application could detect which cores had AVX-512 and change it's scheduling bitmask before doing the AVX-512 work.
The OS probably should do the dynamic stuff you mentioned, this would then be to avoid the initial hit for applications that care about that.
I initially thought the same, but then realized the big issue with that. The x86 architecture, as it is (see below), requires that both core types appear to be homogenous. Therefore, if the E cores claim support for AVX-512 with CPUID (which would be a lie), then every application using glibc will try to use AVX-512 for memcpy (or whatever) when they shouldn't. As a result, they'd end up pinned to a P core when they should remain on an E core.
This whole mess is because AVX-512 was initially released on hardware where this distinction didn't exist. If AVX-512 was released during or after the whole P/E core thing last year, it would be possible to have applications using AVX-512 state their intentions to the scheduler (as @magicalhippo suggests). The application could say, "hey, OS, I need AVX-512 right now," and all would be well. As it is now, we're stuck with it.
On the other hand, you can now virtually guarantee that a GPGPU is present on Intel consumer chips. So now you can write your code 3 times; no vector acceleration (for the really old stuff), AVX-512 for the servers, and GPU for the consumer chips!
We have had that for 15+ years even in C with fine grained looping like OpenMP. It’s terribly inefficient to communicate across threads. Sometimes you just need SIMD
I support that effort. In the meantime, we have to do what we have to do. I'm presently in the process of optimizing some stuff and comparing the improvements of SIMD vs GPU on both ARM and x86 (this isn't that impressive, it's some basic loop vectorization). But just as Linus writes in that post, getting the compiler to do well seems impossible.
I'm measuring performance improvement and energy consumption reduction. The results are incredible. We really have to do this stuff. But it's complicated and the documentation is generally awful. So yes, a new language that deals with all of this would be very, very welcome :)
How will these languages emit machine code for every variant of GPU/CPU/DSP/Vector/FPGA/whatever architecture they might run on? This isn't as simple as it sounds.
The host binary will include intermediate representation of the compute code to be compiled by the device driver.
This already exists, see SYCL + SPIR-V. Intel's oneAPI is one implementation of this approach.
no, these are really big and interesting back ends. still, socially, that has to be better than _everyone_ breaking out the datasheets...if they even have those anymore
AVX-512 in general purpose CPUs was designed to not really make sense but start seeding the market at 10nm, vaguely make sense at 7nm, and really make sense starting at 5nm (or Intel's 14nm (Intel10) -> 10nm (Intel7) -> 7nm (Intel4)).
So Intel's process woes have been hurting their ability to execute in a meaningful way. Additionally Alder Lake was hurt by needing to pivot to heterogeneous cores (E cores and P cores), which hurt their ability to keep pushing this even in a 'maybe it makes sense, maybe it doesn't' state it had been in.
AVX-512 is in Alder Lake P-cores but not E-coree. AVX-512 is disabled in Alder Lake CPUs because the world is not ready for heterogenous ISA extensions. AVX-512 could be enabled by disabling E-cores.
It was supposedly maybe actually taken out of Raptor Lake, but the latest info I can find on that is from July: long before it was released. I have a Raptor Lake CPU but haven't found the time to experiment with disabling E-cores (far too busy overclocking memory and making sensible rainmeter skins).
Early Alder Lake could but they fused it off in the later ones and newer microcode also blocks it on the older processors.
Raptor Lake is basically "v-cache alder lake" (the major change is almost doubling cache size) so it's unsurprising they still don't have an answer there, and if they did it could be backported into Alder Lake, but they don't seem to have an immediate fix for AVX-512 in either generation.
Nobody really knows why, or what's up with Intel's AVX-512 strategy in general. I have not heard a fully conclusive/convincing answer in general.
The most convincing idea to me is Intel didn't want to deal with heterogeneous ISA between cores, maybe they are worried about the long-term tail of support that a mixed generation will entail.
Long term I think they will move to heterogeneous cores/architectures but with homogeneous ISA. The little cores will have it too, emulated if necessary, and that will solve all the CPUID-style "how many cores should I launch and what type" problems. That still permits big.little architecture mixtures, but fixes the wonky stuff with having different ISAs between cores.
there are probably some additional constraints like Samsung or whomever bumped into with their heterogenous-ISA architectures... like cache line size probably should not vary between big/little cores either. That will pin a few architectural features together, if they (or their impacts) are observable outside the "black box".
Well, intel seems to sorta be proving a point I had a long time about arm's big little setup. Which is that its only needed because their big cores wern't sufficiently advanced to be able to scale their power utilization efficiently. If you look at the intel power/perf curves, it the "Efficient" cores are anything but. Lots of people have noticed this, and pointed out its probably not "power Efficient" but rather "die space efficient under constrained power" because they have fallen behind in the density race, and their big cores are quite large.
But i'm not even so sure about that, avx-512 is probably part of the size problem with the cores. We shall see, your probably right that hetrogenious might be here to stay, but I suspect a better use of the space long term is even more capable GPU cores offloading the work that might be done on the little cores in the machine. AKA, you get a number of huge/fast/hot cores for all the general purpose "CPU" workloads, and then offload everything that is trivially parallelized to a GPU that is more closely bound to the cores and shares cache/interconnect.
Like, serious/honest question, how do you see Gracemont as space-efficient here? It's half the size of a full Zen3 core yet probably at-best produces the same perf-per-area, and uses 1.5x the transistors of Blizzard for similar performance (almost 3x the size, bearing in mind 5nm vs 7nm). that's not really super small, it's just that Intel's P-cores are truly massive, like wow that is a big core even before the cache comes in.
For years I thought it would be cool to see an all-out "what if we forget area and just build a really fast wide core" and that's what Intel did. And actually, for as much as people say Apple is using a huge "spare no expenses" core, it's not really all that big even considering the area - you get around 1.5-1.6x area scaling between 5nm and 7nm as demonstrated by both Apple cores and NVIDIA GPUs, and probably close to AMD's numbers as well. So just looking at it at a transistor level, Apple is using 2.55 x 1.6 = 4.08mm2 equivalent of silicon and Intel is using 5.55mm2, so Apple is only using 75% of the transistors of Intel's golden cove p-core...
But in the e-core space, it's become a meme that Gracemont is "size efficient rather than power efficient" and I'm just not sure what that means in practical terms. Usually high-density libraries are low-power, so those two things normally go together... and it's certainly not like they're achieving unprecedented perf-per-area, they're probably no better than Zen3 in that respect. Where is the space efficiency in this situation if it's not libraries or topline perf-per-area?
And Intel is still deciding on future of AVX512, internally there is already a replacement that works with atom cores (which are size and power bound).
> AVX-512 is disabled in Alder Lake CPUs because the world is not ready for heterogenous ISA extensions
It was already opt-in (disabled unless you also disable efficiency cores), that is no justification to make it impossible to use for people who want to try it out.
But I suppose Intel just doesn't want people to write software using those new instructions.
If Intel allow to enable AVX-512, they need to validate functionality on every chip. Some chips may dropped (or reused as i3) due to this. There's not much reason to do so for AVX-512 that only enthusiasts enable.
Linus is correct that auto-vectorization is basically useless, but I have some minor disagreements. JSON is obviously not the only format or protocol that can be parsed using SIMD instructions. In general computers spend a lot of time parsing byte streams, and almost all of these could be parsed at gigabytes per second if someone familiar with the protocol, the intel intrinsics guide, and the tricks used in simdjson spent an afternoon on them. It's sort of unfortunate that our current software ecosystem makes it difficult for many developers to adopt these sorts of specialized parsers written in low-level languages and targeting specific vector extensions.
If someone wants to use a hash table or TLS, they may find AES-NI useful.
Part of his argument though was that for most cases where you're parsing JSON (or some other protocol), the parsing time is small compared to the amount of time you spend actually doing things with the data. And for most of the examples where the code is mostly parsing (say, an application doing RPC routing, where the service just decodes an RPC and then forwards it to another host) you can often use tricks to avoid parsing the whole document anyway. For example, if you have a protocol based on protobuf (like gRPC) you can put the fields needed for routing in the lowest number fields, and then use protobuf APIs to decode the protobuf lazily/incrementally; this way you will usually just need to decode the first (or perhaps the first few) fields, not the entire protobuf.
Making code faster and more efficient is always better, it's just that you might not see much improvement in end-to-end latency times if you make your protocol parsing code faster.
I think this is maybe true in practice because business logic is usually written really badly, but a lot of the time the work that must be done is ~linear in the size of the message.
One of the specific design properties of AVX-512 is to make auto-vectorization less useless. If Intel had committed to more predictable and wide support, we might be seeing the fruits of that now. Whether it would have been a "Game changer" or just a slight improvement, I don't know, but even slight improvements in cpu performance are worth it these days.
made me smirk, as games is what likely would have profited from it
games are also what tends to be some of the more havy compute applications outside of servers
through it likely would be mostly just some code here and there deep withing the game engine
but then this is also an example where auto-vectorization can matter as it means writing platform independent code which happens to be fast on the platforms you care about most but also happens to compile and run with some potential (but only potential) performance penalty when porting to other platforms. Not only makes this porting easier, it also for some games might be fine if the engine is a bit slower. Main drawback is that writing such "auto vectorized" code isn't easy and can be brittle.
I'm freshly out of the weeds of implementing some fixed-point decimal parsers, and only found ~1 cycle that I could save (at best) per iteration using avx512.
I largely agree with linus here. It's still a huge pain to do anything non-trivially-parallelizable with simd even in an avx512 world
1. I have NEVER been satisfied with the results of a compiler autovectorization, outside of the most trivial examples. Even when they can autovectorize, they fail to do basic transformations that greatly unlock ILP (to be fair these aren't allowed by FP rules since it changes operation ordering)
2. Gather/scatter still have awful performance, worse than manually doing the loads and piecemeal assembling the vector. The fancy 'results-partially-delivered-in-order-on-trap' semantics probably don't help. Speeding up these operations would instantly unlock SO much more performance (see my username)
3. The whole 'extract-upper-bit-to-bitmask' -> 'find-first-set-in-bitmask' really deserves an instruction to merge the last two (return index of first byte with uppermost bit set, etc). It's such a core series of instructions and ime often the bottleneck (each of those is 3 cycles)
4. The performance of some of the nicer avx512 intrinsics leaves a lot to be desired. Compress for example is neat in theory but is too slow for anything I've tried in real life. Masked loads are another example, cool in theory but in practice MUCH faster to replicate manually.
5. It still feels like there's a ton of really random/arbitrary holes in the instruction set. AVX512 closed some of these but introduced many more with the limited operations you can do on masks and expensive mask <-> integer conversions.
Some of the AVX512 instructions are nice additions though, even if you only use it on YMM or XMM registers. All the masked instructions make it much simpler to vectorize loops over variable amounts of data without special treatment for the head/tail of the data.
This is exactly right—-the 512b width is the least interesting thing about AVX512. All the new instructions on 128, 256, and 512b vectors, (finally) good support for unsigned integers, static rounding modes for floating point, and predication are much more exciting.
I don't mind head/tail stuff so much but the masked instructions and vcompress make a lot of "do some branchy different stuff to different data or filter a list" easier and faster compared to AVX2, so I'm a big fan
Some of the biggest investments into general purpose Linux performance, reliability, and sustainability were made by people with "specialty engines" that Linus is talking about. The specialty applications often turn out to be important because they push the platform to excel in unexpected ways.
With that said, the biggest practical problem we saw with AVX-512 was that it's so big, power hungry, and dissimilar to other components of the CPU. One tenant turning that thing on can throttle the whole chip, causing unpredictable performance and even reliability issues for everything on the machine. I imagine it's even worse on mobile/embedded, where turning it on can blow out your power and thermal budget. The alternatives (integrated GPUs/ML cores) seem to be more power efficient.
It's not terribly dissimilar from the SSE instructions before it. The biggest weirdness about AVX is it's CPU throttling. Which, IMO, should not be a thing for the reasons you mention.
I imagine that future iterations of AVX will likely ditch CPU throttling. I'm guessing it was a "our current design can't handle this and we need to ship!" sort of solution. (If they haven't already? Do current CPUs throttle AVX-512 instructions?)
> I imagine it's even worse on mobile/embedded
Fortunately, x86 mobile/embedded isn't terribly common, certainly not with AVX-512 support.
> I imagine that future iterations of AVX will likely ditch CPU throttling. I'm guessing it was a "our current design can't handle this and we need to ship!" sort of solution. (If they haven't already? Do current CPUs throttle AVX-512 instructions?)
exactly what it looks like to me, and it was already fixed in Ice Lake and probably doesn't exist at all in Zen4. Just nobody cares about Ice Lake-SP when Epyc is crushing the server market.
the whole thing looks very much to me like "oops we put not just AVX-512 but dual AVX-512 in our cores and on 14nm it can hammer the chip so hard we need more voltage... guess we're gonna have to stop and swing the voltage upwards".
not only is that less of a problem with higher-efficiency nodes, but the consumer core designs drop the idea of dual-AVX 512 entirely which reduces the worst-case voltage droop as well...
The turbo curves depend on the exact type of Xeon. Were they Bronze or Silver? Those are much worse in that regard.
It is also unclear to me that AVX-512 is necessarily problematic in terms of power and thermal effects. Scalar code on many threads actually uses more energy (dispatch/control of OoO instructions being the main driver) than vector code (running far fewer instructions). The much-maligned power-up phase does not matter if running vector code for at least a millisecond.
I agree dedicated HW can be more power-efficient still, assuming it does exactly what your algorithm requires. And yet they make the fragmentation and deployment issues (can I use it?) of SIMD look trivial by comparison :)
Thanks for the data point. Water cooled, impressive.
But what is the problem here? Power is work per time. More power can be a good thing, especially if the work is useful. This is more likely in the SIMD case, which has much lower OoO cost than scalar code in terms of total work accomplished. At some point, the chip hits a limit and throttles. This means less speedup than otherwise might have been the case, but it's still a useful speedup relative to scalar, right?
> One tenant turning that thing on can throttle the whole chip,
This wasn't related to AVX-512, it was related to its implementation on that specific, somewhat old, architecture. Current AMD and next generation Intel don't have this limitation, so it shouldn't be considered when thinking of "the future".
> One tenant turning that thing on can throttle the whole chip, causing unpredictable performance and even reliability issues for everything on the machine
This is where ironically where the double pumping of Zen 4 shines, it's less of an problem with it.
The problem with AVX isn’t that compilers find it difficult to emit enough of them or that people don’t often work on problems that are easily transformed to simd at all.
It’s enough if a small number of libraries use these instructions, and a lot larger number of programs can benefit.
The problem with AVX-512 in my opinion is the sparse hardware support, power/thermal problems and noisy neighbor phenomenon.
To be useful simd instructions need to be widely available and not have adverse power effects such as making other programs go much slower.
No, but that's mainly because Intel has been totally incompetent for the last half decade. On Zen4/Intel chips post 2020 that have it, AVX-512 is really good.
Firstly, there are in fact a lot of problems that benefit from parallelization, but are not quite wide enough to justify dealing with (say) a GPU. For example, even in graphics you often have medium-sized rasterization tasks that are much more convenient to run on CPUs. Audio processing is similar. Often you have chunks of vectorizable work inter-mixed with CPU-friendly pointer chasing, and the ability to dispatch SIMD work within nanoseconds of SISD work is very helpful.
Secondly, AVX-512 (and Arm's SVE) support predication and scatter/gather ops, both of which open the door to much more aggressive forms of automatic vectorization.
If I understand correctly, this discussion is specifically about auto-vectorization, which is when the compiler automatically re-writes your code to use highly parallel (and performant) vectorized assembly instructions.
Linus acknowledges that vectorized instructions are valuable, so his rant seems to be specifically about auto-vectorization:
> Yes, yes, yes, you can use it in random places. It's not only some JSON parsing library, you'll find it in other libraries too. I suspect the most use it gets in many situations is - drum roll - implementing memmove/strchr/strcmp style things. And in absolutely zero of those cases will it have been done by the compiler auto-vectorizing it.
In other words.. the people writing foundational parsing libraries (json / strchr / strcmp) are writing their vectorized instructions by hand.
I use intrinsics by hand all the time. It's very easy to make a problem too complicated to autovectorize. And even if you do get it to autovectorizie, it's not exactly future proof against compiler changes.
I use Agner as well. I started up my own version for Rust specifically targeting avx512[1], but I've been hitting enough snags to where I think I'll abandon it. It's super green at the moment, and I haven't pushed it to Cargo. But if I'm going to dedicate time to it, then I need it to work for my purposes, and there's a thread-parallel problem that makes this unusable for me at the moment.
Note that simply using a SIMD vector class library does not make it "go faster". In fact it can make things worse (due to latency). What you usually need is a problem and then a solution (algorithm) that parallelizes well.
Explain? My experience has been that you don't need much. I've benchmarked Agner's exp, and if you have 4 calculations to do, then calling it with avx2 will be 4x faster than calling std::exp four times.
AVX-512 is fabulous for cryptography, especially software-oriented algorithms like ChaCha and BLAKE3 that lean on general-purpose SIMD features rather than dedicated hardware acceleration. Most of the added value is just the larger registers, but the native bit-rotate instructions are also great. (But agreed with Linus that autovectorization isn't very helpful for this stuff.)
SIMD on x86 has become a ridiculous minefield of standards. I honestly pity the compiler writers who have to navigate this nonsense. It was stressful enough trying to just manage the compiler flagging in an application (music synthesizer) I wrote recently.
The core amd64 ISA is stable and that's good. I wouldn't call it elegant or nice, but it's there. The patchwork of which processor supports which SIMD extension on top of it, though, has gotten out of control and is unpleasant.
I'm sure AVX512 is great. Except as an application author I can't rely on it being there.
I am not a MacOS user or MacOS developer, but I envy the consistency of that platform now that they control their own ISA.
> SIMD on x86 has become a ridiculous minefield of standards.
people make the AVX-512 compatibility thing more complex than it has to be. there's consumer AVX-512 and server AVX-512 and within those series there has always been strictly increasing compatibility, with a common core between them. and AMD supports the complete thing with Zen4 so there is no differentiation for them.
intel didn't want to put bfloat16 ML instructions in consumer chips and people turned that into a whole thing because they like to meme about AVX-512, but that's literally the only difference. it's just not that complex a story though, if you don't turn it into a thing.
intel's consumer and server teams have been operating quasi-independently and some of them are on "newer generations" than the other team. currently that's the client team that's ahead of the server team. so client got a couple gens ahead of server, and server has this one extra instruction in their versions. that's it, that's the whole deal.
frankly it's actually a simpler story than SSE4 where AMD just kinda went and did its own thing for a while and Intel went and did AVX and there truly was total divergence between what the companies were doing. you can probably turn SSE into a real ugly venn diagram if you try to capture every nuance too! After all, some of them were introduced in K8, K9, Phenom... and there's lots of sub-versions of each.
xeon phi is also its own thing but you'll know if you're targeting that. it has a weird 4-wide AVX thing to go along with its 4-wide SMT, and it's its own specific thing.
so you can see within each server/consumer series it's monotonically increasing, and the server gets bfloat16 and client gets VPOPCNT, IFMA, and VBMI.
so it's really pretty straightforward: are you targeting server, client, or both, and how recent? that will tell you how much of an overlap you've got.
like is that really that bad a chart overall? it's just "newer chips have more stuff", that's not news to anyone, and barely anyone has cannon lake or even rocket lake. ice lake and tiger lake are what you want to target, mostly. Alder lake actually supports everything, but Intel pulled AVX-512 from that series...
and yes, it's annoying that server got stuck in 2017 for 5+ years, ice lake-SP (cooper lake) barely has any penetration either, but the client platform moved right along and supports basically everything except bfloat16. sapphire rapids is cooper lake's successor and the first installations (HPC) were promised in like 2018... they will come online early in 2023 apparently. So they are 5+ years behind schedule because of node problems, things are basically comically bad at Intel and the server division is still not executing/iterating properly.
> Except as an application author I can't rely on it being there.
I'm surprised to keep hearing this concern. We can write code once in a vector-length agnostic way, compile it for multiple targets, and use whatever is there at runtime.
Agreed. Do you have any example of an optimization of generic/cross-platform vector code, such that it would run better on SSE2 or SSSE3?
One example might be reducing the number of live registers to avoid spilling, but you can already do that in your portable code, without necessarily requiring a separate codepath for the low-spec machines.
When writing "generic" code you still want to consider the target; for example, while it may semantically be nicer to use masking everywhere, doing so is pretty bad for perf pre-AVX-512 (especially for loads/stores, which just don't have masking pre-AVX2, and iirc AVX2's masks on AMD can still fault for masked out elements). A pretty big portion of AVX-512 is that it makes more things possible/doable a lot nicer, but that's useless if you have to also support things that don't. (another example may be using a vector-wide byte shuffle; SSE and AVX512 have instructions for that, but on AVX2 it needs to be emulated with two 16×u8 shuffles, which you'd want to avoid needing if possible; in general, if you're not considering exactly the capabilities of your target arch & bending data formats and wanted results around it, you're losing performance)
> while it may semantically be nicer to use masking everywhere, doing so is pretty bad for perf
Agreed. It's best if applications pad data structures.
> iirc AVX2's masks on AMD can still fault for masked out elements).
Unfortunately yes. I haven't seen a CPU use that latitude, though.
> it makes more things possible/doable a lot nicer, but that's useless if you have to also support things that don't.
hm, is it truly useless? I've found that even emulating missing vector functionality is usually still faster than entirely scalar code.
> if you're not considering exactly the capabilities of your target arch & bending data formats and wanted results around it, you're losing performance
That's fair. We're also losing performance if we don't port to every single arch we might be running on.
It seems to me that generic code (written as you say with an eye to platform capabilities) is a good and practical compromise, especially because we can still specialize per-arch where that is worthwhile.
> I've found that even emulating missing vector functionality is usually still faster than entirely scalar code.
inefficient SIMD can still be better than scalar loops, yes, but better than that is efficient SIMD; and you may be able to achieve that by rearranging things such that you don't need said emulation, which you wouldn't have to bother doing if you could target only AVX-512.
OK, it would indeed be nice if we only had to target AVX-512, but that's not the reality I'm in.
On minimizing required emulation - any thoughts as to how? Padding data structures seems to be the biggest win, only mask/gather/scatter where unavoidable, anything else?
Stay within 16-byte lanes for many things (≤16-bit element shuffles, truncation, etc); use saturating when narrowing types if possible; try to stay to a==b and signed a>b by e.g. moving negation elsewhere or avoiding unsigned types; switch to a wider element type if many operations in a sequence aren't supported on the narrower one (or, conversely, stay to the narrower type if only a few ops need a wider one). Some of these may be mitigated by sufficiently advanced compilers, but they're quite limited currently.
Great points! It seems useful to add a list based on yours to our readme.
Please let me know if you'd like us to acknowledge you in the commit message with anything other than the username dzaima.
"dzaima" is how I prefer to be referred to as; but that list is largely me going off of memory, definitely worth double-checking. (and of course, they're ≤AVX2-specific, i.e. x!=y does exist in avx-512 (and clang can do movemask(~(a==b)) → ~movemask(a==b), but gcc won't), and I can imagine truncated narrowing at some point in the future being faster than saturating narrowing on AVX-512; or maybe saturating narrow isn't even better? (for i32→i8, clang emits two xmmword reads whereas _mm256_packs_epi32 → _mm256_packs_epi16 → _mm256_permutevar8x32_epi32({0,4,undef}) can read a ymmword at a time, thus maybe (?) being better on the memory subsystem, but clang decides to rewrite the permd as vextracti128 & vpunpckldq, making it unnecessarily worse in throughput))
Yes, they theoretically could. The AMD manual contains this language:
Exception and trap behavior for elements not selected for loading or storing from/to memory is implementation dependent. For instance, a given implementation may signal a data breakpoint or a page fault for doublewords that are zero-masked and not actually written.
To clarify, are you saying the entire app was slower with AVX than it was with SSE4?
That would be surprising, because 2x vector width is expected to outweigh 10-20 percent downclocking. Even more so with Haswell, which adds FMA and thus doubles FLOPS.
The additional permutes are indeed not free, but we did get an all to all int32 shuffle, which could actually be more efficient than having to load/generate the corresponding PSHUFB input.
Taking a step back, these are examples of AVX adding a bit of cost, but I'm not yet seeing an accounting of the benefits, nor showing that they are outweighed by the cost.
> To clarify, are you saying the entire app was slower with AVX than it was with SSE4?
We were optimizing loop by loop, and some loops converted in AVX could well be slower yes (at this point of time). AVX is probably more often a win nowadays.
> That would be surprising, because 2x vector width is expected to outweigh 10-20 percent downclocking.
In practice workload is often bottleneck by memory access, and there is diminishing returns with increased vector size.
I sell consumer software and you can get bad reviews from people not having SSSE3 (2008). So not ony I hardly need larger vectors (vs 128-bit) vs memory throughput, but it would take 10+ years before being able to use that.
The larger the SIMD vector, the more useless it is in a way that Linus describes.
> Seriously, go look at random code on github - I dare you. Very little of it is clearly delineated loops over arrays with high repeat counts. It's not very helpful to be able to vectorize some "search for a value in an array", when people end up using hash tables or other more complex data structures instead of arrays for searching.
Sure, but many parts that matter for performance, the inner loops to optimize, may be like that.
There are several chicken & egg problems with this kind of thing.
First, there are only a handful of languages optimised for "structure of arrays" instead of "array of structures" programming (SoA vs AoS). To my knowledge only some array-based languages like J/K/APL, Jonathan Blow's Jai, and Google's Rune.
Why aren't array languages more popular? Because the performance delta has historically not been worth the switching costs.
Why has the performance delta not been big enough? Because AVX-512 and similar architectures haven't been available to the mass market.
Why isn't AVX-512 included in every processor? Because there's not enough software for it yet.
Etc...
Once AVX-512 becomes widespread, languages are developed for highly-parallel architectures with SIMD, then you'll start seeing new applications materialise driving demand.
Similar architectures have been available for a plenty of time! 256 bits at once with multiple execution units is a lot of compute power and has been the standard for a decade. Let alone SSE.
SSE and AVX instructions are optimised primarily for 3D graphics, such as multiplying 4 floating point numbers with a 4x4 matrix. There are a handful of additional instructions optimised for doing things to pixels... and that's about it.
AVX-512 is designed to work more like what a GPU does internally, and provides a much richer set of instructions. It enables fine-grained masking and shuffles, without which many simple types of code are either impossible to compile, or much more complex... and slower. This is why auto-vectorisation with SSE an AVX are only enabled for some simple loops, and provide marginal benefits outside of those scenarios.
Done on special-purpose hardware that is >10 times more efficient than any implementation on GP hardware.
> emulation,
Only current practical use of AVX-512.
> ML, matrix multiplications, ...
Much better to do on the GPU.
In general wide SIMD on CPU has the problem that to make effective use of it, you have to massage your code enough that you are ~95% away from just running it on the GPU anyway, and you can gain much more performance if you do that. The best niche of AVX-512 would have been as the baseline common target for things that also get optimized for GPUs... except that Intel has eliminated this possibility by heavy product segmentation right from the start.
> Done on special-purpose hardware that is >10 times more efficient than any implementation on GP hardware.
That locks users in to only the codecs that have been implemented in this particular hardware, and increases hardware size for particular vendor implementations rather than provide common building blocks that many codecs use (DCT, ...)
> Much better to do on the GPU.
Yep and that's what I run ML stuff on, not all systems have a GPU available though, and for some applications it's faster to do something immediately on the CPU, than have the overhead of going to/from GPU
> In general wide SIMD on CPU has the problem that to make effective use of it, you have to massage your code enough that you are ~95% away from just running it on the GPU anyway,
Better tooling and libraries could help with this imho. Note that the situation for GPU is here also not great at all, since the good tools are locked into 1 vendor.
Oh? Which GPU? iGPU? Discrete? Intel? Nvidia? AMD? From which generation? Using what libraries? Assuming you are running on an x86-64? Or maybe arm? Something else? How are you going to handle underflow/overflow? Did you need IEEE FP support? How many $1000s were you going to spend on hardware to test/verify your code on different GPUs?
Also depends on how big the matricies are, if they are too small and the latency of the GPU isn't worth it, too large and it won't fit. The too small/too large decision depends on which GPU, how many PCIe lanes, and which library.
> you have to massage your code enough that you are ~95% away from just running it on the GPU anyway
Depends on your usecase, I guess. Having AVX-512 on the CPU allows you to just re-implement one critical function to be faster, and let the rest of your code be clean and simple. Communicating with a GPU comes with a large latency penalty that is not acceptable in some use-cases.
> you are ~95% away from just running it on the GPU anyway
Vector code with lots of branches absolutely exists. You can run it on a GPU, but because they don't dedicate transistors to OoO, branch prediction, and good prefetchers, the code won't run very well.
SSE, NEON and AVX are fundamental for anything reducible to matrix and vector arithmetic (e.g., signal processing and ML inference on the CPU).
> It is hard to find simple enough code to vectorize at all.
It's hard to find code simple enough for the compiler to efficiently auto-vectorize.
Anything that reduces to GEMM will parallelize in practice extremely well, and there are many excellent libraries with SIMD support (MKL, BLAS, ATLAS, Eigen, etc.). However, these libraries rely on kernels carefully written by experts and benchmarked extensively over decades. They're not the output of running naively written code through a super smart compiler.
All of this is extremely relevant to what you bought your PC or phone for. It's also not in the kernel, and therefore Linus seems to be unaware of their pervasiveness and utility.
The main thing seems to be that we don't have general purpouse languages in common use that fit SIMD auovectorization in semantics, without carefully "code to the compiler" portable assembly mindset. Even after 25 years of x86 SIMD support in hardware.
It's just not been an aim even in more recent-ish languages like Rust, Swift, Go etc. Partly because of the unworkably hard task of targeting fragmented, undebuggable, proprietary driver/OS programming interfaces from POV of portable languages.
Exactly. But also he inadvertently actually touches on this in the post. Talking about people using hash tables and so on. I think if people had the right guidance from language and libraries and from education, they would choose array-based patterns more often. They're often in fact simpler to write.
But many languages and common practices push people in a highly incremental piece by piece approach. I'd argue we're basically taught to think in terms of loops, from our earliest days of programming. We could just as easily be taught to think of things in terms of bulk arrays. The win in programming this way is extremely high.
I was very disappointed to find the SIMD story in Rust so underdeveloped.
Yes, people are not running weather forecasting on their home computers. But they're playing video, synthesizing audio, doing speech recognition, etc. all the time. And more of this ever day.
And if there's any chance of rescuing the uses of ML from a hellish privacy violating landscape worth of a Butlerian Jihad... it would by making sure most of this type of computation was done locally on-device.
> Exactly. But also he inadvertently actually touches on this in the post. Talking about people using hash tables and so on. I think if people had the right guidance from language and libraries and from education, they would choose array-based patterns more often. They're often in fact simpler to write.
A) The vast majority of programmers have zero idea about any of this. They want a key-value store and have no idea about the underlying implmentation. And that's okay. You can write an awful lot of useful software without understanding Big-O notation.
B) Pointer chasing has been dogshit on modern architectures for quite a while, and most people have no idea. Look at all the shit Rust gets for being annoying to implement linked lists which are a garbage data structure on modern CPUs.
C) Array-of-struct to struct-of-array type transformations are just starting to get some traction in languages. Entity component systems are common in games but haven't really moved outside of that arena. These are the kind of things that vectorization can go to town on but haven't yet moved to mainstream programming.
D) Most of this stuff is contained inside libraries anyway. So, only a few people really need to care about this.
Thing is that hashmap usage goes beyond data manipulation / key-value store uses. People use it to construct semi-structured network data models, track counts by indices, etc. Because there's an assumption that the O(1) of a hashtable always beats the O(N) of a linear search. Except that that's really not always the case anymore on a modern arch, with things properly vectorized, etc.
When what we should have is libraries that provide high level relational/datalog style manipulation of in-memory datasets, and you describe what you want to do and the system decides how. Personal beef.
Think of the average "leetcode" question. It almost always boils down to an iterative pass over some sequence of data, manipulating in place. C-style strings or arrays of numbers, etc. If you tried to answer the question with "I'd use std::sort and then std::blah and so on, on some vectors" they'd show you the door because want you to show clever you are at managing for loop indices and off-by one problems and swapping data between two arrays in a nested loop, etc.
So we're actually gating people on this kind of thing. And imho it's doing us a disservice. The code is not as readable. And it doesn't necessarily perform well. It's often a 1988 C programmer's idea of what good code is that is used as the entry bar.
> Because there's an assumption that the O(1) of a hashtable always beats the O(N) of a linear search. Except that that's really not always the case anymore on a modern arch, with things properly vectorized, etc.
It was never universally true. I remember reading about compiler design in the 1970s; I think it was somewhere in Wirth where he pointed out that, for local symbolic lookups, it was more efficient to use a simple loop over an array because the average number of symbols was so low that the constant management overhead from any more advanced data structure was more expensive than a linear scan.
> Yes, people are not running weather forecasting on their home computers. But they're playing video, synthesizing audio, doing speech recognition, etc. all the time. And more of this ever day.
>
> And if there's any chance of rescuing the uses of ML from a hellish privacy violating landscape worth of a Butlerian Jihad... it would by making sure most of this type of computation was done locally on-device.
I think SIMD support is useful too but most of the examples you gave are things which are handled on-device now using dedicated hardware. Video playback has been hardware accelerated for ages, ML acceleration is multiple hardware generations in, etc.
This is not the case for synthesizing audio. Vectorization with GPU is essentially still useless there due to memory latency considerations, at least when I was looking at this most recently.
EDIT: there was a period in the late 90s/early 00s when dedicated DSPs were a useful but costly add-on to digital audio PC systems. SIMD moving into the CPU put an end to that.
Thanks - I said “most” because there are definitely plenty of cases like that, but I think there’s an interesting phenomenon where SIMD is useful for its flexibility and low latency but we’re seeing faster cycles adding hardware support for the highest-value acceleration targets than we used to.
Video playback only has dedicated hardware for the formats that exist when your CPU was made. A cpu with AVX-512 will do a lot better on AV1 decode/encode than one without until someone actually makes a cpu with hardware for it. Funnily enough, the best ML acceleration hardware on CPUs is AVX-512. For small neural nets, a well tuned CPU implementation will do way better than GPUs.
I'm not saying that there's no value in it, but rather that while you're right that it can produce a fair improvement over a traditional CPU implementation things like video are not the best examples because they're _so_ common that they get dedicated hardware which is usually considerably faster / more power-efficient so there's a narrow window for a couple of years where that SIMD implementation is most valuable.
ML is similar – many people are running these apps now but, for example, many millions of them are running Apple ML models on Apple hardware with acceleration features so again, while SIMD is unquestionably useful, I'm not surprised that the average working programmer doesn't feel a huge need to dive in rather than using something like a library which will pick from multiple backends.
I use Zig every day and I consider Zig's vector type and SIMD support a good stepping stone to the destination of just writing the intrinsics.
Competing efforts including std::simd or Highway are either unusable because they don't expose pshufb or more difficult to use than the tools they are abstracting away.
Author of Highway here :) PSHUFB is TableLookupBytes, or TableLookupBytesOr0 if you want its 0x80 -> 0 semantics.
Can you help me understand why Highway might be more difficult to use? That would be very surprising if we are writing a largish application (few thousand lines of vector code) and have to rewrite our code for 3+ different sets of intrinsics.
If I have some time I'll try porting a small parser I wrote the other day to Highway. I expect to gain a lot of bitcasts to change the element type, a lot of angle brackets, and a lot of compile time. It looks like there isn't a way to ask for permute4x64, so I will maybe replace my permute4x64 with SwapAdjacentBlocks, shuffle, and blendv, which will probably cost a register or two and some cycles.
I don't think there are necessarily problems with the design of Highway given its constraints and goals. I would just personally find it easier for most of the SIMD code I've written (this tends to be mostly parsers and serializers) to write several implementations targeting AVX2, AVX-512, NEON, and maybe SSE3.
Thanks for your feedback! I'd be happy to advise on the port, feel free to open an issue if you'd like to discuss.
I suppose there is a tradeoff between zero type safety + less typing, vs more BitCast verbosity but catching bugs such as arithmetic vs logical shifts, or zero extension when we wanted signed.
Would be interesting if you see a difference in compile time. On x86 much of the cost is likely parsing immintrin.h (*mmintrin are about 500 KiB on MSVC). Re-#including your own source code doesn't seem like it would be worse than parsing actually distinct source files.
Many of the Highway swizzle ops are fixed pattern; several use permute4x64 internally. Can you share the indices you'd like? If there is a use case, we are happy to add an op for that.
It is impressive you are willing (and able) to write and maintain code for 4 ISAs. Still, wouldn't it be nice to have also SVE for Graviton3, SVE2 on new Arm9 mobile/server CPUs, and perhaps RISC-V V in future?
> Can you share the indices you'd like? If there is a use case, we are happy to add an op for that.
I'm using 2031 and 3120 (indices and the rest of this comment in memory order assuming an eventual store, so that's imm8 = (2 << 0) | (0 << 2) | (3 << 4) | (1 << 6) etc.). This is pretty niche stuff! It's also not the only way to accomplish this thing, since I am starting with these two vectors of u64:
v1 = a a b b
v2 = c c d d
then as an unimportant intermediate step
v3 = unpacklo(v1 and v2 in some order) = a c b d
and I ultimately want
v4 = permute4x64(v3, 3120) = c d b a
v5 = permute4x64(v3, 2031) = b a c d
> It is impressive you are willing (and able) to write and maintain code for 4 ISAs.
This is mostly hypothetical, right now I'm writing software targeting specific servers and only using AVX2 and AVX-512. I may get to port stuff to NEON soon, but I've only skimmed the docs and said to myself "looks like the same stuff without worrying about lanes," not written any code.
> Still, wouldn't it be nice to have also SVE for Graviton3, SVE2 on new Arm9 mobile/server CPUs, and perhaps RISC-V V in future?
I don't know much about variable-width vector extensions like SVE and SVE2 but my impression is that they are difficult to use for the sorts of things I usually write. These things are often a little bit like chess engines and a little bit less like linear algebra systems, so the implementation is designed around a specific width. This isn't really a rebuttal though, it just means I am signing up for even more work by writing code on a per-extension per-width basis instead of a per-extension basis if I ever move on to these targets, which I certainly should if they have large performance benefits compared to NEON.
Thanks for the example. This seems like a reasonable thing to want, and as you say it's not ideal to SwapAdjacentBlocks and then shuffle again.
It's not yet clear to me how to define an op for this that's useful in case vectors are only 128-bit.
Until then, it seems you could use TableLookupLanes (general permutation) at the same latency as permute4x64, plus loading an index vector constant?
> These things are often a little bit like chess engines and a little bit less like linear algebra systems, so the implementation is designed around a specific width.
hm. For something like a bitboard or AES state, typically wider vectors mean you can do multiple independent instances at once.
Likely that's already happening for your AVX-512 version? If you can define your problem in terms of a minimum block size of 128 bits, it should be feasible.
> I ever move on to these targets, which I certainly should if they have large performance benefits compared to NEON.
SVE is the only way to access wider vectors (currently 256 or 512 bits) on Arm.
Are there any libraries that allow me to write different versions of the same function (AVX-512, AVX2, SSE, etc) and then automatically choose the best one that the system supports at runtime? Or maybe even better, the compiler generates multiple versions for me.
In other words, one binary that takes advantage of new instructions but still runs on older hardware. It doesn't really have to be older either, plenty of brand new CPUs doesn't support AVX-512.
Generally speaking the vendor libraries have dynamic dispatch support, which can identify which functions are available on a CPU and then deploy to the best at runtime. Intel got into some hot water for having their dispatch hurt performance on AMD CPUs, but it seems that's been fixed. Their IPP-Crypto and AMD's AOCL-Cryptography libraries both support dynamic dispatch these days, for example.
It seems to me that a table of function pointers is all that's required. Highway is a little fancier in that the first entry is a trampoline that first detects CPU capabilities and then calls your pointer; subsequent calls go straight to the appropriate function.
Do the (experimental/non-portable) compiler versions contribute any additional value?
I gather from the linked-to video that binary-load-time selection has better run-time performance than init-at-first-call run-time dispatch, and doesn't have the tradeoff between performance and security.
Thanks for the pointer. I read the video transcript and agree with their premise that indirect calls are slow.
The are several ways to proceed from there. One could simply inline FastMemcpy into a larger block of code, and basically hoist the dispatch up until its overhead is low enough.
Instead, what they end up doing is pessimizing memcpy so that it is not inlined, and even goes through another thunk call, and defers the cost of patching until your code is paged in (which could be in a performance or latency-sensitive area). Indeed their microbenchmark does not prove a real-world benefit, i.e. that the thunks and patching are actually less costly than the savings from dispatch. It falls into the usual trap of repeating something 100K times, which implies perfect prediction which would not be the case in normal runs.
Also, the detection logic is limited to rules known to the OS; certainly sufficient for detecting AVX-512, probably harder to do
something like "is it an AVX-512 where compressstoreu or vpconflict are super slow". And certainly impossible to do something reasonable like "just measure how my code performs for several codepaths and pick the best",
or "specialize my code for SVE-256 in Graviton3".
So, besides the portability issue, and actually pessimizing short functions (instead of just inlining them), this prevents you from doing several interesting kinds of dispatch.
Caveat emptor.
Cool, I wasn't familiar with libvolk. It seems to be a collection of prewritten kernels, so it only helps if the function you want to write is among them.
github.com/google/highway seems to be closer to what is requested here. It provides around 200 portable intrinsics using which you can write a wide variety of functions (only a single implementation required). Disclosure: I am the main author.
Oooh, that's exactly the problem I had these days. The Firefox extension "Firefox translations" uses a tool (bergamot-translator) which requires SSE4 (or is it SSE3.1?) to run at all, so on all slightly old machines it fails miserably with a weird error. (90% of the machines I see around are 5 to 10 years old, work perfectly and don't need to be replaced, thank you very much, save money, think of the planet, this sort of things).
Rust's native support for this is super verbose and repetetive. There's a macro library that can deal with all the unsafe and version picking for you though.
Autovectorization basically doesn't work. The effort to test it properly (that it got vectorized the way you expect or at all) is more maintenance than writing it yourself.
If you insist on abstractions, autoscalarization (the opposite approach) would be better, which is kind of how Fortran works… but I unironically recommend just writing assembly like ffmpeg does.
It's called GPUs and TPUs? Turned out it's more practical to design hardware for known problems comprising big parts of computing workload than "build it and they will come" Itanium/AVX 512 plans. Also, GPUs have been repurposed for science, machine learning and mining crypto and, as additional useful applications are discovered, new hardware can support them more effectively.
trigger warning: possible rust advocacy.
I hope the warning can help some folks relax, and stave off the usual repeating meta discussion.
I've looked at a lot of rust generated outputs from iterator usage in godbolt. If you have tried this (and if not, go look), you'll know that you almost certainly never want to look at debug output for iterator usage.
In release mode builds of rust programs using the iterator package you'll often see SIMD. The author doesn't need to do special things to get this, the combination of aggressive inlining and encouragement to use owned and intrusive structures goes a long way to making this viable more often in practical code - so much so that I see it working _a lot_. It's not a magic bullet, and I'm not advocating that it's some be-all end-all, but it works well there.
If you'd like to see an example, search for "godbolt rust iterator" and look at the optimized output. I'm sure one of the near top results will do.
A lot of people in C++ land probably just have the vector optimizations off. If you use clang, you will get vector instructions at -O2. Gcc seems to need -O3, or the specific flags enabled.
Rust or otherwise, the more semantics that can be expressed in a language, the more optimizations that can be done for you (at the cost of lacking full control in certain scenarios).
> End result: go look for AVX-512 benchmarks. You'll find them. And then ask yourself: how many of these are relevant to what I bought my PC/mac/phone for?
... do you guys not do JSON? cause we're doing json.
seriously, linus is wrong on this issue, and he just keeps digging. lose the ego and admit you were spewing FUD.
avx-512 had legit problems in Skylake-SP server environments where it was running alongside non-AVX code (although you could segment servers by AVX and non-AVX if you wanted).
None of that has been true for any subsequent generations where the downclocking is essentially nil, and the offset has always been configurable for enthusiasts who get more buttons to push.
It also was never a "one single instruction triggers latency/downclocking" like some people think, it always took a critical mass of AVX-512 instructions (pulling down the voltage rail) before it paused and triggered downclocking, and "lighter" operations that did less work would trigger this less.
AMD putting AVX-512 in Zen4 sealed the deal. AMD isn't making a billion-dollar bet on AVX-512 because of "benchmarksmanship", they're doing it because it's a massive usability win and a huge win for performance. Over time you will see more AVX-512 adoption than AVX2 - because it adds a bunch of usability stuff that makes it even more broadly applicable, like op-masking and scatter operations, and AVX is already very broadly applicable. Some games won't even launch without it anymore because they use it and they didn't bother to write a non-AVX fallback.
It's also a huge efficiency win when used in a constrained-power environment - for your 200W-per-1RU power budget, you get more performance using AVX-512 ops than scalar ones. Or you can run at a lower power budget for a fixed performance target. And you're seeing that on Zen4, when it's not tied to Intel's 14nm bullshit (and really it's probably also apparent on Ice Lake-SP if anybody bothered to benchmark that). That's also a huge win in laptops too, where watts are battery life. Would anybody like more efficiency when parsing JSON web service responses when they're on battery? I would.
Sorry Linus it's over, just admit you're wrong and move on.
JSON acceleration and handful of other peripheral spot improvements aren't going to yield big whole application speedups in most apps.
Using hand coded accelerated SIMD kernels in specific places like JSON codec hits Amdahls' law[1]. The archievable whole app speedup is going to be low in most cases unless you get pervasive performant compiler generated SIMD code throughout your code, done by JITs of managed languages etc.
if there were big general-case speedups still possible from auto-vectorization that juice would have been squeezed already, even if it was only possible at runtime the core would watch for those patterns and pull those onto the units.
it's like the old joke about economists: an economist is walking down the street with his friend, the friend says "look, a 20 dollar bill laying on the ground!" and bends to pick it up. But the economist keeps walking, and says "it couldn't be, or someone would already have picked it up!". That joke but with computer architecture.
we are inherently talking about a world where that's not possible anymore, that juice has been squeezed. Throwing 10% more silicon at some specific problems and getting 2.5-3x speedup (real-world numbers from SIMD-JSON) is better than throwing 10% more silicon at the general case and getting 1% more performance. If 10-20% of real-world code gets that 3x speedup (read: probably 2x perf/w) that's great, that's much better than the general-case speedups!
Depends on what abstraction level we are discussing.
Existing applications without code changes in mainstream languages with parallelism-restrictive semantics, I agree we're probably close to the limits.
Beyond that we know that most apps are amenable to human perf work and reformulation to get very big speedups.
Besides human reformulation expressed with only low level parallelism primitives like SIMD instrinsics, the field has been using parallelism geared languages like Cuda, Futhark, ISPC etc. And there's a lot of untapped potential in data representation flexibility etc that even those languages aren't tapping, like for example Halide can do.
Human perf work also involves a lot of trial and error, it's a search type process, automation of which hasn't been explored that much. Some work to automating this are approaches like Atlas BLAS.
> ... do you guys not do JSON? cause we're doing json.
Go ahead, benchmark how much of JSON parsing consists of runtime of the entire pipeline between browser and server. I bet it is in low single digit unless you're literally doing "deserialize, change a variable, serialize"
Yes, it is use case that works, but unless you're keeping gigabytes in JSON and analyzing it, it isn't making most users job any faster, low single digits increases at best.
> It's also a huge efficiency win when used in a constrained-power environment - for your 200W-per-1RU power budget,
Which is not the use case Linus was talking about ? He didn't argue it doesn't make sense in those rare cases where you can optimize for it and GPUs are not an option.
There's now another game in town, exemplified by Apple's switch to ARM chips for its Macs. I'll keep an eye on the AVX2 vs. AVX-512 performance gap on Zen4+, but my working hypothesis is that my SIMD-handcoding time will be better spent on improving ARM support than upgrading AVX2 code to AVX-512 for the foreseeable future.
I'd probably use Highway in a new project; thanks for your work on it! In my main existing project, though, Highway-like code already exists as a side effect of supporting 16-byte vectors and AVX2 simultaneously, and I'd also have to give up the buildable-as-C99 property which has occasionally simplified e.g. FFI development.
:)
C99 for FFI makes sense. It's pretty common to have a C-like function as the entry point for a SIMD kernel. That means it's feasible to build only the implementation as C++, right?
I'm a huge simp for M1 too (and there's SVE there too). Yeah for client stuff if you can get people to just buy a macbook that's the best answer right now, if that does their daily tasks. Places need to start thinking about building ARM images anyway, for Ampere and Graviton and other cost-effective server environments if nothing else. If you are that glued at the hip to x86 is time to look at solving this problem.
Apple's p-cores get the limelight but the e-cores are simply ridiculous for their size... they are 0.69mm^2 vs 1.7mm2 for gracemont, excluding cache. Gracemont is Intel 7, so it's a node behind, but, real-world scaling is about 1.5-1.6x between 5nm and 6nm so that works out to about 1.1mm2 for Avalanche if it were 7nm, for equal/better performance to gracemont, at much lower power.
Sierra Forest (bunch of nextmont on a server die, like Denverton) looks super interesting and I'd absolutely love to see an Apple equivalent, give me 256 blizzard cores on a chiplet and 512 or 1024 on a package. Or even just an M1 Ultra X-Serve would be fantastic (although the large GPU does go unutilized). But I don't think Apple wants to get into that market so far from what I've seen.
(tangent but everyone says "Gracemont is optimized for size not efficiency!" and I don't know what that means in a practical sense. High-density cell libraries are both smaller and more efficient. So if people meant that they were using high-performance libraries that would be both bigger and less efficient (but clock higher). If it's high density it'd be smaller and more efficient but clock lower. Those two things go together. And yes everyone uses a mix of different types of cells, with high-performance cells on the timing hot-path... but "gracemont is optimized for size not efficiency" has become this meme that everyone chants and I don't know what that actually is supposed to mean. If anyone knows what that's supposed to be, please do tell.)
(also, as you can see from the size comparison... despite the "it's optimized for size" meme, gracemont still isn't really small, not like Blizzard is small. they're using ~50% more transistors to get to the same place, and it's almost half the size of a full zen3 core with SMT and all the bells and whistles... I really think e-cores are where the music stops with the x86 party, I think i-cache and decoders are fine on the big cores but as you scale downwards they start taking up a larger and larger portion of the core area that remains... it is Amdahl's Law in action with area, if i-cache and decoding doesn't scale then reducing the core increases the fraction devoted to i-cache/decoding. And if you scale it down then you pay more penalty for x86-ness in other places, like having to run the decoder. And you have to run the i-cache at all times even when the chip is idling, otherwise you are decoding a lot more. It just is a lot of power overhead for the things you use an e-core for.)
And I claim that that is the real problem with AVX-512 (and pretty much any vectorization). I personally cannot find a single benchmark that does anything I would ever do - not even remotely close. So if you aren't into some chess engine, if you aren't into parsing (but not using) JSON, if you aren't into software raytracing (as opposed to raytracing in games, which is clearly starting to take off thanks to GPU support), what else is there?
If you need a little bit of inference (say, 20 ReNet50s per second per CPU core) as part of a larger system, there's nothing cheaper. If you're doing a small amount of inference, perhaps limited by other parts of the system, you can't keep a GPU fed and the GPU is a huge waste of money.
AVX-512, with its masked operations and dual-input permutations, is an expressive and powerful SIMD instruction set. It's a pleasure to write code for, but we need good hardware support (which is literally years overdue).
I'd say AES Encryption/Decryption (aka: every HTTPS connection out there), and SHA256 Hashing is big. As is CRC32 (the VPMULDQ instruction), and others.
There's.... a ton of applications of AVX512. I know that Linus loves his hot-takes, but he's pretty ignorant on this particular subject.
I'd say that most modern computers are probably reading from TLS1.2 (aka: AES decryption), processing some JSON, and then writing back out to TLS1.2 (aka: AES Encryption), with probably some CRC32 checks in between.
--------
Aside from that, CPU signal filtering (aka: GIMP image processing, Photoshop, JPEGs, encoding/decoding, audio / musical stuff). There's also raytracing with more than the 8GB to 16GB found in typical GPUs (IE: Modern CPUs support 128GB easily, and 2TB if you go server-class), and Moana back in 2016 was using up 100+ GB per scene. So even if GPUs are faster, they still can't hold modern movie raytraced scenes in memory, so you're kinda forced to use CPUs right now.
> AES Encryption/Decryption (aka: every HTTPS connection out there),
that already have dedicated hardware on most of the x86 CPUs for good few years now. Fuck, I have some tiny ARM core with like 32kB of RAM somewhere that rocks AES acceleration...
> So even if GPUs are faster, they still can't hold modern movie raytraced scenes in memory, so you're kinda forced to use CPUs right now.
Can't GPUs just use system memory at performance penalty ?
> that already have dedicated hardware on most of the x86 CPUs for good few years now
Yeah, and that "dedicated hardware" is called AES-NI, which is implemented as AVX instructions.
In AVX512, they now apply to 4-blocks at a time (512-bit wide is 128-bit x 4 parallel instances). AES-NI upgrading with AVX512 is... well... a big important update to AES-NI.
AES-NI's next-generation implementation _IS_ AVX512. And it works because AES-GCM is embarrassingly parallel (apologies to all who are stuck on the sequential-only AES-CBC)
> Can't GPUs just use system memory at performance penalty ?
CPUs can access DDR4/DDR5 RAM at 50-nanoseconds. GPUs will access DDR4/DDR5 RAM at 5000-nanoseconds, 100x slower than the CPU. There's no hope for the GPU to keep up, especially since raytracing is _very_ heavy on RAM-latency. Each ray "bounce" is basically a bunch of memory-RAM checks (traversing a BVH tree).
Its just better to use a CPU if you end up using DDR4/DDR5 RAM to hold the data. There are algorithms that break up a scene into oct-trees that only hold say 8GBs worth of data, then the GPU can calculate all the light bounces within a box (and then write out the "bounces" that leave the box), etc. etc. But this is very advanced and under heavy research.
For now, its easier to just use a CPU that can access all 100GB+ and just render the scene without splitting it up. Maybe eventually these GPU oct-tree split / process within a GPU / etc. etc. subproblem / splitting will become better researched and better implemented, and GPUs will traverse System ram a bit better.
GPUs will be better eventually. But CPUs are still better at the task today.
> I am confused, CPUs have dedicated instructions for AES encryption and CRC32. Are they slower than AVX512?
Those instructions are literally AVX instructions, and have been _upgraded_ in AVX512 to be 512-bit wide now.
If you use the older 128-bit wide AES-NI instruction, rather than the AVX512-AES-NI instructions, you're 4x slower than me. AVX512 upgrades _ALL_ AVX instructions to 512-bits (and mind you, AES-NI was stuck on 128-bits, so the upgrade to 512-bit is a huge upgrade in practice).
-----
EDIT: CRC32 is implemented with the PCLMULQDQ instruction, which has also been upgraded to AVX512.
True, but the problem is that that is today better done on vector hardware like a GPU or other ML hardware. The world has sort of diverged int to two camps: vectorizable problems that can be massivly paralleleized (graphics, simulation, ML) and for that we use GPUs, and then everything else is CPU. What i think linus is saying is that there are few reasons to use AVX-512 on a CPU, when the is a GPU much better siuted for those kinds of problems.
You could say that the intersecting area in the ven diagram of "Has to run on CPU" and "Can use vector instructions" is small.
GPUs are still an unworkable target for wide end user audiences because of all the fragmentation, mutually incompatible APIs on macOS/Windows/Linux, proprietary languages, poor dev experience, buggy driver stacks etc.
Not to mention a host of other smaller problems (eg no standard way to write tightly coupled CPU/GPU codes, spotty virtualization support in GPUs, lack integation in estabilished high level languages, etc chilling factors).
The ML niche that can require speficic kinds of NVidia GPUs seems to be an island of its own that works for some things, but it's not great.
While true, it is still easier to write shader code than trying to understand the low level details of SIMD and similar instruction sets, that are only exposed in a few selected languages.
Even JavaScript has easier ways to call into GPU code than exposing vector instructions.
Yes, one is easier to write and the other is easier to ship, except for WebGL.
The JS/browser angle has another GPU related parallel here. WebAssembly SIMD is is shipping since a couple of years and like WebGL make the browser platform one of the few portable ways to access this parallel-programming functionality now.
(But functionality is limited to approximately same as the 1999 vintage x86 SSE1)
> You could say that the intersecting area in the ven diagram of "Has to run on CPU" and "Can use vector instructions" is small.
People are forgetting the "Could run on a GPU but I don't know how" factor. There's tons of Situations where GPU Offloading would be fast or more energy efficient but importing all the libraries, dealing with drivers etc. really is not worth the effort, whereas doing it on a CPU is really just a simple include away.
> You could say that the intersecting area in the ven diagram of "Has to run on CPU" and "Can use vector instructions" is small.
I dunno, JSON parsing is stupid hot these days because of web stacks. Given the neat parsing tricks by simdjson mentioned upthread, it seems like AVX512 could accelerate many applications that boil down to linear searches through memory, which includes lots of parsing and network problems.
Memcpy and memset are massively parallel operations used on a CPU all the time.
But lets ignore the _easy_ problems. AES-GCM mode is massively parallelized as well, each 128-bit block of AES-GCM can run in parallel, so AVX512-AES encryption can process 4 blocks in parallel per clock tick.
Icelake and later CPUs have a REP MOVS / REP STOS implementation that is generally optimal for memcpy and memset, so there’s no reason to use AVX512 for those except in very specific cases.
I know when I use GCC to compile with AVX512 flags, it seems to output memcpy as AVX registers / ZMMs and stuff...
Auto vectorization usually sucks for most code. But very simple setting of structures / memcpy / memset like code is ideal for AVX512. It's a pretty common use case (think a C++ vector<SomeClass> where the default constructor sets the 128 byte structure to some defaults)
AVX512 doesn't itself imply Icelake+; the actual feature is FSRM (fast short rep movs), which is distinct from AVX512. In particular, Skylake Xeon and Cannon Lake, Cascade Lake, and Cooper Lake all have AVX512 but not FSRM, but my expectation is that all future architectures will have support, so I would expect memcpy and memset implementations tuned for Icelake and onwards to take advantage of it.
gpus have a high enough latency that for O(n) operations, the time it takes to move the data to the GPU will be higher than the time it takes to run the problem on a CPU. AVX-512 is great because it makes it easy to speed up code to the point that it's memory bottlenecked.
https://lemire.me/blog/2022/05/25/parsing-json-faster-with-i...
All things that are good for HFT but also good for speeding up your web browser and maybe even saving power because you can dial down the clock rate. It's a tragedy that Intel is fusing off AVX 512 in consumer parts so they can stuff the chips with thoroughly pizzled phonish low-performance cores.