Removing characters from strings faster with AVX-512

Andoryuuta · on May 1, 2022

Intel is removing AVX-512 support from their newer CPU's (Alder Lake +). :/

https://www.igorslab.de/en/intel-deactivated-avx-512-on-alde...

PragmaticPulp · on May 1, 2022

Server and workstation chips still have AVX-512. It’s only unsupported on CPUs with smaller E(fficeincy) cores.

AVX-512 was never really supported in newer consumer CPUs with heterogeneous architecture. These CPUs have a mix of powerful cores and efficiency cores. The AVX-512 instructions were never added to the efficiency cores because it would use way too much die space and defeat the purpose of efficiency cores.

There was previously a hidden option to disable the efficiency cores and enable AVX-512 on the remaining power cores, but the number of workloads that would warrant turning off a lot of your cores to speed up AVX-512 calculations is virtually non-existent in the consumer world (where these cheap CPUs are targeted).

The whole journalism controversy around AVX-512 has been a bit of a joke because many of the same journalists tried to generate controversy when AVX-512 was first introduced and they realized that AVX-512 code would reduce the CPU clock speed. There were numerous articles about turning off AVX-512 on previous generation CPUs to avoid this downclocking and to make overclocks more stable.

pantalaimon · on May 1, 2022

Catching the bad instruction fault on the E-cores and only scheduling the thread on the P-cores would be something that could be added to Linux (there were already third party patches towards that goal) if Intel had not disable the feature entirely.

jeffbee · on May 1, 2022

But it's not really compatible with the GCC IFUNC scheme ... PTL entries will be permanently remapped to the most appropriate code on the CPU where the function is first called, and never thereafter remapped. So you end up with a coin toss whether you get the optimized function or not.

Personally I don't find the e-cores on my alder lake CPU to be of any value. They're more of a hazard than a benefit.

janwas · on May 1, 2022

Fair point about ifunc, but we're using our own table of function pointers, which can be invalidated/re-initialized. Someone also mentioned that the OS could catch SIGILL.. indeed seems doable to then reset thread affinity to the P cores?

saagarjha · on May 1, 2022

Presumably the AVX-512 code is something on your hot path, so I’m not sure waiting for a signal to reschedule the work is something you would want.

heavenlyblue · on May 1, 2022

You reschedule it only once so it doesn’t matter

saagarjha · on May 2, 2022

I don’t see how? Once your quantum is up the thread gets put back into the scheduling pool and you have to do this all over again…

boardwaalk · on May 2, 2022

Couldn’t there be some flag on the thread that marks it as “P-core only”? Doesn’t seem hard. I don’t known Linux scheduler internals though.

pantalaimon · on May 2, 2022

Threads already have an affinity bit mask where you can select which cores they can be scheduled on.

https://man7.org/linux/man-pages/man2/sched_setaffinity.2.ht...

saagarjha · on May 2, 2022

Right but then you’d just have all processes with that flag all the time.

pantalaimon · on May 2, 2022

Only if glibc decides to use AVX512 for memset and such. I’m not sure if that makes sense to begin with, but it could also not do that if it detects a heterogeneous CPU.

heavenlyblue · on May 2, 2022

You don’t have to schedule your thread on another core at all

zozbot234 · on May 1, 2022

> The AVX-512 instructions were never added to the efficiency cores because it would use way too much die space and defeat the purpose of efficiency cores.

And this is why scalable vector ISA's like the RISC-V vector extensions are superior to fixed-size SIMD. You can support both kinds of microarchitecture while running the exact same code.

willis936 · on May 1, 2022

>The AVX-512 instructions were never added to the efficiency cores because it would use way too much die space and defeat the purpose of efficiency cores.

Isn't the purpose of efficiency cores to be more power efficient? It's more power efficient to vectorize instructions and minimize pipeline re-ordering.

mastax · on May 1, 2022

Power and area efficient. You can fit 4 E cores in the area of 1 P core. Adding AVX-512 to the E cores would significantly hamper that, though I don't know by how much.

willis936 · on May 3, 2022

I've been thinking about this. I think the idea of heterogeneous extensions is nuts. Intel is right to either have it on all cores or none. I think the die space would be worth it to actually have more high performance, efficient vector extensions available. Considering how many E cores are being crammed into raptor lake, I hope Intel will decide to add AVX-512 into E cores in time.

R0b0t1 · on May 1, 2022

That's not a valid reason why I can't use them on the P cores. Some motherboards can enable them on the i9-12900k, it works fine, but you need to pin to a P core.

PragmaticPulp · on May 1, 2022

The reason is that it was never validated or tested with AVX-512 and Intel and motherboard vendors couldn’t commit to shipping everything with AVX-512 support in future steppings/revisions.

If you disable E cores you could enable AVX-512 on certain motherboards, but like I said that’s not really a net win 99.99% of the time when you’re giving up entire cores.

It was also at your own risk because presumably the power/clock speed profiles were never tuned for a feature that wasn’t actually supported. I can see exactly why they turned it off on newer CPUs only after an announcement.

R0b0t1 · on May 1, 2022

Still smells like bullshit. Let the customer decide. Who cares if it was validated? Why was it even included? Just put it behind a yes-I-really-mean-it-switch so nobody uses it by accident.

kadoban · on May 2, 2022

There's not many great places for such a switch, and even then, then you have to validate _that_ behavior. It's mostly just not worth it.

pantalaimon · on May 2, 2022

Several motherboard vendors added that very switch in UEFI (some even with the option to force an older microcode version AFAIK) before it was disabled in hardware by Intel

Dylan16807 · on May 2, 2022

> it was never validated or tested with AVX-512 and Intel and motherboard vendors couldn’t commit

Only because they screwed it up on purpose! That's not an acceptable reason for removing the feature; in part because it would apply to any feature they decided to cut.

mhh__ · on May 1, 2022

You're forgetting about server CPUs, and we don't know yet about Raptor Lake.

Andoryuuta · on May 1, 2022

Ah, yep. You're totally right. I didn't even consider server CPUs. Also, I thought I read somewhere that it was for all consumer CPUs starting at Alder Lake, but I have no idea where, so I could be entirely wrong. :)

electricshampo1 · on May 1, 2022

This is only on the client side; server still has and will have AVX512 for the foreseeable future.

SemanticStrengh · on May 1, 2022

And zen 4 is rumoured to add support for it ^^

gslin · on May 1, 2022

A problem is slowing down the CPU frequency significantly when AVX-512 is involved, e.g. https://en.wikichip.org/wiki/intel/xeon_gold/6262v this, which usually cancels out the benefit in the Real World (tm).

PragmaticPulp · on May 1, 2022

This was massively exaggerated by journalists when AVX-512 was first announced.

It is true that randomly applied AVX-512 instructions can cause a slight clock speed reduction, the proper way to use libraries like this would be within specific hot code loops where the mild clock speed reduction is more than offset by the huge parallelism increase.

This doesn’t make sense if you’re a consumer doing something multitasking and a background process is invoking the AVX-512 penalty in the background, but it usually would make sense in a server scenario.

adgjlsfhk1 · on May 1, 2022

the thing I never understood about this is why Intel didn't just add latency to the avx512 instructions instead? that seems much easier than downclocking the whole cpu

janwas · on May 1, 2022

I believe they do actually do something like this - until power and voltage delivery change, wide instructions are throttled independently of frequency changes (which on SKX involved a short halt).

pclmulqdq · on May 1, 2022

Intel has been trying to reduce the penalty for AVX-512, and barring that, advertise that there is no penalty. Most things on Ice Lake run fine with 256 bit vectors, but Skylake and earlier really needed 128 bit or narrower if you weren't doing serious vector math.

Forget about 512 bit vectors or FMAs.

alksjdalkj · on May 1, 2022

I think this is less of a problem on newer CPUs: https://travisdowns.github.io/blog/2020/08/19/icl-avx512-fre...

pclmulqdq · on May 1, 2022

Those are client CPUs, which have very different behavior around power management than server parts. However, AVX downclocking has mostly gone away with ice lake and hopefully sapphire rapids does away with it permanently (except on 512 bit vectors).

mhh__ · on May 1, 2022

Unless someone has data for the latest Intel chips (i.e. sapphire rapids) showing the opposite I'm inclined to think this is a meme from 2016/7 that needs to go the way of the dodo.

Twirrim · on May 1, 2022

It was largely wrong then, too. Cloudflare, who really kicked off a large amount of the fuss, had "Bronze" class Xeon chips, that weren't designed or marketed for what they were attempting to use them for. They were only ever intended for small business stuff. Not large scale high performance operations. Their performance downclock for AVX-512 is way, way higher on Bronze.

NavinF · on May 1, 2022

Weren’t those chips $10k each back then? Hardly anyone got gold Xeons.

Twirrim · on May 1, 2022

Not even close. The blog post was 2017.

Actually, I stand corrected, after double checking, Cloudflare were using Silver. Entry level data centre chips, instead of small business chips. Still not the kind of chips you'd buy for high performance infrastructure, and not intended to be used for such.

Xeon Silver 4116s hit the market at $1,002.00. The Golds were $1,221.00. The performance differences are quite significant. For something that'll be in service for ~3-5 years, $200 is absolutely trivial by way of a per-chip increase. It's firmly in the "false economy" territory to be skimping on your chip costs. It's a bit more understandable in smaller businesses, but you just don't do it when you're operating at scale.

Also remember: at the scales that Cloudflare are purchasing at, they won't be paying RRP. They'll be getting tidy discounts.

NavinF · on May 2, 2022

I’m not familiar with the model numbers. What’s the gold equivalent to the Xeon Silver 4116?

Anyway I’m sure they compared the TCO of buying more low-end chips vs fewer high-end chips.

janwas · on May 1, 2022

I would love to see an example of reasonable code not seeing any benefit. On first generation SKX, we saw 1.5x speedups vs AVX2, and that was IIRC even without taking much advantage of AVX3-only instructions.

SemanticStrengh · on May 1, 2022

Please stop spreading this fallacy, while downclocking can happen, usually the benefit is still strong and superior to avx256. Even 256 can induce downclocking. AVX 512 when properly utilized simply demolish non AVX 512 cpus.

vlovich123 · on May 1, 2022

On that one task. The challenge is if the avx512 pieces aren’t a bottleneck in every single concurrent workload you run. It’s fine if the most important thing your running on them is code optimized for AVX512. Realistically though, is that the case for the target market of CPUs capable of AVX512, since consumer use cases aren’t? The predominant workload would be cloud right? Where you’re running heterogeneous workloads right? You’d have to get real smart by coalescing AVX512 and non AVX512 workloads onto separate machines and disabling it on the machines that don’t need it. Very complicated work to do because you’d have to have each workload annotated by hand (memcpy is optimized to use AVX512 when available so the presence of AVX512 in the code is insufficient)

The more generous interpretation is that Intel fixed that issue a while back although the CPUs with that problem are still in rotation and you have to think about that when compiling your code.

watmough · on May 1, 2022

This is really cool.

I just got through doing some work with vectorization.

On the simplest workload I have, splitting a 3 MByte text file into lines, writing a pointer to each string to an array, GCC will not vectorize the naive loop, though ICC might I guess.

With simple vectorization to AVX512 (64 unsigned chars in a vector), finding all the line breaks goes from 1.3 msec to 0.1 msec, so a little better than a 10x speedup, still just on the one core, which keeps things simple.

I was using Agner Fog's VCL 2, Apache licensed C++ Vector Class Library. It's super easy.

mdb31 · on May 1, 2022

Cool performance enhancement, with an accompanying implementation in a real-world library (https://github.com/lemire/despacer).

Still, what does it signal that vector extensions are required to get better string performance on x86? Wouldn't it be better if Intel invested their AVX transistor budget into simply making existing REPB prefixes a lot faster?

37ef_ced3 · on May 1, 2022

AVX-512 is an elegant, powerful, flexible set of masked vector instructions that is useful for many purposes. For example, low-cost neural net inference (https://NN-512.com). To suggest that Intel and AMD should instead make "existing REPB prefixes a lot faster" is missing the big picture. The masked compression instructions (one of which is used in Lemire's article) are endlessly useful, not just for stripping spaces out of a string!

mhh__ · on May 1, 2022

Many people seem to think AVX-512 is just wider AVX, which is a shame.

NN-512 is cool. I think the Go code is pretty ugly but I like the concept of the compiler a lot.

janwas · on May 1, 2022

Why is a large speedup from vectors surprising? Considering that the energy required for scheduling/dispatching an instruction on OoO cores dwarfs that of the actual operation (add/mul etc), amortizing over multiple elements (=SIMD) is an obvious win.

mdb31 · on May 1, 2022

Where do I say that the speedup is surprising?

My question is whether Intel investing in AVX-512 is wise, given that: -Most existing code is not aware of AVX anyway; -Developers are especially wary of AVX-512, since they expect it to be discontinued soon.

Consequently, wouldn't Intel be better off by using the silicon dedicated to AVX-512 to speed up instruction patterns that are actually used?

mhh__ · on May 1, 2022

AVX-512 is not going to be discontinued. Intel's reticence/struggling with having it on desktop is irritating but it's here to stay on servers for a long time.

Writing code for a specific SIMD instruction set is non-trivial, but most code will get some benefit by being compiled for the right ISA. You don't get the really fancy instructions because the pattern matching in the compiler isn't very intelligent but quite a lot of stuff is going to benefit by magic.

Even without cutting people without some AVX off, you can have a fast/slow path fairly easily.

janwas · on May 1, 2022

My point is that vector instructions are fundamentally necessary and thus "what does it signal" evaluates to "nothing surprising".

Sure, REP STOSB/MOVSB make for a very compact memset/memcpy, but their performance varies depending on CPU feature flags, so you're going to want multiple codepaths anyway. And vector instructions are vastly more flexible than just those two.

Also, I have not met developers who expect AVX-512 to be discontinued (the regrettable ADL situation notwithstanding; that's not a server CPU). AMD is actually adding AVX-512.

mdb31 · on May 1, 2022

> vector instructions are fundamentally necessary

For which percentage of users?

> AMD is actually adding AVX-512

Which is irrelevant to in-market support for that instruction set.

retrac · on May 1, 2022

> For which percentage of users?

Anyone using software that benefits from vector instructions. That includes a variety of compression, search, and image processing algorithms. Your JPEG decompression library might be using SSE2 or Neon. All high-end processors have included some form of vector instruction for like 20+ years now. Even the processor in my old eBook reader has the ARM Neon instructions.

mhh__ · on May 1, 2022

Any users who either wants performance or uses a language that can depend on a fast library.

XorNot · on May 1, 2022

Why would it be irrelevant? Even the paucity of availability isn't really a problem - the big winners here are server users in data centers, not desktops or laptops. How much string parsing and munging is happening ingesting big datasets right now? If running a specially optimized function set on part of your fleet reduces utilization, that's direct cost savings you realize. If the AMD is then widening that support base, you're deeply favoring expanding usage while you scale up.

_rtld_global_ro · on May 1, 2022

Given Intel's AVX extension could cause silent failures on servers (very high work load for prolonged time, compare to end user computers), I'm not sure it would be a big win for servers either: https://arxiv.org/pdf/2102.11245.pdf.

jcranmer · on May 1, 2022

I'm downvoting you because the assertion you're implying--that use of AVX increases soft failure rates more than using non-AVX instructions would--is not sustained by the source you use as reference.

tialaramex · on May 1, 2022

Indeed, I'd summarise that source as "At Facebook sometimes weird stuff happens. We postulate it's not because of all the buggy code written by Software Engineers like us, it must be hardware. As well as lots of speculation about hypothetical widespread problems that would show we're actually not writing buggy software, here's a single concrete example where it was hardware".

If anything I'd say that Core 59 is one of those exceptions that prove the rule. This is such a rare phenomenon that when it does happen you can do the work to pin it down and say yup, this CPU is busted - if it was really commonplace you'd constantly trip over these bugs and get nowhere. There probably isn't really, as that paper claims, a "systemic issue across generations" except that those generations are all running Facebook's buggy code.

janwas · on May 1, 2022

One interesting anecdote is that HPC planning for exascale included significant concern about machine failures and (silent) data corruption. When running at large enough scale, even seemingly small failure rates translate into "oh, there goes another one".

ip26 · on May 1, 2022

Is it generally possible to convert rep str sequences to AVX? Could the hardware or compiler already be doing this?

AVX is just the SIMD unit. I would argue the transistors were spent on SIMD, and the hitch is simply the best way to send str commands to the SIMD hardware.

nwmcsween · on May 2, 2022

Why? IIRC something like 99% of string operations are on 20 chars or less. If you're hitting bottlenecks then optimize.

ip26 · on May 3, 2022

If you are arguing most string ops have just a few chars and therefore don’t use vectors… why do we need to spend silicon enhancing rep prefix in the first place?

brrrrrm · on May 1, 2022

What's the generated assembly look like? I suspect clang isn't smart enough to store things into registers. The latency of VPCOMPRESSB seems quite high (according to the table here at least https://uops.info/table.html), so you'll probably want to induce a bit more pipelining by manually unrolling into the register variant.

I don't have an AVX512 machine with VBMI2, but here's what my untested code might look like:

  __m512i spaces = _mm512_set1_epi8(' ');
  size_t i = 0;
  for (; i + (64 * 4 - 1) < howmany; i += 64 * 4) {
    // 4 input regs, 4 output regs, you can actually do up to 8 because there are 8 mask registers
    __m512i in0 = _mm512_loadu_si512(bytes + i);
    __m512i in1 = _mm512_loadu_si512(bytes + i + 64);
    __m512i in2 = _mm512_loadu_si512(bytes + i + 128);
    __m512i in3 = _mm512_loadu_si512(bytes + i + 192);

    __mmask64 mask0 = _mm512_cmpgt_epi8_mask (in0, spaces);
    __mmask64 mask1 = _mm512_cmpgt_epi8_mask (in1, spaces);
    __mmask64 mask2 = _mm512_cmpgt_epi8_mask (in2, spaces);
    __mmask64 mask3 = _mm512_cmpgt_epi8_mask (in3, spaces);

    auto reg0 = _mm512_maskz_compress_epi8 (mask0, x);
    auto reg1 = _mm512_maskz_compress_epi8 (mask1, x);
    auto reg2 = _mm512_maskz_compress_epi8 (mask2, x);
    auto reg3 = _mm512_maskz_compress_epi8 (mask3, x);

    _mm512_storeu_si512(bytes + pos, reg0);
    pos += _popcnt64(mask0);
    _mm512_storeu_si512(bytes + pos, reg1);
    pos += _popcnt64(mask1);
    _mm512_storeu_si512(bytes + pos, reg2);
    pos += _popcnt64(mask2);
    _mm512_storeu_si512(bytes + pos, reg3);
    pos += _popcnt64(mask3);
  }
  // old code can go here, since it handles a smaller size well

You can probably do better by chunking up the input and using temporary memory (coalesced at the end).

bertr4nd · on May 1, 2022

I love Daniel’s vectorized string processing posts. There’s always some clever trickery that’s hard for a guy like me (who mostly uses vector extensions for ML kernels) to get quickly.

I found myself wondering if one could create a domain-specific language for specifying string processing tasks, and then automate some of the tricks with a compiler (possibly with human-specified optimization annotations). Halide did this sort of thing for image processing (and ML via TVM to some extent) and it was a pretty significant success.

gfody · on May 1, 2022

there's more whitespace above 0x20 https://en.m.wikipedia.org/wiki/Whitespace_character#Unicode

brrrrrm · on May 1, 2022

The complication involved with UTF-8 encoded space removal is immense and likely quite far out of scope.

GICodeWarrior · on May 2, 2022

Here's a list of processors supporting AVX-512:

https://ark.intel.com/content/www/us/en/ark/search/featurefi...

The author mentions it's difficult to identify which features are supported on which processor, but ark.intel.com has a quite good catalog.

tedunangst · on May 1, 2022

What would be a practical application of this? The linked post mentions a trim like operation, but in practice I only want to remove white space from the ends, not the interior of the string, and finding the ends is basically the whole problem. Or maybe I want to compress some json, but a simple approach won't work because there can be spaces inside string values which must be preserved.

jandrewrogers · on May 1, 2022

I agree that the whitespace in text example seems a bit contrived but I've done similar types of byte elision operations on binary streams (e.g. for compression purposes), which this could be trivially adapted to.

jquery · on May 1, 2022

I prefer AMDs approach that allows them to put more cores on the die instead of supporting a rarely used instruction set.

fulafel · on May 1, 2022

Zen 4 is rumored to have AVX512. AMD has in the past had support for wide SIMD instructions with half internal width implementation, so the die area requirements and instruction set support are somewhat orthogonal. There's many other interesting things in AVX512 besides the wide vectors.

pclmulqdq · on May 1, 2022

AVX-512 finally gets a lot of things right about vector manipulation and plugged a lot of the holes in the instruction set. Part of me is upset that it came with the "512" name - they could have called it "AVX3" or "AVX Version 2" (since it's intel and they love confusing names).

adrian_b · on May 1, 2022

Actually AVX-512 predates AVX and Sandy Bridge.

The original name of AVX-512 was "Larrabee New Instructions". Unlike with the other Intel instruction set extensions, the team which defined the "Larrabee New Instructions" included graphics experts hired from outside Intel, which is probably the reason why AVX-512 is a better SIMD instruction set than all the other designed by Intel.

Unfortunately, Sandy Bridge (2011), instead of implementing a scaled-down version of the "Larrabee New Instructions", implemented the significantly worse AVX instruction set.

A couple of years later, Intel Haswell (2013), added to AVX a few of the extra instructions of the "Larrabee New Instructions", e.g. fused multiply-add and memory gather instructions. The Haswell AVX2 was thus a great improvement over the Sandy Bridge AVX, but it remained far from having all the features that had already existed in LRBni (made public in 2009).

After the Intel Larrabee project flopped, LRBni passed through a few name changes, until 2016, when it was renamed to AVX-512 after a small change in the binary encoding of the instructions.

I also dislike the name "AVX-512", but my reason is different. "AVX-512" is made to sound like it is an evolution of AVX, while the truth is the other way around, AVX was an involution of LRBni, whose purpose was to maximize the profits of Intel by minimizing the CPU manufacturing costs, taking advantage of the fact that the competition was weak, so the buyers had to be content with the crippled Intel CPUs with AVX, because nobody offered anything better.

The existence of AVX has caused a lot of additional work for many programmers, who had to write programs much more complex than it would have been possible with LRBni, which had from the beginning features designed to allow simplified programming, e.g. the mask registers that allow much simpler prologues and epilogues for loops and both gather loads and scatter stores for accessing the memory.

boulos · on May 2, 2022

Hmm. That's not how I recall it. The folks in Israel working on Sandybridge (Gesher), already had their AVX plans in place before LRBni was "finalized" (even by the time of "our" siggraph paper -- I was only tangentially involved, not listed -- new instructions were being added all the time).

So it's more like both groups knew what the other was doing, but LRBni was free to focus primarily on graphics and a clean slate, while the AVX folks shot for "SSE but wider, and a few more".

AVX-512 is sort of a franken-combo of what AVX3 would have been, plus many of the LRBni instructions that shipped in the poorly named MIC parts, plus some more (e.g., now including a VNNI dialect, bf16 ops, etc.).

adrian_b · on May 2, 2022

Indeed, as you say, the development of both LRBni and of AVX by 2 separate Intel teams stretched over many years.

Most of the development of LRBni was between 2005 and 2009, when it became publicly known. The first product with LRBni was Knights Ferry, which was introduced in 2010, being made with the older 45-nm process. Knights Ferry was used only in development systems, due to insufficient performance.

Sandy Bridge, using the newer 32-nm process, was launched in 2011. I do not know when the development of Sandy Bridge had started, but in any case the first few years of development must have overlapped with the last few years of the development of LRBni.

I suppose that there was little, if any, communication between the 2 Intel teams.

AVX was developed as an instruction set extension in the same way as the majority of the instruction set extensions had been developed by Intel since the days of Intel 8008 (1972) and until the present x86 ISA.

Intel has only very seldom introduced new instructions that had been designed having a global view of the instruction set and making a thorough analysis of which instructions should exist in order to reach either the best performance or the least programming effort.

In most cases the new instructions have been chosen so that they would need only minimal hardware changes from the previous CPU generation for their implementation, while still providing a measurable improvement in some benchmark. The most innovative additions to the Intel ISA had usually been included in the instruction sets of other CPUs many years before, but Intel has delayed to also add them as much as possible.

This strategy of Intel is indeed the best for ensuring the largest profits from making CPUs, as long as there is no strong competition.

Moreover, now the quality of the ISA matters much less for performance than earlier, because the very complex CPUs from today can perform a lot of transformations on the instruction stream, like splitting / reordering / fusion, which can remove performance bottlenecks due to poor instruction encoding.

Most programmers use only high-level languages, so only the compiler writers and those that have to write extremely optimized programs have to deal with various ugly parts of the Intel-AMD ISA.

So AVX for Sandy Bridge has been designed in the typical Intel way, having as target to be a minimal improvement over SSE.

On the other hand LRBni was designed from the ground, to be the best instruction set that they knew how to implement for performing its tasks.

So it was normal that the end results were different.

For the Intel customers, it would have been much better if Intel did not have 2 divergent developments for their future SIMD ISA, but they would have established a single, coherent, roadmap for SIMD ISA development during the next generations of Intel CPUs.

In an ideal company such a roadmap should have been established after discussions with a wide participation, from all the relevant Intel teams.

For cost reasons, it is obvious that it would not have been good for Sandy Bridge to implement the full LRBni ISA. Nevertheless, it would have been very easy to implement a LRBni subset better than AVX.

Sandy Bridge should still have implemented only 256-bit operations, and the implementation of some operations, e.g. gather and scatter, could have been delayed for a later CPU generation.

However other LRBni features, should have been present since the beginning, e.g. the mask registers, because they influence the instruction encoding formats.

The mask registers would have required very little additional hardware resources (the actual hardware registers can reuse the 8087 registers), but they would have simplified AVX programming a lot, by removing the complicated code needed to handle correctly different data sizes and alignments.

The current CPUs with AVX-512 support would have been simpler, by not having to decode 2 completely distinct binary instruction formats, for AVX and for AVX-512, which is a fact that made difficult the implementation of AVX-512 in the small cores of Alder Lake.

pclmulqdq · on May 1, 2022

TIL. Thank you for the history lesson on AVX. Comparing to SVE and the RISC-V vector instructions, AVX feels so clunky, but I guess that was part of the "Intel tax."

atq2119 · on May 1, 2022

Agreed. Though I feel that for the most part, size-agnostic vector instructions a la SVE would be the way to go.

janwas · on May 1, 2022

:) I have actually heard it referred to as AVX3, we also adopted that name in Highway.

protoman3000 · on May 1, 2022

Please correct me if I'm wrong, but wouldn't we normally scale these things instead on a GPU?

raphlinus · on May 1, 2022

The short answer is no, but the long answer is that this is a very complex tradeoff space. Going forward, we may see more of these types of tasks moving to GPU, but for the moment it is generally not a good choice.

The GPU is incredible at raw throughput, and this particular problem can actually implemented fairly straightforwardly (it's a stream compaction, which in turn can be expressed in terms of prefix sum). However, where the GPU absolutely falls down is when you want to interleave CPU and GPU computations. To give round numbers, the roundtrip latency is on the order of 100µs, and even aside from that, the memcpy back and forth between host and device memory might actually be slower than just solving the problem on the CPU. So you only win when the strings are very large, again using round numbers about a megabyte.

Things change if you are able to pipeline a lot of useful computation on the GPU. This is an area of active research (including my own). Aaron Hsu has been doing groundbreaking work implementing an entire compiler on the GPU, and there's more recent work[1], implemented in Futhark, that suggests that that this approach is promising.

I have a paper in the pipeline that includes an extraordinarily high performance (~12G elements/s) GPU implementation of the parentheses matching problem, which is the heart of parsing. If anyone would like to review a draft and provide comments, please add a comment to the GitHub issue[2] I'm using to track this. It's due very soon and I'm on a tight timeline to get all the measurements done, so actionable suggestions on how to improve the text would be most welcome.

[1]: https://theses.liacs.nl/pdf/2020-2021-VoetterRobin.pdf

[2]: https://github.com/raphlinus/raphlinus.github.io/issues/66#i...

mwcampbell · on May 1, 2022

> To give round numbers, the roundtrip latency is on the order of 100µs

I can't help but notice that, at least in my experience on Windows, this is the same order of magnitude as for inter-process communication on the local machine. Tangent: That latency was my nemesis as a Windows screen reader developer; the platform accessibility APIs weren't well designed to take it into account. Windows 11 finally has a good solution for this problem (yes, I helped implement that while I was at Microsoft).

fancyfredbot · on May 1, 2022

I wonder if this applies to the same extent for an on-package GPU which shares the same physical memory as the CPU. I'd expect round trip times in that case to be minimal and the available processing power would probably be competitive with AVX512. I have been wondering if this is the reason for deprecating AVX512 on consumer processors - these are likely to have a GPU available.

raphlinus · on May 1, 2022

Good question! There are two separate issues with putting the GPU in the same package as the CPU. One is the memcpy bandwidth issue, which is indeed entirely mitigated (assuming the app is smart enough to exploit this). But the round trip times seem more related to context switches. I have an M1 Max here, and just found ~200µs for a very simple dispatch (just clearing 16k of memory).

I personally believe it may be possible to reduce latency using techniques similar to io_uring, but it may not be simple. Likely a major reason for the roundtrips is so that a trusted process (part of the GPU driver) can validate inputs from untrusted user code before it's presented to the GPU hardware.

fancyfredbot · on May 1, 2022

Yes I think you are right about driver overhead, although there should be ways to amortize that it probably doesn't work very well for latency sensitive problems! I expect that in most cases if you have enough work to do to make using AVX512 worthwhile you can afford the round-trip.

boulos · on May 2, 2022

It's been a while, but IIRC the integrated GPUs are only L3-cache coherent. So while that greatly improves the memcpy problem, anything that would have fit in L1 and does a bunch of math may still be a better fit for AVX2 or AVX-512.

curling_grad · on May 1, 2022

Maybe because of IO costs?