Hacker News new | past | comments | ask | show | jobs | submit login
Gathering Intel on Intel AVX-512 Transitions (travisdowns.github.io)
126 points by matt_d on Jan 17, 2020 | hide | past | favorite | 48 comments



Author here, happy for any feedback or to answer any questions.


No questions, but really enjoyed your blog, and your comments on this website over the past few years.


Just want to say Thank You. Do you know anything about on AMD's side of things?

>Note: For the really short version, you can skip to the summary, but then what will do you for the rest of the day?

Spending rest of the day on HN. /s


I don't know specifically, e.g. if there are any such pauses on Zen. Also, Zen doesn't yet support AVX-512 so a big possible source of variation is moot.

I don't know if any AMD chip has ever had different turbo speeds for any ISA. It should be noted that even without that, any chip can still run slower with heavier instructions because they hit some other limit: thermal, TDP, current, etc.

AMD has used an interesting "adaptive clocking" scheme since steamroller, and apparently this is still in effect in Zen:

https://www.realworldtech.com/steamroller-clocking/

This handles the same type of voltage droop worst case that Intel apparently handles by dispatch throttling. It would be interesting to test it, since the clock elongation should be visible when you measure instruction timing relative to a clock not affected by the adaptation.


On the desktop chips (x299) it's easy to adjust all the clock speeds in the bios.

If the workloads I'm most interested in are all avx512-heavy (why I bought x299 instead of threadripper), do you think there'd be a reason to set the clock speeds to be equal, regardless of ISA? That is, if I currently have 4.6/4.3/4.1 GHz no-avx/avx(2)/avx512, when might it be worth setting all three of these to 4.1 GHz?

I suspect "never" is the answer?

I have the impression that Zen's clocking algorithm is much smarter than Intel's heuristic approach.


I don't think it makes sense, because the maximum penalty due to the tranitions is fairly low (~30 us out of 650 us, and only under pretty much a malicious load that transitions at exactly the right points), and mostly you want the higher frequencies when you can get them: they quickly overwhelm the small transition periods.

Also, someone indicated to me in private correspondence that even when the frequencies are manually set so no transition takes place, the throttling periods may still take place (which makes sense since the required voltage may still be higher).


It’s a real shame avx-512 has so many eccentricities when it’s a much nicer ISA than anything before it (in x86 land). I would almost prefer a more predictable, high-latency decomposition into 4x128 wide uops over what we have now.


If I could choose, I would like everything to run at the max turbo frequency all the time, yeah.

Still, and despite writing this post which will make a lot of people express something similar to what you wrote, I consider myself an AVX-512 fan, not the other way around. It's the most important ISA extension since, well, I'm not sure: a long time (probably AVX and AVX/2 combined would have a similar impact).

It introduces a whole ton of stuff that is very powerful: full-width shuffles down to byte granularity with awesome performance, masking of every operation, often free, compress and expand operations, and a longer list at [1]. That's only from an integer angle too (what I care about).

Yeah, it's taken AVX-512 a while to get traction (the fact that generation after generation of new chips have just been Skylake client derivatives with no AVX-512 hasn't helped), but I hope we are reaching a turning point.

These transitions are something you have to deal with if you want max performance, and I think we'll come up with better models for how to make the "global" decision of whether you should be using AVX-512.

---

[1] https://branchfree.org/2019/05/29/why-ice-lake-is-important-...


The never-ending Skylake is/was a real problem. Intel was slowly adding features in a manner where it made sense to target last n generations but then all that came to a perpetual stop and suddenly we have this new extension that you can only really use on the very latest and most expensive, with virtually no backwards compatibility.

The instructions are sufficiently different from AVX2 that any appropriate use is not as simple as sticking it behind a gate and using a smaller block size, it basically requires a completely separate (re)write to properly take advantage of.


> The instructions are sufficiently different from AVX2 that any appropriate use is not as simple as sticking it behind a gate and using a smaller block size, it basically requires a completely separate (re)write to properly take advantage of.

I'd say yeah, you often need a rewrite of the core loop to take full advantage, but you can still more or less write AVX-style code in AVX-512 if you want, and take advantage of the width increase.

The main difference I think for most code is the way the comparison operators compare into a mask register. It would have been nice if they had just extended the existing compare into SIMD reg (0/-1 result) instructions too, to ease porting.


> it basically requires a completely separate (re)write to properly take advantage of.

Why? At a higher level of abstraction, you can dispatch simd instructions at the max width available. At least, that's how I work with vectorized code. Still see gains on avx512.


> It's the most important ISA extension since, well, I'm not sure: a long time (probably AVX and AVX/2 combined would have a similar impact).

IMHO the most important ISA extension since AMD64 was AES-NI, which moved a major consumer of CPU time into the also-rans.


Intel really botched the launch of AVX-512. By not making it available on client, and even on server having it be available on select chips meant that no one code coded / optimized for it. If you are proposing new instructions make sure they are widely available.


Additionally, it looks like each subset of AVX-512 past the foundation is an optional feature and needs to be tested for. With only a few exceptions (usually as a result of bickering between Intel and AMD), previous ISA extensions implied you had everything that came before it too.

In practice this means you could pick a few entire subsets: Legacy+SSE2 is always there for 64 bit, maybe test up to SSE4.2 for another subset. Maybe switch everything from REX to VEX if it has AVX and AVX2. That's effectively three back ends, which is manageable. With AVX-512, everything beyond AVX-512F is a la carte, and that adds unwanted complexity for instruction selection in a compiler.

Just look at all the separate AVX feature flags from CPUID:

https://en.wikipedia.org/wiki/CPUID#EAX=7,_ECX=0:_Extended_F...

Between the performance problems and complexity, I think it'll be a while until AVX-512 is attractive.


Although there are many AVX-512 features, actual implementations still break down into only a few subsets.

If you ignore the EOL Xeon Phi stuff (with different and incompatible ISAs), it was proceeding in a superset approach, but cascade lake and cooper lake AI extensions kind of messed that up.

Good way to visualize it:

https://github.com/InstLatx64/InstLatx64/raw/master/VennDiag...

Basically you have the SKX subset and the ICL as the big important ones in the near future, unless you care about AI, in which case Cascade Lake is like SKX + VNNI and Cooper Lake is additionally + BF16.

So in practice you'll target one of those subsets, nothing more fine-grained that that. Yes you should still test for all the required extensions, but that part is easy.


Am not sure if you were trying to make the case for simplicity or against it :)


Yeah, I know :).

In some ways the explanation of how to simplify your view of it just makes it sound even worse.

A big question is if/when AMD starts adopting AVX-512, will they choose the same subsets that Intel did, or introduce new ones?


That's a great picture (thank you), but it really looks complicated to me. :-)


It's complicated yes, but it doesn't approach the level of thinking about all 20 AVX-512 extensions individually and testing for them.

Basically, on the ground, it is about as complicated as say the few 128 and 256 but extensions: there are only a few sets of functionality you have to care about (2 if you don't care about AI).

It's just that within those groups Intel decided to be very fine grained about the functionality, dividing the instructions among many flags (still, in a logical way).

So instead of the new generation just supporting SSE2, say, it supports 6 new flavors of AVX-512. My claim then is that this doesn't matter much: you can just think of all of those 6 as a unit, AVX-512-ICELAKE or whatever, because there are no CPUs that support a proper subset and there probably never will be (if there is, that's fine - you'll evaluate then if it makes sense for a new codepath).

Maybe I'm not making a good case that this is same :).


Nah, you're making a good case. I think what you're trying to say is to test the CPUID bits for a consistent subset of AVX512 flags and treat it all as one or two clumps. There was always going to be a fallback path for older subsets (SSE2-SSE4.2, AVX1-AVX2) anyways, so punt if it doesn't have all the features in a clump.


Exactly.

I wouldn't start with the CPUID testing though.

It's more like "Why do you care about ISA features"? Usually because you are trying to choose how many code paths to support for runtime ISA-based dispatching, or how many binaries to build when you build multiple versions of a binary (which may include compile-time dispatching).

So for that planning process, you only care about a few clumps. Then your CPUID testing strategy should still test all the required extensions, for completeness, and fall back as usual. Or something like that.


This is one of the cases where JITs win over AOT compilers.

Intel has made the work on OpenJDK for taking advantage of AVX when present.


As a developer who's micro-optimized some genetic software, I can confirm that I'd considered AVX-512 but decided against it after learning that the hardware being purchased by the company would not have the full AVX-512 feature set desired and it was simpler/easier to just write it in AVX2. Getting the software to also work on older/cheaper hardware made the business owner happy too.


If you really care about performance, you could always compile on the target machine directly via -xhost [0] or whatever the flag is on your compiler.

[0] https://software.intel.com/en-us/cpp-compiler-developer-guid...


In my case, it's GCC. The option is `-march=native -mtune=native`.

The trick though is _describing_ the scalar operations in the language and getting the compiler to understand how to efficiently vectorize them. I couldn't get GCC to do it at the time (GCC-5 if I recall, though we deployed with GCC-6); maybe it was just inexperience on my part. But I ended up writing the intrinsics by hand. To be quite honest it was my first dive into SIMD and I thought it was rather fun to do.


-march=native implies -mtune=native.

You can say -march=native -mtune=sandybridge, but there would be no point.

You can say -march=sandybridge -mtune=native, usefully. It might go slower on a real sandybridge than if tuned for it, but would still work, and would go as fast as the smaller instruction mix allows on your build machine.


I know this. I don't care. I use `-march-native -mtune=native` specifically to point other developers on the team to the two relevant compiler options. And if they don't look, nothing's lost.


Which ISA did it have?

Even the minimal AVX-512 ISA on any mainstream CPU (SKX) is pretty much a strict superset of AVX2.


> Which ISA did it have?

Business side was considering whether to buy Skylake or Broadwell.


But what about instructions like vpermi2b, which does 64 parallel 128 byte table lookups? The AVX shuffle instructions were hamstrung by the split into two 16-byte halves...


Yeah, the shuffles are awesome in AVX-512.


> it’s a much nicer ISA than anything before it (in x86 land).

It's coming from Intel Knights Landing. Previous massively multicore Intel offerings used a Pentium core and had a different 512 bit SIMD instruction set, KL used a Silvermont Atom core and introduced AVX-512 and parts of it were implemented in the Skylake Purley platform in a desperate attempt to give KL more software. With little to no actual adaption and the desperate situation of being stuck on 14nm and overselling those capabilities because they expected most CPUs to move over to 10nm it was no surprise the Knights... chips got the axe. But, the insanity of AVX-512 having an entire menu of possible instruction subsets stayed.


Note that the 512-bit Larrabee instructions (on Knight's Ferry/Corner) had different encodings (and IIRC different encodings between KNF and KNC), but it was essentially the same instruction set. The few differences there were between LRBNI and AVX-512 had (AFAIK) almost nothing to do with the move from the old Pentium in-order core to Silvermont.

I think it's also safe to assume that, given the lead time to design an ISA and integrate it into an architecture (many years), the merging of AVX-512 into Skylake wasn't done "in a desperate attempt to give KL more software".

These slides from Tom Forsyth provide lots of interesting background on the evolution of the instructions: http://tomforsyth1000.github.io/papers/LRBNI%20origins%20v4%...


> I would almost prefer a more predictable, high-latency decomposition into 4x128 wide uops over what we have now.

AVX512-VL gives the programmer AVX512 functionality at 128/256-bit widths, if it is believed to be more beneficial than a frequency hit.


What is this intended to be used for? This [1] article mentions compression, ML, scientific computing. Wouldn't people rather use GPU for those workloads though?

[1] https://devblogs.microsoft.com/cppblog/microsoft-visual-stud...


Offloading to the GPU has a significant cost that needs to be amortized. DNNs work well because very little is moved over the PCIe bottleneck, and inputs can be buffered. Take modern compression, let's say H265, has a complex control flow combined with highly vectorizable work. I'm unsure where the threshold lies today, but you need a significant amount of work before offloading to and reading back from the GPU becomes interesting on a beefy Xeon.


Nice work as usual BeeOnRope

Now if only I could actually use avx512 in a desktop, been waiting what feels like 5+ years..


The latest Microsoft Visual C++ has an option to generate AVX-512 code https://docs.microsoft.com/en-us/cpp/build/reference/arch-x6...


This site seems to lock up my phone, but desktop is fine. Anyone else have that issue?


There are several large SVGs, I wonder if that is the issue.

Can I ask what type of phone you have? Are you willing to help me diagnose the issue?


I have a really old Nexus 6p. I'm definitely willing to help, but I bet it's just part of having a really old phone.


Well that happens to be a phone I have too, although the battery is pretty dead so it's hard to use. It's still a pretty powerful phone though, weird that it would have issue rendering the page.

I was able to reproducing freezing and hanging even on my Pixel 3, so I can probably look into it myself. Again, I have to guess the large SVGs are to blame.


Awesome. Thanks!


As a former owner of that model of device, whose phone bricked itself irreparably after updating to Oreo, I'm surprised it hasn't been destroyed by the bootloop of death. I'll never buy a Google hardware product again after the runaround they gave me.


I managed to avoid that problem although I read about it plenty. It was my first Android phone, but I liked it except for the battery life issues, which were, frankly, brutal. Regularly shuts down at 50% battery. To replace the battery you are pretty much guaranteed to break parts of the camera assembly (the cover).

I'm not sure how blame should be apportioned between Google and the manufacturer though.


I bought the device from the Google store, had it replaced once (under warranty due to a different failure) by Google, and the box and device say "Google Nexus 6p." The update that bricked the phone (literally on the first startup after installing the update) came to me from Google.

So, I blame Google entirely. If they chose to contract out their hardware work to a poor manufacturer, that's their problem. My business is with Google, not that manufacturer.

Google didn't see it that way. They told me to contact the nearest Huawei service center.

That service center was across the Pacific Ocean, in China.

If my Macbook Pro fails to start up immediately after an Apple software update, I don't expect Apple to tell me "It's a problem with Samsung's memory, so we can't help you. Call Samsung in Korea." I expect them to take responsibility for their software update having rendered my device useless.

https://issuetracker.google.com/issues/37130791


Would disabling “Intel Turbo Boost Technology” be advised to avoid this?

If it never uses turbo, it should not suffer the transitions?


No, the transitions occur even without turbo. In fact, the chip I tested on has no turbo at all, just 3.2 GHz nominal speed.

Also, disabling turbo would probably be a massive over-reaction, unless you really care out about 99.9th latency or something: the impact of these transitions is small (at worst a few %), while the benefit of turbo is large: 10s of %.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: