Auto-vectorization for the masses (2011)

nwallin · on Feb 16, 2020

If you're in C++ land, constexpr can be a godsend for auto vectorization.

Traditionally, you have a SoA of all your data. So if you have an array of triangles (not meshes) you might have three structs p1, p2, p3, each with three arrays of x,y,z. Then you write a fortan-esque loop over your triangles. Usually the vectorizer will vectoriser that, but fortan-esque code is obnoxious to write IMHO.

Instead, you have the same thing, but with constexpr operator[] methods. The inner structs return a constexpr constructed vec3, the outer struct returns a constexpr constructed triangle, and then you plug that into a constexpr processing function/method. GCC and Clang will vectorize this. And you can reuse normal linear algebra functions like dot cross etc (so long as they're constexpr) and it doesn't look like Fortran.

Matt Godbolt missed this during his path tracing three ways presentation. His data oriented design pathtracer stored an array of vec3s, which can't be auto vectorized. (and isn't data oriented design) I keep meaning to try to submit a PR to his repo, but the project has some weird dependencies and I can't get it to compile.

MSVC won't autovectorize either of the above. Not sure if it doesn't have a vectorizer or if it's just insufficiently powerful. Or if I'm just using the wrong compiler flags.

tboerstad · on Feb 16, 2020

Do you have a code example of the constexpr operator[] you mention? I am having a hard time following it.

MSVC has an auto-vectorizer, with /Qvec-report:2 it will give you information on why it doesn't auto-vectorize a specific loop.

It's well documented, though I'm not particularly fond of it. Here is an example where the auto-vectorizer works: https://godbolt.org/z/crTLUY

Here is the documentation: https://docs.microsoft.com/en-us/cpp/parallel/auto-paralleli...

marklacey · on Feb 15, 2020

I only barely skimmed the post and the follow-on posts so this is less about that and more about autovectorizers.

Autovectorization is the wrong approach for data-parallelization. You don’t want to rely on a brittle unpredictable code transformation for performance in this case. You want to bake it into the programming model.

ispc uses this approach and it results in performance predictability to a large degree. You can imagine other approaches as well, like explicitly data-parallel loops, or a declarative approach.

Most of these (and the GPU data-parallel models) rely to a very large extent on the programmer to manage data dependencies to ensure correctness.

tom_mellior · on Feb 15, 2020

> You don’t want to rely on a brittle unpredictable code transformation for performance in this case.

That's somewhat true, but much of the unpredictability could be removed if compilers provided annotations saying "I expect this loop to be vectorized" where the compiler would be forced to report an error if it didn't manage to do it.

jcranmer · on Feb 16, 2020

Such annotations exist in all major C/C++/Fortran compilers, although not all will error if it goes poorly. Although most of the ones with an HPC focus have some output where they will tell you how they optimized your loop for you.

tom_ · on Feb 16, 2020

> although not all will error if it goes poorly

So... these annotations don't actually exist? ;)

The whole point of such a feature is that the compiler will fail if it can't do what you want.

petschge · on Feb 16, 2020

Well, it will not error out, but if you say

   #pragma clang loop vectorize(enable)

and the loop is then not vectorized you will get a warning like

   warning: loop not vectorized: failed explicitly specified loop vectorization

marklacey · on Feb 16, 2020

And then what do you do when you upgrade compilers and the build starts failing?

This isn’t a hypothetical, it happens in real life. Your only option at that point is roll back compilers and hope someone cares about the regression enough to fix it.

The point is the better model is to build the semantics into the language rather than relying on the whims of implementation.

Of course the semantic guarantees will likely be somewhat weak because of differences in ISAs, memory hierarchies, etc.

Fronzie · on Feb 16, 2020

As fallback, I have unit-tests with a check on both correctness and the instruction count (and # of memory allocations in hot loops). Whereas the cpu cycle-count varies between runs, the instruction count does not.

Linux has good support for performance counters, Windows requires a bit of work.

So in this case, the compiler update would show a regression in the test suite, which needs to be addressed as part of the compiler upgrade.

Of course, having this in the compiler would remove the need for unit tests.

llukas · on Feb 15, 2020

Just for the record: you rely on performance tests to guarantee performance, nothing else.

rsp1984 · on Feb 15, 2020

This has been done by Intel: https://ispc.github.io

tom_ · on Feb 15, 2020

More about ispc, from Matt Pharr: https://pharr.org/matt/blog/2018/04/30/ispc-all.html - includes some discussion of Intel's corporate culture. Interesting throughout.

joe_the_user · on Feb 16, 2020

Great quote here:

"The problem with an auto-vectorizer is that as long as vectorization can fail (and it will), then if you’re a programmer who actually cares about what code the compiler generates for your program, you must come to deeply understand the auto-vectorizer. Then, when it fails to vectorize code you want to be vectorized, you can either poke it in the right ways or change your program in the right ways so that it works for you again. This is a horrible way to program; it’s all alchemy and guesswork and you need to become deeply specialized about the nuances of a single compiler’s implementation—something you wouldn’t otherwise need to care about one bit."

The thing about this kind of thing is that it's nightmare to do but the people who can do it wind-up seeming like wizards and alchemists and so they won't necessarily say "this is a nightmare, never do this".

tom_mellior · on Feb 15, 2020

So... skimming this post and its successors, I didn't see any actual examples of generated vector code, especially not examples that GCC can't do although they are supposedly "easy". And no benchmarks. Did I miss anything or did this project really die before it got to vectorization (or anything more interesting than constant folding)?

epistasis · on Feb 15, 2020

Very interesting and useful to see.

And in an entirely approach for vectorization for the masses: I do wish that it was easier to access vectorization through BLAS, a library that is well supported across nearly all languages, gets massively optimized, but is hard to install correctly.

chewxy · on Feb 15, 2020

Good news is that the Gonum team has been working on an optimized pure Go version of BLAS. It's at parity with netlib blas for some of the important functions (GEMV, GEMV, etc).

Why is this good news? Go is a very easy to use language, and it favours using compile targets, leading it to be available across different platforms. To install, one simply does `go get gonum.org/v1/gonum`

jedbrown · on Feb 15, 2020

Netlib BLAS is a very low bar [1], and not at all how one should go about writing a performance portable BLAS. BLIS (https://github.com/flame/blis/) is a much better approach, and underlies vendor implementations on AMD (https://developer.amd.com/amd-aocl/blas-library/) and many embedded systems.

[1] GEMV is entirely limited by memory bandwidth, thus quite uninteresting from a vectorization standpoint. Maybe you meant GEMM?

polskibus · on Feb 16, 2020

Does anyone know if JVM, golang or .NET have autovectorization?

MrBuddyCasino · on Feb 16, 2020

JIT compilers are in general not great at auto-vec because it is an expensive optimization, they do not have as much time as eg C++ ahead-of-time compilers, so the JVM and the CLR only handle very simple cases. .NET has an explicit API to guarantee vectorization, Java does not, but there is a JEP: https://openjdk.java.net/jeps/338

I don’t know about Golang, but given the compiler speed it probably can’t afford the more advanced techniques.