Hacker News new | past | comments | ask | show | jobs | submit login
Auto-vectorization for the masses (2011) (leiradel.github.io)
51 points by lelf on Feb 15, 2020 | hide | past | favorite | 19 comments



If you're in C++ land, constexpr can be a godsend for auto vectorization.

Traditionally, you have a SoA of all your data. So if you have an array of triangles (not meshes) you might have three structs p1, p2, p3, each with three arrays of x,y,z. Then you write a fortan-esque loop over your triangles. Usually the vectorizer will vectoriser that, but fortan-esque code is obnoxious to write IMHO.

Instead, you have the same thing, but with constexpr operator[] methods. The inner structs return a constexpr constructed vec3, the outer struct returns a constexpr constructed triangle, and then you plug that into a constexpr processing function/method. GCC and Clang will vectorize this. And you can reuse normal linear algebra functions like dot cross etc (so long as they're constexpr) and it doesn't look like Fortran.

Matt Godbolt missed this during his path tracing three ways presentation. His data oriented design pathtracer stored an array of vec3s, which can't be auto vectorized. (and isn't data oriented design) I keep meaning to try to submit a PR to his repo, but the project has some weird dependencies and I can't get it to compile.

MSVC won't autovectorize either of the above. Not sure if it doesn't have a vectorizer or if it's just insufficiently powerful. Or if I'm just using the wrong compiler flags.


Do you have a code example of the constexpr operator[] you mention? I am having a hard time following it.

MSVC has an auto-vectorizer, with /Qvec-report:2 it will give you information on why it doesn't auto-vectorize a specific loop.

It's well documented, though I'm not particularly fond of it. Here is an example where the auto-vectorizer works: https://godbolt.org/z/crTLUY

Here is the documentation: https://docs.microsoft.com/en-us/cpp/parallel/auto-paralleli...


I only barely skimmed the post and the follow-on posts so this is less about that and more about autovectorizers.

Autovectorization is the wrong approach for data-parallelization. You don’t want to rely on a brittle unpredictable code transformation for performance in this case. You want to bake it into the programming model.

ispc uses this approach and it results in performance predictability to a large degree. You can imagine other approaches as well, like explicitly data-parallel loops, or a declarative approach.

Most of these (and the GPU data-parallel models) rely to a very large extent on the programmer to manage data dependencies to ensure correctness.


> You don’t want to rely on a brittle unpredictable code transformation for performance in this case.

That's somewhat true, but much of the unpredictability could be removed if compilers provided annotations saying "I expect this loop to be vectorized" where the compiler would be forced to report an error if it didn't manage to do it.


Such annotations exist in all major C/C++/Fortran compilers, although not all will error if it goes poorly. Although most of the ones with an HPC focus have some output where they will tell you how they optimized your loop for you.


> although not all will error if it goes poorly

So... these annotations don't actually exist? ;)

The whole point of such a feature is that the compiler will fail if it can't do what you want.


Well, it will not error out, but if you say

   #pragma clang loop vectorize(enable)
and the loop is then not vectorized you will get a warning like

   warning: loop not vectorized: failed explicitly specified loop vectorization


And then what do you do when you upgrade compilers and the build starts failing?

This isn’t a hypothetical, it happens in real life. Your only option at that point is roll back compilers and hope someone cares about the regression enough to fix it.

The point is the better model is to build the semantics into the language rather than relying on the whims of implementation.

Of course the semantic guarantees will likely be somewhat weak because of differences in ISAs, memory hierarchies, etc.


As fallback, I have unit-tests with a check on both correctness and the instruction count (and # of memory allocations in hot loops). Whereas the cpu cycle-count varies between runs, the instruction count does not.

Linux has good support for performance counters, Windows requires a bit of work.

So in this case, the compiler update would show a regression in the test suite, which needs to be addressed as part of the compiler upgrade.

Of course, having this in the compiler would remove the need for unit tests.


Just for the record: you rely on performance tests to guarantee performance, nothing else.


This has been done by Intel: https://ispc.github.io


More about ispc, from Matt Pharr: https://pharr.org/matt/blog/2018/04/30/ispc-all.html - includes some discussion of Intel's corporate culture. Interesting throughout.


Great quote here:

"The problem with an auto-vectorizer is that as long as vectorization can fail (and it will), then if you’re a programmer who actually cares about what code the compiler generates for your program, you must come to deeply understand the auto-vectorizer. Then, when it fails to vectorize code you want to be vectorized, you can either poke it in the right ways or change your program in the right ways so that it works for you again. This is a horrible way to program; it’s all alchemy and guesswork and you need to become deeply specialized about the nuances of a single compiler’s implementation—something you wouldn’t otherwise need to care about one bit."

The thing about this kind of thing is that it's nightmare to do but the people who can do it wind-up seeming like wizards and alchemists and so they won't necessarily say "this is a nightmare, never do this".


So... skimming this post and its successors, I didn't see any actual examples of generated vector code, especially not examples that GCC can't do although they are supposedly "easy". And no benchmarks. Did I miss anything or did this project really die before it got to vectorization (or anything more interesting than constant folding)?


Very interesting and useful to see.

And in an entirely approach for vectorization for the masses: I do wish that it was easier to access vectorization through BLAS, a library that is well supported across nearly all languages, gets massively optimized, but is hard to install correctly.


Good news is that the Gonum team has been working on an optimized pure Go version of BLAS. It's at parity with netlib blas for some of the important functions (GEMV, GEMV, etc).

Why is this good news? Go is a very easy to use language, and it favours using compile targets, leading it to be available across different platforms. To install, one simply does `go get gonum.org/v1/gonum`


Netlib BLAS is a very low bar [1], and not at all how one should go about writing a performance portable BLAS. BLIS (https://github.com/flame/blis/) is a much better approach, and underlies vendor implementations on AMD (https://developer.amd.com/amd-aocl/blas-library/) and many embedded systems.

[1] GEMV is entirely limited by memory bandwidth, thus quite uninteresting from a vectorization standpoint. Maybe you meant GEMM?


Does anyone know if JVM, golang or .NET have autovectorization?


JIT compilers are in general not great at auto-vec because it is an expensive optimization, they do not have as much time as eg C++ ahead-of-time compilers, so the JVM and the CLR only handle very simple cases. .NET has an explicit API to guarantee vectorization, Java does not, but there is a JEP: https://openjdk.java.net/jeps/338

I don’t know about Golang, but given the compiler speed it probably can’t afford the more advanced techniques.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: