SIMD with Zig

Guzba · on May 2, 2023

Unless something has changed I really wish Zig was open to SIMD intrinsics. Imo, if you're manually writing SIMD, you are doing complex performance-oriented programming and you really do end up needing to know what the instruction set you're using gives you for tools. Eg arm64 has pretty cool interlacing/deinterlacing which would be goofy to re-create on amd64 and there is subtlety to multiplication and lots of other things. SIMD instructions also sidestep lots of compiler-ey stuff like strict aliasing and types don't matter, sizes and lane positions do. It is an interesting beast.

10000truths · on May 2, 2023

This has been mentioned before:

https://github.com/ziglang/zig/issues/7702

I don't think anyone disagrees about the need for intrinsics. In fact, I have actually taken a crack at implementing the AVX512 intrinsics into the Zig compiler as builtin functions on my personal fork of the repo. But it is a non-trivial task - there are over 450 distinct instructions across the entire AVX512 feature set, and over 100 for AVX2. And I'm only focusing on support for the LLVM backend, which does the heavy lifting in the codegen phase. Getting the register allocation and instruction scheduling correct for all the intrinsics in the self hosted backend would involve a lot more work.

p0nce · on May 2, 2023

What I do for D is implement the intrinsics following the semantics of the x86 instructions. Target x86, x86_64, arm32, arm64 with D compilers, that smoothes out the difference. It's a lot of work, and very similar to the simd-everywhere library that does it for C++. There is not so much impendence mismatch between x86 and arm. I wish more people would understand that you absolutely need such intrinsics for fast software, there is no way around that. You're not going to write your 4x-at-once pow function for each arch, also you won't find a better name for `_mm_madd_epi16`. (EDIT: I guess nowadays you could do that but with taking ARM semantics as source of truth).

https://github.com/AuburnSounds/intel-intrinsics

janwas · on May 2, 2023

Mostly agree, but there is actually a mismatch between madd_epi16 and Arm. Implementing Arm semantics or x86 on the other requires ~5 instructions, but if we generalize the definition to allow reordering (e.g. Highway's ReorderWidenMulAccumulate [1]), it's only 2 instructions.

1: https://github.com/google/highway/blob/master/g3doc/quick_re...

p0nce · on May 2, 2023

Indeed, and your comment led me to find additional issues with my port of _mm_madd_epi16.

I agree it would perhaps be possible to find better semantics for SIMD that kinda gloss over all the differences. That would be cleaner but require a lot of names. Well I suppose that's what Highway does, isn't it?

janwas · on May 4, 2023

:) Yes indeed! Always happy to discuss suggestions for new intrinsics via Github issues.

Guzba · on May 2, 2023

I have not been monitoring the SIMD situation in Zig so it is nice to hear that there is some general support for intrinsics even if they are not yet added.

Thanks for your effort working on an implementation too. I am aware how large these instruction sets have gotten so I can at certainly imagine at least some of the effort of the undertaking.

stephc_int13 · on May 2, 2023

Writting SIMD code with intrinsics is kind of ugly / non-portable and close to assembly language.

But it is useful and given the peculiarities of those SIMD instructions, I am not convinced that it will ever be sufficient to use "vectorized" types + a few hints and let the compiler do the work. That would be nice though.

I understand the hesitation of a language design team to replicate the full intrinsics mess, they are probably hoping to find something better.

In the mean time we call still fallback to C to write SIMD heavy code.

camel-cdr · on May 2, 2023

For anybody interested in this, here is an article discussing a very similar problem using arm neon intrinsics, also using the interleaved loads: https://branchfree.org/2019/04/01/fitting-my-head-through-th...

wyldfire · on May 2, 2023

Or even all target-specific intrinsics - not limited to SIMD ones.

kookamamie · on May 2, 2023

Nice. It would be even nicer, if Zig would support hot-dispatch for SIMD, i.e. the idea that the compiler can emit multiple versions of the same function/code for a number of vector widths simultaneously and the runtime selects the best (widest) option available for the hardware running the code - this is something ISPC does and is incredibly useful for targeting a range of architectures.

exDM69 · on May 2, 2023

The issue with runtime instruction set detection and dynamic dispatch is that it needs to be at a rather coarse level to be beneficial.

Take a simple 4-wide dot product for example, on x86_64 you'd have 3-4 different implementations (SSE2, SSE3 w/ hadd, SSE4.2 w/ dpps). But the function itself is just a few clock cycles, and calling it via function pointer will eliminate any gains and you might as well compute it with a scalar loop at that point.

This is further compounded by inhibiting compiler optimizations. You can't use the dot product function in higher level code expecting it to be inlined and further optimized (which is really the key to performance) if it's behind a dynamic dispatch.

A sufficiently smart compiler could maybe propagate the dynamic dispatch above, so that all the dot products get inlined but all code using dot product would get emitted multiple times with different dot product implementations, with the dynamic dispatch only at the top level. This has a slight risk of combinatorial explosion, but there really aren't that many combinations of supported ISAs in real hardware out there.

Another option you can use without any special compiler support is to take all your performance sensitive parts and pack them into a shared object/dll, compile multiple versions with different compiler options and choose the correct dll at runtime. Or even build the entire executable a few times and have some kind of launcher pick the correct one.

kps · on May 2, 2023

What you say about dynamic dispatch also applies to regular function calls, which is why I'm disappointed that Zig provides no visible distinction at the call site between direct and indirect calls (as K&R C did but ANSI C made optional).

I understand the desire for magic-indirection ergonomics; I just don't think the tradeoffs work out the same for code vs data.

hansvm · on May 2, 2023

If the function is comptime-known then you'll get a direct call, and you can mark it near the call site as being required to be comptime-known (e.g., by marking a function argument to another function with the "comptime" tag). To make it super explicit you can make a no-op function like `fn cknown(comptime f: anytype) @TypeOf(f) {return f;}` and then replace stuff like `foo(bar)` with `cknown(foo)(bar)`.

It's a tiny bit harder to force an indirect call if that's desired for some reason; I think you'd need to write a slightly longer never inlined helper function to strip the constness from the pointer. It's doable though, just not directly provided by the language.

kps · on May 2, 2023

What I had in mind is the programmers' view of the code: in K&R C, every function call was visibly either direct or indirect, `f()` or `(*f)()`. My main concern is not actually performance, but comprehensibility: an indirect call is a conditional branch, where the condition can be arbitrarily far away in space and time and is not statically determinable, so it should not be invisible.

I appreciate that this doesn't fit Zig's call syntax, which is unlikely to change, so the long-term best case is probably some LSP marking based on the callee type.

nyberg · on May 2, 2023

https://github.com/ziglang/zig/issues/6966 was discussed

AndyKelley · on May 2, 2023

This is called [function multi-versioning](https://github.com/ziglang/zig/issues/1018).

hansvm · on May 2, 2023

Zig the language doesn't do this right away, but:

1. IMO it'd be a bit more ziggish to branch at compile time. Zig has cross-compilation as a first-class feature, and you generally know the architecture you're targeting.

2. Whether you're branching at runtime or compile time, it'd be easy to build a vectored app with that behavior. The first thing that comes to mind is having code that's generic on the vector type (or bit width) and then just choosing which generic to instantiate in an inline for loop in your app's entrypoint.

3. A lot of the time you don't need that behavior. Select a vector width 8x too big, ensure your data is chunked into multiples of that, and rely on the compiler to break that down into a few instructions of the appropriate length. You can't effectively target a GPU that way, and it's not the same as hand-tuned assembly, but you get decent results on a vast array of problems.

burntsushi · on May 2, 2023

You can't branch at compile time. Well, you can, but that's less useful than branching at runtime. If you branch at compile time, then you need to produce binaries that have a minimum supported ISA extension. But if you branch at runtime, then you can produce maximally portable binaries that only use ISA extensions when they're available.

murkt · on May 2, 2023

Counterpoint for 1: packages on PyPI or NPM. When I’m publishing an extension for another language. I really have no idea about their architecture.

I’ve seen people just compile for the lowest common denominator. E.g. faiss-cpu on PyPI doesn’t even include AVX2 and is slow as heck.

p0nce · on May 2, 2023

Counterpoint: it's not all positive to do multiversioning.

- You pay for it in build times, more IR, more optimizations.

- a lot of software is for servers, in that case you mostly know the arch before-hand. Multi-versioning is best for consumer software.

- The lowest spec computers, in consumer software, need the optimizations the most. And they don't have latest instructions.

- SIMD has biggest gains from memory optimizations, and those do not necessarily require the latest and greatest instructions.

burntsushi · on May 2, 2023

Indeed. This is how ripgrep works. It's compiled for just plain `x86_64`, but it looks for whether things like AVX2 are enabled. And if so, uses vector algorithms for substring and multi-substring search. The nice thing about dealing with strings is that the "coarse" requirement is already somewhat natural to the domain.

But, this functionality is absolutely critical. It doesn't even have to be automatic. Just the ability to compile functions with certain ISA extensions enabled, and then only call them when the requisite CPU features are enabled is enough.

In a nutshell: https://github.com/BurntSushi/memchr/blob/8037d11b4357b0f07b...

barrkel · on May 2, 2023

Architecture yes, but specific feature set, including exact support for each incremental revision of the SIMD instruction and register set, not so much. If you're e.g. providing a docker image for a service, it's fine to build per architecture, but you don't want the hassle of producing a different one for SSE4, SSE4.1, SSE4.2, SSE4a and so on.

janwas · on May 2, 2023

Fair point, but we find it useful to have a few clusters: SSE4, AVX2 (Haswell), Skylake, newer AVX-512 Icelake/Zen4. That's still manageable, we're just compiling the SIMD parts 4 times and binary size impact is very modest (we're not replicating the entire binary).

anonymoushn · on May 2, 2023

regarding point 2, it's not that simple. A lot of 256-bit-wide goodies are in AVX-512 and a lot of different sets of extensions exist.

Laremere · on May 2, 2023

While Zig doesn't support this automatically, I think there's a path towards this thanks to comptime support. For example:

    const std = @import("std");

    fn f(comptime width: comptime_int, value: i32) i32 {
        const v = @splat(width, @as(i32, value));
        return @reduce(.Add, v);
    }

    pub fn main() !void {
        std.debug.print("1={d}\n", .{f(1, 5)});
        std.debug.print("2={d}\n", .{f(2, 5)});
        std.debug.print("4={d}\n", .{f(4, 5)});
    }

Here it creates 3 different versions of the function f at compile time, and then calls them each in succession. Running it prints:

  1=5
  2=10
  4=20

In practice you'd need to set up a dispatch that chooses the function based on the hardware, and ensure that zig/LLVM are actually using the full width of the vectors when compiling.

kristoff_it · on May 2, 2023

It's planned: https://github.com/ziglang/zig/issues/1018

ayende · on May 2, 2023

A better option here would be to use `_mm_movemask_epi8`, for the vector:

``` { false, false, false, false, true, false, false, true } ```

It will give an int with the bits: `0b0000_1001`

You can then do `@clz` to get the index of the first set bit.

That would save you the `@reduce` and `@splat`. I don't see how to access move mask from Zig, however.

Note that the standard library has `std.zig.firstIndexOfValue`, which does what you have in the post, basically. And there is `std.zig.firstTrue`, which does the the same thing as `@clz` in this case, but I don't know what kind of assembly it will generate.

e4m2 · on May 2, 2023

> I don't see how to access move mask from Zig, however.

Bools in a vector each only occupy one bit. You can @bitCast an N element bool vector directly to a uN integer and it will generate movemask on x86.

anonymoushn · on May 2, 2023

This must be new. Last year the compiler believed vectors of bools were both N bytes wide and N bits wide for different purposes and we ended up needing

  @ptrCast(*const u64, &my_bool_vec).*

vincnetas · on May 2, 2023

I'm missing benchmark. Of course it might be obvious to experienced developers that this is faster. But it would be nice to see how much faster it is.

larsnystrom · on May 2, 2023

The author mentions this at the end of the article:

> You need to benchmark, test and tweak in order to figure out what works best. Benchmarking is particularly important because, unless you're dealing with large data or very hot code, there's a good chance that effort won't yield measurable benefits.

How to benchmark something like this sounds like it could become a pretty good article in its own right.

lionkor · on May 2, 2023

> How to benchmark something like this

Usually, you can run something like perf or callgrind, with instruction level profiling, and you will get a good idea. Its not benchmarking in the traditional sense, but it has a similar result.

SIMD can make your code faster, or slower - profiling is a great way to tell exactly how much faster or slower.

unwind · on May 2, 2023

Very nice write-up, as someone feeling somewhat left behind by the Zig train it's always nice to get a refreshed feeling for the language.

I had to go look up "std.mem.indexOfSclar" in the source [1] since my, admittedly rather butt-hurt, feeling is that I can't guess Zig naming. It's supposed to be "indexOfScalar" which of course makes 100% sense.

[1]: https://github.com/ziglang/zig/blob/master/lib/std/mem.zig#L...

AndyKelley · on May 2, 2023

The Zig train will be making a stop at the 0.11.0 station on May 30th. That will be a great time to hop on board!

anonymoushn · on May 2, 2023

is async planned for 0.11 at the end of current_month?

inexcf · on May 2, 2023

A bit confused about what you are saying. Was "std.mem.indexOfSclar" a typo on the site that was fixed within 15 mins of you posting or did you just misread it? Is there something about Zig function naming that makes it weird?

latch · on May 2, 2023

Yes, a typo. I fixed the post.

kamikaz1k · on May 2, 2023

Thank you for the clear write up. This is very helpful for my understanding