I have some experience with this, ie ensuring LLVM optimizes and codegens the "best"! I have been working to generate target independent "kernels" for the Rav1e AV1 encoder and have had to do a lot of unidiomatic things to get LLVM to generate machine code similar in quality to hand written ASM. Granted, this is on integers and not floats, but the same principles should apply.
What I've found is that you need to ignore most of Rust: use/load raw pointers, don't use slices, unroll manually, vectorize manually, and check preconditions manually. You'll still get the amazing type system, but the code will have to be more C-like than Rust-like.
* raw pointers: LLVM is pretty good at optimizing C code. Rust specific optimization needs some work. (edit: I assumed arrays here, so you'll need the pointer for offsets; references are still okay. You'd also use the pointers for iterating instead of the usual slice iteration patterns)
* no slices: index checking is expensive, not to the CPU, the CPU rarely misses the check branches, but to the optimizer. I've found these are mostly left un-elided, even after inlining.
* no slices: slice indexing uses overflow checking. For Rav1e's case, the block/plane sizes mean that doing the index calculation using `u32` will never overflow, so calculating the offsets using u32 is fine (I'll have to switch to using a pseudo u24 integer for GPUs though, because u32 is still expensive on them).
* unroll manually: LLVM would probably do more of this with profiling info, but I've never found it (this is subjective!) to do any unrolling w/o. Maybe if all the other items here are also done...
* vectorize manually: Similar to unrolling. I've observed only limited automatic vectorization.
* And to get safety back: check, check, and check before calling the fast kernel! Ie wrap the kernel function with one that does all the checks elided in the kernel.
Be that as it may, I wasn't talking about C++. Also, the integer operation optimizations require no such flags on either side, and yet the resulting machine code was still very poor for Rust.
They do. The issue may be that references don't support pointer arithmetic – that is, given `x: &u32` you cannot get `x[1]`. The normal approach is to use a reference to a slice, `x: &[u32]`, and the default bounds-checked indexing. Rust does let you do unchecked indexing on slices (with `unsafe`), but it may be more efficient to avoid indices entirely in favor of pointer arithmetic. LLVM can often optimize index calculations into pointer arithmetic, but not always.
Edit: Although, on rereading the above post, I see diamondlovesyou did mention indices, so... not really sure what's going on.
True. I should have explained that better. If you're accessing an array, you'll have to do the pointer offset-ing yourself (otherwise you're using a slice and all the checking that entails). Thus, you can't use a reference type, because reference types don't have `<*const T>::add` like pointers do (also, `&T as usize` is invalid; you have to go through a pointer type first). I suppose I assumed you'd be accessing an array/slice; references otherwise are fine.
Rust references should in general optimize better because they give stronger aliasing guarantees.
Even for slices, using get_unchecked(1..) to get a smaller subslice without bounds checking might be better than pointer arithmetic as long as the slice lengths get optimized away (i.e. they are never used and never passed to non-inlined functions).
> Rust references should in general optimize better because they give stronger aliasing guarantees.
AFAIK this does not work atm due to a codegen bug in llvm (which can also affect code using restrict in C in some cases). This bug will get fixed one day, but most likely another bug will be revealed at this point… This part of LLVM was never really used in C as much as in Rust, so they keep finding bugs in it. Hopefully it will get fine in the long run, but I'm not holding my breath.
You can just pass a reference/mut ref to the first element of the slice. This is actually how the generated kernels in my Rav1e PR do it, just for the aliasing reason you mentioned.
You can use iterators to avoid unnecessary bounds checking on every element in a slice and you can still get an index to the value. Something like this:
```
for (i, val) in my_slice.iter().enumerate() {
let x = *val + 9999;
}
```
Not really in this case; Rav1e runs over blocks of image planes (like say a 16x16 block of a specific 720p color channel), so there is no continuous slice to iterate over.
Maybe you can use chunks_exact() and chunks_exact_mut(). The _exact versions allow lifting the bounds checks out of the loop and gave me some great performance boosts in image processing code.
These blocks are non-continuous. It operates over a part of a row, where the start of the next row is the start of the current row + some stride.
I mean maybe? But I probably wouldn't anyway for this case as a slice reference is actually a "fat" pointer (ie twice the size of a normal pointer) and the length of the slice won't be used (the block size is known per kernel); LLVM might delete the length part anyway.
These are automatically generated kernels, so readability isn't the primary concern here.
Once const-generics are released, I feel like you should be able to create your own "fixed-size slice" type, which uses the constant parameter for "bounds checking". I imagine that would be a lot more optimiser-friendly...
I'm not sure this is the optimizer's fault. If I'm doing a bounds-checked array access that might panic, that panic is an observable side effect that cannot be reordered with respect to other observable side effects. That puts constraints on what the optimizer can do, not because it's not smart enough, but because it has to respect the meaning of the program.
there are aspects of Rust that are easier for llvm to optimize with than c. there are things rust isn't yet doing that it will do to make llvm be able to optimize it better as well.
the primary reason THIS code didn't optimize as well is that you can't pass ffast-math to llvm from rust in any useful way (yet)
Can anyone elaborate a bit on why this PR has such terrible compile times? As someone with only a passing familiarity of Rust (I do most of my work in C/C++ and generally expect sub-10s compile times), I would naively expect that writing more C-like code which uses fewer of Rust's language features would compile faster or at least not considerably slower. What's going on there?
The biggest Rust language feature is the borrow checker IMO, and unless you're running in unsafe, that's still doing its job. I always assumed the slowness was more based on the borrow checker analysis than on the error wrapping (option) stuff, slices, or functional bits.
I don't have anything solid backing this up, but I've often heard Rust proponents claim that the borrow checker is actually fairly cheap in terms of compile time overhead. Looking at the PR, it's obvious that it generates quite a bit of code (separate kernels for different block sizes, etc.) and does things like generics over tuples, but I don't know enough about Rust to pick out patterns that might be particularly slow to compile. It does have somewhat fewer unsafe blocks than I expected.
That was interesting indeed, thanks. For an outsider, the most trivial takeaway from that is something along the lines of "Rust is slow to compile and it's probably not getting 10x faster any time soon". Beyond the obvious culprits like heavy metaprogramming, it looks like death by a thousand cuts (not to mention that ggez spends 15 seconds on something that's not even accounted for in the profile, that alone is huge) and there obviously isn't a common root cause for huge compile times in general.
Seems a little gung ho to disable guaranteed index checking in a video codec no? I know you still do the checks, but it sounds like it's not in a statically guaranteed way.
Depends how much you care about safety, and how much safety is needed. If you're doing an FEA analysis or whatever, fine disable all checking. But a video codec is guaranteed to be used in dangerous environments.
Also depends how much it affects performance. It it doubles it, ok I'll accept the risk. 10% faster? No thanks I'd rather play it safe.
It looks like what the author was looking for is [1]
f64::mul_add(self, a: f64, b: f64) -> f64
Adding it to the code indeed allows the LLVM to generate the "vfma" instruction. But it didn't significantly improve performance, on my machine at least.
$ ./iterators 1000
Normalized Average time = 0.0000000011943495282455513
sumb=89259.51980374461
$ ./mul_add 1000
Normalized Average time = 0.0000000011861410852805122
sumb=89259.52037960211
Maybe the performance gap is not due to what the author thought...
"The program" contains two hot loops. Judging from the assembly code you linked in a sibling comment, only the second of these loops was vectorized, the first one wasn't. This slower non-vectorized loop will still dominate execution time.
And for whatever it's worth, dropping the original article's
for i in 0..n{
b[i]=b[i]+(r/beta)*(c[i]-a[i]*b[i])
}
in place of your loop using iterators and FMA, you still get nicely vectorized code (though without FMA) for this loop:
Hi I wrote the blog post linked - and I feel a little silly that I didn't check that _both_ loops vectorized. So I fixed the Rust implementation to keep a running vector of partial sums which I finish up at the end - this one did vectorize. The result was a 2X performance bump, which I'm about to include in the blog post as an update.
If it's OK I'll link to this comment as the inspiration.
On the iterators versus loop: for some reason when I use the raw loop _nothing_ vectorizes, not even the obvious loop. What I read online was that bounds checking happens inside the loop body because Rust doesn't know where those indices are coming from. Using iterators instead is supposed to fix this, and it did seem to in my experiments.
I liked your trick to iterate on chunks to force the compiler to vectorize the code !
Now that the code is properly vectorized, you can add the `mul_add` function, and this time you'll see a significant speedup. I tried it on my machine and it made the code 20% faster.
Thanks! The chunks trick was a fairly straightforward translation of what I would do in C++ if the compiler wouldn't vectorize the reduction for some reason. These days most compilers will do it if you pass enough flags, a fact I really took for granted when doing this because Rust is more conservative.
I've tried using mul_add, but at the moment performance isn't much better. But I also noticed someone else on my machine running a big parallel build, so I'll wait a little later and run the full sweep over the problem sizes with mul_add.
So really the existence of FMA didn't have a performance implication it seems except to confirm that Rust wasn't passing "fast math" to LLVM where Clang was. It just so happens that "fast math" will also allow vectorization of reductions.
it isn't always that simple. FMA instructions are tricky to use in a way that actually improves performance, llvm may be doing it right while doing it manually that way may not.
also, sometimes a SIMD instruction is used but only on 1 lane at a time. this is actually common with floating point code.
Something I found surprising: Some AVX2 and AVX-512 instructions consume so much power that Intel chose to have their chips dynamically slow their clock frequency when the instructions are executed. So naively switching to SIMD instructions can not only fail to improve performance, but it can also hurt the performance of unaltered code executed after it -- even unrelated code running on other cores.
What do you mean "manually" ? `mul_add` is a rust function that operates on a single f64, it's still up to LLVM to choose which instructions to use and to do the vectorization.
well technically FMA doesn't necessarily imply vectorization; it depends on whether the P{S,D} vs the S{S,D} suffixed instructions were being used, but if you saw the (P)arallel variants, then yes, it was vectorized.
Thanks. It seems that at least parts of the inner loop have been vectorized. Edit: If Im reading the asm correctly, the second zip has been vectorized, but the first fold was not.
It doesn't exist yet and it's not clear it should be replicated as is. The fast-math flag does a bunch of related things that should probably be exposed separately so it's not a footgun in several situations. I'm also partial to exposing it per-function so the control is actually in the hands of the people writing the code that know the context and not subject to someone fiddling with compiler flags and getting incorrect code.
For this example you'd probably want -fassociative-math and not the other stuff that may result in incorrect code. -ffast-math was not actually used in the clang compilation and it's possible that the -fvectorize that was used picks a sensible mix of options.
Floating point code is really difficult to do correctly. LLVM doesn't actually model IEEE 754 correctly yet, although hopefully the remaining issues with the constrained intrinsics will be fixed by the end of the year (even then, sNaN support is likely to still be broken for some time).
What makes floating point more difficult than integers is two things. First, there is an implicit dependency on the status register that greatly inhibits optimization, yet very few users actually care about that dependency. The second issue is that there is many more properties that matter for floating point. For an integer, you essentially only care about three modes: wrapping (w/ optional carry flag), saturating, and no-overflow. Exposing these as separate types exhaustively is easy. For floating point, you have orthogonal concerns of rounding mode (including dynamic), no-NaN, no-infinity, denormals (including flush-to-zero support), contraction to form FMA, reciprocal approximation, associative, acceptable precision loss (can I use an approximate inverse sqrt?), may trap, and exception sticky bits. Since they're orthogonal, that means that instead of a dozen types, you need a few thousand types to combine them, although many combinations are probably not going to be too interesting.
Technically Rust only guarantees memory-safety (and only outside of unsafe!{}). It has many features that aid in other kinds of safety - strongly encouraging you to unwrap Option<> and Result<>, requiring all match cases to be covered, allowing for lots of strategic immutability, etc. But it doesn't guarantee that kind of correctness.
That's not correct. Safe Rust is advertised as sound, and Rust defines that as "safe Rust programs do not exhibit undefined behavior". Undefined behavior is a much larger term than just memory safety, and include things like thread safety, const safety, unwind safety, etc.
rust doesn't guarantee anything if you opt out of the guarantees. two examples that come to mind: unsafe and maybe bounds checks in release mode.
fastmath is probably different anyway, as the "breaking" is on a floating point logic level, as in: results become imprecise, but not exactly "wrong" - as in undefined behaviour. but i don't know fastmath, so i might be wrong.
yes, this is all purely down to Rust not doing ffast-math (on purpose)
writing a wrapper is not trivial, at all.
what Rust could do though, is add a wrapped float type, that the compiler will forward to llvm saying "you can ffast-math these" That is the approach Rust tends to take for these kinds of things, though no plans are in the works to do this to make floats vectorizeable yet. Maybe we should start such plans?
It's worth explaining why. So far as I understand the situation, dependencies may assume and depend on the absence of ffast-math. Enabling it globally for a compilation is generally considered to be a footgun for this reason.
FMA isn't a safe optimization as it can give different results.
C++ compilers have flags to enable it globally. gcc and clang include the optimization in -Ofast.
Rust allows you to choose at a code level (but usually people don't know about it). Perhaps it should also have a global fast-math flag that would automatically optimize it. Pros and cons to that.
FMA is "safe" in that if it breaks your code, it was already broken. It can only make the results slightly more accurate, unlike for instance the rsqrt instruction which is less accurate. (and as such is not a safe optimization)
GCC emits FMA instructions at -O2 without -ffast-math.
Well, it could "break" your code in that it might make your code produce different results than a separate equivalent implementation that didn't use FMA.
I wasn't trying to imply it should be on by default. Often one does not care about the lower bits of the floats, but do want the speed. For some tasks it's very much the opposite. Being able to specify a global option with local override is a great combo.
FMA is not always faster, it has a high latency: 5-6 cycles depending on the CPU while Add and MUL have very low-latency.
This means that to fully utilizes FMA you need to unroll a loop more. Sometimes yyou just can't, and the other time you use more instructions, use more cache.
In short it's not always better.
Also as other said, FMA has better accuracy than separate Add + Mul
If you’re doing fiddly numerical work, this must definitely be optional, as swapping separate multiplication and addition for FMA (or vice versa) can compromise correctness. In some cases you need two different algorithms if FMA is present or absent.
Some algorithms guarantee that some arithmetic operation(s) applied to 2+ floating point inputs will result in a list of floating point outputs which when summed have exactly the correct result. This gets all screwed up if you mess with the order of operations or the rounding of intermediate results.
Fair, I think it would be very helpful for Rust if some expert actually knows of a specific example for which this is the case. I think there is an RFC about enabling floating-point contraction by default, that would "silently" be able to do some of these transformations depending on the optimization level.
The one very important thing that often get destroyed by compiler using associativity is the TwoSum Error free transform which is a vital composant of several algorithms that deal with numerical error (most notably the Kahan Summation).
I think the key issue with the optimizations that ICC is performing for C++ but Rust is not doing in this case is just FP-contraction, which is related to, but not the same as, assuming associativity.
The RFC about that is https://github.com/rust-lang/rfcs/pull/2686 , where you see users kind of split into the "I want faster binaries" and "I want more deterministic execution" camps. Neither are wrong TBH.
Some people have tried to show there that enabling FP-contraction by default isn't always better / more precise, but I'm not sure if they succeeded.
> I think the key issue with the optimizations that ICC is performing for C++ but Rust is not doing in this case is just FP-contraction, which is related to, but not the same as, assuming associativity.
I think associativity is necessary to vectorize reduction operations like:
r+=(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
I haven't looked at the code generated by ICC, but I would expect it to vectorize this by computing tuples of "partial sums", roughly as follows:
For the error free transform, the important part is the associativity and not the fusion (which would not degrade the operation).
As illustrated in the Wikipedia page, if move terms according to associativ rules you can show that one of the term is always zero and conclude that it is useless, dropping it from the computation.
However, in practice, floating-points are not associativ and that term contains the numerical error of the previous operation.
This is interesting to see. But if I'm going to compare numerical C++ against numerical Rust, then I would be using a higher-level library for the comparision. What is Rust's Other Leading Brand (TM) for the Eigen C++ library?
That comparison (I’m not familiar enough with Eigen to truly say) is going to change over time too; once const generics lands (which is proceeding, finally) the APIs for numerics libraries are going to be significantly different in Rust.
before you look at speed, did you verify you get the exact same math results in Rust and C++ (and for each compiler and platform) ? For C++ code, I have seen the results of calculations vary across compilers (and flags)
The authors ends by noting that FMA would probably have improved the performances for the Rust code.
It is interesting to note that, whereas most ffast-math optimization will trade precision for reduced computing time, adding an FMA can only improve the precision of the output (and thus it is a safe optimization).
pedantry:
ffast-math does not always trade precision. It simply trades the results being the same as if they were not vectorized. A vectorized sum of floats for instance is more accurate, not less.
What's "almost" algebraic about enum? It can definitely be used to construct sum types, and you can make product types with struct or inline in an enum
my best guess is that you can't do recursive enums without explicit boxing [edit: or other forms of indirection,
like &T]¹. so you can't do this:
enum List<T> {
Nil,
Cons(T, List<T>)
}
instead, you have to box/reference-ify the recursive occurrence:
enum List<T> {
Nil,
Cons(T, Box<List<T>>)
}
so in certain circumstances it doesn't let you "coproduct" two types together, you might need to modify one a bit, which makes it a technically-not-exactly-a-coproduct (i think). a bit of a stretch but it sort of makes sense next to a by-reference-only ML langs where you can (co)product anything as you please
You don't have to box, but you do need some sort of type to make things sized. This is usually a pointer of some kind, but any kind of pointer works. Take references, for example:
enum List<'a, T> {
Nil,
Cons(T, &'a List<'a, T>)
}
fn main() {
let list = List::Cons("hello", &List::Nil);
}
Box is usually chosen because it's a good default choice.
What I've found is that you need to ignore most of Rust: use/load raw pointers, don't use slices, unroll manually, vectorize manually, and check preconditions manually. You'll still get the amazing type system, but the code will have to be more C-like than Rust-like.
* raw pointers: LLVM is pretty good at optimizing C code. Rust specific optimization needs some work. (edit: I assumed arrays here, so you'll need the pointer for offsets; references are still okay. You'd also use the pointers for iterating instead of the usual slice iteration patterns)
* no slices: index checking is expensive, not to the CPU, the CPU rarely misses the check branches, but to the optimizer. I've found these are mostly left un-elided, even after inlining.
* no slices: slice indexing uses overflow checking. For Rav1e's case, the block/plane sizes mean that doing the index calculation using `u32` will never overflow, so calculating the offsets using u32 is fine (I'll have to switch to using a pseudo u24 integer for GPUs though, because u32 is still expensive on them).
* unroll manually: LLVM would probably do more of this with profiling info, but I've never found it (this is subjective!) to do any unrolling w/o. Maybe if all the other items here are also done...
* vectorize manually: Similar to unrolling. I've observed only limited automatic vectorization.
* And to get safety back: check, check, and check before calling the fast kernel! Ie wrap the kernel function with one that does all the checks elided in the kernel.
Source: Wrote https://github.com/xiph/rav1e/pull/1716, which speeds up the non-asm encodes by over 2x!