Hacker News new | past | comments | ask | show | jobs | submit login
SLEEF Vectorized Math Library (sleef.org)
62 points by g0xA52A2A on Feb 2, 2020 | hide | past | favorite | 14 comments



Interesting, but a bit weird that the benchmark doesn't include the standard math functions - the question most potential users would ask is, how much performance could I gain from this?

Personally I do a lot of approximations of math functions, which usually give me about a 10x speed increase relative to <math.h>. From the looks of it, this isn't quite that fast.


Recent versions of GCC as well as the Intel compilers can autovectorize with the right flags (-ffast-math with GCC, -fast with Intel), and yield better performance with the <math.h> functions than you'll get from SLEEF on x86_64.

LLVM doesn't have a vector math library, so SLEEF could help you there.

If you're using single precision and AVX512, a >10x speed increase is likely. Otherwise, you'll probably get less than that. These functions are very accurate, most to withing 1 ULP of least precision. That is, if the answer they provide isn't the correctly rounded floating point answer, then it'll be either the next or previous representable floating point number.

There is a lot of room for giving up accuracy in the name of speed (eg, using less terms in the polynomials).


Do you have a benchmark showing that gcc is faster than sleef?

Sorry I wasn't clear, by "10x" I meant 10x faster than the standard library with the fastest compiler options (-O3 -ffast-math -avx512), but I've only tested with clang.


This is more for people who need to do specialized DSP math like DFTs using a variety of different cpu arches. Most vendor libraries (like Intel IPP) for vector units fail to support their competitors. So benchmarks might be relatively less important than portability.


This library can leverage data parallelism to increase throughout vs the scalar versions. ie. One instruction performs 4 operations instead of one. If the problem you are solving is suited to this parallelism, you could get a significant speedup.


Sure, so why is that speedup not measured in the benchmark?


i use this library as an optional feature in simdeez. it was a relief to find it, and a rust wrapper for it. its hard stuff to port it is so arcane. big thanks to sleef!


I've been relying on a Julia port of version 2 (version 3 is out now). Version 3 added a lot of new functions, and I believe it improved performance on many of the old ones.

It is much faster (when vectorized) than what you get in base Julia, but lags behind gcc (glibc) and the Intel compiler's vectorized math libraries in performance.


Curious whether you've compared it to https://github.com/chriselrod/LoopVectorization.jl ? Which if I understand right is a pure-Julia attempt to use many of the same tricks.


Compare my username with that of the author of that library ;).

For special functions, LoopVectorization relies on SLEEFPirates.jl, which is a fork of SLEEF.jl, a Julia port of version 2 of SLEEF. Most of the changes in SLEEFPirates are so that it works when you use llvm-vectors as arguments, but I also switched to using Estrin's rather than Horner's method of evaluating polynomials for a few functions (which more recent SLEEF versions did as well).

The code is pure Julia (or Julia + LLVM call; either way it does not need any external dependencies aside from Julia itself). It does need performance work, but I have many higher priorities at the moment.


Hah! Sorry, didn't cross my mind. And thanks for the details.


glibc has SIMD-vectorized math functions? I don't quite understand what is being compared here. Any specifics you can share?


Yes, here is the source for 8 double precision logs (with avx512) in glibc, for example: https://github.molgen.mpg.de/git-mirror/glibc/blob/20003c498... The "sysdeps/x86_64/fpu/multiarch" directory contains many of these functions.

You'll need at least GCC 8 to use them automatically, as well as the -ffast-math flag: https://godbolt.org/z/PL26up


Correction: gcc 6 and 7 create them too. They just have 7 unrolled calls to log finite before (and then again after) a loop surrounding the vectorized call.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: