I'm the co-author of one of the papers referenced in the blogpost, (Fast Quicksort Implementation Using AVX Instructions), we did write the AVX512 code back in 2015, just had nowhere to run it, at least publicly. The paper also very explicitly says that the lookup tables can be instead replaced by the AVX512 compress instructions. The code for that paper is available in https://github.com/vkrasnov/avx_qsort
I wonder how fast it is compared to djbsort https://github.com/jart/cosmopolitan/blob/master/libc/nexgen... and longsort https://github.com/jart/cosmopolitan/blob/e011973593407f576d... djbsort is outrageously fast for 32-bit ints with avx2 (which unlike avx512 it's the avx we can reliably use in open source). But there's never been a clear instruction set to use on Intel / AMD for sorting 64-bit ints that's reliably fast. So if this thing can actually sort 64-bit integers 10x faster on avx2 I'd be so thrilled.
Yes, we can sort 64-bit ints. The speedup on AVX2 is roughly 2/3 of the 10x we see on AVX-512.
Longsort appears to be an autovectorized sorting network. That's only going to be competitive or even viable for relatively small arrays (thousands). See comments above on djbsort.
Why not use whichever AVX the CPU has? Not a problem when using runtime dispatch :)
The chart above shows a 1000x (3 orders of magnitude base 10) increase in energy consumption relative to a register move (it really should be called copy).
I'm not sure what you mean by that. You can't assume the presence of AVX or AVX2 without explicitly checking for it, because Intel was still disabling those features on new low-end Pentium and Celeron parts at least a recently as Comet Lake (2020). Sure, AVX2 support is much more widespread than AVX512 support, but that has nothing to do with open-source and it's a bit strange to describe that in terms of reliability.
Some of us like to think of ourselves writing open source as serving the public interest. It's hard to do that if you're focusing on an ISA the public doesn't have. I haven't seen any consumer hardware that has AVX512.
Lots of consumer hardware has AVX512 (I have an 11th gen Intel laptop CPU that has it).
Regardless, Clang and GCC both support function multi-versioning where you supply multiple versions of a function and specify which CPU features each implementation needs, and the best version of the function will be selected at runtime based on the results of cpuid. For example, you can use this to write a function that uses no vector instructions, SEE, AVX2, or AVX512 and all versions will be compiled into the executable and the best version you can actually use will be selected at runtime. This is how glibc selects the optimal version of functions like memset/memcpy/memcmp, as there are vector instructions that significantly speed these functions up.
I agree AVX-512 is not exactly widespread on client CPUs but as akelly mentions, it does exist (e.g. Icelake).
What we do is dispatch to the best available instruction set at runtime - that costs only an indirect branch, plus somewhat larger binary and longer compile time.
Even if AVX512 was entirely constrained to server hardware (it's not), how would it be contrary to the public interest for open-source software to take advantage of those instructions?
Intel 10th gen mobile and 11th gen mobile and desktop, excluding Pentium and Celeron, have AVX-512. And all 12th gen have it on the P cores but not the E cores. If the E cores are enabled then AVX-512 is unavailable.
On 12th gen they disabled it on the P cores too even with E cores disabled with a microcode update. A lot of newer systems don't have access to the older microcode, and microcode doesn't typically let you downgrade.
There are workarounds for downgrading microcode, because the CPU itself doesn't actually have non-volatile storage for microcode updates and relies on the motherboard firmware to upload updates on each boot (and motherboard firmware can often be downgraded, possibly after changing a setting to allow that).
Which is probably why Intel has changed to disabling AVX512 using fuses in more recently manufactured Alder Lake CPUs.
My point with "a lot of newer systems" was that there are motherboards now that completely lack a version of their firmware with the microcode that allows avx-512. There's nothing to downgrade to without an exploit to allow making your own firmware images with mixed and matched components.
The E3 v5 series were part of the generation codenamed Skylake, introduced in 2015. But the Skylake microarchitecture was reused in each subsequent new Intel desktop processor generation through 2019's Comet Lake (due to Intel's 10nm failure). They didn't introduce a new microarchitecture in that product segment until Rocket Lake and Alder Lake, both in 2021. So despite being almost 7 years old, the E3-1220v5 is still representative of most of the installed base for Intel desktops and entry-level workstations, and a large chunk of their mobile installed base.
(The original E3-1220 predates AVX2 by two years, so this code wouldn't even run on it.)
Oh, thanks for pointing that out. Golly, Sandy Bridge is a bit old, yes - but still the result is surprising.
djb reports 8000 cycles for int32 x 256 - this is much slower than we benchmark in bench_sort.cc, even for AVX2 (which he confirms is being reached). Not sure what's going on.
I am always amazed at the algorithms re-implemented using SIMD. One of my favorites is the Striped Smith-Waterman approach used for sequence alignment. Does anyone have any good resources on learning to use SIMD? I've found it heard to make the "plunge".
The lectures and assignments for Oregon State's CS475 (Parallel Programming) are all available online. [0] There are lectures [1] and a project [2] about SIMD. I really enjoyed the entire course as a survey of parallel and high-performance computing. The full course covers multi-processing, multi-threading, caching, SIMD, GPUs (CUDA and OpenCL) and MPI. The projects are in C/C++ (along with OpenMP, CUDA and OpenCL). FYI, I think the last two projects use some large research GPU bank that you have to have special access to use, so you'd be out of luck on implementing the projects for those.
The performance benefit of SIMD implementation of Smith-Waterman-Gotoh is moot relative to the gains provided by a total reformulation of the pairwise sequence alignment problem.
The Wavefront Algorithm (WFA) flips the problem on its head by progressively exploring the best scoring alignment until a global alignment is attained. Then no more work needs to be done to fill the matrix. The total work is actually quadratic in sequence divergence rather than length, a huge improvement over SWG for almost all applications.
In WFA the data dependencies are trivial and compilers easily auto-vectorize the inner loop of the algorithm. It's also possible to implement this in linear memory relative to sequence divergence with a bidirectional approach (biWFA).
All this is to say that vectorization and SIMD hardware is cool, but new theory and approach can completely overwhelm it's potential benefits.
Highway looks awesome from a quick glance. Definitely going to set some time aside to play with it and read it.
Sort of tangentially related, what sort of tools are you using for disassembly in 2022? Is there anything better than objdump today? What I really want is a version of godbolt that reads compile_commands.json and can show me the godbolt-style color-coded disassembly for an entire binary, any function. I find that when I’m asking these sort of questions I’m wasting a lot of time fitting my problem into godbolt so I can just get a rough idea of what’s coming out, because it’s so much faster (maybe 10x) to read the color tagged version from godbolt than it is to do the same with objdump.
Edit: along the same lines, what can people do (or rather, what do you do) about situations like this: https://github.com/google/highway/blob/master/hwy/examples/b... where things might get better in the future but for now we have to implement it another way? How do you avoid regressions and how do you track workarounds? I want some sort of database that tries competing implementations on every build and emits a compiler warning when things get better. How are you dealing with this?
Thanks! Feel free to raise Github issues if you'd like to ask/discuss anything.
I'm also a huge fan of Godbolt/Compiler Explorer. Highway is integrated there, so you can just copy in your functions. Here's the last throwaway test that I was using: https://gcc.godbolt.org/z/5azbK95j9
> things might get better in the future but for now we have to implement it another way
There's several possible answers.
1) For the issue you pointed to, that we cannot have arrays of N vectors, it's a reasonable workaround to instead allocate an array of N vectors' worth of elements, and trust the compiler to elide unnecessary Load/Store. This often works using clang, and would avoid having to manually unroll here. I do prefer to minimize the amount of compiler magic required for good performance, though, so typically we're unrolling manually as shown in that code.
2) If there are compiler bugs, typically the workarounds have to stay in because in the open-source world people are still using that compiler years later.
Automatically detecting when things get better is an interesting idea but I am not yet aware of such infrastructure.
// Compiler doesn't make independent sum* accumulators, so unroll manually.
// We cannot use an array because V might be a sizeless type. For reasonable
// code, we unroll 4x, but 8x might help (2 FMA ports * 4 cycle latency).
That code needs 2 loads per FMA. So a CPU with 2 FMA ports would need at least 4 load ports to be able to feed the 2 FMA ports. Given that most CPUs with 2 FMA ports have just 2 load ports, unrolling by 4 should be more or less ideal.
But, ideally, the compiler could make the decision based on the target architecture.
Without enabling associative math, it isn't legal to duplicate floating point accumulators and change the order of the accumulation. Perhaps compiling under `-funsafe-math` would help.
If you're using GCC, you'll probably need `-fvariable-expansion-in-unroller`, too.
I think highway looks great. I'm sure I'll procrastinate on something important to play with it reasonably soon.
Thanks :) I'd be interested to hear how it goes for you.
Agree that 4x unrolling is getting most of the low-hanging fruit without excessive code size. I saw only very slightly better performance on SKX with 8x.
You're right that it's nicer when the compiler can decide about the unrolling - for example with knowledge whether we have 16 or 32 regs. The unsafe/fast-math flags are pretty dangerous, though :/ https://simonbyrne.github.io/notes/fastmath/
Especially when they enable flush-to-zero, which would be unacceptable for a library loaded into some other application.
Interestingly from the portability angle, this was all written in C# but with the SIMD intrinsics there he was still able to obtain ~SOTA performance.
The repo I linked is his C++ version of the C# original.
It was submitted to the C# standard library but was rejected (as an observer, the rejection was disappointing) – had it been accepted C# would have been in the interesting position of having a faster sort than any native language standard library, at least for integer keys.
I've read parts 1-5, there doesn't seem to be a part6. Thanks again for the link, it is regrettable that there is duplication of effort. His approach is quite similar to ours (including HeapSort, sorting network, unrolled in-place partition), and re-invents what he calls "double-pumped partitioning" (introduced by Bramas).
Copying the blog post to arxiv and citing some related work so it comes up in alerts and cited-by searches would [have made/make] this good work visible to a wider circle, including the algorithms community.
It's nice to see a C++ version, this will be easier for me to benchmark, and it also seems to support other built-in types now, as well as AVX2+AVX-512 (though not other platforms). A quick list of differences based on reading the blog:
* Nifty idea for simplifying the branch in partition. (We also found that branchless code is worse)
* Major effort to make all reads vector-aligned. This could make a difference especially on AVX-512. I had tried two variants but it was difficult to make this safe for asan (no out of bounds reads allowed).
* The lookup table is 2KB, which could be halved using 32-bit broadcast + variable shift instead of PDEP.
* The sorting network seems to be half the size and bitonic. I haven't yet compared the number of permute instructions (not obvious from the source because it's generated).
* The pivot is only a median of three. We use a much larger sample.
Thanks for sharing results and for the write-up. It is another welcome addition to a series of recent reports attesting the SIMD instructions having entered the mainstream computing, from the high performance JSON parsing to now the quick sort.
The white paper is calling out the memory bandwidth as a major bottleneck, which is where I have a question.
The M1 Max sports a unusually wide (512 bit wide) memory bus coupled with 6.4Ghz DDR5 unified memory which, according to the available empirical evidence, allows a single CPU core at achieve 100Gb/sec circa memory transfer speed. M1 cores also feature a very large L1 D-cache as well as a large L2 cache (48 Mb has been reported). Results, however, are approximately 2.4x lower for the M1 Max. I do realise that the NEON SIMD processing will always be slower compared to AVX-512 SIMD based one due to a 4x difference of the vector size, but isn't the M1 Max supposed to perform somewhat faster than in observed figures due to the faster and much wider memory bus that would partially compensate for NEON inefficiency?
Other than the vector size difference between NEON and AVX-512, would you attribute the difference in performance to the small test batch size (~4/8/16 Mb for 32/64/128 unit sizes used in the test) thus being able to able to fit in the L2 cache nearly entirely, or due to GCC being unaware of cache line sizes on M1 Pro/Max therefore resulting in the cache underutilisation or inefficient instruction scheduling, or you would purely attribute it to NEON having not aged gracefully to meet current data processing demands?
:)
If I'm reading 100Gb/sec correctly as Gigabits, a Skylake core can also reach such bandwidth usage. The issue is that the system can only sustain that for roughly 8 cores. It would be very interesting to see bench_parallel results for an M1 Max, if anyone would like to give that a try? I suspect it will be comparable to Skylake-X, because 8 (performance) cores are not quite enough to utilize all the M1 Max bandwidth in this app. M1 may be more power efficient, though. It is also great to see more bandwidth per core. A hypothetical M2 with say 256-bit SVE vectors and 16 cores could be very interesting.
L2 caches are typically partitioned and private to a core. If that also applies to the M1 Max, then each core would only access 3 MB, thus the working set is larger than cache as intended.
I do believe NEON is the limiting factor here. I haven't looked into how many IPC we should expect, but even if it is 4 (the number of 128-bit NEON units), Skylake is often reaching 2 (with 512-bit vectors), so the measurement that an M1 core is about half as fast as SKX for this use case seems plausible.
Very cool work! What do you see as the typical process of getting code like this into use in well known codebases? I'm trying to get a sense of when a typical engineer (operating with higher level language or libraries) might get to take advantage of this.
Thanks! Highway is included in Chrome and Firefox. Not sure what you mean by process? It's pretty easy for example using git submodules - you tell Git the Highway version you want to depend on, notify your CMake or Bazel build system of the hwy library dependency, and that's it.
One requirement is C++11 or later - some people have asked about supporting C but I believe that's infeasible.
Regarding columnar databases, is there an optimal data structure in between storing data as rows and storing data as columns that would make it fast to sort and filter by either dimension? Some mix of B-trees or interval trees?
SIMDe implements the exact semantics of a given ISA using either the native intrinsics, or other platform's intrinsics when possible, otherwise autovectorization. This is great when you've already written code using e.g. NEON intrinsics and want it to run on x86 (similar to neon2sse), or the reverse. It's an impressive undertaking to implement thousands of instructions for each platform :)
Highway instead aims for a path that each platform can implement efficiently. For example, ReorderWidenMulAccumulate bridges differences between NEON and x86 which user code need not care about if they just want a dot product, and CountTrue is efficient on both platforms without requiring NEON to go to the extra trouble of matching the exact x86 movmskb semantics.
Also, Highway uses only intrinsics and does not rely on autovectorization (+), which makes for more predictable performance.
+ except in the EMU128 target, which is only used if SIMD/intrinsics are disabled
I started reading through Highway docs, and maybe I missed something, but I’m surprised it’s a built library to be linked against, instead of a header-only library —- at least when using static targeting. If using static targeting, wouldn’t function call overhead be severe, or are a lot of implementations in the headers?
Good question. There are only a few functions implemented in .cc files, tagged with HWY_DLLEXPORT, notably memory allocation and detecting x86 CPU capabilities. If it were necessary, we could likely strip those out or do a header-only library. The ops/intrinsics called from user code are inlined in headers.
In short, it turns out not to help for single core with vectors, but a few initial passes of ips4o (with 64..256-way partitioning) is faster for parallel sorts.
Depends on your goals?
If you'd like to contribute code under Apache2, I'd be happy to collaborate.
Feel free to reach out to the email in the Highway README.
Some of the things on my short to medium-term todo list include:
- CompressStore for 8-bit lanes (before Icelake, that will likely require splitting up into runs of 8 bytes with one PSHUFB + store each)
- also ExpandLoad (analogous to CompressStore) and LeadingZeroCount
- the C++ STL isn't autovectorized much (http://0x80.pl/notesen/2021-01-18-autovectorization-gcc-clan...). Manual vectorization can help, we have some of the simpler algorithms already in hwy/contrib/algo, others such as unique, reverse, minmax_element, all_of are more interesting.
I have no idea what you just wrote. I don't speak C, know basically nothing about it, I'm not in the business and I'm not ever going to do either, because I don't like it.
So I looked that up. You're talking about intrinsics. I have no idea about that, beyond "Why would people rather remember this instead of just learning the proper assembly code, which is far easier to read and remember anyway?"
I don't write code for compilers to do the work I am supposed to be doing, so I have no idea of the world you code in.
Your response doesn't really answer my question, but that's totally my fault, so ...
Hand me a box with a defined problem, some compiled code I can run to benchmark my own solution against, then wait for me to beat your solution within defined parameters.
My goal is to have some challenges that need optimization, but there needs to be a point to it. I can take any problem from the web, rethink the approach and rewrite it in assembly, but in 99.999% of all cases there's no point to it and I have nothing to compare it against on my own machine anyway.
I do not know what's generally useful to optimize and I certainly do not know who actually cares about that anymore.
Like, for example, how the pattern matching/string search function in freepascal is damn slow. I've researched the topic and noticed that all the solutions I've found are basically crap, so I wrote my own, beating it by some wide margin in terms of codesize and performance. I don't even know why these devs accept subpar solutions, but responses I got to that question were, basically, "It's fast enough" ... which is dumb.
Still, I don't actually have a way of fairly comparing my work.
Regardless, apparently the fastest fixed size pattern/string search algorithms I've found are all crap, but I really have no way of creating a fair comparison. Oh, on that note ... you don't, by chance, have a pattern matching benchmark for fixed sized needles I could compare my own against?
Sorry for being a mess. :D I really just wanna work on something that's actually a challenge AND something people actually need to run more efficiently, just like the QuickSort example.
Thanks :) If I understand correctly, you have tuples of (number, T*) and want to sort by number? That can be done by storing these tuples interleaved, reinterpret_casting to hwy::uint128_t (making sure that the pointer bits are in the least-significant half), and sorting those.
SIMD based sorting isn't new, e.g. djbsort: https://sorting.cr.yp.to/. I find it strange the paper doesn't cite it or compare to it at all.
EDIT: to be fair it does seem that Bramas' first published work on SIMD sorting (that I could find) predates djbsort, https://arxiv.org/abs/1704.08579. A comparison would still be nice though.
It's not new within Google either. I remember poking through Jeff Dean's original MapReduce code (written c. 2003) and finding a SIMD-based QuickSort that was the in-memory portion of the external sort algorithm.
It looks like this algorithm is interesting in the specifics of how it works, eg. the use of a compress-store instruction for partitioning. The algorithm I mentioned above mostly focused on speeding up comparisons through the use of SIMD instructions, rather than changing the partitioning part of the QuickSort algorithm itself. Not sure how that compares to djbsort, I didn't see technical details of the algorithm on the page.
Interesting, I did not know that, thanks for the pointer.
There is also an effort underway to update LLVM std::sort to BlockQuicksort, which can use some SIMD instructions similarly to what you describe.
What exactly should be cited? I had a quick look and djbsort seems to be a constant-time algorithm, presumably a sorting network. They report about 2 GB/s for 768 int32 on 3 GHz Skylake; IIRC our sorting network reaches 3-4 GB/s, but for a smaller input size (256 int32 or int64). Hard to compare directly, but djbsort is likely slower, and the sorting network is anyway a small fraction of the total time in a Quicksort.
djbsort uses a sorting network, not a quicksort.
It is also designed to be in constant time, which is not the case with the linked blog post.
So the restrictions are different.
This is not my field, so I can't judge the relevance of this, but the author cites the state of the art.
It just seems that djbsort is not the state of the art and therefore does not need to be cited.
I still don't understand the hype around Bernstein, he is quite relevant in cryptography/cryptography implementations but not everything he does is gold.
For 32bits, djb has done light testing, indicating djbsort would be a better base case.
> So far I haven't been able to verify these vqsort speed claims. On the contrary, it seems that, for 32-bit data types on AVX2, vqsort would be faster if its base-case code were replaced by a call to the 2018 djbsort code. Similarly, vqsort should reuse vxsort-cpp for AVX-512.
The whole "stick SIMD into it and compare it with the stdlib" thing is a really common learning trope. Feels more like a portfolio filler, albeit probably not a useless one.
We also compare with state of the art platform-specific code. The interesting thing here is that they turn out to be slower than our approach using portable intrinsics (github.com/google/highway) :)
Another one if you enjoy this sort of thing like me is Raph Levien's work on the stack monoid; this [1] blog post and this [2] ArXiV paper are great places to look.
This looks awesome but could do with a bit of productionalization. For example ready to use binaries. Do you know if there are any plans to merge this work with core postgres?
This is great, and can definitely improve sort performance in database and big data projects. I can immediately imagine this is a perfect match to my project of columnar processing engine TiFlash (https://github.com/pingcap/tiflash).
Not having played with SIMD much myself, does leveraging these instructions for an intensive operation like a sort push other workloads out of the CPU more aggressively than operating on 32 or 64 bits at a time would?
In other words, do you have to be more careful when integrating these wide operators to preserve some resources for other operations?
At least on Intel, AVX has its own functional units and register file, so I would say it's not a major concern. It's possible that running some AVX instructions could be almost or completely free if you weren't using execution port 5 anyway, to the extent that instruction-level parallelism can be considered "free".
If you're really taking a microscope to performance, the main hazards would be intermittently using AVX for only a few instructions, because that might lead to the CPU stopping for a few microseconds to turn the power on and off on the functional units. If you're using them heavily the overall thermal situation might cause a core or package-wide clock rate degradation, but if you have a use case for sustained AVX-512 usage this is likely to be a good tradeoff.
Pushing them out of the CPU, I don't know, but some SIMD instruction sets on some CPUs have side effects that can negatively affect the performance of other operations. For example, the use of AVX2 / AVX-512 can cause some Intel CPUs to lower their base frequency, thus reducing the performance of simultaneous operations that are not using SIMD.
Not for recent hardware - Ice Lake eliminated the throttling on 256b ops (they already didn't exist for certain 256b ops and all smaller ops), and reduced the throttling to almost nothing for 512b ops. Rocket Lake eliminated the throttling for 512b ops.
They do use a lot of power (and as a result, generate a lot of heat), so they can still cause thermal throttling, or throttling due to power limits - but there's no more "AVX offset"
They do emit a lot of heat, which might actually throttle the CPU overall.
But to my knowledge they use different registers, and when properly pipelined they don't hog the CPU cache like unoptimized algorithms that constantly round trip to RAM.
The CPU has a certain upper bound on power, TDP. That limit is independent of whether it is reached by executing scalar code on lots of cores, or SIMD on a few.
One little-known fact is that out of order CPUs burn lots of energy just on scheduling instructions, ~10x more than the operation itself. SIMD amortizes that per-instruction overhead over multiple data elements, and actually increases the amount of useful work done per joule. We're talking 5-10x increase in energy efficiency.
Right, though your use of the terms "different registers" and "properly pipelined" is quite nuanced for a non-specialist. It leaves a lot to the imagination or prior experience!
If it is a compute-heavy code that was spilling to a stack and now can fit in the SIMD registers, there could be a speed up with less memory traffic. This is analogous to a program's working set either fitting in RAM or spilling to swap while it runs, but at a different level of the memory hierarchy. In the extreme case, your working set could be a fixed set of variables in expensive loops for some sequential algorithm, and so the compiler register placement ability and number of registers available could be the difference between effectively O(1) or O(n) memory accesses with respect to the number of compute steps.
Of course, you can find such transitions between registers/L1 cache, L1/L2 cache, L2/L3 cache (if present), local/remote RAM (for modern NUMA machines), etc. This happens as you transition from a pure CPU-bound case to something with bigger data structures, where the program focus has to move to different parts of the data to make progress. Naively holding other things equal, an acceleration of a CPU-bound code will of course mean you churn through these different data faster, which means more cache spills as you pull in new data and have to displace something else. Going back to the earlier poster's question, this cache spilling can have a detrimental effect on other processes sharing the same cache or NUMA node, just like one extremely bloated program can cause a virtual memory swapping frenzy and hurt every other program on the machine.
One form of "pipelining" optimization is to recognize when your loop is repeatedly fetching a lot of the same input data in multiple iterations and changing that to use variables/buffers to retain state between iterations and avoid redundant access. This often happens in convolution algorithms on arrays of multi-dimensional data, e.g. image processing and neural nets and many cellular-automata style or grid-based simulations. Your algorithm slides a "sampling window" along an array to compute an output value as a linear combination of all the neighboring values within the window using some filtering kernel (a fixed pattern of multiplicative scalars/weights). Naive code keeps loading this whole window once for each iteration of the loop to address one output position. It is effectively stateless except for the loops to visit every output. Pipelined code would manage a window buffer and fetch and replace only the fringe of the buffer while holding and reusing inputs from the previous iterations. While this makes the loops more stateful, it can dramatically reduce the number of memory reads and shift a memory-bound code back into a CPU-bound regime.
Other major optimizations for these N-dimensional convolutions are done by the programmer/designer: some convolution kernels are "decomposable" and the expensive multi-dimensional operation can be strength-reduced to a sequence of 1-dimensional convolutions with appropriately chosen filter kernels to produce the same mathematical result. This optimization has nothing to do with SIMD but simplifies the work to embody the operation. Fewer input values to read (e.g. a 1D line of neighbors instead of 2D box) and the number of arithmetic operations to do all the multiply-and-add steps to produce the output. Imagine a 2D filter that operates on an 8x8 window. The 2D convolution has to sample and combine 64 input neighbors per output, while the decomposed filter does 8 neighbors in each axis and so one quarter as many steps total after one pass for the X axis and one pass for the Y axis.
Naively, decompsition is done as separate 1D passes over the data and so performs more memory writes and allocates an additional intermediate array between the original input and final output. This is often a big win in spite of the extra memory demands. It's a lot more coding, but this could also be "pipelined" to some degree in N-dimensions to avoid pushing the entire intermediate results arrays through main memory. Approaches differ, but you can make different tradeoffs for how many intermediate results to store or buffer versus redundant calculation in a less stateful manner.
Generally, as you make the above kinds of optimizations your code becomes more "block-based", loading larger chunks of data with fewer changes to new "random" memory locations. This is very much like how databases and filesystems optimized their access to disk storage in prior decades to avoid random seeks for individual bytes or blocks and to instead use clever structures to support faster loading of sets of blocks or pages. When this is successful, your program does most of its memory access sequentially and achieves higher throughput that is bound by total memory bus bandwidth rather than random access latency limitations.
The final form of "pipelining" matters once you really hit your stride with these optimized, block-based programs. All this access again can cause cache spilling as you sustain high bandwidth memory access. But if you can provide the right hints to the CPU, or if the CPU is clever enough to detect the pattern, it can shift into a different cache management regime. Knowing that you will only access each block of data once and then move on (because you are holding the reusable state in registers), the CPU can use a different prefetching and spilling policy to both optimize for your next memory access and to avoid churning the whole cache with these values you will only read once and then ignore. This reduces the impact of the code on other concurrent tasks which can still enjoy more reasonable caching behavior in spite of the busy neighbor.
Generally yes to the first question, no to the second.
If you want your code to have low perturbance on other concurrent work done by the system, implementing it in a inefficient way doesn't usually help with that goal, since then your code will just be executing all the time because it takes a long time to finish. And you still won't have good control of the execution resources compared to using normal tools like OS scheduling policies to address the goal.
Sorting is one of a few things that computers spend so much time doing, even a relatively small improvement over that state of the art can have a big impact.
On the one hand this means if you can come up with a better way to do it in some case that's awesome and could be very valuable. On the other hand it means a lot of people have thought pretty hard about it and the next improvement may be hard to come up with unless you pick a tight enough subset of the problem.
Thanks for your post! Let me explain my problem, which fits right to it. It works as an example, too.
So, in freepascal there's some subpar pattern matching/string search function. It annoyed me so greatly, I've started researching into the "state of the art" and apparently they're all crap. My fixed size pattern matching solution fits into less than 256 bytes, not counting setup code, which is minimal.
And here start the problems for me, because I have no way of actually comparing my solution to "the state of the art", because "the state of the art" doesn't offer me downloadable, compiled benchmarks I can just run to compare stuff to my own and learning a new language (like, i don't speak C and certainly never will), installing a compiler, getting through all the walls that inevitably be in my way to getting it to run, etc ... totally breaks my timebudget and thus I drop the problem.
Basically, what I'm looking for is:
Give me a problem that's worth tackling, with a defined set of parameters and a compiled solution I can compare my own against and I will beat it eventually. I've tried doing this on my own, but that's just a big no-go on many levels, so I'm dependent on someone handing something to me.
Like ... the QuickSort. I have no way of comparing my potential own solution against "the state of the art", which is - as far as I know - pretty much all that's holding me back.
I mean, I guess you could call a B-Tree over a subset of columns "columnar"... And bit-map indexes are by columns too — but that's really bending the definition of what most people mean by columnar storage in databases.
Re-implementation of stuff with SIMD is always amazing to me. I have done stuff with SIMD before: 4 element float vector operations; basic arithmetic on arrays of floats.
Those are things that are super simple, once you get past the scary mystique of SIMD.
It’s stuff like this that should be getting all the witch burning :D
I'd recommend building using Bazel (perhaps via the bazelisk launcher), then it would be: bazel run -c opt :bench_sort (or :bench_parallel). Will verify that on Tue and put it in a README in contrib/sort.
Actually we also support floating-point (32/64 bit, possibly even 16).
I've previously also looked into radix sort: https://arxiv.org/abs/1008.2849
Radix sort may actually be faster if you know that only a few bytes are guaranteed to distinguish keys, but that's difficult to guarantee/assume at the library level.
> By comparison, the standard library reaches 58/128/117 MB/s on the same CPU, so we have managed a 9-19x speedup depending on the type of numbers.
That's pretty disingenuous. The standard implementation is known to be terribly slow because of constraints. They should compare against current state of the art.
I don’t see any mention in the paper of thermal clock throttling concerns, which can really neuter performance of tools that sustain use of AVX operations over a period of time. For the quick benchmarks presented in the paper, of course it will be faster. What if I’m continuously hammering my CPU with AVX operations? I expect it to severely downclock.
On Ice Lake Xeon the penalty for using the AVX-512 features on a single core is -100MHz. If we pessimistically use the slowest part Intel sells, that is a 5% performance penalty (2% on their fastest parts). The speedup from this work is 40-60% compared to AVX2. So you'd be a fool to take the side of the folk myth. AVX-512 works.
By the way the performance penalty for using AVX-512 on multiple cores when the multiple cores were already active is zero. There is no penalty in most server scenarios.
>On Ice Lake Xeon the penalty for using the AVX-512 features on a single core is -100MHz.
That is a penalty due to licensing [0], not thermal throttling. As I wrote elsewhere, I’ve seen my clockspeed get cut in half across all cores on a physical die when running AVX-heavy operations for a sustained period of time, due to thermal throttling.
The default AVX offset for Ice Lake is indeed only 100MHz (and it doesn't exist starting with Rocket Lake), but 512b SIMD instructions use a lot of power, and as a result generate a lot of heat - so they certainly can cause thermal throttling or throttling due to power limits
In datacenter blade servers (i.e. on a cloud VM), I’ve noticed up to 50% downclocking due to thermal throttling when running sustained frequent AVX operations.
I’m sure an exotic watercooled setup will fare much better, but those aren’t generally what we run in production.
My laptop has been throttling itself for a while, I recently discovered. I had been trying to benchmark some code changes and have given up and am letting the CI machine run them, because my numbers are all over the place and go down with each run.
One option would be to go into BIOS and see if there's some way of just locking your CPU to one of the lower clock speeds. This will give lower benchmarking numbers of course, but at least they should be fairly stable. (in Linux, it also often possible to tinker with frequencies while the system is running).
Even on a desktop this sort of thing is sometimes necessary, for example my CPU has different clock speeds depending on how many processors are running, so I have to lock it to the all-core clock if I want to see proper parallel speedups.
This might be annoying for day-to-day usage (although, CPUs really are insanely performant nowadays so maybe it will not be too bad).
Quote: "Today we're sharing open source code that can sort arrays of numbers about ten times as fast as the C++ std::sort...." and proceeds to explain about SIMD instructions.
Well, to me sounds that C++ std::sort is simply lacking an optimization which can very easy be implemented in next iteration of the compiler/linker. I had high hopes this would be a really breakthrough algorithm, not lack of a simple optimization using some instruction set that compiler/linker didn't implement. If they want, CPU vendors would make an even more awesome and faster instruction set, let's say SIMD2, which would leave this one in the dust.
It depends on the goals of the library maintainers. Users who really care about speed may already be using another library such as PDQsort or ips4o. Given that, the stdlib maintainers may reason that adding code and 'complexity' (not all developers understand or are able to maintain SIMD) is not worthwhile.
Conversely, they may prefer to work towards the standard library being the fastest known way of doing things. This is now much more feasible given the single portable implementation, vs. having to rewrite thousands of lines for six instruction sets.