Algorithms for Modern Hardware

reaperman · 2024-03-16T03:58:28.000000Z

Previous postings (including 30 days ago, but this one seems to be from 'tosh 8 hours ago and then manipulated by mods for a "second chance" post getting merged with a different posting by 'throwaway2037 from 2 days ago, and I'm happy for that):

March 7, 2022 - 62 comments - https://news.ycombinator.com/item?id=30583808

Feb 17, 2022 - 49 comments - https://news.ycombinator.com/item?id=30376140

Feb 12, 2022 - 44 comments - https://news.ycombinator.com/item?id=30311112

Nov 1, 2022 - 33 comments - https://news.ycombinator.com/item?id=33427349

vlovich123 · 2024-03-16T16:25:48.000000Z

Great. Unfortunately the English version is missing most of the algorithms (or it hasn’t been written yet).

Of course nothing can be complete, but I had the following pieces of feedback (plan on opening a GitHub issue later):

The compiler flags section is missing the following:

* Os - optimize for size

* LTO

* Compiling -ffunction-section -fdata-sections & then linking with —-gc-sections (arguably this one just reduces the size of the binary through dead code elimination)

Other:

PGO should probably be mentioned in the flags and targets section as “we’ll describe this in situational optimizations“ (not sure why it’s written there but whatever).

AutoFDO isn’t discussed which is important since it makes running FDO actually feasible in production.

Cold annotations and hot/cold code splitting (hot/cold code splitting isn’t even discussed).

sereja · 2024-03-16T18:31:55.000000Z

The Russian version is more about algorithm design 101 and 102 (similar to cp-algorithms.com). I used to do competitive programming, and I co-founded an educational nonprofit where I also taught it for a few years. I organized my lecture notes and put it on a website, which is now used as a textbook by most CS students in Russian-speaking countries.

I do intend to translate it someday, but it has nothing to do with performance engineering :)

rigelbm · 2024-03-16T16:04:32.000000Z

The proposed ToC looks amazing, and I can't wait to get my hands in this book. One thing that seems omitted though, which can have significant impact in performance, is I/O. That would have been a nice add-on.

someuser0123091 · 2024-03-16T15:45:53.000000Z

Does anyone know if there is any research around designing a compiler + language that produces optimal code? Besides superoptimization that is.

Specifically, in what way should a language be designed to aid the compiler in finding the optimal code?

For example for the matrix multiplication, why can't the compiler find the optimal code? Why are BLAS written in assembly for each CPU architecture? Isn't this a question of having a high level CPU spec (how many ports, instruction latency, throughput) and then using a constraint solver to find the optimal code?

TrueDuality · 2024-03-16T16:26:50.000000Z

Compiler optimization is an extremely hard problem. Unsolvable by our current standards and usually the compiler has less knowledge than the author of the code and has to make assumptions about the hardware it is going to run on. Language design feeds heavily into the types of optimizations that are even possible to reason about.

So yes there has been an enormous amount of research into both the programming language design side of this equation, the compiler design, and the area in the middle. Probably hundreds of thousands of research papers since the 50s or 60s.

Compilers and automated systems, in the general case at least, still can't out compete a domain expert that is familiar with the specific hardware being targeted, a good profiler, and the time to fiddle with the raw assembly.

charleshn · 2024-03-16T18:59:24.000000Z

> Why are BLAS written in assembly for each CPU architecture?

On that specific subject of SIMD portability, embedded DSL like Highway developed by Google [0] can help. Hopefully it'll gain traction and usage.

[0] https://github.com/google/highway

pornel · 2024-03-16T20:54:40.000000Z

With a typical workflow of “just compile this source code”, this is an impossible task, because performance of algorithm implementations depends on the actual data they process (think how SQL plans queries based on table stats). Profile-guided optimization already improves performance significantly despite merely informing some heuristics. A truly optimal compiler would need to design algorithms for specific data sets (basically what people do when they profile the code).

However, there are some takes beyond the “Algol68 on PDP11” design of most languages. There’s Intel’s SPMD C compiler that extends the language to make auto-vectorization a first-class feature.

There is Halide lang specifically for image processing, which separates the goal from the algorithms used to archive it.

naasking · 2024-03-18T21:02:44.000000Z

This strikes me as akin to asking why can't an algorithm find the lowest cost path to visit a set of nodes, aka the traveling salesman problem.

A language that helps the compiler has the usual properties we already know: provide as much info as possible with types, and restrict the set of allowable expressions to make the search tractable (like CUDA, ownership and alias analysis, and so on).

nyrikki · 2024-03-16T16:32:07.000000Z

Different hardware cache-hits differently for one.

Level 3 BLAS functions like matrix on matrix products perform O(N^3) operations on O(N^2) data.

The main way BLAS boosts Level 3 functions is cache optimization which is hardware dependent.

But optimization problems in general are not trivial for computers.

srijansriv · 2024-03-16T06:20:52.000000Z

obligatory mention of cp-alogs: https://cp-algorithms.com/index.html

quibono · 2024-03-16T12:05:11.000000Z

Love cp-algos. Great quality, also love the sample code.

One of my favourite write-ups: https://cp-algorithms.com/string/z-function.html

begueradj · 2024-03-16T19:38:08.000000Z

Amazing book !

torginus · 2024-03-16T19:51:30.000000Z

After noticing the Quake 3 Rsqrt algorithm in here, I cannot help but question the quality of this whole site. As in - is it grounded in any actual practical reality of writing code for modern hardware?

For reference, Quake style Rsqrt hasn't been practical on x86 for more than a decade, as Intel processors internally separate integer and float SSE registers, and converting between the two is costly.

Instead, they have a built-in intrinsic of _mm_rsqrt_ps which does a limited precision rsqrt in hardware.

vectri · 2024-03-16T20:26:34.000000Z

The rsqrt page mentions hardware implementations

https://en.algorithmica.org/hpc/arithmetic/rsqrt/

> In fact, it is so good that it has been implemented in hardware, so the algorithm is no longer relevant by itself for software engineers, but we are nonetheless going to walk through it for its intrinsic beauty and great educational value.

throwaway2037 · 2024-03-14T05:17:01.000000Z

This is a wild quote:

<< Among the cool things that we will speed up:

    2x faster GCD (compared to std::gcd)
    8-15x faster binary search (compared to std::lower_bound)
    5-10x faster segment trees (compared to Fenwick trees)
    5x faster hash tables (compared to std::unordered_map)
    2x faster popcount (compared to repeatedly calling popcnt)
    35x faster parsing series of integers (compared to scanf)
    ?x faster sorting (compared to std::sort)
    2x faster sum (compared to std::accumulate)
    2-3x faster prefix sum (compared to naive implementation)
    10x faster argmin (compared to naive implementation)
    10x faster array searching (compared to std::find)
    15x faster search tree (compared to std::set)
    100x faster matrix multiplication (compared to “for-for-for”)
    optimal word-size integer factorization (~0.4ms per 60-bit integer)
    optimal Karatsuba Algorithm
    optimal FFT

>>

If true, why haven't these appeared in the C++ STL?

EDIT: Formatting only

josephg · 2024-03-16T03:37:39.000000Z

About a decade ago I wrote a fast rope library in C, with support for arbitrary inserts & deletes at arbitrary locations in large strings. I benchmarked it, and I was shocked to discover that my library was ~20x faster than the SGI C++ rope library that shipped with my compiler.

I assumed I must be doing something wrong. Eventually took a look and found the SGI rope library constructs a tree where each heap allocated leaf node only contains a single character from the document. No wonder its so inefficient. I'm somewhat horrified by the idea that anyone is using that implementation in their software.

Sometimes I think about how much money people would pay to make their computer run 20x faster. Our society spends a truckload of money on computing hardware - but most of that hardware is wasted running wildly inefficient programs. Modern computers should do basically everything instantly. Its weird to me that we'll pay thousands of dollars for faster CPUs then write new software in python, electron or a docker container running in a VM.

reaperman · 2024-03-16T04:03:16.000000Z

> Sometimes I think about how much money people would pay to make their computer run 20x faster.

I think people who can reliably do this are paid $250-750k/year at top paying companies.

anonymoushn · 2024-03-16T07:51:14.000000Z

I think there are almost 0 positions in industry for this sort of thing.

saagarjha · 2024-03-16T08:00:22.000000Z

There are plenty, actually. Source: did this professionally for a bit

noelwelsh · 2024-03-16T08:45:43.000000Z

I'm interested to hear more. What kind of companies pay for this? Hardware companies like Intel and Nvidia? What are the roles called?

saagarjha · 2024-03-16T09:37:08.000000Z

There are several branches, but all of them typically run at a scale where performance costs them money. People with large server deployments (any "hyperscale" company) hire these people to save millions of dollars a year. People who do client performance attract and retain users. Those who do trading bank on being the fastest. And many other areas; these are just the big ones.

hnthrowaway0328 · 2024-03-16T11:36:03.000000Z

Interesting. Does it need a PHD or a specific Master? Because I feel no one needs that kind of performance at home so the only place to train is academy or industry while industry usually does not train a junior on that topic.

jandrewrogers · 2024-03-16T15:05:22.000000Z

This type of optimization isn't something you typically study as an academic. Everyone I know that does this kind of work (these jobs definitely exist) was self-taught by experimenting in their own time because they had a passion for performance optimization. It is a rewarding specialization in that there is a quasi-objective measure of incremental progress.

Much of this optimization work requires specific knowledge of the details of the operating environment (e.g. Linux) and silicon microarchitecture (e.g. AVX-512), which can be poorly documented. You have to be comfortable doing experiments and digging into system arcana to surface properties of the system that you can't trivially google. While algorithm selection is important, doing that well is table stakes and part of the role is knowing when and why an "optimal" algorithm is worse than e.g. selectively applied brute-force.

Computational efficiency and throughput is worth a lot of money at scale but most of it is designed and implemented at the scale of a single machine. Every software engineer owns the tools required to become proficient at this. It is genuinely a rare skill even among systems programmers and is a good way to separate yourself from the crowd.

anonymoushn · 2024-03-16T13:10:21.000000Z

I just became an AVX2 programmer for fun. You can too, if you go to the highload.fun group chat then I or someone else will suggest blazing fast ideas for every problem that happen to be about half as fast as our own approaches.

saagarjha · 2024-03-16T12:46:20.000000Z

No. Usually you learn this stuff by doing it at a smaller scale elsewhere.

jiggawatts · 2024-03-16T09:02:07.000000Z

The FAANGS certainly have people doing this kind of work. For example, Andrei Alexandrescu has a bunch of keynote talks at CPPCon where he talks about 1-4% speed optimisations he makes to the std library at Facebook.

https://youtu.be/FJJTYQYB1JQ

I also like the talk by Nicholas Ormrod “The strange details of std::string at Facebook":

https://youtu.be/kPR8h4-qZdk

jandrewrogers · 2024-03-16T15:21:31.000000Z

Various types of hyper-scale or high-performance computing where incremental improvements in throughput, efficiency, latency, and resource utilization saves millions of dollars. Hardware companies are not where you go for this, though they do offer some limited micro-optimization support for the companies that do care about this kind of thing. You want to focus on companies that spend enormous amounts of money on compute infrastructure. The appetite to invest in compute efficiency waxes and wanes with the economy. If companies can get away with throwing money at a problem they will, but that can quickly become untenable.

A closely related area is slightly bending the scaling linearity curve on e.g. big multi-core servers or scale-out systems, so that it is possible to efficiently throw hardware at problems. However, this operates from a pretty different set of theoretical principles than classic performance optimization.

Two domains that have an almost unlimited appetite for improved performance and efficiency right now due to current bottlenecks are AI and sensor processing.

maayank · 2024-03-16T09:22:03.000000Z

Algotrading companies.

anonymoushn · 2024-03-16T09:02:04.000000Z

I'm certainly intrigued and I'll shoot you an email.

Kamq · 2024-03-16T15:09:50.000000Z

It should also be noted that you can also get a significant speed up by writing situation specific code, instead of handling the general case, if your problem allows it.

This kind of guarantees that custom code, written competently, will always be able to outpace standard library code. The standard library is one of the few things that probably always needs to handle the general case.

falsandtru · 2024-03-16T05:44:33.000000Z

> Sometimes I think about how much money people would pay to make their computer run 20x faster.

Bottlenecks that can make a computer 20x faster are very rare. First, pay $1m for a specialist to find the bottlenecks. However, it is even rarer that it is a bottleneck of the entire service. Rather, it is more likely to be an accidental drop in performance. Thus, basically performance monitoring is the best use of money.

> Its weird to me that we'll pay thousands of dollars for faster CPUs then write new software in python, electron or a docker container running in a VM.

There are more labor costs and lost commercial opportunities due to delays in development.

bobajeff · 2024-03-16T16:13:57.000000Z

>Its weird to me that we'll pay thousands of dollars for faster CPUs then write new software in python, electron or a docker container running in a VM.

Python has a REPL, Duck typing and hooks for many libraries including gui libraries.

Electron is basically a browser you can program via JavaScript with everything you'll ever need to make a gui application.

They both make a lot of sense when you consider the alternatives. However, I do wonder why there isn't yet an option to AOT compile python and electron projects into fast executables.

throwaway2037 · 2024-03-17T07:05:57.000000Z

    > then write new software in ... electron

Yeah, I agree. The OP's point is a bit misguided. The reason why people use Electron is it is cheaper to develop an app in Electron. By "cheaper", I mean the cost of developers plus the hardware needed to run it well. How much does an enterprise desktop PC with a 3+GHz CPU cost? Surely less than 2K USD. How much does a developer cost? It is hard to pay less than 75K USD per year.

And about Python: Almost no one is doing CPU intensive work in pure Python. They are using C-language extensions like Pandas, NumPy, etc.

josephg · 2024-03-18T08:41:44.000000Z

> How much does an enterprise desktop PC with a 3+GHz CPU cost? Surely less than 2K USD. How much does a developer cost?

But the software runs on the computers of all of their users. How much money do their users' computers cost? If you have just 10k users, probably $10M or so? But software companies don't pay for my computer's resources. The money I need to spend on computer hardware isn't priced in to your calculus at all. They don't care.

I find it pretty frustrating and sad.

adgjlsfhk1 · 2024-03-16T21:23:07.000000Z

the semantics make aot compiling based speedup basically impossible. Python is just way too dynamic to do anything with. If you need speed and nice language features, you're much better off using a high level language that has better semantics for speed. Julia is my preferred option, but Lua and a few others fit this niche well.

3836293648 · 2024-03-16T05:15:31.000000Z

Because the STL makes a bunch of guarantees that can't be broken. Like that an object won't ever move within a hashmap, even on edits. Or that std::set has to use buckets and linked lists.

Basically they've overspecified their datastructures and we're now paying the price

deadlypointer · 2024-03-16T07:26:46.000000Z

Std::set is using red black trees, did you mean std unordered set?

anonymoushn · 2024-03-16T09:06:24.000000Z

yes, the common complaint about separate chaining being effectively mandatory is about unordered_map and unordered_set. It looks like std::set suffers from essentially the same pathology though: you cannot use implementations with good data layouts (b-trees) because the standard imposes unreasonable requirements about iterator invalidation.

See https://stackoverflow.com/a/26552219

foofie · 2024-03-16T07:47:57.000000Z

> Basically they've overspecified their datastructures and we're now paying the price

I strongly disagree, and I'm perplexed how anyone can describe fundamental traits such as object lifetimes of fundamental infrastructure such as standard data structures of being over specified.

Just imagine the shit show it would be if upgrading your compiler broke your code because std::set started leading your code to throw exceptions because they sneaked a major breaking change such as moving objects that should not be moved.

It's also perplexing how breaking backward compatibility is depicted as a perfectly acceptable thing to do to a major programming language while completely ignoring the ability to release code as a third-party library. If the new implementation of a std::set alternative is any good, people would be lining up around the block to adopt it. I mean,it's already a standard practice in game development to use custom data structure implementations with custom allocators. Why is this not an option, but breaking half of the world's code suddenly is?

3836293648 · 2024-03-16T10:08:02.000000Z

Because if you need something to remain in the same place you box it yourself. Ruining the performance for everyone else because you don't want to handle some boxing on your own is hardly reasonable. This goes fully against the C++ mantra of if you don't use it, you don't pay for it. It's why you have sort and stable_sort rather than unstable_sort and sort.

Just because Hyrum's Law applies to an implementation of the standard doesn't mean that you should pessimise your implementation. You should actively hurt those who rely on implementation quirks

hughesjj · 2024-03-16T03:45:01.000000Z

Isn't stl notoriously slow relative to most c++ implementations given equivalent functionality?

I've always understood stl to prefer rock solid safety+stability+compatibility (+ a bit of dogfooding) with the widest applicable domain (scale etc) over pure, unadulterated performance, but I haven't coded in c++ since college so I know nothing

kccqzy · 2024-03-16T04:21:27.000000Z

The standard library has a bunch of requirements that a random third party library likely does not have. Like ABI compatibility (most people recompile from source often so it isn't needed). Like exception safety (most people turn off exceptions anyways so they don't need such safety).

Some parts of the standard library like std::vector is already good enough. Other parts like std::unordered_set are rarely used in industry unless low-dependency is a requirent.

tsimionescu · 2024-03-16T07:33:04.000000Z

I very much doubt most people coding in C++ would turn off exceptions. Virtually every new runtime construct being added is designed for exceptions.

There are some niches where exceptions are frowned upon, but those are small.

Kamq · 2024-03-16T15:07:11.000000Z

Define most here a little bit more. I would imaging most people using it are hobbyists/students and, in that case, I think you're right.

But most people using it in industry fall into two camps:

1. using it because the project is old enough that C++ was a reasonable default when the project started. You may also be right here

2. using it because performance is absolutely critical to the application, here I would actually imagine you're wrong. noexcept removes a ton of stack-unwinding code, and you can absolutely get a significant performance boost out of it. It would be weird if they didn't pull this lever, given the perf boost and that errors as values is also a fairly reasonable way to handle problem

In the second category, google famously had a "no exceptions" rule at one point (could still be in effect, I have no knowlegde of google)

timeinput · 2024-03-17T10:43:28.000000Z

They do have that rule, but not for performance reasons. The full decision is here https://google.github.io/styleguide/cppguide.html#Exceptions and probably worth a read but the two main take away I think is:

> Because most existing C++ code at Google is not prepared to deal with exceptions, it is comparatively difficult to adopt new code that generates exceptions.

tsimionescu · 2024-03-17T10:51:51.000000Z

I see noexcept as a part of the language that works together with exceptions, it is a valid optimization for certain (typically small) pieces of a code base that otherwise uses exceptions.

I'd imagine when people say "not using exceptions", it is about using compiler flags to completely disable the exception mechanisms from the language. And this also invalidates many of the design patterns C++ is known and sometimes praised for, at least RAII and operator overloading.

throwaway2037 · 2024-03-17T07:10:25.000000Z

The latest version the Google C++ Style Guide of says: "We do not use C++ exceptions."

Ref: https://google.github.io/styleguide/cppguide.html#Exceptions

saagarjha · 2024-03-16T08:01:38.000000Z

Unfortunately a significant portion of feedback for the C++ standard library comes from Google LLC

menaerus · 2024-03-16T08:24:24.000000Z

ABI compatibility is needed in 99% of the cases since you basically never compile your own STL. You dynamically (or statically if you will) link against the one you have on a system since most software does not have a control over the machines/environment where it's going to be deployed. Even in cases where you do have a control over your environment deployment, compiling your own libc++/libstdc++ is a complication which you usually want to avoid.

bregma · 2024-03-16T12:06:08.000000Z

The C++ standard library is about 90% template code in headers. Yes, you compile it all the time, in your own environment. You have no choice.

menaerus · 2024-03-17T08:16:37.000000Z

The fact that you will use a class or function template from STL, and compile it, doesn't change the fact that, regardless of that, you will have to link against the libstdc++ or libc++ which you didn't compile yourself. I am not sure I understand the argument you're trying to make.

janwas · 2024-03-15T10:21:25.000000Z

I am inclined to believe those numbers :)

That's an interesting question. One partial answer is that the hash table speedup likely involves an interface change, replacing "insert single item" with batching. This is a generally helpful principle, but unfortunately does not match some existing abstractions.

chubot · 2024-03-16T16:16:47.000000Z

The API design of STL is often over-constrained.

That is, The style is very kitchen-sink and generalized, which means implementers have little flexibility.

Every “real” C++ codebase I’ve worked on has its own data structures for most things. And that’s a feature of C++, not a bug.

For example, don’t use std::unordered_map for anything performance sensitive. It’s a shockingly slow hash table because the interface constrains it to allocate a lot.

The STL isn’t designed for optimum performance. It’s easy to do better by satisfying your app’s requirements and nothing else.

vlovich123 · 2024-03-16T16:50:51.000000Z

There’s also other reasons. For example, take binary search:

* prefetch + cmov. These should be part of the STL but languages and compilers struggle to emit the cmov properly (Rust’s been broken for 6 years: https://github.com/rust-lang/rust/issues/53823). Prefetch is an interesting one because while you do optimize the binary search in a micro benchmark, you’re potentially putting extra pressure on the cache with “garbage” data which means it’s a greedy optimization that might hurt surrounding code. Probably should have separate implementations as binary search isn’t necessarily always in the hot path.

* Eytzinger layout has additional limitations that are often not discussed when pointing out “hey this is faster”. Adding elements is non-trivial since you first have to add + sort (as you would for binary search) and then rebuild a new parallel eytzinger layout from scratch (i.e. you’d have it be an index of pointers rather than the values themselves which adds memory overhead + indirection for the comparisons). You can’t find the “insertion” position for non-existent elements which means it can’t be used for std::lower_bound (i.e. if the element doesn’t exist, you just get None back instead of Err(position where it can be slotted in to maintain order).

Basically, optimizations can sometimes rely on changing the problem domain so that you can trade off features of the algorithm against the runtime. These kinds of algorithms can be a bad fit for a standard library which aims to be a toolbox of “good enough” algorithms and data structures for problems that appear very very frequently. Or they could be part of the standard library toolkit just under a different name but you also have to balance that against maintenance concerns.

derefr · 2024-03-16T22:10:51.000000Z

Is there a statically-compiled language where the subset of methods of an API or ADT that ever get referenced, determines at link time which implementation of the API or ADT gets linked + WPOed against?

Because it seems “obvious” to enable programs that don’t depend on the over-constraining parts of things like std::unordered_map, to specialize the implementation they’re using so that it doesn’t employ the overhead-laden parts required to make those uncalled methods function. And “just have two copies of the code and pick between them based on control-flow info” seems like a very trivial way to do it for at least some low-hanging-fruit cases.

sereja · 2024-03-16T13:20:19.000000Z

Author here. I published most of these in one batch two years ago, and this is a relatively short time for compilers and libraries to catch up (when Daniel Lemire publishes something, it also usually takes a few years before it makes its way to the standard libraries, and he is much more involved in the downstreaming process than I am).

In my opinion, the main challenges are:

1. Lack of suitably narrow abstractions. E.g., my "binary search" implementation requires building a small static structure (6-7% of the original array size), and although it is extremely hard to imagine a practical use case where you get a sorted array but can't spend linear time on its preprocessing, it's technically not a drop-in replacement for std::lower_bound.

2. Backwards compatibility. E.g., std::set has a requirement that a node deletion should not invalidate other iterators, which makes it harder to implement it as a B-tree, which has to move a lot of keys around after a node merge/split.

3. Performance regressions. Sometimes a change can make a program 2x faster in most use cases but 1.5x slower in some specific one. If the hardware handling that use case was already at 90% capacity, it will now start to fail after the upgrade, while a 2x improvement on other use cases is just "nice" and doesn't offset it.

4. Vagueness of "better". There are about 10 blog posts on the internet now claiming they have designed the fastest hash table in the world—and every one of them is right because they are using different benchmarks tailored to their specific data set and their specific hardware.

5. Desire to implement things more generically in the middle-end of a compiler instead of the standard library, which is much harder to do. You don't want to hand code the optimal SIMD procedure for calculating the sum of an array for each CPU microarchitecture; you want the compiler to do it automatically for everything that resembles a simple "for" loop. This also leads to a diffusion of responsibility, with compiler people and library maintainers arguing over the appropriate place for an optimization to be implemented.

6. Lack of incentives. Most people who can implement these optimizations work for big tech and would look better in their performance review by contributing to their employer's library (e.g., Abseil for Google, Folly for Meta), or at least to a library with a less Kafkaesque review process like Boost, rather than the standard library.

7. Things still being in the research stage. For example, I recently discovered (but haven't published yet) a new GCD algorithm that seems to yield another ~2x improvement over binary GCD (~4x over std::gcd), and so the guy who recently pushed it in libc++ has in a certain sense wasted work.

I haven't rerun benchmarks myself, but I believe some relatively decoupled parts of the STL have actually since been upgraded in some compilers (std::lower_bound is now branchless, std::gcd now uses binary GCD, std::accumulate and similar reductions now use instruction-level parallelism when they see it) although in all these cases I didn't discover but at most only popularized them.

vlovich123 · 2024-03-16T16:57:45.000000Z

For the binary search index speed up, is that documented somewhere? The English version talks about Eytzinger layout but I think you’re referring to something else since that’s not an ancillary array size.

Great job on this book. Lots of great content that can be used as reference.

sereja · 2024-03-16T17:37:40.000000Z

https://en.algorithmica.org/hpc/data-structures/s-tree/

https://www.youtube.com/watch?v=1RIPMQQRBWk

ribit · 2024-03-16T08:04:39.000000Z

One also needs to look at the context of these quotes. For example, for the binary search they change the array layout to improve cache locality. This is not really helpful if you have to work with sorted arrays (as many binary search algorithms do). If instead you have the freedom to optimize the data structure itself (and you do the search often enough to amortize the construction), you might as well build some variant of a B-tree which is going to be much faster than binary search anyway.

adrianN · 2024-03-16T08:11:53.000000Z

I'd like to see a benchmark showing a B-tree beating an array with the correct layout. I think you only want a tree if you also want to update the array.

ribit · 2024-03-16T08:45:40.000000Z

You can implement your B-tree as an array if that's what you like. The advantage of the B-tree-like-layouts is that you need fewer indirections. With binary search you do one node jump per comparison. With a B-tree you do one jump per N comparisons.

Of course, the detail is in trying it out, intuition does not always capture hardware behavior. I remember looking at this many years ago and B-trees were considerably faster. The code was far from being optimal though. Maybe if one pulls all tricks in the book the additional complexity of managing multiple comparisons will outweigh saving from reducing loads.

sereja · 2024-03-16T11:24:18.000000Z

Author here.

For a perfect drop-in replacement of std::lower_bound, the best you can do without breaking anything is to make the search branchless and maybe add some prefetching [1]. Some compilers actually try to implement std::lower_bound this way, but these things can sometimes break from version to version because there are no semantics to reliably make compiler use predication instead of branching.

Your intuitions are very much correct: you can get rid of pointers in a B-tree and make it static and implicit and fast, especially if you use SIMD to search for the lower bound within a node [2], but it would technically not be a replacement to std::lower_bound as we need to build an additional structure (even though it's very hard to imagine a scenario where you obtain a sorted array but can’t afford to spend linear time on preprocessing). C++23 has since added std::flat_set, which seems to be an appropriate place to implement it (in the article I compared against std::lower_bound because neither I nor the vast majority of the readers knew what std::flat_set was).

You can also add pointers back to support insertion and deletion with a moderate penalty to performance [3], but this dynamic B-tree is also technically not a replacement to std::set because of the extra pointer invalidations when a node merges or splits (even though in most cases you don't need pointer stability). You can fix it by, e.g., storing separate pairs of pointers so that each iterator knows where its key is in the tree and vice versa. That would add some overhead (especially in terms of memory) but make it compliant with the standard and still quite a bit faster and lighter than std::set.

The three articles combined are like 50 pages long so for a tl;dr version you might be interested in a talk I did at CppCon [4]. You can also extend the trick for heaps, ropes, segment trees, and other tree-like structures. There is a lot of work to be done here.

[1] https://en.algorithmica.org/hpc/data-structures/binary-searc...

[2] https://en.algorithmica.org/hpc/data-structures/s-tree/

[3] https://en.algorithmica.org/hpc/data-structures/b-tree/

[4] https://www.youtube.com/watch?v=1RIPMQQRBWk

tmyklebu · 2024-03-16T17:09:37.000000Z

First, what a great resource you've put together! You're presenting a lot of useful material clearly and concisely, together with the reasoning behind it. I wish I had this when I started doing performance work.

> For a perfect drop-in replacement of std::lower_bound, the best you can do without breaking anything is to make the search branchless and maybe add some prefetching [1]. Some compilers actually try to implement std::lower_bound this way, but these things can sometimes break from version to version because there are no semantics to reliably make compiler use predication instead of branching.

What's imperfect about a radix-4 (or higher) search instead of a binary search for replacing std::lower_bound? For a radix-k search, you'll have k-1 fetches in flight before you decide which subarray to recurse on, and your dep chain will be log_2(k) times shorter.

sereja · 2024-03-16T18:07:51.000000Z

When you add prefetching (that is, compare against the middle element and fetch both the middle of the left half and the middle of the right half ahead of time) you are essentially doing radix-4 search, just with fewer actual comparisons.

(You can prefetch k layers ahead for radix-2^k search, but searching in small arrays will become more expensive this way.)

I didn't benchmark it, but I guess on mid-to-large arrays it would actually work somewhat slower than prefetching because it is more instruction-heavy, and, more importantly, prefetches are "cancelable": if the memory bus is too busy with actual requests, they will be skipped, while in explicit radix-k search you would have to wait for all (k-1) elements even if the middle element happened to be already cached and you already know which elements you need next.

That said, it could probably work with small arrays where caching is not a concern and especially if you optimize for latency and not throughput. You can also try to do the comparisons with SIMD (Wojciech Mula tried something similar and got a small boost: http://0x80.pl/articles/simd-search.html).

ribit · 2024-03-16T13:49:10.000000Z

Thanks, this is great! I think this conversation is a great illustration for the idea that data structures can be seen as more fundamental than algorithms.

userbinator · 2024-03-16T03:21:27.000000Z

C++ STL is generic and cross-platform, but I suspect they are doing specific optimisations (x86 Asm), and it's very easy to beat a compiler that way.

adrianN · 2024-03-16T08:44:14.000000Z

It’s not that easy to beat compilers, they know how to generate processor specific instructions too.

edward28 · 2024-03-16T03:48:38.000000Z

Step 1. Throw more compute at it.

pikrzyszto · 2024-03-16T14:43:26.000000Z

And then you get hit by Universal Scaling Law (or Universal Law of Computational Scalability) - see the section on https://en.wikipedia.org/wiki/Neil_J._Gunther

You can't throw more compute at a problem ad infinitum.

signa11 · 2024-03-16T04:40:22.000000Z

imho, probably preceded by

   step-0 sit quietly in a room with paper and pencil, and ... T.H.I.N.K

syngrog66 · 2024-03-16T13:57:43.000000Z

[flagged]

nsteel · 2024-03-16T14:01:34.000000Z

From the faq:

Pre-ordering / financially supporting the book. Due to my unfortunate citizenship and place of birth, you can’t — that is, until I find a way that at the same time complies with international sanctions, doesn’t sponsor the war, and won’t put me in prison for tax evasion. So, don’t bother. If you want to support this book, just share it and help fix typos — that would be enough.

I'm not sure there's much to boycott here, except the pursuit of knowledge?

vgalin · 2024-03-16T14:24:16.000000Z

People are now boycotting knowledge?

lastgeniusua · 2024-03-16T14:38:34.000000Z

he is also loudly anti-war: https://twitter.com/sergey_slotin

anthk · 2024-03-16T14:00:13.000000Z

NOTE: should be boycott this US site for the Iraq war then?

Another 'evil' Russian personal site from a good geek:

http://www.stargrave.org/index.html

nayemrahman · 2024-03-16T06:09:17.000000Z

Nice!