Hacker News new | past | comments | ask | show | jobs | submit login
Bulk Data Structures C++ (gamasutra.com)
179 points by kouh on Aug 20, 2019 | hide | past | favorite | 105 comments



Game developers like me go through stages of grief in reinvention of memory management.

In this case, what will eventually be reinvented is an arena allocator.

Having just researched this, Cap'n'Proto is a good implementation of one that suits game development needs: (1) flexibility, (2) no serialization representation for networking and AI, (3) mutability of primitives, (4) garbage collection of stale objects in lists (i.e. removed items) is manual, (5) constraints to prevent non-performant design, and (6) support for these performance-sensitive idioms in multiple languages, not just C++.

Migrating to an arena allocator is a completely different can of worms...


Exactly. Also using a vector as underlying storage instead of sets of fixed size memory chunks seems not ideal to say the least.


Except that std::vector can be specialized with custom arena allocators, which isn't uncommon in performance sensitive applications.


std vector needs to be countiguous, so you cannot use a segmented storage.


Most applications need more than one vector. Management of this with custom allocators pays dividends fairly quickly.


Sorry, there is some confusion here. The OP is literally implementing a custom allocator. I'm saying that using std::vector for the backing store of your allocator is wrong because resizing will invalidate your already allocated objects (which is a no-can-do in a general purpose allocator) and the copying is wasteful. A custom fixed size chunk based allocator woud be better.


I often read that when in doubt, use a vector. It has its disadvantages, but for performance it's usually okay. Simplicity can be a good choice.


Sound advise for general purpose programming but not for (high end) games.

Especially these days with multi-threaded game engines, a call to malloc() (e.g. from vector resizing) may attempt to grab a contended mutex and end up waiting until it misses its frame.

Modern game engines use a combination of memory management techniques which are tuned for different use cases. For example: large, persistent blocks of memory are allocated ahead of time. Small, transient objects are allocated from a thread-local, per-frame pool and they're never freed explicitly (at the end of the frame, memory will be reclaimed for reuse).

Most game engines don't use the C++ standard library containers at all in the first place. There are gamedev-flavored STL variants like EASTL, though.


For triple-A game engines attuned to work on console hardware, sure, but they're really sophisticated optimizations.

The STL is fine for a lot of things in gamedev.

Object pools are not so hard to implement, the only thing a object pool really does is to reduce memory allocation and de-allocation delays while keeping the O(1) random access. It's just a strategy for allocating and accessing memory properly.

It sounds like premature optimization, and it often is, because those strategies are not really trivial to use or implement.


> std::vector uses constructors and destructors to create and destroy objects which in some cases can be significantly slower than memcpy().

This is precisely what vector::emplace() solves, and std::move should be faster than swap and pop. Modern C++ has changed a lot, this article ignores the massive improvements added in c++11,14,17.


> This is precisely what vector::emplace() solves, and std::move should be faster than swap and pop.

The whole swap-and-pop section weirded me out. Maybe I just don't know enough about C++, but saying that assignment (a[i] = a[n-1]) will call the destructor seems false.

As far as I know, the compiler should generate an implicitly defined copy assignment operator for these fixed size PODs and it should be as performant as memcpy.

But again, I don't have years and years of in-depth C++ experience, so I would be grateful if an expert could shed more light on this.


Yeah, fairly certain that is wrong.

I think that would just call the copy assignment operator, would it not?

For correctness you would probably then follow up with a pop_back to keep the vector right-sized.

Actually you'd probably want to do:

a[i] = std::move(a[n - 1]);

Then follow up with pop_back.

Best would probably be:

a[i] = std::move(a.erase(n-1));


Ideally, the std lib implementation should handle that detail for you...


which detail?


In theory erase could return a move iterator, meaning that you could omit the call to std::move. This wouldn't be backwards compatible though so not going to happen.


wait, how is this supposed to work?

   a[i] = std::move(a.erase(n-1));
There is no erase that takes an index, so I assume that n = a.end(). Also it is missing a dereference:

   a[i] = std::move(*a.erase(a.end()-1));
but erasing the one-before-the-end returns the (new) end iterator, which obviously is not referenceable. In general, after calling erase, it is too late to access the erased element.

You want something like:

  template<class Container, class Iter>
  auto erase_and_return(Container&& c, Iter pos)
  {
     auto x = std::move(*pos);
     c.erase(pos);
     return x;
  }
Also in the general case it doesn't make sense for erase to return a move iterator.


Thanks for the corrections. I mis-read the documentation and thought erase returned an iterator to the elements erased.


You are correct. A trivial copy assignment operator makes a copy of the object representation as if by std::memmove. All data types compatible with the C language (POD types) are trivially copy-assignable.

https://en.cppreference.com/w/cpp/string/byte/memmove


Not memmove. A trivial object assignment can be _memcpy_aligned, which is much faster. And the size is compile-time constant.


I assume you mean aligned on boundaries ? I picked up that from https://en.cppreference.com/w/cpp/language/copy_assignment and it does also say that memmove has a fallback to std::memcp when there is no overlap between source and destination.


The article is just doing a generic cargo cult warning there. Not bad as a general C++ gotcha warning, but definitely incorrect in this specific case.

As per the author's constraints these are "POD types that are trivially memcpy-copyable", so by definition the copy constructors will never do anything. Much less "allocate memory" as the author claims.


[flagged]


From the Guidelines:

> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."


In most C++ game engines the standard library is almost never used, for performance reasons.

See: https://github.com/electronicarts/EASTL


My understanding the primary reason for he developers using custom libraries is not so much performance but a) historically console compilers and expecially standard libraries have been extremely buggy and b) is good to have a single implementation across platforms instead of having to deal with quirks and implementation divergence.


From what I've heard, there are two more major reasons to not use STL for gamedev.

- Debug build performance. Release builds of C++ code using STL are generally pretty fast, but Debug builds suffer a lot (especially Visual Studio's std::vector implementation is notoriously horrible for debug builds). Debug executable speeds matter when you are debugging a game; you don't want to test your first-person shooter in 1 FPS!

- Build speed. Because of heavy use of templates and historical cruft, STL slows down your build times a lot. The build-test cycle is very important when designing games; you don't want to wait for a few hours after you've changed a few lines of code to tweak a new feature. Gigantic distributed build servers alleviates this problem a bit, but they are pretty cumbersome to set up nonetheless.


Performance in debug builds is a particular issue, since getting acceptable-for-gamedev machine code from modern C++ often requires optimized builds.

http://aras-p.info/blog/2018/12/28/Modern-C-Lamentations/


For MSVC, the debug checks are fairly customizable through judicious use of appropriate debug macros. One can also enabled optimizations with debug symbols, but the debugging experience can be jarring.

I'm not a game developer, but have spent a decade doing C++ on Windows, and at former employer, we had several different debugging profiles depending on the severity/difficulty of reproducing/debugging an issue. Our "normal" debug profile had all of the debug checks in the std lib disabled, and we could only effectively debug our own code. Not sure if games dont do this, or if its still not performing enough.


One problem with using different debug macros in your debug build is that any libraries you link in must also be using the same flags. This is not necessarily possible for binary releases as they will assume certain standard library flags to exist in the debug builds (like iterator checking levels).

At work we don't use a debug build in the traditional sense, it's what you call a no-optimisations build where the code is compiled without most optimisations but otherwise the flags are the same as a release build. Some teams also go a step further and compile most of the code in release but some of their code with optimisations disabled.


> One problem with using different debug macros in your debug build is that any libraries you link in must also be using the same flags.

They don't have to be, but it certainly makes this world's easier. If the flags are not the same, for sure you have to be very careful about passing objects between DLL boundaries.

At the companies I've done C++ work at, we've always had the source for all non C libs and compiled any C++ libs our selves (except for Windows libs, bit they also provide checked debug libs), so we could control the flags.


Are all those "best practices" valid for modern C++ ? I mean one statement says "Pass and return containers by reference instead of value.". This is in contradiction to modern C++ where you return containers by value and rely on copy-elison/RVO. https://stackoverflow.com/questions/15704565/efficient-way-t...


The best way to tell is to try it the modern way and then look at the assembly code generated on something like godbolt.org. If it ends up being less efficient then you change it to accept a non-const reference to store the result in as a parameter instead.

Though if you'll be calling the same function repeatedly to accumulate content into a single container it is far more efficient to have a function with an output reference rather than returning a new container. This will result in fewer memory allocations and you can also pre-allocate the size once before calling those functions.

On the part of tooling it might be nice if there was a way to annotate a function so that it creates a warning if the compiler cannot use copy-elision for the return value. (To be honest I haven't checked the documentation for this specific thing)


Copy elision is now mandated in many (but not all) cases FWIW.


The warning that I would want would trigger when someone changes the function and prevents or suppresses copy elision from happening. Like for example adding a check at the start of the function and returning a default container.


making the returned object non copyable, non movable is a an option.

edit: also, the rule is simple: RVO is always mandated, NRVO remains an optimization.


I'm not sure if C++14 or C++17 has fixed this but if the object was not copy-constructible then the compiler would emit an error if it was returned by value even if RVO/NRVO was meant to be used. I figure because semantically you need to enable copy-construction of the object.


IIRC it was changed in C++14. Now in the RVO case, no copy/move constructor is required (and in fact the compiler is not allowed to call it if it exists).


vector::emplace() still needs to construct the object, it just happens inplace and avoids a redundant copy of an already constructed object. Same with std::move(). As such the blog post is correct.

Using POD structs which can be zero-initialized and memcpy'ed may indeed be faster, especially when these are bulk-operations.


It's not clear from the article, but I suspect the author is talking about what happens when the vector is resized and has to move existing elements, which is a real problem.

There are plans to solve this - http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p114...


AFAIK std::vector does use memcpy (std::copy as well) when objects are trivially copyable.


That is correct yet the compilers definition of what is trivially copyable might be more strict than what you expect. For example, objects that are trivially relocatable can also be memcpy'd for reserve/realloc, but the compiler will not be able to figure that on its own.

std::vector itself falls in this category: trivially relocatable, definitely not trivially copyable. So a vector of vectors will not necessarily be able to use memcpy but rather fall back to copy/move assignment. This is not very significant in performance for this type (vector move being cheap) but a language gotcha nonetheless (as the move constructor will be called n times in every capacity change)


Since C++11, you can use template traits to determine if a type is trivially copyable, and even add static_asserts to your code to ensure future changes dont break expectations.


Trivially copyable is a word of power (well, two words I guess), it's meaning is well defined and you can statically assert for it.

What unfortunately is not defined is (trivially) relocatable as that's not a property that can be safely be inferred so it is not (yet) part of the standard. Some libraries still have this concept and require some sort of opt in.


It seems clear from the article the developer isn't that clued up.


It is better to use push_back over emplace to be explicit about which constructor will be called.


+1, Google suggests doing this as well:

https://abseil.io/tips/112

> So in general, if both push_back() and emplace_back() would work with the same arguments, you should prefer push_back(), and likewise for insert() vs. emplace().


That's an interesting point the tip makes. Is there guidance on how to use the emplace_back() added to c++17 which returns a reference to the constructed element?

The reference returning emplace_back() is used frequently in the code to construct a new element of a struct and then fill in its members, as opposed to creating a new struct then push_back() to copy the memory in.


But emplace() is already as explicit about it as it gets.


Nuh uh.

If you're not careful, it will call an implicit constructor.


No, the problem is that std::vector still calls the constructor and destructor of each and every object in the array at least once. This is a performance loss if they don't do anything - you have to rely on the compiler to inline the call, then remove the code. For POD datastructures it can be significant, because those are usually the largest arrays in your application. This is why e.g. Facebook's Folly library detects POD types in their vector and doesn't call ctors and dtors at all.

Similarly, std::vector has to allocate more memory every time it has to grow and copy all its contents, whereas for POD datatypes you can just use realloc which can save copies.

These are all borderline microoptimisations, but they matter for realtime highly responsive software. Or just in general when you need to squeeze out every last bit of performance.


std::move'ing a vector does not call the ctor/dtor of every element within the vector, but that might not be what you're referring to.

If you want an `A` struct/class, you'll call the ctor/dtor, that's true. But for POD types, if the ctor/dtor does nothing, they are trivial to inline and will incur no runtime overhead by any compiler nowadays.


Not necessarily a fan of this sort of re-blogging so here's the original link: https://ourmachinery.com/post/data-structures-part-1-bulk-da...


Our Machinery has a lot of great resources and recommend reading when you have time.


Reading all the discussions and visible confusion about the best C++ practices, when and where and when a constructor will be called etc. seems to be the perfect illustration of the author point.


I wanted to like this article because I'm been thinking about this a lot in the context of game development, noticed a few things. One thing I'll say from briefly playing with this - the code leaves lots out a looks ostensibly simpler than it really is. Would very much appreciate tips / pointers on this or a more fleshed out and working implementation of the code.

For the bulk data with holes code:

First, there's an initialization step that has to happen the first time you allocate your bulk_data_t. Namely, you need to iterate through every item in the list and set its next_free item to the item following it, looping the last item back around to zero. You also need to do this for all the items between the new size and old size every time you resize your item list.

Second, safe iteration over all of the bulk data doesn't seem possible without adding some sort of flag to indicate whether or not an item is free.

Am I missing something here?


No need to preinitialize the list. When allocating an element you have two cases: a) the free list is non empty: you pop an element by making the head point to its successor; b) the free list is empty: you increase the size of the vector by one.

To deallocate an element you simply set the elemnt next to the whatever is the current head, then make head point to this element.

You start with an empty vector.

An element is either allocated or in a free list, that's why the next pointer can be kept in an union.


I would say it depends on how one might want to handle it, like when you create an item_t you set next_free = -1 as a flag to indicate that it is not free. And have bulk_data_t's 0th position's next_free be 0.


Update: I was indeed missing something.

I think I've figured out roughly what the author intended, code below [0].

First, it looks like he's relying implicitly on data stored in std::vector. Namely vectors have both a capacity and a size. The capacity is total number of allocated elements. The size is the total number of elements stored actually stored.

Second, vector::resize won't reallocate until it runs out of capacity, but it will give you access to extra elements if you need them. So this is used to lazily re allocate while bumping up the size of the vector.

Both of these effectively make it "do the right thing" by leaning on the vector storing both size and capacity.

If you hand manage those values yourself you can get a pretty compact C implementation without a lot of code.

One last thing: Using a union here for the item_t is pretty much guaranteed to get you a segfault. The whole thing should really be a struct. This also allows for setting next to sentinel value if necessary.

[0]: C code for bulk_data_t example: https://pastebin.com/Tfcdt39h


The author's pseudo-code implies that bd->items is full already, since it is probably initialized to the initial number of items that are added. This is probably for memory optimization for games. Also explains why resize increases the size by 1, just enough memory to add the new item.

This way you don't need to worry about keeping track of size and capacity either.

The reason I suggested -1 is because when we iterate through bd->items, we need a way to know if it's a valid value or just "holed".


> This way you don't need to worry about keeping track of size and capacity either.

The only reason you don't have to worry about this is because std::vector handles it for you, at least in the code examples provided by the author. If you choose to go with a pure C implementation (which is what I'm trying out) then you will have to keep track of these.

> The reason I suggested -1 is because when we iterate through bd->items, we need a way to know if it's a valid value or just "holed".

Yep, I was able to get an example using -1 as a sentinel working and passing fuzz testing.


A union is perfectly fine, the next element is only needed when the element is in the free list in which case item_t won't be accessed.


It should work fine in the examples, but it won't work if you also attempt to use the -1 in the next slot to tell you which items are non-free.


If you want to iterate through your objects without some external mean to track the live objects (like an embedded next pointer in the object itself), then the swap and pop idiom is a better solution (iterating via indexing is going to be significantly faster than following the next pointers).


Although it's a fair amount of work, you can make it very simple to switch between SoA and AoS by writing a child class for a C++ vector<yourclass> that templates your original class, but returns values of a child class of yourclass that operates on the SoA data.

With a public-data-heavy class that might run you into a performance problem with allocating the extra unused memory, but you can always pull out the interface as a virtual parent of both to avoid that as well.

I would rarely be afraid of using SoA over AoS if it can lead to significant performance improvements. Done well it can hide all the complexity with some clever use of interfaces and classes.


very simple minimally intrusive SOA/AOS conversion using templates: https://godbolt.org/z/rBeWOA

I just typed it in, I'm sure there are errors. I borrowed to_tuple from somewhere, that's the real hack. Real reflection can't arrive too soon.

edit: added soa2aos example


Can you give an example?


This is mainly to illustrate the idea; I don't claim any correctness or good performance from this code. (if you do inserts after reading a [] you may invalidate some pointers!)

https://pastebin.com/aZWTAL2J

impl_X is your base class with most of your logic. interface is used to pull out just the parts of the data that you might work with while wanting to have it in SoA format. Then we specialize the vector template for the interface to give us a dummy class with the things we need, but that sends our writes back to the backing array.

If we need to get an individual struct out of it the conversion is automatic. If we just need to access some member vars it will (hopefully) optimize down to direct accesses. We do bear some complexity in implementation, but it's all confined here.

I'm now realizing I was a bit imprecise in my earlier comment. the specialized vector is not around <yourclass> but around an interface parent of your class. You could also just specialize yourclass vector, but then you don't have the ability to switch.


After writing that up, I saw below that someone else has done it much like I had envisioned and ironed out the odd parts.

Better source: https://github.com/crosetto/SoAvsAoS


Thank you


I assume that we want to access these arrays as "array of structs" for most functions but as "structure of arrays" for some calculation intensive functions. The article suggests storing it as array of structs and to make copies for those calculations but this seems inefficient to me. Modern C++ should provide a way to efficiently decouple the access model from the memory layout.

Can we hide the actual memory layout without big overhead using C++ inline/template functions/classes? Would that be the visitor pattern?


> make copies for those calculations but this seems inefficient to me

I haven't read the whole article, but this "make copies of elements from an array into another array for the current frame only" is common in game development.

Remember that on modern CPUs, an L3 miss is about 200x slower than an L1 hit. RAM isn't random access: randomly jumping around is slow, but iterating over an array is fast, both because of the cache and because of pre-fetching.

Say you have a big array of A's, and another big array of B's. For the current frame, some of the A's need to interact with some of the B's. If you go through the entire list of B's, and copy the ones that will definitely need to interact into a new list, call it B2, then maybe (or not) do the same with the A's into A2, then it can often be approximately 30 times faster. Multiply that by 4 (or 8) if you can "zip" through your A2's and B2's with SIMD.

Not only that, but your A2 and B2 lists can be put on a stack allocator (nothing to do with allocating on the stack - it's a special type of O(1) heap allocator whose contents are discarded at the end of each video frame).


Copying is fast if the access pattern is a good fit for the CPU architecture.

If you need to copy only every N-th byte from a AoS it might be as inefficient as random access. So, copying could be expensive.

The article suggests striping your data in blocks but then you end up with the worst of both worlds in terms of program code complexity.


> Can we hide the actual memory layout without big overhead using C++ inline/template functions/classes?

This seems to claim to do it: https://github.com/crosetto/SoAvsAoS

Found that while looking for this, which I vaguely knew about and which also seems to do that: https://m-sp.org/downloads/cgo2018-src-poster.pdf


I don't think the article is suggesting ever making temporary copies of data into a structure-of-arrays (SOA) format. Rather the choice should be made at design time and you write your code accordingly for those parts that deal with SOA data.

The author's advice is that you should go with the standard array-of-structures (AOS) format by default, but if you know you'll be doing number crunching, use an "unrolled by eight" grouped SOA format that's both SIMD- and cache-friendly.


FTA: "Another thing I might consider is to keep the data stored at AoS, but generate temporary SoA data for processing by some algorithm."


Missed that, thanks!


> Modern C++ should provide a way to efficiently decouple the access model from the memory layout.

I try a lot to make a "array of structs" and also"structure of arrays" for my own little relational language in rust.

Is just not possible (that I know). At best, you could store as packed arrays or arrays of arrays then at runtime static dispatch them.

P.D: Or generate code for both. Anyway is not easy to build... the OPTIMAL algorithms for both cases diverge enough.


ISPC, Halide and the unreleased Jai all have ways of doing this.


Is explained how them do it?


> Can we hide the actual memory layout without big overhead using C++ inline/template functions/classes?

Not super easily because the array type needs to know the fields of the class it's containing to do the re-write. This is where you need more substantial codegen to enter the picture. Something like the metaclasses proposal should handle it just fine. Or macros in the meantime.


I posted this elsethread: https://godbolt.org/z/rBeWOA

But yes, better reflection is needed to make it truly generic.


This sounds like you want to use ranges, which were introduced in C++20 and quite a few game developers found them too complicated.

The way I see it to store your data in whatever is the most efficient form for your computations to use, and use a simple view for those functions. Then for functions which need to look at the data in another form you use more complicated views which can abstract some of the data layout for you and make it simpler to manipulate.

Unfortunately I can see some people decrying this sort of code as too complex and complicated, but I think it can be made to work rather well.


Doesn't the memory layout actually matter for cache locality? So you would still need to be able to have both memory layouts for performance.


Cache locality is kind of the whole point of it. Some algorithms benefit hugely if you choose a specific layout.

Other algorithms do some kind of random access to a few fields only and they don't benefit at all. Those algorithms can make up 90% of your code but only account for 10% of the computation. Therefore it would be easier to have your data look like a AoS in 90% of your code but actually be stored as a SoA to gain the speed in 90% of the computation.


Most definitely.

If, for example you've got a vector of structs (which is a basic tabular store, that is row major). Depending on the operations you're performing, you may see huge performance benefits from instead using a column oriented data structure. Especially with very large datasets. A large part of this because of cache locality and prefetch.

I see this in finance often. For querying large, slowly changing datasets, column store RDBMS destroy traditional row oriented stores. Column stores can be colloquially an order of magnitude faster for some operations, such as computing aggregates grouped by a date (but theyre significantly much slower for inserts and even more so for updates).

As usual, when it comes down to optimizations, depends on the use case, and experiment and measure, measure, measure.

Also, another big caveate is that it can change arbitrarily with different hardware or even OS revisions.

Edit: spelling


Do you mean switching the memory layout depending on runtime conditions or just changing the software interface to a constant (shape) block of memory?


IIRC this was something JAI was trying to do.


I've heard in one of his videos that Jonathan Blow ditched the AOS -> SOA conversion feature, but he may come back to the idea sometime later.

(One problem with Jai is that unless you are viewing his Youtube videos regularly, you cannot catch up on what is going on with the language...)


That's unfortunate.


I think that was just refactoring tools.


I’m pretty sure Jai has (or at least did when I last looked) a type modifier keyword that changes the layout (the code working with it doesn’t change)


It should based on livestreams, but details on Jai are so far and few between, it's hard to say


Yeah, my comment was based on some old fan-made documentation and the latest live streams I've seen where he talked about it (which was a good many months ago now, but it certainly seemed like its supported)


There is already a good solution: https://www.plflib.org/colony.htm, that will [eventually] end up in the std [https://github.com/WG21-SG14/SG14/tree/master/SG14].


I really need a resource on how to make code cache friendly (or at least, more aware of computer architecture). Got an interview coming up at a HFT firm. Please HN, deliver!


Check out bisqwits videos on cache locality


Who is bisqwit?

Couldn't really get anything on Google.


> Also, without some additional measures, neither plain arrays or vectors support referencing individual objects.

Uh, isn't this just subscripting?

> But, as stated above, we don’t care about the order.

Maybe std::unordered_set might be what you want?


> Maybe std::unordered_set might be what you want?

Modern video games are typically now both GPU-bound and CPU-bound. But also, latency is way more important than throughput.

Imagine you have 1000+ different std::unordered_set objects in your game, that are being used and accessed every frame. Most of the time, your game is using around 30% of the CPU. But on one frame, you get unlucky and 900 of your std::unordered_set objects run out of space and have to be re-allocated at the same time. Suddenly your frame rate drops from 30fps to 3fps and then back up again. This is totally unacceptable in a video game - gamers hate it, and they have a name for it, called "stuttering".

For that reason, most video games allocate large blocks of memory upfront and use their own custom allocators, usually very different from the doug lea malloc() that iirc new is still a wrapper for. (I'm aware of std::allocator, but that's a whole other topic...)

Basically if you think about how, in some first person shooter, the oldest bodies and bullet decals start disappearing when new enemies appear, the whole engine is based around that philosophy.


For that reason, most video games allocate large blocks of memory upfront and use their own custom allocators, usually very different from the doug lea malloc() that iirc new is still a wrapper for. (I'm aware of std::allocator, but that's a whole other topic...)

That is only half of the story. The full story is using a preallocated object pool, and reusing entity objects without ever having to dynamically allocate new instances on demand.


Remember `std::unordered_set` is typically rather slow.


Well, it depends what you're doing with it.


Can you elaborate why is it slow? Shouldn't it be faster tham `std::ordered_set` which uses a Red Black Tree as the undelying data structure thus proviing a O(logn) time complexity on the other hand `std::unordered_set` uses hash functions to `index` in an array and retrieve which essentially is a O(1) time complexity.


This is where we get into O(1) != fast territory. Algorithmic complexity has a weak relationship to CPU performance, not a strong one.

If you want to find something in a set storing it as an array and doing a linear scan will beat a std::unordered_set up to a shockingly large number of items due to how CPU's work.

In particular it's the pointer chasing aspect of std::unordered_set that becomes a problem (an issue shared with _most_ hash set implementations). Remember an unordered_set is not an array of items, it's an array of buckets of items (this is how hash collision is handled). Worse still, those buckets are usually linked lists. It typically can't be speculated effectively and it can't be prefetched effectively, so you become memory latency bound during an un-cached lookup. And memory latency is just shy of absolutely terrible. If you're expecting L1/L2/L3 cache hits on lookups then you're not dealing with vary large sizes probably and you're going to get much better cache density with the flat array than the array-of-buckets.

There are alternative hashsets that are flat and avoid this, but they are less common and as far as I know no standard implementation on any language uses such a hash set. There's a good talk about such a dense, flat hash set here: https://www.youtube.com/watch?v=ncHmEUmJZf4


Some languages have hash tables (and hence sets, as they tend to be implemented with them) that use open addressing, in a way that doesn't end up being bad cachewise unless there are unreasonably many collisions for the same hash code.

It is also not uncommon for mature implementations to optimize the cases with few elements in the hash to use linear lookup. In some cases that optimization is also used while storing data in small hash tables.

Pointer chasing hash tables was good when cache was nonexistant or small (ie the 90s), but nowadays, open addressing is just superior.


I am unfamiliar with std-library details, but the solution is to use Hash Tables with linear-probing, with maybe Robin-hood hashing.

Linear Probing means that if h(key) has a collision, you store the value at h(key)+1. If that has a collision, you store at h(key)+2. Etc. etc. (wrap-around if necessary). Linear Probing means that most accesses will have one-main memory fetch, and then after that, its all pre-fetchers to linear-scan for the data.

Robin-hood hashing means that the "poor" steals from the rich. "Poor" values are the ones who have to be probed significantly: maybe 10x or 20x, or 100x, away from their preferred spot. A "rich" value is one which has been placed close to its ideal spot.

The idea is that you average-out your linear probing distances, so that instead of having 100x (on some probes) and 0x on other probes, all of your hash values will tend towards... roughly 3x probes or so. (I forget the math exactly).

Its as simple as checking on insertion: if values[h(foo)+rehash_count].rehash_count > current_rehash_count, you insert foo into that location. Then, you take THAT value, and hash it forward.

---------

It seems like hash-tables + linear probing with robinhood hashing is popular these days for L1 caches (everyone at offset 5 away from their ideal is way more cache friendly than most at offset 0, but some at offset 100). But I'm not aware of any standard-lib implementation of the technique.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: