- It looks up the same key on every iteration, so every level of the trie stays in the cache, no matter how large the nominal data set size is.
- The setup appears to be such that even though the key is 128 bytes, it's probably not going to share a prefix longer than 4-5 bytes.
- The code is constructing a temporary string of the 128 byte key for every lookup in the hash table (by going from a string to char-pointer to string), but not for the trie.
So it first eliminates the problems of a trie (especially the cache inefficiency) by an unrealistic test setup, and then has a bug introducing probably a factor of 2 unnecessary overhead to the hash table.
There's an incredibly simple algorithm for such an access pattern. Store all values in a vector, without regards for order. Then when an element is requested, find it's index i with a linear scan and swap elements i and i/2.
In an access pattern where the same elements are requested over and over this very quickly brings those elements to the front, giving really fast access times. This super simple data structure would totally blow the rest out of the water on this benchmark, and should illustrate the issue with it quite well.
I hate it when libraries try this sort of thing. I've already used all of the cores higher up in the program logic where it's far more effective, but doing that all the library is going to achieve is a net slow down of all throughput.
Don't blindly throw work at multiple cores low down in algorithms. Amdahl's law tells you that the pay off is likely low, and if the program already makes proper use of the cores you're just going to slow things down.
I'm learning Omp right now, and I'm excited about it, but my sarcasm filter is off today--are you serious about this?
What about the inability to exit a parallel region--I was under the impression that once you start a parallel for, you can't exit it until it's run through the entire iteration.
In particular splay trees do this by rotating elements to the top whenever they are accessed (IIRC) and while they aren't formally balanced, they tend to be "balanced enough" -- if you only ever access a few elements, they will be clustered at the top and the structure of the rest of the tree doesn't matter
We've done few variations of those for representing parts of a trading book where you often have a few price points (the inside) are hit more often and their performance is more important. Self-balancing data structures are very cool.
So basically none of the results are remotely relevant to real use. Except maybe that unordered_map is faster on large data sets than map.
Another thing I noticed is it starts with the example of a type column on a database, then proceeds to use graphs with a logarithmic scale on the x axis but not the y axis, so you can basically only see two data points, which are for millions of strings. He also only uses long strings, where I would expect the type column to be the opposite
This means even if the benchmarks were valid, I still wouldn't know what to use if I wanted to map 2-10 different short strings to numbers I wouldn't know what the right answer is. Of course the real right answer for that case, which I'd guess is a perfect hash function, wasn't tested.
Exactly. I predicted every benchmark result except for the trie one, which surprised me a ton (wut??!! but pointer chasing costs!), apparently though that was just bad benchmarking putting the only path ever tested in cache.
"At around 1,000,000 elements our unsorted vector takes around 32 minutes to lookup a string."
I don't at all believe that C++ can only iterate through 2000 strings a second. Python takes an imperceptibly small time to iterate through an array of a million strings when I test it out in my repl, and I imagine C++ would either be as fast or somewhat faster. This number is so ridiculous that it makes me very skeptical of the rest of the article.
The x-axis is the number of iterations, i.e. the number of lookups. The collection is always size 1,000,000. Also, each of those lookups is for the exact same entry, so after the first time, it'll be all cache hits. Well, except for the std::vector, which is what you're talking about... :)
Gotcha. It's probably still quadratic, just judging from the values. You can't need 32 minutes for 10^6 100-byte comparisons (which typically abort after the first character compares unequal. Not that that matters).
Why is the time for map lookup going up so fast for large numbers of strings? Is he running out of memory and thrashing? Is the implementation of map broken? Does the benchmark include growing the map? (If you include growth time, map insertion is O(N log N, because you need log N recopies of size N.)
Update: The author is always using a collection of 1,000,000 entries. The horizontal axis is the number of lookups, into a collection of 1,000,000 entries, e.g. always ~ 20 entries in the tree. So the x-axis is logrithmic, but the y axis is linear and we expect the graph to be linear.
Original comment: Good question. The x-axis is logrithmic, the y axis is linear. Without cache effects, std::map and lower_bound should take logrithmic time, so should be a straight line on the graph. Also, unordered_map should be constant time, independent of size. So its either cache effects (including swapping / thrashing), or there's some other overhead that's dominating the whole thing, and the author isn't measuring what they think.
Sort of. It's probably the level below that: these results look fairly typical of a data structure growing larger than the cache, which is usually 8 MiB now.
If you don't need sorted search (lower/upper bound) or prefix search I doubt even a good trie would beat a good hash table. Have you tried google::dense_hash_map? std::unordered_map is slower because it uses chaining to resolve collisions.
Tries have the incredibly useful feature that they fail fast and match fast when you get to a unique prefix. Hash you always need to process the full string though.
Yeah, the most cache-friendly structure is going to win, but failing fast means no strcmp test if you happen to get a collision, which is potential miss to fetch the characters. And if the hash uses bucket chaining, the programming gods help you.
On an interview I was once asked to calculate the latency on a hash fetch in Java with the JDK String, hit and miss. It all basically boils down to how many caches misses are you going to have. I literally just counted up the memory accesses and counted up the hits and misses then gave an answer for cold and hot cache. Then we worked on rewriting it.
Interesting write up. There was a claim I didn't understand though
> Let’s fix this. An easy way to get our std::vector
implementation in-line with the rest of the containers is we can sort it and utilize an algorithm, std::lower_bound, in order to speed up our lookup times.
Unless I'm missing something, isn't this wrong? The point of the vector is that v[enum_value] gives you the name of enum_value as a string. Once you sort it this relationship no longer holds, unless it so happens that the vector was already sorted (which happens to be the case for "circle", "square", "triangle").
You are absolutely right - and the benchmark in the repository doesn't even try to get the value - it's only check for existent of key.
However, it can easily be fixed by using something like vector<tuple<string, types_t>> and supplying predicates for both std::sort and std::lower_bound to only consider the first element in the tuple. There will be some performance hit, but should be minimal.
I see, that makes sense. I guess at that point you have something very much like std::map, except it doesn't keep itself sorted, you have to do it yourself.
I guess for cases like this where you initialize it once and never change it again it could even be a better choice (i.e., faster initialization).
I get that part, that's why you use std::find or std::lower_bound on the vector, but my point is that std::lower_bound will not give you the right answer unless your enum is defined with the names in alphabetical order.
The performance gap between unordered_map<> and the custom implementations seems to be less than 2X. This will quickly evaporate if one uses an efficient hash such as dense_hash_map<>.
For the trie, things would be a lot better if the words got longer. All the 'inefficiencies' of a trie lie in the pointer chasing. The great thing is that you can process a string 'on-line'.
If things start getting really big, there are some tricks with LCP-arrays and constant time RMQ (range minimal query) that have great theoretical performance. I haven't seen that stuff in practice though.
I love how it's always the native code developers who worry the most about performance implications, when they often have the least to gain from doing so.
It must be something about being close enough to the metal to realize and care about what happens.
Hopefully if you're using C++, it's because the performance implications matter. And hopefully, you're nitpicking on the implementation of the mapping algorithm for the same reason. Third, that's that only good reason to be close to the metal anymore: that there's a requirement for some particular speed target, and none of the "good enough" solutions are good enough.
Otherwise, why wouldn't I build my program in Python or something? If selecting the right data structure just based on Big-O notation will get me to a good enough solution, digging down is just a waste of time, like saving microseconds when network latency is 1000 times the problem.
Sure, but it's one of the biggest go/no-go reasons. Preexisting codebase, available libraries suitable to the problem domain, platform limitations, desire to obfuscate code, developer experience, desire for speed of development, and whatever else I'm forgetting would be other reasons to choose a particular set of technologies.
But the comment I was responding to was about performance implications, and whether or not development was "close to the metal", so my response was too (using C++ and Python as examples of lower-abstraction and higher-abstraction languages).
There are other reasons that are at least 9 times in 10 (probably more) completely dominated by a performance justification on occasions where people actually have the choice rather than having it made for them.
The last time I wrote production C++ was at Google. There, when requests are coming in at a torrential rate, every instruction counts. In cases like that, it's perfectly reasonable to sweat the details when selecting a data structure.
There's a element of that but like @khedoros1 says you have it backwards.
By comparison, Java programmers typically have very little mechanical sympathy. Just yesterday one suggested to me that he could add a Kafka instance on our machine (rather than a new topic queue on the existing Kafka) to speed up things. I pointed out it would probably make little difference as it would be hitting the same disk. And it would complicate their code because they'd have to switch connection details depending on the topic.
Tries can be memory hogs and the techniques for level and path compression can add significant overhead. It would have been interesting to see memory usage.
I've of the more interesting data structures for things like this are Ternary Search Trees after each subtree starts with a common prefix. That would have been an interesting comparison.
There are still some interesting applications of tries (and dictionary automata). E.g., if you want to find all words given a prefix (wildcard searches). If the dictionary is static, you could use an array with binary search, however for dynamic dictionaries this gives you O(n) insertion.
Ternary search trees are really great and relatively compact. A while ago, I compared memory use of storing a dictionary (the SOWPODS word list) in Rust using various tries (as measured with the heapsize crate):
- Trie, using a HashMap to store edges: 691MB
- Trie, using a sorted Vec to store edges: 44MB
- Ternary search tree: 18MB
A typical ternary search trie is even more compact, but I implemented randomized ternary search tries, which uses two extra member variables in each node for bookkeeping (in these measurements u16s).
Suffix tries are an even better example. Donald Knuth once heralded them as the greatest algorithmic break-through of the '70s but in practice, it doesn't quite perform.
The point of suffix tries is finding substrings. I.e. build a suffix trie of all of Shakespear's works in linear time and memory (linear w.r.t. the total length of the string). And then, given a potential quote of length K, see if it is in Shakespear's work in O(K) time. The big draw there is that the complexity of string lookups does not depend on the length of the big text against which you are matching. This finds practical use in genome sequencing.
Suffix arrays, on the other hand, have worse complexity in the typical implementation (O(n) construction, O(k + log(n)) search [1]). However, they work extremely well in practice, because they use (contiguous) arrays.
If I recall correctly, the linear search approach is based on constant time range minimal query on the suffix array. That feels rather likely to also be worse in real life applications.
Heck, as far as I can tell, this approach is just another way of storing a tree.
Or Fibonacci trees. I have seen suffix tries used in the wild. I have yet to see a Fib tree. Shown to be optimal in a number of tree operations, horribly slow in practice..
That's true. Years ago I did some experiments with various trie representations and despite my effort, glibc malloc was reporting 40-50% of internal fragmentation. Once switched to memory arenas, I nearly rid off fragmentation.
You can fix some of that and techniques like a binary trie can give better cache performant/smaller levels, but while a trie sounds simple they are surprisingly difficult to get to perform better. The optimizations for hashes are far easier to implement IMHO
Based on experience I'm going to form an educated guess (hypothesize) and say that both std::map and author's trie container are primarily bound by pointer chasing. Obviously one should test this.
If that's the case, use a C++ btree_map implementation.
I'm fond of sorted vectors as well such as boost flat_map. The worst case performance can be surprising! But eventually if it gets too big you'll need a btree
Can you share an instance where you actually needed a btree? What made it a true need (business need of some kind maybe_ and not just a measurable difference?
Generally the requirement for range queries and piercing queries. (Does range x,y contain object o; is object o overlapping range described by object p)
Additionally requirement to maximize immutability for reasons of thread safety.
Range queries do not really work well on sorted vectors even if you have as many as needed indices.
Immutability and race freedom are even more complex.
With a tree, copy on write solves many problems. (And can be much cheaper than copying whole structure.) If not, you can atomically replace subtrees in a safe way.
It would be interesting to see a zoomed in view of how things perform at small numbers of elements. Perhaps an unsorted vector is actually the best up to a certain point, for example.
No, the charts should use a log-log scale. Now only the x axis is logarithmic which makes the behavior confusingly exponential-looking and also hides the differences at small sizes.
The plots would have benefitted from a log scale on the y axis, or both axes. The behaviour at the low end can't be clearly visualised since it's squished to the baseline by the exponential behaviour for high x values.
This seems fairly involved for a problem that should be solved in the database...
I mean seriously, this is the kind of thinking that at least in theory is going on inside a relational database. If you have a table mapping these strings to values, (or, in a more sane database, it's the way around), you would just do the join and be done with it.
auto get_type(const std::string& type) {
auto e = m.find(type);
if (e == std::end(m)) return type_t::num_types;
return e->second;
}
I don't understand the point of that code. Use "map.at(key)" and you get the value from the map. No need for function and branching and whatever.
If you really want to mess around, you compare "map.at(key)" against "map[key]". at return a const, [] is not const and allows to create the key (if I remember well).
To conclude, if you really really want to show off your optimization skill, you optimize return types: "auto &&" vs "auto &" vs "auto" vs a few other ones.
Can't help you with that last one. One decade of C++ and still struggling with reference, value copy, left-value reference...
That reminds me how bad C++ is a mess. 5 minutes of optimizations and my brain is already hurting. Good thing I moved to DevOps. More pay, less hassle.
I wrote a 3D engine in C++ many years ago and it still feels wrong. At least with C you get the power to do things the way you want to and yes it is mostly unsafe if you don't know what you are doing. IMHO C++ is a failed attempt to reinvent already working language. C should not be extended. There are other languages like go and rust that provide good performance while not bastardizing C.
> At least with C you get the power to do things the way you want to
C++ gives you the same level of control over your program that C does, but also comes with nice low-cost (in many cases zero-cost) abstractions for when you need them.
Define "bastardizing C". Which improvement do you consider as "failed attempts"? RAII, constructors/destructors, so you don't need to write spaghetti code to free resources reliably? Templates, so you can write sort() function once, for all comparable types? Lambdas? Or maybe GLib is somehow better than stl/boost?
C++ saves TONS of time and effort in our projects, thank god that I don't need to write in plain C anymore.
Good luck chasing bugs then, though. qsort() will be totally happy comparing apples to oranges - aka there is zero type safety.
Templates give you compile time type checking, that's why one doesn't pass void pointers like this anymore but uses templates to implement generic functions.
> Good luck chasing bugs then, though. qsort() will be totally happy comparing apples to oranges - aka there is zero type safety.
That's a fair point. However, I do not tend to make many of the mistakes that would be caught by the C++ type system. But different people tend to make different kinds of mistakes.
The type bugs that happen tend to be pretty obvious and easy to debug. Good luck to you debugging template code :) I've had a harder time debugging C++ code than C code. Again, YMMV.
Although sometimes "I do not tend to make many of the mistakes that would be caught by the C++ type system" might be related to "how to use C++ type system so it would catch mistakes people tend to make".
There are indeed trade-offs. But the comment I was replying to implied that you couldn't write a generic sort for all comparable types. That is wrong.
Unsafer yes.
But it's only slower because it's not being inlined. If the compiler can see both the implementation of qsort and the comparison function, there is no reason why it would be slower than a templated sort. You could accomplish this with link-time optimisation or by moving the implementation of qsort to a header file.
A trade-off you didn't mention: templates will cause multiple copies of the sort to be inlined in the executable. This can lead to bigger executables (with slower start-up) and more cache misses.
Another trade-off: templates lead to much slower compile times.
qsort might be general in the element type, but not in the container type. That is, it will only work for arrays.
Also without LTO it will be very expensive as it cannot inline the compare function call. C++ sort is just faster and more powerful.
> qsort might be general in the element type, but not in the container type.
That is a clear advantage of C++.
> Also without LTO it will be very expensive as it cannot inline the compare function call.
Templates force you to expose the implementation in a header file. In C you can choose. If you move the qsort implementation in a header file, the compiler should not have any trouble inlining it.
Inlining is not always a win, though (slower compiles, bigger code, more cache misses).
"I wrote a 3D engine in C++ many years ago and it still feels wrong"
Only half of the reasons why C++ is so popular have to do with the language itself. The other half is related to constraints, that don't have as robust solutions using other languages.
Nobody is claiming C++ is a great language but for a lot of constraints there are no viable options.
- It looks up the same key on every iteration, so every level of the trie stays in the cache, no matter how large the nominal data set size is.
- The setup appears to be such that even though the key is 128 bytes, it's probably not going to share a prefix longer than 4-5 bytes.
- The code is constructing a temporary string of the 128 byte key for every lookup in the hash table (by going from a string to char-pointer to string), but not for the trie.
So it first eliminates the problems of a trie (especially the cache inefficiency) by an unrealistic test setup, and then has a bug introducing probably a factor of 2 unnecessary overhead to the hash table.