Graph Engine vs. C++ unordered_map: memory consumption test, Part 2

jandrewrogers · on June 7, 2015

I would make the point that C++ unordered_map is not remotely optimized for this use case, so it is not surprising that it is less efficient. You normally would not use it for something like this. In fact, C++ unordered_map is not tunable enough to offer excellent performance for most use cases. It is a very generic implementation of an unordered_map.

In any code base where performance matters, most important use cases for unordered_map would be replaced with a purpose-built implementation (like with GraphEngine). Not only is it pretty trivial to design and implement custom maps for various use cases, you can usually ensure at least a 2X improvement across all relevant metrics relative to the standard library unordered_map.

Based on the relative metrics for GraphEngine and C++ unordered_map in this article, I would expect a purpose-built C++ implementation to be significantly faster than GraphEngine. I know that is not what is being measured but it in real-world C++ implementations, it is how most properly engineered architectures would do it.

nly · on June 7, 2015

Indeed. std::unordered_* implementations are hogtied by the interface and complexity requirements in the C++ standard (it basically mandates chained buckets), and more generally the need to work with almost any hash function (linear probing would suck with crappy hash functions for example).

Jo Muñoz wrote a cool blog series on the current implementations in circulation, the flaws with the interface, and how his implementation (Boost multi_index) made further optimisations despite the constraints.

Don't miss the other two parts of the series - look under October on the right. He did extensive bookmarks in November as well.

http://bannalia.blogspot.co.uk/2013/10/implementation-of-c-u...

StephanTLavavej · on June 7, 2015

I bookmarked those posts a while ago in the hopes that I'll get a chance to rewrite VC's unordered_map in the future. We've taken the first step towards doing so by deprecating the non-Standard hash_map (which shares its guts with the Standard unordered_map) in 2015, so I can remove it outright at the beginning of the next development cycle. Then the unordered family can be changed without fear of breaking the hash family.

nly · on June 7, 2015

Thanks for what you do STL (For those that don't know, Stephan maintains Microsofts C++ standard library). Even though I don't develop against MSVC at all, improving the state of C++ at Microsoft makes life easier for all of us in terms of portability and bringing nice new things in to common use.

Thank you also for maintaining your GCC distro for Windows[0] and the the many great presentations you have given at CppCon, Channel 9, Going Native etc.

[0] http://nuwen.net/mingw.html

StephanTLavavej · on June 7, 2015

You're welcome! And even if you don't use VC's STL, hopefully make_unique and the other things I've designed will be helpful.

nly · on June 7, 2015

And for making this possible

    vector<int> v = {1,2,3,4,5}; 
    cout << accumulate(begin(v), end(v), 1, multiplies<>());

jaytaylor · on June 8, 2015

I'm curious..

Is it possible to use VIsual C++'s STL inside of Linux?

StephanTLavavej · on June 8, 2015

Nope, our sources assume they're running on Windows.

joaquintides · on June 8, 2015

You might also consult the entries given below, which take into account some late improvements in Boost.MultiIndex not present at the beginning of the series:

http://bannalia.blogspot.com/2014/01/a-better-hash-table.htm... http://bannalia.blogspot.com/2014/01/a-better-hash-table-gcc... http://bannalia.blogspot.com/2014/01/a-better-hash-table-cla...

gsg · on June 8, 2015

That's an admirable sentiment, but aren't your hands tied by the promises made by the standard? In particular the addresses of elements are guaranteed to be stable, which would seem to prevent an implementor from relocating them to achieve more cache-friendly layout.

(Of course, you would want to do as well as possible within the given bounds.)

StephanTLavavej · on June 8, 2015

Yeah, unordered_meow is always going to be a pretty crummy hash table. But our implementation could be significantly improved while still obeying the Standard.

gsg · on June 8, 2015

Fair enough! Good luck with your rewrite.

aaggarwal · on June 7, 2015

While Graph Engine definitely looks useful, I agree that the comparison is unfair. GE is an optimized implementation while associative containers (unordered_*) in C++ are more generalized implementations. The above blog provided clearly explains optimized implementations of associative containers in C++. It would be interesting to see how these compare with GE.

v-yadli · on June 7, 2015

Please note that this test also only captures a single profile of Graph Engine. The cell entry lookup table and the MM module are designed to operate in a more complex environment and not optimized towards the direction of this micro benchmark(concurrency, logging, a strong type system, to name a few). Of course, it would be very interesting to test against other options, especially those concurrent dictionaries combined with custom allocators.

huhtenberg · on June 7, 2015

> with MSVC's built-in memory allocator

... which has been a trivial wrapper around HeapAlloc for a long time, and the HeapAlloc implementation varies by the Windows version. In particular, Vista+ versions are low-fragmentation heaps and W7+ versions have an outstanding multi-threaded performance.

In other words, deferring to a "MSVC's built-in memory allocator" doesn't make much sense.

StephanTLavavej · on June 7, 2015

And recently (in 2013, I forget about previous versions) the CRT began using the process heap, instead of having its own heap handle. So the malloc/new/std::allocator family is an even thinner wrapper around HeapAlloc in that sense.

(In 2015, we're introducing a little bit of extra trickery though. In std::allocator, we'll highly align large allocations to be friendly to autovectorization. This isn't done at the lower malloc/new levels because they don't get size information at deallocation time, but the std::allocator interface has always required that information even if it was previously ignored. IIRC this buys 15% or so, which makes the back-end devs drool. Our current magic number for "large" is 4 KB, so this basically never affects node-based containers, only stuff like big vectors.)

stinos · on June 8, 2015

> In std::allocator, we'll highly align large allocations to be friendly to autovectorization

Ok this and other things you posted here are mighty interesting - do you (or someone else working on this) have (has) a blog where news like this gets announced?

StephanTLavavej · on June 8, 2015

VCBlog is the VC team's official blog. See http://blogs.msdn.com/b/vcblog/archive/2014/06/06/c-14-stl-f... for where I announced the alignment thing.

chuckcode · on June 7, 2015

I've found measuring memory usage with STL containers to be difficult as they have their own allocator and unless you're using reserve() often end up allocating extra space for things. Dr Dobbs had an interesting article a long time ago around optimizing allocators [1] and also memory management was one of the main reasons that Electronic Arts rolled their own variant of the STL [2]. I use a lot of STL for convenience but have been burned enough times that when memory consumption his high or has heavy usage I manage it manually and make sure to use a modern allocator like jemalloc or tcmalloc[3].

[1] http://www.drdobbs.com/cpp/improving-performance-with-custom... [2] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n227... [3] http://locklessinc.com/benchmarks_allocator.shtml

StephanTLavavej · on June 7, 2015

By default, std::allocator just wraps new, so it doesn't make things any harder to measure. vector isn't "allocating extra space" for no reason - its capacity grows geometrically in order to provide amortized O(1) push_back. If you use reserve() incorrectly, you can actually trigger quadratic complexity (as I've done in the past when I didn't know any better). Even with unused capacity, vector is quite space-efficient. For example, with GCC's 2x growth, on average a vector will be using 33% extra space (that's 0.5/1.5). With VC's 1.5x growth, on average a vector will be using 20% extra space (that's 0.25/1.25). That's a pretty small cost to pay for vector's optimal locality. And in practice, many containers will actually be exactly-sized (range construction and copy construction will produce exact sizing), so the overhead is even lower.

halayli · on June 7, 2015

This test is flawed.

    start_time = std::chrono::steady_clock::now();

    for (auto entry : param_entries)
    {
        void* cell_buf      = new char[entry.cell_size];
        auto  upsert_result = LocalStorage.emplace(entry.cell_id, CellEntry{ cell_buf, entry.cell_size, 0 });

        if (!upsert_result.second)
        {
            std::swap(cell_buf, upsert_result.first->second.ptr);
            delete[] cell_buf;
        }

        if (rand() % 3 == 0)
        {
            delete[] upsert_result.first->second.ptr;
            LocalStorage.erase(entry.cell_id);
        }
    }

    end_time = std::chrono::steady_clock::now();

He's not testing unordered_map performance alone, but the performance of new/delete & unordered_map. Also entry is a copy of param_entries's item, it should be changed to auto& entry. So he's essentially copying the whole array while iterating.

v-yadli · on June 7, 2015

I'm pretty sure the values in entry will be loaded into registers.

halayli · on June 8, 2015

That doesn't change any fact.

peterazo · on June 7, 2015

Yet another example how C# developers can't write C++ code. :)