Assuming 2 GHz CPU and reasonable assumption data is in L1 cache, reading 14 kB ...

DSMan195276 · on April 24, 2017

I don't quite understand your argument - you give lots of numbers for how long it takes to copy, but ignore cache misses and just come to the conclusion that `malloc` is the bottleneck because "pointer chasing is slow". If you're going to give theoretical numbers for copy, why not give numbers for the "pointer chasing" as well? The numbers you gave don't really mean much if there is nothing to compare them too. Yes, 110 nanoseconds is fast, but copying a 700 byte array 20 times was never intended to be an example that would be grinding your computer to a halt.

You also don't give enough weight to cache misses in this scenario. A L1 cache miss (So the memory is in L2) is going to be approximately a ~15x slow down from L1. A full main memory hit will be about ~200x slower. Not nearly the 3-5x you're guessing. More-over, the cache missing in `malloc` will likely be the same memory you're allocating anyway, meaning that the cache misses were likely going to happen regardless.

With that said, the `malloc` time should be somewhat constant regardless of the size of the allocation (Unless your allocation requires requesting memory from the OS), but the copying won't be. Once you start getting into structures that are fairly large in size, `malloc`ing new memory for them anytime you expand them will end-up being a fairly expensive operation that I would easily wager dwarfs any time spent in `malloc`.

But I mean, it's easy to argue basically any of these points without actual testing, so I've tried running a few basic tests on my own. Obviously your results would vary (I'm currently running on a fairly slow CPU):

First, I ran tests on different block sizes and number of allocations without freeing any of the allocated memory. In these tests, the copying tended to overshadow the allocations without copying when your block size was above around 8k. But because I wasn't calling `free`, allocating would quickly eat up most of the time if the number of allocations was increased (With most of the time being spent in the kernel). I did most of my tests with around the 40000 allocations, and above that things would get quite slow with the allocations alone even with block sizes larger then 8k.

With calling free, however, things tend to lean much more heavily in favor of the copying being more expensive. Right around a block size of 1k is where the allocation costs about the same time as the copy. The speed-up is of course due to `malloc` always have enough free memory available and not having to ask the OS for more (So this test represents the basic cost of the "pointer chasing" alone). Also, like I had assumed, the time for `malloc` without copying is 100% dependent on the number of allocations - changing the size of the block doesn't make any change to the time without copy.

Now that said, these tests are by no means definitive, but at the very least they do show that `malloc` is not necessarily expensive compared to the copying, especially once you go above a few k in your `malloc` size, and/or when the total number of allocations starts to ramp up very high. If your allocation size is fairly small though, then I would agree the `malloc` time is at least equivalent (or more) then the time spent copying.

vardump · on April 24, 2017

I oversimplified, but I think my numbers are about in the ballpark when it comes to copying. It's fast, because CPUs are optimized for linear memory access patterns. I also took into account high likelihood the memory was touched before copying and will be touched again immediately after copying. One should consider cache misses on system level, not in isolation.

"malloc" will need for the very least to keep lists of free areas and try to avoid virtual memory fragmentation. Fragmented heaps (="real life code" situation) are significantly slower than fresh clean heaps, because the are more allocation list entries to check.

Because amount of work malloc needs to do is variable, it's hard to give any numbers how long that pointer chasing operation will take.

Pointer chasing is slow, because non-sequential access patterns, such as following pointers, create a data dependency and cannot be prefetched. The CPU is likely to stall waiting for the dependency to be resolved. Typically each pointer dereferencing ("chasing") operation will also touch a different cache line, so cache misses are also common.

Synthetic tests are way more likely to reside in L1 cache (including allocation lists) than anything in real life.

Many malloc implementations mmap or equivalent allocations that exceed a certain threshold. In that case, copying is slow; you'll pay a lot extra for page faults during the copy. Page faults might also occur on a fresh heap in a synthetic test.

I know from personal experience lots of small size mallocs sometimes cause non-trivial performance issues.

Standard disclaimer concerning any optimizations: Always profile your code. Things work differently in real code than in benchmarks, and what was true last time in your real life scenario is no way guaranteed to be true next time.