Fewer mallocs in curl

nnethercote · on April 23, 2017

There is a little-known Valgrind tool called "DHAT" (short for "Dynamic Heap Analysis Tool") that's designed to help find exactly these sorts of excessive allocations.

Here's an old blog post describing it, by DHAT's author: https://blog.mozilla.org/jseward/2010/12/05/fun-n-games-with...

Here's another blog post in which I describe how I used it to speed up the Rust compiler significantly: https://blog.mozilla.org/nnethercote/2016/10/14/how-to-speed...

And here is the user manual: http://valgrind.org/docs/manual/dh-manual.html

wyldfire · on April 24, 2017

> I previously had -Ccodegen-units=8 in RUSTFLAGS because it speeds up compile times. ... the resulting rustc was about 5–10% slower. So I’ve stopped using it now.

...why is this the case?

detaro · on April 24, 2017

It breaks the code into smaller modules that are processed independently, thus limiting the scope of optimizations to those modules.

tom_mellior · on April 23, 2017

For the benefit of others who found the description in the blog post unclear and can't or don't want to dig through the code changes themselves: "fixing the hash code and the linked list code to not use mallocs" is a bit misleading. Curl now uses the idiom where the linked list data (prev/next pointers) are inlined in the same struct that also holds the payload. So it's one malloc instead of two per dynamically allocated list element. This explains the "down to 80 allocations from the 115" part.

The larger gain is explained better and comes simply from stack allocation of some structures (which live in a simple array, not a linked list or hash table).

userbinator · on April 23, 2017

Curl now uses the idiom where the linked list data (prev/next pointers) are inlined in the same struct that also holds the payload.

I wonder why this wasn't the original design; I've seen the "two-allocate" method a few times in other code and it's always seemed rather silly to allocate that separate little structure just to point to the things you're linking together anyway, so I'm curious how that way of doing it became somewhat commonplace.

masklinn · on April 23, 2017

Mayhaps because the structure was originally stack-allocated, or because the developer was unfamiliar or uncomfortable with intrusive datastructures. Or because the structure is used outside of linked list contexts and the memory overhead of the intrusion was considered not compensated for?

tom_mellior · on April 23, 2017

> Or because the structure is used outside of linked list contexts

That use case is possible with the particular implementation that is now used by curl:

    struct fileinfo {
      struct curl_fileinfo info;
      struct curl_llist_element list;
    };

You can use the payload (struct curl_fileinfo) outside of a list without incurring any overhead.

I think one of your other points is more likely, or simply "it was fast enough and we went for more interesting features than for such optimizations".

wbl · on April 23, 2017

One of the Plan 9 C enhancements makes this even easier by permitting the fields of these structs to be accessed directly.

generic_user · on April 24, 2017

you can do this now in ISO C11 with the inclusion of anaonomous structures and unions.

    struct A { 
        int x, y; 
    };

    struct B {
        union {
            struct A 2d;
            struct { int x, y; };
        }
        int z;
    };

    ...

    struct B foo;
    foo.x = foo.2d.y = foo.z = 0;

you can access x and y as members of B or A.

tomlu · on April 23, 2017

Well there are situations where intrusive list pointers don't work well. If you don't control the class, then you simply can't put pointers in it. If most objects aren't ever in a list (eg. a million objects but only ten in a list at any given time), then you're wasting space. If you want to put objects in several different lists, then you have to know up front how many lists and their names, and this can become unmanageable.

daemin · on April 24, 2017

You can always allocate memory big enough for a the payload plus the link structure without the links actually being inside of the payload. Just make the first part of memory contain the links and immediately afterward you have the payload.

kzrdude · on April 23, 2017

It's the object oriented approach isn't it? So it will be natural to some (because it was taught to them that way).

dom0 · on April 23, 2017

AKA intrusive (sometimes "invasive") containers / lists.

husted · on April 24, 2017

Curl now uses the idiom where the linked list data (prev/next pointers) are inlined in the same struct that also holds the payload

And when you have a memory overwrite then your meta data is corrupted and you loose the complete list. I keep my meta data and payload in two separate list for that very reason.

unwind · on April 24, 2017

So you structure your code to get more subtle errors in case of memory overwrites? "Loosing the complete list" sounds catastrophic and like something that would be easily noticed , and thus fixed.

Arranging the code to "keep trucking" in case of spurious memory overwrites seems like setting it up to be hard to debug, and having programs run in some ill-defined state where some portion of data has been overwritten but nobody notices is just super-scary.

makerbraker · on April 23, 2017

I think this is fantastic engineering work towards performance, without falling back on the "RAM is cheap" line and instead doing nothing.

It's not every day that you see an example of someone examining and improving old code, that will result in a measurable benefit to direct and indirect users.

avar · on April 23, 2017

RAM is cheap.

What's not cheap is requesting RAM from the OS piecemeal.

In this particular case the curl maintainer ended up saving a bit of RAM as a side-effect, but it's actually much more common for programs that benefit from this class of optimization to waste RAM as a side-effect, and yet be much faster.

For example a common optimization in high-level programming languages is to overallocate memory for strings. E.g. if the user requests "x" you allocate 16 bytes, if you need 17 bytes you allocate 32 etc.

This wastes RAM, a 513 byte string will malloc 1024 bytes, but ends up being much faster on overage exactly because RAM is cheap, but allocations are expensive.

blibble · on April 23, 2017

it doesn't go to the OS each time you call malloc()

the first time malloc() is invoked it will ask the OS for an oversized chunk of memory(probably using mmap() or sbrk())

allocations consist of incrementing a pointer, or pulling a a chunk off a freelist, both of which are effectively free

mmap() is very fast to call too, so I'm not sure what you're on about

regardless, curl is a piece of software that spends 99.99% of its time waiting for network traffic

kevin_thibedeau · on April 23, 2017

On Linux. Basic core libs like libCurl can find their way into MMU-less embedded systems with relatively crude malloc()s. Sane allocation behavior is needed to ward off fragmentation that hurts the entire system.

derefr · on April 24, 2017

Wouldn't it be better to write a shim library that provides Linux-like MMU behavior at a userspace per-process level, and then target libs like libcurl to it? It'd be a sort of "memory policy driver" that you'd port to an embedded OS.

Sanddancer · on April 24, 2017

You really don't want library code providing or dictating malloc behavior, that's how things like OpenSSL and Heartbleed end up happening. It may be slower, but dealing with the politics of a particular malloc can be uglier than just making your code work well enough most of the time.

Taniwha · on April 23, 2017

Maybe more than you thing, if you're running a multithreaded app malloc/free/new/delete need to have mutexes to protect the heap data structures, that can also lead you into the kernel

adrianN · on April 23, 2017

Pulling a chunk off a freelist and maintaining that free list is most definitely not free.

blibble · on April 23, 2017

so what, to return from a freelist you alter the head pointer to point to the next item and return it

that's what, 2 or 3 instructions? even if you push that upto a few hundred it's still effectively free compared to anything else curl is going to do (e.g. writing to the disk)

dxhdr · on April 23, 2017

A structure allocated in one chunk has cache locality and will be significantly faster to access than dereferencing multiple pointers.

The allocation itself is not the bottleneck, it's the poor choice of data structure.

vvanders · on April 23, 2017

Yup, malloc/free isn't usually algorithmically expensive but the cache misses tend to be the brutal part.

You'll spend 15 cycles doing work on 600 cycles of cache fetching. Plus you've just evicted some cache for your local scope.

nostrebored · on April 23, 2017

you have to handle the case in which you have no remaining items as well, breaking up a chunk log(chunkToBreak/chunkSizeNeeded) times. And in the case that you have no viable chunk to use, you need to call back to the OS.

avar · on April 23, 2017

    > I'm not sure what you're on about.

You clearly interpreted OS to mean "the kernel" in my comment. To be clear I'm including the C library which implements malloc() in "the OS".

    > curl is a piece of software that spends 99.99%
    > of its time waiting for network traffic.

You're commenting on an article that shows that in some cases 30% of the time was spent on malloc() overhead.

blibble · on April 23, 2017

well yes, they removed the network overhead and disk overhead by running it on localhost and writing to /dev/null, and once you remove two bottlenecks you'll find that something else becomes the new bottleneck

libc is not the OS, it's the standard library for C, in the same way the java class library isn't the OS, it's the standard library for java

avar · on April 23, 2017

You seem insistent on having some side-thread argument with me for some reason.

    > once you remove two bottlenecks...

The article we're both commenting on goes to great length to point this out. I don't know why you're trying to convince me or anyone else of this.

Yes, in a network library malloc overhead isn't usually a worthwhile thing to worry about, but libcurl is hardly just any random library.

    > libc is not the OS, it's the standard library for C.

I'm not trying to convince you that you should consider libc part of the OS.

I was merely clarifying that in the context of my comment, which you replied to assuming I just meant the kernel, that I was talking about the C library as well.

FWIW lots of people use "Operating System" to mean the set of software you get by default. OpenBSD, GNU, OSX, Android etc. definitely consider libc to be part of the OS, and Android probably considers its java class library part of the base OS. If everyone meant "the kernel" by "the OS" we'd just say "the kernel".

You don't have to agree with that, but correcting people on that point is about as futile as trying to convince people that your use of "hacker" is correct.

tinus_hn · on April 24, 2017

Surprisingly, when you remove the actual work, 100% of the time spent is overhead! What a waste!

pjmlp · on April 23, 2017

Which C library from which C compiler?

pjmlp · on April 23, 2017

That is compiler and OS specific behavior.

DSMan195276 · on April 23, 2017

I would only add that the expensive part isn't the allocation per se, but copying the data from one chunk to another. If you allocation 1024 when you only need 700, when you end-up needing 701 you don't have to copy all 700 bytes into a new chunk. This can add up pretty quick if you're copying the data every time you append one new element. If I appended 20 bytes to my array of 700 one at a time and copied each time, I'd end-up copying around 14K worth of bytes when everything is said and done.

But that said I would also keep in mind that `malloc` will generally be pretty fast - it's not going to hit up the OS every time you make a request. The advantage here isn't because `malloc` is slow, it is because copying is slow.

vardump · on April 23, 2017

Assuming 2 GHz CPU and reasonable assumption data is in L1 cache, reading 14 kB from L1 cache takes: 14kB / (32B*2) / 2 GHz = ~110 nanoseconds.

It'll take a bit longer (maybe 3-5x as long?) to write it to newly allocated memory (likely cache misses), the point is copying memory is just not as expensive as is often assumed.

I'm pretty sure malloc is the bottleneck in this case, pointer chasing is slow.

DSMan195276 · on April 24, 2017

I don't quite understand your argument - you give lots of numbers for how long it takes to copy, but ignore cache misses and just come to the conclusion that `malloc` is the bottleneck because "pointer chasing is slow". If you're going to give theoretical numbers for copy, why not give numbers for the "pointer chasing" as well? The numbers you gave don't really mean much if there is nothing to compare them too. Yes, 110 nanoseconds is fast, but copying a 700 byte array 20 times was never intended to be an example that would be grinding your computer to a halt.

You also don't give enough weight to cache misses in this scenario. A L1 cache miss (So the memory is in L2) is going to be approximately a ~15x slow down from L1. A full main memory hit will be about ~200x slower. Not nearly the 3-5x you're guessing. More-over, the cache missing in `malloc` will likely be the same memory you're allocating anyway, meaning that the cache misses were likely going to happen regardless.

With that said, the `malloc` time should be somewhat constant regardless of the size of the allocation (Unless your allocation requires requesting memory from the OS), but the copying won't be. Once you start getting into structures that are fairly large in size, `malloc`ing new memory for them anytime you expand them will end-up being a fairly expensive operation that I would easily wager dwarfs any time spent in `malloc`.

But I mean, it's easy to argue basically any of these points without actual testing, so I've tried running a few basic tests on my own. Obviously your results would vary (I'm currently running on a fairly slow CPU):

First, I ran tests on different block sizes and number of allocations without freeing any of the allocated memory. In these tests, the copying tended to overshadow the allocations without copying when your block size was above around 8k. But because I wasn't calling `free`, allocating would quickly eat up most of the time if the number of allocations was increased (With most of the time being spent in the kernel). I did most of my tests with around the 40000 allocations, and above that things would get quite slow with the allocations alone even with block sizes larger then 8k.

With calling free, however, things tend to lean much more heavily in favor of the copying being more expensive. Right around a block size of 1k is where the allocation costs about the same time as the copy. The speed-up is of course due to `malloc` always have enough free memory available and not having to ask the OS for more (So this test represents the basic cost of the "pointer chasing" alone). Also, like I had assumed, the time for `malloc` without copying is 100% dependent on the number of allocations - changing the size of the block doesn't make any change to the time without copy.

Now that said, these tests are by no means definitive, but at the very least they do show that `malloc` is not necessarily expensive compared to the copying, especially once you go above a few k in your `malloc` size, and/or when the total number of allocations starts to ramp up very high. If your allocation size is fairly small though, then I would agree the `malloc` time is at least equivalent (or more) then the time spent copying.

vardump · on April 24, 2017

I oversimplified, but I think my numbers are about in the ballpark when it comes to copying. It's fast, because CPUs are optimized for linear memory access patterns. I also took into account high likelihood the memory was touched before copying and will be touched again immediately after copying. One should consider cache misses on system level, not in isolation.

"malloc" will need for the very least to keep lists of free areas and try to avoid virtual memory fragmentation. Fragmented heaps (="real life code" situation) are significantly slower than fresh clean heaps, because the are more allocation list entries to check.

Because amount of work malloc needs to do is variable, it's hard to give any numbers how long that pointer chasing operation will take.

Pointer chasing is slow, because non-sequential access patterns, such as following pointers, create a data dependency and cannot be prefetched. The CPU is likely to stall waiting for the dependency to be resolved. Typically each pointer dereferencing ("chasing") operation will also touch a different cache line, so cache misses are also common.

Synthetic tests are way more likely to reside in L1 cache (including allocation lists) than anything in real life.

Many malloc implementations mmap or equivalent allocations that exceed a certain threshold. In that case, copying is slow; you'll pay a lot extra for page faults during the copy. Page faults might also occur on a fresh heap in a synthetic test.

I know from personal experience lots of small size mallocs sometimes cause non-trivial performance issues.

Standard disclaimer concerning any optimizations: Always profile your code. Things work differently in real code than in benchmarks, and what was true last time in your real life scenario is no way guaranteed to be true next time.

throwanem · on April 23, 2017

Also, when you download FLAC files with it, they'll sound warmer.

Rusky · on April 23, 2017

https://www.audioasylum.com/messages/pcaudio/119979/re-a-rev...

> also most players use malloc to get memory while new is the c++ method and sounds better.

jrimbault · on April 23, 2017

This is a new rabbit hole of "audiophile".

Sidenote: "audio asylum", at least the name is appropriate, and it's https, I guess that's good too.

nayuki · on April 23, 2017

There is actually a corner on the web where people debate about FLACs and WAVs sounding different

lucideer · on April 23, 2017

This is probably true in non-blind listening comparison though. Similar to food tasting different when served in different coloured receptacles[0]

[0] http://onlinelibrary.wiley.com/doi/10.1111/j.1745-459X.2012....

mikeash · on April 23, 2017

If you had to summarize the audiophile community in a few words, it would be "does not understand the point of blind testing."

venture_lol · on April 23, 2017

All audophiles eventually will reach the stage where my hearing aid is better than yours :)

mikeass · on April 25, 2017

[flagged]

sctb · on April 26, 2017

We've banned this account and the main one for repeatedly violating the guidelines and creating new accounts to do so.

nayuki · on April 24, 2017

For example: http://forums.naimaudio.com/topic/flac-vs-wav-audio-quality

user5994461 · on April 23, 2017

That's fun considering they are bit-to-bit identical :D

edejong · on April 23, 2017

Yes, you got the pun, bravo, 2 geek-points for you.

However, you lose 20 geek-points because FLAC and wav are not bit-to-bit identical. They are both lossless encodings and correct implementations will preserve the input material bit-to-bit.

bcook · on April 23, 2017

Perhaps he's referring the the bits sent out from the codec rather than the bits received by the codec?

user5994461 · on April 23, 2017

FLAC is bit to bit identical to WAV after decoding.

In layman terms, it's like zipping and unzipping a .wav

edejong · on April 23, 2017

Yes, I understand how the FLAC codec works. My comment was pointing out that FLAC can either represent the codec or the binary and that in neither case one can say that FLAC is bit-to-bit identical, since in the former case you're referring to the specification and in the latter case you are referring to the binary executable.

One can say that FLAC is a lossless compression codec and as such it is homomorphic to the original raw binary encoding and the WAV encoding.

It might be pedantic, but saying "flac is bit-to-bit identical to wav" is just wrong. To me it sounds like saying: "The speed limit is 80 kilometers".

noway421 · on April 23, 2017

Disagree, they sounds the best with wget.

tehabe · on April 23, 2017

You've never experienced FLACs downloaded via scp. The encryption makes all the difference.

reinhardt1053 · on April 23, 2017

Is this reddit?

mfukar · on April 24, 2017

What is the difference?

throwanem · on April 24, 2017

Worth preserving.

venture_lol · on April 23, 2017

You know all those memory sloshing around does create a different sound stage, rippling contextual frequencies affecting DACs and such?

Cyph0n · on April 23, 2017

Especially when played over my diamond coated high fidelity wires. I'll probably pair that with my fully 24K gold HDMI cables when watching a movie.

aphextron · on April 23, 2017

>It's not every day that you see an example of someone examining and improving old code, that will result in a measurable benefit to direct and indirect users.

It's even more impressive when you think about the absolute ubiquity of cURL in practically every device imaginable. He's probably single handedly saving many kilowatts of power consumption with fixes like this.

srean · on April 23, 2017

Indeed. Roadways are cheap to use but those who behave as if its all theirs are still called dicks for a reason.

ed_balls · on April 23, 2017

This is a completely different scenario than an average app. Curl is used in cars and other embedded devices. This things matter for this scale.

>"RAM is cheap"

Memory is cheap on servers and desktops. SSD can be treated as slow RAM.

lkjalsdkfjasdf · on April 23, 2017

This only shows the naivete of the author. There are plenty of C programmers that aren't malloc abusers and like to avoid any potential syscall usage. Instrusive linked lists is a beginner C topic.

circular_logic · on April 24, 2017

I'm a C beginner so I found it quite useful. ;)

Your comment makes it sound like everyone should do this. Are there not reasons to have a non intrusive list as well? For example if the structure is used elsewhere and you don't want to tie it to that list implementation. Or if you want to point to the same structure many times in a list?

vbezhenar · on April 23, 2017

Underlying problem is that C doesn't have comprehensive standard collections, so many developers reinvent the wheel over and over again, and usually that wheel is far from best in the world. If curl was written with C++, those optimizations would be applied automatically by using STL collections.

tom_mellior · on April 23, 2017

Somewhat true, but many C++ programs implement their own containers because they find the STL "slow" or "bloated" or whatever. (For concreteness, both GCC and LLVM do this.) So I guess the same might eventually happen to curl, in which case we'd get this same kind of article on Hacker News ;-)

jononor · on April 24, 2017

GCC used to be written in C (not C++) until couple of years back, and the codebase origins predates both the STL and C++. So they are not the greatest example.

tom_mellior · on April 24, 2017

As you say, GCC switched to C++ just a few years ago, when the STL had already existed for decades. I don't understand how you conclude that they couldn't have just used the STL when they did the switch to C++. Implementing their own containers was more work without any gain in portability.

adrianN · on April 23, 2017

STL maps are pretty slow and can be beat without too much effort.

tom_mellior · on April 23, 2017

You mean with hash tables? Different underlying structure, so I wouldn't call that a reimplementation.

Anyway, the projects I mention above also define their own vectors.

lorenzhs · on April 23, 2017

even std::unordered_map is easy to beat on most workloads if you relax the standard's requirements a litte, because the standard more or less requires hashing with chaining ("The elements of an unordered associative container are organized into buckets. Keys with the same hash code appear in the same bucket"). A good linear probing hash table will be faster most of the time, e.g. Google's dense_hash_map.

dooglius · on April 23, 2017

I don't believe C++ compilers can do optimizations around allocation. A quick test at https://godbolt.org/g/lFQcKf reveals that gcc does not appear to coalesce STL operations, or move the allocation to the stack. The compiler is much dumber than most people think it is.

pcwalton · on April 23, 2017

LLVM can remove allocations that are unused, but that's it.

There's been talk of doing escape analysis for allocations on the mailing list, but so far nothing has been done as far as I know.

vbezhenar · on April 23, 2017

I'm not sure I follow you, but std::list implementation stores object data inside node, so it's only a single allocation per node, as it should be.

wolf550e · on April 23, 2017

He turned his containers into intrusive ones. The standard C++ STL containers are not intrusive. Boost comes with intrusive containers, but if you need a third party library, you can do that in C too.

dbaupp · on April 23, 2017

They have many of the same properties as being intrusive, e.g. one can write a list type that can store any type and store it directly inline next to the next/previous pointers, not needing an allocation like the common C strategy of using void*.

eropple · on April 23, 2017

What, in your view, is meaningfully different between an intrusive container and std::list allocating memory right next to the prev/next pointers for your data?

dbaupp · on April 24, 2017

For data in a single list, they're essentially equivalent, but they differ when one wants to put things in multiple lists.

For instance, a collection of Ts might want to have a list of the ones with property A and the ones with property B (independently, so any T could be on list A or B, or both, or neither). This could be done by having two std::list<std::shared_ptr<T>> (or something similar, like replacing the shared_ptrs with raw pointers, or indices into an external std::vector<House>), but this pays for the storage for each T, for the prev/next pointers, and also the storage for the pointer to the Houses (plus extra overhead of shared_ptr, etc.). With intrusive lists, one can have next_a/prev_a and next_b/prev_b pointers directly in the Ts and only pay that slight extra cost.

eropple · on April 24, 2017

That's a really good point and I hadn't considered it. Thanks!

halayli · on April 23, 2017

no that's not the problem. STL containers rely on mallocs internally. also the major saving came from curl_multi_wait:

> At this point I sorted all allocations based on size and checked all the smallest ones. One that stood out was one we made in curl_multi_wait(), a function that is called over and over in a typical curl transfer main loop. I converted it over to use the stack for most typical use cases. Avoiding mallocs in very repeatedly called functions is a good thing.

iamalurker · on April 23, 2017

My problem with excessive allocations is usually what happens in interpreted languages. People think, hey it's already slow ass interpreted so lets not care about allocation at all.

An example which I see all the time, looking at tons of python libraries which in the end do I/O against a TCP socket. Sometimes the representation between what the user passes to the library and what goes out to the socket can be retained as an array of buffers which are to be sent to the socket.

Instead of iterating on the array, and sending each block (if big enough) on it's own to the socket, the library author concat them to one buffer, and then send it over the socket.

When dealing with big data, this adds lots of fragmentation and overhead (measurable), yet some library authors don't care...

Even the basic httplib and requests has this issue when sending via POST a large file (it concats it to the header, instead of sending the header, then the large fiel).

snksnk · on April 23, 2017

Optimization backed by comparative statics. These reads are so satisfying. Thank you for submitting.

faragon · on April 23, 2017

Explicit dynamic memory handling in low level languages hurts in a similar way garbage collectors do in high level languages: hidden and often unpredictable execution costs (malloc/realloc/free internally usually implement O(n log n) algorithms, or worse). So the point for performance, no matter if you work with low level or high level languages is to use preallocated data structures, when possible. That way you'll have low fragmentation and fast execution because not calling the allocator/dealocator in the case of explicit dynamic memory handling, and lower garbage collector pressure because of the same reasons in the case of a garbage-collected languages.

maccard · on April 23, 2017

An allocator doesn't have to be slow. You can implement allocator yourself that asks for memory from the OS up front, and just hands pointer back to the caking application. If you know the order of allocations is the reverse of the deallocstiond (as an example) you can do allocations with a pointer bump!

faragon · on April 23, 2017

Sure. Imagine you have a process with 2^20 active allocations (e.g. 2^20 calls to malloc()), i.e. you have a tree with the metainformation, and every time you de-allocate and allocate you have to search trough one or more trees. So no matter how "smart" is your library for avoiding OS system calls, you already have a hell to maintain (search through a tree or a list, delete, split, etc.). Things get ugly when a process has lots of dynamic memory calls, on non-trivial cases.

maccard · on April 24, 2017

> you have a tree with the metainformation, and every time you de-allocate and allocate you have to search trough one or more trees

Don't use a tree in that case? I don't see why you using a tree to store allocation info means that allocations are slow.

faragon · on April 24, 2017

Whatever other data structure you use is not going to be simple, and will have even worse drawbacks: e.g. using hash tables for handling allocation pools and free blocks could be even worse.

vertex-four · on April 23, 2017

An alternative, if you have a well-defined portion of code by the end of which you know you're not going to touch any data which was allocated in it, is to use arenas - allocation is then a pointer bump, and you can deallocate all of it at once.

faragon · on April 23, 2017

Or even better: for local small allocations just use the stack.

vertex-four · on April 24, 2017

Alloca is non-standard and apparently a C11 compliant compiler can choose not to implement variable-length arrays - there's no standard way of allocating on the stack.

Also - are you utterly certain your allocations will always be small? Could someone craft some input to your program to have you allocate more, and possibly use up the stack in some environments? There's environments which have much smaller stacks than have RAM available.

faragon · on April 24, 2017

You can do stack allocation in C without "alloca()", e.g.

size_t example(const char in, const size_t in_size, char out, const size_t max_out_size)

{

  size_t out_size = 0;
  char stack_buf[32768];
  char *buf = stack_buf;
  char *heap_buf = NULL;
  size_t req_size = in_size * 2;

  if (req_size < sizeof(stack_buf)) {
     buf = heap_buf = (char *)malloc(req_size);
  }

  // Do stuff

  if (heap_buf)  {
     free(heap_buf);
  }

  return out_size;
}

vertex-four · on April 24, 2017

That's precisely what's being done in curl. Generally when people say "stack allocation" as an alternative to "malloc" they mean alloca/VLA, so you'll forgive my confusion.

faragon · on April 24, 2017

ERRATA:

if (req_size < sizeof(stack_buf))

Should be:

if (req_size > sizeof(stack_buf))

maccard · on April 24, 2017

You don't always know up front how much space you need.

faragon · on April 24, 2017

You can guess a maximum for covering 99.99% of the cases, and use dynamic memory just in the 0.01%. E.g.

size_t example(const char in, const size_t in_size, char out, const size_t max_out_size)

{

  size_t out_size = 0;
  char stack_buf[32768];
  char *buf = stack_buf;
  char *heap_buf = NULL;
  size_t req_size = in_size * 2;

  if (req_size < sizeof(stack_buf)) {
     buf = heap_buf = (char *)malloc(req_size);
  }

  // Do stuff

  if (heap_buf)  {
     free(heap_buf);
  }

  return out_size;
}

rumcajz · on April 23, 2017

My rule of thumb is to look at application's design and only ever use malloc where there is 1:N (or N:M) relationship between entities. Everything that's 1:1 should be allocated in a single step.

areyousure · on April 23, 2017

Your heuristic sounds interesting. Can you say a little more about which entities and how to determine their relationship? Thanks.

maccard · on April 23, 2017

Weve had data that was fixed size but wouldn't fit on the stack.

21 · on April 23, 2017

> Doing very small (less than say 32 bytes) allocations is also wasteful just due to the very large amount of data in proportion that will be used just to keep track of that tiny little memory area (within the malloc system). Not to mention fragmentation of the heap.

That not necessarily true. Modern allocators tend to use a bunch of fixed size-buckets.

But given that curl runs on lots of platforms it makes sense to just fix the code.

__s · on April 23, 2017

& often those fixed-size buckets smallest size is 32 bytes. It still has to at least have a freelist

notacoward · on April 23, 2017

There has to be a free list somewhere, but a single bucket only needs a bitmap. I think the GP's point is that such a structure amortizes the cost of the free-list metadata over more items, reducing total overhead.

21 · on April 23, 2017

The free list can be stored inside the empty cells, meaning you put the pointer to the next empty cell inside the previous empty cell, and you need a single additional pointer to store the location of the first empty cell. When you free a cell you just make that the first empty cell.

exDM69 · on April 23, 2017

This isn't free either because the free list is scattered around in memory and all that pointer chasing is bad for caches.

Someone · on April 24, 2017

It isn't free, but all that pointer chasing is one step for each allow and one for each free. There is no need to look for a 'best' block on alloc, nor is there a need to search for a best place to insert a block on free.

Moreover, the cache hit you take for an alloc likely would have happened anyways because, presumably, the program that made the allocation wants to write to the allocated memory (in theory, that doesn't imply the CPU has to _read_ the memory, but are there allocators that are that smart?)

For frees, the memory may, but need not be, already be in the cache when free is called.

dbaupp · on April 23, 2017

I'm curious if any allocators with size classes/buckets actually do much pointer chasing. One can theoretically have a free-list for each class, so at most one pointer needs to be dereferenced: work out the size class, load the front of the appropriate list, read its tail pointer and store that as the new front of the list.

Rusky · on April 23, 2017

The free list can be stored almost entirely within the free buckets, giving it a constant cost in memory you can't allocate.

vertex-four · on April 23, 2017

Note that this pattern[0] is essentially "copy-on-write", which can be encapsulated safely as such in a reasonably simple type (in a language with generics) and used elsewhere. I use a similar mechanism pervasively in some low-level web server code to use references into the query string, body and JSON objects directly when possible, and allocated strings when not.

[0] https://github.com/curl/curl/commit/5f1163517e1597339d

Asooka · on April 23, 2017

But why not use alloca instead of always allocating 10 elements on the stack you might not need?

Edit: I would also be tempted to remove the ufds_on_stack variable and just check if the ufds pointer points to the stack or not.

drfuchs · on April 23, 2017

Because always allocating them on the stack costs zero cycles in the typical case, while alloca costs more than zero cycles in the typical case. And assuming that "struct pollfd" isn't big, and the function isn't very recursive, there's no practical downside to wasting a little stack space for the life of the function.

(Of course, he could get rid of "bool ufds_malloc" and just see if his pointer is NULL before calling free(); or not even bother checking, since free(NULL) is defined to be a no-op.)

codyps · on April 23, 2017

`alloca` is a simple addition to the stack pointer, so a single instruction, presuming it isn't folded into the normal bump of the stack pointer to allocate the fixed size local variables. There isn't really much cost to doing a dynamic stack allocation rather than a fixed one. Variable length arrays (VLAs) allow the same thing but can be slightly more portable.

Normal C caveats do apply here though: alloca is POSIX, not C (but is widely implemented outside of POSIX systems). VLAs are an optional standard feature. Neither is required to actually use the stack for storage.

Not sure if there are any platforms supported by curl which would prevent it's use of VLAs or alloca.

tjalfi · on April 23, 2017

tl;dr - alloca costs, history, and why it is problematic

Alloca is somewhat more expensive on x86/x64 than a single instruction.

[0] shows the code generation for four functions that generate and sum an iota array. I used -O1 to make the differences more apparent.

iota_sum_alloca and iota_sum_vla generate similar code. They both require a frame pointer (RBP) and code to preserve the 16 byte alignment of the stack frame.

iota_sum_const_alloca and iota_sum_array generate identical code. Clang recognizes that alloca is invoked with a constant argument.

History of Alloca

Alloca was originally written for unix V7 [1]. Doug Gwyn wrote a public domain implementation [2] in the early 80s for porting existing programs. The FSF used Gwyn's alloca implementation in GDB, Emacs, and other programs. This helped to spread the idea.

Problems of Alloca

[3] is a comp.compilers thread that discusses some of the issues with alloca. Linus does not want either VLAs or alloca in the Linux kernel [4].

References:

[0] https://godbolt.org/g/1JyXhQ

[1] http://yarchive.net/comp/alloca.html

[2] https://github.com/darchons/android-gdb/blob/android-gdb_7_5...

[3] http://compilers.iecc.com/comparch/article/91-12-079

[4] https://groups.google.com/forum/#!msg/fa.linux.kernel/ROgkTB...

Edited for minor formatting changes.

tjalfi · on April 24, 2017

https://godbolt.org/g/XKAZOb fixes the bugs in the sample code in the parent post.

If you compile with -O2 then iota_sum_const_alloca and iota_sum_array are both evaluated at compile time.

Spidler · on April 23, 2017

Also, because curl is c89-portable, and alloca might fail on the portability standpoint.

vertex-four · on April 23, 2017

Statically allocating stack memory is essentially free in environments with enough memory - you just subtract a little more from your stack pointer. Alloca is not standard C - it might not even be available in some compilers that curl compiles on.

cesarb · on April 23, 2017

> Alloca is not standard C

A similar option is a variable-length array, which is standard C. It does share the problem of not being available in some compilers, however.

That said, there are other problems with alloca/VLAs. For small enough sizes, like you said it's faster and simpler to statically allocate. For large enough sizes, you risk overflowing the stack, which might be more constrained than you'd expect (especially when using threads). There might be a sweet spot where it's better, but it's not worth the portability cost and the potential headaches.

mikeash · on April 23, 2017

Note that variable-length arrays ended up being a particularly troublesome feature, and were therefore made optional in C11. It's still "standard," but a compliant compiler is allowed not to support them, so it might as well not be.

0xcde4c3db · on April 23, 2017

> The point is rather that curl now uses less CPU per byte transferred, which leaves more CPU over to the rest of the system to perform whatever it needs to do. Or to save battery if the device is a portable one.

Does anyone have a general sense of how these kinds of efficiencies translate to real-world battery life? I understand that the mechanisms (downclocking/sleeping the CPU) are there; I'm just curious as to how much it actually moves the needle in a real system.

exDM69 · on April 23, 2017

30℅ less CPU use is roughly 30℅ less energy usage. The CPU will power down when it's done.

Downclocking has negative impact because the CPU needs to be powered up longer and it is leaking current all the time it is not powergated.

59nadir · on April 23, 2017

In any one single instance it's not that interesting, but considering probably half (if not more) of the stuff you interact with uses libcurl, it adds up to a lot.

maccard · on April 23, 2017

Not sure in hard numbers, but mobile processors are designed to work this way - do a small amount of work at full power and then sleep.

zkms · on April 23, 2017

Race to idle~

hota_mazi · on April 23, 2017

> There have been 213 commits in the curl git repo from 7.53.1 till today. There’s a chance one or more other commits than just the pure alloc changes have made a performance impact, even if I can’t think of any.

"I can't think of any" is not a very scientific way to measure optimizations. Actually, this simple fact casts a doubt on whether it's this malloc optimization that led to the speed up or any of the 200+ commits OP is working on top of.

Why not eliminate that doubt by applying the malloc optimizations to the previous official release? I'm a bit skeptical about the speed up myself, since I would expect curl to be primarily IO bound and not CPU bound (much less malloc bound, given how little memory it uses).

dr_zoidberg · on April 23, 2017

I'm not skeptical about it. Last time I did something similar to this optimization, I had a python function that was doing some work over strings to get a similarity metric.

For this function, a list of elements was kept inside each call, which began empty and a few calculations where performed, populating it before the main loop of the function kicked in. In python-land, this was obviously done by declaring an empty list `precalc_values = []` and then appending over it. When we cythonized it, the dev that took it went in with a `cy_malloc(size(int)*elements)` "dynamic array of ints", and called it a day, 70x speedup over plain python.

A few days later I came in and saw that code, and said "why not go with a simple array?", to which I was told "because we don't know the size of the strings beforehand". Did a test run with a small array plus a counter (to know up to where the array held real values, and not just zero-init fields) and got a 100x speedup.

In the end we went with both functions and a wrapper that would check the length of the strings involved and select one or the other, because the array version would massively crash from accessing an array out of bounds if a large string happened to come by.

ape4 · on April 23, 2017

> This time the generic linked list functions got

> converted to become malloc-less (the way

> linked list functions should behave, really).

I don't see how a linked list can not use malloc().

rumcajz · on April 23, 2017

Look for intrusive containers.

ape4 · on April 23, 2017

TIL, thanks.

hota_mazi · on April 23, 2017

Have each node of data contain the `prev` and `next` link.

You are "polluting" your payload with list information, but you gain malloc less lists.

Someone · on April 23, 2017

And cache locality. That can have considerable impact if you frequently search that list for an item satisfying some condition.

amenghra · on April 24, 2017

You would think curl's perf is bound by the network latency/bandwidth and that intrusive lists wouldn't make a signifiant difference.

__s · on April 23, 2017

> The point here is of course not that it easily can transfer HTTP over 20GB/sec using a single core on my machine

2GB

fastest963 · on April 24, 2017

He said gigabit, or at least that's how I read it. 2900MB/sec * 8 is over 20gb/s