Hacker News new | past | comments | ask | show | jobs | submit login
Fewer mallocs in curl (haxx.se)
392 points by dosshell on April 23, 2017 | hide | past | favorite | 130 comments



There is a little-known Valgrind tool called "DHAT" (short for "Dynamic Heap Analysis Tool") that's designed to help find exactly these sorts of excessive allocations.

Here's an old blog post describing it, by DHAT's author: https://blog.mozilla.org/jseward/2010/12/05/fun-n-games-with...

Here's another blog post in which I describe how I used it to speed up the Rust compiler significantly: https://blog.mozilla.org/nnethercote/2016/10/14/how-to-speed...

And here is the user manual: http://valgrind.org/docs/manual/dh-manual.html


> I previously had -Ccodegen-units=8 in RUSTFLAGS because it speeds up compile times. ... the resulting rustc was about 5–10% slower. So I’ve stopped using it now.

...why is this the case?


It breaks the code into smaller modules that are processed independently, thus limiting the scope of optimizations to those modules.


For the benefit of others who found the description in the blog post unclear and can't or don't want to dig through the code changes themselves: "fixing the hash code and the linked list code to not use mallocs" is a bit misleading. Curl now uses the idiom where the linked list data (prev/next pointers) are inlined in the same struct that also holds the payload. So it's one malloc instead of two per dynamically allocated list element. This explains the "down to 80 allocations from the 115" part.

The larger gain is explained better and comes simply from stack allocation of some structures (which live in a simple array, not a linked list or hash table).


Curl now uses the idiom where the linked list data (prev/next pointers) are inlined in the same struct that also holds the payload.

I wonder why this wasn't the original design; I've seen the "two-allocate" method a few times in other code and it's always seemed rather silly to allocate that separate little structure just to point to the things you're linking together anyway, so I'm curious how that way of doing it became somewhat commonplace.


Mayhaps because the structure was originally stack-allocated, or because the developer was unfamiliar or uncomfortable with intrusive datastructures. Or because the structure is used outside of linked list contexts and the memory overhead of the intrusion was considered not compensated for?


> Or because the structure is used outside of linked list contexts

That use case is possible with the particular implementation that is now used by curl:

    struct fileinfo {
      struct curl_fileinfo info;
      struct curl_llist_element list;
    };
You can use the payload (struct curl_fileinfo) outside of a list without incurring any overhead.

I think one of your other points is more likely, or simply "it was fast enough and we went for more interesting features than for such optimizations".


One of the Plan 9 C enhancements makes this even easier by permitting the fields of these structs to be accessed directly.


you can do this now in ISO C11 with the inclusion of anaonomous structures and unions.

    struct A { 
        int x, y; 
    };

    struct B {
        union {
            struct A 2d;
            struct { int x, y; };
        }
        int z;
    };

    ...

    struct B foo;
    foo.x = foo.2d.y = foo.z = 0;
you can access x and y as members of B or A.


Well there are situations where intrusive list pointers don't work well. If you don't control the class, then you simply can't put pointers in it. If most objects aren't ever in a list (eg. a million objects but only ten in a list at any given time), then you're wasting space. If you want to put objects in several different lists, then you have to know up front how many lists and their names, and this can become unmanageable.


You can always allocate memory big enough for a the payload plus the link structure without the links actually being inside of the payload. Just make the first part of memory contain the links and immediately afterward you have the payload.


It's the object oriented approach isn't it? So it will be natural to some (because it was taught to them that way).


AKA intrusive (sometimes "invasive") containers / lists.


Curl now uses the idiom where the linked list data (prev/next pointers) are inlined in the same struct that also holds the payload

And when you have a memory overwrite then your meta data is corrupted and you loose the complete list. I keep my meta data and payload in two separate list for that very reason.


So you structure your code to get more subtle errors in case of memory overwrites? "Loosing the complete list" sounds catastrophic and like something that would be easily noticed , and thus fixed.

Arranging the code to "keep trucking" in case of spurious memory overwrites seems like setting it up to be hard to debug, and having programs run in some ill-defined state where some portion of data has been overwritten but nobody notices is just super-scary.


I think this is fantastic engineering work towards performance, without falling back on the "RAM is cheap" line and instead doing nothing.

It's not every day that you see an example of someone examining and improving old code, that will result in a measurable benefit to direct and indirect users.


RAM is cheap.

What's not cheap is requesting RAM from the OS piecemeal.

In this particular case the curl maintainer ended up saving a bit of RAM as a side-effect, but it's actually much more common for programs that benefit from this class of optimization to waste RAM as a side-effect, and yet be much faster.

For example a common optimization in high-level programming languages is to overallocate memory for strings. E.g. if the user requests "x" you allocate 16 bytes, if you need 17 bytes you allocate 32 etc.

This wastes RAM, a 513 byte string will malloc 1024 bytes, but ends up being much faster on overage exactly because RAM is cheap, but allocations are expensive.


it doesn't go to the OS each time you call malloc()

the first time malloc() is invoked it will ask the OS for an oversized chunk of memory(probably using mmap() or sbrk())

allocations consist of incrementing a pointer, or pulling a a chunk off a freelist, both of which are effectively free

mmap() is very fast to call too, so I'm not sure what you're on about

regardless, curl is a piece of software that spends 99.99% of its time waiting for network traffic


On Linux. Basic core libs like libCurl can find their way into MMU-less embedded systems with relatively crude malloc()s. Sane allocation behavior is needed to ward off fragmentation that hurts the entire system.


Wouldn't it be better to write a shim library that provides Linux-like MMU behavior at a userspace per-process level, and then target libs like libcurl to it? It'd be a sort of "memory policy driver" that you'd port to an embedded OS.


You really don't want library code providing or dictating malloc behavior, that's how things like OpenSSL and Heartbleed end up happening. It may be slower, but dealing with the politics of a particular malloc can be uglier than just making your code work well enough most of the time.


Maybe more than you thing, if you're running a multithreaded app malloc/free/new/delete need to have mutexes to protect the heap data structures, that can also lead you into the kernel


Pulling a chunk off a freelist and maintaining that free list is most definitely not free.


so what, to return from a freelist you alter the head pointer to point to the next item and return it

that's what, 2 or 3 instructions? even if you push that upto a few hundred it's still effectively free compared to anything else curl is going to do (e.g. writing to the disk)


A structure allocated in one chunk has cache locality and will be significantly faster to access than dereferencing multiple pointers.

The allocation itself is not the bottleneck, it's the poor choice of data structure.


Yup, malloc/free isn't usually algorithmically expensive but the cache misses tend to be the brutal part.

You'll spend 15 cycles doing work on 600 cycles of cache fetching. Plus you've just evicted some cache for your local scope.


you have to handle the case in which you have no remaining items as well, breaking up a chunk log(chunkToBreak/chunkSizeNeeded) times. And in the case that you have no viable chunk to use, you need to call back to the OS.


    > I'm not sure what you're on about.
You clearly interpreted OS to mean "the kernel" in my comment. To be clear I'm including the C library which implements malloc() in "the OS".

    > curl is a piece of software that spends 99.99%
    > of its time waiting for network traffic.
You're commenting on an article that shows that in some cases 30% of the time was spent on malloc() overhead.


well yes, they removed the network overhead and disk overhead by running it on localhost and writing to /dev/null, and once you remove two bottlenecks you'll find that something else becomes the new bottleneck

libc is not the OS, it's the standard library for C, in the same way the java class library isn't the OS, it's the standard library for java


You seem insistent on having some side-thread argument with me for some reason.

    > once you remove two bottlenecks...
The article we're both commenting on goes to great length to point this out. I don't know why you're trying to convince me or anyone else of this.

Yes, in a network library malloc overhead isn't usually a worthwhile thing to worry about, but libcurl is hardly just any random library.

    > libc is not the OS, it's the standard library for C.
I'm not trying to convince you that you should consider libc part of the OS.

I was merely clarifying that in the context of my comment, which you replied to assuming I just meant the kernel, that I was talking about the C library as well.

FWIW lots of people use "Operating System" to mean the set of software you get by default. OpenBSD, GNU, OSX, Android etc. definitely consider libc to be part of the OS, and Android probably considers its java class library part of the base OS. If everyone meant "the kernel" by "the OS" we'd just say "the kernel".

You don't have to agree with that, but correcting people on that point is about as futile as trying to convince people that your use of "hacker" is correct.


Surprisingly, when you remove the actual work, 100% of the time spent is overhead! What a waste!


Which C library from which C compiler?


That is compiler and OS specific behavior.


I would only add that the expensive part isn't the allocation per se, but copying the data from one chunk to another. If you allocation 1024 when you only need 700, when you end-up needing 701 you don't have to copy all 700 bytes into a new chunk. This can add up pretty quick if you're copying the data every time you append one new element. If I appended 20 bytes to my array of 700 one at a time and copied each time, I'd end-up copying around 14K worth of bytes when everything is said and done.

But that said I would also keep in mind that `malloc` will generally be pretty fast - it's not going to hit up the OS every time you make a request. The advantage here isn't because `malloc` is slow, it is because copying is slow.


Assuming 2 GHz CPU and reasonable assumption data is in L1 cache, reading 14 kB from L1 cache takes: 14kB / (32B*2) / 2 GHz = ~110 nanoseconds.

It'll take a bit longer (maybe 3-5x as long?) to write it to newly allocated memory (likely cache misses), the point is copying memory is just not as expensive as is often assumed.

I'm pretty sure malloc is the bottleneck in this case, pointer chasing is slow.


I don't quite understand your argument - you give lots of numbers for how long it takes to copy, but ignore cache misses and just come to the conclusion that `malloc` is the bottleneck because "pointer chasing is slow". If you're going to give theoretical numbers for copy, why not give numbers for the "pointer chasing" as well? The numbers you gave don't really mean much if there is nothing to compare them too. Yes, 110 nanoseconds is fast, but copying a 700 byte array 20 times was never intended to be an example that would be grinding your computer to a halt.

You also don't give enough weight to cache misses in this scenario. A L1 cache miss (So the memory is in L2) is going to be approximately a ~15x slow down from L1. A full main memory hit will be about ~200x slower. Not nearly the 3-5x you're guessing. More-over, the cache missing in `malloc` will likely be the same memory you're allocating anyway, meaning that the cache misses were likely going to happen regardless.

With that said, the `malloc` time should be somewhat constant regardless of the size of the allocation (Unless your allocation requires requesting memory from the OS), but the copying won't be. Once you start getting into structures that are fairly large in size, `malloc`ing new memory for them anytime you expand them will end-up being a fairly expensive operation that I would easily wager dwarfs any time spent in `malloc`.

But I mean, it's easy to argue basically any of these points without actual testing, so I've tried running a few basic tests on my own. Obviously your results would vary (I'm currently running on a fairly slow CPU):

First, I ran tests on different block sizes and number of allocations without freeing any of the allocated memory. In these tests, the copying tended to overshadow the allocations without copying when your block size was above around 8k. But because I wasn't calling `free`, allocating would quickly eat up most of the time if the number of allocations was increased (With most of the time being spent in the kernel). I did most of my tests with around the 40000 allocations, and above that things would get quite slow with the allocations alone even with block sizes larger then 8k.

With calling free, however, things tend to lean much more heavily in favor of the copying being more expensive. Right around a block size of 1k is where the allocation costs about the same time as the copy. The speed-up is of course due to `malloc` always have enough free memory available and not having to ask the OS for more (So this test represents the basic cost of the "pointer chasing" alone). Also, like I had assumed, the time for `malloc` without copying is 100% dependent on the number of allocations - changing the size of the block doesn't make any change to the time without copy.

Now that said, these tests are by no means definitive, but at the very least they do show that `malloc` is not necessarily expensive compared to the copying, especially once you go above a few k in your `malloc` size, and/or when the total number of allocations starts to ramp up very high. If your allocation size is fairly small though, then I would agree the `malloc` time is at least equivalent (or more) then the time spent copying.


I oversimplified, but I think my numbers are about in the ballpark when it comes to copying. It's fast, because CPUs are optimized for linear memory access patterns. I also took into account high likelihood the memory was touched before copying and will be touched again immediately after copying. One should consider cache misses on system level, not in isolation.

"malloc" will need for the very least to keep lists of free areas and try to avoid virtual memory fragmentation. Fragmented heaps (="real life code" situation) are significantly slower than fresh clean heaps, because the are more allocation list entries to check.

Because amount of work malloc needs to do is variable, it's hard to give any numbers how long that pointer chasing operation will take.

Pointer chasing is slow, because non-sequential access patterns, such as following pointers, create a data dependency and cannot be prefetched. The CPU is likely to stall waiting for the dependency to be resolved. Typically each pointer dereferencing ("chasing") operation will also touch a different cache line, so cache misses are also common.

Synthetic tests are way more likely to reside in L1 cache (including allocation lists) than anything in real life.

Many malloc implementations mmap or equivalent allocations that exceed a certain threshold. In that case, copying is slow; you'll pay a lot extra for page faults during the copy. Page faults might also occur on a fresh heap in a synthetic test.

I know from personal experience lots of small size mallocs sometimes cause non-trivial performance issues.

Standard disclaimer concerning any optimizations: Always profile your code. Things work differently in real code than in benchmarks, and what was true last time in your real life scenario is no way guaranteed to be true next time.


Also, when you download FLAC files with it, they'll sound warmer.


https://www.audioasylum.com/messages/pcaudio/119979/re-a-rev...

> also most players use malloc to get memory while new is the c++ method and sounds better.


This is a new rabbit hole of "audiophile".

Sidenote: "audio asylum", at least the name is appropriate, and it's https, I guess that's good too.


There is actually a corner on the web where people debate about FLACs and WAVs sounding different


This is probably true in non-blind listening comparison though. Similar to food tasting different when served in different coloured receptacles[0]

[0] http://onlinelibrary.wiley.com/doi/10.1111/j.1745-459X.2012....


If you had to summarize the audiophile community in a few words, it would be "does not understand the point of blind testing."


All audophiles eventually will reach the stage where my hearing aid is better than yours :)


[flagged]


We've banned this account and the main one for repeatedly violating the guidelines and creating new accounts to do so.



That's fun considering they are bit-to-bit identical :D


Yes, you got the pun, bravo, 2 geek-points for you.

However, you lose 20 geek-points because FLAC and wav are not bit-to-bit identical. They are both lossless encodings and correct implementations will preserve the input material bit-to-bit.


Perhaps he's referring the the bits sent out from the codec rather than the bits received by the codec?


FLAC is bit to bit identical to WAV after decoding.

In layman terms, it's like zipping and unzipping a .wav


Yes, I understand how the FLAC codec works. My comment was pointing out that FLAC can either represent the codec or the binary and that in neither case one can say that FLAC is bit-to-bit identical, since in the former case you're referring to the specification and in the latter case you are referring to the binary executable.

One can say that FLAC is a lossless compression codec and as such it is homomorphic to the original raw binary encoding and the WAV encoding.

It might be pedantic, but saying "flac is bit-to-bit identical to wav" is just wrong. To me it sounds like saying: "The speed limit is 80 kilometers".


Disagree, they sounds the best with wget.


You've never experienced FLACs downloaded via scp. The encryption makes all the difference.


Is this reddit?


What is the difference?


Worth preserving.


You know all those memory sloshing around does create a different sound stage, rippling contextual frequencies affecting DACs and such?


Especially when played over my diamond coated high fidelity wires. I'll probably pair that with my fully 24K gold HDMI cables when watching a movie.


>It's not every day that you see an example of someone examining and improving old code, that will result in a measurable benefit to direct and indirect users.

It's even more impressive when you think about the absolute ubiquity of cURL in practically every device imaginable. He's probably single handedly saving many kilowatts of power consumption with fixes like this.


Indeed. Roadways are cheap to use but those who behave as if its all theirs are still called dicks for a reason.


This is a completely different scenario than an average app. Curl is used in cars and other embedded devices. This things matter for this scale.

>"RAM is cheap"

Memory is cheap on servers and desktops. SSD can be treated as slow RAM.


This only shows the naivete of the author. There are plenty of C programmers that aren't malloc abusers and like to avoid any potential syscall usage. Instrusive linked lists is a beginner C topic.


I'm a C beginner so I found it quite useful. ;)

Your comment makes it sound like everyone should do this. Are there not reasons to have a non intrusive list as well? For example if the structure is used elsewhere and you don't want to tie it to that list implementation. Or if you want to point to the same structure many times in a list?


Underlying problem is that C doesn't have comprehensive standard collections, so many developers reinvent the wheel over and over again, and usually that wheel is far from best in the world. If curl was written with C++, those optimizations would be applied automatically by using STL collections.


Somewhat true, but many C++ programs implement their own containers because they find the STL "slow" or "bloated" or whatever. (For concreteness, both GCC and LLVM do this.) So I guess the same might eventually happen to curl, in which case we'd get this same kind of article on Hacker News ;-)


GCC used to be written in C (not C++) until couple of years back, and the codebase origins predates both the STL and C++. So they are not the greatest example.


As you say, GCC switched to C++ just a few years ago, when the STL had already existed for decades. I don't understand how you conclude that they couldn't have just used the STL when they did the switch to C++. Implementing their own containers was more work without any gain in portability.


STL maps are pretty slow and can be beat without too much effort.


You mean with hash tables? Different underlying structure, so I wouldn't call that a reimplementation.

Anyway, the projects I mention above also define their own vectors.


even std::unordered_map is easy to beat on most workloads if you relax the standard's requirements a litte, because the standard more or less requires hashing with chaining ("The elements of an unordered associative container are organized into buckets. Keys with the same hash code appear in the same bucket"). A good linear probing hash table will be faster most of the time, e.g. Google's dense_hash_map.


I don't believe C++ compilers can do optimizations around allocation. A quick test at https://godbolt.org/g/lFQcKf reveals that gcc does not appear to coalesce STL operations, or move the allocation to the stack. The compiler is much dumber than most people think it is.


LLVM can remove allocations that are unused, but that's it.

There's been talk of doing escape analysis for allocations on the mailing list, but so far nothing has been done as far as I know.


I'm not sure I follow you, but std::list implementation stores object data inside node, so it's only a single allocation per node, as it should be.


He turned his containers into intrusive ones. The standard C++ STL containers are not intrusive. Boost comes with intrusive containers, but if you need a third party library, you can do that in C too.


They have many of the same properties as being intrusive, e.g. one can write a list type that can store any type and store it directly inline next to the next/previous pointers, not needing an allocation like the common C strategy of using void*.


What, in your view, is meaningfully different between an intrusive container and std::list allocating memory right next to the prev/next pointers for your data?


For data in a single list, they're essentially equivalent, but they differ when one wants to put things in multiple lists.

For instance, a collection of Ts might want to have a list of the ones with property A and the ones with property B (independently, so any T could be on list A or B, or both, or neither). This could be done by having two std::list<std::shared_ptr<T>> (or something similar, like replacing the shared_ptrs with raw pointers, or indices into an external std::vector<House>), but this pays for the storage for each T, for the prev/next pointers, and also the storage for the pointer to the Houses (plus extra overhead of shared_ptr, etc.). With intrusive lists, one can have next_a/prev_a and next_b/prev_b pointers directly in the Ts and only pay that slight extra cost.


That's a really good point and I hadn't considered it. Thanks!


no that's not the problem. STL containers rely on mallocs internally. also the major saving came from curl_multi_wait:

> At this point I sorted all allocations based on size and checked all the smallest ones. One that stood out was one we made in curl_multi_wait(), a function that is called over and over in a typical curl transfer main loop. I converted it over to use the stack for most typical use cases. Avoiding mallocs in very repeatedly called functions is a good thing.


My problem with excessive allocations is usually what happens in interpreted languages. People think, hey it's already slow ass interpreted so lets not care about allocation at all.

An example which I see all the time, looking at tons of python libraries which in the end do I/O against a TCP socket. Sometimes the representation between what the user passes to the library and what goes out to the socket can be retained as an array of buffers which are to be sent to the socket.

Instead of iterating on the array, and sending each block (if big enough) on it's own to the socket, the library author concat them to one buffer, and then send it over the socket.

When dealing with big data, this adds lots of fragmentation and overhead (measurable), yet some library authors don't care...

Even the basic httplib and requests has this issue when sending via POST a large file (it concats it to the header, instead of sending the header, then the large fiel).


Optimization backed by comparative statics. These reads are so satisfying. Thank you for submitting.


Explicit dynamic memory handling in low level languages hurts in a similar way garbage collectors do in high level languages: hidden and often unpredictable execution costs (malloc/realloc/free internally usually implement O(n log n) algorithms, or worse). So the point for performance, no matter if you work with low level or high level languages is to use preallocated data structures, when possible. That way you'll have low fragmentation and fast execution because not calling the allocator/dealocator in the case of explicit dynamic memory handling, and lower garbage collector pressure because of the same reasons in the case of a garbage-collected languages.


An allocator doesn't have to be slow. You can implement allocator yourself that asks for memory from the OS up front, and just hands pointer back to the caking application. If you know the order of allocations is the reverse of the deallocstiond (as an example) you can do allocations with a pointer bump!


Sure. Imagine you have a process with 2^20 active allocations (e.g. 2^20 calls to malloc()), i.e. you have a tree with the metainformation, and every time you de-allocate and allocate you have to search trough one or more trees. So no matter how "smart" is your library for avoiding OS system calls, you already have a hell to maintain (search through a tree or a list, delete, split, etc.). Things get ugly when a process has lots of dynamic memory calls, on non-trivial cases.


> you have a tree with the metainformation, and every time you de-allocate and allocate you have to search trough one or more trees

Don't use a tree in that case? I don't see why you using a tree to store allocation info means that allocations are slow.


Whatever other data structure you use is not going to be simple, and will have even worse drawbacks: e.g. using hash tables for handling allocation pools and free blocks could be even worse.


An alternative, if you have a well-defined portion of code by the end of which you know you're not going to touch any data which was allocated in it, is to use arenas - allocation is then a pointer bump, and you can deallocate all of it at once.


Or even better: for local small allocations just use the stack.


Alloca is non-standard and apparently a C11 compliant compiler can choose not to implement variable-length arrays - there's no standard way of allocating on the stack.

Also - are you utterly certain your allocations will always be small? Could someone craft some input to your program to have you allocate more, and possibly use up the stack in some environments? There's environments which have much smaller stacks than have RAM available.


You can do stack allocation in C without "alloca()", e.g.

size_t example(const char in, const size_t in_size, char out, const size_t max_out_size)

{

  size_t out_size = 0;
  char stack_buf[32768];
  char *buf = stack_buf;
  char *heap_buf = NULL;
  size_t req_size = in_size * 2;

  if (req_size < sizeof(stack_buf)) {
     buf = heap_buf = (char *)malloc(req_size);
  }

  // Do stuff

  if (heap_buf)  {
     free(heap_buf);
  }

  return out_size;
}


That's precisely what's being done in curl. Generally when people say "stack allocation" as an alternative to "malloc" they mean alloca/VLA, so you'll forgive my confusion.


ERRATA:

if (req_size < sizeof(stack_buf))

Should be:

if (req_size > sizeof(stack_buf))


You don't always know up front how much space you need.


You can guess a maximum for covering 99.99% of the cases, and use dynamic memory just in the 0.01%. E.g.

size_t example(const char in, const size_t in_size, char out, const size_t max_out_size)

{

  size_t out_size = 0;
  char stack_buf[32768];
  char *buf = stack_buf;
  char *heap_buf = NULL;
  size_t req_size = in_size * 2;

  if (req_size < sizeof(stack_buf)) {
     buf = heap_buf = (char *)malloc(req_size);
  }

  // Do stuff

  if (heap_buf)  {
     free(heap_buf);
  }

  return out_size;
}


My rule of thumb is to look at application's design and only ever use malloc where there is 1:N (or N:M) relationship between entities. Everything that's 1:1 should be allocated in a single step.


Your heuristic sounds interesting. Can you say a little more about which entities and how to determine their relationship? Thanks.


Weve had data that was fixed size but wouldn't fit on the stack.


> Doing very small (less than say 32 bytes) allocations is also wasteful just due to the very large amount of data in proportion that will be used just to keep track of that tiny little memory area (within the malloc system). Not to mention fragmentation of the heap.

That not necessarily true. Modern allocators tend to use a bunch of fixed size-buckets.

But given that curl runs on lots of platforms it makes sense to just fix the code.


& often those fixed-size buckets smallest size is 32 bytes. It still has to at least have a freelist


There has to be a free list somewhere, but a single bucket only needs a bitmap. I think the GP's point is that such a structure amortizes the cost of the free-list metadata over more items, reducing total overhead.


The free list can be stored inside the empty cells, meaning you put the pointer to the next empty cell inside the previous empty cell, and you need a single additional pointer to store the location of the first empty cell. When you free a cell you just make that the first empty cell.


This isn't free either because the free list is scattered around in memory and all that pointer chasing is bad for caches.


It isn't free, but all that pointer chasing is one step for each allow and one for each free. There is no need to look for a 'best' block on alloc, nor is there a need to search for a best place to insert a block on free.

Moreover, the cache hit you take for an alloc likely would have happened anyways because, presumably, the program that made the allocation wants to write to the allocated memory (in theory, that doesn't imply the CPU has to _read_ the memory, but are there allocators that are that smart?)

For frees, the memory may, but need not be, already be in the cache when free is called.


I'm curious if any allocators with size classes/buckets actually do much pointer chasing. One can theoretically have a free-list for each class, so at most one pointer needs to be dereferenced: work out the size class, load the front of the appropriate list, read its tail pointer and store that as the new front of the list.


The free list can be stored almost entirely within the free buckets, giving it a constant cost in memory you can't allocate.


Note that this pattern[0] is essentially "copy-on-write", which can be encapsulated safely as such in a reasonably simple type (in a language with generics) and used elsewhere. I use a similar mechanism pervasively in some low-level web server code to use references into the query string, body and JSON objects directly when possible, and allocated strings when not.

[0] https://github.com/curl/curl/commit/5f1163517e1597339d


But why not use alloca instead of always allocating 10 elements on the stack you might not need?

Edit: I would also be tempted to remove the ufds_on_stack variable and just check if the ufds pointer points to the stack or not.


Because always allocating them on the stack costs zero cycles in the typical case, while alloca costs more than zero cycles in the typical case. And assuming that "struct pollfd" isn't big, and the function isn't very recursive, there's no practical downside to wasting a little stack space for the life of the function.

(Of course, he could get rid of "bool ufds_malloc" and just see if his pointer is NULL before calling free(); or not even bother checking, since free(NULL) is defined to be a no-op.)


`alloca` is a simple addition to the stack pointer, so a single instruction, presuming it isn't folded into the normal bump of the stack pointer to allocate the fixed size local variables. There isn't really much cost to doing a dynamic stack allocation rather than a fixed one. Variable length arrays (VLAs) allow the same thing but can be slightly more portable.

Normal C caveats do apply here though: alloca is POSIX, not C (but is widely implemented outside of POSIX systems). VLAs are an optional standard feature. Neither is required to actually use the stack for storage.

Not sure if there are any platforms supported by curl which would prevent it's use of VLAs or alloca.


tl;dr - alloca costs, history, and why it is problematic

Alloca is somewhat more expensive on x86/x64 than a single instruction.

[0] shows the code generation for four functions that generate and sum an iota array. I used -O1 to make the differences more apparent.

iota_sum_alloca and iota_sum_vla generate similar code. They both require a frame pointer (RBP) and code to preserve the 16 byte alignment of the stack frame.

iota_sum_const_alloca and iota_sum_array generate identical code. Clang recognizes that alloca is invoked with a constant argument.

History of Alloca

Alloca was originally written for unix V7 [1]. Doug Gwyn wrote a public domain implementation [2] in the early 80s for porting existing programs. The FSF used Gwyn's alloca implementation in GDB, Emacs, and other programs. This helped to spread the idea.

Problems of Alloca

[3] is a comp.compilers thread that discusses some of the issues with alloca. Linus does not want either VLAs or alloca in the Linux kernel [4].

References:

[0] https://godbolt.org/g/1JyXhQ

[1] http://yarchive.net/comp/alloca.html

[2] https://github.com/darchons/android-gdb/blob/android-gdb_7_5...

[3] http://compilers.iecc.com/comparch/article/91-12-079

[4] https://groups.google.com/forum/#!msg/fa.linux.kernel/ROgkTB...

Edited for minor formatting changes.


https://godbolt.org/g/XKAZOb fixes the bugs in the sample code in the parent post.

If you compile with -O2 then iota_sum_const_alloca and iota_sum_array are both evaluated at compile time.


Also, because curl is c89-portable, and alloca might fail on the portability standpoint.


Statically allocating stack memory is essentially free in environments with enough memory - you just subtract a little more from your stack pointer. Alloca is not standard C - it might not even be available in some compilers that curl compiles on.


> Alloca is not standard C

A similar option is a variable-length array, which is standard C. It does share the problem of not being available in some compilers, however.

That said, there are other problems with alloca/VLAs. For small enough sizes, like you said it's faster and simpler to statically allocate. For large enough sizes, you risk overflowing the stack, which might be more constrained than you'd expect (especially when using threads). There might be a sweet spot where it's better, but it's not worth the portability cost and the potential headaches.


Note that variable-length arrays ended up being a particularly troublesome feature, and were therefore made optional in C11. It's still "standard," but a compliant compiler is allowed not to support them, so it might as well not be.


> The point is rather that curl now uses less CPU per byte transferred, which leaves more CPU over to the rest of the system to perform whatever it needs to do. Or to save battery if the device is a portable one.

Does anyone have a general sense of how these kinds of efficiencies translate to real-world battery life? I understand that the mechanisms (downclocking/sleeping the CPU) are there; I'm just curious as to how much it actually moves the needle in a real system.


30℅ less CPU use is roughly 30℅ less energy usage. The CPU will power down when it's done.

Downclocking has negative impact because the CPU needs to be powered up longer and it is leaking current all the time it is not powergated.


In any one single instance it's not that interesting, but considering probably half (if not more) of the stuff you interact with uses libcurl, it adds up to a lot.


Not sure in hard numbers, but mobile processors are designed to work this way - do a small amount of work at full power and then sleep.


Race to idle~


> There have been 213 commits in the curl git repo from 7.53.1 till today. There’s a chance one or more other commits than just the pure alloc changes have made a performance impact, even if I can’t think of any.

"I can't think of any" is not a very scientific way to measure optimizations. Actually, this simple fact casts a doubt on whether it's this malloc optimization that led to the speed up or any of the 200+ commits OP is working on top of.

Why not eliminate that doubt by applying the malloc optimizations to the previous official release? I'm a bit skeptical about the speed up myself, since I would expect curl to be primarily IO bound and not CPU bound (much less malloc bound, given how little memory it uses).


I'm not skeptical about it. Last time I did something similar to this optimization, I had a python function that was doing some work over strings to get a similarity metric.

For this function, a list of elements was kept inside each call, which began empty and a few calculations where performed, populating it before the main loop of the function kicked in. In python-land, this was obviously done by declaring an empty list `precalc_values = []` and then appending over it. When we cythonized it, the dev that took it went in with a `cy_malloc(size(int)*elements)` "dynamic array of ints", and called it a day, 70x speedup over plain python.

A few days later I came in and saw that code, and said "why not go with a simple array?", to which I was told "because we don't know the size of the strings beforehand". Did a test run with a small array plus a counter (to know up to where the array held real values, and not just zero-init fields) and got a 100x speedup.

In the end we went with both functions and a wrapper that would check the length of the strings involved and select one or the other, because the array version would massively crash from accessing an array out of bounds if a large string happened to come by.


> This time the generic linked list functions got

> converted to become malloc-less (the way

> linked list functions should behave, really).

I don't see how a linked list can not use malloc().


Look for intrusive containers.


TIL, thanks.


Have each node of data contain the `prev` and `next` link.

You are "polluting" your payload with list information, but you gain malloc less lists.


And cache locality. That can have considerable impact if you frequently search that list for an item satisfying some condition.


You would think curl's perf is bound by the network latency/bandwidth and that intrusive lists wouldn't make a signifiant difference.


> The point here is of course not that it easily can transfer HTTP over 20GB/sec using a single core on my machine

2GB


He said gigabit, or at least that's how I read it. 2900MB/sec * 8 is over 20gb/s




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: