Why does calloc exist?

_RPM · on Dec 5, 2016

    buf = calloc(huge, huge);
    if (errno) perror("calloc failed");
    printf("calloc(huge, huge) returned: %p\n", buf);
    free(buf);

This has a flaw. errno doesn't magically get reset to zero. You should check the return value of calloc, then use errno. Checking if(errno) is not the right way to determine if there was an error.

avar · on Dec 6, 2016

    Its [errno's] value is significant only when the return value of
    the call indicated an error (i.e., -1 from most calls; -1 or NULL
    from most library functions); a function that succeeds is allowed
    to change errno.

imron · on Dec 6, 2016

Note the difference between "is allowed" vs "must".

If you write a program that relies on this behaviour you're going to have a hard to track down bug at some point.

takeda · on Dec 6, 2016

That's even stronger case for not relying on errno to catch errors.

The code should be something like this:

    buf = calloc(huge, huge);
    if (!buf)
        perror("calloc failed");
    printf("calloc(huge, huge) returned: %p\n", buf);
    free(buf);

imron · on Dec 6, 2016

Yes, my post above was agreeing that you should check the error condition of the return value, rather than relying on errno to be cleared on success.

takeda · on Dec 6, 2016

Ah, my bad.

ericfrederich · on Dec 6, 2016

Yeah... that caught my attention too and I looked up errno and found the same thing.

int0x80 · on Dec 6, 2016

wow, yeah, that's totally and obviously buggy ...

bluefox · on Dec 5, 2016

That's a nice alternative history fiction.

Here's an early implementation: https://github.com/dspinellis/unix-history-repo/blob/Researc...

LukeShu · on Dec 5, 2016

You haven't proven it wrong.

Here's the earliest implementation in that repo (in Research UNIX V6; your link in V7): https://github.com/dspinellis/unix-history-repo/blob/Researc...

    calloc(n, s)
    {
    return(alloc(n*s));
    }

There are several interesting things we learn from poking around V6 though:

- `calloc` originated not on UNIX, but as part of Mike Lesk's "iolib", which was written to make it easier to write C programs portable across PDP 11 UNIX, Honeywell 6000 GCOS, and IBM 370 OS[0]. Presumably the reason calloc is the-way-it-is is hidden in the history of the implementation for GCOS or IBM 370 OS, not UNIX. Unfortunately, I can't seem to track down a copy of Bell Labs "Computing Science Technical Report #31", which seems to be the appropriate reference.

- `calloc` predates `malloc`. As you can see, there was a `malloc`-like function called just `alloc` (though there were also several other functions named `alloc` that allocated things other than memory). (Ok, fine, since V5 the kernel's internal memory allocator happened to be named `malloc`, but it worked differently[1]).

[0]: https://github.com/dspinellis/unix-history-repo/blob/Researc... (format with `nroff -ms usr/doc/iolib/iolib`)

[1]: https://github.com/dspinellis/unix-history-repo/blob/Researc...

ksherlock · on Dec 6, 2016

OpenBSD added calloc overflow checking on July 29th, 2002. glibc added calloc overflow checking on August 1, 2002. Probably not a coincidence. I'm going to say nobody checked for overflow prior to the August 2002 security advisory.

https://github.com/openbsd/src/commit/c7b2af4b3f7e78424f8943...

https://github.com/bminor/glibc/commit/0950889b810736fe7ad34...

http://cert.uni-stuttgart.de/ticker/advisories/calloc.html

mnay · on Dec 6, 2016

It is embarrassing for glibc not to check for overflow in calloc implementation prior to 2002. It is not only a security flaw but also violation of C Standards (even the first version ratified in 1989, usually referred to as C89).

The standard reads as follows:

  void *calloc(size_t nmemb, size_t size);

  The calloc function allocates space for an array of nmemb objects, each of whose size is size.[...]

and,

  The calloc function returns either a null pointer or a pointer to the allocated space.

So if it cannot allocate space for an array of nmemb objects, each of whose size is size, then it has to return null pointer.

aap_ · on Dec 6, 2016

So the (slightly modified) question still stands: Why do calloc and malloc exist? Indeed it looks like calloc was originally intended as a portable way to allocate memory. It used the function alloc which apparently was not meant to be used directly; most iolib functions have a 'c' tacked on. So when iolib was reworked into the stdlib why was calloc kept? saretired suspects backward compatibility but I don't believe this, because no other c-prefixed iolib function was kept and i couldn't find any code that actually used calloc in the v6 distribution either. So maybe whoever is responsible for malloc/calloc in v7 (I think it was ken, not dmr) thought malloc should be a public function but saw a use for calloc and changed the semantics to be a bit more predictable.

saretired · on Dec 6, 2016

dmr wrote the V7 malloc and calloc (which calls malloc) [0,1]. Mike Lesk used calloc in lex ;-) [2]

[0] https://github.com/dspinellis/unix-history-repo/blob/Researc...

[1] https://github.com/dspinellis/unix-history-repo/blob/Researc...

[2] https://github.com/dspinellis/unix-history-repo/blob/Researc...

aap_ · on Dec 6, 2016

Why are you so sure it was written by dmr? The coding style looks like ken's to me: a) no space after if/while/for/etc b) use of "botch".

Yes, calloc is used in lex, but that is not part of v6...at least not the official distribution, I don't know when he started development. But since he also uses fopen and friends why shouldn't he be using malloc as well? changing 'calloc(n, m)' to 'malloc(n*m)' doesn't sound like such a huge change.

saretired · on Dec 6, 2016

It appears that only calloc was in Lesk's Portable C Library [0] while malloc was the name Thompson gave the kernel's memory allocator in V6 [1]. When Ritchie rewrote Lesk's library for V7, he may have simply retained calloc for backward compatibility with existing user space code. [0] http://roguelife.org/~fujita/COOKIES/HISTORY/V6/iolib.html [1] https://github.com/hephaex/unix-v6/blob/master/ken/malloc.c

joemcmahon · on Dec 8, 2016

GETMAIN, the malloc() equivalent in MVT-derived IBM OSes, does not always zero memory. IIRC, MVS didn't zero it at all, so you might get anything in there, thus the need for a call that guaranteed zeroed memory for it. (This is from my memory of assembly programming on MVT/MVS up to the 1990s; z/OS apparently[1] does it somewhat differently now, so that some allocations are definitely zeroed.)

[1] http://www-01.ibm.com/support/docview.wss?uid=isg1OA28314

colanderman · on Dec 5, 2016

It's a good explanation of why calloc still exists and is useful. Otherwise it would have been dropped from the standard like cfree was.

viraptor · on Dec 5, 2016

I think there's a big difference between "why does ... exist" and "why does ... still exist". calloc() may be useful today for reasons completely different from why it existed in the first place. And the difference #2 is just an implementation-specific optimisation. There's nothing in the standard that forces calloc to use lazy allocation / virtual memory. Actually, it may be implemented on platforms which can't provide this.

wumpus · on Dec 5, 2016

Thank you for bringing up the implementation-specific nature of #2! If the author is running Linux, then perhaps they've never checked out overcommit vs. not overcommit.

Someone · on Dec 5, 2016

The way malloc, calloc, and memset are implemented all are implementation specific. For example, memset, when asked to zero memory, may use cache control instructions such as https://en.wikipedia.org/wiki/Cache_control_instruction#Data....

That tells your cache "pretend that you read all zeroes into the cache line at this address, and mark it as dirty (that guarantees the zeroes will be written out, even if the caller doesn't write to the cache line)

For small amounts of memory that will be written to soon, that's as good as free since it doesn't have to read from memory (the naive loop will, as it has to bring in an entire cache line before it can zero out its first byte or word)

mikeash · on Dec 5, 2016

You're likely to still see a performance improvement without overcommit. The OS will try to zero free pages in the background, so there's a good chance that it'll have pre-zeroed pages to hand you when you ask for them, rather than making you wait for them to be zeroed on demand.

Of course, there are plenty of systems where this doesn't happen, or there are no pages in the first place, or there's no kernel zeroing stuff for you.

dagw · on Dec 6, 2016

For what it's worth, the numpy 'trick' he used to demonstrate that feature also works on windows.

aap_ · on Dec 5, 2016

Good point, and it makes we wonder why the code is attributed to dmr...it looks like written by ken. I suppose recreating such a repo can only be so accurate.

wyldfire · on Dec 5, 2016

> So basically, calloc exists because it lets the memory allocator and kernel engage in a sneaky conspiracy to make your code faster and use less memory. You should let it! Don't use malloc+memset!

On the flip side, if your critical metric is latency then these tricks of calloc's and the OS's are exactly what you try to avoid. memset() the buffer, and if you have the privileges you should mlock() it to prevent it from being paged out. Of course, this all presumes that it's not an ephemeral buffer to begin with. Best to change your design to leverage a long-lived resource if possible.

valarauca1 · on Dec 5, 2016

    Best to change your design to leverage a long-lived
    resource if possible.

    On the flip side, if your critical metric is latency
    then these tricks [...] are exactly what you try to avoid

If you keep the buffer alive as long as possible with a slab allocator, or just smart/good memory management strategy. How you acquire the buffer will ultimately be trivial, likely dwarfed your other startup tasks (reading config, opening sockets, etc.)

rcfox · on Dec 5, 2016

I think he was referring to the part where a calloc'd memory page will be zeroed the first time it's used, rather than all at once at the beginning.

In a real-time system, the start-up time matters less than having predictable response times.

valarauca1 · on Dec 5, 2016

In a realtime system you can't use virtual memory because the access times are unpredictable.

wyldfire · on Dec 5, 2016

I specifically avoided that word because it triggers particular deadlines that people have in mind. If my application requires no more than X ms latency I don't care to handwring over realtime vs soft realtime vs whatever, but it's still critical to fit in the budget. But indeed you can get reliable low-latency products to work on linux, with virtual memory. But like I said pinning is a great way to keep those peaks down.

valarauca1 · on Dec 5, 2016

The division you are looking for is

Hard Realtime: Embedded system, no virtual memory/OS. Or special OS provisions to let them run.

Soft Realtime: Responsive.

In the case you are aiming for the second. There are several million things that'll net greater performance. We're talking about saving a matter of nano-seconds in C/C++. How you load your config will have more effect then this.

If you want to save $1,000,000 rolling pennies a start. But there are likely way bigger savings elsewhere, worth way more time, and less effort.

karmakaze · on Dec 6, 2016

I didn't think how it's achieved matters to the classification which is by the consequence of missing a deadline according to good ol' Wikipedia:

Hard – missing a deadline is a total system failure.

Firm – infrequent deadline misses are tolerable, but may degrade the system's quality of service. The usefulness of a result is zero after its deadline.

Soft – the usefulness of a result degrades after its deadline, thereby degrading the system's quality of service

valarauca1 · on Dec 7, 2016

Your difference between soft/firm is an arbitrary decision made by a manager, not really a hard/fast If I can't meet this deadline my system is a total technical failure.

You've created a false dichotomy.

gkfasdfasdf · on Dec 6, 2016

And if you really want to squeeze the last bit of performance out you could not memset / init the memory at all, and make sure you only read the parts your app has written.

jblow · on Dec 5, 2016

Sorry, but this is just goofy and bad.

If you depend on copy-on-write functionality, then you need to use an API that is specced to guarantee copy-on-write functionality. If that means you use an #ifdef per platform and do OS-specific stuff, then that is what you do.

Anything else is amateur hour.

If copy-on-write is a desirable feature, then as the API creator, your job is to expose this functionality in the clearest and simplest way possible, not to hack it in obscurely via the implementation details of some random routine. (And then surprise people who didn't expect copy-on-write with the associated performance penalties.)

This is why we can't have nice things.

kccqzy · on Dec 5, 2016

I think the author's point is opportunistic optimization. He didn't ask us to rely on this behavior.

jblow · on Dec 5, 2016

If you don't know whether or not you are really getting an optimization, then how much do you really care?

If you really care, then you actually profile your system and see what takes how much time, under which circumstances. The results of such a profile are almost always surprising.

I guess this is a basic cultural difference -- almost nobody in the HN crowd really cares whether their software runs quickly; there is just a bunch of lip service and wanting-to-feel-warm-fuzzies, with very little actual work.

In video games (for example) we need to hit the frame deadline or else there is a very clear and drastic loss in quality. This makes this kind of issue a lot more real to us. If you look at the kinds of things we do to make sure we run quickly ... they are of a wholly different character than "guess that calloc is going to do copy-on-write maybe."

kllrnohj · on Dec 6, 2016

At the same time why would you ever opt to malloc & memset instead of calloc? calloc might have clever optimizations, whereas malloc + memset won't. Intentionally choosing something slower, buggier, and that requires more work on your part is moronic.

astrobe_ · on Dec 6, 2016

Predictability sometimes trumps optimizations. For a striking illustration of this, see timing attacks.

jblow · on Dec 6, 2016

Exactly. I would avoid using calloc simply because I don't know what it actually does.

firethief · on Dec 6, 2016

The implementation details of malloc aren't specified as part of its interface either...

jblow · on Dec 6, 2016

Which is why high-end games do not use generic system malloc; in general we link custom allocators whose source code we control and that are going to behave similarly on all target platforms.

(In fact we go out of our way to not do malloc-like things in quantity unless we really have to, because the general idea of heap allocation is slow to begin with.)

takeda · on Dec 6, 2016

You know what it does:

    The calloc() function contiguously allocates enough
    space for count objects that are size bytes of memory
    each and returns a pointer to the allocated memory.
    The allocated memory is filled with bytes of value zero.

You should not care how it does it.

jblow · on Dec 6, 2016

Spoken like someone who does not ship fast software!

takeda · on Dec 6, 2016

There's a saying that is often misused, but applies here: "premature optimization is a root of all evil"

You first write your code using standard system functions, using the right calls for what you're doing. If after that performance of the code is bad because of calloc() only then you roll out your own implementation (most likely in assembly), and accept that in the future your code might not work well, because something in your OS has changed since you wrote your code.

kgabis · on Dec 6, 2016

It's not always the best way to write software. If you're writing a program that's supposed to work in real(ish) time then it's good to take performance into consideration early on, otherwise you'll end up rearchitecting your program later. It's not necessarily about a number of cycles each operation takes, but rather about memory layout of your data. I guess it's a matter of experience: if you expect something to be a bottle neck (because it was a bottleneck in a similar application you've written in the past) then maybe you should just write it properly the first time round?

takeda · on Dec 6, 2016

That's why I mean when I said that the saying is abused. Some people think that choosing the right algorithm is premature optimization. It is not.

Choosing whether to use malloc vs calloc is not an architectural change though, and in fact it is very easy to replace one with the other, but if you use the right call for right use case, then you will benefit from optimizations that the OS provides, and often you might not even be able to achieve them from user space.

takeda · on Dec 6, 2016

This was approach that OpenSSL does (it had its own memory management routines), and it already caused security vulnerabilities, not to mention performance issues.

Rule of thumb: if you need to allocate memory region that will be overwritten anyway (for example reading a file) use malloc(). If you need a zeroed memory, use calloc().

As long as you rely on guarantees provided by the calls and use the right call for right use case you get predictability and very often optimization.

kllrnohj · on Dec 7, 2016

yes, but malloc() isn't predictable either. If you care about predictability you aren't using malloc or calloc.

zik · on Dec 5, 2016

Neither copy on write nor size checking are specced as part of the calloc() definition.

Here's the specification of calloc from the ISO standard:

  7.22.3.2 The calloc function

  Synopsis

    #include <stdlib.h>
    void *calloc(size_t nmemb, size_t size);

  Description
  
  The calloc function allocates space for an array of nmemb objects, each of whose size is size. The space is initialized to all bits zero.
  
  Returns

  The calloc function returns either a null pointer or a pointer to the allocated space.

caf · on Dec 6, 2016

An implementation that lets the size overflow and returns a pointer to a block that isn't large enough for "an array of nmemb objects, each of whose size is size" is not conforming with that specification.

That specification gives the implementation exactly two options: return NULL, or return a pointer to a block of sufficient size.

jblow · on Dec 5, 2016

Yes, and that is exactly my point.

The article says you should use calloc because it provides these optimizations. I am saying no, that's goofy, because it is not specced to provide these optimizations.

AndyKelley · on Dec 6, 2016

Ignoring the multiplication issue, I think once again it all comes down to communicating intent. If you want to allocate zeroed memory, use calloc. If you don't need your memory to necessarily be zeroed, use malloc.

I'm agreeing with you here - if your intent is to have copy-on-write functionality, that's not what you're communicating when you use calloc.

It's okay for an implementation to try to optimize given the constraints of intent, but I agree that if an implementation is doing something non-straightforward (copy-on-write in this case), that is a smell that perhaps the API needs to expose an additional layer.

mannykannot · on Dec 6, 2016

>Anything else is amateur hour.

Such as writing the optimizing compilers that make it feasible for you to use C at all?

imtringued · on Dec 6, 2016

Well his argument is that if you need e.g. CoW you shouldn't rely on the OS doing that implicitly for you and instead you should explicitly use the CoW features of the OS.

There should be a way to explicitly demand the optimization instead of relying on the behavior of a specific compiler to implicitly optimize the code.

Shifting beyond the bitwidth is undefined behavior. Why can't shifts be checked by default. Let me explicitly demand the undefined behavior when I really need it.

jblow · on Dec 6, 2016

I don't understand your reply. How is this not a non sequitur?

sowbug · on Dec 6, 2016

The output of a modern C compiler is unpredictable in terms of performance. Yet mannykannot suspects you still use such tools. Please explain your inconsistency.

et1337 · on Dec 6, 2016

Even if this weren't a fallacious argument, he's actually in the process of replacing Thekla's C/C++ workflow with a custom language called Jai.

megabochen · on Dec 6, 2016

Isn't Jai piggybacking on C?

et1337 · on Dec 6, 2016

He started with two backends. One generates bytecode for an internal interpreter, and this is still needed because all Jai code can be run at compile time. The other backend generates C code, but it's a temporary measure. He just added the LLVM backend: https://www.youtube.com/watch?v=HLk4eiGUic8

megabochen · on Dec 6, 2016

oh thanks, I didn't see the last one :)

imtringued · on Dec 6, 2016

Is C now so bloated that it needs optimizing compilers just to get executed at all? Is a C interpreter is unfeasable? Apparently even an unoptimizing compiler is not enough.

mannykannot · on Dec 6, 2016

I am not aware of any change to the C language that has introduced bloat - perhaps you could explain?

FYI optimizing compilers are also used to produce efficient instruction streams.

Animats · on Dec 5, 2016

The real reason "calloc" exists was that it was really easy to hit 16-bit overflow back in the PDP-11 days.

LukeShu · on Dec 5, 2016

Historically, not quite true.

No version of Research UNIX V1 through V7, nor any of BSD 1, 2, 3, 4, or 4.4 did overflow checking. They all just did `m * n` or `m *= n`.

jcranmer · on Dec 5, 2016

If you look through the history of CVEs, you'll find that pretty much every implementation of calloc or a calloc-like function starts with m * n and ends up only changing after someone points out the security flaw.

raphaelj · on Dec 5, 2016

Thank you for this answer. The answers from OP were not convincing me.

ben_bai · on Dec 5, 2016

> Plus, if we wanted to, we could certainly write our own wrapper for malloc that took two arguments and multiplied them together with overflow checking. And in fact if we want an overflow-safe version of realloc, or if we don't want the memory to be zero-initialized, then... we still have to do that.

Like reallocarray(3) does?

    buf = malloc(x * y);
    // becomes
    buf = reallocarray(NULL, x, y);
    
    newbuf = realloc(buf, (x * y));
    // becomes
    newbuf = reallocarray(buf, x, y);

syntheticnature · on Dec 5, 2016

reallocarray(3) looks nifty, but until it's available on a wider, ideally more standard-driven basis than just OpenBSD and FreeBSD, it's likely to not see wide uptake.

brynet · on Dec 5, 2016

It's already gaining adoption outside of the BSDs. OS X/iOS seem to have it as part of their libmalloc. Android Bionic libc has it as part of the code they sync from upstream OpenBSD.

Many open source projects include their own, or simply bundle the OpenBSD implementation:

  * mandoc
  * flex
  * unbound and nsd
  * tor
  * tmux
  * libbsd
  * libressl
  * xorg-xserver
  * ...

The list only continues to grow, several more examples to be found on GitHub.

klodolph · on Dec 5, 2016

Darwin (macOS / iOS) is often counted as "one of the BSDs", just a fairly weird one. Big chunks of the standard library are copied from FreeBSD with changes to work on top of Mach.

achivetta · on Dec 6, 2016

There's code for it in Darwin's libmalloc, but it's not exposed as API.

reallocarray() has some difficulties as an interface, mostly inherited from realloc(). I'm a bigger fan of reallocarr(), but that's NetBSD only. We (the operating systems community) need to find a consensus here, but I'm not convinced that reallocarray() is that consensus yet.

ben_bai · on Dec 6, 2016

reallocarray is a very thin layer around realloc. No surprises there. Simple find and replace to bring overflow checking into your code.

reallocarr changes the semantics. Equally easy in new pieces of code and a little harder when converting existing code.

masklinn · on Dec 6, 2016

reallocarray(3) is part of the portable OpenBSD compat layer[0], and clocks up at under 10 lines on top of realloc(3):

    #define MUL_NO_OVERFLOW ((size_t)1 << (sizeof(size_t) * 4))
    void *reallocarray(void *optr, size_t nmemb, size_t size) {
        if ((nmemb >= MUL_NO_OVERFLOW || size >= MUL_NO_OVERFLOW) &&
             nmemb > 0 && SIZE_MAX / nmemb < size) {
            errno = ENOMEM;
            return NULL;
        }
        return realloc(optr, size * nmemb);
    }

[0] https://github.com/openssh/openssh-portable/blob/master/open...

nicolast · on Dec 6, 2016

And then there's of course when calloc returns non-zeroed memory once in a while, which causes... 'interesting' bugs.

https://bugzilla.redhat.com/show_bug.cgi?id=1293976 http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-5229

AceJohnny2 · on Dec 5, 2016

> And at least we aren't trashing the cache hierarchy up front – if we delay the zero'ing until we were going to write to the pages anyway, then that means both writes happen at the same time, so we only have to pay one set of TLB / L2 cache / etc. misses.

Ooh, nice one. My first impression was that calloc was just lazy-allocating, which is fine in most cases but when you want precise control over timing, maybe you want to be sure that memory is zero'd at allocating time rather than pay the cost unexpectedly at use time.

But the cache-awareness makes that a moot point. You'd be paying double cache-eviction costs if you were clearing that memory up front: once at clearing time, and once at actual-writing time. This implementation of calloc avoids that.

Waterluvian · on Dec 6, 2016

Not sure how I feel about, "oh everyone's looking this way! Let me get political"

tdeck · on Dec 6, 2016

Really felt like a bait and switch to click that link and get an angry rant.

Manishearth · on Dec 6, 2016

I've always been surprised that memset is usually just a nonmagical for loop. I used to expect that the OS does things to magically make it faster (running lazily, etc).

chei0iaV · on Dec 6, 2016

Which implementations use a nonmagical for loop?

glibc's is all in assembly full of SIMD instructions, which seem very much magical...

https://sourceware.org/git/?p=glibc.git;a=tree;f=sysdeps/x86...

http://stackoverflow.com/questions/8858778/why-are-complicat...

caf · on Dec 6, 2016

CPUs are pretty great at running nonmagical for loops, so you'd have be zeroing a pretty giant block of memory before it made sense to get the OS involved at all.

Manishearth · on Dec 6, 2016

Of course :) But the OS could get involved for larger blocks of memory.

Also, I wonder if zeroing large chunks of memory would be faster to do in kernel space using real addresses. You can avoid the multiple real memory lookups involved in a single virtual write.

(Of course, we already avoid those often, but it could be useful to avoid entirely. Not sure what the tradeoffs are here)

caf · on Dec 6, 2016

Paging is still enabled in kernel mode, the kernel uses virtual addresses.

(The kernel's linear mapping of physical memory can take advantage of huge pages though, which means that there might be one or two less levels of page tables involved with those addresses). TLB misses aren't significant if you're bulk-writing to a block of memory anyway, you'll max out the bandwidth of the memory without that being an issue.

adrianN · on Dec 6, 2016

For linear access patterns the TLB does its job perfectly. The overhead is negligible.

Const-me · on Dec 6, 2016

Let’s see what happens after the allocation.

With malloc + memset, the OS will likely allocate that memory in huge pages, on PC that would be 2-4MB / page depending on the architecture, https://en.wikipedia.org/wiki/Page_(computer_memory)#Huge_pa...

If I calloc then write, the OS can’t give me huge pages because of that copy on write thing. Instead, the OS will gradually give me the memory in tiny 4kb pages. For large buffers you should expect TLB cache misses, therefore slowing down all operations with that memory.

kr7 · on Dec 6, 2016

A quick test with strace on my machine shows that callocing and mallocing a 2GB buffer results in the exact same mmap system call for both. Neither use huge pages.

amluto · on Dec 6, 2016

I don't think your average kerbel can tell the difference between malloc+memset and calloc.

mikeash · on Dec 6, 2016

There's no reason the OS can't use huge pages for a calloc. In fact, "give me some zeroed memory" tends to be the only interface exposed by the kernel, since security requires zeroing memory before handing it out to userspace anyway.

Const-me · on Dec 7, 2016

This article _creates_ a reason why the OS might not be able to use huge pages for a calloc.

If substantial count of people will read this article, believe what’s written, and [re]design their software under the wrong assumption calloc returns sparse copy-on-write memory buffer at no performance cost — the OSes will no longer be able to use huge pages for a calloc. Because doing that would dramatically increase physical memory usage for such software.

drfuchs · on Dec 6, 2016

Originally, calloc was the function Unix programmers were expected to use by default, since it avoids any sort of intermittent bugs due to your forgetting to initialize some field in the data structure you're allocating. But clearing the memory to all zeros took precious time, so if you were an advanced programmer, and knew for a fact that you were going to fill it all in yourself, you could optimize by calling malloc.

MaulingMonkey · on Dec 5, 2016

It's harder to forget to multiply by sizeof(T) when calloc-ing as well.

topkekz · on Dec 6, 2016

you can do sizeof(T[n]) instead of sizeof(T) * n

IgorPartola · on Dec 6, 2016

I don't get it. The two behaviors are completely orthogonal. Why can't I have a malloc() that does lazy copy-on-write for large arrays and why can't I have an error checking malloc() and why can't I have a calloc() that allocates the memory up front and doesn't zero it out? I get the "it's historic" argument, but this seems like a silly distinction. Sounds like what you want to do practically is basically just make your malloc() wrap a calloc() with size 1, and stop explicitly memset()ing. Or just introduce your own functions:

    moarmem(n) // malloc(n)
    moarmemslower(n) // p = malloc(n); memset(p, 0);
    moarmemfaster(n) // calloc(n, 1)
    evenmoarmem(p, n); // realloc()
    fuggetaboutit(p) // free()

kr7 · on Dec 6, 2016

> Why can't I have a malloc() that does lazy copy-on-write for large array

malloc will generally do lazy copy-on-write above a certain limit; it was 128kb for glibc last time I checked.

> why can't I have an error checking malloc()

reallocarray is becoming the de-facto standard for that.

> why can't I have a calloc() that allocates the memory up front and doesn't zero it out?

Er, malloc?

masklinn · on Dec 6, 2016

> Why can't I have a malloc() that does lazy copy-on-write for large arrays

reallocarray(3) (though fundamentally that's what malloc does as well, it just doesn't do overflow checking)

> why can't I have an error checking malloc()

reallocarray(3) (because malloc doesn't get the information, it gets a single size rather than a count and a per-object size)

> why can't I have a calloc() that allocates the memory up front and doesn't zero it out?

reallocarray(3)

> Sounds like what you want to do practically is basically just make your malloc() wrap a calloc() with size 1, and stop explicitly memset()ing.

If you're going to memset(0) it there's no reason to use malloc() ever, but it's very common to malloc() then fill the allocation directly, in which case the memsetting is a redundant cost.

kamouth · on Dec 6, 2016

"(I mean, let's be honest: if we really cared about security we wouldn't be writing in C.) "

Why so ?

BugsBunnySan · on Dec 6, 2016

Should be the other way around, shouldn't it: "Only if you really care about security should you be allowed to write in C" :)

(i.e., don't use a professional-grade band-saw, if you're not a professional)

duaneb · on Dec 5, 2016

This is a great example of why _alloc is an abstraction over virtual memory.

What this doesn't express is that dealing with page allocation directly can be quite annoying to get correct cross platform. You generally don't want to do that unless a) you're optimizing past the "knuth level" and know you need to for performance (e.g. mapping files to memory), b) you're writing something where you run dynamic code (JIT or dynamic recompilation) or c) you're writing your own allocator and/or using page faults to get some functionality, ala Go's stop-the-world hack.

Basically, don't bypass _alloc unless you have a reason.

notacoward · on Dec 5, 2016

I always thought it was because of padding. An array of M structures each N bytes long could require more than M*N bytes (certainly has on some architectures I've worked with). But I guess that's not it after all.

mikeash · on Dec 5, 2016

C accounts for padding in the size of the individual type. By the time you do sizeof, it's already rounded up to where you can just do M*N. For example:

    struct S {
        long a;
        char b;
    };

On my computer (64-bit Mac), sizeof(struct S) is 16, due to 7 bytes of padding after b. Since the compiler handles the padding, that means calloc doesn't have to.

notacoward · on Dec 7, 2016

Today's compilers. How about the compilers when calloc was first defined? As I said, I've worked on compilers that behaved differently, either always or according to various options. The computing universe has actually become less diverse in some ways than it used to be, so we should be careful of drawing conclusions about old interfaces based on today's monoculture. Is it really impossible for people here to imagine that some of the dozens of platforms that had their own compilers and C libraries chose to do the rounding up in the latter? It would actually be a pretty reasonable choice, for different microarchitectures capable of running the same instruction set and binaries but with different cache subsystems. That way you could make the decision at run time instead of compile time. Many of the early RISC machines did even weirder tricks than that to wring out the last bit of performance on multiple generations without having to recompile.

xxpor · on Dec 5, 2016

Which can lead to sadness if you're not careful with #pragma pack and the like.

tedunangst · on Dec 5, 2016

Padding is included in sizeof.

to3m · on Dec 5, 2016

calloc doesn't know what it's allocating for, so it has to hand out m * n.

(Despite the syntax, the same goes for operator new in C++! Placement vector new in particular is a trap.)

stouset · on Dec 5, 2016

This would be very, very bad for performance in a lot of cases do to non-aligned struct reads, were it true.

rcthompson · on Dec 5, 2016

I you calloc some memory and then the first thing you do is write to it, can the compiler optimize away the initial write of zeros since they will just be overwritten?

Serplat · on Dec 5, 2016

For smaller memory allocations that don't go directly to the OS, I suppose it's theoretically possible (though I'm not sure any compilers do it). For the larger allocations that the author mentions that go directly to the OS to fulfill, however, the compiler wouldn't be able to optimize this away. In that case, the zero'ing occurs in the kernel, which is something that the compiler has no control over.

The zero'ing is done significantly for security reasons anyway. If a program could somehow disable that feature and get leftover memory from the kernel it could very easily contain password, secret keys, and other important bits of data that you wouldn't want random programs on your computer to have access to.

DannyBee · on Dec 6, 2016

To give you a definite answer: Yes.

If you have a calloc, even along conditional paths, it will understand the value along that path is zero until the store, and that the zero is killed by the store.

What it will do varies, because it does not want to screw up the sparse memory usage.

But, for example, it even understands how to take the zero values + stores and turn them into a memset and kill the stores, etc.

It generally does not do something like split the allocation into a calloc and malloc+memset part, or whatever

radarsat1 · on Dec 5, 2016

I would imagine this might work in cases where the compiler can prove that all addresses are written, but probably this is a limited number of cases. Likely it would do so simply by noticing two successive writes to the same location, instead of doing anything special related to calloc.

caf · on Dec 6, 2016

Possibly, for small allocations: a possible way this can happen is a small-value `calloc()` inlined as a `malloc()` / store zeroes pair, and then the "store zeroes" part of that discarded by a later optimisation pass as a dead store.

On modern server/desktop/mobile CPUs, this won't make much difference anyway because the second write to the same location is essentially 'free' due to the store buffer.

And of course if you're calling calloc() in a tight loop, then the zeroing is the least of your performance concerns!

jedisct1 · on Dec 5, 2016

Good operating systems also provide `reallocarray()`.

astrobe_ · on Dec 6, 2016

It's in the standard library, not in the operating system.

binarycrusader · on Dec 6, 2016

The standard library is usually delivered as part of the operating system on *nix-like platforms.

masklinn · on Dec 6, 2016

In fact the standard library is usually part of the operating system in the sense that it's the interface to the OS, on both nix-line and non-nix like systems. Linux is the exception (in that raw syscalls are officially supported) not the rule.

binarycrusader · on Dec 6, 2016

Yes, exactly. This is especially true on Darwin and Solaris.

astrobe_ · on Dec 6, 2016

Making statements as inaccurate as possible without being wrong is a fun game, I guess.

My turn: A file explorer is part of the OS because it is installed by default on Windows. A good OS provides a graphical file browser.

binarycrusader · on Dec 6, 2016

As the other respondent posted, the standard library is the only interface to the operating system on some platforms.

For example, on Solaris and various *BSDs, the syscall interface is private or unstable and is explicitly NOT an interface.

So without the standard library, none of the applications could run.

That sure sounds like part of the operating system to me, and fits various accepted definitions such as the one Wikipedia provides.

And yes, I would consider the file explorer included with Windows part of the operating system, as most people I think would also.

The confusion here seems to come from the Linux world, where components are mix and match; where you can pair the kernel with an entirely different base.

That isn't true of many other operating systems; the base distributed set is designed to work together and provide the environment and platform for applications.

astrobe_ · on Dec 7, 2016

There's no confusion at all on my part. The operating system isn't even aware that your application has a heap to begin with. All it knows, is that your program may ask for more memory.

But if you consider a file browser as being part of the OS, I can't help you. You want to adopt a view that profanes have in a technical discussion. That's worse than ignorance.

binarycrusader · on Dec 7, 2016

You are very confused and quite frankly very wrong.

The original claim was the standard library is not part of the OS. However, the standard library is objectively part of the OS on many platforms and so the original claim is factually incorrect.

My so-called "view" reflects the industry accepted definition of an OS, so perhaps you should review your suppositions.

astrobe_ · on Dec 8, 2016

Ok so name a few of the "OS on many platforms" in which one absolutely can't replace the standard library.

binarycrusader · on Dec 9, 2016

Solaris is one of them, unless you made significant changes to the kernel itself.

Darwin, as shipped by Apple is another.

There are a variety of embedded OS' that are the same.

So yes, as shipped and delivered, you can't replace the standard library and the vendors both consider the standard library as part of the OS.

And furthermore, existing application binaries for those platforms would not work without them.

wfunction · on Dec 5, 2016

No, the 2 GB array should still take a quota of 2 GB. It just wouldn't take 2 GB's worth of time to initialize. The overcommit "feature" in Linux is a bug that crashes C programs in places that violate the language's guarantees (such as when a write occurs to a location in memory that was allocated correctly).

dimman · on Dec 5, 2016

There are some unfortunate statements in there (if taken out of context) that requires you to read the whole thing for it to make sense. Like "...but most of the array is still zeros, so it isn't actually taking up any memory..." which is a bit ambigious if not read in the complete context, then it makes sense.

smegel · on Dec 5, 2016

> But calloc lives inside the memory allocator, so it knows whether the memory it's returning is fresh from the operating system, and if it is then it skips calling memset. And this is why calloc has to be built into the standard library, and you can't fake it yourself.

Err...mmap(2)?

theseoafs · on Dec 5, 2016

You can't fake it yourself from within the C stdlib is what the author means.

MichaelBurge · on Dec 6, 2016

I suppose another alternative would be for memset() to check if the page is already mapped to the zero page, and to do nothing if it is. There are some bitset-related data structures that should make that pretty efficient.

ericfrederich · on Dec 6, 2016

Somebody needs to go update all the StackOverflow answers saying that malloc is faster. According to this, calloc seem to always be faster with several other benefits as well.

bcpermafrost · on Dec 6, 2016

It is not as you say.

The article suggests that Malloc + Memset is slower than Calloc.

Malloc will be faster depending on your use case. If your plan is to eventually call memset. Then just use Calloc, otherwise malloc will be faster all the time.

Twirrim · on Dec 6, 2016

n00b question: Why would you not memset? I would assume you'd want to start with all zeroed memory in almost all cases.

lerpa · on Dec 6, 2016

Not always, you could be planning on filling the data with something else. Very common to do that.

elua · on Dec 12, 2016

I.e.: reading from a file or copying memory?

angusp · on Dec 6, 2016

> I mean, let's be honest: if we really cared about security we wouldn't be writing in C.

How so? C is low level, so to be secure you must be fully aware of the behaviours and side effects of what you're doing. In another, perhaps higher level language, sure, there may be less of these gotchas but to be properly secure you need a similar amount of knowledge about background behaviour.

edblarney · on Dec 6, 2016

Why do so many people disagree on something that should be nearly empirical?