Zero and forget -- caveats of zeroing memory in C

qznc · on Oct 29, 2012

Making it hard for the compiler to optimize your memset-zero away is not a long-term solution. At some point in the future the compiler might be able to analyze this and optimize it away. As a cryptographer you should not rely on bad compilers.

Actually, using his memzero solution would work, but not because of his reasons. Putting memzero into another compilation unit (.c file) requires to compile it separately. memzero itself cannot be compiled to a NOP, since the compiler does not know how it is used and a call to memzero cannot be optimized since the compiler does not know what it does.

Nevertheless, link-time optimization in theory could still optimize across compilation units. The only solution which comes to my mind is to use 'volatile' for the memory access, but that will never be fast.

saurik · on Oct 29, 2012

> The only solution which comes to my mind is to use 'volatile' for the memory access, but that will never be fast.

As you are insisting that the memory is accessed when you demand that the memory is wiped for cryptographic purposes, you will not be burned by the usage of volatile. (To be clear, you would of course not use the memory with volatile: you would add that qualifier only when you went to wipe it.)

qznc · on Oct 29, 2012

If you have a type like "volatile int*" for your array elements to zero, the compiler cannot use SSE intructions to write the zeros, e.g. MOVNTDQ.

http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/...

saurik · on Oct 29, 2012

Interesting. Is there a reason for this? I was under the impression that volatile only required that the accesses actually happen, not that the accesses had to happen in a manner considered "boring". Is the issue that volatile is also demanding that the ordering remain consistent, and the SSE instruction is not capable of guaranteeing that?

(edit:) In fact, that instruction, and a small handful of others (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD) do seem to cause re-orderings. On x86, at least, any other form of optimization should continue to be allowed (involving cache-lines, etc.), but you are definitely right: this instruction's usage would not be. :(

jbri · on Oct 29, 2012

The easy way of reasoning about what optimizations the compiler can do with a volatile location is to think "If this were actually a memory-mapped IO port, would this compiler optimization change the observed behaviour".

CJefferson · on Oct 29, 2012

One problem is that volatile is in practice often used for writing to memory mapped ports. I suspect in that situation using multi-memory address instructions might lead to pain. Of course in x64 such things might be less common / not make sense, but in general if you say volatile you are saying "do every read and write I tell you to, in the order I tell you to".

erydo · on Oct 29, 2012

Even putting it into a separate compilation unit isn't a long-term solution: compilers that do whole program optimization may still be able to optimize it out. I believe just declaring the operand to be volatile would prevent it from being optimized out, however. e.g.:

    void* memzero(volatile void*, size_t);

(However, I haven't tested this).

nkurz · on Oct 29, 2012

Volatile is definitely a better solution than the approach in the article. I think your solution works, but also am not certain. I haven't tried either, but I also wonder if this works:

   (volatile void *) memset(a, 0, 3)

I think the answer for both of those is in this link but I'm not savvy enough to figure out for sure: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22278#c43

But there's a safer and clearer approach I'd probably actually use (although also untested):

  volatile void *prevent_optimization = memset(a, 0, 3);

qznc · on Oct 29, 2012

I do not think this works, because volatile has no effect, if you do not access prevent_optimization.

I'm also not sure of erydo's approach. Given "volatile int x", is "(x+1)" a volatile access? I cannot clarify that with ANSI C99.

nkurz · on Oct 29, 2012

I think (hope?) that the write would be counted as an "access". If not, this would break the "what if this was an IO port" analogue. If for some reason it wasn't, you could use a static variable to hold the result. I fear the real problem with "volatile" is that just about everything about it is implementation dependent rather than clearly defined by standard.

primitur · on Oct 29, 2012

Zero-before use is standard practice, anyway, at least in safety-critical/crypto/life- systems development.

Why leave something like that to compiler semantics? If the block of memory you think is safe turns out in fact to be a back-door, well then: thats your fault, not the compiler, operating system, etc.

Well, there's also the rule "Never use malloc() in-process", too, which means: before main(), all your vars and memory are already initialized and allocated as you need them from the start. Oh, and other silly rules too, which can prevent a death or two along the way ..

ToastOpt · on Oct 29, 2012

> bad compilers

Good compilers. Compilers are, most commonly, judged on their ability to optimize code.

> since the compiler does not know how it is used

That's not what the programming language specification says. See the "as if" rule in C.

frede · on Oct 29, 2012

You might actually want to put your memzero function into a shared library to be on the safe side.

cygx · on Oct 29, 2012

Personally, I'd just write a wrapper function for memset().

The following one works in at least gcc 4.5.3 and clang 3.1, but is actually not guaranteed to work by C language semantics:

    static inline void memzero(void *volatile block, size_t size)
    {
        memset(block, 0, size);
    }

A safe alternative is

    static inline void memzero(void *block, size_t size)
    {
        static void *(*volatile const memset_)(void *, int, size_t) = memset;
        memset_(block, 0, size);
    }

but has the downside that the virtual call cannot be optimized away, whereas gcc and clang actually inline the call to memset() if the first version is used.

If this is a concern, there's probably no alternative to an explicit loop:

    static inline void memzero(void *block, size_t size)
    {
        volatile unsigned char *bp = block;
        while(bp < (unsigned char *)block + size)
            *bp++ = 0;
    }

fhars · on Oct 29, 2012

That implementation will die with a bus error for seven out of every eight possible values of mem on most non-x86 architectures, zero out the wrong memory regions for these values on some more exotic architectures, and be horribly slow everywhere else. http://stackoverflow.com/questions/1496848/does-unaligned-me...

dexen · on Oct 29, 2012

Whether memset(), or any other function, gets optimized away by GCC should depend on function attributes (1) -- more exactly on the `pure'; possibly some others. However, GCC (tested with 4.7.1) somehow considers memset() pure regardles of declaration. The default declaration is:

  $ echo '#include <string.h>' | gcc -E - | grep memset
  extern void *memset (void *__s, int __c, size_t __n) __attribute__ ((__nothrow__ , __leaf__)) __attribute__ ((__nonnull__ (1)));

When replaced by hand with declaration lacking any attributes, it still gets optimized away.

    	/* will be optimized away for unclear reasons */
    extern void *memset (void *__s, int __c, size_t __n);

Contrast that to behavior of any user-defined function:

    	/* may be optimized away */
    extern void * memxxx(void *__s, int __c, size_t __n) __attribute__ ((pure));
    	/* should not be optimized away */
    extern void * memyyy(void *__s, int __c, size_t __n);

IMHO GCC's special handling of memset() is broken...

(1) http://www.cs.auckland.ac.nz/references/c/gcc4.7/Function-At...

saurik · on Oct 29, 2012

(edit: I just noticed that qznc's earlier top-level comment makes this same overall point, even down to the specifics about future optimization and volatile. That comment is also shorter, so it is probably provides more "bang for the buck" to read it instead of mine.)

I am pretty certain that memset is not conceptually being handled especially different: it is simply being inlined. If you have a memset of a variable to 0, gcc's compiled output might involve just a single mov instruction and if it is a small array, it might be a few mov instructions: if it is a lot of memory being cleared, it might involve a call.

However, as it is inlined, the semantics are pretty clear for how it should work (and attributes like pure will be deduced, as they would for any function you include in your actual code; you don't need to mark it as such <- edit: to be clear, though, that I really don't think that memset is "pure"... it is pretty much the exact opposite of "pure"): it is equivalent to assigning a variable a value, and if that variable's value will not ever be used, then the assignment will not happen.

Instead, if the developer really really really insists that that assignment happen even if, from the perspective of the C language standard there is no legitimate side effect of that operation, the correct thing to do is to temporarily qualify the pointer with volatile and then do whatever it is you wanted (such as wiping it).

Sadly, this article does not even in a single place mention "volatile", so I'm not certain that the author understood how this works. Instead, he states that a "solution" is to instead put a zero in the first element and then copy it to subsequent elements... something that could easily be detected by abstract interpretation, and which a sufficiently smart compiler would be correct in also optimizing away.

dexen · on Oct 29, 2012

It's not about volatile; marking the array as `volatile' caused warning, and the call is still optimized away:

    a.c:14:2: warning: passing argument 1 of ‘memset’ discards ‘volatile’ qualifier from pointer target type [enabled by default]

To take `volatile' into account, the `memset()' would have to to have `volatile void * s' as first argument in its prototype.

Curiously enough, making a custom function with volatile pointer argument and `pure' attribute still causes GCC to optimize it away. I guess such combination makes no sense anyway.

    extern void * memaaa(volatile void *__s, int __c, size_t __n) __attribute__ ((__pure__));

saurik · on Oct 29, 2012

Right. I did not claim that once you added the volatile qualifier that you could still use the existing implementation of memset: if you simply cast the pointer to volatile and then pass it to the existing version of memset, you will certainly get a diagnostic indicating that you lost the qualifier, as memset itself does not have volatile as part of its type definition.

However, the author's attempts to redefine memset without using volatile rely on gcc limitations: in the first "trivial" case that it does not use abstract interpretation while looking for constant variables (in this case, it should be easy to fold that loop down to "oh, that rvalue is always a constant 0), and in the second that it does not do extensive optimization across compilation units.

As for the other thing you noticed: while I would imagine that the gcc-specific "I am the developer and want you to do something very special and important" __attribute__ would override any notion that it might have regarding "volatile", they are really orthogonal concepts: with just that prototype the compiler wouldn't even know if that function does anything at all with the memory you passed... the function might just do something with the pointer value and not indirect it at all; therefore, one would pretty much demand that "pure" have the semantics that gcc is giving it here.

rwg · on Oct 29, 2012

If you don't compile with "-fno-builtin" or "-fno-builtin-memset", gcc treats memset() as a builtin function when optimization is enabled. I'm guessing it overrides memset()'s prototype with that of __builtin_memset(), but I don't know enough about gcc's internals to be sure.

If you compile with "-fno-builtin-memset", gcc uses whatever prototype is in the preprocessed source code and generates calls to memset() as appropriate. However, if you provide your own memset() implementation in the same compilation unit (or in a different unit w/ LTO) that doesn't play volatile or asm("") games, then the compiler still might be able to deduce that the memset() call can be optimized away.

ToastOpt · on Oct 29, 2012

> IMHO GCC's special handling of memset() is broken

I appreciate how you've distinguished your opinion from objective facts. E.g. according to spec, it's not broken, but it's a perfectly valid opinion to hold that the spec is broken too. :-)

bvrlt · on Oct 29, 2012

The assumption that the trivial solution won't be optimized out is, I think, wrong. From your experiment GCC is indeed not smart enough to do so but I would bet that a compiler like ICC would. In this case the best is probably to use some pragmas to avoid optimizing out a statement.

rwg · on Oct 29, 2012

CERT's Secure Coding wiki has more to say on the subject, including portable code for a memset_s() function that can still potentially be optimized way:

https://www.securecoding.cert.org/confluence/display/seccode...

https://www.securecoding.cert.org/confluence/pages/worddav/p...

tedunangst · on Oct 29, 2012

OpenBSD has a function called secure_bzero. All it currently does is call regular bzero, but if external compilation isn't enough to do the job, we come up with something else.

Regarding the article, I don't at all understand why the three arguments are necessary. Why would the following patch not work?

    - memset(x, 0, n);
    + memzero(x, n);

ibotty · on Oct 29, 2012

i think it's to keep the patch(es) easier to read. it certainly feels awkward.

nitrogen · on Oct 29, 2012

Note that the type punning is only actually useful on systems where memory addresses are 64 bits wide, hence we include that code conditionally for environments with the LP64 data model, which incudes most Unix-like systems.

The first statement seems false. I was not previously aware of an association between the number of address lines and the size of the data bus on computing systems? I know I've had 32-bit processors with at least 64-bit memory buses, and the SheevaPlug has a 32-bit processor with a 16-bit memory bus.

Also, the code above this paragraph will only use wide accesses on 64-bit architectures ("#ifdef __LP64__"), even though there are benefits available on 32-bit systems.

tsahyt · on Oct 29, 2012

Why isn't there a keyword stating "don't optimize this!" in the C Standard? If there is is, please correct me. There are a bunch of similar problems introduced by optimizing compilers that could be solved with such a keyword.

mikeash · on Oct 29, 2012

The volatile keyword is effectively that keyword. The trouble is that people want to have their cake and eat it too. They want some optimizations (e.g. a memset that's more efficient than zeroing byte by byte) but not others (e.g. skipping the whole thing).

By specifying volatile and zeroing out the memory in a loop, you're guaranteed that the compiler won't optimize away the zeroing. But if you want something faster, you need to get cleverer, because there's no universal "optimize this way but not that way" command.

ygra · on Oct 29, 2012

C is only ever concerned with side-effects regarding the C abstract machine. Which is not at all the real machine it's running on. The thing is that optimisations only need to verify that they don't change what is observable from within the abstract machine. Which sometimes makes things strange.

smcl · on Oct 29, 2012

One way to inhibit optimisations we used was to declare one of the variables involved as "volatile". I'm not sure about MS, Intel, GCC etc but ours took this as a sign to not run any optimisation passes over code involving volatile variables.

coob · on Oct 29, 2012

In GCC 4.4 and up you can use the following:

    #pragma GCC push_options
    #pragma GCC optimize ("O0")

    memset(a, 0, 3);

    #pragma GCC pop_options

'volatile' should also work, I don't see why there is any need to rewrite memset. Here is a related question on SO:

http://stackoverflow.com/questions/2219829/how-to-prevent-gc...

Edit: buried at the bottom of that post is another method:

    __attribute__((optimize("O0")))

Someone · on Oct 29, 2012

Define "no optimization". On most CPUs, the compiler will have to run some register assignment algorithm. If you insist that every variable load and store must hit memory exactly once, performance will be dreadful, and the compiler probably still would have to run some code to decide which intermediates to spill to memory. Also, compilers typically have to decide on instructions to use. To clear a variable, do you use CLR, XOR, or SUB? In short: there is no clear definition of "don't optimize that"

mseebach · on Oct 29, 2012

I've only ever done an short compiler class in college, but wouldn't a sufficiently sophisticated optimizer unroll the loop and propagate the constant, thus eliminating the read, allowing the entire block to go away?

Someone · on Oct 29, 2012

Relevant for those thinking that "volatile" will save the day: http://www.cs.utah.edu/~regehr/papers/emsoft08-preprint.pdf

It shows that a) volatile does not have to do _anything_ (if that is documented by the compiler) and b) you cannot trust your compiler to be bug-free. The latter is one reason to follow cert.org's advice to read the disassembly output of your compiler.

fserb · on Oct 29, 2012

Could please someone explain why this is an issue in the first place?

If the memory of the process is available to other processes after it finishes (or while it's running), isn't this already a lost game? I.e., how can you be sure that this particular chunk of memory wasn't cached on disk at some point? how can you be sure that someone didn't access it before your memset() call?

alexkus · on Oct 29, 2012

Remote exploits for one.

Imagine a web server that didn't wipe plaintext passwords or encryption keys from its memory after finishing with them. If the web server was remotely exploitable then it could be possible to obtain the contents of the memory of that process remotely, thus possibly leaking passwords or other sensitive information of other people that have connected to that web server at some point in the past.

> how can you be sure that someone didn't access it before your memset() call?

True, but there's a difference between having a very small window of opportunity where the data could be obtained via a remote exploit, and leaving the window wide open for possibly endless period of time.

Obviously you shouldn't have any remote exploits in the code in the first place, but it's good practice for secure programming to keep the sensitive information in memory for as short a period as possible just in case there is something that you aren't aware of.

alexchamberlain · on Oct 29, 2012

You can't be, but we have to generate keys and use passwords. By deleting the data - by writing zeros - you are cutting down the opportunity for attack.

Bear in mind that other normal user processes can only access the memory once it has been returned to the OS. Therefore, it means you have to have a malicious super user to access your keys whilst the program is running.

zokier · on Oct 29, 2012

>If the memory of the process is available to other processes after it finishes

Of course the memory will be available to other processes because we don't have infinite supply of memory. Stuff like mlock(2) can prevent paging to disk, and there are probably other countermeasures too which I'm not aware of.

icebraining · on Oct 29, 2012

I.e., how can you be sure that this particular chunk of memory wasn't cached on disk at some point?

Using something like mlock(), supposedly.

sn · on Oct 29, 2012

A good solution in c is to use calloc, which is malloc + zero'ing out the memory.

http://www.cplusplus.com/reference/clibrary/cstdlib/calloc/

Also, in a standards-compliant compiler, statically declared variables are automatically initialized to zero unless stated otherwise.

Someone · on Oct 29, 2012

That won't work here. Here, the goal is to have 'cfree': zero'ing out memory and then freeing it.

sn · on Oct 29, 2012

OK, shows me for skimming too much.

There apparently used to be a cfree though it appears to be equivalent to free.

I upvoted rwg's comment regarding CERT's Secure Coding wiki. memset_s is the correct solution given it is part of the C11 standard.

AndreyKarpov · on Oct 30, 2012

Relevant artcile: Overwriting memory - why? - http://www.viva64.com/en/k/0041/