Hacker News new | past | comments | ask | show | jobs | submit login
How to dismantle a compiler bomb (codeexplainer.wordpress.com)
201 points by gregorymichael on March 9, 2018 | hide | past | favorite | 46 comments



I had a famous machine learning professor send me a C program that crashed in init on a 32-bit machine because it allocated a 8GB array...

It was particularly strange to set a breakpoint at the beginning of main() with gdb and see the program never got there. Oddly enough, he never actually used the 8GB array, even though he had no problem allocating the array on the POWER workstation he was using.


Not sure what OS he used but some operating systems, Linux for example, overcommit memory. You can alloc an 8GB array just fine and as long as you don’t use it no actual memory gets used.

That’s why it worked on his machine. It didn’t work on yours because of the lack of address space.


Linux for example, overcommit memory.

Not if you tell it not to. This is a common configuration on servers and other applications where requiring overcommit to function is considered a bug.


Correct, but it is the default setting.


The typical cause of this problem is not that the program runs out of memory but that it overflows its stack. In this case, of course, 8GiB would wrap right back around to the start — but I would think that would result in failing to generate an executable, not generating a crashing executable, unless the compiler was super slapdash.


Compilers rarely have any idea how large your stack space is. Quite often it's determined at runtime in various ways, especially in multithreaded environments.

(There are some surprising exceptions such as PIC hardware stacks where you might be allowed exactly 8 call frames, and your whole program's call stack must be a DAG with no recursion)


Okay, but in this particular case, we're talking about an 8GiB array on a 32-bit platform. Its size truncated to 32 bits is 0x00000000. If the compiler doesn't detect that, it's a compiler bug.


Same thing happened to me in C++ with gcc, making an array too big on the stack. Or maybe that was a dynamically sized array. Can't even remember the context now, project euler maybe.


The code in the example asks for a top-level array of length (uint) -1 with all elements initialized to 1. Overcommit would never help, because where this blows up is not in running the program, but in compiling it.

Even if that weren't true, the only way overcommit ever helps is if you don't initialize or otherwise touch the memory you allocate.

I'm not sure what happens if you try to write zeroes in overcommit allocated memory, but I would bet it breaks because it'll pagefault and linux will allocate it.


I was of course referring to the example in the post I was replying to. If someone allocates an 8GB array but never uses it, that doesn't do anything. The code you're referring to has that array explicitly set as main.


"it works on my machine"



The optimizer should remove that at compile time if it's not used.


If it's a global variable not declared with `static`, another compilation unit could access and use it.


I'm reminded of the C++ exploding error competition: https://tgceec.tumblr.com/post/74534916370/results-of-the-gr...


Also interesting are the other compiler bomb submissions:

https://codegolf.stackexchange.com/questions/69189/build-a-c...

All are short bits of code in a variety of languages that expand to massive files.


This would be an interested DDOS attack for open source CI systems that check pull requests. Especially when combined with some sort of distributed network build cache like bazel, you could easily fill it the cache by making a few pull requests with this.


The main practical takeaway is to prefer iterators over pre-generated arrays.

Also, TIL according to the C standard, 'main' is not a reserved identifier! (https://stackoverflow.com/questions/34764796/why-does-declar...)

If anyone can clarify: I assumed that the gcc read literals directly into their 'target' type, but it seems like some literals (such as '-1u') are read as signed integers first then typecasted to the the target type?


> Also, TIL according to the C standard, 'main' is not a reserved identifier! (https://stackoverflow.com/questions/34764796/why-does-declar...)

That's not quite true.

Main is not required to be defined in a freestanding environment, but in a hosted environment:

> 5.1.2.2.1 Program startup

> 1

> The function called at program startup is named

> main. The implementation declares no prototype for this function. It shall be defined with a return type of int and with no parameters:

    int main(void) { /* ... */ }
> or with two parameters (referred to here as argc and argv, though any names may be used, as they are local to the function in which they are declared):

    int main(int argc, char *argv[]) { /* ... */ }
> or equivalent; 10) or in some other implementation-defined manner.

main can take on differing types, but it then becomes undefined-behaviour, which allows the compiler to do whatever it wants.

(5.1.2.2.1 Program startup, C11 Standard http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf)


> according to the C standard, 'main' is not a reserved identifier!

In fact, main() is just a convention of libc. You can have C without libc. (Such as when writing a kernel!)

Now, attempting to link a standalone executable without a '_start' symbol, on the other hand...


> In fact, main() is just a convention of libc.

No, it's not just a convention. 'main' is defined as the execution entrypoint in at least the C11 [0], C90 [1] standards. Both have both these forms defined:

    int main(void) {}

    int main(int argc, char* argv[]) {}
You don't have to follow that convention, but then it becomes implementation-defined behaviour.

C without libc can still expect to have a main. One doesn't imply the other. It's just without a main, you also have to manually link to _start as well.

[0] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf

[1] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf


> Now, attempting to link a standalone executable without a '_start' symbol, on the other hand...

GNU ld and gold both support --entry to change the entry point. Alternatively, you can specify it in your linker script.

https://gist.github.com/prattmic/1f3618025aab52b0e90e88445b6...

    $ gcc -nostdlib -ffreestanding -Wl,--entry=foobar -o main main.S
    $ ./main; echo $?
    42


I'm a little baffled to discover -1u is exactly 0xFFFFFFFF — the maximum value of unsigned int.

I would have expected a type error.

Likely there's some important code out there that relies on this strange behavior.


This is common knowledge for C programmers. In the embedded/firmware space, I've seen #define UINT_MAX (unsigned int)(-1) very often. It's convenient because it is always the maximum unsigned integer value regardless of whether int is 8/16/32/64 bits.


According to standard, (unsigned int)(-1) is undefined behavior (as is signed overflow) because the machine can use some other representation of signed integers than twos-complement. On the other hand you will probably never find non-twos-complement architecture in any vaguely production use today.


Nope, casting signed to unsigned is well-defined. The C standard requires it to act like two’s complement regardless of what the machine actually uses:

> Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type.

https://stackoverflow.com/questions/50605/signed-to-unsigned...


~0u is also a safe equivalent if there is unwarranted fear around wrapping.


Apart from fact that it is not an integer overflow, what you can see in system headers is not what you can do in regular program’s headers. System headers are part of toolchain and can rely on specifics of platform and actively use these to provide compliant values for you. When you’re on a different platform, a different set of definitions is provided (though most cross-platform standard library implementations try to abstract specifics deep into their internals and builtins, since it is useful for stdlib writers too).


Unsigned underflow/overflow is very clearly defined.

Signed integer underflow/overflow is UB, on the other hand.


You are not aware of two's complement? At the end of the day it's just bytes when you go down to the metal, the only difference is how they are interpreted.

https://en.wikipedia.org/wiki/Two's_complement


Why doesn't the {1} in this case initialize the array to all 1s?


Spec defines any remaining elements to be initialized to 0.

You can even do sparse initialization if we're talking C99 here, like so:

int array[] = {1, 2, 3, [99] = -1}

And you'll get an array of length 100 {1, 2, 3, 0, 0, 0, ..., 0, -1}

Any missing initialized element is always 0 if initialization is done at all, though.


It's because you can initialize the array with particular values, like: `type a[3] = { 1 , 2, 3 };`.

But I agree that it would make sense to fill the array (or have an easier method to do that), it may just be an argument of speed. I think OSes generally hand over uninitialized memory zero'ed out to prevent reading of memory from previous programs - so it's a case of allocating the memory space and then continuing, as opposed to setting values for each position.


Until now, I always thought this:

    int array[5] = {X};
Could be be used to set all values of the array to X, but have never used it for anything other than 0. That's somewhat surprising behavior.


Because that's not how array initialisation works in C. The specified indices are initialised with the provided values, and the missing elements are always initialised to 0. See for instance section 6.7.9 of the C99 standard. Perhaps you were confused by the common idiom:

> int array[5] = { 0 };

which takes advantage of this missing element initialization behaviour.


Or when you use range instead of xrange with big numbers (python)...


Not an issue anymore on Python 3.


Because xrange became range.


Mildly interesting at best. Other languages have range operators you can abuse, or other similar tricks with macros, constants, etc. I'm not sure this is a problem that needs attention.


If you're compiling/running random unverified source you have plenty of other problems already. :)


TL;DR: a one liner source file tells the compiler to set main as a 4GB array; in the process the compiler might run out of memory.


s/4GB/4GBx4 = 16GB/


To add to this, a "tl;dr" is only useful if written by people who actually read the whole article. That is, written by people who did not apply "tl;dr" themselves.

In this particular case, the article clearly says:

> The array will contain 4294967295 integers, each with the size of four bytes, taking up 17179869180 bytes in total.

Later, the article again stresses again that you have to multiply by four:

> having the size of 10000 integers, that is, 40000 bytes

It's really hard to imagine how somebody who actually did read the article could have missed that.


Its easy it was too long so they didn't read they skimmed.


Well, right. So we're back to the starting proposition: That you shouldn't write a TL;DR if you didn't read the article.

Anyhow, this was more of a short blog post than a full article. It was a 2 minute read (not exactly a novel). And the thought experiments at the end are interesting, and illuminate some important points about how the language operates. Definitely worth the read.


I think their interpretation of "tl; dr" is: "it was too long so I didn't read it, and here is what I think anyway".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: