I've always thought that there would be more value in exactly the opposite - a pathological C compiler (or C++ compiler, POSIX operating system, etc.). The pathological compiler would always implement undefined behavior in the most surprising, unpredictable manner - e.g., uninitialized memory would have a properly randomized value that changes from one run to the next, overflows would result in random values, etc.
That way, any programs that are made to work on the pathological implementation are largely guaranteed to rely only on standardized behavior, and work on any other standards compliant implementation.
A long time ago, when encountering "undefined" behavior, it would try to launch Nethack[1]. (It is valid behaviour, according to the spec.)
Seriously, ask around. There was a multi-month fat-chewing about gcc's pathological interpretation of "undefined" earlier this year on the Cryptography list, and tends to come up anywhere C programmers with an interest in security drink.
That's inaccurate, according to the page you linked to (and to my memory of the incident).
gcc 1.17 would invoke nethack (or one of several other similar games, if available) if it saw an unrecogized #pragma directive.
According to the C standard, the behavior of a #pragma not followed by STDC is implementation-defined, not undefined -- which means that an implementation is required to document its behavior. (I presume that gcc did so.)
gcc did not play launch Nethack in response to undefined behavior in general, or in response to anything other than an unrecognized #pragma directive.
But that is the thing, undefined behavior just says that the compiler takes no position on what should happen. So a boring compiler that refuses to compile code which would have undefined behavior is sufficient. It makes idomatic C a bit more challenging but if you've read through Dan's code you know that he's all about picking a style and sticking with it. And for that set of constraints, its all defined (and boring because there is very little, if any, optimization going on insider the compiler)
So sub-optimal but predictable and safe code given that you change the preference bit to safety over speed.
A lot of undefined behaviour cannot be identified at compile time. For example, adding two signed integers is undefined if it overflows, but perfectly well defined if it does not.
A safe compiler will have to add checks to guard against undefined behaviour happening at runtime.
I had the same opinion at first and would agree if UB was mostly possible to avoid in meaningful programs, but it isn't. The main casualty would be the reason why this is being discussed at all: compilers need to make very specific assumptions on UB to enable some optimizations and if UB goes away, so do the optimizations. So we get the performance of the "boring" compiler anyway, just with more effort on behalf of the programmer.
DJB is (as always, I guess) right, C simply isn't really suited to optimizations based on UB. What'd be the big deal if compiler developers just stopped inflicting these on C programmers and focused on optimizing compilers for better-defined languages (with range types, specified overflow behavior etc.) instead?
You're acting like it's unambiguously bad, but that's not true. For example, I might create a macro DEREF_AND_FREE(x) which expands to if(x)free(*x). Often times, I'll lazily use that in places where I know x isn't null. It's more readable and maintainble than splitting the macro up into two separate macros, DEREF_AND_FREE_IF_NONNULL and DEREF_AND_FREE_I_KNOW_ITS_NONNULL. These UB-based optimizations which are "inflicted" on me mean that I don't take the performance hit for my laziness.
> What'd be the big deal if compiler developers just stopped inflicting these on C programmers and focused on optimizing compilers for better-defined languages (with range types, specified overflow behavior etc.) instead?
This is non-trivial. These definitions come with tradeoffs that fundamentally alter the language. However, this is basically what rust is.
I don't know if it's guaranteed to bark on all possible cases of UB, but there we go:
$ cat test.c
int main() {
return *(int*)0;
}
$ gcc test.c -fsanitize=undefined -o test
$ ./test
test.c:2:9: runtime error: load of null pointer of type 'int'
Segmentation fault
I'm pretty sure I've seen similar tool based on clang as well.
"Undefined behavior" means behavior that is not defined by the C standard. Things other than the C standard (secondary standards like POSIX, compiler documentation, hardware specifications, etc.) are free to define the behavior of any construct that's not defined by the language standard.
And an optimizing compiler is free to transform its generated code based on the assumption that the code's behavior is defined.
Arguably, undefined behaviour is more like 0/0 in (ordinary) mathematics than just something the C standard fails to specify (you want implementation defined for that one).
An optimizing compiler is free to transform its generated code based on the assumption that undefined behaviour never occurs.
There is of course the issue whether you want the compiler to do that. Usually optimizing one parameter in a design to the exclusion of everything else is an strong indicator that the engineers involved are functionally incompetent.
Pardon the bald appeal to authority, but your assertion reminds me of a Richard Hamming quote:
"""
As we say, the volume is almost all on the surface. Even in 3 dimensions the unit sphere has 7/8-ths of its volume within 1/2 of the surface. In n-dimensions there is 1–(1/2^n) within 1/2 of the radius from the surface.
This has importance in design; it means almost surely the optimal design will be on the surface and will not be inside as you might think from taking the calculus and doing optimizations in that course. The calculus methods are usually inappropriate for finding the optimum in high dimensional spaces. This is not strange at all; generally speaking the best design is pushing one or more of the parameters to their extreme—obviously you are on the surface of the feasible region of design!
"""
I'm reminded of a friends bitch about his company hiring a couple of ex-bell labs video compression guys to develop video codecs. Said they spent a year and came up with a great codec. Decode speed was twice as fast as anything else. Great!! Encoding speed was about a 1000 times slower. So encoding 5 minutes of video would take 3.5 days. And thus totally unusable.
The issue is design axis's are entangled. And also figures of merit are nonlinear. Twice as whatever doesn't mean twice as good.
You're designing a standard for people to implement in their specific compilers, and you want to let them generate fast code... I hardly think the mere existence of "undefined behavior" indicates incompetence.
Dereferencing NULL, in a C program, is ALWAYS undefined behavior. The fact that you are using a Cortex-M processor does not change this fact. It is spelled out very clearly in the C standard.
I'm a bit curious how a pointer with value 0 ended up being banned by C. Specifically, on x86 real mode (which includes modern systems early in boot before switching to protected mode), there's a hardware defined data structure at address zero: the "Interrupt Vector Table".
Maybe everyone who wrote real-mode systems software either ignored that bit of the C standard, or they fudged around it (IIRC the IVT is an array and the first entry doesn't matter, so you can start a few bytes past zero), or they used assembly to access it.
In C, these two code fragments do not necessarily give the same result (assume pointers and longs are the same size):
long zero = 0;
struct IVT * ivt_ptr = (struct IVT *)zero;
and
struct IVT * ivt_ptr = 0;
/* or (struct IVT *)0 if you like that better stylistically */
The former gives you a pointer whose bits are all 0. The latter gives you a null pointer. On systems where 0 is a valid memory address the compiler should pick some other address that is not valid.
Similarly, these two are not necessarily the same:
if ( ivt_ptr == 0 ) ...
and
if ( ivt_ptr == (struct IVT *)zero ) ...
/* or if ((long)ivt_ptr == zero ) ... */
0 only represents the null pointer in C when it is the constant 0.
> On systems where 0 is a valid memory address the compiler should pick some other address that is not valid.
Not necessarily. Null dereference is undefined behavior so it may as well access some valid memory.
Even on platforms where 0 is a valid address, practical compilers tend to use 0 as null constant because it simplifies implementation (easy to check for null, matches the C syntax without WTFs).
This comes at the cost of making some small part of memory unusable to portable C code (nobody will believe you when you return 0 from malloc if 0 happens to be considered null), but that's fine because such memory is typically used for machine-specific, nonportable stuff anyway.
It seems that 0 wasn't a usable address on PDP-11 so K&R decided it's fine to use it as null pointer constant.
And of course dereferencing 0 isn't impossible in practice on arches which support this. Compilers for such machines usually can be coerced to generate code which accesses 0, you just have to live with the fact that your code isn't considered "valid, portable C" anymore.
Like many embedded cores, Cortex-M's don't have an MMU - everyone sees physical memory. In fact, I believe that's the chief distinguishing feature of the 3 core levels - Cortex-M's have no MMU, Cortex-A's have memory paging, and I think Cortex-R's are in between and have memory segments like MIPS.
So if the board designer wired everything up to map RAM at address 0, then yes, you'd be able to dereference a pointer to address 0.
Completely beside the point, though. Dereferencing NULL is always undefined behaviour in C.
If I do a read of a 'null' pointer on the Cortex ARM I'm using it's the initial stack pointer. If you try to write to that without some special magic involving the flash controller one gets a bus fault interrupt. Behavior depends on what the ISR does.
AVR Processors don't have bus faults. Some machines I've worked on have RAM based at address zero.
Which x86 kernel maps something at 0? In Windows and Linux, the low part of virtual address space is reserved for userspace.
In fact, Linux had several privilege escalation bugs which involved putting something at 0 and executing a buggy syscall which loaded this thing due to NULL dereference and believed it's some legit internal kernel data.
OS X and iOS map in __TEXT,__PAGEZERO to 0x0.* Of course, it is mapped without any protection levels so it isn't of much use except for catching most nullptr dereferences (just 4 GB of protection on x86_64!).
* Well, this is implemented by the default system toolchain. You'r particular toolchain can opt out of this behavior if they want to.
"perfectly fine" does not depend on the processor alone, it depends on the compiler. A compiler for x86 can assume dereferencing a NULL pointer is undefined behavior, and optimize your code in ways you perhaps didn't expect.
Or it could document and provide a well defined behavior of derefrencing a NULL pointer.
(e.g. gcc provides -fdelete-null-pointer-checks to control this)
clang warns about this particular usage of a null constant:
warning: indirection of non-volatile null pointer will be deleted, not trap [-Wnull-dereference]
return (int)0;
^~~~~~~~
note: consider using __builtin_trap() or qualifying pointer with 'volatile'
> warning: indirection of non-volatile null pointer will be deleted, not trap
That's a sweet little optimization :)
Probably the only reason they don't abort compilation at this point is that someone complained when it broke his tricky little macro which sometimes generates unreachable null dereference. Or something like that.
I'm not sure this would be helpful, working with any unpredictable system is a nightmare. The compiler emitting warnings of undefined behavior being utilized would be the best solution, though it would not be able to identify all scenarios. Input fuzzing would likely get you the rest of the way there.
the problem is that the compiler assumes undefined behaviour can never happen ALL the time.
Signed integer overflow is the most common one, since it means you can hoist 32bit ints into 64bit ints so they fit in one register or similar and save hitting the stack.
I'd personally be fine with being forced to specify what sort of undefined behaviors I want to allow to be "assumed I didn't mean"; it's not a problem for me if some code refuses to compile until I e.g. type-annotate my u32s with "they're not going to overflow, don't worry."
I was about to say that "this would break a ton of other projects"—but actually, it seems like it'd be fine as long as it was an opt-in -W switch.
Emitting warnings and errors mostly works fine for undefined behaviors, the real problem is with implementation-defined behaviors.
Let's say that, on the pathological compiler, it's documented that sizeof(int) is 4, and chars are signed if the number of seconds in the current minute is even, otherwise sizeof(int) is 5 and chars are unsigned. If your code compiles and works with either setup, it's probably correct and portable.
It's not really feasible for programmers to avoid implementation-defined things like sizeof(int) varying across implementations, not is it feasible for compilers to detect assumptions made about the implementation's definition in the program. The existence of a pathological compiler would make it far easier to write programs that work across all implementations.
It's true that working with unpredictable systems is a nightmare, but that's precisely the nature of undefined and implementation-defined behavior in C. The existence of a pathological compiler would merely allow developers to work through these issues on a single machine, which would be an improvement on needing to obtain a variety of compilers and platforms test with to find these issues.
Because if you have undefined behavior, you want your program to crash and burn as quickly as possible, so that you can remove the undefined behavior. What seems to be proposed by Dan Bernstein is not so much a "boring C compiler", but rather a superset of C which does not have any undefined or unspecified behavior.
Sometimes if you have undefined behavior, you want the documented extension to happen.
Or you want the de facto widely understood behavior to happen which has happened on every compiler you've used in 30 years: at least every compiler that was for two's complement matchines. Or every compiler on which pointers to different data types were of the same size. And so on.
You also don't want to be burned by code that is relying on a common extension, when that is ported.
There is a word for the compiler approach of "we're going to optimize this based on the assumption that the program is not relying on a common extension that is technically UB".
That word is: malpractice.
For instance, it is not unusual for C code to be targetting machines in which all objects are in the same address space, such that pointers are internally like binary numbers, allowing pointers to different objects to be compared for inequality: ptr1 < ptr2 /* does ptr1 point to a lower address than ptr2 in the one big address space? /
Programmers expect this work consistently with the address structure of the machine. They don't want the expression to be somehow wrongly optimized based on the assumption that ptr1 and ptr2 must point to the same object.
Implementations must honor requirements beyond those of ISO C. Such as, for instance, behaviors that those same compilers* used to define historically in their past revisions.
If GCC has behaved a certain way for 25 years, and some code has come to depend on that, and then that is suddenly taken away, such that entire GNU/Linux distros continue to compile, but break in random places because of that detail, the fault for the regression lies 105% with the doofus who made the GCC commit. Even if that behavior is not defined by ISO C.
I suspect you're talking about "pointer to member", which doesn't exist in C; it is a C++ concept. This is not a pointer, but a kind of indirection mechanism dressed up in pointer syntax.
I've never seen C++ code which depended on the internals of pointers-to-member, or assumed they had the same size as other kinds of pointers.
A pointer to member is actually a complicated offset into an offset into some structure associated with an object's class. This offset can be "dereferenced" with respect to objects of different types in the same hierarchy, taking into account multiple inheritance and virtual bases, etc. A pointer to a Base::Foo function, can be applied against a Derived object, such that it resolves to the correct code, even though Base is the third base of Derived, and so Foo is in a totally different vtable position in Derived compared to Foo. (The pointer cannot be a blind integer offset.)
In C, we would make a struct containing function pointers, and a simple "pointer to member" would just be an integer "offsetof(our_struct_type, some_function)". Then worry about it later if someone wanted multiple inheritance. :)
Nevertheless, it is a kind of pointer (it's called a "pointer") with a non-uniform size. That its internal structure is composite (just like a segmented 80286 far pointer) is irrelevant.
I think the idea is that any reliance on undefined behavior would blow up in your face and immediately cause a bugs, rather than working correctly with some compilers and not with others.
The problem is that undefined behaviors are used for valid extensions. "Undefined behavior" doesn't mean "error".
For instance including a platform header like #include <fortran.h> is undefined behavior. ISO C has some requirements there: if the header is found, then the directive is replaced with the preprocessing tokens from reading that header (and so if there are syntax errors there, they have to be diagnosed, and so on). But ISO C specifies no requirements as to what <fortran.h> contains, or whether it exists. On one implementation, it might cause the rest of the translation unit to be treated as a Fortran program. On an other implementation, it might bring in some declarations related to a Fortran interoperability. On yet another, translation might stop with "header not found: fortran.h".
UB is simply "behavior upon use of a nonportable or erroneous construct, of erroneous data, or of indeterminately-valued objects for which [ISO C] imposes no requirements".
There is also a difference between "implementation-defined" and "unspecified". Look it up!
The former is similar to the latter, only it must be documented. Unspecified behaviors imply some choice from among alternatives which need not be documented. (Like the order in which the argument expressions of a function are evaluated.)
Not really. There are a lot of platforms that already have a compiler that the vendor claims is standard, so being able to ensure programs rely only on the standard would make it easy to target all of these existing platforms and compiler implementations.
The "boring" compiler doesn't exist yet, and would need to be implemented on every platform that currently has a standard compiler to be as useful.
Um, sounds like the most appropriate response would be a low-performance C compiler/interpreter that enforces crashes on all run-time undefined behaviors, and refuses to compile for any compile-time checkable ones.
IMO both approaches would be useful. One nice thing about outlawing undefined behaviour rather than defining it, though, it that you'll be writing portable code - if you have a code base that's filled with all all sorts of once-was-undefined-but-now-is-defined behaviour, you can never build that code on another compiler.
Because enforcing undefined behavior on this test-compiler means you can build your software on any standards-compliant compiler and have it work the same. In theory.
Funny idea but best to leave that to code analysis tools. So, your compiler works as expected and works well. That's why you use it. Other peoples' work like shit due to bugs and undefined behavior. So, a coding style plus analysis tools to reinforce it make your code work well with others' compilers while you still use The Right Thing. ;)
I used to use with C a replacement for malloc() that would initialize the memory to 0xDEADBEEF. It was amazing how many bugs were forced into the open.
These days, valgrind is a much better option for doing this.
Go does this in at least one way. The iteration order of maps is undefined in the language spec. The go runtime automatically and purposefully randomizes the order of items in a hash so programmers can't rely on the arbitrary ordering it may have at any time.
That's what this is, keeping in mind that the compiler's implementation of undefined behavior is an input that should be fuzzed.
Think of, say, integer overflow. The problem is that typical compilers have a consistent (but non-standard) implementation, so it can't be fuzzed without some compiler help.
Tests can show the presence of bugs but not their absence. The idea here is to create a compiler for C (or perhaps a C dialect) so that nasty undefined behavior is guaranteed to not occur.
This is true - in general, a warning means the code is non-standard or ambiguous, so the compiler is falling back to some default or non-standard interpretation. Then, people come to rely on that, and, like you say, not turn warnings on. So, I say that's not a useful class of behavior, and in some cases (trying to write secure / portable software) it would be better if that didn't exist.
> I'd like to see a free C compiler that clearly defines, and permanently commits to, carefully designed semantics for everything that's labelled "undefined" or "unspecified" or "implementation-defined" in the C "standard".
That wouldn't just be a new compiler. That would be a new language. I dont think there are any short cuts to fix C.
C++ is doing something somewhat similar with the "C++ Core Guideline" rules, which are designed to be checkable by static analysis tool. The long term goal is that eventually C++ will become a safe subset, when all the error-prone parts of the language have replacements.
> "Following the rules will lead to code that is statically type safe, has no resource leaks, and catches many more programming logic errors than is common in code today. And it will run fast - you can afford to do things right."
It would be a valid implementation of C. But if you want to depend on the new behavior, which was previously undefined, then you would have to reference this new specification. That effectively makes it a new language, because you would no longer reference the C specification, but this new one.
(And if you don't want to depend on the new behavior, which was previously undefined, then there's not much point in using this particular implementation.)
The C standard allows for many different compiler implementations. Say you set in stone and publish to your users, a certain implementation specification.
You have now created a new spec. Really a new language. Your users will make programs that rely on this new specification. Such code will no longer be portable to other standard C-compilers. Their programs might only be well defined in your new language, not in standard C.
For instance he talks about a C that default initializes all memory to zero. A program that relies on that will no longer be portable to other C compilers. If its no longer portable, is it really the same language any more?
Because then you are relying on a specification that's not in the spec. A different interpretation means that your code no longer works. Therefore, you've effectively created a new language.
Sixth grader students know all the words third grader students know, but third grade students do not know all the words that sixth grade students know. Are the third and sixth grade students speaking different languages?
In a certain sense, yes. Sixth graders can say things that third graders cannot understand.
The trick is to try to bring together all of these dialects in a way that everyone can in fact understand. This is why writing language specifications is _hard_.
And even afterwards, you can still choose a dialect over the standard. You'll just lose a certain segment of your audience. That may or may not matter for your purposes.
It's just a figurative use of "literally". It's like saying "He's such a baby" of an immature (<- also figurative!) adult. You aren't saying that such an adult is actually a baby.
Yes, but that introduces some backwards incompatibility. The C standards body has a really, really high bar for doing so. English, not so much. That is a choice they're making.
At this point you pretty much want an alternative language more than a compiler: one that looks like C, but is completely defined. And this makes kinda sense, yet is IMO still quite dangerous (less than the actual situation but still too much)
1/ There are tons of people who will say: "this project is written in C" instead of the reality that is that this project is written in boringC, (actually probably even most of the projects authors will) and will for ex. compile with a regular C compiler instead of a boringC one, maybe for performance improvement...
2/ You just can't take C, even with a completely defined semantics, and call it a day for secure programming. It would still be filled with crazy features that are intrinsically impossible to render secure (everything about C "arrays" comes to mind)
So let's take this idea to its real destination and rewrite secure projects using a really secure programming language, instead of just extending the current mess for 30 more years.
Cyclone was already a good response. I'll add that you want one that's better than C and easy enough to port to. One that combines simplicity, fast compilation, fast execution, no undefined behavior, and plenty of safety.
Fortunately, Niklaus Wirth and his people already implemented a half dozen or so of them. Plus industrial variants like Delphi (or FreePascal) and Modula-3. Safer, faster to develop, all used for OS's, most with GC, and fast enough in production. This of course assumes you won't use Ada, which eliminated most of safety issues as well.
I doubt people will build on them. The dangers of C have too much allure. Plus all the code already written. ;)
kind of ignored by the mass and unsupported now, yet this is exactly the kind of project the developer community needs... a language that is minimalist, yet with enough feature to allow some expressiveness and speed.
So what do you do when an index into an array is out of bounds at run-time. You can't perform a check every time because that would go against the C's principle, and at the same time it would degrade performance, making it less useful compared to newer languages.
How would you implement defined behavior without a significant (which is what an if check is at that level) overhead in this case?
> So what do you do when an index into an array is out of bounds at run-time. You can't perform a check every time because that would go against the C's principle, and at the same time it would degrade performance, making it less useful compared to newer languages.
Bounds checks are not really a problem if you get rid of for loops in favor of constructions that eliminate bounds checks by construction (e.g. iterators). Of course, they're still problems in languages that continue the tradition of C-style for loops.
This is a classic example of how just defining away undefined behavior can make C unacceptably slow, but does not necessarily make other languages unacceptably slow.
Compared to the economic cost of bugs, bound checks everywhere are (very very) cheap. Even more so with the CPU and compilers we have today, which are more than sufficiently smart from the point of view of when C was created. Intel is even adding some new instructions in their CPU to make that even less costly than you could imagine in even some crazy C based situation you would not think to be possible in the first place, because of C. But the cost here is extra complexity, for something that should have been built-in in the first place.
More technically to address fears of slowdowns even with simpler architectures, the additional checks will likely be factored most of the time (for example before a loop with linear accesses). When they are not, to have a real impact 1/ it has to be in a really hot code path (like 1% or maybe even 0.1% or even less of a real system) 2/ the cpu should not be able to use empty OOO slots and execution units to execute it without any extra penality 3/ this is quite a deduction from the two preceding points, but if we are talking about a big array being accessed in random order this will be slow because of RAM access anyway, an extra check won't induce any meaningful slowdown maybe even if it were mispredicted (which it will not be - in regard to performance).
Like always if you have any performance issue, first profile. And actually I can't remember having ever heard a single person complaining that some bound checking in a program was the cause of their slowdowns. It often far more macroscopic, and easy to fix anyway when it is that much micro. Considering usual modern systems, they are sometimes so slow that the problem is certainly NOT native bound checking, but very probably architectural madness.
If you were frequently catching out of bounds accesses your CPU's branch predictor would be making mistakes and you'd be frequently eating the branch mispredict penalty. Any extra instruction or check that alters the control flow is often much more expensive than just another addition. However, in this case if you're taking the branch something is seriously broken so that you should not be facing this. Pre-Haswell Intel CPUs only had one branch slot so you still might have a penalty but as you say this isn't the end of the world.
I'm someone currently in the process of drinking the Rust kool-aid, in part because the goals of the language are very closely aligned with what Bernstein proposes here. Further, because of the memory model, the compiler can produce quite competitive performance while still preventing the execution of any undefined behavior (related to his last strawman in the bullet list).
I get that even if Rust offers a panacea to the issues he mentions (which it may or may not, I'm no expert), there's still an awful lot of C code that would need to be deprecated and rewritten. But surely there'd be value in efforts to get away from a language that has such a loose standard?
There is, however, a lot of work being done into formalizing Rust. It's just started, however. Hopefully we will end up with a lot of things being formalized someday in the future. Like all efforts, it's gonna take a lot of time.
I wonder how hard it would be to create a C -> Rust transpiler. The output would probably be God-awful, but it could be a start to moving away from C towards Rust.
A guess: very, very difficult. Writing code to satisfy Rust's checking is not much harder than writing C[1], but converting arbitrary C to satisfy Rust is something completely different.
[1] An' I say it as shouldn't. C is my natural language.
Yes, the problem with doing this is that Rust really imposes certain kinds of design constraints that makes this hard. Like, you COULD, but you'd make a lot of use of unsafe, in a way that's not idiomatic.
Part of the problem here is that the program may rely on subtleties that make the translation hard. I've seen experienced C or C++ people come into the Rust IRC and ask why certain things aren't allowed, it's often due their code relying on certain aspects of the looser model.
That said, it's very possible to go unsafe first and then refactor. It just might not be easy.
Another way of making the transition is to pick a component and only re-write it. This is the approach Firefox is taking, for example.
It's probably going to be a hard row to hoe to get any code that doesn't always visibly break out of circulation. That said, I think that's why there's huge value in moving to languages where you do break things more visibly.
Ultimately it comes down to time, and "It looks like it works but might do something wrong sometime" is a weaker argument to delay release than "It will not compile/run".
Required reading here is Chris Lattner's "What Every C/C++ Programmer Should Know About Undefined Behavior", which goes into good detail as to why undefined behavior is important for performance, sometimes very important: http://blog.llvm.org/2011/05/what-every-c-programmer-should-...
Could you make a new language that doesn't depend on UB for performance? Sure, quite possibly. Is C that language? Probably not.
How far are we from having the will and ability for it to be common to use safer languages than C/C++/Java/etc.?
A long time ago, it didn't seem pragmatic to write in, say, ocaml, because no one was going to feel like building or even installing the compiler in order to try your dumb software, but now that everything is about packages from distributions...
Today it feels strange to apply security updates for necessary software where the announcements list "Multiple memory safety errors, integer overflows, use-after-frees..."
It's free for non-commercial use. I've used it for several months now to build things like Tor. I haven't noticed any disadvantage compared to GCC or Clang.
You beat me to it. Unlike the other commenters, I see that you're describing it as a robust, predictable start to what Bernstein describes. Undefined behavior and other problems can be caught by extensions to the compiler or via static analysis tools like an Astree Analyzer knockoff. The fact that they'll be written in Ocaml (or SML) will make them more reliable. A good choice and what I've promoted here so far.
Additionally, the clean passes are easier to bootstrap for those that fear subversion of their compiler. Can use whatever you want on whatever machine to implement it. So long as code matches spec, the output should be the same. That implementation compiles the other compilers, verifiers, whatever.
You might also enjoy that people have already extended CompCert work and others for developments that push in the Boring, Safe/Secure, C compiler direction. Here's some:
So, lots of good stuff being done by a subset that focus on right design methods with the right tools. And they're getting incredible results. Surprise! :)
It's a verified subset of the C standard rather than all behaviors of all compilers. There's some semantics and work covering a lot of that already that just needs to be integrated into CompCert or static analyzers. Likewise, there's static analyzers that already cover it too but not formally verified.
Here's an example of investigating that with Compcert:
It sounds as if the root of the problem is the C standard. Wouldn't it be better to fix that instead of creating a shadow standard defined by the implementation of this boring compiler?
I also object to the presupposition that there is secure software and non-secure software. In most practical cases, it is impossibly hard to tell in which class an application falls, so it would be better not to make that distinction at all.
>It sounds as if the root of the problem is the C standard. Wouldn't it be better to fix that
Isn't the reason for a lot of undefined behavior either in the name of performance through optimization and assumptions, portability because the implementation can just do whatever the underlying platform natively does or the committee just couldn't agree on it?
All three of those scenarios appear inherently unfixable to me without a separate sub-standard that intentionally clamps down on performance/ease of portability and actually gets the committee to agree.
I don't know the reasons for why so many things are undefined in C. However, if a program does one thing when compiled with compiler A and another thing when compiled with compiler B, how does that help portability? To my perhaps naive mind this makes portability harder to achieve not easier.
A classic example of trying to define behaviour regardless of what all the different hardware does is Java choosing IEEE754 floating point. Which isn't natively supported on x86.
>Java choosing IEEE754 floating point. Which isn't natively supported on x86.
IEEE754 was based on the design of the 8087, and it's been native on x86 since the 80387.
The problem with Java choosing IEEE754 was, according to the paper you linked, not that it wasn't supported on all hardware, but that Java didn't follow the spec completely,
"Later we shall see why Java’s expanded market would be served better by actual conformity to the letter and spirit of IEEE Standard 754"
The 80-bit floats were still in use right through the 2000s; I know because I had to work out what option to use to get reproducible fp rounding on 64-bit doubles in 32-bit C. I've not had occasion to look at the details of amd64 fp.
The point is you are supposed to avoid undefined behavior, precisely because these things can be different between compilers. In theory, this lets the compilers optimize better for a specific platform/cpu, while avoiding changing the meaning of the programmer's code. In practice, most platforms and compilers have done the same exact thing with the same "undefined" behavior, that a large body of code relies on that defacto standard.
"In practice, most platforms and compilers have done the same exact thing with the same "undefined" behavior, that a large body of code relies on that defacto standard."
All compilers on the same platform have done the same exact thing with the same "undefined" behavior.
At the moment, all platforms are x86. Previously, all platforms were SPARC, and before that, VAX.
Then you require that either the programmer checks if the pointer is null, or the compiler can infer that it isn’t null, or that you add an annotation for that. (Either an annotation for a method parameter to require it not to be null, or an annotation that marks it as nullable, or an annotation that disables these warnings.
The real problem is when compilers make assumptions based on such uses. For example, optimizing C compilers will see that, and assume that p must not be NULL, because otherwise it would be undefined. That can lead to dangerous results. John Regehr is a computer science professor at the University of Utah, and has written about this extensively: http://blog.regehr.org/archives/213
If you look at the comment history, you'll find I have recently argued why you don't want warnings for these kinds of things. But I agree with Regehr et al.'s proposal for a Friendly C, http://blog.regehr.org/archives/1180, which would disallow optimizing way those null-pointer checks. Any attempt for myself to justify why would just restate, poorly, what Regehr and company already state in that post and the ones linked to from it.
I think it's worth noting that some critical code, such as the Linux kernel, already opts out of such optimizations because they've been bitten by it in the past. My previous post talks about that.
> It sounds as if the root of the problem is the C standard.
Some of it is indeed a problem with the standard. There are other bits though where portability or optimization needs kind of require some level of "undefined" behaviour for the standard. There really are some bits of code that perhaps can't be as portable and/or as efficient if you code for safety.
> I also object to the presupposition that there is secure software and non-secure software.
I'm with you 100% here. Especially without a more precise definition of what we mean by "secure".
> In most practical cases, it is impossibly hard to tell in which class an application falls, so it would be better not to make that distinction at all.
Unfortunately, it seems that it's really quite easy to tell. Unless you have overwhelming evidence [1] that says the software is "secure", it's almost certainly going to turn out to be insecure.
Writing secure software in C is just next to impossible. Memory corruption, arbitrary pointers, deep magic. Even pure assembly is more secure, as everything is explicit and well-defined there. There are lots of safer languages than C, and performance is not everything.
What did you think about Modula-3? I like how it had much of what a C++ programmer might want in a cleaner, safer, faster-to-compile way. Most Modula and Oberon derived languages had little to no undefined behavior outside the Unsafe modules. All were used to write OS's & other low-level code. Optional GC where it was tolerable. IIRC, Modula-3 also had verification done on its standard modules to catch certain errors.
Some of the design attributes might be worth resurrecting in another project. Minus the capitalized keywords. :)
I wrote much of an OS in Modula I. I wrote robotics code in Modula II. I didn't do much with Modula III, but I knew some of the people behind it at DEC SRL, before Compaq bought DEC and shut down research.
I wish computing had gone that way, instead of C/C++. Everything would be less buggy.
One of few I enjoyed reading on new language work. The design goals are sensible, too, given Wirth languages have already achieved them. I'm still for ditching the uppercase and letting editors take care of that part. Don't want carpal pinky syndrome hitting shift or caps lock all the time haha.
Anyway, some good news seeing Modula-2 get revived and with a M2-to-C compiler in works given its my favorite C alternative. Has a nice compromise between their simple, low-level, high-performance needs and my preference for safe, easier-to-integrate, maintainable and compiles fast. Your thoughts?
I'm not sure where you got the impression that "Rust is picking up where C++ left off". Rust is a far, far, far smaller language than C++, and has benefited from hindsight in its design in a way that C++ cannot without massively breaking backwards compatibility.
It is not dead. You can still take FreePascal and write some security conscious code with it, to be linked with other parts of your application. The compiler works.
There is also Lua. And Haskell and Ocaml compilers for FP people. These languages are easily linkable with C/C++ codebases.
Then how do you explain the fact that some of the most secure pieces of software ever written were written in C? (I'm thinking of, e.g., Qmail — and pretty much everything else DJB has ever written — and OpenBSD, just to name a couple off the top of my head.)
I would also qualify your statement that "performance is not everything" to say that performance isn't always everything. Writing a kernel in an interpreted language probably isn't the best idea, for example.
> Then how do you explain the fact that some of the most secure pieces of software ever written were written in C? (I'm thinking of, e.g., Qmail — and pretty much everything else DJB has ever written — and OpenBSD, just to name a couple off the top of my head.)
Unusual amounts of manual effort by unusually skilled people, probably using unusually security-focused development processes?
Exactly. Freak occurrences never disprove the rule. Most were simple, too, to the point that it's easier to get right. Large one's like OpenBSD had plenty of bugs during the development process. Many problems in small and large are prevented with a language that doesn't introduce them.
Rest has to be caught with other methods. A good language can still help, though, if it's easier to analyze.
Because C has a lot of momentum, and interoperability, thus lots of security software was written.
This software isn't secure because of C but despite of it, and has had a long list of security vulnerabilities that a safer language would have protected us from.
I'm very interested to know what point you believe you are making here. I ask because it is fairly clear you aren't addressing the topic of the thread and I wonder if you believe you are.
OpenBSD is mostly famous for fixing tons of security issues in a large C codebase, if anything OpenBSD demonstrates the problem and how hard it is to fix it. And DJB wrote the OP, so he's obviously not pleased with the current situation.
Qmail is less bad than most (and much less bad than Sendmail), but it's worth noting the author dismissed a known remote attack involving array lengths overflowing 32 bits. He's aware 64 bit machines exist yet assumes nobody would ever give his server enough RAM to allow exploiting this, which I find surprisingly reckless given his reputation.
Most of us know several languages in which this bug would have been impossible (using either bignums or bound-checked arrays), yet here we are.
> Writing secure software in C is just next to impossible.
This is true. And it was true when I learned C, and was told almost the same exact thing on the second day of class. In 1992.
Maybe eventually we'll collectively move to something safer, but at this point I've stopped holding my breath. DJB's proposal seems pretty reasonable in the interim.
C is as secure as developers take their time to implement. A fantastic amount of dangerous stuff currently runs on C. C has several substandards, such as MISRA and JSF, to aid developers in writing safe C, eliminating arguments about "well-defined."
A good start is bounds checking and following MISRA. Languages like Rust make it cheaper, but security never comes for free.
OTOH a heavily restricted C, closer to assembly than to Rust, could still be useful: it would be close to the metal but still preserve portability. Making such a language a strict subset of C would additionally help early porting efforts.
Ada can be amazingly efficient. It just isn't cool. I've often wondered what would happen if someone pulled a reskining of the language (ala VB.NET >> C#) what would happen.
I considered making a variant of Go or Pascal that enforced Ada's safety properties for common operations by default without anything else in it. Turn it off on a per-module basis still with type or interface checks. Result might get a bit more uptake.
If you want to have a language be a "C killer" for secure software you need to put a lot of emphasis on simplicity and portability, which are not strengths of Rust.
I think that if you get rid of raw pointers and type casts, you are able to avoid most undefined behavior with dynamic checks: bounds checking, tagged unions, etc. The notable exception is concurrency bugs, which are hard to avoid without a restrictive static analyzer, but I think there is an argument to be made that perhaps the kind of code mr Bernstein wants to write is better off with "boring" sequential algorithms.
Just that Rust is limited to the architectures Clang supports (I don't think we will see people writing competing Rust implementations in their weekends like its possible to do with C). Clang can generate code for lots of architectures but its still a limitation when compared to C's universality. :)
It's actually LLVM, not Clang (which also uses LLVM).
Yes, it would be a tremendous amount of work to write a Rust compiler from scratch, but you can also write a LLVM backend.
Not in a weekend, but it's possible. Bonus points for having both rustc and Clang work with it.
Alternatively, once the MIR (Mid-level rust IR) work is done, you will be able to write your own lowering code from that to, say, PIC assembly.
It won't be optimized and you'll probably output a lot of code (compared to a fully featured GCC backend at -Os), but you would be able to do it in a weekend.
Cool, that's fair. Hopefully someday we will get compilers for more esoteric architectures; we're trying to not tie behavior to LLVM specifically where it doesn't make sense to.
Nah, that still won't get you anywhere. There's been a ton of C extensions and variants that got no uptake or development despite easy porting. Cyclone comes to mind. C community has refused to adopt safer, predictable approaches to things going back decades. Also, as with most, can't break backward compatibility with any apps doing things in a stupid or dangerous way.
So, it isn't happening. Literally need to focus on the people already considering switching from C instead of its fans or heavy users. Plus, next project should consider seemless compatibility with C for piecemeal rewrites.
Please define which assembly you are talking about. Older CPUs didn't trap on undefined opcodes (I'm talking things like the 8080, Z80, 6502, 6809, etc). And even on a more modern CPU (say, the x86 line) I see that the condition flags after a DIV are all undefined (Z, C, S, O, all undefined; also, MUL also leaves the flags undefined).
But yes, assembly languages tend to be more defined than C.
You can deal with timing attacks using Kemmerer's Shared Resource Matrix to find those and other covert channels. Then, you just look at the bandwidth of them along with design alternatives that don't leak timing. You can sometimes make the timing predictable or non-data-driven at interface level to nullify that. An example comes from military-grade stuff way back that transmitted in fixed-length, fixed-rate packets. Markus Ottela added that to Tinfoil Chat at my recommendation.
Far as correctness proofs, a combo of Ada 2012 with Design-by-Contract and SPARK is probably best at that. Far as app level, just program in a functional style with abstract state machines. Make it more concrete and detailed, replicating analysis or tests at each level. Eventually, you run the code checkers on the result and throw in some fuzz testing.
No easy route, though, that I'm aware. Highly assured software and system design is a painstaking process. Avoiding risky stuff, defining all success/fail traces, failing safe, and static analysis to preempt bugs are the strongest route. Plus source-to-object correspondence to be sure compiler didn't eliminate safety checks or introduce problems. Spark Pro supports that.
" Claim that all we need is for some particular "undefined"-catching tool to be widely used. In fact, these tools are full of false positives and false negatives; at best they catch a few limited types of "undefined" behavior, not changing the big picture."
Question: why can't existing C compilers clearly indicate when they encounter undefined behavior?
Modern compilers/analyzers can and will report whatever statically detectable UB they can, but most interesting cases are triggered dynamically. e.g.:
int data[64];
int foo(int bar) {
return data[bar];
}
If foo() has external linkage, how does the compiler ensure it's never called with a value <0 or >63?
And that's an easy case.
clang/llvm can instrument code to detect some forms of UB and trigger a diagnostic at runtime, but these checks can be expensive (and you still have a runtime failure, rather than a compile time detection).
Well, there is undefined behavior like this, and there is other undefined behavior. In this case, there is a semi-sensible thing the code could do: if you ask for foo(66), return the data at 2*sizeof(int) beyond the end of `data`. If that region of memory is protected, print a message and end the program (so, crash).
The other situation is when the compiler assumes there is no UB, and applies some optimizations that break code that otherwise did what you wanted it to do. GCC is notorious for implementing a lot of these optimizations, almost out of spite (because you deserve it if you write faulty code...). Aliasing variables/pointers with a different type, reinterpreting memory as a different type (type punning), certain null checks (e.g. if (this == null)), and so on... All of these can be detected at compile time, and there is a sensible "do-as-I-mean" thing that you can do, or a language-lawer unexpected thing (that of course might allow correct programs to be faster).
[EDIT: more clearly distinguished between undefined and unspecified behavior.]
> Question: why can't existing C compilers clearly indicate when they encounter undefined behavior?
Why can't existing programs clearly indicate when they encounter unexpected input?
Not every instance of undefined behavior has explicit handling in a compiler; sometimes, undefined behavior occurs as part of the implicit functioning of the compiler implementation. The compiler does not need to consider undefined behavior at all. It doesn't necessarily have an explicit case for "undefined, choose a behavior"; it can simply act as though undefined behavior does not exist, and have the behavior in that case depend on whatever the code written to handle defined behavior happens to do in the undefined case.
[EDIT: This example is unspecified behavior, not undefined behavior. The example does still illustrate what I meant by behavior that arises out of the compiler implementation without explicit notice by the compiler.]
For instance, if you write f(g(), h()), C doesn't specify whether g or h gets called first. The compiler doesn't explicitly parse that whole expression and internally note "oh, you have unspecified behavior here, I'll silently choose an interpretation and not tell you about it". The compiler just parses the expression and breaks it down into sequential operations, implicitly choosing an interpretation based on its parsing, evaluation, and optimization. The compiler might evaluate g() first because it parses left-to-right, or h() first because it parses right-to-left, or h() first because h() was an inline function and it expands those first...
A compiler explicitly designed to flag undefined behaviors would need to have additional logic to notice undefined behaviors; you couldn't just take an existing compiler and find all the places where it explicitly notices undefined behavior and chooses a way to handle it.
Existing compilers have started to add warnings for cases where they explicitly exploit undefined behavior for optimization, such as assuming a lack of signed overflow, or assuming that you never dereference a NULL pointer. But retrofitting a compiler to flag cases where it currently just doesn't even consider the undefined behavior would require far more work.
Everything not defined in the spec represents undefined behavior, not just the things explicitly disclaimed as undefined. Not every instance of undefined behavior has explicit handling in a compiler; sometimes, undefined behavior occurs as part of the implicit functioning of the compiler implementation.
The C and C++ standards make clear separations between "undefined behavior", "unspecified behavior", and "implementation defined behavior".
Order of argument evaluation is not undefined behavior, it's unspecified behavior. A standard-compliant implementation can choose any order it wants in any instance, it doesn't have to document this behavior, but it does have to produce output consistent with some ordering. It can't erase your hard disk or launch the nuclear missiles, which it could do if it encountered undefined behavior.
You're right, sorry. I should have looked up that specific example before using it, and used an example that fell under "undefined" behavior.
My point, though, was that the compiler doesn't have to explicitly notice the program using undefined behavior and choose a behavior. Rather, the compiler may simply not consider undefined behavior at all, act as though only defined behaviors can occur, and have undefined behavior be whatever happens due to the lack of explicit handling. The compiler can emit code that handles the defined behavior, and whatever that code happens to do in other cases will be the compiler's "handling" of undefined behavior.
Thus, making the compiler explicitly flag undefined behavior isn't just a matter of finding all the places where the compiler explicitly handles undefined behavior and emitting a warning there. It would require noticing everywhere the compiler simply doesn't consider cases the spec doesn't require it to consider.
There's a blog post[0] from Chris Lattner from a few years back that touches on this.
The summary is: "People often ask why the compiler doesn't produce warnings when it is taking advantage of undefined behavior to do an optimization, since any such case might actually be a bug in the user code. The challenges with this approach are that it is 1) likely to generate far too many warnings to be useful - because these optimizations kick in all the time when there is no bug, 2) it is really tricky to generate these warnings only when people want them, and 3) we have no good way to express (to the user) how a series of optimizations combined to expose the opportunity being optimized."
Undefined behavior is primarily something that that has effects, rather than at compile time. For example, taking the value of an uninitialized local variable.
int main() {
int i;
if(some_big_complicated_function()){
i = 0;
}
printf("%d\n",i);
}
Is this undefined behavior? The compiler can't be sure, because it can't know whether `some_big_complicated_function()` returns true or not.
Therefore, it would need to be a runtime check, rather than a compile time check. But runtime checks are expensive, and so it typically isn't done. It would be entirely allowed by the standard to give such a warning at runtime, but the resulting program would go much, much slower due to all the checks.
What you say is true, but this case could easily trigger a warning about possible use before assignment. The compiler doesn't have (and can't) to prove that either branch will always be taken at runtime. And there's no benefit to code like this anyway, just write it so it's obviously to be assigned. Your coworkers will thank you.
And really, the main benefit to not assigning a value in the declaration is to get the compiler to tell you if there is a code path that misses an assignment.
int i;
if (foo || bar)
i = 1;
else if (baz) {
if (!quux)
goto fail;
i = 2;
}
printf("%d", i); // not all code paths assign a value
That's shoddy programming - it obviously contains an error.
As someone who compiles with "-Wall -Wextra -Werror", I believe that this example program shouldn't even be allowed to compile, even though the standard allows it.
The variable should really be initialized explicitly, which shouldn't be expensive at runtime.
int i = 1; /* or whatever */
Alternately, if the programmer intended to only print when some_big_complicated_function() return true - which is a strong possibility because they only wrote a value to "i" for that case - then the bug is in the position of the printf and possibly the variable definition, if you're using >=C99.
#include <stdlib.h>
#include <stdio.h>
int some_big_complicated_function(void);
int main() {
if (some_big_complicated_function()) {
int i = 0;
printf("%d\n", i);
}
return EXIT_SUCCESS;
}
> the resulting program would go much, much slower due to all the checks.
That depends on a lot. It can be (very) significant if you're talking about the middle of a hot loop; in the given example that runs once per main(), the difference is negligible.
For better examples and a discussion of why it can be very hard to issue useful warnings when optimizing around undefined behavior, see LLVM's series of articles that should is mandatory reading for anybody working with C/C++.
This would certainly worth a warning (at some warning level) exactly because what you say -- the compiler is not sure that the behavior is fully defined.
I wasn't sure whether or not it would print out a warning, so I tested it. With g++ 4.8.4, there is no warning printed, even with `-Wall -Wextra -pedantic`.
I believe in many cases in C, you would need to insert runtime checks (and I believe gcc/clang has a flag for inserting these when statically checking for it is infeasible).
One such behavior which is considered undefined is signed integer overflow/underflow. Without this being undefined, some important loop optimizations would not be possible [1]. The blog link below has lots of information about undefined behavior from the perspective of a compiler implementer.
While it is possible (in theory) to catch some types of undefined behavior, some of it can be very subtle, and it shows up in the C Standard far more often than most people think.
Undefined behavior is the end result of conflicting standards; the usual human response to a pile of confusing standards is making yet another new, soon to be confusing, standard, see XKCD for the graphical interpretation. And that new standard both contains new undefinitions and via its impact against existing standards creates who new waves of new undefinitions.
OpenBSD (+LibreSSL) folks have been uncomfortable with unpredictable compiler optimizations for quite a while too. Here's a previous discussion on compilers in OpenBSD: https://news.ycombinator.com/item?id=9322259
It has always seemed to me that the C community (and to an ever greater extent the C++ community) takes particular pride in making it as difficult as possible to write correct code. It provides bragging rights and job security for those select few who have taken the time to master all of the arcana of the language.
Personally, I tend to think that life is too short for that sort of thing. That's why I choose Lisp, at least when it's up to me.
Surprised being a LISPer you haven't read Gabriel's Worse is Better essays on the topic. Convincingly argue the reason behind the uptake of certain tech is they do just enough of a good job to work and for a lot of people. Then they get extended gradually in direction of what they should've been. C and UNIX were perfect examples of this on crappy hardware of 70's-80's. By 90's, it was legacy effect in play for various reasons. Still is.
Throw in organizations competing for money or ideology to get a bunch of implementations of a language that was shit to begin with. So, it was more the opposite of what you claim: too many groups trying to make things convenient for themselves in the short-run. LISPers did something similar with their individual messes getting merged into Common LISP. Those doing it clean and sensible on the right hardware were always the outliers.
I have read "Worse is better." The point I was trying to make was different:
> Then they get extended gradually in direction of what they should've been.
Yes, this is the part I'm disputing. It seems to me that C and C++ are not being extended in "the direction they should have been" but rather in a direction that is beneficial mainly to a small incumbent minority. Everyone agrees that security is important, and yet it is virtually impossible to write secure code in C or C++, not because it would violate the laws of physics or some mathematical theorem, but because both the standards and the implementations are hopelessly brain-damaged.
"but rather in a direction that is beneficial mainly to a small incumbent minority."
The majority of C programmers seem to fight changes to the permissive way of doing things in name of efficiency and compatibility. I don't know what the interplay between them, compiler writers, and standards bodies is. Yet, it seems C programmers react pretty negatively to anything changing their language even here. Just imagine how much of UNIX or GCC might break if we did pre-fixed strings, a reversed stack, compiler-only pointer arithmetic, auto-bounds checks, and so on. It would all just... COLLAPSE.
"yet it is virtually impossible to write secure code in C or C++, not because it would violate the laws of physics or some mathematical theorem, but because both the standards and the implementations are hopelessly brain-damaged."
Totally agree there that this is what resulted. Recently, on a new platform, I needed to code up something that C programmers can read and I can't remember C worth crap. So, I decided to re-learn FreeBASIC cuz even it's easier to get right than C & I straight up can't stand C. Coding style wasn't... modern... but worked, was type-safe, iterated fast, ran in milliseconds, and was very readable. Stuff I can never say collectively about C without state-of-the-art tooling on 4-8 core machine with RAM disks. ;)
Seeing that you're so cocksure, you wouldn't have any problems substantiating your claims, right?
C++ has made huge progress in writing safer, more reliable code with the C++11 standard, and these topics are an important focus point for the coming standards.
I'm not cocksure. That's why I hedged with "seems to me", i.e. this is my opinion.
However, to substantiate my claim, at least for C, I only have to point back to the original article.
It's possible that things have improved with C++11, I don't know. However, I would think that if C++11 were suitable for writing secure code, that DJB would be aware of this and just say so instead of advocating for a new C compiler.
He's not asking for a -wundefined or -ansi -pedantic, he's asking for definitions for everything -ansi -pedantic rejects, and a new compiler. Hard work. Which of those cases are that desirable?
I think the biggest problem with undefined behavior in C is that it is taken to justify violation of C's customary semantics[1] on platforms where it isn't necessary.
Many types of undefined behavior in the C standard are essentially dispensations for Lisp machines or one's complement CPUs. It's very surprising to see them invoked by the optimizer on more conventional platforms. It might not satisfy djb, but a standard for these behaviors alone would be great progress.[2]
[2] Posix doesn't really address this, sadly. The committee seems to have avoided standardizing byte size until an unforeseen interaction with a later C standard pushed it into a corner.
I think the problem is not the language standard per se, it's the education and historically sloppy compilers. It's a bit like W3Schools for javascript teaching everybody all the worst practices, that's not a problem with the javascript language itself.
I mean really, what are the most common cases of undefined behavior that cause real problems in the world? (Reading past array bounds or trying to dereference uninitialized variables are such obvious errors that i don't categorize them as 'unexpected undefined behavior', the later is caught by most compilers and static analyzers today anyway. Though i agree having the length known in arrays would greatly help.)
Any time i see some raging blog post about how undefined behavior ruined someones day it's 99% of the time because they one way or another were trying to cast a pointer of one type to a pointer of another type, like thinking they could read the first 16 bits of an int* by casting it to a short* first. The root cause of this is because people think that pointers are addresses to a sequential grid of NAND-gates on your DDR4-stick, when they in fact are abstract references to a variable that can carry a value. This comes from school, where the way teachers show you how pointers work is exactly like that. Tutorials on the internet also say that you can do smart tricks with pointers just to teach you and then people think that's how you should do it. When compiling in debug-mode this is how the memory is usually aligned so things still work anyway.
Historically compilers didn't warn about this either and things still worked so many people got used to writing things this way and today those people are the "experienced C gurus" in companies that pass down this misbehavior to their younger peers.
Overflow on signed arithmetic is another of the common pitfalls but that's life if you code for embedded, the alternative solution that "just works" would be to do all arithmetic in a BigInt type as python but that clearly isn't viable on embedded, or to do everything in floating point but that has its own pitfalls.
- Remove all possibilities of memory aliasing. Only exception might be when reading from devices/network (some kind of boxing, unboxing)
- All array accesses are bounds checked.
- Checked conversion between int/unsigned int and different sized items. Converting a char into an unsigned int? Be explicit in what you want to do
- No pointers "flying around", unbound to types. Have reference-counted objects at least. Allow a way of passing a sub-array without copying/with read only permissions.
- Remove "gotchas", like ordering of statements, assignment on comparison (unless it's obviously correct), no "compiler dependent" behaviour, no "undefined behaviour" (unless it's machine specific)
You are confusing unsafe behavior with undefined behavior. He is not suggesting that a C compiler should prevent the above mistakes. He's suggesting that it should document what code the compiler will emit in such cases. For example,
- reading past the end of an array will always segfault the program immediately. or reading past the end of an array will return (arrayType)0
- signed integer types must overflow according to two's complement. A compiler may not assume that, for instance, (int)x < (int)x + 1 is always true.
- uninitialized variables are zeroed out. That is, `int *x;` always produces a null pointer, `char y` always produces '0'
Such a compiler would accept and compile any C program, and the programmer (and any static analysis tools) would have a better idea of the behavior of the output program.
> reading past the end of an array will always segfault the program immediately. or reading past the end of an array will return (arrayType)0
How would this work when you pass the array (i.e., the pointer) to a function? In order to do what you describe, every pointer would need to include bounds information. This would break the C ABI, so your code would no longer link with existing C libraries.
OP answered this on the thread. Pointer bounds are stored in a lookup table. Pointers are passed to functions according to C ABI, and bounds are retrieved by querying the lookup table by the pointer.
C makes you take the needed metadata to other parts of the program manually.
It's not about being low-level, it's about having "pretend types" (like string or array) that are only actually a pointer with a tiny bit of syntactic sugar.
For instance, getting rid of unspecified order of evaluation. Or defining a dialect of ISO C in which evaluation is strictly left to right, and making support for that optional (so then your local GCC or whatever can have -fstrict-iso-eval to turn it on).
Another example of "small stuff": catching negative values being passed into the <ctype.h> functions. Also, all the instances in the standard library where a null pointer invokes undefined behavior could be required to abort the program.
Some undefined behaviors cannot be made defined without a lot of overhead (like catching all out of bounds pointers).
However, there is a lot of "low hanging fruit" undefined behavior where things can be improved.
One of the bases for security is the age of a program. All other things being equal, the implementation that has been around the longest is expected to have fewer issues than the shiny, new thing (because the old program is well-understood, has been patched to death, considers a thousand corner cases, ran on more hardware, more time passed without any issues, etc.).
As such, changing away from well-known compilers and language behaviors may not be the wisest thing for security.
On the other hand, extending existing compilers to restrict behavior seems reasonable. Also, by all means document the hell out of the things that are not well-defined.
I believe Compcert still translates programs with undefined behavior and doesn't really do anything about it; or rather, the behavior is defined but based on whatever the Compcert authors define the behavior to be.
> Claim that a boring C compiler can't possibly support the desired system _performance_. Even if this were true (which I very much doubt), why would it be more important than system _correctness_?
If correctness is more important than performance, C is the wrong language. Consider a language with, at the very least, garbage-collection built into the core, as memory allocation and usage errors are the base of a lot of what goes badly wrong with C programs.
In short, his notion of "Boring C" seems to be Haskell or Common Lisp.
I'm not sure that either of those languages are usually GC'd, but they'd be good choices, too, if they were.
(I just don't like Wirth languages. I don't like their type systems and I don't like the definition-before-use requirement. They're better than C, in some respects, but worse in others, and the net isn't always an improvement.)
All of Wirth's after Modula-2 were GC'd. C programmers would reject a GC'd language, though. So, I stayed with originals that enhanced safety, predictability, readability, and so on while capable of low-level, fast stuff. Wirth's stuff and industrial variants of it often let you do unsafe stuff in dedicated modules for that so you could turn off GC or checks.
Plus they wrote OS's in Modula-2 and Pascal on ancient hardware. Whole platforms. So, they're proven for the use case C programmers often say you need their language for.
For anyone interested in Modula-2 as a potential replacement for C as a systems language, there is a small but hopefully growing revival of the language occurring on the Gnu Modula-2 mailing list and (especially) on comp.lang.modula2 . Right now there is a call on the newsgroup for testers of the Modula-2 to C (M2C) compiler. This is part of a larger effort for the Modula-2 Revision 2010 (M2 R10) language definition and implementation by Benjamin Kowarsch and Rick Sutcliffe.
Appreciate the tip: that's exciting. Also an ideal time to extend it with improvements learned from experience since Modula-2 and avoid problems learned similarly. I actually downloaded the old Modula2-to-C compiler, Lilith technical paper, and A2 Oberon sources in case any of that ended up being useful. Maybe it will. :)
Whole OSes were written in Lisp, too, as well as other GC'd programming languages (Smalltalk, Mesa, perhaps a few others), and they ran on rather underpowered hardware by modern standards as well.
I get your point, but having to check for allocation failure is a substantial source of bugs in non-trivial programming without GC, and having to manually allocate RAM makes string handling, for example, orders of magnitude less safe than it is in a language with GC.
Let me be clear: could a C programmer understand and effectively use Genera OS without significant, paradigm-altering changes in mindset about programming? Probably not. So, I don't include LISP in a C-replacement discussion given they're in two, different ballparks. And LISP is universally hated by almost all programmers outside its own fans and those Clojure is attracting. That's despite the fact that I promote its advantages elsewhere where they might get take-up. ;)
Now, back to imperative languages with some familiarity to C programmers. For allocation failure, some languages provide exceptions to handle such things but I usually recommend allocating what you need at the beginning and being sure you'll know your max for later. Do the checks plus reserve early so you worry less from then. Shouldn't be a problem. It's work to keep in mind but straight-forward.
Far as string-handling, there's safe libraries and approaches to it. We just use them with whatever performance or syntax hit we get if our language doesn't do it inherently safe. The issues with it are why I recommend prefix strings for clean-slate stuff over null-terminated. If you can get away with it. Having separate types for the two can highlight interface errors.
Ada, and recently Rust, for example. People trying to improve C are sad to watch. People use C because they have an irrational belief in their own faultlessness. Same reason people are afraid of flying but not driving. Selling a correctness-enhancing tool to these people is an impossible task.
It's like motorcyclists wanting to convert their ride to something safe that their new family with small children can use for their next vacation... they're thinking of solving the problem at the wrong level. The foundations are fundamentally flawed for what they're trying to do.
But I guess the real problem is that C is too established in resource- and predictability-constrained domains.
What is sad to watch is the binary sizes blowing up and long tail latency due to garbage collection. I think Rust is the only thing to come along in about 30 years that stands a chance.
Rust should work on `arm-none-eabi` to my knowledge. We don't provide packages out of the box, but people have done it. You need to set up the right target specification, and then it should work.
Possible, I don't remember the details. It might also be that setting up a cross compiling toolchain by hand was not exactly how I intended my first contact with a new language be (I had set up a cross compiling GCC, then a cross compiling Ada that ended up in a linux simulator, I might have lazied on the rust test).
agreed, another "not quite C", something where you can control what happen when you multiply 2 integers, something where you can control bit banding, memory mapped registers, float behavior, etc. All with sane defaults.
> * Claim that a boring C compiler can't possibly support the desired
system _performance_. Even if this were true (which I very much
doubt), why would it be more important than system _correctness_?
It can't be taken for granted that performance and correctness are separate concerns. Even leaving aside the obvious category of realtime applications, performance issues can create DoS vulnerabilities.
edit: That being said, a boringcc might end up being less susceptible to these vulnerabilities in practice, since the performance of the emitted code might be more predictable. I'm sure djb has considered this. But in general I think it's often a mistake to think of performance as an implementation detail rather than as a deliberate feature.
Programmers always want new and more clever constructs in their code, and there is always the next feature that will optimize inner loops 10% better. This is human nature apparently, so don't put too much faith in a boring anything gaining traction.
And for those of you courting Rust, remember there are two kinds of programming languages in the world: those that no one uses, and those that are known to be terrible.
That way, any programs that are made to work on the pathological implementation are largely guaranteed to rely only on standardized behavior, and work on any other standards compliant implementation.