Into the Depths of C: Elaborating the De Facto Standards [pdf]

TickleSteve · on June 1, 2016

I've been using C for over 20 years and I'm sure I would be caught out by these...

...but...

The fact these curiosities are not an issue in day-to-day work and C is (one of) the most popular languages around today mean that they aren't too serious.

When you have a knowledge of the hardware and are working at that level day-in day-out then issues like this really don't bother you that much.

(I do realise that this is a slightly contrarian view these days, but there is an awful lot of unjustified C-bashing around currently).

pkhuong · on June 1, 2016

http://cacm.acm.org/magazines/2016/3/198849-a-differential-a...

8575 C or C++ packages in Wheezy. This tool found definite UB bugs in 40% of them.

How would you know that your code is sometimes misbehaving because of UB?

jjnoakes · on June 2, 2016

The linked paper claims their tool issued as least one warning on 40% of the C or C++ packages, not that they were all valid warnings indicating undefined behavior.

EpicEng · on June 2, 2016

A warning is not 'definite UB'

TickleSteve · on June 1, 2016

not denying there are issues... just saying that C isn't the only widespread language with issues. at least C's problems are widely known.

D-Coder · on June 2, 2016

The Potzrebie car isn't the only car with issues. At least the Potzrebie's doors-falling-off problem is widely known.

Gibbon1 · on June 2, 2016

If the compiler writers union decides that the spec allows for 'format hard drive off on signed overflow' we can always change the spec to 'don't format hard drive on signed overflow'

pcwalton · on June 2, 2016

It's not that simple. The compiler writers have caused security issues before in the Linux kernel. And the compiler writers are right: the undefined behavior that they exploit exists for important performance-related reasons.

pizlonator · on June 2, 2016

To say they are right is an overstatement, I think. There is insufficient discussion of the empirically measured performance benefits of specific forms of UB.

Some kinds of UB could be turned into something stricter, like reading a bad pointer. This either traps or returns some value. It won't format your hard drive unless you install a trap that formats your hard drive, but that's none of the spec's business. Traps can happen due to timers, so if arbitrary traps mean UB then every instruction is UB. Even if the spec punts on defining what a trap is, that's a progression over saying it's UB.

Same thing goes for division and modulo. In corner cases, it will either return some value or it will trap. It won't format your hard drive.

The most profitable "true" UB is stuff like TBAA, but smart people turn that off.

Do you know what the performance benefits are of other kinds of UB? Do you know how many of those perf benefits (like being maybe being able to take some shortcuts in SROA) can't be solved by changing the compiler (i.e. you'll get the same perf, but the compiler follows slightly different rules)? Maybe I'm not so well read, but I hardly ever hear of empirical evidence that proves the need for UB, only demos that show the existence of an optimisation in some compiler that would fail to kick in if the behaviour was defined.

Also, if there were perf benefits of the really gnarly kinds of UB, I would probably be happy to absorb the loss in most of the code I write. If I added up all of the time I've wasted fixing signed-unsigned comparison bugs and used that time to make WebKit faster, then I'd probably have made WebKit faster by a larger amount than the speed-up that WebKit gets from whatever corner-case optimisation the compiler can do by playing fast and loose with signed ints.

I suspect that UB is the way that it is because of politics - you can't get everyone to agree what will happen, nobody wants to lose some optimisation that they spent time writing, and so we punt on good semantics.

_0ffh · on June 5, 2016

"...the undefined behavior that they exploit exists for important performance-related reasons..."

That's a good point, but arguably incomplete. It was there, to give the compiler writer leeway to implement the semantic in the way which is natural for the platform. It was not intended to play sophistic tricks on the programmer, in order to gain a few per cent of performance in some benchmark.

pjmlp · on June 2, 2016

> The fact these curiosities are not an issue in day-to-day work and C is (one of) the most popular languages around today mean that they aren't too serious.

C became one of the most used languages, not necessary popular, thanks to the adoption of UNIX and the rise of FOSS/C culture (UNIX based) in the late 90's.

Back then it was just yet another systems programming language.

I only started to care about it when I moved from MS-DOS into Windows / UNIX, and even by then I was into C++ after a short (1 year) encounter with C.

As for not being an issue, the CVE list shows daily the cost of any programming language that "enjoys" copy-paste compatibility with C semantics.

Or the business opportunity for those that sell tools that help both developers (static analyzers) and users (anti-virus/firewalls) to overcome those shortcomings.

pavlov · on June 2, 2016

Late '90s is about a decade off, IMO. C was very much the only systems language by then.

I think the last time a popular OS was built on something else than C was the original Mac OS, which had a Pascal API and used Pascal calling conventions.

On the desktop, C was chosen as the API language for Windows and OS/2 around 1986. That meant both Microsoft and IBM agreed that PC software is going to be in written in C.

pjmlp · on June 2, 2016

On MS-DOS, Acorn, Amiga and Atari it was just yet another one.

On Mac OS (after the C transition), Windows and OS/2, they might had C as main implementation, but most of us that couldn't carry on using Turbo/Quick/HiSoft Pascal, Modula-2 or Basic compilers, moved to C++ instead.

We could still make us of improved safety and stronger type checking features, while being compatible with the C toolchains.

EDIT: Also IBM and Microsoft eventually had very good C++ support in the form of SOM, COM, C Set++ and MFC, with Borland providing the very good OWL and VCL.

Also any Windows 3.x old timer remembers the message and event handling macros alongside #define STRICT, that Microsoft used to bring some sanity to Windows programming with straight C.

anexprogrammer · on June 2, 2016

On Amiga it was the only one of significance, as the API language. Having Lattice helped of course. Pretty much everything else was a toy in the early years.

Most significant DOS development seemed to be C.

pjmlp · on June 2, 2016

On Amiga, on my part of the globe we only cared about Assembly, AMOS and GFA Basic. The OS was a mix of Assembly, BCPL and C.

On MS-DOS, on my part of the globe we only cared about Assembly, Turbo Pascal, Turbo Basic, Turbo C, Turbo C++ and Clipper.

C was hardly the only choice on MS-DOS.

All our stuff on demoscene related activities and game programming attempts were in Assembly and Turbo Pascal.

anexprogrammer · on June 2, 2016

Intuition was all C, only DOS was BCPL, which made it a pain to do anything with. AMOS was one of the toys I mention - mainly popular for hobbyists and some bad released games (it never interfaced with the OS, being ST derived. That led to its many compatibilty issues). Most games from the houses I knew, or knew people at, were either a mix of C + 680x0 or pure assembly. In commercial software (GUI based) C was the vast majority.

On MSDOS I never said it was the only choice - there were many choices, but most commercial development seemed focussed on C. DB work often ended up on Foxbase or Clipper. Turbo Pascal and C were hugely successful but didn't catch MS C. Somewhat surprising given how slow early MS compilers were.

Of course if you were doing DOS TSRs or games you'd be much more likely to use assembly in the mix.

pjmlp · on June 2, 2016

MS C compilers not only were worse than Borland ones, although they were the platform owner, their C compiler was the last MS-DOS C compiler to get a C++ cousin.

Sadly the way Borland managed the company, let to us having to move to VC++ with MFC, instead of BC++ with OWL or C++Builder and VCL.

Only now VC++ is catching up with C++ Builder for UWP apps.

On MS-DOS besides the DB stuff, everyone I knew was either using Assembly, or a mix of Turbo Pascal with inline Assembly.

C and C++ only came into play on last high school year, just before getting into the university, but the majority of us already had almost a decade of coding experience by then.

anexprogrammer · on June 2, 2016

The really sad part is MS licensed Lattice for the first couple of versions of MS C, yet Lattice itself was markedly faster. If I remember right it didn't even come with any debugger though Lattice did (not sure about v1).

I took a real dislike to Windows and MFC and moved back to the Unix side of things, so my Win programming was pleaingly brief. :)

Your experience is almost the inverse of mine - we had a few juniors comng on with Pascal as they'd learnt that in uni, but they were easy to convert to C. Just about everyone I knew in those days were C/nix or C/DOS, with just a few hanging on still trying to make a living on the Amiga - mainly games devs.

pjmlp · on June 2, 2016

I wasn't aware of the Lattice story, interesting.

Yeah, I guess in the old days before the Internet and with expensive BBS connections, the technology had more silos than nowadays, because it was harder to move masses for any given technology.

ricksplat · on June 2, 2016

> Microsoft and IBM agreed

My understanding of the narrative surrounding this is that Microsoft started working for IBM on OS/2 before switching to Windows, which - struggling here - was intended to have a certain amount of binary compatibility (?). Basically IBM decided, Microsoft went along with them, and the rest was history. Would be interesting to understand who made the decision to use C and why ... was OS/2 intended to be "unix-like"?

pcwalton · on June 1, 2016

Undefined behavior in the form of memory safety issues is a problem in day-to-day work.

gpderetta · on June 2, 2016

UB and memory safety are a big issue exactly because they are not a problem in day to day work.

I literally can't remember the last time I had spent any significant time investigating one of these issues. In my experience when that a crash happen (usually in a unit test or the first time you start the app) because of these issues, the backtrace points you to the exact problem.

The pain start when the program and tests work correctly for all reasonable inputs and the underlying issue never manifests in during normal execution and can be potentially exploited by a malicious attacker with a carefully crafted input.

What I'm trying to say is that I don't want memory safety because it would improve my daily programming experience (in fact possibly the reverse would be true), but I want it because I want security.

pcwalton · on June 2, 2016

> What I'm trying to say is that I don't want memory safety because it would improve my daily programming experience (in fact possibly the reverse would be true)

I think you underestimate how much time is saved by not having to deal with these issues. Programming in C or C++ is frequently an exercise in writing the code, seeing a crash due to a memory safety problem, debugging it, and then repeating until you see anything resembling a working program. Writing in a memory-safe language lets you skip all that startup friction and go straight to "something resembling a working program".

gpderetta · on June 2, 2016

I don't know, I don't think I spend even a couple of hours per month inside a debugger. I doubt my colleagues do either. Usually I write code for 1-2 weeks, then spend one day trying to get it to compile then one or two days debugging it, but most of my debugging sessions is trailing logs and trying to figure out what went wrong with in complex state machines (i.e. the business logic, not the language/infrastructure 'overhead').

Also, many of the data structures I deal with are highly intrusive (as in an object belonging at the same time in multiple containers) and short of full GC I doubt it would be easy to guarantee MS.

Then again, possibly I'm not representative of the typical C++ programmer.

caf · on June 2, 2016

But that's not what this study is about: everyone agrees that stomping wildly off the end of an array in C is not going to end well.

This is about far more subtle issues than that - issues where there is some disagreement about whether it's OK to do or not. And the GP is right - these are often not such a problem in practice, if only because these are the kinds of issues where experienced C programmers know that they're sailing close to the wind, and there's almost always an alternative construct that's on more solid ground.

pizlonator · on June 2, 2016

Signed versus unsigned comparison, signed wrap-around, undefined division behaviour, undefined behaviour of some casts, and strict aliasing optimisations have all caused hard bugs that I've had to spend a lot of time fixing.

I had to deal with some of these before I was a compiler writer, and I would end up just kind of kicking my code repeatedly until stuff worked again.

Now that I'm a compiler writer, I know how to recognize what is happening, but I'm still not smart enough to avoid the bugs in general and I still spend time fixing bugs that result from these issues.

So, I'm with pcwalton: it is a problem. Maybe I don't see it every day, but I see it probably at least once a month.

MaulingMonkey · on June 2, 2016

There are fewer issues that scare me more than undefined behavior that usually works. "Works" enough for programmers to add it, knowingly or unknowingly, even to defend it as "working in practice"... only to leave a needle in a haystack that they'll never find.

TickleSteve · on June 1, 2016

we will have to agree to disagree on that one.

paxcoder · on June 1, 2016

To write portable code, I wouldn't study de-facto definitions of de-jure undefined behavior, except to see if I could cover every possible one and only if all alternatives were inferior.

pm215 · on June 1, 2016

Indeed not, but enquiring into what the in-the-wild de-facto beliefs about behaviour are might help in deciding what the de-jure rules should be changed to, or what a compiler implementation ought to do if it cares about what it does on the vast mass of code out there that does commit undefined behaviour, wittingly or otherwise...

paxcoder · on June 2, 2016

Undefined behavior has a purpose: Not specifying implementation details makes it easier to write new implementations and for a wider variety of platforms. "De facto standards" take away this freedom, so ideally you'd want to reject reliance on UD, but I see your (second) point about that not always being practical. I guess "be conservative in what you do, be liberal in what you accept from others". Just make sure that your foundations are strong (pun) or the whole house will be an EcmaScript.

kerneis · on June 1, 2016

If you think you know C quite well, here is one of the studies the authors ran to elaborate their semantics on corner cases of the language: http://www.cl.cam.ac.uk/~pes20/cerberus/notes50-survey-discu...

> If you zero all bytes of a struct and then write some of its members, do reads of the padding return zero? (e.g. for a bytewise CAS or hash of the struct, or to know that no security-relevant data has leaked into them.)

(and 14 other questions)

Webpage of the project: http://www.cl.cam.ac.uk/~pes20/cerberus/

blastrat · on June 1, 2016

I knew C quite well. Haven't written any for years. The statements "zero all bytes of a struct" and "reads of the padding" contain enough ambiguity that it answers the question. Not to mention the ambiguity in the words "read" and "write" as they pertain to C, since they already have a "std" meaning that's not the same as lvalue or rvalue, so what exactly do they mean here?

And if you think you can answer the question without resolving the ambiguities, that answers some other questions.

kerneis · on June 1, 2016

I believe the questions in this study (I did not write it, only know the authors) were deliberately open-ended, allowing for comments on the specifics. A previous, much longer version used to contain code examples to comment, but it proved too detailed for people to complete.

Moreover, the study was explicitly not about ISO C: "We were not asking what the ISO C standard permits, which is often more restrictive, or about obsolete or obscure hardware or compilers. We focussed on the behaviour of memory and pointers. This is a step towards an unambiguous and mathematically precise definition of the de facto standards: the C dialects that are actually used by systems programmers and implemented by mainstream compilers."

Here is an actual example of a comment to this question:

    I would expect this code to work:
    
    struct foo
    {
        char a;
        double b;
    };
    
    foo p;
    foo q;
    memset( &p, 0, sizeof( p ) );
    memset( &q, 0, sizeof( q ) );
    p.a = 1;
    q.a = 1;
    assert( memcmp( &p, &q, sizeof( foo ) ) == 0 );

psewell · on June 1, 2016

Indeed. The long version is at [pdf] http://www.cl.cam.ac.uk/~pes20/cerberus/notes30-full.pdf; it has 85 questions supported by concrete code examples and experimental data, e.g. (one of several questions that refine the above):

Q64. After an explicit write of zero to a padding byte followed by a write to adjacent members of the structure, does the padding byte hold a well-defined zero value? (not an unspecified value)

  #include <stdio.h>
  #include <stddef.h>
  typedef struct { char c; float f; int i; } st;
  int main() {
    // check there is a padding byte between c and f
    size_t offset_padding = offsetof(st,c)+sizeof(char);
    if (offsetof(st,f)>offset_padding) {
        st s; 
        unsigned char *p = 
          ((unsigned char*)(&s)) + offset_padding;
        *p = 0;
        s.c = 'A';
        s.f = 1.0;
        s.i = 42;
        unsigned char c3 = *p; 
        // does c3 hold 0, not an unspecified value?
        printf("c3=0x%x\n",c3);
    }
    return 0;
  }

Some of the questions have clear answers with respect to either the ISO or de facto standards, but many do not - that's the point.

Too · on June 2, 2016

Is memset and memcmp compatible with strict aliasing? Intuitively it seems like that would be a gap into the aliasing rules. Altough void* is allowed to alias anything so maybe it works through that. Ive never seen memset, memcpy or memcmp on anything but char* in production code.

gpderetta · on June 2, 2016

The aliasing rules only talk about dereferencing pointers. Void* can't be dereferenced so it has no interaction with the aliasing rules. You might be thinking about char, and yes, you are allowed to dereference char to inspect the bytes of an object, which is what the various mem* functions do under the hood.

wahern · on June 2, 2016

Here's a similar, real-world case involving GCC 6's more aggressive SRA and alias-analysis optimizations: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71120