Hacker News new | past | comments | ask | show | jobs | submit login
A semantic model for a substantial fragment of C (cam.ac.uk)
97 points by edwintorok on March 31, 2017 | hide | past | favorite | 75 comments



I've heard many "real" computer scientists say the "C remains central to our computing infrastructure but still lacks a clear and complete semantics" line but I have no idea what they mean.

Can someone explain to me what this means to the academic-CS type? How can something with a parser & lexer not have "clear and complete semantics". If you have a computer language which has defined elements of the syntax how can you not have clearly defined meanings of combinations of the syntax?

Does someone have an example of a piece of C that is not "deterministic" (best word I have for this) to a lexer & parser?


The parser/lexer consider the syntax of the language. Syntax is easy. Semantics is about meaning. It's quite easy to have non-clearly-defined meanings of syntactic constructs! Just write some vague language in the specification. Don't confuse the semantics of the language with the behavior of a particular implementation (eg, GCC). It's important for the language to have clear, well-defined semantics so that implementations can be compatible and programmers, compilers, analysis tools, etc can all agree on the meaning of a program.

Follow some links from the page for concrete examples:

https://www.cl.cam.ac.uk/~pes20/cerberus/n2090.html

https://www.cl.cam.ac.uk/~pes20/cerberus/n2089.html

And, of course, their paper, which has examples and explanations: https://www.cl.cam.ac.uk/~km569/into_the_depths_of_C.pdf


My question was "If you have a computer language which has defined elements of the syntax how can you not have clearly defined meanings of combinations of the syntax?"

Also, these links don't have anything that is defined in C that isn't "deterministic". I can look at all of these pieces of code and see what is happening and it's all stemming from a misinterpretation of what the standards specify, not from what a "lack of aclear and complete semantics".

Some of these are also not issues at all. I have no idea what they mean to show with provenance_equality_global_yx.c. They take two pointers, change one of the values, and see if the two different pointers are equal. It's obvious in most cases this will be faults but its not an invariant.

First of all, there is no guarantee for layout/ordering of data in the binary. Secondly, a pointer is a memory address. A Memory Address + 1 is just 1 sizeof(int) (in this case). If you're on a system that needs specific aligning and special datatype sizes it might even be impossible to have two ints 1 sizeof(int) away.

There's no reason to think you'd be able to get the same output from that code from multiple architectures because 1) it may be impossible on some hardware and 2) it's not specified in the spec that that will work.


> If you're on a system that needs specific aligning and special datatype sizes it might even be impossible to have two ints 1 sizeof(int) away

In this case, you're right about that, since these aren't arrays being considered. But if p is pointing to an array object of length n, with i <= n - 2, you must have &p[i] + 1 == p[i + 1], from the definition of array indexing in 6.5.2.1.2 of N1570. So, it must always be possible to have two ints 1 sizeof(int) away, otherwise it's not C.

----

The entire Cerberus research project is about finding inconsistencies between the spec, compiler implementations, and human understanding, and creating an accurate formal specification of C. They have found numerous of these. N2090 is about a spec bug, inconsistent with another WG document, and inconsistent with practice.


> My question was "If you have a computer language which has defined elements of the syntax how can you not have clearly defined meanings of combinations of the syntax?"

Because the C standard has implementation-defined and undefined-behaviour, which means every syntactically valid C program has a broad range of possible meanings.


You more or less answered your own question. The very fact that they are able to misinterpret the standard is sufficient proof that the standard lacks a clear and complete semantics. If the standard were formalized, it would not allow misinterpretations.

Think of simple mathematical equations: you can only misinterpret them if you don't know enough mathematics, the equation itself allows only one meaning.


Also helps to see a counterexample. Oberon report is a good one where everything is laid out in clean, neat detail:

https://www.inf.ethz.ch/personal/wirth/Oberon/Oberon.Report....

They also make it memory-safe & type-safe by design so any undefined behavior should happen in SYSTEM/UNSAFE modules hidden behind interfaces that you know to check.


> all stemming from a misinterpretation of what the standards specify, not from what a "lack of aclear and complete semantics"

If people keep misinterpreting it, isn't that the definition of not being clear?


Example of undefined behavior: What arguments will f be called with?

   int i = 0;
   f(i++, i++);
More examples here: https://www.cl.cam.ac.uk/teaching/1415/CandC++/lecture10.pdf


I'd assume it would be

    f(1, 0); 
Because function arguments are pushed in reverse order in C and funnily enough, GCC actually does that even though that's crazy. But clang catches it as an "unsequenced modification".

I don't think that's a good example. The C standards make no guarantees about unsequenced modifications so relying on them is the programmer not using C, it's the programmer using a language derivative implemented by their compiler.

Why do people say "C remains central to our computing infrastructure but still lacks a clear and complete semantics" when the standards are very clear about what it allows? If you use something it doesn't allow you can't blame the spec for that.


You're missing the point.

Can we agree that people are learning a programming language that they call "C", and writing text files in that language and running programs that they call "C compilers" such as gcc on those text files to produce binary executables?

You are correct to note that the programming language I'm describing does not perfectly conform to any of the documents that are ever called a "C standard". But that programming language is central to our computing infrastructure, and whether you or C standard authors like it or not that programming language is widely called "C".

That language's semantics are unclear and incomplete---even though both gcc and clang are commonly called "C compilers", they have differing behavior on `f(i++, i++)` as you found; reportedly, there's code on which gcc behaves differently depending on the optimization level option.


There is a point that you're probably missing. UB is one of greatest things that could have been "invented" in engineering. Instead of covering idiotic cases that should never happen via complex algorithms that poison the code simplicity, you can just say that these are UB and rely on your client code to respect that. If you have a registry of objects with names, you don't have to check name uniqueness, because you can state that adding the same name is UB. Like it is for removing non-existent record. If a range of function produces wrong results (anything around limits.h), don't fix it — you're probably not even able to do it correctly, just call it UB for arguments above INT_MAX/4. I use that a lot in my projects, because it allows you to start without waiting for all sorts of these cases to be covered in callee while there are four-to-ten absolutely correct callsites that will never produce UB.

UB is not a flaw. It is an efficient engineering method. Kids will be alright, that's hard thing to learn, but once you get it, it is simple, clear and complete. If kids want to stay in a sandbox, well, we have a lot of them.

Those who blame UB probably never turned their attention to electronics, physics or chemistry. That is a mess. Every time you pass wrong arguments or do something out of spec, it can explode and really hurt your foot.

Edit: typo


If you're saying that sometimes UB is a good idea, as illustrated by your examples, then I agree with you.

But if you're saying all of C's UB is good, and none if it is a flaw, I think you'll find yourself in the minority there. Do you consider integer overflow to be an "idiotic case that should never happen and poisons code simplicity"? Relying on it to be two's complement is so common that successors like Java and Go and Rust all standardized that.

If you'll concede that some of C's UB is relied upon in real, important, non-idiotic code, then it follows that it is indeed important for there to be efforts like Cerberus that develop unambiguous semantics for more of C than is currently covered by C standards.


>integer overflow, two's complement

As UB is an engineering method, we need to look at it from practical point of view ONLY. In my limited knowledge, platforms with really random overflow-then-back (x+n-m) do not exist. Note it has nothing to do with n's-complement, it is just a sad consequence of (x+n)==undefined, so compiler is free to assume (x+n <= INT_MAX) and throw out some "dead" code, which is not dead, ymmw. This specific UB is about what you get in (INT_MAX + positive) use cases, and these are very strange, because irl 127+3 is 130, not -126. It doesn't even help in famous m=(l+r)/2.

Though I can see social rationale behind standardization you mentioned. Some behaviors too suddenly appear undefined to newcomers who do not bother to read tfm or faq at least, and even those who program for 5-7 years still cannot explain what restrict keyword means. Maybe I'm wrong or somewhat elitist in it; today's software development is different. We have lots of programmers who hardly go further than learning "a new syntax" and we are doomed to use them as a productive force. Then these UB-clearing moves have perfect sense, that's why I mentioned sandbox languages.


I have no idea what your point is, if any.

I will note that "from [a] practical point of view ONLY", it is incredibly impractical for a compiler to gratuitously break real code that relies on behavior that deviates from spec; if that reliance is common, the practical thing to do is to support that behavior, which they do: "The optimizer does go to some effort to "do the right thing" when it is obvious what the programmer meant (such as code that does "* (int *)P" when P is a pointer to float)." --blogpost by the second most widely used C compiler, clang/llvm: http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

Which is why projects like Cerberus are valuable.


My point is that UB is almost never any bad, and when it is, there is a live case for that. There is no value in defining the undefined. Clang does what it does at internal levels for completely different purpose than supporting "real code that relies on behavior that deviates from spec". The article describes that in detail. That is completely disconnected from the value of Cerberus-thing you mentioned.


Chemistry might explode, but there's no undefined behaviour.


There is also the issue of detecting such undefined behaviour in code automatically using static analysis. This is essential when you want to change or upgrade the compiler you are using. If I understand correctly, they are trying to supply such tools for C.


> You are correct to note that the programming language I'm describing does not perfectly conform to any of the documents that are ever called a "C standard". But that programming language is central to our computing infrastructure

I don't agree with this. Yes, people often write code that unintentionally rely on UB/implementation-specific behavior; but are there important projects that do this intentionally? In any case, compiler authors are going to follow the C standards, not you hypothetical "C-in-practice" standards, so what good does it do to for formalize or clearly specify the latter?


> are there important projects that do this intentionally?

Expecting integer overflow to be two's complement is so common that successors like Java, Go, and Rust all standardized on it, used by hashing algorithms and more: https://github.com/rust-lang/rfcs/pull/560

> In any case, compiler authors are going to follow the C standards, not you hypothetical "C-in-practice" standards

"The optimizer does go to some effort to "do the right thing" when it is obvious what the programmer meant (such as code that does "* (int *)P" when P is a pointer to float)." --blogpost by the second most widely used C compiler, clang/llvm: http://blog.llvm.org/2011/05/what-every-c-programmer-should-...


Unsigned integers in C have defined wrap and that is what portable hash functions use.


> I don't agree with this. Yes, people often write code that unintentionally rely on UB/implementation-specific behavior; but are there important projects that do this intentionally?

The Linux kernel.


I thought staging was the only folder with UB code.


So then the Linux kernel is exactly a big and important project that intentionally depends upon UB, which is what you asked for right?


Where do they do it?


While people are arguing it's syntax, you all might enjoy this parody that's a good example of Poe's Law due to C's horrid (i.e. lack of a) design:

https://www.gnu.org/fun/jokes/unix-hoax.html


I don't understand why you say that it's not a good example. It's legal C that has undefined semantics, which is what you were asking for.


> It's legal C

No it's not. It's undefined behavior meaning, by definition, it's not C. It's not defined in C. If I had an list of all definitions, would an undefined item be in my list? No.

When you are using Undefined Behavior, you are not using C. You're using a C-extension that a compiler designer has implemented. This was done because there is no sane operation to do in these UB cases for one reason or another. Either it would not be portable or it wouldn't make sense.

If there was a standard that implemented standards for UB cases it would not be portable.


I think you're confusing syntax and semantics. The claim is that C is not really semantically well defined, but your toplevel posts talk about the fact that C has well defined syntax, which is true. However, that's besides the point. Syntax is roughly the surface form, whereas semantics refers to meaning. The problem is that a well formed program (ie a syntactically correct program; this is actually the precise definition of well-formed) in C yields undefined behavior on a semantic level, exactly what we don't want to happen. You need to maintain a clear separation between the syntax and meaning. When parent said that its legal C, he meant syntactically. The point is by definition you cannot tell from the surface/lexical form of C what is semantically valid and what is not semantically valid, ie what is well defined and what is not. So by saying its undefined behavior you are proving parent's point, that a syntactically valid piece of C yields a semantically not well defined chunk of code.


OK, so to paraphrase your position: C has clear and complete semantics because anything written in C that doesn't have clear and complete semantics is, by definition, not actually written in C. Yes?

To me, that's like saying the tax code doesn't contain any loopholes, because any loopholes that it contains are, by definition, not actually part of the tax code.

While amusing, I don't think this is a fruitful way to look at the problem. :)


You can write things that look like C, but that doesn't make them valid C. Invalid code is just a collection of ASCII characters, I guess.


Does the parser reject it? That's the test for syntactically valid c.


Syntactically valid is not a perfect match for valid.


> It's undefined behavior meaning, by definition, it's not C. It's not defined in C.

The C99 standard defines undefined behaviour as

"behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements"

Undefined constructs are not necessarily erroneous (and therefore, in a way, "not C"). Many of them are perfectly legal constructs for which the standard you're targeting simply imposes no specific requirements. In fact, C standard documents list several (as in, I think, hundreds of) examples, many of them perfectly correct in terms of syntax. The standard just doesn't dictate what the compiler should do with them.

If you really like playing the words game, you could, at best, claim that undefined behaviour isn't C99 or C89 or whatever your compiler is targeting at a specific time, but it's very much C.

> It's not defined in C. If I had an list of all definitions, would an undefined item be in my list? No.

And yet Annex J of the C99 standard has a list of undefined behaviours.

And that list itself is just a reference, because the C standard is otherwise full of cases where it does not dictate requirements -- effectively, you know, defining a bunch of undefined items. E.g. 6.2.4, 2 in C99, where object lifetimes are defined, states that "If an object is referred to outside of its lifetime, the behaviour is undefined".

Besides, come on, look at code like this:

int foo = INT_MAX;

foo++;

How exactly is this "not C"? Can you think of any perfectly standards-abiding compiler that will not compile this? Does it use any keywords, identifiers or grammar that a C compiler does not understand?


"When you are using Undefined Behavior, you are not using C."

Yes you are using C when you're using undefined behavior. It's one of the cornerstones of C's "design." The standard is poorly defined enough that different people building C compilers will come up with their own interpretation of what specific, legal, C statements mean. These are statements used in real projects that also compile with real compilers. The massive problems that resulted led to static analyzers and compiler checks for undefined behavior. Plus, people recommending you avoid it wherever possible for predictability, safety, and security. It is legal C, though.

Whereas, in languages like Pascal or Ada, you could express similar operations with their semantics making it totally clear what you could expect. They'll either not compile or include runtime checks to avoid similar problems wherever possible. C's designers just didn't give a shit: they were using what modifications of unsafe BCPL they could to squeeze max performance & memory utilization out of a PDP-11. Wirth's Modula-2 at least gave a clean language with many checks & sanity by default that you could selectively turn off if absolutely necessary. Not C, though.

So, it's legal, real C designed without safety, clear semantics, and so on. Different projects implement "real C" differently due to unclear semantics. Different, broken executions result.


I think this answers your earlier question. We don't have tools that implement "C". We only have tools that implement supersets of C. In particular, these tools take code like the above and do "something" with them; the exact thing they do varies with optimization level and phase of the moon. What you want is a tool that takes such code and says "Oops, undefined behavior on line 9, here's a stack trace."


> Because function arguments are pushed in reverse order in C and funnily enough, GCC actually does that even though that's crazy.

Actually, pushing function arguments right-to-left means that the first argument ends up on top of the stack - and that makes vararg-style functions possible.

Think about it - a called vararg method must be able find the fixed arguments to figure out how many arguments it actually gets.


That's what I mean by "Pushed in reverse order"


What if f is a macro. Then it has a lexically determined meaning. What if f is a function. Then it has a parsing determined meaning. What if the ++ operator has been redefined. Et cetera.


If F is a macro then it will just be expanded. Lets say F was

    #define f(a, b) printf("%d %d\n", a, b);
This would change absolutely nothing. A macro is just a piece of code that is expanded. It still follows all of the rules (and disallowances) of C.


Fairly certain you can have compound statements as the result of the macro. e.g., the macro could expand to if(a) {printf("%d\n", b);} for example


You can. Of course you don't even need the logical connector

    #define f(a,b)  \
        f_var1 = a; \
        f_var2 = b
Equally you could use macros to make it a whole lot worse

    #define f(a,b)  \
        f_arr[a][b] += f_arr[b][a]


What if the macro uses a and/or b more than once?


At least, the ++ operator cannot be redefined in C.


The ability to redefine the ++ operator would not alone make it difficult to add proper semantics to the C language. What is preventing that is that no one can really describe how function application truly works in C (in terms of operation ordering), except that we all have an idea similar enough that our systems seem to work together in practice.

For all the criticism Haskell gets over the purported difficulty on describing sequential computation, at least the order of monadic operations is clearly captured by the requirements to be a monad. There is no such concept for C, except that each compiler mostly seems to perform operations sequentially.


At least some of it is defined. Annex C of the C99 standard defines when sequence points occur (commonly the ';' operator, but also in a few other places).

Source: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf page 439 (!) Note that Annex C is merely informative, but it contains references back to the actual standard where these things are defined.

Edit: I think a better advantage of Haskell is that Haskell is much more free to not define sequence points. Haskell effectively defaults to not sequencing expressions unless you use a monad or there is some obvious dependency between those expressions like the output of one function used as the input to another.

All that implicit parallelism is something that C compilers have to work hard to derive, but in Haskell it's explicit in the source.


Monad's are not such basic building blocks. It's not monads that define sequencing points. It's that data flow in Haskell is the only way to define control flow. (And Monads make use of that.)


I'm dumbfounded by all the operators of C that enable all this confusion.

About the potential order of evaluation, it's not limited to C, but at least it's clearly and dilligently explained even in tiny standards like Scheme.


The scheme standard actually bit my sorry behind recently. I was porting a small library between scheme implementations, and my tests were failing.

I had always thought it was evaluated left->right (which it was in the first scheme implementation), but it is in fact standardised as "The order of evaluation is unspecified". The reason I hadn't noticed before was that most my code is rather functional. Now I was reading a char from a file handle, and peeking it as the other argument.

It failed because read-char is destructive (it changes the position of the file pointer), and since the order was the other way around in the scheme I was porting to, all the procedures relying on my "read-until" method started failing left and right.

It just took me ~20 years to figure it out...


ouch.


Yup. Did not feel great.

The worst part is that I admitted to writing code that rely on order of evaluation on Hackernews.

To my defense it is _very_ old code that I converted from an even older turbo Pascal program I wrote as a teenager


To clarify: the bug took about 3 minutes to figure out. Learning the evaluation order of procedure arguments is undefined took 20 years though :)


And I had no idea it was officially fixed ordered in standard. I've read mentions in papers but I never realized what they meant.


Well: The thing is that it is _not_ fixed. It is defined as unspecified.


Would this be an issue in Common Lisp? I believe CL function arguments are evaluated left-to-right.


It is undefined in Common Lisp whether or not the car of a form is resolved to a function before or after the other arguments, iirc


That's right. The standard says left to right.

It's not good style to exploit that, but you can.


Is this example non-deterministic or just undefined? Sure you should not rely on this code to give you consistent results between compilers and even optimisation levels, but what ever happens should be consistent between runs.


This is UB and it's caught by both GCC and Clang:

    $ gcc -Wall test.c 
    test.c: In function ‘main’:
    test.c:12:13: warning: operation on ‘a’ may be undefined [-Wsequence-point]
      test(a++, a++);
                ~^~
    $ clang test.c
    test.c:12:8: warning: multiple unsequenced modifications to 'a' [-Wunsequenced]
            test(a++, a++);
                  ^    ~~
    1 warning generated.


It's interesting that GCC requires -Wall whereas clang issues the warning by default. I wonder how many "bugs" could be caught early if GCC just issued the warning by default as well (for folks who don't religiously use -Wall)?


Exactly! Without sequence-point error, output is 1,0 on GCC6, 0,1 with CLang.


Exactly. The OP asked for an example of a non- deterministic function not UB.


I'm not sure about non-deterministic behavior without looking it up, but you can fix this example to be implementation defined, rather than undefined like this:

  #include <stdio.h>
  
  int postincrement(int *x) { int y = *x; *x = y + 1; return y; }
  
  int main(int argc, char **argv) {
    int i = 0;
    printf("%d, %d\n", postincrement(&i), postincrement(&i));
    return 0;
  }
The point is that function arguments introduce sequence points, but the specification for i++ is weaker than what you get for a function call. C is full of gotchas like this...


Their term was non-deterministic, but they asked for undefined behavior, because they were asking for examples of course being not fully defined semantically.


Unlike the other replies, I don't think they are talking about undefined behavior, I think they mean "formal semantics". The C standard documents are written in English, which might leave room for ambiguities (logical or linguistic), underspecification or inconsistencies. You can search for formal semantics (or "operational semantics", or "denotational semantics") of other languages for examples. And what the C standard calls "undefined behavior" can in theory be formalized in a very clear way (so I think it was a bad choice for them to call it "undefined behavior")


Indeed, semantics has nothing to do with undefined behavior. Claiming that undefined behavior in C is what prevents us from defining its semantics is quite similar to the assertion that the fact that the division by 0 is not defined is the reason we cannot obtain semantics for the arithmetic notation.


    int f(int x, int y) {
       return x << y;
    }
This is undefined for SOME values of y, namely those bigger than 31 (on platforms where int is 32 bit; adjust where different).

This is caused by two facts:

a) The arithmetic operation should be compiled to a hardware shift instruction where possible

b) Different architectures do different things when shifting by an amount bigger than the word size:

- on ARM, the "y" operand is evaluated mod 32

- on x86, if y is greater than 31, the result is 0


> - on ARM, the "y" operand is evaluated mod 32

> - on x86, if y is greater than 31, the result is 0

You mixed them up, didn't you ?

AFAIK, ARM does set all bits to 0 if the shift if greater than 31, but it only uses the lowest byte of the shift amount register, so it feels like a mod 256.

x86[^1] does mask the lower 5 bits of the shift amount, so it feels like a mod 32. And the amount can only be 8 bits, so a 32-bit C variable must be truncated somehow.

[1] except on 8086, there is no mask there, so no modulo.


Also see the work done in the under-utilized K Framework which captures & rejects undefined behavior as well:

http://fsl.cs.illinois.edu/index.php/Defining_the_Undefinedn...

Since it's K, they were able to turn it into a GCC-like compiler for you verifying things about your application. If it even compiles, you have no undefined behavior.

https://github.com/kframework/c-semantics

On concurrency side, it's built on top of Maude tool that an inexperienced student was able to use to find errors in Eiffel's SCOOP model for concurrency. So, it can probably handle that aspect as well.


Any write up about the errors they found? Are they a problem with SCOOP as such or an implementation issue?


My memory is hazy but the abstract suggests this is the one I read before:

http://se.ethz.ch/~meyer/publications/concurrency/scoop_maud...

The good thing about SCOOP particularly is there was a lot of CompSci work improving on it in many ways. Most of the safer, concurrency models didn't get such attention. Example:

http://cme.ethz.ch/publications/


"Its front-end is written from scratch to closely follow the C11 standard, including a parser that follows the C11 standard grammar, and a typechecker."

C front-end written from scratch sounds interesting. I cannot find the code for this front-end anywhere. Has anyone found it somewhere or maybe it's not released yet?


There is also the K framework, where C semantics were specified and an interpreter is generated.

http://www.kframework.org/index.php/Chucky_Ellison


I didn't even see you beat me to it. You should also link to KCC when you bring it up. Developers love practical stuff they can download and use to their benefit. :)

https://news.ycombinator.com/item?id=14012891


Does this handles concurrency properly?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: