Hacker News new | past | comments | ask | show | jobs | submit login
Everything I wish I knew when learning C (tmewett.com)
738 points by bubblehack3r on Nov 28, 2022 | hide | past | favorite | 391 comments



I like it, but the array details are a little bit off. An actual array does have a known size, that's why when given a real array `sizeof` can give the size of the array itself rather than the size of a pointer. There's no particular reason why C doesn't allow you to assign one array to another of the same length, it's largely just an arbitrary restriction. As you noted, it already has to be able to do this when assigning `struct`s.

Additionally a declared array such as `int arr[5]` does actually have the type `int [5]`, that is the array type. In most situations that decays to a pointer to the first element, but not always, such as with `sizeof`. This becomes a bit more relevant if you take the address of an array as you get a pointer to an array, Ex. `int (*ptr)[5] = &arr;`. As you can see the size is still there in the type, and if you do `sizeof *ptr` you'll get the size of the array.


I really wish that int arr[5] adopted the semantics of struct { int arr[5]; } -- that is, you can copy it, and you can pass it through a function without it decaying to a pointer. Right now in C:

    typedef uint32_t t1[5];
    typedef struct { uint32_t arr[5]; } t2;
    void test(t1 a, t2 b) {
        t1 c;
        t2 d;
        printf("%d %d %d %d\n", sizeof(a), sizeof(b), sizeof(c), sizeof(d));
    }
will print 4, 20, 20, 20. I understand that array types having their sizes in their types was one of Kernighan's gripes with Pascal [0], which likely explains why arrays decay to pointers, but for those cases, I'd say you should still decay to a pointer if you really wanted to, with an explicit length parameter.

[0] http://www.lysator.liu.se/c/bwk-on-pascal.html


> I really wish that int arr[5] adopted the semantics of struct { int arr[5]; }

You and me both. In fact, D does this. `int arr[5]` can be passed as a value argument to a function, and returned as a value argument, just as if it was wrapped in a struct.

It's sad that C (and C++) take every opportunity to instantly decay the array to a pointer, which I've dubbed "C's Biggest Mistake":

https://www.digitalmars.com/articles/C-biggest-mistake.html


That would be a nice little "gcc addition" to the C standard, honestly.

To bad they spend most their time doing whatever it is they do.


I have long been convinced that WG14 has no real interest in improving C's security beyond what a Macro Assembler already offers out of the box.

Even the few "security" attempts that they have made, still require separate pointer and length arguments, thus voiding any kind of "security" that the functions might try to achieve.

However even a Macro Assembler is safer than modern C compilers, as they don't remove your code when one steps into a UB mine.


One of the members of WG14 posted here a few days ago that they only use C89.


Mind blown.


Earlier versions of gcc actually used to support this in a very restricted context in C90 (or maybe gnu89) mode:

  struct foo { int a[10]; };
  struct foo f(void);
  int b[10];
  b = f().a;
In C90, you can't actually do anything with `f().a` because the conversion from array to pointer only happened to lvalues (`f().a` is not an lvalue), and assignment is not defined for array variables (though gcc allowed it). The meaning is changed in C90 so that non-lvalue arrays are also converted to pointers. gcc used to take this distinction into account, so the above program would compile in C90 mode but not in C99 mode. New versions of gcc seem to forbid array assignment in all cases.

I think this quirk also means that it's technically possible to pass actual arrays to variadic functions in C90, since there was nothing to forbid the passing (it worked in gcc at least, though in strict C90, you wouldn't be able to use the non-lvalue array). In C99 and above, a pointer will be passed instead.


Beware of struct padding.

sizeof(b.arr) != sizeof(b)

Consider:

  #include <stddef.h>

  #include <inttypes.h>

  typedef struct Array Array;

  struct Array {
      int32_t data[8];
  };

  void foo(Array const* arr) {
      size_t sz = sizeof(arr->data);
  }


> There's no particular reason why C doesn't allow you to assign one array to another of the same length

Actually, there is a particular (though not necessarily good) reason, since that would require the compiler to either generate a loop (with conditional branch) for a (unconditional) assignment or generate unboundedly many assembly instructions (essentially a unrolled loop) for a single source operation.

Of course, that stopped being relevant when they added proper (assign, return, etc) support for structs, which can embed arrays anyway, but that wasn't part of the language initially.


It was initially available in 1982, so plenty of time to add the other features.

https://www.bell-labs.com/usr/dmr/www/chist.html


Another weird property about C arrays is that &arr == arr. The reference of an array is the pointer to the first element, which is what `arr` itself decays to. If arr was a pointer, &arr != arr.


Is today international speak like a pirate day? arr arr arr


I think it is clearer to say that arr == &arr[0] but your mileage may vary.


&arr is a pointer to the array. It will happen to point to the same place as the first element, but in fact they have different types, and e.g. (&arr)[0] == arr != arr[0].


> There's no particular reason why C doesn't allow you to assign one array to another of the same length, it's largely just an arbitrary restriction.

IIRC C has an informal guarantee that no primitive syntax will ever cause the CPU to do more than O(1) work at runtime. Assignment is always O(1), and therefore assignment is limited to scalars. If you need assignment that might do O(N) work, you need to call a stdlib function (memcpy/memmove) instead. If you need an allocation that might do O(N) work, you either need a function (malloc) or you need to do your allocation not-at-runtime, by structuring the data in the program's [writable] data segment, such that it gets "allocated" at exec(2) time.

This is really one of the biggest formal changes between C and C++ — C++ assignment, and keywords `new` and `delete`, can both do O(N) work.

(Before anyone asks: a declaration `int foo[5];` in your code doesn't do O(N) work — it just moves the stack pointer, which is O(1).)


> Assignment is always O(1)

This depends on what you consider to be O(1) - being that the size of the array is fixed it's by definition O(1) to copy it, but I might get your point. I think in general your point isn't true though, C often supports integer types that are too large to be copied in a single instruction on the target CPU, instead it becomes a multi-instruction affair. If you consider that to still be O(1) then I think it's splitting hairs to say a fixed-size array copy would be O(N) when it's still just a fixed number of instructions or loop iterations to achieve the copy.

Beyond that, struct assignments can already generate loops of as large a size as you want, Ex: https://godbolt.org/z/8Td7PT4af


I think the meaning here is that assignment is never O(N) for any variable N computed at runtime. Of course, you can create arbitrarily large assignments at compile time, but this always has an upper bound for a given program.


Then you are wrong, since we're already talking about arrays of sizes known at compile time. Indeed, otherwise we would also need to remember the size in the runtime.


I don't think we're actually in disagreement here. It looks like I misread the parent comment to be claiming that fixed-size array assignment ought to be considered O(N), when no such claim is made.


Yeah to clarify I'm definitely in agreement with you that it's O(1), the size is fixed so it's constant time. It's not like the 'n' has to be "sufficiently small" or something for it to be O(1), it just has to be constant :)

People are being very loose about what O(n) means so I attempted to clarify that a bit. Considering what assignments can already do in C it's somewhat irrelevant whether they think it's O(n) anyway, it doesn't actually make their point correct XD


IIRC this is valid in C99:

    void foo(size_t n) {
        int arr[n];
        …
    }


VLAs can be declared in a single statement, but they cannot be initialized in C17 (6.7.9):

> The type of the entity to be initialized shall be an array of unknown size or a complete object type that is not a variable length array type.

Curiously, C23 actually seems to break the O(1) rule, by allowing VLAs to be initialized with an empty initializer:

  int arr[n] = {};
GCC generates a memset call (https://godbolt.org/z/5v31bKs5a) to fill the array with zeros.


How do you think it works? Does the compiler generate some kind of stack alloc?

Stupid question: Does that mean a huge value for 'n' can cause stack overflow at runtime? I recall that threads normally get a fixed size stack size, e.g., 1MB.


Yes, it causes stack overflow at runtime. Compilers warn for it, in particular clang has a warning that you can configure to pop up whenever the stack usage of a function goes beyond some limit you set - I think that setting it to 32k or 64k is a safe and sane default as e.g. macOS thread stack sizes are just 512kb


It just moves the stack pointer by n which is O(1). It doesn’t initialize it of course. But my point is that the array size isn’t known at compile time.


> IIRC C has an informal guarantee that no primitive syntax will ever cause the CPU to do more than O(1) work at runtime. Assignment is always O(1), and therefore assignment is limited to scalars.

this is absolutely and entirely wrong. You can assign a struct in C and the compiler will call memcpy when you do.

Enjoy: https://godbolt.org/z/98PnhYoev


And any structure is O(1), not O(n), because C structures are not parameterized.


memcpy is not O(1)


It's O(1) relative to any size computed at runtime: that is, running the same program (with the same array size) on different inputs will always take the same of work for a given assignment.


We're in the context of the assignment operation in the language here. Yes, in C you can only assign statically-known types but that does not mean you can just ignore that a = f(); may take a very different time depending on the types of a and f


This reasoning falls apart for structs with array members.


Well C does allow "copying" an array if it's wrapped inside a struct, which does not make it O(1). gcc generates calls to memcpy in assembly for copying the array.


> IIRC C has an informal guarantee that no primitive syntax will ever cause the CPU to do more than O(1) work at runtime.

How about CPUs that have no, say, division instruction, so it has to be emulated with a loop?


Fixed-width divisions are O(1), just comparatively expensive (and potentially optimized to run in variable time). Consider that you can do long division on pairs of numerals of, say, up to 20 digits and be Pretty Confident of an upper bound on how long it's going to take you (you know it's not going to take more than 20 rounds), even though it's going to take you longer to do that than it would for you to add them.


Interesting, I didn't fully realise that. That it's arbitrary is annoying, I clearly had tried to rationalise it to myself! Thanks for the comments, will get around to amending


Hi, great article. Regarding char, I'd remark that getchar() etc return int so it can return -1 for EOF or error.

I'm pretty sure this implies int as a declaration is always signed, but tbh I'm not completely sure!


int, as all other integer types except char, is indeed signed by default.

Aside: signedness semantics of char is implementation-defined. However, the type char itself is always distinct from both signed char and unsigned char.


Another one corner of the language where arrays actually being arrays is important is multidimensional array access:

    int arr[5][7];
    arr[3][5] = 4; // equivalent to *(*(arr + 3) + 5) = 4;
This works because (arr + 3) has type "pointer to int[7]", not "pointer to int". The resulting address computation is

    (char*)arr + 3 * sizeof(int[7]) + 5 * sizeof(int) ==
    (char*)arr + 26 * sizeof(int)
That's also another reason why types like "int [5][7][]" are legal but "int [5][][]" are not.


Really, are multidimensional arrays an important part of the language?

The above code looks like it's indexing into an array of pointers. If you want a flat array, make a few inlined helper functions that do the multiplying and adding. Your code will be much cleaner and easier to understand.


Of course multidimensional arrays are an important part of the language, just as the ability to have structs inside structs.

> The above code looks like it's indexing into an array of pointers. If you want a flat array, make a few inlined helper functions that do the multiplying and adding. Your code will be much cleaner and easier to understand.

It is a "flat" array already, not an array of pointers: [0]. No need to write the code that compiler generates for you already.

[0] https://godbolt.org/z/x3cPf3TvT


I am aware that it's flat already.


The code above does not mention any pointer types, so why would you assume that it's indexing into an array of pointers?

I can't think of any reason why get(a, i, j) is more readable than a[i][j].


Return pointer to array of 4 integers:

  int32_t (* bar(void))[4] {
      static int32_t u[4] = {1, 0, 1, 0};
      return &u;
  }
Return a pointer to a function taking a char:

  void f(char a) {
      // ...
  }

  void (* baz(void))(char) {
      return f;
  }


This is where you really want to start using typedefs.


> it's largely just an arbitrary restriction

Kind of. But the restriction is in keeping with the C philosophy of no hidden implementation magic. C has the same restriction on structs. That's the same question; an array of bytes of known size to the compiler it could easily abstract away. But assignment is always a very cheap operation in C. If we allow assigning to represent memcpy() that property is no longer true.

Same reason why Rust requires you to .clone() so much. It could do many of the explicit copies transparently, but you might accidentally pass around a 4 terabyte array by value and not notice.


> But assignment is always a very cheap operation in C.

That's just not true though, you can assign struct's of an arbitrarily large size to each-other and compilers will emit the equivalent of `memcpy()` to do the assignment. They might actually call `memcpy()` automatically depending on the particular compiler.

The fact that if you wrap the array in a struct then you're free to copy it via assignment makes it arbitrary IMO.


Perhaps I am missing something in the spec - but trying this in various compilers, it seems that you *can* assign structs holding arrays to one another, but you *cannot* assign arrays themselves.

This compiles:

  struct BigStruct {
    int my_array[4];
  };
  int main() {
    struct BigStruct a;
    struct BigStruct b;
    b = a;
  }
But this does not:

  int main() {
    int a[4];
    int b[4];
    b = a;
  }
That seems like an arbitrary restriction to me.


In the first example a & b are variables, which can be assigned to each other. In the second a & b are pointers, but b is fixed, so you can not assign a value to it.


They’re not pointers. sizeof a == 4*sizeof(int), not sizeof(int*).


They're pointers, just weird ones. The compiler knows it's an array, so it gives the result of the actual amount of space it takes up. If you passed it into a function, and used the sizeof operator in the function, it'd give `sizeof(int *)`. Because sizeof is a compile-time operation, so the compiler still knows that info for your example.


That jest means it decays into a pointer after being passed as a function argument. In the example given however it’s not a pointer. Just like it wouldn’t be inside a struct.


Essentially ‘b = a’ in the second example is equivalent to ‘b = &a[0]’ or assigning an array to a pointer.

This is because if you use an array in an expression, it’s value is (most of the time) a pointer to the array’s first element. But the left element is not an expression, therefore it is referring to b the array.

Example one works because no arrays are referred to in the expression side, so this shorthand so to speak is avoided.

Arrays can be a painful edge in C, for example variable length arrays are hair pulling.


The left side of assignment in C is an expression. it's just not in a context where array-to-pointer decay is triggered.


Ironically, Rust does allow you to implicitly copy an array as long as it reduces to a memcpy


Specifically arrays [T; N] are Copy if precisely T is Copy. So, an array of 32-bit unsigned integers [u32; N] can be copied, and so can an array of immutable string references like ["Hacker", "News", "Web", "Site"] but an array of mutable Strings cannot.

The array of mutable Strings can be memcpy'd and there are situations where that's actually what Rust will do, but because Strings aren't Copy, Rust won't let you keep both - if it did this would introduce mutable aliasing and so ruin the language's safety promise.


> Everything I wish I knew when learning C

By far my biggest regret is that the learning materials I was exposed to (web pages, textbooks, lectures, professors, etc.) did not mention or emphasize how insidious undefined behavior is.

Two of the worst C and C++ debugging experiences I had followed this template: Some coworker asked me why their function was crashing, I edit their function and it sometimes crashes or doesn't depending on how I rearrange lines of code, and later I figure out that some statement near the top of the function corrupted the stack and that the crashes had nothing to do with my edits.

Undefined behavior is deceptive because the point at which the program state is corrupted can be arbitrarily far away from the point at which you visibly notice a crash or wrong data. UB can also be non-deterministic depending on OS/compiler/code/moonphase. Moreover, "behaving correctly" is one legal behavior of UB, which can fool you into believing your program is correct when it has a hidden bug.

A related post on the HN front page: https://predr.ag/blog/falsehoods-programmers-believe-about-u... , https://news.ycombinator.com/item?id=33771922

My own write-up: https://www.nayuki.io/page/undefined-behavior-in-c-and-cplus...

The take-home lesson about UB is to only rely on following the language rules strictly (e.g. don't dereference null pointer, don't overflow signed integer, don't go past end of array). Don't just assume that your program is correct because there were no compiler warnings and the runtime behavior passed your tests.


> how insidious undefined behavior is.

Indeed. UB in C doesn't mean "and then the program goes off the rails", it means that the entire program execution was meaningless, and no part of the toolchain is obligated to give any guarantees whatsoever if the program is ever executed, from the very first instruction. A UB-having program could time-travel back to the start of the universe, delete it, and replace the entire universe with a version that did not give rise to humans and thus did not give rise to computers or C, and thus never exist.

It's so insidiously defined because compilers optimize based on UB; they assume it never happens and will make transformations to the program whose effects could manifest before the UB-having code executes. That effectively makes UB impossible to debug. It's monumentally rude to us poor programmers who have bugs in our programs.


I'm not sure that's a productive way to think about UB.

The "weirdness" happens because the compiler is deducing things from false premises. For example,

1. Null pointers must never be dereferenced.

2. This pointer is dereferenced.

3. Therefore, it is not null.

4. If a pointer is provably non-null, the result of `if(p)` is true.

5. Therefore, the conditional can be removed.

There are definitely situations where many interacting rules and assumptions produce deeply weird, emergent behavior, but deep down, there is some kind of logic to it. It's not as if the compiler writers are doing

   if(find_undefined_behv(AST))
      emit_nasal_demons()
   else
      do_what_they_mean(AST)


The C and C++ (and D) compilers I wrote do not attempt to take advantage of UB. What you got with UB is what you expected to get - a seg fault with a null dereference, and wraparound 2's complement arithmetic on overflow.

I suppose I think in terms of "what would a reasonable person expect to happen with this use of UB" and do that. This probably derives, again, from my experience designing flight critical aircraft parts. You don't want to interpret the specification like a lawyer looking for loopholes.

It's the same thing I learned when I took a course in race in high performance driving. The best way to avoid collisions with other cars is to be predictable. It's doing unpredictable things that cause other cars to crash into you. For example, I drive at the same speed as other traffic, and avoid overtaking on the right.


I think this is a core part of the problem; if the default for everything was to not take advantage of UB things would be better - and we're fast enough that we shouldn't NEED all these optimizations except in the most critical code; perhaps.

You should need something like

    gcc --emit-nasal-daemons
to get the optimizations that can hide UB, or at least horrible warnings that "code that looks like it checks for null has been removed!!!!".


AFAIK GCC does have switches to control optimizations, the issues begin when you want to use something other than GCC, otherwise you're just locking yourself to a single compiler - and at that point might as well switch to a more comfortable language.


> What you got with UB is what you expected to get - a seg fault with a null dereference, and wraparound 2's complement arithmetic on overflow.

This is how it worked in the "old days" when I learned C. You accessed a null pointer, you got a SIGSEGV. You wrote a "+", then you got a machine add.


In the really old DOS days, when you wrote to a null pointer, you overwrote the DOS vector table. If you were lucky, fixing it was just a reboot. If you were unlucky, it scrambled your disk drive.

It was awful.

The 8086 should have been set up so the ROM was at address 0.


This is the right approach IMO, but sadly the issue is that not all C compilers work like that even if they could (e.g. they target the same CPU) so even if one compiler guarantees they wont introduce bugs from an overzealous interpretation of UB, unless you are planning to never use any other compiler you'll still be subject to said interpretations.

And if you do decide that sticking to a single compiler is best then might as well switch to a different and more comfortable language.


This is the problem; every compiler outcome is a series of small logic inferences that are each justifiable by language definition, the program's structure, and the target hardware. The nasal demons are emergent behavior.

It'd be one thing if programs hitting UB just vanished in a puff of smoke without a trace, but they don't. They can keep on spazzing out literally forever and do I/O, spewing garbage to the outside world. UB cannot be contained even to the process at that point. I personally find that offensive and rude that tools get away with being so garbage that they can't even promise to help you crash and diagnose your own problems. One mistake and you invite the wrath of God!


> I personally find that offensive and rude that tools get away with being so garbage that they can't even promise to help you crash and diagnose your own problems.

This is literally why newer languages like Java, JavaScript, Python, Go, Rust, etc. exist. With the hindsight of C and C++, they were designed to drastically reduce the types of UB. They guarantee that a compile-time or run-time diagnostic is produced when something bad happens (e.g. NullPointerException). They don't include silly rules like "not ending a file with newline is UB". They overflow numbers in a consistent way (even if it's not a way you like, at least you can reliably reproduce a problem). They guarantee the consistent execution of statements like "i = i++ + i++". And for all the flak that JavaScript gets about its confusing weak type coercions, at least they are coded in the spec and must be implemented in one way. But all of these languages are not C/C++ and not compatible with them.


Yes, and my personal progression from C to C++ to Java and other languages led me to design Virgil so that it has no UB, has well-defined semantics, and yet crashes reliably on program logic bugs giving an exact stack traces, but unlike Java and JavaScript, compiles natively and has some systems features.

Having well-defined semantics means that the chain of logic steps taken by the compiler in optimizing the program never introduces new behaviors; optimization is not observable.


It can get truly bizarre with multiple threads. Some other thread hits some UB and suddenly your code has garbage register states. I've had someone UB the fp register stack in another thread so that when I tried to use it, I got their values for a bit, and then NaN when it ran out. Static analysis had caught their mistake, and then a group of my peers looked at it and said it was a false warning leaving me to find it long afterwards... I don't work with them anymore, and my new project is using rust, but it doesn't really matter if people sign off on code reviews that have unsafe{doHorribleStuff()}


On the contrary, the latter is a far more effective way to think about UB. If you try to imagine that the compiler's behaviour has some logic to it, sooner or later you will think that something that's UB is OK, and you will be wrong. (E.g. you'll assume that a program has reasonable, consistent behaviour on x86 even though it does an unaligned memory access). If you look at the way the GCC team responds to bug reports for programs that have undefined behaviour, they consider the emit_nasal_demons() version to be what GCC is designed to do.


> There are definitely situations where many interacting rules and assumptions produce deeply weird, emergent behavior

The problem is how due to other optimisations (mainly inlining) the emergent misbehaviour can occur in a seemingly unrelated part of the program. This can the inference chain very difficult, as you have to trace paths through the entire execution of the program.

The issue occurs for other types of data corruption, it’s why NPE are so disliked, but UB’s blast radius is both larger and less reliable.


I agree with the factual things that you said (e.g. "entire program execution was meaningless"). Some stuff was hyperbolic ("time-travel back to the start of the universe, delete it").

> [compilers] will make transformations to the program whose effects could manifest before the UB-having code executes [...] It's monumentally rude to us poor programmers who have bugs in our programs.

The first statement is factually true, but I can provide a justification for the second statement which is an opinion.

Consider this code:

    void foo(int x, int y) {
        printf("sum %d", x + y);
        printf("quotient %d", x / y);
    }
We know that foo(0, 0) will cause undefined behavior because it performs division by zero. Integer division is a slow operation, and under the rules of C, it has no side effects. An optimizing compiler may choose to move the division operation earlier so that the processor can do other useful work while the division is running in the background. For example, the compiler can move the expression x / y above the first printf(), which would totally be legal. But then, the behavior is that the program would appear to crash before the sum and first printf() were executed. UB time travel is real, and that's why it's important to follow the rules, not just make conclusions based on observed behavior.

https://blog.regehr.org/archives/232


...Why is the compiler reordering so much?

Look. I get it, clever compilers (I guess) make everyone happy, but are absolute garbage for facilitating program understanding.

I wonder if we are shooting ourselves in the foot with all this invisible optimization.


People like fast code.


In 2022, is there any other reasons to use C besides "fast code" or "codebase already written in C"?


No, and, in fact, the first one isn't valid - you can use C++ (or a subset of it) for the same performance profile with less footguns.

So really the only time to use C is when the codebase already has it and there is a policy to stick to it even for new code, or when targeting a platform that simply doesn't have a C++ toolchain for it, which is unfortunately not uncommon in embedded.


"codebase already written in C" includes both "all the as yet unwrapped libraries" and "the OS interface".


There isn't. Fast code is pretty important though to a lot of people while security isn't (games, renderers, various solvers, simulations etc.).

It's great C is available for that. If you're ok with slow use Java or whatever.


> Integer division is a slow operation, and under the rules of C, it has no side effects.

Then C isn't following this rule - crashing is a pretty major side effect.


The basic deal is that in the presence of undefined behavior, there are no rules about what the program should do.

So if you as a compiler writer see: we can do this optimization and cause no problems _except_ if there's division by zero, which is UB, then you can just do it anyway without checking.


Only non-zero integer division is specified as having no side effects.

Division by zero is in the C standard as "undefined behavior" meaning the compiler can decide what to do with it, crashing would be nice but it doesn't have to. It could also give you a wrong answer if it wanted to.

Edit: And just to illustrate, I tried in clang++ and it gave me "5 / 0 = 0" so some compilers in some cases indeed make use of their freedom to give you a wrong answer.


To my downvoters, since I can no longer edit: I've been corrected that the rule is integer division has no side effects except for dividing by zero. This was not the rule my parent poster stated.


> I've been corrected

No you haven't. The incorrect statement was a verbatim quote from nayuki's post, which you were responding to. Please refrain from apologising for other people gaslighting you (edit: particularly, but not exclusively, since it sets a bad precedent for everyone else).


At the CPU level, division by zero can behave in a number of ways. It can trap and raise an exception. It can silently return 0 or leave a register unchanged. It might hang and crash the whole system. The C language standard acknowledges that different CPUs may behave differently, and chose to categorize division-by-zero under "undefined behavior", not "implementation-defined behavior" or "must trap".

I wrote:

> Integer division is a slow operation, and under the rules of C, it has no side effects.

This statement is correct because if the divisor is not zero, then division truly has no side effects and can be reordered anywhere, otherwise if the divisor is zero, the C standard says it's undefined behavior so this case is irrelevant and can be disregarded, so we can assume that division always has no side effects. It doesn't matter if the underlying CPU has a side effect for div-zero or not; the C standard permits the compiler to completely ignore this case.


> I wrote:

> > Integer division is a slow operation, and under the rules of C, it has no side effects.

Yes, you did, and while that's a reasonable approximation in some contexts, it is false in the general case, since division by zero has a side effect in the form of invoking undefined behaviour. (Arguably that means it has every possible side effect, but that's more of a philosophical issue. In practice it has various specific side effects like crashing, which are specific realizations of its theoretical side effect of invoking undefined behaviour.)

vikingerik's statement was correct:

> [If "Integer division [...] has no side effects",] Then C isn't following this rule - crashing is a pretty major side effect.


> it is false in the general case, since division by zero has a side effect in the form of invoking undefined behaviour.

They were careful to say “under the rules of C,” the rules define the behaviour of C. On the other hand, undefined behaviour is outside the rules, so I think they’re correct in what they’re saying.

The problem for me is that the compiler is not obliged to check that the code is following the rules. It puts so much extra weight on the shoulders of the programmer, though I appreciate that using only rules which can be checked by the compiler is hard too, especially back when C was standardised.


> They were careful to say "under the rules of C,"

Yes, and under the rules of C, division by zero has a side effect, namely invoking undefined behaviour.

> The problem for me is that the compiler is not obliged to check that the code is following the rules.

That part's actually fine (annoying, but ultimately a reasonable consequence of the "rules the compiler can check" issue); the real(ly bad and insidious) problem is that when the compiler does check that the code is following the rules, it's allowed to do it in deliberately backward way that uses any case of not following the rules as a excuse to break unrelated code.


Undefined behavior is not a side effect to be "invoked" by the rules of C. If UB happens, it means your program isn't valid. UB is not a side effect or any effect at all, it is the void left behind when the system of rules disappears.


Side effects are a type of defined behavior. Crashing is not a "side effect" in C terms.


> Indeed. UB in C doesn't mean "and then the program goes off the rails", it means that the entire program execution was meaningless, and no part of the toolchain is obligated to give any guarantees whatsoever if the program is ever executed, from the very first instruction.

This is the greatest sin modern compiler folks committed to abuse C. C as the language never says the compiler can change the code arbitrarily due to an UB statement. It is undefined. Most UB code in C, while not fully defined, has an obvious part of semantics that every one understands. For example, an integer overflow, while not defined on what should be the final value, it is understood that it is an operation of updating a value. It is definitely not, e.g., an assertion on the operand because UB can't happen.

Think about our natural language, which is full of undefined sentences. For example, "I'll lasso the moon for you". A compiler, which is a listener's brain, may not fully understand the sentence and it is perfectly fine to ignore the sentence. But if we interpret an undefined sentence as a license to misinterpret the entire conversation, then no one would dare to speak.

As computing goes beyond arithmetic and the program grows in complexity, I personally believe some amount of fuzziness is the key. This current narrow view from the compiler folks (and somehow gets accepted at large) is really, IMO, a setback in the computing evolution.


> It is definitely not, e.g., an assertion on the operand because UB can't happen.

C specification says a program is ill-formed if any UB happens. So yes, the spec does say that compilers are allowed to assume UB doesn't happen. After all, a program with UB is ill-formed and therefore shouldn't exist!

I think you're conflating "unspecified behavior" and "undefined behavior" - the two have different meanings in the spec.


> C specification says a program is ill-formed if any UB happens. So yes, the spec does say that compilers are allowed to assume UB doesn't happen.

I disagree on the logic from "ill-formed" to "assume it doesn't happen".

> I think you're conflating "unspecified behavior" and "undefined behavior" - the two have different meanings in the spec.

I admit I don't differentiate those two words. I think they are just word-play.


The C standard defines them very differently though:

  undefined behavior
    behavior, upon use of a nonportable or erroneous program
    construct or of erroneous data, for which this International
    Standard imposes no requirements

  unspecified behavior
    use of an unspecified value, or other behavior where this
    International Standard provides two or more possibilities
    and imposes no further requirements on which is chosen in
    any instance
Implementations need not but may obviously assume that undefined behavior does not happen. Assume that however the program behaves if undefined behavior is invoked is how the compiler chose to implement that case.


"Nonportable" is a significant element of this definition. A programmer who intends to compile their C program for one particular processor family might reasonably expect to write code which makes use of the very-much-defined behavior found on that architecture: integer overflow, for example. A C compiler which does the naively obvious thing in this situation would be a useful tool, and many C compilers in the past used to behave this way. Modern C compilers which assume that the programmer will never intentionally write non-portable code are.... less helpful.


> I disagree on the logic from "ill-formed" to "assume it doesn't happen".

Do you feel like elaborating on your reasoning at all? And if you're going to present an argument, it'd be good if you stuck to the spec's definitions of things here. It'll be a lot easier to have a discussion when we're on the same terminology page here (which is why specs exist with definitions!)

> I admit I don't differentiate those two words. I think they are just word-play.

Unfortunately for you, the spec says otherwise. There's a reason there's 2 different phrases here, and both are clearly defined by the spec.


That's the whole point of UB though: the programmer helping the compiler do deduce things. It's too much to expect the compiler to understand your whole program to know a+b doesn't overflow. The programmer might understand it doesn't though. The compiler relies on that understanding.

If you don't want it to rely on it insert a check into the program and tell it what to do if the addition overflows. It's not hard.

Whining about UB is like reading Shakespeare to your dog and complaining it doesn't follow. It's not that smart. You are though. If you want it to check for an overflow or whatever there is a one liner to do it. Just insert it into your code.


> That's the whole point of UB though

No, the whole (entire, exclusive of that) point of undefined behaviour is to allow legitimate compilers to generate sensible and idiomatic code for whichever target architechture they're compiling for. Eg, a pointer dereference can just be `ld r1 [r0]` or `st [r0] r1`, without paying any attention to the possibility that the pointer (r0) might be null, or that there might be memory-mapped IO registers at address zero that a read or write could have catastrophic effects on.

It is not a licence to go actively searching for unrelated things that the compiler can go out of its way to break under the pretense that the standard technically doesn't explicitly prohibit a null pointer dereference from setting the pointer to a non-null (but magically still zero) value.


If you don't want the compiler to optimize that much then turn down the optimization level.


> If you don't want it to rely on it insert a check into the program and tell it what to do if the addition overflows. It's not hard.

Given that even experts routinely fail to write C code that doesn't have UB, available evidence is that it's practically impossible.


> So yes, the spec does say that compilers are allowed to assume UB doesn't happen.

They are allowed to do so, but in practice this choice is not helpful.


On the contrary, it is quite helpful–it is how C optimizers reason.


> Indeed. UB in C doesn't mean "and then the program goes off the rails", it means that the entire program execution was meaningless, and no part of the toolchain is obligated to give any guarantees whatsoever if the program is ever executed, from the very first instruction.

I don't think this is exactly accurate: a program can result in UB given some input, but not result in UB given some other input. The time travel couldn't extend before the first input that makes UB inevitable.


They might be referring to eg. the `_Nonnull` annotation being added to memset. The result is that this:

   if (ptr == null) {
      set_some_flag = true;
   } else {
      set_some_flag = false;
   }
   memset(ptr, 0, size);
Will never see `set_some_flag == true`, as the memset call guarantees that ptr is not null, otherwise it's UB, and therefore the earlier `if` statement is always false and the optimizer will remove it.

Now the bug here is changing the definition of memset to match its documentation a solid, what, 20? 30? years after it was first defined, especially when that "null isn't allowed" isn't useful behavior. After all, every memset ever implemented already totally handles null w/ size = 0 without any issue. And it was indeed rather quickly reverted as a change. But that really broke people's minds around UB propagation with modern optimizing passes.


False. If a program triggers UB, then all behaviors of the entire program run is invalid.

> However, if any such execution contains an undefined operation, this International Standard places no requirement on the implementation executing that program with that input (not even with regard to operations preceding the first undefined operation).

-- https://devblogs.microsoft.com/oldnewthing/20140627-00/?p=63...


Executing the program with that input is the key term. The program can't "take back" observable effects that happen before the input is completely read, and it can't know before reading it whether the input will be one that results in an execution with UB. This is a consequence of basic causality. (If physical time travel were possible, then perhaps your point would be valid.)


The standard does permit time-travel, however. As unlikely as it might seem, I could imagine some rare scenarios in which something seemingly similar happens -- let's say the optimiser reaching into gets() and crashing the program prior to the gets() call that overflows the stack.


Time travel only applies to an execution that is already known to contain UB. How could it know that the gets() call will necessarily overflow the stack, before it actually starts reading the line (at which point all prior observable behavior must have already occurred)?


It doesn't matter how it knows. The standard permits it to do that. The compiler authors will not accept your bug report.


If you truly believe so, then can you give an example of input-conditional UB causing unexpected observable behavior, before the input is actually read? This should be impossible, since otherwise the program would have incorrect behavior if a non-UB-producing input is given.


If it's provably input-conditional then of course it's impossible. But the C implementation does not have to obey the sequence point rules or perform observable effects in the correct order for invocations that contain UB, and it doesn't have to implement "possible" non-UB-containing invocations if you can't find them. E.g. if you write a program to search for a counterexample for something like the Collantz Conjecture, that loops trying successively higher numbers until it finds one and then exits, GCC may compile that into a program that exits immediately (since looping forever is, arguably, undefined behaviour) - there's a real example of a program that does this for Fermat's Last Theorem.


> If it's provably input-conditional then of course it's impossible.

My entire point pertains to programs with input-conditional UB: that is, programs for which there exists an input that makes it result in UB, and there also exists an input that makes it not result in UB. Arguably, it would be more difficult for the implementation to prove that input-dependent UB is unconditional: that every possible input results in UB, or that no possible input results in UB.

> But the C implementation does not have to obey the sequence point rules or perform observable effects in the correct order for invocations that contain UB

Indeed, the standard places no requirements on the observable effects of an execution that eventually results in UB at some point in the future. But if the UB is input-conditional, then a "good" execution and a "bad" execution are indistinguishable until the point that the input is entered. Therefore, the implementation is required to correctly perform all observable effects sequenced prior to the input being entered, since otherwise it would produce incorrect behavior on the "good" input.

> E.g. if you write a program to search for a counterexample for something like the Collantz Conjecture, that loops trying successively higher numbers until it finds one and then exits, GCC may compile that into a program that exits immediately (since looping forever is, arguably, undefined behaviour) - there's a real example of a program that does this for Fermat's Last Theorem.

That only works because the loop has no observable effects, and the standard says it's UB if it doesn't halt, so the compiler can assume it does nothing but halts. As noted on https://blog.regehr.org/archives/140, if you try to print the resulting values, then the compiler is actually required to run the loop to determine the results, either at compile time or runtime. (If it correctly proves at compile time that the loop is infinite, only then can it replace the program with one that does whatever.)

It's also irrelevant, since my point is about programs with input-conditional UB, but the FLT program has unconditional UB.


How this might happen is that one branch of your program may have unconditional undefined behavior, which can be detected at the check itself. This would let a compiler elide the entire branch, even side effects that would typically run.


The compiler can elide the unconditional-UB branch and its side effects, and it can elide the check itself. But it cannot elide the input operation that produces the value which is checked, nor can it elide any side effects before that input operation, unless it can statically prove that no input values can possibly result in the non-UB branch.


That example doesn't contradict LegionMammal978's point though, if I understood correctly. He's saying that the 'time-travel' wouldn't extend to before checking the conditional.


Personally, I've found that some of the optimizations cause undefined behavior, which is so much worse. You can write perfectly good, strict C that does not cause undefined behavior, then one pass of optimization and another together can CAUSE undefined behavior.

When I learned this, if it was and is correct, I felt that one could be betrayed by the compiler.


Optimizations themselves (except for perhaps -ffast-math) can't cause undefined behavior: the undefined behavior was already there. They can just change the program from behaving expectedly to behaving unexpectedly. The problem is that so many snippets, which have historically been obvious or even idiomatic, contain UB that has almost never resulted in unexpected behavior. Modern optimizing compilers have only been catching up to these in recent years.


There have been more than a few compiler bugs that have introduced UB and then that was subsequently optimized, leading to very incorrect program behavior.


A compiler bug cannot introduce UB by definition. UB is a contract between the coder and the C language standard. UB is solely determined by looking at your code, the standard, and the input data; it is independent of the compiler. If the compiler converts UB-free code into misbehavior, then that's a compiler bug / miscompilation, not an introduction of UB.


A compiler bug is a compiler bug, UB or not. You might as well just say "There have been more than a few compiler bugs, leading to very incorrect program behavior."


The whole thread is about how UB is not like other kinds of bugs. Having a compiler optimization erroneously introduce a UB operation means that downstream the program can be radically altered in ways (as discussed in thread) that don't happen in systems without the notion of a UB.

While it's technically true that any compiler bug (in any system) introduces bizarre, incorrect behavior into a program, UB just supercharges the things that can go wrong due to downstream optimizations. And incidentally, makes things much, much harder to diagnose.


I just don't think it makes much sense to say that an optimization can "introduce a UB operation". UB is a property of C programs: if a C program executes an operation that the standard says is UB, then no requirement is imposed on the compiler for what should happen.

In contrast, optimizations operate solely on the compiler's internal representation of the program. If an optimization erroneously makes another decide that a branch is unreachable, or that a condition can be replaced with a constant true or false, then that's not "a UB operation", that's just a miscompilation.

The latter set of optimizations is just commonly associated with UB, since C programs with UB often trigger those optimizations unexpectedly.


LLVM IR has operations that have UB for some inputs. It also has poison values that act...weird. They have all the same implications of source-level UB, so I see no need to make a distinction. The compiler doesn't.


Any optimization that causes undefined behavior is bugged – please report them to your compiler's developers.


By definition an optimisation can’t cause UB as UB is a langage level construct.

An optimisation can cause a miscompilation. They happens and is very annoying.


Miscompilations are rarer and less annoying in compilers that do not have the design behaviour of compiling certain source code inputs into bizarre nonsense that bears no particular relation to those inputs.


You realize these two statements are equivalent, right?

> compiling certain source code inputs into bizarre nonsense

> winning at compiled-binary-execution-speed benchmarks, giving fewer reasons for people to hand-write assembly code for the sake of speed (assembly code is much harder to read/write and not portable), reducing code size by eliminating unnecessary operations (especially -Os), reordering operations to fit CPU pipelines and instruction latencies and superscalar capabilities

If you don't like the complexity of modern, leading-edge optimizing compilers, you are free to build or support a basic compiler that translates C code as literally as possible. As long as such compiler conforms to the C standard, you have every right to promote this alternative. Don't shame other people building or using optimizing compilers.


> compiling certain source code inputs into bizarre nonsense

> winning at compiled-binary-execution-speed benchmarks, giving fewer reasons for people to hand-write assembly code for the sake of speed (assembly code is much harder to read/write and not portable), reducing code size by eliminating unnecessary operations (especially -Os), reordering operations to fit CPU pipelines and instruction latencies and superscalar capabilities

Mainstream C compilers actually make special exceptions for the undefined behaviour that's seen in popular benchmarks so that they can continue to "win" at them. The whole exercise is a pox on the industry; maybe at some point in the past those benchmarks told us something useful, but they're doing more harm than good when people use them to pick a language for modern line-of-business software, which is written under approximately none of the same conditions or constraints.

> Don't shame other people building or using optimizing compilers.

The people who are contributing to security vulnerabilities that leak our personal information deserve shame.


It's true that I don't like security vulnerabilities either. I think the question boils down to, whose responsibility is it to avoid UB - the programmer, compiler, or the standard?

I view the language standard as a contract, an interface definition between two camps. If a programmer obeys the contract, he has access to all compliant compilers. If a compiler writer obeys the contract, she can compile all compliant programs. When a programmer deviates from the contract, the consequences are undefined. Some compilers might cater to these cases (e.g. -fwrapv, GNU language extensions) as a superset of all standard-compliant programs.

Coming from programming in Java first, I honestly would like to see a lot of UB eliminated from C/C++, downgrading them to either unspecified behavior (weakest), implementation-defined behavior, or single behavior (best). But the correct place to petition is not compiler implementations; we have to change the language standard - the contract that both sides abide by. Otherwise we can only get as far as having a patchwork of vendor-specific language extensions.


> Coming from programming in Java first, I honestly would like to see a lot of UB eliminated from C/C++, downgrading them to either unspecified behavior (weakest), implementation-defined behavior, or single behavior (best). But the correct place to petition is not compiler implementations; we have to change the language standard - the contract that both sides abide by. Otherwise we can only get as far as having a patchwork of vendor-specific language extensions.

That feels backwards in terms of how the C standard actually gets developed - my impression is that most things that eventually get standardised start life as a vendor-specific language extensions, and it's very rare to have the C standard to introduce something and the compiler vendors then follow.

And really in a lot of cases the concept of UB isn't the problem, it's the compiler culture that's grown up around it. For example, the original reason for null dereference being UB was to allow implementations to trap on null dereference, on architectures where that's cheap, without being obliged to maintain strict ordering in all code that dereferences pointers. It's hard to imagine how what the standard specifies about that case could be improved; the problem is compiler writers prioritising benchmark performance over useful diagniostic behaviour.


> If you don't like the complexity of modern, leading-edge optimizing compilers, you are free to build or support a basic compiler that translates C code as literally as possible.

Most optimizing compilers can do this already, it's just the -O0 flag.


I tried compiling "int x = 1 / 0;" in both the latest GCC and Clang with -O0 on x86-64 on Godbolt. GCC intuitively preserves the calculation and emits an idiv instruction. Clang goes ahead and does constant folding anyway, and there is no division to be seen. So the oft-repeated advice of using -O0 to try to compile the code as literally as possible in hopes of diagnosing UB or making it behave sanely, is not great advice.


I recently dealt with a bit of undefined behavior (in unsafe Rust code, although the behavior here could similarly happen in C/C++) where attempting to print a value caused it to change. It's hard to overstate how jarring it is to see an code that says "assert that this value isn't an error, print it, and then try to use it", and have the assertion pass but then have it be printed out as an error and then panic when trying to use it There's absolutely no reason why this can't happen since "flipping bits of the value you tried to print" doesn't count as potential UB any less than a segfault, but it can be hard to turn off the part of your brain that is used to assuming that values can't just arbitrarily change at any point in time. "Ignore the rest of the program and do whatever you want after a single mistake" is not a good failure mode, and it's kind of astonishing to me that people are mostly just fine with it because they think they'll be careful enough not to make a mistake ever or that enough of the time it happened they were lucky that it didn't completely screw them over.

The only reason we use unsafe code on my team's project is because we're interfacing with C code, so it was hard not to come away from that experience thinking that it would be incredibly valuable to shrink the amount of interfacing with C as small as possible, and ideally to the point where we don't need to at all.


It's not insidious at all. C compiler offers you a deal: "Hey, my dear programmer, we are trying to make an efficient program here. Sadly, I am not sophisticated enough to deduct a lot of things but you can help me! Here are some of the rules: don't overflow integers, don't dereference null pointers, don't go outside of array bounds. You follow those and I will fulfill my part of making your code execute quickly".

The deal is known and fair. Just be a responsible adult about it: accept it, live with the consequences and enjoy efficiency gains. You can reject it but then don't use arrays without a bound check (a lot of libraries out there offer that), check your integers bounds or use a sanitizer, check your pointers for nulls before dereferencing them, there are many tools out, there to help you, or... Just use another language that does all that for you.


UB was insidious to me because I was not taught the rules (this was back in years 2005 to 2012; maybe it got more attention now), it seemed my coworkers didn't know the rules and they handed me codebases with lots of existing hidden UB, and UB blew up in my face in very nasty ways that cost me a lot of debugging time and anguish.

Also, the UB instances that blew up were already tested to work correctly... on some other platform (e.g. Windows vs. Linux) or on some other compiler version. There are many things in life and computing where when you make a mistake, you find out quickly. If you touch a hot pan, you get a burn and quickly pull away. But if you miswire an electrical connection, it could slowly come loose over a decade and start a fire behind the wall. Likewise, a wrong piece of code that seems to behave correctly at first would lull the author into a false sense of security. By the time a problem appears, the author could be gone, or she couldn't recall what line out of thousands written years ago would cause the issue.

Three dictionary definitions for insidious, which I think are all appropriate: 1) intended to entrap or beguile 2) stealthily treacherous or deceitful 3) operating or proceeding in an inconspicuous or seemingly harmless way but actually with grave effect.

I'm neutral now with respect to UB and compilers; I understand the pros and cons of doing things this way. My current stance is to know the rules clearly and always stay within their bounds, to write code that never triggers UB to the best of my knowledge. I know that testing compiled binaries produces good evidence of correct behavior but cannot prove the nonexistence of UB.


I don't think this is the whole story. That are certain classes of undefined behavior that some compilers actually guarantee to treat as valid code. Type punning through unions in c++ comes to mind. Gcc says go ahead, the standard says UB. In cases like these, it really just seems like the standard is lazy.


> The deal is known and fair.

It often isn't. C is often falsely advertised as a cross-platform assembly language, that will compile to the assembly that the author would expect. Some writers may be used to pre-standardization compilers that are much less hostile than modern GCC/Clang.


> C is often [correctly, but misleadingly] advertised as a cross-platform assembly language, that will compile to the assembly that the author would expect.

Because that's what it is. What they don't tell you is that the most heavily-developed two (or more) compilers for it (which you might otherwise assume meant the two best compilers), are malware[0] that actively seek out excuses to inject security vulnerabilities (and other bugs) into code that would work fine if compiled to the assembly that any reasonable author would expect.

0: http://web.archive.org/web/20070714062657/http://www.acm.org... Reflections on Trusting Trust (Ken Thompson):

> Figure 6 shows a simple modification to the compiler that will deliberately miscompile source whenever a particular pattern is matched. If this were not deliberate, it would be called a compiler "bug". Since it is deliberate, it should be called a "Trojan horse".


Nice way to put down the amazing work of compiler authors. It's not malware you just don't understand how to use it. If you don't want the compilers to do crazy optimisations turn down the optimisation level. If you want then to check for things like null pointers or integer overflow or array bounds at runtime then just turn on the sanitizers those compiler writers kindly provided to you.

You just want all of it: fast optimizing compiler, one that checks for your mistakes but also one that knows when it's not a mistake and still generates fast code. It's not easy to write such a compiler. You can tell it how to behave though if you care.


> If you want then to check for things like null pointers or integer overflow or array bounds

I specificly don't want them to check for those things; that is the fucking problem in the first place! When I write:

  x = *p;
I want it compiled to a damn memory access. If I meant:

  x = *p; __builtin_assume_non_null(p);
I'd have damn well written that.


Socialism is when the government does something I don't like, and Reflections on Trusting Trust is when my compiler does something I don't like. The paper has nothing to do with how optimizing compilers work. Compiling TCC with GCC is not going to suddenly make it into a super-optimizing UB-exploiting behemoth.


This article on undefined behavior looks pretty good (2011?)

https://blog.regehr.org/archives/213

A main point in the article is function classification, i.e. 'Type 1 Functions' are outward-facing, and subject to bad or malicious input, so require lots of input checking and verification that preconditions are met:

> "These have no restrictions on their inputs: they behave well for all possible inputs (of course, “behaving well” may include returning an error code). Generally, API-level functions and functions that deal with unsanitized data should be Type 1."

Internal utility functions that only use data already filtered through Type 1 functions are called "Type 3 Functions", i.e. they can result in UB if given bad inputs:

> "Is it OK to write functions like this, that have non-trivial preconditions? In general, for internal utility functions this is perfectly OK as long as the precondition is clearly documented."

Incidentally I found that article from the top link in this Chris Lattner post on the LLVM Project Blog, "What Every C Programmer Should Know About Undefined Behavior":

http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

In particular this bit on why internal functions (Type 3, above) shouldn't have to implement extensive preconditions (pointer dereferencing in this case):

> "To eliminate this source of undefined behavior, array accesses would have to each be range checked, and the ABI would have to be changed to make sure that range information follows around any pointers that could be subject to pointer arithmetic. This would have an extremely high cost for many numerical and other applications, as well as breaking binary compatibility with every existing C library."

Basically, the conclusion appears to be that any data input to a C program by a user, socket, file, etc. needs to go through a filtering and verification process of some kind, before being handed to over to internal functions (not accessible to users etc.) that don't bother with precondition testing, and which are designed to maximize performance.

In C++ I suppose, this is formalized with public/private/protected class members.


I haven’t used C or C++ for anything, but in writing a Game Boy emulator I ran into exactly that kind of memory corruption pain. An opcode I implemented wrong causes memory to corrupt, which goes unnoticed for millions of cycles or sometimes forever depending on the game. Good luck debugging that!

My lesson was: here’s a really really good case for careful unit testing.


Yeah for that kind of stuff you want tests on every single op checking they make exactly the change you expect.


I would go one step farther: The documentation will say it is undefined behavior but the compiler doesn't have to. Here's an example from the man page for sprintf

  sprintf(buf, "%s some further text", buf);
If you miss that section of the manual, your code may work, leading you to think the behavior is defined.

Then you will have interesting arguments with other programmers about what exactly is undefined behavior, e.g. what happens for

  sprintf(buf, "%d %d", f(i), i++);


I remember reading a blog post a couple of years back on undefined behavior from the perspective of someone building a compiler. The way the standard defines undefined behavior (pun not intended), a compiler writer can basically assume undefined behavior never occurs and stay compliant with the standard.

This offers the door to some optimizations, but also allows compiler writers to reduce the complexity in the compiler itself in some places.

I'm being very vague here, because I have no actual experience with compiler internals, nor that level of language-lawyer pedantry. The blog's name was "Embedded in academia", I think, you can probably still find the blog and the particular post if it sounds interesting.


Yeah a decent chunk of UB is about reducing the burden on the compiler. Null derefs being an obvious such example. If it was defined behavior, the compiler would be endlessly adding & later attempting to optimize-away null checks. Which isn't something anyone actually wants when reaching for C/C++.

Similarly with C/C++ it's not actually possible for the compiler to ensure you don't access a pointer past the end of the array - the array size often isn't "known" in a way the compiler can understand.


> Which isn't something anyone actually wants when reaching for C/C++.

Disagree. I think a lot of people want some kind of "cross-platform assembler" (i.e. they want e.g. null deref to trap on architectures where it traps, and silently succeed on architectures where it succeeds), and get told C is this, which it very much isn't.


Except every other sane systems programming language does indeed do null checks, even those older than C, but they didn't come with UNIX, so here we are.


I'll tell you what happens when someone writes:

      sprintf(buf, "%d %d", f(i), i++);
They get told to rewrite it.


Good point, actually. Many cases of undefined behavior are clearly visible to an experienced C programmer when they review someone else’s code.


By whom? Most places still don't do proper code reviews or unit testing.


Was rewriting the stack due to undefined behavior or was it due to a logic error, e.g. improper bounds calculation?


Isn’t all UB a result of logic errors?

Writing beyond the end of allocated memory (due to incorrect bounds calculation ) is an example of undefined behaviour


No, even type-punning properly allocated memory (e.g. using memory to reinterpret the bits of a floating point number as an integer) through pointers is UB because compilers want to use types for alias analysis[1]. In order to do that "properly" you are supposed to use a union. In C++ you are supposed to use the reinterpret_cast operator.

[1] Which IMO goes back to C's original confusion of mixing up machine-level concepts with language-level concepts from the get-go, leaving optimizers no choice but unsound reasoning and blaming programmers when they get it wrong. Something something numerical loops and supercomputers.


I believe using reinterpret_cast to reinterpret a float as an int is undefined behavior, because I don't believe that follows the type aliasing rules [1]. However, you could reinterpret a pointer to a float as a pointer to char, unsigned char, or std::byte and examine it that way.

As far as I'm aware, it's safe to use std::memcpy for this, and I believe compilers recognize the idiom (and will not actually emit code to perform a useless copy).

[1] https://en.cppreference.com/w/cpp/language/reinterpret_cast


That's like saying all bugs are undefined behavior. C lets you write to your own stack, so if you corrupt the stack due to an application error (e.g. bounds check), then that's just a bug because you were executing fully-defined behavior. Examples of undefined behavior would be things like dividing by 0 where the result of that operation can differ across platforms because the specific behavior wasn't defined in the language spec.


Writing past the end of an array is defined as UB.

Not all bugs are UB, you can have logic errors of course. But stack corruption is I believe always triggered by UB.


There are some complicated UBs that arise when casting to different types that are not obviously logic errors (can't remember the specifics but remember dealing with this in the past).


As a curious FE developer with no C experience, this was very interesting. Thanks for writing the article!


This looks decent, but I'm (highly) opposed to recommending `strncpy()` as a fix for `strcpy()` lacking bounds-checking. That's not what it's for, it's weird and should be considered as obosolete as `gets()` in my opinion.

If available, it's much better to do the `snprintf()` way as I mentioned in a comment last week, i.e. replace `strcpy(dest, src)` with `snprintf(dst, sizeof dst, "%s", src)` and always remember that "%s" part. Never put src there, of course.

There's also `strlcpy()` on some systems, but it's not standard.


strncpy does have its odd and rare use-case, but 100% agree that it is not at all a “fix” for strcpy, it’s not designed for that purpose, and unsuited to it, being both unsafe (does not guarantee NUL-termination) and unnecessary costly (fills the destination with NULs).

The strn* category was generally designed for fixed-size NUL-padded content (though not all of them because why be coherent?), the entire item is incorrect, and really makes the entire thing suspicious.


Then there are strn*_s since C11 (and available before that on many platforms) which do exactly what you want.


Lol no. These are the Annex K stuff which Microsoft got into the standard, which got standardised with a different behaviour than Windows’ (so even on windows following the spec doesn’t work) and which no one else wants to implement at all.

And they don’t actually “do exactly what you want”, see for instance N1967 (“field experience with annex k”) which is less than glowing.


>sizeof dst

Note that this only works if dst is a stack allocated(in the same function) array and not a char *


> and always remember that "%s" part. Never put src there, of course

> Note that this only works if dst is a stack allocated array

Even this "ideal" solution is full of pitfalls. The state of memory safety is so sad in the world of C.


Yes it should be read as a placeholder for whatever you need to do.

Could be an array inside a struct too for instance, that is quite common.


Ah, that was one of my less considered additions - thank you for the feedback!


Would it be a sin to use memcpy() and leave things like input validation to a separate function? I'm nervous any time somebody takes a function with purpose X and uses it for purpose Y.


Uh, isn't using `memcpy()` to copy strings doing exactly that?

The problem is that `memcpy()` doesn't know about (of course) string terminators, so you have to do a separate call to `strlen()` to figure out the length, thus visiting every character twice which of course makes no sense at all (spoken like a C programmer I guess ... since I am one).

If you already know the length due to other means, then of course it's fine to use `memcpy()` as long as you remember to include the terminator. :)


The reason you would want to use memcpy would be if 1) you already know what the length is, 2) if you need a custom validator for your input, 3) you don't want to validate your input (however snprintf() is doing that), 4) if the string may include nulls or there is no null terminator.

But the fifth reason may be that depending on snprintf as your "custom-validator-and-encoder-plus-null-terminator" may introduce subtle bugs in your program if you don't know exactly what snprintf is doing under the hood and what its limitations are. By using memcpy and a custom validator, you can be more explicit about how data is handled in your program and avoid uncertainty.

(by "validate" I mean handle the data as your program expects. this could be differentiating between ASCII/UTF-8/UTF-16/UTF-32, adding/preserving/removing a byte-order mark, eliminating non-printable characters, length requirements, custom terminators, or some other requirement of whatever is going to be using the new copy of the data)


If you really need a fast strcpy then probably not, but in most situations snprintf will do the job just fine. And will prevent heartache.


snprintf is pretty slow, partly because it returns things people typically don't want.


When I first learned C - which also was my first contact with programming at all - I did not understand how pointers work, and the book I was using was not helpful at all in this department. I only "got" pointers like three or four years later, fortunately programming was still a hobby at that point.

Funnily when I felt confident enough to tell other people about this, several immediate started laughing and told me what a relief it was to hear they weren't the only ones with that experience.

Ah, fun times.

EDIT: One book I found invaluable when getting serious about C was "The New C Standard: A Cultural and Economic Commentary" by Derek Jones (http://knosof.co.uk/cbook). You can read it for free because the book ended up being to long for the publisher's printing presses or something like that. It's basically a sentence-by-sentence annotated version of the C standard (C99 only, though) that tries to explain what the respective sentence means to C programmers and compiler writers and how other languages (mostly C++) deal with the issue at hand, but also how this impacts the work of someone developing coding guidelines for large teams of programmers (which was how the author made a living at the time, possibly still is). It's more than 1500 pages and a very dense read, but it is incredibly fine-grained and in-depth. Definitely not suitable for people who are just learning C, but if you have read "Expert C Programming: Deep C Secrets" and found it too shallow and whimsical, this book was written for you.


Having basic experience in any assembly language makes pointers far more clear.

"Addressing modes," where a register and some constant are used to calculate the source or target of a memory operation, make the equivalence of a[b]==*(a+b) much more obvious.

I also wonder about the author's claims that a char is almost always 8 bits. The first SMP machine that ran Research UNIX was a 36-bit UNIVAC. I think it was ASCII, but the OS2200/EXEC8 SMP matured in 1964, so this was an old architecture at the time of the port.

"Any configuration supplied by Sperry, including multiprocessor ones, can run the UNIX system."

https://www.bell-labs.com/usr/dmr/www/otherports/newp.pdf


> Having basic experience in any assembly language makes pointers far more clear.

That's a key point. I came to C after several years of programming in assembly and a pointer was an obvious thing. But I can see that for someone coming to C from higher level languages it might be an odd thing.


There was an "official" C compiler for NOS running on the CDC Cyber. As I recall, 18-bit address, 60-bit words, more than one definition of a 'char' (12-bit or 5-bit, I think). It was interesting. There were a lot of strange architectures with a C compiler.

I would also point out architectures like the 8051 and 8086 made (make...they are still around) pointer arithmetic interesting.


The C standard, as I recall defines a byte effectively as at least 8 bits. I've read that some DSP platforms use a byte (and thus a char) that is 24 bits wide, because that's what audio samples use, but supposedly those platforms rarely, if ever, handle any actual text. The standard library contains a macro CHAR_BITS that tells you

I think I remember reading about a C compiler for the PDP-10 (or Lisp Machine?), also a 36-bit machine, that used a 9 bit byte. There even exists a semi-jocular RFC for UTF-9 and UTF-18.


Pointers are by far the most insidious thing about C. The problem is that nobody who groks pointers can understand why they had trouble understanding them in the first place.

Once you understand, it seems so obvious that you cannot imagine not understanding what a pointer is, but at the beginning, trying to figure out why the compiler won't let you assign a pointer to an array, like `char str[256]; str = "asdf"`, is maddening.

One thing I think would benefit many is if we considered "arrays" in C to be an advanced topic, and focused on pointers only; treating "malloc" as a magical function until the understanding of pointers and indexing is so firmly internalized that you can just add on arrays to that knowledge. Learning arrays first and pointers second is going to break your brain because they share so much syntax, but arrays are fundamentally a very limited type.


When I've had to explain it, I describe memory as a street with house numbers (which are memory addresses).

A house can store either people, or another house number (for some other address).

If you use a person as a house number, it will inflict grievous harm upon that person. If you use a house number as a person, it will blow up some random houses. Very little in the language stops you from doing this, so you have to be careful not to confuse them.

Then I describe what an MMU does with a TLB, at which point the eyes glaze over.


From my memory, the syntax of pointers really tripped me up. E.g., the difference between * and & in declaration vs dereferencing. I think this is especially confusing for beginners when you add array declarations to the mix.


Agreed. Complex declarations (e.g., array of pointers to functions) are non-intuitive:

http://www.ericgiguere.com/articles/reading-c-declarations.h...

https://cdecl.org/

I learned this from the book Expert C Programming by Peter van der Linden.


I don't remember C using & in declarations. Is it a recent addition?


What is so difficult about the concept of a memory address? Is it the C syntax? Asking because I personally have never struggled with this.


That's the problem! I can't tell you what is difficult because it seems so incredibly obvious to me now.

When I was ~12, I had a lot of trouble with it, and the only thing I remember from those times is various attempts to figure out why the compiler wouldn't let me assign a string value to an array. What the hell is an "lvalue", Mr. Compiler?

Now I look at the assignment command above and I recoil in horror, but for some reason at the time it seemed very confusing to me, especially since `char *str; str = "abcd";` works so well. The different between the two (as far as intention goes) is vast in retrospect, but for some reason I had trouble with it at the time.


The pointer/array confusion in C makes this way harder to understand than it has to be. The other thing is the syntax, which is too clever and too hard to parse in your head for complex expressions. Both of these things also tend to not be explained very well to beginners, probably partly due to the fact that explaining it in detail is complex and would perhaps go over the beginner's head. It's also stupid, so you'd probably have to explain how it turned out to be this complex.


> Is it the C syntax?

Pretty sure that this is a big factor, I’m not aware of any recent languages that put type information before and after the variable name. Nowadays there’s always a clear distinction between the variable name and the type annotation.


For this example:

   char str[256]; str = "asdf"
both str and "asdf" are not pointer-type expressions; they're both arrays (which is exposed by sizeof). The reason why this doesn't work is because C refuses to treat arrays as first-class value types - which is not an obvious thing to do regardless of how well you understand pointers or not. Other languages with arrays and pointers generally haven't made this mistake.


One thing that helped me understand pointers was understanding that a pointer is just a memory address.

When I was still a noob programmer, my instructor merely stuck to words like "indirection" and "dereferencing" which are all fine and dandy, but learning that a pointer is just a memory address instantly made it click.

Pointers are a $1000 topic for a $5 concept.


Well there’s a little bit more to it. There is a type involved, and then there’s pointer arithmetic.


Well yes, but those aren't hard.

Pointer arithmetic is merely knowing that any addition/subtraction done to a pointer is multiplied by the size of the type being pointed to. So if you're pointing to a 64-byte struct, then "ptr++;" adds 64 to the pointer.


Typed pointers interact with aliasing in "interesting" ways.


When I’m teaching (a very high-level language), I make a point of saying that a variable is a named memory location. Where is that location? We don’t know. Now, I am absolutely aware that the address isn’t the “real” location, but I have this idea that talking about variables in this way might help them grok the lower-level concept later on.


My experience with pointers was the inverse of yours. My first programming language was Java, and I spent many hours puzzling out reference types (and how they differed from primitive types). I only managed to understand references after somebody explained them as memory addresses (e.g. the underlying pointer implementation). When I later learned C, I found pointers to be delightfully straightforward. Unlike references in Java, pointers are totally upfront about what they really are!


When I got to Java, I experienced the same problem. Much later, I learned C# and found that it apparently had observed and disarmed some of Java's traps, but they also got a little baroque in some places, e.g. with pointers, references in/out parameters, values types, nullable types, ... A lot of the time one doesn't need it, but it is a bit of a red flag if language has two similar concepts expressed in two similar ways but with "subtle" differences.

I did like the const vs readonly solution they came up with. I wish Go (my current goto (pun not necessarily unintentional) language) had something similar


From over a decade ago, I really enjoyed this clay animation on C pointers: https://www.youtube.com/watch?v=5VnDaHBi8dM , http://cslibrary.stanford.edu/104/


"The C Puzzle Book" is the thing I recommend to anyone who knows they want to have a good, working understanding of how to use pointers programming in C.

Many years ago I did the exercises on the bus in my head, then checking the answers to see what I got wrong and why over the space of a week or so. It's a really, really good resource for anyone learning C. It seemed to work for several first year students who were struggling with C in my tutorials as well and they did great. Can't recommend it highly enough to students and the approach to anyone tempted to write an intro C programming text.


I would highly recommend the video game Human Resource Machine for getting a really good understanding of how pointers work.

It's more generally about introducing assembly language programming (sort of) in gradual steps, so you'll need to play through a fair chunk of the game before you get to pointers. But by the time you get to them, they will seem like the most obvious thing in the world. You might even have spent the preceding few levels wishing you had them.


> Declaring a variable or parameter of type T as const T means, roughly, that the variable cannot be modified.

I would add "... cannot be modified through that pointer". (Yes, in fairness, they did say "roughly".) For example consider the following:

    void foo(int* x, const int* y)
    {
        printf("y before: %d\n", *y);
        *x = 3;
        printf("y after: %d\n", *y);
    }
This will print two different values if you have `int i = 1` and you call `foo(&i, &i)`. This is the classic C aliasing rule. The C standard guarantees that this works even under aggresive optimisation (in fact certain optimisations are prevented by this rule), whereas the analogous Fortrain wouldn't be guaranteed to work.


You already know this, but I would add that under strict aliasing rules, this is only valid because x and y point to the same type.

The most common example is when y is float* and someone tries to access its bitwise representation via an int*.

(Please correct me if I'm wrong)

https://gist.github.com/shafik/848ae25ee209f698763cffee272a5...


A small detail: you probably meant

  printf("y before: %d\n", *y);


Oops you're right! Fixed now thanks.


*y in both printf, right?


I was born in '74 so the last generation to start with C and go to other, higher-level, languages like Python or JavaScript. Going in this direction was natural. I was amazed by all the magic the higher-level languages offered.

Going the other direction is a bit more difficult apparently. "What do you mean it does not do that?". Interesting perspective indeed!


What was nice about C then was that, based on my study of CPUs at the time, you could pretty much get your head around what the CPU was doing. So you could learn the instructions (C) and the machine following them (the CPU).

When I got to modern CPUs it's so complex my eyes glazed over reading the explanation and I gave up trying to understand them.


I was born in the late 80s and C was my first language, in a community college intro to programming class.


I started coding with C and OCaml in 2019. Everything in between these two was so unnatural. With JavaScript as the worst of all


This was my experience with learning programming as well, however, I am 2x younger :)


Introductory programming courses at the University of Arizona were still taught in C when I was a freshman in 2008


Im a decade younger and my university taught C for its intro to programing class.

Granted it was a disaster of a programing class.


Some constructive feedback:

> Here are the absolute essential flags you may need.

I highly recommend including `-fsanitize=address,undefined` in there (docs: https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.h...).

(Edit: But probably not in release builds, as @rmind points out.)

> The closest thing to a convention I know of is that some people name types like my_type_t since many standard C types are like that

Beware that names beginning with "int"/"uint" and ending with "_t" are reserved in <stdint.h>.

[Edited; I originally missed the part about "beginning with int/uint", and wrote the following incorrectly comment: "That shouldn't be recommended, because names ending with "_t" are reserved. (As of C23 they are only "potentially reserved", which means they are only reserved if an implementation actually uses the name: https://en.cppreference.com/w/c/language/identifier. Previously, defining any typedef name ending with "_t" technically invokes undefined behaviour.)"]

The post never mentions undefined behaviour, which I think is a big omission (especially for programmers coming from languages with array index checking).

> void main() {

As @vmilner mentioned, this is non-standard (reference: https://en.cppreference.com/w/c/language/main_function). The correct declaration is either `int main(void)` or the argc+argv version.

(I must confess that I am guilty of using `int main()`, which is valid in C++ but technically not in C: https://stackoverflow.com/questions/29190986/is-int-main-wit...).

> You can cast T to const T, but not vice versa.

This is inaccurate. You can implicitly convert T* to const T*, but you need to use an explicit cast to convert from const T* to T*.


UPDATE regarding "_t" suffix:

POSIX reserves "_t" suffix everywhere (not just for identifiers beginning with "int"/"uint" from <stdint.h>); references: https://www.gnu.org/software/libc/manual/html_node/Reserved-..., https://pubs.opengroup.org/onlinepubs/9699919799/functions/V....

So I actually stand by my original comment that the convention of using "_t" suffix shouldn't be recommended. (It's just that the reasoning is for conformance with POSIX rather than with ISO C.)


Well, semantically, "size_t" makes sense to me ("the type of a size variable"), while "uint_t" does not ("the type of a uint variable"), because "uint" is already a type, obviously - just like "int".


> -fsanitize=address,undefined

In addition, I recommend -fsanitize=integer. This adds checks for unsigned integer overflow which is well-defined but almost never what you want. It also checks for truncation and sign changes in implicit conversions which can be helpful to identify bugs. This doesn't work if you pepper your code base with explicit integer casts, though, which many have considered good practice in the past.


Good one, thanks. Note that it requires Clang; GCC 12.2 doesn't have it.


Wow nice, I didn't know about this one. I can add some more which are less known. This is my current sanitize invocation (minus the addition of "integer" which I'll be adding, unless one of these other ones covers it):

  -fsanitize=address,leak,undefined,cfi,function
CFI has checks for unrelated casts and mismatched vtables which is very useful. It requires that you pass -flto or -flto=thin and -fvisibility=hidden.

You can read a comparison with -fsanitize=function here:

https://clang.llvm.org/docs/ControlFlowIntegrity.html#fsanit...

There's also TypeSanitizer, which isn't officially released, but is really interesting and should be able to be applied via a patch from the branch:

https://www.youtube.com/watch?v=vAXJeN7k32Y

https://reviews.llvm.org/D32199

  $ curl -L 'https://reviews.llvm.org/D32199?download=1' | patch -p1


I think "leak" is always enabled by "address". It's only useful if you want run LeakSanitizer in stand-alone mode. "integer" is only enabled on demand because it warns about well-defined (but still dangerous) code. You can also enable "unsigned-integer-overflow" and "implicit-conversion" separately. See https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html#...


Ah I wasn't sure if LSAN was always enabled with ASAN, good to know -- ty!


Why the hell "potentially reserved" was introduced? How is it different from simply "reserved" in practice except for the fact such things can be missing? How do you even use a "potentially reserved" entity reliably? Write your own implementation for platforms where such an entity is not provided, and then conditionally not link it on the platforms where it actually is provided? Is the latter even possible?

Also, apparently, "function names [...] beginning with 'is' or 'to' followed by a lowercase letter" are reserved if <ctype.h> and/or <wctype.h> are included. So apparently I can't have a function named "touch_page()" or "issue_command()" in my code. Just lovely.


From https://www.open-std.org/JTC1/sc22/wg14/www/docs/n2625.pdf:

> The goal of the future language and library reservations is to alert C programmers of the potential for future standards to use a given identifier as a keyword, macro, or entity with external linkage so that WG14 can add features with less fear of conflict with identifiers in user’s code. However, the mechanism by which this is accomplished is overly restrictive – it introduces unbounded runtime undefined behavior into programs using a future language/library reserved identifier despite there not being any actual conflict between the identifier chosen and the current release of the standard. ...

> Instead of making the future language/library identifiers be reserved identifiers, causing their use to be runtime unbounded undefined behavior per 7.1.3p1, we propose introducing the notion of a potentially reserved identifier to describe the future language and library identifiers (but not the other kind of reservations like __name or _Name). These potentially reserved identifiers would be an informative (rather than normative) mechanism for alerting users to the potential for the committee to use the identifiers in a future release of the standard. Once an identifier is standardized, the identifier stops being potentially reserved and becomes fully reserved (and its use would then be undefined behavior per the existing wording in C17 7.1.3p2). These potentially reserved identifiers could either be listed in Annex A/B (as appropriate), Annex J, or within a new informative annex. Additionally, it may be reasonable to add a recommended practice for implementations to provide a way for users to discover use of a potentially reserved identifier. By using an informative rather than normative restriction, the committee can continue to caution users as to future identifier usage by the standard without adding undue burden for developers targeting a specific version of the standard.


So... instead of mandating implementations to warn about (re)defining a reserved identifier, they introduce another class of "not yet reserved indentifiers" and advise implementations to warn about defining such identifiers in the user code — even though it's completely legal, — until the moment the implementation itself actually uses/defines such an identifier at which point warning about such redefinition in the user code — now illegal and UB — is no longer necessary or advised.

Am I completely misreading this or is this actually insane? Besides, there is already a huge swath of reserved identifiers in C, why do they feel the need to make an even larger chunk of names unavailable to the programmers?


The problem is that the traditional wording of C meant that any variable named 'top' was technically UB, because it begins with `to'.

In practical terms, what compilers will do is, if C2y adds a 'togoodness' function, they will add a warning to C89-C2x modes saying "this is now a library function in C2y," or maybe even have an extension to use the new thing in earlier modes. This is what they already do in large part; it's semantic wording to make this behavior allowable without resorting to the full unlimited power of UB.


> Besides, there is already a huge swath of reserved identifiers in C, why do they feel the need to make an even larger chunk of names unavailable to the programmers?

The C23 change was mostly to downgrade some of the existing reserved identifiers from "reserved" to "potentially reserved". (It also added some new reserved and potentially reserved identifiers, but they seem reasonable to me.)


I still fail to see any practical difference between these two categories, except that the implementations are recommended to diagnose illegal-in-the-future uses of potentially reserved identifiers but are neither required nor recommended to diagnose actually illegal uses of reserved identifiers. There is also no way to distinguish p.r.i from r.i.

It also means that if an identifier becomes potentially reserved in C23 and reserved in C3X, then compiling a valid C11 program that uses it as C23 will give you a warning, which you can fix and then compile resulting valid C23 program as C3X without any problem; but compiling such a C11 program straight up as C3X will give you no warning and a program with UB.

Seriously, it boggles my mind. Just a) require diagnostics for invalid uses of reserved identifiers starting from C23, b) don't introduce new reserved identifiers, there is already a huge amount of them.


How can a (badly chosen) typedef name trigger _undefined behavior_, and not just, say, a compilation error...?

I find it difficult to imagine what that would even mean.


You can declare a type without (fully) defining it, like in

    typedef struct foo foo_t;
and then have code that (for example) works with pointers to it (*foo_t). If you include a standard header containing such a forward declaration, and also declare foo_t yourself, no compilation error might be triggered, but other translation units might use differing definitions of struct foo, leading to unpredictable behavior in the linked program.


One potential issue would be that the compiler is free to assume any type with the name `foobar_t` is _the_ `foobar_t` from the standard (if one is added), it doesn't matter where that definition comes from. It may then make incorrect assumptions or optimizations based on specific logic about that type which end up breaking your code.


The problem being that to trigger a compile error the compiler would have to know all its reserved type names ahead of time.

It is not required to do so, hence undefined behavior. You might get a wrong underlying type under that name.


But wouldn't one be required to include a particular header in such case (i.e. the correct header for defining a particular type)?

I mean, no typedef names are defined in the global scope without including any headers right? Like I find it really weird that a type ending in _t would be UB if there is no such typedef name declared at all.

Or is this UB stuff merely a way for the ISO C committee to enforce this without having to define <something more complicated>?


[Note: What I originally wrote in my top-level comment was inaccurate; I edited that comment, but later posted another update: https://news.ycombinator.com/item?id=33773043#33775630.]

The purpose of this particular naming rule is to allow adding new typedefs such as int128_t. The "undefined behaviour" part is for declaration of any reserved identifier (not specifically for this naming rule). I don't know why the standard uses "undefined behaviour" instead of the other classes (https://en.cppreference.com/w/cpp/language/ub); I suspect because it gives compilers the most flexibility.


[Edit: My link to the behaviour classes was wrong (it was for C++ instead of C), it should have been https://en.cppreference.com/w/c/language/behavior]


Doesn’t the compiler need to know all of the types to do the compilation anyway?


I'm not sure, but in general having incompatible definitions for the same name is problematic.


Thank you so much! I will definitely be amending a few things. WRT no section on undefined behaviour - you're so right, how could I forget?


Certainly yes, but for debug builds and tests. It can be heavyweight for production.


C spec:

>That shouldn't be recommended, because names ending with "_t" are reserved.

Also C spec naming new things:

>_Atomic _Bool

I'm glad to see the C folks have a sense of humor.


Not all reserved names are reserved for all purposes. _t is reserved only for type names (typedefs), whereas _Atomic and _Bool are keywords.


The standard reserves several classes of identifiers, "_t" suffix [edit: with also "int"/"uint" prefix] is just one of several rules. Another rule is "All identifiers that begin with an underscore followed by a capital letter or by another underscore" (and also "All external identifiers that begin with an underscore").


That only because bool was usually an old alias to int.. It's defined as alias to _Bool in stdbool.h, highly recommended.


> C has no environment which smooths out platform or OS differences

Not true - C has little environment, not no environment. For example, fopen("/path/file.txt", "r") is the same on Linux and Windows. For example, uint32_t is guaranteed to be 32 bits wide, unlike plain int.

> Each source file is compiled to a .o object file

Is this a convention that compilers follow, or are intermediate object files required by the C standard? Does the standard say much at all about intermediate and final binary code?

> static

This keyword is funny because in a global scope, it reduces the scope of the variable. But in a function scope, it increases the scope of the variable.

> Integers are very cursed in C. Writing correct code takes some care

Yes they very much are. https://www.nayuki.io/page/summary-of-c-cpp-integer-rules


> Is this a convention that compilers follow, or are intermediate object files required by the C standard? Does the standard say much at all about intermediate and final binary code?

The standard only says that the implementation must preprocess, translate, and link the several "preprocessing translation units" to create the final program. It doesn't say anything about how the translation units are stored on the system.

> This keyword is funny because in a global scope, it reduces the scope of the variable. But in a function scope, it increases the scope of the variable.

Not quite: in a global scope, it gives the variable internal linkage, so that other translation units can use the same name to refer to their own variables. In a block scope, it gives the variable static storage duration, but it doesn't give it any linkage. In particular, it doesn't let the program refer to the variable outside its block.


On Windows you can directly access UNC paths (without mounting) with fopen. You can't do this on POSIX platforms. Also, not all API boundaries are fixed width so you're going to be exposed to the ugliness of variable width types.

I think the article is correct that one must be aware of the platform and the OS when writing C code.


fopen will/should fail on windows with the unix path syntax.

The reason it's indeterminate is because some stdc lib vendors will do path translation on Windows, some won't. I believe cygwin does (because it's by definition a unix-on-windows), but I'm pretty sure the normal stdclib vendors on windows do not.

I'm almost positive that MacOS (before MacOS X) will fail with unix path separators, since path separators are ':' not '/'.


It will work on Windows, since it inherits the behavior from MS-DOS. It's the shell on Windows (or MS-DOS) where it fails since the shell uses '/' to designate options, so when MS-DOS gained subdirectories (2.0) it used '\' as the file separator on the shell. The "kernel" will accept both. There even used to be an undefined (or underdefined) function in MS-DOS to switch the option character.


Apparently it's true. I wonder when this was implemented?

Canonicalize separators

"All forward slashes (/) are converted into the standard Windows separator, the back slash (\). If they are present, a series of slashes that follow the first two slashes are collapsed into a single slash."

https://learn.microsoft.com/en-us/dotnet/standard/io/file-pa...


Since MS-DOS 2.0, released in October of 1983---39 years ago.


After learning C, one of the first projects I came into contact with, was the ID Tech 3 game engine [1]

On the one hand, it taught me how professional C programmers structure their code (extra functions to remove platform differences, specific code which is being shared between server and client to allow smooth predictions) and how incredible fast computers can be (thousands of operations within milliseconds), but it also showed me, how the same code can result in different executions due to compiler differences (tests pass, production crashes) and how important good debugging tools are (e.g. backtraces).

To this day I am very grateful for the experience and that ID decided to release the code as open source.

[1] https://github.com/id-Software/Quake-III-Arena


Love the intro and overview — looking forward to more!

These weren't mentioned in the post but have been very helpful in my journey as a C beginner so far:

- Effective C by Robert C. Seacord. It covers a lot of the footguns and gotchas without assuming too much systems or comp-sci background knowledge. https://nostarch.com/Effective_C (Also, how can you not buy a book on C with Cthulhu on the cover written by a guy with _three_ “C”s in his name?)

- Tiny C Projects by Dan Gookin, for a “learn by doing” approach. https://www.manning.com/books/tiny-c-projects

- Exercism's C language track: https://exercism.org/tracks/c

- Computer Systems, A Programmer's Perspective by Randal E. Bryant and David R. O'Hallaron for a deeper dive into memory, caches, networking, concurrency and more using C, with plenty of practice problems: https://csapp.cs.cmu.edu/


More Good projects to learn from:

  - Busybox (https://github.com/mirror/busybox)
  - uClibc (https://git.uclibc.org/uClibc/tree/)
  - musl (https://git.musl-libc.org/cgit/musl/tree/)
  - misc GNU tools (https://git.savannah.gnu.org/cgit/grep.git/tree/, https://git.savannah.gnu.org/cgit/findutils.git/tree/, etc)
The first two are oriented towards embedded development, which I find leads to the simplest, most portable code. Those devs are absolute wizards.


This used to be my bible when doing full time C programming around 2000 (together with the standard docs) but I’m out of date with the latest standard updates (as is this) but it may still be of interest.

https://c-faq.com/


Thanks for submitting this. I'm teaching myself C so these high level overviews are super useful for improving my intuition. In the following example, shouldn't there be an asterisk * before the data argument in the getData function call? The way I understand it the function is expecting a pointer so you would need to pass it a pointer of the data object.

> "If you want to “return” memory from a function, you don’t have to use malloc/allocated storage; you can pass a pointer to a local data:

void getData(int *data) { data[0] = 1; data[1] = 4; data[2] = 9; }

void main() { int data[3]; getData(data); printf("%d\n", data[1]); } "


No, it's correct. The asterisk is a little inconsistent, in that it means two opposite things. In the declaration it means "this is a pointer." However, in an expression, it means "this is the underlying type" and serves to dereference the pointer.

    int a = 5;
    int *x; // this is a pointer
    x = &a;
    int c = *x; // both c and *x are ints
If it were *data, it would be equivalent to *(data + 0), which is equivalent to data[0], which is an int. You don't want to pass an int, you want to pass an *int.


The way I got this to stick in my head was to always think of * as dereferencing, and tell myself that

    int *x;
is declaring that the type of *x is int.


It's not just a memorization trick; that's exactly what the statement means. If you do

    int *x, y;
You're saying that both *x and y are integers.


... which is why I never understood why this is the convention rather than

    int* x, y. 
Does somebody know?


Because now you've got an int pointer and an int. The star associates with the right, not left.

I prefer to use the variant you described though, because it feels more natural to associate the pointer with the type itself. As far as I know, the only pitfall is in the multiple declaration thing so I just don't use it.

IMO, it's also more readable in this case:

    int *get_int(void);
    int* get_int(void);
The second one more clearly shows that it returns a pointer-to-int.


Multiple declaration is generally frowned upon, because you declare the variables without immediately setting them to something.

If you always set new variables in the same statement you declare them, then you don't use multiple declarations, which means there is no ambiguity putting the * by the type name.

So convention wins out for convention's sake. And that's the entire point of convention in the first place: to sidestep the ugly warts of a decades-old language design.


Spaces are ignored (except to separate things where other syntactical things like * or , aren't present), and * binds to the variable on the right, not the type on the left. I actually got this wrong in an online test, but I screenshotted every question so I could go over them later (! I admit, a dirty trick but I learned things like this from it, though I still did well enough on the test to get the interview).

int*x,y; // x is pointer to int, y is int.

int x,*y; // x is int, y is pointer to int

And the reason I got it wrong on the test is it had been MANY years since I defined more than one variable in a statement (one variable defined per line is wordier but much cleaner), so if I ever knew this rule before, I had forgotten it over time.

I keep wanting to use slash-star comments, but I recall // is comment-to-end-of-line in C99 and later, something picked up from its earlier use in C++.

Oh yeah, C99 has become the de-facto "official" C language, regardless of more recent changes/improvements, as not all newer changes have made it into newer compilers, and most code written since 1999 seems to follow the C99 standard. I recall gcc and many other compilers have some option to specify which standard to use for compiling.


I think the question is why it binds to the variable rather than the type. It's obviously a choice that the designers have made; e.g. C# has very similar syntax, but:

   int* x, y;
declares two pointers.

I think the syntax and the underpinning "declaration follows use" rule are what they got when they tried to generalize the traditional array declaration syntax with square brackets after the array name which they inherited directly from B, and ultimately all the way from Algol:

   int x, y[10], z[20];
In B, though, arrays were not a type; when you wrote this:

   auto x, y[10], z[20];
x, y, and z all have the same type (word); the [] is basically just alloca(). This all works because the type of element in any array is also the same (word), so you don't need to distinguish different arrays for the purposes of correctly implementing [].

But in C, the compiler has to know the type of the array element, since it can vary. Which means that it has to be reflected in the type of the array, somehow. Which means that arrays are now a type, and thus [] is part of the type declaration.

And if you want to keep the old syntax for array declarations, then you get this situation where the type is separated by the array name in the middle. If you then try to formalize this somehow, the "declaration follows use" rule feels like the simplest way to explain it, and applying it to pointers as well makes sense from a consistency perspective.


You must've misunderstood, your statement looks like both x and y are `int *`, but in fact only x is an `int *`, while y is an `int`.


I don't know for certain, but I suspect it simplified the language's grammar, since C's "declaration follows use" rule means you can basically repurpose the expression grammar for declarations instead of needing new rules for types. This is also why the function pointer syntax is so baroque (`int (*x)();` declares a variable `x` containing a pointer to a function taking no parameters and returning an int).


I like that a lot! However, it makes things like

    int *x = &a;
a bit more confusing/inconsistent.


Not at all!

a is int; &a is pointer to int; x is pointer to int; *x in again int.


Gotcha, so it's kind of like:

    int (*(x = &(a)));
    i    i p   p i   // i means int, p means pointer


I prefer to think of it as

    (int *) x = &(a);
     i   p  p   a i // a means address       
    
Which is why I prefer to write

    int* x = &a;
"integer pointer" named "x" set to address of integer "a".

---

As a sibling comment pointed out, this is ambiguous when using multiple declaration:

    int* foo, bar;
The above statement declares an "integer pointer" foo and an "integer" bar. It can be unambiguously rewritten as:

    int bar, *foo;
But multiple declaration sucks anyway! It's widely accepted good practice to set (instantiate) your variables in the same statement that you declare them. Otherwise your program might start reading whatever data was lying around on the stack (the current value of bar) or worse: whatever random memory address it refers to (the current value of foo).


Thanks :)


Thanks. Now I understand why I found pointers difficult. It's the declaration that confused me.


I think it would help if beginners learn a language other than C to learn about pointers. My first language was Pascal, and it didn't have a confusing declaration syntax, nor did it have a confusing array decay behavior so it was much much easier to learn. Nowadays of course I don't think about it but those details mattered to beginners.


Yeah. Since trying C, I've learnt a bit of Rust, so referencing and dereferencing seem straightforward without abstracting references using a pointer.


You can read a declaration like `int x` as “`x` is an int”, and hence x is an int pointer.


With the asterisks backslash escaped:

> ..read a declaration like `int *x` as "`*x` is an int"..


No, it's fine.

The name of an array decays to a pointer to the first element in various contexts. You could do `&data[0]` but it means exactly the same thing and would read as over-complicated things to C programmers.


Thanks for all the great answers. The inconsistency between pointer deceleration and dereference syntax was what got me. :)


the local variable data effectively decays to int*.

*data would give you an int, &data would give you an int*.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: