Everything I wish I knew when learning C

DSMan195276 · on Nov 28, 2022

I like it, but the array details are a little bit off. An actual array does have a known size, that's why when given a real array `sizeof` can give the size of the array itself rather than the size of a pointer. There's no particular reason why C doesn't allow you to assign one array to another of the same length, it's largely just an arbitrary restriction. As you noted, it already has to be able to do this when assigning `struct`s.

Additionally a declared array such as `int arr[5]` does actually have the type `int [5]`, that is the array type. In most situations that decays to a pointer to the first element, but not always, such as with `sizeof`. This becomes a bit more relevant if you take the address of an array as you get a pointer to an array, Ex. `int (*ptr)[5] = &arr;`. As you can see the size is still there in the type, and if you do `sizeof *ptr` you'll get the size of the array.

Jasper_ · on Nov 28, 2022

I really wish that int arr[5] adopted the semantics of struct { int arr[5]; } -- that is, you can copy it, and you can pass it through a function without it decaying to a pointer. Right now in C:

    typedef uint32_t t1[5];
    typedef struct { uint32_t arr[5]; } t2;
    void test(t1 a, t2 b) {
        t1 c;
        t2 d;
        printf("%d %d %d %d\n", sizeof(a), sizeof(b), sizeof(c), sizeof(d));
    }

will print 4, 20, 20, 20. I understand that array types having their sizes in their types was one of Kernighan's gripes with Pascal [0], which likely explains why arrays decay to pointers, but for those cases, I'd say you should still decay to a pointer if you really wanted to, with an explicit length parameter.

[0] http://www.lysator.liu.se/c/bwk-on-pascal.html

WalterBright · on Nov 28, 2022

> I really wish that int arr[5] adopted the semantics of struct { int arr[5]; }

You and me both. In fact, D does this. `int arr[5]` can be passed as a value argument to a function, and returned as a value argument, just as if it was wrapped in a struct.

It's sad that C (and C++) take every opportunity to instantly decay the array to a pointer, which I've dubbed "C's Biggest Mistake":

https://www.digitalmars.com/articles/C-biggest-mistake.html

bombcar · on Nov 28, 2022

That would be a nice little "gcc addition" to the C standard, honestly.

To bad they spend most their time doing whatever it is they do.

pjmlp · on Nov 29, 2022

I have long been convinced that WG14 has no real interest in improving C's security beyond what a Macro Assembler already offers out of the box.

Even the few "security" attempts that they have made, still require separate pointer and length arguments, thus voiding any kind of "security" that the functions might try to achieve.

However even a Macro Assembler is safer than modern C compilers, as they don't remove your code when one steps into a UB mine.

Gibbon1 · on Nov 29, 2022

One of the members of WG14 posted here a few days ago that they only use C89.

ErikCorry · on Nov 29, 2022

Mind blown.

maxdamantus · on Nov 29, 2022

Earlier versions of gcc actually used to support this in a very restricted context in C90 (or maybe gnu89) mode:

  struct foo { int a[10]; };
  struct foo f(void);
  int b[10];
  b = f().a;

In C90, you can't actually do anything with `f().a` because the conversion from array to pointer only happened to lvalues (`f().a` is not an lvalue), and assignment is not defined for array variables (though gcc allowed it). The meaning is changed in C90 so that non-lvalue arrays are also converted to pointers. gcc used to take this distinction into account, so the above program would compile in C90 mode but not in C99 mode. New versions of gcc seem to forbid array assignment in all cases.

I think this quirk also means that it's technically possible to pass actual arrays to variadic functions in C90, since there was nothing to forbid the passing (it worked in gcc at least, though in strict C90, you wouldn't be able to use the non-lvalue array). In C99 and above, a pointer will be passed instead.

danoman · on Nov 29, 2022

Beware of struct padding.

sizeof(b.arr) != sizeof(b)

Consider:

  #include <stddef.h>

  #include <inttypes.h>

  typedef struct Array Array;

  struct Array {
      int32_t data[8];
  };

  void foo(Array const* arr) {
      size_t sz = sizeof(arr->data);
  }

a1369209993 · on Nov 28, 2022

> There's no particular reason why C doesn't allow you to assign one array to another of the same length

Actually, there is a particular (though not necessarily good) reason, since that would require the compiler to either generate a loop (with conditional branch) for a (unconditional) assignment or generate unboundedly many assembly instructions (essentially a unrolled loop) for a single source operation.

Of course, that stopped being relevant when they added proper (assign, return, etc) support for structs, which can embed arrays anyway, but that wasn't part of the language initially.

pjmlp · on Nov 29, 2022

It was initially available in 1982, so plenty of time to add the other features.

https://www.bell-labs.com/usr/dmr/www/chist.html

dahfizz · on Nov 28, 2022

Another weird property about C arrays is that &arr == arr. The reference of an array is the pointer to the first element, which is what `arr` itself decays to. If arr was a pointer, &arr != arr.

WalterBright · on Nov 28, 2022

Is today international speak like a pirate day? arr arr arr

chasil · on Nov 28, 2022

I think it is clearer to say that arr == &arr[0] but your mileage may vary.

naniwaduni · on Nov 29, 2022

&arr is a pointer to the array. It will happen to point to the same place as the first element, but in fact they have different types, and e.g. (&arr)[0] == arr != arr[0].

derefr · on Nov 28, 2022

> There's no particular reason why C doesn't allow you to assign one array to another of the same length, it's largely just an arbitrary restriction.

IIRC C has an informal guarantee that no primitive syntax will ever cause the CPU to do more than O(1) work at runtime. Assignment is always O(1), and therefore assignment is limited to scalars. If you need assignment that might do O(N) work, you need to call a stdlib function (memcpy/memmove) instead. If you need an allocation that might do O(N) work, you either need a function (malloc) or you need to do your allocation not-at-runtime, by structuring the data in the program's [writable] data segment, such that it gets "allocated" at exec(2) time.

This is really one of the biggest formal changes between C and C++ — C++ assignment, and keywords `new` and `delete`, can both do O(N) work.

(Before anyone asks: a declaration `int foo[5];` in your code doesn't do O(N) work — it just moves the stack pointer, which is O(1).)

DSMan195276 · on Nov 28, 2022

> Assignment is always O(1)

This depends on what you consider to be O(1) - being that the size of the array is fixed it's by definition O(1) to copy it, but I might get your point. I think in general your point isn't true though, C often supports integer types that are too large to be copied in a single instruction on the target CPU, instead it becomes a multi-instruction affair. If you consider that to still be O(1) then I think it's splitting hairs to say a fixed-size array copy would be O(N) when it's still just a fixed number of instructions or loop iterations to achieve the copy.

Beyond that, struct assignments can already generate loops of as large a size as you want, Ex: https://godbolt.org/z/8Td7PT4af

LegionMammal978 · on Nov 28, 2022

I think the meaning here is that assignment is never O(N) for any variable N computed at runtime. Of course, you can create arbitrarily large assignments at compile time, but this always has an upper bound for a given program.

enedil · on Nov 29, 2022

Then you are wrong, since we're already talking about arrays of sizes known at compile time. Indeed, otherwise we would also need to remember the size in the runtime.

LegionMammal978 · on Nov 29, 2022

I don't think we're actually in disagreement here. It looks like I misread the parent comment to be claiming that fixed-size array assignment ought to be considered O(N), when no such claim is made.

DSMan195276 · on Nov 29, 2022

Yeah to clarify I'm definitely in agreement with you that it's O(1), the size is fixed so it's constant time. It's not like the 'n' has to be "sufficiently small" or something for it to be O(1), it just has to be constant :)

People are being very loose about what O(n) means so I attempted to clarify that a bit. Considering what assignments can already do in C it's somewhat irrelevant whether they think it's O(n) anyway, it doesn't actually make their point correct XD

IgorPartola · on Nov 29, 2022

IIRC this is valid in C99:

    void foo(size_t n) {
        int arr[n];
        …
    }

LegionMammal978 · on Nov 29, 2022

VLAs can be declared in a single statement, but they cannot be initialized in C17 (6.7.9):

> The type of the entity to be initialized shall be an array of unknown size or a complete object type that is not a variable length array type.

Curiously, C23 actually seems to break the O(1) rule, by allowing VLAs to be initialized with an empty initializer:

  int arr[n] = {};

GCC generates a memset call (https://godbolt.org/z/5v31bKs5a) to fill the array with zeros.

throwaway2037 · on Nov 29, 2022

How do you think it works? Does the compiler generate some kind of stack alloc?

Stupid question: Does that mean a huge value for 'n' can cause stack overflow at runtime? I recall that threads normally get a fixed size stack size, e.g., 1MB.

jcelerier · on Nov 29, 2022

Yes, it causes stack overflow at runtime. Compilers warn for it, in particular clang has a warning that you can configure to pop up whenever the stack usage of a function goes beyond some limit you set - I think that setting it to 32k or 64k is a safe and sane default as e.g. macOS thread stack sizes are just 512kb

IgorPartola · on Nov 29, 2022

It just moves the stack pointer by n which is O(1). It doesn’t initialize it of course. But my point is that the array size isn’t known at compile time.

jcelerier · on Nov 28, 2022

> IIRC C has an informal guarantee that no primitive syntax will ever cause the CPU to do more than O(1) work at runtime. Assignment is always O(1), and therefore assignment is limited to scalars.

this is absolutely and entirely wrong. You can assign a struct in C and the compiler will call memcpy when you do.

Enjoy: https://godbolt.org/z/98PnhYoev

wnoise · on Nov 28, 2022

And any structure is O(1), not O(n), because C structures are not parameterized.

jcelerier · on Nov 28, 2022

memcpy is not O(1)

LegionMammal978 · on Nov 28, 2022

It's O(1) relative to any size computed at runtime: that is, running the same program (with the same array size) on different inputs will always take the same of work for a given assignment.

jcelerier · on Nov 29, 2022

We're in the context of the assignment operation in the language here. Yes, in C you can only assign statically-known types but that does not mean you can just ignore that a = f(); may take a very different time depending on the types of a and f

leni536 · on Nov 28, 2022

This reasoning falls apart for structs with array members.

pratk · on Nov 29, 2022

Well C does allow "copying" an array if it's wrapped inside a struct, which does not make it O(1). gcc generates calls to memcpy in assembly for copying the array.

xigoi · on Nov 29, 2022

> IIRC C has an informal guarantee that no primitive syntax will ever cause the CPU to do more than O(1) work at runtime.

How about CPUs that have no, say, division instruction, so it has to be emulated with a loop?

naniwaduni · on Nov 29, 2022

Fixed-width divisions are O(1), just comparatively expensive (and potentially optimized to run in variable time). Consider that you can do long division on pairs of numerals of, say, up to 20 digits and be Pretty Confident of an upper bound on how long it's going to take you (you know it's not going to take more than 20 rounds), even though it's going to take you longer to do that than it would for you to add them.

tmewett · on Nov 28, 2022

Interesting, I didn't fully realise that. That it's arbitrary is annoying, I clearly had tried to rationalise it to myself! Thanks for the comments, will get around to amending

emmelaich · on Nov 29, 2022

Hi, great article. Regarding char, I'd remark that getchar() etc return int so it can return -1 for EOF or error.

I'm pretty sure this implies int as a declaration is always signed, but tbh I'm not completely sure!

GrumpySloth · on Nov 29, 2022

int, as all other integer types except char, is indeed signed by default.

Aside: signedness semantics of char is implementation-defined. However, the type char itself is always distinct from both signed char and unsigned char.

Joker_vD · on Nov 29, 2022

Another one corner of the language where arrays actually being arrays is important is multidimensional array access:

    int arr[5][7];
    arr[3][5] = 4; // equivalent to *(*(arr + 3) + 5) = 4;

This works because (arr + 3) has type "pointer to int[7]", not "pointer to int". The resulting address computation is

    (char*)arr + 3 * sizeof(int[7]) + 5 * sizeof(int) ==
    (char*)arr + 26 * sizeof(int)

That's also another reason why types like "int [5][7][]" are legal but "int [5][][]" are not.

ErikCorry · on Nov 29, 2022

Really, are multidimensional arrays an important part of the language?

The above code looks like it's indexing into an array of pointers. If you want a flat array, make a few inlined helper functions that do the multiplying and adding. Your code will be much cleaner and easier to understand.

Joker_vD · on Nov 29, 2022

Of course multidimensional arrays are an important part of the language, just as the ability to have structs inside structs.

> The above code looks like it's indexing into an array of pointers. If you want a flat array, make a few inlined helper functions that do the multiplying and adding. Your code will be much cleaner and easier to understand.

It is a "flat" array already, not an array of pointers: [0]. No need to write the code that compiler generates for you already.

[0] https://godbolt.org/z/x3cPf3TvT

ErikCorry · on Nov 29, 2022

I am aware that it's flat already.

int_19h · on Dec 2, 2022

The code above does not mention any pointer types, so why would you assume that it's indexing into an array of pointers?

I can't think of any reason why get(a, i, j) is more readable than a[i][j].

danoman · on Nov 29, 2022

Return pointer to array of 4 integers:

  int32_t (* bar(void))[4] {
      static int32_t u[4] = {1, 0, 1, 0};
      return &u;
  }

Return a pointer to a function taking a char:

  void f(char a) {
      // ...
  }

  void (* baz(void))(char) {
      return f;
  }

wnoise · on Nov 29, 2022

This is where you really want to start using typedefs.

retrac · on Nov 28, 2022

> it's largely just an arbitrary restriction

Kind of. But the restriction is in keeping with the C philosophy of no hidden implementation magic. C has the same restriction on structs. That's the same question; an array of bytes of known size to the compiler it could easily abstract away. But assignment is always a very cheap operation in C. If we allow assigning to represent memcpy() that property is no longer true.

Same reason why Rust requires you to .clone() so much. It could do many of the explicit copies transparently, but you might accidentally pass around a 4 terabyte array by value and not notice.

DSMan195276 · on Nov 28, 2022

> But assignment is always a very cheap operation in C.

That's just not true though, you can assign struct's of an arbitrarily large size to each-other and compilers will emit the equivalent of `memcpy()` to do the assignment. They might actually call `memcpy()` automatically depending on the particular compiler.

The fact that if you wrap the array in a struct then you're free to copy it via assignment makes it arbitrary IMO.

mytherin · on Nov 28, 2022

Perhaps I am missing something in the spec - but trying this in various compilers, it seems that you *can* assign structs holding arrays to one another, but you *cannot* assign arrays themselves.

This compiles:

  struct BigStruct {
    int my_array[4];
  };
  int main() {
    struct BigStruct a;
    struct BigStruct b;
    b = a;
  }

But this does not:

  int main() {
    int a[4];
    int b[4];
    b = a;
  }

That seems like an arbitrary restriction to me.

poepie · on Nov 28, 2022

In the first example a & b are variables, which can be assigned to each other. In the second a & b are pointers, but b is fixed, so you can not assign a value to it.

GrumpySloth · on Nov 29, 2022

They’re not pointers. sizeof a == 4*sizeof(int), not sizeof(int*).

chipsa · on Nov 29, 2022

They're pointers, just weird ones. The compiler knows it's an array, so it gives the result of the actual amount of space it takes up. If you passed it into a function, and used the sizeof operator in the function, it'd give `sizeof(int *)`. Because sizeof is a compile-time operation, so the compiler still knows that info for your example.

GrumpySloth · on Nov 29, 2022

That jest means it decays into a pointer after being passed as a function argument. In the example given however it’s not a pointer. Just like it wouldn’t be inside a struct.

fsociety · on Nov 28, 2022

Essentially ‘b = a’ in the second example is equivalent to ‘b = &a[0]’ or assigning an array to a pointer.

This is because if you use an array in an expression, it’s value is (most of the time) a pointer to the array’s first element. But the left element is not an expression, therefore it is referring to b the array.

Example one works because no arrays are referred to in the expression side, so this shorthand so to speak is avoided.

Arrays can be a painful edge in C, for example variable length arrays are hair pulling.

int_19h · on Dec 2, 2022

The left side of assignment in C is an expression. it's just not in a context where array-to-pointer decay is triggered.

brundolf · on Nov 28, 2022

Ironically, Rust does allow you to implicitly copy an array as long as it reduces to a memcpy

tialaramex · on Nov 28, 2022

Specifically arrays [T; N] are Copy if precisely T is Copy. So, an array of 32-bit unsigned integers [u32; N] can be copied, and so can an array of immutable string references like ["Hacker", "News", "Web", "Site"] but an array of mutable Strings cannot.

The array of mutable Strings can be memcpy'd and there are situations where that's actually what Rust will do, but because Strings aren't Copy, Rust won't let you keep both - if it did this would introduce mutable aliasing and so ruin the language's safety promise.

nayuki · on Nov 28, 2022

> Everything I wish I knew when learning C

By far my biggest regret is that the learning materials I was exposed to (web pages, textbooks, lectures, professors, etc.) did not mention or emphasize how insidious undefined behavior is.

Two of the worst C and C++ debugging experiences I had followed this template: Some coworker asked me why their function was crashing, I edit their function and it sometimes crashes or doesn't depending on how I rearrange lines of code, and later I figure out that some statement near the top of the function corrupted the stack and that the crashes had nothing to do with my edits.

Undefined behavior is deceptive because the point at which the program state is corrupted can be arbitrarily far away from the point at which you visibly notice a crash or wrong data. UB can also be non-deterministic depending on OS/compiler/code/moonphase. Moreover, "behaving correctly" is one legal behavior of UB, which can fool you into believing your program is correct when it has a hidden bug.

A related post on the HN front page: https://predr.ag/blog/falsehoods-programmers-believe-about-u... , https://news.ycombinator.com/item?id=33771922

My own write-up: https://www.nayuki.io/page/undefined-behavior-in-c-and-cplus...

The take-home lesson about UB is to only rely on following the language rules strictly (e.g. don't dereference null pointer, don't overflow signed integer, don't go past end of array). Don't just assume that your program is correct because there were no compiler warnings and the runtime behavior passed your tests.

titzer · on Nov 28, 2022

> how insidious undefined behavior is.

Indeed. UB in C doesn't mean "and then the program goes off the rails", it means that the entire program execution was meaningless, and no part of the toolchain is obligated to give any guarantees whatsoever if the program is ever executed, from the very first instruction. A UB-having program could time-travel back to the start of the universe, delete it, and replace the entire universe with a version that did not give rise to humans and thus did not give rise to computers or C, and thus never exist.

It's so insidiously defined because compilers optimize based on UB; they assume it never happens and will make transformations to the program whose effects could manifest before the UB-having code executes. That effectively makes UB impossible to debug. It's monumentally rude to us poor programmers who have bugs in our programs.

mattkrause · on Nov 28, 2022

I'm not sure that's a productive way to think about UB.

The "weirdness" happens because the compiler is deducing things from false premises. For example,

1. Null pointers must never be dereferenced.

2. This pointer is dereferenced.

3. Therefore, it is not null.

4. If a pointer is provably non-null, the result of `if(p)` is true.

5. Therefore, the conditional can be removed.

There are definitely situations where many interacting rules and assumptions produce deeply weird, emergent behavior, but deep down, there is some kind of logic to it. It's not as if the compiler writers are doing

   if(find_undefined_behv(AST))
      emit_nasal_demons()
   else
      do_what_they_mean(AST)

WalterBright · on Nov 28, 2022

The C and C++ (and D) compilers I wrote do not attempt to take advantage of UB. What you got with UB is what you expected to get - a seg fault with a null dereference, and wraparound 2's complement arithmetic on overflow.

I suppose I think in terms of "what would a reasonable person expect to happen with this use of UB" and do that. This probably derives, again, from my experience designing flight critical aircraft parts. You don't want to interpret the specification like a lawyer looking for loopholes.

It's the same thing I learned when I took a course in race in high performance driving. The best way to avoid collisions with other cars is to be predictable. It's doing unpredictable things that cause other cars to crash into you. For example, I drive at the same speed as other traffic, and avoid overtaking on the right.

bombcar · on Nov 28, 2022

I think this is a core part of the problem; if the default for everything was to not take advantage of UB things would be better - and we're fast enough that we shouldn't NEED all these optimizations except in the most critical code; perhaps.

You should need something like

    gcc --emit-nasal-daemons

to get the optimizations that can hide UB, or at least horrible warnings that "code that looks like it checks for null has been removed!!!!".

badsectoracula · on Nov 29, 2022

AFAIK GCC does have switches to control optimizations, the issues begin when you want to use something other than GCC, otherwise you're just locking yourself to a single compiler - and at that point might as well switch to a more comfortable language.

titzer · on Nov 29, 2022

> What you got with UB is what you expected to get - a seg fault with a null dereference, and wraparound 2's complement arithmetic on overflow.

This is how it worked in the "old days" when I learned C. You accessed a null pointer, you got a SIGSEGV. You wrote a "+", then you got a machine add.

WalterBright · on Nov 29, 2022

In the really old DOS days, when you wrote to a null pointer, you overwrote the DOS vector table. If you were lucky, fixing it was just a reboot. If you were unlucky, it scrambled your disk drive.

It was awful.

The 8086 should have been set up so the ROM was at address 0.

badsectoracula · on Nov 29, 2022

This is the right approach IMO, but sadly the issue is that not all C compilers work like that even if they could (e.g. they target the same CPU) so even if one compiler guarantees they wont introduce bugs from an overzealous interpretation of UB, unless you are planning to never use any other compiler you'll still be subject to said interpretations.

And if you do decide that sticking to a single compiler is best then might as well switch to a different and more comfortable language.

titzer · on Nov 28, 2022

This is the problem; every compiler outcome is a series of small logic inferences that are each justifiable by language definition, the program's structure, and the target hardware. The nasal demons are emergent behavior.

It'd be one thing if programs hitting UB just vanished in a puff of smoke without a trace, but they don't. They can keep on spazzing out literally forever and do I/O, spewing garbage to the outside world. UB cannot be contained even to the process at that point. I personally find that offensive and rude that tools get away with being so garbage that they can't even promise to help you crash and diagnose your own problems. One mistake and you invite the wrath of God!

nayuki · on Nov 28, 2022

> I personally find that offensive and rude that tools get away with being so garbage that they can't even promise to help you crash and diagnose your own problems.

This is literally why newer languages like Java, JavaScript, Python, Go, Rust, etc. exist. With the hindsight of C and C++, they were designed to drastically reduce the types of UB. They guarantee that a compile-time or run-time diagnostic is produced when something bad happens (e.g. NullPointerException). They don't include silly rules like "not ending a file with newline is UB". They overflow numbers in a consistent way (even if it's not a way you like, at least you can reliably reproduce a problem). They guarantee the consistent execution of statements like "i = i++ + i++". And for all the flak that JavaScript gets about its confusing weak type coercions, at least they are coded in the spec and must be implemented in one way. But all of these languages are not C/C++ and not compatible with them.

titzer · on Nov 28, 2022

Yes, and my personal progression from C to C++ to Java and other languages led me to design Virgil so that it has no UB, has well-defined semantics, and yet crashes reliably on program logic bugs giving an exact stack traces, but unlike Java and JavaScript, compiles natively and has some systems features.

Having well-defined semantics means that the chain of logic steps taken by the compiler in optimizing the program never introduces new behaviors; optimization is not observable.

galangalalgol · on Nov 28, 2022

It can get truly bizarre with multiple threads. Some other thread hits some UB and suddenly your code has garbage register states. I've had someone UB the fp register stack in another thread so that when I tried to use it, I got their values for a bit, and then NaN when it ran out. Static analysis had caught their mistake, and then a group of my peers looked at it and said it was a false warning leaving me to find it long afterwards... I don't work with them anymore, and my new project is using rust, but it doesn't really matter if people sign off on code reviews that have unsafe{doHorribleStuff()}

lmm · on Nov 28, 2022

On the contrary, the latter is a far more effective way to think about UB. If you try to imagine that the compiler's behaviour has some logic to it, sooner or later you will think that something that's UB is OK, and you will be wrong. (E.g. you'll assume that a program has reasonable, consistent behaviour on x86 even though it does an unaligned memory access). If you look at the way the GCC team responds to bug reports for programs that have undefined behaviour, they consider the emit_nasal_demons() version to be what GCC is designed to do.

masklinn · on Nov 28, 2022

> There are definitely situations where many interacting rules and assumptions produce deeply weird, emergent behavior

The problem is how due to other optimisations (mainly inlining) the emergent misbehaviour can occur in a seemingly unrelated part of the program. This can the inference chain very difficult, as you have to trace paths through the entire execution of the program.

The issue occurs for other types of data corruption, it’s why NPE are so disliked, but UB’s blast radius is both larger and less reliable.

nayuki · on Nov 28, 2022

I agree with the factual things that you said (e.g. "entire program execution was meaningless"). Some stuff was hyperbolic ("time-travel back to the start of the universe, delete it").

> [compilers] will make transformations to the program whose effects could manifest before the UB-having code executes [...] It's monumentally rude to us poor programmers who have bugs in our programs.

The first statement is factually true, but I can provide a justification for the second statement which is an opinion.

Consider this code:

    void foo(int x, int y) {
        printf("sum %d", x + y);
        printf("quotient %d", x / y);
    }

We know that foo(0, 0) will cause undefined behavior because it performs division by zero. Integer division is a slow operation, and under the rules of C, it has no side effects. An optimizing compiler may choose to move the division operation earlier so that the processor can do other useful work while the division is running in the background. For example, the compiler can move the expression x / y above the first printf(), which would totally be legal. But then, the behavior is that the program would appear to crash before the sum and first printf() were executed. UB time travel is real, and that's why it's important to follow the rules, not just make conclusions based on observed behavior.

https://blog.regehr.org/archives/232

salawat · on Nov 28, 2022

...Why is the compiler reordering so much?

Look. I get it, clever compilers (I guess) make everyone happy, but are absolute garbage for facilitating program understanding.

I wonder if we are shooting ourselves in the foot with all this invisible optimization.

saagarjha · on Nov 29, 2022

People like fast code.

knodi123 · on Nov 29, 2022

In 2022, is there any other reasons to use C besides "fast code" or "codebase already written in C"?

int_19h · on Dec 2, 2022

No, and, in fact, the first one isn't valid - you can use C++ (or a subset of it) for the same performance profile with less footguns.

So really the only time to use C is when the codebase already has it and there is a policy to stick to it even for new code, or when targeting a platform that simply doesn't have a C++ toolchain for it, which is unfortunately not uncommon in embedded.

wnoise · on Nov 29, 2022

"codebase already written in C" includes both "all the as yet unwrapped libraries" and "the OS interface".

bluecalm · on Dec 1, 2022

There isn't. Fast code is pretty important though to a lot of people while security isn't (games, renderers, various solvers, simulations etc.).

It's great C is available for that. If you're ok with slow use Java or whatever.

vikingerik · on Nov 28, 2022

> Integer division is a slow operation, and under the rules of C, it has no side effects.

Then C isn't following this rule - crashing is a pretty major side effect.

zerocrates · on Nov 28, 2022

The basic deal is that in the presence of undefined behavior, there are no rules about what the program should do.

So if you as a compiler writer see: we can do this optimization and cause no problems _except_ if there's division by zero, which is UB, then you can just do it anyway without checking.

t0mas88 · on Nov 28, 2022

Only non-zero integer division is specified as having no side effects.

Division by zero is in the C standard as "undefined behavior" meaning the compiler can decide what to do with it, crashing would be nice but it doesn't have to. It could also give you a wrong answer if it wanted to.

Edit: And just to illustrate, I tried in clang++ and it gave me "5 / 0 = 0" so some compilers in some cases indeed make use of their freedom to give you a wrong answer.

vikingerik · on Nov 28, 2022

To my downvoters, since I can no longer edit: I've been corrected that the rule is integer division has no side effects except for dividing by zero. This was not the rule my parent poster stated.

a1369209993 · on Nov 28, 2022

> I've been corrected

No you haven't. The incorrect statement was a verbatim quote from nayuki's post, which you were responding to. Please refrain from apologising for other people gaslighting you (edit: particularly, but not exclusively, since it sets a bad precedent for everyone else).

nayuki · on Nov 28, 2022

At the CPU level, division by zero can behave in a number of ways. It can trap and raise an exception. It can silently return 0 or leave a register unchanged. It might hang and crash the whole system. The C language standard acknowledges that different CPUs may behave differently, and chose to categorize division-by-zero under "undefined behavior", not "implementation-defined behavior" or "must trap".

I wrote:

> Integer division is a slow operation, and under the rules of C, it has no side effects.

This statement is correct because if the divisor is not zero, then division truly has no side effects and can be reordered anywhere, otherwise if the divisor is zero, the C standard says it's undefined behavior so this case is irrelevant and can be disregarded, so we can assume that division always has no side effects. It doesn't matter if the underlying CPU has a side effect for div-zero or not; the C standard permits the compiler to completely ignore this case.

a1369209993 · on Nov 29, 2022

> I wrote:

> > Integer division is a slow operation, and under the rules of C, it has no side effects.

Yes, you did, and while that's a reasonable approximation in some contexts, it is false in the general case, since division by zero has a side effect in the form of invoking undefined behaviour. (Arguably that means it has every possible side effect, but that's more of a philosophical issue. In practice it has various specific side effects like crashing, which are specific realizations of its theoretical side effect of invoking undefined behaviour.)

vikingerik's statement was correct:

> [If "Integer division [...] has no side effects",] Then C isn't following this rule - crashing is a pretty major side effect.

DougBTX · on Nov 29, 2022

> it is false in the general case, since division by zero has a side effect in the form of invoking undefined behaviour.

They were careful to say “under the rules of C,” the rules define the behaviour of C. On the other hand, undefined behaviour is outside the rules, so I think they’re correct in what they’re saying.

The problem for me is that the compiler is not obliged to check that the code is following the rules. It puts so much extra weight on the shoulders of the programmer, though I appreciate that using only rules which can be checked by the compiler is hard too, especially back when C was standardised.

a1369209993 · on Nov 29, 2022

> They were careful to say "under the rules of C,"

Yes, and under the rules of C, division by zero has a side effect, namely invoking undefined behaviour.

> The problem for me is that the compiler is not obliged to check that the code is following the rules.

That part's actually fine (annoying, but ultimately a reasonable consequence of the "rules the compiler can check" issue); the real(ly bad and insidious) problem is that when the compiler does check that the code is following the rules, it's allowed to do it in deliberately backward way that uses any case of not following the rules as a excuse to break unrelated code.

Dylan16807 · on Dec 3, 2022

Undefined behavior is not a side effect to be "invoked" by the rules of C. If UB happens, it means your program isn't valid. UB is not a side effect or any effect at all, it is the void left behind when the system of rules disappears.

Dylan16807 · on Dec 3, 2022

Side effects are a type of defined behavior. Crashing is not a "side effect" in C terms.

hzhou321 · on Nov 28, 2022

> Indeed. UB in C doesn't mean "and then the program goes off the rails", it means that the entire program execution was meaningless, and no part of the toolchain is obligated to give any guarantees whatsoever if the program is ever executed, from the very first instruction.

This is the greatest sin modern compiler folks committed to abuse C. C as the language never says the compiler can change the code arbitrarily due to an UB statement. It is undefined. Most UB code in C, while not fully defined, has an obvious part of semantics that every one understands. For example, an integer overflow, while not defined on what should be the final value, it is understood that it is an operation of updating a value. It is definitely not, e.g., an assertion on the operand because UB can't happen.

Think about our natural language, which is full of undefined sentences. For example, "I'll lasso the moon for you". A compiler, which is a listener's brain, may not fully understand the sentence and it is perfectly fine to ignore the sentence. But if we interpret an undefined sentence as a license to misinterpret the entire conversation, then no one would dare to speak.

As computing goes beyond arithmetic and the program grows in complexity, I personally believe some amount of fuzziness is the key. This current narrow view from the compiler folks (and somehow gets accepted at large) is really, IMO, a setback in the computing evolution.

kllrnohj · on Nov 28, 2022

> It is definitely not, e.g., an assertion on the operand because UB can't happen.

C specification says a program is ill-formed if any UB happens. So yes, the spec does say that compilers are allowed to assume UB doesn't happen. After all, a program with UB is ill-formed and therefore shouldn't exist!

I think you're conflating "unspecified behavior" and "undefined behavior" - the two have different meanings in the spec.

hzhou321 · on Nov 28, 2022

> C specification says a program is ill-formed if any UB happens. So yes, the spec does say that compilers are allowed to assume UB doesn't happen.

I disagree on the logic from "ill-formed" to "assume it doesn't happen".

> I think you're conflating "unspecified behavior" and "undefined behavior" - the two have different meanings in the spec.

I admit I don't differentiate those two words. I think they are just word-play.

kmm01 · on Nov 28, 2022

The C standard defines them very differently though:

  undefined behavior
    behavior, upon use of a nonportable or erroneous program
    construct or of erroneous data, for which this International
    Standard imposes no requirements

  unspecified behavior
    use of an unspecified value, or other behavior where this
    International Standard provides two or more possibilities
    and imposes no further requirements on which is chosen in
    any instance

Implementations need not but may obviously assume that undefined behavior does not happen. Assume that however the program behaves if undefined behavior is invoked is how the compiler chose to implement that case.

marssaxman · on Nov 28, 2022

"Nonportable" is a significant element of this definition. A programmer who intends to compile their C program for one particular processor family might reasonably expect to write code which makes use of the very-much-defined behavior found on that architecture: integer overflow, for example. A C compiler which does the naively obvious thing in this situation would be a useful tool, and many C compilers in the past used to behave this way. Modern C compilers which assume that the programmer will never intentionally write non-portable code are.... less helpful.

kllrnohj · on Nov 28, 2022

> I disagree on the logic from "ill-formed" to "assume it doesn't happen".

Do you feel like elaborating on your reasoning at all? And if you're going to present an argument, it'd be good if you stuck to the spec's definitions of things here. It'll be a lot easier to have a discussion when we're on the same terminology page here (which is why specs exist with definitions!)

> I admit I don't differentiate those two words. I think they are just word-play.

Unfortunately for you, the spec says otherwise. There's a reason there's 2 different phrases here, and both are clearly defined by the spec.

bluecalm · on Nov 28, 2022

That's the whole point of UB though: the programmer helping the compiler do deduce things. It's too much to expect the compiler to understand your whole program to know a+b doesn't overflow. The programmer might understand it doesn't though. The compiler relies on that understanding.

If you don't want it to rely on it insert a check into the program and tell it what to do if the addition overflows. It's not hard.

Whining about UB is like reading Shakespeare to your dog and complaining it doesn't follow. It's not that smart. You are though. If you want it to check for an overflow or whatever there is a one liner to do it. Just insert it into your code.

a1369209993 · on Nov 29, 2022

> That's the whole point of UB though

No, the whole (entire, exclusive of that) point of undefined behaviour is to allow legitimate compilers to generate sensible and idiomatic code for whichever target architechture they're compiling for. Eg, a pointer dereference can just be `ld r1 [r0]` or `st [r0] r1`, without paying any attention to the possibility that the pointer (r0) might be null, or that there might be memory-mapped IO registers at address zero that a read or write could have catastrophic effects on.

It is not a licence to go actively searching for unrelated things that the compiler can go out of its way to break under the pretense that the standard technically doesn't explicitly prohibit a null pointer dereference from setting the pointer to a non-null (but magically still zero) value.

bluecalm · on Nov 29, 2022

If you don't want the compiler to optimize that much then turn down the optimization level.

lmm · on Nov 28, 2022

> If you don't want it to rely on it insert a check into the program and tell it what to do if the addition overflows. It's not hard.

Given that even experts routinely fail to write C code that doesn't have UB, available evidence is that it's practically impossible.

marssaxman · on Nov 28, 2022

> So yes, the spec does say that compilers are allowed to assume UB doesn't happen.

They are allowed to do so, but in practice this choice is not helpful.

saagarjha · on Nov 29, 2022

On the contrary, it is quite helpful–it is how C optimizers reason.

LegionMammal978 · on Nov 28, 2022

> Indeed. UB in C doesn't mean "and then the program goes off the rails", it means that the entire program execution was meaningless, and no part of the toolchain is obligated to give any guarantees whatsoever if the program is ever executed, from the very first instruction.

I don't think this is exactly accurate: a program can result in UB given some input, but not result in UB given some other input. The time travel couldn't extend before the first input that makes UB inevitable.

kllrnohj · on Nov 28, 2022

They might be referring to eg. the `_Nonnull` annotation being added to memset. The result is that this:

   if (ptr == null) {
      set_some_flag = true;
   } else {
      set_some_flag = false;
   }
   memset(ptr, 0, size);

Will never see `set_some_flag == true`, as the memset call guarantees that ptr is not null, otherwise it's UB, and therefore the earlier `if` statement is always false and the optimizer will remove it.

Now the bug here is changing the definition of memset to match its documentation a solid, what, 20? 30? years after it was first defined, especially when that "null isn't allowed" isn't useful behavior. After all, every memset ever implemented already totally handles null w/ size = 0 without any issue. And it was indeed rather quickly reverted as a change. But that really broke people's minds around UB propagation with modern optimizing passes.

nayuki · on Nov 28, 2022

False. If a program triggers UB, then all behaviors of the entire program run is invalid.

> However, if any such execution contains an undefined operation, this International Standard places no requirement on the implementation executing that program with that input (not even with regard to operations preceding the first undefined operation).

-- https://devblogs.microsoft.com/oldnewthing/20140627-00/?p=63...

LegionMammal978 · on Nov 28, 2022

Executing the program with that input is the key term. The program can't "take back" observable effects that happen before the input is completely read, and it can't know before reading it whether the input will be one that results in an execution with UB. This is a consequence of basic causality. (If physical time travel were possible, then perhaps your point would be valid.)

Filligree · on Nov 28, 2022

The standard does permit time-travel, however. As unlikely as it might seem, I could imagine some rare scenarios in which something seemingly similar happens -- let's say the optimiser reaching into gets() and crashing the program prior to the gets() call that overflows the stack.

LegionMammal978 · on Nov 28, 2022

Time travel only applies to an execution that is already known to contain UB. How could it know that the gets() call will necessarily overflow the stack, before it actually starts reading the line (at which point all prior observable behavior must have already occurred)?

lmm · on Nov 28, 2022

It doesn't matter how it knows. The standard permits it to do that. The compiler authors will not accept your bug report.

LegionMammal978 · on Nov 29, 2022

If you truly believe so, then can you give an example of input-conditional UB causing unexpected observable behavior, before the input is actually read? This should be impossible, since otherwise the program would have incorrect behavior if a non-UB-producing input is given.

lmm · on Nov 29, 2022

If it's provably input-conditional then of course it's impossible. But the C implementation does not have to obey the sequence point rules or perform observable effects in the correct order for invocations that contain UB, and it doesn't have to implement "possible" non-UB-containing invocations if you can't find them. E.g. if you write a program to search for a counterexample for something like the Collantz Conjecture, that loops trying successively higher numbers until it finds one and then exits, GCC may compile that into a program that exits immediately (since looping forever is, arguably, undefined behaviour) - there's a real example of a program that does this for Fermat's Last Theorem.

LegionMammal978 · on Nov 30, 2022

> If it's provably input-conditional then of course it's impossible.

My entire point pertains to programs with input-conditional UB: that is, programs for which there exists an input that makes it result in UB, and there also exists an input that makes it not result in UB. Arguably, it would be more difficult for the implementation to prove that input-dependent UB is unconditional: that every possible input results in UB, or that no possible input results in UB.

> But the C implementation does not have to obey the sequence point rules or perform observable effects in the correct order for invocations that contain UB

Indeed, the standard places no requirements on the observable effects of an execution that eventually results in UB at some point in the future. But if the UB is input-conditional, then a "good" execution and a "bad" execution are indistinguishable until the point that the input is entered. Therefore, the implementation is required to correctly perform all observable effects sequenced prior to the input being entered, since otherwise it would produce incorrect behavior on the "good" input.

> E.g. if you write a program to search for a counterexample for something like the Collantz Conjecture, that loops trying successively higher numbers until it finds one and then exits, GCC may compile that into a program that exits immediately (since looping forever is, arguably, undefined behaviour) - there's a real example of a program that does this for Fermat's Last Theorem.

That only works because the loop has no observable effects, and the standard says it's UB if it doesn't halt, so the compiler can assume it does nothing but halts. As noted on https://blog.regehr.org/archives/140, if you try to print the resulting values, then the compiler is actually required to run the loop to determine the results, either at compile time or runtime. (If it correctly proves at compile time that the loop is infinite, only then can it replace the program with one that does whatever.)

It's also irrelevant, since my point is about programs with input-conditional UB, but the FLT program has unconditional UB.

saagarjha · on Nov 29, 2022

How this might happen is that one branch of your program may have unconditional undefined behavior, which can be detected at the check itself. This would let a compiler elide the entire branch, even side effects that would typically run.

LegionMammal978 · on Nov 29, 2022

The compiler can elide the unconditional-UB branch and its side effects, and it can elide the check itself. But it cannot elide the input operation that produces the value which is checked, nor can it elide any side effects before that input operation, unless it can statically prove that no input values can possibly result in the non-UB branch.

tpush · on Nov 29, 2022

That example doesn't contradict LegionMammal978's point though, if I understood correctly. He's saying that the 'time-travel' wouldn't extend to before checking the conditional.

andrewmcwatters · on Nov 28, 2022

Personally, I've found that some of the optimizations cause undefined behavior, which is so much worse. You can write perfectly good, strict C that does not cause undefined behavior, then one pass of optimization and another together can CAUSE undefined behavior.

When I learned this, if it was and is correct, I felt that one could be betrayed by the compiler.

LegionMammal978 · on Nov 28, 2022

Optimizations themselves (except for perhaps -ffast-math) can't cause undefined behavior: the undefined behavior was already there. They can just change the program from behaving expectedly to behaving unexpectedly. The problem is that so many snippets, which have historically been obvious or even idiomatic, contain UB that has almost never resulted in unexpected behavior. Modern optimizing compilers have only been catching up to these in recent years.

titzer · on Nov 28, 2022

There have been more than a few compiler bugs that have introduced UB and then that was subsequently optimized, leading to very incorrect program behavior.

nayuki · on Nov 28, 2022

A compiler bug cannot introduce UB by definition. UB is a contract between the coder and the C language standard. UB is solely determined by looking at your code, the standard, and the input data; it is independent of the compiler. If the compiler converts UB-free code into misbehavior, then that's a compiler bug / miscompilation, not an introduction of UB.

joosters · on Nov 28, 2022

A compiler bug is a compiler bug, UB or not. You might as well just say "There have been more than a few compiler bugs, leading to very incorrect program behavior."

titzer · on Nov 28, 2022

The whole thread is about how UB is not like other kinds of bugs. Having a compiler optimization erroneously introduce a UB operation means that downstream the program can be radically altered in ways (as discussed in thread) that don't happen in systems without the notion of a UB.

While it's technically true that any compiler bug (in any system) introduces bizarre, incorrect behavior into a program, UB just supercharges the things that can go wrong due to downstream optimizations. And incidentally, makes things much, much harder to diagnose.

LegionMammal978 · on Nov 28, 2022

I just don't think it makes much sense to say that an optimization can "introduce a UB operation". UB is a property of C programs: if a C program executes an operation that the standard says is UB, then no requirement is imposed on the compiler for what should happen.

In contrast, optimizations operate solely on the compiler's internal representation of the program. If an optimization erroneously makes another decide that a branch is unreachable, or that a condition can be replaced with a constant true or false, then that's not "a UB operation", that's just a miscompilation.

The latter set of optimizations is just commonly associated with UB, since C programs with UB often trigger those optimizations unexpectedly.

titzer · on Nov 29, 2022

LLVM IR has operations that have UB for some inputs. It also has poison values that act...weird. They have all the same implications of source-level UB, so I see no need to make a distinction. The compiler doesn't.

skitter · on Nov 28, 2022

Any optimization that causes undefined behavior is bugged – please report them to your compiler's developers.

masklinn · on Nov 28, 2022

By definition an optimisation can’t cause UB as UB is a langage level construct.

An optimisation can cause a miscompilation. They happens and is very annoying.

lmm · on Nov 28, 2022

Miscompilations are rarer and less annoying in compilers that do not have the design behaviour of compiling certain source code inputs into bizarre nonsense that bears no particular relation to those inputs.

nayuki · on Nov 29, 2022

You realize these two statements are equivalent, right?

> compiling certain source code inputs into bizarre nonsense

> winning at compiled-binary-execution-speed benchmarks, giving fewer reasons for people to hand-write assembly code for the sake of speed (assembly code is much harder to read/write and not portable), reducing code size by eliminating unnecessary operations (especially -Os), reordering operations to fit CPU pipelines and instruction latencies and superscalar capabilities

If you don't like the complexity of modern, leading-edge optimizing compilers, you are free to build or support a basic compiler that translates C code as literally as possible. As long as such compiler conforms to the C standard, you have every right to promote this alternative. Don't shame other people building or using optimizing compilers.

lmm · on Nov 29, 2022

> compiling certain source code inputs into bizarre nonsense

> winning at compiled-binary-execution-speed benchmarks, giving fewer reasons for people to hand-write assembly code for the sake of speed (assembly code is much harder to read/write and not portable), reducing code size by eliminating unnecessary operations (especially -Os), reordering operations to fit CPU pipelines and instruction latencies and superscalar capabilities

Mainstream C compilers actually make special exceptions for the undefined behaviour that's seen in popular benchmarks so that they can continue to "win" at them. The whole exercise is a pox on the industry; maybe at some point in the past those benchmarks told us something useful, but they're doing more harm than good when people use them to pick a language for modern line-of-business software, which is written under approximately none of the same conditions or constraints.

> Don't shame other people building or using optimizing compilers.

The people who are contributing to security vulnerabilities that leak our personal information deserve shame.

nayuki · on Nov 29, 2022

It's true that I don't like security vulnerabilities either. I think the question boils down to, whose responsibility is it to avoid UB - the programmer, compiler, or the standard?

I view the language standard as a contract, an interface definition between two camps. If a programmer obeys the contract, he has access to all compliant compilers. If a compiler writer obeys the contract, she can compile all compliant programs. When a programmer deviates from the contract, the consequences are undefined. Some compilers might cater to these cases (e.g. -fwrapv, GNU language extensions) as a superset of all standard-compliant programs.

Coming from programming in Java first, I honestly would like to see a lot of UB eliminated from C/C++, downgrading them to either unspecified behavior (weakest), implementation-defined behavior, or single behavior (best). But the correct place to petition is not compiler implementations; we have to change the language standard - the contract that both sides abide by. Otherwise we can only get as far as having a patchwork of vendor-specific language extensions.

lmm · on Nov 29, 2022

> Coming from programming in Java first, I honestly would like to see a lot of UB eliminated from C/C++, downgrading them to either unspecified behavior (weakest), implementation-defined behavior, or single behavior (best). But the correct place to petition is not compiler implementations; we have to change the language standard - the contract that both sides abide by. Otherwise we can only get as far as having a patchwork of vendor-specific language extensions.

That feels backwards in terms of how the C standard actually gets developed - my impression is that most things that eventually get standardised start life as a vendor-specific language extensions, and it's very rare to have the C standard to introduce something and the compiler vendors then follow.

And really in a lot of cases the concept of UB isn't the problem, it's the compiler culture that's grown up around it. For example, the original reason for null dereference being UB was to allow implementations to trap on null dereference, on architectures where that's cheap, without being obliged to maintain strict ordering in all code that dereferences pointers. It's hard to imagine how what the standard specifies about that case could be improved; the problem is compiler writers prioritising benchmark performance over useful diagniostic behaviour.

saagarjha · on Nov 29, 2022

> If you don't like the complexity of modern, leading-edge optimizing compilers, you are free to build or support a basic compiler that translates C code as literally as possible.

Most optimizing compilers can do this already, it's just the -O0 flag.

nayuki · on Nov 29, 2022

I tried compiling "int x = 1 / 0;" in both the latest GCC and Clang with -O0 on x86-64 on Godbolt. GCC intuitively preserves the calculation and emits an idiv instruction. Clang goes ahead and does constant folding anyway, and there is no division to be seen. So the oft-repeated advice of using -O0 to try to compile the code as literally as possible in hopes of diagnosing UB or making it behave sanely, is not great advice.

saghm · on Nov 29, 2022

I recently dealt with a bit of undefined behavior (in unsafe Rust code, although the behavior here could similarly happen in C/C++) where attempting to print a value caused it to change. It's hard to overstate how jarring it is to see an code that says "assert that this value isn't an error, print it, and then try to use it", and have the assertion pass but then have it be printed out as an error and then panic when trying to use it There's absolutely no reason why this can't happen since "flipping bits of the value you tried to print" doesn't count as potential UB any less than a segfault, but it can be hard to turn off the part of your brain that is used to assuming that values can't just arbitrarily change at any point in time. "Ignore the rest of the program and do whatever you want after a single mistake" is not a good failure mode, and it's kind of astonishing to me that people are mostly just fine with it because they think they'll be careful enough not to make a mistake ever or that enough of the time it happened they were lucky that it didn't completely screw them over.

The only reason we use unsafe code on my team's project is because we're interfacing with C code, so it was hard not to come away from that experience thinking that it would be incredibly valuable to shrink the amount of interfacing with C as small as possible, and ideally to the point where we don't need to at all.

bluecalm · on Nov 28, 2022

It's not insidious at all. C compiler offers you a deal: "Hey, my dear programmer, we are trying to make an efficient program here. Sadly, I am not sophisticated enough to deduct a lot of things but you can help me! Here are some of the rules: don't overflow integers, don't dereference null pointers, don't go outside of array bounds. You follow those and I will fulfill my part of making your code execute quickly".

The deal is known and fair. Just be a responsible adult about it: accept it, live with the consequences and enjoy efficiency gains. You can reject it but then don't use arrays without a bound check (a lot of libraries out there offer that), check your integers bounds or use a sanitizer, check your pointers for nulls before dereferencing them, there are many tools out, there to help you, or... Just use another language that does all that for you.

nayuki · on Nov 29, 2022

UB was insidious to me because I was not taught the rules (this was back in years 2005 to 2012; maybe it got more attention now), it seemed my coworkers didn't know the rules and they handed me codebases with lots of existing hidden UB, and UB blew up in my face in very nasty ways that cost me a lot of debugging time and anguish.

Also, the UB instances that blew up were already tested to work correctly... on some other platform (e.g. Windows vs. Linux) or on some other compiler version. There are many things in life and computing where when you make a mistake, you find out quickly. If you touch a hot pan, you get a burn and quickly pull away. But if you miswire an electrical connection, it could slowly come loose over a decade and start a fire behind the wall. Likewise, a wrong piece of code that seems to behave correctly at first would lull the author into a false sense of security. By the time a problem appears, the author could be gone, or she couldn't recall what line out of thousands written years ago would cause the issue.

Three dictionary definitions for insidious, which I think are all appropriate: 1) intended to entrap or beguile 2) stealthily treacherous or deceitful 3) operating or proceeding in an inconspicuous or seemingly harmless way but actually with grave effect.

I'm neutral now with respect to UB and compilers; I understand the pros and cons of doing things this way. My current stance is to know the rules clearly and always stay within their bounds, to write code that never triggers UB to the best of my knowledge. I know that testing compiled binaries produces good evidence of correct behavior but cannot prove the nonexistence of UB.

patrick451 · on Nov 29, 2022

I don't think this is the whole story. That are certain classes of undefined behavior that some compilers actually guarantee to treat as valid code. Type punning through unions in c++ comes to mind. Gcc says go ahead, the standard says UB. In cases like these, it really just seems like the standard is lazy.

lmm · on Nov 28, 2022

> The deal is known and fair.

It often isn't. C is often falsely advertised as a cross-platform assembly language, that will compile to the assembly that the author would expect. Some writers may be used to pre-standardization compilers that are much less hostile than modern GCC/Clang.

a1369209993 · on Nov 29, 2022

> C is often [correctly, but misleadingly] advertised as a cross-platform assembly language, that will compile to the assembly that the author would expect.

Because that's what it is. What they don't tell you is that the most heavily-developed two (or more) compilers for it (which you might otherwise assume meant the two best compilers), are malware[0] that actively seek out excuses to inject security vulnerabilities (and other bugs) into code that would work fine if compiled to the assembly that any reasonable author would expect.

0: http://web.archive.org/web/20070714062657/http://www.acm.org... Reflections on Trusting Trust (Ken Thompson):

> Figure 6 shows a simple modification to the compiler that will deliberately miscompile source whenever a particular pattern is matched. If this were not deliberate, it would be called a compiler "bug". Since it is deliberate, it should be called a "Trojan horse".

bluecalm · on Nov 29, 2022

Nice way to put down the amazing work of compiler authors. It's not malware you just don't understand how to use it. If you don't want the compilers to do crazy optimisations turn down the optimisation level. If you want then to check for things like null pointers or integer overflow or array bounds at runtime then just turn on the sanitizers those compiler writers kindly provided to you.

You just want all of it: fast optimizing compiler, one that checks for your mistakes but also one that knows when it's not a mistake and still generates fast code. It's not easy to write such a compiler. You can tell it how to behave though if you care.

a1369209993 · on Nov 29, 2022

> If you want then to check for things like null pointers or integer overflow or array bounds

I specificly don't want them to check for those things; that is the fucking problem in the first place! When I write:

  x = *p;

I want it compiled to a damn memory access. If I meant:

  x = *p; __builtin_assume_non_null(p);

I'd have damn well written that.

saagarjha · on Nov 29, 2022

Socialism is when the government does something I don't like, and Reflections on Trusting Trust is when my compiler does something I don't like. The paper has nothing to do with how optimizing compilers work. Compiling TCC with GCC is not going to suddenly make it into a super-optimizing UB-exploiting behemoth.

photochemsyn · on Nov 28, 2022

This article on undefined behavior looks pretty good (2011?)

https://blog.regehr.org/archives/213

A main point in the article is function classification, i.e. 'Type 1 Functions' are outward-facing, and subject to bad or malicious input, so require lots of input checking and verification that preconditions are met:

> "These have no restrictions on their inputs: they behave well for all possible inputs (of course, “behaving well” may include returning an error code). Generally, API-level functions and functions that deal with unsanitized data should be Type 1."

Internal utility functions that only use data already filtered through Type 1 functions are called "Type 3 Functions", i.e. they can result in UB if given bad inputs:

> "Is it OK to write functions like this, that have non-trivial preconditions? In general, for internal utility functions this is perfectly OK as long as the precondition is clearly documented."

Incidentally I found that article from the top link in this Chris Lattner post on the LLVM Project Blog, "What Every C Programmer Should Know About Undefined Behavior":

http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

In particular this bit on why internal functions (Type 3, above) shouldn't have to implement extensive preconditions (pointer dereferencing in this case):

> "To eliminate this source of undefined behavior, array accesses would have to each be range checked, and the ABI would have to be changed to make sure that range information follows around any pointers that could be subject to pointer arithmetic. This would have an extremely high cost for many numerical and other applications, as well as breaking binary compatibility with every existing C library."

Basically, the conclusion appears to be that any data input to a C program by a user, socket, file, etc. needs to go through a filtering and verification process of some kind, before being handed to over to internal functions (not accessible to users etc.) that don't bother with precondition testing, and which are designed to maximize performance.

In C++ I suppose, this is formalized with public/private/protected class members.

Waterluvian · on Nov 28, 2022

I haven’t used C or C++ for anything, but in writing a Game Boy emulator I ran into exactly that kind of memory corruption pain. An opcode I implemented wrong causes memory to corrupt, which goes unnoticed for millions of cycles or sometimes forever depending on the game. Good luck debugging that!

My lesson was: here’s a really really good case for careful unit testing.

Gigachad · on Nov 28, 2022

Yeah for that kind of stuff you want tests on every single op checking they make exactly the change you expect.

dockd · on Nov 28, 2022

I would go one step farther: The documentation will say it is undefined behavior but the compiler doesn't have to. Here's an example from the man page for sprintf

  sprintf(buf, "%s some further text", buf);

If you miss that section of the manual, your code may work, leading you to think the behavior is defined.

Then you will have interesting arguments with other programmers about what exactly is undefined behavior, e.g. what happens for

  sprintf(buf, "%d %d", f(i), i++);

krylon · on Nov 28, 2022

I remember reading a blog post a couple of years back on undefined behavior from the perspective of someone building a compiler. The way the standard defines undefined behavior (pun not intended), a compiler writer can basically assume undefined behavior never occurs and stay compliant with the standard.

This offers the door to some optimizations, but also allows compiler writers to reduce the complexity in the compiler itself in some places.

I'm being very vague here, because I have no actual experience with compiler internals, nor that level of language-lawyer pedantry. The blog's name was "Embedded in academia", I think, you can probably still find the blog and the particular post if it sounds interesting.

kllrnohj · on Nov 28, 2022

Yeah a decent chunk of UB is about reducing the burden on the compiler. Null derefs being an obvious such example. If it was defined behavior, the compiler would be endlessly adding & later attempting to optimize-away null checks. Which isn't something anyone actually wants when reaching for C/C++.

Similarly with C/C++ it's not actually possible for the compiler to ensure you don't access a pointer past the end of the array - the array size often isn't "known" in a way the compiler can understand.

lmm · on Nov 28, 2022

> Which isn't something anyone actually wants when reaching for C/C++.

Disagree. I think a lot of people want some kind of "cross-platform assembler" (i.e. they want e.g. null deref to trap on architectures where it traps, and silently succeed on architectures where it succeeds), and get told C is this, which it very much isn't.

pjmlp · on Nov 29, 2022

Except every other sane systems programming language does indeed do null checks, even those older than C, but they didn't come with UNIX, so here we are.

Sohcahtoa82 · on Nov 28, 2022

I'll tell you what happens when someone writes:

      sprintf(buf, "%d %d", f(i), i++);

They get told to rewrite it.

Koshkin · on Nov 28, 2022

Good point, actually. Many cases of undefined behavior are clearly visible to an experienced C programmer when they review someone else’s code.

pjmlp · on Nov 29, 2022

By whom? Most places still don't do proper code reviews or unit testing.

halpmeh · on Nov 28, 2022

Was rewriting the stack due to undefined behavior or was it due to a logic error, e.g. improper bounds calculation?

ghostpepper · on Nov 28, 2022

Isn’t all UB a result of logic errors?

Writing beyond the end of allocated memory (due to incorrect bounds calculation ) is an example of undefined behaviour

titzer · on Nov 28, 2022

No, even type-punning properly allocated memory (e.g. using memory to reinterpret the bits of a floating point number as an integer) through pointers is UB because compilers want to use types for alias analysis[1]. In order to do that "properly" you are supposed to use a union. In C++ you are supposed to use the reinterpret_cast operator.

[1] Which IMO goes back to C's original confusion of mixing up machine-level concepts with language-level concepts from the get-go, leaving optimizers no choice but unsound reasoning and blaming programmers when they get it wrong. Something something numerical loops and supercomputers.

agalunar · on Nov 28, 2022

I believe using reinterpret_cast to reinterpret a float as an int is undefined behavior, because I don't believe that follows the type aliasing rules [1]. However, you could reinterpret a pointer to a float as a pointer to char, unsigned char, or std::byte and examine it that way.

As far as I'm aware, it's safe to use std::memcpy for this, and I believe compilers recognize the idiom (and will not actually emit code to perform a useless copy).

[1] https://en.cppreference.com/w/cpp/language/reinterpret_cast

halpmeh · on Nov 28, 2022

That's like saying all bugs are undefined behavior. C lets you write to your own stack, so if you corrupt the stack due to an application error (e.g. bounds check), then that's just a bug because you were executing fully-defined behavior. Examples of undefined behavior would be things like dividing by 0 where the result of that operation can differ across platforms because the specific behavior wasn't defined in the language spec.

kllrnohj · on Nov 28, 2022

Writing past the end of an array is defined as UB.

Not all bugs are UB, you can have logic errors of course. But stack corruption is I believe always triggered by UB.

jeffreyrogers · on Nov 28, 2022

There are some complicated UBs that arise when casting to different types that are not obviously logic errors (can't remember the specifics but remember dealing with this in the past).

lolptdr · on Nov 28, 2022

As a curious FE developer with no C experience, this was very interesting. Thanks for writing the article!

unwind · on Nov 28, 2022

This looks decent, but I'm (highly) opposed to recommending `strncpy()` as a fix for `strcpy()` lacking bounds-checking. That's not what it's for, it's weird and should be considered as obosolete as `gets()` in my opinion.

If available, it's much better to do the `snprintf()` way as I mentioned in a comment last week, i.e. replace `strcpy(dest, src)` with `snprintf(dst, sizeof dst, "%s", src)` and always remember that "%s" part. Never put src there, of course.

There's also `strlcpy()` on some systems, but it's not standard.

masklinn · on Nov 28, 2022

strncpy does have its odd and rare use-case, but 100% agree that it is not at all a “fix” for strcpy, it’s not designed for that purpose, and unsuited to it, being both unsafe (does not guarantee NUL-termination) and unnecessary costly (fills the destination with NULs).

The strn* category was generally designed for fixed-size NUL-padded content (though not all of them because why be coherent?), the entire item is incorrect, and really makes the entire thing suspicious.

AstralStorm · on Nov 28, 2022

Then there are strn*_s since C11 (and available before that on many platforms) which do exactly what you want.

masklinn · on Nov 28, 2022

Lol no. These are the Annex K stuff which Microsoft got into the standard, which got standardised with a different behaviour than Windows’ (so even on windows following the spec doesn’t work) and which no one else wants to implement at all.

And they don’t actually “do exactly what you want”, see for instance N1967 (“field experience with annex k”) which is less than glowing.

mtlmtlmtlmtl · on Nov 28, 2022

>sizeof dst

Note that this only works if dst is a stack allocated(in the same function) array and not a char *

sedatk · on Nov 28, 2022

> and always remember that "%s" part. Never put src there, of course

> Note that this only works if dst is a stack allocated array

Even this "ideal" solution is full of pitfalls. The state of memory safety is so sad in the world of C.

unwind · on Nov 28, 2022

Yes it should be read as a placeholder for whatever you need to do.

Could be an array inside a struct too for instance, that is quite common.

tmewett · on Nov 28, 2022

Ah, that was one of my less considered additions - thank you for the feedback!

0xbadcafebee · on Nov 28, 2022

Would it be a sin to use memcpy() and leave things like input validation to a separate function? I'm nervous any time somebody takes a function with purpose X and uses it for purpose Y.

unwind · on Nov 28, 2022

Uh, isn't using `memcpy()` to copy strings doing exactly that?

The problem is that `memcpy()` doesn't know about (of course) string terminators, so you have to do a separate call to `strlen()` to figure out the length, thus visiting every character twice which of course makes no sense at all (spoken like a C programmer I guess ... since I am one).

If you already know the length due to other means, then of course it's fine to use `memcpy()` as long as you remember to include the terminator. :)

0xbadcafebee · on Nov 29, 2022

The reason you would want to use memcpy would be if 1) you already know what the length is, 2) if you need a custom validator for your input, 3) you don't want to validate your input (however snprintf() is doing that), 4) if the string may include nulls or there is no null terminator.

But the fifth reason may be that depending on snprintf as your "custom-validator-and-encoder-plus-null-terminator" may introduce subtle bugs in your program if you don't know exactly what snprintf is doing under the hood and what its limitations are. By using memcpy and a custom validator, you can be more explicit about how data is handled in your program and avoid uncertainty.

(by "validate" I mean handle the data as your program expects. this could be differentiating between ASCII/UTF-8/UTF-16/UTF-32, adding/preserving/removing a byte-order mark, eliminating non-printable characters, length requirements, custom terminators, or some other requirement of whatever is going to be using the new copy of the data)

masklinn · on Nov 28, 2022

If you really need a fast strcpy then probably not, but in most situations snprintf will do the job just fine. And will prevent heartache.

saagarjha · on Nov 29, 2022

snprintf is pretty slow, partly because it returns things people typically don't want.

krylon · on Nov 28, 2022

When I first learned C - which also was my first contact with programming at all - I did not understand how pointers work, and the book I was using was not helpful at all in this department. I only "got" pointers like three or four years later, fortunately programming was still a hobby at that point.

Funnily when I felt confident enough to tell other people about this, several immediate started laughing and told me what a relief it was to hear they weren't the only ones with that experience.

Ah, fun times.

EDIT: One book I found invaluable when getting serious about C was "The New C Standard: A Cultural and Economic Commentary" by Derek Jones (http://knosof.co.uk/cbook). You can read it for free because the book ended up being to long for the publisher's printing presses or something like that. It's basically a sentence-by-sentence annotated version of the C standard (C99 only, though) that tries to explain what the respective sentence means to C programmers and compiler writers and how other languages (mostly C++) deal with the issue at hand, but also how this impacts the work of someone developing coding guidelines for large teams of programmers (which was how the author made a living at the time, possibly still is). It's more than 1500 pages and a very dense read, but it is incredibly fine-grained and in-depth. Definitely not suitable for people who are just learning C, but if you have read "Expert C Programming: Deep C Secrets" and found it too shallow and whimsical, this book was written for you.

chasil · on Nov 28, 2022

Having basic experience in any assembly language makes pointers far more clear.

"Addressing modes," where a register and some constant are used to calculate the source or target of a memory operation, make the equivalence of a[b]==*(a+b) much more obvious.

I also wonder about the author's claims that a char is almost always 8 bits. The first SMP machine that ran Research UNIX was a 36-bit UNIVAC. I think it was ASCII, but the OS2200/EXEC8 SMP matured in 1964, so this was an old architecture at the time of the port.

"Any configuration supplied by Sperry, including multiprocessor ones, can run the UNIX system."

https://www.bell-labs.com/usr/dmr/www/otherports/newp.pdf

jjav · on Nov 28, 2022

> Having basic experience in any assembly language makes pointers far more clear.

That's a key point. I came to C after several years of programming in assembly and a pointer was an obvious thing. But I can see that for someone coming to C from higher level languages it might be an odd thing.

kjs3 · on Nov 28, 2022

There was an "official" C compiler for NOS running on the CDC Cyber. As I recall, 18-bit address, 60-bit words, more than one definition of a 'char' (12-bit or 5-bit, I think). It was interesting. There were a lot of strange architectures with a C compiler.

I would also point out architectures like the 8051 and 8086 made (make...they are still around) pointer arithmetic interesting.

krylon · on Nov 29, 2022

The C standard, as I recall defines a byte effectively as at least 8 bits. I've read that some DSP platforms use a byte (and thus a char) that is 24 bits wide, because that's what audio samples use, but supposedly those platforms rarely, if ever, handle any actual text. The standard library contains a macro CHAR_BITS that tells you

I think I remember reading about a C compiler for the PDP-10 (or Lisp Machine?), also a 36-bit machine, that used a 9 bit byte. There even exists a semi-jocular RFC for UTF-9 and UTF-18.

andrewla · on Nov 28, 2022

Pointers are by far the most insidious thing about C. The problem is that nobody who groks pointers can understand why they had trouble understanding them in the first place.

Once you understand, it seems so obvious that you cannot imagine not understanding what a pointer is, but at the beginning, trying to figure out why the compiler won't let you assign a pointer to an array, like `char str[256]; str = "asdf"`, is maddening.

One thing I think would benefit many is if we considered "arrays" in C to be an advanced topic, and focused on pointers only; treating "malloc" as a magical function until the understanding of pointers and indexing is so firmly internalized that you can just add on arrays to that knowledge. Learning arrays first and pointers second is going to break your brain because they share so much syntax, but arrays are fundamentally a very limited type.

chasil · on Nov 28, 2022

When I've had to explain it, I describe memory as a street with house numbers (which are memory addresses).

A house can store either people, or another house number (for some other address).

If you use a person as a house number, it will inflict grievous harm upon that person. If you use a house number as a person, it will blow up some random houses. Very little in the language stops you from doing this, so you have to be careful not to confuse them.

Then I describe what an MMU does with a TLB, at which point the eyes glaze over.

patrick451 · on Nov 29, 2022

From my memory, the syntax of pointers really tripped me up. E.g., the difference between * and & in declaration vs dereferencing. I think this is especially confusing for beginners when you add array declarations to the mix.

abareplace · on Nov 29, 2022

Agreed. Complex declarations (e.g., array of pointers to functions) are non-intuitive:

http://www.ericgiguere.com/articles/reading-c-declarations.h...

https://cdecl.org/

I learned this from the book Expert C Programming by Peter van der Linden.

teo_zero · on Dec 2, 2022

I don't remember C using & in declarations. Is it a recent addition?

Koshkin · on Nov 28, 2022

What is so difficult about the concept of a memory address? Is it the C syntax? Asking because I personally have never struggled with this.

andrewla · on Nov 28, 2022

That's the problem! I can't tell you what is difficult because it seems so incredibly obvious to me now.

When I was ~12, I had a lot of trouble with it, and the only thing I remember from those times is various attempts to figure out why the compiler wouldn't let me assign a string value to an array. What the hell is an "lvalue", Mr. Compiler?

Now I look at the assignment command above and I recoil in horror, but for some reason at the time it seemed very confusing to me, especially since `char *str; str = "abcd";` works so well. The different between the two (as far as intention goes) is vast in retrospect, but for some reason I had trouble with it at the time.

rando14775 · on Nov 29, 2022

The pointer/array confusion in C makes this way harder to understand than it has to be. The other thing is the syntax, which is too clever and too hard to parse in your head for complex expressions. Both of these things also tend to not be explained very well to beginners, probably partly due to the fact that explaining it in detail is complex and would perhaps go over the beginner's head. It's also stupid, so you'd probably have to explain how it turned out to be this complex.

DougBTX · on Nov 29, 2022

> Is it the C syntax?

Pretty sure that this is a big factor, I’m not aware of any recent languages that put type information before and after the variable name. Nowadays there’s always a clear distinction between the variable name and the type annotation.

int_19h · on Dec 2, 2022

For this example:

   char str[256]; str = "asdf"

both str and "asdf" are not pointer-type expressions; they're both arrays (which is exposed by sizeof). The reason why this doesn't work is because C refuses to treat arrays as first-class value types - which is not an obvious thing to do regardless of how well you understand pointers or not. Other languages with arrays and pointers generally haven't made this mistake.

Sohcahtoa82 · on Nov 28, 2022

One thing that helped me understand pointers was understanding that a pointer is just a memory address.

When I was still a noob programmer, my instructor merely stuck to words like "indirection" and "dereferencing" which are all fine and dandy, but learning that a pointer is just a memory address instantly made it click.

Pointers are a $1000 topic for a $5 concept.

Koshkin · on Nov 28, 2022

Well there’s a little bit more to it. There is a type involved, and then there’s pointer arithmetic.

Sohcahtoa82 · on Nov 28, 2022

Well yes, but those aren't hard.

Pointer arithmetic is merely knowing that any addition/subtraction done to a pointer is multiplied by the size of the type being pointed to. So if you're pointing to a 64-byte struct, then "ptr++;" adds 64 to the pointer.

int_19h · on Dec 2, 2022

Typed pointers interact with aliasing in "interesting" ways.

sramsay · on Nov 29, 2022

When I’m teaching (a very high-level language), I make a point of saying that a variable is a named memory location. Where is that location? We don’t know. Now, I am absolutely aware that the address isn’t the “real” location, but I have this idea that talking about variables in this way might help them grok the lower-level concept later on.

colonCapitalDee · on Nov 28, 2022

My experience with pointers was the inverse of yours. My first programming language was Java, and I spent many hours puzzling out reference types (and how they differed from primitive types). I only managed to understand references after somebody explained them as memory addresses (e.g. the underlying pointer implementation). When I later learned C, I found pointers to be delightfully straightforward. Unlike references in Java, pointers are totally upfront about what they really are!

krylon · on Nov 29, 2022

When I got to Java, I experienced the same problem. Much later, I learned C# and found that it apparently had observed and disarmed some of Java's traps, but they also got a little baroque in some places, e.g. with pointers, references in/out parameters, values types, nullable types, ... A lot of the time one doesn't need it, but it is a bit of a red flag if language has two similar concepts expressed in two similar ways but with "subtle" differences.

I did like the const vs readonly solution they came up with. I wish Go (my current goto (pun not necessarily unintentional) language) had something similar

nayuki · on Nov 28, 2022

From over a decade ago, I really enjoyed this clay animation on C pointers: https://www.youtube.com/watch?v=5VnDaHBi8dM , http://cslibrary.stanford.edu/104/

harry8 · on Nov 28, 2022

"The C Puzzle Book" is the thing I recommend to anyone who knows they want to have a good, working understanding of how to use pointers programming in C.

Many years ago I did the exercises on the bus in my head, then checking the answers to see what I got wrong and why over the space of a week or so. It's a really, really good resource for anyone learning C. It seemed to work for several first year students who were struggling with C in my tutorials as well and they did great. Can't recommend it highly enough to students and the approach to anyone tempted to write an intro C programming text.

mumblemumble · on Nov 28, 2022

I would highly recommend the video game Human Resource Machine for getting a really good understanding of how pointers work.

It's more generally about introducing assembly language programming (sort of) in gradual steps, so you'll need to play through a fair chunk of the game before you get to pointers. But by the time you get to them, they will seem like the most obvious thing in the world. You might even have spent the preceding few levels wishing you had them.

quietbritishjim · on Nov 28, 2022

> Declaring a variable or parameter of type T as const T means, roughly, that the variable cannot be modified.

I would add "... cannot be modified through that pointer". (Yes, in fairness, they did say "roughly".) For example consider the following:

    void foo(int* x, const int* y)
    {
        printf("y before: %d\n", *y);
        *x = 3;
        printf("y after: %d\n", *y);
    }

This will print two different values if you have `int i = 1` and you call `foo(&i, &i)`. This is the classic C aliasing rule. The C standard guarantees that this works even under aggresive optimisation (in fact certain optimisations are prevented by this rule), whereas the analogous Fortrain wouldn't be guaranteed to work.

pafje · on Nov 28, 2022

You already know this, but I would add that under strict aliasing rules, this is only valid because x and y point to the same type.

The most common example is when y is float* and someone tries to access its bitwise representation via an int*.

(Please correct me if I'm wrong)

https://gist.github.com/shafik/848ae25ee209f698763cffee272a5...

vardump · on Nov 28, 2022

A small detail: you probably meant

  printf("y before: %d\n", *y);