Catch-23: The New C Standard Sets the World on Fire

Dylan16807 · on April 2, 2023

> C23 furthermore gives the compiler license to use an unreachable annotation on one code path to justify removing, without notice or warning, an entirely different code path that is not marked unreachable: see the discussion of puts() in Example 1 on page 316 of N3054.9

I don't agree with that description at all. Here's the code:

  1 if (argc <= 2)
  2   unreachable();
  3 else
  4   return printf("%s: we see %s", argv[0], argv[1]);
  5 return puts("this should never be reached");

The only code path that's "entirely different" is lines 1,4,5 and in that case of course you remove a return that's after a return.

And the other valid code path is 1,2,5, which has `puts` after `unreachable`.

To need `puts` you have to imagine a code path that gets past the "if" without taking either branch?

Maybe the author means something by "code path" that's very different from how I interpret it?

I would be pretty surprised if the above code means something different from:

  if (argc <= 2) {
    unreachable();
    return puts("this should never be reached");
  } else {
    return printf("%s: we see %s", argv[0], argv[1]);
    return puts("this should never be reached");
  }

ternaryoperator · on April 2, 2023

This reminds me of a point made by the late Stan Kelly-Bootle, who for years wrote the Devil's Advocate column in UNIX Review magazine. In the early 1990s, he was discussing Microsoft's new C compiler and noted that in the promo material for the new compiler, it showed a benchmark for a loop that counted from 1 to 10,000 then printed "Hello". MS claimed that without optimization it took a few milliseconds, after optimization: 0 ms. A small asterisk explained the optimizer simply removed the loop. Kelly-Bootle pointed out, that the only reason a developer would write such a loop was to introduce a needed delay. Therefore, deleting the loop was not optimizing, but in fact pessimizing. And so, it was in fact Microsoft's Pessimizing C compiler.

codeflo · on April 2, 2023

Of course, that's technically incorrect. The way the standards are written, the compiler is free to replace the program with any other program that has the same (in a precisely defined sense) observable behavior (these are the famous "as if" formulations in language specs). Heating up the CPU is not considered observable behavior.

If someone really just wants a delay, it's easy to either (for programs running on normal OSs) call a sleep function, or (on tiny embedded systems) add an empty inline assembler statement that the compiler can't see through.

carlmr · on April 2, 2023

>Heating up the CPU is not considered observable behavior.

Neither is measuring delays of cached versus non-cached instructions. Yet it turns out to be very observable.

codeflo · on April 2, 2023

Of course these things are “observable” in the literal sense. And yet, they aren’t considered to be observable by the memory model of any language spec that I know of. Same as CPU power draw, which has been used as a side-channel to extract bits of crypto keys, and is very much influenced by common optimizations.

Practically, if you need to execute a specific sequence of machine instructions in order to prevent side-channel attacks, then you have to rely on assembler, compiler intrinsics and/or OS support. But that was true way before Spectre.

hyperhopper · on April 2, 2023

This is not true at all:

I've been many loops that turn into no-ops because all the functionality has been refactored out but this fact is hidden in function calls.

Sure, this should ideally be surfaced as a lint error, not a compiler optimization, but you cannot say that intentional delays are the "only" reason.

Also since processing time is variable, using that as a method should be extremely heavily discouraged/warned/require-opt-in

viraptor · on April 2, 2023

Those delay loops are common on microcontrollers and the usual solution is to either make the counter volatile or insert something opaque to the compiler in the loop body.

It would be of course nice if a warning was produced for that specific case: This whole loop was removed - is it really what you wanted, or is it a broken delay loop?

kzrdude · on April 2, 2023

I think it's a practical example of how the C language has made a journey to being more high abstraction than it used to be, in practice. And how that unsettles those used to the old behaviour.

wahern · on April 2, 2023

I think the point is that if the `argc <= 2` path is unreachable, then that means argc is always greater than 2, permitting the compiler to optimize the entire block to just:

  return printf("%s: we see %s", argv[0], argv[1]);

IOW, the conditional has been elided. But you're right in that the wording of the complaint doesn't match the example. The author presumably had in mind some of the more infamous NULL pointer-related optimizations, without spending the time to put together a properly analogous example.

dtolnay · on April 2, 2023

I interpreted the author's characterization to be about something like:

  1  if (argc <= 2)
  2    puts("A");
  3  puts("B");
  4  if (argc <= 2)
  5    unreachable();
  6  else
  7    return puts("C");
  8  return puts("D");

in which not just lines 4-6,8 go away (as you said) but also lines 1-2.

It makes sense to me but I can see why the author would characterize this situation as "license to use an unreachable annotation on one code path to justify removing an entirely different code path that is not marked unreachable". In a different world one might expect A to be printed "before the UB happens".

masklinn · on April 2, 2023

On the other hand, that has been the behaviour of optimising compilers in the face of UBs for years at this point, decades maybe. The linux kernel was hit by a deref' constraint propagation back in 2009 or so.

This is a behaviour I would absolutely expect from the construct, I would even qualify it as "the point".

rrobukef · on April 3, 2023

I find this especially surprising because line 2 may be exactly the reason why line 5 is unreachable. E.g. if puts("A") contractually throws an exception you cannot just remove it.

What am I missing in this example?

zefix · on April 3, 2023

C does not have exceptions...

rrobukef · on April 5, 2023

But it has long_jmp, right?

alwaysbeconsing · on April 2, 2023

One way to look at it (and I am not sure if this is correct, but it may be what the essay author meant) is to not treat the `unreachable` as affecting the presence of the decision, but only the result of the decision. If `unreachable` was replaced by a normal statement, we'd have:

    if (argc <= 2)
        do_something();
    else
        return printf("%s: we see %s", argv[0], argv[1]);

So the `return printf` is executed when `argc` is greater than 2. If we remove just the body of the first branch:

    if (argc <= 2)
        ;
    else
        return printf("%s: we see %s", argv[0], argv[1]);

the same thing holds. And additionally when `argc <= 2`, control will move past the `if`.

Under this view, if the `unreachable` won't cause the entire removal of the `if`, the compiler will produce the equivalent of:

    if (argc > 2)
        return printf("%s: we see %s", argv[0], argv[1]);

    return puts("this should never be reached")

Again, I don't say this is the correct interpretation, but it is one possibility, that would have to be ruled out by other parts of the standard.

Dylan16807 · on April 2, 2023

I understand that interpretation, but that's what the end of my comment is about. If we treat unreachable as affecting the block it's in, but pretend it's not there for control flow, then the two versions of the code do different things. That's confusing and hard to preserve.

Asooka · on April 2, 2023

This just shows that "unreachable" is almost impossible to use safely. The only safe use of unreachable is if it is immediately after an instruction that makes the program stop running. It is not for "this cannot happen", because things that "cannot happen" happen all the time. If you use "unreachable", you're just asking for trouble and it seems the compiler authors are happy to oblige.

josephcsible · on April 2, 2023

This couldn't be more wrong. What you say to never use unreachable for is one of the most important use cases of unreachable. The whole point is to give the optimizer an assumption that it can't figure out on its own.

lldb · on April 2, 2023

One example of it being useful is unchecked std::variant access in c++ - there isn’t any api to access it like a union (if you already know the type) but you can mark the wrong type path unreachable to the same effect.

cryptonector · on April 2, 2023

There's no problem with this feature. I don't understand TFA's problem with it. As a programmer I get to not use `unreachable()` if I don't want to, and if I do I'm happy that the compiler takes my word for it and does the right thing. This is not at all like code elision in UB cases.

The `realloc()` change though...

badrabbit · on April 2, 2023

Shouldn't the compiler warn or error on unreachable code?

codeflo · on April 2, 2023

This is not about code that's found to be unreachable through static analysis (where compilers might warn), but about a manual programmer annotation that claims the code is dynamically unreachable even though statically it might look otherwise.

benj111 · on April 2, 2023

Why would you want that?

Is it to aid building for multiple targets? For debug builds?

flohofwoe · on April 2, 2023

Unreachable is mainly used as an optimization hint. For instance if you put an unreachable into the default branch of a continuous and non-exhaustive (from the pov of the compiler) switch-case statement, the compiler will not emit a range check for the jump table lookup.

ufo · on April 2, 2023

It helps optimization. One example is if you have code like this:

    if(condition) {
       error_stuff()
       abort();
    }
    normal_stuff();

If the compiler doesn't know that abort exits the program, they have to compile the normal_stuff path under the assumption that the error path might have run before it. This might result in suboptimal code.

Currently, many compilers support annotations such as __attribute__(noreturn) and __builtin_unreachable() to manually indicate that a code path is unreachable. C23 is now standardizing these features (with a slight tweak to the syntax).

_0ffh · on April 2, 2023

You can for example use it to give hints to the compiler that allows for optimisations, that it couldn't do otherwise.

Described e.g. here https://web.archive.org/web/20160508051118/http://blog.regeh...

Github https://github.com/preames/llvm-assume-hack

masklinn · on April 2, 2023

> Why would you want that?

To aid with optimisation, it basically lets you ask the compiler to remove branches, and provide constraints to the same.

An implementation might trap in debug code, but given no context would be provided you'd likely avoid this and would instead use your own wrapper macro to output a message of some sort in that case.

properparity · on April 2, 2023

But why put in unreachable? Doesn't make any sense to me.

If a branch is truly not supposed to ever happen, why have a branch at all? Just remove that code from the source entirely- that helps the optimizer even more, because the most optimal code is of course no code at all.

masklinn · on April 2, 2023

> But why put in unreachable? Doesn't make any sense to me.

Because sometimes you don't have a choice e.g. say you have a switch/case, if you don't do anything and none of the cases match, then it's equivalent to having an empty `default`. But you may want a `default: unreachable()` instead, to tell the compiler that it needs no fallback.

> If a branch is truly not supposed to ever happen, why have a branch at all? Just remove that code from the source entirely- that helps the optimizer even more, because the most optimal code is of course no code at all.

Except the compiler may compile code with the assumption that it needs to handle edge cases you "know" are not valid. By providing these branches-which-are-not, you're giving the compiler more data to work with. That extra data might turn out to be useless, but it might not.

benj111 · on April 2, 2023

But this example isn't adding a constraint. The if statement is getting optimised away???

masklinn · on April 2, 2023

It is adding a constraint. The constraint is that argc can’t be smaller than 2. This is a literal “can’t”, as far as the compiler is concerned it’s a logical impossibility.

The branch containing the unreachable() obviously gets removed but the compiler then propagates the constraint (the condition for that illegal branch), and can prune any other path where `argc <= 2` upstream and downstream, as they are dead code per the constraint.

i-use-nixos-btw · on April 2, 2023

This is written with quite a lot of hyperbole.

The predominant focus is realloc(pre,0) becoming UB instead of what the author misleadingly describes as useful, consistent behaviour. It is far from that, and that’s the entire reason that it was declared UB in the first place: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2464.pdf. Note that this wasn’t a proposal to change something, it’s a defect report: the original wording was never suitable.

The second part is the misconception about the impact of UB. Making something UB does not dictate that its usage will initiate the rise of zombie velociraptors. It grants the implementation the power to decide the best course of action. That is, after all, what they’ve been doing all this time anyway.

Note that this deviates from implementation-defined behaviour, because an implementation-defined behaviour has to be consistent. Where implementations choose to let realloc(ptr,0) summon the zombie raptors, they are free to do so. Don’t like it? Don’t target their implementation. Again, this isn’t a change from the POV of implementers - it’s a defect in the existing wording.

In this case, the course of action that any implementation will choose is to stick with the status quo. It is clearly not a deciding factor in whether or not you embrace the new standard, and to suggest otherwise is dishonest, sensationalist nonsense. The feature was broken, and it’s just being named as such.

Arch-TK · on April 2, 2023

I agree that realloc was poorly defined for the 0 size case, I think UB or IDB both would have worked in this case to really drive that point home, the WG chose UB.

That being said, you're completely wrong about what UB means. Making use of UB may as well initiate the rise of zombie velociraptors. Except for the situation where your implementation explicitly specifies that it provides a predictable behaviour for a specific case of UB, there's literally no guarantee of what will happen. Assuming that the implementation will stick with some status quo and your code won't exhibit absolutely unusual behaviour is just naiive.

Please don't mislead people into thinking that it's ever a good idea to assume that undefined behaviour will be handled sensibly, this kind of mislead assumption is one of the major sources of bugs in C code.

coliveira · on April 2, 2023

> this kind of mislead assumption is one of the major sources of bugs in C code.

This is not even close to be true. Most bugs in C code are from programmer mistakes, not from UB behavior. The exaggeration that is spread by some people regarding UB is close to absurd. If something is UB, it may generate different results in different situations, even with the same compiler. The standard is just clarifying this problem. A good compiler will do something sensible, or at least issue a warning when this situation is detected. If you have a bad compiler that does strange things with your code, it's not a defect of UB but the compiler instead.

wruza · on April 2, 2023

Optimizing compilers don’t work like that. They can either deviate from the standard and leave it as defined behavior, or mark it UB and go with it as usual.

To get some insight by analogy, consider this set of constraints (unrelated to C):

  x <= 7
  2x >= 5
  …(more with x, y, z but not more constraining x)…

When you feed this to a linear constraint solver, you may get anything from 2.5 to 7 as x. E.g. 3.1415926. Not because a solver wanted to draw some circles, but because it transformed your geometric problem into an abstract representation for its own algorithm, performed some fast operations over it and returned the result. Nobody knows how exactly a specific solving method will behave wrt (underconstrained) x given that the description above is all you have.

When you feed UB into an optimizer, you feed a bit of lava into a plastic pipe, figuratively. You’ll get anything from program #2500…0000 to program #6999…9999, where “…” is few more thousands/millions of digits. Run some numbers from there as an .exe to see if something absurd happens.

The nature of UB and optimizers is that you either relax UBs into DBs and get worse efficiency, or you specify more UBs and get worse programming safety. What happens in between can be perceived as completely random. And the better/faster the optimizer is, the more random the outcome will likely be.

The exaggeration that is spread by some people regarding UB is close to absurd

UB-in-code is absurd by definition, no exaggeration here.

Arch-TK · on April 2, 2023

> Most bugs in C code are from programmer mistakes

These most often lead to the triggering of UB. The reason why programmer mistakes lead to confusing bugs instead of simple and straightforward bugs which are easy to catch in the development process is mainly because UB imposes no restrictions on what the compiler should do. In the vast majority of UB cases the compilers simply don't do anything, and assume it can't happen. This is why dereferencing a pointer and then checking if it's null ends up eliding the null check (because if you've dereferenced it, it can't be null, that would be UB). Accessing past the end of an array is UB so it can't happen, therefore your compiler won't check for it. Accessing past the end of an array and accidentally reading from/writing to another variable - likewise.

UB encompasses ALL behavior for which the standard does not provide an explicit definition. The reason why the C standard provides explicit instances of UB usually boils down to clarifying situations where people were confused about whether something was UB or not. But if the behaviour is not defined in the standard, then it is by definition UB.

SCLeo · on April 2, 2023

If I am not wrong, one major security bug that C programs usually face is buffer overflow, which is an undefined behavior.

cryptonector · on April 2, 2023

Right, this should have been left to the implementor if they didn't want to standardize one behavior. Making it UB is the worst possible outcome. Yes, people who write portable code will still want to not rely on `realloc()`'s freeing behavior, but if you do and your realloc() implementation doesn't, then you suffer a leak, while if you do and realloc() decides to wipe your drive and make your power supply explode...

astrange · on April 2, 2023

> Except for the situation where your implementation explicitly specifies that it provides a predictable behaviour for a specific case of UB, there's literally no guarantee of what will happen.

That situation is "when you have UBSan turned on".

G_z9 · on April 2, 2023

[flagged]

G_z9 · on April 2, 2023

[flagged]

Dylan16807 · on April 2, 2023

> How this gets downvoted is beyond me

Primarily because you're bringing in an argument from a different story entirely rather than figuring out a better option.

But also you're being very rude in that other thread, and calling twitter "high bandwidth" for a discussion is... weird.

> Yeah, I’m being downvoted by a bot or something.

Uh huh.

G_z9 · on April 2, 2023

Ok, that’s not unreasonable. But I think that making an unrelated comment like that is really only a bad thing if it’s in bad faith. He made that comment expecting a response, I’m not like hounding him. And yes, I said some rude things. Are you going to downvote every comment that you see from me because some other comments I made were rude? Doesn’t really add up. And why are you policing the threads? That’s more weird than asking for a twitter space.

I don’t think asking for a twitter space is all that weird. I am constantly frustrated talking with people on HN because what could take 2 seconds takes 20 minutes. I often find that a debate never even has a chance to be resolved because everyone just gets worn out trying to talk through a digital straw. Plus, asking for a twitter space doesn’t involve exchange of personal information or anything concerning. It’s definitely not done and off the wall but I don’t think it’s problematic.

Edit: the more I think about it, the more sense it makes. HN has a problem with being flooded with vitriol and and lots of other negative behavior, long chains that are just useless. It would make a lot of sense to offload most of that to another platform since HN as a platform is not well suited to debating. Instead of initiating a huge chain of vitriol, a twitter space could be initiated when people want to debate something. Instead of tons of noise and garbage, HN would host a link to the space. And it would be better because the nature of a space lends itself to people coming to a conclusion, covering the issue more thoroughly and people letting loose less hate, all because of the high bandwidth, intimate nature of real-time audio. It also helps filter out people who are bots or aren’t serious or who don’t really care about the topic being debated. I am legitimately going to email HN admin about this.

Dylan16807 · on April 2, 2023

I didn't downvote you, by the way.

But sure posting in the wrong topic will get a downvote to turn your comment gray. What is wrong with that policing?

> Are you going to downvote every comment that you see from me because some other comments I made were rude? Doesn’t really add up.

...what? You continued the thread here. Nobody is downvoting random comments of yours. Your comments on that story and this story are part of the same conversation.

And if you can't figure out how to reply in a deep thread you can just wait a couple minutes for the link to be there.

> a twitter space

Oh, the chat thing. I thought you meant tweets. Sure, that's a reasonable idea for some conversations.

G_z9 · on April 2, 2023

Yes I’ve figured out the timer. As a 2010 account, I kind of have to just yield to you.

astrange · on April 2, 2023

Hey, I think getting into arguments for a day then randomly giving up and wandering off is what the sites all about. Actually, I think the guy who stops replying first kind of wins - it's similar to why you shouldn't double-text when dating.

> I don’t think asking for a twitter space is all that weird.

My issue is that I don't think I have anything to contribute as I'm not making original conclusions but kind of just quoting a typical labor economist. (Different from quoting the average person, they're usually worried about different things.)

Example being https://www.apricitas.io/p/chatgpt-please-take-my-job.

SAI_Peregrinus · on April 2, 2023

It's not directly related to the topic at hand. It's meta-commentary about the discussion, not about the actual topic. That, and your post I'm replying to, and this reply I'm making should all be downvoted as they're all off-topic.

c4mpute · on April 2, 2023

> The second part is the misconception about the impact of UB. [...] It grants the implementation the power to decide the best course of action. That is, after all, what they’ve been doing all this time anyway.

Wrong, Wrong, Wrong.

UB allows the implementation to take any arbitrary course of action, without informing anyone, without documentation, without any conscious decision, without weighing anything to be better/worse. Nondeterministically catching fire and launching nuclear rockets is a completely compliant reaction to UB.

What you are describing is "implementation defined" behavior. That has to be deterministic, documented, and conforming to some definition of sanity. Examples are the binary representation of NULL, sizes of integer types or stuff like the maximum filename length. Sadly, too many things in C have "undefined behavior", too few have "implementation defined" behavior.

And UB has always been an excuse for compilers to screw over programmers in hideous ways. Programmers are rightfully afraid of any kind of new UB being introduced, because it will mean that whole new classes of bugs will arise because the compiler optimized out that realloc(..., a) where a might be 0, because thats UB, so screw you and your code... And this change is especially dangerous because it makes a lot of existing code UB.

chongli · on April 2, 2023

And UB has always been an excuse for compilers to screw over programmers in hideous ways

Your reply was great up until this. Compiler writers aren’t looking to screw over programmers, they’re looking to make code faster. UB gives them the ability to make assumptions about what is and is not true, at a particular moment in time, in order to skip doing unnecessary work at runtime.

By assuming that code is always on the happy path, you can cut a lot of corners and skip checks that would otherwise greatly slow down the code. Furthermore, these benefits can cascade into more and more optimizations. Sometimes you can have these large, complicated functions and call graphs get optimized down to a handful of inlined instructions. Sometimes the speedup can be so dramatic that the entire application is unusable without it!

Many of these optimizations would be impossible if compilers were forced to assume the opposite: that UB will occur whenever possible.

The tool programmers have available to them is compiler flags. You can use flags to turn off these assumptions, at the cost of losing out on optimizations, if your code needs it and you’re unable to fix it. But it’s better to turn on all possible warnings and treat warnings as errors, rather than ignoring them, to push yourself to fix the code.

adgjlsfhk1 · on April 2, 2023

the thing that makes UB almost malicious is that it propagates inter-procedurally. This makes reasoning about code with UB basically impossible which means that you should always assume that the compiler is going to screw you over if you use it because there is no way to know whether it will.

chongli · on April 2, 2023

You should consider a program with undefined behaviour to be the equivalent of a mathematical proof that contains an unstated contradiction. Ex falso quodlibet: from a falsehood anything follows. Also called the principle of explosion.

Undefined behaviour renders your entire program meaningless. It must be avoided at all costs. Using undefined behaviour on purpose is like sticking a fork in an electrical socket.

Joker_vD · on April 2, 2023

> Undefined behaviour renders your entire program meaningless

That's exactly the complaint. Consider that the implementations of the standard library sometimes have exposed UB: that renders behaviour of all of the running code on the system undefined.

Many programmers believe that the fallout of the UB could, and therefore should, be limited in scope.

coliveira · on April 2, 2023

To achieve your goal, compilers would have to disable any sufficiently powerful optimization. If you write bugs (UB), a powerful compiler will eventually catch them and generate code that you didn't intend at the beginning. However, this is not the fault of the compiler or the language.

chongli · on April 2, 2023

Compiler writers have already done this. With flags you can disable any optimization you like, with all of the performance loss that entails. But then people complain that their programs are slow.

What people really want is an AI that ignores the code they write and just “does what they really meant.” But of course that’s not foolproof either. Every day people ask each other to do things and miscommunications occur, with the wrong thing being done. I don’t really know what to say other than “people should be more careful and also more forgiving.”

coliveira · on April 2, 2023

Exactly. All the hoopla about UB is complaining about how compiler optimizations work and the fact that the standard committee makes clear (with each new meeting) what is considered undefined behavior or not. They should instead thank the committee for clarifying this.

adgjlsfhk1 · on April 2, 2023

It is the fault of the language to the extent that the purpose of the language is to make it easy to write correct programs, and UB makes it really hard (and in some cases impossible) to write correct programs.

coliveira · on April 2, 2023

It is just the opposite. UB is a clarification to tell programmers what the language considers to be undesired behavior. If they didn't say anything, it would be always a mystery if a certain construct was allowed or not, effectively making it compiler dependent. Compilers would also have less avenue for creating optimizations. In the next iterations of the C standard we may see more constructs classified as UB.

adgjlsfhk1 · on April 2, 2023

That sounds good in theory, but many things that are UB in C/C++ are UB because they are really hard to verify at compile time which makes them almost impossible to program around. Any signed addition in C is potential UB unless you have a proof that all numbers that will ever be input to the addition won't cause overflow (which is made harder because C doesn't define the size of the default integer types). Furthermore, no progress is UB which means that as a programmer, you have to solve the halting problem for your program before knowing whether it has a bug.

jcranmer · on April 2, 2023

> many things that are UB in C/C++ are UB because they are really hard to verify at compile time which makes them almost impossible to program around

The second half of the sentence doesn't follow from the first. Take everyone's favorite example, signed integer overflow: all you have to do to avoid UB on signed integer overflow is check for overflow before doing the operation (and C23 finally adds features to do that for you).

Taking a step back, the fundamental thing about UB is that it is very nearly always a bug in your code (and this includes especially integer overflow!). Even if you gave well-defined semantics to UB, the semantics you'd give would very rarely make the program not buggy. Complaining that we can't prove programs free of UB is tantamount to complaining that we can't prove programs free of bugs.

It actually turns out that UB is actually extremely helpful for tools that try to help programmers find bugs in their code. Since UB is automatically a bug, any tool that finds UB knows that it found a bug; if you give it well-defined semantics instead, it's a lot trickier to assert that it's a bug. In a real-world example, the infamous buffer overflow vulnerability Heartbleed stymied most (all?) static analyzers for the simple reason that, due to how OpenSSL did memory management, it wasn't actually undefined behavior by C's definition. Unsigned integer overflow also falls into this bucket--it's very hard to distinguish between intentional cases of unsigned integer overflow (e.g., hashing algorithms) from unintentional cases (e.g., calculating buffer sizes).

the_why_of_y · on April 2, 2023

My complaint here is that it took C more than 30 years between defining signed integer overflow as UB and providing programmers with standard library facilities to check if a signed integer operation would result in overflow.

I much prefer Rust's approach to arithmetic, where overflow with plain arithmetic operators is defined as a bug, and panics on debug-enabled builds, plus special operations in the standard library like wrapping_add and saturating_add for the special cases where overflow is expected.

chongli · on April 2, 2023

My complaint here is that it took C more than 30 years ... I much prefer Rust's approach

That's an odd complaint. Rust didn't spring forth fully formed from the ether, it stands on the shoulders of C (and other giants of PL history). 30 years ago you couldn't use Rust at all because it didn't exist.

The reason the committee doesn't just radically change C in all these nice ways to catch up to Rust is because it would be incompatible. Then you wouldn't have fixed C, you'd just have two languages: "old C", which all of the existing C code in the world is written in, and "new C", which nothing is written in. At that point why not just start over from scratch, like they did with Rust?

the_why_of_y · on April 2, 2023

Interestingly, the first Ada standard in 1983 defined signed integer overflow to raise a CONSTRAINT_ERROR exception.

But apparently it lacked unsigned integers with modular arithmetic?

http://archive.adaic.com/standards/83lrm/html/lrm-11-01.html... http://archive.adaic.com/standards/83lrm/html/lrm-03-05.html

The 2012 version is a bit more readable, and has unsigned integers:

For a signed integer type, the exception Constraint_Error is raised by the execution of an operation that cannot deliver the correct result because it is outside the base range of the type. For any integer type, Constraint_Error is raised by the operators "/", "rem", and "mod" if the right operand is zero.

For a modular type, if the result of the execution of a predefined operator (see 4.5) is outside the base range of the type, the result is reduced modulo the modulus of the type to a value that is within the base range of the type.

http://www.ada-auth.org/standards/rm12_w_tc1/html/RM-3-5-4.h...

c4mpute · on April 2, 2023

> all you have to do to avoid UB on signed integer overflow is check for overflow before doing the operation

All you have to do is add a check for overflow _that the compiler will not throw away because "UB won't happen"_. The very thing you want to avoid makes avoiding it very hard, and lots of bugs have resulted from compilers "optimizing" away such overflow checks.

chongli · on April 2, 2023

This is covered in the article and numerous replies in this thread. Use <stdckdint.h>.

c4mpute · on April 5, 2023

stdckdint.h is only available in C23. The problem has existed before that and lead to tons of exploits and bugs.

xigoi · on April 2, 2023

> all you have to do to avoid UB on signed integer overflow is check for overflow before doing the operation (and C23 finally adds features to do that for you).

…making your code practically unreadable, since you have to write ckd_add(ckd_add(ckd_mul(a,a),ckd_mul(ckd_mul(2,a),b)),ckd_mul(b,b)) instead of a * a + 2 * a * b + b * b.

chongli · on April 2, 2023

That's not the correct syntax for the ckd_ operations. They take 3 operands, the first being a pointer to an integer where the result should be stored. And they return a bool, which you need to check in a conditional. If you're just going to throw out the bool and ignore the overflows, why bother with checked operations in the first place?

xigoi · on April 2, 2023

Yeah, I realize that now. That's even worse. So you'll have to write something like

    int aa,twoa,twoab,bb,aaplustwoab,aaplustwoabplusbb;
    if (ckd_mul(a,a,&aa)) { return error; }
    if (ckd_mul(2,a,&twoa)) { return error; }
    // …
    if (ckd_add(aaplustwoab,bb,aaplustwoabplusbb)) { return error; }
    return aaplustwoabplusbb;

So ergonomic!

> If you're just going to throw out the bool and ignore the overflows, why bother with checked operations in the first place?

I'd expect the functions to return the result on success and crash on failure. Or better, raise an exception, but C doesn't have exceptions…

chongli · on April 2, 2023

Why not just write:

    bool aplusb_sqr(int* c, int a, int b) {
        return c && ckd_add(c, a, b) && ckd_mul(c, *c, *c);
    }

xigoi · on April 2, 2023

Obviously you could do that in this case, I just wanted to come up with a complicated formula.

chongli · on April 2, 2023

See my other comment [1] which addresses the exact things you brought up here. Safe checked arithmetic is a new standard feature in C23. If no progress were not UB, then tons of loop optimizations would be impossible and then we couldn’t have nice things, like numpy.

[1] https://news.ycombinator.com/item?id=35406554

coliveira · on April 2, 2023

> Any signed addition in C is potential UB unless you have a proof that all numbers that will ever be input to the addition won't cause overflow

This has always been the case. Standard C has always operated with the possibility that addition can overflow. The programmer or library writer is responsible to check if the used types are large enough. If you want to be perfectly sure you need to check for overflow. Making this UB has not changed the nature of the issue.

> is made harder because C doesn't define the size of the default integer types

They correctly made this implementation defined. But C now has different byte sized integer types if you want to be sure.

CJefferson · on April 2, 2023

Is the improved performance of C over say Java, or Rust (which both have much less undefined behaviour -- Java almost none) worth the pain and bugs which have been caused by UB?

Honestly, I don't think so, and as computers get more powerful and the amount of the world which relies on their correct functioning grows, I feel the arguments for UB become increasingly difficult to justify.

chongli · on April 2, 2023

I went to look up undefined behaviour in Rust and I got this scary warning:

Warning: The following list is not exhaustive. There is no formal model of Rust's semantics for what is and is not allowed in unsafe code, so there may be more behavior considered unsafe. The following list is just what we know for sure is undefined behavior. Please read the Rustonomicon before writing unsafe code.

After the warning was a list of many of the same types of things that are undefined behaviour in C. In addition, there’s a bunch more undefined behaviour related to improper usage of the unsafe keyword.

So I don’t think you get a free lunch with Rust here. What you get is a “safe” playground if you stay within the guard rails and avoid using the unsafe keyword. But then you are limited to writing programs which can be expressed in safe Rust, a proper subset of all programs you might want to write.

Furthermore, the lack of a formal specification for Rust is one area where it lags behind C, a standardized language. All of the undefined behaviour in C is decreed and documented by the standard, having been decided by the committee. Rust, on the other hand, may have weird and unpredictable behaviour that you just have to debug yourself, which may or may not be compiler bugs.

CJefferson · on April 2, 2023

I agree rust isn’t perfect, but I think you underestimate the value of “safe” code.

I often write programs that have unsafe code. However, the unsafe code is never more than 100 lines, which means I have a very small amount of code to reason about — Rust users expect (of course, you as a programmer has to enforce) that it should be possible to cause UB from safe code, so my “safe interface” to my unsafe code ensures my code can’t cause UB, no matter what I call.

On problem with Rust is generally when you mess up it panics — I think that’s better than buffer overflows and the like, but still not a good user experience.

This means there is a very small amount of code I have to really think about, while in C or C++, basically any place x[i] appears (regardless of if x is a pointer or a std::vector).

You can of course write safe C code, people do, but it’s hard, and it only takes one slip up anywhere in your program to blow it.

chongli · on April 2, 2023

In one sense, C is the unsafe code block for myriad other languages, like Python. Python users don’t want to deal with undefined behaviour either. They want to write their high level code in NumPy or PyTorch and just have everything work very fast.

Little do they know: they rely on C for those libraries and for things like ATLAS and LAPACK, which implement the underlying numerical linear algebra code. Well, it turns out that ATLAS relies pretty heavily on optimizing C compilers to generate optimal code on many different platforms. At the bottom of all this are the many loop optimizations included in compilers which, thanks to undefined behaviour in the C spec, are able to assume that code is always on the happy path.

It also turns out that Rust includes bindings to ATLAS and LAPACK. I would imagine at some point people might want to write a new linear algebra package in pure Rust. I think it’ll be quite difficult to match the performance of those two in safe Rust, but we’ll see.

jamincan · on April 2, 2023

Isn't LAPACK written in Fortran?

chongli · on April 2, 2023

You're right, and ATLAS is as well, but Fortran has undefined behaviour [1] for all the same reasons that C does.

[1] https://stackoverflow.com/a/57558908

Kranar · on April 2, 2023

C does not have a formal specification either. It has a standard's document that is written using formal English, but it does not provide a formal spec of C's semantics. A formal spec of a programming language's semantics would entail using a formal semantic model such as operational or denotational semantics. Some programming languages do specify the formal semantics for the entire language or some subset of the language but C is not one of them.

Your claim that the C Standard lists all undefined behavior is actually false. The C Standard only lists out the explicit list of undefined behavior, but it does not list out the implicit list of undefined behavior. There have been efforts to make just such a list but it's an incredibly difficult task.

saagarjha · on April 4, 2023

Java can optimize programs to make well-defined but very rarely occurring cases be on the slow path. C can't really do this.

Kranar · on April 2, 2023

It's funny that your original post was an objection to how undefined behavior gives license to screw developers over, but here you are talking about how undefined behavior is like sticking a fork in an electrical socket.

chongli · on April 2, 2023

My original post was an objection to the implied intent on the part of compiler writers. An electrical socket does not have intent, it's just a hazard that also happens to provide enormous benefits to our lifestyles.

I think it's a perfect analogy to undefined behaviour in C: enormous benefits but also a hazard to be wary of. A lot of people don't understand the benefits, they just see the hazard. Throughout this discussion I've been trying to clarify that, with perhaps limited success.

tsegratis · on April 2, 2023

But just to be clear @chongli is logical

Think of UB as a probabilistic error. I.e. it is always stupid to rely on it

1. Write code without errors -- sensible 2. Allow compilers to assume the absence of errors -- occasionally sensible, since it speeds up your program

In defence of UB, for the most part they are things that should break your program anyway: stack overflow is never correct. So your choice is mostly to fail badly quickly, or to fail slowly well

Thanks to google making the UB sanitizers you are free to make that choice even in C

Kranar · on April 2, 2023

I'd argue that it's stupid to think that it's stupid to rely on UB.

Almost any non-trivial software explicitly relies on undefined behavior, including safety critical libraries such as cryptographic libraries, the Linux operating system has rampant undefined behavior that it makes a conscious decision to use. POSIX makes use of undefined behavior for shared libraries (it treats functions loaded from shared libraries as void*, which is undefined behavior).

Gibbon1 · on April 2, 2023

That's not an argument to keep live grenades laying around, it's an argument to remove them from the spec.

Like signed int being UB. Define it to have 2 complement semantics. Problem solved. I'm sure the nutters trying to extend C++ with templates will howl but this is C not C++. And seriously C++ is dead man walking at this point.

chongli · on April 2, 2023

C23 does make two’s complement standard. It also adds checked arithmetic so you can safely avoid signed overflow.

It does not make signed overflow defined behaviour. This would prevent integer operation reordering as an optimization, leading to slower code.

properparity · on April 2, 2023

>This would prevent integer operation reordering as an optimization, leading to slower code.

The sane way to address that is to add explicit opt-in annotations like 'restrict'.

  #push_optimize(assume_no_integer_overflow)
  int x = a + b;
  // more performance orientated code
  #pop_optimize
  // back to sane C

  #push_optimize(assume_no_alias(a, b), assume_stride(a, 16), assume_stride(b, 16))
  void compute(float *a, float *b, int index)
  {
   // here the compiler can assume a and b do not alias
   // and it can assume it can always load 16 bytes at a time
   // the programmer has made sure it's aligned and padded to so with any index
   // there's always 16 bytes to load
   // so go on, use any vectorized simd instruction you want
  }
  #pop_optimize
  // back to sane C

chongli · on April 2, 2023

That’s a lot uglier and clunkier than just using the ckd_add, ckd_mul etc. safe checked arithmetic. Plus if an overflow occurs you still get an incorrect result which you probably don’t want.

Or maybe I’m wrong? Do people actually want overflows to occur and incorrect results? If they’re willing to tolerate incorrect results, why would they also want optimizations disabled?

Gibbon1 · on April 2, 2023

The thing is it's ugly in the rare case that absolute performance is worth fighting for. And not ugly in the majority case where it isn't in the top three important things.

chongli · on April 2, 2023

No, GP's proposal is ugly in the majority case. If you're going to make signed overflow defined behaviour then every time you write:

    int c = a + b;

You have to assume it will overflow and give an incorrect result. So now you need to check everything, everywhere, and you don't get any optimizations unless you explicitly ask for them with those ugly #push_optimize annotations. I completely fail to see how this is an advantage.

The way C works right now, the assumption is that you want optimization by default and safety is opt-in. The GP's proposal takes away the optimization by default. It then makes incorrect results the default, but it does not make safety the default. To make safety the default you would have to force people to write conditionals all over the place to check for the overflows with ckd_add, ckd_mul etc. Merely writing:

    int c = a + b;

Does not give you any assurances that your answer will be correct.

Gibbon1 · on April 3, 2023

"So now you need to check everything, everywhere"

If you want to write robust code in C that what you need to do. UB doesn't give you runtime checks nor compile time checks for overflow.

"Does not give you any assurances that your answer will be correct."

Your problem is you think C's int is a mathematical integer when it is not. It's an ordered set.

chongli · on April 3, 2023

You misunderstood what I was saying. Those statements are from the perspective of the hypothetical language proposal I was critiquing. That proposal turns off all the optimizations by default and forces you to add annotations to turn them back on. At the same time, it does not actually give you anything useful for your trouble because it still doesn't solve the problem of signed overflow giving incorrect results.

The way C is now, you get the performance by default and safety is opt-in. That's the tradeoff C makes and it's a good one. Other languages give safety by default and make performance opt-in. The proposal I was responding to gives neither.

Gibbon1 · on April 2, 2023

Yeah but it's reversed signed overflow shouldn't be UB by default. You should have to explicitly opt in for that.

The reason of course why they refuse to do that if because if that were that case most shops would up and ban unsafe signed.

pclmulqdq · on April 2, 2023

C++ 20 did that too.

pjmlp · on April 2, 2023

Until LLVM, GCC, key game engines and GPGPU SDK get rewritten into something else, it is going to be Resident Evil day for a looong time.

AlotOfReading · on April 2, 2023

I wish UB were only as nasty as "nondeterministic behavior". In fact, if there's UB in anything the compiler sees, nothing at all can be assumed, including whether you even get an output. What you've given the compiler isn't C, so it doesn't have any obligations to do anything with it. The codepath with UB doesn't have to run for the nuclear rockets to launch and the nasal demons to appear.

Since approximately every nontrivial program ever written has UB, in actual practice we're only saved by the fact that compilers aren't entirely maliciously compliant.

Dylan16807 · on April 2, 2023

That's not true. If the program's execution path from start to finish avoids UB then you're safe. (Also the source code itself has to avoid UB, but that part isn't hard.)

It's true that code with UB does not have to be reached, per se, but it does have to be something your program will reach before it can hurt you.

AlotOfReading · on April 2, 2023

You're correct in practical terms, but I'm making a very pedantic point about what the standard requires happen, mainly because this pedantry has important implications for e.g. safety critical C. Note 1 to the definition in 3.4.3 provides some clarification about the extent of UB and states that UB can manifest at translation time. It also gives says that the translator should behave in a documented manner when encountering UB, but does not require that it do so.

LegionMammal978 · on April 2, 2023

C has both translation-time UB and runtime UB. (C++ explicitly separates the two concepts into "ill-defined, no diagnostic required" and "undefined behavior".) You can tell them apart from the condition for UB to occur: if it's a translation-time condition, then it's translation-time UB, and if it's a runtime condition, then it's runtime UB. (Same with implicit UB: is it a translation-time or a runtime assumption being violated?)

Usually when we talk about UB, we're implicitly talking about runtime UB, since translation-time UB is generally far less subtle. If a program contains only conditional runtime UB, the compiler is not permitted to break the entire program from the very beginning, since all possible executions that do not trigger runtime UB must execute correctly as per 5.1.2.3.

AlotOfReading · on April 2, 2023

5.1.2.3 only binds conforming programs. Programs containing UB are by definition non-conforming.

I hadn't considered the C++ standard here, but 1.9 is much more clear than corresponding C verbiage. 1.9.5 is exactly what's described upthread, where any "execution [that] contains an undefined operation" has no prescribed behavior. But the note to the requirement immediately before that (1.9.4) doesn't use that language and instead "imposes no requirements on programs that contain UB". If they had intended only to avoid specifying semantics for programs that hit UB during some possible execution, they would have used the same language as 1.9.5.

Kranar · on April 2, 2023

Your claim is actually false. C differentiates between a conforming program and a strictly conforming program. 5.1.2.3 binds to conforming programs which is permitted to produce output dependent on undefined behavior.

Only strictly conforming programs may not produce output dependent on undefined behavior.

AlotOfReading · on April 2, 2023

No? Conformance allows unspecified and implementation defined. Strict conformance is the absence of that (i.e. same output in every conforming environment). Neither includes UB, as UB is "outside the standard" in some sense and doesn't have defined semantics.

Kranar · on April 2, 2023

It's a common misconception that a conforming program may not engender undefined behavior. In fact this very article touches on how realloc has introduced new (and backwards incompatible) undefined behavior precisely to accommodate the POSIX standard (so that POSIX compliant implementations of C can redefine the otherwise undefined behavior however they please).

AlotOfReading · on April 2, 2023

Can you cite that? It runs against a plain reading of the standards (both C and C++) and would be insane for the standard to allow "correct" programs to include those with undefined behavior. There was even an unadopted proposal (n853 [1]) attempting to clarify this.

While I was making sure I wasn't missing something obvious, I took a look through the rest of the WG14 proposals to see if I was somehow off in my understanding regarding translators being allowed to barf over UB anywhere in the program. There was a proposal clarifying the situation to the possible-execution understanding from upthread submitted by Victor Yodaiken (n2278 [2]), but unfortunately it was also never adopted.

[1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n853.htm

[2] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2278.pdf

Kranar · on April 3, 2023

You are now mixing correct and conforming programs. Neither the C or C++ Standard mention anything involving correct programs. C uses the term conforming program and C++ uses the term well-formed program.

In C a conforming program is any program that satisfies any single conforming implementation, even if said implementation includes extensions or non-portable constructs. A strictly conforming program is a program that satisfies every conforming implementation, which implies that said program does not produce an output that depends on undefined, unspecified, or implementation defined behavior.

In C++ a well-formed program is any program that satisfies the syntax rules, diagnosable semantic rules and the one-definition rule.

LegionMammal978 · on April 3, 2023

Going off of the C17 numbering,

4/7: "A conforming program is one that is acceptable to a conforming implementation."

This definition has no restrictions regarding runtime requirements, unlike for strictly conforming programs.

5/1: "An implementation translates C source files and executes C programs in two data-processing-system environments, which will be called the translation environment and the execution environment in this International Standard. Their characteristics define and constrain the results of executing conforming C programs constructed according to the syntactic and semantic rules for conforming implementations."

So clause 5 binds all "conforming C programs constructed according to the syntactic and semantic rules for conforming implementations", not just strictly conforming programs.

Now, 4/3: "A program that is correct in all other aspects, operating on correct data, containing unspecified behavior shall be a correct program and act in accordance with 5.1.2.3."

We can interpret this as saying that "a program that is correct in all other aspects... containing unspecified behavior" is "constructed according to the syntactic and semantic rules for conforming implementations" even if it only works when "operating on correct data".

From there, it does not seem very difficult to conclude that in general, a conforming program which contains fully specified behavior, assuming it operates on correct data, is also "constructed according to the syntactic and semantic rules for conforming implementations", and is therefore bound by clause 5. If we were to instead take the negation of this conclusion, that a program is not bound by clause 5 if any possible input data causes it to violate a runtime requirement, then the wording of 4/3 would not make any sense.

(In other words, every conforming program has a corresponding set of "correct input data", and it is correct and bound by clause 5 if it does not violate any runtime requirements when given any input data within that set. A program is only incorrect if that set is empty, i.e., the UB is unconditional.)

---

Meanwhile, I suppose you're looking at C++17. The note in [intro.execution]/4 is non-normative, and all of the normative language (e.g., on the very next paragraph) attaches runtime UB to the execution as a function of the input data, not the pure program.

[intro.compliance]/(2.1) and its (non-normative) footnote further clarify the distinction, stating, "If a program contains no violations of the rules in this International Standard, a conforming implementation shall, within its resource limits, accept and correctly execute that program.... 'Correct execution' can include undefined behavior, depending on the data being processed; see 1.3 and 1.9." This suggests that a program that executes undefined operations does not necessarily contain any rule violations.

AlotOfReading · on April 3, 2023

I think your confusion here is coming from the fact that "unspecified behavior" is a specific thing in the standard's terminology (looking specifically at n3088 right now), distinct from the concept of "undefined behavior". So when it says "constructed according to the semantic rules", that inherently excludes UB, which definitionally has no semantics prescribed for it (unlike unspecified behavior). For brevity's sake, I'm ignoring the allowance that an implementation can give any particular UB defined semantics.

To reduce my point to a list of options:

* No UB in the program -> Specified by 5.*

* No UB for certain inputs -> Specified for those inputs, not specified otherwise

* UB present, but not on any possible execution path -> Not specified (this is the argument)

* UB present on every possible execution path -> Not specified (definitionally)

I linked this in a sibling comment, but there was a proposal to amend C2x's wording here to specifically exclude this type of insanity (n2278), but it wasn't adopted because it could potentially prohibit optimizations and the working group doesn't really want to address the issue of undefined behavior with more definitions.

LegionMammal978 · on April 3, 2023

My claim is, "If a program violates the semantic rules at runtime given input A, but does not violate the semantic rules at runtime given input B, then the execution of the program given input B will be defined by clause 5." This is because the wording of 4/3 implies that "being constructed according to the semantic rules" is a function of both the program and the data it is given, such that the behavior can be defined on some inputs by clause 5, but undefined on other inputs.

As I understand it, your claim is, "If a program violates the semantic rules at runtime given input A, then the behavior of the program is undefined given input B, even if the program would not have violated the semantic rules at runtime given input B." Am I misunderstanding your claim?

> I linked this in a sibling comment, but there was a proposal to amend C2x's wording here to specifically exclude this type of insanity (n2278), but it wasn't adopted because it could potentially prohibit optimizations and the working group doesn't really want to address the issue of undefined behavior with more definitions.

N2278 seems entirely irrelevant to this question, of whether potential UB given one input can cause unexpected behavior given another input. Instead, it seems to say, "If the program violates the semantic rules at runtime given input A, causing UB, then the implementation is forbidden from making that behavior identical to the program's hypothetical behavior if it had been given another input B." That looks pretty unworkable in the general case. (E.g., must compilers operate as if an out-of-bounds write on one object can modify the value of a totally different object?)

AlotOfReading · on April 5, 2023

I realized on thinking a bit more that two of the cases I mentioned are actually identical, so you're correct. Good to know going forward!

AnimalMuppet · on April 2, 2023

Fine. HN is, after all, a place where you can be pedantic.

But those of us who are actually writing programs mostly care about "in practical terms", and in practical terms, this doesn't happen, so we don't care. We've got enough trouble worrying about what does happen; we don't have time and energy to worry about what doesn't and won't happen.

AlotOfReading · on April 2, 2023

To provide some more context/motivation for why you might care, I write safety-critical code. I'm often advising people what they need to do for certification, etc. If all you need to do is ensure that you never execute undefined operations and knock out the list of specified UB, that's totally, 100% manageable. Throw some sanitizers on, provide realistic input, and test the hell out of it. Normal stuff.

If the reality is that any UB can invalidate the entire program (as is the interpretation taken by other standards re: C), then that's not remotely sufficient. You have to ensure the complete absence of UB.

still_grokking · on April 2, 2023

That's like saying: "I don't care what the standard says!"

Sure, this is perfectly fine.

Only that you're not writing any C/C++ than, but something in the "gcc 12 language with some switches", or maybe the "LLVM 15 language with some switches", or something like that.

AnimalMuppet · on April 2, 2023

Well, if Visual Studio (or whatever Microsoft calls their compiler these days), and all known versions of gcc, and all known versions of LLVM all do something sane, then I'm not sure I care all that much about the theoretical possibility that some compiler someday might do something insane.

coliveira · on April 2, 2023

> approximately every nontrivial program ever written has UB

You can replace "UB" for "bugs" and the result is the same. UB is a bug on the part of the programmer, from the point of view of C, similar to dereferencing a null pointer. When the standard says that something is UB, it is just clarifying what these situations are.

AlotOfReading · on April 2, 2023

What the standard explicitly calls out as UB is only a small subset of actual UB.

While you can certainly classify all UB as "bugs", doing so misses the critical differences between UB and other categories of bugs. If you have a logic bug for example, your program will correctly and consistently do the wrong thing. It will continue doing that wrong thing with a different compiler, on a different platform today and 10 years from now. Implementation defined behavior is a bit looser, but will still be consistent with any particular implementation (which will document the behavior) and will only manifest in the code that depends on it. A PR inserting one of these "normal" bugs doesn't invalidate the entire rest of the program.

UB is different. You can't make assumptions about UB because from the point of view of the standard, UB is "not C". There are no assumptions to be made, it's just all the stuff that doesn't have assigned semantics. And since the input is meaningless, so is the entirety of whatever the compiler gives you back.

coliveira · on April 2, 2023

> If you have a logic bug for example, your program will correctly and consistently do the wrong thing.

Not correct. Bugs can occur differently in different architectures, even in high level languages. UB is just a kind of bug whose effect depends on how the compiler behaves, so you have to be careful to test your code on different compiler settings. This is nothing new on programming languages, it is only made explicit in the C standard. Suddenly people started to believe that pointing out the obvious source of bugs (UB) in the standard is equivalent to let programs misbehave.

AlotOfReading · on April 2, 2023

I'm not sure if you're making a point about "unspecified behavior" (where the compiler can choose between multiple valid behaviors), but no, a strictly conforming program will have the same semantics on different architectures. Strictly conforming programs can still have bugs, but their nature is completely different than UB because that's the point of the standard.

adgjlsfhk1 · on April 2, 2023

> you have to be careful to test your code on different compiler settings.

The problem is you have to test your code on compilers that don't exist yet with compiler settings that do different things from any compiler that ever might exist.

coliveira · on April 2, 2023

This has always been the case. If you write code that has UB, new compilers can do something yet undefined, by definition.

cryptonector · on April 2, 2023

Bugs are UB-like in a sense (what's the code going to do? well, you'll have to think about it, or try it and see), but UB is strictly worse than bugs (different compilers, even different versions of the same compiler, can do radically different things way beyond the scope of the bug).

mjevans · on April 2, 2023

That's exactly why a compiler shouldn't be able to 'optimize' in the face of UB, it should be an ERROR and the section of undefined behavior highlighted in the error message.

chongli · on April 2, 2023

This would mean you’d have to insert a check every time you add two signed integers together, because signed overflow is UB. You’d also have to wrap every memory access with bounds checks, because OOB memory access is UB.

There are also tons and tons of loop optimizations compilers do for side-effect free loops which would have to be removed completely. This is because infinite loops without side effects are UB. So if you wanted these optimizations you’d have to prove to the compiler — at compile time — that your loop is guaranteed to terminate since it is not allowed to assume that it will. Without these loop optimizations, numerical C code (such as numpy) would be back in the stone ages of performance.

Edit: I just wanted to point out that one of the new features in C23 is a standard library header called <stdckdint.h> that includes functions for checked integer arithmetic. This allows you to safely write code for adding, subtracting, and multiplying two unknown signed integers and getting an error code which indicates success or failure. This will be the standard preferred way of doing overflow-safe math.

DangitBobby · on April 2, 2023

Another option would be to define behaviors for integer overflow and out of bounds memory access. Presumably they happen fairly often and it might be a good idea to nail down what should happen in those cases.

chongli · on April 2, 2023

Those things aren’t up to the language, they’re up to hardware. C is a portable language that runs on many different platforms. Some platforms might have protected memory and trap on out of bounds memory access. Other platforms have a single, flat address space where out of bounds memory access is not an error, it just reads whatever is there since your program has full access to all memory.

The same goes for integer overflow. Some platforms use 1’s complement signed integers, some platforms use 2’s complement. Signed overflow would simply give different answers on these platforms. The standards committee long ago decided that there’s no sensible answer to give which covers all cases, so they declared it undefined behaviour which allows compilers to assume it’ll never happen in practice and make lots of optimizations.

Forcing signed overflow to have a defined behaviour means forcing every single signed arithmetic operation through this path, removing the ability for compilers to combine, reorder, or elide operations. This makes a lot of optimizations impossible.

adrian_b · on April 2, 2023

The problem is that here is a vicious circle.

Most old computer architectures had a much more complete set of hardware exceptions, including cases like integer overflow or out-of-bounds access.

In modern superscalar pipelined CPUs, implementing all the desirable hardware exceptions without reducing the performance remains possible (through speculative execution), but it is more expensive than in simple CPUs.

Because of that, the hardware designers have taken advantage of the popularity gained by languages like C and C++ and almost all modern programming languages, which no longer specify the behavior for various errors, and they omit the required hardware means, to reduce the CPU cost, justifying their decision by the existing programming language standards.

The correct way to solve this would have been to include in all programming language standards well-defined and uniform behaviors for all erroneous conditions, which would have forced the CPU designers to provide efficient means to detect such conditions, like they are forced to implement the IEEE standard for floating-point arithmetic, despite their desire to provide unreliable arithmetic, which is cheaper and which could win benchmarks by cheating.

chongli · on April 2, 2023

CPU designers don't like having their hand forced like that. If you create a new standard forcing them to add extra hardware to their designs, they'll skip your standard and target the older one (which has way more software marketshare anyway). They will absolutely bend over backwards to save a few cycles here and a few transistors there, just so they can cram in an extra feature or claim a better score on some microbenchmark. They absolutely do not care at all about making life easier for low-level programmers, hardware testers, or compiler writers.

rini17 · on April 2, 2023

I don't believe adding simple checks against data already present in L1 caches and marked as "unlikely to fail" should be so onerous.

saagarjha · on April 4, 2023

> In modern superscalar pipelined CPUs, implementing all the desirable hardware exceptions without reducing the performance remains possible (through speculative execution), but it is more expensive than in simple CPUs.

Yeah, and that's how you get security vulnerabilities!

johnny22 · on April 2, 2023

doesn't C force 2s complement now? If so, one less thing to worry about.

bluecalm · on April 2, 2023

UB is a better option though. When your signed integer overflows it's a bug nevertheless. Why force the compiler to generate code for a pointless case instead of letting it optimize the intended one?

If you value never having bugs over performance then just insert a check or run your program with a sanitizer that does that for you. It's a solved problem for a case where performance doesn't matter. The thing is that it does.

skitter · on April 2, 2023

That would be great if it was possible, but how do you specify & implement sensible behavior for this:

    void foo(int *a, int b) { a[b] = 1}

At runtime there is no information about whether that write is in bounds and no way to prevent this from corrupting arbitrary data unless you compile for something like CHERI.

mjevans · on April 5, 2023

In checked languages this would probably be an 'unsafe' function, since it lacks those features.

If this were accessible at build time it could be checked for anything that references the function and bounds checked accordingly.

The promotion of a pointer to an array is really the source of the logical error. A language could place range checks on created arrays, and pointers / references to allocated arrays could be handled differently than anonymous slabs of memory. However an array without bounds (even stored elsewhere from just before the array's starting address) is as unsafe as 'null terminated strings' for length bounds. That's an idea that made much more sense when systems were much smaller and slower and the exposure to untrusted code and data were also far lower.

void foo(void *a, int b) { (int[])(a) = 1 } // Not quite C pseudocode, also see poke()

gpderetta · on April 2, 2023

Good luck defining the behaviour of use after free of accessing out of bound stack memory without bound checking and GC.

saagarjha · on April 4, 2023

They don't happen that often. That's why they're bugs!

mafuy · on April 2, 2023

> you’d have to insert a check every time you add two signed integers together,

This is exactly what is done in serious code. It is typically combined with contracts and static analysis (often human), e.g. "it is guaranteed that this input is in range 10-20, so adding it with this other 16 bit int can be assumed to be below sint32_max".

pclmulqdq · on April 2, 2023

Great, those checks can stay in "serious" code, and those of us who don't want them can take the UB. C++ 20 actually ended up specifying that all ints are twos complement, removing this from the category of "UB," but a lot more weird stuff is programmed in C.

gpderetta · on April 2, 2023

Note that signed overflow is still UB in c++ even with 2-complement being guaranteed for signed types.

heywhatupboys · on April 2, 2023

> because signed overflow is UB

no longer

circuit10 · on April 2, 2023

Doing that at compile time would require being able to perfectly predict everything the program can do, which is equivalent to solving the halting problem (make the program do something undefined after it finishes, then if you get an error at compile time then it halts) and is mathematically impossible. Doing it at runtime would have a massive performance impact

gpderetta · on April 2, 2023

We rehash this argument every few weeks. Please search the comment history why it is nonsensical.

pmarin · on April 2, 2023

If they are bugs they should be reported to the user and end the compilation with an error.

the_why_of_y · on April 2, 2023

Compilers actually have some options to enable that.

The problem is, it only works well in the simplest cases when the code will 100% exhibit UB within a single function.

In most cases, the UB would only manifest on particular input values - if you want your compiler to warn about that then it will report one "potential UB" for every 10 lines of C code, and nobody wants to use such a compiler.

jcranmer · on April 2, 2023

The case of realloc being declared UB (as opposed to impl-defined) was not driven by the compiler writers but by the people who write the C libraries.

This isn't a case of compilers screwing over the programmers, because the people who are responsible for those optimizations are the people who are scratching their heads as to why it's UB and not impl-defined behavior.

GuB-42 · on April 2, 2023

UB can initiate the rise of zombie velociraptors.

  int n;
  printf("type 0 to stop the rise of zombie velociraptors");
  scanf("%d", &n);
  realloc(pre, n);
  if (n != 0) rise_zombie_velociraptors()

May result in velociraptors raising even if the user enters "0".

The reason is that because realloc(pre, 0) is UB, for the compiler, it cannot happen, so n can't be 0, so the n != 0 test can be optimized out, so, velociraptors.

Asooka · on April 2, 2023

> The second part is the misconception about the impact of UB. Making something UB does not dictate that its usage will initiate the rise of zombie velociraptors. It grants the implementation the power to decide the best course of action. That is, after all, what they’ve been doing all this time anyway.

Wrong. UB never happens. That is the promise the program writer makes to the compiler. UB never happens. A correct C program never executes UB. This allows the compiler to assume that anything that is UB never happens. Does some branch of your program unconditionally execute realloc(..., 0) after constant propagation? That branch never happens and can just be deleted.

Reading the defect report, they state "Classifying a call to realloc with a size of 0 as undefined behavior would allow POSIX to define the otherwise undefined behavior however they please." which is wrong. UB cannot be defined, if you define it, you are no longer writing standard C. It should instead have been classified as "implementation-defined behaviour".

In any case it's not that hard to just write a sane wrapper. This one is placed in the Public Domain:

    void *sane_realloc(void *ptr, size_t sz)
    {
        if (sz == 0) {
            free(ptr); /*free(NULL) is no-op*/
            return NULL;
        }
        if (ptr == NULL) {
            return malloc(sz);
        }
        return realloc(ptr, sz);
    }

I am calling it sane and not safe, because it is not safe. You still have the confusion of what happens when the function returns NULL (was it allocation failure or did we free the object?) - check errno. However, it has the same fully defined semantics on most all implementations and acts like people would expect.

You may be tempted to make the function return the value of errno, mark it [[nodiscard]] and take a pointer-to-pointer-to-void, so that the value of the pointer will only be changed if the reallocation was successful. I am not sure if that is safer. You are trading one possible bug - null pointer on allocation failure, which then will cause a segmentation fault for another - stale pointer on allocation failure, but with updated size. The latter is more likely to be used in buffer overflow attacks than the former.

omoikane · on April 2, 2023

> This is written with quite a lot of hyperbole

The first sight of "catch fire" might not have caught my attention, but by the time it got to "instrument of arson" and "Molotov cocktails", the style was sufficiently distracting that I was convinced I wasn't the intended audience.

benj111 · on April 2, 2023

My understanding was that they're changing realloc() because they previously allowed zero length arrays and because you can't tell if this is a zero length array you need to either get rid of zero length arrays or change realloc().

So the feature wasn't broken to begin with, it was broken by another feature.

GuB-42 · on April 2, 2023

I actually like unreachable() a lot. What it does is that it invokes undefined behavior, that's all.

It does nothing trickier than any other kind of UB. In fact, I could implement unreachable() like this: void unreachable() { (char *)0 = 1; }.

Standardizing it however gives interesting options for compilers and tool writers. The best use I can find is to bound the values of the argument of a function. For example, if we have "void foo(int a) { if (a <= 0) unreachable(); }, it tells the compiler that a will always be >0 and it will optimize accordingly, but it can also be used in debug builds to trigger a crash, and static analyzers can use that to issue warnings if, for example, we call foo(0). The advantage of using unreachable() instead of any other UB is that the intention is clear.

lionkor · on April 2, 2023

Respectfully, you would already be doing this in any C codebase, with `assert()`, right? We are all checking our preconditions with assert... right?

GuB-42 · on April 2, 2023

AFAIK, assert() is not undefined behavior, so it can't be used for optimization. It is either implementation-defined in debug mode, or does nothing in release mode.

For example:

  assert(a >= 0);
  if (a < 0) printf("a is negative");

In release mode, assert() will be gone, so the if/printf() will stay. If we used "if (a < 0) unreachable();" instead of assert(), it would optimize away both lines.

pornel · on April 2, 2023

NDEBUG makes these checks disappear, so that's not an option for checks that are supposed to stay in the program.

lprib · on April 2, 2023

Using `unreachable()` instead of `assert()` for your preconditions without profiling first is just pre-loading the gun to shoot yourself in the foot in the future. When those preconditions are inevitably violated at some point, you will get random UB corruption rather than simply aborting as is the case for assert.

GuB-42 · on April 2, 2023

Yep, undefined behavior is unsafe, C in general is unsafe. There are plenty of languages that are safe, though a little bit rusty like ADA, use one of these if you want safe code.

If you still want to use C, for example for compatibility reasons and want to make it safer, assert isn't going away (unless you set NDEBUG). Preconditions are not "inevitably violated", there are ways of making sure they aren't, and I think an explicit "unreachable()" can help tools that are designed for that purpose.

Should you profile first before using unreachable() for optimization purposes? Maybe, but the important part is that now, you have a way of clearly and effectively tell the compiler what you know will never happen so that it can optimize accordingly, whether it is before or after profiling.

Compilers usually do a great job at optimization, but there are often some edge cases the compiler have to take into account in order to generate code that complies with the C standard, and it can have an impact on performance. unreachable() is one way to tell the compiler "please forget about the edge case, I know it won't happen anyways", the best part is that it is explicit, no obscure tricks here.

Side note about profilers: no matter what your strategy is with regards to optimization, I think profilers are essential tools that don't get enough attention. People talk a lot about linting, coverage and unit tests, but profilers are not to be left out. They are not just tools that tell you where not to optimize your code, they can also find bugs, mostly performance bugs, but not only.

ptx · on April 2, 2023

> What it does is that it invokes undefined behavior, that's all. [...] it can also be used in debug builds to trigger a crash

How can it be used to trigger a crash (a specific behavior) if the behavior it invokes is undefined? Are you saying it would be defined differently for debug builds so that it doesn't invoke undefined behavior?

firstlink · on April 2, 2023

> and that such changes may impose themselves on old code without recompilation when dynamically linked libraries are upgraded.

All I can do is laugh. This is what the dynamic linker fanatics wanted. This is what they explicitly advocate for to this day. Share and enjoy!!

AshamedCaptain · on April 2, 2023

I really don't think anyone could possibly want the _specified behavior_ of a function changing below their feet.

However, the author is unlikely to be correct here. E.g., to this day, glibc contains _multiple implementations of memcpy_ just to satisfy those executables that depend on the older, memmove-like behavior that was once part of the unspecified behavior of glibc. The only way to get the dynamic linker to choose one of the newer versions is to, well, rebuild the executable. It is inconceivable that glibc would not use symbol versioning with an actual specification change.

The behavior is practically the same as with static linking, and you still get the benefits of dynamic linking.

throwaway892238 · on April 2, 2023

People who don't understand dynamic linking are doomed to re-implement it, poorly.

coliveira · on April 2, 2023

Exactly! Shared libraries mean that new code with modified behavior can and will be called when made available, independent of how the original code was compiled. It is interesting that people come out to complain about this obvious behavior.

hermitdev · on April 2, 2023

The problem isn't changing implementation. This is expected with shared libs. The problem is changing the contract of the function and then expecting it to be drop in compatible. It's not. It _should_ be treated as a breaking ABI change, because the old behavior and new behavior are not compatible, yet it's being masqueraded as such. It's quite literally the same behavior/attitude behind the "w" vs "wt" change that led to aCropolyse.

tedunangst · on April 2, 2023

It's a really weird complaint. The standard specifies that it's now undefined behavior. That imposes zero requirements to change the library. Whatever it is the library was doing, it's one possible undefined behavior.

bayindirh · on April 2, 2023

I’d rather have small binaries and memory efficient systems instead of huge blobs having their own complete disconnected environments with non-coherent behavior on the same situation. Also, wasting tons of memory while at it.

If I have something that critical, I can always statically compile.

GuB-42 · on April 2, 2023

> C178 purports to be a bug-fix revision of C11. Does the word "toto" on page 1 indicate (a) the editor's musical tastes; (b) that nobody bothered to spell-check the document; (c) that we're not in Kansas anymore; or (d) none of the above?

As a french guy I'd go with (d).

I've often seen "toto" used as a placeholder name, sometimes followed by "titi", "tata", "tutu", I have even used it myself. It is similar to "foo", "bar", "baz". I don't know if it is specific to France, of French speaking countries, but it is definitely a thing here.

rahen · on April 2, 2023

Most likely toto as the French for foobar.

Jens Gustedt is part of the C comity and participated to C23. He also works for INRIA in France: https://en.wikipedia.org/wiki/French_Institute_for_Research_...

RustyRussell · on April 2, 2023

Frankly, the C standards ctte went off the deep end when they effectively banned NULL to memset etc (obv with zero length).

Not because these functions couldn't handle it, but because this assertion simplifies optimizations elsewhere.

This has required adding extra checks in my code, found mainly by trial and error, and has made it less readable and less optimal.

Finally, the checked arithmetic operations returning false on success is a horror show. Fortunately it will be found on the first time the code is run, but that's a damnably low bar :(

ericpauley · on April 2, 2023

> Finally, the checked arithmetic operations returning false on success is a horror show.

This seems in line with C conventions? Generally a 0 return code means success.

wruza · on April 2, 2023

With int statuses, not with bools. It’s just a twisted logic in return value you have to deal with in your head.

“If checked operation has a status, then it failed.” - ok

“If checked operation [is true], then it failed.” - wat

SAI_Peregrinus · on April 2, 2023

The checked operations ask "did an error occur?". If it's false, then the check passed and no error occurred. If it's true, then the check indicated an error.

masklinn · on April 2, 2023

> With int statuses, not with bools

Which C historically did not have, so int played that role. The function is the same, and the existing idioms remain.

wruza · on April 2, 2023

I find it strange to introduce real bools (which these macros return according to their official signatures) and then to assign them a meaning of a still-nonexistent but widely used C type. At least my C intuition stumbles upon that immediately, no matter how long I think about it.

Ah, anyway, standard C/libc is basically a lost cause. It can’t get any worse, since you have to refer to a manual at every call to not step on a landmine.

Kamq · on April 2, 2023

> Finally, the checked arithmetic operations returning false on success

That's what got you? C functions returning error flags (with zero meaning no error) isn't exactly new.

Dwedit · on April 2, 2023

Replace memset with a macro, that's the C way.

notfed · on April 2, 2023

Isn't the return value just a carry bit?

spc476 · on April 2, 2023

Not every CPU C runs on has a carry bit. MIPS, SPARC, RISC-V, all don't have the concept of a "carry bit."

JonChesterfield · on April 2, 2023

Author is angry but not wrong. Lifting the most damning quote from the article as I haven't seen it for a while.

C inventor Dennis Ritchie pointed to several flaws in [ANSI C] ... which he said is a licence for the compiler to undertake agressive opimisations that are completely legal by the committee's rules, but make hash of apparently safe programs; the confused attempt to improve optimisation ... spoils the language.

—Dennis Ritchie on the first C standard

juunpp · on April 2, 2023

> The ckd_* macros steer a refreshingly sane path around arithmetic pitfalls including C's "usual arithmetic conversions."

A 7 letter function to add two numbers and that returns a boolean... not entirely sure I'd call that 'sane'.

ludocode · on April 2, 2023

I'd prefer if it were more letters. It bothers me when API designers omit random letters just to save a few keystrokes. These are particularly egregious because I keep forgetting which letters they kept. Is it "chk"? or "ckd"? or "chd"? or something else?

I wrote a portability library that wraps these with compiler intrinsic and standard C fallbacks. I chose to spell out the full word in addition to making the type explicit. It's a lot more verbose of course but a lot clearer to read:

https://github.com/ludocode/ghost/blob/develop/include/ghost...

goatlover · on April 2, 2023

A saner language would handle the conversion for you so it would work with just the normal math operators.

masklinn · on April 2, 2023

How would that work for the largest type supported by the platform?

pjmlp · on April 2, 2023

A panic would be thrown, like in memory safe system programming languages, those that were in use outside Bell Labs and unfortunely lost to UNIX.

ChancyChance · on April 2, 2023

Is the world finally realizing that "a + b" actually returns two values: pass/fail and the value if pass?

"a + b = c;" is a fundamentally flawed operation from a computer architecture perspective.

c4mpute · on April 2, 2023

First, you might have meant c = a+b;

The other way isn't really definable as an assignment mathematically.

And there is a lot more to it than just pass/fail. First, an addition doesn't fail, from a computer architecture perspective, the addition will always succeed, the only thing that could fail (in all the usual architectures) are possible memory fetch and store operations when not strictly dealing in register or immediate operands. Second, there is no fail flag. There is a overflow flag, an underflow flag, a zero flag, a sign and a few more that are irrelevant here. Any of overflow, underflow, zero or sign might mean that the operation "failed" depending on the types of your operand. Where the processor doesn't know anything about the type, so there won't be a straightforward 'fail' flag in any case. Only the library or compiler can use type information such as (un)signedness, bignum-ness, nonzeroness, desired wraparound (for modular types) and other possible types together with aforementioned flags to decide if that addition might have failed.

So nothing is fundamentally flawed, what you are describing is just insufficiently complex (because there is no fail flag, just a ton of other flags) or overly complex (because uint32_t c = a + b is modular 2^32 arithmetics and cannot fail).

khazhoux · on April 2, 2023

> First, you might have meant c = a+b;

> The other way isn't really definable as an assignment mathematically.

This correction is condescending and unnecessary. Unless the person had never written a single line of code in their life, then they would obviously know "a+b" is not a modifiable lvalue.

And the point about pass/fail was also obviously not mean to capture the full complexity of the flags set by a CPU operation. It was very clearly a statement about how basic addition does not behave in computers the way it does on paper -- as simple as that.

From HN guidelines: "Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize."

c4mpute · on April 2, 2023

You might be right on the first point. Edit: actually, you might not be. There are languages with compound lvalues and CPU architectures with multiple result registers (x86 being the best-known example). E.g. you can do "(result, flags, err) = do_stuff(a, b, c)" in Go, and x86 DIV storing different parts of the division result in different registers: https://c9x.me/x86/html/file_module_x86_id_72.html And generally with common CPU architectures, flags are another such result register that is always written, such that any operation like c := a+b is actually something like (c, flags) := a+b. And for stuff like multiplication, there is actually the notion of two result registers being the higher and lower part of the resulting operation, like (a * 2^32 + b) = c * d (see x86 MUL). Therefore some precision in language is necessary for the discussion (and yes, the different meanings of ==, =, := in various languages and mathematics are also confusing, even to me ;).

I do strongly disagree on the second one about pass/fail. This kind of nitpicking is necessary here, because the discussion is about a standard intended to precisely describe such operations, and how the underlying hardware might be utilized to execute them. Being imprecise in this context is dangerous, wrong, problematic and leads to the whole point of the discussion being lost in a sea of handwaving.

JonChesterfield · on April 2, 2023

> The other way isn't really definable as an assignment mathematically.

It's an equality sign. See also, := and unification.

Arch-TK · on April 2, 2023

There is actually another option.

A more sophisticated type system.

Let's say you had some pseudocode like this:

    let a = 5
    let b = 12
    let c = a + b

The type of a would be Integer[5..5], the type of b would be Integer[12..12], the type of c would therefore be Integer[17..17]. In a more complex example:

    def foo(a: Integer[0..10], b: Integer[0..10]):
        return a + b

The return type of this function would be Integer[0..20].

This kind of type system can solve a number of issues, all but division by zero (which would probably still have to be solved with some kind of optional type).

If type inference dictates that the upper range of an integer would be too large to physically store in a machine data type, then you either resort to bignums or you make it a compilation error. By adding modular and saturating integer types you can handle situations where you want special integer behaviours. By explicitly casting (with the operation returning an optional) you can handle situations where you want to bound the range. This drastically simplifies a lot of code by removing explicit bounds checks in all places except where they are absolutely necessary. If for some reason you care about the space or computational efficiency of the underlying machine type, you can have additional annotations (like C's u?int_(least|fast)[0-9]+_t). If you absolutely must map to a machine type (this is usually misguided, unless you are dealing with existing C interfaces, for which such a language can provide special types) you can have more annotations.

Ada has something resembling this. I believe there are some other languages that implement similar features. I believe this sort of thing has a name, but I am not great with remembering the names of things.

Hopefully this is some food for thought.

im3w1l · on April 2, 2023

I think the issue with this is that the worst-case bounds normally grow much faster than the actual values. And it can be easy to see for the programmer that the values can't actually grow that much because a is only big when b is small or some property like that, but then you have to convince the compiler of the same. I might be misremembering though.

codethief · on April 2, 2023

> because a is only big when b is small or some property like that

Exactly, the expressiveness of the type system then (typically) becomes the obstacle: How do you express that a and b could each reach INT_MAX but their sum never exceeds INT_MAX?

Arch-TK · on April 2, 2023

Those kinds of assumptions are where you explicitly cast to a smaller ranged type with the option of an error if the sum does exceed a limit. The point of this type system is not to be able to fully encode every possible interaction between numbers in a system, but rather to remove unnecessary bounds checking in a bunch of cases and make it explicit in the few cases where you ARE actually making an assumption.

codethief · on April 3, 2023

> Those kinds of assumptions are where you explicitly cast to a smaller ranged type

But how exactly do you do that? As mentioned, a and b individually can still reach INT_MAX.

I agree with your overall assessment, though. If a type system could represent (and recognize, and evaluate / automatically draw conclusions from) any possible restriction on the values of variables this would probably amount to the type system being able to carry out arbitrary mathematical proofs. The existence of such a type system seems rather unlikely.

Arch-TK · on April 3, 2023

In some pretend rust. Pretend Integer exists and has the properties I described:

    fn special_sum(a: Integer<0..(1<<32)>, b: Integer<0..(1<<32)>) -> Option<Integer<0..(1<<32)>>
    {
        Integer<0..(1<<32)>::try_from(a + b)
    }

The type of the expression a+b would be Integer<0..(1<<33)-1>.

Obviously you can design a language which makes this much more ergonomic. That language can also avoid needing to use a type larger than a 32 bit wide unsigned integer to perform the operation through a special optimisation. Moreover, there's nothing stopping the type system from being able to maintain more sophisticated rules, for example there's nothing stopping a product type of two Integer<0..(1<<32)> having a rule applied which ensures that the sum of the two values cannot exceed 2*32-1.

wizzwizz4 · on April 2, 2023

> but then you have to convince the compiler of the same.

In conventional parlance, this is known as "handling overflow".

still_grokking · on April 2, 2023

> I believe this sort of thing has a name, […]

https://en.wikipedia.org/wiki/Refinement_type

But the concept is just a little bit over 30 years old. So don't expect it shows up in most mainstream languages before the end of the next 20 years, and don't expect it to come to the C languages ever.

Meanwhile in mainstream ML-land:

https://github.com/Iltotore/iron

(Or for the older version of the language: https://github.com/fthomas/refined)

(Please also note that for this feature both versions don't need language support at all but are "just" libraries, as the language is powerful enough to express all kinds of type level / compile time computations in general.)

Arch-TK · on April 3, 2023

I know C is never getting anything like this. But C is really just stuck being a very crappy ABI design language.

I do wish things like Rust had native support for stuff like this though.

And it really doesn't have to get in the way of anyone who insists on the more primitive type systems, but it would be nice to have a high level language which didn't have assembly-level (of abstraction) integer types. Why should I care that my machine works with square multiples of 8 bits at a time or care that they have specific wrapping behaviour.

codethief · on April 3, 2023

> But the concept is just a little bit over 30 years old. So don't expect it shows up in most mainstream languages before the end of the next 20 years, and don't expect it to come to the C languages ever.

Any specific results/papers from (refinement) type theory you hope/expect to see implemented in the next 20 years?

JonChesterfield · on April 2, 2023

Compilers do this sort of range tracking anyway. At least within a function. It's useful for loop optimisations.

notfed · on April 2, 2023

It's a flaw that has a pretty good tradeoff: unparalleled readability.

ChancyChance · on April 2, 2023

It depends. If you want to study maths, yes. If you want to be a programmer:

[status, value] = add(a, b);

Is much more unparalleled-ly (?) readable from the perspective of how a computer actually operates. In reality, this:

uint c = (uint)a + (uint)b; // (to make that other guy happy)

is really:

c = (a + b) % (sizeof(uint));

in "C", which is less readable but far more accurate.

ChancyChance · on April 2, 2023

That’s 2^sizeof(uint)

solidsnack9000 · on April 2, 2023

"Looking forward, marijuana legalization will surely beget notions such as fractional-, imaginary-, and negative-length objects, each with as much potential for mayhem as zero-length objects."

It's a funny thing to say.

firstlink · on April 2, 2023

Rust seems to do fine with ZSTs somehow.