Zig, a language which is explicitly aimed at the same domain as C, has an improv...

fooker · 2024-05-22T20:48:57 1716410937

Yes, you can surely improve things from C. C is not a benchmark for anything other than footguns per line of code.

The debug modes you mention are also available in various forms in C and C++ compilers. For example ASan and UBSan in clang will do exactly what you have described. The question is, then whether these belong in the language specification or left to individual tools.

pjmlp · 2024-05-23T07:13:41 1716448421

As proven multiple times throughout the computing history, individual tools are optional, and as such used less often than they actually should be.

Language specification is unavoidable when using said language.

fooker · 2024-05-23T18:16:45 1716488205

Have you wondered why Rust or Python do not have a specification?

For a bunch of languages outside the C-centric world, specifications don't exist.

pjmlp · 2024-05-24T08:13:21 1716538401

The certainly have, even if it isn't a ISO one.

https://docs.python.org/3/reference/index.html

https://docs.python.org/3/library/index.html

https://doc.rust-lang.org/reference/index.html

https://doc.rust-lang.org/std/index.html

https://ferrous-systems.com/blog/ferrocene-language-specific...

fooker · 2024-05-25T22:07:51 1716674871

Documentation and specification are not the same things.

The intuitive distinction is that the second one is for compiler/library developers, and the former is for users.

A specification can not leave any room for ambiguity or anything up to interpretation. If it does (and this happens), it is treated as a bug to be fixed.

lstodd · 2024-05-23T08:45:29 1716453929

mwahahaha. as if there is some divine "language specification" which all compilers adhere to on pain of eternal damnation.

no such thing ever existed.

pjmlp · 2024-05-23T09:23:23 1716456203

Given that one can write Fortran in any language, maybe you're right.

rcxdude · 2024-05-23T09:50:36 1716457836

it's not just in debug modes. It should be the standard in release mode as well (IMO the distinction shouldn't exist for most projects anyway). ASan and UBSan are explicitly not designed for that.

samatman · 2024-05-23T15:16:22 1716477382

Worth noting that Zig has ReleaseSafe, which safety-checks undefined behavior while applying any optimizations it can given that restriction.

The more interesting part is that the mode can be individually modified on a per-block basis with the @setRuntimeSafety builtin, so it's practical to identify the performance-critical parts of the program and turn off safety checks only for them. Or the opposite: identify tricky code which is doing something complex, and turn on runtime safety there, regardless of the build status.

That's why this sort of thing should be part of the specification. @setRuntimeSafety would be meaningless without the concept of safety-checked undefined behavior.

I would say that making optionals and fat pointers (slices) a part of the type system is possibly more important, but it all combines to give a fighting chance of getting user-controlled resource management correct.

Given the topic of the Fine Article, it's worth briefly noting that `defer` and `errdefer` are keywords in Zig. Both the test allocator, and the GeneralPurposeAllocator in safe mode, will panic if you leak memory by forgetting to use these, or rather, forget to free allocations generally. My impression is that the only major category of memory bugs these tools won't catch in development is double-free, and that's being worked on.

fooker · 2024-05-23T12:18:30 1716466710

Well, give it a try.

If you can make it work in a way that has acceptable performance characteristics, every systems language will adopt your technique overnight.

rcxdude · 2024-05-23T15:45:25 1716479125

I use rust, which already does this.

fooker · 2024-05-23T18:10:08 1716487808

Signed overflow is officially a 'bug' in rust, it traps in debug mode but silently follows LLVM/platform behavior in release mode.

Huh, doesn't that sound familiar?

steveklabnik · 2024-05-23T19:29:03 1716492543

> silently follows LLVM/platform behavior

This is not the case. It's two's compliment overflow.

Also, since we're being pedantic here: it's not actually about "debug mode" or "release mode", it is tied to a flag, and compilers must have that flag on in debug mode. This gives the ability to move release mode to also produce the flag in the future, if it's decided that the overhead is worth it. We'll see if it ever is.

> Huh, doesn't that sound familiar?

Nope, it is completely different from undefined behavior, which gives the compiler license to do anything it wants. These are well defined semantics, the polar opposite of UB.

fooker · 2024-05-25T22:12:53 1716675173

>This is not the case. It's two's compliment overflow.

Okay, here is an example showing that rust follows LLVM behavior when the optimizer is turned on. LLVM addition produces poison when signed wrap happens. I'm a little bit puzzled about the vehement responses in the comments wow. I have worked on several compilers (including a few patches to Rust), and this is all common knowledge.

https://godbolt.org/z/r6WTxGjrb

steveklabnik · 2024-05-26T02:58:13 1716692293

The Rust output:

  define noundef i32 @square(i32 noundef %x, i32 noundef %y) unnamed_addr #0 !dbg !7 {
    %_0 = add i32 %y, %x, !dbg !12
    ret i32 %_0, !dbg !13
  }

Let's compare like to like, here's one with equivalent C++ code: https://godbolt.org/z/Y4MnGeof4

The C++ output:

  define dso_local noundef i32 @square(int, int)(i32 noundef %0, i32 noundef %1) local_unnamed_addr #0 !dbg !99 {
    tail call void @llvm.dbg.value(metadata i32 %0, metadata !104, metadata !DIExpression()), !dbg !106
    tail call void @llvm.dbg.value(metadata i32 %1, metadata !105, metadata !DIExpression()), !dbg !106
    %3 = add nsw i32 %1, %0, !dbg !107
    ret i32 %3, !dbg !108
  }

> LLVM addition produces poison when signed wrap happens.

https://llvm.org/docs/LangRef.html#add-instruction

> nuw and nsw stand for “No Unsigned Wrap” and “No Signed Wrap”, respectively. If the nuw and/or nsw keywords are present, the result value of the add is a poison value if unsigned and/or signed overflow, respectively, occurs.

Note that Rust produces `add`. The C++ produces `add nsw`. No poison in Rust, poison in C++.

Here is an example of these differences producing different results, due to the differences in behavior: https://godbolt.org/z/Gaonnc985

Rust:

  define noundef zeroext i1 @test() unnamed_addr #0 !dbg !14 {
    ret i1 true, !dbg !15
  }

C++:

  define dso_local noundef zeroext i1 @test()() local_unnamed_addr #0 !dbg !123 {
    tail call void @llvm.dbg.value(metadata i32 undef, metadata !128, metadata !DIExpression()), !dbg !129
    ret i1 false, !dbg !130
  }

This is because in Rust, the wrapping behavior means that this will always be true, but in C++, because it is UB, the compiler assumes it will always be false.

> I'm a little bit puzzled about the vehement responses in the comments wow.

You are claiming that Rust has semantics that it was very, very deliberately designed to not have.

samatman · 2024-05-24T17:46:20 1716572780

Rust includes a great deal of undefined behavior, unlocked with the trustme keyword. Ahem, sorry, unsafe. If only...

So if we're going to be pedantic, it's safe Rust which has defined semantics for basically everything. A considerable accomplishment, to be sure.

steveklabnik · 2024-05-24T17:53:41 1716573221

While this is true, we’re talking about integer overflow. That’s part of safe Rust. So it’s not really germane to this conversation.

pjmlp · 2024-05-23T07:12:06 1716448326

Even languages like Modula-2 and Ada, among others, had better semantics than C, but they didn't come for free alongside UNIX.

rperez333 · 2024-05-23T03:37:02 1716435422

I know nothing about Zig, but this is pretty interesting and looks well designed. Linus was recently very mad when someone suggested a new semantics for overflow:

—— I'm still entirely unconvinced.

The thing is, wrap-around is not only well-defined, it's common, and EXPECTED.

Example:

   static inline u32 __hash_32_generic(u32 val)
   {
        return val * GOLDEN_RATIO_32;
   }

and dammit, I absolutely DO NOT THINK we should annotate this as some kind of "special multiply". —-

Full thread: https://lore.kernel.org/lkml/CAHk-=wi5YPwWA8f5RAf_Hi8iL0NhGJ...

jcranmer · 2024-05-23T14:47:52 1716475672

> The thing is, wrap-around is not only well-defined, it's common, and EXPECTED.

No, it's really not. Do this experiment: for the next ten thousand lines of code you right, every time you do an integer arithmetic operation, ask yourself if the code would be correct if it wrapped around. I would be shocked if the answer was "yes" in as much as 1% of the time.

(The most recent arithmetic expression I wrote was summing up statistics counters. Wraparound is most definitely not correct in that scenario! Actually, I suspect saturation behavior would be more often correct than wraparound behavior.)

This is a case where I think Linus is 100% wrong. Integer overflow is frequently a problem, and demanding the compiler only check for it in cases where it's wrong amounts to demanding the compiler read the programmer's mind (which goes about as well as you'd expect). Taint tracking is also not a viable solution, as anyone who has implemented taint tracking for overflow checks is well aware.

cozzyd · 2024-05-23T18:01:33 1716487293

It depends heavily on context.

For the kernel, which deals with a lot of device drivers, ring buffers, and hashes, wraparound is often what you want. The same is likely to be true for things like microcontroller firmware and such.

In data analysis or monte carlo simulations, it's very rarely what you want, indeed.

jcranmer · 2024-05-23T20:05:42 1716494742

Is it really?

For example, I opened up https://elixir.bootlin.com/linux/latest/source/drivers/firew... as a random source file in the Linux kernel, and I didn't see a single line where wraparound would be correct behavior.

There are definitely cases where wraparound behavior is correct. There are also cases hard errors on overflow isn't desirable (say, statistics counters), but it's still hard to call wraparound the correct behavior (e.g., saturation would probably work better for statistics than wraparound). There are also cases where you could probably prove that overflow can't happen. But if you made the default behavior a squawk that wraparound occurred, and instead made developers annotate all the cases where that was desirable to silence the squawk, even in the entire Linux kernel, I'd suspect you'd end up with fewer than 1000 places.

This is sort of the point of the exercise--wraparound behavior is often what you want when you think about overflow, but you actually spend so much of your time not thinking about it that you miss how frequently wraparound behavior isn't what you wanted.

cozzyd · 2024-05-23T21:32:24 1716499944

I think wraparound generally is better for statistics counters like the ones in the linked code, since often you want to check the number of packets/errors per some time interval, which you can do with overflow (as long as the time interval isn't so long that you overflow within a period) but not with saturation.

samatman · 2024-05-23T15:23:57 1716477837

I think it's critical that we do annotate it as a special multiply.

If wraparound is ok for that particular multiplication, tell the compiler that. As a sibling comment says, this is seldom the case, but it does happen, in particular, expecting byte addition or multiplication to wrap around can be useful.

The actual expectation of the vast majority of arithmetic in a computer program is that the result will be correct in the ordinary schoolyard sense. While developing that program, it should absolutely panic if that isn't the case. "Well defined" doesn't mean correct.

I don't understand your objection to spelling that `val *% GOLDEN_RATIO_32` is. When someone sees that (especially you, later, coming back to your own code) it clearly indicates that wrapping is expected, or at least allowed. That's good.

bregma · 2024-05-23T11:08:05 1716462485

Unsigned integer overflow is not undefined in C or C++. You can rely on how it works.

Signed integer overflow, on the other hand, is undefined. The compiler is allowd to assume it never happens and can re-arrange or eliminate code as it sees fit under that assumption.

How many lines will this code print?

    for (int i = INT_MAX-1; i < 0; ++i) printf("I'm in danger!\n");