Re: C as used/implemented in practice

nathanb · on July 13, 2015

I have difficulty accepting "let's replace C with X", where X is a memory-managed language. As a systems programmer (I write SCSI driver code in C), I can't overemphasize how important it is to be able to address memory as a flat range of bytes, regardless of how that memory was originally handed to me. I need to have uint8_t* pointers into the middle of buffers which I can then typecast into byte-aligned structs. If your memory manager would not allow this or would move this memory around, that's a non-starter.

I don't stick with C because I love it. If I'm writing something for my own purposes, I use Ruby. I've written some server code in Golang (non-production), and it's pretty nifty, even if the way it twists normal C syntax breaks my brain. I even dabble in the dark side (C++) personally and professionally from time to time. And in a previous life, I was reasonably proficient in C# (that's the CLR 2.0 timeframe; I'm completely useless at it in these crazy days of LINQ and the really nifty CLR 4 features...and there's probably even more stuff I haven't even become aware of).

But none of those languages would let me do what I need to do: zero-copy writes from the network driver through to the RAID backend. And even if they did, the pain of rewriting our entire operating system in Go or Rust or whatever would be way more than the alleviated pain of using a "nicer" language.

(We never use 'int', by the way. We use the C99 well-defined types in stdint.h. Could this value go greater than a uint32_t can represent? Make it a uint64_t. Does it need to be signed? No? Make sure it's unsigned. A lot of what he's complaining about is sloppy code. I don't care if your compiler isn't efficient when compiling sloppy code.)

daeken · on July 13, 2015

Having written a lot of driver/kernel code and also dabbled in Rust, I have a hard time seeing anything that Rust can't do in this regard. All of the memory manipulation you will typically do is doable in Rust, though you will certainly lose some of the safety if you're bouncing things between types. The only real limitation you'll have is the inability (AFAIK) to do inline assembly.

There'd be no need to rewrite anything to work with Rust; the binaries emitted by the compiler should be just fine, assuming ABI compatibility. Maybe some changes to the way things are linked?

foogered · on July 14, 2015

You should be able to use inline assembly in rust-nightly: https://doc.rust-lang.org/nightly/book/inline-assembly.html

daeken · on July 14, 2015

Awesome, thanks. I knew it was missing last time I checked, but I should never assume with such active development going on!

spoiler · on July 14, 2015

cc: @foogered

This is what's been so off-putting about Rust, to be honest.

"I should never assume" is precisely what bothered me.

I loved Rust at first, but it required so much maintenance with its constant changes between versions, that I stated to dislike it. I didn't have much time (and I enjoy learning new languages), but when I had to virtually unlearn something just learned the previous day, it became frustrating after a few iterations.

A colleague at work once sneered at me when I suggested I could do my next [internal] project in Rust; I didn't end up using Rust, because I got self-conscious about the idea, and kinda got scared that I'll have to do much, much more work simply because completely valid code would not work in a few days. That feeling is kinda stuck with me to this very day, even though Rust wasn't a stable release back then, and now is.

Not to mention, there's people I've talked to about Rust who still feel like Rust isn't mature and never will be—simply because of it's "reputation."

[1]: They were referring to how dramatically and chaotically it changed before the first stable release.

pcwalton · on July 14, 2015

What would you preferred Rust have done to avoid this "reputation"?

I mean, as far as I can see, the options were "develop Rust in secret" or "make a language much less suited to this domain".

mcguire · on July 14, 2015

It's kinda bad, but the other options are worse. What're ya gonna do?

(Now, could we do something about the fair-sized chunks of the standard library that are marked "unstable"? I'd be happy with a compiler flag to turn that error into a warning....)

heinrich5991 · on July 14, 2015

There is an option, use the nightly compiler.

plesner · on July 14, 2015

I wonder how many of the features in Rust that have changed were designed with that possibility in mind? And if not, whether it would have made any difference if they had been?

steveklabnik · on July 14, 2015

As a very drastic example, the runtime was only actually removed six months ago. If we still had it, our 'embedding Rust in other languages' story would be significantly worse, and the kinds of things being discussed in this thread wouldn't have been possible.

pcwalton · on July 14, 2015

The goals of Rust have not really changed since its development.

The fact is that (in particular) a region and lifetime system are hard. We were doing research. Research usually requires many attempts to solve a problem.

plesner · on July 14, 2015

That's exactly why I ask.

It feels like languages often struggle with compatibility issues down the road because they assume that they're making all the right decisions from day one. Since you guys are aware that getting this stuff right takes many attempts maybe it would factor in the design process of a new Rust feature that it might eventually be replaced? I can understand if it isn't though, it's another constraint on problems that are already very hard.

Jweb_Guru · on July 16, 2015

Many times during Rust's design process, I saw quotes like this:

"Rust is not the first systems language and will not be the last."

and

"Rust is a language designed for the hardware of today, not that of ten years from now."

Rust is just a stepping stone on the path to better systems languages. It's not the greatest language ever and it made mistakes, but it gets enough right to make it a very compelling option in the systems programming space.

klibertp · on July 14, 2015

Why would you expect stability from software in early alphas is beyond me. If you want to be an early adopter you need to take the maintenance problems into account. And if you just wish to use a somewhat completed language you should have just waited for a final release.

PLs are complicated beasts and developing them takes time. It's normal for a language to change drastically during development: frequently it turns out that some feature doesn't fit well with some other, and you need to scratch the feature or redesign both features.

geofft · on July 14, 2015

For what it's worth, inline assembly has been available in Rust ~forever, it's just not in Rust 1.0, the stable release, so a lot of people missed the existence features like this.

On the flip side, Rust 1.0 doesn't have any of the problems you describe. Rust 1.0 code will work just fine in Rust 1.300 when it comes out in 2050, provided that the language is still around. And inline assembly will end up in a stable release as soon as that promise can be made about it.

mathetic · on July 14, 2015

>> They were referring to how dramatically and chaotically it changed before the first stable release.

Exactly.

steveklabnik · on July 14, 2015

Inline assembly landed in Rust 0.6, April 2013, so if you haven't checked it out since then, there's a _lot_ of new/different stuff :)

We've been discussing stabilizing these kinds of low-level features: no_std is first up. We're not 100% sure inline asm is worth it, as you can always link directly to something that's just written in assembly, but we'll see.

daeken · on July 14, 2015

I guess my mental estimate of last having looked "about 6 months ago" was off by a little bit. I think this weekend might be dedicated to writing a Rust kernel. Thanks for your hard work!

steveklabnik · on July 14, 2015

No problem. Like I said, we're very interested in making this use-case better, so please file any odd things that crop up. Oh, and there's #rust-osdev on mozilla's IRC where friendly people idle.

mattst88 · on July 14, 2015

I suspect that the overwhelming majority of cases that warrant inline assembly would not benefit if it was instead a function call out to something you've linked in. That is, most inline assembly is a small snippet of optimized code that necessarily exists in the function you're optimizing.

steveklabnik · on July 14, 2015

Yeah, as always, when we're considering stabilizing or removing something, we talk to who is using the feature and in what way, and these kinds of issues surface.

pjmlp · on July 14, 2015

That is why naked and inline exist for.

daeken · on July 14, 2015

Inline from C libraries won't -- as far as I'm aware -- inline into a Rust function. Inline is a compile-time hint, not a link-time behavior. In theory this could be changed in LLVM, but it seems really unlikely to me. The function exposed from C (with the inline asm) can be naked, though, assuming you are hygienic.

pjmlp · on July 15, 2015

Yeah, but that is a consequence of not having a Rust aware linker and relying on one that basically uses C semantics.

Languages that use their own linkers tend to have better opportunities for code optimization during linking.

vardump · on July 14, 2015

Hmm... Because 32-bit Windows IRQ handler doesn't save FPU/SIMD state, Rust also needs to have ability to suppress any FPU/SIMD output, when generating code to run under 32-bit Windows kernel at DIRQL (= interrupt handler). Otherwise the mayhem in userland will be rather interesting, usermode thread SIMD state gets corrupted.

It'd definitely be tricky to use Rust for kernel drivers, but still so tempting!

pcwalton · on July 14, 2015

You can use the -C switches in rustc to enable and disable hardware features such as FP and SIMD.

daeken · on July 14, 2015

Hm, I bet there's a flag to disable SIMD entirely (not really important here), and maybe even enable softfp if you use any floats in your code (or libraries, by extension). That'd be interesting to hack in, if it's not already there.

rkangel · on July 14, 2015

I've written C almost every day at work for ~7 years, and I have maybe written the keyword 'float' half a dozen times.

I do embedded and systems development - you never need floating point in that sort of environment. It's only when you get into application domains that you need it.

stephencanon · on July 14, 2015

There are a number of architectures where you leave half of the performance of little things like memset and memcpy on the table without SIMD.

as a footnote: everytime in my career that someone has said "you never need floating point in X [kernel/network stack/some embedded thing/boot/whatever]," they've come back within a year to ask for my help with floating point library routines for X. So now you've cursed yourself =)

com2kid · on July 14, 2015

As an embedded developer let me say that the Cortex M4s with FPUs are worth their weight in gold for saving me from debugging fixed point libraries.

GFK_of_xmaspast · on July 15, 2015

I never ever ever want to write fixed point numerics again.

monocasa · on July 14, 2015

Last time I checked, Rust wasn't great for intrusive data structures, as well as non-tree graphs. Is this better now?

aidenn0 · on July 14, 2015

I'm not sure what you mean by "intrusive data structures" but as far as non-tree graphs, it's only as bad as any other non-managed runtime.

monocasa · on July 14, 2015

By intrusive data structures, I mean data structures where the per node metadata is stored within the struct of the data rather than externally. Linux's list_head is a good example of this. Having one less pointer indirection can make it a bit more cache friendly.

And yes, Rust can handle non tree graphs as well as any other unmanaged languages, but the heavy use of 'unsafe' makes me feel icky. : )

pcwalton · on July 14, 2015

You might be interested in this library: https://github.com/dschatzberg/intrusive

to3m · on July 14, 2015

I switched over to using size_t for array indexes a few years ago, and found this a big improvement. And once I switched to size_t, most of my uses of unsigned went away - either you care about the size (and you want a uintN_t), or you don't (and you probably want size_t).

lsiebert · on July 14, 2015

I think this works best if you pass a pointer to an error structure in the parameters. I've found people often use signed integer types with arrays so they can return a -1 to signal that something went wrong.

pjmlp · on July 14, 2015

> But none of those languages would let me do what I need to do: zero-copy writes from the network driver through to the RAID backend.

Oberon, Modula-3 and D allow for it via their SYSTEM/Unsafe/@system modules, but the two former ones failed to get a dent into the OS market (for various reasons) and D still has some improvements to their memory model going on.

Also Ada and SPARK are usually the languages to reach for in life critical systems.

Also lets not forget before C became widespread outside UNIX, Modula-2 and Pascal dialects were saner alternatives.

pcwalton · on July 14, 2015

> But none of those languages would let me do what I need to do: zero-copy writes from the network driver through to the RAID backend.

Why not?

aidenn0 · on July 14, 2015

Are you on a 64-bit target? Then a uint32_t gets converted to a 64-bit signed value before any arithmetic operator is applied. There are some real pitfalls here, though they are often exaggerated.

Do you turn off the TBAA in your compiler? In my experience most systems programmers either turn it off, or don't know the rules.

[edit] I forgot int is typically 32-bits on 64-bit targets. The same argument would still apply for uint16_t and smaller though.

stephencanon · on July 14, 2015

> uint32_t gets converted to a 64-bit signed value before any arithmetic operator is applied.

Only if int is > 32 bits on your platform, which is quite rare these days.

dbaupp · on July 14, 2015

Aren't numeric promotions only applied in mixed expressions? (However, I'm not a C language-lawyer, so I could very easily be incorrect.)

aidenn0 · on July 14, 2015

All smaller than int types are promoted to int before calculations are done. I was wrong that uint32_t is smaller than int though.

For an actual example:

  uint32_t combineTwoUint16(uint16_t x, uint16_t y)
  {
    return x<<16 | y;
  }

If int is larger than 16 bits, then this is technically undefined behavior in the case that x is greater than 2^15 as it's signed integer overflow.

vtsrh · on July 14, 2015

Well the solution is simple:

  return (x+0U)<<16 | (y+0U);

Would it be insane to have a inline library for every arithmetic operation, that would handle such cases and offer addition optional functionality?

lmm · on July 14, 2015

The intersection of people who are willing to rewrite all their arithmetic to use such a library with people who are not willing to switch to a non-C language is rather small.

vtsrh · on July 14, 2015

Keep the old code, instead just start using it for new one.

lmm · on July 14, 2015

Most projects still written in C are those that make extensive use of C libraries. Making the application code immune doesn't actually reduce the vulnerability surface much - much of the vulnerability comes in the libraries the application calls.

vtsrh · on July 14, 2015

Use it for new code. Libraries are written too.

lmm · on July 14, 2015

Libraries are usually older than the applications that use them. New libraries can be, and often are, written in new languages.

vtsrh · on July 14, 2015

https://news.ycombinator.com/item?id=9885478

Looks like you hit an infinite loop. Better luck next time.

pcwalton · on July 14, 2015

No, sadly. In some cases (floating point) what happens is even implementation-dependent (though queryable with FLT_EVAL_METHOD).

cremno · on July 14, 2015

What pitfalls do you mean (assuming 32-bit wide int and uint32_t is unsigned int as both are very common)?

_kst_ · on July 14, 2015

uint32_t is promoted to a signed 64-bit value only if int is 64 bits. Even on 64-bit systems, int is typically 32 bits.

Does TBAA mean type-based alias analysis?

https://en.wikipedia.org/wiki/Alias_analysis#Type-based_alia...

(EDIT: I keep forgetting HN doesn't support markdown.)

tokenrove · on July 14, 2015

Most languages with garbage collectors also have mechanisms for working with memory that doesn't get touched by the collector, to support FFI.

Also, consider a language like Ada for driver work.

zamalek · on July 14, 2015

> move this memory around

A lot of people seem to assume that Chris (the author) was talking about managed memory, which he never mentioned once. Managed memory is runtime safety, a type system is compile-time safety. He's complaining about the type system. As an example:

> address memory as a flat range of bytes [...] I can then typecast into byte-aligned structs

You should never have to do that. You shouldn't be able to do that. Your job should be far simpler. Look at unique_ptr: a whole class of bugs are eliminated by this ZERO cost abstraction. Possibly what Chris is advocating is being able to describe what an I/O port is to the compiler and then using that abstraction to write your SCSI driver. This intent should be compiled down to as-good machine code (if not better) than what your C compiler would have given you - in the same way that unique_ptr is compiled.

I don't think any existing language gets this right.

albinofrenchy · on July 14, 2015

> You should never have to do that. You shouldn't be able to do that.

Someone, at somepoint, has to do this though. Custom memory allocators are more or less predicated on having a byte buffer you chop up and use like this.

I think the best we get -- especially in driver code -- is well thought out design that have low cost abstractions between the device details and the application logic. But that seems like a library detail more than a compiler or language one.

sharpneli · on July 14, 2015

> I don't think any existing language gets this right.

Which is precisely the reason C is still used.

Until we have a language that produces at least as good results as C and is safer we're not going to see any change in this area.

zamalek · on July 14, 2015

Agreed. I was pointing out that my interpretation of the email was a hypothetical tone, instead of a factual tone. As far as I'm concerned C isn't good enough, but I generally keep my mouth shut about it because I can't offer anything constructive to the discussion: I don't know what a "good C" would look like.

sklogic · on July 14, 2015

> I need to have uint8_t* pointers into the middle of buffers which I can then typecast into byte-aligned structs.

Quite a lot of CPUs would just trap here. Assuming that unaligned access is allowed is a sin.

bluetomcat · on July 14, 2015

> Quite a lot of CPUs would just trap here

Or even worse -- for example, ARM CPUs usually round-down the misaligned address to the closest boundary when alignment checking is disabled. This means that attempting to access a 4-byte int at location 11 will silently let the CPU access it at location 8. This can manifest in some very nasty bugs.

stephencanon · on July 14, 2015

IIRC this behavior went away in ARMv6.

Nursie · on July 14, 2015

Really?

Because this is very common practice in device or network code, in my experience.

sklogic · on July 14, 2015

Yes, it is a very common crap indeed. I had a lot of pain porting some Linux filesystems to Sparc, for example (had to give up back then, that code was beyond any hope).

Not to mention the endianess issues, an often sight in the crappy networking code.

spoiler · on July 14, 2015

I agree, and I'd like to add that its not just this particular author, but most people who criticise C about it's "insecurities" use sloppy code when they criticise C, which always bothers me. I'm far from being a C fan (I'm also a Ruby fanboy), but programming languages aren't safe, only code can be safe, and that depends entirely on the developer.

Yes, it's "easier" to introduce some bugs in C than Ruby (or Go, or whatever), but that's because whoever wrote that code with the bug didn't know C well enough. Is that C's fault? Same can be said about any language, really.

If you don't know that String#match returns nil on unsuccessful matches and try to call MatchData#[], you'll get a NPE (something along the lines of "undefined method `[]' for NilClass"). This is very similar to dereferencing a NULL pointer in C[1].

[1]: I know dereferencing a NULL pointer in C is undefined behaviour, but your program will crash—if you're lucky enough—when you try to work with NULL pointers when you don't expect them.

dbaupp · on July 14, 2015

This is nonsense. C has a very weak type system and very weak runtime guarantees, making it much easier to introduce problems with no indication that something's up. Other languages with strong type systems and/or stronger runtime checks eliminate large classes of bugs that are very easy to trigger in C.

So, yes, it is C's "fault" that it doesn't protect against classes of bugs that many other languages do. Sure, those languages have some of the same bugs that C does, but they're missing most of the very worst ones and that's really powerful. For example, a garbage collector protects against accessing dangling pointers: it's just not something the programmer has to worry about at all.

Rejecting cricitisms of C's safety inadequacies with "just code better"/"just learn the language better" doesn't work in practice: there have been too many high-profile vulnerabilities in C software, many of which would've been much harder to trigger in other languages.

cremno · on July 14, 2015

>C has a very weak type system and very weak runtime guarantees, making it much easier to introduce problems with no indication that something's up.

Here's an interesting example I've stumbled upon a few weeks ago:

https://stackoverflow.com/questions/31037149/type-safety-for...

float _Complex -> float doesn't require any diagnostic even though the imaginary component is (silently) discarded. Clang has one (not enabled by default or -Wall/-Wextra) however current GCC versions haven't.

asveikau · on July 14, 2015

I'm sure both people using C99 support for complex numbers are really bothered by this.

stephencanon · on July 14, 2015

Nope, never been a problem. Can't speak for the other guy.

FreeFull · on July 14, 2015

Never been a problem for me either.

nathanb · on July 16, 2015

I think you're discounting the fact that the software with the bugs was written in C precisely because C was a better fit than the other languages which offered greater protection.

I'm not saying (and have never said) that C is better than other languages for every problem. I'm just saying that I read too many articles about the "problems" with C that Stupid C Programmers must not realize since we keep using it where the author of the article hasn't bothered to understand why programs are written in C in the first place.

freyr · on July 14, 2015

Since the author of the post is also an author of LLVM, clang, and Swift, and the director of developer tools at Apple, he certainly understands the difference between sloppy and non-sloppy code, but he also knows that sloppy code is a reality and undefined behavior is dangerous.

So we can make everyone a better programmer, or we can make better languages, or we can throw in the towel and say things are good enough. I think he's suggesting the correct path.

pcwalton · on July 14, 2015

The undefined behavior of C and C++ results in remote code execution to a degree completely unmatched by other languages.

Examples of dangerous UB in C always use deliberately sloppy code for pedagogical reasons. For real-world examples of problems, look at the CVE database.

mcguire · on July 14, 2015

How much of this is due to problems with C or C++, and how much is simply due to a lack of other languages for systems programming?

And yes, a major complaint of mine is using C or C++ when you don't absolutely have to.

pjmlp · on July 14, 2015

> How much of this is due to problems with C or C++, and how much is simply due to a lack of other languages for systems programming?

You mean lack of Algol, Mesa, Modula-2, Modula-3, Ada, Oberon, Oberon-2, Component Pascal, Object Pascal, Delphi, .... ?

C became ubiquous when UNIX got adopted as the enterprise OS, slowly replacing the mainframes.

Like JavaScript in browsers or Objective-C in iOS, targeting UNIX meant using C, back in the day companies would pay for UNIX SDKs.

No sane company would buy another compiler when they already had to pay for a C one, the official language.

The other OS vendors started to offer C compilers in their tools to ease the transition between the new UNIX workstations and their other OS.

So alternative saner systems programming languages wither way as C grew stronger and now we are with this problem.

mcguire · on July 14, 2015

"Lack of other languages" was poorly worded. I'm sorry.

What I meant was that, after the spread of Unix and C, C is the most common systems language in production. (Before the spread of C, I believe assembly was the most common systems language(s).)

I suspect that any systems language would have a bad reputation if it were used (and over-used) as much as C.

ufo · on July 14, 2015

Lots of languages around the time C came out had array bounds checks at runtime. C has a hard time doing that because of pointed and arrays being (mostly) interchangeable.

geofft · on July 14, 2015

> Yes, it's "easier" to introduce some bugs in C than Ruby (or Go, or whatever), but that's because whoever wrote that code with the bug didn't know C well enough. Is that C's fault? Same can be said about any language, really.

I thought the promise of computers is that we didn't need to have smart people working on repetitive, boring, error-prone jobs.

And yes, there's a line between languages like C, Ruby, Python, Objective-C, etc. on one hand that don't actively try to make bugs hard, and Ada, Rust, Haskell, Ur, etc. on the other. That line is not particularly lined up with something like interpreted vs. compiled or old vs. new, and if you look for the line there, you won't see it.

vardump · on July 14, 2015

I don't know anyone who can write proper, secure, bug free C code in multithreaded environment. Some people, such as DJB [1] do get pretty close, though.

Much less those who can do same with C++.

But when you have larger teams, it gets even harder. People just think so differently and misunderstand intentions without realizing.

I did think I could do that in my twenties. 15+ years later I have a lot more respect for C.

[1]: http://cr.yp.to/qmail/guarantee.html

MaulingMonkey · on July 14, 2015

> most people who criticise C about it's "insecurities" use sloppy code when they criticise C, which always bothers me

If you're really really lucky, your coworkers will only write sloppy code by accident. But unless you're only working on toy projects, statistics will catch up to you and sloppy code will happen. To err is human.

By NASA standards, I suspect most of your code has been written "sloppy". As has most of mine.

> but programming languages aren't safe, only code can be safe, and that depends entirely on the developer.

Languages can be safe in the sense that they can force code to be safe in specific ways, or at least warn you better with unsafe opt-ins or better static analysis.

We agree that the developer is to blame for the thousands of overflow CVEs out there.

One developer recognizes they're not an infallible robot, nor are their coworkers, nor is the new intern they're about to hire, and uses the tools at their disposal - static analysis, "safe" languages, etc. - to catch and fix some large percentage of certain mistakes they, and those they work with, make.

Another developer scoffs at the first for "blaming their tools" and tries to avoid mistakes with sheer willpower. By not setting up static analysis, maybe they save enough time to do an additional 10-20 code reviews over the course of the project.

All else being equal, who will end up with safer code?

> Yes, it's "easier" to introduce some bugs in C than Ruby (or Go, or whatever), but that's because whoever wrote that code with the bug didn't know C well enough.

If this is true, nobody knows C well enough. Find me a programmer who's written a sufficiently large C project without a single bug, and I will worship them as a living god.

> Is that C's fault? Same can be said about any language, really.

I don't care about fault, per se. But sure, let's blame C. And every other language. Let's not blind ourselves against their faults, and the possible ways we might improve them, and the possible ways we might adapt ourselves to them.

Let's not saddle ourselves with stone axes for the rest of our lives.

> If you don't know that String#match returns nil on unsuccessful matches and try to call MatchData#[], you'll get a NPE (something along the lines of "undefined method `[]' for NilClass"). This is very similar to dereferencing a NULL pointer in C[1].

Hence the point of talks such as "Null References: The Billion Dollar Mistake", and why some languages are designed to avoid letting you access potentially null/nil/nothing variables without checking that they aren't first.

deathanatos · on July 14, 2015

> I have difficulty accepting "let's replace C with X", where X is a memory-managed language.

What about C++ (which adds RAII, which I believe is indispensable for writing correct code, especially over C) or Rust, which adds much better memory correctness? I'm in agreement where the article says,

> C, and derivatives like C++, is a very dangerous language the write safety/correctness critical software in, and my personal opinion is that it is almost impossible to write security critical software in it

Though I believe it can be done in C++, with some discipline (but much less than C would require).

> As a systems programmer (I write SCSI driver code in C),

I think SCSI driver code counts as a niche application

> I can't overemphasize how important it is to be able to address memory as a flat range of bytes, regardless of how that memory was originally handed to me. I need to have uint8_t* pointers into the middle of buffers which I can then typecast into byte-aligned structs.

My understanding is that C doesn't generally allow this; that's what the strict aliasing rule is, and what's "wrong"[1] with several of the examples in the article. IIRC, you can get a [unsigned] char * into a struct (but why?[2]), but attempting to cast a char * to a struct foo * is forbidden.

(Of course, with amends to the thread's original purpose, which is asking what the common layman understands / uses / depends on. Type aliasing is not well understood in my opinion. I'm not entirely confident I've got it right in this post.)

> If your memory manager would not allow this or would move this memory around, that's a non-starter.

(same comments about Rust/C++)

> We never use 'int', by the way. We use the C99 well-defined types in stdint.h. Could this value go greater than a uint32_t can represent? Make it a uint64_t. Does it need to be signed? No? Make sure it's unsigned. A lot of what he's complaining about is sloppy code.

In my experience, this is a rare thing; especially while interviewing, I find the majority of candidates — claiming to be most comfortable in C (we allow language of choice, in the hopes that you choose your strongest!) — don't know what `size_t` is.

[1]: The array copy code is correct, but the author is lamenting optimizations that cannot be taken unless we assume the pointers don't alias; the int-to-float code is UB (hence why he writes "miscompile" in quotes; it's UB, so by definition there's no wrong output (though an error might be nice); this is also why "obvious" is in quotes: humans know what the programmer meant, but what the programmer wrote is UB; I think this is telling about C: human expectation and the language don't align, from a language-design standpoint, this is not good).

[2]: most of the time I see people reaching for a char-pointer-into-a-struct, or cast-char-pointer-to-struct, they're short circuiting actually decoding some I/O byte-stream into an in memory data structure. This is not portable, unless — maybe — if you do "packed" structures (which is still not portable, I believe), but then you're sacrificing performance by potentially having unaligned members in the struct (which are harder for the processor to deal with, and might require multiple (e.g., MIPS) or unaligned (e.g., x86, amd64)) loads/stores.

near · on July 14, 2015

I understand that in some cases, these heroic compiler optimizations can offer significant performance increases. We should keep C around as it is for when said performance is critical.

But surely, we can design a language that has no undefined behavior, without substantial deviations from C's syntax, and without massive performance penalties. This language would be great for things that prize security over performance.

And the trick is, we don't need to rewrite all software in existence in a new language to get here! C can be this language, all we need is a special compilation flag that replaces undefined behavior with defined behavior. Functions called inside a function's arguments? Say they evaluate left-to-right. Shift right on signed types? Say it's arithmetic. Size of a byte? Say it's 8-bits. memset(0x00) on something going out of scope? If the developer said to do it, do it anyway. Underlying CPU doesn't support this? Emulate it. If it can't be emulated, then don't use code that requires the safe flag on said architecture. Yeah, screw the PDP-11. And yeah, it'll be slower in some cases. Yes, even twice as slow in some cases. But still far better than moving to a bytecode or VM language.

And when we have guaranteed behavior of C, we can write new DSLs that transcode to C, without carrying along all of C's undefined behavior with it.

You want to talk about writing in higher-level languages like Python and then having C for the underlying performance critical portions? Why not defined-behavior C for the security-critical and cold portions of code, and undefined-behavior C for the critical portions?

Maybe Google wouldn't accept the speed penalty; but I'd happily drop my personal VPS from ~8000 maximum simultaneous users to ~5000 if it greatly decreased the odds of being vulnerable to the next Heartbleed. But I'm not willing to completely abandon all C code, and drop down to ~200 simultaneous users, to write it in Ruby.

pcwalton · on July 14, 2015

> But surely, we can design a language that has no undefined behavior, without substantial deviations from C's syntax, and without massive performance penalties.

…Including undefined behavior around memory allocation, in particular use-after-free?

What to do about that is the big question, in my mind. Other forms of UB can mostly be patched up straightforwardly with a clean design (though there are some tough questions around bounds checks). But when it comes to UAF, there are basically three ways you can go about this and still remain a runtimeless systems language:

1. Compromise on "no UB" for use-after-free. UAF remains undefined behavior. Some variants of Ada with dynamic memory allocation have this, and I believe many Pascals did this. It's a popular approach in many new systems languages, like Jonathan Blow's Jai.

2. Disallow dynamic allocation. This is the approach taken by SPARK and other hardened variants of Ada.

3. Allow dynamic allocation, but statically check it with a region system. This is Rust's approach. Eliminating memory safety problems in this way while avoiding a GC is pretty much unique to that language, though it's obviously influenced by many other systems that came before it (C++, Cyclone).

All of the options have serious downsides. Option (1) opens you up to what has become, in 2015, a very common RCE vector. Option (2) is very limiting and pretty much restricts your language to embedded development. Option (3) has large complexity and expressiveness costs (though once you've paid the cost you can get data race freedom without any extra work, which is nice). Altogether it's a really difficult problem with tough tradeoffs all around.

near · on July 14, 2015

> Including undefined behavior around memory allocation, in particular use-after-free?

There are obviously going to be limits to what can be done. If you access beyond memory, you get "bad data" if the address is mapped by the OS, or a crash if it's not. That's a clear bug, and we can't make C a language that is incapable of producing programs with bugs. I don't really think of this as "undefined" ... we define very clearly that one of two things happens, based on the OS' memory layout. That's very different from GCC's understanding, where undefined == "if I want to have the program upload a cat picture to Reddit instead of shift a signed integer right, then that's what I'll do." (facetious, but you get the idea. Many of GCC's 'optimizations' cause outright security vulnerabilities, and defy all logic, like deleting chunks of code entirely.)

We want the most logical thing to happen when a user does something, not a completely unexpected thing just because it happens to make some compiler benchmark test look a little better.

> Other forms of UB can mostly be patched up straightforwardly with a clean design

I'm betting there aren't any C programmers out there that know 100% of the behaviors that are undefined. I've been programming for 18 years, and I got bit the other day because I had "print(sqlRow.integer(), ", ", sqlRow.integer());" ... where the .integer() call incremented the internal read position. MinGW decided to evaluate the second call first, and then the first one, so my output ended up backward. You may think that one's obvious, just like I might think that a shift by more bits than the integer type holds being undefined is obvious, but there are people that would be surprised by both.

Stating that function arguments evaluate left-to-right, just like "operator," does in expressions, would be an infinitesimal speed hit on strange systems, and no speed hit at all on modern systems that can just as easily use an indirect move to set up the stack frame.

And if you have a processor that can't do arithmetic shift right, which would be extremely rare, then generate that processor's equivalent of "((x & m) ^ b) - b" after the shift.

pcwalton · on July 14, 2015

> I don't really think of this as "undefined" ... we define very clearly that one of two things happens, based on the OS' memory layout. That's very different from GCC's understanding, where undefined == "if I want to have the program upload a cat picture to Reddit instead of shift a signed integer right, then that's what I'll do."

I don't think there's much of a difference in practice between the two. If you admit "bad data" into your language, you very quickly spiral into true undefined behavior. For example, call a "bad data" function pointer—what happens then? (This is basically how UAF tends to get weaponized in the wild, BTW.) Or use the "bad data" to index into a jump table—what happens then?

near · on July 15, 2015

> I don't think there's much of a difference in practice between the two. If you admit "bad data" into your language, you very quickly spiral into true undefined behavior.

Well, by your definition, it would indeed be basically impossible to turn C into a well-defined language. You'd have to make absolutely radical changes to memory management, pointers, etc.

So then, can we at least agree that it would be a good idea to minimize undefined behavior in C? Sure, we can't fix bad pointer accesses, and I get why this stuff was there in the '70s. But modern CPUs have largely homogenized on some basic attributes. How about we decide that "two's complement has won" and thus clearly define what happens on signed integer overflow? How about we state that "a byte is 8-bits"? And so on ... all of the things that are true of basically every major CPU in use, and that would be exceedingly unlikely to ever change again in the future.

And this can still be offered as an optional flag. But when enabled, it's just a little bit of added protection against an oversight turning into a major security vulnerability, and at virtually no cost.

ectoplasm · on July 15, 2015

Out of curiosity, are there expensive runtime checks needed to handle signed integer overflow or do processors actually know about two's complement? If you had to check every single time you touched an int that could be bad. In general, what UB can be defined with static fixes, and what UB can only be defined with dynamic fixes?

spc476 · on July 15, 2015

It depends on the architecture. The VAX could be set (on a function-by-function basis) to either ignore 2's complement overflow, or automatically trap. The Intel x86 line can trap, but you have to add the INTO instruction, possibly after each math operation that could overflow. I don't think the Motorola 68k could trap on overflow. The MIPS has two sets of math operations, one that will automatically trap on 2's complement overflow, and a set that won't (and at the time, the C compiler I used only used the non-trap instructions).

That's why the C standard is so weasly with overflow---it varies widely per CPU.

bandrami · on July 14, 2015

There's a fourth option: disallowing freeing. A resource, once allocated, cannot be repurposed. It sounds crazy for those of us who grew up with the resource limitations of the 1980s, but as time passes it becomes an increasingly interesting idea.

pjmlp · on July 14, 2015

Regarding Pascal, Borland dialects allowed for regions, but of course they are scope based.

jeffreyrogers · on July 14, 2015

For those who don't know Chris Lattner[1], who wrote this post, is the primary author of LLVM and more recently of Swift, so he knows a bit about what he's talking about :)

[1]: https://en.wikipedia.org/wiki/Chris_Lattner

carlosrg · on July 14, 2015

Until I see really big and open source projects like WebKit or Clang itself moving to Swift or whatever, anything I read about moving to "better systems languages" is like reading a letter to Santa Claus. I doubt C++ is going anywhere, especially when C++ itself is not standing still and evolving (C++11, 14, 17...) while maintaining backwards compatibility.

pjmlp · on July 14, 2015

What about having early versions Mac OS written in Object Pascal, only to rewrite it in C for pleasure of the UNIX hordes?

GFK_of_xmaspast · on July 15, 2015

I wasn't a mac user until later, are you saying that was done for reasons of ideology?

pjmlp · on July 15, 2015

Yes. Mac OS was initially written in a mix of Object Pascal and Assembly.

Even Photoshop 1.0 was, check the available source code.

Object Pascal was the inspiration for Turbo Pascal 5.5 OOP extensions.

After a few OS releases, UNIX was already getting user in the industry and pressure from users about C compiler availability grew.

Apple introduced a new SDK with C and C++ support, including a C++ framework. Afterwards the new OS apis were written in C.

Also by this time, Apple did their first attempt at the UNIX market with A/UX.

pjmlp · on July 14, 2015

"My hope is that the industry will eventually move to better systems programming languages, but that will take a very very long time..."

-- Chris Lattner

Yes, a very long time. Modula-2 was born in 1978, but we can go back to Algol and Lisp even.

mcguire · on July 14, 2015

"In the first example above, it is that 'int' is the default type people generally reach for, not 'long', and that array indexing is expressed with integers instead of iterators. This isn’t something that we’re going to 'fix' in the C language, the C community, or the body of existing C code."

The majority of that message is pretty well said, but this particular part leaves me cold. The problem isn't that 'int' is the default type, not 'long', nor is it that array indexing isn't done with iterators. (Ever merged two arrays? It's pretty clear using int indexes or pointers, but iterators can get verbose. C++ does a very good job, though, by making iterators look like pointers.) The problem is that, in C, the primitive types don't specifically describe their sizes. If you want a 32-bit variable, you should be able to ask for an unsigned or signed 32-bit variable. If you want whatever is best on this machine, you should be able to ask for whatever is word-sized. Unfortunately, C went with char <= short <= int <= long (, longlong, etc.); in an ideal world, 'int' would be the machine's word size, but when all the world's a VAX, 'int' means 32-bits.

That is one of the major victories with Rust: most primitive types are sized, with an additional word-sized type.

Gibbon1 · on July 14, 2015

Then again with C99 you do have stdint.h which gives you defined width types as well as minimum width types. And others.

mrpippy · on July 13, 2015

For the for loop example, is there some reason why clang doesn't output a warning like "Does 'i' really need to be signed? If so, explicitly make it a 'signed int'. Otherwise, change it to be unsigned"

porges · on July 14, 2015

The point here is not particularly about signedness, it's that UB allows better optimizations to be performed.

If overflow is defined to wrap around then it's potentially an infinite loop (take N == MAXVALUE). With overflow defined as UB you can say the loop executes exactly N times (because you're not allowed to write code that overflows).

So UB is both bad and a source of power :)

cowsandmilk · on July 14, 2015

> The point here is not particularly about signedness

But in the case of C, that is what it is about since unsigned integers have defined behavior, so you can only have UB and the optimizations when you use a signed integer.

jeffreyrogers · on July 14, 2015

Yes, but more generally UB allows optimizations that wouldn't be allowed otherwise. The whole reason C has so many undefined behaviors is for the benefit of compiler writers.

Joky · on July 14, 2015

I believe it is not the original reason: most UB are about funky HW. For instance for the non-wrap signed int it can be explained because C does not assume you have 2-complement hardware.

jcranmer · on July 14, 2015

More specifically, some early hardware would trap on signed overflow. A lot of undefined behavior in C actually comes from "some machine would cause a trap", and C predates the invention of precise trapping in out-of-order processors. The possibility of traps is generally the difference between undefined behavior or unspecified/implementation-defined behavior.

GFK_of_xmaspast · on July 14, 2015

If N is unsigned, clang++ (and I think g++) will warn about signed/unsigned comparison, but sometimes you do want signed loop indices.

tacos · on July 14, 2015

Microsoft's compiler has been ALL OVER THIS for 20 years.

Nothing says "buggy port of Linux app to Windows" faster than "#pragma warning (disable: 4018)".

Clang is catching up. GCC doesn't seem to care as much. Given all the bitching about undefined behavior you'd think they'd up the warnings.

Compilers are programs too, y'know. We can define (and gasp, re-define) behavior.

GFK_of_xmaspast · on July 15, 2015

Well, in fairness and generality nothing says buggy like "disabled warnings".

(and you know you're in trouble when you start looking at a new codebase and see a "-fpermissive" in there)

Totally agree with more warnings being needed, I understand that exploiting UB is needed for optimization, but I don't buy the 'well we can't warn about every case, so we won't warn about any' reasoning.

nikanj · on July 14, 2015

Once again, http://research.microsoft.com/en-us/people/mickens/thenightw...

ryanmarsh · on July 14, 2015

Have we lost sight of the fact that when we talk about a programming language we're really talking about how to put bits on CPU registers?

kd0amg · on July 15, 2015

We aren't. Most programming languages do not expose CPU registers to the programmer (because they are not semantically important), and most programmers do not think in terms of moving bits in and out of registers (and typically don't have much to gain from doing so). CPU registers are just an accident of implementation. Programming as a general activity is not inherently tied to a register-based CPU. In fact, programmers are often happy to have compilers/machines remove some put-bits-into-register activities.

JustSomeNobody · on July 14, 2015

What would be considered "security critical"? SSH? IPTables? Linux kernel?

AgentME · on July 14, 2015

Personally, as someone that doesn't want to be hacked, I would think anything I run that connects to the internet and isn't already very well sandboxed.