Hacker News new | past | comments | ask | show | jobs | submit login
Cross-platform Rust rewrite of the GNU coreutils (github.com/uutils)
548 points by yincrash on March 22, 2016 | hide | past | favorite | 487 comments



Rust is a great choice for systems programming it'd seem. Esp considering that a well-tested, battle-hardened code-base like SQLite faces problems [1] solely due to the nature of the language its written in.

[1] https://news.ycombinator.com/item?id=11312918


Not sure why you're being downvoted. I don't see why we shouldn't be moving to languages like Rust given the chance, as C makes it far more difficult to write safe and correct code.


I expect it's because because the first part is essentially preaching to the choir, and because Richard Hipp very much disagrees with the second part[0]

> Rewriting SQLite in Rust, or some other trendy “safe” language, would not help. In fact it might hurt.

(see link for expansion on that matter, which is a question of tooling and testing)

[0] http://blog.regehr.org/archives/1292#comment-18452


Disappointing to see Hipp make that argument. It's trivially refuted.

Yes, all programming languages allow the programmer to write bugs. But languages very much vary in how many, and what kinds of bugs programmers write in practice. Saying "well, Rust doesn't eliminate all bugs" is attacking a straw man. If you want to argue that Rust isn't worth it, you need to convince me that C plus gcov results in fewer bugs in the important areas in practice than Rust (plus kcov [1] if you like) does. I think that's going to be pretty hard. (Especially if memory safety issues are the most important bug class you're concerned about: I think it's completely impossible for any C-based solution to compete with Rust here, regardless of how much tooling you add.)

Drawing an equivalence between undefined behavior and compiler bugs also doesn't make sense. Compiler bugs are way way less commonly encountered than undefined behavior in C. Also, they're qualitatively different: compiler bugs get fixed in new compiler versions, while UB is by design and doesn't get fixed.

[1]: https://users.rust-lang.org/t/tutorial-how-to-collect-test-c...


> If you want to argue that Rust isn't worth it, you need to convince me that C plus gcov results in fewer bugs in the important areas in practice than Rust (plus kcov [1] if you like) does.

I don't have a dog in this fight, but I don't see how the burden of proof is on Hipp rather than the folks proposing the change. In other words, shouldn't the "rewrite it in Rust" folks have to prove that the cost of their proposed rewrite will be justified?


> In other words, shouldn't the "rewrite it in Rust" folks have to prove that the cost of their proposed rewrite will be justified?

Agreed, they do. But that it hasn't been proven to make things better doesn't mean it will make things worse. It just means that we don't know enough to say. The right way to answer the "would this software have fewer bugs if it were rewritten in Rust?" question requires a detailed look at what bugs the software has empirically encountered.


A rewrite automatically makes things worse because you start with no code.

I figure it's one of the signs of programmer maturity, that you start to look askance at rewrites. So tempting, yet so rarely even finished let alone better.


In the case of Rust I don't believe this is 100% true given C ABI compatibility. You could start rewriting in such a way that it is integrated with the existing code and slowly, but surely tease the C out of the system.


It would for the longest time be a C program with a metastasizing wart of Rust hung off the side, impossible to get into, impossible to work with, debugging hell, compilation hell. The distros would weep.


Programs written in multiple languages are not exactly a new thing. Every iOS and Mac app is one, just to name one example.

And Firefox has a good chance to become exactly what you describe with the weird cancer analogy--in fact, the nightly builds already are.


Sure, the primary burden of proof is on those proposing a change. However, any time you stand up and make an argument, the burden is on you to make sure it actually makes sense, and that goes for both sides.


This is getting a bit meta, but I disagree. It would be trivial to abuse in discussions.

    A: Bash would be way better for SQLite, really!
    B: But Bash is a terrible choice because X, Y, Z, ...
    A: If you make those arguments, you have to prove them.


"Bash would be terrible because using it would cause a fire"

Yes, arguments have a burden to make sense, and provide evidence...


I'm not sure I take your point. If X, Y, and Z are cogent enough to be worth a rational response, B has done their job under my principle above. A is precisely the one I would want to yell at for using specious arguments.


Anything that is stated without proof can be refuted without proof. Why you think your opinion should be regarded at a higher standard than anybody else's?


...I don't. That's sort of my point. Both sides of a debate have the same responsibility to be reasonable, whatever that level of responsibility may be, whether it's a formal debate or just tossing ideas around.


I like to swing it the other way, you have to prove me that the newly proposed solution is not worth it. You can check out Amazon's policy about driving change. The "will not work" camp has to do the work to prove that something is not going to work. I think this approach yields to better results in practice.


'Saying "well, Rust doesn't eliminate all bugs" is attacking a straw man.' This is itself a straw man. Follow the link to Mr. Hipp's comments and read them. He did not say this.

That a programmer who has produced such high-quality and rigorously tested software as sqlite should be portrayed as either cavalier or naive about software quality is something I find profoundly mis-guided.


> This is itself a straw man. Follow the link to Mr. Hipp's comments and read them. He did not say this.

"Rust doesn't eliminate all bugs" is a rephrased version of "Some well-formed rust programs will generate machine code that behaves differently from what the programmer expected."

> That a programmer who has produced such high-quality and rigorously tested software as sqlite should be portrayed as either cavalier or naive about software quality is something I find profoundly mis-guided.

I don't think he's cavalier or naive about software quality! SQLite's quality speaks for itself, and he is completely correct about what you need to do to achieve the level of correctness that SQLite has. He knows a lot more than I do about the rigor needed to ensure that amount of software quality. I do think, though:

1. The amount of testing that SQLite undergoes is not economically feasible for most software.

2. Rust's testing tools would allow for the same level of code quality in the absence of evidence otherwise. (The example he cited, code coverage, is wrong, as kcov is available for Rust.)

3. There is value in static analysis above and beyond the value of testing (also vice versa), because testing only reveals bugs that manifest themselves in inputs available in the test suite. This is true regardless of the code coverage of the tests.

In other words, I think Hipp's argument would make sense if it were something along the lines of: "I couldn't test a Rust SQLite as well as I can test the existing SQLite because kcov is missing features X, Y, and Z, and the value I get from the static analysis is less than the value I get from those code coverage tools, since we've found lots of bugs from those code coverage features and haven't found as many memory safety/data race bugs". That'd be totally valid. But as a blanket statement that Rust would make things worse, it doesn't make sense to me.


I was not aware of kcov or its basis bcov. Thanks for pointing those out.

However, a quick glance at the bcov source code leads me to believe that it only does source-line coverage, not branch coverage. So, unless my quick reading of bcov sources is mistaken, I couldn't test a Rust SQLite as well as I can test the existing SQLite because kcov/bcov is missing the ability to measure coverage of individual machine-code branch instructions, and the value I get from static analysis is much less than the value I get from branch-coverage testing tools.

That's not being pedantic, btw. The difference between source-line coverage and machine-code branch coverage is huge. The latter really is necessary.


Thanks, that's really great feedback. We should invest in making use of llvm-cov. It should dovetail nicely with the MIR work (which is now to the point where it can build the Rust compiler), since the language-specific IR it exposes is especially suited for quick development of instrumentation passes.


I don't think he's making the argument you think he's making. This is clearer in his later comments in the thread, in which he says that UB is much scarier in systems like Fossil.

His argument is that achieving the level of quality that SQLite has requires verification (in a broad sense) after the compiler, and that's what he does. If you consider the goal to be producing quality-assured binaries, then you can treat UB, compiler bugs, and many other things as falling in a similar category, which are almost certainly eliminated by an MC/DC test suite.

As you say, this isn't feasible for almost any software (as John said in the blog post, SQLite is the only program he knows of that has MC/DC testing when not required by law). But it does mean that rewriting SQLite in Rust wouldn't provide as much value as rewriting many other things where the binaries do not have such guarantees.


> His argument is that achieving the level of quality that SQLite has requires verification (in a broad sense) after the compiler, and that's what he does.

> If you consider the goal to be producing quality-assured binaries, then you can treat UB, compiler bugs, and many other things as falling in a similar category, which are almost certainly eliminated by an MC/DC test suite.

I don't think that they're eliminated because dynamic testing can't eliminate everything—it only finds bugs given its test inputs.

Moreover, though, I'm also skeptical of the claim that binaries are all that matter. Lots of software projects (for example, Firefox) import SQLite as source into the project instead of using the binaries. They upgrade their compilers without running the SQLite test suite to catch regressions. (Firefox might, but I'm sure lots of other projects using SQLite from source don't.)

> But it does mean that rewriting SQLite in Rust wouldn't provide as much value as rewriting many other things where the binaries do not have such guarantees.

Sure; static analysis is more useful in systems that aren't as well dynamically tested. But static analysis still has value.


> I don't think that they're eliminated because dynamic testing can't eliminate everything—it only finds bugs given its test inputs.

I can't think of an example of UB-exploit in the compiler that wouldn't result in different branch behavior, and thus, I think, in failure of MC/DC. But I may simply be insufficiently imaginative.

John makes the point about source code in a follow-up comment, and I agree, but I also see Hipps' perspective.


> I can't think of an example of UB-exploit in the compiler that wouldn't result in different branch behavior,

What about Implementation-defined Behavior? This is (I think!) technically a subset of Undefined Behavior and it permits such things as setting values[1] to arbitrary (but well-defined!) values on various operations, such as "excessive" left shifts. What I'm saying is that a compiler is permitted to substitute IB for UB and still be conforming. So it could start to do strange things to arithmetic, etc. Does that make sense as an example of what you're thinking of?

EDIT: [1] I obviously meant memory locations... as referred to by "variables" which aren't really variables, but are really binders/aliases. But here we are.


I think you're confused about implementation-defined and undefined behavior. The former is not a subset of the latter, but excessive left shifts are UB.


> "Rust doesn't eliminate all bugs" is a rephrased version of "Some well-formed rust programs will generate machine code that behaves differently from what the programmer expected."

I think that sounded more like "Many of the programs which are UB in C are also UB in Rust, even though not (yet) specified as such."


> I think that sounded more like "Many of the programs which are UB in C are also UB in Rust, even though not (yet) specified as such."

Well, this seems either (a) false or (b) uninteresting to me. It's false because C and Rust have different semantics, and Rust rules out lots of programs that C doesn't. It's uninteresting because if the point is that Rust has accidental undefined behavior due to compiler bugs (and it does), then that UB is so rarely hit that it doesn't matter nearly as much in practice as the UB in C.

CPUs have bugs, too. Do we consider Java an unsafe language because of rowhammer?


Help me out here, I don't know much about Rust. Which bad programs does Rust rule out?

I know about the borrow checker, but I don't think ownership bugs is a type of bug that Mr. Hipp frequently produces. It's a program design issue -- not something you think about at every single line you write. A well-designed program does not do many ownership transfers.

As someone else noted, out-of-bounds errors are sadly a pain to debug in C. For testing code, C has at least valgrind (and certainly many less well-known tools). For production code, you might not want dynamic index checking. It would invalidate performance arguments for Rust.

I think the real nasty cases of UB in C, which frequently occur as not statically refutable, are signed arithmetic overflow and shifts. As with out-of-bounds errors, I don't think Rust has a better story here -- only better built-in tooling.


> I know about the borrow checker, but I don't think ownership bugs is a type of bug that Mr. Hipp frequently produces.

The borrow checker eliminates use-after-free. And use-after-free is one of the most common types of vulnerability exploited in practice today, if not the single most common. (For evidence, look at reports about Pwn2Own.)

Index checking is quite cheap for most programs, and LLVM is good at eliminating redundant checks. As a useful comparison in another languages, Chromium compiles with bounds checks on std::vector by default (the [] operator, not the at() method which is required to be checked.) Actually, there are a lot of issues with Rust's current machine code output that I think slow it down, but index checking isn't one of them.


I don't think you can count use-after-free as exploitable. Sure, if you have them (and I think those cases can largely be ruled out by good design), it leads to crashes. But for memory type exploits you need control over the value in it. Not a security specialist but I'm not aware of common attacks besides overflowing buffers.

In hot paths (like codecs, compression algorithms...) I'm sure you never want bounds checking. It can be optimized away for sequential loops of course, but not so easily in the case of data lookups.


Use after free is a really common form of RCE. Have you heard the term "heap spray"? That's an exploitation technique for use after free.



"While it is technically feasible for the freed memory to be re-allocated and for an attacker to use this reallocation to launch a buffer overflow attack, we are unaware of any exploits based on this type of attack."


It's so common there's a tutorial on it: http://www.fuzzysecurity.com/tutorials/expDev/11.html


Thanks!


I have no idea what that article is talking about. Use-after-free exploits have been extremely common for years. Like I said, look at Pwn2Own (which requires submitting an actual working exploit): https://www.google.com/search?q=pwn2own+use-after-free


Gah, it seems this one isn't good. I trust OWASP and skimmed, it seems that I agree with sibling commentors that this is more dangerous than is presented here.


Rust allows you to specify the behaviour you want:

Checked operations: https://doc.rust-lang.org/std/primitive.i8.html#method.check...

Saturating: https://doc.rust-lang.org/std/primitive.i8.html#method.satur...

Wrapping: https://doc.rust-lang.org/std/primitive.i8.html#method.wrapp...

Wrapping with notification: https://doc.rust-lang.org/std/primitive.i8.html#method.overf...

These apply to shifts too and you can tag types themselves to always behave one way. Overflows at compile time are caught as errors. More info at:

https://github.com/rust-lang/rust/issues/22020

If the requested behaviour is unspecified (just `a+b`), overflows cause runtime panic (edit: or wrapping behaviour in release build)

See https://doc.rust-lang.org/1.7.0/reference.html#behavior-not-...


> If the requested behaviour is unspecified (just `a+b`), overflows cause runtime panic.

As far as I know most machines can't trap on signed integer overflow. Which means having runtime panic is not practical. The page you linked says "no checking by default for optimized builds".


Ok, I simplified this too much. For full details of what happens you'll have to read the docs.

The non-release mode panics. Release mode with debug asserts turned on panics. Release mode with forced overflow checks panics. Release mode with no special options results in the same result as a wrapping operation.


1. The amount of testing that SQLite undergoes is not economically feasible for most software.

But the argument isn't about "most software" - it's about SQLite in specific. And that's the crux of the issue. Saying that he believes it would be counterproductive to rewrite SQLite in Rust at this time is not saying that that would be true of all or most or some other programs.


> Saying "well, Rust doesn't eliminate all bugs" is attacking a straw man.

Yes, it's very much akin to saying "well, a seat belt won't eliminate all deaths". I would understand an argument of "I'm not prepared to rewrite the software at this time" or "I'm not familiar enough with replacement X to assess whether it's a good choice", but at some point more people will have to acknowledge C's failings with more than lip-service. That doesn't mean Rust has to be the solution/replacement, but something does.


Nor does Rust let you trigger "dangerous" UB in non-unsafe code. This isn't going to change, ever, so the argument that "give it time, Rust will soon be chock-full of UB" is moot too.


I'm new to Rust. It is possible for a program written in C to link to a library written in Rust?

Is it theoretically possible to rewrite Sqlite in Rust and still maintain compatibility with programs written in C?


Yeah, it's one of Rust's strongest features, as it allows progressive migration of native code, and also allows Rust to serve as a fast/low-overhead extension language for Ruby and Python (etc.).

More details:

- http://doc.rust-lang.org/book/ffi.html#calling-rust-code-fro... - http://doc.rust-lang.org/1.6.0/book/rust-inside-other-langua...


The quality of tooling, and the ability of experts to verify the output of machine code is a really important point. I think I'd agree, rust at this point would hold back an elite developer like Richard Hipp.

The promise of rust, which may or may not be realized is pushing some very common problems down to the compiler. All code has bugs, so the compiler probably does things wrong in some cases. As the tools mature, these will get scrubbed out, just like every other project.

That said, array bounds checking is responsible for so many problems, it seems worth it to raise the minimum for a language. Not every developer is elite. In fact, they're pretty rare. C requires you to be really smart all the time, or at least be aware of when you're not smart enough to get a chunk of code right. Looping over some bytes from a file shouldn't be that risky. rust lets me, a less than elite developer, save my few moments of brilliance for the hard part of a program, rather than having to worry about the evaluation order of foo(i++,i++);

Maybe, it'll turn out the only way to make good software in the future is to find the best 100 developers in the world, and get them to make stuff. I doubt it though, being able to leverage the other million of us to make stuff, and have some (a lot?) of confidence it'll be free from the most common C errors is valuable.

Nobody should be forced to use tools they don't like. rust is trendy, but it has some very good ideas. Trendyness isn't reason alone to dismiss its approach. blah blah rust cheerleading blah blah.


I think the idea that "elite" developers can write bug-free C (or even just network-facing C free of security-sensitive memory safety problems) is pretty well refuted at this point. Now you can write bug-free C if you're willing to spend enormous time and money on testing: this is to a first approximation what SQLite did. But that only makes economic sense for a small minority of projects. Just putting "elite" developers on a C project, without the huge verification cost, is by itself not enough to eliminate bugs, as much as we as hackers would like to think it is.

(The one exception may be, like, DJB. But djbdns/qmail are very unusual C programs in many ways.)


It will be interesting to see if the formal proof/verification work that is being done by NICTA for the seL4 project will mature into something that can be used practically elsewhere in industry.

https://sel4.systems/


You might be interested in https://robigalia.org/


From my understanding, NICTA's proofs are basically algorithms to show that Haskell models and C code is equivalent. They can then go on to do proofs with their Haskell models.


> C requires you to be really smart all the time, or at least be aware of when you're not smart enough to get a chunk of code right.

It's not so much that it requires smarts. It's that is requires you to be ever-vigilant and to never make any mistakes. (That's why UB and bounds overflows are so devastating to software security. Almost any slip-up by the developers can be exploited.)

Incidentally, the ever-vigilant bit is also why we really want compilers to be doing the bounds-checking (or proving that it isn't necessary). Compilers are really good at being ever-vigilant. Humans (no matter how smart)... not so much.


> good at being ever-vigilant

There goes the price of freedom. Drats.


> That said, array bounds checking is responsible for so many problems, it seems worth it to raise the minimum for a language.

Amen to that! If I could just add one feature to C - at least as an option - it would be array bounds checks.

I have no trouble with manual memory management - a garbage collector is nice to have, but I have not had many problems with memory leaks or dangling pointers. And the ones I had were relatively easy to locate and fix.

But array bounds violations are so easy to commit and so nasty to track down... When I still wrote C code for a living, I would have gladly sacrificed quite a bit of performance to get bounds checks on all array accesses, at least for testing and debugging...


> I have not had many problems with memory leaks or dangling pointers. And the ones I had were relatively easy to locate and fix.

Note that every single browser vulnerability in this year's Pwn2Own was a use after free (dangling pointer).


Well. I'll gladly admit that the C code I have written was way less complex than a web browser. When I still wrote C code for a living, I worked on an application suite that was ~250k lines of code, maybe ~300k, in total, whereas e.g. Firefox is a couple of million lines. I assume other mainstream browsers are similarly big.

Plus, what a browser does is, by nature, a lot more complicated than what I worked on.

EDIT: What I am trying to say is: I do not mean to trivialize use-after-free bugs, but in my personal experience, I did not run into many, while I witnessed (and caused, I am afraid) a bunch of array bounds violations, and they sometimes made me want to cry.


Yeah, browsers are essentially interpreters (and nowadays compilers) for a variety of crazy languages (html, css, javascript; probably a few more?).

You don't want to be writing your interpreters in C.


> You don't want to be writing your interpreters in C.

Well, technically, I think most browsers these days are written in C++, but your point remains valid.

OTOH, the "default" Python interpreter is written in C, and so is Perl (Ruby MRI, too, I think, but I am not 100% certain). I cannot recall any major security problem with those languages that originated inside the interpreter (which, of course, does not mean those did not/do not exist). Then again, a web browser is probably far messier in terms of what input is has to deal with.


> Then again, a web browser is probably far messier in terms of what input is has to deal with.

That's the ticket, the browser does have a large attack surface but more importantly it's supposed to safely execute completely arbitrary and untrusted payloads. In the same category are pretty much all of the usual suspect of security issues: flash, java (applets), …

Most interpreters are only fed trusted payloads, lest the developer starts eval'ing stuff they got from god knows where, and in that case the fault is usually laid to the developer's feet rather than the interpreter's.


Yeah. The browser is more like a hypervizor that Amazon might be running to run arbitrary people's VMs. But it has an unimaginably larger surface area than Xen.


Prof. Regehr did not find problems with SQLite. He found constructs in the SQLite source code which under a strict reading of the C standards have “undefined behaviour”, which means that the compiler can generate whatever machine code it wants without it being called a compiler bug. That’s an important finding. But as it happens, no modern compilers that we know of actually interpret any of the SQLite source code in an unexpected or harmful way.

At some point some popular compiler is going to make a subtle but important change to some undefined behavior that's not going to be immediately obvious as to it's repercussions, and the fallout will be massive. It boggles my mind the mental contortions people will go though to justify what is essentially an argument of "it hasn't caused a problem yet" while ignoring that it's caused many problems already, just not that they've noticed or that have affected them.


It's still important to distinguish the two. The source code has bugs, but in the year 2016 it builds binaries that do not have bugs.


At some point some popular compiler is going to make a subtle but important change to some undefined behavior that's not going to be immediately obvious as to it's repercussions, and the fallout will be massive.

Wait until you realize that the length of a byte in C is not clearly defined. Someday a processor will come along where a byte is 6 bits, and the fallout will be massive. (Really, it's happened before).

No, actually that processor would not become popular because on one would use it.

The reality is unless you are using a formally-defined language like ML, you are relying on undefined behavior in your language.


The number of bits in a byte is specified by the CHAR_BIT macro, which is required to be at least 8. It's commonly larger than 8 on C compilers for some DSPs, but almost universally 8 for hosted implementations.

POSIX specifically requires it to be exactly 8.


Did you see the short-lived attempt to create "friendly C" a few months ago? [1]

There's languages that have a few undefined corners, and then there's C. A sufficiently large difference in quantity becomes a difference in quality. C is qualitatively worse than most modern languages with its undefined behavior. (Granted, some modern languages escape by having the one implementation, which is then the definition. But still, that's less undefined than C.)

[1]: http://blog.regehr.org/archives/1287


It was too ambitious. Some of those examples, like shifting by 32, are about trying to remove even unspecified behavior. If you could reduce most cases of undefined behavior down to "what some real machine would do", it would be an enormous help, and wouldn't be anywhere near as hard.

For example you could say that division by zero will either give a result or trap, but that it can't do anything else, and the code path cannot be ignored. Or that an uninitialized variable is equivalent to initializing it with a semi-random number.

Even if an out of bounds array access will cause untold chaos, you can at least specify that it will cause that chaos at X bytes past the base of the array.

Merely creating an invalid pointer would be, on architectures where it doesn't trap, 100% harmless.

And for crying out loud, is forgetting to terminate a string literal with a " still undefined? There are so many bits of undefined behavior that are easy to remove.


> Or that an uninitialized variable is equivalent to initializing it with a semi-random number.

This proposal destroys lots of dead code elimination optimizations that are very important in code post inlining (for example, in the STL).


Can you explain why? What if we relax it to say that copying from one uninitialized variable to another can be omitted?


Because uninitialized variables are allowed to change values arbitrarily over their (nonexistent) "live range". Your proposal would force them into having a real live range, with a stable value. See this section for an example, which shows examples of the kinds of optimizations this opens up:

http://llvm.org/docs/LangRef.html#undefined-values

There's also the issue that having a stable value forces the register allocator to keep a live range for the undefined variable, which can cause unnecessary spills or remat in other places. Especially on 32-bit x86, this can be a problem.


> Because uninitialized variables are allowed to change values arbitrarily over their (nonexistent) "live range".

Even specifying that they have a new arbitrary value on each access would be a big improvement over the status quo. It wouldn't allow nasal demons. Since LLVM seems to already have these semantics, it makes a good argument that it wouldn't hurt C's performance to tighten the spec at least that much.

But I'm not seeing how C or C++ code gets you in a situation where you're purposefully doing arithmetic or bitwise operations on uninitialized variables. If it almost never happens, it doesn't need to optimize particularly well. What parts of the STL can be faster by treating uninitialized variables as impossible?

> There's also the issue that having a stable value forces the register allocator to keep a live range for the undefined variable, which can cause unnecessary spills or remat in other places.

This would only happen when the variable is accessed multiple times. In which case having to keep the value safe is no worse than if it actually had been initialized. I don't see how this is a problem.


> What parts of the STL can be faster by treating uninitialized variables as impossible?

What I'm mostly thinking of is allowing unused branches to be pruned. The STL tends to get inlined really heavily, which results in a whole pile of IR being emitted for what look like very simple operations. Based on the actual parameters and state, the optimizer then wants to prune out as much dead code as possible to reduce i-cache footprint, and sometimes time as well.

Take small string optimization. That optimization requires a branch on length in almost every string operation. But if you have a std::string of constant size, you don't need the heap spilled code to be emitted at all. Usually the only way to work this out is inlining + constprop + DCE. That's where undefined value semantics are really helpful: a branch on an undefined value can be completely removed as undefined behavior, which can make its branch targets unreachable, and allow them to be removed, and so on recursively. Undefined values allow entire CFG subtrees to be eliminated in one fell swoop, which is an extremely powerful technique for reducing code size.

I don't have a precise example off the top of my head as to where this kicks in in the STL, but I strongly suspect it does.


I still don't understand. Why would length be undefined if that's how you tell whether a string is small or not?

Even if you can remove one of the branches because you know if the string is small, the logic of "this branch can't happen" -> undefined -> delete sounds more complex than "this branch can't happen" -> delete.


The order is undefined -> "I, the compiler, declare this branch can't happen" -> delete. The middle step is valid because "undefined behavior" permits the compiler to make that declaration, then act on it. If you don't want it to do that, use defined behaviors only, which is a great deal easier said than done. Partially because of how hard it is to avoid it in your own code, and partially because it is shot through all the other library code (which as pcwalton points out, courtesy of aggressive inlining, is also your code).

I was skeptical about all this about six months ago myself, but the continuous stream of articles on this topic, plus the spectacular and highly educational failure of Friendly C (and not just that it failed, but why it failed, which is why I posted that exact link) has satisfied me. It is also part of why I've stepped up my own anti-C rhetoric since I've been so convinced... as bad as I thought C was, it really is, no sarcasm, not merely "hating", even worse than I thought. I am honestly scared to use it professionally and pretty much refuse to touch it without good static analysis support.


I just don't understand how "std::string of constant size" leads to the compiler inferring that a variable is undefined along a particular code path. In particular, on the not-taken code path, what is the uninitialized variable being branched on?

Edit: Is it figuring out that the pointer in the string object is uninitialized? That doesn't seem any easier than reasoning about the length value, and I don't see how it would lead to "br undef".


LLVM has been doing this optimization since 2008, and it contains justification in the commit message. LLVM commit #60470:

    Teach jump threading some more simple tricks:
    
    1) have it fold "br undef", which does occur with
       surprising frequency as jump threading iterates.
    ...
Chris didn't cite any specific numbers in the log, but I believe him when he says it actually happens. You should be able to run "opt -O2 -debug-only=jump-threading" and see where it does :)

Related, with some actual numbers to prove it helps, LLVM commit #138618:

    SimplifyCFG: If we have a PHI node that can evaluate to NULL and do a load or
    store to the address returned by the PHI node then we can consider this
    incoming value as dead and remove the edge pointing there, unless there are
    instructions that can affect control flow executed in between.
        
    In theory this could be extended to other instructions, eg. division by zero,
    but it's likely that it will "miscompile" some code because people depend on
    div by zero not trapping. NULL pointer dereference usually leads to a crash so
    we should be on the safe side.
        
    This shrinks the size of a Release clang by 16k on x86_64.


For the price of removing some of the rusty nails sticking out of C, I'll happily pay a few 16KB on a binary as large as clang. Plus in most or all of those cases, the optimization would still be possible even with stable uninitialized values. It just needs to be done in a different way. You might be able to get 13 of those 16KB back with a very minor amount of work.


> For the price of removing some of the rusty nails sticking out of C, I'll happily pay a few 16KB on a binary as large as clang.

Lots of LLVM users won't. The competition between LLVM and GCC is (or was) pretty brutal.

> Plus in most or all of those cases, the optimization would still be possible even with stable uninitialized values. It just needs to be done in a different way.

I doubt that's possible without justification. How? Are you familiar with all of the passes of LLVM, how they interact, and with the code patterns generated by the STL?

I trust Chris in that he didn't add the jump threading optimization for no reason, which is the main one that matters here. If he says that this occurs with "surprising frequency", sorry, but I'm going to trust him. Compiler developers are usually right about the impact of their optimizations. Submit a patch to LLVM if you like to remove it, but I highly doubt it'll go through. If it did go through, Rust (and perhaps Swift) would probably revert it, as we ensure that we don't emit UB in the front end, so losing the optimization hurts us for no reason.


The goal is to refine the semantics of C. Removing a valid optimization doesn't help that.

It's not that I think the Chris is wrong, it's that I think other optimizations related to dead code have gotten better in the last many years, and focused effort could improve them more.


I think "refining the semantics of C" is a doomed effort that we shouldn't undertake. We'll lose performance for very little benefit, if market game theory even made it possible (and it doesn't: benchmarks of C compiler performance are much more important to customers than friendliness of the C dialect). We should just be moving away from C instead, or if we must stick with C we should invest in dynamic checks for undefined behavior.

My position, by the way, is hardly uncommon among longtime C compiler developers.


Do you think C managed to get things exactly right? Or should we add more kinds of undefined behavior?

What do we do about the fact that undefined behavior actually makes it harder or impossible to write efficient code in some cases, like checking for integer overflow?

Even if your goal is performance over anything, there are a whole lot of undefined behaviors that have absolutely zero performance benefit.


> What do we do about the fact that undefined behavior actually makes it harder or impossible to write efficient code in some cases, like checking for integer overflow?

This is a good example of why we should be moving away from C. :) Signed overflow being undefined is basically necessary due to a self-inflicted wound from 1978: the fact that "int" is the easiest thing to type for the loop index when iterating over arrays. Nobody is going to go through and fix all the C code in existence to use unsigned, so we're stuck with that UB forever. The realistic solution is to start migrating away from C (and I have no illusions about how long that will take, but we'll never get there if we don't start now).

> Even if your goal is performance over anything, there are a whole lot of undefined behaviors that have absolutely zero performance benefit.

Sure. But compilers don't exploit those in practice (because compiler authors are not language lawyers for fun), so they're basically harmless in practice. They're essentially just spec bugs for the committee to fix in the next revision.


It just seems like such a waste to abandon C instead of smacking compiler/spec writers and getting them to specify behavior that was de-facto specified for decades.

C should be writable by humans.


> Even specifying that they have a new arbitrary value on each access would be a big improvement over the status quo.

This is approximately what LLVM describes undef as, and it indeed leads to nasal demons. This description enables a single value to both pass a bounds check, and then subsequently go out of bounds!


You're right, in certain cases it would still cause trouble, but it would be a lot fewer cases than 100%. Reading only once would be safe, and passing it to another functions would make a variable that can no longer change unexpectedly.


Unless the function is inlined/the compiler can deduce that the variable is undef.

(Of course, you could define it to work, but I would suspect that would make this sort of value much less useful.)


> The reality is unless you are using a formally-defined language like ML, you are relying on undefined behavior in your language.

The CompCert C verified compiler is a verified compiler which supports "almost all of the ISO C90 / ANSI C language" - http://compcert.inria.fr/ .

It isn't C. It is a formally-defined language that is very similar to C. It is otherwise nothing like ML.


Its far more likely for a major compiler to exploit more undefined behavior for optimizations than for a processor with a bizarre sized byte to become popular. The major compilers already do this.

> The reality is unless you are using a formally-defined language like ML, you are relying on undefined behavior in your language.

This is not true, not using the definition of "undefined behavior" provided in the C standard.


"while ignoring that it's caused many problems already" Really? I believe Mr. Hipp is claiming that it hasn't. Do you have evidence to the contrary?

I don't think the argument is that these cases should be ignored -- they are now corrected, after all. It is that most of these cases should be treated as low priority compared to issues that are creating observable problems.


I take kbenson to be saying that what Mr. Hipp claims isn't a problem for SQLite is or has been a problem for other projects. That is, other projects have been bitten by relying on undefined behavior (such as what a variable's initial value might be) that may have been consistent across a variety of implementations, and then suddenly wasn't. Perhaps it's safe to say that today we don't need to address these issues on this project in this language, but that's taking on technical debt and setting ourselves up for work in the future (that may be difficult to identify at that time).


That was exactly my point, and I think the alternative interpretation makes no sense in light of the portion of my comment saying "just not that they've noticed or that have affected them."

> Perhaps it's safe to say that today we don't need to address these issues on this project in this language, but that's taking on technical debt and setting ourselves up for work in the future (that may be difficult to identify at that time).

Yes, that is succinctly saying something I was just implying. Even if a current C program is verified as having absolutely no problems with any current compiler due to the way it is using undefined behavior and the compilers interpret it, it's impossible to assume it will remain in that state by the nature of the problem being examined. Undefined behavior is undefined, and thus may change. Now, any language may decide to change how something works, so every program has to deal with this at some level, but again by the nature of this problem, there's much less assurance that the problem won't be a small,subtle change, possibly in one compiler, which is missed until it's widespread.

This would be much less of an issue if there was a specific subset of C which defined most the undefined behavior which could be turned on with a flag. It would probably prevent portability in some respect, but I hope we've finally reached a place where portability is accepted as secondary to security.


I was making a general point using his statement as the impetus, not a specific point in the case of SQLite. Undefined behavior (specifically, developers relying on it being consistent) has caused many problems in the past, and will undoubtedly do so in the future. Relying on behavior which by it's definition is impossible to rely on, is not a situation I think we should be defending.


> The disagreement is not over whether or not [undefined behavior] is a problem, but rather how serious of a problem. Is it like “Emergency patch – update immediately!” or more like “We fixed a compiler warning” or is it something in between. –Hipp

In this specific case with sqlite it certainly is more of a fixed-compiler-warning. However, you could imagine some undefined behavior in SSL implementation to require emergency-patch level as well. I agree with you that undefined behavior "in most cases should be treated as low priority".


Can you suggest what that change might be related too? I, too, think it's less likely than one might think.

This discussion seems to miss a couple things. SQLite is embedded quite often, by programs written in C; would embedding a rust library and possibly runtime fix things? The parent program would still potentially have defects and those defects could impact SQLite.

SQLite is also rather mature, I do t see how you can compare a rewrite to nature code. Take OpenSSL as an example, they are t tossing it, they are fixing it, it's a much more shallow lift to fix it.

I'm all for some big rust programs to prove its case though, a mailer, a dns daemon, some sort of database. Something useful cut from whole cloth, and ideally something we have historically not done well. I don't think a rewrite is it though.


> Can you suggest what that change might be related too? I, too, think it's less likely than one might think.

Aliasing rules[1]? Compilers apparently don't agree on that now, so even if it were to coalesce into the same undefined behavior, one would have to change.

> This discussion seems to miss a couple things. SQLite is embedded quite often, by programs written in C; would embedding a rust library and possibly runtime fix things?

I'm not advocating for Rust as much as I'm advocating against C, or at least against C as it currently exists and is implemented with so much undefined behavior. I've argued elsewhere in this discussion that a special subset of C with as much undefined behavior as possible specifically defined, which could be enabled through a flag, would do wonders (even if at the expense of some portability).

> Take OpenSSL as an example, they are t tossing it, they are fixing it, it's a much more shallow lift to fix it.

It's much more shallow to review and patch it. Let's not kid ourselves that it will be fixed when they are done with it. There will likely be bugs regardless of the language used to implement a crypto library. Does that mean we should ignore when one language allows an entire class of bugs that another does not, especially when it's for a Crypto library?

Let me make my case another way. What if C nevercaught on as the dominant language, or C was defined originally in a more strict manner, and most the utilities and tools we take for granted were instead implemented in a language that had less undefined portions, allowed less undefined behavior, and thus had less bugs and security problems? Would the resultant small performance hit (due to being unable to optimize around the undefined behavior) outweigh the added stability and security, or would we have been better off? I think we would have been better off, and I think since C's still widely in use, it's not too late to make that case.

1: https://news.ycombinator.com/item?id=11288665


Really, if your project intends to be portable C code, you need to compile it with undefined behaviour generating an error.


Undefined behavior is not always identifiable through static analysis. Obviously it can be checked against at runtime, but that's actually quite expensive. It would, for example, include bounds checks for everything, and overflow checks on all signed arithmetic.


That's not the worst of it: the truly intractable part is preventing use-after-free UB. The only ways to do this are (a) remove malloc from your language; (b) add a lifetime system (incompatible with all existing C libraries); (c) add a garbage collector (which most projects written in C will not accept for performance reasons).


I think that's mostly solved by changing the whole of the idea from "remove/disallow all undefined behavior" to "remove as much of the common needed undefined behavior as possible such that most programs need not really use it". Perfect is the enemy of good.


That'd be fine if use-after-free was a corner case that doesn't show up often in practice, but every single exploit at Pwn2Own this year was UAF…


Are we arguing the same thing? If UAF is the cause of security problems, and UAF is undefined behavior now, redefining it under a special mode to mean "this is an error (whether or not the compiler enforces it)" at least clarifies the situation,and lets and included static analysis report an error as needed and capable.

That is, I'm not arguing that as much undefined behavior as possible should be made defined and possible, just that it's defined. That definition may very well be "you are not allowed to do this. Don't do this."


With portable software (as in, portability is more valuable than other factors), imperfect is the enemy. ;-)


We have over four decades history of portability being valued over correctness, stability and security. I'm not impressed with where it's gotten us.


See, I look at it as the opposite. You aren't really portable if your code isn't correct.


And once you've eliminated that, you have DoS bugs like forcing infinite loops, abandoned (referenced but unused) memory leaks, and worst-case hash table insertions. All of those are serious attacks for anything with a resource budget.


rlimit ftw


Trapping on an overflow register signal is very, very cheap, and there are standard libraries for it.

Some of this stuff can be caught at compile time. Like code that does signed addition of positive values and checks if they might be less.


He certainly has a point, but I think he goes too far in assurances about the current sqlite3.so.

Testing "every single instruction" does not guarantee that all C-level UB has been eliminated, because some bugs can be input-dependent. For example, even if your coverage tells you that this function has been tested, it could still trigger undefined behavior for other inputs that trigger overflow:

    int f(int x) { return x * 2; }


For another example of this issue, consider the fact that data race detectors do not eliminate all data races in practice, because they only find races that show up on your test inputs.


I ask this from a position of seeking knowledge rather than being adversarial, the discussions on undefined behaviour from a C language perspective always seem muddled to me between what's undefined at the C language level such that a compiler might take advantage of it and produce results that would be unexpected looking at the code, vs. the machine code generated faithfully represents the language statements but a bug may be present when processing some external input at runtime.

If the inputs here are coming from something external to the program, such that the compiler can't know the value of x then the machine code should be faithful to the language statements, it's a bug if overflow occurs which may have other runtime implications, but the compiler isn't going to remove some chunk of code, etc. because of it.

From their testing page SQLite say they also test boundary conditions (I don't know to what granularity that is though) which which may catch something like this, which is agreeing that instruction level coverage isn't enough. They also run their tests with all the various sanitisers enabled.

Isn't this the same case in Rust, the non-release builds would need to see a suitable test case for the overflow check to cause a panic.

I guess I'm asking is this example relevant to concerns about undefined behaviour and optimising compilers vs. Rust? Since the function is input value dependent in practice isn't this more like implementation defined behaviour at runtime, in that a platform will alway provide a consistent behaviour, e.g. overflow, trap, saturate, etc.?

I've followed regehr's blog for a few years, I've read a lot about Rust, but I'm mostly working in higher level dynamic languages and don't have a lot of hands on experience with C or Rust and just wondering if I'm missing some subtlety here.


Frankly, SQLite is the exception that proves the rule.

For all the complexity of the SQL Language, and efficiency constraints SQLite needs to have, plus all the algorithms it has to implement, it still got a pretty well-defined task. It is not easy to transfer that experience to other systems. Say, your company's backend API. Completely different constraints.

It could still be the case that, had Rust existed when SQLite was being created, that it would have taken much less engineering effort. Which is the metric that matters, as given infinite manpower, you can write anything in any language.

I do agree that rewriting SQLite now, which is a very battle-tested piece of software, would probably do more harm than good. I bet people will still try, for fun if nothing else.


Wow. I was surprised that comment came from a smart, accomplished guy. It's nonsense. What he's essentially saying that the existence of compiler bugs in Rust refutes using it as a C alternative. Let me try that: the existence of hundreds of compiler bugs in C compilers over past years means it can be relied on either. CompCert had a few at spec level so we throw it out, too.

Realistically, you rewrite the components piece by piece in the new language. You make sure each compiles right with testing and review. You report any problems to the compiler team, who fixes them. Eventually, you're whole app is in the safer language without a lot of work. You might even swap them where safer one becomes reference code with other one there for any platforms not supported yet or too buggy. A diversity benefit as Hipp mentions.

With all of this, the program becomes immune to most memory & concurrency issues while being easier to maintain. Its undefined behavior will probably be a fraction of C's in number and severity. That's a net win.

Note: I'd have told him Ada/SPARK instead of Rust given it's been stomping C in embedded safety for a long time w/ lots of tooling for verification activities already there. Counters his compiler maturity argument, too.


> What he's essentially saying that the existence of compiler bugs in Rust refutes using it as a C alternative.

That is not at all what Hipp's comment says, and the original author's response to it gives it a much more charitable reading than you seem to be.


Let me re-read it. He makes several points. The first is that they're intentionally relying on undefined behavior known to cause problems sometimes out of nowhere. He says it's OK they rely on it because it's not currently causing them problems. Relying on something impossible to rely on as kbenson worded it since that's working out so far. Reminds me of a George Carlin quip about people building villages on active volcanos then surprised when lava turns up in living room. Pretty foolish.

Next he points out Rust should reduce undefined behavior occurrences vs C while eventually having some of his own. That's correct but he neglects the biggest benefit: its safety scheme preventing many flaws found in C projects by default. Leaving this out of his comment makes it Rust vs C on undefined behavior and compiler correctness only. Bad comparison given efficient, memory safety is basically Rust's main benefit.

Next, he conflates compiler bugs with undefined behavior. They're not equivalent. One is an implementation failure to be remedied. One is a design failure to probably stay indefinitely in the language and compilers. He's falsely reframing the situation to prop up an argument.

Next, he delivers the argument: that compiler bugs mean you can't rely on Rust unless you check the machine code itself. I like that he checks machine code as that's a high-assurance recommendation with proven value. He claims there's not enough tools to get the job done and a lack of compiler diversity. The first might be true and the second usually only matters if they're implementing the same spec. Otherwise, you're getting effectively different programs you can't compare directly. Plus, most GCC, LLVM, and Rust programs are performing just fine relying on one, actively-developed compiler without machine code testing.

So, he's made some bogus claims, dismissed Rust's whole benefit package, focused discussion on machine code from buggy compilers, made claims about its verification which I lack knowledge to evaluate, and ignored field evidence that his focus area is a small problem. He's trying really hard to dismiss Rust entirely at compiler level without much to show for it. Incidentally, it takes much less writing for most of us to dismiss C on grounds of language or compiler safety. Something you don't see him doing. ;)

That said, Rust community should invest in tooling for assembly/machine level verification if he was correct in saying they don't have it. That will be important for OS and embedded where developers trust compilers very little. Past that, his comment's misdirection and level of bias deserves no charitable interpretations.


> Next, he conflates compiler bugs with undefined behavior.

Maybe Rust is just not specified for every corner case? The compiler could just do anything in such cases (e.g. what a C compiler would do -- not checking for arithmetic overflow, for example). You can then go ahead and claim it wasn't UB in Rust. But effectively it is the same, and you can't expect that Rust will specify that a compiler must check for arithmetic overflows in the future.

> Next, he delivers the argument: that compiler bugs mean you can't rely on Rust unless you check the machine code itself.

If there are more bugs in the Rust compiler than in GCC or LLVM (which I don't know, but it sounds reasonable to assume given Rust's age) then that's just a good engineer's pragmatic realism.


  > Maybe Rust is just not specified for every corner case?
Even in the absence of a formal specification, if you demonstrate undefined behavior in safe code, it will be regarded as a high-severity bug and slated for correction. If it's a bug in the compiler, it will be patched. If it's a bug in the language itself, the language will be redefined to prevent that behavior in safe code and the implementation will be updated to reflect this. If these changes break existing code, then so be it: soundness fixes are an instance where the Rust developers reserve the right to break backwards compatibility. Rust takes UB seriously.


He and I both agreed UB could show up in Rust although it defaults on quite a bit of safety. Far as compiler maturity, he could use that argument but counters himself with machine code verification due to no trust in compilers. As in, what was point of being up Rust compiler quality if he doesnt trust C's either.

Only valid point he has is that a system language needs tools to produce and/or test machine code output for source equivalence. Claims Rust doesnt have that but that's outside my knowledge. I know Ada, SPARK, CompCert C, and a Java subset have methods available.


> I don't see why we shouldn't be moving to languages like Rust given the chance, as C makes it far more difficult to write safe and correct code

I agree in principal, however the timing is wrong.

Rust is a very new language (less than 6 years old), and is undeniably totally unproven. It is the latest "buzz" language, and may not be around long term... nobody knows.

What-more, a full-fledged re-write of CoreUtils in Rust is unlikely to be used by any production environment, because this new CoreUtils will also be undeniably totally unproven. This is the same growing pains LibreSSL have been experiencing, lots of gung-ho fans, but very few actual users (outside of OpenBSD/FreeBSD) - and their undertaking is arguably a lot easier, since they're just cleaning a codebase, not starting from scratch.

Average users who just consume distro's are not going to switch to a new unproven CoreUtils (even if they knew how), and distro maintainers are not going to switch until it's proven either. It will take a huge company with a huge install-base switching and testing it in production for many years before others start to feel comfortable... however this is also an enormous burden on said mega-corporation, for little-to-zero perceived benefits.

Yes, in principal, it's "safer" code, but to a mega-corp with thousands of installs, the risk is too great. New bugs, language pitfalls, behavior potentially changing, etc. Perhaps Rust dies, perhaps it's replaced with an even better alternative. It will take a LOT of time to work all this out.

Flatly, re-writing a several decade's old, matured codebase in today's flavor-of-the-week language is not a good idea. It's a waste of time and effort.

Let the languages mature more, do more systems work that doesn't involve replacing the foundation we all stand on... and maybe, in 5-10 additional years, we'll see where Rust goes.


> Flatly, re-writing a several decade's old, matured codebase in today's flavor-of-the-week language is not a good idea.

It's odd that so many in the computing industry are unwilling to move on from a language from 1978. We think of ourselves as one of the most fast-moving industries, but we have this odd reverence for early C and Unix that makes us stubborn and resistant to change. The fact is: we didn't know how to do some things properly in 1978. We know more about programming language design now. (Even Rob Pike would presumably agree—that's why he created Go!)

To be sure, we shouldn't just rewrite things in new languages for no reason. I think many segments of our industry are too fad driven. But to me the right thing is simple: Let's evaluate new technologies on their merits. Rust may well be worse than C! But if it is, let's figure out why, and say so explicitly.


> The fact is: we didn't know how to do some things properly in 1978.

Actually we did know how to make it properly, as Extended Algol in Burroughs B5000 in 1961 was being used, just to cite one example from many others that were ignored by the UNIX authors, because they didn't want to spend too much effort designing a proper compiler.


> We think of ourselves as one of the most fast-moving industries

We're fast-moving because our foundation is solid and not changing (ie. CoreUtils and gang). It's an assumption that these things "just work" with zero fuss and weirdness between systems.

We build on-top of these systems, so changing them out from underneath us all is a dramatic shift.

Perhaps Rust is the key to making these things better. I never claimed Rust is bad. I've only claimed that Rust may or may not be the right choice here, and since it's so young and unproven, we should wait before trying to re-write "all the things" in Rust. Today, Rust is a pet language... tomorrow, maybe not.

Remember, it took C many years to "catch on", and even longer to become the de facto standard for systems work. We can't rush this sort of thing... especially given the sheer magnitude of things depending on this code.

We should also be careful who actually does the re-write when the time comes. New CS grads who cannot understand the old C code and therefore feel [insert-new-hip-language-here] is better... are not the best ones to tackle this sort of endeavor. This requires deep, deep understanding of the entire package, how all the components interact, legacy behavior and the reasons behind design decisions, etc...


> We're fast-moving because our foundation is solid and not changing (ie. CoreUtils and gang).

That isn't a definition of "fast moving" that I would use. It sounds like slow moving.

Why is innovation in the core layers of the system less legitimate than innovation in social media apps? Boeing has no problem upgrading their engines every few years for better fuel efficiency. Why can they do that, whereas can't we do the same with things like GNU coreutils?

> Remember, it took C many years to "catch on", and even longer to become the de facto standard for systems work. We can't rush this sort of thing... especially given the sheer magnitude of things depending on this code.

Actually, Bell Labs had no problems with building their entire system front to back in their new unproven language C instead of using Fortran. I'm glad they went the way they did.


> That isn't a definition of "fast moving" that I would use. It sounds like slow moving.

Perhaps I didn't word it correctly.

When you want to write the next WhatsApp, you don't have to start from scratch. The OS has been taken care of for you, and you can expect it to "just work".

This solid foundation allows innovation to build on-top, at a rapid pace. If your foundation was constantly changing, you'd have to account for all sorts of weird intricacies, non-portable code, etc (like the old days).

A rapidly changing OS isn't really what you want for a production environment. In fact, you want it to be as constant as possible, so that you are free to do your work.

> Actually, Bell Labs had no problems with building their entire system front to back in their new unproven language C instead of using Fortran

You're right. But do remember it took a long time for it to propagate. Today, there's still systems not written in C (although they are in the minority). At the time, a lot of systems were written in pure assembler, and it took a long time to convince those guys that C was ready as a replacement for most tasks (and C changed dramatically in that time period).

I'm glad they went the way they did at the time -- but today with all these things built on top (financial markets, governments, big mega-corps, small ma n' pa shops, etc...), we need slower changes in order to keep the stability.


> At the time, a lot of systems were written in pure assembler, and it took a long time to convince those guys that C was ready as a replacement for most tasks (and C changed dramatically in that time period).

And they were wrong to resist C. They should have switched over sooner.

> today with all these things built on top (financial markets, governments, big mega-corps, small ma n' pa shops, etc...), we need slower changes in order to keep the stability.

Stuart Feldman, on why Makefiles insist on tabs:

"After getting myself snarled up with my first stab at Lex, I just did something simple with the pattern newline-tab. It worked, it stayed. And then a few weeks later I had a user population of about a dozen, most of them friends, and I didn't want to screw up my embedded base. The rest, sadly, is history…"

Stability is important, but eventually you do have to go back and fix things that are wrong for us to move forward.


> At the time, a lot of systems were written in pure assembler, and it took a long time to convince those guys that C was ready as a replacement for most tasks (and C changed dramatically in that time period).

Operating systems written in higher level languages go back all the way to the 60's, like Burroughs written in Extended Algol in 1961.

There were multiple other OSes written in variants of Algol and PL/I, before C was even an idea on its designers.

Only home micros were fully written in Assembly by the time C was getting used outside AT&T.

It is an urban myth spread by AT&T fanboys that C is the first system programming language.

Anyone that bothers to research the history of mainframes and operating systems will easily find documents of those systems, most of them written in higher level languages with more memory safe features than C has ever had.


Boeing's engines for passenger jets are not substantially different than they were 20 years ago, are they? coreutils also got some feature additions, fixes, and optimizations over the years.

Aircraft technology may not have been the best comparison for your argument. Have you seen http://idlewords.com/talks/web_design_first_100_years.htm


I think you're missing the point; this project looks to me more like a learning exercise, which is just fine.


Alupis wrote: "Average users who just consume distro's are not going to switch to a new unproven CoreUtils (even if they knew how), and distro maintainers are not going to switch until it's proven either. It will take a huge company with a huge install-base switching and testing it in production for many years before others start to feel comfortable... however this is also an enormous burden on said mega-corporation, for little-to-zero perceived benefits."

Seems like this is a good opportunity for Mozilla to demonstrate the resilience of Rust and its ecosystem by adopting Rust as the language of choice, where possible, and deploying in-house a custom Linux system in which Rust-written components are plugged-in, as and when they are written and ready, with perhaps a full-fledged switch to Redox sometime in the future. If Rust and software written in it are pushed to their limits within Mozilla, then its not an unreasonable recommendation for it to be deployed on an even wider scale.


> deploying in-house a custom Linux system in which Rust-written components are plugged-in

I don't see Mozilla doing this, really. There's no direct benefit, and there's a nebulous future benefit for Rust.

Mozilla is using Rust components in Firefox though (as well as Servo being mostly Rust). "Adopting Rust as the language of choice" seems to be happening already -- I've heard a lot of folks enthusiastic about (re)writing in Rust. This stuff takes time, though.


>Flatly, re-writing a several decade's old, matured codebase in today's flavor-of-the-week language is not a good idea. It's a waste of time and effort.

That's the sort of thinking we'd benefit from having less. Legacy is a terrible burden.


Don't underestimate performance. I haven't seen the most recent benchmarks, but C/C++/Fortran are still unbeatable in raw speed. If you want maximum performance no matter what, these are the languages to choose, even if they sacrifice readability, maintainability or safety.


Well, rust is beating g++ in the benchmark game now. [1] (please don't take microbenchmarks seriously) Performance is complicated, rust has a couple of things going for it that will help out a lot. First, there's a pretty strong bias to use the stack. As opposed to java or scheme where stuff is heap allocated by default. Second, rust avoids the pointer aliasing problems of c and c++. When someone gets around to making a fast matrix library, it'll likely take much much less time to catch fortran than c++ took.

But again, performance is complicated. People have put tens of thousands of hours into c/c++ optimization. rust is young, so not so much time there. On the upside, rust has room to grow.

https://benchmarksgame.alioth.debian.org/u64q/which-programs...


Take microbenchmarks seriously as true statements that such & such measurements were made of particular programs.

Please don't generalize that into language X is beating language Y.


If equivalent Rust code is significantly slower than C or C++, it's a bug. Please file them if you find them.


That is, in this general form, wrong. Most compiled languages can match or sometimes exceed C speeds, depending on the task at hand and the algorithm chosen. There are a lot of very inefficient C programs around - because they are badly written. More high-level languages allow the programmer to focus on speed where it matters. And in the current day and age, there should be no reason to favor raw speed vs. program correctness and safety.


You see code size and linearity is very important for anything non-trivial and hot. New languages tend to have abstractions built upon themselves making understanding and optimizing a pain.


Well, I have plenty experience of working with SBCL which directly gives you access to the assembly produced, allowing extremely good control of the code quality in hot spots. Just

(disassemble 'my-function)

would give a nice printout. There are also lots of other languages, which might be high-level, but still give you excellent code quality - I used Modula-2 a long time ago which did back then beat the resident C-compiler in generated code quality. Yes, many C compilers are very good and some new stuff not much so, but claiming in general that "C" is fast is an oversimplification.


Difficulty is relative. If you don't study modern C idioms your C code will be crap. Same goes for Rust.

Honestly using higher-level languages is the same mentality as taking a pill to magically lose weight. It's quick, but detrimental (to programmers ability) in long term.

Programming becomes easy but in the long term most people forget how algorithms and data structures work, in addition to cache mechanisms and other optimizations.

Until hardware changes drastically there's no sense rewriting everything (unless of course it's just for fun). It's better use of time to study Math and lower-level concepts instead.

That being said, most businesses will take the quick pill instead.


> It's quick, but detrimental (to programmers ability) in long term.

Do you know how the lifetime/borrow check system works?

I've heard plenty of criticisms of it, but I've never heard "the lifetime system doesn't make you understand manual memory management".


> Difficulty is relative. If you don't study modern C idioms your C code will be crap. Same goes for Rust.

That's not a valid argument for why better tooling can't help alleviate some of the difficulty.

> Honestly using higher-level languages is the same mentality as taking a pill to magically lose weight. It's quick, but detrimental (to programmers ability) in long term.

If that were true, the most effective programmers would only code on assembly.

> Programming becomes easy but in the long term most people forget how algorithms and data structures work, in addition to cache mechanisms and other optimizations.

Why would higher level languages obviate knowledge of any of these things? If anything, I think it would help by reducing "noise" from incidental complexity; e.g. ownership in C vs Rust.


>If that were true, the most effective programmers would only code on assembly.

No, because C is as fast as hand-coded assembly in most of the cases. The same can't be said of any high level language in comparison to C (save for C++ and Fortran).


> The same can't be said of any high level language

Rust is basically a metal level systems language. Just like C. It has a nicer type system but it compiles to more or less the same thing in the end. Unless you make an error such as using freed memory, in which case it doesn't compile while the equivalent C code would often compile to a very high performance foot-gun. The Rust safety is paid for, for the most part, at compile time unlike the usual high level languages like java/C#/python/... which pay for it at runtime.


> No, because C is as fast as hand-coded assembly in most of the cases.

Most hand-coded assembly is pretty slow. C isn't any pinnacle of performance either.

C programs are compiled to some non-existent abstract machine that ignores the real variations in memory architecture and CPU implementations. The compiled binary can't adapt at runtime.

I don't know of any C implementations that take advantage of runtime information for optimization purposes. So you end up with generated code with a lot of redundant computation, tests, branches and pointer dereferences (such as function pointer dereference that always refers to same address), just because they might be necessary with some input -- input that wasn't the case this time.

A single mispredicted branch is expensive. Say branch mispredict takes 15 cycles. That's enough to do up to 480 (32*15) 32-bit floating point operations on a single core. Ignoring runtime information takes us pretty far from anything you can call optimal code.

The current crop of compilers are also pretty bad at vectorizing anything complicated. Those cases it can be pretty trivial to beat the compiler by 2-10x, in some cases even 40x+ if your vectorization can also eliminate a lot of unpredictable branches.


I disagree. When programming C you have to remember so much stuff and be so carful about even the simplest things, this mental effirt takes from your resources, and I dont care how good a C programmer you are.

The mental capacity saved can be invested in higher level design issues that gone get you a lot more in the long run.


I am not arguing that higher-level languages are worse and that everything should be coded in C. Each problem requires different tools for its solution. And to that end, high-level concepts should be done in a language-agnostic manner.

However, lower-level understanding is paramount. For instance, and most-importantly today, taking advantage of multi-threading requires understanding of cache-coherence, memory-alignments, et cetera.

Therefore, while a proper serial algorithm today may be correct its scalability is going to be limited without lower-level understanding. Although, I'll admit that can be built into the higher-level languages (like concurrency in Clojure for instance), but I personally prefer to understand what is going on rather than blissful ignorance :)


It will be slower, though. C is king in raw performance, and in critical areas where that does matter you don't really have a choice but to write it "by hand" (aka in C).


Can you name particular areas in which C allows you to get the performance that Rust does not? Be specific.


I'm not entirely clear on your point. Are you suggesting that writing in C as opposed to Rust is somehow morally cleansing? Or are you suggesting that having higher level abstractions is somehow ruinous to ability?

The evidence is pretty clear that programming in C does not confer enough skill to prevent disastrous mistakes despite its near-hardware level of abstraction, so I'd really like to know what you're saying here.


I don't think I worded my original comment in the best way. However, I find that the people who make statements like "C makes it far more difficult to write safe and correct code" haven't touched the language in years if ever. Hence, no wonder it will be difficult for them because they have not practiced or been exposed to good C idioms. Correct code in C is actually easy since the language is very simple. Due to this simplicity, unit-testing is very easy yet lots of legacy codebases don't use these testing strategies to their potential. If they did lots of issues would be corrected.

Although I cannot refute there are additional memory-related issues to be aware of in C, it is precisely such awareness that make someone a better programmer because its how the underlying hardware works. For instance, parallelizing algorithms must take into account cache-alignment boundaries and so on.

Abstractions are good. However, too much abstraction is bad. Just as too much of anything is a bad thing. When people rely solely on such abstractions, it is in fact ruinous to ability.

That being said, higher-level languages certainly have their place. But I firmly believe that C is a high-enough abstraction for systems programming and with proper idioms and testing strategies, it can be just as safe as the plethora of garbage-collected languages out there in the wild.


> Correct code in C is actually easy since the language is very simple.

1. No, it's not. Witness the various arguments that have happened on HN over the years concerning whether the imprecise language of the spec makes some idiom undefined behavior or not.

2. The evidence over the past 35 years has not shown that correct code is "simple" in C.

> For instance, parallelizing algorithms must take into account cache-alignment boundaries and so on.

…which Rust makes you just aware of as C does.

> But I firmly believe that C is a high-enough abstraction for systems programming and with proper idioms and testing strategies, it can be just as safe as the plethora of garbage-collected languages out there in the wild.

1. Rust isn't garbage collected.

2. Your firm belief is contradicted by 35 years' worth of memory safety track records of large-scale software written in C.


"The evidence over the past 35 years has not shown that correct code is 'simple' in C."

"Your firm belief is contradicted by 35 years' worth of memory safety track records of large-scale software written in C."

Keep in mind, of those 35 years, it only makes sense to consider the last decade-onward (or so) in comparison. While the language hasn't changed a whole lot, programming methodologies certainly have. How many of those libraries in question were written within the last 5 or 10 years? What is the code coverage on them? Et cetera.

I am not trying to argue that C is perfect and everything should be written in C. That would be crazy! There are definitely issues with C no doubt--but so many people say C is dangerous and difficult when in fact they don't even practice using the language. This is no different than practicing Violin on a daily basis and saying Cello is difficult so people should stop playing Cello (not a perfect analogy but you know what I mean).

The other comments is just me ranting at how the majority of programmers don't understand important computer-science concepts because of their removal from such concepts by higher-level languages. And sometimes its important to have a higher level abstraction to help solve certain problems. But, people have a tendency to become lazy and therefore they lose the fundamentals over time.


> Keep in mind, of those 35 years, it only makes sense to consider the last decade-onward (or so) in comparison. While the language hasn't changed a whole lot, programming methodologies certainly have.

The last decade has seen an explosion of modern C++ code using best practices that routinely exhibits the same memory safety issues. C is worse.

> There are definitely issues with C no doubt--but so many people say C is dangerous and difficult when in fact they don't even practice using the language.

I practice C all the time and I think it's dangerous and difficult. I've seen so many brilliant programmers accidentally create game-over RCEs via use after free, for example.


> The last decade has seen an explosion of modern C++ code using best practices that routinely exhibits the same memory safety issues.

Examples? I'd be surprised if best practice C++ (C++11, say) had memory safety issues. (I realize that's only the last half-decade...)

> C is worse.

Certainly. Lack of destructors alone makes it hard to create safe abstractions.


> Examples? I'd be surprised if best practice C++ (C++11, say) had memory safety issues.

Pwn2Own last week. For more examples, search any browser engine's bug tracker.

Yes, this is all modern C++.


> Yes, this is all modern C++.

Is it? You're telling me that all code in all browsers have been re-written into C++11 (or C++14) with best practices? I don't believe you. At a minimum, I'm going to need some documentation before I believe that.

[Edit: I'm not trying to pull a No True Scotsman here. I just doubt that browsers have been completely rewritten into modern C++, or with anything approaching best practices. I've seen how long old code lives to believe it without some supporting evidence.]


Most exploits tend to be in new code (contrary to popular belief), which in all modern browsers is written in modern C++. The WTF (Blink/WebKit) and the MFBT (Firefox) are state-of-the-art template libraries; you are free to search for those libraries and verify for yourself. New C++11 features such as rvalue references do nothing to avoid memory safety problems; in fact, they make them worse, since "use-after-move" is now a problem whereas it wasn't before.

I know it's hard to believe, but C++ is not memory safe, old C++ or modern C++, in theory or in practice. The new C++ features do effectively nothing to change this. As far as use-after-free goes, C++ basically adds safety over C in two places: (1) reference counting is easier to use and is easier to get right; (2) smart pointers are arguably somewhat less likely to get freed before accessed again due to the destructor rules (though I think (2) may not be true in practice). Browsers have been making use of these two features for a very long time.

Bringing up modern C++ here is "no true Scotsman" unless you can point to a specific C++11 feature that browsers are not using that is a comprehensive solution to the use-after-free vulnerabilities they suffer from. There is no such feature I am aware of.


No, I wasn't asserting that there is some magic C++ feature that the browsers aren't using. "Most exploits tend to be in new code" was the piece of your argument that I was missing.


Well, you are probably right. :)

I am certainly biased as I enjoy procedural languages for their simplicity. So many constructs in the others. As the saying goes, C++ is my favorite 4 languages. I hope that doesn't become the fate of Rust. Go-Lang seems to get that part properly then that's probably debatable.


> "Correct code in C is actually easy since the language is very simple."

I broke back into this account just to share my marvel at the singly most wrong sentence in human history. The Mona Lisa of being incorrect about programming.

I audit code for security vulnerabilities in several languages professionally. You are correct about the benefits of unit testing, which is language agnostic, but C is not and cannot be Easy To Be Correct, ever, no matter what.


Rust is a terrible choice. Its standard library assumes malloc never fails.


This is downvoted, but I'd rather see it refuted. Is this false? Is SQLite as written tolerant of malloc() failure?


The Rust standard library does assume that allocation succeeds. The language itself knows nothing about the heap, and so you can write allocators that do whatever you wish. Side note: on most Linux distros, overcommit is on, and so malloc will basically always succeed; the OOM killer will kill your program before you'd get a failure. I am less knowledgeable about OSX and Windows.

I cannot speak to SQlite in this regard.


Holy crap, I'd never have imagined that... A design that assumes allocation always succeeds flies in the face of decades of safety/security-critical coding wisdom. I strongly recommend the team revisit and change that somehow to account for failures, NULL's, whatever. Actually, same anywhere a failure-prone, esp hardware, resource is acquired. C apps can handle this issue so Rust should as well if it's to replace them.

EDIT: Thanks to replies for clarification that it's just one allocator, aborts, and others are available. Still feel weird about it but that's better.


The standard allocator will abort on oom error. For applications which need to be tolerant to oom errors, you need to use a different allocator.

Most programs can't tolerate oom errors, though, and it would be absolutely unreasonable for every function in the standard library that might perform an allocation to return an error that has to be handled by every programmer all the time.

EDIT: C's solution is to make it very easy to ignore oom errors, so most programmers just don't handle oom errors. Rust's solution is much better.


Every function in the C++ standard library reliably communicates allocation failure to the application without relying on aborting the whole program. If Rust can do it, C++ can do it too. Rust got itself into this trap by eschewing exceptions.


Maybe avoid using judgmental language like calling a standard allocator that aborts on oom a "trap." The Rust team made conscious design choices in full awareness of the trade-offs. Moreover, Rust actually does have thread unwinding, and even the ability to catch an unwinding thread, so it is not true that the Rust standard library could not have unwound on oom.

Rust's standard library just isn't designed for writing applications that need to survive oom errors. That's fine, its not designed for a number of other applications which Rust the language is well-suited for either (operating systems, for example). Its designed for the majority use case, because life is full of trade offs.


If Rust isn't suitable for C's niche, Rust's proponents should stop pitching Rust as a replacement for C.

> The Rust team made conscious design choices in full awareness of the trade-offs

I don't think that anyone who isn't already predisposed to avoid exceptions would consider the tradeoff the Rust people made to be the correct one.


This is absurd. Most C applications do not need to persist through OOM errors, and has people have repeatedly reiterated, it is totally possible to write a Rust program that persists through OOM errors.


To be fair, not a bug in the language itself. Regarding libc, it's also very easy to abort on error. Just write a function xmalloc().


Note that Rust, the language, is perfectly capable of handling memory allocation failures, it's just the standard library that makes the assumption. Embedded environments wouldn't use the standard library for numerous reasons anyway, and the "core" library uses no allocation at all. That said, IIRC handling of failure in the standard library will happen eventually, I believe there's just no consensus yet on the best way to do it.


At least for the nightly builds, you can specify a closure to run when OOM happens. The only restriction is that it cannot return, but you can "recover" from panics in the nightly standard library as well. The thinking was, I think, that in most cases the default behavior of abort is the correct behavior. Recovering from a failed malloc is really only relevant in large allocations, and there are multiple paths (including directly using __rust_allocate) to recovery in that case.

Furthermore, keep in mind that the _pure language_ Rust has no concept of dynamic allocation (apart from "language elements" like __rust_allocate, which are kind of like a pre-linker).


Another robustness and availability-oriented system that will eagerly abort your process on this kind of failure is Erlang.


Even with overcommit on, malloc can return NULL on linux.


liballoc does check for NULL IIRC.


Side note: on most Linux distros, overcommit is on, and so malloc will basically always succeed; the OOM killer will kill your program before you'd get a failure

Clarifying note: this is configurable.


I'm not a Rust expert but everywhere I could find heap::allocate it checks for null.


Any time you do Box::new() you are allocating memory. There is no way do report allocation failure from that - the program will panic.


I suppose this coreutils impl doesn't do it but one could perhaps use rust w/o its standard library and use strictly some libc instead. Would probably still be a net win.


Writing Rust code interfacing libc is a rather frustrating experience though. Many libc interfaces use constructs that Rust really wants you to avoid (for very good reasons), things such as global mutable variables (errno and for callback functions without context pointers) or union types. I know about at least one cargo crate attempting to "Rustify" libc with some level of success.

None the less, this is what I typically end up doing when writing Rust code as much of th Rust standard library is simply not ready for "serious" usage.


Personally, I rather enjoy writing clean Rust wrappers for gross C APIs. Doing so does require a solid knowledge of C, and familiarity with a small bag of Rust tricks.

In my experience, the Rust standard library is extremely convenient and it handles corner cases very well. There are definitely still holes, but that's what "cargo add $CRATE_NAME" is for.

My biggest annoyance with Rust (and it's not a huge one) is that if I'm doing something off the beaten path, I'm probably going to need to wrap a C library or two that nobody has wrapped yet. There's a lot of great stuff on crates.io, but it's only a minuscule fraction of the total C ecosystem.


> Rust code as much of th Rust standard library is simply not ready for "serious" usage.

Can you elaborate on this? I've been getting serious usage out of the Rust standard library for years...


There is no way to select over a range of TcpStream objects. There is no stable way to select over a range of mpsc channels. There is no way to access a TcpStream in a non-blocking manner. There is no support whatsoever for UNIX signals. I could go on.


I don't miss the things you mentioned very much, and I think it's an exaggeration to say that these things are what makes Rust's standard library not usable for anything serious. But anyway:

> There is no way to select over a range of TcpStream objects.

mio

> There is no stable way to select over a range of mpsc channels.

Yes, this is annoying. We should stabilize MPSC select. You can use BurntSushi's chan for now though.

> There is no way to access a TcpStream in a non-blocking manner.

mio basically supports this, no?

> There is no support whatsoever for UNIX signals.

Do many languages have good support for this? The C and C++ standard libraries don't; that's part of POSIX. Signals interact very badly with garbage collectors, so I can't imagine many languages have this.


Mio is not part of the standard library. Go, for example, has excellent support for UNIX signals.


So I think our disagreement is semantic. When you say the Rust standard library is "unusable", you mean that you usually need Cargo packages in addition to what the standard library provides in order to write programs in Rust. That is true, but that's just a design difference between Rust and some other languages, like Go. For a systems language like Rust, I think that focusing on having an excellent package manager instead of having a super-comprehensive standard library was the right call. The standard library is very well designed for what it does (IMHO), and I credit the community-based library stabilization process for that.

As for Go and signals, I think it's pretty debatable as to whether you can have "excellent support" for signals without the ability to write a true signal handler. Note that this is not a fault of Go and is pretty much inherent to any garbage collected language. I suspect the Rust community would not be particularly happy with an implementation of signals that had hardwired sending to MPSC channels. Even having MPSC channels in the library at all is somewhat controversial...


Here is a talk I recorded about Linux systems programming with Rust. Quality is only 720p, but maybe you'll still find it useful. The speaker compares solutions in C with equivalent ones in Rust, and what kinds of advantages / safety features you gain:

https://www.youtube.com/watch?v=sCOO6WdDZuk


...faces problems solely due to the nature of the language its written in.

Really, this is true of any programming language and any involved enough program. It's just that with C and security-critical programs, there's some unfortunate concordance between the errors you want to avoid and the errors that are harder to avoid.


> Many GNU, Linux and other utils are pretty awesome, and obviously some effort has been spent in the past to port them to Windows. However, those projects are either old, abandoned, hosted on CVS, written in platform-specific C, etc.

I have seen such a paragraph in another projects README, IIRC a Go rewrite of standard utilities. I do not understand why a project would be obsolete because it's on CVS or is old. CVS is simpler than Git, albeit less capable. I, for one prefer it over Git for this reason, and others may do so too. Why would the end user care?

And why would we care about the age of a programme if it works?

Now, that said, the authors need not justify anything, they are free to do whatever they want, and I guess it's fun to code this stuff. I tried this just to play with Golang when it was 1.1.

And lastly, that Makefile is really a bunch of shell scripts, and some common environment variables. There is no real dependency tracking in it, and it is easier to maintain a bunch of shell scripts than a seriously ugly and complex Makefile like this. N.b. that when I say dependency tracking I mean dependencies among input and output files of processing commands, not tasks. I guess Cargo would know how to do that, and how to not build if the build artefact is already there. It's a useless use of make. And that makefile is very GNU-specific, not complying with a project whose purpose is to be cross-platform. Also, I guess, tho I'm really unfamiliar with Rust and Cargo, if that makefile was removed, maybe on Windows they'd be able to drop development dependencies on Cygwin or Msys.


Open source is useless if the majority of people in the open community -- for whatever reason -- can't build it.

If a person can't build the thing, they effectively have no power to contribute to changing it. If they have no power to contribute to changing it, it's... spiritually, missing the most essential parts of open source.

This is the fundamental issue driving comments like the quote above.

You can tell me that not having a dependency management system, not having a sane version control system, using "some" C compiler (sans a matrix of tests specifying a range of expected good compilers), arbitrarily fine-grained platform specificness meaning an average OSS contributor can never reasonably test their changes against all targets..... all of these things can be "worked around". But at some point, the litany of issues -- some of which take a new contributor dozens of hours to work around -- becomes a simply overwhelming barrier to contribution.

It's time to admit that a foundation of workarounds in FOSS development processes at the very core of our systems is a problem.

Re-writing it in language $x may not be the solution, but it's certainly understandable that there's a widespread desire for simply getting better toolchains underneath our most basic essential systems.


This project does not really improve upon the situation, as it uses what it is to replace for building, and a very complex Makefile where there's no need for one.

I've said nothing about dependency management as in fetching code that the project depends on. I'm talking about compilation dependencies, i.e. file a.o depends on a.c, a.h, b.c and b.h. Make is for this:

  a.o: a.c a.h b.c b.h
          cc -o ${.TARGET} ${.ALLSRC}
But nowhere in the projects Makefile the rules are in this fashion. They are like shell aliases. What I wanted to say is that the Makefile could be replaced with a bunch of shell scripts that would be easier to use and maintain.

For the rest, it seems that we mostly agree, tho I do not think that CVS is not sane, it's perfectly usable.


On line 4 ifneq appears which is a GNU make feature. .ALLSRC is a BSD feature. It's a fine GNU makefile. Also you should look at Cargo.toml to see better what is going on.


The issue is not whether projects are old, it's whether they're well-maintained. Please see the Core Infrastructure Initiative's Best Practices Badge project for one attempt at measuring the "health" of an open source project.

https://github.com/linuxfoundation/cii-best-practices-badge#...


I could see preferring svn to git because of the simpler model, but cvs? No thanks, a vcs without atomic commits is not much better than snapshot archives, maybe worse actually.


For my use case, it's way more preferrable:

a) SVN seems daunting and complex, tho I didn't ever dive into it. CVS is so simple and easy, a half-arsed programmer like me can actually understand it. Things like git and mercurial are way more complex.

b) RCS is real handy for single files, e.g. a free-standing text file or shell script. But when the thing grows up, it is very easy to integrate the fileset into a CVS repo preserving it's history: move the ,v files to $CVSROOT/$MODULE/.

c) The repository model of CVS is as transparent as it gets.

d) The keywords like $Id$ are really useful.

e.g. I keep my system configuration in "~/Checkouts/system-config", and I have a script that cp's the files to appropriate locations using a map file. When I'm not sure if the active config is not up to date, I can verify very easily. And I can be sure that dirty files won't be active as long as I don't expressly copy them. I know that SVN has this too, but I find CVS easier to use in general.

I guess for fast paced, very active development, yes CVS is sub-par, but for personal stuff, or for something that is patched say at most two-three times a month, it's O.K. It boils down to personal preference.


SVN is way simpler at the interface than CVS, you should really look into it. SVN is a spiritual successor to CVS, and is trivially easy for a CVS user to pick up. We switched from CVS to SVN at work several years ago and everyone was happy with the change.


We did the same and plenty were _unhappy_. In the end git took over.

There were plenty of things I didn't like about SVN (separate folders per branch? yuk), having used CVS, and didn't find it 'trivial' to pick up.


Really, it's hard to handle branching worse than CVS does, and if SVN doesn't do a stellar job either, I don't see it as a barrier to switching.


there are two things that are nice about svn if you use it just like cvs.

1, atomic commits. I edit ten files, that's one checkin, rather than the per file checkins of cvs. On a low volume project, not a big advantage. if you've ever conflicted on a bigger project with cvs, it can be kind of a pain to resolve. seeing the whole commit of the other guy is helpful. If you don't run into this more than, say, monthly, it's not worth it.

2. offline diffs. svn has a whole copy of the repo, so you can compare history even if the central repo is down, or you're working from the beach. This one is pretty nice regardless.

svn is a pretty nice upgrade, if you're working with a distributed team.


All appreciated, but nowadays my biggest repo is my emacs config, some tens of thousand SLOCs, about 2000-2500 of which I authored (I commit packages I use too, no elpa). And my repositories are local, i.e. they are in /var/cvsrepos. Other than that, I admit that SVN is indeed superior.

I believe that one should use the best tool for the case, not the overall best tool in every case.

That said, I'll give a look at SVN. I can consider switch when I have the time if it is easy to import from RCS, because I do use it a lot here and there, mostly for plain text documents. I do not like maintaining unrelated things in a single repository.


If it's just you, everything is local, it's probably not worth the overhead. Sounds like you have a good, efficient setup.

If you're learning for the sake of learning, i'd dig into git. i'm getting better, but some cases still frighten me.


> If you're learning for the sake of learning, i'd dig into git.

Well, nearly everything moved to git. I've used it for a while, mostly commit/pull/clone. I actually moved to CVS from git :) It's a bit like perl, git, you either need to be an expert at it, or else you can get going, but at the end you create a mess.

> [...] it's probably not worth the overhead. Sounds like you have a good, efficient setup.

Mostly, yes, but mostly because I don't really need that much more than recording my history, and occasionally looking at what I did in the past and more rarely working on short-lived branches. But there are a couple tools that if I ever code them up, I will make opensource. Tho if I ever do, and they take on, I'll use a 'just send a patch on the mailing list' approach. I believe it's easier to deal with. No flamewars on VCSes :)


Sometimes snapshot archives is exactly what's needed.


Agreed; I'd go even farther to say that old is a feature. Now, as a side effect of its age, it may not conform to modern-day best practices and styles… which might make it harder to fix newly discovered bugs or add new features.

Heck, the increased difficulty in creating feature bloat could even be considered a feature in and of itself, too!


Git makes it easier to accept contributions and code review from a larger developer community, thereby allowing building a higher-quality product.

(Incidentally, while I think that part of the reason why Git is this way is due to design differences from CVS, this isn't essential to it being true. Git is also better than Monotone or Fossil or Bazaar, despite being much closer in design, because it has network effects that those other systems don't.)


> Git makes it easier to accept contributions and code review from a larger developer community, thereby allowing building a higher-quality product.

It is very easy to contribute if you know git. But you can mess it up if you don't know. I made a two-line bugfix patch to flycheck, and heck, I was nearly pasting the patch into a comment in issues because I didn't know anything about how to make a pull request on github and how to commit so that the puller would be happy. I didn't want to do sth. embarassing and spent two hours reading and reading how to submit my patch properly. And the whole fix was made and tested in about two minutes. If only I was able to submit a dumb patch, I wouldn't have to care if they used git, cvs, or tarballs and quilt.


I think it's easily fair to say that git is by far the superior option to CVS, and preferring CVS over git is objectively incorrect. I think that you would be wise to invest the time in learning how to use git correctly.


Objective truth is second to factual productivity. I did use git. It's powerful, but more than I need. CVS takes ten pages of reading to grok. And then it's out of my way. I couldn't learn git in years, and there's something new every other day, it gets in my way. And I don't have time to chase a command line on stack overflow, I need that time for what I'm actually doing.


I can't speak for other projects, but for anything I maintain on GitHub, "Hi, I don't know how to use git, but here's a patch file" is a perfectly acceptable bug report! I'm not claiming that git makes it 100% easy, just that it makes it more easy.

That said, GitHub does let you use a web-based editor to create a git branch behind the scenes and open a pull request, with no VCS client required at all. So that's a point in GitHub's favor. (Again, it's not inherent to git, and a hypothetical CVSHub could do that, but GitHub exists.)


If you want to contribute to the Linux kernel, that's how it works as well. You just mail your patch and don't need to know anything about git.


Redox has been working on smaller, more BSD-like versions of the basic Unix utilities, also written in Rust:

https://github.com/redox-os/coreutils

I am excited to see this as I was working on a similar project last year (rewriting the BSD userland in Rust), but it's from pre-1.0 Rust so not really as idiomatic as what is coming out of the Redox project.


From the README: "These are based on BSD coreutils rather than GNU coreutils as these tools should be as minimal as possible."

If all of these utilities are running in userspace, per Redox's microkernel architecture, then what is the advantage of intentionally not making them feature-rich?


It has the advantages that come from following the UNIX style. To learn more, I recommend reading this short paper by Rob Pike and Brian Kernighan, published in 1984: http://harmful.cat-v.org/cat-v/unix_prog_design.pdf


"Those days are dead and gone and the eulogy was delivered by Perl" - Rob Pike (2004)


Rewriting coreutils is neat, but a project I'd really look forward would be a strict POSIX base expanded with warnings or errors on valid but risky constructs e.g. echo -n. Even more so if it included a shell (with static analysis of useless uses and dangerous patterns). That would make writing cross-shell scripts much easier.


Ok, I'll bite. What's wrong with "echo -n"?


http://www.in-ulm.de/~mascheck/various/echo+printf/ is the raw data version.

POSIX defines echo as only taking string parameters and no options, but notes that behaviour facing `-n` is implementation-defined.

BSD and GNU echo implement `echo -n` as not printing a trailing newline, but `echo` commonly calls to a shell builtin which may or may not follow that behaviour (and may switch behaviour depending on whether the shell is in "sh mode" or not), so `echo -n` could print `-n<newline>` or nothing whatsoever (empty string and suppressed newline) depending on the utils set, the shell, and the shell's runmode.

GNU echo also supports -e, -E, --version and --help options, and much like -n shells and other utils set may or may not support these.

For instance on my machine (OSX 10.11)

* zsh (builtin) interprets -e, -E and -n as options (but not version or help)

* bash (builtin) also does, except when invoked as sh in which case it does not and all parameters are literal (this may also apply to zsh)

* dash (builtin) interprets -n, but none of the others. bash note may also apply to it.

* BSD echo interprets -n but will print -e, -E, version and help literally

* GNU echo interprets all of the above

So if you use echo with any non-literal parameter, or with one of the parameters listed above, in a script you distribute to un-controlled third-parties as an sh script (rather than e.g. a bash or zsh script specifically) you will suffer from portability issues.

And that's just for measly trivial echo (and incidentally why you should always use printf rather than echo in scripts you try to make portable).


The only operating systems that are certified to conform to POSIX are old Unix operating systems. GNU has always been non-POSIX. And honestly, a POSIX implementation would be more trouble than it's worth.

The point of coreutils is to have utilities that make up the ability to write scripts for and interact with your operating system, right? Well, what operating system?? A POSIX-compliant one? Or just a mostly-POSIX-compliant one? Or one with POSIX extensions? How would your utilities know the difference? How would the OS know how to deal with these utilities? Would the user know the difference?

Ultimately, each platform has quirks, and it is up to the developer to port and test their script or application to a platform and make any necessary changes. This extends to far more than just POSIX compliance.


That's not a question of certification or extensions, I'm talking about being able to write portable scripts. Due to the interaction between its definition and its extensions, echo is a prime example of being impossible to use portably (except in the very restricted case of only literal strings without escapes which don't start with a -).

> Ultimately, each platform has quirks, and it is up to the developer to port and test their script or application to a platform and make any necessary changes.

That is not humanly feasible and that's why specifications exists. You can't "port and test" your script to a platform which doesn't even exist yet, but if you follow the specification and the platform implements it (assuming it does so correctly) your scripts will run.


Ok, I think I understand the confusion now. You seem to be of the impression that shell scripts are like bytecode executed in a virtual machine. That is the only way I know of that you could write an application for a platform that doesn't exist and expect it to work. But even for that to work, it would need to be the same VM, and bytecode generated by & for it, or there's still no guarantee it will work.

Of course, you can already write a shell script for a particular shell, distribute that shell to that system, and depend on the shell to properly execute your script [by using internal functions only]. But that defeats the whole purpose of following a standard like POSIX, or caring at all how any given platform's 'echo' program works.

Bottom line, though: two independent implementations of a standard provide no guarantee they will work together. Practice over a couple decades shows this to be the case.


> Ok, I think I understand the confusion now. You seem to be of the impression that shell scripts are like bytecode executed in a virtual machine.

What in bloody hell are you talking about?

> Of course, you can already write a shell script for a particular shell

Which is irrelevant to my comment as that's not what portability means.

> But that defeats the whole purpose of following a standard like POSIX

Exactly.

> Bottom line, though: two independent implementations of a standard provide no guarantee they will work together.

If following a standard can't ensure your program can work on two different implementations of the standard, you don't have a standard you have decorated toilet paper.

Which is more or less what the "commands and utilities" part of POSIX is.


Whoever wrote the POSIX spec was too conservative. -n should mean no newline(s). Anything else is wrong.


  ECHO(1)                 FreeBSD General Commands Manual                ECHO(1)
  
  NAME
       echo — write arguments to the standard output
For this definition, -n should mean merely a sequence of two bytes to be written to stdout. Why not just use printf instead? It is way more flexible and powerfull, and "printf x" always prints "{'x', 0}";.


The POSIX spec is basically the minimal intersection of all commercial UNIX distros of its day. It defines a minimal system that happened to be what everybody already had implemented. For the most part you are allowed to add features to a POSIX base to make a real OS, since that's what everybody had.

This is also why the old Windows POSIX subsystem was so useless. It implemented only the minimal amount needed to check off the box on a feature list, and none of the stuff you need to actually make a system usable.

That said, POSIX has a fair bit of braindamage baked in and people can be excused for ignoring the worst parts and instead doing the right thing.


The FreeBSD manpages have:

    Note that the -n option as well as the effect of
    `\c' are implementation-defined in IEEE Std
    1003.1-2001 (``POSIX.1'') as amended by Cor. 1-2002.
     -- http://www.freebsd.org/cgi/man.cgi?echo
So there's probably some system (or shell) out there where -n doesn't work?


On some Unix variants that will print "-n" followed by a newline. On others it will print the empty string. http://steve-parker.org/sh/echo.shtml


If you want to suppress non-complaint behavior, set the POSIXLY_CORRECT environment variable. You can also specify the version of POSIX to comply with, because the standards aren't always standard.

https://www.gnu.org/software/coreutils/manual/html_node/Stan... https://www.gnu.org/software/coreutils/faq/coreutils-faq.htm...

I would hope your cross-shell scripts are also conforming to a certain shell script language, and are also resetting all environment factors which change the function of various commands. Not that endianness is ever a worry with a shell script ............

(Also note that POSIX supports printf, which you can use to insert any character string you like, basically)


What exactly is this thing with 3 people commenting "I'll bite"?

How is that the most obvious response to parent's observation about echo?


The OP is "luring" people with a small amount of information i.e. "bait," and other people will "bite," as a fish would, to learn the rest of the info.


Yeah, I know the phrase. What I find difficult to believe is that three random people responded with the same response on his simple 2 line comment.

I mean we see tons of comments with "small amount of information"/cryptic references that might similarly puzzle people in HN, but not equally many "I'll bite".


"I'll bite" is a pretty common response when someone says something without elaboration that you sense they really want to follow up on. Maybe its a regional thing, but I'm having a bit of a problem thinking of another phrase.


Alright, I'll bite...

Why is echo -n risky?


I'll bite


Though I like ideas behind Rust but the code is awful. If you look at the code you'll see a lot of misterious symbols (as in Perl) and strange constructs like wrap, unwrap, Arc etc. They make the code less readable.

Also Rust doesn't have exceptions so you have to wrap almost any function call with let/match/Ok/Err. Ugly.

I looked at one random file which turned out to be a `du' command implementation: https://github.com/uutils/coreutils/blob/master/src/du/du.rs...

Are they really starting a new OS level thread for every directory found? Looks like an easy way to exhaust system resources to me. Also I don't see the code that would collect error information if if the thread panics.


>Also Rust doesn't have exceptions so you have to wrap almost any function call with let/match/Ok/Err. Ugly.

respectfully, This is pure nonsense.

First of all rust does not have runtime and AFAIK for providing exception you should have runtime to manage stack.

Second not every language should be like high-level languages, it is not the rule to be like C#,Java,Python,etc. I use a lot of them for my work when I need simple thing to do, but rust designed to do low-level stuff, and I cannot understand how having not having exception makes a language ugly (specially when you code in lowlevel).


I mean if you have to write that match construct around every function call the code quickly gets bloated, doesn't it? And that is not good.


> I mean if you have to write that match construct around every function call

Not quite.

1. you're supposed to handle errors around function calls which can fail, which is a strict subset of "every function call"

2. rust has a number of higher-order constructs to facilitate that handling[0][1][2], not just raw `match` statements or expressions.

That aside, for rust explicit error handling is considered a feature both at the language level (allows for less runtime requirements and much stronger guarantees — check out exception-safe C++ for what happens when low-level meets exceptions) and at the user level (by forcing a conscious and explicit decision, whether it's crashing the system, handling the error or passing the ball upwards)

> the code quickly gets bloated, doesn't it?

Does C code quickly get bloated? Because you're also supposed to check for error codes after each function call which can fail, and C doesn't provide much abstractive power to mitigate that.

[0] http://doc.rust-lang.org/std/result/enum.Result.html

[1] http://doc.rust-lang.org/core/macro.try!.html

[2] https://github.com/rust-lang/rfcs/pull/243


Rust has macros, so boilerplate can be swept up pretty tidily. In the case of propagating up errors, there is the try! macro that encapsulates a match that returns early with the error


try! macro results in code that looks pretty similar to exceptions: https://doc.rust-lang.org/std/macro.try!.html


Yeah Error handling bloat is a problem.

You can use convenience functions: expression.expect("panic with this message if expression evaluates to an Err")

You can also use macros: try!(expression) makes it so if expression evaluates to an error it returns the error immediately, otherwise it does nothing.

This approach is still more verbose than using exceptions but on the flip side dealing with errors up front can make it easier to write reliable code.


Comment OP's stance is valid. "Magic" in this context are keywords or symbols that are not immediately clear to programmers that don't work in rust. One of the reasons golang is so successful is that there is very little magic in the syntax, and even when there is it's fairly easy to grok (an example would be the `go` keyword).

FWIW I also share their opinion that rust is unapproachable.


> One of the reasons golang is so successful is that there is very little magic in the syntax, and even when there is it's fairly easy to grok (an example would be the `go` keyword).

Do you have a specific symbol you would like to change in Rust, and what would you like to change it to?

The only example I've seen (in a child comment to yours) is effectively a complaint that Rust has lifetimes and Go doesn't, which is effectively saying "you should have a garbage collector like Go does", which is an argument against a fundamental design decision of Rust. If you want to argue that you should always use a garbage collector, argue that directly instead of making vague negative comparisons between Rust's and Go's syntax.


I'm definitely a Rust fanboy, but the single-quote syntax for lifetime annotations can be irritating. Several editors I've used automatically insert a second quote to match, and I am frequently unable to disable that behavior without losing all paired delimiter insertion (like for parentheses or braces).

It's a minor quibble to be sure, but it's the only language symbol that bothers me when writing Rust. Not sure what I'd suggest replacing it with...backtick, maybe? Pipe? @? ~?

There aren't many other special characters on a QWERTY board that aren't already used in Rust. Which I think gets at one of the stumbling blocks that I see in the various Rust syntax bikesheds among those who haven't worked in the language. It's just alien until you've used it a bit, especially if you're writing a lot in pseudocode-y dynamic languages.


Rust used to have @ and ~ sigils* but they were removed for legibility reasons and difficulty typing on European keyboards.

* For heap and owner pointers respectively.


We had a big debate about it back in the day, and ' won as it's about as visually lightweight as you can get. I think the other characters you suggested would invite even more Perl comparisons.


> keywords or symbols that are not immediately clear to programmers that don't work in rust

That's an argument against introducing new notation for anything. That can't be right.


Golang has some magic too - for example magic with capital letters in exported symbols. But generally golang is easier to read than line like this:

> static NAME: &'static str = "du";


Because Rust has lifetimes and Go doesn't, so Go doesn't need syntax for them. If you want to argue that Rust should use a garbage collector like Golang does (which entails arguing that everybody who is using Rust is wrong for not wanting an always-on GC) I'm happy to have that argument, but say so explicitly.


There aren't any mysterious symbols left in Rust that aren't in C++. Can you name one? The Perl comparison doesn't make sense to me.

Also, idiomatic code doesn't match on errors; it uses try!. And nightly now supports a ? syntax that's even shorter.


Just wrt the let/match/Ok/Err == ugly comment: I don't know Rust very well and I haven't looked very closely at the code in question, but why would Rust force you to wrap almost any function call with let/match/Ok/Err, though? I am assuming the issue at hand here is when you call function that returns a Result (if that's the name of the parent of Ok/Err)? Couldn't one let the return values 'flow up'? For example: if a function starts returning Results and you don't handle it on the level above, you start returning Results too... Also, I suppose there's ways of flattening results to avoid having nested Results (I would guess and_then does that based on its signature?)? In this scenario, you could argue that one now replaces match with map at every function call where you do not actually handle the Err and that this is ugly too, but the alternative is hoping that the caller happens to have read your source code (or have checked exceptions), which is, arguably (to be diplomatic), the case for unchecked-exceptions. Then what happens if you change the code and so on and so on...

This way lets you achieve something similar to checked-exceptions, and in addition to gain composability of Err (and the like), without adding another language construct (which a language without exceptions would have to do). Got a great deal of experience in other languages that does this, and it's worked out pretty good so far for at least for me once you get the hang of it (i.e. functional coding).

Where I am sitting, that would be a wise choice for Rust, at least if I am understanding it correctly, as it tries to be a safer alternative to other system languages. However, arguably (:), it is OK in dynamic languages or similar that aims trades in correctness for conciseness and, arguable (again :), speed of development to simply have unchecked-exceptions.

Also, it is possible I misunderstood your comment and/or how that code was ugly, in which case I hope the downvotes/replies won't be too harsh :)

EDIT: typo/clarity


Yeah, codedokode is wrong. Rust lets you keep returning Results as you described. You don't even need to actually write a match statement yourself: there's a built in try! macro, which will do an early return from a function if a function call returns an Err value. And soon there will be a new ? operator which does this inline so you don't even need the macro.

What you end up with are much more flexible "exceptions" that don't need a lot of extra compiler support.


Yes Rust lets you flatten the results.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: