I am still skeptical that "excessive use of unsafe" is actually a thing happenin...

Animats · on Dec 29, 2016

You keep making that claim without backup. Two days ago I posted links to extensive use of "unsafe" in matrix libraries. (Some of that code was clearly transliterated from C. Raw pointers all over the place.) That's entirely for performance; all that code could be safe, at some performance penalty.

I'd suggest using only safe code for whatever matrix/math library gets some traction, and then beating on the optimizer people to optimize out more checks.

Manishearth · on Dec 29, 2016

I just gave you backup; I grepped my whole .cargo cache dir (both the one used by servo and my global one). You have also made your claim without backup -- you have repeatedly claimed that this is an endemic problem in Rust, with only individual crates (most of them obscure ones) to back it up, and I only usually make my claim in response to claims like yours -- the burden of proof is on you. Anyway, I do provide some more concrete data below, so this isn't something we should argue about.

Marices fall under the abstraction umbrella IMO. This is precisely what unsafe code is for. However, I totally agree that we should be fixing this in the optimizer, with some caveats. Am surprised it doesn't get optimized already, for stack-allocated matrices. I'm wary of adding overly specific optimizations, because an optimization is as unsafe as an unsafe block anyway, it just exists at a different point of the pipeline. If there's a general optimization that can make it work I'm all for it (for known-size matrices there should be I think), but if you have a specific optimization for the use case imo it's just better to use unsafe code.

The raw pointers thing is a problem, but bad crates exist. They don't get used.

I recently did start going auditing my cargo cache dir to look for bad usages of unsafe, especially looking for unchecked indexing, since your recent comments -- I wanted to be sure. This is what I have so far: https://gist.github.com/Manishearth/6a9367a7d8772e095629e821...

That's a list of only the crates containing unsafe code in my global cargo cache (this contains most, but not all, of the crates used by servo -- my servo builds use a separate cargo cache for obsolete reasons, but most of those deps make it into the global cache too whenever I work on a servo dep out of tree)

I've removed dupe crates from the list. I have around 600 total crates in my cache dir, these are just the ones containing unsafe code.

Around a 70 of these crates use unsafe for FFI. Around 30 are abstractions like crossbeam and rayon and graphs.

I was surprised at the number of crates using unchecked indexing and unchecked utf8. I suspected it would be less than 10, but it's more like 20. Still, not too bad. It's usually one or two instances of this per crate. That's quite manageable IMO. Though you may want to be stricter about this and consider those numbers to be problematic, which I understand.

I bet you're right that many of these crates can have the unchecked indexing or other unsafe code removed (or, the perf penalty is not important anyway). I probably should look into this at some point. Thanks for bringing this to my attention!

Animats · on Dec 29, 2016

I looked at a few.

"itoa" is clearly premature optimization. That uses an old hack appropriate to machines where integer divide was really expensive, like an Arduino-class CPU. It's unlikely to help much on anything with a modern divide unit.

"httpparse", "idna", "serde-json", and "inflate" should be made 100% safe - they all take external input, are used in web-facing programs, and are classic attack vectors.

Not much use of number-crunching libraries; that reflects what you do.

I'll look at some more later. How to deal effectively with incoming UTF-8, especially bad UTF-8, may need some thinking.

dtolnay · on Dec 30, 2016

I maintain two of the crates you called out so here is a bit more detail on the use cases:

"itoa" is code that is copied directly from the Rust core library. Every character of unsafe code is identical to what literally everybody who uses Rust is already running (including people using no_std). Anybody who has printed an integer in Rust has run the same unsafe code. It is some of the most widely used code in Rust. If I had rewritten any of it, even using entirely safe code, it would be astronomically more likely to be wrong than copying the existing code from Rust. The readme contains a link to the exact commit and block of code from which it is copied.

As for premature optimization, nope it was driven by a very standard (across many languages) set of benchmarks: https://github.com/serde-rs/json-benchmark

"serde_json" uses an unsafe assumption that a slice of bytes is valid UTF-8 in two places. This is either for performance or for maintainability, depending on how you look at it. Performance is the more obvious reason but in fact we could get all the same speed just by duplicating most of the code in the crate. We support deserializing JSON from bytes or from a UTF-8 string, and we support serializing JSON to bytes or to a UTF-8 string. Currently these both go through the same code path (dealing with bytes) with an unchecked conversion in two important spots to handle the UTF-8 string case. One of those cases takes advantage of the assumption that if the user gave us a &str, they are guaranteeing it is valid UTF-8. The other case is taking advantage of the knowledge that JSON output generated by us is valid UTF-8 (which is checked along the way as it is produced).

Here again, both of those uses are driven by the benchmarks in the repo above and account for a substantial performance improvement over a checked conversion.

Animats · on Dec 30, 2016

"serde_json" uses an unsafe assumption that a slice of bytes is valid UTF-8 in two places. This is either for performance or for maintainability, depending on how you look at it. Performance is the more obvious reason but in fact we could get all the same speed just by duplicating most of the code in the crate.

Could that be done safely with a generic, instantiated for both types?

dtolnay · on Dec 30, 2016

Yes, that is what we already do. The two unsafe UTF-8 casts are the two critical spots at opposite edges of the generic abstraction where the instantiation corresponding to UTF-8 string needs to take advantage of the knowledge that something is guaranteed to be valid UTF-8.

What we have is as close as possible to what you suggested.

As I mentioned, we could get rid of the unsafe code in other ways without sacrificing performance. Ultimately it is up to me as a maintainer of serde_json to judge the likelihood and severity of certain types of bugs and make tradeoffs appropriately. There are security-critical bugs we could implement using only safe code, for example if you give us JSON that says {"user": "Animats"} and we deserialize it as {"user": "admin"}. My judgment is that using 100% safe code would increase the likelihood of other types of bugs (not related to UTF-8ness) and the current tradeoff is what makes the most sense for the library.

From another point of view, performance and safety are synonyms in this case, not opposites. If we use 0.1% unsafe code and perform faster than the fastest 100% unsafe C/C++ library (which is what the benchmarks show for many use cases) then people will be inclined to use our 0.1% unsafe library. If we give up unsafe but sacrifice performance, people will be inclined to use the 100% unsafe C/C++ alternatives.

Manishearth · on Dec 29, 2016

Yeah, this was basically my conclusion too.

I'm somewhat okay with the parsing ones using unsafe if we can be very sure that the unsafe code actually has a performance impact, and be very careful about it. Some of them already do this, but not all.