According to this comment: https://old.reddit.com/r/rust/comments/p0ul6b/when_zero_cost... , `rustc`/`llvm` is indeed able to optimize the wrapped example to a big `memset`. Sure in this specific case, it is different from the first one which delayed the zeroing, but that is not unintelligently `clone` many times as claimed in the article.
I don't really feel like this is an issue of "zero cost abstractions". The baseline expectation should be that it will do something reasonable, which in this case is probably `if Copy do bit-setting with memset/bzero if possible, otherwise clone() loop` which is what it appears to do according to a reddit comment linked to elsewhere.
I'm not really sure it follows, though, that you should ever expect more than that. It's nice if you can get it but the presence of a very specific optimization shouldn't be taken as the new "zero cost" baseline, and I don't think it should be taken as given that a newtype will inherit all possible optimizations of its enclosed type.
I think a proper solution would be to expose the trait `IsZero` [0] as that is used to figure out when it is correct to use `calloc` instead of having to do `malloc` -> `memset` [1].
I don't know if there is any plans for exposing that trait, but it would be a nice thing to see.
(As you likely know) Normally in Rust the trait is a gentleman's agreement. If my type does not have total Equivalence then I don't implement the Eq trait, and so if a type has Eq then it's promising to deliver total Equivalence. You can, in fact, just defy this and make types that do something "completely wrong" while implementing the syntax of the Trait and I provide a suite of them so that other people don't need to do that experiment:
However, Rust does have two other tricks up its sleeve:
1. Unsafe traits. You can declare a trait to be "unsafe". Implementations of this trait likewise need that "unsafe" keyword and it alerts the programmer that, as with unsafe function calls, they are responsible for actually getting the implementation details correct. Not many standard library traits are unsafe because that's a considerable burden, but a few are, including Send and Sync.
You wouldn't learn much by implementing unsafe traits wrongly in a library like misfortunate, it's the same lesson as a C++ library that scribbles on random memory addresses.
2. Rust knows whether you implemented a trait by hand or merely derived it. In a few cases it can be reasonable for the language to distinguish these cases for built-in traits, since in the former case it knows the derived trait does what is needed even though a manual implementation may not.
It is possible that future improvements to Const Generics will rely on this latter idea. So long as you derive Eq rather than implementing it, Rust can reason that your type really is suitable as a constant type parameter whereas misfortunate::Always is not suitable (it would cause mayhem if permitted because of its idea of what "equality" is) despite claiming to be Eq.
> Rust knows whether you implemented a trait by hand or merely derived it. In a few cases it can be reasonable for the language to distinguish these cases for built-in traits, since in the former case it knows the derived trait does what is needed
Too late to edit. This should clearly have said,
in the latter case not the former case. In the hand implemented case the compiler has no idea whether your implementation has the desired properties.
There needs to be some thought into how it works with Enums too. Rust currently has no built in standard for enum defaults (you have to implement default yourself). It would be nice to somehow mark one of the variants to be represented with the 0 discriminant (Like how Option::None is implemented)
Yes, I can see uses for this. Forty-two is obviously not something you'd actually want, but I can see needing numbers in the range 0-100 and being frustrated that since I need 0 I can't carve out a niche even though I don't mind losing 101 through 255 from a u8 instead of making my type wider.
Still the existing tricks get a lot done for relatively little work. I have used Option<NonZeroUsize> to do roughly what I'd use a single integer for in C, but making explicit that zero isn't just zero, but "I dunno, invalid". Would I have done that even if it cost more space? Probably, but it was cool that I didn't even need to consider that, "Zero cost abstraction".
Yeah something like this would be how I would like to see it, the trait is already there and is pretty much that (without the Copy bound though as that is not needed.) Hopefully it will end up public and can be implemented with a derive.
Knowing that they usually are at the end doesn't help since you still need to write a general runtime check that expresses "are all the bits for this type zero or uninit". Afaik currently there is no way to do this without causing UB. This would need some new intrinsic.
> The baseline expectation should be that it will do something reasonable
Surely baseline is that it does something correct. Reasonable (or just "not pathologically stupid" if you separate the two, with reasonable next) comes after. To my mind an optimising compiler is a compiler first, an optimiser second.
> the desire to optimize one specific case while neglecting the general case
This feels like a recurring pattern in Rust's design whether we're talking about performance, what's accepted by the compiler, etc. I'm not sure it's a bad thing exactly, but it leads to a lot of unintuitive footguns. Of course the consequence of these footguns is de-optimization or a compiler error, instead of a runtime exception or a memory error. But nevertheless it can make for a confusing landscape of behavior to navigate.
Should they not try and spot-optimize common cases where they can, just for the sake of predictability? No, I don't think so. And so I don't know what the answer is. I just often (and even more so when I was new to it) find myself surprised by Rust's general-rules-that-actually-have-notable-exceptions. Though who knows, maybe it's just not possible to make a language this powerful that doesn't run into this problem.
> Should they not try and spot-optimize common cases where they can, just for the sake of predictability? No, I don't think so. And so I don't know what the answer is. I just often (and even more so when I was new to it) find myself surprised by Rust's general-rules-that-actually-have-notable-exceptions. Though who knows, maybe it's just not possible to make a language this powerful that doesn't run into this problem.
I think the industry as a whole has already accepted "try to optimize common cases on a best-effort basis, at the cost of unpredictable performance cliffs in unusual cases" as the default paradigm.
From multi-tiers JITs, auto-vectorizing, all the way down to branch prediction, source code is a leaky abstraction. If you don't care about performance, you still get the benefits most of the time. If you do care, then you need to peel back the abstraction and have a good understanding of what the compiler/JIT/CPU does behind the scenes.
As told about in the blog post the first one will be able to be optimized via specializations to a single calloc call. The other here cannot use the same specialization as it does not seem to be able to specialize on the type of iterator yet. This means it will be a malloc followed by a memset when compiled.
which will give you a list of WrappedBytes but initialized with calloc. I assume this is because llvm can see that the map is a identity function in this case and then can optimize it out.
If you write some of the "clever" pop count algorithms in C or C++, good compilers will go "Oh you're doing bit counting" and on a modern platform with a bit counting CPU instruction your whole algorithm turns into one CPU instruction.
[In Rust the standard library provides count_ones() on integer types, and that's actually safely wrapping an intrinsic pop count feature]
In this case it's a combination of compiler magic and specializations in the rust standard library to make it easy for the compiler to understand that this particular iterator shape is a NOP.
What does that post prove except that "allocating memory that must be zeroed on use but is never used" is million times faster than "allocate memory and zero it right away"?
Sure, I can see how the code doesn't immediately tell you the difference but after you understood it, isn't it completely logical?
The title seems clickbaity and not accurate when you dig into the article.
As a Rust beginner, I would never assume code I write to not be executed, or cut out by the compiler. I know this sometimes happens, but I wouldn't count on it for lack of understanding. I don't think that's anyone's expectation who doesn't know the inner workings well, really.
The entire article could basically be boiled down to your first sentence + the two code examples.
And IMO the title is still inaccurate because it attacks a general promised/desired property of Rust (and other languages) and I honestly don't see the connection. Looks like the observed behaviour is an unfortunate side effect of the `newtype` thing and is very prone to fixing and optimization in future versions. It's not a failure of the "zero-cost abstraction" ideal which is being chased after every day and is still in flux.
> nor should anyone expect that they do
I'm not a systems programmer, so I might be wildly off base in what I'm about to say. However, isn't it in fact expected of systems programmers that they do understand in detail what the compiler will do with their code?
Yeah, that was obviously not an optimization. You can't allocate 16 gigs in 5 micros. It might be problematic if you were relying on the lazy allocation of course.
The second example doesn't seem compelling to me either...is a newtype of a reference type a common application?
> The point of this post is not to bash the Rust team, but rather to raise awareness. Language design is a difficult process full of contradictory tradeoffs. Without a clear vision of which properties you value above all else, it is easy to accidentally violate the guarantees you think you are making. This is especially true for complex systems languages like C++ and Rust which try to be all things to all people and leave no stone of potential optimization unturned.
This paragraph is worth highlighting. Say yes to everything and vision becomes meaningless. Can't move in one direction when you're pulled by an infinite number of stakeholders on all sides.
This gets a strong second form me. I don’t particularly care for Rust, but that is personal preference. This point stands strong for a wide swath of languages (both compiled and interpreted). There is a seemingly endless pull to be all the things, to all the people for a lot of languages. I would much prefer seeing a smaller set of highly focused languages with literally seamless interop between them.
The article is not about what you think it's about.
1. Turns out that zeroing u8 is a special case so if you allocate 17 GB of your custom structure Rust is gonna init that for you.
2. Okay, so you don't understand how RefCell works? The caller to this function should be holding r, not passing it into the function. Of course that will just correctly Panic, so it's not wonderful. But that is the point of RefCell.
Generally more stuff runs fast in Rust than in other languages, but you can always find bad counterexamples in any language.
> 2. Okay, so you don't understand how RefCell works? The caller to this function should be holding r, not passing it into the function. Of course that will just correctly Panic, so it's not wonderful. But that is the point of RefCell.
That’s not the interesting part. The point is that rust won’t optimize a lifetime inside a struct the same way it will optimize an actual reference. It can’t assume that the lifetime will be valid for the whole function (as it can with ref). This limits the amount of reordering you can do in the optimizer.
The example you were looking at was explaining why that is the case.
Ultimately I'm happy for Rust to only be so smart, of course there are going to be limitations. But you are right, the author was making a clear example and my comment doesn't make sense in that context.
> However, when v has an arbitrary type, it will instead just clone() it for every single element of the array, which is many orders of magnitude slower. So much for zero cost abstraction.
Arguably the WrapperType, i.e. any arbitrary type is the general case and thus the baseline performance. `u8` is the special case (hence specialization) that performs the alloc_zeroed optimization. So it's not really that the abstraction adds a cost. It's just that the special case removes a cost paid by everything else.
In the future the vec initialization might be fixable, but this requires turning potentially undefined values into some valid if arbitrary bit patterns (i.e. llvm's freeze) to compare it against zero.
Most people consider "cost" in "zero-cost abstraction" to refer to runtime cost. That presentation is saying that if you let it refer to other kinds of cost too, then nothing is zero-cost, but to me that just means he picked a bad definition.
And even then, I've often heard it stated as "zero cost if you don't use it", eg, exceptions have zero runtime cost unless an exception is actually thrown, so, given that they should only be thrown in exceptional cases, that really shouldn't impact your program. (Whether or not its truly zero cost depends, I guess, eg does memory count as cost? What if it makes something not fit in cache?)
I love that exceptions are the classic case for this, because it's not even true that "zero-cost exceptions" are zero _runtime_ cost on the non exceptional path. The most trivial example is they block vectorization.
Well, exceptions shouldn't be the classic case here, because - as you say - they're typically not zero cost.
Zero-cost abstraction is what you get when your abstraction gets compiled away to nothing. Like properly written value wrappers in C++, or (presumably) newtype in Rust. These things disappear from final assembly.
The definition of zero-cost abstractions, as introduced in C++, was that "it can't be slower that code than the equivalent hand-written code" (e.g. code not using the abstraction).
In that regard, exceptions are interesting as if you're on the happy path (which should be 99.999% of the time - a normal program that uses exceptions as error handling method should not encounter any exception if you do a `catch throw` on an average run of the software), they can cost less than return-value-based error handling (https://nibblestew.blogspot.com/2017/01/measuring-execution-...). If you're on a "sad path", though, they will cost more.
What is pretty sure is that since compilers learned to put the "sad" path in .cold section, the code size issue has become a 100% non-issue, the "sad" path won't bloat the hot, exception-less path ; in my experience, exceptions are in cases that matter a negative-cost abstraction.
That’s a disingenuous misinterpretation of what “zero-cost” means. Zero-cost refers to runtime performance of compiled code.
It’s also disingenuous to pretend there isn’t some downside to not using the abstraction. Obviously you need to evaluate the tradeoffs.
High level code itself is a sort of abstraction. We could write raw assembly all the time to remove that abstraction and associated costs, but clearly that’s not very productive.
Chandler is simply wrong here. Every single line of code has a cost. There's nevertheless massive benefit to that cost not occuring on millions of end user machines and instead only once in the mind of the developer.
Personally, this is why I think specialization is a bad idea. It means that minor refactors can have very strange impacts on performance, and that every API now has an (often undocumented) set of types for which it is unexpectedly much more performant.
Wrapper types in C++ really are zero cost. Rust is simply broken here, sorry. That happens when you throw away 30 years of work to reinvent the wheel because you like variable names before types or something.
The 'zero cost' attitude is one of C++ problems. Zero cost abstractions are never really entirely zero cost. You still pay, often in something that you didn't bother to measure, which (in C++) is usually compile time or programmer productivity or programmer cognitive cost or compiler complexity.
There's a reason a lot of the complaints about C++ mention bloat or the difficulty of practically choosing a safe subset (you could use a subset, but you're very likely dependent on coworkers, existing codebases and library authors which may not share your subset). A lot of people added 'zero cost' stuff, and all of that has a cost.
For example, lets say Rust's newtype pattern worked perfectly everywhere, and all of the author's examples run fine as u8 with no runtime cost. There would likely still be a small cognitive cost to learn this, a small cost to unwrap the types (for other programmers reading the code), and a tiny cost in compile time. Typedefs are worth it, and it's all very reasonable to pay this!
But there's still a cost, and when people never believe there's a cost a language ends up like C++.
'Zero cost' is generally meant to mean 'zero runtime cost'. For most workloads, an increase in compile times is okay for an optimized production build. In the rare cases where that's not enough, I find that using precompiled headers and splitting code into translation units helps significantly.
Yet many people never look at other costs, so de facto they're assuming 'insignificant costs for everything' which isn't true.
For example, C++'s template language as originally implemented was technically 'zero cost', but C++ programmers paid for it a lot for a long time, in inscrutable error messages and slow compile times (this was fixed to a large degree with modern implementations and standards).
People didn't understand that they were paying in programmer / compile time / bad error messages? I don't believe that for a second. Those costs are extremely visible. As visible as it is possible to be. They wouldn't get more visible if you dressed them in reflective jackets, slapped a police light bar on top, and turned on a siren to make sure everyone was looking in the right direction. Who undergoes that kind of suffering and doesn't even notice? Nobody.
In contrast, the benefits of Zero Cost Abstraction are quite subtle. "Why would I want that? Ever heard of Moore's law? Caring about perf is so 1990s!" goes the immediate thinking. If you never have to write high-perf code, that reasoning is even correct! Of course, there are still many places where performance does matter, and being able to use high level language features on the very innermost loops, the places that halve or quarter the throughput of your $6000 graphics card(s) if you carelessly toss in even a single call of overhead, is quite something to those in a position to take advantage.
Since the caveats of ZCA are obvious and the benefits are subtle, I think it's perfectly fair to use the term as a way to draw attention to the latter.
The problem with this viewpoint is that the baseline is pretty arbitrary. Using a newtype carries a conceptual cost of wrapping and unwrapping... but using u8 directly carries the conceptual cost of remembering what any given u8 means! What makes that the baseline and newtype wrapping a cost over that as opposed to using domain concepts in types as the baseline and using native types the cost?
Well, imagine that your code is serializing and deserializing someone else's type. You probably care more in that context about the underlying type and not the newtype, and the newtype doesn't help you.
Now, I did say this option was worth it. And there are cases of being too conservative (Golang?). But when designing and programming, one should look at the tradeoff. Abstraction is not always worth its costs [EDIT: and sometimes you can have less costs by a better abstraction, but being aware of the costs helps to think about the better abstraction].
that's exactly what should keep you awake at night. Because the conceptual cost is arbitrary and not meaningfully measurable in a quantitative sense, people tend to elide over it, but though it's not measurable that cost is no less real and might bubble up in the form of someone losing money, or in some cases, causing physical harm.
Zero cost refers to runtime cost, as in the compiled code.
> You still pay, often in something that you didn't bother to measure, which (in C++) is usually compile time or programmer productivity or programmer cognitive cost or compiler complexity.
Obviously everything is about tradeoffs. If an abstraction is making everything worse for you, then don’t use it. I don’t think it’s fair to try to list every possible downside of an abstraction, as there are obviously also downsides to not using abstractions otherwise we wouldn’t have these options.
Zero cost refers to the compiled code and runtime performance.
'zero cost' is what makes some things possible. If your perspective is one where CPU and memory is effectively infinite then yes, who cares about zero cost. The last C++ project I worked on I had like 32k of storage and some kilobyte sized amount of RAM. If C++'s abstractions weren't zero cost it would have been impossible to use in this environment. Java, C#, Go -- none of these are even close to possible.
Why wouldn't I just use C? Because of improved programmer productivity and readability of the code.
The reason C++ exists as a language today (and why Rust is it's direct competitor) is because of zero-cost abstractions. Because in many situations programmer productivity and compiler complexity is second to performance.
My takeaway wasn't 'never abstract'. That's obviously absurd. Even C itself has abstractions which it tries to make zero cost. Only that it's sometimes not worth it in the language, and C++ neglected to look at the other costs in the past.
No, that's the point of C++. If you want a language that doesn't do zero cost abstractions, there are plenty to chose from. If you want a language a step above C that gives you fine-grained control over exactly what code is generated and memory allocated than you have very few choices.
You're basically just arguing that C++ shouldn't be C++ (and Rust shouldn't be Rust). We already have plenty of languages that provide high-level "costly" abstractions. The reason we need C/C++/Rust is for that zero-cost aspect.
Historically, there were much fewer choices for languages and arguably C++ has been used for building applications that didn't need zero-cost features. But now we have plenty of alternatives and C++ still exists for that valuable niche that it provides. This is the reason Rust was created -- to provide this level of control without the baggage of C++.
Bjarne and the C++ community are very clear on what the zero overhead principle is. If it isn’t holistic enough for you, that’s no fault of the principle. Your objection is either off-topic to what ‘zero cost’ means, or equivocation.
Why did a Rust defect become such a diatribe on C++ ? As far as I am aware, using a typedef/using type does not slow down zero-based initialisation of primitive types in C++. (last time I checked a few years ago - will apologise if this has changed).
Sorry, got bit a bit too much by C++ templates and a few other things. Rust views itself as a successor language, and I wouldn't like it to make the same mistakes.
This is mostly in fact an example of a big problem with benchmarks. What's measured here is the difference between never using an object (and so the actual work to prepare it can be elided, albeit by the OS not the programming language) and doing all the work up front in the expectation you'll use it, but then not using it.
u8 turns into "Hey, Linux kernel, zero these pages if I ever read them" (and then never reading them) whereas the opaque type turns into memset() zeroing all the pages.
Not doing any work is in fact a million times faster but your real program would have needed to do work, otherwise why bother having the vector? Whereupon the benefit disappears.
Where possible design your benchmarks to really do the thing you think you're measuring. If what you're measuring is nothing then be sceptical about supposed "performance" measured for that, since it's nothing, you're probably exploring the same space as the people who wanted to find out how much the human soul weighs (trick question, there is no such thing, but they put a lot of effort into trying to measure it anyway).
It's just an example to show that zero cost abstractions are actually not zero cost at all.
The language would have you believe the abstraction is zero cost. If you were to really believe it, you would never think about whether you need to allocate a vector of u8 or your custom byte type. They should be one and the same. That's the point of the promise of zero cost abstractions.
By the way, pre-allocating a lot of virtual memory upfront is not unheard of.
In this situation, to avoid paying the cost of the abstraction, you would have to stop and think "I can't allocate _my_ byte type! I must allocate u8, then cast the result to a vector of my custom byte type. Maybe the compiler will not like the cast so I have to create an "unsafe" block and do some pointer casting? (I don't know if that's what you would need to do in rust, or if it would have been something else).
> you would have to stop and think "I can't allocate _my_ byte type! I must allocate u8
No, this is premature optimization. Rather, you would write the code in the most obvious way, and then profile it to figure out which optimizations are worth doing. At that point you can comment your code explaining why it looks all wonky :)
And applying this thinking everything means that your entire code is pervaded with slowness with no obvious hot spots, and you jump with joy since it obviously means that your code is as efficient as it possibly could be (because otherwise the profiler will show some peaks, right?).
I know this was meant as (light) sarcasm, but have you ever profiled a nontrivial program that turned out to have zero peaks? My gut tells me this would be really difficult to do by accident. The same way that writing crypto code that is resistant to timing attacks is hard.
This definitely brings it closer, but on my machine, touching the entire array still ends up being 2 seconds slower (5.4s for the calloc side, 7.8s for the clone side)
It's not using clone but memset. Of course memsetting memory to 0 when the operating system already set it to 0 is still silly and something that could be further optimized, but it's not cloning things.
it might be optimized to a memset, but it's the clone codepath, as opposed to the u8 specialization codepath as discussed in the article
But the original commenter is talking about how the benchmark isn't useful because it doesn't touch memory, but you get similar results even if you do touch memory.
I think it's more accurate to say that there's an optimization that applies only to a very specific situation that's millions of times faster than without the optimization.
It's a bit strange to say that before the u8 specialization this was not a problem and it's suddenly a problem now.
Very well said. I actually came here to write a comment criticizing the post because the fact that this was “negative” has to do with the order in which the experiment took place. Flip it and you have a different sentiment. A similar point can be made but an explanation of “it’s very slow” presumes an assumption that every statement in rust will be maximally optimized via specialization which I consider to be an unreasonable expectation.
I’m not arguing against you, but why do you feel like expecting maximal optimization is unreasonable? Is it just when limited to specialization or do you think it applies more broadly in terms of general optimization?
Rust has "typedefs" and those are called type aliases.
The intention of the newtype pattern is to specifically create a new type without the behaviour of the wrapped type, so if there's a special case just for u8, I would expect it not to be on the newtype.
It’s misleading. One version is reserving the memory and will zero it upon use. The other version allocates the memory and zeros every byte of the memory up front by actually writing zeroes to it.
It’s basically competing abstractions. Letting the kernel reserve pages and only zeroing them when used will hide the cost and amortize it across access.
It’s worth knowing about but it’s also an artificial problem revealed by benchmarks.
Technically it postpones initializing the memory, and then it exits before that is needed.
If you actually use the memory, both cases work out the same. There are too many languages where you can easily make the mistake of reading uninitialized memory, and that can't happen here, if you ask to sum() the vector you'll get a zero answer in both cases... and both programs will be similarly slow.
(Note that you can't actually "sum" the code sample's wrapper type because it lacks the appropriate operations. I spent a while trying to actually make a working example where I trusted I was measuring something "real" and not artefacts and I eventually gave up. Definitely begin with an actual problem, so that you know what you're trying to do and can measure whether you're doing it or else you may just be wanking)
It's doing different things:
In the first case it's not doing anything and it will request memory dynamically. In the second case it's zeroing 16gb of RAM.
The explanation linked at the end explains it better.