Zero Cost Abstractions

thenewwazoo · on May 16, 2019

An interesting post. I have two thoughts.

The first is in response to:

> I think to some extent Rust has tried to cheat this by making non-zero cost abstractions actively unpleasant to write, making it feel like the zero cost abstraction is better.

This idea is closely related to a comment by estebank just yesterday[1]. I disagree - I think Rust makes non-zero-cost abstractions obvious to the author. If you want to do something in a way that's costly, the language surfaces that information. It puts it right in your face and makes you deal with it. That's a good thing. It makes for better code. If you want to use dynamic dispatch, you can't pretend you're not.

The second is to add a place where I think Rust has gotten zero cost abstractions right: zero-sized generic types. I use them extensively in embedded code to build things like states. I can use trait methods and associated consts in really powerful ways to encode complicated relationships that constrain the operation of a peripheral at compile time, and it compiles down to a single register access. That's crazy powerful. Using C structs and type punning can do the same thing, but without the safety. Using C++'s classes can give me some type safety, but at the cost of real method call overhead. I get giddy every time I think about writing something like

    struct Adc<PIN: AdcPin, MODE: AdcMode>();
    
    let adc = Adc(adc_pin, AdcModes::8Bit);

and the compiler will let me know if my `adc_pin` isn't the right type.

[1] https://news.ycombinator.com/item?id=19920945

ridiculous_fish · on May 16, 2019

Why would a C++ version involve method call overhead?

Wouldn't C++ actually be more straightforward with its value-typed template parameters?

comex · on May 17, 2019

Rust is getting value-typed template parameters Real Soon Now - in fact, they now work to some extent on nightly. In the parent’s example, my guess is that only the second parameter could be changed to a value, whereas the first is actually used somewhere as a type - but I have no context so it’s just a guess.

thenewwazoo · on May 16, 2019

With C++, wouldn't I need to either use a virtual base class with a typed template parameter (to emulate interfaces), or need to use duck-typed parameter and lose type safety?

kccqzy · on May 17, 2019

Sprinkle in a bunch of std::enable_if to do SFINAE and that's it. It's a way to tell the compiler a certain function or method should be enabled only when some other compile-time condition is true.

orbifold · on May 17, 2019

You don’t have to use a virtual base class. You can implement interfaces statically by passing in the implementation as a template parameter.

gmueckl · on May 17, 2019

Which of the two template arguments are you talking about? The enum-typed template argument will lead to a optimized specialization of the template. C++ does, however, have an unfortunate problem with restricting type parameters to types with certain properties/guarantees in ways that won't make the compiler explode in obtuse error messages.

llukas · on May 20, 2019

static_assert, enable_if or just plain not providing default implementation and only specializations, as everything in C++ error messages is something you can customize to high extent...

jcelerier · on May 17, 2019

which is exactly how you would do it in C++ : https://www.embedded.com/design/programming-languages-and-to...

pjmlp · on May 17, 2019

Only if you don't make proper use of template value types and inline calls with static dispatch.

There are quite a few embedded C++ talks about how to go at it.

pron · on May 16, 2019

There is one major problem with -- or, rather, cost to -- zero-cost abstractions: they almost always introduce accidental complexity into the programming language originating in the technical inner workings of the compiler and how it generates code (we can say they do a bad job abstracting the compiler), and more than that, they almost always make this accidental complexity viral.

There is a theoretical reason for that. The optimal performance offered by those constructs is almost always a form of specialization, AKA partial evaluation: something that is known statically by the programmer to be true is communicated to the compiler so that it can exploit that knowledge to generate optimal machine code. But that static knowledge percolates through the call stack, especially if the compiler wants to verify — usually through some form of type-checking — that the assertion about that knowledge is, indeed, true. If it is not verified, the compiler can generate incorrect code.

Here is an example from C++ (a contrived one):

Suppose we want to write a subroutine that computes a value based on two arguments:

    enum kind { left, right };
 
    int foo(kind k, int x) { return k == left ? do_left(x) : do_right(x); }

And here are some use-sites:

    int bar(kind k) { return foo(k, random_int()); }
    int baz() { return foo(left, random_int()); }
    int qux() { return foo(random_kind(), random_int()); }

The branch on the kind in foo will represent some runtime cost that we deem to be too expensive. To make that “zero cost”, we require the kind to be known statically (and we assume that, indeed, this will be known statically in many callsites). In this contrived example, the compiler will likely inline foo into the caller and eliminate the branch when the caller is baz, and maybe in bar, too, if it is inlined into its caller, but let’s assume the case is more complicated, or we don’t trust the compiler, or that foo is in a shared library, or that foo is a virtual method, so we specialize with a "zero-cost abstraction":

    template<kind k> foo(int x) { return k == left ? left(x) : right(x); }

This would immediately require us to change all callsites. In the case of baz we will call foo<left>, in qux we will need to introduce the runtime branch, and in bar, we will need to propagate the zero-cost abstraction up the stack by changing the signature to template<kind k> bar(), which will employ the type system to enforce the zero-cosiness.

You see this pattern appear everywhere with these zero cost abstractions (e.g. async/await, although in that case it’s not strictly necessary; after all, all subroutines are compiled to state machines, as that is essential to the operation of the callstack — otherwise returning to a caller would not be possible, but this requires the compiler to know exactly how the callstack is implemented on a particular platform, and that increases implementation costs).

So a technical decision related to machine-code optimization now becomes part of the code, and in a very intrusive manner, even though the abstraction — the algorithm in foo — has not changed. This is the very definition of accidental complexity. Doing that change at all use sites, all to support a local change, in a large codebase is esepcially painful; it's impossible when foo, or bar, is part of a public API, as it's a breaking change -- all due to some local optimization. Even APIs become infected with accidental complexity, all thanks to zero-cost abstractions!

What is the alternative? JIT compilation! But it has its own tradeoffs... A JIT can perform much more aggressive specialization for several of reasons: 1. it can specialize speculatively and deoptimize if it was wrong; 2. it can specialize across shared-library calls, as shared libraries are compiled only to the intermediate representation, prior to JITting, and 3. it relies on a size-limited dynamic code-cache, which prevents the code-size explosion we'd get if we tried to specialize aggressively AOT; when the code cache fills, it can decide to deoptimize low-priority routines. The speculative optimizations performed by a JIT address the theoretical issue with specialization: a JIT can perform a specialization even if it cannot decisively prove that the information is, indeed, known statically (this is automatic partial evaluation).

A JIT will, therefore, automatically specialize on a per-use-site basis; where possible, it will elide the branch; if not, it will do it. It will even speculate: if at one use site (after inlining) it has so far only encountered `left` it will elide the branch, and will deoptimize if later proven wrong (it may need to introduce a guard, which, in this contrived example will negate the cost of the branch, but in more complex cases it would be a win; also there are ways to introduce cost-free guards -- e.g. by introducing reads from special addresses that will cause segmentation faults if the guard trips, a fault which is caught; OpenJDK's HotSpot does this for some kinds of guards).

For this reason, JITs also solve the trait problem on a per-use-site basis. A callsite that in practice only ever encounters a particular implementation — a monomorphic callsite — would become cost-free (by devirtualization and inlining), and those that don’t — megamorphic callsites — won’t.

So a JIT can give is the same “cost-freedom” without changing the abstraction and introducing accidental complexity. It, therefore, allows for more general abstractions that hide, rather than expose, accidental complexity. JITs have many other benefits, such as allowing runtime debugging/tracing "at full speed" but those are for a separate discussion.

Of course, a JIT comes with its own costs. For one, those automatic optimizations, while more effective than those possible with AOT compilation, are not deterministic — we cannot be sure that the JIT would actually perform them. It adds a warmup time, which can be significant for short-lived programs. It adds RAM overhead by making the runtime more complicated. Finally, it consumes more energy.

There's a similar tradeoff for tracing GC vs. precise monitoring of ownership and lifetime (Rust uses reference-counting GC, which is generally less effective than tracing, in cases where ownership and lifetime are not statically determined), but this comment is already too long.

All of these make JITs less suitable for domains that require absolute control and better determinism (you won't get perfect determinism with Rust on most platforms due to kernel/CPU effects, and not if you rely on its refcounting GC, which is, indeed more deterministic than tracing GC, but not entirely), or are designed to run in RAM- and/or energy-constrained environments — the precise domains that Rust targets. But in all other domains, I think that cost-free abstractions are a pretty big disadvantage compared to a good JIT -- maybe not every cost-free abstraction (some tracking of ownership/lifetime can be helpful even when there is a good tracing GC), but certainly the philosophy of always striving for them. A JIT replaces the zero-cost abstraction philosophy, with a zero-cost use philosophy -- where the use of a very general abstraction can be made cost-free, it (usually) will be (or close to it); when it isn't -- it won't, but the programmer doesn't need to select a different abstraction for some technical, "accidental", reason. That JITs so efficiently compile a much wider variety of languages than AOT compilers can also demonstrate how well they abstract the compiler (or, rather, the languages that employ them do).

So we have two radically different philosohies, both are very well suited for their respective problem domain, and neither is generally superior to the other. Moreover, it's usually easy to tell to which of these domains an application belongs. That's why I like both C++/Rust and Java.

kllrnohj · on May 16, 2019

Or just use something like ThinLTO. You don't need a JIT just to get cross-module inlining, which is all you're really talking about here. There's no actual "specialization" in play, just basic inlining + dead branch elimination. Trivial to do this statically at build time instead. So much so this is even commonly done in JIT'd languages like Java (Proguard & R8 say hello)

Most JITs also have the downside of they need to optimize for compilation performance, and as such typically do not do as thorough a job optimizing as something like offline LLVM. Or you end up with fourth-tier JITs like webkit's FTL. And now instead of just needing to wait for the JIT to think your function is interesting you now need to wait for a second and finally third JIT pass.

JITs really primarily just shine in the rapid iteration cycle. Conceivably what you really want is a language where debug builds are JIT and release builds are full AOT with LTO. Not entirely unlike Dart's behavior in Flutter. Or even how Java behaves on Android to an extent. There doesn't seem to be any particularly significant reason why this pairing isn't more broadly applicable and used.

There's also no real reason you can't have zero-cost abstractions AND a JIT if you want. Hell, in theory this is what you get with C++ in a webasm environment. This are not exactly opposing technologies.

pron · on May 16, 2019

> get cross-module inlining, which is all you're really talking about here.

Not at all.

> There's no actual "specialization" in play, just basic inlining + dead branch elimination.

Devirtualization + inlining + dead branch elimination is an instance of specialization, and an extremely effective one at that, but other forms of specializations apply.

> Trivial to do this statically at build time instead.

No. Some things are undecidable and/or would lead to code explosion. E.g., it's often hard to statically prove that a subroutine that takes an int would always receive the number 2, and specializing it for all ints would cause code explosion (again, contrived example but with real-world counterparts).

> There's also no real reason you can't have zero-cost abstractions AND a JIT if you want. Hell, in theory this is what you get with C++ in a webasm environment. This are not exactly opposing technologies.

Of course. But I was talking about two very different philosophies that greatly affect language design.

kllrnohj · on May 17, 2019

> Not at all.

For your example? Yes, it was.

> Devirtualization + inlining + dead branch elimination is an instance of specialization

Your example did not involve any devirtualization.

> No. Some things are undecidable and/or would lead to code explosion. E.g., it's often hard to statically prove that a subroutine that takes an int would always receive the number 2, and specializing it for all ints would cause code explosion (again, contrived example but with real-world counterparts).

Again, your example did not have any such issues.

And if it was hard to statically prove a subroutine that takes an int always receives 2 then guess what? A JIT wouldn't do it, either. They don't actually specialize that much for runtime information because gathering that information would destroy performance. Devirtualization? Sure, limited in scope, and huge payoffs to be had for doing it. All arguments analyzed & binned? Good god no. That'd be ludicrous in resource usage required, and the payoffs would be incredibly hard to quantify in any meaningful way.

Take your example - all it would do is eliminate a single branch. If it was so common to take a single avenue that it could be eliminated, then pretty much every branch predictor would do it anyway. So... why specialize it?

pron · on May 17, 2019

> Take your example - all it would do is eliminate a single branch. If it was so common to take a single avenue that it could be eliminated, then pretty much every branch predictor would do it anyway. So... why specialize it?

I don't think you've actually read the comment. The particular example is just a demonstration of a principle. Obviously that particular case is too simple to be interesting.

> A JIT wouldn't do it, either.

Again, you've completely missed the point (read the original comment, perhaps?). In any event, a JIT could and would do it in some cases; e.g.: https://gist.github.com/chrisseaton/4464807d93b972813e49

kllrnohj · on May 17, 2019

> In any event, a JIT could and would do it in some cases; e.g.: https://gist.github.com/chrisseaton/4464807d93b972813e49

Uh, what? That's just a basic constant fold optimization that can be statically handled? It even has examples showing anything that would require runtime information (like mutable strings) don't get optimized.

pron · on May 17, 2019

> That's just a basic constant fold optimization that can be statically handled

Incorrect. No static analysis can establish the constant in some of those cases if only because they are not provably constant. If the situation changes at runtime, those routines would need to be deoptimized and recompiled.

> It even has examples showing anything that would require runtime information (like mutable strings) don't get optimized.

The example does not show that "anything that would require runtime information don't get optimized" just that that case doesn't. I don't know Ruby, but I can imagine why guarding the constness of a mutable array may not be advisable.

kllrnohj · on May 17, 2019

> Incorrect. No static analysis can establish the constant in some of those cases if only because they are not provably constant.

Which one of those is not probably constant? They all are literally hardcoded constants?

pron · on May 17, 2019

The methods are not constant:

    def foo
      Array.send(:define_method, :sort) do; [100, 200]; end
    end

    puts eval([1, 2, 3].inspect).sort[1] * 2
    foo
    puts eval([1, 2, 3].inspect).sort[1] * 2

But, anyway, my points isn't about partial-evaluation of values specifically (although JITs can and do do that); instead of int and 2 you can consider a type with 4 billion subtypes. JITs do specializations that an AOT simply cannot possibly do (not without code explosion).

Now, as I wrote in the original comment, I'm definitely not saying that JIT compilation is "better" than AOT. I am saying that the "zero-cost abstraction" philosophy of C++/Rust and the "zero-cost use" philosophy of JITs are two extremes that are each perfectly suited to two very different software domains.

orf · on May 17, 2019

V8 uses runtime type/value information to specialize methods in some cases[1]. I find it hard to believe that HotSpot doesn’t either.

Obviously complex objects are impossible to handle, but branching on an integer that’s always the same value isn’t. Same for constant folding, if an argument is determined to be constant at runtime then that’s going to affect the choices the JIT makes.

1. https://ponyfoo.com/articles/an-introduction-to-speculative-...

kllrnohj · on May 17, 2019

That's specializing on type, not on a given specific value. Being JS the value influences the type, but it's only looking at a coarse binning (int vs. double vs. string etc..) not specializing for a single specific value. A form of de-virtualization if you will.

Hotspot wouldn't even need to bother with that most of the time since it's primarily running statically typed languages for which that doesn't apply in the first place.

saagarjha · on May 17, 2019

HotSpot will monomorphize polymorphic method calls if it detects that only one type is being used, FWIW.

the8472 · on May 16, 2019

Hotspot performs inlining + DCE/CSE/costant folding/... in a loop, they call it incremental inlining. I.e. inline, eliminate, now you have new inlining opportunities because the code dropped below your max inlining size, rinse & repeat.

I think rustc does not iterate llvm optimization passes and thus even with lto leaves opportunities on the table.

kllrnohj · on May 17, 2019

> I think rustc does not iterate llvm optimization passes and thus even with lto leaves opportunities on the table.

I think you'd need to support that with some evidence. If there's any code that hotspot manages to out-optimize LLVM on that'd be a fascinating investigation.

In practice I think you'll be shocked how well GCC & LLVM make code vanish vs. Hotspot. It's one of the major reasons benchmarking C/C++ code is so damn hard, the optimizer tends to eliminate the entire thing.

pron · on May 17, 2019

> In practice I think you'll be shocked how well GCC & LLVM make code vanish vs. Hotspot.

First of all, HotSpot is a VM, not a compiler. It has several compilers, among them C2 (the default optimizing compiler), Graal, but also LLVM (Zing is a HotSpot fork VM which uses an LLVM-based JIT called Falcon: https://www.azul.com/products/zing/falcon-jit-compiler/) + several GCs + other stuff.

Second, I think you may be shocked at the comparison of LLVM to C2: https://youtu.be/O87PaWkXlZ0 (they come up about the same, each winning/losing on different things). And that is when the optimization can be proven; whatever optimization you have, a JIT can apply them more aggressively.

the8472 · on May 17, 2019

Note that it is up to the compiler frontend to choose which LLVM optimization passes to call, in which order and how often.

But neither clang[0] nor rustc[1] do something that resembles hotspot C2's incremental inlining[2].

It is fairly expensive, so hotspot's benefit here is that it only has to spend the optimization cycles on the hottest code.

> It's one of the major reasons benchmarking C/C++ code is so damn hard, the optimizer tends to eliminate the entire thing.

This is totally a thing in java too. That's why the java microbenchmark harness (JMH) provides helpers that can inhibit DCE, constant propagation and inlining

[0] https://www.youtube.com/watch?v=s4wnuiCwTGU [1] https://github.com/rust-lang/rust/issues/44041 [2] https://bugs.openjdk.java.net/browse/JDK-8005071

pjmlp · on May 17, 2019

Here are two examples, inlining method calls and flatning object hiearchies across dynamic libraries.

AOT LTO always loses against dynamic linking, whereas a JIT has free reign on that area.

wolf550e · on May 17, 2019

ThinLTO does not work across shared libraries loaded at runtime. HotSpot does.

lilyball · on May 16, 2019

Something I've never seen JIT proponents talk about is the cost of maintaining all of the information necessary to make smart specialization decisions, both in terms of storage space and in the execution overhead of updating that info. I actually have no idea how this is done (as I've never researched JIT compilers).

jdmichal · on May 17, 2019

The one I've heard most about in Java is single-implementation devirtualization. Since Java loads dynamically at the class level, it knows exactly which implementations of an interface or class hierarchy are in use. It has to maintain this information in the class loaders anyway, so this is almost free. However, it allows for some pretty aggressive devirtualization by only compiling for known loaded types and trapping the rest. In my experience, especially around things like interfaces defining module contracts, there tends to only be one implementation in play during normal runtime, meaning that the interface becomes zero-cost. And it doesn't even have to track new implementations loading. It just leaves the optimization there until it has proof via the trap that it must recompile.

sfink · on May 18, 2019

It's certainly not free. The execution cost is often bounded by having a multi-tier setup where only the earlier Jit tiers gather type (and perhaps value) information. The final tier only checks it. (Or not, in which case the updates that can invalidate the compiled code are instrumented to discard the code so that it will be recompiled with updated info.)

As for space, you can use a scheme of "1, 2, many" to limit the set of types you record. That helps some. But space will also be consumed in the form of IC (inline cache) code that tests for the various possibilities. Again, those chains will probably be capped at a certain size.

You're right, it's tricky and can be expensive. But the total amount of entropy in actual observed code typically isn't that large, so there are many opportunities to be clever in the encoding and thereby save a ton of space over a naive solution.

(Source: I'm not really a JIT person but I've had a fair amount of exposure to the spidermonkey jit.)

the8472 · on May 16, 2019

For this particular left/right example a much cheaper alternative is to make the enum variants a 1st class citizen of the type system. Then the compiler will be able to use the variant type information from one callsite to monomorphize all the uses without inlining.

perfmode · on May 16, 2019

The example is merely used to motivate the principle. Let’s not miss the forest for the trees.

pron · on May 16, 2019

You're absolutely right, but the same general analysis applies. The example is contrived, but I want to show the principle (perhaps static variants would have made for a better example; oh, well).

kllrnohj · on May 16, 2019

For this particular example it'd be even easier to just put 'int foo(kind k, int x)' in the header so it's inlined & optimized at each call site anyway. Slap a constexpr on it so the compiler even yells at you if it can't resolve it at compile time for some reason.

deaddodo · on May 17, 2019

> There's a similar tradeoff for tracing GC vs. precise monitoring of ownership and lifetime (Rust uses reference-counting GC, which is generally less effective than tracing, in cases where ownership and lifetime are not statically determined), but this comment is already too long.

Rust doesn't use a GC. It uses a borrowing mechanism that ensures (theoretically) a single reference to any variable at a time. In the case of shared variables, you do have std::sync::Arc<T> (atomic references) and std::rc::Rc<T> (reference counted immutables), but I would hesitate to conflate those to the language as a whole as it's not the default.

vardump · on May 17, 2019

Reference counting is a type of garbage collection. Of course Rust isn't a garbage collected language, but you can kind of use it as one.

deaddodo · on May 17, 2019

> Reference counting is a type of garbage collection.

I'm well aware, or I wouldn't have brought Those types up, obviously.

But there's a vast difference between a language's memory management/type system being intrinsically Garbage Collected and a couple special use types hoisting that functionality.

snovv_crash · on May 18, 2019

If you want to go that way, C++ has shared_ptr and can also be a garbage collected language. It is kind of missing the point though.

metafunctor · on May 16, 2019

What are your thoughts on whole-program optimizing compilers on this? Because, from what I've seen, with strong typing, the compiler can go to town and optimize everything, and makes JIT seem like a joke.

Because, users almost always run whole programs. Why shouldn't our tools optimize for that?

For the record, I like all of C++, Rust, Java, and MLton. You just need to know what you're working with (hint: it's the compiler/runtime, not the language spec)

denormalfloat · on May 16, 2019

Whole program optimization is good if your inputs never change, and you have a infrastructure to benchmark your build and recompile with profiling on each release. JITs do this automatically. A team at work had this release process.

1. Cut the branch at around midnight

2. Run the build with a portion of production traffic for a few hours

3. Collect the profiling info and feed it back into the build.

4. Repush the binary to be tested again.

5. In the morning, the team would manually push the binary to prod.

The benefit was clear: about 20% reduction in CPU. However, to get to this level of automation is not easy, and you get it out of the box with JITs.

One other thing: It's easy to become dependent on such performance gains. The team that had the process got into a difficult situation where there was a bug in one of the releases, couldn't roll back, and had to cherry pick a change. A few lines of code change had to go through the whole push-profile-rebuild-test cycle before it could be rolled out. Pushing the non-profiled change would have caused several other latency SLO to be violated and pagers fired. Instead, they had to wait the several hours with the bug in place, stressing over how soon the profiling would be done.

wolf550e · on May 17, 2019

Better PGO tooling can use the profiles from previous version of the code, which is almost but not quite the same, to compile a PGO optimized build of the patched version.

If there is no tooling to do that, a subset of the training data can be used which can be processed in a short amount of time to gather enough profile data to get most of the benefits. So say, instead of 20% faster code after 6 hours, 10% faster code after 15 minutes.

It is also possible to use PGO to find the critical optimizations done in the PGO optimized build that lead to most of the gains and add annotations in the code (branch taken, branch not taken, force inline, never inline, etc) or split functions the way the PGO optimized build does (e.g. common case is inline the guarding if statement at the beginning of the function, not inline the rest of the function).

pron · on May 16, 2019

> What are your thoughts on whole-program optimizing compilers on this?

They're awesome, but they're not powerful enough to allow the "cost-free use" as well as JITs. Even whole program optimizing compilers can't be as aggressive as JITs in the general case because, like all AOT compilers, they are limited to what they can prove, and there are many things compilers simply cannot prove (at least not without extreme help from the programmer).

Now, don't get me wrong: AOT compilers are often good enough (for languages specially designed for them), but the automatic "zero-cost-use" of JITs is also often good enough compared to cost-free abstractions. We're talking about a general philosophy taken to certain extremes. I would say that a whole-program optimization on a language like ML is a middle ground between those two extremes.

> and makes JIT seem like a joke.

There's nothing stopping a JIT from doing whole-program optimization. It's just that they can do better for less work. They can do the same optimizations, only with local reasoning because they're not required to prove their correctness. True, they will need to add a guard for some of those speculative optimizations, but on the other hand, they can do so much more (speculate and optimize cases even a whole-program AOT can't prove). Of course, as I said in my original comment, JITs do have other costs.

pron · on May 17, 2019

P.S.

This talk, about a whole-program optimizing compiler for Java bytecode compares the approach to a JIT at ~25-35m: https://youtu.be/CZVa4PBv2qg

This paper discusses the optimizations available to WPO-AOT and JIT: https://dl.acm.org/citation.cfm?doid=3136000.3136002

anonymoushn · on May 17, 2019

This per-call-site change in the async/await case is entirely unnecessary. Users of greenlet or similar get the same semantics without changing all their call sites. The implementations are different, but there are zero-cost implementation techniques available either way.

wahern · on May 17, 2019

But either way there's an ABI that the compiler needs to adhere to, and any magic added to the stack management will effect everything else that must follow the same stack protocol. The stackful/fiber approach may be less cost in terms of performance and even overall complexity, but it still represents hidden complexity.

That hidden complexity comes to the foreground when you want to mix software that uses completely different stack management protocols, such as the typical C ABI vs Go.

lmm · on May 16, 2019

> This would immediately require us to change all callsites. In the case of baz we will call foo<left>, in qux we will need to introduce the runtime branch, and in bar, we will need to propagate the zero-cost abstraction up the stack by changing the signature to template<kind k> bar(), which will employ the type system to enforce the zero-cosiness.

There's no inherent reason that this should require code changes though - that's a C++ limitation. If we had a language with stage polymorphism and type inference then call sites that are indifferent about whether extensive branching occurs do not need to change their code (but will acquire a type that could tell us, if we inspected it), while call sites where we consider specialisation obligatory only need a type annotation at the point where we decide that we do care. So it can be zero cost at higher level too: we only pay a complexity cost for the specialization tracking if we want to use it.

> So a technical decision related to machine-code optimization now becomes part of the code, and in a very intrusive manner, even though the abstraction — the algorithm in foo — has not changed. This is the very definition of accidental complexity.

We might not care about branching or not, but we certainly do care about the computation actually being evaluated in a reasonable time. That's something we don't really have a good model for at the moment, but I've come around to thinking that no language is immune from a need to reason about performance occasionally, so complexity in the performance characteristics of some construct is not accidental.

> So a JIT can give is the same “cost-freedom” without changing the abstraction and introducing accidental complexity. It, therefore, allows for more general abstractions that hide, rather than expose, accidental complexity.

I've come to think that hiding something without having a good model for it is one of the most dangerous things in software: the harder a program papers over the details, the harder it is to access them when one needs to - but inevitably one does eventually need to. Sooner or later you need to understand why two very similar pieces of code run at radically different speed, and finding that out with a JIT is far harder than putting extra information in your function signatures. Unless and until we can have a model for our specialization that makes it clear what the performance characteristics of any given piece of code will be, I'd sooner make it painfully explicit - or at least make sure we can transform code into a representation where it's explicit in the cases where we need to.

It's more important to have our optimizations be consistent and understandable than to have them give the best performance possible in specific cases. The worst case is often more important than the average case. I'd argue that much of the industry is moving in that direction - new graphics APIs making fast path versus not a much more explicit choice, Haskell favouring rewrite rules over fusion, even Java with Graal/Truffle making the various kinds of optimisation and staging more visible and developer-controlled.

pron · on May 16, 2019

> There's no inherent reason that this should require code changes though - that's a C++ limitation.

In some simple cases static "constant inference" could help, but there is an inherent reason why this is not the case in general. Specialization requires deciding some property, and in any case -- even in the left-right example -- that property can be undecidable (or not decidable by the compiler). It is simply impossible for a compiler to always know whether some value at some site is constant. For example, instead of random_kind() I could use a_complicated_computation_that_yields_a_kind(x). The cost-free abstraction philosophy says that the programmer must be able to tell if an optimization is made, and this is simply not possible (it could be done with dependent types, but that certainly wouldn't lower the intrusive). For a JIT, that's not a problem. It would still specialize.

But it's always the question of control: do you prefer to work hard to guarantee that some information is known statically, or use a tool that figures it out automatically, sometimes worse than you, but usually better.

> so complexity in the performance characteristics of some construct is not accidental.

I would say that this kind of performance is. It's unrelated to the algorithm chosen by the programmer but to some compiler knob.

> I've come to think that hiding something without having a good model for it is one of the most dangerous things in software: the harder a program papers over the details, the harder it is to access them when one needs to - but inevitably one does eventually need to. Sooner or later you need to understand why two very similar pieces of code run at radically different speed, and finding that out with a JIT is far harder than putting extra information in your function signatures.

I don't know if I agree with your premise, but I strongly disagree with your conclusion, because a JIT doesn't take away this freedom. In fact, JITs (like in OpenJDK) have excellent tools that show you the decisions they've made[1], and you can control those decisions with compiler directives[2] (those are sometimes recommendations, but there's no problem making them requirements and extending their scope). So a JIT doesn't take away any of that control, it just doesn't push this complexity into the language. Besides, what you put in the signaure are only certain compiler knobs. Even in Rust/C++ we rely on compiler optimizations that we have no direct control over in the language.

> Java with Graal/Truffle making the various kinds of optimisation and staging more visible and developer-controlled.

I'm not sure what you're referring to. Graal and Truffle can be used on effectively a wide variety of languages without requiring them to change. But another example from Java demonstrates the different philosophy: value types. Choosing the "inline" memory layout for value types has some semantic implications, but it is also a technical detail; unfortunatly it's one that the compiler isn't good at guessing, so there will be an explicit annotation on the type's declaration. But some of the reason adding value types is taking so long is precisely because they don't want that detail to affect use sites, and are aiming for an outcome that would leave the vast majority of use-sites unchanged (possibly not even requiring recompilation, but I don't follow that project too closely and it's changing all the time so I'm not sure what the current prototype does).

[1]: https://github.com/AdoptOpenJDK/jitwatch/wiki, https://www.graalvm.org/docs/reference-manual/tools/#ideal-g...

[2]: https://docs.oracle.com/en/java/javase/11/vm/compiler-contro...

lmm · on May 17, 2019

> For example, instead of random_kind() I could use a_complicated_computation_that_yields_a_kind(x). The cost-free abstraction philosophy says that the programmer must be able to tell if an optimization is made, and this is simply not possible (it could be done with dependent types, but that certainly wouldn't lower the intrusive).

Why would that be intrusive? a_complicated_computation_that_yields_a_kind has a type (either inferred or declared) that indicates its staging behaviour. The language knows whether a_complicated_computation_that_yields_a_kind(x) is available at compile time or not, and the IDE can tell the programmer. As long as the type inference is good enough (which is not trivial, but I see no reason to assume it's impossible - stage polymorphism is still a research area but I'm not aware of this being a particular sticking point) there's no reason we can't have the best of all worlds.

> But it's always the question of control: do you prefer to work hard to guarantee that some information is known statically, or use a tool that figures it out automatically, sometimes worse than you, but usually better.

But there's no contradiction between figuring it out automatically and doing that statically. We've seen this already with types-of-return-values: people thought there was a tradeoff between Java-1.4 style "explicitly write the type on each line" and Python-style "type is unknowable until runtime", but actually we can infer at compile time and get the best of both worlds.

> In fact, JITs (like in OpenJDK) have excellent tools that show you the decisions they've made[1], and you can control those decisions with compiler directives[2] (those are sometimes recommendations, but there's no problem making them requirements and extending their scope).

But to the extent that we insist that the JIT must be free to change its behaviour based on information that isn't statically available, that limits how much confidence I can have. Maybe my function is inlined in my test cases, but some rarely-used code path will lead to it being uninlined when some particular pattern of network timeouts happens. Maybe the peculiar load pattern that happens when I failover will mean that a failed-over-to warm spare gets different JIT decisions from the instance that was originally master. And if the results aren't integrated into the language then it's hard to be confident that the aspects we care about will be maintained as the codebase evolves.

> Besides, what you put in the signaure are only certain compiler knobs. Even in Rust/C++ we rely on compiler optimizations that we have no direct control over in the language.

Indeed - there's plenty of room for most languages to improve the visibility and control over these aspects. This philosophy ultimately ends up with something like the Coq-as-macro-assembler approach (and even then, it's not always as easy as it should be to understand the performance characteristics of assembly language on modern processors).

pron · on May 17, 2019

> The language knows whether a_complicated_computation_that_yields_a_kind(x) is available at compile time or not, and the IDE can tell the programmer.

But that routine could always return left for the values of x passed to qux. Static analysis cannot possibly know that, but a JIT will (although not enough to elide the call to that routine altogether, just the branch).

> but actually we can infer at compile time and get the best of both worlds.

Type inference does not require annotation, but it does require a guarantee that the type is statically inferrable. E.g. when the Scala compiler fails to infer a type it raises an error. But a JS JIT can devirtualize even when the type is not statically inferrable. The kind of compiler I think you're envisioning would still need to choose between the zero-cost abstraction philosophy and fail if it cannot infer and between the zero-cost use philosophy, and not fail but rather hope to specialize based on runtime information. But maybe you're advocating something like optional typing for compiler optimization. That's certainly doable, and I would argue that it's already done, to some extent, in Java, but only available to JDK code (there are @DontInline and @ForceInline annotations). I think Kotlin wants to do something like that, too, with their "contracts", but I couldn't tell if they're sound or not (i.e. whether the compiler allows you to lie to it in a way that would force an incorrect optimization).

(Just to be clear, I am not talking about type inferences that are meant to aid the programmer, but those that are meant to aid the compiler.)

> And if the results aren't integrated into the language then it's hard to be confident that the aspects we care about will be maintained as the codebase evolves.

Well, now you're saying you prefer the zero-cost abstraction philosophy of having as much control as possible on technical compilation aspects in the programming language. That's fine, of course, but I think this is far from the universal preference. I mean, I like this power in C++, but it clearly has the major costs that I mentioned in my original post, and it would continue to have them even with better inference. Sometimes we like our accidental complexity, but sometimes we don't.

> and even then, it's not always as easy as it should be to understand the performance characteristics of assembly language on modern processors

Yeah... but are you implying that if CPUs were predictable writing formal deductive proofs for code generation would be easy???? They're somewhere between a satisfying puzzle and hair-pulling frustration in the best of circumstances.

lmm · on May 17, 2019

> But that routine could always return left for the values of x passed to qux. Static analysis cannot possibly know that, but a JIT will

Sure, so there will be cases where a JIT could opportunistically devirtualize whereas a static devirtualization step can't. But with the static approach programmer can immediately see why devirtualization didn't happen, and what they would need to do to make it happen. I don't see that as contrary to the zero-cost abstraction philosophy: you have asked the computer to calculate a_complicated_computation_that_yields_a_kind(x) and make a virtual function call based on the result, if what you wanted was the computer to calculate a_complicated_computation_that_yields_left(x) and make a nonvirtual call then you should have told it that. Presumably you would still have put an indirection in there if you had been hand-coding it, because if you actually know that the computation always returns left then you would have written as much.

> Well, now you're saying you prefer the zero-cost abstraction philosophy of having as much control as possible on technical compilation aspects in the programming language. That's fine, of course, but I think this is far from the universal preference. I mean, I like this power in C++, but it clearly has the major costs that I mentioned in my original post, and it would continue to have them even with better inference. Sometimes we like our accidental complexity, but sometimes we don't.

I want to have the control if I need it, but that doesn't imply it has to carry a cost if I'm not using it. By analogy, in a language with complete type inference (e.g. H-M), I don't pay any cost for my variables having types I don't want to - I can write a script with zero type annotations, let the types be inferred, and it will still work just as well as it would in a unityped language. But then if I do want to put explicit types on some of the variables then I can pay the cost at that point and the language will let it interleave nicely with code that doesn't use type annotations.

Imagine inferring the worst-case performance characteristics of all program functions just as a part of the language. For functions that didn't put any effort into it, this information would be useless (probably infinite or at best known-finite-but-unbounded) - but valid. Even just a distinction between what's compile-time-constant or not would be useful (and many languages already draw that distinction in one way or another, but without the structure and visibility that would make it useful to programmers).

pron · on May 17, 2019

> if what you wanted was the computer to calculate a_complicated_computation_that_yields_left(x) and make a nonvirtual call then you should have told it that.

Again, you're showing a preference to the zero-cost abstraction philosophy; the zero-cost use philosophy is intentionally different. The problem is that the computer cannot tell ahead of time whether it will be able to elide a branch or not. Suppose you tell it it should, what should it do if it can't know for sure that it can? If you say it should fail -- that's the zero-cost abstraction philosophy; if you say it should try -- that's the zero-cost use philosophy.

> and it will still work just as well as it would in a unityped language

I don't understand the analogy. Whether you infer or explicitly state, you still want all pertinent information statically known. This is the zero-cost abstraction philosophy (which you seem to prefer). My view is the following: the vast majority of things we'd like to statically know cannot be known at an acceptable cost. The question is what should we do with the rest? Should we pay for the effort of helping the compiler with what we can know, or let the compiler do its thing, which includes taking into consideration things that cannot be statically known. For the domain C++ and Rust target I prefer the first (because there isn't really much of a choice); for other domains, I prefer the second.

> Imagine inferring the worst-case performance characteristics of all program functions just as a part of the language.

I am not sure what exactly you mean here. In general, inferring properties statically has some very clear benefits and very clear costs. The more precise the information, the higher the cost.

rightbyte · on May 16, 2019

Is refcounting non-determinatistic? In what way? Like, you don't know if thread A or B will finnish last and free it?

pron · on May 16, 2019

Yes, but not just. You cannot always statically determine how many objects a particular operation on a data structure with internal dynamic-lifetime nodes would need to free -- it depends on the dynamic contents of the data structure at that point in time -- and as deallocation by refcounting GC takes time proportional to the number of freed objects, you cannot determine the cost of an operation.

A simple, small (in terms of runtime size) refcounting GC does better than an equally simple tracing GC, and simple refcounting has the advantage of a significantly lower heap footprint (a tracing GC uses heap overhead to drastically reduce the cost of memory management). There is also the assumption is that Rust/C++ applications will not make heavy use of GC, all the more reason to use a simple, low-footprint one, in-line with its "cost-free abstraction" philosophy.

vardump · on May 16, 2019

Atomic refcounting consumes global, system wide resource of inter-core and inter-CPU (like QPI) communication. 20M-50M of those events per second can render your 4 CPU 128 core system frozen.

Also just freeing memory can create a surprisingly long pause. Depends on allocator, of course.

cesarb · on May 16, 2019

> Atomic refcounting consumes global, system wide resource of inter-core and inter-CPU (like QPI) communication.

Forgive me if I'm wrong, but wouldn't reference counters consume these global resources only when there's contention for the cacheline? Once the cacheline is in an exclusive state, there's no need to go outside the core at all.

tlb · on May 16, 2019

Yes, and refcounting creates false contention. For example, a global config object is constantly having its refcount bumped up and down every time someone dereferences it, even though the actual contents are read-only.

There are ways to mitigate this, but they’re complex.

wwright · on May 17, 2019

This is actually an excellent case where Rust can help you completely remove the need to lock and help you keep your memory safe — if you can make it clear that your config object will live longer than all threads referencing it (which shouldn’t be hard if it is truly global), then Rust will let you share it among all threads with no further overhead at all, for programmer or processor.

steveklabnik · on May 17, 2019

One interesting thing about rust here is that you can mitigate refcount traffic, since you can also take a regular reference to the data. You only need to increment/decrement on actual ownership stuff.

vardump · on May 17, 2019

You're correct. I simplified it a bit.

The point still stands, though. There's not much contention resources available, so if high system performance is desired, one should look for alternatives to avoid creating chatter between cores and CPU sockets.

kazinator · on May 17, 2019

When you have some object O that has a reference to a set of hundreds of thousands of other objects, and drop the last reference on O, you don't know how many of those other objects will also be traversed and have their reference hit zero. It could be none of them; it could be all of them.

JoeAltmaier · on May 16, 2019

There are some fundamental optimizations we don't even think about any more. E.g. the first class I wrote in C++ was a simple container for a byte, with a constructor, a getter and a conversion to int. Then in main() I declared one locally, constructed with 'true'. Then I returned that object as the return code.

Imagine my surprise when the generated code was simply

mov AX, 0xFF

ret

Three or four bytes of code! So some abstractions, encapsulation for instance, provide the user with guarantees but can cost absolutely nothing.

metafunctor · on May 16, 2019

I really enjoyed the zero cost abstractions in Standard ML achieved by the whole-program optimizing MLton compiler (and I have a t-shirt to prove it). It's amazing how one can define clean, abstract, type-safe module hierarchies that still boil down to very, very efficient machine code that's just inches away from hand-written C code in terms of raw performance.

The zero runtime cost, of course, was paid in compilation time. With some tricks, it wasn't a big problem in practice, though.

kristianp · on May 16, 2019

Pardon my ignorance, but don't that family of languages suffer from the overhead of boxing?

aeneasmackenzie · on May 17, 2019

They can unbox sometimes, especially compiling with MLton or stalin. There's research papers on specifying boxedness explicitly still in the MLish framework but no real implementations afaik

xelxebar · on May 17, 2019

Along these lines is an intermediate language called Morte[0], developed by Gabriel Gonzalez, well known in the Haskell community. It seems to still be getting some love.

The basic idea is to implement a strongly normalizing calculus of constructions which languages can compile down to. That way, equivalent functions provably generate equivalent code. Here's [1] a blog from 2014 that goes into more detail on motivation, examples, and the like.

There's a proof-of-concept frontend called Annah[2]. I'm unaware of any backends though.

[0]:https://github.com/Gabriel439/Haskell-Morte-Library

[1]:http://www.haskellforall.com/2014/09/morte-intermediate-lang...

[2]:https://github.com/Gabriel439/Haskell-Annah-Library

vnorilo · on May 16, 2019

"Zero cost" is a pet peeve of mine. While I concede that the concept is useful, there is always a cost somewhere.

A very concrete example of this was a C++ unit test suite where templates were used to generate several combinations of tasks/backends for verification. At one point our cloud builders ran out of memory. To avoid upgrading the instances at a €€€ cost from all the abstraction, I did some manual type erasure.

So, zero cost typically only concerns run time characteristics. Even then there are elusive pitfalls, like generated code size.

saghm · on May 16, 2019

In the Rust community, "zero cost" is pretty widely agreed upon to refer to "zero runtime cost". We're all unfortunately quite aware of the fact that our zero cost abstractions increase our compile times.

paulddraper · on May 16, 2019

AFAIK, no one has even meant "zero cost" to mean "zero compile-time cost".

jampekka · on May 16, 2019

Not with that very term, but it seems that an important criteria for the Oberon compiler was that any optimization should not make compilation of the compiler slower.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90....

adrianratnapala · on May 17, 2019

And yet it is called zero cost. Not "zero cost along one particular dimension that we won't name and only sometimes matters."

steveklabnik · on May 17, 2019

We took it from C++ which has a long history with the term; take it up with Bjarne.

harry8 · on May 17, 2019

Sure he could. And he could take it up with anyone who copies it on a greenfields project too, right? It's a crap name. So what, computing is full of them, assembly language is full of them, C is full of them, C++ is full of them. Sometimes we make a break and don't use one and invent something better. Sometimes...

mov [memory], %register memove(dst, src, size); dst = std::move(src);

eg RAII, SFINAE these are not great names to be copied, I'd be surprised if rust did.

jstimpfle · on May 16, 2019

As I understand it, examples of "zero cost abstractions" would be std::string or std::map<X,Y> in C++. Reimplementing those in plain C is not a great idea, because the result will be very bug-prone, and will not be (significantly) faster.

The problem with these "zero cost abstractions" is that we're measuring the wrong cost.

The key to efficiency is optimizing what is done, not optimizing each little line of code. std::string is a terrible idea with regards to efficiency. If it works for you chances are you should use Python instead. For anything complex, std::string is slow and has a huge memory footprint (each string has its own buffer, separately allocated from the heap). It also leads to a heavy increase in compilation times. Anecdotally, it can even be painful to depend on it when the standard library is upgraded. If you have anything beyond a few megabytes worth of text data, std::string is about the worst possible technical solution.

pdpi · on May 16, 2019

zero-cost abstractions are abstractions where the hand-written code wouldn't be any better than the abstraction.

An example is how Option<Box<T>> optimises down to a single pointer in Rust — the compiler uses its knowledge of how Box<T> is non-nullable, and coopts the null value to represent the None value for the Option.

blt · on May 16, 2019

Is this optimization required by the language semantics or some standard (vs. being only a description of mainline rustc's implemented optimizations)?

lilyball · on May 16, 2019

AFAIK there is no prescriptive standard for Rust, there's only descriptions of how the rustc compiler works. But this was a very intentional decision quite some time ago, and is also the reason why the `std::ptr::NonNull` type exists (the documentation explicitly states this allows enums to use the forbidden value as a discriminant).

FWIW the module documentation of `std::option` does describe this optimization for Option<Box<T>>.

> This usage of Option to create safe nullable pointers is so common that Rust does special optimizations to make the representation of Option<Box<T>> a single pointer. Optional pointers in Rust are stored as efficiently as any other pointer type.

steveklabnik · on May 16, 2019

It’s a defined part of the language semantics.

lilyball · on May 16, 2019

Is there actually a formal language semantics definition anywhere? I thought all we had was stuff like https://doc.rust-lang.org/reference/index.html which is explicitly listed as "not a formal spec".

steveklabnik · on May 17, 2019

Things have different degrees of guarantees. This is something we’ve explicitly said is guaranteed.

kurtisc · on May 17, 2019

Where are things like this said/defined/collated?

steveklabnik · on May 17, 2019

We are generally at a point where if it's documented, it's guaranteed. This feature is really easy to describe, and so we have a high degree of confidence that when we say "this behavior is defined", we won't run into an edge case where we'd have to break it.

So, in this case, it's a documented part of Option.

I realize this answer isn't fantastic, but it's where we're at now. It's why we're working on the reference this year!

kurtisc · on May 17, 2019

>we're working on the reference this year!

I didn't know this, it's good to hear. Thanks!

UnFleshedOne · on May 16, 2019

If your application's core functionality is text editing (or something similar) you might have to use something like ICU anyway, complete with its own string type and 30 megs of dlls. Or custom allocators on std::string if only memory allocation is critical. Or one of million other string classes (every freaking C++ library comes with one...) Or internal string implementation built specifically for your needs. Almost never would you drop back down to passing C-strings around though. That's just asking for buffer overflows and such.

keldaris · on May 16, 2019

Because of everything you've outlined in your last paragraph and more, I don't know many competent C++ programmers who would call std::string or std::map "zero cost abstractions" with a straight face. That term is more commonly applied to things like CRTP, template policy classes and other abstractions that you pay for in compile time (and, depending on linker optimizations, code bloat), but not in runtime [1]. As you point out, an abstraction that does a bunch of extra work is by definition not zero cost.

[1] Yes, code bloat can lead to runtime performance issues as well. Common sense coupled with linker optimizations can usually, but not always, mitigate those issues.

deathanatos · on May 17, 2019

> std::string is a terrible idea with regards to efficiency. If it works for you chances are you should use Python instead

Concrete examples of what you mean would help here.

> For anything complex, std::string is slow and has a huge memory footprint (each string has its own buffer, separately allocated from the heap).

Some implementations (e.g., GNU's) will inline small strings, and so avoid a heap allocation. Python strings always heap allocate; if you're referring to the fact that Python strings don't have to chase a pointer to the data (the string data is inlined in the Python object itself), that might have some marginal effect on speed, but for most cases it really isn't going to matter. I'm not seeing the memory comparison: `std::string` is 32 bytes on my system; Python's str reports as 49. (+ character data for each.)

What would you prefer std::string did to make it more amenable to "a few megabytes of data"? (A case that, honestly, I think it handles just as well as most language's string type.)

(Now, if we want to harp on std::string's Unicode support, I won't stand in your way.)

jstimpfle · on May 17, 2019

> Concrete examples of what you mean would help here.

If it works for you, chances are that you are doing something that Python is equally suited for. And much more convenient. (Not necessarily, though).

> I'm not seeing the memory comparison

I didn't make one. I was saying something similar to "AVL trees vs RB trees largely doesn't matter. If you care about efficiency, you don't use either since they're both slow. If you don't care, the small speed difference doesn't matter".

> What would you prefer std::string did to make it more amenable to "a few megabytes of data"? (A case that, honestly, I think it handles just as well as most language's string type.)

Start by identifying strings that are immutable once they're constructed. These are usually the lion's share. Then start pooling these strings to avoid allocation overhead. This brings the overhead of a string down to a pointer or offset, and optionally a length, which costs anywhere from 4 to 16 bytes. Then, think about string interning to weed out duplicates.

This reduces memory consumption significantly. Depending on average string size, up to like 80% saved. And maybe more importantly, you can pass around strings freely without any overhead - string handles fit in a single register.

> (Now, if we want to harp on std::string's Unicode support, I won't stand in your way.)

Unicode is a mess. For all I've ever done, byte arrays (UTF-8) were the right choice.

comex · on May 17, 2019

std::string is a zero-cost abstraction joined at the hip to a non-zero-cost abstraction. At heart, it’s an abstraction over the pattern of having a uniquely-owned string pointer which is freed when the owner (object or stack frame) is freed. If you use it when you would otherwise use that pattern, the overhead should be effectively zero, at least if your STL implementation is high-quality.

However, as you note, the API encourages you to use that ownership pattern everywhere, in a way you likely wouldn’t if you were doing allocation by hand – and often by accident. As the worst offender, the copy constructor allocates a new buffer and copies the whole string, though it’s implicitly called whenever you simply assign one std::string variable to another, or pass a std::string to a std::string-typed function parameter. It’s hard to avoid copies even if you know what you’re doing, and easy to misunderstand the costs if you don’t. In contrast, if you were doing allocation by hand, most of the time you’d try to just pass around pointers to existing buffers, with various ownership semantics: temporary access (borrowed), transferring ownership (move), or even using reference counting.

- For the unowned case, you could always use `const std::string &`, but that only works when referencing a full existing heap allocation; it can’t represent substrings of heap allocations, or data that wasn’t on the heap in the first place. C++17 improved the situation by adding std::string_view, but you have to know to use it, especially since the API design is limited by backwards compatibility constraints: for example, std::string’s substr method still returns a std::string, not a string_view. And some people say that in ‘modern C++’ you should avoid raw unowned types to ensure safety, so if you follow their advice you would have to avoid string_view as well.

- For the transferring-ownership case, C++11 added essentially zero-cost move construction that just moves the pointer from one std::string to another. But again, you have to know to use it, and the default is an expensive copy.

- For reference counting, you could use std::shared_ptr, but it’s suboptimal (always using atomic reference counting even if you don’t need it, and doing a double heap indirection).

Rust’s standard library has a type with similar semantics, String. But with the benefit of hindsight, it improves upon std::string in each of those areas:

- For the unowned case, there’s &str, which is like C++ string_view, and it’s idiomatic to use &str everywhere instead of &String (the equivalent of C++ `const std::string &`). Slicing a String (i.e. the equivalent of substr) gives you &str; if you want to copy that into its own owned buffer, you have to do it explicitly via `.to_string()`. Also, Rust’s borrow checker verifies that your use of unowned pointers is safe, so there’s no safety/performance tradeoff.

- Regarding the transferring-ownership case, Rust makes move semantics the default. Assigning one String-typed variable to another, or passing a String to a String-typed function parameter, will do a move; if you try to use the source afterwards, or if you don’t uniquely own the source, it will complain at compile time (there are workarounds depending on what you want to do). If you want to copy the data to a new heap allocation, you have to explicitly write `.clone()`.

- For reference counting… well, it’s arguably less than optimal in Rust too, but better than C++. For example, the language provides atomic and non-atomic reference-counting wrapper. You can use the faster non-atomic version if you don’t need to share the value across threads (and the compiler prevents you from doing so by accident). Also, you can avoid the double indirection by using Rc<str> instead of Rc<String>.

- Oh, and there is also Cow, the copy-on-write type, which represents a pointer that may or may not be owned. I don’t think the C++ STL has any equivalent; some STL implementations used to make all std::strings copy-on-write, before C++11 made that illegal, but that adds overhead for all users rather than just those who need it.

Thus, I’d say Rust’s String has a better claim to being a true zero-cost abstraction.

bjoli · on May 17, 2019

Lisp family of languages have macros which can provide zero cost abstractions for you. For example, the loop facility is really just a macro, and so are the racket for loops.

Not only that: in scheme there is usually a source->source optimisation phase which means you can easily inspect what that abstraction does. That way it is pretty simple to know that an optimisation really is zero cost.

My favourite examples are, as already mentioned, racket's for loops and the various scheme pattern matches out there. The difference between lisp and rust is that these can be added to your language without having to change the compiler and they can be built as a library by anyone.

truth_seeker · on May 18, 2019

I very much like to see this feature implemented at (Compiler + JIT) level in higher level dynamic languages like Clojure, Erlang, ES8 JS, Python etc.

alschwalm · on May 16, 2019

There is a typo under the 'optimal performance' section. "A zero cost _abstractoin_ ought..."

galonk · on May 16, 2019

Also at the end it says "not non-zero cost". Assuming the double-negative is a mistake, they should delete either the "not" or the "non-".

arthurcolle · on May 16, 2019

There's also a typo if you search for "they want ot write"

IloveHN84 · on May 17, 2019

Nice post, but some code examples would have helped to clarify even further the concepts, especially for newbies

choeger · on May 16, 2019

Good luck with the trait object. I doubt you're going to significantly improve on what the Haskell community has done in 20 years. If you do manage it though, kudos in advance.

moomin · on May 16, 2019

If Rust does manage it, I confidently predict someone will be submitting a GHC proposal within the week.

jadbox · on May 16, 2019

Noob here: why is the trait object hard to optimize in principle?

uryga · on May 16, 2019

(disclosure: i know Haskell but not Rust, and haven't actually implemented anything like this in a compiler. please let me know if i'm wrong!)

best guesses:

- dynamic dispatch usually involves carrying around¹ some function pointers (to your type's implementations of the trait methods). this means that when calling `x.foo()` on a trait object `x` of some trait `Foo`, the actual function/method being called isn't known until runtime, so the compiler can't optimize the call through things like inlining (which require knowing what's being called). also AFAIK jumping to some unknown function pointer is relatively slow on the CPU level

- since the compiler doesn't know the actual type (and hence layout) of your object, it has to be passed via reference, so there's some indirection overhead and it's probably impossible to e.g. optimize away intermediate values or be smart about the way they're kept in registers.

the problem is that the nature of trait objects (i.e. "i know nothing about this value except that it implements the `Foo` trait) kind of seems to require these things :(

¹ via a vtable or "dictionary passing"

_fq4v · on May 16, 2019

I don't think rust has polymorphic recursion. GHC is already pretty fast on non-polymorphic recursion.

withoutboats · on May 16, 2019

This is the sort of comment on hacker news I absolutely love: I know one of us is sitting in a place of ignorant hubris, and I doubt its me.

The problems I refer to are, as I stated in the post, UX problems, and they arise ultimately from Rust's guarantees generics will be monomorphized, that allocations will be explicit, that arguments are passed by value, and so on. None of this applies to Haskell. We don't need to "improve on" Haskell (though I think our coherence and instance resolution rules are superior), we have different design constrants resulting in different problems.

crimsonalucard · on May 16, 2019

The perfect target candidate for zero cost abstractions is a data base api.

With databases as the bottleneck of web development it is not optimal to use such high level apis like SQL to optimize everything. We resort to ANALYZE and strange sql hacks to trick the query engine into performing the right algorithms when the api should explicitly let us choose to use or tweak both zero cost abstractions and/or a query planners.

mamcx · on May 16, 2019

yes. This is part of what I have say about "make a GENERAL relational language and code the (full) app on it".

Is unnecessary complexity to have the "schema" split in several places (the DB catalog, the table, the view(s), the SQL fragment, the struct/classes, the mappings).

In my mind this is all you will need:

  rel City of
    id:Pk
    name:Str CHECK NotEmpty

just ONCE.

If also, we can express the relational operators "raw":

  plan1 = Scan(City, WHERE id= 1)
  plan2 = FindKey(City, WHERE id= 1)

  plan3 = City ?where id= 1 |> toPlan (turn into one of the above)

then is possible to also get zero cost abstractions.

BTW: Slowly I'm building this:

http://tablam.org

Stuck on trait objects this weeks, probably will stick with enums!

brianberns · on May 16, 2019

I love SQL, but I strongly agree with this. It would be great to have a platform-specific low-level API for performance-critical queries. (Much like “unsafe” code in a managed programming language.)

jpgvm · on May 16, 2019

I think the best analogy would be SQL is to OpenGL what this API would be to Vulkan.

Ability to define the query procedurally or atleast semi-so vs the declarative approach of SQL would allow direct optimisation of very performance critical queries.

GrayShade · on May 16, 2019

I disagree. The amount of optimization that a query planner does is vastly more than a web application developer would care for.

We have people who can't be bothered to close database transactions (see that Uber report). I wouldn't trust then to choose the optimal join strategy in a complex query.

mikeyhew · on May 17, 2019

What Uber report?

GrayShade · on May 17, 2019

https://eng.uber.com/mysql-migration/

revskill · on May 16, 2019

So, it's zero cost from the program performance standpoint. User MUST pay something to have it.

It's not zero-cost from user standpoint.

saagarjha · on May 17, 2019

How is the user paying for this? Generally the cost ends up being longer compiles and more development time.

withoutboats · on May 17, 2019

The user is the programmer in this context, the user of the programming language, not the end user of the program.