Is Rust stack-efficient yet?

klodolph · on Nov 17, 2022

This affects the way you write code, too.

I was writing something in Rust and I wanted to create a new boxed object.

  Box::new(...)

Boom! Program crashes. The object I’m putting in the heap is too large for the stack. Rustc does this by instantiating the object on the stack, and then copying it to the box. I don’t really want to fuss with nightly or stuff like Box::new_uninit just to deal with this. C++ has both regular `new` and placement `new`, both of which put objects in memory which is already allocated. I had assumed that the Rust compiler could optimize out a move, since that’s such a prominent feature in C++ compilers.

sitkack · on Nov 17, 2022

Box::new allocating on the stack and moving is only true in debug builds, in release builds it directly allocates on the heap. Not apologizing, just explaining. It is being fixed, the interim solution is to use a Vec.

klodolph · on Nov 17, 2022

There’s a broader problem here, which also applies to the C++ ecosystem, which is that debug builds are far less usable than they should be. C++ compilers in common debug configurations will emit a function call for std::move(), which is not in any way useful for typical debugging tasks and can make the program significantly slower.

I don’t want to rely on compiler optimizations to make my code work. Or alternatively, find a way to deliver those optimizations in debug builds.

The idea of using a Vec would be nice—if only the boxed item were an array! It’s a struct, you see…

codeflo · on Nov 17, 2022

Having been bitten by that exact problem in C++, I think the original sin is to treat stuff like copy elision as a mere optimization, instead of a semantic guarantee.

klodolph · on Nov 17, 2022

The C++ committee recognized this problem. As of C++17, copy elision is mandatory. Several forms of it, at least.

dathinab · on Nov 17, 2022

Rust has also recognized the problem from a very early stage one.

For example this is why there was a `box` operator in early rust.

And e.g. placement-in like APIs had been in the works for years, it's just that no satisfying and sound solution has been found (but multiple solutions which initially seems sound).

Which is why we currently are in a "hope the optimizer does the right thing" situations (through it is pretty much guaranteed to do the right thing for a lot of cases around Box).

But then it also isn't the highest priority as it turns out a log of "big" data structures (lists, maps, etc.) tend to grow on the heap anyway, the the situation that someone run into debug builds crashing because of a big data-structure is pretty rare, and it also crashing of release build is even rarer. Some of the most likeliest ways to have a too-big data-structure on the stack is: Having some very deep type nesting. But then such types are (in the rust ecosystem) often seen as an abuse of the type system and an anti-pattern anyway. Through it can be a fun puzzel, and some people are obsessed which bending the type system to their will to create DSLs or encode everything possible in the type system. But I have yet to see commercial projects with mixed skill level team members where using such libraries didn't lead to productivity reduction on the long run (independent of programming language).

klodolph · on Nov 17, 2022

It’s just a bit of a surprise, and Rust hasn’t ironed out some of these surprises. I’m sure it will get fixed eventually.

Yes, you can give examples of cases where unusual code (like deep type nesting) can create these large data structures, and you can call it an anti-pattern. But Rust is also pitched as a C++ replacement for greenfield projects, so you have all of these C++ programmers who are used to being able to “new” something into existence of any size, and then initialize it. A series of design decisions in Rust has broken that for objects which don’t fit on the stack.

I’m satisfied with the explanation that “no satisfying and sound solution has been found” and I’m also satisfied with “Rust developers haven’t gotten around to addressing this issue”. I’m not really interested in hearing why some people who run into the same issue are making bad decisions.

estebank · on Nov 21, 2022

One piece of context I want to add, although there's no language construct for placement new, the unsafe `MaybeUninit` allows you to write partially to memory, and a macro[1] can be written to make almost seamless to use.

[1]: https://crates.io/crates/place

codeflo · on Nov 17, 2022

> But I have yet to see commercial projects with mixed skill level team members where using such libraries didn't lead to productivity reduction on the long run (independent of programming language).

Mixed skill team or not, I really don’t see why Box<[u8; 1024 * 1024]> should be something the language struggles with.

dathinab · on Nov 18, 2022

EDIT: I realized the TryFrom is just implemented for Box<[T]> not Vec<T> but you can easily convert a Vec<T> to a Box<[T]>. I updated the code accordingly.

vec![0u8; 1024*1024].into_boxed_slice().try_into().unwrap()

isn't that terrible to use

her as a function:

fn calloc_buffer<const N: usize>() -> Box<[u8; N]> {

   vec![0u8; N].into_boxed_slice().try_into().unwrap()

}

I you want to rely a bit less on the optimizer using `reserve_exact()` + `resize()` can be a good choice. I guess it could be worthwhile to add a helper method to std.

inferiorhuman · on Nov 17, 2022

Agreed – but why would you want to box an array instead of simply using a Vec?

nayuki · on Nov 17, 2022

You can save memory by having fewer fields. This can matter when you have lots of small arrays.

Vec<u8> has {usize length, usize capacity, void* data}. Box<[u8]> has {usize length, void* data}. Box<[u8;N]> has {void* data}.

inferiorhuman · on Nov 18, 2022

For a typical use case that seems like a rather extreme optimization, no? If you have a lot of objects with many small arrays and you're keeping them in a Vec, they'll be on the heap. If you're dealing with a bunch of small parts of a big blob of binary data, you'd use slices and not create new arrays. If you're on an embedded system you're not likely to have an allocator anyways.

(without trying to be too argumentative) right? Or?

Edit since I've been throttled:

  For example it can make a difference between passing values per register or per
  stack in some situations. … But then for some fields where C++ is currently very
  prevalent it might matter all the time.

That's an interesting one I hadn't thought about (and I didn't realize that the register keyword was deprecated in C++17). In a rather broad sense I hope Rust catches on in the kinda niche stuff where C++ is often popular. For example I've only done a little bit of dabbling with Rust in an embedded context but overall I thought it brought a lot to the table.

vgel · on Nov 18, 2022

In a system at $WORK I recently optimized a structure from String to Box<str> (similar optimization to remove the 8 byte capacity field) and saved ~16Gb of memory. Granted, the program uses 100-200Gb of RAM at peak, but it still was a nice win for basically no work. It's also a semantic win, since it encodes "this string can't be mutated" into the type.

dathinab · on Nov 18, 2022

yes but also no,

In some situations "optimizing smart pointers" to just be a single pointer size (Box<[T; N]>) instead of two pointer sizes (Box<[T]>) or instead of three pointer sizes (Vec<T>) can make a difference.

For example it can make a difference between passing values per register or per stack in some situations.

Or it could make the difference of how many instances of the boxed slice fit efficiently into the stack part of a smallvec. Which if it's the difference between the most common number fitting or not fitting can make a noticeable difference.

Through for a lot of fields of programming you likely won't opt. to do such optimizations as there are more important things to do/better things to spend time optimizing at. But then for some fields where C++ is currently very prevalent it might matter all the time.

tialaramex · on Nov 17, 2022

I guess what they mean is that the Vec would allocate heap space, and you could steal the allocation for your object to make the Box? You'd need to create this MyType manually and then tell Box what you made unsafely with like Box::from_raw()

It feels like a better way to do that directly with Box is Box::<MyType>::new_zeroed() which will make you a Box<MaybeUninit<MyType>> full of zero bytes. If MyType is definitely valid when made entirely of zero bytes and you're sure of that, you can unsafely assume_init() to have the MaybeUninit resolve to an actual MyType.

[[ If you lied, now everything is on fire, I did warn you that you need to be sure and it is an unsafe function ]]

If MyType is very much not valid if consisting entirely of zero bytes well, new_uninit() gives you memory in unspecified (must not be read) state, you can properly initialise it and then assume_init() as before - but all the extra work kinda sucks, and in either case clearly it would be nicer to just write what you meant and have it work.

klodolph · on Nov 17, 2022

I think the commenter made a guess that I was boxing an array, which is a good guess, it just happens to be wrong in this case.

Maybe that will work in the future—I don’t use nightly Rust, so for now, new_zeroed() won’t work. The basic problem is “I want to allocate something large on the heap” and it doesn’t seem like I should need to use nightly builds or unsafe{} to do it.

sli · on Nov 17, 2022

    let heap_value = vec![the_struct];

Based on another comment addressing this, I don't think the original commentor was making assumptions about the shape of your data.

klodolph · on Nov 17, 2022

This doesn’t actually work, it will still overflow the stack. The vec! macro will just copy its arguments into the heap; the arguments are still on the stack to begin with.

tialaramex · on Nov 17, 2022

> I don’t use nightly Rust, so for now, new_zeroed() won’t work

That's a completely fair observation. The main thing I want stabilised is a single niche for custom types. I would take more if offered but experience says that every extra little thing doubles the discussion time, so, one niche is all I need, and Rust guarantees this exists in some form so even if a later mechanism does - say - fancy non-contiguous niches, I just want one value ASAP.

https://github.com/rust-lang/rfcs/pull/3334

> “I want to allocate something large on the heap” and it doesn’t seem like I should need to use nightly builds or unsafe{} to do it.

The former makes sense to me, the latter (a requirement to use unsafe) I can see there can be cases where the compiler has to do a lot of contortions to safely but optimally mint the type in place in the heap and just writing the unsafe case is reasonable. I don't know anything about your type so I can't judge.

chrsig · on Nov 17, 2022

> The idea of using a Vec would be nice—if only the boxed item were an array! It’s a struct, you see…

Vectors of length 1 are still vectors :)

sqeaky · on Nov 17, 2022

If you hate relying on optimizations in principle, I have nothing for you, but if you pragmatically want your debug build to be more like your release build, then there are options.

All the major C++ compilers support some variation on the idea of "Release with debug symbols". If you are using cmake or another meta build there are usually default set of options for this. If you are making your own build scripts then you might just add -d and -O2 to your Gcc of Clang flags.

The debug symbols will still consume space which will impact performance, but that is not likely to be a huge issue in all but the tightest performance regimes. And all the optimizations should be there.

klodolph · on Nov 17, 2022

There’s not an underlying principle here, just trying to avoid nasty surprises.

Many optimizations are in practice unreliable—they are buried in the depths of a compiler and not part of the docs, it may be difficult to find out what conditions are necessary for the optimization to work, you may find that an optimization stops working when you update your compiler, you may find that changing a seemingly-unrelated piece of code breaks the optimization (maybe some function is no longer inlined for various reasons), or you may use a different compiler.

So I prefer to write code that works correctly without optimizations. It’s not a hard rule, but in this scenario, I would prefer to rewrite the code—and this happens to be annoying here.

germandiago · on Nov 17, 2022

> common debug configurations will emit a function call for std::move()

This is bbeing fixed in clang I think at least. It will be treated as an intrinsic not as a function and I recall for forward something similar.

gpderetta · on Nov 17, 2022

fwiw, GCC now has -ffold-simple-inlines exactly for this issue.

klodolph · on Nov 17, 2022

That’s not what -ffold-simple-inlines does. The -ffold-simple-inlines flag simply removes debugging information for certain inlined functions. It doesn’t affect whether the function is inlined in the first place. The result is that debug builds may have a smaller amount of debug information, but the code will be the same.

gpderetta · on Nov 17, 2022

That's not my reading of the docs, which explicitly talk about folding. Also simple tests shows that it does indeed inline the call even in debug mode.

In addition to inlining it also does also remove debug info.

klodolph · on Nov 17, 2022

You may be right—I was reading the release notes, and the actual docs go into more detail about what the flag does.

professoretc · on Nov 17, 2022

There's also the `artificial` attribute which instructs the debugger to skip through marked functions.

chlorion · on Nov 17, 2022

This is not guaranteed behavior and should not be relied on!

pcwalton · on Nov 17, 2022

The problem here is that performing the optimization can change the semantics of the program, if you consider order of memory allocation to be part of the semantics. If the object you're creating itself has memory allocations within it, then allocating the box before constructing contents will change the order of the mallocs. C++ compilers will not do this optimization for this reason, though C++ has emplace so that the programmer can manually work around it. For this reason I think that it may be best to just introduce an emplace-like pattern for Rust as well.

If constructing the object doesn't have side effects, however, then we should be able to hoist the allocation at the optimizer level.

atq2119 · on Nov 17, 2022

> if you consider order of memory allocation to be part of the semantics

Don't do that, then? :)

LLVM can already do heap-to-stack, so people don't seem to feel too strongly that exactly preserving heap allocations is worth it. And what does that mean on multi-core, anyway?

conradev · on Nov 17, 2022

Bringing that feature to Rust was under discussion for a while but it was ultimately withdrawn:

https://github.com/rust-lang/rust/issues/27779#issuecomment-...

Georgelemental · on Nov 17, 2022

There is an open RFC that takes a different approach: https://github.com/rust-lang/rfcs/pull/2884

wmanley · on Nov 17, 2022

There's some ugly hacks that may or may not help here:

https://www.reddit.com/r/rust/comments/xxhp3s/perhaps_not_a_...

    r#box!(make_my_elem())

This crate is marked as deprecated because apparently upstream rust optimises its use-case now, but you never know:

https://github.com/kvark/copyless

    Box::alloc().init(make_my_elem())

dahfizz · on Nov 17, 2022

It seems crazy to me that something as trivial as allocating heap space needs rust nightly and unsafe. And people want to rewrite the world in Rust.

Maybe in another ten years the language will mature.

msbarnett · on Nov 17, 2022

> It seems crazy to me that something as trivial as allocating heap space needs rust nightly and unsafe.

It seems crazy to you because your spectacular misdescription of the problem is simply incorrect.

nu11ptr · on Nov 17, 2022

I noticed in one of my crates that Rust often cannot optimize "moves" away. I was in the unusual situation where I had to move around a very large stack buffer(typically in Rust they live on the heap). Instead of passing it back and forth between moving functions as I originally designed, I had to redesign it to use macros which significantly improved my benchmarks. Further attention to optimizations here would be very welcome.

redanddead · on Nov 17, 2022

How did you implement those macros?

nu11ptr · on Nov 17, 2022

Tbh it was like a year ago and I forget the specifics. You are welcome to look at the code, however:

https://github.com/nu11ptr/flexstr/blob/master/flexstr/src/b...

UPDATE: As I'm thinking about it, it is starting to come back to me a little:

1. I create the buffer

2. I do some op against it to fill it

3. I consume it and transform it into a final immutable flexstr

For #1, the 'new' function moves the buffer back to the caller (memcpy). Using it in #2 I think was fine as I think it is typically passed by mutable ref. For #3, the buffer was moved again (passed by `self`) so it could be consumed and reused without a language level copy (but was copying in the generated code). Replacing #1 and #3 with macros kept the stack buffer in the local stack frame and greatly sped up my code in benchmarks, and that is what those two macros do I linked to if I'm recalling correctly.

kibwen · on Nov 17, 2022

Prior discussion with the author on Reddit: https://www.reddit.com/r/rust/comments/yw57mj/are_we_stack_e...

dathinab · on Nov 17, 2022

For easy to miss context:

This seems to be mainly about measuring the change of stack efficiency of rust compared to a C++ base line.

Given who the author is I'm pretty sure they know that e.g. having more stack movement can be better then having more heap allocations.

But there is a list of ways you can reduce stack memory movement as compiler optimizations without allocating heap memory, and AFIK rust doesn't yet fully take advantage of many of them.

Additionally sometimes trading a lot of stack movements with a single allocation can be quite a bit faster, e.g. why anyhow does a thin-pointer + inlined vtable optimization.

So I think the side is mainly for tracking improvements in compiler code generation and secondary if rustc uses some thin-pointer optimizations in places where it does yield some benefits.

Ericson2314 · on Nov 17, 2022

We need &out and &in, so we can safely write this stuff by hand. Then we can worry about the compiler automatically optimizing to use &out behind the scenes (using ABI flexibility).

Trying to skip that first step, so the compiler just does unverified shit behind the scenes, I think will just end up with inflexibility and disappointment. Be the tortoise not the hare.

wmanley · on Nov 17, 2022

Easier said than done. You can already have:

    unsafe fn new_in_place(out: &mut MaybeUninit<MyType>)

But you require unsafe code both to implement it (to project MaybeUninit) and to call it.

For it to be safe to call the type system would have to encode the invariant that &out is initialised after the call. It becomes even more complicated if the operation can fail.

Ericson2314 · on Nov 17, 2022

As a stepping stone. we could worry just about the panic=abort case, where this is all easier.

For panicking, I think we would want to switch to a model where borrower not borowee runs destructor for &mut; that way we can support "moving out of &mut temporarily" too.

Ericson2314 · on Nov 17, 2022

Yes, you do. This is the cost of doing business. I'm OK with it.

If one doesn't want to deal with imperative programming in all its glory, go write Haskell or something. (I do that too.)

pkolaczk · on Nov 17, 2022

It would be more informative if the stack metrics were paired with heap metrics. You can trivially avoid stack to stack copies by allocating on the heap and passing pointers / references. But that is often actually slower, because heap allocation is more costly than copying data within the stack.

adrian17 · on Nov 17, 2022

This is not about a heap-vs-stack tradeoff, this is about the compiler routinely generating very inefficient code that copies data around on stack for no good reason.

For an example I've seen myself: using a custom GC pointer library, calling `Gc::allocate(SomeBigStruct{...})` constructed the SomeBigStruct on stack and copied it around using memcpy 4 times before it actually landed in the allocated heap memory. The equivalent code compiled with a C++ compiler would have probably optimized the program enough to fill the struct in-place on heap without any issues.

(this example is from over a year ago; it's not as bad anymore, but it still generates suboptimal assembly with too much copying)

CapmCrackaWaka · on Nov 17, 2022

If you don't mind me asking - how do you witness these low level memory allocations? Specific program, plugin to vs code?

adrian17 · on Nov 17, 2022

The simplest way to make quick experiments for me is with https://godbolt.org/ .

For my particular example: https://godbolt.org/z/8GvYzYj5h

You can see we're trying to put a 1kB object on heap, but the compiler generates two `memcpy` calls - first to copy it on stack to build the wrapper struct, second to actually copy the entire struct onto allocated memory.

In "real programs", you just need to look at the program's assembly. Or even more generally, I originally noticed this when observing the unusually high amount of time spent in some functions when profiling.

hmfrh · on Nov 17, 2022

Other than Godbolt Rust also has cargo-show-asm[1] that directly shows the actual assembly.

[1]: https://crates.io/crates/cargo-show-asm

spacechild1 · on Nov 17, 2022

You look at the generated assembly.

zamadatix · on Nov 17, 2022

Most likely these kinds of things are the cause but this data does not actually show that assumption because it lacks the other information. It’s also possible rust has more copies on each which would be useful to track anyways.

pkolaczk · on Nov 17, 2022

The optimizing backend for Rust and C++ is common. So if it didn't get optimized in Rust, it is very likely it wouldn't be in C++ as well. However, the code style of those two codebases might be different. IMHO C++ code is traditionally a lot more pointer and heap allocation heavy than Rust code. Rust makes moving stuff very convenient and using pointers/references quite inconvenient. Therefore showing heap allocation profiles would help us understand if those differences are due to actual compiler inefficiencies or different memory management patterns used in the source code.

Also rustc code has been written in different times than the majority of clang code. That might as well affect the copying patterns. Move semantics is actually a quite modern thing in C++ (and not default like in Rust).

bananapub · on Nov 17, 2022

> The optimizing backend for Rust and C++ is common. So if it didn't get optimized in Rust, it is very likely it wouldn't be in C++ as well. However, the code style of those two codebases

eh? Rust doesn't have placement-new or specify copy elision. nothing at all to do with backends or "code style".

pkolaczk · on Nov 17, 2022

Sure it doesn't do placement new yet, but I think you're exaggerating the effect on lack of copy elision. Rust doesnt really need copy elision so heavily because it defaults to moving stuff, and move is just a very shallow copy taking typically one or two cycles. I've never seen it become visible in a profile.

In C++, copies are much more heavy, because they need to preserve the original, so copy elision matters a lot more. Also a developer is free to put arbitrarily complex stuff into a copy constructor. In Rust, those heavy copies are explicit so the developer can fully control when they happen.

Nevertheless - it could be all those reasons together. It's good someone is looking into it.

agluszak · on Nov 17, 2022

> Move semantics is actually a quite modern thing in C++ (and not default like in Rust).

But it's very different. In Rust there are no move-constructors. A move is simply a memcpy. And the moved-from object doesn't have to be in "a valid, but unspecified state". You cannot use it, because the borrow checker prevents you from doing so. So it can actually be in any state, giving the compiler more room for optimization.

edflsafoiewq · on Nov 17, 2022

C++'s constructors are basically custom built for this situation. It's really the only thing they're good for.

underdeserver · on Nov 17, 2022

What? If I understand correctly, the LLVM IR -> binary stage is shared. The Rust -> LLVM IR or C++ -> LLVM IR stages are obviously not shared, and these optimizations can happen there. Specifically C++ guarantees copy elision in some cases.

fb03 · on Nov 17, 2022

I'd appreciate if the graphs showed a larger span instead of every 4 hours and at most from a day ago (showing a flat line). It'd be very cool to see it spanning from months so that we could get a sense of the tendency of the line

ChadNauseam · on Nov 17, 2022

OP mentioned on reddit that he’s planning on updating the graph with new data as he makes improvements

infogulch · on Nov 17, 2022

Note that the author mentions they will update the page manually, not through CI.

mkj · on Nov 17, 2022

This looks like a good metric to track. Looking at some generated asm when I was optimising a no_std program for size, I was surprised how much stack shuffling was going on. Also iirc it had runs of load/store where it seemed like a loop might be better.

Shrinking the size of my library's Error struct seemed to help a lot - I wonder if it's because an error "E" smaller than the Result<T, E>'s "T" can be returned in-place, but a larger one needs a copy...?

yawgmoth · on Nov 17, 2022

I would love to see this for C#. Avoiding heap allocations was something I had to tackle as part of a simulation software. C# has a handful of GC generations, and GC gen1 is very fast, but it's still faster to pool things ahead of time.

insanitybit · on Nov 17, 2022

This isn't related to heap allocations. It's about removing the cost of a stack variable being moved into another stack variable.

yawgmoth · on Nov 17, 2022

I became interested in stack performance specifically because I had to avoid heap allocations.

insanitybit · on Nov 18, 2022

I mean, it's tangentially related, sure. It's also related to ML because computers run ML algorithms and computers run programs with stacks.

yawgmoth · on Nov 18, 2022

While I want to understand your point and welcome enlightenment, so far this conversation has just been kinda weirdly hostile.

pjmlp · on Nov 17, 2022

Using structs, stack allocations (safe since C# 7.x) and native heap go a long way.

Many of the post C# 7 features have been used to improve .NET performance in techpower benchmarks, and rewrite runtime code from C++ into C#.

int_19h · on Nov 17, 2022

You can also use generic functions that take struct arguments instead of delegates to get something like STL. I wish there was some syntactic sugar in the language for that, too - basically, struct lambdas.

pjmlp · on Nov 18, 2022

There is a bit of that with static lambdas, function pointers (unsafe), but still isn't quite what you want.

int_19h · on Nov 19, 2022

Well, until relatively recently, they wouldn't be able to desugar what I really want into underlying bits. But now that we have verifiable ref structs with ref fields in them in C# 11, it's a straightforward transformation. Might be worth a proposal, actually...

lynzrand · on Nov 17, 2022

You can access C#'s GC metrics through `System.GC.GetTotalAllocatedBytes` and similar APIs. Maybe you can make a benchmark and tracker for it yourself!

throwup · on Nov 17, 2022

I've run into frustrating stack overflows in seemingly trivial non-recursive code, so I appreciate this effort!

I wonder if it might be simpler to track stack sizes statically instead of using runtime instrumentation. What I mean is, for example on x86_64, the function prologue has a stack reservation in the form of `sub rbp, 0x168`. So we can easily tell that the function uses 0x168 bytes of stack space. Just add those numbers up for every function in a crate, and you have a score. Track that number over time for a set of common crates.

the_mitsuhiko · on Nov 17, 2022

I like tracking this or at least having a way to track this. It's incredibly common that you only discover a crash too late in production due to running out of stack size and at least knowing a histogram over some test runs about how close something went to the (configured) limit would already be incredibly helpful for service stability.

Murfalo · on Nov 17, 2022

It's not clear to me how this would track stack<->stack and memory<->stack copies. Can you explain?

throwup · on Nov 17, 2022

My suggestion would not measure the copies themselves, but it would count how many bytes are the source/destination of copies (of course incl. other things like parameters/etc). It's not the exact same metric, but it does still help answer the question in the title. I would expect the two numbers to be highly correlated when building the same binary with different versions of the compiler.

It would also solve some practical problems mentioned in the page like the complexity of the setup and the speed of statistics gathering.

toxik · on Nov 17, 2022

If you copy between stack regions, you need more stack memory, so the stack allocation should be larger. Of course, this would undercount cases that copy to the same region many times which seems likely.

bjourne · on Nov 17, 2022

This won't work since you also need to track the maximum number of stack frames (e.g recursive calls) which is undecidable.

PaulHoule · on Nov 17, 2022

That's why I like AVR8 assembly better than C for Arduino. You have this big register file and the one C wants to use the most is the stack pointer.

ajross · on Nov 17, 2022

That's not the optimization in question here, though. C uses the stack pointer a lot if you use a lot of stack-local variables. Mark all your stuff "static" and it will use immediate addresses instead (which may or may not help you -- putting all the "local" stuff in a block referenced by one pointer is usually a good thing!).

What's happening here is that C++ is the inheritor of decades of ponderous analysis about how code works with temporary results such that it can usually (.../often/sometimes/under-the-right-astrological-sign) optimize them away or arrange to have them magically appear in the right place. The return value optimization and all the move semantics nonsense is aimed at this space.

As a result, C++ tends to put things on the stack "where they want to go", where I guess Rust is a little naive and needs to build them in one place just to copy them where they need to be.

unsafecast · on Nov 17, 2022

> C uses the stack pointer a lot if you use a lot of stack-local variables. Mark all your stuff "static" and it will use immediate addresses instead

Yes, but optimizing compilers are decently good at using registers instead, which is unfortunately not true for less popular targets (which usually means anything other than amd64 and arm64). I think that's what GP was referring to.

ape4 · on Nov 17, 2022

Honest question - does this mean Rust isn't ready for production yet?

msbarnett · on Nov 17, 2022

No. TFA points out there isn’t a gigantic performance sink or anything, just an infelicity in the code they’re generating.

>> Does this mean Rust is slower than C++?

> No. You can always write your Rust code carefully to avoid copies. Besides, all of this only comes out to a small percentage of the total instruction count. That being said, it's something we should fix, and which I'm working on.

tialaramex · on Nov 17, 2022

As pointed out elsewhere in this comment tree, things living on the stack when they needn't can mean they don't fit and thus the program doesn't work, so the optimisation can matter for that reason, and this is a particular reason to worry about it for larger objects where the optimiser is more likely not to see what's going on.

msbarnett · on Nov 17, 2022

And yet, real world production rust programs exist and measured performance is generally excellent.

People are acting like this graph they saw for the first time today means that Rust is running at sub-Ruby speeds, when even with these excess copies we already know, and have known for years how rust programs have been performing in real life.

That there is room for improvement here does not mean that the status quo was not already excellent.

germandiago · on Nov 17, 2022

> People are acting like this graph they saw for the first time today means that Rust is running at sub-Ruby speeds

Well, this is no different of how Rust people talk about C++ as if it was as unsafe as if you were writing inline assembly :)

insanitybit · on Nov 17, 2022

It is totally unrelated to production readiness. This is just trying to address and understand a cost in Rust that ideally wouldn't exist. The cost exists in C++ too, but Rust uses move semantics much more aggressively vs reference forwarding.

pcwalton · on Nov 17, 2022

No. Lots of people are happily running Rust in production and getting significant value from it.

je_bailey · on Nov 17, 2022

I don’t understand the point of this. Is there a trade off in being stack efficient and speed?

vgatherps · on Nov 17, 2022

The main goal is removing replacing pointless stack-to-stack copies with simply mutating in-place on the stack correctly in the first place.

Due to some mix of:

* Rust code relying more on copies than C++ (for esample, harder to make something uninitialized and fill it in)

* LLVM missing optimizations that rust relies on heavier than C++

* No real guarantees around RVO / NRVO

Rust code often will put something on the stack, and then just instantly copy it somewhere else on the stack, even in optimized code. I've observed this happening sometimes pretty blatantly myself.

codeflo · on Nov 17, 2022

> No real guarantees around RVO / NRVO

Shouldn’t Rust in theory have a lot more freedom in defining its calling conventions than C++ has? I wonder if there’s anything that prevents doing RVO by default, or if just hadn’t been a priority yet.

vgatherps · on Nov 17, 2022

I think in theory it could, but something was definitely getting clogged in the optimizer. I'd see code like

" a = A::new(...);

return a; "

Create a on the stack, and immediately copy a into the stack region the caller was expecting it in. This seemed to get worse as struct size got larger, so I'm guessing there was so much IL the optimizer had to churn through it just gave up at some point.

kibwen · on Nov 17, 2022

Evaluation order is unspecified in C++, whereas it is well-specified in Rust. This makes things easier to reason about in Rust, but does give the optimizer less wiggle room.

https://en.cppreference.com/w/cpp/language/eval_order

codeflo · on Nov 17, 2022

How is evaluation order even relevant for RVO?

thethirdone · on Nov 17, 2022

In code like `a(b(d),c(e))`, I think it could be relevant. You would want different code based on the size of `b(d)`, `c(e)`, `d`, and `e`. If you must evaluate b before c, that would eliminate some possible arrangements.

Specifically, if `e` and `b(d)` are huge, you probably would want to evaluate `c(e)` first and then `b(d)`.

Tyr42 · on Nov 17, 2022

There was a RFC for them, but it didn't get much traction.

pjc50 · on Nov 17, 2022

"Memory moves to the stack frequently represent wasted computation. For the most part, they're CPU cycles that are spent shuffling data from one place to another instead of performing useful work. Stack-to-stack memory moves in particular are very likely to represent pure overhead; non-stack-to-stack memory moves are sometimes genuinely useful and necessary but frequently also represent waste."

It's essentially a critique that the optimizer is missing opportunities.

amelius · on Nov 17, 2022

FTA:

> Why do we care about stack memory moves?

> Memory moves to the stack frequently represent wasted computation. For the most part, they're CPU cycles that are spent shuffling data from one place to another instead of performing useful work. Stack-to-stack memory moves in particular are very likely to represent pure overhead; non-stack-to-stack memory moves are sometimes genuinely useful and necessary but frequently also represent waste.

masklinn · on Nov 17, 2022

> Is there a trade off in being stack efficient and speed?

It's just rust being slightly less efficient: it spends instructions doing unnecessary stack-to-stack copies, and has larger stackframes (to hold the redundant copies, which can be an issue both with deep recursion and for inlining).

mywittyname · on Nov 17, 2022

But why are these operations slow in Rust?

pornel · on Nov 17, 2022

Safe Rust prevents access to uninitialised memory, so a pattern like:

    buf = malloc(size)
    init(buf)

is too risky, because init could read the uninitialized memory or fail to overwrite all of it causing a heartbleed-like leak elsewhere (most programmers may think it's just garbage bytes who cares — Rust cares.)

Rust's safe abstraction for initialization and heap allocation instead passes structs by value (you can't misuse a buffer pointer if it doesn't exist), and relies on the optimizer to remove all unnecessary copies. The optimizer doesn't always do that, which is what this site tries to measure and fix.

ravi-delia · on Nov 17, 2022

They are precisely the same speed in every language, seeing as they're going to be the same instructions. The linked site points out that Rust does more of them.

MrBuddyCasino · on Nov 17, 2022

Apparently not: ERR_CONNECTION_TIMED_OUT

mminer237 · on Nov 17, 2022

It works for me, but you could try archive.org: https://web.archive.org/web/20221115213622/https://arewestac...

kitd · on Nov 17, 2022

Seems to be up now. For me at least.