Hacker News new | past | comments | ask | show | jobs | submit login
Identifying Rust's collect:<Vec<_>>() memory leak footgun (polybdenum.com)
180 points by muglug 11 months ago | hide | past | favorite | 122 comments



I don't think it's a memory leak. It's re-using the same space of the underling array. It's allocated. Dropping the Vec will release the memory.

In v1 you put in 128 containers is with each 1024 boxes.

Then v2 is taking out the first box out of each container, tossing the container and putting the box at the space where the container was, packing them.

The fact that capacity doubles when you remove the as u8 is... normal. You reuse the space, so you can fix 2u8 in the space of 1u16.

The problem here is more that this the optimization causes a Vec<T> -> Vec<Y> where sizeof(T) > sizeof(Y) to have much more capacity than expected.

Which IS a valid bug report. Bug again, not a memory leak.


In the Haskell world there's a bit of a cheap differentiation for this kind of leak.

Memory leak: classical C style, fully lost in memgrind (runtime bug) Space leak: Some memory structure requires significantly more memory than it should. E.g. keeping a (lazy) ref to a large blob, around that keeps it in memory.

The effect is largely the same, but the latter is almost harder to spot, since the tooling cannot be as general as memgrind.


Also the latter is the bane of lazy languages, which make space leaks both commonplace and extremely difficult to notice.


Generally having a reference for something for longer than you should is a different kind of problem from “the vector never resizes down”.


It seems like you have a better grasp on the problem than I. I understand why capacity in terms of number of elements doubles. But why did he end up using 200x more memory? Is it that the transient memory used the 18GiB and then he reduced it to vector of elements that were 200x smaller? Or is it that there’s some other problem that when you collect, it’s reallocating 2x using the wrong type to do the computation?


Because it doubles in every node and OP has 300k of them[1]. It's not a leak, since memory is...well...accounted for and easily freed by either dropping Vec or calling shirk_to_fit. IMO it's expected behavior.

Honestly, if I had 300k Vecs, I would call `shrink_to_fit` even without that optimization.

1: Since the edge list calculation logic is run for one node at a time in sequence, the fact that it temporarily uses 132kb of memory would normally not be a problem at all. However, the new collect() optimization meant that that internal storage was being preserved in the final vec which got inserted into the list of edge lists, thus keeping around that 132kb of obsolete memory forever. Multiply that by 300k nodes and suddenly you’ve leaked the computer’s entire RAM.


Ah right. Vec of vecs and the interior Vec has wasted capacity due to the reduction of the element size. Thanks!


The `map` step converted the data to a smaller type, and the whole chain was smart enough to reuse the original Vec without reallocating. Since the original item size was [much bigger], there was a lot of "free" capacity left over.

It is indeed a bit surprising, but seems entirely reasonable for the platform to do imo. It's efficient, in both memory (allocation) and CPU. If you want a specifically-sized container, you have to make sure that happens, otherwise you're letting things be optimized in language-general ways. That will frequently be wrong in edge cases, and may change at any time.


A similar thing happens in Go if you don't use a pre-known 'capacity' when allocating slices. Each internal realloc could use a larger backing array. Adding elements 1 by 1 will keep doing this making lots of garbage. They all get collected of course, but performance is horrible with memory fragmentation and all the copying.


C# doubles the size of a List<T> each time you hit the capacity (if you didn't set it correctly when you created the list or left it at the default). Does Go do something similar or does it only increase it by 1 each time?


It grows by 2x for small sizes, then it transitions to growing it by 1.25x


In many of these languages (and C++) you can inadvertently pessimise by trying to provide useful size hints, Rust has a smarter API here.

Here's how that goes, suppose I receives batches of 10 Doodads. My growable array type has an API to reserve more space for Doodads, so before pushing each onto my growable array I reserve enough space for 10 extra. At small sizes this helps. Instead of 1, 2, 4, 8, 16, 32 Doodads, for two batches we grow to 10 and then 20 Doodads, we're only allocating twice and copying 10 items, instead of allocating 6 times and copying 31 items. But at bigger sizes it hurts instead, we're doing far more allocations and exponentially more copying.

Rust provides two distinct functions, Vec::reserve and Vec::reserve_exact. With reserve we get to avoid the pessimisation, it will grow faster than the exponential amortization if asked but never slower - while reserve_exact still allows us to give a specific final size in cases where we actually know that before growing too far.


You can use slices.Grow for that purpose in Go. Exact sizing is only possible during slice creation.


It's unclear which policy slices.Grow exhibits, and so maybe it's not guaranteed whether you get the equivalent of Vec::reserve or Vec::reserve_exact ? My guess since it's not mentioned and no alternative is supplied is that you get Vec::reserve_exact - the same pessimisation footgun.


You get reserve, not reserve_exact. You are right that it would be allowed to implement reserve_exact semantics instead, given the method comment. But that’s not how it works, if you want reserve_exact semantics, you need to take extra steps. That said, the semantics are not exactly the same as reserve. The go compiler may optimize since allocations (I don’t know if it does though).



Interesting that it considers >= 256 entries to be 'not small'. That would be accurate for larger structs, not so much if the element is a bool, int, pointer, etc.


Using 1.25 creates more copies along the way than 2x.


Sure. It's a tradeoff to avoid wasting huge amounts of memory once the list gets huge.

No matter what the growth factor, you want to set the capacity up front if you know it. But 2x and 1.25x are both reasonable.


No, if that's all it was, the excess memory usage would be 2x.

But it's 200x.

Right?


Looking at this example:

    fn dbg_vec<T>(v: &Vec<T>) {
        println!(
            "vec data ptr={:?} len={} cap={}",
            v.as_ptr(),
            v.len(),
            v.capacity()
        );
    }
    
    fn main() {
        {
            let v1 = (0u16..128).map(|i| [i; 1024]).collect::<Vec<_>>();
            dbg_vec(&v1);
            let v2 = v1.into_iter().map(|x| x[0] as u8).collect::<Vec<_>>();
            dbg_vec(&v2);
        }
        {
            let v1 = (0u16..128).map(|i| [i; 1024]).collect::<Vec<_>>();
            dbg_vec(&v1);
            let v2 = v1.into_iter().map(|x| x[0]).collect::<Vec<_>>();
            dbg_vec(&v2);
        }
    }


    # cargo +nightly clean
         Removed 11 files, 7.3MiB total
    
    # cargo +nightly run --release
       Compiling vec-debug v0.1.0 (/home/neo/vec-debug)
        Finished release [optimized] target(s) in 0.17s
         Running `target/release/vec-debug`
    vec data ptr=0x7f47ee162010 len=128 cap=128
    vec data ptr=0x7f47ee121010 len=128 cap=262144
    vec data ptr=0x55c6514f3ba0 len=128 cap=128
    vec data ptr=0x55c6514f3ba0 len=128 cap=131072
    
What happens is that the original's vectors assigned space gets reused. That's it.

It LOOKS like there is more because the capacity inflates. But capacity is written in terms of sizeof(T), not bits

    (0u16..128).map(|i| [i; 1024]).collect::<Vec<_>>()
the capacity (and length) is 128 times an array of 1024 of a u16. We then re-use the same underlying array which has a CAPACITY of 128 times (1024 * 16) and put in a u8 (the cast). So each item went from 1024 * 16 to 8.

So previously we had 128 * 1024 * 16 = 2,097,152 bits.

How many times can we put 8 bits in a capacity of 2,097,152 bits?

262,144

How many times can we put 16 bits in a capacity of 2,097,152 bits?

131,072


Thank you for explaining it this way. I’ve been seeing this bit of news off and on all day and knew there was something I wasn’t understanding. This was the key bit


Please read the article. His produces 200x intermediary values. Clickbait title, since it wasn't a leak.


I agree that this is not a memory leak.

However, the semantic distinction between "this uses much more memory than expected" and "this is a memory leak" is a little subtle, and it seems pretty rude to call it clickbait.


No, memory leak is a very distinct definition: unused and stored, but inaccessible memory. Memory leak can be as small as a single word. In this case, it's just a memory. There is another term for this scenario, which I don't remember.

This is a case of optimization gone wrong, but nothing is leaked, and every single byte is accounted for.

The title is click bate, but article still interesting to read.


Clickbait (in the context of Rust). In languages with managed memory there are no true memory leaks so such wastes are called leaks. In lower-level languages, we should stay more strict with what we call things.


… Box::leak¹ is a function that exists. That seems like a memory leak, no?

Less tongue-in-cheek, if a program allocates far more memory than expected of it, I going to colloquially called that a "memory leak". If I see a Java program whose RSS is doing nothing but "up and to the right" until the VM runs out of memory and dies a sweet sweet page thrashing death, I'm going to describe that as a "memory leak". Having someone tell me, "well, actually, it's not a leak per se it's just that the JVM's GC didn't collect all the available garbage prior to running out of memory because …" … I don't care? You're just forcing me to wordsmith the problem description —-the problem is still there. Program is still dead, and still exceeding the constraints of the environment it should have been operating in.

The author had some assumptions: that Vec doesn't overalloc by more than 2x, and that collect allocates — one of those did turn out to be false, but I think if I polled Rust programmers, a fair number of them would make the wrong assumption. I would, and TIL from this article that it was wrong, and that collect can reuse the original allocation, despite it not being readily apparent how it knows how to do that with a generic Iterator. (And, the article got me to understand that part, too!)

Unlike most clickbaits which lure you in only to let you down, I learned something here. Reading it was worthwhile.

¹https://doc.rust-lang.org/stable/std/boxed/struct.Box.html#m...


> "If I see a Java program whose RSS is doing nothing but "up and to the right" until the VM runs out of memory and dies a sweet sweet page thrashing death, I'm going to describe that as a "memory leak"."

By this definition, if a program reads in a file and you point it to a small file then the program does not have a memory leak, but if you point it to a large enough file, then the program does have a memory leak. Whether or not a program has a memory leak doesn't depend on the code of the program, but how you use it. But then on a bigger computer, the program doesn't have a memory leak anymore.

That seems a less useful definition than the parent poster's / the common definition.


… that's really not the idea I'm trying to convey with the comment.

Clearly, if you feed a program a larger file that it is going to read into memory to process, it is then expected that it will consume more resources on account of it doing more work. But that is memory being expended on visible, useful work. All of the examples in the comment are referring to memory being "allocated" (in the sense of being assigned to the program) but not fulfilling any visibly useful function insofar as the operator/programmer can see: Java's GC being unable to effectively reclaim unused memory prior to killing a machine, the OP's example of a Vec allocating without (seemingly) have a purpose (…as it is excess of what is required to allow for amortized appends).


There is an implied steady state in what the program is doing. If it goes right from "loading" to "exit" then you need a more complicated analysis.

When you have that steady state, that definition looking at uncontrolled growth is more useful than trying to dissect whether the memory is truly unreachable or only practically unreachable.


>I don't care? You're just forcing me to wordsmith the problem description

Yes, because if you don't define the problem clearly, the problem won't be solved. Java being inefficient with memory use doesn't mean any memory was leaked.

Memory leaks can be tricky to track down, and if I spent 6 hours looking for a memory leak only to come back and found out you meant it uses more memory than what's efficient I'd be pissed I wasted 6 hours because you wanted to save 5 minutes.


"uses more memory than what's efficient"

There is a hidden memory store using orders of magnitude more RAM than the live data. Why do we need to nitpick exactly how hidden it is? Are you going to be mad if I don't know whether it's literally inaccessible or not?


Because there are legitimate reasons why memory can be allocated. This is like calling your OS cache a memory leak when you open up Task Manager and see you only have 400MB free. A memory leak implies memory that is lost for good and it no longer being kept track of.

Consider it this way - if I had a program that connected to a database and used a connection pool to improve performance, would it be a "connection leak" that 5 connections were opened even though the database was idle?

The framing here is similar - Rust, in an attempt to improve performance reused large memory allocations. Some applications do this on purpose and call it buffer pools.


Except this memory is not being intentionally used as a cache/buffer.

Rust attempts to keep the vector capacity in a range for that purpose, and failed to do so here.

No matter what, it's a bug. So none of those possible justifications fit, because it's a bug.

For the database analogy, I would call it a connection leak if the number of idle connections greatly exceeded the amount that had ever been simultaneously busy and they weren't actually getting reused.


I disagree that it’s a small semantic difference.

I don’t think it’s clickbait though, I think the author was just misusing terminology.


I said it was subtle, not small. I agree it's a valuable distinction.


A memory leak means it leaks, it's not anymore under control. Here the memory is under control, it can be reclaimed by the program.


I did read the article.

> the memory waste from excess capacity should always be at most 2x, but I was seeing over 200x.

So the 200x analysis is his problem?


200x is correct. What's happening is that he makes a vector with tons of capacity and only a few elements, so lots of wasted space. Then he turns that vector into another vector using an operation that used to allocate a new vector (thus releasing the wasted space) but now reuses the previous vector's allocation (retaining the wasted space).

It's definitely a sneaky bug. Not a "memory leak" in the normal sense since the memory will still be freed eventually. I'd call it an unexpected waste of memory.


Rust can re-use an allocation, but if the new item is smaller than the previous it doesn't automatically remove (free) the "wasted" memory left over from the previous allocation. I think this is categorically not a memory leak as the memory was absolutely accounted for and able to be freed (as evidenced by the `shrink_to_fit()`), but I can see how the author was initially confused by this optimization.

The 2x versus 200x confusion IMO is the OP was conflating that Vec will double in size when it needs more space, so they were assuming the memory should have only ever been 2x in the worst case of the new size. Which in the OPs case because the new type size was smaller than the previous, it seemed like a massive over-allocation.

Imagine you had a `Vec<Vec<u16>>` and to keep it simple it there were only 2 elements in both the inner and outer Vec's, which if we assume Rust doubled each Vec's allocation that'd be 4x4 "slots" of 2 bytes per slot (or 32 bytes total allocated...in reality it'd be a little different but to keep it simple let's just assume).

Now imagine you replace that allocation with a `Vec<Vec<u8>>` which even with the same doubling of the allocation size would be a maximum of 4x4 slots of 1 byte per slot (16 bytes total allocation required). Well we already have a 32 byte allocation and we only need 16, so Rust just re-uses it, and now it looks like we have 16 bytes of "waste."

Now the author was expecting at most 16 bytes (remember, 2x the new size) but was seeing 32 bytes because Rust just re-used the allocation and didn't free the "extra" 16 bytes. Further, when they ran `Vec::shrink_to_fit()` it shrunk down to only used space, which in our example would be a total of 4 bytes (2x2 of 1 byte slots actually used).

Meaning the author was comparing an observed 32 byte allocation, to an expectation of at most 16 bytes, and a properly sized allocation of 4 bytes. Factored out to their real world data I can see how they'd see numbers greater than "at most 2x."


Clickbait implies a specific intent to mislead for clicks, whereas I think there’s a completely good-faith disagreement here about the meaning of the word “memory leak.”


I appreciate the mention of Box<[T]>. I've never (or at least very rarely?) seen people using it, even in rustc, although I'm sure there's somewhere in the compiler that uses it. I've kind of gotten the feeling that Box<[T]> is frowned upon as a needless optimization because the one-word storage difference is so minimal, but I appreciate having the immutable size semantics as a way of ensuring no extra allocations occur by way of the methods on Vec.


Another rust type you should look out for is Box<str>In my rust code I find a lot more uses of Box<str> rather than Box<[T]>. Strings are often fixed length identifiers or usernames that get passed around frequently and stored in data structures. Box<str> can be a drop-in replacement in a surprising amount of code


Arc<str> is an atomically ref-counted string, also quite useful in async or multi-threaded code where (unlike w/ Box) you sometimes can't know in advance which task will drop the last reference to some data.


Box<str> is used pervasively in the rust compiler instead of String for exactly this reason. Basically every string of code the compiler looks at is, unsurprisingly, constant.


Yeah, the general concept here is https://en.wikipedia.org/wiki/String_interning, Box<str> is great for it.


Another downside of Box<T> is that Vec::into_boxed_slice copies all elements from the source vector. Vec::shrink_to_fit does not appear to copy. This is based on the stable docs.


It doesn't copy the elements if the source vector is already at max capacity, which you can often arrange for beforehand.


For example, by explicitly calling shrink_to_fit. So you either have fragmentation, or an extra copy if you're not careful, but none of the solutions let you forget about the detail entirely.


If you look at the source for into_boxed_slice, it calls shrink_to_fit at the beginning before doing anything else. Hence the documentation is slightly wrong, and no copies occur.

Edit: I submitted a PR to clear up the docs: https://github.com/rust-lang/rust/pull/120110


thank you!


Or alternatively, the vector was created with `Vec::with_capacity()` or `ExactSizeIterator::collect()`, and had the correct capacity all along.


This is a good time to check how your code performs in beta vs stable. This particular case is new in beta and it would be interesting to catch regressions before they land.

Someone already filed this as a bug: https://github.com/rust-lang/rust/issues/120091


This re-use trick seems clever, but definitely needed more consideration before merging IMO. Re-use where we can determine the size & alignment are a perfect match ought to be a win, but the other cases seem very situational. And The8472's attitude here seems unlikely to lead to a better Rust.


Yeah I feel like it should reallocate if the size difference is great.


Over allocating maps and arrays in standard libs is not really a memory "leak" . Other langs do it.


After skimming the first few paragraphs, I almost commented something similar to yours. But after reading it until the end, the problem is more nuanced than that.

The issue is more about reusing memory that has been allocated when you "convert" (in Rust term: into_iter()) a Vec to another Vec. More surprisingly, the reuse still happens even when the new Vec has a different type than the original Vec's type. You should read the whole post—it's more interesting than I (and you) thought.


Agreed.

In the Haskell world, "space leak" is used to refer to a lazy computation that overallocates because of laziness (classic example being the difference between foldl and foldr or maybe that's foldl' and foldr'--I forget some details/cannot be bothered to think about folds right now).

I've personally stolen the terminology for a computation that temporarily holds on to a bunch of unneeded memory.


Conceptually, collect seems to create a new vector. In that case you would expect that the memory of the old vector is freed. The library tries to be smart and reuses the old vector as an (undocumented?) optimization. So the old memory that is no longer needed is not released.

Whether you call that a leak or not, Rust is known for its predictability. Allocating 200x the memory which a programmer would expect if he doesn't know the implementation details is bad.


The semantics are kind of besides the point, I think. This allocation reuse behavior was meant as an optimization, but the real-world programs this is meant to be "optimizing" run significantly slower in beta than they did in stable. So there is a bug here no matter how you look at it.


You’re right, but it’s (I agree; incorrectly) been called a memory leak for a long time, to inappropriately over allocate/not free resources.

Coming from my java days, they called this a memory leak there too.

It’s definitely a more inappropriate term in rust, where memory leak has a definite meaning already.


Yeah, came here to say this. Still an interesting post, but the title implies a much more serious problem


[flagged]


There is nothing unsafe about allocating more memory than necessary.



Being so quick to jump this sarcastic gun on something so clearly unrelated to memory safety (even if it was a leak, which it isn't!) is not a great look.


Memory safety has not been violated here.


This is a pretty surprising behavior. Reusing the allocation without shrinking when the resulting capacity will be within 1.0-2.0x the length: seems reasonable, not super surprising. Reusing the allocation without shrinking when the resulting capacity will be a large multiple of the (known) length: pretty surprising! My intuition is that this is too surprising (at capacity >2) to be worth the possible optimization but probably still worth doing at capacity small multiples of length.


Is reusing an allocation while changing its size a thing you expect to be able to do? I would believe that some languages/systems/etc can do that, but it certainly feels like an exception rather than a rule. Reuse generally means the whole block of memory is retained, from what I've seen, because you'd have to track that half-freed memory for reuse somehow and that has some associated cost. (A compacting-GC language would be a reasonable exception here, for example)


> Is reusing an allocation while changing its size a thing you expect to be able to do?

It's something you should expect to be able to try to do. The underlying allocator may reject the request depending on context (maybe it is works for large sizes only, for example). This is provided by Rust's Allocator trait realloc_in_place() API, which returns CannotReallocInPlace if it isn't possible.

For Vec::collect, in the event that the storage cannot be reused smaller in place, I think it would be reasonable to free it and allocate an appropriately sized buffer instead.


> This is a pretty surprising behavior.

This optimisation crosses the line into "too clever by half". Once a Rust programmers groks how Vec's work their mental model of how capacity is allocated simple, so simple they hardly need thing about it's implications as they write code. This optimisation breaks that simple model badly, and you would have to be really focused to realise it's about to bite you as the code streams off your finger tips. Consequently this "optimisation" is almost certainly going to cause a lot of code to using far more memory than expected.

You get the same results with filter():

    let mut v = vec![1,2,3];
    v.push(4);
    println!("len {} cap {}", v.len(), v.capacity());
    let v = v.into_iter().map(|x| x+1).filter(|_x| false).collect::<Vec<_>>();
    println!("len {} cap {}", v.len(), v.capacity());

    len 4 cap 6
    len 0 cap 6
Despite being in some sense "correct", I'd consider that result a bug. One fix would be to examine the result, and if it breaks the programmers expectations badly (by say using more than twice the memory needed) it does a shink_to_fit() for you. Since the programmer is probably expecting a new vector to be generated it isn't a surprising outcome.


I’ve read the article and I’m still lost as to how this optimization results in a 200x discrepancy between capacity and length. Was the mapped element size 200x smaller in this case? Or is there some bug where repeated mappings like this cause a 2x growth based on the previous element size instead of the target element size?


I think the behavior is that if you have a Vec of u128 (say 1000), filter that to fewer elements (say 10), and then collect it into a Vec of u32 you might expect the resulting value to be around 40 bytes but in beta Rust it is 16000 bytes. In current Rust the collect would cause a new 10 element Vec of u32 to be allocated, in beta it reuses the original larger allocation. The author's code is doing a bit more but essentially when moving to beta Rust the new optimization caused memory usage to increase by a big multiple.


Ok. I missed that there’s a filter step that compounds the problem. The more I read the less this sounds like a bug and more like application code is missing a shrink_to_fit and was relying on a pessimization.

That being said, it’s also not an unreasonable expectation on the user’s behalf that the size and capacity don’t get crazy different in code as innocuous and idiomatic as this.

I wonder how the open bug will end up training itself.


Someone here linked an open ticket for this issue. In the comments at least one person made basically the same argument that holding on to a potentially large % of memory is a surprising sharp edge, meanwhile shrinking the Vec and perhaps allocating is unsurprising behavior. Requiring many additional defensive shrink_to_fit calls to avoid this problem seems like the wrong tradeoff but I don't write enough Rust to have a strong opinion.


The question is how much overhead would adding a check to determine to shrink if > % of freedom to all code that doesn’t need this optimization entail?

The reason it’s important to consider is that I can always add a shrink_to_fit even if it’s a sharp edge. I can’t remove the conditional within the std library even if I know it doesn’t apply. And adding explicit APIs to control this nuance is a bit much (whether through a dedicated collect_maybe_shrink function or as a new argument to collect to control shrinkage) and there are usability implications to complicating the UI. It may be that ultimately this should be fixed purely through documentation even though defensive shrink_to_fit sucks. Not all technical problems can be solved and it’s all trade offs.


> The question is how much overhead would adding a check to determine to shrink if > % of freedom to all code that doesn’t need this optimization entail?

Once per collect? Damn near nothing.


I'm guessing allocator time, memory usage, and cache residency are the major performance considerations. Vec knows what size it is so the comparison is cheap and in any case collect is already expected to allocate depending on what it is fed.


The comparison is "cheap" if you can speculate through it & even then isn't free. If your comparison is 50/50 on branching then you're going to pay a penalty every time you hit that code path. It's entirely possible that the map operation is going to dominate the check, I'm just highlighting it's not free and it's a good idea to validate the cost somehow first (but also recognize there are applications you're going to penalize without them realizing it).


This is a good example of the type of challenges you face as an author of widely used library. I can see a lot of scenarios where an optimization like this would bring benefits. But there are also many where it would hurt performance (not to mention memory usage), including most "collect once read many times" use-cases.

But I think the real thing for me is that this violates the principle of least surprise. If I wanted the type of memory reuse / lazy transformation behavior this optimization introduces, I would be looking at working with an iterator with a bunch of functinoal transforms. And if I'm calling .collect() it's because I want to convert the iterator into a data structure optimized for reads.

But I can also see how others would land on the other end, and hence the challenges for the library authors.


While I agree this is a bit surprising... I don't see literally anything in the docs for `collect()` that implies anything about memory behaviors. You get a collection, nothing more is guaranteed.

If you want specific memory behavior, you really do have to use a method that intends to do that. Like `shrink_to_fit()` (mentioned in the article).


I would say I don't (or now, didn't) expect collect to be able to know that there is an allocation under there, since that's not part of Iterator's interface.

I.e., the docs could call out that there aren't such implied behaviors, and that in some circumstances, it might reuse the allocation, but that that's not guaranteed. (And ideally, offer me info on what to do if I do want certain guarantees about the allocation of the Vec.)

Warding off what I think is a reasonable, but wrong, assumption, is well within the scope of the docs.


> It’s also an illustration of how an optimization in one place can lead to bugs downstream by violating programmers’ expectations.

This touched upon a pet peeve of mine for its resemblance with all the talk about undefined behaviour in C and C++. Programmers’ expectations are not codified and do not need to be respected: international standards do.


>> Programmers’ expectations are not codified and do not need to be respected

That's pretty negative attitude - my immediate gut response was "and neither do yours". But Rust isn't an ISO standard and is still in development. Even if we do think in those terms, people developing a standard have IMHO an obligation to the people who will be using the standard.


To be a bit pedantic, the people using the C++ standard are mostly compiler engineers.


Everyone writing C++ is following the standard. The people that often reference it directly are the compiler people. But if we are going to try and be stupid about it we can ignore the compiler engineers opinions too - they're just implementing an arbitrary specification that won't impact anyone but them anyway right?


Well no, users of C++ compilers are using an abstract, encapsulated, and portable interface to it, which makes their code automatically follow the standard (if it cares at all). That's the whole point, isn't it? I don't need to know what size the standard dictates const char* must be, I can just use it and it works.


I mean sure, so for those people WG21 does them a great service, in most cases they can say "Oh that's IFNDR" [Ill-formed; No Diagnostic Required] or "That's UB" and so it's not the compiler's fault your program doesn't work and they can close your ticket as NOTABUG if they want.

My guess is that all or almost-all non-trivial C++ projects trigger IFNDR and thus have no actual meaning per the ISO standard language. They do work, more or less, but that's just convention and can be taken away at any time or for any reason. Header files, use of concepts, unfortunate algorithmic goofs, careless naming, there are a billion† ways to trigger IFNDR in C++ and by definition that's fatal to your correctness.

† A billion is likely an overestimate, but WG21 has no formal inventory of IFNDR in the sprawling ISO document so, who knows.


It would be nice if everyone's compiler would tell their users what crazy thing they did wrong, it's just not required in order to implement C++ at all. That seems reasonable to me. Enhancing a failure-mode UX for an abstract machine interface is not really the same thing as implementing that interface. Maybe it would be nicer if the standard didn't have any possible UB in it.


Oh it's not that it doesn't tell you what you did wrong, it has no idea it's wrong, the C++ language is deliberately designed assuming you never make such mistakes.

It's not difficult, it's impossible, by Rice's Theorem.

It's slightly alarming that anybody thought "Yeah the programmers will always do this correctly" was good enough, but see the rest of our society.


UB is an example of IFNDR right?

Are you saying it’s impossible for the compiler to identify and warn on UB? That doesn’t sound right.


Oh! No, IFNDR is much worse. Categorically so.

Undefined Behaviour is a runtime thing. For example, suppose I have a function which takes an integer between 0 and 9 inclusive and indexes into an size 10 array, but I discover that due to a bug it's possible if a user hits the "Expedite Gyro" button then as well as expediting the gyro, it calls my function with a value which can be up to 26. Indexing 26 into a size 10 array is Undefined Behaviour. Absolutely anything (possible) might happen. But this is runtime problem, the program is still correct unless you actually hit the button.

Undefined Behaviour is more common in a language like C++, but it exists in many safer languages, for example Go has UB for complex data races. It's hard to design this out of your language but some languages do (the safe Rust subset, the WUFFS language for example). It would be reasonable to argue that in such languages they "identify and warn" on UB, in the sense that programs with UB are rejected.

We can also detect at runtime that UB occurred, and say, report it or stop the software if that seems appropriate. To do this properly we need to know all possible causes of UB and detect them in the moment before they happen, while our behaviour is still well defined. Of course if we're going to all that effort maybe we should just reject programs with Undefined Behaviour up front ?

Ill-Formed; No Diagnostic Required isn't like that, IFNDR is about the process of translating the program, typically during compilation and thus long before runtime. It's a way to dodge Rice's Theorem. Henry Rice got a Maths PhD 70+ years ago for showing that the question of whether a program satisfies non-trivial semantic (as opposed to syntactic) requirements is Undecidable.

IFNDR says that's OK, we just won't ask, if our program doesn't satisfy the semantic requirements it had no meaning whatsoever and so we don't care what the transformation does, Garbage In: Garbage Out. This allows us to write all correct programs - we never checked they're correct but in principle they work. Unfortunately the price is that we have no idea whether all/ some/ any of our programs are correct.

The other way to dodge Rice's Theorem is practised by Rust. Check semantic requirements for each program and if you can't see why it satisfies the semantic requirements then reject the program. The price is that some correct programs won't compile at all. For example before "Non-lexical lifetimes" some really easy obviously correct Rust doesn't compile, because the Rust borrow checker used to assume borrows live until the end of a lexical scope.

I believe that incentive structure means IFNDR is toxic, because it encourages the language to add more and more unchecked semantic requirements, since they're "free". Their consequence is that your program is now nonsense and might do anything, but there's no errors, no sign anything is worse so you have no reason to complain. In contrast the incentives for Rust are to improve the semantic checking, because programmers are annoyed when they can see why their program is correct but the compiler is too dumb.


> Programmers’ expectations are not codified and do not need to be respected: international standards do.

What if...stick with me...the standards sought to codify and respect programmers' expectations?

https://en.wikipedia.org/wiki/Principle_of_least_astonishmen...


Then using a language without a published standard is a fools errand.


No? How does that even follow? That a standard should codify common expectations insofar possible (without breaking rigor) is irrelevant to whether you "should" only use standardized languages or not.


It's a statement of opinion. I think attempting to use a language without a standard causes problems that can be avoided by using one with a standard. As evidenced in this exact post, the author spent a fair amount of time trying to figure out if the nightly, devel or production versions of the complier were implicated.

He wasn't even able to fully test a theory because he relied on a feature that wasn't even available in older compilers in a completely unrelated system.

So, if you want to ignore these costs, that's fine, I'm not prescribing your actions for you, just sharing an opinion. Do you follow?


Respecting both is much better than respecting only the latter.


Great write-up. Thank you. I wouldn't have guessed. The upshot of this whole thing is that Vec is for mutable collections and when you're done you make a boxed slice.


Here's the code performing the allocation reuse:

https://github.com/rust-lang/rust/blob/25f8d01fd8bda339612d0...

It's not supposed to waste more space than a regular `collect()` or `push()` would when growing the capacity exponentially.


Cross posting my comment from reddit[1] because I think it's interesting.

-----

Nice post. I love calling attention to this. Just a few months ago, I ran into the ~same~ similar problem, although it wasn't caused by `collect()`. It was caused by "normal" `Vec` usage:

https://github.com/BurntSushi/aho-corasick/commit/474393be8d...

The issue appeared when building large Aho-Corasick automatons. Otherwise, it usually doesn't matter too much because the absolute memory sizes are pretty small. But if your automaton uses 1GB of memory, then because of the excess capacity in a bunch of little `Vec` values, that usage could easily balloon to 2GB. And even at that size, _it was hard to notice that it was more than it should have been_! It is easy to just think that's normal memory usage. Especially when it's coming from the automaton representation that you know is less memory efficient. (What I'm talking about here is something I called a "non-contiguous NFA." The aho-corasick crate also has a "contiguous NFA" and a DFA. The "contiguous NFA" uses one contiguous `Vec` allocation with a bunch of bit-packing tricks.)

But then someone actually reported an issue on the `ahocorasick_rs`[2] Python project (bindings to the `aho-corasick` crate) that the `pyahocorasick`[3] Python package (with a bespoke C implementation of Aho-Corasick) was using _substantially_ less memory. The contiguous NFA was still doing better, but the _peak_ memory usage of `ahocorasick_rs` was much much bigger than `pyahocorasick`. That prompted me to investigate and figure out what the heck was going wrong. (And this is one reason why benchmarks comparing tools or libraries are actually useful beyond just advertisements or flexing. They alert you to what is possible, and thus possibly push you to find bugs in your current implementation that might be otherwise easy to miss. In other words, benchmark comparisons can turn unknown unknowns into known unknowns.)

So since this prompted me to look very carefully at memory usage, I noticed that the memory usage reported by `AhoCorasick::memory_usage`[4] was substantially smaller than peak RSS memory usage in a simple reproduction program of the original issue reported to `ahocorasick_rs`. I eventually figured out that was because of the excess capacity used by a zillion tiny `Vec`s in the non-contiguous representation. I fixed _most_ of that by calling `shrink_to_fit()`. But as the commit linked above notes, it wasn't really feasible to call `shrink_to_fit()` on all `Vec`s because that ended up with a non-trivial impact on construction time. And on top of all of that, memory usage was still worse than `pyahocorasick`.

But why was I using a bunch of little `Vec`s in the first place? It's because this is the "non-contiguous" NFA. This is the thing that gets built from a list of patterns by first building a trie then filling in the failure transitions. That trie construction really demands random access mutation, which puts a constraint on your data structure choices. Plus, I had designed the contiguous NFA as a release valve of sorts that lifts this constraint. So I reasoned, "sure, the non-contiguous NFA isn't designed to be fast. It just needs to get us started." But still, `pyahocorasick` only has one Aho-Corasick implementation (the `aho-corasick` crate has 3). So it must be doing something differently.

It turns out that `pyahocorasick` uses linked lists! THE HORROR! But actually, they are a really good solution to this problem. As a Rust programmer, I reach for `Vec` first. But a C programmer reaches for linked lists first. And it turns out that linked lists are a much better fit here.

And so, I switched to linked lists[5]. (But using indices instead of pointers. Because of course. This is Rust. :-)) And now `ahocorasick_rs` uses less memory than `pyahocorasick`!

[1]: https://old.reddit.com/r/rust/comments/199jycb/identifying_r...

[2]: https://pypi.org/project/ahocorasick-rs/

[3]: https://pypi.org/project/pyahocorasick/

[4]: https://docs.rs/aho-corasick/latest/aho_corasick/struct.AhoC...

[5]: https://github.com/BurntSushi/aho-corasick/commit/31bb1fc30d...


> Just a few months ago, I ran into the same problem, although it wasn't caused by `collect()`. It was caused by "normal" `Vec` usage

If it wasn't caused by `collect()`, then I suppose it's a related problem, but not the same problem. Your problem was caused by, as your commit message says, "ignorance [original: ignorant] of the broader effects of this strategy [double-when-full] in aggregate"

The OP's problem, otoh, is more about reusing memory that has been allocated when you "convert" (in Rust term: into_iter()) a Vec to another Vec. What's more, the reuse also happens even when the new Vec has a different type from the original Vec's type. This behavior is more surprising than the double-when-full strategy, which is more widely known by programmers (even if sometimes they "forget" about the implications).


I feel like second sentence you quoted clarified that? "same problem" in this context means, "there is excess and unused capacity in the Vec." Your description is more precise, and at that level of precision, sure, they are not the "same problem." But yet, they share the same characteristic that is easy to forget: a `Vec` might have unused capacity leading to higher memory usage than you might expect otherwise if you aren't accounting for that unused capacity.

My phrasing is a reflection of what I think is a more universal lesson to draw from the blog, and one that is not provoked merely by how `collect()` works, but rather, the nature of growth amortization itself.

> This behavior is more surprising than the double-when-full strategy

I agree.

I updated my comment to use "similar" instead of "same."


What we need is a page-based Vec that mmaps (anon) for the storage but leaves the unused portions zero-bytes and therefore not part of RSS until actually required. (And when clearing/shrinking sections, madvise DONTNEED the pages).

That is, the vec could expand to areas much larger than the actual used size, but this would have no effect on process RSS until those pages get dirtied with actual data.


That can be not a great idea for subtle reasons even though it does seem like a better design at first.

When you do the madvise, based on the contact of the API, the kernel has to do a TLB shoot down which is really expensive. And not just “expensive for my process” but expensive in terms of a significant slowdown for entire machine. Honestly you could probably DDOS a machine if you could reliably trigger that shootdown.

As such you want to be very very careful where you place that data structure.

For what it’s worth your memory allocator (or at least a good one like mimalloc or the non-gperftools newer tcmalloc) will do exactly as you mentioned where it will free memory back to the OS using advanced heuristics tested out at scale.

As for why a shoot down is needed, it’s so that a racing allocation call in another process doesn’t get that virtual address but be running on a core where the TLB points elsewhere (the core that did the allocation may not even be the one where it’s used).


Yes, well, I think you're right it's a lot about the contract of the API and the expectation of the user.

I'd certainly want intelligence (either by the user or by the framework) in how frequently you release pages back to the OS.

But the DONTNEED is really not the core of the value. It's being able to create vectors that don't fragment as badly.

And yes, you're right a decent allocator could help with this, and what I'm describing is really just a kind of specialized allocator.


I don’t follow. Fragmentation wasn’t the problem here - it’s that an optimization to trying to reuse the underlying allocation resulted in a severe over-allocation in a specific path. That would happen even with your idea unless you use DONTNEED to release all the unused pages.

I think fragmentation is an over focused on problem when in practice it’s rarely the problem. It’s also a problem that can be solved by an application reset and making sure to periodically reset your running application in a way that doesn’t result in severe disruption is a better practice that works around many more issues than just fragmentation.


Is MADV_FREE on Linux any better in this regard? It's what the libc allocators tend to use, if I recall correctly.


It would have to be the same problem because there’s no way to free memory without a shootdown AFAIK. As I describe elsewhere, I don’t think there’s any way to free memory to the OS without one as otherwise you can run into race conditions where another process allocating memory gets access to the memory being freed in another process through a stale TLB entry.

Here’s a discussion [1] of a hypothetical lazy_munmap option that would initiate the unmap without an immediate shoot down (i.e. the CPUs would lazily evict from the TLB when they notice the request) but no such option exists. It’s also not immediately clear such a hypothetical API could exist on current CPU architectures or if it would require new HW support as I don’t have enough hands-on expertise at that level. Anyway, [1] is an interesting discussion of this idea and the note about huge pages making these even less valuable is interesting to keep in mind.

[1] https://news.ycombinator.com/item?id=23216590


Couldn't this obfuscate OOM type problems by not triggering actual memory issues until much later than the allocation?


Maybe? Depends on how you monitor and what your expectations are. But for workloads that use a lot of memory, and want to avoid wasteful empty allocations, or have sparse data, etc. it's a reasonable strategy.


Is this not going to be more or less what the memory allocator backing Vec does under the hood anyway?


That's already what it does (except when shrinking)


Just a somewhat tangent thought: do you know of any tricks to speed up the construction phase of Aho Corasick? A recent problem of mine needed to use Aho Corasick but with several million strings in the dictionary it took a bit longer than I thought to construct it. Any tricks you know of to speed up construction? Don't particularly care about memory.


Hmmm. The phrasing of your question is a little ambiguous. I could interpret it two ways:

1. "Do you know if there are any config knobs in the aho-corasick crate that will speed up construction?"

2. "I have my own implementation of Aho-Corasick and construction isn't as fast as I hoped. What kinds of tricks can I employ to make my own implementation faster?"

I'd be happy to try and answer either of these, but I think it would be better to ask with a concrete example on the issue tracker Discussions: https://github.com/BurntSushi/aho-corasick/discussions

All of the tricks I know about (and I judged to be worth doing) are in the aho-corasick crate. There's really just no getting around the fact that you need to first build a trie (which requires visiting every byte in every pattern) and then do a breadth-first search across the entire trie to setup failure transitions. Making construction substantially faster probably requires some kind of innovation in those two parts at a fundamental level.

You can also cheat at this. That is, by designing your Aho-Corasick automaton to be zero-copy deserializable. The DFAs in regex-automata support this. I waver on whether to do this for aho-corasick, because it acts like a straight-jacket for evolving the internals of the library.


Thanks for the great write up. Your writing and deep dives never fail to enlighten and entertain at the same time.


What’s the reason for not always using the contiguous variant?


You can't build the contiguous variant directly from a sequence of patterns. You need some kind of intermediate data structure to incrementally build a trie in memory. The contiguous NFA needs to know the complete picture of each state in order to compress it into memory. It makes decisions like, "if the number of transitions of this state is less than N, then use this representation" or "use the most significant N bits of the state pointer to indicate its representation." It is difficult to do this in an online fashion, and likely impossible to do without some sort of compromise. For example, you don't know how many transitions each state has until you've completed construction of the trie. But how do you build the trie if the state representation needs to know the number of transitions? Classic chicken-and-egg.

Note that the conversion from a non-contiguous NFA to a contiguous NFA is, relatively speaking, pretty cheap. The only real reason to not use a contiguous NFA is that it can't represent as many patterns as a non-contiguous NFA. (Because of the compression tricks it uses.)

The interesting bits start here: https://github.com/BurntSushi/aho-corasick/blob/f227162f7c56...


It took me a few extra seconds to parse the headline. ("footgun" didn't help) Is

    <Vec<_>>()
perhaps to become another

    ¯\_(ツ)_/¯

?


It's interesting that HN zapped the second colon. The headline here (at time of my posting) is:

"Identifying Rust's collect:<Vec<_>>() memory leak footgun"

but the actual title of the blog post is:

"Identifying Rust's collect::<Vec<_>>() memory leak footgun"

That extra colon matters to Rust's syntax.


Not sure what you're asking. At risk of loosely explaining what you may already know, `<Vec<_>>` is a generic parameter `<T>` with the type `T` specified as `Vec`. Vec also has a param, `<_>`, where the type is ignored - as the compiler can figure that out.

Sidenote, `<Vec<_>>` is missing the best part - the fish head, to become `::<Vec<_>>`, aka `::<>` - the Turbo Fish :D

I admit to using this syntax more than necessary just so i can write a turbofish :D

edit: Wait no, `<>` is the head - isn't it. hah. So i guess it's missing the fins? /shrug


Yes? What about unamused Vec?


8485582898


Kiki




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: