Hacker News new | past | comments | ask | show | jobs | submit login

Reference counting is usually very expensive, because even reading a variable updates the reference count, and ending a scope involves testing the reference count of every variable defined inside the scope and conditionally deallocating the referent.

Without reference counting, here's the end of a hairy function scope that deallocates 15 local variables and restores two callee-saved registers:

        11a6:       48 83 c4 78             add    $0x78,%rsp
        11aa:       5b                      pop    %rbx
        11ab:       41 5e                   pop    %r14
        11ad:       c3                      retq
Now imagine looping over 15 local variables to decrement each reference count, test it, and do a conditional jump based on the result; if that's open-coded, your epilogue is humongous, and if it's in a reference-count-decrementing subroutine, it's going to have mispredicted branches all over the place, costing you maybe 15 cycles each, a total of about 100 cycles for this function. We're talking about adding an order of magnitude of cost to subroutine call, or more. (I think this function doesn't really need 15 local variables; that's the compiler's fault.)

This gets worse with multithreading, because writing to the same reference count on different cores would even in the best case require one core stealing the cache line from another in order to modify it, which may stall the core; but often even an atomic reference count increment is more expensive even than that because it involves a memory barrier.

Reference counting can become reasonably cheap if it's done at large granularity (filesystem files or COM objects, not conses); if you can elide almost all of the reference-count updates, as in Rust; or if your language runtime is just so dog-slow at everything that the extra cost of reference counting isn't that important, like CPython.

30 years ago or more, before generational GC had gone mainstream, reference counting was a more reasonable choice, because GC was going to be very slow in any case, and ref counting at least used less memory—especially important on machines without cache or virtual memory.

(Purely immutable (applicative) languages like Haskell and Miranda are usually lazy, since that's the payoff for completely abjuring side effects. But lazy evaluation is implemented at the machine level by mutation.)




WRT the multithreading issue, https://lwn.net/SubscriberLink/872869/0e62bba2db51ec7a/ mentions that using atomic memory operations in CPython for Py_INCREF and Py_DECREF slows down the entire CPython interpreter by 60%.


> Reference counting is usually very expensive, because even reading a variable updates the reference count

There are many papers out there on how to elide most ref count operations on locals, and runtimes that use ref counting for automatic memory management typically defer ref count updates in various clever ways (like Nim IIRC). You give up some latency for significant throughput gains.


Aye.


> Reference counting is usually very expensive, because even reading a variable updates the reference count, and ending a scope involves testing the reference count of every variable defined inside the scope and conditionally deallocating the referent.

This is true, but if you can prove things about lifetimes then you can eliminate RC operations with that.

Also, GC is expensive for large processes because 1. peak memory is typically higher 2. you have to scan all of memory for pointers 3. that memory may have been swapped out. Of course this can be optimized too.


> This is true, but if you can prove things about lifetimes then you can eliminate RC operations with that.

Yes, this is true, and in fact with Rust's lifetime analysis you can get excellent performance with reference counting in many places. There is more information about this in https://news.ycombinator.com/item?id=28879401 and https://news.ycombinator.com/item?id=28884775.

> Also, GC is expensive for large processes because 1. peak memory is typically higher 2. you have to scan all of memory for pointers 3. that memory may have been swapped out. Of course this can be optimized too.

It's true that a tracing GC usually requires significantly more memory than manual allocation or reference counting, though occasionally it doesn't, since to implement compacting you basically have to implement a tracing GC, so a GC can sometimes give you lower fragmentation.

It's not true that you have to scan all of memory for pointers, for two reasons:

1. You may have a region of memory that contains no pointers, like Erlang's binary storage.

2. Generational GCs only have to look for pointers in the nursery, the stack, and things marked by the write barrier. If they're GCing a generation that isn't the nursery, at that point they have to scan that generation too, but that's so much less frequent that it doesn't make GC expensive. You say, "Of course this can be optimized too," but actually almost all modern GCs are generational, since the late 01990s. Except...

3. Things like Erlang partition the heap into separate per-process heaps for a large number of lightweight processes. Each of these heaps can be GCed independently because you can't have pointers from one heap to another. "What about unreferenced processes?", you might ask. Well, Erlang doesn't GC processes; you have to explicitly create and destroy them, like a larger-granularity malloc/free, typically using a Rust-like ownership hierarchy to avoid process leaks. This works better than it sounds like it should.


> Generational GCs only have to look for pointers in the nursery, the stack, and things marked by the write barrier. If they're GCing a generation that isn't the nursery, at that point they have to scan that generation too, but that's so much less frequent that it doesn't make GC expensive. You say, "Of course this can be optimized too," but actually almost all modern GCs are generational, since the late 01990s.

I was going to come back and edit this to say "…when you violate an assumption of generational GC" but you somehow managed to instantly reply to my 3 days late post!

Though I'm not too familiar with how GC runtimes are really implemented; I thought generations were more or less based on allocation age but don't know if it actually does analysis to separate allocations that can't reference each other.


I was just lucky! I didn't mean to play gotcha!

(I should write a HN user-agent to watch old threads, though.)

I should preface this by saying I've never actually written a GC, or maintained one someone else wrote, so I might be off-base here; this is just my understanding from reading books and papers:

The awesome thing about generational GC is that (normally) the generations are physically separate in memory, so the GC can avoid even looking at objects in generations older than the one it's emptying out. (Unless they might contain pointers to objects in that generation, that is, which is detected by the write barrier.)

This combines nicely with the potential killer advantage of copying collectors: they don't need to even look at garbage. So if your nursery (or Garden of Eden, if you prefer) is 8MiB and contains only 128 live objects averaging 64 bytes each, a non-concurrent nursery GC consists of walking the stack and any marked older objects, copying those 8192 bytes into the next generation while adjusting the relevant pointers, then resetting the allocation pointer to the beginning of the nursery. The other, say, 130,944 objects that were allocated in the nursery aren't even looked at by the GC, because nothing points to them. (Unless they have finalizers (destructors).) So the GC examines and copies on average 0.06 bytes per object allocated.

Initializing the objects in the first place, as you must do whether they're allocated by decrementing a stack pointer or anything else, is far more expensive.

Of course, most programs retain a larger fraction of their nursery objects than that, stack allocation makes better use of memory caches than a multi-megabyte nursery, and if your program keeps retaining nursery objects long enough, it will eventually need to GC one of the other generations. So GC still isn't free even in the throughput sense. But it's competitive with manual dynamic allocation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: