Hacker News new | past | comments | ask | show | jobs | submit login
Pictures of a working garbage collector (oilshell.org)
173 points by todsacerdoti on Jan 12, 2023 | hide | past | favorite | 82 comments



Good to see GDB's watchpoints (https://sourceware.org/gdb/download/onlinedocs/gdb/Set-Watch...) get a mention. Often called Data Breakpoints in IDEs (which then, confusingly, often also use "Watch" for a different concept - argh).

Watchpoints, when they're suitable, are an incredibly powerful way to query your program's runtime behaviour (When does this happen? Why does this value end up here?) instead of stepping through and printing things / logging stuff.

They also fast, provided you watch values suitable for the CPU's logic to handle them directly. They need to be small-ish and located in memory for hardware support to handle it, otherwise GDB needs to single step.

(the `-l` flag is also important here and even less well known - it tells GDB you really care about the memory address the expression is stored in, not the expression itself)


Did anyone else click on this expecting to see a photo documentary about the life of a bin/waste collector?


if you google image search "garbage collector" as I just did, you will in fact see photos of these actual, pivotal, working members of our society.


I expected to see a ship cleaning up a garbage island in the ocean.


I did, but domain made me think about oil spill


@chubot (author) I'm curious what the debug experience is like. It looks like you're generating C++ code that maps pretty closely onto your statically typed Python code - is it straightforward to say "aha, it's that variable in C++ so it'll be the same name in Python"?

I'm wondering if it's possible to put some directives into the generated C++ code that would map its debug info directly back to source lines in the Python - I can't find any docs to confirm my feeling this should be possible.

Edit: Maybe this? https://gcc.gnu.org/onlinedocs/cpp/Line-Control.html


Oh yes I've been wondering about that too. I think what tipped me off about the feature is that I debugged some code using yacc, and it jumped right to the .y file. So yacc does that pretty effectively.

So I think that would be cool, but I don't usually use a debugger in Python (or even C++ -- I tend to write small programs using shell as the "REPL").

Oil is extremely test-driven and shell's stdin/stdout type interfaces means it's pretty easy to exhaustively test. Examples: https://www.oilshell.org/release/0.13.1/test/spec.wwz/survey...

The GC is the one case we have where that kind of testing/debugging doesn't suffice! So I started watching the CppCon videos on debuggers a couple years ago and found some good talks :)


Get `udb` - the reversible/time-traveling debugger from Undo. I don't own any stock - I love their product. You can run your code within it, hit an error, set a watch-point, reverse-continue to where the watch-point memory was changed and then check the stack. udb will turn brain-melting, blood-freezing memory bugs into trivial problems that you can solve in a few minutes.


A similar project is rr[0], which is freely available. Like you said, I find that reversible debuggers are a huge improvement over regular debuggers because of the ability to record an execution and then effectively bisect the trace for issues.

[0]: https://rr-project.org/


In our latest release of UDB we added the `last` command: https://docs.undo.io/TrackingValueChanges.html

Which effectively combines a `watch -l`, plus a reverse-continue but also monitors when the underlying memory was allocated / freed.

It's quite nice, basically "git blame" for your variables.


(author here) Yes I tried reversible debugging a couple years ago, but at the time I didn't have any bugs that needed it :) I managed to get by here without it, but it's a technique I'd like to be more familiar with

I'm on the Undo mailing list as well -- nice to see that it's effective!

(Also should say that Clasp sounds very cool -- I'm a fan of anything that enables interop and reuse, rather than rebuilding the same thing in different languages, which may or may not be as good)


Thanks! I'm very interested in a new garbage collector that works with C++. I have a long list of requirements we need that is currently satisfied by the Boehm garbage collector and were satisfied by the Memory Pool System before we moved away from it. I've been looking at the Memory Management Toolkit (MMTk). Where would you put Oil in that small club of memory managers?


I've been following along the gc of oil.

It succeeds because it only solves the GC problems that Oil has (and I mean this as a high complement).

Oil is single threaded, code runs in loops that are well understood, and the code is generated from a strongly typed subset of Python.

The GC then is run only between loop iterations, so there is no need for stack scanning. You never have to worry about a root being in a temporary (from the point of view of the C++ compiler) since there are just a few locals in the function running the loop, and any variables local to the loop are logically dead between iterations.

Since a goal of Oil is portability, not scanning registers and stacks is very important. Getting this right when the GC could be invoked at any allocation is potentially intractable with mypy semantics at least.


Ah - thank you. We are looking for a GC that supports (in no particular order): (1) conservative stack scanning, object pinning. (2) optionally conservative and precise on the heap. (3) compacting when precise. (3) weak pointers. (4) good multithreaded performance. (5) finalizers. (6) ability to create pools of different kinds of objects - especially cons cells (pairs).


Probably not coincidentally, that sounds like SBCL's cgc...


You can look at this one: https://github.com/pebal/sgcl


So I'd say Oil's collector is highly unusual and not applicable most problems! (I now think that "every GC is a snowflake" -- it's such a multi-dimensional design space)

It's unusual because it's a precise collector in C++, and what I slowly realized is that that problem is basically impossible for any non-trivial software, without changing the C++ language itself :)

It seems like that hasn't happened, despite efforts over decades. I added this link about C++ GC support to the appendix, which also explains our unique constraints.

Garbage collection in the next C++ standard (Boehm 2009)

https://dl.acm.org/doi/abs/10.1145/1542431.1542437

http://www.oilshell.org/blog/2023/01/garbage-collector.html#...

---

The reason that precise GC can work for Oil is because it's a shell that links with extremely little 3rd-party code, and has relatively low perf requirements. We depend on libc and GNU readline, just like bash. And those libraries are basically old-school C functions which are easy to wrap with a GC.

(Also as Aidenn mentioned, shells use process-based concurrency, which means we don't have threads. The fact that it's mostly generated C++ code is also important, as mentioned in the post)

---

The funny thing is that one reason I started this project is because I worked with "big data" frameworks on clusters, but I found that you can do a lot on a single machine. (in spirit similar to the recent "Twitter on one machine post" https://news.ycombinator.com/item?id=34291191 )

I would just use shell scripts to saturate ~64 cores / 128 G of RAM, rather than dealing with slow schedulers and distributed file systems.

But garbage collectors and memory management are a main reason you can't use all of a machine from one process. There's just so much room for contention. Also the hardware was trending toward NUMA at the time, and probably is even more now, so processes make even more sense.

All of that is to say that I'm a little scared of multi-threaded GC ... especially when linking in lots of third party libraries.

And AFAIK heaps with tens or hundreds of gigabytes are still in the "not practical" range ... or they would take a huge amount of engineering effort

---

But of course there are many domains where you don't have embarrassingly parallel problems, and writing tight single- or multi-threaded code is the best solution.

Some more color here: https://old.reddit.com/r/oilshell/comments/109t7os/pictures_...

I wonder if Clasp has any support for multi-process programming? Beyond Unix pipes, you could also use shared memory and maybe some semaphores to synchronize access, and avoid copying. I think of that as sort of "inverting" the problem. Certain kinds of data like pointer-rich data is probably annoying to deal with in shared memory, but there are lots of representations for data and I imagine Lisps could take advantage of some of them, e.g. https://github.com/oilshell/oil/wiki/Compact-AST-Representat...


Thank you. We do precise GC in C++. I wrote a C++ static analyzer in Lisp that uses the Clang front end and analyzes all of our C++ code and generates maps of GC-managed pointers in all classes. We precisely update pointers in thousands of classes that way. We also use it to save the system's state to a file or relinked executable so we can start up quickly later. Startup times using that are under 2 seconds on a reasonable CPU.


Ah OK very interesting ... So the tracing is precise, but what about rooting? From your other comment it sounded like that it can be imprecise. But maybe it's for dynamic linking where you don't have source access?

In any case, I would imagine embedding a C++ compiler at runtime does open up a lot more options!


gdb has had reversible debugging since release 7, in 2009. What does udb offer that it lacks?


(I work for Undo)

> gdb has had reversible debugging since release 7, in 2009. What does udb offer that it lacks?

GDB's built-in reversible debugging is cool (and it's helped raise awareness) but it doesn't scale well. We build on the same command set and serial protocol that GDB defined - UDB is GDB but with additional Python code hooking it up to our separate record/replay engine.

For UDB, I'd say we offer: 1. Performance & efficiency (orders of magnitude faster at runtime and lower in memory requirements). 2. Recordings can be saved to portable files (share with colleagues, receive from customers, etc). 3. Library API so applications can self-record with control of when to capture and save. 4. Wider support of modern software (proactively tracking modern CPU features, shared memory and device maps, etc). 5. Correctness (in the past we've found the reverse operations in GDB don't have as strong semantics as we'd hoped, though I'd also be happy to be wrong here)

FWIW, rr (https://rr-project.org/) also offers many similar benefits over GDB's built-in system (though not the library API in point 3) but with differences in what CPUs / systems are supported, ability to attach at runtime, etc.

If you're looking for an open source solution, I'd choose rr over GDB's built-in approach.


I'm not familiar with udb, only with rr. Compared to rr, gdb's recording for reversible debugging is tremendously slower. However rr has strict requirements for the CPU (didn't work on AMD, the last time I looked) and requires some permissive perf_event_paranoid setting to run.

In my experience rr's recording is around x2 slower for single threaded programs than running the program on its own.

I think featurewise they are in parity.

I heard that udb is like rr, but better on some fronts (except price and software freedom).


gdb's built-in recording slows down the debuggee by about 1000x, rr more like 2x. That makes a huge difference in practice.

rr's recording works across system calls, thread context switches, etc, but gdb's doesn't.

rr's recording creates a persistent recording on disk that you can even move around between machines. This permits workflows that gdb's doesn't.

(Disclaimer: I work on rr)


I was expecting to see some sort of heat map gif of a GC filling and trashing slots and was kinda disappointed.


I was expecting to see a man in a dirty boiler suit and was pleasantly surprised.

A thousand years from now when they're digging up texts and artifacts from this period, there's likely going to be a lot of confused academics.


I was also expecting to see a physical machine in action and was disappointed


At a given point it does look like ARC might be the easiest solution (sure it has problems, but they're a subset of GC's problems)

Edit: this link posted by another commenter gives a nice overview of the tradeoffs between the methods and it has very nice graphs to boot https://spin.atomicobject.com/2014/09/03/visualizing-garbage...


ARC's problems are orthogonal to tracing/copying GC's problems. They are fundamentally different paradigms, since ARC mostly pays a cost for memory which is no longer in use, while tracing/copying GC pay a cost for memory which is in use.

That is, with ARC, the more garbage you generate, the more work will be spent on the GC vs normal program flow. In contrast, with tracing/copying, you can generate as much garbage as you want, without affecting GC time; but, the more memory you actually use, the more time you will spend in GC.


ARC (if ‘A’ means atomic) is simply not a good fit for modern hardware. With traditional tracing GCs you get blazing fast allocation and get to amortize the cost of deallocation, while you pay a price for ref counting on every usage (or at best on almost every usage with a good compiler).


> ARC (if ‘A’ means atomic)

A is for Automatic, as in the compiler adds the calls needed to manage the reference counting.

If the language and compiler can guarantee a reference is strictly contained within a thread, it doesn't have to be atomic. Usually that is not the case, so atomic operations are used.


You need more than that. Reference counting amplifies reads into writes. To ameliorate that (somewhat) you need to keep the references separate from the data (a particular problem with Rust). Otherwise you end up with lots of writes scattered all over memory, causing false sharing and lots of unnecessary communication over buses between your cores. One way to do this is to keep the references together in separate cache lines or pages, away from the data. And this only partly alleviates the problem.


While this is true on a microscopic scale, the motivation for using RC is macroscopic: to release memory ASAP to get predictable, low memory usage. From the RC point of view the real enemy is swap and the memory management system's primary objective is to eliminate swap.


To summarise:

- your application is using close to 100% available RAM

- you don't care about the impact of all those extra writes clogging up CPU caches and bouncing cache lines back and forth across internal buses

- you can't afford to pay for a small amount of extra RAM

In that case reference counting sounds ideal. Good luck!


This was exactly Apple's reasoning for using refcounting in Swift. Although, as I understand it they worked around problem (2) by building their own silicon.

If you're on the server then just buy more RAM and use a GC, I agree.

But if you're making consumer software then buying more RAM isn't an option. Additionally, RAM costs battery life even when it isn't in use.


Prompt is even better than predictable. In particular, reliably knowing the refcount on immutable data (in the persistent structure sense) has reached one means you can mutate it undetectably.


FWIW there is a clash of terminology here.

- ARC (upper case): Automatic Reference Counting (in Swift/ObjC)

- Arc (title case): Atomic reference counting (in Rust)

I assume OP intended the former.


I don't think there is really a distinction here. Swift uses atomic reference counting. Granted the details are different between approaches (swift tries to elide references counts when it can), but there are not two different "arcs" being discussed here, they refer to the same concept.


Atomic references counting is a necessary technology when you want to do reference counting for an object shared by multiple threads. This applies both when doing manual reference counting (Rust's Arc, C++'s std::shared_ptr) or when doing automatic reference counting (objects in Swift, Objective C, Python).

Choosing to use automatic reference counting in a language is a design choice, with positives and negatives. Atomic reference counting is an implementation detail, a strict necessity whenever you can't be sure an object will only be referenced by one thread (in essence, non-atomic reference counting is just an optimization).


Ah good to know. Never used Rust so haven't come across Arc, only ARC in various other languages.


you pay a price for ref counting on every usage

What do you mean by 'usage'? References don't change when reading or writing the data being reference counted, they change when being passed to a function or returned from one. In C++ it's rare that you would even use reference counting, since that essentially means you don't know the lifetime of your value. Most variables are going to be on the stack and the vast majority of dynamic allocation is going to be referenced from a single scope at one time.

The reality is that it takes gross incompetence to have a speed impact from reference counting.


Yeah I meant usage as “getting it into/out of scope”.

> In C++ it's rare that you would even use reference counting, since that essentially means you don't know the lifetime of your value

Sure, because it is a manual memory managed language with RC being an escape hatch only. But there are plenty of problems/programs where you simply can’t know the lifetime of your objects, e.g. Chrome uses a proper GC for C++ as well.


with RC being an escape hatch only

I don't know what you mean by escape hatch, but it generally just isn't necessary and memory is managed automatically by scope. The bigger point here is that it just isn't a significant part of execution time.

But there are plenty of problems/programs where you simply can’t know the lifetime of your objects

Like what? I can only think of one, which is passing memory allocated in one thread to another thread.

Chrome uses a proper GC for C++ as well

This is an anecdote, it doesn't prove or disprove anything in the bigger picture.


> managed automatically by scope

That’s just an implementation detail. The point is that the object’s lifetime is only known at runtime and will be reclaimed when a counter reaches zero, this is reference counting. Whether you have to manually inc/dec that counter, or the language does it for you through some abstraction is besides the point, it is automatic memory management either way, as it.. manages memory automatically.

> Like what? I can only think of one, which is passing memory allocated in one thread to another thread

Any programming language, both parsing into an AST, AST manipulations, interpretation (and that is a very wide category, not only for things you would think of as proper languages). But even some games may want to use GC for some in-game objects, as the lifetime of those is fundamentally dependent on user action.

Would the litany of managed languages and their widespread usage be less anecdotal?


I'm not sure what points you are trying to make now, but originally you were talking about reference counting being slow and I was saying in a language like C++ it has no impact on performance because it is only necessary when giving memory allocated in one thread to another thread.

both parsing into an AST, AST manipulations, interpretation even some games may want to use GC for some in-game objects, as the lifetime of those is fundamentally dependent on user action

Here you are conflating the lifetime of resources inside the various scopes of a program with dynamic resources in a game. These are not the same thing. Language level reference counting will not save you or help you to know when to unload a level or a texture. Just because there is control over resources doesn't mean reference counting. Likewise even in something like java you need to set links to heap allocated objects to null so that they can be garbage collected. The language doesn't magically know when you need to unload a level.


Because reference counting is slow. Sure, if you use it for 2 objects that are always in scope you won’t see its effect, but I would argue about “using RC” at all with such a small number.

> only necessary when giving memory allocated in one thread to another thread

RC count can be larger than one even when only a single thread using it. But I’m not familiar with this usage and it is not really RC for memory management anymore, more like a lock-less data structure.

I wasn’t talking about texture/level loading/unloading because it is more complex, but things like using scripting languages for part of the game logic.


Because reference counting is slow.

I think you're just repeating yourself, but I'm not sure what question you're answering.

RC count can be larger than one even when only a single thread using it. But I’m not familiar with this usage and it is not really RC for memory management anymore, more like a lock-less data structure.

I don't understand what you are saying here.

things like using scripting languages for part of the game logic.

Scripting languages are slow for a lot of reasons, like pointer chasing and excessive memory allocations. Reference counting is a very small piece of that puzzle.


Reference counting has limitations. It does not release cyclic structures, not suitable for lock-free algorithms and may overflow the stack.


It does not release cyclic structures

I don't think anyone is debating that.

not suitable for lock-free algorithms and may overflow the stack.

This you will have to explain. I see people make vague assertions like this but I never see a good explanation.


Can't atomically assign a pointer and increment a counter, need to use a lock. If make a long list using shared_ptr will overflow the stack when the head destructor executes.


Can't atomically assign a pointer and increment a counter

You can do that in multiple ways.

First you can use the extra bits of a pointer for a counter to fit it all into 64 bits.

Second, you can use a 128 bit compare and swap which has been supported by CPUs for about 20 years now.

Third, you can not use pointers and use indices of whatever bit resolution you want, using the extra bits for a counter.

Finally, how does a garbage collector change this ?

If make a long list using shared_ptr will overflow the stack when the head destructor executes.

If we set aside for a second the insanity in making a linked list where every pointer destruction calls the next pointer in the list's destructor, how is this unique to a shared_ptr ?


Did you know that the counter is stored in a different place than the pointer? All current atomic<shared_ptr> implementations use a lock. Stack overflow is not unique to a shared_ptr, but GC pointers don't have this problem. The reference counting has advantages, but it cannot fully replace GC pointers.


I think you're confusing shared_ptr with reference counting as a technique in general. Can you answer the questions I have above without talking about shared_ptr?

Stack overflow is not unique to a shared_ptr, but GC pointers don't have this problem

No one should ever have this problem. It is a ridiculous way to make a linked list in the first place.


I see you just don't want to see the problem. Look at this document: https://www.open-std.org/jtc1/sc22/WG21/docs/papers/2014/n41...


You're still trying to compare one basic smart pointer implementation to garbage collection in general.

You told me something was impossible to do without garbage collection and I explained three different ways that I've already done it, then you just keep trying to talk about something that was never up for discussion in the first place. You hallucinated shared_ptr into the conversation from nowhere.

Without talking about smart pointers, what am I missing from the list above? Why do some lock free algorithms need garbage collection?


I am writing about shared_ptr because C++ is one of a top language. Read about the ABA problem.


I'm talking about C++ too, but I write lock free algorithms and data structures all the time and they have nothing to do with with smart pointers in any way.

Why don't you answer my questions above? They confront the ABA problem directly since I explained three ways to keep counts paired with pointers or indices. What can't be done without a garbage collector? Why do you keep giving vague recommendations to read about general topics? Give me a specific deeply technical answer if you can.


Show an implementation of a concurrent lock-free stack without delayed freeing.

> Like what? I can only think of one, which is passing memory allocated in one thread to another thread.

How would you implement a graph that requires heap allocation for each node and supports adding/removing nodes and links with purely scope-based memory management (no shared_ptr or equivalent, no tracing GC), while still reclaiming memory as soon as possible (so no arena-based solutions)?

There are also single-threaded concurrent scenarios where object ownership can be ambiguous, but those are similar to the multi-threaded scenarios you discuss.


In the case of nodes in a graph, you have to keep track of where the connections go anyway, so you can count the list of connections.


I presume that “arc” means “automatic rc”.

What’s manual rc like?


Normally I'd read "arc" as "atomic reference count", i.e., a refcount system where the increment and decrement operations are done with atomic increments/decrements. This permits the refcounted objects to be shared across threads that might mutate the refcount: they need to not corrupt the refcount, or UB ensues.

E.g., the Arc type in Rust. Rust, for example, has a separate Rc type. Rc should outperform Arc, but cannot be shared across threads.



You can manually increment and decrement reference counts with retain/release: https://stackoverflow.com/questions/6578/understanding-refer...


Like using std::shared_ptr<T> in C++ or Rc<T> (or Arc<T>, which in that case stands for atomic, not automatic) in Rust. That is, you as a programmer choose for which variables to use reference counting.


I was expecting to see something like what you would see on windows when running defrag....


There is another site with graphics linked at the bottom of the article https://spin.atomicobject.com/2014/09/03/visualizing-garbage...


Wow, that’s a very cool post!


I expected to see one of the big orange things you see on the street



Exactly my thoughts :D


This is very insightful! Good to see some low-level perf benchmarking as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: