This seems like the right decision: simplify the language, flatten the learning ...

dmpk2k · on June 3, 2013

I'm not sure I agree with the point regarding concurrent data structures and garbage collectors. The idea is nice, but ignores a rather serious problem in implementation: such garbage collectors are serious engineering undertakings.

To see how hard the implementation of a parallel and concurrent garbage collector can be, take a look at the Garbage-First Garbage Collection paper. Try to really understand all the moving parts in there and how they interact. Once you're done wiping your brains off the wall, realize that this is the extent that GC engineers need to go to remove most bottlenecks on scaling, regardless of your opinion of G1 itself. This is a serious and hard-core engineering effort that absorbs several man years of expert effort (and I emphasize the expert part) to do properly. Rust would probably have to become the size and importance of the Java world to ever get this level of attention. Despite this, the JVM's collectors still routinely throw roadblocks in front of the mutator.

And that ignores the other tradeoffs that GCs present. E.g. most GCs need at least 3x working set to work reasonably efficiently. That's 3x less working set you could be keeping in a local memory bank, or using for something else (like a disk cache, because page faults sure are crazy expensive...).

The best way to avoid this problem is to not play the game at all. The reason I became interested in Rust in the first place is Graydon and co wisely did not follow that pied piper. Optional thread-local garbage collectors are a much simpler problem.

As an unrelated aside, I recall Andrei Alexandrescu making the argument that GC is important for type safety, unless the type system can prove that dangling pointers are impossible.

pron · on June 3, 2013

Go's designers seemed to take a somewhat defeatist approach in this regard. Instead of writing a really good GC, they opted to design the runtime in such a way that more memory sharing is possible, and the pressure on the GC is hopefully reduced. But they've done this at a cost: they've deliberately introduced aliasing, which all but completely precludes the possibility of ever implementing a really good GC for Go. That's why I think Go might be good enough for current multi-core hardware, but will not scale in a many-core world (there are other aspects of Go that are major road blocks for future scaling as well).

I remember seeing a talk by Cliff Click where he explains how a small change to the CPU can greatly assist with developing pauseless GCs (though his former employer, Azul Systems, has moved on to implement the whole thing in software, probably at some cost to performance).

Regardless of the undertaking, the question remains -- as we move into a many-core era -- whether we'll find a programming paradigm good enough to take advantage of many cores. I think that whatever the winning paradigm may be, it will require a "really good" GC.

rayiner · on June 3, 2013

Concurrent, scalable GC is hard, but it's made harder by the complete lack of cooperation from the hardware and OS. For example, you can dramatically simplify the design of a concurrent GC if you can scan the thread stacks and registers in a snapshot before doing the concurrent tracing. This should not be a high-latency operation on existing hardware, but on say OS X it can take 10's of milliseconds for just a couple of hundred threads. On Azul's machines, cooperation from its hypervisor allows thousands of threads to be paused in under a millisecond. Similarly, a read barrier like the kind implemented in Azul's hardware is not very complicated. The CPU already does, conceptually, a very elaborate virtual -> physical address translation + checking of access protections on every memory access, which is cached in the TLB. Adding a bit to the TLB to indicate that a page should be protected by a read barrier would cost very little.

Azul's GC is basically a "really good" GC. It can handle hundreds of CPUs with hundreds of GBs of heap with tens of GBs per second of allocation rate with pause times of 10-20 milliseconds. It's basically a solved problem with the proper overall system design.

hga · on June 3, 2013

Going by what I know of Azul's systems, their pauseless GC avoids a 1 second delay per GB of heap by focusing on solving the hardest case vs. deferring it. When running on their custom hardware (generic 64 bit RISC + their special sauce) it uses an instruction to cheaply implement a read barrier. According to a paper describing this system, which they've since improved, doing that on stock (x86_64) hardware would incur a ~20% performance penalty. If your application uses a lot of memory and can't afford pauses that's probably OK.

nickik · on June 3, 2013

There Collector C4 is a soft-realtime they can not provide realtime but if they can be quite sure that they have very samll pauses. I truly belive in that vision. The problem is that the all the parts of they system have to be updated.

At the moment they can run on x86_64 but its not as fast as it could be. The have a special build for every linux distribution.

I am not sure how good this all works if you have very few cores and not a lot of threads. All in all however the C4 defently shows the way of the future. Hardware developer should learn from Azul, the OS guys should work more on supporting managed runtimes and the programming language community should do so to. Managed Memory is the future exept in very, very special cases.

m0th87 · on June 3, 2013

I don't see how Go's pointers preclude a good GC. You can't have dangling pointers, so it's not as if they're forced to use a conservative GC. Care to elaborate?

pron · on June 3, 2013

The best-known issue is interior pointers[1]:

To give the programmer this flexibility, Go must support what we call interior pointers to objects allocated in the heap. The X.buf field in the example above lives within the struct but it is legal to capture the address of this inner field, for instance to pass it to an I/O routine. In Java, as in many garbage-collected languages, it is not possible to construct an interior pointer like this, but in Go it is idiomatic. This design point affects which collection algorithms can be used, and may make them more difficult, but after careful thought we decided that it was necessary to allow interior pointers because of the benefits to the programmer and the ability to reduce pressure on the (perhaps harder to implement) collector. So far, our experience comparing similar Go and Java programs shows that use of interior pointers can have a significant effect on total arena size, latency, and collection times.

Interior pointers may pose a problem for sophisticated GCs. Interestingly, this may not be a deal-breaker (see here: https://groups.google.com/forum/#!topic/golang-dev/GvA0DaCI2...), but because Go relies on interior pointers so much, and because the language does not distinguish between interior pointers and regular object pointers, this may prove to be a huge problem in practice.

[1]: http://talks.golang.org/2012/splash.article#TOC_14.

m0th87 · on June 3, 2013

Nice, thanks for the reference. It's also worth pointing out that Go's GC can't do soft real-time because of GC on the global heap. IIRC, Rust has a global heap as well, although it's less frequently used thanks to richer ownership semantics.

pcwalton · on June 3, 2013

> As an unrelated aside, I recall Andrei Alexandrescu making the argument that GC is important for type safety, unless the type system can prove that dangling pointers are impossible.

Rust's type system can.

cwzwarich · on June 3, 2013

But how do you go beyond the paradigm of linear / affine types and nested regions that Rust gives you when you remove GC? Lots of programs are difficult to write with only these features, e.g. a web browser, and when you do you just end up writing a garbage collector yourself anyways. Either you add optional unrestricted allocations or increase the complexity of the type system to allow for the expression of more safe deallocation using dependent types or another similarly strong logic.

I really like Rust, but I wonder how it is going to get over this hurdle that previous languages like Cyclone never fully solved.

kibwen · on June 3, 2013

The title of this post is somewhat misleading. Just because pcwalton wants to remove GC from the language doesn't mean that GC won't be available, just that it would be transformed from a compiler mechanism into a module in the standard library. The semantics would be exactly the same, though the syntax to use it would change a bit. The goal here isn't to hate on GC, it's to ensure that normal users are capable of implementing pointer types that are exactly as "blessed" as the built-in types--and the best way to do that is by eating your own dogfood.

cwzwarich · on June 3, 2013

pcwalton both wrote the article and the comment I was replying to. If GC isn't necessary because Rust's type system can prove that the deallocations are safe then the answer to the question of how to write programs with nontrivial memory allocation patterns shouldn't be "use an external unsafe GC".

kibwen · on June 3, 2013

The existence of Rust's current GC doesn't have anything to do with preventing dangling pointers. Off the top of my head, cyclical references are the only thing that you need GC for that you can't handle with unique pointers or borrowed pointers, but there may be others.

I would need to have read Andrei's original argument to be sure, but I'm guessing it was along the lines of "without a strong type system, garbage collection is necessary to prevent dangling pointers". And Rust's type system is strong enough to prevent dangling pointers; they cannot happen unless you are deliberately dropping into `unsafe` blocks. This is true regardless of if you have garbage collection built in. But that doesn't necessarily mean that the type system is powerful enough to encode all the patterns that GC allows.

jeltz · on June 3, 2013

Strongly agreed about the GC. One thing which got me interested in both Rust and Erlang is the fact that they do not have GC of shared memory and instead opt for a simple per process GC. Avoiding complicated problems where they are not necessary is part of good design.

mietek · on June 3, 2013

> As of now, the only really-good GCs are found in the JVM

This is quite a strong assertion. What other GCs have you evaluated and rejected as not-really-good? Does this include GHC's parallel generational GC, or BEAM VM's per-process GC?

carterschonwald · on June 3, 2013

The ghc gc, and the ghc rts generally, are really great pieces of engineering. Very readable code too!

And the ghc gc / rts are getting better every release. I'm hoping to spend some time late summer / early fall evaluating how to get better Numa support in. (And jvm notably lacks any Numa story as far as I can tell)

pjmlp · on June 3, 2013

At least Oracle and IBM VM do support NUMA. Surely there might be other vendors supporting as well.

http://docs.oracle.com/javase/7/docs/technotes/guides/vm/per...

http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/index.jsp?top...

-

carterschonwald · on June 3, 2013

Gc support is the important first step.

But then the next step of work is scheduling work in a locality aware way. The libnuma lib in Linux and the portable hwloc lib that's part of the openmpi project are example substrates.

Admittedly, for most applications this second step won't make a large difference, but in my application domain, numerical computing, better memory locality at any given granularity can yield an order of magnitude improvement.

Locality aware scheduling in this context can actually yield a slightly super linear speed in the multicore / multi socket regime because of vagaries of a systems memory bus topology.

Ill hopefully be in a better position to quantify this last remark more precisely middle to late summer, assuming Wellposed's numerical product launch results in signing some early customers in the next 2 months.

pjmlp · on June 3, 2013

Thanks for the explanation. I never used NUMA systems.

pron · on June 3, 2013

I know nothing of the GHC GC (though the JVM has had a parallel, generational GC for many years, and is probably considered the least advanced of the JVM GCs), but BEAM's GC, as you mentioned, is per-process, and thus does not help with shared data structures (which do exist in Erlang, namely ETS).

mietek · on June 3, 2013

It's not valid to say BEAM VM's per-process GC doesn't help with shared data structures, as there are no shared data structures in Erlang.

ETS acts like shared memory, but is not, in fact, shared memory. ETS tables are kept in a separate memory region from process heaps, and aren't automatically garbage collected. They are destroyed when their parent process terminates, or manually.

Per-process GC is a natural consequence of the way systems are built in Erlang, which often involves using hundreds of thousands of short-lived processes. This decreases a system's GC load, as entire process heaps can simply be destroyed when their process terminates, without needing to be traversed. There's also no need for a stop-the-world GC pause.

pron · on June 3, 2013

Sure, but the fact that ETS tables exist shows that even Erlang recognizes the need in shared data. The fact that ETS tables aren't garbage collected is a limitation (and ETS tables have other limitations, too, like not supporting multi-object transactions).

The Erlang GC cannot be compared with Java's GC. It is not "better" or "worse". Erlang's GC wasn't designed for better throughput but to ensure complete isolation. Erlang values isolation over everything else because it is at the core of Erlang's fault tolerance philosophy. In Java, a task could trigger a collection that could potentially block all other threads. This can't (I think) happen in Erlang. But the JVM's GC has many other advantages.

mietek · on June 3, 2013

> multi-object concurrency

I'm not familiar with this term, and it doesn't seem googleable. Can you give a definition?

> The Erlang GC cannot be compared with Java's GC.

And yet, by saying "the only really-good GCs are found in the JVM", you compare all GCs.

pron · on June 3, 2013

multi-object concurrency

Sorry. I meant multi-object transactions. Slip of the finger.

And let me qualify: good GCs that are relevant to solving that problem GCs are essential for, namely concurrent data structures. Erlang doesn't have a GC that's relevant (I really like Erlang, BTW).

mietek · on June 3, 2013

Then surely you know support for transactions is built as a layer on top of ETS?

I actually don't like Erlang as a whole, because it lacks a static type system, but that's besides the point.

I reject characterising the BEAM VM's GC as not-really-good, especially in concurrency context. I believe the immutable, shared-nothing, message-passing model is a superior solution to the data problem than the threads-and-locks model. Actual memory sharing is an optimization which can be applied by the runtime system, as it's done for binaries in Erlang.

trailfox · on June 3, 2013

Perhaps he meant really good GCs in mainstream systems? It is a bit of an assertion on his part, but Java GCs are pretty advanced relative to most mainstream languages.

Aloisius · on June 3, 2013

Performance of modern software is now (and more so in the future) largely determined not by the single-thread speed of running an algorithm, but the scalability of an algorithm as it parallelizes across cores.

Is there any evidence that this is true for most modern software? I can think of far more instances of software that use threads for simple things like networking or UI because it is easier and/or their library makes them than software that is so CPU-bound for single algorithms that they need to parallelize to maximize performance across CPUs.

pron · on June 3, 2013

Most web servers run logic concurrently (different requests are handled in parallel). The request handlers then need to access a shared data store, and that's where you get a bottleneck (that people try to solve with all manners of eventually-consistent caches or distributed stores).

Aloisius · on June 3, 2013

Web servers run logic concurrently because it is the sanest architecture to allow people to write their handlers in whatever language they want. They certainly don't do it for performance reasons. The average web app programmer also doesn't use threads directly for running their own logic concurrently.

Now you could argue whatever shared data store they're accessing needs to run things concurrently in order to maximize performance (and certainly many do), but even then we're usually looking at relatively simple shared memory patterns often with fairly coarse locking that scales well to a few cores, but not much beyond that (which often doesn't matter since you can saturate your network and/or memory bus before that). Maybe once in a while you'll see a lock-less concurrent data structure, but they're still fairly exotic.

pron · on June 3, 2013

They certainly don't do it for performance reasons.

What? Care to elaborate? (AFAIK, the only popular single-threaded webapp server is node.js, which is one of the slower web servers around: http://www.techempower.com/benchmarks/#section=data-r4)

...that scales well to a few cores, but not much beyond that (which often doesn't matter since you can saturate your network and/or memory bus before that)

You're describing limitations to current implementations that seriously hinder scaling. Developers are far from happy with this state of affairs, which, thankfully is not a necessity. You most certainly will not saturate the network if the data is distributed wisely. Here's a post I once wrote for highscalability.com that discusses the networking issue: http://highscalability.com/blog/2012/8/20/the-performance-of...

The memory bus is a bigger problem in general: if you do enough work on enough cores you will saturate the bus. But this, too, is simply the current state of affairs and not some insurmountable barrier to scaling.

scott_s · on June 3, 2013

Ron, I wanted to tell you how much I've appreciated your contributions to this thread. (I would have emailed, but I can't find one for you anywhere!)

NkVczPkybiXICG · on June 3, 2013

ref-counting should be more than sufficient to deal with the standard multi-thread model of "read from concurrent queue, process, write to different concurrent queue". This covers a huge majority of cases where you'd want a GC. You only need a GC when you have unclear ownership semantics, and if that's the case you're usually fucked anyhow.

rbehrends · on June 3, 2013

Except that basic reference counting is inherently unsafe in a multi-threaded environment. And thread-safe reference counting is not any simpler than garbage collection [1].

[1] E.g.: http://dl.acm.org/citation.cfm?id=1111597

kibwen · on June 3, 2013

"According to conventional wisdom, the update of reference slots and reference-counts requires atomic or synchronized operations. In this work we demonstrate this is not the case [...]"

Interesting. Rust currently has an atomically-reference-counted type (ARC, briefly mentioned in the OP's post) for sharing data between tasks. I wonder if the algorithm they describe would be useful for Rust as well.

McP · on June 3, 2013

Neither of those statements are true, and the cited article doesn't back up your claim. Check out this thread safe reference counting pointer, it's not complicated: http://www.boost.org/doc/libs/1_53_0/libs/smart_ptr/shared_p...

rayiner · on June 3, 2013

A naive, thread-safe reference counter using atomic compare-swap is extremely expensive. On the best CPU's in the uncontended case it's 30-50 clock cycles. So modifying a single pointer field becomes a 100 clock cycle affair (decrement for the old value, increment for the new value).

Thread-safe deferred reference counting looks like GC.

Pxtl · on June 3, 2013

Would auto pointers and weak references work better? I've always liked that paradigm...single transferable ownership just feels conceptually clean.

pron · on June 3, 2013

Not complicated but prohibitively expensive. Here's the source code: http://gcc.gnu.org/onlinedocs/gcc-4.6.0/libstdc++/api/a01034... It does a CAS on every copy!

McP · on June 3, 2013

Prohibitively? That really depends on your use case.

OpenSceneGraph [1] uses intrusive reference counting pointers yet has been found suitable for many real time applications.

[1] http://en.wikipedia.org/wiki/OpenSceneGraph

pron · on June 3, 2013

Prohibitive for the very data structures that make true many-core scaling possible and that necessitate a GC in the first place. A CAS on every pointer copy could obliterate your scaling. This kind of technique was fine in the good-old one- or two-core days. It won't cut it now - not if you want to scale.

Nothing about modern hardware is the same as it used to be. I strongly recommend watching this extremely enlightening talk by Doug Lea about designing multi-threaded algorithms (http://emergingtech.chariotsolutions.com/2013/04/phillyete-s...). For example, did you know that in order for a busy-wait (spin loop) to be effective on modern intel CPUs, it has to compute random numbers (otherwise, the CPU might think you're in the OS idle-loop and power-down the core)?

acqq · on June 3, 2013

Thanks for Doug Lea's presentation, I'll try to produce presentation from the video using ffmpeg. The sound is... unpleasant.

Regarding CAS on every pointer copy, it's true that the current hardware has problem with it. Still I wonder if it would be worth and enough to optimize some future CPU to speed up the "interlocked increments" and "interlocked decrements" by introducing delays only to the cores that "interockedly" access the same addresses? I guess it can be tricky as one core has a lot of "in-flight" memory operations at the same tick, but I don't know the real limitations or if there's something more I'm missing, anybody reading this who knows more, what are the limitations?

pron · on June 3, 2013

Concurrent queues are important, but weren't the data structures I was talking about. I was referring to persistent data structures (like Clojure's) that rely on versioning (and used in MVCC, for example). The old version is collected when there are no more references. These immutable data structures don't require ownership anyway.

Circular references are a big problem, because even if you have a single circular reference, a significant portion of your object graph may leak. (Say A -> B -> C -> D -> E -> ..., and also B -> A. A tiny problem like this has just caused your entire object graph to leak).

dmpk2k · on June 3, 2013

A problem with persistent data structures is that they're memory-hungry. Combining that with the overhead of a garbage collector...

I think such structures are useful (and nice!) in some circumstances, but as a general solution for efficient system software they're just too hungry to be useful, at least for now. Having useful memory cut to 1/6th (if I'm being generous) or lower is painful, and god forbid the operating system decides it's time to start swapping out pages. The effective memory bandwidth will take a kick in the pants too.

No silver bullet, &c.

pron · on June 3, 2013

My main issue with persistent ds is that they don't have good (CPU) cache behavior on updates. They don't necessarily consume that much memory -- a 32-way trie is quite memory efficient -- as long as the refs to the old versions die soon.

But even mutable concurrent data structures need to keep the old version around, albeit momentarily (hopefully), as they CAS in a new node (depending on the nature of the data structure) in the presence of concurrent readers. Achieving even that with simple ref-counting is not trivial.

steveklabnik · on June 3, 2013

There is a branch with a GC written in Rust: https://github.com/graydon/rust/tree/gc

sliverstorm · on June 3, 2013

One cool thing you could get from building the GC in a library- could it not then be much easier to bolt on a custom GC?

kmasters · on June 3, 2013

While the JVM has an excellent garbage collector (maybe the best on the planet), it's performance is linearly degraded by the amount of memory a process is using.

If you go into any Java shop (which is where I live) and try to run a highly available system with more than 8GB of RAM in a server instance, you will be lucky to do that.

Servers on the cheap exist with 256GB of RAM, if you can't get to 12GB without OOMs, and GC pauses of 30 seconds, you will never get to a process that scales to this hardware. Your GC pause on a 100GB system would be over a minute complete stop the world probably more.

And that's a shame, because the only antidote, is to over-distribute, and over cluster, and not only slow everything down by 1000X compared to a single server, but the engineering and operational costs to distribute everything into tiny 4GB buckets are also unacceptable.

I think the industry should rethink GC, if it can get past this boundary. 4GB walls and 8GB walls compared to the current availability of hardware is embarrassing.

I would rather not build a server that can't use all the memory I want it to without falling over.

My minimum required RAM metric is 256GB for a process. That could even be a fairly small (code wise) system that just needs a lot of RAM in which case manual memory management is a piece of cake.

twic · on June 3, 2013

We run JVMs - running complex webapps on Tomcat - in production with 28 GB heaps. We don't have OOMs or chronic GC pauses. We don't reboot in between fortnightly releases.

We aren't running machines with 100 GB heaps, but i can tell you from personal experience that it's possible to exceed 8 GB without difficulty.

I doubt you'd have to look far to find people who were running with substantially bigger heaps on stock Sunacle/OpenJDK JVMs.

kmasters · on June 4, 2013

I wont discredit your experience, however, I have been a consultant in several 64 bit shops, linux and solaris who had 64GB servers who could not scale above 8GB due to OOMs.

Also remember that if you are on linux doing that 28 GB heap you are living in the world of memory over-commit, which means if you actually tried to use that memory, you will probably segfault.

Again if you have done it great, no shop I have worked in, has been able to do it with app servers and keep their servers up reliably. Its a constant source of pain and I think JVM GC is not sufficient to scale at the level of current hardware.

carry_bit · on June 3, 2013

If you are working with large heaps you should check out Azul's C4 collector.

kmasters · on June 3, 2013

The trick is getting your company to pay for it and run the JVM on top of their VM. But yes, that's where I was heading.

hga · on June 3, 2013

Well, they did release it as open source in a "Managed Runtime Initiative", but no one did anything with it and the site is now offline.

It's a set of patches to Linux, to cheaply implement the memory management tricks, e.g. batched operations on big pages without redundant TLB clearing, and a set of patches to the Hotspot JVM, which is where I think most people had trouble digesting it. See e.g. http://en.wikipedia.org/wiki/User:AzulPM/Managed_Runtime_Ini... for more details.

It ideally would have gotten enough traction that the Linux patches would get accepted in the mainline kernel. Ah well, perhaps someday.