Hacker News new | past | comments | ask | show | jobs | submit login
Removing garbage collection from the Rust language (pcwalton.github.io)
242 points by graue on June 3, 2013 | hide | past | favorite | 130 comments



This seems like the right decision: simplify the language, flatten the learning curve, and delegate complex functionality to libraries.

But it's important to point out the general importance of good garbage collectors. GCs are importnat not only because they help avoid a common type of bugs, but because they are essential to many concurrent data structures. Performance of modern software is now (and more so in the future) largely determined not by the single-thread speed of running an algorithm, but the scalability of an algorithm as it parallelizes across cores. Many of the most scalable concurrent data structures are almost impossible to implement robustly without a really good garbage collector.

As of now, the only really-good GCs are found in the JVM, and Java does indeed provide very low-level access to important memory primitives (direct control of memory fences is now possible through an undocumented API, and will be made public probably in the forthcoming Java 8). But as good and as powerful the JVM is, such algorithms will be necessary in circumstances where the JVM is not the best choice, namely embedded systems. So a garbage collector as good as those available for the JVM (or almost as good) but for a low-level systems language will be a boon.

A good GC for C++ would probably be a headache because of the language's inherent unsafety in just about anything.

Go is not an option in those circumstances, either. For one, while Go compiles to native binaries, the language actually operates at a higher level than Java (for example, it offers no control over thread scheduling like Java does). More importantly, choices made by the language designers might preclude a GC implementation good enough for high-performance concurrent data structures.

Rust is a great language to try and build a good GC for. If it would be possible to do so in a library and not in the core language - all the better.


I'm not sure I agree with the point regarding concurrent data structures and garbage collectors. The idea is nice, but ignores a rather serious problem in implementation: such garbage collectors are serious engineering undertakings.

To see how hard the implementation of a parallel and concurrent garbage collector can be, take a look at the Garbage-First Garbage Collection paper. Try to really understand all the moving parts in there and how they interact. Once you're done wiping your brains off the wall, realize that this is the extent that GC engineers need to go to remove most bottlenecks on scaling, regardless of your opinion of G1 itself. This is a serious and hard-core engineering effort that absorbs several man years of expert effort (and I emphasize the expert part) to do properly. Rust would probably have to become the size and importance of the Java world to ever get this level of attention. Despite this, the JVM's collectors still routinely throw roadblocks in front of the mutator.

And that ignores the other tradeoffs that GCs present. E.g. most GCs need at least 3x working set to work reasonably efficiently. That's 3x less working set you could be keeping in a local memory bank, or using for something else (like a disk cache, because page faults sure are crazy expensive...).

The best way to avoid this problem is to not play the game at all. The reason I became interested in Rust in the first place is Graydon and co wisely did not follow that pied piper. Optional thread-local garbage collectors are a much simpler problem.

As an unrelated aside, I recall Andrei Alexandrescu making the argument that GC is important for type safety, unless the type system can prove that dangling pointers are impossible.


Go's designers seemed to take a somewhat defeatist approach in this regard. Instead of writing a really good GC, they opted to design the runtime in such a way that more memory sharing is possible, and the pressure on the GC is hopefully reduced. But they've done this at a cost: they've deliberately introduced aliasing, which all but completely precludes the possibility of ever implementing a really good GC for Go. That's why I think Go might be good enough for current multi-core hardware, but will not scale in a many-core world (there are other aspects of Go that are major road blocks for future scaling as well).

I remember seeing a talk by Cliff Click where he explains how a small change to the CPU can greatly assist with developing pauseless GCs (though his former employer, Azul Systems, has moved on to implement the whole thing in software, probably at some cost to performance).

Regardless of the undertaking, the question remains -- as we move into a many-core era -- whether we'll find a programming paradigm good enough to take advantage of many cores. I think that whatever the winning paradigm may be, it will require a "really good" GC.


Concurrent, scalable GC is hard, but it's made harder by the complete lack of cooperation from the hardware and OS. For example, you can dramatically simplify the design of a concurrent GC if you can scan the thread stacks and registers in a snapshot before doing the concurrent tracing. This should not be a high-latency operation on existing hardware, but on say OS X it can take 10's of milliseconds for just a couple of hundred threads. On Azul's machines, cooperation from its hypervisor allows thousands of threads to be paused in under a millisecond. Similarly, a read barrier like the kind implemented in Azul's hardware is not very complicated. The CPU already does, conceptually, a very elaborate virtual -> physical address translation + checking of access protections on every memory access, which is cached in the TLB. Adding a bit to the TLB to indicate that a page should be protected by a read barrier would cost very little.

Azul's GC is basically a "really good" GC. It can handle hundreds of CPUs with hundreds of GBs of heap with tens of GBs per second of allocation rate with pause times of 10-20 milliseconds. It's basically a solved problem with the proper overall system design.


Going by what I know of Azul's systems, their pauseless GC avoids a 1 second delay per GB of heap by focusing on solving the hardest case vs. deferring it. When running on their custom hardware (generic 64 bit RISC + their special sauce) it uses an instruction to cheaply implement a read barrier. According to a paper describing this system, which they've since improved, doing that on stock (x86_64) hardware would incur a ~20% performance penalty. If your application uses a lot of memory and can't afford pauses that's probably OK.


There Collector C4 is a soft-realtime they can not provide realtime but if they can be quite sure that they have very samll pauses. I truly belive in that vision. The problem is that the all the parts of they system have to be updated.

At the moment they can run on x86_64 but its not as fast as it could be. The have a special build for every linux distribution.

I am not sure how good this all works if you have very few cores and not a lot of threads. All in all however the C4 defently shows the way of the future. Hardware developer should learn from Azul, the OS guys should work more on supporting managed runtimes and the programming language community should do so to. Managed Memory is the future exept in very, very special cases.


I don't see how Go's pointers preclude a good GC. You can't have dangling pointers, so it's not as if they're forced to use a conservative GC. Care to elaborate?


The best-known issue is interior pointers[1]:

To give the programmer this flexibility, Go must support what we call interior pointers to objects allocated in the heap. The X.buf field in the example above lives within the struct but it is legal to capture the address of this inner field, for instance to pass it to an I/O routine. In Java, as in many garbage-collected languages, it is not possible to construct an interior pointer like this, but in Go it is idiomatic. This design point affects which collection algorithms can be used, and may make them more difficult, but after careful thought we decided that it was necessary to allow interior pointers because of the benefits to the programmer and the ability to reduce pressure on the (perhaps harder to implement) collector. So far, our experience comparing similar Go and Java programs shows that use of interior pointers can have a significant effect on total arena size, latency, and collection times.

Interior pointers may pose a problem for sophisticated GCs. Interestingly, this may not be a deal-breaker (see here: https://groups.google.com/forum/#!topic/golang-dev/GvA0DaCI2...), but because Go relies on interior pointers so much, and because the language does not distinguish between interior pointers and regular object pointers, this may prove to be a huge problem in practice.

[1]: http://talks.golang.org/2012/splash.article#TOC_14.


Nice, thanks for the reference. It's also worth pointing out that Go's GC can't do soft real-time because of GC on the global heap. IIRC, Rust has a global heap as well, although it's less frequently used thanks to richer ownership semantics.


> As an unrelated aside, I recall Andrei Alexandrescu making the argument that GC is important for type safety, unless the type system can prove that dangling pointers are impossible.

Rust's type system can.


But how do you go beyond the paradigm of linear / affine types and nested regions that Rust gives you when you remove GC? Lots of programs are difficult to write with only these features, e.g. a web browser, and when you do you just end up writing a garbage collector yourself anyways. Either you add optional unrestricted allocations or increase the complexity of the type system to allow for the expression of more safe deallocation using dependent types or another similarly strong logic.

I really like Rust, but I wonder how it is going to get over this hurdle that previous languages like Cyclone never fully solved.


The title of this post is somewhat misleading. Just because pcwalton wants to remove GC from the language doesn't mean that GC won't be available, just that it would be transformed from a compiler mechanism into a module in the standard library. The semantics would be exactly the same, though the syntax to use it would change a bit. The goal here isn't to hate on GC, it's to ensure that normal users are capable of implementing pointer types that are exactly as "blessed" as the built-in types--and the best way to do that is by eating your own dogfood.


pcwalton both wrote the article and the comment I was replying to. If GC isn't necessary because Rust's type system can prove that the deallocations are safe then the answer to the question of how to write programs with nontrivial memory allocation patterns shouldn't be "use an external unsafe GC".


The existence of Rust's current GC doesn't have anything to do with preventing dangling pointers. Off the top of my head, cyclical references are the only thing that you need GC for that you can't handle with unique pointers or borrowed pointers, but there may be others.

I would need to have read Andrei's original argument to be sure, but I'm guessing it was along the lines of "without a strong type system, garbage collection is necessary to prevent dangling pointers". And Rust's type system is strong enough to prevent dangling pointers; they cannot happen unless you are deliberately dropping into `unsafe` blocks. This is true regardless of if you have garbage collection built in. But that doesn't necessarily mean that the type system is powerful enough to encode all the patterns that GC allows.


Strongly agreed about the GC. One thing which got me interested in both Rust and Erlang is the fact that they do not have GC of shared memory and instead opt for a simple per process GC. Avoiding complicated problems where they are not necessary is part of good design.


> As of now, the only really-good GCs are found in the JVM

This is quite a strong assertion. What other GCs have you evaluated and rejected as not-really-good? Does this include GHC's parallel generational GC, or BEAM VM's per-process GC?


The ghc gc, and the ghc rts generally, are really great pieces of engineering. Very readable code too!

And the ghc gc / rts are getting better every release. I'm hoping to spend some time late summer / early fall evaluating how to get better Numa support in. (And jvm notably lacks any Numa story as far as I can tell)


At least Oracle and IBM VM do support NUMA. Surely there might be other vendors supporting as well.

http://docs.oracle.com/javase/7/docs/technotes/guides/vm/per...

http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/index.jsp?top...

-


Gc support is the important first step.

But then the next step of work is scheduling work in a locality aware way. The libnuma lib in Linux and the portable hwloc lib that's part of the openmpi project are example substrates.

Admittedly, for most applications this second step won't make a large difference, but in my application domain, numerical computing, better memory locality at any given granularity can yield an order of magnitude improvement.

Locality aware scheduling in this context can actually yield a slightly super linear speed in the multicore / multi socket regime because of vagaries of a systems memory bus topology.

Ill hopefully be in a better position to quantify this last remark more precisely middle to late summer, assuming Wellposed's numerical product launch results in signing some early customers in the next 2 months.


Thanks for the explanation. I never used NUMA systems.


I know nothing of the GHC GC (though the JVM has had a parallel, generational GC for many years, and is probably considered the least advanced of the JVM GCs), but BEAM's GC, as you mentioned, is per-process, and thus does not help with shared data structures (which do exist in Erlang, namely ETS).


It's not valid to say BEAM VM's per-process GC doesn't help with shared data structures, as there are no shared data structures in Erlang.

ETS acts like shared memory, but is not, in fact, shared memory. ETS tables are kept in a separate memory region from process heaps, and aren't automatically garbage collected. They are destroyed when their parent process terminates, or manually.

Per-process GC is a natural consequence of the way systems are built in Erlang, which often involves using hundreds of thousands of short-lived processes. This decreases a system's GC load, as entire process heaps can simply be destroyed when their process terminates, without needing to be traversed. There's also no need for a stop-the-world GC pause.


Sure, but the fact that ETS tables exist shows that even Erlang recognizes the need in shared data. The fact that ETS tables aren't garbage collected is a limitation (and ETS tables have other limitations, too, like not supporting multi-object transactions).

The Erlang GC cannot be compared with Java's GC. It is not "better" or "worse". Erlang's GC wasn't designed for better throughput but to ensure complete isolation. Erlang values isolation over everything else because it is at the core of Erlang's fault tolerance philosophy. In Java, a task could trigger a collection that could potentially block all other threads. This can't (I think) happen in Erlang. But the JVM's GC has many other advantages.


> multi-object concurrency

I'm not familiar with this term, and it doesn't seem googleable. Can you give a definition?

> The Erlang GC cannot be compared with Java's GC.

And yet, by saying "the only really-good GCs are found in the JVM", you compare all GCs.


multi-object concurrency

Sorry. I meant multi-object transactions. Slip of the finger.

And let me qualify: good GCs that are relevant to solving that problem GCs are essential for, namely concurrent data structures. Erlang doesn't have a GC that's relevant (I really like Erlang, BTW).


Then surely you know support for transactions is built as a layer on top of ETS?

I actually don't like Erlang as a whole, because it lacks a static type system, but that's besides the point.

I reject characterising the BEAM VM's GC as not-really-good, especially in concurrency context. I believe the immutable, shared-nothing, message-passing model is a superior solution to the data problem than the threads-and-locks model. Actual memory sharing is an optimization which can be applied by the runtime system, as it's done for binaries in Erlang.


Perhaps he meant really good GCs in mainstream systems? It is a bit of an assertion on his part, but Java GCs are pretty advanced relative to most mainstream languages.


Performance of modern software is now (and more so in the future) largely determined not by the single-thread speed of running an algorithm, but the scalability of an algorithm as it parallelizes across cores.

Is there any evidence that this is true for most modern software? I can think of far more instances of software that use threads for simple things like networking or UI because it is easier and/or their library makes them than software that is so CPU-bound for single algorithms that they need to parallelize to maximize performance across CPUs.


Most web servers run logic concurrently (different requests are handled in parallel). The request handlers then need to access a shared data store, and that's where you get a bottleneck (that people try to solve with all manners of eventually-consistent caches or distributed stores).


Web servers run logic concurrently because it is the sanest architecture to allow people to write their handlers in whatever language they want. They certainly don't do it for performance reasons. The average web app programmer also doesn't use threads directly for running their own logic concurrently.

Now you could argue whatever shared data store they're accessing needs to run things concurrently in order to maximize performance (and certainly many do), but even then we're usually looking at relatively simple shared memory patterns often with fairly coarse locking that scales well to a few cores, but not much beyond that (which often doesn't matter since you can saturate your network and/or memory bus before that). Maybe once in a while you'll see a lock-less concurrent data structure, but they're still fairly exotic.


They certainly don't do it for performance reasons.

What? Care to elaborate? (AFAIK, the only popular single-threaded webapp server is node.js, which is one of the slower web servers around: http://www.techempower.com/benchmarks/#section=data-r4)

...that scales well to a few cores, but not much beyond that (which often doesn't matter since you can saturate your network and/or memory bus before that)

You're describing limitations to current implementations that seriously hinder scaling. Developers are far from happy with this state of affairs, which, thankfully is not a necessity. You most certainly will not saturate the network if the data is distributed wisely. Here's a post I once wrote for highscalability.com that discusses the networking issue: http://highscalability.com/blog/2012/8/20/the-performance-of...

The memory bus is a bigger problem in general: if you do enough work on enough cores you will saturate the bus. But this, too, is simply the current state of affairs and not some insurmountable barrier to scaling.


Ron, I wanted to tell you how much I've appreciated your contributions to this thread. (I would have emailed, but I can't find one for you anywhere!)


ref-counting should be more than sufficient to deal with the standard multi-thread model of "read from concurrent queue, process, write to different concurrent queue". This covers a huge majority of cases where you'd want a GC. You only need a GC when you have unclear ownership semantics, and if that's the case you're usually fucked anyhow.


Except that basic reference counting is inherently unsafe in a multi-threaded environment. And thread-safe reference counting is not any simpler than garbage collection [1].

[1] E.g.: http://dl.acm.org/citation.cfm?id=1111597


"According to conventional wisdom, the update of reference slots and reference-counts requires atomic or synchronized operations. In this work we demonstrate this is not the case [...]"

Interesting. Rust currently has an atomically-reference-counted type (ARC, briefly mentioned in the OP's post) for sharing data between tasks. I wonder if the algorithm they describe would be useful for Rust as well.


Neither of those statements are true, and the cited article doesn't back up your claim. Check out this thread safe reference counting pointer, it's not complicated: http://www.boost.org/doc/libs/1_53_0/libs/smart_ptr/shared_p...


A naive, thread-safe reference counter using atomic compare-swap is extremely expensive. On the best CPU's in the uncontended case it's 30-50 clock cycles. So modifying a single pointer field becomes a 100 clock cycle affair (decrement for the old value, increment for the new value).

Thread-safe deferred reference counting looks like GC.


Would auto pointers and weak references work better? I've always liked that paradigm...single transferable ownership just feels conceptually clean.


Not complicated but prohibitively expensive. Here's the source code: http://gcc.gnu.org/onlinedocs/gcc-4.6.0/libstdc++/api/a01034... It does a CAS on every copy!


Prohibitively? That really depends on your use case.

OpenSceneGraph [1] uses intrusive reference counting pointers yet has been found suitable for many real time applications.

[1] http://en.wikipedia.org/wiki/OpenSceneGraph


Prohibitive for the very data structures that make true many-core scaling possible and that necessitate a GC in the first place. A CAS on every pointer copy could obliterate your scaling. This kind of technique was fine in the good-old one- or two-core days. It won't cut it now - not if you want to scale.

Nothing about modern hardware is the same as it used to be. I strongly recommend watching this extremely enlightening talk by Doug Lea about designing multi-threaded algorithms (http://emergingtech.chariotsolutions.com/2013/04/phillyete-s...). For example, did you know that in order for a busy-wait (spin loop) to be effective on modern intel CPUs, it has to compute random numbers (otherwise, the CPU might think you're in the OS idle-loop and power-down the core)?


Thanks for Doug Lea's presentation, I'll try to produce presentation from the video using ffmpeg. The sound is... unpleasant.

Regarding CAS on every pointer copy, it's true that the current hardware has problem with it. Still I wonder if it would be worth and enough to optimize some future CPU to speed up the "interlocked increments" and "interlocked decrements" by introducing delays only to the cores that "interockedly" access the same addresses? I guess it can be tricky as one core has a lot of "in-flight" memory operations at the same tick, but I don't know the real limitations or if there's something more I'm missing, anybody reading this who knows more, what are the limitations?


Concurrent queues are important, but weren't the data structures I was talking about. I was referring to persistent data structures (like Clojure's) that rely on versioning (and used in MVCC, for example). The old version is collected when there are no more references. These immutable data structures don't require ownership anyway.

Circular references are a big problem, because even if you have a single circular reference, a significant portion of your object graph may leak. (Say A -> B -> C -> D -> E -> ..., and also B -> A. A tiny problem like this has just caused your entire object graph to leak).


A problem with persistent data structures is that they're memory-hungry. Combining that with the overhead of a garbage collector...

I think such structures are useful (and nice!) in some circumstances, but as a general solution for efficient system software they're just too hungry to be useful, at least for now. Having useful memory cut to 1/6th (if I'm being generous) or lower is painful, and god forbid the operating system decides it's time to start swapping out pages. The effective memory bandwidth will take a kick in the pants too.

No silver bullet, &c.


My main issue with persistent ds is that they don't have good (CPU) cache behavior on updates. They don't necessarily consume that much memory -- a 32-way trie is quite memory efficient -- as long as the refs to the old versions die soon.

But even mutable concurrent data structures need to keep the old version around, albeit momentarily (hopefully), as they CAS in a new node (depending on the nature of the data structure) in the presence of concurrent readers. Achieving even that with simple ref-counting is not trivial.


There is a branch with a GC written in Rust: https://github.com/graydon/rust/tree/gc


One cool thing you could get from building the GC in a library- could it not then be much easier to bolt on a custom GC?


While the JVM has an excellent garbage collector (maybe the best on the planet), it's performance is linearly degraded by the amount of memory a process is using.

If you go into any Java shop (which is where I live) and try to run a highly available system with more than 8GB of RAM in a server instance, you will be lucky to do that.

Servers on the cheap exist with 256GB of RAM, if you can't get to 12GB without OOMs, and GC pauses of 30 seconds, you will never get to a process that scales to this hardware. Your GC pause on a 100GB system would be over a minute complete stop the world probably more.

And that's a shame, because the only antidote, is to over-distribute, and over cluster, and not only slow everything down by 1000X compared to a single server, but the engineering and operational costs to distribute everything into tiny 4GB buckets are also unacceptable.

I think the industry should rethink GC, if it can get past this boundary. 4GB walls and 8GB walls compared to the current availability of hardware is embarrassing.

I would rather not build a server that can't use all the memory I want it to without falling over.

My minimum required RAM metric is 256GB for a process. That could even be a fairly small (code wise) system that just needs a lot of RAM in which case manual memory management is a piece of cake.


We run JVMs - running complex webapps on Tomcat - in production with 28 GB heaps. We don't have OOMs or chronic GC pauses. We don't reboot in between fortnightly releases.

We aren't running machines with 100 GB heaps, but i can tell you from personal experience that it's possible to exceed 8 GB without difficulty.

I doubt you'd have to look far to find people who were running with substantially bigger heaps on stock Sunacle/OpenJDK JVMs.


I wont discredit your experience, however, I have been a consultant in several 64 bit shops, linux and solaris who had 64GB servers who could not scale above 8GB due to OOMs.

Also remember that if you are on linux doing that 28 GB heap you are living in the world of memory over-commit, which means if you actually tried to use that memory, you will probably segfault.

Again if you have done it great, no shop I have worked in, has been able to do it with app servers and keep their servers up reliably. Its a constant source of pain and I think JVM GC is not sufficient to scale at the level of current hardware.


If you are working with large heaps you should check out Azul's C4 collector.


The trick is getting your company to pay for it and run the JVM on top of their VM. But yes, that's where I was heading.


Well, they did release it as open source in a "Managed Runtime Initiative", but no one did anything with it and the site is now offline.

It's a set of patches to Linux, to cheaply implement the memory management tricks, e.g. batched operations on big pages without redundant TLB clearing, and a set of patches to the Hotspot JVM, which is where I think most people had trouble digesting it. See e.g. http://en.wikipedia.org/wiki/User:AzulPM/Managed_Runtime_Ini... for more details.

It ideally would have gotten enough traction that the Linux patches would get accepted in the mainline kernel. Ah well, perhaps someday.


Rust continues to shape up as a real C++ competitor. And C++ certainly needs some competition. It is way too lonely in its domain.

I hope the next version of Rust will have a proper Windows package, though. I don't understand why they don't bundle the version of MinGW they depend on. I was pretty disappointed to be greeted by missing DLL errors after running Rust's Windows installer.

There are countless MinGW builds, with different thread libraries, exception models, etc. I guess when they say "MinGW" they mean the binaries from the original MinGW project. However those get updated all the time. Thus the packages the MinGW installer would download today might not be compatible with the Rust 0.6 binaries which were build some time ago .. probably with a different version of said packages. Thus they should really release a complete package.

Even if there is no compatibility issue, it is just bad policy. I actually program C/C++ on Windows AND have a MinGW build on my machine.. but not the one Rust needs. Which is common because other builds are much more up to date and full-featured than those from the original MinGW project. And I really didn't feel like doing an additional MinGW installation just to play around a little with Rust.


I asked a similar question on the IRC and got the answer that they are aiming for Microsoft C compiler so it would work with natively with system as much as possible. MinGW has some slight issues when launching application IIRC.


By the way, why does Rust need a C compiler anyway?

I admit I am ignorant here because I never cared. I know that many academic/niche programming languages bundle MinGW/depend on a MinGW or even Cygwin installation.. I always assumed that was to cut some corners. I mean, I have used multiple programming languages on Windows which compile to native code and have no dependency on a C compiler.


> By the way, why does Rust need a C compiler anyway?

I'm not 100% sure, but I do know that zero.rs requires 'the following libc functions: malloc, free, abort, memcpy, and memcmp'. So I'd imagine it's that.


Those functions could be easily done via syscalls to the underlying OS or simple Assembly routines, as they are quite simple to implement.

I have not yet looked at Rust's code, but I guess it is required for some bootstrapping code.


Those functions could be easily done via syscalls to the underlying OS or simple Assembly routines, as they are quite simple to implement.

None of those four functions use syscalls, nor are any of them simple to implement efficiently in assembly.

Consider that memcpy, under GCC,

* generates different code based on any known alignment of the source or destination

* generates simpler code if the size is known at compile time

* generates only register accesses if one or both of the arguments live in registers

* is elided entirely if GCC determines that it may do so safely

The other three have similar complex behavior. The Rust developers didn't use them just because they were too dumb to know how to write a for loop in assembly.


Simple is not the same as easy.

Anyone with compiler development experience can implement them without much trouble, hence simple.

Sadly that is a skill many seem to lack nowadays.


Yeah, back in the old days everyone had that kind of compiler development experience, didn't they?


Compared to what kids seem to know nowadays, yes.

They can grok HTML5 page full of JavaScript stuff, but then mix language with implementation, think that strong typing is only possible with VM based languages and faith at the look of Assembly code.


I think malloc/free at least are certainly far from simple to implement..


Simple does not mean easy.


What's the definition of "hard to implement" if not complexity?


If you are targeting an existing OS, malloc/free can be simple wrappers to OS syscalls for memory allocation. Not very efficient, but pretty straightforward to implement.

If targeting direct hardware, any CS student should have learned how to implement a memory allocator on their compiler design, data structures or OS classes.

Back when I graduated, this would be a home work exercise.

If you want to implement a high performance memory allocator that works under pressure in a multi-core context, then yes it is complex to implement.

Now something that allocates/releases memory with a general purpose algorithm? That is quite simple and many compiler design books even provide such examples.


> If you are targeting an existing OS, malloc/free can be simple wrappers to OS syscalls for memory allocation.

Not really, there's no such thing as malloc/free syscalls. On Linux, you have mmap/munmap (at page granularity) and brk/sbrk (which only extend the current heap). You could implement malloc as either a mmap() of one or more pages, or by using sbrk() to extend the heap by exactly the memory you need, but then freeing that memory becomes much harder.


To take the example of Windows, wouldn't this qualify as a malloc/free syscall equivalent?

http://msdn.microsoft.com/en-us/library/windows/desktop/aa36...


In fact, I believe modern versions of msvcrt.dll just call HeapAlloc(GetProcessHeap(), ...) in the usual case.


That is what I meant by OS syscalls.

Should I write a full tutorial on an HN post how to write memory allocators when anyone can easily search for one?


> Should I write a full tutorial on an HN post how to write memory allocators when anyone can easily search for one?

No, but a memory allocator is more that "simple wrappers". My point was that OS kernels (Linux/BSDs at least) don't actually provide malloc/free as syscalls, and that you need to implement an allocator in userspace if you want these functions (even if it's a trivial allocator that wastes memory).


Of course they don't, why should they?!

malloc/free are part of the C language runtime, not OS syscalls.

A compiler writer can build the memory allocator on top of what the OS provides as syscalls or take a shortcut and use the C runtime library instead while adding a dependency to it.

I never mentioned malloc/free were syscalls.


Those functions should be available in %systemroot%\system32\msvcrt.dll.


msvcrt.dll is not an OS API library, Kernel32.dll is.


http://msdn.microsoft.com/en-us/library/abx4dbyh(v=vs.80).as... The msvcrt.dll is now a "known DLL," meaning that it is a system component owned and built by Windows. It is intended for future use only by system-level components.

Msvcrt.dll began shipping as part of the OS around the Win2K/XP timeframe, IIRC.


It still does not make it an OS API, it is the C runtime library which is quite different even if bundled with the OS.


I know some folks who think of the NT syscall interface as the 'true' OS API and see kernel32.dll as an add-on. I'll ask, but I don't think these folks make a distinction between user32.dll and the msvcrt.dll that ships with the OS.


> I know some folks who think of the NT syscall interface as the 'true' OS API and see kernel32.dll as an add-on.

This is 100% correct, many of the real syscalls are located in Ntoskrnl, while kernel32.dll is actually part of the Win32 subsystem, given the personalities feature from Windows.

So you could be using another subsystem.

However, given that the POSIX and OS/2 subsystems are no longer supported, the only available subsystem is Win32.

Metro is another matter as it uses another architecture not based on the subsystems mechanism.

But in the end these are all OS APIs, not language runtimes.


OK, asking around, the word I get is that the OS team owns the msvcrt.dll in %systemroot%\system32. Although unofficially it might be the case that x86 apps built with VC6 depend on this dll to implement a stable interface, it's not officially recommended to take dependencies on it.


So it (Firefox with Servo) would work on Windows as close to native as possible? It would be funny to start Firefox and have a console windows poping in/out of existence or some other visual quirk.


I believe the reason in the past was due to C++ implemented NPAPI plugins like Flash. Only MSVC builds of Mozilla/Firefox were officially supported.


> By the way, why does Rust need a C compiler anyway?

Bootstrapping.


Rust bootstraps by downloading an existing snapshot of the rust compiler, which it then uses to build the stage0 rust compiler, which then builds the stage1 rust compiler, which then builds the stage2 rust compiler.


I like it!

It sounds like ~ pointers are basically like unique_ptr in C++11 or scoped_ptr at Google (http://google-styleguide.googlecode.com/svn/trunk/cppguide.x...). In practice, I find that these are the most commonly useful semantics, and as Patrick mentions it is simple and predictable.

I fully agree that the landscape of GC/refcounting solutions is diverse. If indeed there is a way to allow for different approaches without favoring one, I think that would definitely make Rust more widely useful and future-proof.

One question: one distinguishing factor of GC (vs refcounting) is the need to "stop the world" during the mark phase. Would there be a way of doing this from the standard library, or would the GC scheme itself be implemented outside the language? Likewise with any barriers that might be required for mutations, scanning the stack, and other tricky parts of implementing GC?


> It sounds like ~ pointers are basically like unique_ptr in C++11 or scoped_ptr at Google

I am not mega familiar with all of the details of {unique,scoped}_ptr, but my current understanding of ~ is this: basically, the compiler inserts a malloc before and a free after something declared with ~ goes into and out of scope, and it's the only pointer allowed to that memory.


> it's the only pointer allowed to that memory

Not quite true with borrowing (i.e. & and &mut), which allows a temporary pointer to the memory to be created. Unlike C++ however, the compiler makes sure all borrows are safe via a fairly intricate borrow checker, that guarantees (among other things) that the borrowed pointers don't outlive the original object. (i.e. no dangling references.)


Ah! Yes. That's what I get for posting at 3am, thank you. :)


> one distinguishing factor of GC (vs refcounting) is the need to "stop the world" during the mark phase

I can't answer the rest of your question, but note that Rust doesn't allow shared memory between tasks. Therefore, in the worst case, any garbage collector would only need to "stop the task" to mark, while the remainder of the tasks proceed uninhibited.


That's only really true for the current GC design. I don't know if there is anything stopping someone from sharing memory via unsafe pointers and trying to manage that with a custom GC.


True, I don't know if anything's going to stop you, but in my uninformed opinion you'd probably be violating so many built-in assumptions that you'd have to perform invasive surgery on the runtime to get it to work. But I'm hardly an expert in this area.


> one distinguishing factor of GC (vs refcounting) is the need to "stop the world" during the mark phase.

That's a distinguishing factor only in practice. In theory there are real time garbage collectors.


In practice there are also hybrids that try to delay long stop-the-world phases as much as possible, but still need to stop-the-world when compacting the heap, like JVM's CMS.

And I was under the impression that there already are GC implementations that never go into stop-the-world phases, like Azul's Pauseless GC [1]. There's also the new G1 from JDK 7 that promises to make pauses fewer and more predictable than CMS, while avoiding many memory fragmentation issues that CMS has (how well that works out in practice, remains to be seen, as it's not fully matured at this point).

Btw, I love how Rust is trying to solve memory management issues by reinventing the problem. That's what new languages should do, instead of reimplementing other languages poorly.

[1] http://www.artima.com/lejava/articles/azul_pauseless_gc.html


On top of that, malloc (and friends) don't make any useful worst-case runtime guarantees either.


That's very important to remember if your heap has a messy enough history.

The right language + a good GC can give you much better bounded worse case guarantees, with Azul's as the best I know of for an imperative language (the more functional, the easier it is to do a variety of things).


Yes, there are ones like Azul's, and don't forget the copying GCs that have different characteristics than mark/sweep(/compact).


No mention of escape analysis. That's where I'm placing my bets.

http://en.wikipedia.org/wiki/Escape_analysis

Most short-lived objects can be stack allocated. Then the heap gets much less of a work out.

The runtime can figure out heap vs stack. Progressively better over time. Definitely better than I can do.

C# has (had?) the option of explicit stack allocated data (pseudo objects). Terrible idea. Premature optimization that prevents the runtime from doing a better job.

If I was doing embedded, realtime, or kernel dev work, I'd want to fiddle the bits myself. But I don't so I don't.


Java does escape analysis, but if I remember correctly, it is no longer used for stack allocation (it is used for other advanced stuff, like lock elision, and replacing objects with scalars) because the JVM GCs have gotten so good that stack allocation provided no significant benefit. The reason is that stack allocation is relevant only for short-lived objects anyway, and for generational GCs, short-lived objects are (almost) a non-issue. GCs struggle much more with long-lived objects, which eventually require compaction, whose cost rises linearly with the size of the live data set. This is the cause of the problems with large heaps mentioned in the comments below.


Very interesting. Thank you for the update. I'm more than a few years out of date. I may have to place new bets. :)

FWIW, Azul has presented to our local user group (seajug.org) a few times, most recently Nov 2012. There's video http://www.nimret.org/seajug/index.jsp?p=2012%2Fnov%2F By all accounts, their allocation and garbage collection implementations are the best available.


> The runtime can figure out heap vs stack. Progressively better over time. Definitely better than I can do.

Not really true: when you get into higher order control flow it's a huge mess and usually escape analysis falls down. The Rust lifetime system allows stack allocation to a greater extent than any escape analysis system I know of.

That said, we've thought about doing escape analysis, but usually you don't need it in Rust: people reach for the garbage collector only when it's absolutely needed anyway, so having the compiler promote GC allocations to stack allocations automatically would only help in a tiny fraction of cases.


I would place my bets on that. The analysis is hard and the gain is quite small.

Dont take my word for it, they guys from Azul threw it out of the code completly and replaced it with something they call 'escape detection'.

“What is Azul’s support for Stack Allocation?”

Azul’s support for stack allocation is really support for Escape Detection – fast detection on when an object is escaping a stack lifetime. Sun has been trying to add a classic Escape Analysis (EA) to HotSpot for quite some time. Its not on by default now. Research from IBM some years ago showed that its really really hard to make EA effective on large program, although it works really well on lots of microbenchmarks.

I cant give perect references right now. Search the internet for Cliff Click and Gil Tene. Specially the talkes.

I remember that on there own hardware the do 'Escape Detection' with costum hardware but the think it would be a win in software as well. I dont know if the have it in software with there new x86_64 VM.


C#'s stackalloc might be useful for some micro-optimizations, but I think it's main purpose is for interop with unmanaged code.


That hadn't occurred to me.

In the late 90s, I cowrote OpenGL bindings for Java, which became the model for jogl and so forth. Whereas other bindings using JNI duplicated the native/C structs, I just carried around the native pointer as a long and used it as an "opaque" pointer. Much less copying (thunking) back and forth.


Rust already does region inference for managing borrowed pointers, so I guess that all the information needed for this is already there.


If you use a Cheney GC, how much does escape analysis help?

Everything that would be stack allocated will die in the first generation, and Cheney's algorithm is indifferent to the amount of garbage.


If nothing else, more heap allocations will cause the collector to run more often.


Absolutely; I was just wondering if anyone knew how big of an impact this would be; all VMs I've used have tiny stacks compared to the size of the heap, so it is academic for many times I would want this.


In every C++ project I've worked on, the vast majority of allocations go immediately into a shared_ptr. This article seems to assert that most C and C++ programs stick to malloc/free or auto_ptr-style semantics. This seems to be a contradiction, so I'm confused. I can see it being true for C, but definitely not for C++.

Am I misunderstanding the thrust of this article?

Edit: deleted comment about cycles, since they are discussed a bit at the end.


shared_ptr used to be commonly accepted as a reasonable default choice, but that hasn't been the case for years. unique_ptr/scoped_ptr is nontrivially faster (thread-safe reference counting is fairly expensive), and much less error prone. These days the usual advice is to only use shared_ptr if you absolutely need it.


I'm sure there are many places where other pointer types would be better, but, like I said, I only tend to see shared pointers, usually typecast to something like FooPtr and used indiscriminatly. It's an uphill battle to even use something like a pointer to const.

I've never seen refcounting overhead show up in callgrind, so I think the choice to uniformly use the most general version is OK.


But the insidious thing is they'll be inlined all over the place and may not show up on callgrind. The atomic operations will contribute to cache lock contention inside the processor. Of course it all depends on often you perform operations on the shared_ptrs. Use them only as handles to large components and keep them out of inner loops and you'll be fine.


I suspect he would say that most of the C++ programs you mention don't really need to use shared_ptr, and would be better off with something more like unique_ptr.


Single- transferred-ownership pointers are not going away.

Reference counted pointers are not going away.

I don't care if Java programmers don't know the difference between the stack and the heap, please keep the @ syntax to save me from having to type

    void f(std::shared_pointer<my_namespace::my_object_type> const & p)
another 50,000 times in my career.


Reference counted pointers may very well be going away. See comments below. TL;DR they totally suck in multi-threaded settings.


How many times have you seen a large application grind to a halt or become unusably slow due to reference counting overhead?

Perhaps it happens sometimes, but compare that to leak-it-all-as-garbage and scan-all-the-address-space style collection.


I'm really excited that Rust has decided to position itself to be usable at a C-level. Not only will I be able to write a kernel module with strong memory safety guarantees (unless I need otherwise), I'll have access to basic data structures like strings and hashmaps.


Here's a minimal Linux kernel module in Rust, by the way: https://github.com/tsgates/rust.ko


"This could be fixed by switching to keywords, which I prefer for this reason"

~ could be named unique_ptr and @ something like shared_ptr. Genius!


Or maybe we could allow types or type annotations to be written using only non-alphabetic characters and then ~ could be ~ and @ something like @ and my fingers won't break from typing. I do prefer having sigils for types that enrich other types to make them obviously different from types that are parametrized by another. Compare ~string to list<string> - the first is a uniquely owned string, the second is a list of string. ~ is an adjective and list is a noun. ~string is a string with such-and-such memory semantics; list<string> is a list that happens to contain strings.


One of the best bits of Rust is the low syntactic noise of ~T instead of unique_ptr<T>, since it's so common. (And similarly with @, which is less common, except in the compiler itself, where it is used all the time.)

Maybe I'm exaggerating a bit, but it really seems to work well after using it for a little.


I agree with you on that. It's a tad intimidating at first, but if you go with the flow you find it really cuts down on the noise. I'm surprised so many folks are arguing the opposite, to be honest.


I felt opposed to ~ and @ first, before I got really exposed to using them myself again and again.


Me too. I used to refer to them as Perlisms in Rust, but nowadays I don't care any longer.


..or given that they like 1970s style abbreviated identifiers so much: "upt" and "spt"

Looking forward to writing: "let mut foo = upt .."


I don't really care for excessive use of abbreviation myself. I would prefer "heap" and "Gc".


If you like really short names, Android's RefBase.h declares `class sp` for "strong pointers" (like std::shared_ptr) and `class wp` for "weak pointers" (like std::weak_ptr).


I look forward to the proposed solution. Because while I understand the arguments about simplifying the language and allowing for external garbage collectors, I am not sure I am convinced that the GC type will be as simple to work with as the current built in GC:ed type. Sometimes you want to have GC when coding and as it is currently implemented in Rust it is easily accessible and simple to understand with readable code.


The article mentions that Rust is at least partially designed for low-level applications such as writing kernels. If garbage collection is shifted into the standard library, would this allow Rust to be used on real-time embedded systems such as the newer ARM microcontrollers?


At least for ARM Cortex-M3 and NXP LPC2000 microcontrollers you do have GC enabled languages available (Oberon)

http://www.astrobe.com/default.htm

But having Rust as option is also great.


I'm completely in favor of this. Garbage collection has no place in the core definition of any language that targets OS kernels and other high-performance applications. It's sad how we have so many languages that fix many of C++'s problems (e.g. D and Go), but not without the addition of a garbage collector.


With regards to OS kernels you may be right (though not for performance reasons), but some high-performance applications would be very hard to write without a GC, especially if they're to take advantage of multicore. See my explanation above about scalable, concurrent data structures. Even hard real-time systems can benefit from a GC, and Java has a few GC implementations for hard real-time apps.

FYI, object allocation and de-allocation for short-lived objects is much faster with a good GC than with dynamic allocation, and even for long-lived objects GC gives a higher throughput than malloc/free. The problem with GC is latency issues (pauses) when maintaining very large heaps with many long-lived objects.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: