Linus Torvalds on Garbage Collection (2002)

ekidd · on April 22, 2011

Shortly before Linus wrote this article in 2002, I wrote an XML-RPC library in C that used reference counting. By the time I was done, I'd written 7,000+ lines of extremely paranoid C code, and probably eliminated all the memory leaks. The project cost my client ~$5K.

The standard Python xmlrpc library was less than 800 lines of code, and it was probably written in a day or two.

Was my library about 50 times faster? Sure, I could parse 1,500+ XML-RPC requests/second. Did anybody actually benfit from this speed? Probably not.

But the real problem is even bigger: Virtually every reference-counting codebase I've ever seen was full of bugs and memory leaks, especially in the error-handling code. I don't think more than 5% of programmers are disciplined enough to get it right.

If I'm paying for the code, I'll prefer GC almost every time. I value correctness and low costs, and only worry about performance when there's a clear business need.

DanielRibeiro · on April 22, 2011

I Guess Linus also made another suggestion which he didn't pursue on this paragraph:

Generational garabage collectors tend to never re-use hot objects, and often do the copying between generations making things even worse on the cache

The suggestion? Just reuse hot objects.

Note, this post is from 2002. Garbage collectors have improved a lot over the last 9 years. Don't evaluate new technology with such old arguments.

jeffdavis · on April 22, 2011

Linus was quite specific about his problems with GCs. Just saying that "garbage collectors have improved" without specifics, and calling his arguments "old" doesn't add much information.

Can you tell us how they have improved in ways that address Linus's specific concerns? And explain how a generational garbage collector can effectively re-use freshly freed objects before they leave the cache? Which 2011 generational GCs take advantage of that idea?

ekidd · on April 22, 2011

Reusing hot memory is a pretty ancient GC technique, though the details have varied over the decades.

You can Google "nursery generation" for some examples:

http://www.google.com/search?q=gc+nursery+generation

The original theory was that most objects in a functional language are extremely short-lived, and you want to reclaim them quickly. But if you GC your nursery generation after every few kB of allocations, you'll also keep the memory in L1 and L2 cache.

The ideal size of a nursery generation (and the value of reusing hot memory) should be measured empirically.

Of course, most popular scripting language GCs are non-copying, reference-counting collectors, which means they ignore techniques that were well-understood back in the 80s. However, this has not been a barrier to the adoption of Python, Ruby, etc., because the business value of high-performance code is often minimal.

DanielRibeiro · on April 22, 2011

Yes I can:

- [1, 2] The pauseless collector also has the highly desirable side-benefit of producing improved memory locality. This happens because the algorithm works by attempting to relocate groups of tightly-coupled objects into regions of adjacent memory, providing excellent paging and cache locality properties for those objects. This can also benefit highly multithreaded applications that operate on distinct sets of object data.

If you read the whitepaper on Azul[3] you will get a glimpse of how things got a lot more involved. And all of this gets a lot more complicated when programmming for multiple cores[4].

But not everybody is using the Erlang VM or Sun's JVM (HotSpot). The point is generic: don't assume things are on your platform by watching other platforms. Don't use old data to guide your search, no matter how brilliant/famous the author is.

[1]http://java.sun.com/products/hotspot/docs/whitepaper/Java_Ho...

[2]http://java.sun.com/docs/books/performance/1st_edition/html/...

[3]http://www.infoq.com/articles/azul_gc_in_detail

[4]http://www.javacodegeeks.com/2011/04/erlang-vs-java-memory-a...

ori_b · on April 22, 2011

You can bypass the GC entirely, and keep the objects that really should be in cache live, and just reuse them. Not a pretty solution, but it's doable.

ekiru · on April 22, 2011

Linus suggested in the OP that automated reference counting (where the language implementation handles reference counting for the programmer) is preferable to GC.

Python uses such reference counting (although it also has a GC fallback to ensure cyclic structures are collected).

cdavid · on April 22, 2011

And it is one of the reason why python is slow and difficult to scale on multiple cores (the main difficulty by far of removing the GIL is reference counting).

kragen · on April 22, 2011

Scaling Python on multiple cores is easy: you just run multiple Python processes, each with its own per-core GIL.

You are correct, though, that ref-counting overhead is one of the main reasons Python is slow.

lukesandberg · on April 22, 2011

is the ref counting slow because it has to be a protected operation? or is it just the amount of them. if it is due to being protected by locks there are lock free ref counting systems (as described in here: http://www.google.com/url?sa=t&source=web&cd=1&v...) that essentially allow batch refcounting opterations to accumulate in a per thread counter that are only reconciled occasionally. assuming you design your data structures correctly (no false sharing) you can get very high write performance because all your counters are cached and there is no cache invalidation due to competing writes on other processors. of course it takes more memory so it is only suitable for highly shared objects, but it can make a significant difference in a reference counting system. (the article points to scalability issues in the linux ref counting system and how to resolve them)

kragen · on April 22, 2011

That's awesome! I knew there were faster approaches to ref-counting, but I didn't know about that one, "sloppy counters".

The most common approach to speeding up ref-counting is to ref-count bigger objects — modules rather than individual variables, say.

The simplest way to speed up ref-counting transparently is to statically analyze the code and remove redundant increment and decrement operations. This can be tricky in practice, and I haven't heard of anyone actually doing it.

sb · on April 22, 2011

Just for the record: Pramod Joisha did static analysis of redundant reference counting operations in 2006 in the Bartok C# research compiler (http://www.hpl.hp.com/personal/Pramod_Joisha/Publications/is...). IIRC, doing it improves performance significantly.

lukesandberg · on April 22, 2011

It seems to me that the best way to do this would be to put the analysis into a tracing JIT. since the compiler knows exactly what will happen in a long code sequence removing redundant incs/decs would be fairly trivial.

cdavid · on April 22, 2011

The problem with most methods to improve ref counting speed is that it generally breaks existing C extensions. For that reason alone, I don't expect cpython to significantly change its way of doing things in that area for a long time.

SpikeGronim · on April 22, 2011

The drawback with multiple processes is that they each have all the compiled bytecode on their heap. When I "import nltk" my heap goes from ~12 to ~36 MB. Add a few more dependencies and you end up wasting a non-trivial amount of RAM on python heaps.

atombender · on April 22, 2011

You could do the import and then fork the child processes. That should share all the memory used by the nitk module between all the children (for as long as the memory is not modified by anyone, which triggers a copy-on-write allocation of the affected pages).

baq · on April 22, 2011

most forks do copy-on-write and every access to a python object meddles with reference count, which is - you guessed it - a write.

froydnj · on April 22, 2011

The context of the copy-on-write win was the bytecode for modules. I don't know that you'd have anybody meddling with the reference counts for that...but I haven't really looked.

rayiner · on April 23, 2011

Bytecode is stored in function objects which IIRC are reference-counted.

atombender · on April 23, 2011

Aside from ad-hoc classes (defined within the scope of a function, say), does it ever make sense to garbage-collect a module or a class?

Say you do "import smtplib" in your main file. Now that module is imported -- forever. I don't know the internals of Python well enough, but I bet that the module reference has strong references to its contents, so that even if nobody is actually calling anything in smtplib, it will be there in case someone does. The same should be true about modules importing other modules; they stay visible at the scope-level, so they are permanently loaded.

So for those cases it would make sense to keep them separate from global garbage collection. I'm pretty sure that the method tables of all the classes in the system take up considerable space, probably in the order of megabytes for many apps.

witten · on April 22, 2011

That's a valid point, but you can do things to mitigate this, like splitting up your program into processes that have different dependencies. For instance, if you have a UI process and a core logic process, you won't need to import your giant UI library twice.

rayiner · on April 23, 2011

Or you could just use a language whose implementation doesn't blow.

FooBarWidget · on April 23, 2011

And suffer from 5x slower development time. Yeah your deadline now has to be extended from 2012 to 2015 but at least the language is FAST and is technically excellent, right?

Peaker · on April 23, 2011

False dichotomy...

adambyrtek · on April 22, 2011

If you split the code horizontally instead of vertically you have to deal with IPC, which could be problematic (and slow). I would do that only if there was already a clearly defined interface between the modules, not just for the sake of concurrency. Moreover, taking into account growing number of cores in modern machines such strategy doesn't seem very future-proof.

gerner · on April 22, 2011

Thank you for making this point. To my great dismay I see more and more people think that you need concurrent threads in one process in order to scale. It seems as if many programmers believe this as scripture and subsequently have difficulty thinking about distributed systems (not just multi-core, but multi-host).

chc · on April 22, 2011

This is a bit like saying, "Of course C has full type safety. Just use a Haskell implementation written in C." Both your statement and his are technically accurate, but yours is on a subtly different topic.

kragen · on April 22, 2011

How is it a bit like that? He complained that Python didn't scale to lots of cores because of the GIL; I said his Python didn't scale because of threading, which interacts badly with the GIL. If you use a shared-nothing approach to parallelizing your Python code, you don't run into that problem. It's not as if shared-state threading happens magically — you have to rewrite your code to use it, too.

cdavid · on April 22, 2011

You are playing with words here: of course you can run multiple python instances to scale your application on multiple-cores, that's a trivial statement. But I was talking about python the interpreter (more exactly cpython). There are legitimate cases where multi-threading is the natural, elegant solution, and cpython, mostly because of reference counting, prevents that.

kragen · on April 22, 2011

CPython prevents you from using shared-state multithreading to scale your application on multiple cores, but it doesn't prevent you from scaling your application on multiple cores. That's not "playing with words".

It is indeed unfortunate when the limitations of our platforms force us to contort our code to improve performance, but that is just as true of multithreading as of multi-process programming. The difference between the complexity of the two is small.

ori_b · on April 22, 2011

Python doesn't do code analysis to determine when it's effectively doing obj.refs++; obj.refs-- repeatedly.

This sort of analysis is useful, and if the interpreter had been designed to do any optimization along with jitting, would probably come nearly for free. Reference counting could be far far cheaper than it is in python. (How much cheaper? I don't know - it'd need work to figure it out)

sb · on April 22, 2011

Recently, there has been some work on removing redundant reference count operations in the Python interpreter. The following paper describes how it can be done: http://portal.acm.org/citation.cfm?id=1869631.1869633.

Regarding the performance impact of reference counting, the following facts are important:

- Switching from immediate reference counting to deferred reference counting (L.P. Deutsch and D.G. Bobrow, 1976 [1]) eliminates about 90pct of all reference count operations in Smalltalk (Berkeley Smalltalk '82, that is) [2]

- A very good account of reference counting can be found in either Dave Ungar's excellent PhD thesis [3] and Dave Ungar and Dave Patterson's in-depth analysis of Smalltalk performance [4].

[1] An efficient, incremental, automatic garbage collector (http://www.cs.umass.edu/~emery/classes/cmpsci691s-fall2004/p...)

[2] High performance storage reclamation in an object-based memory system (http://techreports.lib.berkeley.edu/accessPages/CSD-84-167.h...)

[3] The Design and Evaluation of A High Performance Smalltalk System (http://www.eecs.berkeley.edu/Pubs/TechRpts/1986/5376.html)

[4] Berkeley Smalltalk: Who knows where the time goes? (Chapter 11 of http://www.iam.unibe.ch/~ducasse/FreeBooks/BitsOfHistory/)

rayiner · on April 23, 2011

This sort of analysis does not come easily. Say you have:

    def bar():
        return some_constructor()

    def foo():
        b = bar()

Here the decrement is in bar() and the increment is in foo(). You have no way to elide the operation without doing inter-procedural analysis, which is hard.

ori_b · on April 23, 2011

You don't need to catch all the cases to see a big improvement. Just the most common ones.

rayiner · on April 24, 2011

The problem is in a language like Python, where everything is broken up into little functions, you're not going to catch the most common uses without doing inter-procedural analysis.

jrockway · on April 22, 2011

C and Java apps with high contention don't scale well onto multiple cores, either.

The key to speed is to not share state, which Python can do fine. It's called fork.

cdavid · on April 23, 2011

Sometimes you cannot easily not share state. The fact that you can use multiple processes to scale is not an interesting statement when comparing python to other languages because it is true for every language out there.

I think the GIL has made people in the python community too defensive: the GIL does not prevent from building scalable architectures in many cases, but it still sucks, and it would be better without. That's a limitation (and a tradeoff because it made development and integration with C easier). And there are scalable architectures based on threads (example: http://www.mailinator.com/tymaPaulMultithreaded.pdf) - "thread suck" has became a meme which slightly bothers me in general. Not a panacea, but a good solution when applicable.

rtaycher · on April 23, 2011

I thought the key was not managing mutable state. Sharing lots of immutable data can probably shave a lot of performance over ipc.

ilcavero · on April 22, 2011

~$5k for 7 KLOC of bug free C code is a steal, that's impossible to do in less than a couple of months

kragen · on April 22, 2011

It's nearly impossible to do in any timeframe. I can think of perhaps two examples in human history where I think it's been done: qmail and seL4. And there may still be bugs in qmail. There may be a few other non-public projects that have achieved less than one bug per 7000 lines of C, but probably not more than one or two.

fanf2 · on April 23, 2011

http://www.dt.e-technik.uni-dortmund.de/~ma/qmail-bugs

There are also some DNS-related bugs that are not on this list.

kragen · on April 24, 2011

Thanks! I hadn't seen those, although I knew of 1.4. I think 1.1, 1.3, and 1.4 are actual bugs, if the reports are accurate; I'm pretty sure Dan disagreed with Wietse about 1.1. Three bugs in about 15000 lines of code doesn't quite rise to the level of less than one bug per 7000 lines of code, so maybe that's only been done once, in seL4.

What are the DNS-related bugs?

abcd_f · on April 22, 2011

Beg to differ on both points.

Ignoring the 7 KLOC metric (which is fairly useless in estimating the project value for the customer), $5K at $100/h rate of a senior C programmer on a contract works out to about 6 working days, which is a reasonable timeframe for writing well optimized XML-RPC protocol parser. In other words it's neither a steal nor does it take two months to write.

(edit) YMMV by a programmer, but that's what it would've taken me to write it (though I have written a good deal of protocol parsers).

jpr · on April 22, 2011

1+KLOCs a day of C? WTF?

rdouble · on April 22, 2011

Compared to other languages, C is pretty straightforward. You also have to write more of it to do less. I switched back to C after 12 years of using web scripting languages, and am surprised at how much more code I churn out per day.

z92 · on April 23, 2011

Do you now use C for web apps? Like parsing GET, POST, DB, Cookies?

abcd_f · on April 22, 2011

It does not need to be 1 KLOC/day :) The complexity of XML-RPC parser is about a week of work for polished code. Go from there, not from the given 7 KLOC of some implementation.

yid · on April 22, 2011

Please don't take this personally, but the fact that your code was N thousand lines doesn't mean the task is inherently an N-thousand line task.

Also, undertaking anything with XML in C is brave, kudos.

ekidd · on April 22, 2011

If you know of any way to make the code shorter and more concise, please let me know. I'm always looking for ways to improve.

https://gist.github.com/937200 (two sample files, totaling slightly less than half the project)

Posting 10-year-old code to HN is always a little embarrassing, but please feel free to be brutal. I'll live. :-) I was definitely way too in love with the preprocessor back then.

Here's the corresponding Python code, from shortly after it hit 1,000 lines in late 2001:

http://hg.python.org/cpython-fullhistory/file/7b2d701ec404/L...

I like the Python code rather better, if only because there's so much less explicit memory management. The actual ideas stand out clearly, and there's just so much less code to read. In most cases, I'd happily sacrifice the performance.

joe_the_user · on April 22, 2011

Was my library about 50 times faster? Sure, I could parse 1,500+ XML-RPC requests/second. Did anybody actually benfit from this speed? Probably not.

Then your (client's) problem wasn't reference counting but premature optimization.

Are there situations where you'd like to have code run fifty times faster than native Python. You bet there are, lots and lots of them - for example, in a Unix-clone Kernel. Sorry if somehow you didn't find one of them.

Hopefully you used Valgrind and Formal language specification to reduce the work required.

And to avoid premature optimization, use gprof to find the bottleneck(s) rather than just diving into what seems to need optimization.

HenryR · on April 22, 2011

You're mixing up 'premature optimisation' and 'unnecessary optimisation'. The first is making code faster that isn't the bottleneck, or dominating performance factor. The second is making something faster than it needs to be. Profiling helps avoid the first, benchmarking helps avoid the second. Both require working (toy) systems, which is hard when you are evaluating which language to begin work in.

Writing a library in C (for reasons of performance) where performance would have been 'good enough' in Python is unnecessary optimisation. Writing your own GC implementation layer in Python because you think that's the bottleneck would be premature optimisation.

EDIT: typo

joe_the_user · on April 23, 2011

Seems like a rather small distinction. Unnecessary optimization is a natural result of premature optimization so I tended to assume that was what was happening. Unnecessary optimization can result from other things as well of course.

Regardless of this distinction, I think the same conclusions follow; the original problem had little to do with reference counting as such.

ondrasej · on April 22, 2011

Actually, geting reference counting right and mostly bug-free is much easier, if you use C++ and do the counting via objects on stack (RAII).

VeXocide · on April 22, 2011

Absolutely not, http://std.dkuug.dk/jtc1/sc22/wg21/docs/papers/2003/n1450.ht... explains why in section D. Implementation difficulty

pjkundert · on April 22, 2011

Actually, yes. The boost::smart_ptr<...> is highly effective. Prior to that existing, I wrote a similar ref-counting library that has been deployed to hundreds of remote sites, running high-reliability industrial control code.

While complex, the ref-counting implementation gave me something that a GC system just can't deliver:

Determinism.

This system is running on a 300MHz embedded x86 hosts, in a solid-state industrial computer, and has to provide sub-millisecond timing with only occasional hiccups >1ms acceptable. I am unaware of a GC system that could deliver this.

These boxes run for years, with no memory leaks. The cost was, of course, ensuring that no "loops" of objects get created. This was a small price to pay for the benefits. Having written large systems using both GC and ref-counting, I still lean toward ref-counting, even if determinism isn't a strong requirement.

Of course, I'm one of those guys who loves C++, so take my opinions with a grain of salt...

skorgu · on April 22, 2011

IBM's Metronome[1] advertises maximum pause times in the hundreds of microseconds range. Cliff Click[2] cites a 10MB max heap for it so definitely not a cure-all but definitely something I'd like to play around with.

[1]http://domino.watson.ibm.com/comm/research_projects.nsf/page... [2]http://www.youtube.com/watch?v=uL2D3qzHtqY

FooBarWidget · on April 23, 2011

Reference counting is not deterministic. Whenever a refcount is lowered to 0, you don't know how many subsequent references exist further up the chain that will also be lowered to 0. Lowering a refcount to 0 can result in an arbitrary amount of code to be executed depending on the references.

If you mean that reference counting, on average, tends to result in fewer amounts of random/unpredictable pauses than most GCs, then I agree.

z92 · on April 23, 2011

I guess on that scale and probability, no code is deterministic.

rtaycher · on April 23, 2011

reference counting is slower* then other forms of gc. My guess why its fairly fast in c++ has a lot to do whith how much you can avoid it by allocating things on the stack(of course you can do escape anaylsis but most gc'd languages don't have explicit stack allocation of non basic types)

*with a lot of collaries

marshray · on April 22, 2011

I, too, love C++ and the boost smart pointer library.

In fact, it's been adopted into modern versions of the C++ standard itself. std::tr1::shared_ptr and std::shared_ptr.

hristov · on April 23, 2011

You are probably right, but your experience does not disprove the central point that GC always slows things down. Now this slow down may be worth it in most situations, in fact it may be worth it in 99% of situations, but it is a fact.

In some cases that 50 times increase in speed may be important. In some cases, if something is 50 times slower it would be simply unusable. Linus happens to be working on one of those special cases, so he is quite correct in pointing out that garbage collection will always slow things down.

InclinedPlane · on April 23, 2011

Context is always important. And there's no such thing as blanket advice that applies to every circumstance. Software development is a hugely varied endeavor, ranging over more orders of magnitude and more variety of application than almost any other industry. It stands to reason that most practices will vary rather considerably depending on context.

Linus writes from a systems perspective. He cares about every nanosecond and every byte. And that's appropriate and even laudable in the context of the linux kernel and core applications. But it's not always appropriate for every situation.

known · on April 29, 2011

I've used your brilliant XML-RPC in one of my consulting assignments. Thanks.

HockeyBiasDotCo · on April 22, 2011

Agreed.

barrkel · on April 22, 2011

Reference counting is GC; a poor form if it's the only thing you rely on, but it is automatic memory management all the same.

Generational GC will frequently use the (L2/L3) cache size itself as its smallest generation, meaning it shouldn't suffer from the pathologies talked about by Linus here.

What GC really gives you, though, is the freedom to write code in a functional and referentially transparent way. Writing functions that return potentially shared, or potentially newly allocated, blobs of memory is painful in a manual memory management environment, because every function call becomes a resource management problem. You can't even freely chain multiple invocations (y = f(g(h(x)))) because, what if there's a problem with g? How do you then free the return value of h? How to you cheaply and easily memoize a function without GC, where the function returns a value that must be allocated on the heap, but might be shared?

Writing code that leans towards expressions rather than statements, functions rather than procedures, immutability rather than mutability, referentially transparent rather than side-effecting and stateful, gives you big advantages. You can compose your code more easily and freely. You can express the intent of the code more directly, letting you optimize at the algorithm level, while the ease of memoization lets you trade space for speed without significantly impacting the rest of your program. Doing this without GC is very awkward.

GC, used wisely, is the key to maintainable programs that run quickly. You can write maintainable yet less efficient programs, or highly efficient yet less maintainable programs, easily enough in its absence; but its presence frees up a third way.

cageface · on April 22, 2011

This is the essential point. GC may not be appropriate in all contexts, but, when it is, it tremendously improves the expressiveness of the language.

illumen · on April 23, 2011

GC code runs differently depending on the code running around it. It causes the code to be non deterministic, and introduces a side effect.

I've written code in a reference counted language(python) which processes about a gigabyte of data per second from the network, with hard real time requirements - all on one machine with multiple cpus/cores. The code is fully unit tested, doc tested, and functionally tested. It's also short, runs on multiple platforms and has been maintained by other people than myself. My personal experience is that you can write highly efficient maintainable code with reference counting.

Reference counting manages memory automatically for you, but it also lets to manage memory manually when needed too. For many situations, it's the best of both worlds.

Peaker · on April 23, 2011

Did you compare the performance of Python with that of other languages that do use GC (e.g: Haskell)?

jfr · on April 22, 2011

> A GC system with explicitly visible reference counts (and immediate freeing) with language support to make it easier to get the refcounts right [...]

To be a little pedantic on the subject, such a system (reference counting and immediate freeing) is a form of automatic memory management, but it is not GC in any way. Garbage collection implies that the system leaves garbage around, which needs to be collected in some way or another. The usual approach to refcounting releases resources as soon as they are no longer required (either by free()ing immediately or by sending it to a pool of unused resources), thus doesn't leave garbage around, and doesn't need a collector thread or mechanism to.

There are partial-GC implementations of refcounting, either because items are not free()d when they reach zero references, or to automatically detect reference loops which are not handled directly.

I agree with Torvalds on this matter. GC as it is promoted today is a giant step that gives programmers one benefit, solving one problem, while introducing a immeasurable pile of complexity to the system creating another pile of problems that are still not fixed today. And to fix some of these problems (like speed) you have to introduce more complexity.

This is my problem with GC. I like simplicity. Simplicity tends to perform well, and being simple also means it has little space for problems. Refcounting is simple and elegant, you just have to take care of reference loops, which also has another simple solution, that is weak references. I can teach a class of CS students everything they need to know to design a refcounting resource management system in one lesson.

GC is the opposite: it is big, complex, and a problem that the more you try to fix it, the more complex it becomes. The original idea is simple, but nobody uses the original idea because it performs so badly. To teach the same class how to design a GC system that performs as well as we expect today, an entire semester may not be enough.

stcredzero · on April 22, 2011

I agree with Torvalds on this matter.

In a way, I do as well.

GC as it is promoted today is a giant step that gives programmers one benefit, solving one problem, while introducing a immeasurable pile of complexity to the system creating another pile of problems that are still not fixed today. And to fix some of these problems (like speed) you have to introduce more complexity.

There are plenty of contexts where speed is a non-issue. In those cases, GC has been a huge win. The conceptual simplicity is the important part. The cost of the resources that would be saved with explicit and optimized memory management would be far outweighed by the resources required to implement such things.

The original idea is simple, but nobody uses the original idea because it performs so badly.

This is simply not true.

In the context of IO-bound enterprise systems, I've seen generational GC perform admirably, almost magically. As a lark, I've put infinite loops into such apps that do nothing but allocate new objects, and unless you are doing an exceptionally intense operation, you couldn't tell the difference. Properly tuned generational GC can be a truly fantastic seeming thing!

However, I will agree that the concerns Linus highlights are real, and that refcounting systems, like the one in iOS are by far better choices in many contexts.

EDIT: The above system I victimized, I only victimized in the TEST environment, but it was populated with something like 2-week old production data. The application in question is a traditional client/server desktop app used by a major energy company and had 800 active users at the time, handling millions in transactions every minute.

IDEA: If someone had an augmented ref-counting system with a runtime containing an optional cycle-detector and something like LINT but for the runtime reference graph, one would get most of the benefits of GC with the efficiency of the ref-counting system. I half expect someone to tell me that this already exists for Python.

abstractfactory · on April 22, 2011

In 2004 Bacon et al. showed that tracing GC and ref-counting GC are special cases of a more general framework, and that it is possible to combine them to get the benefits of both. See "A Unified Theory of Garbage Collection" here:

http://www.research.ibm.com/people/d/dfb/publications.html

Scholar cluster:

http://scholar.google.com/scholar?q=bacon+unified+theory+gar...

Incidentally, it is not even remotely true that reference counting is more efficient than tracing GC in all cases.

kragen · on April 22, 2011

> I've seen generational GC perform admirably, almost magically. As a lark, I've put infinite loops into such apps that do nothing but allocate new objects

While I agree that generational GC can perform spectacularly well, what you're describing is close to the case it's optimized for, not close to its worst case. The worst case is that you allocate lots and lots of small objects and then write a pointer to all of them into a tenured garbage object.

> I half expect someone to tell me that this already exists for Python.

Yes, that's how Python works, except that I don't know what you mean by "something like LINT but for the runtime reference graph."

stcredzero · on April 22, 2011

While I agree that generational GC can perform spectacularly well, what you're describing is close to the case it's optimized for, not close to its worst case

Here's the thing: Most of the rest of the app was rather close to the case it's optimized for.

I don't know what you mean by "something like LINT but for the runtime reference graph."

Something that tells you that you've created a reference graph with a cycle, you have a memory leak, or that you're using references in some other stupid or suboptimal way. I'm not even sure if there's a way to automatically detect anything like the last category, though the first two are certainly detectable. Basically, you take most of the infrastructure of GC, and you just turn it into a runtime advisor to warn devs and testers of mistakes.

kragen · on April 22, 2011

Great Circle commercialized the Boehm collector back in the 1990s, and I seem to recall that most of their customers were using it to tell them when they had a memory leak or reused freed memory, not to remove the need for reference counting altogether. But it didn't tell you about cyclic references, unless they resulted in a memory leak.

lambda · on April 23, 2011

> IDEA: If someone had an augmented ref-counting system with a runtime containing an optional cycle-detector and something like LINT but for the runtime reference graph, one would get most of the benefits of GC with the efficiency of the ref-counting system. I half expect someone to tell me that this already exists for Python.

Or you could do what Rust <https://github.com/graydon/rust/wiki/Language-FAQ>; does, and have different types of objects; stateful and stateless. Stateless object cannot have cycles (since you can't construct them in stateless objects, unless you're in a lazy language), and so reference counting can be used for stateless objects. Stateful objects, which can have cycles, are managed by the garbage collector.

Rust sounds like a really interesting idea, and I'd love to give it a try, but sadly it's still in the "some assembly required" stage as they work on bootstrapping it so the only real projects that are worthwhile to do with it yet are helping out with the bootstrapping process.

rayiner · on April 22, 2011

> This is my problem with GC. I like simplicity. Simplicity tends to perform well, and being simple also means it has little space for problems. Refcounting is simple and elegant,

Reference counting isn't conceptually any more simple than garbage collection (you still have to indicate how the underlying alloc() and free() operations are implemented). Further, a proper high performance reference counting system isn't any simpler than a high performance GC. In a multithreaded system you have to keep the reference counts in sync without doing a slow atomic operation every time a pointer is passed around, and you need to back the ref counting system with a high performance malloc()/free() that can handle multithreaded allocation and is resistant to memory fragmentation.

jamii · on April 22, 2011

For higher-level systems programming (ie not bare metal) I think the approach taken by the Rust language is interesting. Mutable data structures are thread-local and are GCed (GC happens independently per thread). Immutable data structures can be passed between threads, are reference counted and support RAII.

https://github.com/graydon/rust/wiki/Language-FAQ

revetkn · on April 22, 2011

This is my problem with GC. I like simplicity. Simplicity tends to perform well, and being simple also means it has little space for problems

If your use case is teaching how to implement a garbage collector vs. a refcounting system, it is certainly much simpler to implement refcounting. If you are a systems programmer/writing a VM or compiler and do your work at a bare-metal level, then manual memory management/refcounting is probably your best choice.

For everyone else (let's guess over 90% of the programming population), using high-level GC'd languages is significantly simpler and more productive.

barrkel · on April 22, 2011

Implementing a stop-the-world, 2-space copying GC, provided you have complete stack, register and type layout data, is trivial. Implementing reference counting is much less so, especially in the presence of threads. Many simple scenarios turn into problems of lock-free programming proportions - and that's just for verifying that memory safety is present, not that the user hasn't introduced bugs with their own threading cock-ups.

(Delphi uses reference counting for interfaces, strings and dynamic arrays, and I am aware of race bugs in strings in particular (which are copy on write); these bugs are hard to fix without murdering performance, yet in practice they are very rare on x86 memory model hardware. So they stay.)

marshray · on April 22, 2011

I've implemented both in C++.

See http://svn.boost.org/svn/boost/trunk/boost/smart_ptr/shared_... as an example of reference counting. Add in some atomic increment/decrement primitives and that's literally all there is to implement in one header file.

But even the lightest weight GC collector I could make had a significant C++ implementation file with lots of nontrivial pointer operations and loops in it. Even without scanning the stack for root objects, passing them in manually, I still think there's some amount of non-portable code in there.

That said, I'm using my GC for new stuff when the objects aren't too temporary. We'll see how it works out.

barrkel · on April 22, 2011

Unless you modified an existing or wrote a new C++ compiler, no, you haven't really (don't forget what you're piggy-backing on when the compiler is invoking ctors and dtors for you too). Boost shared pointers are toys; they don't deal with copy-on-write semantics. C++ exception safety is also hilariously difficult to get right (this adds to the refcounting problem); the language is broken by design.

marshray · on April 24, 2011

Unless you modified an existing or wrote a new C++ compiler, no, you haven't really

Not sure which thing you're saying "I haven't really". But yeah, at times I have experimented with code that replaced built-in C++-runtime-library functionality.

(don't forget what you're piggy-backing on when the compiler is invoking ctors and dtors for you too).

Ctor/dtors by themselves don't, by themselves, generally allocate heap objects.

Boost shared pointers are toys; they don't deal with copy-on-write semantics.

I don't think that CoW is an essential feature of every allocation tracking scheme. Still, if you declare shared_ptr<const T> you can copy when you need to.

The std::unique_ptr and "move" semantics in C++11 are filling in some of those gaps in the core language.

C++ exception safety is also hilariously difficult to get right (this adds to the refcounting problem)

Be honest - it is exceptional conditions in all forms of programming are "hilariously difficult". C programs typically handle it sporadically or dedicate 50% of the code bulk (evenly throughout the program) to handling such conditions. C++ exceptions are tools to help you to put all that danger into a single facility which you can then point to and be horrified by. This is a great improvement.

the language is broken by design.

As someone else once replied to me, Yeah, if your definition of broken is awesome. :-)

It just depends on what you want out of your language. I like other languages too.

rayiner · on April 23, 2011

Of course you're leaving out the malloc()/free() implementation that the ref-counting scheme depends on that the GC doesn't.

marshray · on April 24, 2011

Perhaps I was thinking that that's usually provided by the operating system "for free". Hard to imagine an actual nontrivial GC system not needing basic malloc/free at some point anyway. Come to think of it, I think I actually tried implementing something like that once.

Alternatively, you could implement a simple first-fit free-block-list scheme in a small amount of code. It might not look terribly different from something a GC might do anyway.

Still, I think it's a fair point.

seanx · on April 22, 2011

And if delphi had complete GC, i would probably still be using it for new projects.

stonemetal · on April 22, 2011

Nothing in the definition of Garbage Collection says it must be a separate post processing step. Therefore reference counting is in fact a garbage collection mechanism. Just because it is an eager algorithm rather than the usual lazy one doesn't change what it does, collect garbage in the system.

marshray · on April 24, 2011

Nothing in the definition of Garbage Collection says it must be a separate post processing step.

My impression (based on reading up on the state-of-the-art some years ago) was that an actual concurrent-with-normal-execution GC needed to resort to putting a write barrier in front of the useful thread of execution much of the time. This would cause a noticeable perf hit.

There were architectures designed with hardware support for this kind of thing, but chips without it left them in the dust. Whether or not that is an accident of history or for a reason is an interesting discussion.

yid · on April 22, 2011

Right, and garbage collection is just lazy memory manamgement--you go see it either way, so this whole point is rather pedantic.

gruseom · on April 22, 2011

I like simplicity. Simplicity tends to perform well, and being simple also means it has little space for problems.

Amen.

I saw an old interview with Chuck Moore a while ago in which he said: I like simplicity and efficiency. That struck me. How often do people put those two things together? We're conditioned to think of them as a tradeoff. But if you can have both, shouldn't we be trying hard for that? Which raises another question: what do you have to give up in exchange for both simplicity and efficiency?

chc · on April 22, 2011

It's because "simple" can be a measure of different things. C is a pretty simple and straightforward language, but it takes 10 times as much code to do anything, which could be said to be quite a bit more complex.

kainosnoema · on April 23, 2011

In my experience, it takes an incredible amount of effort and experience to solve a complex problem with simplicity and efficiency. It's well worth it, but there's definitely a cost, and not everyone is capable of it.

gruseom · on April 23, 2011

That's been my experience too. Part of this difficulty, though, is that it requires going against our training and culture. This raises the question of how much easier it might get with different training and culture.

That's one thing that's so intriguing about Moore. He's a living specimen of an alternate computing history. I sometimes wonder what would have happened if he had been in, say, Backus's position at the dawn of high-level languages.

caf · on April 23, 2011

I tend to feel that the two often go together. After all, "the key to making a program fast is to make it do as little as possible".

seanx · on April 22, 2011

> I like simplicity. Simplicity tends to perform well, and being simple also means it has little space for problems.

I like simplicity in my code, I am far less concerned about simplicity in the underlying libraries that I use.

GC makes my code simpler and more understandable. I use manual memory allocation when I absolutely need to, and GC languages when I can. In most cases, that means I use a GC language.

famousactress · on April 22, 2011

We should really encourage eachother to put the date in the title when submitting old articles to HN. It's a total brainf*k to read through the entire article, and not realize the context it was in.. or to just glance at the title and assume the topic is a current one. Just saying.

[Edit] Not that I have a problem with older posts, btw.. I actually really like them most of the time. But the date would give everyone a better opportunity to evaluate whether they want to read the article, and would be reading it with reasonable context.

lambda · on April 23, 2011

Is it just me, or does the title say (2002) to give you context?

famousactress · on April 23, 2011

That was added later (presumably in response to this comment).

oinksoft · on April 22, 2011

  Date: Fri, 9 Aug 2002 20:28:16 -0700

famousactress · on April 22, 2011

Sorry if I wasn't clear. Of course I can click through and check the date. I'm just suggesting that since HN is generally a news site that the general expectation is that the articles posted will be current, and I think it'd be handy if we tried to call out dates in the title when posting old stuff. My original comment's relatively well upvoted now also, so I don't think I'm alone.

sklivvz1971 · on April 22, 2011

It's 2011, FFS. This kind of mindset is really self defeating in the long term. Sure, hand optimizing is better. Having a gazillion lines of shit legacy code and technical debt to fix because you hand optimized for the 90's, it's not so great. I'll keep my GC and sip a Mohito on the beach, while Linus keeps on fixing Linux's "optimizations" ten years from now.

shin_lao · on April 22, 2011

The alternative to a GC is not "hand optimizing". There are several patterns such as reference counted memory and RAII that are far from complex to use.

If you've written any data intensive application you know that a GC doesn't solve memory issue. It's just a different strategy.

sklivvz1971 · on April 22, 2011

It's a strategy _I_don't_need_to_concern_with_ :-)

hasenj · on April 22, 2011

It's already been ten years since he wrote that.

gaius · on April 22, 2011

Yeah - mods can we edit a date into the title please?

parfe · on April 22, 2011

The email was written a decade ago.

loup-vaillant · on April 22, 2011

So. Programs that use Garbage Collection tend to be slow.

Cause: Hardware don't like it.

Solution: fix the hardware?

Seriously, I'm afraid we're stuck in a local optimum here. It is as if machines are optimized for the two dominant C/C++ compilers out there, and we have then to optimize our program against that, closing the loop. Shouldn't compilers and hardware be designed hand in hand?

vog · on April 22, 2011

I think you're referring to the following paragraph:

One fundamental fact on modern hardware is that data cache locality is good, and not being in the cache sucks. This is not likely to change.

However, this did not happen to please the "two dominant C/C++ compilers". The reason is much more fundamental:

One of the most expensive parts of the hardware is memory, and fast memory is a lot more expensive to produce than slower memory. So we have the choice between using the same (and thus slow) memory throughout the system, or combining different kinds of memory so that at software has at least the _chance_ to run faster. This is a fundamental issue, and the only thing you can do is trying to find the optimal share for each kind of memory.

But no matter how well you choose: software will only be able to exploit this if it is designed for locality.

If you can fix that (i. e. if you can find a cheap way to produce gigabytes of fast memory that makes chaches obsolete) the current compilers won't stop you from exploiting it: The code that is optimized for locality will still run as fast, and the code that can't be optimized for locality will run orders of magnitudes faster.

So we aren't in a local optimum at all. You can still optimize further "just" by producing faster and cheaper hardware.

rayiner · on April 22, 2011

There is not a fundamental conflict between GC and cache generally. The conflict is with how caches are implemented in most CPUs.

Consider a simple generational collector with a nursery of N cache lines. The allocator will bump allocate into LINE_0, LINE_1... until LINE_N, and then copy all survivors into the mature generation. When it starts allocating again, it will start at LINE_0. Ie: it will touch at least N cache lines before ever coming back to any given cache line. If N > the size of the cache, each allocation will involve a cache miss. Each time the allocator allocates into a cache line, it will have been evicted already, and will need to be brought into the cache before it is written to.

The key observation here is that you're dealing with an artificial W-W dependency, not a real R-W dependency. When the allocator allocates into a cache line, it doesn't care what the data already in that cache line is! The cache line doesn't need to be brought in from memory because it is garbage and will be overwritten completely.

There are lots of solutions to the problem. At every allocation the allocator could simply issue a prefetch for the next line, because it knows it will need it soon. Or, you could have an instruction that simply blows away a cache line, marking it cached+dirty in the coherence protocol without generating a main memory access.

In fact, I'm not sure that modern CPUs don't detect this condition in their write buffers already. Something like Bulldozer's write coalescing cache could certainly do this.

Aside from this issue, generational GC's are very cache friendly. Collection inherently compacts memory, and the order of collecting tends to bring objects onto the same pages and sometimes the same cache lines as the objects that refer to them.

As an aside, the other big cache-thrashing process, marking, can also be made cache friendly on modern architectures. If your CPU has SMT, you can run the marker in a concurrent thread. It can sit there take advantage of the available memory parallelism that most programs leave on the table.

kragen · on April 22, 2011

One of the most expensive parts of the hardware is memory, and fast memory is a lot more expensive to produce than slower memory. So we have the choice between using the same (and thus slow) memory throughout the system, or combining different kinds of memory so that at software has at least the chance to run faster. This is a fundamental issue, and the only thing you can do is trying to find the optimal share for each kind of memory.

Although all modern high-performance (edit: I mean non-embedded-microcontroller) computers work this way, it's not the only possible way. The Tera MTA takes a different, cacheless approach.

First, the problem with modern RAM in desktop machines is not that it sucks at bandwidth. You can get your bandwidth arbitrarily high by multibanking. Multibanking requires more buses or point-to-point links, but that's a tolerable cost.

The problem with modern RAM is that it sucks at latency, compared to what the CPU would like. Well, what do you do about latency? You make your requests earlier, and make sure you have other things to do in the meantime, when they get back. The Tera did this by having 128 sets of registers – 128 hardware threads — and switching to the next thread on every cycle. That means that, if all the thread slots were full, every thread only executed an instruction every 128 cycles, which is plenty of time to hide the latency of a slow memory fetch, as long as the memory bandwidth was adequate.

So basically every thread gets to pretend that it's running on a machine with zero-latency RAM — memory that's as fast as the registers. And pointer-chasing becomes as fast as looping over an array.

There are some other advantages to this design. Pipelining logic is very simple, because unless your pipeline gets insanely deep, you never have two instructions in the pipeline from the same thread, so you don't have register hazards.

(Cache is still beneficial in such a design, since it reduces the bandwidth that the links to main memory need to support. But the Tera didn't use it.)

I don't really understand why the Tera MTA failed in the market, and I suspect the problems were commercial rather than technical — customers had to take a big risk by porting their software to an unproven HPC platform, a platform whose performance characteristics were completely unlike anything else in the market (and unlike anything you can buy today). So customer uptake was insufficient to provide the cash flow needed to keep updating the design to keep up with Intel and AMD.

The technical reason such a design might fail would be if the silicon resources needed to support an entirely independent core were comparable to the silicon resources needed to support a hardware thread. Consider the GreenArrays GA144 chip: 144 independent cores, each with a tiny amount of independent RAM, on the same chip. Such a chip will be at least as fast as a chip with 144 independent register sets, but a single execution pipeline — in the worst case, it's bottlenecked on getting data out of RAM, and only one of its cores is usable, making it just as fast, while in the best case, it runs 144 times as fast. So a chip with 144 register sets needs to be cheaper — i.e. smaller — than the GA144. (Well, or easier to program, but presumably you can use more mainstream multicore chips to prove the example instead.)

ramchip · on April 23, 2011

Isn't there a much simpler reason - that programs are generally single-threaded? I imagine it would take a lot of work to port basic tools like a web browser to such an architecture without it running much slower. A lot of applications do little parallelizable number-crunching, but a lot of branching and sequential operations. How could you make parsing XML or HTML fast on this? What about a text processor or a compiler?

kragen · on April 23, 2011

It's true, if Intel could make a single processor core that went eight times as fast on single-threaded code, they would do that instead of making eight-core chips, unless the cost difference was horrific. But they can't, so if you want your code to go faster, you have to find a way to parallelize it.

This started happening in earnest about 20 years ago in the supercomputer market, which is where the Tera was sold. About 10 years ago, it started happening in desktop CPUs (check out Herb Sutter's article about "the end of the free lunch") and now it's starting to happen in embedded microcontrollers, with the Parallax Propeller and the GreenArrays chips.

As it happens, the Tera didn't lose to faster single-threaded supercomputer CPUs. By the time the MTA came out, even the most stalwart defenders of the fast-single-threaded-performance approach, like the Cray SV1 and the NEC SX-5, had succumbed to the necessity of CPU parallelism. But the approach that was taking over the supercomputer market at the time was actually far more parallel, and far more difficult to program efficiently --- NUMA machines and then Beowulfs.

So that's why I don't think it was single-threaded programs that made the Tera fail in the supercomputer market.

The question of how to meaningfully parallelize XML and HTML parsing, compilation, text processing in general, and web browsers are very interesting indeed. It's not obvious how to do it, but it might turn out to be tractable. A group at Berkeley was doing some research on it in 2007 and 2008: http://www.eecs.berkeley.edu/~lmeyerov/projects/pbrowser/

ramchip · on April 23, 2011

My point was not aimed at the Tera specifically, but rather at this architecture as a solution to the memory problem in general. As you say, supercomputers have been using multi-threaded code for a long time now, so it's not a technical problem for this case, but it can be for the kind of applications 'regular users' may need.

Thanks for the references. I'm in embedded systems but I hadn't heard of the Parallax Propeller before, it's an interesting architecture.

kragen · on April 23, 2011

I suspect that the killer feature of Tera-like architectures for embedded systems could be their determinism. Embedded microcontrollers like the LPC2100 series are starting to get caches, branch prediction, and the like, and that makes me really nervous, because it's going to make worst-case timing very hard to reproduce in testing. (In a way, this has been the case since the introduction of interrupts in the 1960s, but I think it makes the problem much worse.)

If multithreaded or massively multicore processors were a viable alternative, you could get deterministic timings without sacrificing throughput. You could even do away with interrupts. The GreenArrays chip doesn't have interrupts at all; instead, its cores go into a low-power shutdown state whenever they're waiting on I/O.

But that's all pretty speculative.

Ygor · on April 22, 2011

Not really directly connected, but in embedded systems you can actually see a lot of hardware software codesign approaches.

There is an approach called NISC [1] (No Instruction Set Computer), where the instruction set is completely removed and the compiler generates control words directly for the architecture (which can also be designed from scratch and optimized).

[1] http://en.wikipedia.org/wiki/NISC

rgoddard · on April 22, 2011

The design of the memory hierarchy is dictated by the fact that in general the faster your memory is, the more it costs. As a way to balance this you have a hierarchy of memory, with the fastest closest to the cpu and also in the smallest quantity. So the compilers are optimized for the hardware, not the other way around.

ciupicri · on April 22, 2011

There were a couple of machines optimized for LISP some years ago, but guess what? They were beaten by ordinary machines.

jimwise · on April 23, 2011

Sure -- but they were beaten by ordinary machines because the combination of better ordinary machines and better techniques for running lisp on ordinary machines improved to the point where lisp machines weren't better at running lisp.

A lot of this is that a few small low-budget research groups were building lisp machines at a time when the big players suddenly started pouring all their efforts into developing commodity single-user graphical workstations. But it's also true that developments in GC technology -- developments which Torvalds seems unaware of in this 2002 post, even though they were 15-20 years old by then -- made GC really fast even on hardware not designed with GC in mind.

tlholaday · on April 22, 2011

Intel makes both processors and compilers.

marshray · on April 22, 2011

Yes, but their compiler doesn't have much market share, certainly not enough that it can drive the chip design. They do, of course, make a lot of popular chips, but even there they can't dictate the architecture single-handedly.

The vast majority of software seems to be built with something other than Intel compilers and compiled for generic 32-bit x86/i386/ia32 and/or x64/x86_64/amd64 platforms. (whatever you want to call 'em)

Jach · on April 22, 2011

You might enjoy reading this: http://www.thocp.net/biographies/papers/backus_turingaward_l... (I've been casually trying to implement an FFP Machine on an FPGA.)

jasongullickson · on April 22, 2011

What he's advocating sounds a lot like how things work in the iOS world, in my experience.

BenoitEssiambre · on April 22, 2011

And thus the insanely smooth user experience on iOS. As an Android developer, this is _the one thing_ I feel makes it difficult to have a polished user experience on Android compared to iOS.

If you think long and hard about each place you call 'new' in java Android apps, it is possible to get a smooth interface. However, the language doesn't encourage it by default like on iOS and you have to put time into it you usually don't have.

For users, milliseconds of jitters and blockiness everywhere is a huge turn off. It feels like the device is struggling to handle simple tasks and it destroys the belief in the UI metaphor of physical objects that have inertia and flow, stretch and bounce when you touch and flick them.

IMO, because it makes such a huge difference to users, UI programming should always happen in a high performance language and environment. One way to attain this performance is to eschew garbage collection. This is also a problem in web browsers where the UI is often written in javascript.

At the very least those who make programing environments, languages or GCs used for building user interfaces should optimize how their environments promote good use of local hardware cache and acceleration. There is a reason why UI rendering often relies on video hardware that is not totally unlike a desktop high performance computer.

kenjackson · on April 22, 2011

And thus the insanely smooth user experience on iOS. As an Android developer, this is _the one thing_ I feel makes it difficult to have a polished user experience on Android compared to iOS.

Yet for WP7 its also silky smooth. The main issue on WP7 for user apps isn't GC at all, but rather the network (creating long lists of images that you're getting from a web service). Once people learned some techniques for dealing with that on the device the experience for 3rd party apps was just as silky as iOS apps -- yet a full generational GC.

The GC is an excuse (unless its not well written), not the reason.

cpeterso · on April 22, 2011

And that Android's GUI did not use GPU hardware acceleration until Honeycomb. The iOS GUI uses hardware acceleration extensively (which is admittedly probably easier when you control the iOS devices' hardware design).

https://code.google.com/p/android/issues/detail?id=6914

adolph · on April 22, 2011

A GC system with explicitly visible reference counts (and immediate freeing) with language support to make it easier to get the refcounts right (things like automatically incrementing the refcounts when passing the object off to others) wouldn't necessarily be painful to use, and would clearly offer all the advantages of just doing it all by hand.

I was wondering how that was different from Cocoa on Objective-C.

See: http://developer.apple.com/library/mac/#documentation/Cocoa/...

pdhborges · on April 22, 2011

The runtime system might not free the memory immediately.

pohl · on April 22, 2011

Could you expand on this thought? I assumed that free() was invoked in the dealloc method, which was explicitly called if the release method reached a zero ref count.

sil3ntmac · on April 22, 2011

That's correct, unless the object is autoreleased, in which case the system will get around to deallocating it after the current run loop iterates (provided you're using the default autorelease pool).

DougBTX · on April 22, 2011

Worth noting that an autoreleased object which you haven't retained will have a ref count of 1, so they are not an exception to the "freed when the ref count is zero" idea.

pdhborges · on April 22, 2011

Yes, but if the object fall into the default autorelease pool you have no way to control when the memory is going to be released (at least without risking a segfault).

pohl · on April 22, 2011

Thank you, that makes sense. So, if I understand this correctly: if one wanted to take advantage of an opportunity to do some cache-friendly reallocation, then avoiding autorelease pools would be necessary.

microtherion · on April 22, 2011

That's rarely a realistic option if you use any Cocoa library code. autorelease is a pretty fundamental aspect of the Cocoa design.

pohl · on April 22, 2011

Wouldn't it be more accurate to say that it's not a realistic option for objects that come from or depend on Cocoa library code?

Or would the mere act of using cocoa in one part of my application somehow prevent me from exploiting such an opportunity in another part of my application?

My instinct says that, for example, a plain C library that used such techniques would not cease to function just by being linked into a cocoa application.

microtherion · on April 23, 2011

Yes, you're right: If you create objects yourself and don't pass them through certain Cocoa library code, they will not be autoreleased.

What I meant, however, is that most nontrivial Cocoa apps are likely to end up with a checkerboard allocation pattern, where the client allocated, controlled-lifetime objects are interspersed with Cocoa library allocated autoreleased objects with somewhat longer lifetimes, which may interfere to some extent with any strategy relying on controlling your object lifetime.

Still, people can and do avoid autoreleasing objects if their lifetimes are well known in advance.

glhaynes · on April 23, 2011

Your instinct is correct.

bcl · on April 22, 2011

In ObjC the retain/release memory management system uses reference counters. When the count hits zero the object is deallocated so that does sound like what Linus is talking about.

Objective C also has the 'new' garbage collection system and there is autorelease which uses a memory pool and releases at some later point.

atombender · on April 22, 2011

Objective-C also has the autorelease pool, which allows for a bit more flexibility in that you don't need to pay as much attention to the refcount.

You can also optimize allocations using zones, which are like mini-heaps that collect a bunch of objects and allow you to release all of them with a single deallocation -- very handy if you have a ton of small objects (like a DAG) where you know for a fact that when the root is deallocated, all the nodes should be freed as well.

joeyespo · on April 22, 2011

I think this is another case of everybody thinks about garbage collection the wrong way: http://blogs.msdn.com/b/oldnewthing/archive/2010/08/09/10047...

From the article: "Garbage collection is simulating a computer with an infinite amount of memory. The rest is mechanism."

Whether or not it's reference counting or generational, the goal is still to simulate infinite memory. That way, you can focus on the high-level problems instead of the technical memory-related details. So it's not necessarily a bad mindset to have.

manveru · on April 22, 2011

Might be worth mentioning Tcl in this context, as it uses reference counting for the GC [1].

It also doesn't allow circular data structures, which are quit hard to implement if all you have are strings anyway.

[1]: http://wiki.tcl.tk/3096

zerohp · on April 22, 2011

Perl also uses reference counting, and cyclical data structures cause memory leaks unless you explicitly decrease the reference count with the weaken function from Scalar::Util.

IgorPartola · on April 22, 2011

And Python. AFAIK, there are no plans to fix it at this point. Also Python is notorious for allocating lots of small objects.

stonemetal · on April 22, 2011

Python has a mark and sweep garbage collector(since 2.0) to catch that which reference counting misses. You can disable it if you know you don't make any cyclical objects.

LeonidasXIV · on April 22, 2011

Though it does throw away unused cyclic dependencies, so you don't have to do the same trickery as in Perl.

pnathan · on April 22, 2011

I've done minor amounts of text analysis in Python. It is a memory pig. I was shocked.

Anything remotely like a non-trivial dataset should not be run in pure Python, based on my experience.

wladimir · on April 22, 2011

[2002]

Though his argument about cache does still hold.

AndrewDucker · on April 22, 2011

I think it holds for work that's happening close to the metal. If you're working at the level where you're trying to fit all your data in the cache, then GC will get in the way of that.

For applications operating at a higher level, or where any calculation that's going on is taking up a trivial amount of CPU time compared to (say) waiting for the user, then the overhead of reference counting (and the mistakes that are associated with that) is probably not worth it.

wladimir · on April 22, 2011

Sure. He was talking about the OS level.

In UI scripts or something else running on multiple levels of abstraction, preventing memory leaks and allowing developers to quickly prototype is much more important. Cache performance does not even come into it. It would be crazy to advocate manually managing memory there.

MatthewPhillips · on April 22, 2011

I think his opinion might not have changed. In his latest "C++ sucks" rant in that "why is git written in C" thread he points out not having GC as being one of the detriments of C++.

awj · on April 22, 2011

For the complexity it adds not having GC is a detriment of C++. The language makes reasoning about allocation, object ownership, etc, much harder. It does this in the course of providing features intended to help you structure and manage larger pieces of code, and thus accidentally works to defeat that purpose.

I don't think Linus ever intended to declare GC as evil or wrong. He was calling it out as a very poor choice that people persist in making for certain domains.

stonemetal · on April 22, 2011

I could see it holding for say Java or C#, but for a single assignment languages like SML or Haskell I don't think it would hold. It seems like changing single assignment to mutation would be one of the first memory related optimizations you would make. While it is easiest to think of that as A1 = A0 + B is a mutation of A, there is nothing that says it should be so limited.

masklinn · on April 22, 2011

> for a single assignment languages like SML or Haskell I don't think it would hold.

It would not have to hold, but I have not heard of any such scheme being implemented.

sausagefeet · on April 22, 2011

What about for something like SISAL?

koenigdavidmj · on April 22, 2011

And sure enough, graph colouring register allocators work just as well for allocating chunks of memory as they do for allocating registers.

rayiner · on April 23, 2011

Are you sure?

Vlasta · on April 22, 2011

I like him mentioning the programmer's mindset associated with GC being a big danger. Some people consider GC a magic bullet and refuse to think about what's happening under the hood. I do not consider that a good habit.

sausagefeet · on April 22, 2011

Some people consider it free too which is very painful. Like pretending networks don't have latency.

marshray · on April 22, 2011

It seems to be far easier (i.e., possible) to go from a manual-memory-management style of development to an automatic one than the other way around. I've known plenty of Java-CS-degree programmers who just never could get the hang of writing C/C++ code without leaking stuff (and not just memory).

alextingle · on April 23, 2011

This is a key point that I think lots of GC proponents gloss over. Memory is not the only resource that needs to be managed. Open file handles, sockets, user sessions, whatever... they all need to be managed in much the same way as memory.

albertzeyer · on April 22, 2011

When I read this, I immediately thought about std/boost::shared_ptr. This is a bit ironic since Linus hates C++ so much.

shared_ptr is a really nice thing in C++. (For those who don't know: It is a ref-counting pointer with automatic freeing.) And its behavior is very deterministic. In many cases in complex C++ applications, you want to use that.

alextingle · on April 23, 2011

Or as we like to call it, std::shared_ptr

__david__ · on April 22, 2011

I like the way the D language approached this. It's garbage collected but it also has a "delete" function/operator. That way you can use garbage collection if you'd like, or you can manually free memory when you think it's worth it.

That seems like a reasonable compromise and I'm surprised that more languages don't do it.

iskander · on April 22, 2011

I'm very suspicious of anyone (even Linus) claiming that gcc is slow because of its memory management. The codebase is crufty and convoluted--- it's probably slow for a thousand different reasons. If you refactored into a clean design and rewrote the beast in OCaml (or any other language with a snappy generational collector), you'd probably get a large performance boost.

astrange · on April 22, 2011

I think your argument consists of "gcc is slow because I don't understand the code layout".

One of the problems affecting the C frontend and backend is poor cache locality due to pointer chasing in their data structures, and they currently do switch between GC memory and manually allocated zones (obstacks) to improve this.

ww520 · on April 22, 2011

This is like arguing assembly is better than high level languages because it's faster with explicit control. The thing is 99% of the time it doesn't matter.

In most cases, GC-based programs have good enough performance to get the job done. For the 1% case, sure use the C/C++/Assembly to have the explicit control and performance. Doing things in non-GC systems because of potential caching problem sounds like a case of premature optimization.

astrange · on April 22, 2011

I really like it when people reply "usually this doesn't matter" to a discussion in a project where it _does_ matter.

ww520 · on April 22, 2011

I got the impression that Linus was discussing GC in general usage in the long post, not specific to the GCC compiler. He even brought up the copy_on_write example, which was a kernel or file system usage.

If the goal is to speed up GCC, there are a long list of things to do before you have to worry about L1/L2 cache miss. Header file processing (or re-processing) is one of the biggest time sinks during compilation.

astrange · on April 22, 2011

cpp avoids reprocessing:

http://gcc.gnu.org/onlinedocs/cpp/Once_002dOnly-Headers.html

Processing header files in the first place is certainly a problem affecting C, but I believe optimizations take more time. Parsing is a much larger problem for C++.

ww520 · on April 22, 2011

I meant reprocessing the same set of header files for every single cpp file including them. Precompiled headers suppose to speed it up. But they still need to be read in and re-created in memory for each cpp file. Why not just compile all the cpp files in one process rather than spawning off a compiler process for every file? That can ensure reusing all the header files processed and are still in memory.

My point is: run the profiler, see what are the bottlenecks, and pick those areas for optimization. Rather than speculating that L1/L2 cache misses are causing the major delay. If they ran the profiler and L1/L2 cache misses are really the problem, then I have nothing else to say.

astrange · on April 23, 2011

> But they still need to be read in and re-created in memory for each cpp file.

This is just mmap for Clang. For GCC it... isn't. The other one you mentioned is called a compile server and nobody seems to have cared enough to implement it.

> If they ran the profiler and L1/L2 cache misses are really the problem, then I have nothing else to say.

http://gcc.gnu.org/ml/gcc/2011-04/msg00315.html

ww520 · on April 23, 2011

Compile server, interesting concept. If no one cares to implement it, that means performance is not a high value item for people to improve upon.

That's an interesting problem in doing a lot of lookup from main memory since the data structures are too big to fit in L1/L2. But that usage is not allocating a lot of short lived objects and freeing them, and reusing the freed memory right the way for next allocations. That's what Linus arguing for to reused the L1/L2 for the short lived objects, and thus GC is inappropriate. I would imagine compilers typically allocate objects that have long lifetime, such as declarations that have scope through the whole compile cycle. Also if memory allocation/deallocation is really a performance problem, then don't deallocate, just reuse the buffers. Reusing the same set of buffers would make sure they are hot in L1/L2.

Anyway it has been an interesting discussion.

mckoss · on April 22, 2011

Didn't Linus forget

    newnode->count = 1;

ciupicri · on April 22, 2011

Maybe it's done by the copy_alloc function.

joshhart · on April 22, 2011

Here are a couple of reasons why I think it's not so clear cut:

1. If garbage collection was that damaging to the cache, Haskell wouldn't be nearly as fast as C. 2. Copy-on-write data structures are nice because the immutability allows for concurrent access without locking.

Granted, this was from 2002 and Linus may no longer feel so strongly about the topic.

jedbrown · on April 22, 2011

Show me a memory-intensive kernel in which Haskell runs close to the performance model. Sparse and dense matrix kernels would be a good place to start. Our C code for sparse matrix-vector products and sparse triangular solves gets better than 90% of STREAM bandwidth (based on an assumption of optimal cache reuse, STREAM is about 85% of hardware peak).

Dense matrix kernels should get better than 90% of FPU peak. Unlike sparse kernels, dense kernels are no longer bandwidth limited, but cache reuse in both L1 and L2, as well as friendly TLB behavior is important to good performance.

It would be interesting to see any Haskell implementations that are competitive. I suspect that the very first thing you will do when trying to get performance is to ditch the functional paradigm and start writing code in an assembly-level monad.

joshhart · on April 22, 2011

Let's be fair here. Numerical processing is not representative of most applications, but I concede the point.

It's an interesting question though. I'm a Haskell beginner, but I'm pretty experienced with Scala. Let me try to see how using sliding compares with a tight Java loop on statically allocated data for convolutions/filters.

dons · on April 22, 2011

> It would be interesting to see any Haskell implementations that are competitive. I suspect that the very first thing you will do when trying to get performance is to ditch the functional paradigm and start writing code in an assembly-level monad.

Or teach the compiler about the algebra of arrays and matrices, so it can do the things to the code, that we'd write by hand.

E.g.

* http://www.cse.unsw.edu.au/~benl/papers/stencil/stencil-icfp...

* http://www.cse.unsw.edu.au/~benl/papers/repa/repa-icfp2010.p...

* http://www.cse.unsw.edu.au/~dons/papers/stream-fusion.pdf

* http://www.cse.unsw.edu.au/~chak/papers/acc-cuda.pdf

In all these cases arrays codes are written in a function style, accompanied with special purpose optimizations and/or code generators (in the case of GPU code), layered over an imperative arrays primitive layer, using a memory effects monad.

jedbrown · on April 22, 2011

This is a worthwhile research topic, but it doesn't really answer my question. From the first paper you cite:

The single threaded Handwritten C version is about 45% faster than our best Haskell result, which is achieved with 3 threads.

Meanwhile, there is no performance model so we don't know how good the C version is. The paper doesn't even report a simple fraction of FPU or bandwidth peak. It is not using SSE instructions so it cannot possibly be better than 50% of FPU peak (the limit is actually lower because this kernel is/should be bandwidth limited). As for parallelism, I'll quote Bill Gropp [1]

The easiest way to make software scalable is to make it sequentially inefficient.

[1] http://books.google.com/books?id=2Da5OcnjPSgC&lpg=PA21&#...

dons · on April 22, 2011

Oh, I'm certainly not arguing that you're going to beat hand tuned straightline code. I'm just pointing out that dropping into assembly isn't the only possibly path.

jedbrown · on April 22, 2011

If you're not within 10% for these kernels, chances are that memory is being used differently. This gets to a further matter which I think is perhaps the greatest failure of current multi/many-core programming paradigms: assuming a flat memory model. Efficient parallel computation has much less to do with computation than with data movement. Recent and future architectures have deeply hierarchical memory systems so any paradigm that does not expose the location of physical pages (within some appropriate abstraction) will have a hard time delivering consistent, understandable performance. Performance should not vary an order of magnitude based on whether memory was faulted (allocation is irrelevant) using a batch of threads with different affinity than those that access it later. But this is the current state of affairs.

I would very much like to see a paradigm where a memory distribution (roughly a high-level representation of the mapping to physical memory) was a first-class concept. Suppose that new memory could be allocated or remapped to have certain compatibility relative to the mapping of another block. Then you could associate tasks with certain coupling between two distributions.

rayiner · on April 23, 2011

> If you're not within 10% for these kernels, chances are that memory is being used differently

More likely imperfect strictness analysis, etc. Haskell is a pure functional lazy language, after all. Getting within 65% of C's performance on a tight numeric kernel is heroic.

jedbrown · on April 23, 2011

Different strictness analysis results in using memory differently. Note that some of Don's references are embedding a DSL that gives a high-level interface to very low-level code (e.g. CUDA) specific to this problem. They should have control over strictness.

rayiner · on April 23, 2011

You just described an ideal use-case for GC. You allocate a bunch of arrays at the outset, then sit there pounding on them with little to no allocation. A good GC won't cost you anything in this case, since it'll allocate large pointer-free arrays in a separate pool that doesn't get traced.

Haskell isn't a good example, since it's a lazy functional language, but if something like Java isn't just as fast for your case it isn't the GC, it's the code generation. Much harder to do all those fancy loop optimizations in 100 ms in a JIT than in 10s in an offline compiler.

jedbrown · on April 23, 2011

I concede that my arguments were directed primarily at Haskell and actually more about data structure reuse than about GC. The threading model and task affinity is also relevant: if a compacting GC moves a hot data structure to a different memory bank, performance can be much slower. In principle, a runtime (with kernel support, libnuma may be sufficient on Linux) could set thread affinity, trace the page table for each (core,thread) pairs, and then have the compacting GC migrate threads and memory so that threads tended to use local memory. In principle, this could give better performance than most static distribution schemes, but this tracing costs something and to my knowledge, this sort of NUMA-aware GC is still a research topic.

marshray · on April 22, 2011

I suspect that the very first thing you will do when trying to get performance is to ditch the functional paradigm and start writing code in an assembly-level monad.

Of course not every problem is going to fit well into an imperative model either. E.g. a problem with a large variety of pointer-y data structures and interit-y objects, like say, a compiler.

scott_meyer · on April 22, 2011

Read your own words "immutability allows for concurrent access" Most GCs mutate (mark bits, semi-space copy, etc.) The immutability of Haskell data structures is an illusion, and from the standpoint of real concurrent access, an expensive one.

ww520 · on April 22, 2011

For mutating GCs, they stop the world when making changes so there's no erosion of the immutability contract in concurrent access. There are parallel GC systems that works quite well with multiple-core systems.

scott_meyer · on April 22, 2011

You write "stop the world" and "concurrent access" in the same sentence. What is up with that? When we manage to go for 10 years without getting 5 or 6 papers claiming to "solve the concurrent/pauseless GC problem" then, perhaps, it might be reasonable to present the problem as "solved."

ww520 · on April 22, 2011

I don't understand what you are trying to get at. I merely stated that the immutability contract for concurrent access at the language level is not violated when the underlying GC mutes the memory. "Stop the world" is the simple technique to ensure that consistency. Things at lower level change/mute all the times. Virtual memory makes memories appear to to be there but really aren't. Your object might got swapped out and brought back in underneath you. But all those changes at lower levels don't change the contract at higher level.

If your problem is with pauses during GC, well guess what computers pause all the times. Whenever you move your mouse or type on the keyboard, they generate interrupts that pause the whole CPU and it has to handle them. Those pauses don't seem to be a problem. If the small pauses in GC are acceptable in general usage, why not use them?

If you want hard guarantee, that's what realtime system is for.

marshray · on April 22, 2011

It's an interesting problem that will probably never be solved, only optimized and amortized. That makes it good for papers!

The most efficient scheme probably is a "stop the world" system. The regular program might mutate concurrently and the GC could indeed be concurrent, but making them run at the same time is a real challenge.

Of course, people don't like it when their app pauses.

I think it's noteworthy that there never was a big office suite or even a web browser written primarily in a GC system, despite many years of promises.

rayiner · on April 23, 2011

Re: office suites, the code bases of the major office suites pre-date the rise of Java, which brought GC into the mainstream. Lot's of apps bigger and more complex than an office suite have been written in Java.

As for web browsers... every web browser has a GC...

marshray · on April 24, 2011

office suites, the code bases of the major office suites pre-date the rise of Java, which brought GC into the mainstream.

Yes, but Java and GC proponents talked up the possibility at the time. There were various projects to do pure-java office suites and web browsers. The time or two I tried them they were memory pigs and slow.

Lot's of apps bigger and more complex than an office suite have been written in Java.

Many that take take heavy user interactivity? Eclipse perhaps, but IBM had to come up with a custom native-code UI toolkit to implement it. I still considered it too slow to use until the last year or so.

As for web browsers... every web browser has a GC...

Last I looked (a long time ago) my impression was that DOM and Javascript interpreter objects were usually refcounted internally. Did this changed?

teh · on April 22, 2011

Slightly related: He mentions that when the containing structure of a sub structure goes away you can free all the resources. The guys behind Samba 4 developed talloc [1] which is build around that idea.

[1] http://talloc.samba.org/talloc/doc/html/index.html

neilc · on April 22, 2011

Arena or pool memory allocation is a very old idea, and has been implemented by many different systems (e.g., Apache/APR, PostgreSQL, lcc).

kerkeslager · on April 22, 2011

> In contrast, in a GC system where you do _not_ have access to the explicit refcounting, you tend to always copy the node, just because you don't know if the original node might be shared through another tree or not. Even if sharing ends up not being the most common case. So you do a lot of extra work, and you end up with even more cache pressure.

It's possible that things were different in 2002, but I don't really think this is the case now. In general, I make the node immutable and never copy it (copying an immutable object makes no sense). In a well-designed code base, mutations happen within the function where the data is created (read: on the stack, where cache locality is a given). Immutability also addresses Linus' concerns with thread-safety. And that's not accounting for concerns which Linus DOESN'T mention, such as increased development speed and correct program behavior.

I'm not the only one saying this. Josh Bloch, for example, recommends immutability and cites cache reasons (http://www.ibm.com/developerworks/java/library/j-jtp02183/in...). And many languages (Haskell, Clojure) are designed heavily around avoiding mutation and sharing nodes within data structures.

This talk of copying nodes to avoid your objects changing out from under you sounds a lot like what I call "writing C in Java". Linus is looking at this from the perspective of, "If they took away explicit memory management from C, this is how I would do it." But OF COURSE if you just bolt a feature like GC into a language that didn't have it before, it won't work well. Effective cache usage in a GCed system requires other language constructs (like immutability).

Now, after all that, I won't make the claim that immutability in a GCed language like Java or C# is faster or even as fast as C with explicit memory management: it would take a lot of profiling code and comparing its functionality to make that claim with any kind of certainty. But it doesn't seem like Linus has done that profiling and comparison either.

KirinDave · on April 22, 2011

That was 2002. Here is the state of the art in 2008: http://cs.anu.edu.au/techreports/2007/TR-CS-07-04.pdf

Unsurprisngly, things have changed. Many of Linus's complaints were valid, and we've learned how to address them.

mv1 · on April 23, 2011

I find it sad that, to this day, one has to spend so much time worrying about memory management to get decent performance. I've yet to work on a performance oriented project where I didn't need to write at least a couple custom allocators to reduce memory management overhead.

GC systems are no better in this regard. I was told of an interesting hack in a Java program that implemented a large cache of objects by serializing them into a large memory block so that the GC saw it as one big object and didn't traverse it. This resulting in dramatically reduced GC pause times (10x+). When needed, objects were deserialized from the array. Disgusting, but effective.

LarrySDonald · on April 22, 2011

So.. Essentially man the F up and live without GC in the parts that are going too slow instead of saying "Oh it's cool, just wait ten years and hardware will be fast enough to run this anyway". Use GC for stuff that needs to be simple and is fast enough anyway, don't bog down code that's too slow with it.

earino · on April 22, 2011

Guy who writes kernel code cares about performance, film at 11.

mmcconnell1618 · on April 22, 2011

I'm quite sure the machine code generated by my compiler isn't nearly as good as it could be if I hand coded it but the efficiency of not writing in machine code far outweighs any potential performance gains.

Rusky · on April 22, 2011

You have a good point that applies in a lot of places, but here you're aiming it at a straw man. Linus advocates reference counting and even suggests building it into the language would be a good idea. That's hardly hand coding anything- it's just a different strategy for GC (which is actually done).

It is getting harder and harder to beat compilers with hand-coded assembly without an enormous amount of effort, though.