The solution, which (unsurprisingly, works!) is how you did stuff in Fortran-77, APL (circa 1958). There were no iterators, so all you did was pass around one big array that contained all the data, and indices into it.
In a couple of years, they will realize that even structs are inefficient, and will go to parallel arrays - at which case the transition to Fortran/APL/J/K would be complete.
I always like reading about how people did things back when dinosaurs ruled the data center. Do you know of any web pages that detail this sort of thing?
I'm close to 40, dinosaurish by HN/Reddit standards, but I learned Fortran mostly because back in the late 80s and early 90s essentially all numeric processing code was written either in Assembly or Fortran; and at a later stage, because I was the only one proficient in Fortran, I inherited lots of legacy code at work.
Most people my age, even if they've been programming since age 9, haven't seen any Fortran unless they're doing physics or using LAPACK.
As a 20something who has been unfortunately intimate with Fortran, I can see why people in the past would have been fanatical about C, and heck even talked about the virtues of diving into assembly rather than deal with Fortran.
GEN-2 is the final lurking place for all your long-lived data..., checking GEN-2 is a “stop the world” event – it (usually briefly) pauses your code, and does what it needs to. Now imagine you have a huge set of objects which will never be available to collect
For many years now, some Smalltalk VMs have had "permspace." You just send a message to a long lived object, and it's moved to a different part of memory considered permanent, and the GC never looks at it. No structs. Just call a function on the permanent objects. Since it's in a different space, this is also very efficient for the VM, since identifying a permanent object is just a comparison of addresses.
Or just track the (presumably few) mutations of permspace objects for special handling. In that case, permspace effectively becomes an infinitely long-lived generation that is never collected, only used as a source of roots.
GC is a tricky problem. Even when you have a system that performs GC in the background (like Java has been doing for almost 10 years) you still are susceptible to your heap becoming fragmented and run the risk of requiring a full GC to compact the heap.
No, because you can allocate from different arenas if you need to. Using plain malloc may lead to fragmentation, but there's no law requiring you to use a single lowest common denominator allocator for your entire program.
That doesn't contradict his observation. Changing your program to use custom memory allocators can solve fragmentation issues, but that's a much bigger step than the changes required to avoid too many (or too expensive ) full GCs (i.e. tuning the garbage collector directly, gettting "permanent" data out of the garbage-collected heap, minimizing mutations so there's less to scan...).
The only times I've had difficulty solving a fragmentation problem was because a GC was involved, and rewriting the code in an effort to trick some enormous black box into doing what I wanted was a lot more difficult than just writing code that did what I wanted in the first place. ymmv.
I guess in theory fpgeek is right, and his great-grandparent post is in the vein of what I thought of answering. But I guess in practice tedunangst is more right at the moment.
Good article! Like his conclusions mention it reminded me a lot of what XNA developers were going through when it was first released in 2007.
In general garbage collectors are sensitive to the volume of objects in the heap and the number of live references, since the collector works by scanning all live objects and following all references contained. If you can't reduce the volume of memory you use explicitly then you can take a stab at reducing the memory used by various language features. Boxing value-types -- such as using a non-generic list of integers, using delegates/events/generators, using specified capacities with System.Collections.List<T>, etc., can all go a long way toward lowering overhead. Reducing the references (by moving to storing indices) can be fairly easy and can be a significant gain as well.
Disposing of IDisposable objects (via using-blocks) when you're done with them rather than waiting for the collector to harvest and/or finalize them also helps considerably.
I think in general more explicit memory usage patterns can benefit even when using a managed language -- you know more about your program and where performance matters than the garbage collector ever will.
This solution seems extreme, given the GC is just a reality of working with managed languages which is most of the web these days (.NET, Java, Ruby, etc.). Many, many heavy traffic websites solve these problems without resorting to language magic tricks. Usually with effective use of caching, or separating out the heavily accessed data in a way that is is optimized for those read operations.
This is the second StackExchange article where it feels like square peg, round hole. The first was [their ORM] (http://samsaffron.com/archive/2011/03/30/How+I+learned+to+st...) which isn't really an ORM at all (perhaps micro ORM is fair) in the traditional sense.
More often they do it just by using non-garbage-collected heap strategies. PHP, Perl, Python and (I believe) Ruby all use reference-counted allocators and don't exhibit the kind of latency explosions you see with GC algorithms.
Pure reference counting doesn't handle cycles. I believe those languages (and I'm pretty certain Python does) implement a periodic mark-and-sweep GC for those cycles.
So you still periods of high latency collections (in .NET only Gen2 collections are actually noticeable, not sure about other platforms). You also have the overhead of managing reference counts, which can be non-trivial; especially if you're paging as a result.
Of course, comparing GC performance between C# and Python or Ruby is sort of silly. If we were trying to hit our ~40ms Question/Show render targets on them we'd probably have lower hanging fruit than GC.
Perl doesn't have mark-and-sweep - instead you use weak references to prevent cycles causing leaks. The combination of Devel::Gladiator and Devel::FindRef allow for manual mark-and-complain for debugging, which tends to work quite well.
It's also possible in the case of object structures to have a sort of "ring" where the weak reference gets shuffled around said ring in order to allow a safe cycle without expecting the user of said objects to care (the trick is that objects' DESTROY methods are allowed to make new references to the object, which lets you do the shuffling just before it would normally have been freed).
I don't entirely see how managing reference counts would cause paging - in perl they're stored as part of the C-level struct for the value so the data's already nicely local to the memory you're already accessing - and when a ref count hits zero the value gets put into a "clean this up when this VM op has finished executing" array so it's pretty efficient.
I would suspect this sort of potential problem is why ObjC seems to be looking towards reference counting - the "smoothness" of not having any chance of a GC pause is probably to apple's eyes a worthwhile trade-off for the disadvantages of reference counting.
Strikes me it might be interesting to be able to mark objects as reference counted as a halfway house between fully GC'ed and fully manual, but I'm not sure whether the amount of use that would get would justify the effort involved.
Haven't done much Perl hacking, good to know it's more-or-less pure reference counting.
Paging can occur as a consequence of copying a reference.
var p1 = *A // let p1 be a reference to A
// something in the interim
var p2 = p1 // create a new reference to A via p1
If A is no longer paged in incrementing its reference count (as a consequence of copying p1 into p2) can cause a page fault. There's are also gotchas with destroying references such as a collection having to effectively iterate over its contents to decrement each reference count, again possibly leading to additional paging.
This assumes an object carries its reference count with it, which every reference counting system I'm aware of does. I suppose you could technically keep reference counts in a common space, though growing that would be a bit rough.
A GC isn't a magic bullet or anything, but they don't need to cause any paging as a result of reference changes until a sweep; and even then, they only need to page in the live set which is more likely to be resident in memory.
I'm... skeptical Apple's introduction of ARC is really performance oriented. Seems more like they tried and failed to build a workable GC for Objective-C with Tiger. Haven't worked with much Objective-C since 10.4 though, so I'm not completely settled either way.
Ah, the copying part, yes - but then again if I'm doing that, I'd expect to be doing so in order to access A at which point I already needed that bit of memory.
The collection thing I can definitely see though.
One thing I've never quite worked out is - without reference counting, how do you handle timely destruction?
I know that if I do something like:
{
my $x = foo();
...
}
then $x will be destroyed at the exact point the } is hit, or that an exception unwinds the stack past this point. That lets me do all sorts of nice things like rolling back a transaction automatically, throwing a warning if some task was expected to be completed but hasn't been, and other bits and pieces.
In a mark-and-sweep GC language, presumably there's some other additional feature that needs to be implemented in order to provide this?
Well, there are certainly cases where you grab a reference speculatively; anything where you need > N in a collection to act on, for instance, but don't know how many items there are yet (such as when you're reading from a stream). There's also the "early return" case, where you'd grab reference say A, B, and C but only dereference C in code that runs if A and B failed some condition.
The equivalent to timely destruction in C# is a using statement combined with implementing IDisposable.
using(var stream = File.Open(somePath))
{
// use stream to do whatever
}
This is syntactic sugar for a try{ } finally { } (that also exists in Java) that calls Dispose() in finally. You do any cleanup you need to in Dispose(), things like releasing file handles. The only way to dodge that Dispose() is to tear the execution thread down, which generally means the process is gone with it.
C# classes (but not structs) can also declare destructors, but their execution schedule isn't guaranteed; they're useful for making absolutely sure resources are released eventually, Dispose() is preferred for timing and performance.
No, they just fiat "cycles" in as a possible memory leak against which the app developer has to guard. That's not so bad, really: it's far, far easier than manually freeing memory. And in any case there are all sorts of resource leaks that GC can't find anyway; this is minor in comparison.
Your last point seems flat wrong, btw. If you're in a GC environment with worst-case latency requirements (real requirements, not just nice-to-haves or 99.9th percentiles, or whatnot) of 40ms, then you're in a whole world of hurt. GC won't do that -- brush up on your C.
A memory leak is pretty terrible, and having to either a) disallow cycles or b) specially handle them in user code is also pretty terrible IMO. Loads of handy graph data-structures, all sorts of fun caching techniques just got either impossible or much more difficult. You definitely can work with those systems but it's like stepping into a time machine (one spitting you out around '94 at an MS COM presentation), why make do unless you have to?
I never claimed GC's solve all leaks either (they do a pretty good job with finalizer semantics honestly), but they're awfully awfully handy for solving very common (and sometimes tricky) ones.
I also did not mean to imply that our 40ms cut is a hard limit, it's a target average with a desire to minimize variability (Marc's work on which lead to this post); one that we're hitting pretty well. If you're working on a real-time system, you're right that you shouldn't be using these non-deterministic GCs.
They're not mainstream (that I'm aware of), but there are realtime GCs that can guarantee a pause of less that 1ms. Metronome is the one I'm familiar with.
True enough, there are research systems that have been built with hard realtime properties. I don't know much about them. So I should have qualified it with "all popular GC-based interpreters" or the like.
I am really not following your criticism of Dapper? Without Dapper we would both need more servers AND have significantly higher latency.
There is no magic here, it is merely understanding the limitations of the .NET GC and adapting to it. XNA and trading platforms also are aware of these issues and have to follow similar patterns.
While this possible using the profiling interfaces, it's not really much simpler IMO.
What happens if more than 1 server needs to GC at the same time (or overlapping times)? You definitely don't want to pull everything out of rotation, so you'll need some coordination, and how do you determine how many servers are allowed to be pulled out? We know we can limp along on about 1/3 of our servers, but it's not a stellar end user experience; at some point you're not solving the problem, you're spreading the pain.
There are some minor annoyance this would introduce too.
We use "server in/out of rotation" as an at a glance "things are getting ill" check, if a web server has been pulled (due to slow responses to requests) it's almost invariably something that needs fixing right now. We'd lose that if pulling servers out of rotation was normal operating procedure.
This would also be a royal pain in terms of local and dev tier testing, since you've basically made the load balancer and frequent GCs a pre-requisite for thorough testing.
Of course, this remains one of the options when dealing with GC stalls; just think what Marc went with here made more sense.
Actually the load balancer approach can be further simplified. Just have a shorter health check on the server, like 3 seconds. When the server is in long GC pause, its health check request from the load balancer will fail. The load balancer automatically takes it off the pool and no traffic reaches it. After GC completed, the server's health check succeeds. Load balancer adds it back to pool automatically.
It's just a simpler solution to have a group of servers with enough capacity accounting for the random GC pauses. Having short health check enables failing fast which is generally better for the users overall - their requests don't stuck in overloaded servers for a long time (GC or otherwise caused).
I'm not familiar with .NET at all, but isn't there some way that you can move a group of objects outside of the GC system and manually manage their allocation and deallocation?
Seems like doing that would be far cleaner than what they actually did and it directly addresses the problem.
Then you have to manage their lifetimes 'manually'. Iterate, ultimately you remove GC altogether.
If GC is the future (and it seems it is) of language runtime, then some kind of control will be needed for situations like this. Different pools, explicit tagging of 'long-lived' or 'cache value' or some such during allocation?
I think that's reductio ad absurdum. It's typical in long-running applications to have a class of memory unlike the rest, that will live the entire lifetime of the application. I see no problem with being able to tell the GC "this memory will always be used," or just allocating it outside of the GC entirely, while still wanting all of the benefits of GC elsewhere in the application.
"Different pools, explicit tagging of 'long-lived' or 'cache value' or some such during allocation?"
The idea that immediately came to my mind is that given that the site is probably load-balanced, the best thing to do would be to take the site out of the load balancer while the big GC runs, then put it back in. I wonder if there's some way to hook that GC event. So, "pools", but at a higher level.
We register for notifications from the GC using a magic threshold number that could mean anything.
Then we quickly notify the rest of the webs a GC is pending on our "message bus". They let us know they are safe from GC at the moment. If they are not you are in a pickle.
Then we notify HAProxy that we are about to run a GC and ask it to tell us when it is done draining the current requests and taking our web offline.
Once notified we perform the GC manually
Then we notify HAProxy that we are done with the GC so it can add us back to the pool.
Or you could just tell haproxy you're going to be low priority for a little bit and not worry about every single last request never being processed on GCing server.
I think you've got it exactly backwards. By the time you've dealt with some of the various complexities manual memory management, for example:
- handling memory with a dynamic lifetime
(with cycle detection as an important corner case)
- improving allocation performance
- preventing heap fragmentation
- thread-local allocators (for multi-threaded scalability)
...
I think you're approaching a garbage collector. In fact, I'd even go so far as to say there is a memory-management analog for Greenspun's tenth rule:
Any sufficiently complicated, long-lived program that manually manages memory contains an ad hoc, informally-specified, bug-ridden, slow implementation of a garbage collector.
I agree there's a place for some amount of "hinting" to the garbage collector about the lifetime of some special objects, but, to me, the important thing is that you only need it for the extreme cases.
It's possible and practical. The MPS allows you to have manually-managed pools alongside GC'ed pools in the same heap: http://www.ravenbrook.com/project/mps/
Useful for long-lived data that doesn't change frequently.
As a developer working primarily in Java, I'm jealous that this is an option for the .NET (C#) people. Not jealous enough to want to work on Windows, or have to trust Mono not to get sued, but jealous, nonetheless.
Garbage collection is a wonderful option. It gets old when it's the only hammer you have.
What a joy to be able to maintain a large cache in your program, rather than having to rely on an external service. What a joy for virtual memory to actually work, rather than constantly (every 2 or 3 seconds) swapping back in something that wouldn't otherwise be used for a while, and even more frequently thrashing on L1/L2 caches.
The most interesting aspect of this from my perspective was the importance of structs in C#. Java doesn't have this construct, making it harder to solve these problems in-memory.
You can simulate it with a bunch of parallel arrays, one array for each field of the "struct". Wrapping up such a construct in a flyweight object (I described this in a comment to the blog post) can then get much of the ease of use back (though you can't override operator== in Java).
This is a good point, but I'd wager parallel arrays are slower because the components of a "struct" don't have spatial locality in memory- thus you have poor cache coherence. Of course, this isn't an issue if you're primarily passing the flyweight objects around and not accessing all the fields.
> I'd wager parallel arrays are slower because the components of a "struct" don't have spatial locality in memory
How much would you wager? I'm willing to take your money :)
> Of course, this isn't an issue if you're primarily passing the flyweight objects around and not accessing all the fields.
If you examine codebases, you see that this is actually the prevalent use case for objects in general.
It turns out that in practice, parallel arrays are often significantly faster, because more often than not you _do_ access a lot of (memory consecutive) records, and you _don't_ access all the fields.
Assume e.g. your structure has student names and numeric grades. For display purposes, you are likely to read both name and grade.
However, for just about every other purpose, you are likely to scan/process grades without ever looking at the name (e.g., to compute grade statistics; or to sort by grade). And in this case, you actually have much _better_ cache locality than you would have otherwise -- and also much better memory usage compared to individual allocation of objects.
Furthermore, if you take this approach, you find that your functions become type-concrete but use-generic, reducing your code footprint (both LOC and I-cache), because e.g. instead of "calcAverageGrade(student[] x)", you can just use a routine called "calcAverage(float[] x)", and call it ont he grade vector.
APL, J and K take this approach to its logical conclusion. The result is programs that are 10-1000 times shorter than the idiomatically comparable Python, Java, C#, C++ etc., and that have runtimes the compare favorably with C and C++.
> It turns out that in practice, parallel arrays are often significantly faster, because more often than not you _do_ access a lot of (memory consecutive) records, and you _don't_ access all the fields.
This doesn't hold if you're doing random access (which it sounds like they are, given that they're passing around lists of indices). In the random-access scenario, parallel arrays will require roughly O(M) times as many trips to main memory as the struct array, where M is the number of fields accessed.
O(L/M) times as many trips, where L is the <average cache line size> divided by <average field size>. With 64 byte cache lines and 8 byte fields, this is at most 8, even if you have 200 fields in your object. In my experience, when I was still doubting this myself, I was unable to ever measure more than 20% slowdown on the least favorable (for columnar storage) workload, and I was able to measure about 200% speedup on a favorable workload.
Note that even if they do pass around a list of indices, accesses are often (in practice) batched - because records that arrived together tend to be needed together. And the indices are often sorted (because of how they were generated, or because they are easy to sort) which then produces a very predictable access pattern for memory and disk.
We can theorize about probable and improbable access patterns until tomorrow. My experience, and everyone that used column-major order that I'm aware of, is that in practice it almost always wins against row-major ("common") ordering, whether on disk or in memory.
The fastest databases around (kdb+, vertica) use column major order. That should tell you something (about disk, but also about memory)
Sorry, I'm really confused. Is this supposed to be "O(M/L)"?
The note about clumped records is valid. That is often the case. Without knowing more about their workload, it's difficult to say whether they have that behavior, but it's quite likely.
As for the fastest DBs, they use column order for a number of reasons. The big thing is that their access patterns are rarely random. They're optimized for scanning on keys. DBs like Fastbit can be extremely fast for whole-table scans. They aren't so fast if you need to find a specific entry. For maximum speed there, you need a dedicated index, and then it doesn't really matter whether the data is stored row-major or column-major.
> Sorry, I'm really confused. Is this supposed to be "O(M/L)"?
No. As soon as you need to read more than <cache line bytes>, it no longer matters that that the the fields are near each other. So, e.g., if your record is 128 bytes but your cache line is 32 bytes, you have 4 cache misses per record. If you use only 4 fields from the record, you have 4 cache misses per record -- so, we're on parity.
As M increases beyond the cache line size, the advantage (compared to column major) no longer increases. So it is at most a constant in practice.
> The big thing is that their access patterns are rarely random.
That's true for almost all programs, almost all of the time, actually -- which is why I was willing to be the other side of your original wager.
> They aren't so fast if you need to find a specific entry. For maximum speed there, you need a dedicated index, and then it doesn't really matter whether the data is stored row-major or column-major.
And again, that's true for many workloads as well.
The main advantage of row-major is simpler appending; if you write a lot and read little, row-major is faster; this is usually the case for e.g. audit logs.
In most typical cases, column major is equal or better.
> No. As soon as you need to read more than <cache line bytes>, it no longer matters that that the the fields are near each other. So, e.g., if your record is 128 bytes but your cache line is 32 bytes, you have 4 cache misses per record. If you use only 4 fields from the record, you have 4 cache misses per record -- so, we're on parity.
As M increases beyond the cache line size, the advantage (compared to column major) no longer increases. So it is at most a constant in practice.
Huh? Why would you have 4 cache misses for a 128-byte record? If you need to access 4 fields and they all fall onto the same cache line, you get only one cache miss. If your fields are spread far apart, you might have 4 cache misses, but that isn't a given. With parallel arrays, the fields will be in separate arrays and necessitate 4 cache misses.
> That's true for almost all programs, almost all of the time, actually -- which is why I was willing to be the other side of your original wager.
> Why would you have 4 cache misses for a 128-byte record?
I wrote earlier "if your cache line is 32 bytes".
If you need all the fields - you will have 4 cache misses. If you need only some of the fields, and they all fall into 1 cache line, then -- as you said, you will only have one cache miss.
My point (poorly worded, upon rereading, but I didn't have much time, dayjob and all...) was that the potential advantage of record-major does NOT increase beyond what is packed into one cache line -- so it is a small constant, rather than deserving of O() notation.
> I didn't make the original wager. :)
Ah, apologies. (But maybe you'd still let me take your money? :) )
> My point (poorly worded, upon rereading, but I didn't have much time, dayjob and all...) was that the potential advantage of record-major does NOT increase beyond what is packed into one cache line -- so it is a small constant, rather than deserving of O() notation.
I see what you mean, now. You're right that it's a constant multiplier.
As with C++, structs in C# can have private members. The big difference between structs and classes, though, is that structs are value types. This means that, for example, function local usage of struct instances can forego allocating them on the heap and instead allocate them in the program's activation records (stack frames).
Because of the semantic differnces you can't just go switching all your classes to structs and realize a significant performance difference. Code like this:
class Foo
{
int bar;
}
public static void Func(Foo[] arrayOfFoos)
{
Foo f = arrayOfFoos[0];
f.bar = 10;
}
will break in subtle ways because you are no longer manipulating the heap allocated instance via a reference but rather a copy that was allocated local to the function.
Arrays of structs also have better cache locality which can improve performance on number crunching workloads and can also make it easier to exchange data between managed code and native code through a P/Invoke interface.
XNA doesn't even have a generational GC, which makes writing games feel like a constant exercise in tricking the GC.
The GC's marking phase is very slow because it has to traverse the whole object graph each time, meaning you want to make the object graph as simple as possible.
Arrays of structs are very efficient for the GC to scan (it can either collect the whole array or not), but working with structs is a nightmare because there's no polymorphism.
For Fluffy[1] we just used classes and tried to optimize the hotspots as an afterthought. I'm not quite satisfied with how it turned out, there's still a periodical 20-40ms lag in the game.
On Xbox 360, XNA comes with the .NET Compact Framework which indeed includes a pared-down, non-generational GC (http://msdn.microsoft.com/en-us/library/bb203912.aspx). (On Windows though you get a full .NET framework.)
As a developer for Xbox 360, I found your best bet is to:
- monitor the amount of memory allocated (print out GC.GetTotalMemory(false))
- allocate everything during loading and force a GC when you're done loading stuff.
- ensure you're not creating any garbage at all during your main loop, which will prevent the GC from kicking in at all.
Structs seem like a very pragmatic solution to the truly hard problem of writing a really good GC algorithm.
Java has a whole selection of very advanced GC algorithms, and from what I hear, it's possible to mitigate most long-pause problems by careful tuning (e.g. incremental+parallel might help here). But JVM tuning is such a black art that really nobody knows how to do it ...
The solution, which (unsurprisingly, works!) is how you did stuff in Fortran-77, APL (circa 1958). There were no iterators, so all you did was pass around one big array that contained all the data, and indices into it.
In a couple of years, they will realize that even structs are inefficient, and will go to parallel arrays - at which case the transition to Fortran/APL/J/K would be complete.