FTA: *Twitter had a service that would have a terrible GC pause every three days...

squarecog · on Jan 20, 2012

The context isn't clear from these notes.

Full context as explained in the talk: used to have stop-the-world GC for 2 minutes every hour. After implementing bytebuffer-based slab allocation, this is only several seconds, and once every three days. Service runs on 200 nodes, with redundancies. Kicking the process on one of them, in a slow roll that finishes in under 3 days, works around the unresponsiveness window (planned shutdown easier to manage than an unpredictable pause).

It's totally a workaround and not a solution. Atilla follows up this example with an anecdote of talking to Oracle folks about when are they going to have true pauseless GC, and they responded with "not that big an issue, really, everyone finds a workaround..." So, this is an example of a workaround. A pretty good one once you realize it's not "the one" machine, it's "some one" machine out of 200.

bdunbar · on Jan 20, 2012

Ah - in context, for that problem, I agree it's 'okay'.

Thanks for clarifying.