Hacker News new | past | comments | ask | show | jobs | submit login
Dismissing Python Garbage Collection at Instagram (engineering.instagram.com)
258 points by shivawu on Jan 18, 2017 | hide | past | favorite | 85 comments



This is a great post. I like how they walked through all the steps and especially the "perf" tool.

Ruby has a patch to do the same thing -- increase sharing by moving reference counts out of the object itself:

Index here:

http://www.rubyenterpriseedition.com/faq.html#what_is_this

First post in a long series:

http://izumi.plan99.net/blog/index.php/2007/07/25/making-rub...

I think these patches or something similar may have made it into Ruby 2.0:

http://patshaughnessy.net/2012/3/23/why-you-should-be-excite...

https://medium.com/@rcdexta/whats-the-deal-with-ruby-gc-and-...

The Dalvik VM (now replaced by ART) also did this to run on phones with 64 MiB of memory:

https://www.youtube.com/watch?v=ptjedOZEXPM

I think PHP might do it too. It feels like Python should be doing this as well.


It's basically "cheating" at GC by exploiting a very narrow use case. I saw a trick like this at Smalltalk Solutions in 2000 with a 3D game debugging tool. The "GC" actually simply threw everything away for each frame tick.

Someone needs to come up with something like a functional language based on a trick like this. Or maybe a meta-language akin to RPython, so people can write domain specific little languages for doing things like serving web requests, combined with domain specific "cheating" GC that can get away with doing much less work than a full general purpose GC.

Couldn't a pure functional programming environment be structured to allow for such GC "cheating?"


ErlangVM's GC does exactly this. Memory is scoped in a closure around a process, and any time the process goes down all its memory its thrown away.

Also anytime a functions scope terminates, all its memory immediately goes away.

This can be done because its a functional language with immutable data structures.


I work in VBScript inside of classic ASP and the statement, "Also anytime a functions scope terminates, all its memory immediately goes away.", is true there. I doubt anyone would suggest that VBScript is a functional language with immutable data structures.

Additionally, all memory used by a page is cleaned up when the page is finished processing. This has to do with "Memory is scoped" more than immutable data structures.


I think you mean memory usage is lexically scoped because no pointers to blocks are saved or returned? I think, and didn't really want to think harder, just thought in the same sense that functional brings some useful baggage with it, so does lexical scope.


> Also anytime a functions scope terminates, all its memory immediately goes away.

I don't think this is accurate; those values remain until GC is triggered for the process or the process terminates.


Erlang's GC is the perfect example of ‘work smarter, not harder’. It's not a particularly advanced GC, it just does so little work that it's soft-real-time capable.

As usual, excellent prog21 article: http://prog21.dadgum.com/16.html (I miss prog21 already…)


The article didn't read that way to me.

The GC is a fallback mechanism to catch any refcounting loops, but the reference counting is still in force, so non-looped objects still do get collected.

When the GC runs, everything gets touched. Without it, the refcounting still hits some shared objects, but this is reduced enough to be worth it, and the heap doesn't grow without bounds.


It sounds a bit like you're describing region based memory management. https://en.wikipedia.org/wiki/Region-based_memory_management


I'm not sure why you would call it cheating -- it seems like a very principled engineering decision to me.

Even if you never deploy, it's not like Python processes will run forever if you have have GC on. Big deployments usually recycle their processes regularly. Why wouldn't you? There's no downside once you have a big enough cluster.

I recall hearing that YouTube also runs with GC off, but I don't have a source for that.


I don't think anything pejorative is intended, but it is a bit like taking out a loan with no intention of repaying.

Another way of looking at it is that you are optimizing away a feature that Python needs in order to be a general-purpose language, but that is not needed for your purpose.


As long as you have the build infrastructure to support patched versions of Python, and can run multiple ones, I don't see the problem.

When you're running at scale like Instagram, it's inevitable that you need to do stuff like this. You're just pushing the limits of what open source software has been tested for.

I think this solution is a hack in the good sense... in particular because you are taking something away rather than adding crap on top, which is the typical solution. I wouldn't be as impressed if they came up with a fancy new GC that took 18 months to write and only worked for their workload. That would be the wrong way of thinking about the problem.


I'm not sure why you would call it cheating -- it seems like a very principled engineering decision to me.

I like sounding "bad."



I immediately thought of BEAM while reading this.


I think C++ and Rust have reference counting objects that can't collect cycles. Add that with an arena allocator and it sounds close to what you are describing. It's more explicit though, whereas Python does the reference counting for you here.


You could have a general-purpose "heap" interface that controls allocation. This would mean that you could control allocation space (for games this is really useful), and tearing down the the heap would tear down its owned objects. You could either hand out weak pointers or statically tie the lifetime to the lifetime of the heap, like Rust does.


I'm stupid so I don't know better but I think Go's new proposed GC is supposed to do something similar. Throw away everything that was created in a go-routine once the go-routine exits.


PHP works very similarly to what they describe in the post. It has a garbage collector that's effectively reference counting, but the whole heap just gets trashed at the end of every request.


It's not functional, and not automatic but Rust has an arena type which supports very fast allocation and de-allocation, with the restriction that de-allocation has to happen all at once.


I find Instragram's engineering blog to be really awesome (I especially like their content on PostgreSQL). As well, it seems like they managed to implement a solid solution to a problem they were facing.

That being said, I wonder if their team considered implementing a different language that was meant to work without GC overhead. I'm all for working with something you're familiar with, but this seems like they've hit the point where they know enough of the problem surface area that they should be able to start optimizing for more than just 10% efficiency by turning off a selling point of safer languages.


> That being said, I wonder if their team considered implementing a different language that was meant to work without GC overhead.

Er, hold on there. As discussed in the article, python only uses GC to cope with cases which the refcount can't handle. Which turns out in some situations to be a relatively small amount. The performance problem here didn't actually come from any "fundamental overhead" of GC, just an incidental implementation detail.

Deciding to jump languages just because of a 10% performance gain which pointed them in that direction (which, as I've discussed above, it didn't) would be ill advised.

> by turning off a selling point of safer languages

I don't know if I'd say GC is a selling point of a safer language, it doesn't really have a lot to do with safety in itself and the authors certainly didn't manage to remove any language safety in disabling the GC.


Changing your code from one language to another is sufficiently time consuming and challenging that it's almost never done successfully.


Nice. I worked on something like this at an internship. I wrote a Unicorn-like preload-fork multiprocess server in Ruby (for other reasons).

I realized that the workload (which involved a large amount of long-lived static data on the heap) would have seen enormous memory savings, if only we weren't running with Ruby 1.9's mark-and-sweep GC algorithm that marked every object during the mark phase.

I briefly experimented with turning off GC and periodically killing workers. Thankfully, in that situation, all we actually had to do was upgrade to Ruby 2.2, which does have a proper CoW-friendly incremental GC algorithm.

`fork` is awesome.


One of their issues was that Python runs a final GC call before process exit. Why does Python run that final GC call if the process is exiting anyway?


Perhaps it's to make sure all the objects' finalizers are run also, making sure to release everything, not just memory and file descriptors? The finalizers are user-defined, so could include side effects that the GC would not know about.


Does Python guarantee that finalizers are called at all? Most languages don't, I thought.


PEP 442 describes the situation with good detail, including how it was in Python 2 and how Python 3 improves on it:

https://www.python.org/dev/peps/pep-0442/


I know Haskell does: https://hackage.haskell.org/package/base-4.9.1.0/docs/Foreig... I'm not sure finalizers make sense without a guarantee they will eventually be called. But Haskell, for example, makes no guarantees about when it will be called, between the moment the last reference is lost and the moment the program exits.

Python seems to, too: https://docs.python.org/3/library/weakref.html, "When the program exits, each remaining live finalizer is called unless its atexit attribute has been set to false. They are called in reverse order of creation."


Do they need to be run in a specific order? Could you maintain a linked list of all objects, and at exit go through it at exit and run the finalizes? Might be faster than actually reclaiming the memory as well.


Maintaining that list through the whole process duration just to use it in the end sounds expensive, both in CPU and memory terms.


In fact, this is now fixed in Python 3.6. When gc is disabled by the user, the final collection is not done.

The effect is faster shutdown with less "copy on read" but some __del__ methods and finally: blocks in some generators might never get called.


Is it just me, or does it look like the typical example of short term hack that will blow up in your face pretty quickly, and turn your life in a constant stream of low-level tinkering ?

I suppose people at instagram didn't just stop there, but are also planning for more long term solution to optimizing their stack ( aka migration to a more performant language).


> turn your life in a constant stream of low-level tinkering

It appears to me this is the fate of anyone that has to make garbage collection scale. [1] I guess it's all worth it; build fast, win big and then struggle with the GC as an exercise in technical debt once you can afford enough staff to focus on the problem. Dreary.

The insight that due to reference counting GC turns reads into writes is interesting; I think I've encountered that before but can't remember where. As the gap between CPU throughput and RAM latency grows this becomes an ever greater point of pain.

> aka migration to a more performant language

That happens occasionally with small systems, and it usually works. But big operations with complex code bases don't do that very often. Facebook has re-implemented PHP at least twice now to improve run time performance without changing their language. Just one anecdote in a long list of heroic anecdotes about those who will do any number of backflips and somersaults to avoid replacing their chosen language.

[1] https://news.ycombinator.com/item?id=12043271


I would say it didn't start out as a goal to optimize based on the article. The engineer noticed something strange and was very interested in learning why and try to optimize at the same time. This sounds like a fun place to work at.


> Instagram can run 10% more efficiently

Seems quite risky/costly for a mere 10% computational efficiency gain. If you're going to change the memory model of a programming language, might as well shoot for 10x improvement instead of 10%.


At scales like Instagram/FB, 10% is a big improvement.


Then 10x such be even better for their users! It would be a noticeable difference at that point.


At their scale, it's not about making it faster for the users. It's already fast enough. It's about making the clusters more efficient to save money. At this scale, they have already made things a lot fast for their users by throwing money at it.


That's not how this works, just because they improve internal infrastructure, does not mean it will translate directly to a visible layer an end-user will see.


Fun fact: Lisp originally had no GC. It just allocated and allocated memory till there was none left, and then it died, after which the user dumped their working heap to tape, restarted Lisp, and loaded the heap back from tape. Since only the "live" objects were actually written, the heap took up less memory than before and the user could keep going.


Instagram’s web server runs on Django in a multi-process mode with a master process that forks itself to create dozens of worker processes that take incoming user requests.

So this is all a workaround for Python's inability to use threads effectively. Instead of one process with lots of threads, they have many processes with shared memory.


Just a small nit pick, it's not shared memory it's Copy on Write memory that comes from forking. It would be interesting to see an explanation of the tradeoffs between threading and multiprocessing in their app servers too!


Right, they're not using shared memory for interprocess communication. The article used the term "shared memory", but they just mean read-only sharing.


Noting that some other library might call `gc.enable()` is correct. But, then ignoring the fact that another library can simply call `gc.set_threshold(n > 0)` seems like an obvious bug in the waiting, and the same issue as something calling `gc.enable()`


This is called out of band GC. We've been doing it for years in Ruby with Unicorn https://blog.newrelic.com/2013/05/28/unicorn-rawk-kick-gc-ou...

However when the ruby community moved to Puma which is based on both processes and threads it was needed less. Not that this is rocket science (it's still far behind the JVM and .NET), I assume a hybrid process/thread model is something that hadn't reached a critical mass in the Python/Django/Flask/Bottle community?


uWSGI can definitively run in an hybrid process/thread model.


They mentioned msgpack was calling gc.enable(), but it looks like that issue was fixed quite a while ago in version 0.2.2:

https://github.com/msgpack/msgpack-python/blob/2481c64cf162d...


This writing feels a little sloppy

> At Instagram, we do the simple thing first. [...] Lesson learned: prove your theory before going for it.

So do they no longer do the simple thing first?

More on topic: this seems like they optimized something in a way that might really constrain them down the road. Now if anyone creates an object that isn't covered by ref-counting they will get OOMs.


Carl Myer, a Django core dev, presented at Django under the hood on using Django at instagram. It was a really good talk that goes through how they scaled and what metrics they use for measuring performance. https://youtu.be/lx5WQjXLlq8


I actually didn't know that CPython had a way of breaking reference cycles. I seem to remember reading that reference counting was the only form of garbage collection that CPython did. Maybe this was the case in the past?


It was introduced in Python 2.0. You're not alone ... I still think of 2.0 as kinda new myself. :)

Look in this doc for "Optional Collection of Cyclical Garbage": https://www.python.org/download/releases/2.0/


Released in October 2000... yeah, that's probably around the last time I paid any attention to the implementation details of GC in Python.


> Each CoW triggers a page fault in the process.

Maybe I misunderstood how page faults work, but I thought this process was reversed. I.e. Each page fault triggers a CoW, not the other way around?


It could help to colocate all the refcounts in a contiguous block of memory, column store style. You would nly get a page fault per 1024 objects.


I'm confused - doesn't a worker run out of memory if GC is disabled?


Python is also reference counted, and this does the bulk of the work - the GC is just for things that were missed. Instagram has the process that spins up the Python works kill and replace any that eventually use of the allowed threshold of memory.


It looks like Instagram's change disabled refcounting too, otherwise they'd still be doing copy-on-read for the refcount variable?


They abandoned that change because it didn't solve their problem.


Isn't that costly? Because you're wiping everything including memory that could be reused, including the very executable being loaded into memory and other such things that goes into a process and running all the process' initialization code again.


Question, but as someone who doesn't do this sort of work...is this typical? That things would balloon that requires you to periodically kill things sounds like fuzzy logic somewhere to me.

I get that software is complex and people have simple deadlines...


This is pretty common for a forked server model. You already need to handle the case where the worker process dies, so it's simple to also occasionally kill it on purpose. Memory thresholding is nice, but you also have things like MaxRequestsPerChild from Apache. Restarting the worker after say 100,000 requests is cheaper than profiling / tracking down slow memory leaks. OTOH, when you get down to MaxRequestsPerChild 10, you have clear problems that should be easy to track down; you can also do CPU usage thresholding to limit damage from infinite loops that are hard to reproduce.


Yes, this is typical. I have never worked somewhere that wrote web applications in Python or Ruby and didn't have their process supervisor kill processes after they exceeded a memory threshold.


Killing a process is a lot more efficient than GC ever is. If your servers are stateless, it makes far, far more sense to enable killing them safely than engineer the world's most efficient GC.


The system we use has a "soft" limit; the worker isn't killed outright when it reaches the limit, it just checks after finishing each requests and kills itself if the limit has been exceeded, leading the manager to spawn a fresh worker.


It will eventually, but the master process monitors and restarts it when the RSS reaches certain threshold. So it's indeed a trade-off.


This is actually very clever, and really solves their problem pretty neat!


Using threading to handle user requests with Python seems very wrong to me. They might see solid improvement by ditching WSGI and employing a non-blocking solution (like Tornado, aiohttp or Sanic), running on PyPy as multiple instances behind a load balancer.


It's not multi-threaded, it's multi-processed. Non-blocking IO is also not a magical fix to any of their relevant problems.


I didn't know about atop or perf profiling. Cool write up.


Instead of a bunch of hacks that are obviously going to blow up in someone's face one day why not just use a more suitable platform?

Forking threads for web pages is so old school...And Python is a terrible choice for something at their scale.

Just redo the hosting bit in Java or golang and call it a day. If their UI code is sufficiently isolated from the back end it's not a huge deal.

Instagram is a pretty small application feature-wise, a few devs could probably do it in a couple months


Instagram is a living example for when you have big scaling problems you just scale. You are probably right with your arguments... the question is: Is it worth to choose a good scaling architecture when it slows your initial development down? Instagram is python, many Youtube parts are python, FB is even worse and is PHP at many parts.

So what? Sure they would scale better when using Erlang, Java or Go... but sometimes it is wiser to finish building something than making the best ultrascalable system. If you are really successful you will find ways to scale.


I've heard this so many times but I always wonder how the unicorn effect skew this. Do a majority of companies with such scaling issues fail? I've seen products go down in flames firsthand because performance was so bad. Never at a start-up but I assume it happens there too.

Facebook is a good example because of all the money they've thrown at a problem they shouldn't have. First was a custom PHP interpretor then a compiler and now hack. If they didn't have nearly unlimited money to throw at it would things have ended differently?

Language choice is one of the easiest choices to make. Pick a fast one out of the box if you plan to get big. It's not like the faster languages take orders of magnitude longer to write code in, the effect is minimal at best.


> Forking threads for web pages is so old school

I don't think you understand how uWSGI works.

> And Python is a terrible choice for something at their scale.

Maybe they could use whatever youtube uses for their frontend instead?


> I don't think you understand how uWSGI works.

...or gunicorn. Or twisted / hendrix. Or any WSGI container with process consciousness in the past 5+ years.


So you suggest a couple months of work instead of a two line hack which caused huge improvement for them, because it is more "correct"?


Why should they rewrite everything if it works for them?


because it doesn't. What they're doing is borderline crazy. Similar to turning off the GC in the JVM... once you reach that point why not just use a more suitable language?

Java/Golang/Rust/C++/C# are all about 100X faster than Python. Wouldn't it be more reasonable to use one of those than mess with the runtome to squeeze out 20% more performance?


> Similar to turning off the GC in the JVM

It's absolutely not the same. The JVM uses GC for all object collection. Python uses reference counting and only relies on GC to clean up cycles.


It does. They're profitable even with fairly small engineering team considering size of their user base.


> turning off the GC in the JVM

very common in the HFT world


If you think about it, this approach is actually very similar to the FaaS/Lambda/Serverless model. Each request lives in its own little container which gets thrown away after every execution. This approach means you reduce the amount of shared state and lots of problems like garbage collection either get easier or go away.


Not really. Where they recovered the 10% memory was by keeping shared state shared, and reducing the amount of memory that was copied into each process.

With something like Django, there's quite a big startup tax you have to pay, so at the scale of Instagram they would need quite a lot more servers to handle the same load if every request was served lambda style.


I think you're both right and wrong. You're right in that if you did run Django inside Lambda it would be inefficienct. But you're wrong because that's not how you should adopt Lambda. This is a similar situation to everybody re-architecting all their single-point-of-failure on-premise apps to the cloud in the first place.

The point of Lambda is that you don't startup Django with each request. You can think of the fixed/shared part of Django being replaced by API Gateway. Lambda then replaces the non-shared threads that get started for each request.


I see where you're coming from, but not all state that could have been shared can be pushed into the API Gateway, like Code Objects and settings. To adopt lambda as you suggest would require a re-architecture (not Django), and at that point you're probably better off going for something more efficient. They even mention that memory cache hit ratio improved giving more CPU throughput, which totally disappears if you're starting a new process for every request.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: