For those looking for a technical explanation, the PHP garbage collector in this case is probably wasting a ton of CPU cycles trying to collect thousands of objects (a LOT of objects are created to represent all the inter-package rules when solving dependencies) during the solving process. It keeps trying and trying as objects are allocated and it can not collect anything but still has to check them all every time it triggers.
Disabling GC just kills the advanced GC but leaves the basic reference counting approach to freeing memory, so Composer can keep trucking without using much more memory as the GC wasn't really collecting anything. The memory reduction many people report is rather due to some other improvements we have made yesterday.
As to why the problem went unnoticed for so long, it seems that the GC is not able to be observed by profilers, so whenever we looked at profiles to improve things we obviously did not spot the issue. In most cases though this isn't an issue and I would NOT recommend everyone disables GC on their project :) GC is very useful in many cases especially long running workers, but the Composer solver falls out of the use cases it's made for.
As to why the problem went unnoticed for so long, it seems that the GC is not able to be observed by profilers, so whenever we looked at profiles to improve things we obviously did not spot the issue.
That sounds like a bug in the profiler, not with Composer. Observing internal time is pretty important for any profiler.
That would help speed up the network part of the install/update yes. The linked patch on the other hand hopefully gets the CPU part to a decent level for most people.
Out of curiosity what tools do you use for profiling and finding these sorts of things? Plain old xdebug, xhprof, or other things? I'm going to have to jump into debugging a fairly large Symfony application within the next couple months and am on the look out for good tools to help me along.
I had the same issue in Python recently. The project runs as a server that loads a huge amount of objects from the database, and could use as much as 10GB memory! Python's reference counting works great, but every so often, the full-heap-scanning cycle collector would run, and it took quite a lot of time to scan a mutli-GB heap.
We noticed the issue happened most often when deserializing objects (loading them from Redis to memory). As it turns out, Python would schedule a collection every time the object_created counter was sufficiently higher than object_destroyed counter. In general, this makes sense, because that way you can be sure that objects are being created and not being freed, which most likely means a resource leak or a reference cycle. However, the same thing happens during deserialization - many new objects are created, and none are freed. Coupled with Python's low threshold (700), GC was triggered many many times in every serialization loop (usually in vain, as no new objects became recyclable). Disabling GC and running full collections manually solved the problem
The truth is, I don't understand the point on having to download MBs of stupid animated images I will not even look at when I expect to see a commit diff.
That user is using PEAR, which is the old shitty PHP "CPAN"
That's the reason for the huge memory usage. We're slowly moving away from PEAR, but since it works for now not everyone has/will transition.
Edit: I should also point out that there are a few packages that almost everyone uses (PHPMD, PHPCS, phpUnit) that are still mostly pulled from PEAR, though I think phpUnit has a composer option.
PHPUnit stopped updating PEAR in April, and actively destroyed PHPUnit on their PEAR repo at the start of this month (i.e. a patch version update that consisted just of a bash script printing a migration message...).
Recursively gathering all dependencies a project might have. A huge downside of the modern scripting languages landscape is that dependency graphs can get quite convoluted
> Also, memory nowadays is cheap, CPU power isn't.
Unless you're running your deployment on a 512Mb or 1G VM. I've had composer max out swap on those too. Even with 2G RAM it's not been happy sometimes, so be interesting to see what difference this patch makes.
You shouldn't be running composer update on your deployment, just composer install which doesn't take as much memory since it doesn't have to resolve dependencies.
IMO you should commit your composer.lock file up to your repository and then use composer.phar install --no-dev --optimize-autoloader on any production instance. Install is much faster and uses hardly any memory compared to the update command.
To add/update any dependencies for your project run the composer.phar update on your development environment or somewhere it can use a ton of memory and cpu without issue. Then just commit and push up your composer.lock changes. Been doing it this way for over a year and had no issues deploying changes in ec2.
Interesting. I was looking at the comments hoping for some more technical background, but unfortunately they seem to have been run over by the animated gif crowd.
It seems when you start to hit the memory limit PHP's automatic garbage collection will loop through the constructed objects to see if any can be cleaned up.
If none can (and in the case of Composer all the objects exist for a reason) then it's wasting time analysing the objects.
So in this case there's only a large waste of cpu doing nothing with gc enabled.
Could someone more versed with PHP, and this project explain why turning off garbage collection helped so much? and why they didn't turn it back on at the end of the function?
PHP is reference counted, so memory is typically freed as soon as an object is no longer needed. Cycles are the exception which can cause memory leaks, so in version 5.3 php added a cycle collector, which reads every object in memory and very occasionally deletes objects that are disconnected and have greater than zero reference counts (cycles).
In my opinion, the php cycle collector is a pointless waste of time. In objective-c, apple just let's the memory leak by default, and they give you tools to find the leaks, and then you modify the code to break the cycles.
There is no need to turn cycle collection back on at the end of the program, because OS frees the memory at program termination.
I agree that cycle collector is pointless waste of time. Most script runs short enough that the memory leak doesn't really matter.
But for long running script, it's either cycle collector, or add support for weak reference. But IMO, due to how reference are stored in PHP, and to my limited knowledge of PHP core, I am quite sure cycle collector are more beneficial in both developer time and usefulness. (Not every programmers know how to manage reference cycle)
I'll note that in common usage, PHP does not exit entirely at the end of every HTTP request. (By default, PHP-FPM never exits between requests.) You would, at the least, have to keep track of all live objects and delete them at the end of each request... which sounds like a garbage collector to me.
> Could someone more versed with PHP, and this project explain why turning off garbage collection helped so much?
The cycle collector is relatively recent, I expect it's not very performant (since most PHP applications don't need it) and composer's dependency resolution may be hitting a pathological case (create lots of objects without cycles, triggering lots of collections but no actually useful work)
> and why they didn't turn it back on at the end of the function?
Since it's a package manager, I'd guess the expectation is the process will die soon-ish afterwards (once it's installed whatever it's resolved). There's a discussion of re-enabling it after dependency resolution (so postinstall hooks run with GC enabled) though.
Garbage collection is slow, but reduces memory usage. So disabling it costs memory. Also, Composer does not keep running, once the job is done, the script terminates, so you don't have to enable GC back again (it's only disabled in the context of the current execution).
I remember story of my friend in algorithmic contest for high school students in Poland (which are quite hard). He solved problems correctly, but in his implementation he got to check in every iteration of loop if a collection still got any elements. He used col.size()==0 instead of col.isEmpty(). The first was O(n) and it fucked up all performance.
Not really, some containers have a linear-time size by design. The canonical example is a linked list in which you wish to keep the splice-another-list-at-middle time linear.
> Behold, found something in the docs about garbage collection:
>> Therefore, it is probably wise to call gc_collect_cycles() just before you call gc_disable() to free up the memory that could be lost through possible roots that are already recorded in the root buffer. [...]
Ugh. I remember that, I had posted on that to explain to the developer why there was so much attention and kept receiving mails and notifications from github for ages. At the time there was no way to "stop watching" when you had commented on something, if I remember correctly.
I still have all those notification emails; I am still thinking about graphing them one day to see how the commenting rate on this thread evolved over time.
Am I the only one that considers this disgusting? If the GC is so bad that it causes 2-10x slower operation in this use case, then it's a bad GC. I mean really, really bad. Short-lived objects in any modern GC should be swept away trivially without a lot of overhead. Of course we're talking about PHP here, so perhaps it's redundant to say something about it sucks, but jesus...runtimes that require hacks like this should be taken out back and shot.
PHP uses ref-counting for most garbage collection. That means non-cyclic data structures are collected eagerly, as soon as the last reference to an object is removed.
Naïve ref-counting can't collect cyclic data structures, though. Normally, cycles are "collected" in PHP by just waiting until the request is done and ditching everything. That works great for web sites, but makes less sense for a command line app like Composer.
To better reclaim memory, PHP now has a cycle collector. Whenever a ref-count is decremented but not zero, that means a new island of detached cyclic objects could have been created. When this happens, it adds that object to an array of possible cyclic roots.
When that array gets full (10,000 elements), the cycle collector is triggered. This walks the array and tries to collect any cyclic objects. They reference this paper[1] for their algorithm for doing this, but what they describe just sounds like a regular simple synchronous cycle collector to me.
The basic process is pretty simple. Starting at an object that could be the beginning of some cyclic graph, speculatively decrement the ref-count of everything it refers to. If any of them go to zero, recursively do that to everything they refer to and so on. When that's done, if you end up with any objects that are at zero references, they can be collected. For everything left, undo the speculative decrements.
If you have a large live object graph, this process can be super slow: you have to traverse the entire object graph. If there are few dead objects, you burn a bunch of time doing this and don't get anything back.
Meanwhile, you're busy adding and removing references to live objects, so that potential root array is constantly filling up, re-triggering the same ineffective collection over and over again. Note that this happens even when you aren't allocating: just assigning references is enough to fill the array.
To me, this is the real problem compared to other languages. You shouldn't thrash your GC if you aren't allocating anything!
Disabling the GC (which only disables the cycle collector, not the regular delete-on-zero-refs) avoids that. However, it has a side effect. Once the potential root array is full, any new potential roots get discarded. That means even if you re-enable the cycle collector later, those cyclic objects may never be collected. Probably not a problem for Composer since its a command-line app that exits when done, but not a good idea for a long-running app.
There are other things PHP could do here:
1. Don't use ref-counting. Use a normal tracing GC. Then you only kick off GC based on allocation pressure, not just by mutating memory. Obviously, this would be a big change!
2. Consider prioritizing and incrementally processing the root array. If it kept track of how often the same object reappeared in the root array each GC, it can get a sense of "hey, we're probably not going to collect this". Sort the array by priority so that potentially cyclic objects that have been live in the past are at one end. Then don't process the whole array: just process for a while and stop.
The commit is great. I love that the comments have spiraled completely out of control. At this point, 30 minutes after the link was posted, the comment thread is now a competition to see who can post the best gif.
I know we're serious here, but stuff like this reminds me why I love the internet so much. It's fun to cut loose once in a while.
Technically, it will let you post them as links, it just won't upload and embed them. But a plugin or a userscript would fix that, provided they're posted somewhere with an easy to deal with API like imgur.
I think most of those gifs are hosted on github. I tried to post a linked image, and it only showed the hyperlink. When I uploaded the image to github, it showed up inline.
Strongly disagree. My team puts images in comments to document visual design changes. Occasionally, we'll even drop in an animated gif to illustrate a workflow/process. These can be extremely helpful in code review.
No. Concern trolling and tone policing needs to fuck right the fuck off.
Do you find you have much success motivating people when you use this tone of voice?
I reported this issue in February. Others have reported it before I did. What is Composer's response? Silence and negligence while its maintainers go to cons and drink beer.
Their priorities are totally fucking backwards.
I would strongly advise using less inflammatory titles than Horrendously Stupid and Ill-Advised Install Instructions if you wish people to not ignore you.
You might not want to do this, and that's fine and your prerogative.
If you do want to see a significant increase in cooperation, dropping the attitude will be a quick and effective way of doing so.
I attempted to be civil months ago. It got ignored.
You pointed out a valid concern, a few people kicked the tyres of it, and when it turned out that the fix entails a good deal of work - something you've skipped over - you decided to become "rude."
My goal is to fix the problem
As I said, your current method of trying to get it fixed plainly isn't working.
If you start and end at the same place (i.e. nothing gets done), the only difference being that you've made people want to actively avoid you, is that a good result for you?
One of the advantages of open source is that you have the freedom to take the code and fix issues that concern you.
You can Google "open source" for more information - that might be a more productive use of your time than waiting for people who owe you nothing to make changes that you want.
Yes I'm well aware of what open source is and how it works (if you notice I'm pretty active on github); I've already sent a PR to fix part of the problem before.
However, it makes very little sense for them to, for example, merge a PR if it uses my public key for verification and not theirs. The problem I'm running into is that I can't just fix the issue without their cooperation.
Regardless, this is all moot because the Composer maintainer is now communicating with me privately to discuss how to fix the problem so it will be solved some day soon.
Disabling GC just kills the advanced GC but leaves the basic reference counting approach to freeing memory, so Composer can keep trucking without using much more memory as the GC wasn't really collecting anything. The memory reduction many people report is rather due to some other improvements we have made yesterday.
As to why the problem went unnoticed for so long, it seems that the GC is not able to be observed by profilers, so whenever we looked at profiles to improve things we obviously did not spot the issue. In most cases though this isn't an issue and I would NOT recommend everyone disables GC on their project :) GC is very useful in many cases especially long running workers, but the Composer solver falls out of the use cases it's made for.