I'm the author of the original 'refork' feature introduced to Puma a couple years ago [1], that served as an inspiration for this project.
As I understand it, Pitchfork is an implementation of a reforking feature on top of Unicorn, a single-threaded multi-process Rack HTTP server.
The general idea of 'reforking' is that in addition to pre-forking worker processes at initialization time, you re-fork them on an ongoing basis, in order to optimize copy-on-write efficiency across processes, especially for memory allocations that might happen after initialization. The Pitchfork author mentions sharing YJIT machine code as a particularly important use-case [2], one that has become relevant in recent Ruby releases.
I never saw much interest in the concept when I originally introduced the reforking feature to Puma, and it remained a kind of interesting experiment that never saw much production uptake. I'm thrilled that Shopify is iterating on the concept through this project and I hope the technique will continue to be refined and developed, and see broader adoption as a result.
Hey Will, Pitchfork author here. As mentioned in the Readme, thanks for your Puma PR, it was indeed quite instrumental in Pitchfork inception.
As for why it wasn’t used much as a Puma feature I don’t know, but there are a few challenges with it that prevented me from putting it in production. The most important one being that new workers end up being grand children of the original process, so if the middle process die, you may end up with zombies etc.
It’s also very scary to fork a process that may have live threads currently processing a request. I believe Pitchfork solves most of that.
I noticed the double-fork with PR_SET_CHILD_SUBREAPER to reparent the new workers, which is a nice reliability improvement for the edge-case where the middle-parent worker possibly crashes. It adds a Linux dependency (as noted), but that probably still covers most production use-cases where the extra reliability that comes with reparenting is most needed. This enhancement could probably be incorporated into Puma.
As for the concern about forking a process that may have live threads currently processing a request, this should already be solved in the Puma implementation. The worker shuts down and finishes serving all pending requests before reforking. There is also an `on_refork` hook for to trigger extra garbage-collection to maximize copy-on-write efficiency, or close any connections to remote servers (database, Redis, ...) that were opened while the server was running.
Neither Puma nor Pitchfork are generally used as a static file server since they're not particularly well-suited to that. They're used as Ruby application servers. The memory consumption and savings being discussed is oriented around the memory required by the Ruby VM to process a dynamic request (e.g., a request to a Ruby on Rails application).
That's fair. I was trying to guess where the misunderstanding arose from and I guessed wrong. I'm sorry about that.
Rounding out the previous thought, the idea with many forking servers is to boot up to the point before a request is served and then fork off for each request. You do gain CoW benefits, but if you have any lazy data structures that are reified in the call, now each child is faulting. Pitchfork will take a child that has processed a request and promote it as the parent, replacing the original process. Now, this new parent is the new CoW base with the expectation that forks from that will result in even greater memory sharing. For a framework like Rails, there's a lot that happens after a request is received.
I have now worked at several significantly-sized companies that ripped Ruby out of everything as they scaled, usually replaced with Go, though none as large as Shopify.
I wonder what that financial calculus looks like between these options for a large organization:
1. trying to basically reinvent the workings of an entire programming language and migrate your Ruby apps to these new tools/runtimes/typecheckers/whatever
2. incrementally rewriting critical paths to something more performant (Go/Rust) & just throwing more servers at the Ruby stuff you haven't managed to replace yet.
> significantly-sized companies that ripped Ruby out of everything as they scaled
I am assuming Ruby, used here meant Ruby Rails. Or specifically, ripped Rails out of everything as they scaled?
The calculus, in many situation would indeed be in flavour of another framework or language in ~2013. The cost of CPU Core has dropped significantly since then, we are not far from having 128 x86 CPU core in a single socket. Even if could only do 10 request per second per core at a target latency, this is 1280 Request per second on a single server. ( In the context of a non API serving App, or Server rendering App )
Edit: Miscalculation here, it should be 128 Core, 256 vCPU, so 2560 RPS.... but the main point still stand.
But at the Scale of Shopify, ~$33B Market Cap ( ouch.....I remember they passed $100B market cap in 2020 and $200B market cap in 2021... ) They could finally afford to pour some resources into Ruby. The only Top 10 programming languages without large cooperate resource backing. I am pretty sure the performance potential of YJIT and Ruby tooling has a ROI of less than 2 years at their Scale.
My personal experience is that the bulk of the time these rewrites don't end up delivering the performance increases that people expect because they don't really understand what's making the prior system slow and don't develop good enough requirements
This is not saying you shouldn't, it doesn't really matter to me. But I think a lot of the time this is driven by engineers who feel like they need to do this rather than a financial decision to save money
My experience is along the following lines (same for Django as well actually):
1. Company started as Rails monolith
2. Some components of the Rails monolith stretch Ruby capabilities beyond where "throw more servers at it" is still reasonable or effective
3. Factor out some backend functionality into dedicated services which can be scaled separately, Rails still serves as the API gateway calling these services. Anything without scaling issues stays in the monolith for now.
4. Eventually Rails is just an API gateway, no one in the org knows Ruby/Rails and its dependency management madness any more, and it gets replaced with a more-performant, purpose-built API gateway, usually something off-the-shelf.
A lot of the time, I ask myself, "why don't we seem to have a tool with robust source-to-source transformation?" You could call it "cross-language refactoring". I wouldn't be surprised if someone with the code-base the size of Google has some internal tool. Maybe this is just wishful thinking. The closest thing I've seen is in academic research, e.g. [1].
The cost of this challenge is likely bigger than the market. How many large code bases need to be translated from X to Y? Probably not very many. There are only so many legacy COBOL systems.
Big issues off the top of my head.
Legacy support. Legacy code is already difficult to understand. If it was originally written in another language that would only make it harder to grok.
Transpiled output is often ugly. Developers in general seem to dislike modifying generated code for a number of reasons. I imagine transpiled code would end up implicitly frozen, and new code would only be added to new files.
Language idioms are hard to translate. Some structures are unique to languages and their equivalents may be considered bad in other languages.
You would either need equivalent libraries in new languages or also translate dependencies. You could probably find equivalent libraries, but their scope may vary.
Large scale code organization varies between languages. Things like Dependency Injection may not translate at all. Some languages may use config and some use code for the same thing.
Except without actually solving the problem of making the translation nice and readable. But this kind of thing at least gives you options for changing language.
> trying to basically reinvent the workings of an entire programming language and migrate your Ruby apps to these new tools/runtimes/typecheckers/whatever
Facebook is probably the best example of this with PHP and Hack [1]
I’m still surprised that we haven’t seen much larger improvements in Ruby performance over the last decade given the large number of major tech companies using Rails.
Yes, I'm aware of 3x3 but for the most common usage of Ruby which if for Rails apps - there hasn't been anything like the gains PHP saw from 5.x to 7.x.
Especially from Microsoft, who has deep compiler/language expertise, and who owns GitHub (large Rails app).
>> Yes, I'm aware of 3x3 but for the most common usage of Ruby which if for Rails apps - there hasn't been anything like the gains PHP saw from 5.x to 7.x.
This has simply not been my experience. The jump in performance in the 3.n has been awesome, to the point were I even downgraded / removed servers.
Ironically the greatest speed (and memory) improvements have been in the swapping out the supporting JS ecosystem components that Rails uses, for example swapping Webpack for the Go based Esbuild.
Not following what you mean. This server is from a shopify employer (major enough?). This server makes quite impressive improvements in terms of memory usage. Moreover, shopify is contributing a production grade JIT to the CRuby runtime.
Howdy, I'm on the Shopify team that is working on both pitchfork and a few different performance-improvement projects for Ruby. There's a ton of activity around Ruby performance right now!
I think we're entering a period of increased experimentation and rapid evolution as demonstrated by projects like YJIT[1][2], improved inline caching[3][4] and Object Shapes[5] (also used by V8), and variable-width allocation[6][7], and smaller improvements like better constant invalidation[8]. Significant investments in TruffleRuby[9] are still going on by Oracle, Shopify, and other companies.
And recently, Takashi Kokubun gave a talk at Ruby Kaigi about the future of JIT compilers in Ruby that gives a peek at a whole new set of optimizations Ruby can work on (as well as some performance comparisons against other interpreted languages)[10]. You may be surprised to see how well Ruby (with the JIT enabled) performs compared to Python 3.
All of which is to say, I think there's quite a bit of performance improvement being made in recent Rubies, and that trend will likely continue for quite some time.
update And I forgot to mention that some very notable computer science researchers and their teams are working in the Ruby community now![11]
Could you comment on any projects within Shopify that are helping Ruby's concurrency story? I'm aware of Ractors (https://docs.ruby-lang.org/en/master/ractor_md.html) and Fibers, but it's unclear to how feasible these primitives currently are to build the necessary abstractions on top of them that would make Rails more concurrent.
https://github.com/socketry/falcon is an interesting project, but again, it's not clear how difficult it would be deploying a Rails app on top of this. My experience with these concurrency models in Rails apps is that one single gem could make a blocking IO call (for example) and negate all of the concurrency/performance gains. It would be cool if Ruby could be configured to throw errors for these types of calls to make finding and fixing offending gems easier.
There's a lot of really great projects happening and plenty to be hopeful about, but when that stuff will land or the changes the rest of the community and ecosystem should think about making still isn't clear.
To be honest, Shopify isn’t particularly invested in Ruby concurrency.
That’s not really a use case we have, and the community is already investing a lot in that direction (ioquatix with fiber scheduler and ko1 with N:M threads and Ractors)
Which part of raw web performance? These don't seem like web performance benchmarks. IOW I'm not sure how calculating digits of PI impacts serving HTML.
Seems like we need benchmarks for actual web workloads. :)
There have been plenty of major performance improvements for Ruby but most of them require changes that just aren't feasible for a large codebase. For example, there are alternative Ruby interpreters that are far faster than the MRI but they're not usually a drop in replacement because they don't have full Gem or really general ecosystem compatibility.
Also, some performance improvements like JIT just don't matter that much for Rails because the primary problem isn't single thread performance it's the lack of cheap parallelism.
>Especially from Microsoft, who has deep compiler/language expertise, and who owns GitHub (large Rails app).
I thought it was missed opportunities those working inside GitHub could not persuade Microsoft to at least budget out additional resources for Ruby. Although they might have their own battle trying to keep everything Rails and not moving to .NET
It does worry me a bit that all resources are coming from Shopify, hopefully they can keep growing so resources isn't a problem in the near future.
You are making multiple comments talking about Microsoft and Shopify not putting resources into Ruby/rails but both companies have core contributors on them.
Where did I said Shopify not putting resources when the last part of my comment was exactly about Shopify putting resources into Ruby Rails.
Microsoft have contribution into Ruby? Or Do you meant Github has Core Contributor in Rails? Name me a single Microsoft employees actively helping in Ruby Core.
> I’m still surprised that we haven’t seen much larger improvements in Ruby performance over the last decade given the large number of major tech companies using Rails.
Every place I've worked in the last ~5 years has treated Ruby as "legacy" code for better or worse.
I wonder how this will handle Websockets. With Rails and Hotwire Turbo, the story to support sprinkling in Websockets for broadcasting events is there. Puma can handle long running connections which makes it suitable to use for that (reverse proxied by nginx of course). Would Unicorn and Pitchfork work the same here in that as long as it's reverse proxied by nginx it'll be fine for Websockets, even with thousands of open connections? What about in development where you wouldn't likely have nginx in front of your app?
That's a great project, very interested to see how it plays out. In my previous job we used Unicorn but didn't hit any GVL contention issues (much smaller scale than Shopify's).
As a suggestion to the authors: A section titled "Why not Unicorn?" explaining the rationale behind this and why one would want to choose it over Unicorn, would be very helpful. If not in the current experimental stage, then in the "stable" one.
For those of you scaling Ruby on Rails: are you more constrained by memory or CPU consumption?
If you could choose a 50% reduction in process memory footprint, versus a 50% reduction in CPU cycles to serve your average request, which would you pick?
Second this. After spending several years of full time just troubleshooting others web apps I can say that not only are database locks in 99.99% of the cases the real bottleneck but also that developers in general are quite bad at managing databases and will try to micro optimizing just about anything before looking at their DB queries.
> Database locks is usual issue. Avoidable if planned better.
I've often seen problems because of N+1 querying in particular, but maybe that's just the majority of codebases that I've seen.
> Memory is second. In 15 years CPU has never been a real concern.
In most stacks that I've worked in, memory seems to be the main cause of problems: be it a Ruby application, a PHP one, Python one or Java.
Though perhaps that's because larger servers are expensive and development environments in particular tend to be on the more conservative side of things. Either that, or the apps that I've worked with are good examples of Wirth's law and refuse to even run with less than 2 GB of RAM (though maybe that's more along the lines of commentary about enterprise Java apps in particular). When your development/test server has something like 8 GB of RAM or your computer has 16 GB of RAM and you need to run 7 or 8 of those apps (as well as IDE instances), you run into problems.
CPU is generally not an issue for most applications, though various databases and data stores instead. Perhaps it's batch migrations or just lots of in-database processing, or honestly just most cases where you use ElasticSearch - not only does it seem to eat as much RAM as you'll give it, but the same seems to apply to the CPU (especially when used as the data store for something like Skywalking APM).
On the bright side: most of the modern tech stacks are generally reasonable and there are often micro frameworks (or just more lightweight ones) to be found, which can be suitable for some use cases. Don't want Rails, use Sinatra. Don't want Laravel, use Lumen. Don't want Django, use Flask. Don't want Spring Boot, use Quarkus. That said, Node with Express.js and .NET with ASP.NET are pretty lightweight out of the box, which is nice. Of course, if you go down this route, you might end up with something that is better suited to a bare API, instead of fancy features like server side rendering/templating etc.
Database locks is our issue. We're doing anywhere from 5-10k req/s to our RDBMS. This could be minimized if not completely avoided, and I'm keen to make it so, but I got here a bit late and the underlying foundation of our application isn't something I can safely describe.
Most of our work happens in the background. We're running a Sidekiq process per CPU core with a concurrency of 20 — which, while the default, seemed to play the best during our tests. We set a max RSS of 2000 and use some code to automatically scale Heroku based on queue size. We do a lot of network IO.
It was a headache to get here, and a hole in our wallet, but I'm mostly happy with the application. Previously, we were running Resque in production on a super over-provisioned AWS node that was costing us $xx,xxx a month: Resque forks the app process for concurrency (we weigh about 450 MB); Sidekiq is a threaded model. Heroku, at least in this case, has been exponentially cheaper. We keep our RDS on AWS for at least some savings versus having it on Heroku. I cannot imagine what a xx-TB database would cost us at Heroku.
Why chose one over the other? Pitchfork’s selling point is that it allows you to reduce CPU usage with YJIT while still reducing memory usage compared to other servers.
From the Reddit thread linked above (lots of context there, worth reading), the author answers:
> what's the main benefit?
Drastically reduced memory usage by improving Copy-on-Write performance. For mid-sized apps it can even use less memory than puma depending on how you set it up. See this synthetic benchmark.
> Why did you decide to build reforking into a fork of Unicorn instead of contributing it to Puma?
Mostly for simplicity. You can build this in Puma, but there are a few extra challenges.
For instance you don't want to refork when another thread is currently processing a request, you'd risk leaving some global resource in a corrupted state. So you'd first need to stop accepting traffic and wait for all requests to complete.
The problem is, Puma doesn't have a request timeout, so if the request never complete, what do you do?
Lots of small challenges like that. But I think it would be awesome if Puma was to take back that idea. Puma's fork_worker feature was a big part of the inspiration after all, and I'd even be happy to help Nate or someone else do it.
Puma is a multi-thread, multi-process Rack HTTP server that implements a request buffer.
By comparison, Unicorn (the project upon which Pitchfork is based) is a single-thread, multi-process Rack HTTP server that does not buffer requests, so it's only designed for fast clients (or it needs to be paired with a proxy like Nginx to handle slow clients).
The narrower design of Unicorn results in a simpler, less-flexible potentially more efficient architecture.
The reforking feature that Pitchfork introduces to Unicorn was originally implemented in Puma [1], though they're controlled differently and the underlying implementations are entirely different.
That was only forking the apache process. This is forking the whole ruby app that also happens to speak HTTP. The gain is in the app side not in the best way to serve HTTP where this is probably a loss over threading or event based solutions.
many serious applications back then built by developers who cared about performance and latency made use of mod_perl. it put the entire perl interpreter (which then loaded perl modules that defined the application) into the preforked apache httpd processes.
often times you'd put a proxy in front, serving static assets with a lighter weight http so that slow loads wouldn't tax limited server capacity for big httpd processes that included full perl interpreters and all application code loaded.
It's very similar to that setup. I don't think that was particularly common though and probably the reason prefork went away. Particularly as apache and later nginx were used as the user facing webservers proxying to something else inside (first with fastcgi and latter just http). The preforking advantage then moved to that server inside which is what this is. Nginx or something similar will still be running in front of this with a threading or event model instead of forking. So this is not going back to an old idea as it never really left, it just left apache because no one ran it as an application server and there are better ways to run an http frontend. Preforking is also not what's new here, this is just an optimization over the preforking model in unicorn which is over a decade old.
it went away because it was resource expensive. even with copy-on-write semantics, those backend httpds with an entire perl runtime in each core were pretty huge and inefficient.
the models that replaced it were single process multiple threads and asynchronous (never block) models that didn't create or require one of many application processes per request.
i think the choice of the preforking model was a stability and portability thing. if one of the children crashed it was no big deal, where a crash in a single process application or webserver will cause the everything to go down. also back in those days threading semantics varied widely between operating systems, the apis themselves differed and linuxthreads were just really too new to run in production.
But it didn't go away. Event based servers were developed to handle slow clients with good performance as frontends for pre-fork or threaded application servers. This innovation is an optimization of a pre-fork backend server that itself is almost 15 years old.
As I understand it, Pitchfork is an implementation of a reforking feature on top of Unicorn, a single-threaded multi-process Rack HTTP server.
The general idea of 'reforking' is that in addition to pre-forking worker processes at initialization time, you re-fork them on an ongoing basis, in order to optimize copy-on-write efficiency across processes, especially for memory allocations that might happen after initialization. The Pitchfork author mentions sharing YJIT machine code as a particularly important use-case [2], one that has become relevant in recent Ruby releases.
I never saw much interest in the concept when I originally introduced the reforking feature to Puma, and it remained a kind of interesting experiment that never saw much production uptake. I'm thrilled that Shopify is iterating on the concept through this project and I hope the technique will continue to be refined and developed, and see broader adoption as a result.
[1] https://github.com/puma/puma/pull/2099 ; https://github.com/puma/puma/blob/master/5.0-Upgrade.md#fork...
[2] https://github.com/Shopify/pitchfork/pull/1#issuecomment-125...