> When crashes do occur an engineer needs to spend time to diagnose how it happened and what caused it. Since Pingora's inception we’ve served a few hundred trillion requests and have yet to crash due to our service code.
> In fact, Pingora crashes are so rare we usually find unrelated issues when we do encounter one. Recently we discovered a kernel bug soon after our service started crashing. We've also discovered hardware issues on a few machines, in the past ruling out rare memory bugs caused by our software even after significant debugging was nearly impossible.
That's quite the endorsement of Rust. A lot of people focus on the fact that Rust can't absolutely guarantee freedom from crashes and memory safety issues. Which I think misses the point that this kind of experience of running high traffic Rust services in production for months almost without a single issue is common in practice.
I had a very similar experience. Much smaller scale, but the service was keeping internal state and clients were connecting with a WebSocket. It could handle up to a million clients on one server and it practically never crashed. While I was writing it I had only hobby-level experience with Rust and I was also mentoring a colleague, so he wrote a big chunk of code as a total Rust noob.
I have a theory that people who are noobs in a certain technology, but are overall experienced, tend to produce very solid code, because they actively avoid doing clever things.
This is my experience (albeit at smaller scale) exactly. If this wasn't the case, Rust would be snake oil and I'd be the first to admit it. Until my most recent endeavor with a rust backend, among other things, I simply didn't know it was possible to not have to debug crashes. Seriously. The pleasure of maintaining rust software in production is so wildly different from anything I've ever experienced in the past that I gladly submit to the compiler over and over and over again. It's worth the investment 100 fold.
I had the same experience when I wrote a camera capture / motion detection / video logging service for a commercial smart refrigerator in Rust. The Swift component crashed at least twice a week, the Rust component ran for months and months without issue.
Similar experience over here too, I've been running a Rust server in production for >1yr now and had absolutely no crashes. Resource usage has been nice and constant too. It's fantastic :)
Which aspect(s) of Rust do you think are most responsible for this? (e.g. borrow checker, memory safety, culture that attracts devs who care about reliability, etc)
Not the person you're asking, but the culture around data representation strikes me as the biggest factor:
1. Only making valid states possible, to the greatest extent reasonably possible.
2. Treating error as regular data, not an afterthought - with language features to make that not too painful.
3. Not returning placeholder values (i.e. if you get back a parsed value from a parse function, it means it parsed correctly, not that either there's an error somewhere else or it parsed correctly).
Language features, in particular "enums" (aka algebraic data types, aka tagged unions), make this approach possible. You couldn't do it in go, for instance, even if there was a cultural decision to.
It is of course a combination of all these aspects.
A type system that can express thread safety (Send/Sync traits) is incredibly valuable when building multi-threaded systems.
Universal definition of what is safe, and standard traits and borrowing rules, make APIs more predictable. Just from function's signature you know a lot about its behavior, without having to look for gotchas in the manual.
Mandatory error handling prevents cutting corners. Unit testing is built-in.
Generics, good inlining, and Cargo help split code into libraries without a performance or usability hit, which helps make focused, well-tested components.
Most of these things aren't groundbreaking, but Rust being new had a luxury of picking current best practices and sensible defaults.
I think ultimately the big thing Rust brings to the table is that a lot of Rust's features is geared towards detecting problems as early as possible.
This means you get complexity, especially on the "initial phases".
* The initial learning curve is steeper than usual . You have x types of string instead of one or two, you have lifetimes, etc, etc. Fortunately you pay it once and then can use it many times (maybe with some revisits to the docs :P)
* The initial building of a feature might take a bit longer than in other less complex languages (the borrow checker gives you a particularly tricky error, etc). But what is happening here is that the language is picking up problems that other languages ... don't. They "defer to production".
Others are mentioning that cargo is another feature that helps. I think it does, but indirectly. What cargo does is improving the developer experience. Having all the packages at the tip of your fingers, without needing to jump over hoops like in other languages is just ... nice. Given the initial ramp up in complexity, and the initial harder-than-usual first write, anything that improves the UX goes towards alleviating that. I personally would put cargo on the same bucket as "helpful compiler messages" and "good docs".
> You have x types of string instead of one or two
Funny how you said X, while there are really only 3 types of strings in standard library. You generally only encounter 2 of them: String and OsString. OsString, while annoying, but it's one of the things that makes rust safe. You only ever deal with 3rd type, CString, when you work with FFI, at which point you better already know how strings work on C.
I said X because people include some and not others, and the conversation rapidly deviates (as shown on this thread). The point remains that there’s more ways to skin that particular cat in Rust than in other languages.
There is also PathBuf, which would use a string in most languages. And sometimes one may need to work with Vec<u8> or Vec<Char>. And then slice/view variants of all the above.
Yeah, when Rust beginners (like me!) think "types of string" I think it's more likely to be &str (and its lifetime variants), String and Vec<u8>. I've gotten pretty far as a beginner by sticking to over-copying with just String, but eventually I think I'll have to figure out what works best for perf.
> There is also PathBuf, which would use a string in most languages.
I feel bad for them. I wanted to fix something in NeoVim's built-in lsp recently and lost all interest because of all the things you have to do to:
- Get parent directory for a path
- Create that directory if it exists
Why was I annoyed by that? Well, different OSs use different path separators, in rust this is trivial because `Path` and `PathBuf` are target-specific.
That's not a type of string though, it's a path, the fact that it uses `OsString` underneath is irrelevant.
> Vec<u8> or Vec<Char>
There isn't really anything special to learn about it them, though? I doubt newbie encounter this while learning. Slices are the same thing, you just can't extend it, what is there to learn?
- I think memory safety is a baseline. You'll note that memory safe languages already tend to be much more reliable than non-memory-safe languages in general.
- Then you have the error handling. A lot of unreliability in my code in other languages comes from unhandled exceptions that only occur rarely. Rust generally puts all possible error conditions in the type signature of the function. Meaning it's actually feasible to handle every failure case.
- Speaking of unhandled exceptions, a lot of those in typed languages tend to be caused by null. Rust does not have null. Instead it has Option, and it is impossible to access the contents of an option without doing the equivalent of a null check. So that entire class of errors is gone.
- Both Result (used for error handling) and Option (used instead of null) are what Rust calls enums, and what are more generally called Sum Types. I think these are a huge deal. They allow you to safely represent data that may be one thing or another with very strict type checking. These are broadly very useful in API design, and in my experience lead to much more robust code than the class hierarchies you need in OOP languages or unions which lack the safety checks. (Aside: sum types would be quite simple to add to other languages. I have no idea why they haven't been added yet).
- Speaking of classes, inheritance is not supported. So that's a bunch of confusing code that just isn't possible to write. This can add a bit of boilerplate to Rust code, but it makes it more straightforward and less bug prone.
- You mentioned the borrow checker. That definitely helps. It's yet another tool that allows you to write APIs that cannot be misused. A great example would be Rust's Mutex type. It can prove at compile time that code does not hold on to references to the protected data beyond the duration that the lock is held.
- Speaking of Mutex. Rust's Send and Sync traits provide very good thread safety. You almost don't need to worry about thread safety at all in Rust. Most concurrency bugs are prevented by the compiler (you can still do things like cause data races).
- Newtypes allow you to check invariants once and then have the fact that they remain satisfied enforced by the type system.
- All type casts are explicit.
- Lots of other little things
One final thing that I think is often overlooked. Rust is strict, and all of these checks apply not only to the code your write, but to all of your dependencies. That means that Rust libraries tend to be much more reliable than libraries from other ecosystems. That probably is partly because of a culture of reliability. But it's also because the language itself makes it hard to write sloppy code. And that the code you are building on is likely to be reliable makes it both less effort and more worthwhile to make your own code reliable (including for library authors), leading to virtuous circle of reliable code.
For any of the Cloudflare team that frequents HN, curious if you have an eventual plan to open-source Pingora? I recognize it may stay proprietary if you consider it to be a differentiator and competitive advantage, but this blog post almost has a tone of "introducing this new technology!" as if it's in the cards for the future.
Do you think that it would be beneficial during analyst conference calls to highlight that Cloudflare is using Rust to build its next-gen critical systems? It shows a strong commitment to building best-in-class technology.
I'm mildly blown away to read, 'And the NGINX community is not very active, and development tends to be “behind closed doors”.' Is this a reflection of the company, nginx (now owned by F5) going the way of an Oracle-style takeover of WebLogic from another era?
Dropbox wrote about their migration from NGINX to Envoy in July 2020 and highlighted a lot of the same concerns [0]. As per this thread [1], NGINX have posted very similar blog posts for the last two years saying they are 'returning to our open source roots', but without much tangible change. And the Cloudflare CEO forecasted this move away from NGINX back in 2018 [2]
IMHO nginx has never been a particularly "open" or friendly open source project. I don't mean to sound rude. I don't think open source contributors "owe" anyone anything in this regard. If you want to throw code over a wall and run away, that's your prerogative. However I do think Cloudflare's assessment is accurate and a real liability for them.
- Tests were a later addition and in a distinct repo with a bespoke harness. I'm sure it has advantages, but it also takes extra work for contributors to figure out.
- They use Trac?! I loved Trac circa 2008 but had no idea it was still a thing. I can't even login to it without it timing out.
I don't want to nitpick an excellent project like nginx, but I think it's clear that easing third party contributions has never been a high priority.
> - They use Trac?! I loved Trac circa 2008 but had no idea it was still a thing. I can't even login to it without it timing out.
Oh wow! I totally forgot about Trac. That was the first ticket management software I used and I completely forgot about it's existence. Thanks for the impromptu trip down memory lane.
We did the same. We've replaced nginx/lua with a cache server (for video) written in Golang - now serving up to 100 Gbit/s per node. It's more CPU and memory efficient and completely tailored to our needs. We are happy that we moved away from nginx.
A german company building an app for watching linear TV. Netflix is actually serving 400Gbit/s per node and already have 800Gbit/s ready.
I think we can scale our setup up to 200 Gbit/s but we are too small. Total traffic is ~2 Tbit/s.
Most challenging is the missing support of QUIC/http3 and KTLS in Golang. Also 100G NIC supply chain is difficult. We use NVIDIA Connect-X 6, but it's impossible to get a version with TLS offloading.
I think it starts with a Wa ... you don’t have to say. I kind of remember to have been stumbled on a Twitter engineering ipv6 tweet. Maybe I m wrong.
For me it’s impressive to get so much data through a computer. But I have one question, what does count as a node, is a node like 1 machine with dual sockets, a lot of ram and a lot of nics or is it like multiple machines combined that act as 1 node like a whole 19 inch rack.
There is quic-go[1] but i don't think that it's sufficiently optimized[2-4] to be used for this kind of workload. caddy will use it to provide HTTP/3 by default[5] in the upcoming 2.6.0 release.
We can't use sendfile since all the traffic is HTTPS. Sendfile can only be used if we can use KTLS. Internally we have some kind of cache hierarchy for video objects, these files are transferred out via Sendfile, but the performance gain here is negligible.
and yes: the proxy does have some intelligence around "just" forwarding requests. Mostly caching related, but also some upstream routing decisions and a ton of observability.
I share lots of feelings towards NGINX that Cloudflare mention on this blog post. New features like 103 Early Hints and HTTP/3 exist in HAProxy and Caddy but there is nothing coming in NGINX.
Nginx is an absolute beast when it comes to accepting HTTP/1.1 from clients and talking HTTP/1.0 (or HTTP/1.1) to a fixed list of backends or serving files from disk.
Everything else is...hard. Writing plugins - arcane knowledge with next no none documentation. 3rd party plugins, multiple times "acquired" (read: paid the author to work on a new plugin) an open-source plugin and made a much better version available in this plus offering.
gRPC support added only in 2018, http2 support was late, proper web sockets support was late.
Many features that were available "for free" with OpenResty were only on the paid version - good luck writing your own plugin.
They were doomed before F5 probably around round time when they raised series B1 in 2014. All this money and it still had nothing "new" to show. For every Plus plugin, there was most likely a better solution available, and often enough that solution was free or cheaper.
F5 is not to blame, they didn't change anything for the worse. The Plus-license is the problem where essential things like monitoring are behind a paywall. Back then this wasn't so important because you basically only had Apache and nginx.
I think I read two weeks ago what F5 was going to focus more on improving the open source version. Probably because the competition is getting harder and they're noticing it in a market share decline, but whichever the reason is, this was good to hear.
Also it was good to see it no longer being part of a Russian company, even though the devs and owners are good people. You never know how a government can enforce some problematic behavior, specially one which is known for liking to throw people out of high rise windows.
Does anyone know why nginx used separate processes for workers, instead of threads? This post makes it sound like threads are the way to go, but presumably nginx had a reason for using processes back in the day.
Share-nothing architecture were deemed more scalable as you don't need synchronization. But then, you can't share stuff, like a connection pool. Also, the architecture was simpler this way. Nginx is also an application server and it was "easy" to develop applications on top of it because of this architecture.
Nginx was written in C. Multithreaded code in a language that doesn't provide any safety rails is hard to get right, and so is async code. They probably figured that the complexity of doing both async and multithreading outweighed the benefits that were predicted to be small. Rust's type system checks for and prohibits many kinds of mistakes that are possible in multithreaded code and in async code, so it's much easier to combine them safely.
Wikipedia says Nginx started in 2004. If you look at the state of things for threads and other "lots of sockets to deal with" things in that timeframe, you can see multiprocess was probably a safer choice, especially if you intended on running well on a variety of Unix or Unix like OSes. This page has some of that state captured pretty well: http://www.kegel.com/c10k.html
It was easier to develop, easy to do zero-downtime graceful restarts and reloads, tooling available today wasn't available back then. SMP in FreeBSD 5.x was uhm...very bad...
What are HTTP status codes greater than 599 used for in practice?
It'd be interesting to see another Cloudflare blog post that just goes into detail on the weird protocol behaviour they've had to work around over the years. I imagine they have more insight into this than pretty much any other organisation on the planet.
If you are in the HTTP proxying business, you might not even know. It's just customer workflows using those status codes. And you can pick either to support them, or to lose the ability to handle traffic for those customers.
I concur with custom signals, using a code >= 600 mitigates the risk that it would ever be allocated, or that you'd conflict with an existing unofficial use.
Did you guys consider HAProxy? I've only ever heard good things about it - particularly stability (though it probably can't beat Rust), performance, and configurability.
I agree with #1... Workers before the cache is crazy powerful for the original purpose of Workers (modifying incoming requests). But now that people are starting to use Workers as their original (for remix, etc) it would be nice to be able to have the cache before Workers. As it is right now, having the CDN do full content caching of rendered Remix pages is difficult.
Most—maybe damn near all—sites see significant dips in traffic for at least a few hours a day. Which part of the day, depends on the site. More often than not, it's while the team is asleep and staffing, if any, is at its lowest point, since teams tend to live roughly in the same ~half of the world that their products are most-used in. Plus there's practically no-one in the Pacific until you reach Japan, and not a ton of "Western" sites see much use in Asia, and vice versa, with a few notable exceptions.
It's not unusual for e.g. ecommerce sites to crank up automated fraud prevention "at night" because staffing is so much lower.
TL;DR Most sites' usage patterns exhibit a pronounced day/night cycle that's not too far off from natural day/night cycles where the bulk of the team lives.
Not uncommon before the intro of edge services to block login on sites during off working hours or during nights. Or at least doing some rate-limiting. We see many attempts at brute force-attacks during the nights. Most sites are not global.
Wow this is just what I was looking for, a proxy written in a memory safe language like rust with no GC as an alternative to nginx. Looking forward to the open source version!
Any one else immediately do "open source" ctrl-f? That's all I wanted to read but I bookmarked the article and put it on my list of things to peruse later
While I agree that in selecting a proxy to use whether it is open source is one of the most important considerations, if that's all you look for in this article, you may be missing something.
I thought the article did a good job of describing how they went about making the choice of whether to continue contributing to an existing (nominally open source) system or to build a new one. And of course, it did a good job of showcasing the strengths of Rust (reliability guarantees strong enough that they could identify when problems were due to hardware.)
Rust's ecosystem is usually fantastic for CLI tools and specialized network servers. Large REST API servers aren't quite as solid (but perfectly doable for basic cases), and GUIs are nowhere near mature.
In several major areas, I actually like the available Rust crates for a given task more than I like the available npm modules. Rust has many fewer third-party libraries available, but the quality is often good.
Right now the idiomatic Go approach to handle networking is 2-goroutine per socket (one for reading and the other for writing). Goroutines are very lightweight userspace threads, but they're not free: each costs you a small amount of memory. At Cloudflare's scale, this overhead quickly adds up. So resource-wise, Go isn't ideal for very large scale use cases.
There is a more-than-6-year-old proposal to introduce non-blocking I/O API (https://github.com/golang/go/issues/15735) but so far it's not gaining much traction. Maybe in Go2.
The post mentions tokio, but I would be curious to see if it uses tower or something similar in house. For our product (caido.io) we also built a custom HTTP parser so if you open source the tool it could be nice to split the parsing in its own crate so we have an alternative to hyper that can understand malformed requests.
Besides comparing this to Nginx plus Lua (OpenResty), has Cloudflare compared it to Haproxy plus Lua or any other similar proxies.
The main issue for me with Rust is that it takes significantly more resources (time, space, memory, CPU) to build projects from source. Building Haproxy is comparatively quick and easy.
The haproxy plus lua static binary (musl, no pcre) I use is already growing rather large. I will bet that Pingora binaries using shared libraries will be at least twice, maybe three times the size.
Curious: when building software, especially mission critical software, why are things like time, space, memory, and CPU at build time a concern? The concerns that usually surface for me are reliability, and performance (time, space, memory, CUP) of the resulting binary at run time. What you mention is kinda the core tradeoff. With Rust, you pay for high reliability and performance by spending cycles at compile time to ensure as much. Cloudflare certainly isn't looking to ship a new sexy live-reloaded and written in 30 minutes nodejs app to production 12 times a day. Plus I'm sure CF has some beastly build infrastructure.
I really love haproxy, but I have seen segfaults and other memory errors thay wouldn't have happened if it had been written in rust (not that that was an option when haproxy was originally written). If you are going to write a new http software from scratch, I definitely think rust is a good language to use.
I would have liked to have seen more explanation of why they decided to build their own rather than use haproxy though.
I don't know of any current ones. Because when I have encountered them, I've reported them and the haproxy maintainers fixed them quickly, and backported the fixes to still maintained older versions.
I don't remember the exact specifics off the top of my head, but in one case a certain kind of invalid config resulted in a crash rather than a normal error message. In another, there was an error in a bounds check, where if you chained multiple converters in the right way, you could end up capturing some additional data from other http headers.
Build time can be annoying, absolutely, but that's the tradeoff of Rust -- compiler does a lot of work so the program can be beastly fast and very reliable at run time.
I don't like big binaries either, mind you, but I view this as a very minor concern for a mission-critical piece of software.
Choosing speed and correctness at run time is what I would have done as well.
I wonder how this is deployed to presumably a large number of hosts? Do you build a distribution package out of your Rust build and ship that? If so, what about the Rust standard library? Though I believe some distributions do provide a package for the Rust standard library, but that means one also has to use the packaged rustc/cargo, which tends to lag behind quite a bit.
Yes, the distribution package is built with `cargo deb`, which automatically makes a suitable binary package. It doesn't need Rust in production. Rust's standard library is compiled into the executable. Its size is negligible, especially with link-time-optimizations.
Rust doesn't have stable ABI, so distributions don't provide "standard library", they provide the toolchain you use to build your rust packages. Emphasis on packages because that's the use-case for toolchain provided by distro and not local development.
In the end, you end up with a binary that is statically linked, except for `glibc` (unless you use libc like musl) and whatever C library you choose to dynamically link against.
Since this is an in-house project, they can build their packages with whatever toolchain they choose and distribute packages for deployment. So whatever toolchain provided by distro is irrelevant unless this packages is part of the distro's official package repository (which it isn't in this case).
You may be interested in knowing that about two years ago, a team of engineers at Dropbox wrote gory details about their use of Rust and it was inspiring. The passion about their work really came through. The team also held an AMA on /r/rust that went well. See here: https://www.reddit.com/r/rust/comments/fjt4q3/rewriting_the_...
Why? It's closed door development in nginx from their perspective. This doesn't apply for their in-house project. They not telling you why you would want to switch off Nginx, they are telling you why they've switched.
To me it looked like they didn't have anything technical against Envoy, they just didn't want to be dependent upon someone else for what is a core part of their product anymore. Which is entirely reasonable and a good idea.
Should have waited to post this until it was actually ready to be open sourced. Otherwise this is just kinda like "huh, neat" without anything else to do with it.
In some cases it can be enough to know that it could be worth waiting for the release instead of putting more resources into a stack you're currently using. You might replace it entirely in a few months if the release turns out to be a product which you can and want to switch to, so it's ok to get a heads-up.
True. But you're not committing to it, you're just waiting for the evaluation instead directing resources into a change which might become obsolete soon, should the new option be a viable one.
In this case, where the code is coming from Cloudflare, you can be sure that they know what they're doing, that there is a high chance that it will be very high quality code.
Apart from this, it's an interesting article from a technical perspective due to the various topics it touches.
Also, in my case, where I'm looking for an new language to program in, where Rust and Go are the my main options, articles like these help me with the decision process on which one to choose.
There's much more value to this article than it just being an announcement that sometime in the near future they will be releasing an open source project.
It almost sounded like a "give me the code or shut up"-message.
> In fact, Pingora crashes are so rare we usually find unrelated issues when we do encounter one. Recently we discovered a kernel bug soon after our service started crashing. We've also discovered hardware issues on a few machines, in the past ruling out rare memory bugs caused by our software even after significant debugging was nearly impossible.
That's quite the endorsement of Rust. A lot of people focus on the fact that Rust can't absolutely guarantee freedom from crashes and memory safety issues. Which I think misses the point that this kind of experience of running high traffic Rust services in production for months almost without a single issue is common in practice.