Pingora, the proxy that connects Cloudflare to the Internet

nicoburns · on Sept 14, 2022

> When crashes do occur an engineer needs to spend time to diagnose how it happened and what caused it. Since Pingora's inception we’ve served a few hundred trillion requests and have yet to crash due to our service code.

> In fact, Pingora crashes are so rare we usually find unrelated issues when we do encounter one. Recently we discovered a kernel bug soon after our service started crashing. We've also discovered hardware issues on a few machines, in the past ruling out rare memory bugs caused by our software even after significant debugging was nearly impossible.

That's quite the endorsement of Rust. A lot of people focus on the fact that Rust can't absolutely guarantee freedom from crashes and memory safety issues. Which I think misses the point that this kind of experience of running high traffic Rust services in production for months almost without a single issue is common in practice.

drogus · on Sept 14, 2022

I had a very similar experience. Much smaller scale, but the service was keeping internal state and clients were connecting with a WebSocket. It could handle up to a million clients on one server and it practically never crashed. While I was writing it I had only hobby-level experience with Rust and I was also mentoring a colleague, so he wrote a big chunk of code as a total Rust noob.

WesolyKubeczek · on Sept 16, 2022

I have a theory that people who are noobs in a certain technology, but are overall experienced, tend to produce very solid code, because they actively avoid doing clever things.

4m1rk · on Sept 18, 2022

Nice one. I believe we should avoid doing unnecessary clever things deliberately and actively.

Aperocky · on Sept 14, 2022

Curses! now I need to learn yet another language!

bbarnett · on Sept 16, 2022

Libcurses in rust, what a perfect idea!

nicoburns · on Sept 14, 2022

Luckily Rust is also fun :)

victor106 · on Sept 14, 2022

Is this using Async Rust?

brink · on Sept 15, 2022

Yes, if you want to handle that many clients, you'll need to use async. It's not too bad to learn.

dcow · on Sept 15, 2022

Came here to post this quote too.

This is my experience (albeit at smaller scale) exactly. If this wasn't the case, Rust would be snake oil and I'd be the first to admit it. Until my most recent endeavor with a rust backend, among other things, I simply didn't know it was possible to not have to debug crashes. Seriously. The pleasure of maintaining rust software in production is so wildly different from anything I've ever experienced in the past that I gladly submit to the compiler over and over and over again. It's worth the investment 100 fold.

brink · on Sept 14, 2022

I had the same experience when I wrote a camera capture / motion detection / video logging service for a commercial smart refrigerator in Rust. The Swift component crashed at least twice a week, the Rust component ran for months and months without issue.

jhgg · on Sept 14, 2022

We had the same experience at work deploying rust services that serve many billions of requests a day as well.

petr_tik · on Sept 14, 2022

does your company have any public information about this? Blogs, job descriptions with numbers, twitter threads?

remram · on Sept 14, 2022

Not GP but I believe this is it: https://discord.com/category/engineering

ellieh · on Sept 16, 2022

Similar experience over here too, I've been running a Rust server in production for >1yr now and had absolutely no crashes. Resource usage has been nice and constant too. It's fantastic :)

rkagerer · on Sept 14, 2022

Which aspect(s) of Rust do you think are most responsible for this? (e.g. borrow checker, memory safety, culture that attracts devs who care about reliability, etc)

gpm · on Sept 14, 2022

Not the person you're asking, but the culture around data representation strikes me as the biggest factor:

1. Only making valid states possible, to the greatest extent reasonably possible.

2. Treating error as regular data, not an afterthought - with language features to make that not too painful.

3. Not returning placeholder values (i.e. if you get back a parsed value from a parse function, it means it parsed correctly, not that either there's an error somewhere else or it parsed correctly).

Language features, in particular "enums" (aka algebraic data types, aka tagged unions), make this approach possible. You couldn't do it in go, for instance, even if there was a cultural decision to.

pornel · on Sept 14, 2022

It is of course a combination of all these aspects.

A type system that can express thread safety (Send/Sync traits) is incredibly valuable when building multi-threaded systems.

Universal definition of what is safe, and standard traits and borrowing rules, make APIs more predictable. Just from function's signature you know a lot about its behavior, without having to look for gotchas in the manual.

Mandatory error handling prevents cutting corners. Unit testing is built-in.

Generics, good inlining, and Cargo help split code into libraries without a performance or usability hit, which helps make focused, well-tested components.

Most of these things aren't groundbreaking, but Rust being new had a luxury of picking current best practices and sensible defaults.

otikik · on Sept 15, 2022

I think ultimately the big thing Rust brings to the table is that a lot of Rust's features is geared towards detecting problems as early as possible.

This means you get complexity, especially on the "initial phases".

* The initial learning curve is steeper than usual . You have x types of string instead of one or two, you have lifetimes, etc, etc. Fortunately you pay it once and then can use it many times (maybe with some revisits to the docs :P)

* The initial building of a feature might take a bit longer than in other less complex languages (the borrow checker gives you a particularly tricky error, etc). But what is happening here is that the language is picking up problems that other languages ... don't. They "defer to production".

Others are mentioning that cargo is another feature that helps. I think it does, but indirectly. What cargo does is improving the developer experience. Having all the packages at the tip of your fingers, without needing to jump over hoops like in other languages is just ... nice. Given the initial ramp up in complexity, and the initial harder-than-usual first write, anything that improves the UX goes towards alleviating that. I personally would put cargo on the same bucket as "helpful compiler messages" and "good docs".

0x457 · on Sept 15, 2022

Agree with you on most except for one point:

> You have x types of string instead of one or two

Funny how you said X, while there are really only 3 types of strings in standard library. You generally only encounter 2 of them: String and OsString. OsString, while annoying, but it's one of the things that makes rust safe. You only ever deal with 3rd type, CString, when you work with FFI, at which point you better already know how strings work on C.

otikik · on Sept 16, 2022

I said X because people include some and not others, and the conversation rapidly deviates (as shown on this thread). The point remains that there’s more ways to skin that particular cat in Rust than in other languages.

nicoburns · on Sept 16, 2022

There is also PathBuf, which would use a string in most languages. And sometimes one may need to work with Vec<u8> or Vec<Char>. And then slice/view variants of all the above.

d3nj4l · on Sept 16, 2022

Yeah, when Rust beginners (like me!) think "types of string" I think it's more likely to be &str (and its lifetime variants), String and Vec<u8>. I've gotten pretty far as a beginner by sticking to over-copying with just String, but eventually I think I'll have to figure out what works best for perf.

0x457 · on Sept 18, 2022

> There is also PathBuf, which would use a string in most languages.

I feel bad for them. I wanted to fix something in NeoVim's built-in lsp recently and lost all interest because of all the things you have to do to:

- Get parent directory for a path - Create that directory if it exists

Why was I annoyed by that? Well, different OSs use different path separators, in rust this is trivial because `Path` and `PathBuf` are target-specific.

That's not a type of string though, it's a path, the fact that it uses `OsString` underneath is irrelevant.

> Vec<u8> or Vec<Char>

There isn't really anything special to learn about it them, though? I doubt newbie encounter this while learning. Slices are the same thing, you just can't extend it, what is there to learn?

manholio · on Sept 16, 2022

> And then slice/view variants of all the above.

So all in all, about a dozen kinds of strings.

nicoburns · on Sept 14, 2022

A few things:

- I think memory safety is a baseline. You'll note that memory safe languages already tend to be much more reliable than non-memory-safe languages in general.

- Then you have the error handling. A lot of unreliability in my code in other languages comes from unhandled exceptions that only occur rarely. Rust generally puts all possible error conditions in the type signature of the function. Meaning it's actually feasible to handle every failure case.

- Speaking of unhandled exceptions, a lot of those in typed languages tend to be caused by null. Rust does not have null. Instead it has Option, and it is impossible to access the contents of an option without doing the equivalent of a null check. So that entire class of errors is gone.

- Both Result (used for error handling) and Option (used instead of null) are what Rust calls enums, and what are more generally called Sum Types. I think these are a huge deal. They allow you to safely represent data that may be one thing or another with very strict type checking. These are broadly very useful in API design, and in my experience lead to much more robust code than the class hierarchies you need in OOP languages or unions which lack the safety checks. (Aside: sum types would be quite simple to add to other languages. I have no idea why they haven't been added yet).

- Speaking of classes, inheritance is not supported. So that's a bunch of confusing code that just isn't possible to write. This can add a bit of boilerplate to Rust code, but it makes it more straightforward and less bug prone.

- You mentioned the borrow checker. That definitely helps. It's yet another tool that allows you to write APIs that cannot be misused. A great example would be Rust's Mutex type. It can prove at compile time that code does not hold on to references to the protected data beyond the duration that the lock is held.

- Speaking of Mutex. Rust's Send and Sync traits provide very good thread safety. You almost don't need to worry about thread safety at all in Rust. Most concurrency bugs are prevented by the compiler (you can still do things like cause data races).

- Newtypes allow you to check invariants once and then have the fact that they remain satisfied enforced by the type system.

- All type casts are explicit.

- Lots of other little things

One final thing that I think is often overlooked. Rust is strict, and all of these checks apply not only to the code your write, but to all of your dependencies. That means that Rust libraries tend to be much more reliable than libraries from other ecosystems. That probably is partly because of a culture of reliability. But it's also because the language itself makes it hard to write sloppy code. And that the code you are building on is likely to be reliable makes it both less effort and more worthwhile to make your own code reliable (including for library authors), leading to virtuous circle of reliable code.

tomaszsobota · on Sept 16, 2022

Thanks for taking the time to write it down, appreciate it.

sullivanmatt · on Sept 14, 2022

For any of the Cloudflare team that frequents HN, curious if you have an eventual plan to open-source Pingora? I recognize it may stay proprietary if you consider it to be a differentiator and competitive advantage, but this blog post almost has a tone of "introducing this new technology!" as if it's in the cards for the future.

eastdakota · on Sept 14, 2022

We are planning on open sourcing it. That's mentioned in the post near the end.

sullivanmatt · on Sept 14, 2022

Thanks Matt, not sure how I missed that. Glad to hear it!

latchkey · on Sept 14, 2022

It is kind of weird to point out nginx doing closed door development as a negative, and then do exactly the same thing yourself.

botanicalfriend · on Sept 15, 2022

The “closed door development” is a problem for their business. They don’t have that problem when they maintain it, free software aside.

Dowwie · on Sept 14, 2022

Do you think that it would be beneficial during analyst conference calls to highlight that Cloudflare is using Rust to build its next-gen critical systems? It shows a strong commitment to building best-in-class technology.

aliljet · on Sept 14, 2022

I'm mildly blown away to read, 'And the NGINX community is not very active, and development tends to be “behind closed doors”.' Is this a reflection of the company, nginx (now owned by F5) going the way of an Oracle-style takeover of WebLogic from another era?

moderation · on Sept 14, 2022

Dropbox wrote about their migration from NGINX to Envoy in July 2020 and highlighted a lot of the same concerns [0]. As per this thread [1], NGINX have posted very similar blog posts for the last two years saying they are 'returning to our open source roots', but without much tangible change. And the Cloudflare CEO forecasted this move away from NGINX back in 2018 [2]

0. https://dropbox.tech/infrastructure/how-we-migrated-dropbox-...

1. https://news.ycombinator.com/item?id=32572153

2. https://twitter.com/eastdakota/status/1024515150546493440

schmichael · on Sept 14, 2022

IMHO nginx has never been a particularly "open" or friendly open source project. I don't mean to sound rude. I don't think open source contributors "owe" anyone anything in this regard. If you want to throw code over a wall and run away, that's your prerogative. However I do think Cloudflare's assessment is accurate and a real liability for them.

Some of the OSS papercuts with nginx:

- nginx has always used a "submit a patch to a mailing list" style of contributions. Many contributions, my own attempt a decade ago, just get ghosted: https://mailman.nginx.org/pipermail/nginx-devel/2010-Decembe...

- Neither the contributing page (http://nginx.org/en/docs/contributing_changes.html) nor the Mercurial repo (http://hg.nginx.org/) redirect to HTTPS!

- Tests were a later addition and in a distinct repo with a bespoke harness. I'm sure it has advantages, but it also takes extra work for contributors to figure out.

- They use Trac?! I loved Trac circa 2008 but had no idea it was still a thing. I can't even login to it without it timing out.

I don't want to nitpick an excellent project like nginx, but I think it's clear that easing third party contributions has never been a high priority.

mijoharas · on Sept 16, 2022

> - They use Trac?! I loved Trac circa 2008 but had no idea it was still a thing. I can't even login to it without it timing out.

Oh wow! I totally forgot about Trac. That was the first ticket management software I used and I completely forgot about it's existence. Thanks for the impromptu trip down memory lane.

totallyunknown · on Sept 14, 2022

We did the same. We've replaced nginx/lua with a cache server (for video) written in Golang - now serving up to 100 Gbit/s per node. It's more CPU and memory efficient and completely tailored to our needs. We are happy that we moved away from nginx.

BonoboIO · on Sept 14, 2022

Wow ... 100 Gbit/s. Where do you work? That’s some serious traffic.

totallyunknown · on Sept 14, 2022

A german company building an app for watching linear TV. Netflix is actually serving 400Gbit/s per node and already have 800Gbit/s ready.

I think we can scale our setup up to 200 Gbit/s but we are too small. Total traffic is ~2 Tbit/s.

Most challenging is the missing support of QUIC/http3 and KTLS in Golang. Also 100G NIC supply chain is difficult. We use NVIDIA Connect-X 6, but it's impossible to get a version with TLS offloading.

BonoboIO · on Sept 14, 2022

Es liegt mir auf der Zunge welche Firma das ist.

I think it starts with a Wa ... you don’t have to say. I kind of remember to have been stumbled on a Twitter engineering ipv6 tweet. Maybe I m wrong.

For me it’s impressive to get so much data through a computer. But I have one question, what does count as a node, is a node like 1 machine with dual sockets, a lot of ram and a lot of nics or is it like multiple machines combined that act as 1 node like a whole 19 inch rack.

totallyunknown · on Sept 15, 2022

A single node is single socket AMD EPCY node with 32 cores, 64 or 128GB RAM, 1x NIC with 2x100 GB/s. Ja, es ist waipu.tv.

ddorian43 · on Sept 15, 2022

What about the disks?

Edit: or linear TV means the recent stream can be all in ram?

totallyunknown · on Sept 15, 2022

All in memory.

otbutz · on Sept 16, 2022

There is quic-go[1] but i don't think that it's sufficiently optimized[2-4] to be used for this kind of workload. caddy will use it to provide HTTP/3 by default[5] in the upcoming 2.6.0 release.

[1]: https://github.com/lucas-clemente/quic-go [2]: https://github.com/lucas-clemente/quic-go/issues/2877 [3]: https://github.com/lucas-clemente/quic-go/issues/2607 [4]: https://github.com/lucas-clemente/quic-go/issues/341 [5]: https://github.com/caddyserver/caddy/pull/4707

lossolo · on Sept 14, 2022

Interesting, do you do a lot of processing in Golang or basically you just use it as a wrapper around sendfile[1] ?

1. https://man7.org/linux/man-pages/man2/sendfile.2.html

totallyunknown · on Sept 15, 2022

We can't use sendfile since all the traffic is HTTPS. Sendfile can only be used if we can use KTLS. Internally we have some kind of cache hierarchy for video objects, these files are transferred out via Sendfile, but the performance gain here is negligible.

costela · on Sept 15, 2022

and yes: the proxy does have some intelligence around "just" forwarding requests. Mostly caching related, but also some upstream routing decisions and a ton of observability.

trillic · on Sept 14, 2022

100 Gbit/s is only like 3000 concurrent viewers at 5000 KiB/s.

zamadatix · on Sept 16, 2022

40 Mbps seems roughly 5x-10x the bitrate I'd expect for this kind of thing.

totallyunknown · on Sept 14, 2022

100 Gbit/s / 5000 KiB/s is 20000.

ImprobableTruth · on Sept 14, 2022

KiB = Kibibyte, not Kibibit

mplewis · on Sept 14, 2022

“Only” :)

988747 · on Sept 16, 2022

Some sport events are live streamed by 10-20 million people in a single country. that means you need 3000-7000 such nodes to serve them.

nly · on Sept 16, 2022

It's almost as if on-demand video streaming isnt suitable for broadcasting live events.

pornel · on Sept 14, 2022

Huge congratulations to the tokio.rs team — the async runtime has proven to work well even in such demanding project.

xfalcox · on Sept 14, 2022

I share lots of feelings towards NGINX that Cloudflare mention on this blog post. New features like 103 Early Hints and HTTP/3 exist in HAProxy and Caddy but there is nothing coming in NGINX.

datalopers · on Sept 14, 2022

nginx was good for a decade or two. they were acquired and doomed to irrelevancy since.

0x457 · on Sept 16, 2022

Nginx is an absolute beast when it comes to accepting HTTP/1.1 from clients and talking HTTP/1.0 (or HTTP/1.1) to a fixed list of backends or serving files from disk.

Everything else is...hard. Writing plugins - arcane knowledge with next no none documentation. 3rd party plugins, multiple times "acquired" (read: paid the author to work on a new plugin) an open-source plugin and made a much better version available in this plus offering.

gRPC support added only in 2018, http2 support was late, proper web sockets support was late.

Many features that were available "for free" with OpenResty were only on the paid version - good luck writing your own plugin.

They were doomed before F5 probably around round time when they raised series B1 in 2014. All this money and it still had nothing "new" to show. For every Plus plugin, there was most likely a better solution available, and often enough that solution was free or cheaper.

Almost nothing has changed since F5 aquired them.

qwertox · on Sept 14, 2022

F5 is not to blame, they didn't change anything for the worse. The Plus-license is the problem where essential things like monitoring are behind a paywall. Back then this wasn't so important because you basically only had Apache and nginx.

I think I read two weeks ago what F5 was going to focus more on improving the open source version. Probably because the competition is getting harder and they're noticing it in a market share decline, but whichever the reason is, this was good to hear.

Also it was good to see it no longer being part of a Russian company, even though the devs and owners are good people. You never know how a government can enforce some problematic behavior, specially one which is known for liking to throw people out of high rise windows.

tothrowaway · on Sept 14, 2022

Does anyone know why nginx used separate processes for workers, instead of threads? This post makes it sound like threads are the way to go, but presumably nginx had a reason for using processes back in the day.

vbernat · on Sept 14, 2022

Share-nothing architecture were deemed more scalable as you don't need synchronization. But then, you can't share stuff, like a connection pool. Also, the architecture was simpler this way. Nginx is also an application server and it was "easy" to develop applications on top of it because of this architecture.

AgentME · on Sept 14, 2022

Nginx was written in C. Multithreaded code in a language that doesn't provide any safety rails is hard to get right, and so is async code. They probably figured that the complexity of doing both async and multithreading outweighed the benefits that were predicted to be small. Rust's type system checks for and prohibits many kinds of mistakes that are possible in multithreaded code and in async code, so it's much easier to combine them safely.

tyingq · on Sept 15, 2022

Wikipedia says Nginx started in 2004. If you look at the state of things for threads and other "lots of sockets to deal with" things in that timeframe, you can see multiprocess was probably a safer choice, especially if you intended on running well on a variety of Unix or Unix like OSes. This page has some of that state captured pretty well: http://www.kegel.com/c10k.html

thayne · on Sept 15, 2022

Haproxy had a similar architecture for a long time, but recently switched to a worker threads model.

0x457 · on Sept 16, 2022

It was easier to develop, easy to do zero-downtime graceful restarts and reloads, tooling available today wasn't available back then. SMP in FreeBSD 5.x was uhm...very bad...

kronololo · on Sept 14, 2022

What are HTTP status codes greater than 599 used for in practice?

It'd be interesting to see another Cloudflare blog post that just goes into detail on the weird protocol behaviour they've had to work around over the years. I imagine they have more insight into this than pretty much any other organisation on the planet.

Matthias247 · on Sept 15, 2022

If you are in the HTTP proxying business, you might not even know. It's just customer workflows using those status codes. And you can pick either to support them, or to lose the ability to handle traffic for those customers.

jenny91 · on Sept 14, 2022

Presumably custom statuses for app-to-app traffic or someone's weird API, etc.

masklinn · on Sept 18, 2022

I concur with custom signals, using a code >= 600 mitigates the risk that it would ever be allocated, or that you'd conflict with an existing unofficial use.

TimTheTinker · on Sept 14, 2022

Did you guys consider HAProxy? I've only ever heard good things about it - particularly stability (though it probably can't beat Rust), performance, and configurability.

fasteo · on Sept 14, 2022

Great write up !

Any cloudflarer involved in this project mind sharing some basic metrics like LOCs, team size, how long from design to first deployment.

Just curious.

marune · on Sept 14, 2022

In the 3rd party section, no mention of HAProxy as a candidate, any specific reason for that?

AtNightWeCode · on Sept 14, 2022

Sounds good. I never encountered any performance issues with Cloudflare.

If you have the time for enhancement, then:

1. Option to hit the cache before workers. (Why we never use workers).

2. Rules for blocking traffic during nights (time-based rules).

3. Make sure every product is a replacement. If you offer the same thing as a cloud provider. Don’t make us write a lot of custom code.

zwily · on Sept 14, 2022

I agree with #1... Workers before the cache is crazy powerful for the original purpose of Workers (modifying incoming requests). But now that people are starting to use Workers as their original (for remix, etc) it would be nice to be able to have the cache before Workers. As it is right now, having the CDN do full content caching of rendered Remix pages is difficult.

RL_Quine · on Sept 14, 2022

When is night on the internet?

yamtaddle · on Sept 14, 2022

Most—maybe damn near all—sites see significant dips in traffic for at least a few hours a day. Which part of the day, depends on the site. More often than not, it's while the team is asleep and staffing, if any, is at its lowest point, since teams tend to live roughly in the same ~half of the world that their products are most-used in. Plus there's practically no-one in the Pacific until you reach Japan, and not a ton of "Western" sites see much use in Asia, and vice versa, with a few notable exceptions.

It's not unusual for e.g. ecommerce sites to crank up automated fraud prevention "at night" because staffing is so much lower.

TL;DR Most sites' usage patterns exhibit a pronounced day/night cycle that's not too far off from natural day/night cycles where the bulk of the team lives.

karambahh · on Sept 14, 2022

Fraud prevention or even straight up order blocking between 1:30 & 4AM because the downstream order management system has only so much buffer capacity

(it's getting rarer, but it does still happen. Fraud prevention cranked up is definitely a thing on any large enough ecommerce website)

AtNightWeCode · on Sept 14, 2022

Not uncommon before the intro of edge services to block login on sites during off working hours or during nights. Or at least doing some rate-limiting. We see many attempts at brute force-attacks during the nights. Most sites are not global.

ehPReth · on Sept 14, 2022

Could you expand on 3?

AtNightWeCode · on Sept 14, 2022

Why would you swap Azure blob storage or S3 for anything in Cloudflare if it comes with running custom code in workers?

swlkr · on Sept 14, 2022

Wow this is just what I was looking for, a proxy written in a memory safe language like rust with no GC as an alternative to nginx. Looking forward to the open source version!

stjohnswarts · on Sept 14, 2022

Any one else immediately do "open source" ctrl-f? That's all I wanted to read but I bookmarked the article and put it on my list of things to peruse later

mcherm · on Sept 14, 2022

> That's all I wanted to read

While I agree that in selecting a proxy to use whether it is open source is one of the most important considerations, if that's all you look for in this article, you may be missing something.

I thought the article did a good job of describing how they went about making the choice of whether to continue contributing to an existing (nominally open source) system or to build a new one. And of course, it did a good job of showcasing the strengths of Rust (reliability guarantees strong enough that they could identify when problems were due to hardware.)

rowin · on Sept 14, 2022

Was Go considered as the language to write Pingora in? If so, why was Rust chosen?

TheFlyingFish · on Sept 14, 2022

Not from Cloudflare, but at a guess:

* They already have some pretty deep Rust experience on staff

* They were already dissatisfied with the performance penalty from Lua's GC, so Go's GC was presumably unattractive as well

* Rust is worth more internet points than Go (just kidding, mostly)

0x457 · on Sept 16, 2022

Well, rust's async ecosystem is top-notch as long as you're writing a network load balancer...

ekidd · on Sept 16, 2022

Rust's ecosystem is usually fantastic for CLI tools and specialized network servers. Large REST API servers aren't quite as solid (but perfectly doable for basic cases), and GUIs are nowhere near mature.

In several major areas, I actually like the available Rust crates for a given task more than I like the available npm modules. Rust has many fewer third-party libraries available, but the quality is often good.

0x457 · on Sept 18, 2022

Oh, I know. I was just joking that async rust is heavily geared towards build some kind of L7 proxy.

riobard · on Sept 16, 2022

Right now the idiomatic Go approach to handle networking is 2-goroutine per socket (one for reading and the other for writing). Goroutines are very lightweight userspace threads, but they're not free: each costs you a small amount of memory. At Cloudflare's scale, this overhead quickly adds up. So resource-wise, Go isn't ideal for very large scale use cases.

There is a more-than-6-year-old proposal to introduce non-blocking I/O API (https://github.com/golang/go/issues/15735) but so far it's not gaining much traction. Maybe in Go2.

Sytten · on Sept 14, 2022

The post mentions tokio, but I would be curious to see if it uses tower or something similar in house. For our product (caido.io) we also built a custom HTTP parser so if you open source the tool it could be nice to split the parsing in its own crate so we have an alternative to hyper that can understand malformed requests.

noncoml · on Sept 14, 2022

Really curious, are they using async/await?

jgrahamc · on Sept 14, 2022

I see quite a bit of

    async fn

and

    .await();

in the source code. What did you want to know?

nemothekid · on Sept 14, 2022

     .await();

Hmm does the code compile at all?

jgrahamc · on Sept 15, 2022

Old C habits die hard.

jhgg · on Sept 14, 2022

They mentioned they were using tokio, so naturally, yes.

1vuio0pswjnm7 · on Sept 15, 2022

Besides comparing this to Nginx plus Lua (OpenResty), has Cloudflare compared it to Haproxy plus Lua or any other similar proxies.

The main issue for me with Rust is that it takes significantly more resources (time, space, memory, CPU) to build projects from source. Building Haproxy is comparatively quick and easy.

The haproxy plus lua static binary (musl, no pcre) I use is already growing rather large. I will bet that Pingora binaries using shared libraries will be at least twice, maybe three times the size.

dcow · on Sept 15, 2022

Curious: when building software, especially mission critical software, why are things like time, space, memory, and CPU at build time a concern? The concerns that usually surface for me are reliability, and performance (time, space, memory, CUP) of the resulting binary at run time. What you mention is kinda the core tradeoff. With Rust, you pay for high reliability and performance by spending cycles at compile time to ensure as much. Cloudflare certainly isn't looking to ship a new sexy live-reloaded and written in 30 minutes nodejs app to production 12 times a day. Plus I'm sure CF has some beastly build infrastructure.

thayne · on Sept 15, 2022

I really love haproxy, but I have seen segfaults and other memory errors thay wouldn't have happened if it had been written in rust (not that that was an option when haproxy was originally written). If you are going to write a new http software from scratch, I definitely think rust is a good language to use.

I would have liked to have seen more explanation of why they decided to build their own rather than use haproxy though.

1vuio0pswjnm7 · on Sept 15, 2022

I have never experienced a segfault or other memory errors with haproxy. How can I reproduce the errors you describe.

thayne · on Sept 15, 2022

I don't know of any current ones. Because when I have encountered them, I've reported them and the haproxy maintainers fixed them quickly, and backported the fixes to still maintained older versions.

I don't remember the exact specifics off the top of my head, but in one case a certain kind of invalid config resulted in a crash rather than a normal error message. In another, there was an error in a bounds check, where if you chained multiple converters in the right way, you could end up capturing some additional data from other http headers.

pdimitar · on Sept 15, 2022

Why are these a concern?

Build time can be annoying, absolutely, but that's the tradeoff of Rust -- compiler does a lot of work so the program can be beastly fast and very reliable at run time.

I don't like big binaries either, mind you, but I view this as a very minor concern for a mission-critical piece of software.

Choosing speed and correctness at run time is what I would have done as well.

JetSpiegel · on Sept 15, 2022

Maybe they could find the time to allow RSS readers to read the HTML on this post. I guess using RSS mean you are atacking their infrastructure.

boris · on Sept 14, 2022

I wonder how this is deployed to presumably a large number of hosts? Do you build a distribution package out of your Rust build and ship that? If so, what about the Rust standard library? Though I believe some distributions do provide a package for the Rust standard library, but that means one also has to use the packaged rustc/cargo, which tends to lag behind quite a bit.

pornel · on Sept 14, 2022

Yes, the distribution package is built with `cargo deb`, which automatically makes a suitable binary package. It doesn't need Rust in production. Rust's standard library is compiled into the executable. Its size is negligible, especially with link-time-optimizations.

nicoburns · on Sept 14, 2022

Rust's standard library is statically linked. Rust binaries typically only require libc (and can be compiled with musl to avoid that dependency too).

0x457 · on Sept 16, 2022

Rust doesn't have stable ABI, so distributions don't provide "standard library", they provide the toolchain you use to build your rust packages. Emphasis on packages because that's the use-case for toolchain provided by distro and not local development.

In the end, you end up with a binary that is statically linked, except for `glibc` (unless you use libc like musl) and whatever C library you choose to dynamically link against.

Since this is an in-house project, they can build their packages with whatever toolchain they choose and distribute packages for deployment. So whatever toolchain provided by distro is irrelevant unless this packages is part of the distro's official package repository (which it isn't in this case).

arberx · on Sept 14, 2022

Is it open source?

jgrahamc · on Sept 14, 2022

It will be. There will be a follow up blog post about the open sourcing with all the gory details of how it was built and how it works.

Dowwie · on Sept 14, 2022

You may be interested in knowing that about two years ago, a team of engineers at Dropbox wrote gory details about their use of Rust and it was inspiring. The passion about their work really came through. The team also held an AMA on /r/rust that went well. See here: https://www.reddit.com/r/rust/comments/fjt4q3/rewriting_the_...

network2592 · on Sept 14, 2022

It might be a couple of months. The open sourcing of Cloudflare workers was announced in May and still has not been released. [1]

[1] https://blog.cloudflare.com/workers-open-source-announcement...

latchkey · on Sept 14, 2022

I think you should remove the part about closed door development as a negative for nginx given the way that this has been developed.

0x457 · on Sept 16, 2022

Why? It's closed door development in nginx from their perspective. This doesn't apply for their in-house project. They not telling you why you would want to switch off Nginx, they are telling you why they've switched.

vgchh · on Sept 14, 2022

They don't say much on why not Envoy. It would be interesting to hear if there were concerns with it.

chillfox · on Sept 15, 2022

To me it looked like they didn't have anything technical against Envoy, they just didn't want to be dependent upon someone else for what is a core part of their product anymore. Which is entirely reasonable and a good idea.

jgrahamc · on Sept 15, 2022

That’s right. Long ago we used an open source DNS server for our authoritative service and at some we knew we had to own our our destiny. Same here.

pjmlp · on Sept 16, 2022

> Our Rust code runs more efficiently compared to our old Lua code.

What a surprise, replacing an interpred dynamic language with a AOT compiled static language leads to performance improvements.

I guess the learnings using Tcl as configuration language for Apache based proxies 20 years ago was lost in newer generations.

bbarnett · on Sept 16, 2022

Hi! I'm the tcl bot, you're the 27th person to ever mention tcl on HN!

keepquestioning · on Sept 15, 2022

Why did Google never try to buy Cloudflare?

VWWHFSfQ · on Sept 14, 2022

Should have waited to post this until it was actually ready to be open sourced. Otherwise this is just kinda like "huh, neat" without anything else to do with it.

qwertox · on Sept 14, 2022

In some cases it can be enough to know that it could be worth waiting for the release instead of putting more resources into a stack you're currently using. You might replace it entirely in a few months if the release turns out to be a product which you can and want to switch to, so it's ok to get a heads-up.

nextaccountic · on Sept 14, 2022

Unfortunately, without being able to run the code yourself or at least seeing a benchmark, it's hard to commit to unreleased code like this

qwertox · on Sept 15, 2022

True. But you're not committing to it, you're just waiting for the evaluation instead directing resources into a change which might become obsolete soon, should the new option be a viable one.

In this case, where the code is coming from Cloudflare, you can be sure that they know what they're doing, that there is a high chance that it will be very high quality code.

Apart from this, it's an interesting article from a technical perspective due to the various topics it touches.

Also, in my case, where I'm looking for an new language to program in, where Rust and Go are the my main options, articles like these help me with the decision process on which one to choose.

There's much more value to this article than it just being an announcement that sometime in the near future they will be releasing an open source project.

It almost sounded like a "give me the code or shut up"-message.