For what it's worth at work (Discord) we serve our marketing page/app/etc... entirely from the edge using Cloudflare workers. We also have a non-trivial amount of logic that runs on the edge, such as locale detection to serve the correct pre-rendered sites, and more powerful developer only features, like "build overrides" that let us override the build artifacts served to the browser on a per-client basis.
This is really useful to test new features before they land in master and are deployed out - we actually have tooling that lets you specify a branch you want to try out, and you can quickly have your client pull built assets from that branch - and even share a link that someone can click on to test out their client on a certain build. Just the other week, we shipped some non-obvious bug, and I was able to bisect the troubled commit using our build override stuff.
The worker stuff is completely integrated into our CI pipeline, and on a usual day, we're pushing a bunch of worker updates to our different release channels. The backing-store for all assets and pre-rendered content sits in a multi-regional GCS bucket that the worker signs + proxies + caches requests to.
We build the worker's JS using webpack, and our CI toolchain snapshots each worker deploy, and allows us to roll back the worker to any point in time with ease.
I wrote probably the stupidest worker script too, that streams a clock to your browser using chunked requests: https://jake.lol/clock (no javascript required).
Thanks! And thanks for building Cloudflare workers. You have no idea how much nicer it is to write and express a lot of this logic in modern javascript, rather than nginx configs and VCLs.
I haven't dug too deep into this, but when loading view-source:https://jake.lol/clock I'm seeing on average 14 second load times. Does this happen for anyone else?
That seems right. Should probably be longer. Each second, a new chunk is emitted with the appropriate html to hide the old clock element and show the new.
The value of any function-as-a-service is the ecosystem within which it sits. Pretty much all of them are the same: upload your code, we will run it.
The value comes from 1) What can trigger that code to run and 2) What services that code can interact with.
And on those two points, AWS still wins hands down. They have by far the most possible triggers for Lambda, and they have by far the most services that Lambda can interact with.
It's cool that Cloudflare built something faster, but unless you're running in a vacuum, speed is the least of your concerns.
Yes, that's how Amazon creates lock-in. But it depends what you're doing with it right? If you are looking to run code based on a SQS event, yes you have to use a Lambda. If you are looking to execute code when something visits a URL you have more options.
The lockin argument alone is a red herring. Every technology implementation creates lockin. The valid question is how hard is something to change. A good architecture balances how easy it is to change something with how optimized it is, also balancing how much it costs to build and maintain.
Realistically you can get as locked into Amazon as you want, lambda alone does not create inescapable lockin by any measure so I would argue Jeremy still has a point in the fact that tools become more useful when you can use them to do more work (ecosystem)...
Amazon performance is generally bad when using other services.
We are rolling out a CDN, with a goal of 20 ms latency in most countries. We want more granularity that AWS - and some zones are just not well served (No AWS in Africa, incomplete offer in Brasil, etc)
Still, we figured we would use Route 53 as you can do Latency Based Routing even with non-AWS servers. Computing latency or using EDNS0 as a proxy is not rocket science, so we thought the DNS would not be a limiting point.
Oh boy, how wrong we were! After wrongly blaming the bad performance on Cloudflare caching, further tests revealed Route 53 takes as much as 0.7s to reply to some DNS queries - and even worse when fronted by Cloudflare, as for some reason the DNS TTL seems to be ignored by Cloudflare. The latency only drops down after about 4 queries, which makes me thing they have some kind of Round-Robin that does not share the DNS queries (I could be wrong)
In the article, the author says: "Most of that delay is DNS however (Route53?). Just showing the time spent waiting for a response (ignoring DNS and connection time)". No you should not ignore the DNS delays! Route53 performance is very bad - 2 full seconds for you!!
We are fortunate it did not take 2s for us. Still, having servers all over the world that reply in 20 ms is useless when the first DNS query takes 700ms.
We ended up leaving for Azure: Traffic Manager outperforms Route 53 by a factor of 2.
Eventually, we will roll our own GeoIP with DNS resolvers on a anycast subnet.
I do not understand how this level of "performance" can be tolerated. At 2 seconds for a DNS query, you are better off using the registrar free DNS service!!
Saying Route53 takes “2 seconds” to resolve is pretty meaningless without a distribution or at least percentiles. Route53 obviously doesn’t take 2 seconds for all or most queries.
I am quoting the author, and their analysis of the initial query. This observation from whoever wrote the article matches my own experimental results: initial queries are very very slow on Route 53 LBR. A distribution of queries is useless and misleading, as later queries are cached if you have a sufficient TTL - so only the first few really matters in the performance results.
Later queries are fast of course, as the results are cached (TTL).
Even if the DNS is very pooly configured, all queries after the first one will benefit from the cache!! So the first few queries matter much more, and this is what we should be talking about instead of distributions and percentiless.
Said differently: If each of your visitor has to way a second or two until the site comes up the first time, then the site works normally, it may still give them a bad impression.
I measured the DNS delay on first Route53 reply to be over 700 ms personally. For the author it is 2000 ms. These results are in the same order of magnitude, and make Route53 unsuitable for many applications. Of coutse, you could start hacking, like keeping Amazon cache warm by issuing queries through chron, or by setting extremely long TTLs and hoping your visitors DSL modem will keep your A records in cache as long as you asked for - but these are just hacks trying to compensate that the first DNS query takes SECONDS to process.
Route53 LBR DNS is not as a "slow and requiring hacks". It's supposed to be fast, simple to run, and to ingrate with different ecosystems. To me, it seems to be none of that.
After assessing Route53 as fubar, I switched from AWS to Azure: TrafficManager offers the same features, and the first request takes less than 350ms. There must still be some cruft in there, but at least it is manageable.
As the architect of Workers I was obviously pretty happy with Zack's results in general. But, I'm not happy with the tail latency (99th percentile), even if it beats the competition. I suspect this has to do with GC pauses. The solution may be to proactively run GC in a background thread between requests. For high-traffic workers that are always processing requests, we could load multiple instances of the worker and alternate between them.
BTW, if you're into modern C++ and this kind of work interests you, please e-mail me at kenton at cloudflare.com. We're hiring!
You are from Cloudflare? Could you tell me why the replies to geoip/latency based routing CNAMEs do not seem to be cached by Cloudflare?
The setup is: domain.com -> geoiplbr.domain.com with cloudflare caching enabled. Nothing else that is fancy and could cause delays.
If I measure the TTFB for domain.com, I see a large DNS delay until about the 4th consecutive query - and then the DNS is no longer the limiting factor.
The same measures on geoiplbr.domain.com normalize after the 2nd query.
It seems to me you have some kind of Roud Robin going on that does not share the DNS results.
Or maybe the caching is not done at the POP level?
This seems to disregard some of the other factors that make Lambda > Cloudflare Workers. We run binaries on our lambda instance with a go-based function, since Lambda allows for up to 250mb of binaries, 3gb ram and 30s max, this allows us to perform computationally and ram heavy applications without worrying about our instance being killed off.
Also I looked into using cloudflare workers to write my own custom edge cdn but they currently don't allow you to change where in the call requests are processed or telling cloudflare what to cache vs not cache. If they could have some functionality that would allow you to easily write your own multi layered CDN this would be interesting.
It’s worth pointing out Amazon charges more than 10x a Worker on a per execution basis to use it as you describe for just 100ms of compute. If you’re actually using 30s it’s probably very expensive indeed.
The statements in the second paragraph are fortunately incorrect. With the exception of some security features Workers totally takes over the incoming request. It can use flags in its subrequests to configure the cache as you need, and will soon have access to the raw Cache API.
Interesting, I'm glad to know that raw access to the Cache API is being added, when I contacted Cloudflare about this a number of months ago at the time they didn't support this. For my edge CDN needs I will reevaluate cloudflare workers soon.
On the first paragraph we have shifted some computationally heavy and horizontally restricted functions from our own servers to Lambda, this allows us to instantly scale to meet our non-consistent demand. With the lambda workers we are using we are averaging 5 to 11s of execution time with approximately 800mb of memory and utilize the cpu heavily. If Cloudflare workers ever expanded to allow for a similar scope I would definitely take a second look at it.
Have you think about including Golang based Lambda function in your benchmark?
As you're guessing that Cloudfare superior JS runtime plays a big role, it could be interesting to see if it can compete against Golang Lambda as well.
A lot of our performance benefit comes from lighter-weight sandboxing using V8 instead of containers, which makes it feasible for us to run Workers across more machines and more locations. It wouldn't surprise me too much if a Worker written in JS can out-perform a Lambda written in Go, as long as the logic is fairly simple. But I agree we should actually run some tests and put up numbers... :)
On another note, currently we only support JavaScript, but we're putting the finishing touches on WebAssembly support, which would let you run Go on Workers... stay tuned.
Just curious: The article mentions V8 isolates. Do you actually also run all IO of the worker in the same isolate? Or in a different one, and the API calls are bridged (via some webworker-like API)?
I guess one of the main challenges is that all resources are properly released when a worker is shut down. Releasing memory sounds pretty easy, if V8 does it for you. But releasing all IO resources might be a bit harder, especially if they are shared between isolates.
The Workers runtime itself is implemented in (modern) C++, not JavaScript. So, there's no need for a separate isolate -- API objects are implemented directly in C++.
In C++, memory and I/O resources are both managed through RAII. Of course, when binding to JavaScript, we often end up at the mercy of the JavaScript GC to let us know when an object is no longer reachable from JS, and the GC makes no promises as to how promptly it will notice this (maybe never). That's fine for memory (it amortizes out) but not for I/O resources. So we're back at the original problem.
Luckily, in the CF Workers environment, it turns out that all I/O objects are request-scoped. So, once a request/response completes, we can proactively release all I/O object handles bound into JS during that request/response. If JS is still holding on to those handles and calls them later, it gets an exception.
Yes, I guess a part of my question was whether the destructors/finalizers that the JS object bindings in C++ might impose are called fast enough to guarantee isolation and prevent resource leakage. Looks like in your case that happens through the request scoping.
I have a test which dives into the crypto performance, which seems to be largely driven by the amount of CPU allocated to the process (both Workers and Lambda is ultimately just calling a C crypto implementation). I'll have a longer post about it shortly, but the summary is a 128MB Lambda is around 8x slower than a Worker in pure CPU.
This really isn't a very good benchmark. It's basically only validating the Cloudflare edge network, but the test itself is far from real-world. A service that returns the current time is not doing anything practical and borders on meaningless.
We have a post which compares CPU intensive workloads which should be ready after the American holiday. The summary is a 128MB lambda provides you with roughly 1/8 of a CPU core, which is therefore 8x slower than a worker.
Sure! I think it's fine infrastructure-wise, but the dev experience is awful. I like tools you can build, test and run locally, for example.
The fundamental problem I run into with Lambda@Edge is just that their request stages aren't a great abstraction (OpenResty/nginx has a similar problem). It really limits what kinds of problems you can solve.
> The fundamental problem I run into with Lambda@Edge is just that their request stages aren't a great abstraction (OpenResty/nginx has a similar problem). It really limits what kinds of problems you can solve.
Yes! I completely agree. Interesting that we both ended up with the Service Workers API instead. I'm really hoping that Service Workers becomes the standard for JavaScript HTTP handling in the future.
Building out storage is my current focus. The challenge is that we want to build something that actually utilizes out network of 151 locations today, 1000's of locations in the future. If your application has users on Mars (or, New Zealand), you should be able to store their data at the Cloudflare location on Mars (or, New Zealand) so that they can get to it with minimal latency.
PS. If you're a storage expert and building a hyper-distributed storage system interests you, e-mail me at kenton at cloudflare. We're hiring.
Let me know if you need help with the Mars location in the future. I can't wait for AWS to open their utopia-planitia-1 region with SpaceX or BlueOrigin.
I'm using cloudflare workers to "polyfill" client hints if they're missing with cookie logic. With their addition of being able to mutate cache keys via edge workers I find it to be a extremely powerful way of everything from per-device image optimization (or google data saver or hidpi support) to serving different pages for the same uri based on your requirements (and storing this in cache).
What Infrastructure as Code (IaC) options exist for Cloudflare Workers? AFAICT neither Serverless nor Terraform support it. IaC is table stakes for any new part of my tech stack, and I would prefer not to code it from scratch - unless deployment/configuration is extremely easy to automate via CLI or something...
Why would you run something in a single location if you can run it everywhere for the same price though? It's not like Lambda is cheaper for being centralized.
Isolating failure domains and complying with data residency requirements are a couple reasons. Also, global reach usually means global blast radius if you screw something up.
For the specific use case you tested workers on the edge absolutely make more sense than lambda, but I think the headline is a bit click-baity.
If your function is IO heavy and your datastore isn’t globalized (as most aren’t, save Spanner and its ilk), you may prefer your function running in the same region as your data to minimize DB latency and transfer cost.
I wish Cloudflare would offer some kind of key-value store with Workers, something like Google Cloud Memorystore but globally distributed in all of their PoPs - even if it's really limited like 32 MB RAM.
I have a post which should come out later in the week which dives into the performance with CPU-intensive workloads. tl;dr is that a 128MB Lambda is about 8x slower than a Worker.
Can you also throw in some adversarial workloads? Simple proof of work using node’s builtin crypto module would be a nice benchmark for V8 isolates vs Lambda’s processes, and would go a long way convincing people that using isolates is reliable in a shared setting relative to processes/containers.
Different isolates can run concurrently on different threads, so pegging the CPU in one isolate doesn't block any others. Also, if a worker spends more than 50ms of CPU time on a request, we cancel it, terminating execution of the worker even if it's in an infinite loop. (Almost all non-buggy Workers we've seen in practice use more like 0ms-2ms per request. Note this is CPU time; time spent waiting for network replies doesn't count.)
That’s fair, the truth is some of the people here have been around since the beginning of the internet, so four decades might be more fair. Unless you add up everyone’s experience, then we’re in the millennia...
I was working on an app which serves roughly 400*10^6 API requests/day.
One cool property of our app is that it's rarely updated and mostly read.
And the goal is to achieve the lowest possible latency at the edge.
It scales beautiful, I am not sure which other architecture can help us keep this afloat with only 4 developers working on it.
So, we have a DynamoDB table which is replicated to multiple regions using DynamoDB streams and Lambda.
For us, Lambda means achieving a lot without many developers and system administrators but I understand that not all problems yield gracefully to this pattern.
It seems using Cloudflare Workers to trigger our Lambda function instead of API gateway could prove to be cheaper.
Would it be possible to read the data through the Cloudflare cache? If so your data and API would be replicated not just around the world but actually within the majority of the worlds ISPs. Based on our experiences with 1.1.1.1, Cloudflare is within 20ms roundtrip of most people on earth.
This is really useful to test new features before they land in master and are deployed out - we actually have tooling that lets you specify a branch you want to try out, and you can quickly have your client pull built assets from that branch - and even share a link that someone can click on to test out their client on a certain build. Just the other week, we shipped some non-obvious bug, and I was able to bisect the troubled commit using our build override stuff.
The worker stuff is completely integrated into our CI pipeline, and on a usual day, we're pushing a bunch of worker updates to our different release channels. The backing-store for all assets and pre-rendered content sits in a multi-regional GCS bucket that the worker signs + proxies + caches requests to.
We build the worker's JS using webpack, and our CI toolchain snapshots each worker deploy, and allows us to roll back the worker to any point in time with ease.
I wrote probably the stupidest worker script too, that streams a clock to your browser using chunked requests: https://jake.lol/clock (no javascript required).