How we processed €5M in donations in 2 hours using Cloudflare and Stripe

darsoli · on Dec 7, 2020

This post seems to go into more detail on the technical architecture.

I've read a lot of these style posts, and oftentimes the results don't end up being that interesting - but I have to say this post is different. Nice job and good overview of all the different layers in the stack. I didn't realize there was a meaningful difference between Cloudflare Workers and Lambda, but now will have to check it out.

nicoburns · on Dec 7, 2020

Indeed Cloudflare Workers have much lower latency, and much much quicker cold starts (to the point that you probably won't notice them). IMO this makes them much more useful for typical web APIs than lambda.

ignoramous · on Dec 7, 2020

Workers and cold-starts? If you count 0ms as coldstarts, then sure.

Also, Workers supports HTTP3 (QUIC) and ESNI already.

https://twitter.com/eastdakota/status/1288855462931177472

jariel · on Dec 7, 2020

My understanding is that 'when hot' AWS Lambdas are pretty quick, and so long as there is basically any traffic, the remain hot? Do you have any details?

nicoburns · on Dec 7, 2020

There are also cold starts every time the number of concurrent requests increases. Also: lambdas are only really useful for the low-to-no traffic situation. If you have constant traffic you're better off using a small VM.

jariel · on Dec 7, 2020

"Also: lambdas are only really useful for the low-to-no traffic situation. If you have constant traffic you're better off using a small VM."

I don't understand this.

The entire point of Lambda's is to offload the overhead and complexity of managing severs - and - to allow for large occasional spikes in traffic.

Even with the limitation of '5 second warmup per concurrent' Lambda - that's not a big deal. It means maybe 100-200 users have to wait a few seconds while 100-200 Lambdas warm up - and with 1.5 user spike ... you really don't know what you're going to get, but with Lambda's at least you're pretty much guaranteed it will work.

For cost efficiency, with stable tech, stable/predictable traffic, and enough scale that you have the right team to be able to manage your EC2's properly - yes that makes sense. But you need some scale to get to that point wherein that cost efficiency is worth it given that Lambda's 'just work' fairly well fairly easily.

But I definitely could be missing something.

perfectspiral · on Dec 7, 2020

I believe what is being implied is that because Cloudflare Workers operate inside of a much lighter construct, they are able to burst to higher concurrency inside of shorter time windows. Lambda can burst to 3000 and then an additional 500 more per minute after that.

https://docs.aws.amazon.com/lambda/latest/dg/invocation-scal...

Based on the scale mentioned in the article (hundreds of RPS) it's likely Lambda would also have been able to handle it just fine.

On another note, using non-provisioned infrastructure (aka "serverless") for an expected bursty load (TV campaign) is bordering on negligence. It sounds like lots of potential donations were missed because the Stripe account was not set up to cope with the load. It turned out to be a wash because Stripe donated 100k but if you change the business context of this system from "receiving donations" to "taking credit card payments" ... this outcome would not be considered acceptable.

jariel · on Dec 7, 2020

Thanks, I see more details on how Cloudflare may be more performant than Lambda.

But why is running a bursty tv load on serverless negligent?

"the Stripe account was not set up to cope with the load." - I've used Stripe and I don't know what this means. I don't see any 'product' limitations towards accepting many payments, that said, they could have some internal financial controls which should have been accommodated by calling ahead and letting them know about the burst.

And how would there be a technical limit with Stripe? If payment processing was directed to Stripe.com - surely they can handle the traffic.

perfectspiral · on Dec 7, 2020

"Why is running a bursty load on serverless negligent?"

Because Lambda and any other FaaS platform are always going to have some limitations as to how fast they can scale out. When you put stuff on TV, this is as bursty load as it gets. You are literally telling everyone viewing that TV channel at that time to pick up their phone and browse to some link.

If you are expected to handle all this load, the only way to ensure you can handle it is to over provision. The FaaS may promise some burst rate, but ultimately it's a shared hosting platform and if they have to choose between giving you your 3000 containers or keeping the other X customers running on those same boxes healthy they will always opt for the latter.

Worst case for them, they refund you the few bucks you paid for Lambda during that time. Worst case for you is you miss a ton of payments/donations/whatever. Welcome to the cloud.

The Stripe API has rate limits of 100/sec: https://stripe.com/docs/rate-limits

They put that to protect themselves from getting overloaded and Im sure these limits are more than plenty for typical use of Stripe which would be well spread out throughout the day. But when you put something on TV, 100 RPS is no way near going to be enough.

jariel · on Dec 7, 2020

Ok thanks for that.

Stripe: 100/second seems like perfectly ample headroom to support their payments. At $10/payment, that's $1K/s which is $3.6 Million in one hour.

If that's 'The Stripe Limit' that every Stripe customer has, wouldn't it kind of imply that's plenty enough for even much bigger customers?

As 'FaaS' scaling - as far as I knew that was the entire point of Lambdas - that they could scale quite quickly.

The alternative, EC2s would need to also 'scale very quickly' and could very well run into the same 'prioritization' problems, no?

In reality, I don't think there's a problem here with Lambda - while possibly not the most ideal option - I think that AWS has ample overhead to supply this little company with their little burst.

Amazon accommodates some pretty big workloads. A call to AWS support may very well have supplied them with the answer as to the real limits on scale.

atlbeer · on Dec 7, 2020

https://news.ycombinator.com/item?id=25264506

jariel · on Dec 7, 2020

Small point: what really is the advantage in pushing those services 'to the edge'?

AWS lambdas, when they are warm, are fast.

Supposing there is a 100ms extra response time due to non-edge and AWS Lambda for whatever reason - is 100ms really the issue when you're facing a blast of 1.5M users? Issn't it the sheer onslaught of scale that is the concern?

It'd seem to be Static hosting some content on S3 + Lambda+Stripe basically would handle the task no problem.

Or am I missing something?

pvhee · on Dec 8, 2020

If you can push services to the edge without adding complexity to your setup, why wouldn't you? The simplicity (and pricing model) of running code on Cloudflare Workers should be reason enough to do this.

I agree that AWS Lambda for this project would have been more than sufficient, together with S3 for static hosting. However, the advantage of CF workers over is that you combine both the CDN, static hosting, the API layer and the back-end logic running in one and the same worker, therefore significantly simplifying your setup.

With AWS (and without using Lambda@Edge), we'd probably have to use a combination of Cloudfront, S3, API Gateway & Lambda. All of these could easily be defined in code with something like the Serverless framework or AWS SAM, but there are a lot more moving parts here when compared to CF Workers.

faitswulff · on Dec 7, 2020

Are lambdas limited by region? They noted in the post that they would be expecting the Irish diaspora from 130+ countries to be participating as well.

jjeaff · on Dec 8, 2020

The real feat is to not get your stripe account shutdown and funds held while receiving millions of donations in a single hour.

Not a dig on Stripe in particular, but I always run into trouble with banks and parent processors before the servers start getting overloaded.

reacharavindh · on Dec 7, 2020

Any chance we could know the total infrastructure spend of this setup?

pvhee · on Dec 7, 2020

Total infrastructure cost is less than 100 USD, but this is an estimate.

As this was directly deployed onto RTE's Cloudflare account, we don't have the breakdown for this. However if we count around 4 million requests at 0.5 USD / million requests, you can see that costs from the serverless infrastructure itself are negligible at a few dollars only.

The major "infrastructure" cost stems from Sentry, which we used for application monitoring & crash reporting. We're on the team plan at 29USD / month with added on-demand spend of 100 USD, which we've nearly exhausted considering a large number of events were sent to Sentry. We had also enabled spike protection, which means we're paying less as certain events are rate limited in any given minute to avoid massive overspend (which we would have certainly run into).

superkuh · on Dec 7, 2020

The write up seems to be: Don't, pay some giant corps to do it.