Hacker News new | past | comments | ask | show | jobs | submit login
Meta's serverless platform processing trillions of function calls a day (2023) (engineerscodex.com)
50 points by jbredeche 7 months ago | hide | past | favorite | 35 comments



Sorry this is going to be slightly tangential to the primary topic of this post. But does anyone work on backends that heavily utilize a serverless platform? To be more specific, backend (micro)services that embrace serverless fully (for example AWS Lambda, Dynamodb, StepFunctions, etc) and minimal usage of any ec2/container based deployments in the architecture? And to be clear, I'm not talking about some small bot or single internal app running on a serverless platform, rather an entire complex core business product or service that embraces serverless architecture -- think Amazon Prime previously using AWS Lambda type stuff.

I'd love to know people's (objective if possible) thoughts, experiences and opinions on such an approach.


Our go-forward architecture runs everything on top of Azure Functions. We are using HttpTrigger with basic SSR web forms to deliver the solution. There is no "back-end". The functions directly serve the customer traffic (text/html get & post) and talk to whatever SQL database / external systems are required.

We are finding that this is a much easier operational model in a B2B/SaaS consulting setting wherein our customers (banks & credit unions) must retain legal ownership over their data. Setting up a VM/container circus per customer and expecting them to keep it healthy is not something we are prepared to continue beyond client #5 or so. There is zero reality wherein we could maintain some kind of multi-tenant system.

After targeting this stack seriously for about a year, I am having a hard time coming up with a reason I wouldn't use it for everything. Not having to think about VMs and weird crusty infra management code has freed up an incredible amount of bandwidth for our development team.


I've worked on low traffic apps where the entire backend API is running in Lambda. Would I do it again? Probably not. Lambda has a bunch of weird limitations (like request and response sizes) which means it's not suitable for all use cases. Additionally, you should avoid coding directly to the Lambda API. You want something to abstract that and make it look like a "normal" app. Otherwise, you wind up with a bunch of stuff you can't easily run locally.


I could be wrong, but from my experience, serverless has only one crucial advantage over serverful: the ability to instantly scale up on a moment's notice. You deploy some code to an established cloud provider, the cloud infrastructure replicates the code across hundreds of compute instances in a hundred geographical locations (possibly utilizing a cloud data store, synchronized internally across all the compute nodes used) and suddenly there is no longer a single performance-sensitive "failure point". Your product can get on the front page of HN and survive.

When it comes to serverful, you have to provision (or perhaps "over-provision") resources ahead of time. If you can estimate your CPU/memory/disk/network needs beforehand, you can just as well rent an appropriately powerful, "serverful" server to achieve your goal. If you assume (figure out) a particular desired number of API requests per second that you need to handle, analyze and benchmark a particular set of hardware resources that can likely handle such throughput and also calculate a per-request -based price of the same throughput with a serverless provider, you will probably find that serverless is at least twice as expensive as serverful.


Yup. Entire core business product for a succeeding startup, though it's a small team of contributors (<10), and a much smaller platform team. Serverless backend started in 2018. Been a blessing in many regards, but it has its warts (often related to how new this architecture is, and of course we've made our own mistakes along the way).

I really like the model of functions decoupled through events. Big fan of that. It's very flexible and iterative. Keep that as your focus and it's great. Be careful of duplicating config, look for ways to compose/reuse (duh, but definitely a lesson learnt) and same with CI, structure your project so it can use something off-the-shelf like serverless-compose. Definitely monorepo/monolith it, I'd be losing my mind with 100-150 repos/"microservices" with a team this size. If starting now I'd maybe look at SST framework[0] because redeploying every change during development gets old fast

I couldn't go back to any other way to be honest, for cloud-heavy backends at least. By far the most productive I've ever been

Definitely has its warts though, it's not all roses.

[0] http://sst.dev


Are you using abstraction frameworks/libraries (eg https://www.serverless.com/framework) or using AWS/Azure/GCP directly?


We use serverless, quite happy with it. TBH I mentioned SST in my first comment but I'm not super experienced with it, and I'm reasonably happy with serverless.

I'd definitely be careful in my own decision in trying out SST so I'd recommend doing the same for anyone who is taking my suggestion seriously


Oh and we only deploy functions and their IAM permissions with serverless. All other AWS resources are managed by terraform. I think this was wise


I’ve worked on a product where half the backend (basically, the control plane but not the data plane) was completely serverless: lambda, DDB, step functions, api gateway

I think it was generally convenient as being very low maintenance. The only availability canary alarms are when the underlying services go down, which is almost never.

Debugging and running locally are the weak points. Eliminating an entire class of things to think about (container health, scale in and out, and so on) is very nice. Since it’s a control plane, tps and costs were low.


> But does anyone work on backends that heavily utilize a serverless platform?

As far as I know, that scenario is not in the decision tree for when to go with function-as-a-service offerings. FaaS is for things like glue code for event handlers, low-traffic services like API Gateway/NLB calling Lambdas, and expensive tasks that can be offloaded from a worker thread to a Lambda.

When traffic going into a FaaS-based service starts to go up, the cost of running Lambdas stops being competitive and you're better off just having a dedicated service handle the work.

Nevertheless, the main selling point of FaaS is not performance, throughput, or even developer convenience. The main selling point is hardware utilization. If your dev teams deploy FaaS stuff instead of dedicated services, the company doesn't need to provision anything to get them to run. The functions are just deployed in the sea of FaaS things they have been running, and they can live with high hardware utilization just fine.


I ask because it's relevant to me, but not in my control. And still early stages and haven't hit that "When traffic going into a FaaS-basdd service starts to go up, the cost of running Lambdas stops being competitive" stage yet -- if what you're describing will be applicable and true in my situation. Sorry I can't go into specifics as it wouldn't be wise for me to discuss work stuff in more detail.

So that's good to know. Do you have any references, in terms of further reading material, for the cross over point for when it's stops being competitive? I highly doubt $cloud_provider is going to give you such a analysis for obvious financial reasons.


> Do you have any references, in terms of further reading material, for the cross over point for when it's stops being competitive?

I'm afraid I don't. The most objective cutoff point I can come up with is based purely on cost. Meaning, cloud providers like AWS might offer a free tier for API Gateway and Lambdas which makes them really competitive vs running a container or launching a VM when traffic is low. However, once your traffic starts to go beyond the free tier (IIRC per month it's 1million requests for API gateway and 400,000 GB/s for lambdas) then operational costs starts to overtake the cost of running a traditional service. How many requests your service handles on all endpoints and how much computational budget each lambda spends is a trait of your own service, and reflects low-level decision such as how you designed your API and what runtime did you chose for your lambdas. Basically this boils down to the FaaS solution eventually leading the cloud provider to charge you per each request, while your traditional service running either in a container or on a VM having a fixed cost. After some math you'll arrive at s traffic volume that can represent the tipping point between FaaS and traditional services.

Personally, my subjective cutoff point is when FaaS costs about 30% of traditional setvices. Once your FaaS cost starts to grow, you're tempted to micro-optimize your API and your service to try to keep it's running cost justifiably low. That's when you're starting to be tempted to put together a sucky service as a cost tradeoff. When that time comes, it's better to bite the bullet and just migrate away from FaaS.


The company I work at is using step functions heavily and I hate it. Instead of an if statement they make it a new step with the conditional in json, so following the code required you to jump around between files.

It has also been built by contractors who have no incentives to make it run locally or be easy to manage in production, but that isn't specific to step functions, and more due to poor leadership.


I never thought about it that way. Sounds like step functions could end up "spaghetti code on steroids" if you do it poorly.


Imagine using JSON as a programming language. And all the implications of that. AWS does provide a VS Code and web renderer for the JSON to visually see what the flow of the step function looks like, but this is just a small improvement overall. If you split up your step functions into multiple, where one step function can call another, also invoke lambdas and other things, then you can imagine you lose all benefits of a modern code editor/IDE and you're manually lookup which file to jump to in order to find the next step in the execution path.

To be fair AWS does provide a "CDK" which lets you write Python code that gets converted into their JSON DSL -- or something along those lines. But I haven't used that, only direct JSON DSL "code" to write step functions.


Interesting article, I wondered about this:

It fetches function calls from DurableQs, consolidates them into FuncBuffers (ordered by importance, then execution deadline),

Since that is a classic source of priority inversions (which is that 'important' messages overwhelm the system resulting in less important messages missing their execution deadline) If there is someone reading this that knows this system can you tell me if they queue ranking is actually rank = f(priority, deadline) ? Where things close to their deadline can out rank high priority things?

The other question that I had but didn't get from skimming was how they do data management. Do the functions carry around all their data? Is any of that data part of a state machine? (which is to say mutating that data by the function results in a new 'state' which would affect other functions using that data)


Some parts of the article didn't exactly talk about their setup, idle and teardown costs. Plus how exactly their I/O is optimized. I am skeptical until I see those numbers.

Edit: Read the OG version here.

https://soteris.github.io/publication/sahraei-2023-xfaas/sah...


It might be cool to see a cloud platform that incentivized off-peak usage. There's presently no incentive structure for most folks to defer work to off peak on any major hosted platform I know.

I'm so excited for the webassembly future, where a huge number of shared modules can be left in memory, sandboxed, and when my function calls such a module, it's already resident & loaded.


The big thing that makes it more efficient is that the only real guarantee is that your function will execute _sometime_ in the future.

Now, those members of our fair technological coven, of a certain age (who are sadly dwindling in number) will certainly recognise sending off a program to be run by a big magic massive machine, and waiting for results later on. Thats batch computing.

Now the key advantage of batch computing here, as also laid out by the paper is that you can group stuff together to cut out VM warmup time.

BUT it also means that because you have the luxury of waiting, you can use spare CPU cycles when they are free. As with everything big sites have burst of traffic, so to handle that, you have lots of spare CPU cycles waiting around.

if you clearly telegraph that your function is liable to be delayed, killed or re-run, then there are still lots of useful work that can be done without costing you more in capacity.

Its a strongly under utilised pattern in modern computing.


Surely there is a lot of workload is delay-able, but there is equal, if not more, amount of workload that do not have the luxury of waiting. Is it fair to say the serverless platform has no advantage for the latter workload?


Yeah, you can't do synchronous stuff on it, like you would say lambda, or what ever google calls theirs. I think calling it serverless is probably confusing to a lot of the outside world, as to them serverless is cgi-bin for the docker generation.

this form of serverless is very much batch jobs with some modern bits sprinkled in.

perhaps they should have called it serverless batch, or Batch-as-a-service, async-serverless or serverless-futures.


I never understood why these platforms, which clearly run on a server, were marketed as "serverless". They seem more akin to a distributed operating system, to me


Ultimately it's a marketing term, but it's a good one.

Of course ultimately the code runs on a server, but it's not a server that the customer has to buy, rent, maintain, or otherwise think about.


> or otherwise think about.

For us this is by far the biggest factor. Innovation tokens are the most expensive thing there is.


I think the nomenclature comes from the fact that "traditionally", before you can upload a piece of code somewhere and have it executed, you have to first setup a server, as in: an instance of a particular operating system (a physical server or a VPS), hence the name "server-ful". In the "server-less" paradigm, you upload directly to an abstract cloud and have cloud infrastructure execute your code. Moreover, uploading to "a server" implies using a fixed, pre-allocated set of CPU/memory/disk resources that cannot grow dynamically, while "server-less" implies using dynamic infrastructure that can scale CPU/memory/disk resources up and down as needed and/or clone executed code across multiple internal compute instances controlled by a load balancer and so on - the notion of "a server" becomes no longer relevant.


Firstly because you don't provision servers to run the code, the runtime will scale the infrastructure horizontally with response to traffic. Secondly because it's not a client/server architecture from the perspective of the code being run.

Basically it's "serverless" because you the developer are not doing anything with servers. You're not writing one, maintaining one, or managing the infrastructure to run one. You drop off a blob of logic to the cloud provider and they handle everything else for you, and you pay on demand.

Functions-as-a-Service (FaaS) is another name for the same idea but less catchy


As a customer of serverless, you do not have to manage any servers, that is provided by the platform. Even with a platform like EKS, the customer is still responsible for selecting instance type, OS/AMI and any OS updates. With serverless, the customer doesn't worry about managing their servers, they just enqueue tasks (function invocations) and let the provider worry about right-sizing instances types, OS updates, etc.


Well of course they have to run on something. But I guess since you don't have to configure the server yourself, the marketing guys figured "out of sight, out of mind".


because the only other viable analogies were:

CGI-Bin

Mainframes

which were deeply out of fashion at the time.


Seems like the biggest edge they have over aws/gcp/azure is the ability to delay function calls to smooth demand. Not much other providers can do to match that.


What are functions in this context? How, for example, are they different from other kinds of computation, like database aggregation quieries?



They are arbitrary Python functions (or whatever other language you want to use). Serverless platforms let developers deploy simple functions without having to provision a VPS and all the stuff that comes with it (operating system, networking, etc). You get charged by the millisecond as the function runs.


Meta has definitely evolved from a LAMP stack!


Here's a link to the actual paper: https://tangchq74.github.io/XFaaS-SOSP23-Final.pdf




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: