I'm pretty unclear on the "how" here - but from what I can understand in the article the search resilience team injected properly tagged synthetic traffic into their system to do testing? That does seem like the kind of practice that could be part of healthy holistic approach - but the article elides a ton of details. I suppose the idea is that it promotes AWS services (with the idea of suggesting that this kind of resiliency comes easier on their platform) - but this is a great example of how good writing strips things down to the barest details. I would love to take lessons from it but I think the details actually aren't here.
Oh... "Chaos systems"=="event-driven development". For those confused about an analogy to chaos theory... as far as i can tell, there isn't one. In physics a chaos system has small perturbations that lead to an instability. I would argue this is just a large perturbation that leads to certain instabilities. I would also classify this as "network systems fault tolerance" engineering.
As a stand-alone article it is fine, and is likely to trip the more fluff than stuff alarm on many people's bs-detectors.
Who's drawing parallels with chaos theory? The origin is Netflix injecting "chaos" like taking servers down randomly. It's not tied to any scientific theory: https://en.m.wikipedia.org/wiki/Chaos_engineering
"Stress testing" might be more intuitive but that's already established as simply testing under high traffic
To add to that, I’ve read enough of my own companies blog posts to know that Y was an incorrect solution to Z, was half-implemented at best, barely improved METRIC, and has already made a lot of peoples’ lives harder for simply existing. So now I take every single one of these posts with a massive grain of salt.
I would love to know what software stack, hardware, and uplink connections in total they utilize to accomplish a real-world 80k request per second throughput. How many instances do you guys think Amazon runs for its primary e-commerce front-end stack? In total, and per region? Assuming they have a multi-region rollout.
If it's the real-deal, and not like people saying "Bun.js can serve 65k req/s+ (coughcough to localhost,)" that's impressive.
But I never see anyone talk about real-world numbers. Just synthetic poopoo.
I think I read the article correctly, but I think it only talks about how they introduce "chaos engineering," I didn't recall them talking about how they actually handle a volume of traffic like 80k req/s.
> If it's the real-deal, and not like people saying "Bun.js can serve 65k req/s+ (cough cough to localhost,)" that's impressive.
Not all req/s are made the same.
Amazon search is made of 100s of services, and Amazon's search page loads 20 products per page, that means 80k search req/s translates to 1.6 MM product API req/s for example.
FWIW a search request at Amazon hits roughly 100 unique search clusters (think of this as ElasticSearch clusters - but its not ES), with different product groupings in each cluster. Each cluster is made up of 1000s of nodes running Lucene (think similar to ES shards). This is just for the "match set", i.e. which products to return.
Then there are services to re-sort those matched products based on popularity, likelihood of purchase, etc. Think giant ML models. Then there are product lookups. Before all of this, there is Query analysis to simplify/improve the query (think giant ML models) to classify "Apple" into electronics vs groceries based on the other keywords and your current context.
Meanwhile, Bun.js is talking about 65k "hello world" type req/s. The compute per req is magnitudes different.
> Not all req/s are made the same.
> Amazon search is made of 100s of services, and Amazon's search page loads 20 products per page, that means 80k search req/s translates to 1.6 MM product API req/s for example.
That's their problem, no? Nobody's forcing them to have an architecture where a request propagates to hundreds of services.
Why is it anyone's "problem"? Nobody said they're being forced to do it this way - just that they are. And I guarantee that there are thousands of other companies out there that have an API-fanout model as well, and might be interested in how Amazon does it.
I don't get the hostility around this article. Nobody is forcing you to read it or to do it this way. If your system is architected in a different way where you can run your whole system on a single instance, then good for you! But Amazon presumably doesn't have that luxury, and others may not either.
> That's their problem, no? Nobody's forcing them to have an architecture where a request propagates to hundreds of services.
I mean, physics kinda is.
Just the volume of data that needs to be hosted, needs multiple nodes. ElasticSearch has some good general documentation of search engines if you want to learn more. In Amazon's case, a general query will hit a fanout of about 10,000x nodes - I wasn't even counting that fanout because it is all technically 1 service.
Meanwhile, Bun.js (no offense to Bun) is built for IO-bounded workloads and 65k req/s is great for that. However, executing the required Natural Language Processing (i.e. multiple ML models) would result in an exhausted CPU (even if Bun could distribute the compute across cores or compute the ML inference on a GPU). I'd be willing to bet even on a great CPU it gets throttled at 10 req/s (at most 100 req/s).
This really isn't some "They should have just done it all in Postgress, Bun.js can just use a nice ORM and be IO-bounded" type situation - which is a philosophy I very much agree with for 99.9999% of usecases.
My favorite kind of HN answer. 100% snark, 100% certain, 100% wrong.
So, instead of that, what would you have? A single, binary that makes multiple DB calls? Hmmm, let's see, some people have tried that and written extensively about the problems they faced doing that. Wait, one of them is actually a small e-commerce firm named after a larger river. Wonder what the issues were ...
Amazon doesn't really have a "primary e-commerce front-end stack" in any concrete sense. They have hundreds/thousands of teams that deploy bits and pieces to a massive pipeline that ultimately makes up what you see on Amazon.com, but each team can have their own infrastructure backing things. Some teams might run everything off a dozen low-end EC2 instances while another sibling team has 3k+ instances; it's really all over the place, and that's ignoring specific events like Black Friday or Prime Day, etc. where teams need to prescale things in advance.
> I would love to know what software stack, hardware, and uplink connections in total they utilize to accomplish a real-world 80k request per second throughput. How many instances do you guys think Amazon runs for its primary e-commerce front-end stack? In total, and per region? Assuming they have a multi-region rollout.
> But I never see anyone talk about real-world numbers. Just synthetic poopoo.
The number probably changes all the time based on load. They'll never release these details because it's a competitive advantage to have the "how popular are they in $place at $time-of-day" data private.
When they do share numbers, it'll always be the most flattering and devoid of any context beyond the "wow" factor.
While I can understand the cynicism, the real answer is a lot closer to something much more boring, which is most people just don't care about the actual numbers, and if they were to release them, while interesting to a small few, generally no one would actually care.
There's also a common misnomer that Amazon.com is somehow just this one giant app running on a set of servers, which isn't remotely how it's actually deployed, and that's before we spend time arguing whether a team's instances even count as "primary e-commerce front-end stack" or not. :P
Also their "stack" includes various degrees of AWS (maws, naws, and I'm sure a bunch of snowflake situations innumerable here, I mean do you count corpinfra? Controls?)
I'm not sure it's the biggest deal in the world, plus real-world market presence data is regularly detailed in shareholder reports for various companies.
You could take low-end instance specifications and standard industry stacks and extrapolate forward how many instances they might need to maintain at a maximum, but those numbers are going to be off.
Are they running 400 low-end front-end instances across the globe? Probably, (plus the 40 or so other services they claim to need, multiplied by region count at a minimum) and that would actually be well below realistic and reasonable for a company like Amazon. You can take a bunch of regional instances that handle roughly 200 req/s and make that work.
I used to work on a system that did about 55k/sec at peak. The service was internal it was only handling grpc calls which were coming from inside our VPC and it was written in Go. It's main job was was reading and writing to an SQL db that was sharded across 3 or 4 of the biggest instances AWS offered at the time (2017ish).
Everything was Dockerized and I think we were using Docker Swarm for container orchestration. I don't remember the specs for each box, but we had auto scaling set up so at peak we'd hit a little over 200 containers.
Looking back now, I'm sure we could have gotten much better performance out of that service, but the team was young and inexperienced and throwing money at the problem was an easier solution.
Having led API teams at a big tech where we handled similar (slightly lower) request numbers at the edge, I can tell you that the entire stack was engineered to be defensive and handle a certain amount of load. As other commenters have said, fan-out means that 80k qps at the edge means you're getting probably 10-100x that on some of the most heavily hit backend systems. A lot of the work we did was very aggressive caching and sharding. Autoscaling to handle load spikes.
Observability was our secret sauce. We would monitor everything. Our caches, NICs, our load balancers, etc. Cache hotspotting and DB problems were the problems that kept us up at night, though my teams didn't deal with much stateful data.
Well, to get actual impact you'd need infrastructure-wide tracing and that's hard.
Like, you could hit a cache and serve 95% of page from it. Or hit some long path that will burn half a second on 20 servers in the backend to serve some big query
I mean, coralmetrics+pmet has been doing that infrastructure wide tracing for decades (albeit being slowly replaced in spots now).
Back in 2014 they were still sharing a few years old (at that time) detail-level service call graph that had so many nodes and lines it looked like string art.
And a hundred different identical products at vastly different prices from brand new sellers like `z-qq-yadonk-8771` that some how has 4.7k reviews at 4.5 stars.
I feel like my eyes were just assaulted trying to read that white-on-gold mobile page. I was unable to read the article due to the accessibility-hostile theme that disables reader mode on mobile Safari.
It dawned on me that in web software, people talk about req/s from two entirely different perspectives and it's borderline fraud:
req/s from localhost to localhost, and req/s from the Internet to any user.
The latter is actually interesting. People saying you can get 10k req/s from Node.js is stupid. You're not actually getting that on say, a single low-end instance over the Internet, which is what most developers are actually going to do.
Instead, you'll get two orders of magnitude fewer requests per second.
What Amazon is talking about here is most likely non-synthetic, real-world 80k requests per second. Which is actually a decent job.
> People saying you can get 10k req/s from Node.js is stupid.
No, it's not, for exactly the reason you state:
> You're not actually getting that on say, a single low-end instance over the Internet
Some languages are, of course, more efficient, but it doesn't matter - you can get very good performance out of any language/runtime - it's all about your architecture and infrastructure.
Where are you saying the difference would exist? I haven't seen local network tests be worse than localhost (usually it's better since the client uses a lot of CPU itself). Why would Internet latency matter? TCP ACKs should be done by the loadbalancing appliance, so they'll be low-latency for the application. TLS handshakes should also be offloaded to the appliance.
From what I've measured, code I've written performs around the same in production as it ran locally given similar hardware. If you're deploying to a VM with 3000 IOPS and 1/2 a CPU core, obviously it's going to run like garbage. If you wouldn't run your business on a raspberry pi 3, you probably shouldn't be running it on an AWS xlarge instance either.
These aren't requests for TCP ACKs to establish a session, nor even requests for a simple static resource. They're requests for the live status of an inventory of physical goods spread across thousands of distribution centers on six continents that are themselves gaining and losing thousands of products per second. A system that can return a reasonably accurate view of that state 80k times per second is not the same thing as a system that can send 80k http responses with "Hello ${NAME}" per second.
I'm not talking about just establishing a session. My questioning there was just why Internet vs. local would be different. On a local network, I've gotten 70k json CRUD requests out of a netty based service + postgresql with 4 cores and a single SSD.
I imagine search is more complex and expensive than CRUD, but 80k isn't something you can only do with a "hello world" tier application.
It’s not the best metric. Just responding to 80k req/sec with static in memory content is easy nowadays. If there are some complex database queries you have to finangle 80k/sec then that’s the interesting part
About a decade ago Opera Mini did 150k transcoded full pageloads/s (times about 30 inlines per pageload that was the average back then, so about 4.5 million requested/loaded/processed/compressed HTTP resources/s).
(All of the public Google Search numbers I've seen have seemed one or two orders of magnitudes too small. Or maybe most people don't use their search engine/browser as much as I do, so my perspective is skewed...)
From my experience with that scale of traffic (with Opera Mini at the time about 250M MAUs and 150k full pageloads/s):
There is surprisingly little seasonal variance. You have your weekly/daily traffic rhythms based on when your users are awake/active based on their geographical distribution and that's mostly it.
"World events" also have very little impact - they tend to barely make a dent in that massive background noise.
Before we had large volumes of traffic I thought we'd be seeing all sorts of unusual peaks, after a few years I realized growth at scale tends to become boring (but in a good way).
For example, it is just as true for this title to have said "How Amazon uses ... to load 1.6 MM requests per second, from just the search page."
Each search page load, is 1 request to the search backend, but 20x request fanout to the product's key-value store to render the images and titles, etc.
I feel like Amazon search is one of the worst products I've ever used. It is a clusterf/ck of paid advertisements and obviously gamed results. I don't care how many requests/sec you get. If the results are horrible, what does it matter?
Funny story, internally Amazon Search doesn't consider the ads products to be part of the "search results". It is tracked and accounted to Ads.
The way Ads are handled on Amazon is really poorly done. The Ads teams claim to make a lot of money (and based on the internal accounting tricks they do), and as such have been pushing Amazon's leadership to go more into Ads, even tho every person I've met that worked at Amazon also hated the prevalence of Ads.
Literally, Directors and VPs at Amazon are afraid to step on the toes of Ads' leadership team because of how well they have told the story about "Ads is excessively profitable".
Meanwhile, all of us in the thread can easily say, even if it is short term profitable, it most certainly is not long term profitable for Amazon.
Both from internally and externally it has been very disappointing to watch actually.
> But that’s almost barely a problem compared to the gamed reviews
Which pales in comparison to the problem of counterfeit goods, IMO.
I can at least somewhat comb through the reviews to look for outliers of well written reviews. Getting something that's obviously a fake (has happened to me multiple times) is completely unacceptable.
Newegg has this issue too, I got a knock-off Intel CPU there once, I was furious.
I'm sick of it being impossible to identify cheaply made products from high quality, durable products on Amazon. The rating system is flat out broken and there's an entire industry built around gaming those ratings.
I'm at the point that I rarely ever buy products on Amazon anymore. It's a total disgrace. On an ethical level, I wish I had the ability to say "I only want to be presented with results that weren't made in China or other slave societies".
Contrary to popular belief Amazon actually does put energy into making sure products are responsibly sourced. Products are de-listed if they’re found to come from unethical sources.
To take that even further take a look at Climate Pledge Friendly. Those are products with (at least one) third party certification. These certifications don’t just further climate goals. Social responsibility is also considered. Including worker conditions and product durability. You can filter search results by this attribute. Admittedly it can be hard to filter for specific certifications.
A giant portion of Amazon's products are made in China by a populace that's enslaved by their totalitarian dictator. Please allow me to identify which products are made in China, and other communist/totalitarian states, so that I can exclude them from my searches.
Is that Amazon’s fault? So many once reputable brands have been MBAed to death and are now indistinguishable from the bottom tier garbage. It is near impossible to do real product research on anything, anywhere.
Surely a company with the resources of Amazon can determine fake reviews from real ones. Especially those bait-and-switch where the product is changed and half the reviews aren't even for the current listing.
These days it should be assumed that the quality of anything bought on Amazon is dogshit. Which, even though I canceled Prime years ago for ethical reasons, is honestly the biggest reason I won't even click an Amazon link.
technically they said 'chaos engineering'... which obviously means they use monkeys slapping their hands on keyboards to write the code that returns the results.
This would be awesome... When I'm in a rabbit hole trying to find a product and I see three or four of roughly the same design I know not to bother any further unless I can find a reliable manufacturer website.
Your article shows you put in a lot of work, and was written well. But... A few thoughts... Service Owners usually tend to design and build their systems in a way which allows them to understand and know how the system will behave under the worst conditions. It is not difficult to test these conditions either. A bash script utilizing curl will suffice.
If you have to use 'Chaos Engineering' to experiment your way into innovation, this is a sign you built your service wrong. What will Amazon Re-Invent next!? I am guess the wheel. Well written article though.
I might be misunderstanding what you consider one, but in my experience a bash script driving curl is great for some types of API or front-end load testing.
However, it won't necessarily help you know how your system will behave if S3 kicks the bucket in us-east-1 (again), your image host for that super-cool Kubernetes cluster suddenly throttles you during a critical restart, or your other service of choice went down due to an expired certificate.
If you however mean to use it to perform a denial of service on an endpoint you don't own, you're more hard-core than I thought.
> Service Owners usually tend to design and build their systems in a way which allows them to understand and know how the system will behave under the worst conditions.
With you so far
> A bash script utilizing curl will suffice.
Lol hell no. Yes, AWS/amazon does require a "GameDay" before launching a service which will execute an mcm (managed change management) that's basically a runbook of (way more in depth and comprehensive way to test your service than a single bash script with curl), but chaos engineering is a great additive, additional verification mechanism that really helps with service outages.
How are you going to test a thundering herd with a bash script executing curl?
How many machines are you running this simple bash+curl script on anyway? Using a single node to generate requests isn't going to do much in testing a service's reliability.
I am sorry, do you think building a reverse index for a billion products fits in one machine? Is that seriously the amount of thought you put into this comment?
Can anyone calculate the dollars-per-request revenue and profit? Would be interesting to see how much it costs Amazon to make money and make some connection to the request rate.
Back in The Day, rumor had it the detail page hosting/rendering would easily max out a single machine after only a handful of queries per second. I regrettably can't verify the truth of that, or if it still holds, but there is a LOT going on with a given /dp . "How many of those requests translate to actual purchases?" is the next question.
i do recall the simplestack version of dpx had some obscene issues with garbage generation, followed by some obscene issues with throughput due to aggressive locking (to avoid the obscene amount of garbage generated).
I am not surprised. That platform was a neat concept, but wow, the nesting and resource consumption was atrocious. There was potential there, yet the implementation in Java made IMHO some fundamentaly flawed assertions about small objects. I unfortunately know of a great director who had to find new opportunities outside of the company over it.
IIRC there was some memory leak issue in the custom rendering engine/language that was being used and for a time the solution was to reboot any VM that has taken more than $x requests. $x was a very small number