Kinda wish I still worked there and could see the post-outage writeup.
My bet's on some unforeseen bottleneck that affects search and static pages. Almost everything within Amazon is crazy scaleable, but there are some bits where you scale them up and their behaviour changes radically. For instance, a service's cache misses might skyrocket as customers get distributed over a wider set of servers, causing service response times to increase just a little bit on average, tipping a dependent service over into more frequent timeouts, causing its downstream service to blow a timeout-percentage 'software fuse' and stop using that service... etc etc.
Given that each of those services (and many more possibly-related ones) will have an on-call engineer paged into a conference call when the manure hit the rotating ventilation apparatus, there are going to be a lot of unhappy people cancelling their weekend plans right now. I definitely don't miss that aspect of the job!
Hehe. Reminds me of my Murphy's law momemts. Whenever I am on a holiday, a deployment will fail, and I will have to login and check while my family plays volleyball on the beach.
I'd tell your boss to hire someone to replace you. What if instead of on holiday, you were dead? Kinda hard to login and fix their shit when you're dead.
I agree. I quit every job where they don't respect my personal time, especially vacation. Life is just too short. It's bad enough what I've been through with companies and their stingy PTO policies (US).
I was smoked up in colorado for the old "Lets take down Dyn real quick" BS a couple years ago. I know this very well, things like to run for months without issue, and I'm off for an afternoon with the phone ringing.
I'm on holiday in Japan at the moment and have no idea what day it is. Only checking HN as I'm currently in Osaka waiting for the Shinkansen to Hiroshima. ;)
my memorial weekend on a few acre olympic rain forest camping trip: regular signal hunting for hn. Oddly found a 6x6 area which after a reboot with provide lte. after idle would have to reboot again to open some more links
OR there wasn't backpressure on a cascading failover so as services failed they increasingly failed to more and more overloaded systems
OR there WAS backpressure and it was the luck of the draw whether you were queued into an error page or got good data
OR the autoscaling couldn't keep up with the onsale window. This used to happen in ticketing a lot. Ticketmaster has a talk somewhere where they talk about warming the scaling load and server cache in anticipation of big ticketing onsales. The time it took to autoscale was just too long.
The systems at Amazon are too large to leverage autoscaling. When I was there (in Marketplace) it was generally estimated that Amazon used about 80X the capacity of their entire AWS public cloud.
I find that implausible. I know they run a huge infrastructure but why would they be using 80X of their AWS public cloud? Thousands of companies if not millions at this stage use AWS and some of them are not insignificant in size.
> Amazon used about 80X the capacity of their entire AWS public cloud
which is probably closer to "the capacity that is available via AWS is a tiny, tiny fraction of their overall computing power.. therefore adding it back in when things are falling over doesn't actually solve any problems."
I think they are using the term 'capacity' to mean 'spare capacity'. I.e. that Amazon's entire compute usage is 80x the spare capacity, so scaling even a small amount would consume any spare capacity in AWS. Still, it seems hard to believe.
1. he misspoke and meant "80%" of the AWS capacity, which I agree seems implausible.
2. Amazon does not run on AWS because Amazon is 80x more than all of AWS infrastructure. This also seems implausible because of Netflix. In fact, there's an article out there that said AWS exceeded Amazon's capacity within 1 quarter!
I still don't understand what that has to do with autoscaling exactly
I did’t understand either. I suspect it’s the semantic difference between ‘public cloud’ and ‘infrastructure’. I don’t know what that difference is really.
When I checked it out, the site seemed responsive but search was broken. I searched "usb flash drive" and was presented with 2 results. I pondered this for a moment and then realized that there were 400 pages with 2 results/page (as far as I checked.) Perhaps there is some load shedding algorithm that reduces the result count per page to reduce load. It did discourage me from further searching so I guess it worked. ;)
It wasn't though. I saw an item being listed as a Prime Day deal on the normal Amazon site, then I searched for it through smile.amazon.com and it wasn't there. Went back to non-smile and it was there (but throwing errors so I couldn't click it anyway...)
I have no idea really (I stopped working on the retail websites in 2013) but my gut feel is that that's about a couple of orders of magnitude too low at least.
But, yes, unforeseen rate limits and size limits can cause many hilarious things to happen. I've seen a few good ones in my time. In particular, when somebody sets an upper size on an in-memory table and commits it thinking "Well, I added a few orders of magnitude safety margin - that should be enough for anybody", that's probably going to become an incident at some point in the distant future ;) With luck, the throttling or failure behaviour will only affect a few people, and it'll be spotted by looking at traffic graphs and noticing a very slightly elevated rate of service errors. If you're unlucky, though, when you hit the limit the whole service slows down, locks up, or just plain crashes, and something like this happens...
Yup, many an outage has something like this at its core. At my company, to address that, we've built an internal library for enforcing rate limits and size limits that is (a) configurable on the fly and (b) generates logs, so that we can trigger an alert whenever any limit reaches 70% of capacity. And thus, hopefully, head these things off before they reach the tipping point.
Don't forget to alert when the double derivative of capacity growth is above a (very low!) threshold, that can catch an explosive problem far earlier than the 70% mark, by which time it might be growing at 5% per second.
It might also be a momentary blip that you really shouldn't need to wake a human up for. The hard part of monitoring imo is that perfect balance. Asymtotically approaching nirvana but never quite reaching.
All monitoring has that problem :-) But if your rate of rate of change has been positive for some suitable amount of time for the context, it's worth waking a human up over it, because the amount of resource remaining is dwindling exponentially.
Not offhand, sorry - I first encountered the notion at a conference some years back, but I've long since gotten out of operations so it's all a bit dusty for me now.
I once owned an EAV DB for storing small config values. Someone wrote a great wrapper around it that made it seem like a proper DB with many tables. Since it was for storing configs this library cached the whole thing in memory on startup. Zoom to 5 years later, we have 10+ settings for every customer in this store, and one day all our hosts keep failing over. As it turns out that small table for configs was north of 5 gigs and destroying our heap.
This is their official rate limit for the FIFO type SQS. Although I bet the Amazon folks know a couple people around the office who can up that limit :)
Not oncall for Amazon but I don't particularly mind it. It sucks to have to be near a computer for the time, but I actually find digging in and finding that needle in a haystack pretty fun. On a good team, the life interruption is minimal.
When I was on-call for a big project, I jumped every time my phone rang. It wasn't a good experience at all.
"We are at the fair, I have 18 customers lined up but I can't take orders. Can you fix this now or should I start taking them on paper?"
You know the joke that atheists don't believe in god until the plane nosedives? That was the moment for me :)
"Oh, I... I need to see the error message before I can say anything. We don't have any errors logged. Hmm... Would you allow me to remote into your device?"
"Can't do that. We lost the internet connection here, sorry."
Something no one has mentioned yet, could it be that the engineering force at Amazon is no longer what it used to be?
I can personally point to two friends who I consider top notch engineers and designers that have left Amazon because of its toxic culture. I'm sure I'm not the only with these anecdotal examples, we've all heard the stories. At the end of the day years of unbalanced work/life balance, overly aggressive management and frugal approach to everything makes for a weak argument for A players to stick around.
Could this be an example of crumbling engineering standard at Amazon?
>Something no one has mentioned yet, could it be that the engineering force at Amazon is no longer what it used to be?
In many regards, yes. The bar had to be lowered to meet the demands of growth. We've also taken in a lot of hires from companies that have brought their culture and friends with them. The culture at Amazon is not what is was even 2 years ago. It is in many places day 2.
No one also seems to notice that Amazon retail often suffers widespread issues like this. We can count on SEV1's happening during peak as things blow up badly. This has happened several years in a row, and sadly the themes are pretty much the same across all: forgot to scale (yes...really) or some stupid system bottleneck. It doesn't help that Amazon retail has a good amount of its workforce based in India and seemingly disconnected from the Seattle based leadership.
> The bar had to be lowered to meet the demands of growth.
Not only that, but I've noticed that Amazon has started using way more contractors (not actual firms, but more Mechanical Turk/UpWork like contractors, if not just straight up misclassification) and agencies within the past year or so. I can't say that I'm surprised that Amazon is having technical issues now that they've grew in numbers but not in technical chops.
On a side note, it is a great time to become a recruiter in Seattle with all these agencies popping up. /s
Honestly don't think that is an issue. AWS is THE most certified company I have ever worked with. Seriously, go look at their compliance page - it is ridiculous.
https://aws.amazon.com/compliance/programs/
Certification is a theater for managers and so called decision makers. It might have a positive impact on code quality as long as people are motivated. But if insiders are telling me that the workplace culture is deteriorating that is a serious problem. Certification will not prevent bad code hitting production.
Not certain on this, but I think Azure is known to have more compliance certifications. This is probably largely due to the long relationships Microsoft has with government organizations etc.
I got the impression from some former Amazon employees that they were burnt out and put their time in and got Amazon on their resume and were just done with life at Amazon so they moved on.
An Amazon manager, full of shit, once tried to tell a joke at a social gathering. He claimed that he and friends removed the engine from someone's car over lunch as a prank. This is the education level that some of them have. He literally didn't think people would know he was full of shit.
It would be technically possible to remove a car engine during lunch. However, that's not so much a prank as malicious damage. You'd also have to be a very skilled mechanic, familiar with that specific car, to do it that fast.
Honest question: education level or mental developmental level?
Mental health is enough of an epidemic that $company_with_exponential_scale is most definitely going to pick up a fair chunk of people with various issues, cognitive included.
Chances are this is even part of why the bar is a bit lower.
(Said as someone with mild high-functioning autism)
My point is Amazon has hired some managers that have really lowered the bar due to the environment they are from and people they are used to dealing with. Imagine someone from a place where you could tell a preposterous lie and 50% of people would believe you.
I can see what you mean. Take for example the email I received to promote Prime. A mix of spanish and english with some very bad translation problems in spanish. It was a bit embarrasing to read it.
At Uber we invested in synthetic load tools that create fake riders and drivers, match them, etc and test our entire dispatch system end to end to arbitrary amounts of online driver and riders. I don’t see why they couldn’t do the same with carts, adding products to carts, etc
They do. But I'm guessing you don't scale up 1M+ servers for your Uber canary traffic tests - these are some of the scales Amazon undergoes in these events. The scale is unlike almost any other web property around.
Do they really need 1 million servers? Many of my friends who work at other tech companies need such few servers in comparison even with significantly high traffic that just screams massive inefficiencies...which seems wrong.
But I've never worked at Amazon so I wouldn't know.
I’m not sure how it’s relevant. If you have the infrastructure to send 1000 concurrent users you can probably send 1M concurrent users. We only test small integer multiples of our peak traffic, and if your absolute number of servers to service that is in the millions then it would make absolute sense to be routinely running that capacity test. If that means “scale up 1M+ servers” then that is what you have to do, otherwise how can you be sure?
Could you do something like that from another cloud service? I suppose the difference would be whether 10K requests from one IP address would be the same as 10K requests coming from 10K IP addresses. The further you move from the actual production load, the higher risk that the test doesn't test everything. For example if the network could only handle connections from 5k hosts then the former could pass while the latter failed miserably.
Simulating millions of users should be well within the capabilities of a company as large as Amazon. Off the shelf load testing tools like Locust can create thousands of fake users with one worker.
It's actually pretty trivial. Getting the right parameters for the load will be hard and making sure you are loading it properly. Think about DDOS attacks. Generating the load is rarely the issue.
and when has amazon ever been an engineering force? i have always felt the website and service experience is a relic of the 2000s. more often than not, i get the answer “our system can’t do that” from customer service.
I think Amazon has taken on an outsized image to many people that just isn't true. We have good engineers in many organizations, but we don't pay enough, have the right strategy, or take care of individuals well enough to lure the kind of great folks you find at other big tech companies. In many ways, Amazon is a retailer that does technology because it found a way to make money from it. The DNA is still MBAs/finance and retail.
A bar raiser is typically someone in the interview loop for a candidate that is not from the same team and therefore more objective about whether the candidate is truly 'better' than most people on the team (raises the bar). Although good in theory- its not actually practical to find every single new person better than the current employees but it helps keep things in perspective and is a deterrent to bringing in weak buddies.
Its true Amazon has some great engineers but is not a very engineering centric. I remember a senior engineer in Retail once comparing it to a plumbing system kept together with bandages.
How much do you think the on-call contributes to engineers leaving? You would think support tools and support personnel could help to retain engineers.
No, bad design and refusal to manage technical debt is the issue. Oncall only matters in some orgs and even then only matters where the tech debt is totally out of control.
Bottom line is Amazon is a product culture not an engineering culture and that makes it really easy to leave for Google or unicorns that really appreciate tech debt tradeoffs.
In simple terms, bar raisers are current Amazon employees that come in during the interview process to analyse candidates. They do this alongside their own full-time job, assessing as many as 10 candidates a week and spending 2-3 hours on each one.
In other words 20-30 hours per week on top of the full-time job? That doesn't sound quite right.
Another bar raiser here. That site is wrong. The expectation is 2 candidates a week. And it would be more proper to say that it is part of your full-time job, not on top of it.
10 candidates in a week may happen at some kind of event, but then it isn’t 2-3 hours per candidate, and in that case you’d effectively be taking a couple of days off from your normal job.
The thing that baffles me is the industry perception is that Amazon has subpar engineering but Amazon right now is 2nd after Apple for Marketcap, so they must be doing something right.
I love the Amazon customer service. They’ve managed to crack a difficult problem and execute enough that other Giants haven’t come close to it yet.
GCP and Azure tail AWS by quite a bit. Amazon online retail is a Google search engine level monopoly now.
So Amazon can do a lot of things wrong, but I’d have to say they get the important parts right.
> Amazon online retail is a Google search engine level monopoly now.
Not even particularly close. Amazon doesn't even have a majority of online sales, although it's getting close. They seem like a much bigger force than they are because of growth.
They are big, because they invested every penny they made into getting big. And those pennies are converted through human misery in "fulfillment centers". (And AWS on-call is not that fancy either for engineers, though the two are almost incomparable.)
They got some parts right, some parts awfully wrong, and some are just irrelevant now. They make money, they are cheap + convenient, and that usually what people focus on. They are not sophisticated, they are not great designers, etc.
This doesn't completely answer the question, but I would distinguish UI and UX from the ability to build systems that run successfully at Amazon scale.
Corporate meme thing like six sigma. Each new employee is supposed to raise the bar, or the whole organization gets worse on average. Jeff Bezos as #1 is supposedly the least qualified employee in the company as each new hire raised the bar.
A PM who has never opened an IDE in his or her life and who is only familiar with “coding” concepts through Wikipedia. They read books by Malcolm Gladwell, Daniel Kahneman, and Nassim Taleb and majored in one of the humanities. When they’re shown the webpage that the geeks created which loads in 2 seconds, they tell the lead developer that they want the loading time cut down to 1 second and the font to be changed.
Considering they cleared $2.6 billion on last year's Prime Day, and membership is up YOY, I'd say this comment is 100% knee-jerk/over analysing. Stuff breaks, things go wrong, planning for the worst sometimes isn't enough.
I don't see how "they sold a lot a year ago" and "membership is up YOY" is meant to dispute the idea that engineering talent is leaving. I agree we don't have enough evidence to say for sure, but those figures don't seem relevant.
My comment has nothing to do with Amazon's current financial success or growth. It's regarding their engineering team and their ability to recruit and retain top talent.
It is not a binary state of absolute destitute and top notch brilliance, it's a trend that can move one way or another and will show itself in more frequent outages, poorly rolled out products, lazy design and etc.
The plumping can keep on working for a few years even after it begins to erode. Historically there are many examples of this.
Maybe but your evidence is annecdotal. Good engineers leave huge companies all the time but it is usually balanced by inflow. I guess we have no evidence of inflow as compared to outflow of good engineers.
I didn't reach a conclusion, I presented a possible reason knowing very well that I don't have the evidence to back it. Hence why I said "could it be that the engineering force at Amazon is no longer what it used to be?" and not "The engineering force at Amazon is no longer what it used to be."
I saw the page reload multiple times when doing a search before finally dying...I can only imagine that means additional load generated by the initial triggering failure.
Which in turn tells me they didn't test the failure case. Now, Amazon is a huge and complicated beast so I don't want to imply this was a "dumb" mistake, but (assuming I'm correct) it is a failure that risked making MORE failure, so it's not demand alone to blame.
In addition to the shit that is not politically correct to say, Amazon relies heavily on interns for production code and ops. Bezos has this view that the unskilled can get him 80% of what he wants and then a few top people can smooth out the rest. In reality the few top people never even come in contact with the kid that is trying to keep amazon.com running overnight. Successive waves of interns and SDE 1's fuck up the same shit over and over. Bezos is stupidly applying this same strategy at his space company.
This is complete nonsense, at least in my org. Interns are given projects that are as far from production code as possible. I'm sure there are managers out there who have done this but it does not extrapolate to the entire company and it's ridiculous to think that it's a company wide practice to make interns responsible for anything that would affect customers.
>Could this be an example of crumbling engineering standard at Amazon?
As someone that's been going through quite a bit of depression because Amazon was the best offer I got (as opposed to Google or Facebook because I'm still bad at coding interviews) I'm afraid this is the straw that's going to break the camel's back for me. This is exactly what I was afraid of and exactly what I have waiting for me when I join next week.
100% serious here: you should probably see a therapist. Life is so much more than your job, and your job is so much more than the brand name of your company. Having worked at a failing business and one of the companies you mentioned, I’m no happier at the latter than I was at the former. If you define yourself by the names on your resumé one day you’ll wake up and realize how much of life you’ve missed out on.
Hey, I've seen a few of your posts around here and just wanted to encourage you to keep your head up. Amazon is an amazing place and even though I don't know anything about your specific situation, there are a lot of incredibly smart people there and I promise that you'll learn a ton.
I was in the exact same boat - ex Amazon intern, felt I was a pretty solid engineer, but didn't quite nail the FB interview. Ended up joining Amazon because I was sick of interviewing.
Let me tell you: it will be OK, and the other posts about making the best of your situation are true. Amazon is an incredible learning opportunity if you stay open to it. I'm still working there a year later and I'm building far higher impact projects than my friends at Goog and FB because like another poster said, Amazon lets SDE1s work on just about everything. The growth potential is tremendous if you show you're competent and willing to learn.
People say the same thing about every big tech company - Google and Facebook included. Trust me, employers will still be impressed with Amazon on your resume.
they have a lot of really amazing and smart people. but like anything in life, its what you make of it. I'd say put your best into the position and try to learn as much as you can from others. no good reason to do less
It's not just what you make of it: a lot of your experience there is going to depend on your manager. I had three: the first was awesome, so he left for another company; the second was good, but he cared too much and went back to his old (non-management) position to feel fulfilled; the third was a sociopath and did really well, promoted to upper management and a subsidiary and out to greener pastures. Working for the sociopath was awful and got me to leave (where I'd planned on a long career). And no, there was no escape from him except leaving Amazon: he sunk my performance reviews when I asked to leave to go to another team 'cause I told him I didn't think I was a good fit where I was.
There are a lot of amazing and smart people, like you said, but there is also a lot of stress, heartache, and trouble if you don't keep your ear to the ground and build a strong network of people to give you an early warning. Don't keep your head down and concentrate on tech and building cool stuff: Amazon can be way too political for pure techies to thrive without strong protection from management.
Chin up. You're probably in the top 5% of the country in income. If you're in Washington, feel free to tack on another 9% or so to your income in state taxes you don't have to pay.
While it's nice to not have state income taxes to worry about, total tax burden isn't all that much different from any other state. The only time you really come out ahead is if you can live in WA and do all your shopping in OR.
For high income a people like amazon engineers, WA is crazy low tax burden. CA income tax could only match WA sales tax of you spend 100% or probably more of your income on taxable items.
You can't let an outage or some alleged talent gap be a reason for wanting to tap out.
You need to realize that your doing great, folks would kill to get a job at Amazon, the scale and the challenges are mind bending compared to what most other companies deal with and the interview is grueling and you made it.
Technology is a word that describes something that doesn't work yet and Amazon thinks that you are a person that can help tip the balance.
If only they had access to some kind of scalable cloud hosting service, they could've completely avoided this sort of outage. :)
Jokes aside, I admire the work of the team(s) responsible for Amazon's web site. I use it so often and encounter glitches so rarely that it really stands out when something does go wrong.
Serious question: I've heard that Bezos's approach with building out commercial units is to break down each part of the vertical into separate commercially-viable components. Idea being if AWS doesn't make sense for 3rd parties to use then it may not be economical to use it internally.
Now the question part: would Amazon ever secretly run Amazon.com in a multi-cloud setup, balancing between AWS, GCE, Azure, etc?
As someone that worked at Amazon a long time ago—back when AWS was just getting started—can confirm (historically). Publicly there were a myriad of AWS services; internally all we could use was S3 (for many years), if we were lucky. AWS being born of Amazon's "spare capacity" is an urban legend.
Nowadays, I hear it's quite different, and much of AWS is more rapidly dogfooded.
Some services (before I left) just did not have the sheer capacity to have Amazon.com as a customer at peak. The service teams just said "nope, sorry, you're going to kill us".
The requirements for prime day/black friday/cyber monday were mind-boggling.
That was quickly adjusted. I think like 3 years ago, Amazon publicly was saying that every day to AWS they were adding the same amount of computing power that was used to run Amazon.com when it was a $10 billion business. AWS is massively greater in scale than Amazon.com at this point.
I cannot confirm or deny that (since my memories are sorta fuzzy), but the real problem was not EC2 capacity. Many other AWS services were involved. :)
At the same time you are also giving the other team more flexibility to solve the problem, as they can optimize it for the Amazon.com use case and are not constrained to a generic solution that is optimized for the masses.
I'm a big fan of dogfooding. But given Amazon's ecommerce huge computing needs, maybe it would make sense to not burden the compute availability of your fledgling cloud offering with your ginormous system?
If built well your compute offering can scale with the resources you give it. If you have Amazon’s commerce platform running on a pool of computers, the team running commerce can give those computers to EC2 in exchange for an equivalent amount of EC2 quota. In theory it’s a bookkeeping operation.
There are plenty of good reasons why you wouldn’t run Amazon’s commerce on EC2, but I don’t think cloud availability is one of them.
FWIW 100% of the Amazon retail website is run on EC2 (and iirc it's 100% on spot instances too). (There are of course lots of internal services that still use non-ec2 capacity but fewer and fewer every day.)
One of Amazon Retail's big internal goals this year is "finally get everyone off of Oracle".
Bezos opposed the creation of AWS. Almost everything that was early AWS (EC2, S3) was done over the objection of Seattle leadership. Look where the teams were based.
After they shipped AWS, it took eighty zillion years for Retail to use anything AWS offered beyond S3, and, as mentioned above, it's not like their 100% on DDB or RDS now: they have dependencies on freakin' _Oracle_ all over the place.
I mean, this is not a criticism of Bezos or Amazon at all, but at the end of the day, even if Bezos is a supergenius (and I see no reason to doubt that he is. Although I often wonder how he keeps himself motivated to continue to work so hard on building a Walmart competitor with the precious few years he has left on Earth given that he could do literally anything with his time), Amazon is still a company made of tens of thousands of people. It's not a 4 dimensional chess-ballet. It's got probably 95% of the same chaos and disorder that every big organization made of humans has. It just turns out that cutting it 5% in the right markets has incredible returns.
I can't see Amazon ever using external cloud hosting for anything except the most trivial of tasks. They're absolutely, utterly paranoid about any sort of confidential information, and I think even with encryption the perceived risk would be too high.
So... you save $9k a year in recurring and it will be more than 5 years before you break even due to your $68k up front equipment costs.
And that's assuming you don't have any needs to quickly scale up or down and you are limited to 1 colo instead of the ability to expand to multiple regions like with AWS.
And that's not even taking into account the cost of the brain power to make sure your hardware stays up and running.
Doesn't sound like rolling your own stuff in a colo is a very good idea in this case. But that's job security if you are the sys admin I guess.
> And that's not even taking into account the cost of the brain power to make sure your hardware stays up and running.
Although, as I said upthread, I agree that AWS is very likely ideal for this particular deployment size, let me try to dispell this oft-repeated myth.
Modern server hardware takes almost no "brain power" (or effort of any kind) to keep up and running.
We aren't living in the days of the early dot-com boom where Linux-on-Intel in the datacenter could mean flimsy cases, barely rack-mountable, with nary a redundant part to be seen.
Applying some up front "brain power", one can even choose and configure hardware in such a way as to provide things like server-level redundancy, if that's important and/or preferable to intra-server redundancy (think Hadoop), or the ability to abandon mechanical disks in place instead of ever having to replace one.
This is the main "sweet spot" for AWS (or "cloud" infrastructure in general): small scale.
I am generally a strong proponent of using ones own hardware in a colo or on-premises, instead of or in addition to the cloud (primarily for "base" workload).
However, if the entirety of your needs can fit into a single rack, even I will advocate for AWS, since "convenience" is, perhaps, not strong enough a word.
I do think your server and storage prices are around $25k too high, but that's easy to do buying brand name and/or not negotiating with multiple vendors on price (which is particularly tough at low volume unless you're a startup with a credible growth story). That's assuming such an expensive CPU (in comparison to so little RAM) isn't foolishly profligate, along with the other hardware choices. Of course, this underscores the point (on which we agree) that, as a rule, it's just not worth that much time and effort for so little.
I'll take your word on the AWS pricing, as it's fairly predictable, if very tedious to perform the prediction. The main "gotchas" I've found people run into are forgetting to add in EBS costs for EC2 instance types without (or without comparable) local storage and underestimating data transfer costs.
You'll have to trust me that this examples hardware spec
and requirements are for a basic/base site.
You can thin the profile and increase the # of chassis, compromise on redundancy, etc...but experience has shown that this arrangement is most cost effective. Kinetic event
impact modeling system -w- RT data delivery -- that should answer your conjectures.
No large vendors used in this example - thinkmate or aberdeen supermicro re-brands for due diligence and warranty.
> You can thin the profile and increase the # of chassis, compromise on redundancy, etc
No, I wouldn't suggest more chasses, as that's almost always more expensive (it's tough to break even on that $1k minimum buy-in on a server).
I believe your workload needs the resources you say. It just happens to be a remarkably rare ratio, hence my remark.
> No large vendors used in this example - thinkmate or aberdeen supermicro re-brands for due diligence and warranty.
The vendor doesn't have to be large to jack up the price.. Any re-brand is super suspicious. To me, a large part of the point of a commodity server product is the reliability is predictable (and therefore easy enough to engineer for/around). Paying extra for "diligence", warranty, or hardware support is just flushing money down the toilet.
A fee for custom assembly and/or a basic smoke test is fine, but it had better be a flat rate per server and on the order of $100. Technician labor isn't that expensive.
Larger or "enterprise" vendors are merely the extreme version of this, with upwards of a 10x premium on something like storage arrays, especially if one includes
You seem to be an absolute type of planner. I used to approach IT mgmt and provisioning that way some years ago before being confronted with the realities of small and large business. One size obviously does not fit all and
sometimes you take shortcuts..usually you pay for them later.
I agree with your cautions around supermicro resale but
the warranty support and build diligence are absolutely necessary for a small business. Having a good business relationship with a trusted provider of hardware that
always performs the first time is priceless.
I don't know what an "absolute type of planner" is, but I consider myself an engineer and a pragmatist. I'm well versed with realities. In reality, with business, there's no such thing as "priceless", only risk, and risk is, generally, quantifiable. With enough data, it's easily quantifiable.
I admit that, having an affinity for startups, rather than more traditional small businesses, I have a greater affinity for risk. Ironically, perhaps, I'm usually the voice of risk-aversion with respect to IT infrastucture, so I don't believe it affects my overall understanding.
I recently pointed out to an interviewer who was trying to convince me that it was worth spending half a megabuck on a petabyte from Netapp because it was "business critical" instead of 1/10th that amount for DIY, that, just like the DIY solution, Netapp does not indemnify the business against loss. One isn't buying insurance, only a bunch of technology.
Sure, "works the first time" is worth something. Is it worth the cost of a whole, complete, extra server on a order of qty 6? If the infant-mortality rate on servers is anywhere approaching 1-in-6 and they're being shipped somewhere that the replacement time and/or cost would be prohibitive, I'd still probably rather just order 7 servers instead.
That's my main problem with paying a vendor for "reliability": it's a very fuzzy, hand-wavy assurance. Paying for reliability with more hardware has data and statistics behind it, which is an engineering solution.
That is some serious set up. I can’t see how you get close to this only spending $25K a year on AWS - maybe the price I was quoted for my needs was some sort of suckers price.
I don't see their comment as paranoid because cloud providers generally aee incompetent/steal your data, but because other cloud providers are direct competitors.
Even if they're not sifting through your server data, they can possibly try to get a competitive advantage by analyzing things like usage data as someone above pointed out.
When AT&T was forced by law to allow MCI switching equipment in their facilities, they tended to leave that room's window open, so pigeons could nest in the equipment.
(I might have a detail wrong, this and a hundred other great telecom anecdotes are in The Master Switch.)
Always nice to make sure no parties you depend on have conflicting interests.
> Now the question part: would Amazon ever secretly run Amazon.com in a multi-cloud setup, balancing between AWS, GCE, Azure, etc?
I'm not sure how secretive it would be. AWS 'bids' on Amazon.com's business and Amazon.com is under no obligation to use AWS as it's cloud service provider.
It is beyond absurd people who sell $AMZN on this news. Amazon users will just keep refreshing until it comes back online, and then continue buying as normal.
Some rabid fans may, but it's ridiculous to assume that Amazon sees zero missed revenue when their website goes down on their (arguably) biggest sale of the year.
That's similar to all of the locks on all Walmart stores inexplicably getting stuck in the locked state for 2 hours at 6AM on Black Friday.
sure, they probably don't get that money back which might slightly influence the revenue for the quarter, but it's not as if the value of the business has changed drastically in the last 3 hours.
This happens every time AWS has an outage as well. Reddit is down, better sell AMZN.
It's not the same at all, because trying Walmart again later in the day means literally driving back to the physical store. With Amazon, it means getting your phone back out.
I'm sure there is some impact, but it's nowhere near the inconvenience of being locked out of a physical store.
If you already waited overnight, then 2 hours isn't going to cause you to go home and come back. You're going to continue to wait in the line until the store opens back up.
My guess is that they will extend prime day. But they fixed it quickly.
For me currently on amazon.de there is no problem and amazon.com shows a captcha form, so it looks like they were DoSed.
We had about $300-$400 worth of purchases lined up this morning that were going to be made just for the day. Site was down, we couldn’t get through - they aren’t going to happen. A lot of people only shop on the 30-40% of days - so, I’m guessing we’llnwait For digital Monday out in November?
Precisely. I wouldn’t doubt there are algorithms that take into account general tone of tweets and news posts (using some sort of NLP) into a trading decision.
Yes, this is done in practice and is a little scary because of its potential as a means of manipulating automated trading. On the other hand the markets can be manipulated already via news read by humans...
It's absurd for long term investors to sell on this news but if your a short term trader or bot then it might be a nice short term profit to trade on this news.
Funny how this comes on the heels of aggressively expanding their workforce and trying to leverage themselves in a hundred different directions...
Maybe it's just me and my confirmation bias at work, but it seems that the core value proposition that Amazon provided -- high value, low margins on products -- has been eroding before our eyes.
Seems so much like the transition Microsoft made... too much focus on "synergies" and leveraging... not enough on keeping the bilges dry and the engine running.
It's funny... Fred Brooks wrote about this in 1975... and we're still making the same mistakes forty years later. There are real limitations to how quickly any organization can grow. Even awesome companies who are excellent at building organizations -- places like like Amazon and Microsoft -- can't organize this law of software development away.
> Seems so much like the transition Microsoft made... too much focus on "synergies" and leveraging... not enough on keeping the bilges dry and the engine running.
The companies that just keep the bilges dry and the engine running are the ones that we love, but they’re gone because they got made irrelevant. Or they got absorbed into something larger. Microsoft has a bunch of failed initiatives (Windows Phone, Zune) plus a bunch of successful ones (Azure, Xbox, Office 365).
If you’re up for classic books like Brooks check out The Innovator's Dilemma. You have to try to expand in a hundred different directions because you don’t know which one of those hundred directions will be relevant next decade, and you have to be unafraid of cannibalizing your core business because if you don’t eat yourself then someone else will eat you instead.
I think the hard part is to walk the line between stagnation & over-expansion. This is a dilemma we all face in organizations large and small as well as individually.
Building systems and processes (a.k.a Habits) that allow us to assume and integrate some new set of "stuff" without having to think about them (so we can move on to the next new "stuff") are what sets these companies apart; Amazon has been brilliant at this; however, from my perspective, looking at all the high-rise office buildings going up, it seems like maybe Icarus has flown too high...
But again, I could be building a narrative to fit my preconceived ideas... I'm definitely no authority here. Maybe this is just another blip... I think more troubling to me is the overall degradation in quality of the things I have formerly taken for granted in Amazon -- the quality of the products and ratings.
The Innovator's Dilemma is certainly not about synergies between products. By its conclusion, it's worthless to try to diversify within the same structure, you'd better create a new company and turn the old one into a holding.
Why would you think it’s about synergies? I’m not sure why you would think it’s about synergies. Or why it would be “worthless to try to diversify within the same structure.”
Summarizing the book here would be a bit of a disservice—but one of the points of the book is that there are economic reasons why companies focus on their most profitable core products, and there are economic reasons why that kind of focus can result in the company collapsing when the market moves forward. This isn’t some kind of imperative—the book isn’t saying, ”therefore, you should create a new company.” It’s more descriptive, “this is how big, successful tech companies can suddenly fail.”
This seems like a bit of an overreaction. Amazon has had big outages before (particularly at AWS), and they invariably solve the problem and move on. As someone else observed, events like this are mainly notable because of how infrequently they happen.
I wonder how Alibaba Cloud handles similar events [1], where there are bursts of 256k/s transactions and ~1bil packages being shipped out.
Do they just do brute-force massive scale out?
Amazon's US market is big, but my understanding is that number of online users in China (> 400mil) exceeds the population of the US (~325mil), which makes me wonder if the folks there think about data architecture a little differently than we do.
Oh I included that number to indicate the scale of orders that have to be handled by Alibaba's data systems -- I understand that they don't ship the packages themselves.
Also, I just read that as of 2017, there are 700mil internet users in China, 90% on mobile. The scale there is just staggering.
They are pretty good about it, and the "Correction of Error" for this one will be pretty epic.
Generally, the rule at Amazon is that any particular f*-up is forgivable... once. (Especially if you can show that you had preventative measures, documented procedures and redundancy in place.)
That said, there will be finger pointing and blame because you're dealing with human beings.
I’m blown away by the sudden and swift down votes to my original comment.
These engineers work at a world class company and are paid vast sums of money to not fuck things up. They live way better off than the majority of the country and their mere presence makes life more expensive and stressful for communities around them.
To suggest they cannot go a mere 48 hours or less without sleep on one of their company’s most hyped days is out of touch.
That someone can earnestly suggest that going 48 hours without sleep is somehow an effective way to address an outage is an indictment of a messed up work culture.
People need sleep to think straight, and no amount of money or responsibility is going to change that.
Amazon is not world class. The pay is not comparatively great. There is no retail job (including Amazon) where undue stress and sleepless nights are warranted.
You're not saving lives, you're selling books and cat litter on the internet.
Is their pay not comparatively great? Usually when I see this statement, folks are comparing seattle to silicon valley 1:1 which isn't a fair comparison. Seattle is expensive but not that expensive. My friends who work at amzn seem to be compensated in line with everyone else I know, but maybe I'm wrong?
That...just isn't true. Missed sleep costs millions of dollars a day and worse, numerous human lives in sleep related accidents (car, medical and more).
I've always been pretty sore about this, that my retail and AWS accounts are linked, with no clear way to disassociate them. If I had known that the accounts were going to be joined at the hip, I would have created my AWS account with a different email.
yeah, I'm pretty sure this could be causing some serious issues with plenty of folks. not sure if it's just the web console or if APIs are affected too, but I sure can't get in.
I work in a non-retail part of Amazon and I'm on vacation. Hasn't stopped friends and family from texting me about this though. As if I personally can go in and reboot a server or something. Hope we get it sorted soon!
If you're affected by this, please accept my unofficial thanks for your patience and understanding. (If you're a coworker in retail, good luck getting things up and running!) :-)
Wouldn't it help them to purchase from Amazon today? If there are lots of orders but no one to fulfill it seems like that would be greater pressure on Amazon.
I have completely stopped using Amazon, with the final nail in the coffin being the stories about how they treat their workers. I think it's just a matter of dependence and most of the stuff I used to get, I can find it elsewhere.
Slaves "had jobs" as well. Their masters just paid them in food, water, and shelter.
This whole thing about businesses "giving jobs" is ludicrous. American brainwashing. Businesses NEED workers. Not the other way around. Humans have always existed and survived for hundreds of thousands of years without "jobs". Businesses cannot survive without workers. Businesses would simply cease to exist without workers. People can still grow their own veggies and meat. Businesses can't generate profits without workers.
I was only showing the other side of equation... I'm as Libertarian as they come and also believe Karl Marx's theories made tons of sense...(I know that sounds a bit contradictory!)
Amazon needs to share its wealth much more, at least among its workers and also independent authors using it, and so on, and realize that it needs to support the ecosystem that allowed it, including things like more freedom and liberty for people, not less. This vast accumulation of wealth is bad news, even for capitalism itself, when you get right down to it. Bezos has a huge chance to make a stellar example to the world here, but so far...ummm...i mean why not?
I know TFA is about technical failures, but the deals themselves are also incredibly lacklustre. I was expecting at least the Warehouse Deals part of Prime Day to come through, typically 15 to 20% off all used offerings. This year however, Amazon restricted it to only select listings which translated to a few hundred items total. Very sad.
I didn't see anything that appealed to me either. One other person told me the same.
Maybe Amazon's overloaded system was caused by shoppers checking back more frequently than in past years because they can't find really good deals this year either but keep looking harder and harder anyway?
I've come to the conclusion that 'Prime Day' is just a clearance sale. Unless you find something that is exactly what you've been planning on buying, it's all just 'tat' and not worth buying even at Prime Day prices.
I read somewhere (sorry, forgot where) that Amazon had been pushing sellers to spend like mad on ads within Amazon.com for Prime Day, apparently it gave you a big edge over whatever the algorithms suggest.
Those sellers will have missed their sales targets, and will consider the ad spend to have been wasted. Will they get it back?
And vendors who boosted supply based on anticipated sales (both discounts and purchases driven by discounts). Are vendors going to find themselves with thousands of extra widgets on-hand but without the anticipated purchasing frenzy they were counting on to sell them?
Yup, and if you have too much inventory, you can either remove it or sell it at a discount, both are expensive options.
Otherwise you'll get Amazon telling you, "Hey buddy! You sure are using up a bunch space and not selling much. Why don't you not do that? We're limiting the amount of storage space you can use for Q3.
... Is there anything even good for Prime Day this year? Or the year before? Two years ago I remember seeing at least some Dell Workstations that could be repurposed into cheapo home servers. Most of the stuff seems to be odds, ends, and stuff that the various Chinese product-clone companies couldn't get rid of.
Last year they sold $2.4B, and this year I've seen estimates around $3.4B - $4B.
Prime day is 36 hours long, but I bet sales are weighted heavily in the first few hours.
So 3.7B / 36 / 60 / 60 = $28.6K per second, and then maybe double or triple that for the first hour or two after 3PM and that'll give you an idea of the scale.
There's also knock on effects, like reduced trust in Amazon/less orders for the rest of prime week, but also positive effects, like people who will just defer their shopping.
For what it's worth -- my sales are below my 30 day average. Glad I didn't go all out this year in terms of advertising.
I wouldn't blame yourself too hard.. Amazon has strategically made it very easy to add items to your lists (bookmarklets, chrome extensions, etc.) while offering _no_ supported way to get the list back out.
If you aren't being a little sarcastic, you are lucky. The bug saved you money for shit that you obviously are not wishing very hard for. Kids at Christmas time certainly don't need a backup....
I'm seeing automatic reloading the page every second or so. Maybe some bad javascript, though it isn't adding entries to the history. Looks like they have a script that is DDoS themselves.
What happens to people responsible for the crash today (infra, culprit services)? Does Amazon take some kind of "action" since Prime Day is a huge, once-a-year event for Amazon?
They will have to write a detailed post-mortem (with many people with titles starting from director watching it every week). Based on what comes out from the post-mortem automation/testing will be implemented to remove/mitigate the failure from the equation in the future.
Unless this is something egregious (ex: a manager not allowing the team to "scale up" in preparation for the event) no one will get fired. Tempers may flare a bit if it is something stupid (it's usually not).
Nothing different from what happened with the (very public) DynamoDB and S3 failures of yesteryear.
I've participated before only to be disappointed. I didn't even Beyer looking at their 'deals' this year. I consider this their garage sale to get rid of junk...unless you want to buy Amazon products like Ring.
How about some love for their marketing guy who made it all happen?
Jeff probably said "make it rain, let's see if your hordes can take down Amazon.com," and this guy basically accepted, and succeeded at, the challenge.
Since you won't seem to post according to the guidelines, we've banned the account. We're happy to unban accounts if you email us at hn@ycombinator.com and we believe this will change.
I’m curious what the reprocusions are, if any, once a post mortem is completed and teams or individuals that contributed to the outage are identified. Is “causing” this a fireable offense?
Not unless there was malicious intent or willful negligence. Amazon is a data driven company. The data shows that a “blame” culture results in more incidents. (Airline industry taught us this: https://www.faa.gov/about/initiatives/maintenance_hf/library...)
No. Amazon's culture is super big on blaming the system. I know of incidents where operators made mistakes doing stuff that directly caused really big outages - the postmortems were entirely about the (lack of) safety of the tooling involved that allowed a single person to make such a mistake, and involved examining all the tooling of entire business units to try to identify other tools that could have such an impact. No individual blaming is involved - and in fact, postmortems always refer to "the engineer" or "the operator" and never include names.
It probably doesn't help that the mobile app appears to load a random picture of a cute dog every time I press the "retry" button. So you can guess what I'm doing, trying to get it to load a new "Dogs of Amazon" pic.
Probably should have gone with goatse, reduce the load.
EDIT: do NOT search for "goatse" on your work connection. That alone, even if you've never heard the word, should tell you why I suggested it as an alternative.
Maybe they figure that people who like dogs enough to want to see more will also click the link [1] to find out more about the "Dogs of Amazon", which seems to be unaffected by their problems. There they will find a story about Amazon's dog-friendly offices, 30+ pictures of Amazon dogs, and a video showing many of them. The time to consume all that is time the person is not hitting refresh on the shopping site, so maybe is a net win for Amazon.
Interestingly, despite not being a dog person, I found the first instance of a dog to be a cute, human-relatable inclusion. On later pages I found it frustrating and annoying.
I'm sure there's a fundamental UX principle or two at work there, but I won't pretend to truly know what they are.
Any sources for me to read up on? Googling just gave me the BattleChess duck story, some unrelated UX examples using ducks, and info about how the duck hunt "gun" worked.
I've never heard the term before, but I do recall plenty of people playing duck hunt shooting at the dog when the game said they failed. Although, he kinda earned a laser zapping by laughing at your failures, right?
It's interesting that you can have hired the best talents in the world, but still have major outages. I wonder if there a way to ensure more people = more stability. Sounds a bit stupid, but maybe if each datacenter has its own software team working on the same issue, it will be very redundant work, but maybe it will be more organic in its failure?
Consumerism at its best... Probably for the better than Amazon Prime is down (at least for me - I'll speak for myself).
Honestly, my life got better when I stopped getting packages to my door. Only buy the most essential things you need - sell the stuff you don't. Better to live an uncluttered life without "things"...
I mean, I see it fail from the fact that there aren't any deals I want. Maybe if their fire tablets ran full android, but there isn't really anything on sale.
I just realized the influence of Hacker News. I read techcrunch every day and it is rare for an article, any article, to get more than 2 or 3 comments (although, to be fair, they just added commenting in the last few months)...this one got 35 already!
Anyways, I think that this is NOT a fail for Amazon at all, but a major win. They obviously created all kinds of attention! I'm sure they will get it right next time, and maybe, if they think fast, can reverse this situation by offering those who were disappointed a 2nd chance at an even-greater discount...at least that's what I would do.
It's hush-hush in the industry that CAT5 and it's predecessor the CAT5+1, both are shielded with peanut butter. It's also a little-known fact that most datacenters or PCs as we in the NOC tend to call them. Are also built around tropical rainforests, which provide both security from the common man, as well as naturally cool down the millions of heatsinks needed for the cloud. But an even lesser known fact is that monkeys, yes monkies love peanut butter. --I'll leave the rest to your imagination. But one of those fat guys, drinking beer, with a hoodie, and in need of a bath left one of the windows open. The rest... well, the rest was prime.
After reading Ryan Holiday's "Trust Me I'm Lying" and knowing how well AWS usually handles surges of traffic, I am not so sure that this isn't a marketing stunt that gets them an amazing amount of press. Cynical?
There are cheaper ways to get marketing press that don't involve kneecapping one of your two primary revenue generation engines. They've gotten at least two major news cycles just out of vaporware promises to deliver packages by drone, for instance. (It may not be vaporware forever but it certainly isn't coming soon enough to justify breathless press from a few months ago.) It's not like they don't have any idea how to drum up press.
This made me smile. :) This is very close to to Poe's law as well, because after all, who can tell for sure if something that simply just went-wrong was intentional or not?
Nah, not at all. The vast majority of people don't care about the inner workings of a website, they just want to buy cheap shit and have it all work. Prime Day pretty much markets itself, I just recently got a package from amazon and it was plastered with Prime Day advertising, my co-workers are all talking about it, it's the centerpiece of their website. You'd have to work pretty hard NOT to know about Prime Day today.
Could also be an A/B test of some new infrastructure design to see if it's ready to deploy in November. I have no idea if they still hold everything together with Perl (via this templating system: http://www.masonhq.com/) but it wouldn't surprise me, and it also wouldn't surprise me that there'd be occasional pushes to replace it with something "modern" (or at least friendlier to the revolving door of new grads).
This would make a lot of sense, they essentially get a "game day" with actual sales, make a lot of money and a few months to prepare for the next wave of seasonal mob-shopping.
My guess is this is an intentional marketing ploy. Think of how much press this is generating. Frontpage of HN. No way we are talking about this otherwise.
My bet's on some unforeseen bottleneck that affects search and static pages. Almost everything within Amazon is crazy scaleable, but there are some bits where you scale them up and their behaviour changes radically. For instance, a service's cache misses might skyrocket as customers get distributed over a wider set of servers, causing service response times to increase just a little bit on average, tipping a dependent service over into more frequent timeouts, causing its downstream service to blow a timeout-percentage 'software fuse' and stop using that service... etc etc.
Given that each of those services (and many more possibly-related ones) will have an on-call engineer paged into a conference call when the manure hit the rotating ventilation apparatus, there are going to be a lot of unhappy people cancelling their weekend plans right now. I definitely don't miss that aspect of the job!