Hacker News new | past | comments | ask | show | jobs | submit login
Prime Down: Amazon’s sale day turns into fail day (techcrunch.com)
646 points by koolba on July 16, 2018 | hide | past | favorite | 360 comments



Kinda wish I still worked there and could see the post-outage writeup.

My bet's on some unforeseen bottleneck that affects search and static pages. Almost everything within Amazon is crazy scaleable, but there are some bits where you scale them up and their behaviour changes radically. For instance, a service's cache misses might skyrocket as customers get distributed over a wider set of servers, causing service response times to increase just a little bit on average, tipping a dependent service over into more frequent timeouts, causing its downstream service to blow a timeout-percentage 'software fuse' and stop using that service... etc etc.

Given that each of those services (and many more possibly-related ones) will have an on-call engineer paged into a conference call when the manure hit the rotating ventilation apparatus, there are going to be a lot of unhappy people cancelling their weekend plans right now. I definitely don't miss that aspect of the job!


Thanks for the comment and I love all the insight but serious question about:

> there are going to be a lot of unhappy people cancelling their weekend plans right now.

The week just started... are you saying that you can already anticipate the war room for an event like this to extend through this coming weekend?


Haha, no, oops. My last few days have been a bit crazy and I'm on holiday, so I just thought it was Sunday ;)


Hehe. Reminds me of my Murphy's law momemts. Whenever I am on a holiday, a deployment will fail, and I will have to login and check while my family plays volleyball on the beach.


I'd tell your boss to hire someone to replace you. What if instead of on holiday, you were dead? Kinda hard to login and fix their shit when you're dead.


I agree. I quit every job where they don't respect my personal time, especially vacation. Life is just too short. It's bad enough what I've been through with companies and their stingy PTO policies (US).


That sounds like a very low bus factor to me.


I was smoked up in colorado for the old "Lets take down Dyn real quick" BS a couple years ago. I know this very well, things like to run for months without issue, and I'm off for an afternoon with the phone ringing.


It's a good holiday if you can no longer remember what day it is.


But not good enough if you’re checking HN ;)


I'm on holiday in Japan at the moment and have no idea what day it is. Only checking HN as I'm currently in Osaka waiting for the Shinkansen to Hiroshima. ;)


Do you seriously believe reading something like HN is work for its viewers??


I think they meant that if you're on vacation you maybe shouldn't be looking at the computer.


my memorial weekend on a few acre olympic rain forest camping trip: regular signal hunting for hn. Oddly found a 6x6 area which after a reboot with provide lte. after idle would have to reboot again to open some more links


My guess was either that

OR there wasn't backpressure on a cascading failover so as services failed they increasingly failed to more and more overloaded systems

OR there WAS backpressure and it was the luck of the draw whether you were queued into an error page or got good data

OR the autoscaling couldn't keep up with the onsale window. This used to happen in ticketing a lot. Ticketmaster has a talk somewhere where they talk about warming the scaling load and server cache in anticipation of big ticketing onsales. The time it took to autoscale was just too long.


The systems at Amazon are too large to leverage autoscaling. When I was there (in Marketplace) it was generally estimated that Amazon used about 80X the capacity of their entire AWS public cloud.


I find that implausible. I know they run a huge infrastructure but why would they be using 80X of their AWS public cloud? Thousands of companies if not millions at this stage use AWS and some of them are not insignificant in size.


WTF? 80 times the capacity of their entire AWS infrastructure?? That seems insane.


No, the exact quote was:

> Amazon used about 80X the capacity of their entire AWS public cloud

which is probably closer to "the capacity that is available via AWS is a tiny, tiny fraction of their overall computing power.. therefore adding it back in when things are falling over doesn't actually solve any problems."


I don't understand the difference between the wording in your post and the post you're replying to.


I think they are using the term 'capacity' to mean 'spare capacity'. I.e. that Amazon's entire compute usage is 80x the spare capacity, so scaling even a small amount would consume any spare capacity in AWS. Still, it seems hard to believe.


I interpreted it to mean:

1. he misspoke and meant "80%" of the AWS capacity, which I agree seems implausible. 2. Amazon does not run on AWS because Amazon is 80x more than all of AWS infrastructure. This also seems implausible because of Netflix. In fact, there's an article out there that said AWS exceeded Amazon's capacity within 1 quarter!

I still don't understand what that has to do with autoscaling exactly


I did’t understand either. I suspect it’s the semantic difference between ‘public cloud’ and ‘infrastructure’. I don’t know what that difference is really.


When I checked it out, the site seemed responsive but search was broken. I searched "usb flash drive" and was presented with 2 results. I pondered this for a moment and then realized that there were 400 pages with 2 results/page (as far as I checked.) Perhaps there is some load shedding algorithm that reduces the result count per page to reduce load. It did discourage me from further searching so I guess it worked. ;)


smile.amazon.com was working fine during the outage, if that helps narrow it down...


It wasn't though. I saw an item being listed as a Prime Day deal on the normal Amazon site, then I searched for it through smile.amazon.com and it wasn't there. Went back to non-smile and it was there (but throwing errors so I couldn't click it anyway...)


I'm seeing this also. Same item is $169 on smile.amazon.com, versus $79 on www. Neither is working properly though.


Not for me: a page that gave a fast 503 on amazon.com would spin indefinitely on smile.amazon.com


Weekend plans? Prime Day is going to be over in the early part of the week.

Does Amazon expect a fix to be deployed ASAP after the immediate crisis is averted?


My bet is they maxed out on the SQS FIFO 300 messages per second limit.


I have no idea really (I stopped working on the retail websites in 2013) but my gut feel is that that's about a couple of orders of magnitude too low at least.

But, yes, unforeseen rate limits and size limits can cause many hilarious things to happen. I've seen a few good ones in my time. In particular, when somebody sets an upper size on an in-memory table and commits it thinking "Well, I added a few orders of magnitude safety margin - that should be enough for anybody", that's probably going to become an incident at some point in the distant future ;) With luck, the throttling or failure behaviour will only affect a few people, and it'll be spotted by looking at traffic graphs and noticing a very slightly elevated rate of service errors. If you're unlucky, though, when you hit the limit the whole service slows down, locks up, or just plain crashes, and something like this happens...


Yup, many an outage has something like this at its core. At my company, to address that, we've built an internal library for enforcing rate limits and size limits that is (a) configurable on the fly and (b) generates logs, so that we can trigger an alert whenever any limit reaches 70% of capacity. And thus, hopefully, head these things off before they reach the tipping point.

https://blog.scalyr.com/2014/08/99-99-uptime-9-5-schedule/


Don't forget to alert when the double derivative of capacity growth is above a (very low!) threshold, that can catch an explosive problem far earlier than the 70% mark, by which time it might be growing at 5% per second.


It might also be a momentary blip that you really shouldn't need to wake a human up for. The hard part of monitoring imo is that perfect balance. Asymtotically approaching nirvana but never quite reaching.


All monitoring has that problem :-) But if your rate of rate of change has been positive for some suitable amount of time for the context, it's worth waking a human up over it, because the amount of resource remaining is dwindling exponentially.


That sounds like a very interesting metric to track and report on. Do you have any further references that discuss that approach?


Not offhand, sorry - I first encountered the notion at a conference some years back, but I've long since gotten out of operations so it's all a bit dusty for me now.


That's cool, but it's a bit more complicated then rate limits.


I once owned an EAV DB for storing small config values. Someone wrote a great wrapper around it that made it seem like a proper DB with many tables. Since it was for storing configs this library cached the whole thing in memory on startup. Zoom to 5 years later, we have 10+ settings for every customer in this store, and one day all our hosts keep failing over. As it turns out that small table for configs was north of 5 gigs and destroying our heap.


This is their official rate limit for the FIFO type SQS. Although I bet the Amazon folks know a couple people around the office who can up that limit :)


oncall is the worst


Not oncall for Amazon but I don't particularly mind it. It sucks to have to be near a computer for the time, but I actually find digging in and finding that needle in a haystack pretty fun. On a good team, the life interruption is minimal.


As much as I like digging,debugging and learning that comes with it, fixing things under tremendous time pressure wears me out :(


When I was on-call for a big project, I jumped every time my phone rang. It wasn't a good experience at all.

"We are at the fair, I have 18 customers lined up but I can't take orders. Can you fix this now or should I start taking them on paper?"

You know the joke that atheists don't believe in god until the plane nosedives? That was the moment for me :)

"Oh, I... I need to see the error message before I can say anything. We don't have any errors logged. Hmm... Would you allow me to remote into your device?"

"Can't do that. We lost the internet connection here, sorry."

Ugh.


How long have you been on call for and how many 3am wakeup calls have you had? I used to think these same. Protip: you won't always be on a good team.


I'm in Asia and always thought I'd make a great on call person for those of you who haven't jumped ship. I should apply somewhere.


Hah. I'm genuinely surprised this isn't more of a thing.


We reduced the stress of on-call by introducing processes that supported creativity and reduced panic:

https://zwischenzugs.com/2017/04/04/things-i-learned-managin...


I prefer to be oncall as long as I'm empowered to prioritize software improvements that reduce ops burden and improve customer experience.


Something no one has mentioned yet, could it be that the engineering force at Amazon is no longer what it used to be?

I can personally point to two friends who I consider top notch engineers and designers that have left Amazon because of its toxic culture. I'm sure I'm not the only with these anecdotal examples, we've all heard the stories. At the end of the day years of unbalanced work/life balance, overly aggressive management and frugal approach to everything makes for a weak argument for A players to stick around.

Could this be an example of crumbling engineering standard at Amazon?


AWS employee and bar raiser here.

>Something no one has mentioned yet, could it be that the engineering force at Amazon is no longer what it used to be?

In many regards, yes. The bar had to be lowered to meet the demands of growth. We've also taken in a lot of hires from companies that have brought their culture and friends with them. The culture at Amazon is not what is was even 2 years ago. It is in many places day 2.

No one also seems to notice that Amazon retail often suffers widespread issues like this. We can count on SEV1's happening during peak as things blow up badly. This has happened several years in a row, and sadly the themes are pretty much the same across all: forgot to scale (yes...really) or some stupid system bottleneck. It doesn't help that Amazon retail has a good amount of its workforce based in India and seemingly disconnected from the Seattle based leadership.


> It is in many places day 2.

For those unfamiliar with internal Amazon lingo, a big deal is made of it always being "day 1"

https://www.fool.com/investing/2017/04/13/jeff-bezos-says-it...


> The bar had to be lowered to meet the demands of growth.

Not only that, but I've noticed that Amazon has started using way more contractors (not actual firms, but more Mechanical Turk/UpWork like contractors, if not just straight up misclassification) and agencies within the past year or so. I can't say that I'm surprised that Amazon is having technical issues now that they've grew in numbers but not in technical chops.

On a side note, it is a great time to become a recruiter in Seattle with all these agencies popping up. /s


This makes me terrified that my users' security relies on AWS...


Honestly don't think that is an issue. AWS is THE most certified company I have ever worked with. Seriously, go look at their compliance page - it is ridiculous. https://aws.amazon.com/compliance/programs/


Certification is a theater for managers and so called decision makers. It might have a positive impact on code quality as long as people are motivated. But if insiders are telling me that the workplace culture is deteriorating that is a serious problem. Certification will not prevent bad code hitting production.


Reminds me of several CMM Level 5 certified IT services firms consistently grounding projects into dust!


I was mainly referring to the security of their systems, but I take your point.


Not certain on this, but I think Azure is known to have more compliance certifications. This is probably largely due to the long relationships Microsoft has with government organizations etc.


I got the impression from some former Amazon employees that they were burnt out and put their time in and got Amazon on their resume and were just done with life at Amazon so they moved on.

Do you sense any attrition like that?


> It doesn't help that Amazon retail has a good amount of its workforce based in India and seemingly disconnected from the Seattle based leadership.

Could you expand on this a bit, please?


An Amazon manager, full of shit, once tried to tell a joke at a social gathering. He claimed that he and friends removed the engine from someone's car over lunch as a prank. This is the education level that some of them have. He literally didn't think people would know he was full of shit.


It would be technically possible to remove a car engine during lunch. However, that's not so much a prank as malicious damage. You'd also have to be a very skilled mechanic, familiar with that specific car, to do it that fast.


Not to mention you'd drop a lot of polluting fluids on the ground if you did it in a parking lot.


Honest question: education level or mental developmental level?

Mental health is enough of an epidemic that $company_with_exponential_scale is most definitely going to pick up a fair chunk of people with various issues, cognitive included.

Chances are this is even part of why the bar is a bit lower.

(Said as someone with mild high-functioning autism)


My point is Amazon has hired some managers that have really lowered the bar due to the environment they are from and people they are used to dealing with. Imagine someone from a place where you could tell a preposterous lie and 50% of people would believe you.


Gullibility has little to do with 'education level'.


You misread it - the manager wasn’t gullible, he was the one telling the fib.



I can see what you mean. Take for example the email I received to promote Prime. A mix of spanish and english with some very bad translation problems in spanish. It was a bit embarrasing to read it.


No load testing ever?


Good luck generating Prime Day load artificially.


At Uber we invested in synthetic load tools that create fake riders and drivers, match them, etc and test our entire dispatch system end to end to arbitrary amounts of online driver and riders. I don’t see why they couldn’t do the same with carts, adding products to carts, etc


They do. But I'm guessing you don't scale up 1M+ servers for your Uber canary traffic tests - these are some of the scales Amazon undergoes in these events. The scale is unlike almost any other web property around.


> But I'm guessing you don't scale up 1M+ servers

Do they really need 1 million servers? Many of my friends who work at other tech companies need such few servers in comparison even with significantly high traffic that just screams massive inefficiencies...which seems wrong.

But I've never worked at Amazon so I wouldn't know.


At one of my previous companies we managed about 100 servers for peaks of about 30 million users.

But we didn’t handle free-form text search like Amazon. I can imagine that would necessitate a huge scaling up of compute and data.


Judging by all the error pages after launch, they needed more!


I’m not sure how it’s relevant. If you have the infrastructure to send 1000 concurrent users you can probably send 1M concurrent users. We only test small integer multiples of our peak traffic, and if your absolute number of servers to service that is in the millions then it would make absolute sense to be routinely running that capacity test. If that means “scale up 1M+ servers” then that is what you have to do, otherwise how can you be sure?


Could you do something like that from another cloud service? I suppose the difference would be whether 10K requests from one IP address would be the same as 10K requests coming from 10K IP addresses. The further you move from the actual production load, the higher risk that the test doesn't test everything. For example if the network could only handle connections from 5k hosts then the former could pass while the latter failed miserably.


Simulating millions of users should be well within the capabilities of a company as large as Amazon. Off the shelf load testing tools like Locust can create thousands of fake users with one worker.


Please, generating load is always easier than servicing it.


Probably can find a hotspot and load test on the hospot.


(◔_◔)


It's actually pretty trivial. Getting the right parameters for the load will be hard and making sure you are loading it properly. Think about DDOS attacks. Generating the load is rarely the issue.


what is a bar raiser?

and when has amazon ever been an engineering force? i have always felt the website and service experience is a relic of the 2000s. more often than not, i get the answer “our system can’t do that” from customer service.


>https://www.socialtalent.com/blog/recruitment/raising-the-ba...

I think Amazon has taken on an outsized image to many people that just isn't true. We have good engineers in many organizations, but we don't pay enough, have the right strategy, or take care of individuals well enough to lure the kind of great folks you find at other big tech companies. In many ways, Amazon is a retailer that does technology because it found a way to make money from it. The DNA is still MBAs/finance and retail.


A bar raiser is typically someone in the interview loop for a candidate that is not from the same team and therefore more objective about whether the candidate is truly 'better' than most people on the team (raises the bar). Although good in theory- its not actually practical to find every single new person better than the current employees but it helps keep things in perspective and is a deterrent to bringing in weak buddies.

Its true Amazon has some great engineers but is not a very engineering centric. I remember a senior engineer in Retail once comparing it to a plumbing system kept together with bandages.


How much do you think the on-call contributes to engineers leaving? You would think support tools and support personnel could help to retain engineers.


No, bad design and refusal to manage technical debt is the issue. Oncall only matters in some orgs and even then only matters where the tech debt is totally out of control.

Bottom line is Amazon is a product culture not an engineering culture and that makes it really easy to leave for Google or unicorns that really appreciate tech debt tradeoffs.


Who cares about retaining those support/ops engineers you shit on because you don't want to own the crap you write, right? :)


From the site:

In simple terms, bar raisers are current Amazon employees that come in during the interview process to analyse candidates. They do this alongside their own full-time job, assessing as many as 10 candidates a week and spending 2-3 hours on each one.

In other words 20-30 hours per week on top of the full-time job? That doesn't sound quite right.


Another bar raiser here. That site is wrong. The expectation is 2 candidates a week. And it would be more proper to say that it is part of your full-time job, not on top of it.

10 candidates in a week may happen at some kind of event, but then it isn’t 2-3 hours per candidate, and in that case you’d effectively be taking a couple of days off from your normal job.


This bar raiser thing sounds like normal hiring practice in my work experience.


The thing that baffles me is the industry perception is that Amazon has subpar engineering but Amazon right now is 2nd after Apple for Marketcap, so they must be doing something right.

I love the Amazon customer service. They’ve managed to crack a difficult problem and execute enough that other Giants haven’t come close to it yet.

GCP and Azure tail AWS by quite a bit. Amazon online retail is a Google search engine level monopoly now.

So Amazon can do a lot of things wrong, but I’d have to say they get the important parts right.


> Amazon online retail is a Google search engine level monopoly now.

Not even particularly close. Amazon doesn't even have a majority of online sales, although it's getting close. They seem like a much bigger force than they are because of growth.

https://techcrunch.com/2018/07/13/amazons-share-of-the-us-e-...


They are big, because they invested every penny they made into getting big. And those pennies are converted through human misery in "fulfillment centers". (And AWS on-call is not that fancy either for engineers, though the two are almost incomparable.)

They got some parts right, some parts awfully wrong, and some are just irrelevant now. They make money, they are cheap + convenient, and that usually what people focus on. They are not sophisticated, they are not great designers, etc.


This doesn't completely answer the question, but I would distinguish UI and UX from the ability to build systems that run successfully at Amazon scale.


Corporate meme thing like six sigma. Each new employee is supposed to raise the bar, or the whole organization gets worse on average. Jeff Bezos as #1 is supposedly the least qualified employee in the company as each new hire raised the bar.


what is a bar raiser?

A PM who has never opened an IDE in his or her life and who is only familiar with “coding” concepts through Wikipedia. They read books by Malcolm Gladwell, Daniel Kahneman, and Nassim Taleb and majored in one of the humanities. When they’re shown the webpage that the geeks created which loads in 2 seconds, they tell the lead developer that they want the loading time cut down to 1 second and the font to be changed.


Hey hey hey. What's wrong with Kahneman?


Absolutely. The fact that this is ongoing hours later is pretty much a clear sign that there’s problems.

I understand that it’s not always easily fixable. But honestly this looks pretty bad. Could be anything from bad code to a flakey cable.

I got out of high volume websites - too much marketing, unpaid overtime, and horrible work-life balance.


Considering they cleared $2.6 billion on last year's Prime Day, and membership is up YOY, I'd say this comment is 100% knee-jerk/over analysing. Stuff breaks, things go wrong, planning for the worst sometimes isn't enough.


I don't see how "they sold a lot a year ago" and "membership is up YOY" is meant to dispute the idea that engineering talent is leaving. I agree we don't have enough evidence to say for sure, but those figures don't seem relevant.


My comment has nothing to do with Amazon's current financial success or growth. It's regarding their engineering team and their ability to recruit and retain top talent.

It is not a binary state of absolute destitute and top notch brilliance, it's a trend that can move one way or another and will show itself in more frequent outages, poorly rolled out products, lazy design and etc.

The plumping can keep on working for a few years even after it begins to erode. Historically there are many examples of this.


Maybe but your evidence is annecdotal. Good engineers leave huge companies all the time but it is usually balanced by inflow. I guess we have no evidence of inflow as compared to outflow of good engineers.


I didn't reach a conclusion, I presented a possible reason knowing very well that I don't have the evidence to back it. Hence why I said "could it be that the engineering force at Amazon is no longer what it used to be?" and not "The engineering force at Amazon is no longer what it used to be."


have we reached peak Amazon?


No. Amazon has had outages on holiday shopping days throughout its history.

Amazon has always been toxic and frugal, so obviously it didn't interfere with whatever mythical software quality it had in the past.

And today's failures might be because past engineers built unstable unmaintainable systems and then ran away.

And of course amazon is 10, 100, 1000 times larger tech system than it used to be.


My gut feel is that they did everything right to prepare based on forecasts from last year but still got swamped by demand.


I have Amazon Prime, and I had no idea that today was Prime day. Haven't seen any marketing for it, and nothing in email until this morning.


It has been all over the place. Unless you don't visit Amazon on a regular bases, or deals websites. I don't even watch tv often and saw ad on it.


The only marketing I've seen for it was the tape on my packages the last few weeks.


I saw the page reload multiple times when doing a search before finally dying...I can only imagine that means additional load generated by the initial triggering failure.

Which in turn tells me they didn't test the failure case. Now, Amazon is a huge and complicated beast so I don't want to imply this was a "dumb" mistake, but (assuming I'm correct) it is a failure that risked making MORE failure, so it's not demand alone to blame.


[flagged]


Please don't rant like this on Hacker News.

https://news.ycombinator.com/newsguidelines.html


In addition to the shit that is not politically correct to say, Amazon relies heavily on interns for production code and ops. Bezos has this view that the unskilled can get him 80% of what he wants and then a few top people can smooth out the rest. In reality the few top people never even come in contact with the kid that is trying to keep amazon.com running overnight. Successive waves of interns and SDE 1's fuck up the same shit over and over. Bezos is stupidly applying this same strategy at his space company.


This is complete nonsense, at least in my org. Interns are given projects that are as far from production code as possible. I'm sure there are managers out there who have done this but it does not extrapolate to the entire company and it's ridiculous to think that it's a company wide practice to make interns responsible for anything that would affect customers.


This is complete BS. Interns are basically on a really well paid holiday and given random features that may/may not ever make it to production.


I always wonder how stupid people can accumulate $100Billion being stupid.


>Could this be an example of crumbling engineering standard at Amazon?

As someone that's been going through quite a bit of depression because Amazon was the best offer I got (as opposed to Google or Facebook because I'm still bad at coding interviews) I'm afraid this is the straw that's going to break the camel's back for me. This is exactly what I was afraid of and exactly what I have waiting for me when I join next week.

There's no point anymore.


100% serious here: you should probably see a therapist. Life is so much more than your job, and your job is so much more than the brand name of your company. Having worked at a failing business and one of the companies you mentioned, I’m no happier at the latter than I was at the former. If you define yourself by the names on your resumé one day you’ll wake up and realize how much of life you’ve missed out on.


Hanging out a semi-related Meetup, which allows a safe coocoon to vent may also help. It worked for me, YMMV


Hey, I've seen a few of your posts around here and just wanted to encourage you to keep your head up. Amazon is an amazing place and even though I don't know anything about your specific situation, there are a lot of incredibly smart people there and I promise that you'll learn a ton.


It's literally comment after comment of playing the victim.


I'm not "playing the victim", I'm bitching about how terrible my own life is. Which it is.

Why do I see literally everyone else have their dreams come true while mine don't?


I was in the exact same boat - ex Amazon intern, felt I was a pretty solid engineer, but didn't quite nail the FB interview. Ended up joining Amazon because I was sick of interviewing.

Let me tell you: it will be OK, and the other posts about making the best of your situation are true. Amazon is an incredible learning opportunity if you stay open to it. I'm still working there a year later and I'm building far higher impact projects than my friends at Goog and FB because like another poster said, Amazon lets SDE1s work on just about everything. The growth potential is tremendous if you show you're competent and willing to learn.

Chin up, you're in a good position :)


People say the same thing about every big tech company - Google and Facebook included. Trust me, employers will still be impressed with Amazon on your resume.


I've personally never seen it.

Nowadays they have to be impressed by the project too...and mine's front-end development on an internal service nobody uses.


It's true. Go check their Blind's, or talk to some employees that have been there a few years.


new person to amazon.

they have a lot of really amazing and smart people. but like anything in life, its what you make of it. I'd say put your best into the position and try to learn as much as you can from others. no good reason to do less


It's not just what you make of it: a lot of your experience there is going to depend on your manager. I had three: the first was awesome, so he left for another company; the second was good, but he cared too much and went back to his old (non-management) position to feel fulfilled; the third was a sociopath and did really well, promoted to upper management and a subsidiary and out to greener pastures. Working for the sociopath was awful and got me to leave (where I'd planned on a long career). And no, there was no escape from him except leaving Amazon: he sunk my performance reviews when I asked to leave to go to another team 'cause I told him I didn't think I was a good fit where I was.

There are a lot of amazing and smart people, like you said, but there is also a lot of stress, heartache, and trouble if you don't keep your ear to the ground and build a strong network of people to give you an early warning. Don't keep your head down and concentrate on tech and building cool stuff: Amazon can be way too political for pure techies to thrive without strong protection from management.


Chin up. You're probably in the top 5% of the country in income. If you're in Washington, feel free to tack on another 9% or so to your income in state taxes you don't have to pay.


While it's nice to not have state income taxes to worry about, total tax burden isn't all that much different from any other state. The only time you really come out ahead is if you can live in WA and do all your shopping in OR.


For high income a people like amazon engineers, WA is crazy low tax burden. CA income tax could only match WA sales tax of you spend 100% or probably more of your income on taxable items.


I would fucking love to work at Amazon. You’re doing better than almost everyone I know.


You can't let an outage or some alleged talent gap be a reason for wanting to tap out.

You need to realize that your doing great, folks would kill to get a job at Amazon, the scale and the challenges are mind bending compared to what most other companies deal with and the interview is grueling and you made it.

Technology is a word that describes something that doesn't work yet and Amazon thinks that you are a person that can help tip the balance.


If only they had access to some kind of scalable cloud hosting service, they could've completely avoided this sort of outage. :)

Jokes aside, I admire the work of the team(s) responsible for Amazon's web site. I use it so often and encounter glitches so rarely that it really stands out when something does go wrong.


Serious question: I've heard that Bezos's approach with building out commercial units is to break down each part of the vertical into separate commercially-viable components. Idea being if AWS doesn't make sense for 3rd parties to use then it may not be economical to use it internally.

Now the question part: would Amazon ever secretly run Amazon.com in a multi-cloud setup, balancing between AWS, GCE, Azure, etc?


As someone that worked at Amazon a long time ago—back when AWS was just getting started—can confirm (historically). Publicly there were a myriad of AWS services; internally all we could use was S3 (for many years), if we were lucky. AWS being born of Amazon's "spare capacity" is an urban legend.

Nowadays, I hear it's quite different, and much of AWS is more rapidly dogfooded.


Some services (before I left) just did not have the sheer capacity to have Amazon.com as a customer at peak. The service teams just said "nope, sorry, you're going to kill us".

The requirements for prime day/black friday/cyber monday were mind-boggling.


That was quickly adjusted. I think like 3 years ago, Amazon publicly was saying that every day to AWS they were adding the same amount of computing power that was used to run Amazon.com when it was a $10 billion business. AWS is massively greater in scale than Amazon.com at this point.


Scaling services for many independent businesses is somewhat of a different challenge than “vertically” scaling for one large one like amazon.com.


I cannot confirm or deny that (since my memories are sorta fuzzy), but the real problem was not EC2 capacity. Many other AWS services were involved. :)


This is not right from an infra POV. You are just moving the problem (by team) but it doesn’t go away. Then you end up with duplication.


At the same time you are also giving the other team more flexibility to solve the problem, as they can optimize it for the Amazon.com use case and are not constrained to a generic solution that is optimized for the masses.


I'm a big fan of dogfooding. But given Amazon's ecommerce huge computing needs, maybe it would make sense to not burden the compute availability of your fledgling cloud offering with your ginormous system?


If built well your compute offering can scale with the resources you give it. If you have Amazon’s commerce platform running on a pool of computers, the team running commerce can give those computers to EC2 in exchange for an equivalent amount of EC2 quota. In theory it’s a bookkeeping operation.

There are plenty of good reasons why you wouldn’t run Amazon’s commerce on EC2, but I don’t think cloud availability is one of them.


FWIW 100% of the Amazon retail website is run on EC2 (and iirc it's 100% on spot instances too). (There are of course lots of internal services that still use non-ec2 capacity but fewer and fewer every day.)


note though it is not 100% AWS, ex they use Akamai instead of cloudfront


I don't think the spot thing is accurate. Amazon retail EC2 capacity is more like reserved instances.


That's just marketing BS.

One of Amazon Retail's big internal goals this year is "finally get everyone off of Oracle".

Bezos opposed the creation of AWS. Almost everything that was early AWS (EC2, S3) was done over the objection of Seattle leadership. Look where the teams were based.

After they shipped AWS, it took eighty zillion years for Retail to use anything AWS offered beyond S3, and, as mentioned above, it's not like their 100% on DDB or RDS now: they have dependencies on freakin' _Oracle_ all over the place.

I mean, this is not a criticism of Bezos or Amazon at all, but at the end of the day, even if Bezos is a supergenius (and I see no reason to doubt that he is. Although I often wonder how he keeps himself motivated to continue to work so hard on building a Walmart competitor with the precious few years he has left on Earth given that he could do literally anything with his time), Amazon is still a company made of tens of thousands of people. It's not a 4 dimensional chess-ballet. It's got probably 95% of the same chaos and disorder that every big organization made of humans has. It just turns out that cutting it 5% in the right markets has incredible returns.


AWS got built over the objection of he Chief Executive and majority owner of the company?


Worked for Amazon a few years back.

I can't see Amazon ever using external cloud hosting for anything except the most trivial of tasks. They're absolutely, utterly paranoid about any sort of confidential information, and I think even with encryption the perceived risk would be too high.


lol, meanwhile everyone else is trusting aws.


That's sad truth. But it's convenience like anything else.

In my business I can put a hardware site online with

---6X--- 2 x intel gold 5115 10 core + 64 GB RAM 1 nvme @512G + soft raid1 @4TB magnetic 1 10G, 2 1G ether ~= 42K.

---2X--- storage or NAS with 60TB @RAID5 + 2x quad core low end xeon + 32 GB RAM = 16k

---2X--- 1G edge/core mngd switches + 10G SAN/LAN mngd switches = 5K

---2X--- Endian firewalls + threat appliances = 5K

---1X--- Colo with 2 year lease and 25 amps @208v 1g port speed and committed throughput > 100/mbps = 16K yearly

68K one time cost for depreciating assets we maintain, provision and secure + 16K yearly recurring cost.

Or I can go AWS and modify my processing model, security expectations and service infra and spend 25K a year + 15K 1x migration cost.


So... you save $9k a year in recurring and it will be more than 5 years before you break even due to your $68k up front equipment costs.

And that's assuming you don't have any needs to quickly scale up or down and you are limited to 1 colo instead of the ability to expand to multiple regions like with AWS.

And that's not even taking into account the cost of the brain power to make sure your hardware stays up and running.

Doesn't sound like rolling your own stuff in a colo is a very good idea in this case. But that's job security if you are the sys admin I guess.


> And that's not even taking into account the cost of the brain power to make sure your hardware stays up and running.

Although, as I said upthread, I agree that AWS is very likely ideal for this particular deployment size, let me try to dispell this oft-repeated myth.

Modern server hardware takes almost no "brain power" (or effort of any kind) to keep up and running.

We aren't living in the days of the early dot-com boom where Linux-on-Intel in the datacenter could mean flimsy cases, barely rack-mountable, with nary a redundant part to be seen.

Applying some up front "brain power", one can even choose and configure hardware in such a way as to provide things like server-level redundancy, if that's important and/or preferable to intra-server redundancy (think Hadoop), or the ability to abandon mechanical disks in place instead of ever having to replace one.


This is the main "sweet spot" for AWS (or "cloud" infrastructure in general): small scale.

I am generally a strong proponent of using ones own hardware in a colo or on-premises, instead of or in addition to the cloud (primarily for "base" workload).

However, if the entirety of your needs can fit into a single rack, even I will advocate for AWS, since "convenience" is, perhaps, not strong enough a word.

I do think your server and storage prices are around $25k too high, but that's easy to do buying brand name and/or not negotiating with multiple vendors on price (which is particularly tough at low volume unless you're a startup with a credible growth story). That's assuming such an expensive CPU (in comparison to so little RAM) isn't foolishly profligate, along with the other hardware choices. Of course, this underscores the point (on which we agree) that, as a rule, it's just not worth that much time and effort for so little.

I'll take your word on the AWS pricing, as it's fairly predictable, if very tedious to perform the prediction. The main "gotchas" I've found people run into are forgetting to add in EBS costs for EC2 instance types without (or without comparable) local storage and underestimating data transfer costs.


You'll have to trust me that this examples hardware spec and requirements are for a basic/base site. You can thin the profile and increase the # of chassis, compromise on redundancy, etc...but experience has shown that this arrangement is most cost effective. Kinetic event impact modeling system -w- RT data delivery -- that should answer your conjectures.

No large vendors used in this example - thinkmate or aberdeen supermicro re-brands for due diligence and warranty.


> You can thin the profile and increase the # of chassis, compromise on redundancy, etc

No, I wouldn't suggest more chasses, as that's almost always more expensive (it's tough to break even on that $1k minimum buy-in on a server).

I believe your workload needs the resources you say. It just happens to be a remarkably rare ratio, hence my remark.

> No large vendors used in this example - thinkmate or aberdeen supermicro re-brands for due diligence and warranty.

The vendor doesn't have to be large to jack up the price.. Any re-brand is super suspicious. To me, a large part of the point of a commodity server product is the reliability is predictable (and therefore easy enough to engineer for/around). Paying extra for "diligence", warranty, or hardware support is just flushing money down the toilet.

A fee for custom assembly and/or a basic smoke test is fine, but it had better be a flat rate per server and on the order of $100. Technician labor isn't that expensive.

Larger or "enterprise" vendors are merely the extreme version of this, with upwards of a 10x premium on something like storage arrays, especially if one includes


You seem to be an absolute type of planner. I used to approach IT mgmt and provisioning that way some years ago before being confronted with the realities of small and large business. One size obviously does not fit all and sometimes you take shortcuts..usually you pay for them later.

I agree with your cautions around supermicro resale but the warranty support and build diligence are absolutely necessary for a small business. Having a good business relationship with a trusted provider of hardware that always performs the first time is priceless.


I don't know what an "absolute type of planner" is, but I consider myself an engineer and a pragmatist. I'm well versed with realities. In reality, with business, there's no such thing as "priceless", only risk, and risk is, generally, quantifiable. With enough data, it's easily quantifiable.

I admit that, having an affinity for startups, rather than more traditional small businesses, I have a greater affinity for risk. Ironically, perhaps, I'm usually the voice of risk-aversion with respect to IT infrastucture, so I don't believe it affects my overall understanding.

I recently pointed out to an interviewer who was trying to convince me that it was worth spending half a megabuck on a petabyte from Netapp because it was "business critical" instead of 1/10th that amount for DIY, that, just like the DIY solution, Netapp does not indemnify the business against loss. One isn't buying insurance, only a bunch of technology.

Sure, "works the first time" is worth something. Is it worth the cost of a whole, complete, extra server on a order of qty 6? If the infant-mortality rate on servers is anywhere approaching 1-in-6 and they're being shipped somewhere that the replacement time and/or cost would be prohibitive, I'd still probably rather just order 7 servers instead.

That's my main problem with paying a vendor for "reliability": it's a very fuzzy, hand-wavy assurance. Paying for reliability with more hardware has data and statistics behind it, which is an engineering solution.


Risk is not quantifiable based on your insight into most businesses. You provision and despair.


I can't understand either of these two sentences, not even who the "you" is supposed to be.

I'd hope for a more substantive reply, if anything.


That is some serious set up. I can’t see how you get close to this only spending $25K a year on AWS - maybe the price I was quoted for my needs was some sort of suckers price.


This is a ballpark average and only for one VPC based site. You'd need multiple sites...just as you do with hardware.


I don't see their comment as paranoid because cloud providers generally aee incompetent/steal your data, but because other cloud providers are direct competitors.

Even if they're not sifting through your server data, they can possibly try to get a competitive advantage by analyzing things like usage data as someone above pointed out.


When AT&T was forced by law to allow MCI switching equipment in their facilities, they tended to leave that room's window open, so pigeons could nest in the equipment.

(I might have a detail wrong, this and a hundred other great telecom anecdotes are in The Master Switch.)

Always nice to make sure no parties you depend on have conflicting interests.


I know of a big company that competes with Amazon outside the cloud that pressures its vendors not to use AWS for any of its data.


Ebay? Walmart? Alibaba?



Walmart, Carrefour


Most companies can't afford to implement all the security and compliance measures that AWS, Google, or Azure does.

The asymmetry in trust is partly because those cloud providers are huge, and partly because those things are part of their core competency.


I know, right? Always seemed a tiny bit hypocritical to me.


> Now the question part: would Amazon ever secretly run Amazon.com in a multi-cloud setup, balancing between AWS, GCE, Azure, etc?

I'm not sure how secretive it would be. AWS 'bids' on Amazon.com's business and Amazon.com is under no obligation to use AWS as it's cloud service provider.


Are these bids currently public knowledge then?!


Id would be pretty hard to hide if Amazon.com's images are severed from Azure ip addresses...


Surely the public-facing stuff is just the tip of the infrastructure iceberg.


They're more than big enough to be bringing their own public IP blocks to a provider and having them announce with BGP.

Hell, even us regular joes can do that with services like Packet.Net or Vultr.


The use Akamai for that


I doubt it. These companies compete brutally, they would never allow Azure or Google to know their use patterns.


The (hypothetical) day when AWS & Amazon.com goes down on Prime Day and Amazon stock tanks 50%.


It is beyond absurd people who sell $AMZN on this news. Amazon users will just keep refreshing until it comes back online, and then continue buying as normal.


Some rabid fans may, but it's ridiculous to assume that Amazon sees zero missed revenue when their website goes down on their (arguably) biggest sale of the year.

That's similar to all of the locks on all Walmart stores inexplicably getting stuck in the locked state for 2 hours at 6AM on Black Friday.


sure, they probably don't get that money back which might slightly influence the revenue for the quarter, but it's not as if the value of the business has changed drastically in the last 3 hours.

This happens every time AWS has an outage as well. Reddit is down, better sell AMZN.


It's not the same at all, because trying Walmart again later in the day means literally driving back to the physical store. With Amazon, it means getting your phone back out.

I'm sure there is some impact, but it's nowhere near the inconvenience of being locked out of a physical store.


If you already waited overnight, then 2 hours isn't going to cause you to go home and come back. You're going to continue to wait in the line until the store opens back up.


Sorry, I missed the Black Friday comparison. Prime Day and Black Friday simply aren’t comparable, because no one camps out for Prime Day.


My guess is that they will extend prime day. But they fixed it quickly. For me currently on amazon.de there is no problem and amazon.com shows a captcha form, so it looks like they were DoSed.


We had about $300-$400 worth of purchases lined up this morning that were going to be made just for the day. Site was down, we couldn’t get through - they aren’t going to happen. A lot of people only shop on the 30-40% of days - so, I’m guessing we’llnwait For digital Monday out in November?


This.

Tons of people that shop sales will just spend their money somewhere else if they miss a sale.


You’re not going to do it once the site’s up if you get the same price?

Presumably the price sensitive customer doesn’t just go elsewhere, right?


Not necessarily people.


Precisely. I wouldn’t doubt there are algorithms that take into account general tone of tweets and news posts (using some sort of NLP) into a trading decision.


Yes, this is done in practice and is a little scary because of its potential as a means of manipulating automated trading. On the other hand the markets can be manipulated already via news read by humans...


Have worked on one such project, can confirm that these exist.


It's absurd for long term investors to sell on this news but if your a short term trader or bot then it might be a nice short term profit to trade on this news.


You never completely recover an order dip. Believe me :)


Their stocks started declining around 2:30 eastern time


As of 3:45PM they’re up 0.75% for the day.


Dammit, AMZN, keep it together, I haven't even vested yet...


You knew the risks when you bought the ticket.


Theoretically, everything has a limit. Including scalable cloud computing.


Funny how this comes on the heels of aggressively expanding their workforce and trying to leverage themselves in a hundred different directions...

Maybe it's just me and my confirmation bias at work, but it seems that the core value proposition that Amazon provided -- high value, low margins on products -- has been eroding before our eyes.

Seems so much like the transition Microsoft made... too much focus on "synergies" and leveraging... not enough on keeping the bilges dry and the engine running.

It's funny... Fred Brooks wrote about this in 1975... and we're still making the same mistakes forty years later. There are real limitations to how quickly any organization can grow. Even awesome companies who are excellent at building organizations -- places like like Amazon and Microsoft -- can't organize this law of software development away.


> Seems so much like the transition Microsoft made... too much focus on "synergies" and leveraging... not enough on keeping the bilges dry and the engine running.

The companies that just keep the bilges dry and the engine running are the ones that we love, but they’re gone because they got made irrelevant. Or they got absorbed into something larger. Microsoft has a bunch of failed initiatives (Windows Phone, Zune) plus a bunch of successful ones (Azure, Xbox, Office 365).

If you’re up for classic books like Brooks check out The Innovator's Dilemma. You have to try to expand in a hundred different directions because you don’t know which one of those hundred directions will be relevant next decade, and you have to be unafraid of cannibalizing your core business because if you don’t eat yourself then someone else will eat you instead.


Thanks for the recommendation!

I think the hard part is to walk the line between stagnation & over-expansion. This is a dilemma we all face in organizations large and small as well as individually.

Building systems and processes (a.k.a Habits) that allow us to assume and integrate some new set of "stuff" without having to think about them (so we can move on to the next new "stuff") are what sets these companies apart; Amazon has been brilliant at this; however, from my perspective, looking at all the high-rise office buildings going up, it seems like maybe Icarus has flown too high...

But again, I could be building a narrative to fit my preconceived ideas... I'm definitely no authority here. Maybe this is just another blip... I think more troubling to me is the overall degradation in quality of the things I have formerly taken for granted in Amazon -- the quality of the products and ratings.


The Innovator's Dilemma is certainly not about synergies between products. By its conclusion, it's worthless to try to diversify within the same structure, you'd better create a new company and turn the old one into a holding.


Why would you think it’s about synergies? I’m not sure why you would think it’s about synergies. Or why it would be “worthless to try to diversify within the same structure.”

Summarizing the book here would be a bit of a disservice—but one of the points of the book is that there are economic reasons why companies focus on their most profitable core products, and there are economic reasons why that kind of focus can result in the company collapsing when the market moves forward. This isn’t some kind of imperative—the book isn’t saying, ”therefore, you should create a new company.” It’s more descriptive, “this is how big, successful tech companies can suddenly fail.”


> Or why it would be “worthless to try to diversify within the same structure.”

?

The book has 2 chapters on that single point. And repeats it everywhere else.

On synergies, the OP post was about it, not about disruption.


This seems like a bit of an overreaction. Amazon has had big outages before (particularly at AWS), and they invariably solve the problem and move on. As someone else observed, events like this are mainly notable because of how infrequently they happen.


I think that would seem relevant to me if this was the first time amazon.com had been unavailable. Or the first time any big website has gone down.


I wonder how Alibaba Cloud handles similar events [1], where there are bursts of 256k/s transactions and ~1bil packages being shipped out.

Do they just do brute-force massive scale out?

Amazon's US market is big, but my understanding is that number of online users in China (> 400mil) exceeds the population of the US (~325mil), which makes me wonder if the folks there think about data architecture a little differently than we do.

[1] https://qz.com/1127087/singles-day-crazy-stats-from-alibabas...


Probably every day at Alibaba is a prime day in terms of transactions...


True but they still have Singles Day, so it's not like they have constant load.


remember that much of Alibaba is essentially a front for lots of small companies - they don't do the shipping themselves


Oh I included that number to indicate the scale of orders that have to be handled by Alibaba's data systems -- I understand that they don't ship the packages themselves.

Also, I just read that as of 2017, there are 700mil internet users in China, 90% on mobile. The scale there is just staggering.


Could be. But also, Amazon is doing Prime Day in Europe. Albeit they've got a time zone offset.


Doesn't seem like it was a scaling problem at all in Amazon's case.


What? How do you jump to this conclusion!?


Would Alibaba actually ship anything close to 1 billion packages. Most listings I see on Alibaba are 3rd party.


Alibaba build on Apache Flink. Today is a good day for Apache Flink.


I actually managed to add an item on sale to my cart, but my cart is now empty after refreshing. Oh well, maybe next year.


This made me laugh more than it should have.


I can't help but feel terrible for the team of people there that ultimately gets blamed for this, I hope they can get some sleep tonight.


I thought Amazon (or at least the AWS side of things) was supposedly fairly good about blameless post-mortems?

I remember people were praising them as such when this outage happened: https://aws.amazon.com/message/41926/


They are pretty good about it, and the "Correction of Error" for this one will be pretty epic.

Generally, the rule at Amazon is that any particular f*-up is forgivable... once. (Especially if you can show that you had preventative measures, documented procedures and redundancy in place.)

That said, there will be finger pointing and blame because you're dealing with human beings.


The bathroom schedule will be tightened up by 15 minutes as punishment.


Well, with this punishment, there won't be any bathroom break left!


People likely won’t be blamed. It’ll most probably be a review of process and a correction of errors to prevent future occurrence.


A night of missed sleep didn’t hurt anybody.


It's not so much the missed sleep, as it is the 48 hours of intense heart-rending stress.


I’m blown away by the sudden and swift down votes to my original comment.

These engineers work at a world class company and are paid vast sums of money to not fuck things up. They live way better off than the majority of the country and their mere presence makes life more expensive and stressful for communities around them.

To suggest they cannot go a mere 48 hours or less without sleep on one of their company’s most hyped days is out of touch.


That someone can earnestly suggest that going 48 hours without sleep is somehow an effective way to address an outage is an indictment of a messed up work culture.

People need sleep to think straight, and no amount of money or responsibility is going to change that.


Amazon is not world class. The pay is not comparatively great. There is no retail job (including Amazon) where undue stress and sleepless nights are warranted.

You're not saving lives, you're selling books and cat litter on the internet.


Is their pay not comparatively great? Usually when I see this statement, folks are comparing seattle to silicon valley 1:1 which isn't a fair comparison. Seattle is expensive but not that expensive. My friends who work at amzn seem to be compensated in line with everyone else I know, but maybe I'm wrong?


The vesting schedule and the sweatshop-esq environment arguably don't make the pay worth it (compared to what you can make elsewhere).

Disclaimer: My knowledge is based solely off of public reporting and first hand experiences of SWEs and TAMs no longer at Amazon/AWS.


That...just isn't true. Missed sleep costs millions of dollars a day and worse, numerous human lives in sleep related accidents (car, medical and more).


I can't login to my root aws account right now. It's pretty annoying that the root account login for aws is tied to amazon.com retail accounts.


I've always been pretty sore about this, that my retail and AWS accounts are linked, with no clear way to disassociate them. If I had known that the accounts were going to be joined at the hip, I would have created my AWS account with a different email.


Using the root account is a poor practice anyway...


Is it really so difficult to imagine that I needed to perform an action that can only be done from the root account?

For example, the pen test authorization request form can only be filled out from the root account.


yeah, I'm pretty sure this could be causing some serious issues with plenty of folks. not sure if it's just the web console or if APIs are affected too, but I sure can't get in.


I work in a non-retail part of Amazon and I'm on vacation. Hasn't stopped friends and family from texting me about this though. As if I personally can go in and reboot a server or something. Hope we get it sorted soon!

If you're affected by this, please accept my unofficial thanks for your patience and understanding. (If you're a coworker in retail, good luck getting things up and running!) :-)


Help strikers in Europe by NOT purchasing from Amazon today. Thank you.


Wouldn't it help them to purchase from Amazon today? If there are lots of orders but no one to fulfill it seems like that would be greater pressure on Amazon.


But I want things.


I have completely stopped using Amazon, with the final nail in the coffin being the stories about how they treat their workers. I think it's just a matter of dependence and most of the stuff I used to get, I can find it elsewhere.


[flagged]


Slaves "had jobs" as well. Their masters just paid them in food, water, and shelter.

This whole thing about businesses "giving jobs" is ludicrous. American brainwashing. Businesses NEED workers. Not the other way around. Humans have always existed and survived for hundreds of thousands of years without "jobs". Businesses cannot survive without workers. Businesses would simply cease to exist without workers. People can still grow their own veggies and meat. Businesses can't generate profits without workers.


It actually cuts both ways. We workers need the business to exist for the work. The business needs our skills to continue being a business.


> People can still grow their own veggies and meat.

Not as an alternative means of even survival-level support if they don't have access to sufficient suitable real estate to do so.


People can still claim their own real-estate ;)


where?


That was a tongue-in-cheek comment about revolutions etc


"They do have jobs" is a very sad way to justify poor working conditions.

Meanwhile Jeff Bezos' fortune is approximately $1,000 per American household.


I would agree that Bezos is one greedy motherfucker. No doubt about it.


Ah, the story of the great job creators, who float majestically through the firmament shedding jobs like a fish sheds it's scales.


I was only showing the other side of equation... I'm as Libertarian as they come and also believe Karl Marx's theories made tons of sense...(I know that sounds a bit contradictory!)

Amazon needs to share its wealth much more, at least among its workers and also independent authors using it, and so on, and realize that it needs to support the ecosystem that allowed it, including things like more freedom and liberty for people, not less. This vast accumulation of wealth is bad news, even for capitalism itself, when you get right down to it. Bezos has a huge chance to make a stellar example to the world here, but so far...ummm...i mean why not?


Explain?



Yeah, if everybody stopped shopping there, the strikers would be muuuuch better off...



"Europe" strikes as often as the wind changes.


Which only raises the question, why doesn't the US? Working conditions must be so fantastic here that no-one feels the need to strike.

It can't be that companies have crippled unions so that they can treat wage and working conditions as a one-sided negotiation. Surely not...


How come it doesn't raise the question of "why doesn't Canada?"


1) They do, 2) they have more worker-friendly laws


Like the f*cking rail service in France, where they've been striking 2 days every three day, to protect outdated privileges and political relevance.


Damn right we do. And it's why we don't live in a capitalist dystopia.


It's also impossible to log into the AWS console. Surely something of such importance should be separate to the ecommerce site.


Down for me also and it's being reported on https://status.aws.amazon.com/


I can log in without issue. If you're having problems, it's probably unrelated to the Prime Day outage.


AWS console works just fine for me.


I know TFA is about technical failures, but the deals themselves are also incredibly lacklustre. I was expecting at least the Warehouse Deals part of Prime Day to come through, typically 15 to 20% off all used offerings. This year however, Amazon restricted it to only select listings which translated to a few hundred items total. Very sad.


I didn't see anything that appealed to me either. One other person told me the same.

Maybe Amazon's overloaded system was caused by shoppers checking back more frequently than in past years because they can't find really good deals this year either but keep looking harder and harder anyway?


I've come to the conclusion that 'Prime Day' is just a clearance sale. Unless you find something that is exactly what you've been planning on buying, it's all just 'tat' and not worth buying even at Prime Day prices.


What about the advertising money already spent?

I read somewhere (sorry, forgot where) that Amazon had been pushing sellers to spend like mad on ads within Amazon.com for Prime Day, apparently it gave you a big edge over whatever the algorithms suggest.

Those sellers will have missed their sales targets, and will consider the ad spend to have been wasted. Will they get it back?


And vendors who boosted supply based on anticipated sales (both discounts and purchases driven by discounts). Are vendors going to find themselves with thousands of extra widgets on-hand but without the anticipated purchasing frenzy they were counting on to sell them?


Yup, and if you have too much inventory, you can either remove it or sell it at a discount, both are expensive options.

Otherwise you'll get Amazon telling you, "Hey buddy! You sure are using up a bunch space and not selling much. Why don't you not do that? We're limiting the amount of storage space you can use for Q3.


The ad dollars are only spent if the ad is shown to users. If the site isn't up, ads can't be displayed.


A lot of the ads were shown prior to Prime Day . They were to drive traffic to Amazon on Prime Day and not to promote product the day of.


A lot of vendors also advertise off Amazon


I’d like to read that comms doc... “how to explain to your client what happened on pre day.”


... Is there anything even good for Prime Day this year? Or the year before? Two years ago I remember seeing at least some Dell Workstations that could be repurposed into cheapo home servers. Most of the stuff seems to be odds, ends, and stuff that the various Chinese product-clone companies couldn't get rid of.


I got a baby car seat for half off, which saved a lot of money, actually.


Is there enough historical, public data available to estimate the amount of money Amazon is losing per second?


Last year they sold $2.4B, and this year I've seen estimates around $3.4B - $4B.

Prime day is 36 hours long, but I bet sales are weighted heavily in the first few hours.

So 3.7B / 36 / 60 / 60 = $28.6K per second, and then maybe double or triple that for the first hour or two after 3PM and that'll give you an idea of the scale.

There's also knock on effects, like reduced trust in Amazon/less orders for the rest of prime week, but also positive effects, like people who will just defer their shopping.

For what it's worth -- my sales are below my 30 day average. Glad I didn't go all out this year in terms of advertising.


The web site is a bit of a mess. And it seems that, just since this morning, my entire Wish List has vanished. Foolish me for not having a backup...


Check again, mine was missing but it's back now. It's doubtful they'll actually lose data, issues are probably just due to services that are offline.


I wouldn't blame yourself too hard.. Amazon has strategically made it very easy to add items to your lists (bookmarklets, chrome extensions, etc.) while offering _no_ supported way to get the list back out.


If you aren't being a little sarcastic, you are lucky. The bug saved you money for shit that you obviously are not wishing very hard for. Kids at Christmas time certainly don't need a backup....


I'm seeing automatic reloading the page every second or so. Maybe some bad javascript, though it isn't adding entries to the history. Looks like they have a script that is DDoS themselves.


Same, the search results page keeps refreshing, void of any items.


What happens to people responsible for the crash today (infra, culprit services)? Does Amazon take some kind of "action" since Prime Day is a huge, once-a-year event for Amazon?


They will have to write a detailed post-mortem (with many people with titles starting from director watching it every week). Based on what comes out from the post-mortem automation/testing will be implemented to remove/mitigate the failure from the equation in the future.

Unless this is something egregious (ex: a manager not allowing the team to "scale up" in preparation for the event) no one will get fired. Tempers may flare a bit if it is something stupid (it's usually not).

Nothing different from what happened with the (very public) DynamoDB and S3 failures of yesteryear.


Well, everything I wanted to buy is simply not on sale. If Prime Day isn't supported by those things that I want or need then why would I participate?


I've participated before only to be disappointed. I didn't even Beyer looking at their 'deals' this year. I consider this their garage sale to get rid of junk...unless you want to buy Amazon products like Ring.


changing the url to smile.amazon.com works


People should be using Amazon Smile all the time, outage or no outage :)

I use this Chrome extension that rewrites all amazon pages to use it: http://www.smilealways.io/

FF equivalent: https://addons.mozilla.org/en-US/firefox/addon/auto-smile/


It doesn't appear to be available for Canada, unfortunately. I didn't even know it existed.


For browsing -- search still seems completely hosed for me


How about some love for their marketing guy who made it all happen?

Jeff probably said "make it rain, let's see if your hordes can take down Amazon.com," and this guy basically accepted, and succeeded at, the challenge.


[flagged]


Since you won't seem to post according to the guidelines, we've banned the account. We're happy to unban accounts if you email us at hn@ycombinator.com and we believe this will change.

https://news.ycombinator.com/newsguidelines.html


Huh? Amazon seems OK now.

Edit: But hmmm, Quora just went down.

> 504. Gateway Timeout.

> Quora is temporarily unavailable.

> Please wait a few minutes and try again.

And they use AWS, right?

Edit: I just got as far as search, order, cart. But no account as Mirimir, so ...

Edit: Re Quora - http://downdetector.com/status/quora

I wonder what other AWS stuff is down. If that's it, anyway.



I’m curious what the reprocusions are, if any, once a post mortem is completed and teams or individuals that contributed to the outage are identified. Is “causing” this a fireable offense?


Not unless there was malicious intent or willful negligence. Amazon is a data driven company. The data shows that a “blame” culture results in more incidents. (Airline industry taught us this: https://www.faa.gov/about/initiatives/maintenance_hf/library...)


Do you know this from experience? Or are you just guessing?


No. Amazon's culture is super big on blaming the system. I know of incidents where operators made mistakes doing stuff that directly caused really big outages - the postmortems were entirely about the (lack of) safety of the tooling involved that allowed a single person to make such a mistake, and involved examining all the tooling of entire business units to try to identify other tools that could have such an impact. No individual blaming is involved - and in fact, postmortems always refer to "the engineer" or "the operator" and never include names.


My friend was fired years ago for an AWS outage that took down Netflix. I don't know any other details, but should ask him.


lol, like employers need a real reason to fire someone in this day and age


It shouldn't be unless there was malicious intent.


FWIW, it worked fine in Western Europe since 1 GMT, 8 hours ago.


This is the page everyone seems to be getting

https://i.imgur.com/vpIHDpA.jpg?1


Same on mobile. Perhaps Amazon retail is transition to a dog photos as a service business model?


did Dapps stand for Dog apps the whole time?

alternatively, Amazon rebrands their entire cloud lineup to PAWS


It's up, aaaaannnnddddd now it's back down...


It probably doesn't help that the mobile app appears to load a random picture of a cute dog every time I press the "retry" button. So you can guess what I'm doing, trying to get it to load a new "Dogs of Amazon" pic.

Probably should have gone with goatse, reduce the load.

EDIT: do NOT search for "goatse" on your work connection. That alone, even if you've never heard the word, should tell you why I suggested it as an alternative.


Maybe they figure that people who like dogs enough to want to see more will also click the link [1] to find out more about the "Dogs of Amazon", which seems to be unaffected by their problems. There they will find a story about Amazon's dog-friendly offices, 30+ pictures of Amazon dogs, and a video showing many of them. The time to consume all that is time the person is not hitting refresh on the shopping site, so maybe is a net win for Amazon.

[1] https://blog.aboutamazon.com/working-at-amazon/how-much-does...


Interestingly, despite not being a dog person, I found the first instance of a dog to be a cute, human-relatable inclusion. On later pages I found it frustrating and annoying.

I'm sure there's a fundamental UX principle or two at work there, but I won't pretend to truly know what they are.


It's called the Duck Hunt principle.


Any sources for me to read up on? Googling just gave me the BattleChess duck story, some unrelated UX examples using ducks, and info about how the duck hunt "gun" worked.


I've never heard the term before, but I do recall plenty of people playing duck hunt shooting at the dog when the game said they failed. Although, he kinda earned a laser zapping by laughing at your failures, right?


It's like https://knowyourmeme.com/memes/oopsie-woopsie

Warning: turn on your adblocker


Dammit I made the mistake of searching "Goatse" at work.


Oh, shoot, I forgot there were yungin's in the room. I'll edit that.

(Another "makes you feel old" moment when someone says they had to look up "goatse".)


Goatse, much like its counterparts are to me a litmus test of your internet culture / age.

Not even 30 yet - damn it hurts to put it writing; but clearly old enough to know the old interwebz :)


The dog pictures are cached locally in the app - can confirm because they also show up when traveling through a cellphone-signal-free zone everyday.


The pictures might be local, but I sure hope that retry button actually generates a request to the server.


I appreciate the edit for anyone whom has not experienced goatse as it also put my age in check.


It's interesting that you can have hired the best talents in the world, but still have major outages. I wonder if there a way to ensure more people = more stability. Sounds a bit stupid, but maybe if each datacenter has its own software team working on the same issue, it will be very redundant work, but maybe it will be more organic in its failure?


Or a team of 9 pregnant women making a baby in only one month?


Should work because 9 months / 9 = 1 month.

You probably want a distributed team though, with maybe fetus being thrown around by pneumatic mail.


You're joking, but I do wonder if you can speed up fetal development in some way.


I've heard this analogy before but I believe it could be improved. Thinking in terms of months leads to an awkward divisor.

It would be better to go with a divisor of 10 so as to assign precisely one finger and one toe to each of the distributed work groups.

(But, I have to admit, you may want yet another group to coordinate and assemble it all.)


Consumerism at its best... Probably for the better than Amazon Prime is down (at least for me - I'll speak for myself).

Honestly, my life got better when I stopped getting packages to my door. Only buy the most essential things you need - sell the stuff you don't. Better to live an uncluttered life without "things"...


I am unable to buy/bid on my Amazon AMS ads still as of 2:12PM.


Yikes AWS console down


I mean, I see it fail from the fact that there aren't any deals I want. Maybe if their fire tablets ran full android, but there isn't really anything on sale.


Just tried to do a search, got no results. Google search found me the product page, (which worked).

If you can't even search for products, there's a problem!


TFA is a super buggy webpage.


Hasn't this happened previously? It may become part of the celebration each year.


Amazon is worse than Alibaba even, what a failure today.


I just realized the influence of Hacker News. I read techcrunch every day and it is rare for an article, any article, to get more than 2 or 3 comments (although, to be fair, they just added commenting in the last few months)...this one got 35 already!

Anyways, I think that this is NOT a fail for Amazon at all, but a major win. They obviously created all kinds of attention! I'm sure they will get it right next time, and maybe, if they think fast, can reverse this situation by offering those who were disappointed a 2nd chance at an even-greater discount...at least that's what I would do.


It's hush-hush in the industry that CAT5 and it's predecessor the CAT5+1, both are shielded with peanut butter. It's also a little-known fact that most datacenters or PCs as we in the NOC tend to call them. Are also built around tropical rainforests, which provide both security from the common man, as well as naturally cool down the millions of heatsinks needed for the cloud. But an even lesser known fact is that monkeys, yes monkies love peanut butter. --I'll leave the rest to your imagination. But one of those fat guys, drinking beer, with a hoodie, and in need of a bath left one of the windows open. The rest... well, the rest was prime.


After reading Ryan Holiday's "Trust Me I'm Lying" and knowing how well AWS usually handles surges of traffic, I am not so sure that this isn't a marketing stunt that gets them an amazing amount of press. Cynical?


There are cheaper ways to get marketing press that don't involve kneecapping one of your two primary revenue generation engines. They've gotten at least two major news cycles just out of vaporware promises to deliver packages by drone, for instance. (It may not be vaporware forever but it certainly isn't coming soon enough to justify breathless press from a few months ago.) It's not like they don't have any idea how to drum up press.


It's happened again [1], and in less than 15 minutes!

That's it, I'm terming this the Klein effect.

[1] https://news.ycombinator.com/item?id=17499447


This made me smile. :) This is very close to to Poe's law as well, because after all, who can tell for sure if something that simply just went-wrong was intentional or not?


Nah, not at all. The vast majority of people don't care about the inner workings of a website, they just want to buy cheap shit and have it all work. Prime Day pretty much markets itself, I just recently got a package from amazon and it was plastered with Prime Day advertising, my co-workers are all talking about it, it's the centerpiece of their website. You'd have to work pretty hard NOT to know about Prime Day today.


Any publicity is good publicity!


lol. Seriously?


Could also be an A/B test of some new infrastructure design to see if it's ready to deploy in November. I have no idea if they still hold everything together with Perl (via this templating system: http://www.masonhq.com/) but it wouldn't surprise me, and it also wouldn't surprise me that there'd be occasional pushes to replace it with something "modern" (or at least friendlier to the revolving door of new grads).


This would go down as one of the worst A/B tests in history if that is the case, which I highly doubt.


This would make a lot of sense, they essentially get a "game day" with actual sales, make a lot of money and a few months to prepare for the next wave of seasonal mob-shopping.

I don't think they're using Perl anymore however.


My guess is this is an intentional marketing ploy. Think of how much press this is generating. Frontpage of HN. No way we are talking about this otherwise.


Not sure I follow.

"Hey did you know Amazon is unreliable and unusable?" "Really? (checks) Yup, you're right!"

(somewhere in the distance, a marketing guru cackles with delight)


No. I work in one of the biggest fulfillment centers in the network. Shit broke.


They are losing millions in revenue every minute their site is down. Way too expensive for marketing.


Amazon is the front page nearly every day for some reason or other (:


If you are big enough it's hard to miss you.


yes, this is the first I'm hearing about this whole amazon thing!


#tinfoilhat




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: