Prime Down: Amazon’s sale day turns into fail day

camtarn · on July 16, 2018

Kinda wish I still worked there and could see the post-outage writeup.

My bet's on some unforeseen bottleneck that affects search and static pages. Almost everything within Amazon is crazy scaleable, but there are some bits where you scale them up and their behaviour changes radically. For instance, a service's cache misses might skyrocket as customers get distributed over a wider set of servers, causing service response times to increase just a little bit on average, tipping a dependent service over into more frequent timeouts, causing its downstream service to blow a timeout-percentage 'software fuse' and stop using that service... etc etc.

Given that each of those services (and many more possibly-related ones) will have an on-call engineer paged into a conference call when the manure hit the rotating ventilation apparatus, there are going to be a lot of unhappy people cancelling their weekend plans right now. I definitely don't miss that aspect of the job!

coryfklein · on July 16, 2018

Thanks for the comment and I love all the insight but serious question about:

> there are going to be a lot of unhappy people cancelling their weekend plans right now.

The week just started... are you saying that you can already anticipate the war room for an event like this to extend through this coming weekend?

camtarn · on July 16, 2018

Haha, no, oops. My last few days have been a bit crazy and I'm on holiday, so I just thought it was Sunday ;)

sidcool · on July 17, 2018

Hehe. Reminds me of my Murphy's law momemts. Whenever I am on a holiday, a deployment will fail, and I will have to login and check while my family plays volleyball on the beach.

Cthulhu_ · on July 17, 2018

I'd tell your boss to hire someone to replace you. What if instead of on holiday, you were dead? Kinda hard to login and fix their shit when you're dead.

ataturk · on July 17, 2018

I agree. I quit every job where they don't respect my personal time, especially vacation. Life is just too short. It's bad enough what I've been through with companies and their stingy PTO policies (US).

btschaegg · on July 17, 2018

That sounds like a very low bus factor to me.

sysadmin420 · on July 18, 2018

I was smoked up in colorado for the old "Lets take down Dyn real quick" BS a couple years ago. I know this very well, things like to run for months without issue, and I'm off for an afternoon with the phone ringing.

sigfubar · on July 16, 2018

It's a good holiday if you can no longer remember what day it is.

devmunchies · on July 16, 2018

But not good enough if you’re checking HN ;)

SmellyGeekBoy · on July 17, 2018

I'm on holiday in Japan at the moment and have no idea what day it is. Only checking HN as I'm currently in Osaka waiting for the Shinkansen to Hiroshima. ;)

slap_shot · on July 16, 2018

Do you seriously believe reading something like HN is work for its viewers??

kajecounterhack · on July 16, 2018

I think they meant that if you're on vacation you maybe shouldn't be looking at the computer.

flattone · on July 16, 2018

my memorial weekend on a few acre olympic rain forest camping trip: regular signal hunting for hn. Oddly found a 6x6 area which after a reboot with provide lte. after idle would have to reboot again to open some more links

kaspm · on July 17, 2018

My guess was either that

OR there wasn't backpressure on a cascading failover so as services failed they increasingly failed to more and more overloaded systems

OR there WAS backpressure and it was the luck of the draw whether you were queued into an error page or got good data

OR the autoscaling couldn't keep up with the onsale window. This used to happen in ticketing a lot. Ticketmaster has a talk somewhere where they talk about warming the scaling load and server cache in anticipation of big ticketing onsales. The time it took to autoscale was just too long.

zerotolerance · on July 17, 2018

The systems at Amazon are too large to leverage autoscaling. When I was there (in Marketplace) it was generally estimated that Amazon used about 80X the capacity of their entire AWS public cloud.

rorykoehler · on July 17, 2018

I find that implausible. I know they run a huge infrastructure but why would they be using 80X of their AWS public cloud? Thousands of companies if not millions at this stage use AWS and some of them are not insignificant in size.

abrookewood · on July 17, 2018

WTF? 80 times the capacity of their entire AWS infrastructure?? That seems insane.

caseysoftware · on July 17, 2018

No, the exact quote was:

> Amazon used about 80X the capacity of their entire AWS public cloud

which is probably closer to "the capacity that is available via AWS is a tiny, tiny fraction of their overall computing power.. therefore adding it back in when things are falling over doesn't actually solve any problems."

Dylan16807 · on July 17, 2018

I don't understand the difference between the wording in your post and the post you're replying to.

beejiu · on July 17, 2018

I think they are using the term 'capacity' to mean 'spare capacity'. I.e. that Amazon's entire compute usage is 80x the spare capacity, so scaling even a small amount would consume any spare capacity in AWS. Still, it seems hard to believe.

kaspm · on July 17, 2018

I interpreted it to mean:

1. he misspoke and meant "80%" of the AWS capacity, which I agree seems implausible. 2. Amazon does not run on AWS because Amazon is 80x more than all of AWS infrastructure. This also seems implausible because of Netflix. In fact, there's an article out there that said AWS exceeded Amazon's capacity within 1 quarter!

I still don't understand what that has to do with autoscaling exactly

martin-adams · on July 17, 2018

I did’t understand either. I suspect it’s the semantic difference between ‘public cloud’ and ‘infrastructure’. I don’t know what that difference is really.

HankB99 · on July 17, 2018

When I checked it out, the site seemed responsive but search was broken. I searched "usb flash drive" and was presented with 2 results. I pondered this for a moment and then realized that there were 400 pages with 2 results/page (as far as I checked.) Perhaps there is some load shedding algorithm that reduces the result count per page to reduce load. It did discourage me from further searching so I guess it worked. ;)

throwaway427 · on July 16, 2018

smile.amazon.com was working fine during the outage, if that helps narrow it down...

azhenley · on July 16, 2018

It wasn't though. I saw an item being listed as a Prime Day deal on the normal Amazon site, then I searched for it through smile.amazon.com and it wasn't there. Went back to non-smile and it was there (but throwing errors so I couldn't click it anyway...)

Alex3917 · on July 16, 2018

I'm seeing this also. Same item is $169 on smile.amazon.com, versus $79 on www. Neither is working properly though.

gnat · on July 16, 2018

Not for me: a page that gave a fast 503 on amazon.com would spin indefinitely on smile.amazon.com

adrianmonk · on July 16, 2018

Weekend plans? Prime Day is going to be over in the early part of the week.

Does Amazon expect a fix to be deployed ASAP after the immediate crisis is averted?

aviv · on July 16, 2018

My bet is they maxed out on the SQS FIFO 300 messages per second limit.

camtarn · on July 16, 2018

I have no idea really (I stopped working on the retail websites in 2013) but my gut feel is that that's about a couple of orders of magnitude too low at least.

But, yes, unforeseen rate limits and size limits can cause many hilarious things to happen. I've seen a few good ones in my time. In particular, when somebody sets an upper size on an in-memory table and commits it thinking "Well, I added a few orders of magnitude safety margin - that should be enough for anybody", that's probably going to become an incident at some point in the distant future ;) With luck, the throttling or failure behaviour will only affect a few people, and it'll be spotted by looking at traffic graphs and noticing a very slightly elevated rate of service errors. If you're unlucky, though, when you hit the limit the whole service slows down, locks up, or just plain crashes, and something like this happens...

snewman · on July 16, 2018

Yup, many an outage has something like this at its core. At my company, to address that, we've built an internal library for enforcing rate limits and size limits that is (a) configurable on the fly and (b) generates logs, so that we can trigger an alert whenever any limit reaches 70% of capacity. And thus, hopefully, head these things off before they reach the tipping point.

https://blog.scalyr.com/2014/08/99-99-uptime-9-5-schedule/

codebje · on July 17, 2018

Don't forget to alert when the double derivative of capacity growth is above a (very low!) threshold, that can catch an explosive problem far earlier than the 70% mark, by which time it might be growing at 5% per second.

cheeze · on July 17, 2018

It might also be a momentary blip that you really shouldn't need to wake a human up for. The hard part of monitoring imo is that perfect balance. Asymtotically approaching nirvana but never quite reaching.

codebje · on July 17, 2018

All monitoring has that problem :-) But if your rate of rate of change has been positive for some suitable amount of time for the context, it's worth waking a human up over it, because the amount of resource remaining is dwindling exponentially.

bostonvaulter2 · on July 17, 2018

That sounds like a very interesting metric to track and report on. Do you have any further references that discuss that approach?

codebje · on July 17, 2018

Not offhand, sorry - I first encountered the notion at a conference some years back, but I've long since gotten out of operations so it's all a bit dusty for me now.

adreamingsoul · on July 16, 2018

That's cool, but it's a bit more complicated then rate limits.

zwkrt · on July 17, 2018

I once owned an EAV DB for storing small config values. Someone wrote a great wrapper around it that made it seem like a proper DB with many tables. Since it was for storing configs this library cached the whole thing in memory on startup. Zoom to 5 years later, we have 10+ settings for every customer in this store, and one day all our hosts keep failing over. As it turns out that small table for configs was north of 5 gigs and destroying our heap.

aviv · on July 16, 2018

This is their official rate limit for the FIFO type SQS. Although I bet the Amazon folks know a couple people around the office who can up that limit :)

adreamingsoul · on July 16, 2018

oncall is the worst

cheeze · on July 17, 2018

Not oncall for Amazon but I don't particularly mind it. It sucks to have to be near a computer for the time, but I actually find digging in and finding that needle in a haystack pretty fun. On a good team, the life interruption is minimal.

sateesh · on July 17, 2018

As much as I like digging,debugging and learning that comes with it, fixing things under tremendous time pressure wears me out :(

iotscale · on July 17, 2018

When I was on-call for a big project, I jumped every time my phone rang. It wasn't a good experience at all.

"We are at the fair, I have 18 customers lined up but I can't take orders. Can you fix this now or should I start taking them on paper?"

You know the joke that atheists don't believe in god until the plane nosedives? That was the moment for me :)

"Oh, I... I need to see the error message before I can say anything. We don't have any errors logged. Hmm... Would you allow me to remote into your device?"

"Can't do that. We lost the internet connection here, sorry."

Ugh.

bpchaps · on July 17, 2018

How long have you been on call for and how many 3am wakeup calls have you had? I used to think these same. Protip: you won't always be on a good team.

unityByFreedom · on July 17, 2018

I'm in Asia and always thought I'd make a great on call person for those of you who haven't jumped ship. I should apply somewhere.

bpchaps · on July 17, 2018

Hah. I'm genuinely surprised this isn't more of a thing.

zwischenzug · on July 17, 2018

We reduced the stress of on-call by introducing processes that supported creativity and reduced panic:

https://zwischenzugs.com/2017/04/04/things-i-learned-managin...

zerotolerance · on July 17, 2018

I prefer to be oncall as long as I'm empowered to prioritize software improvements that reduce ops burden and improve customer experience.

rezashirazian · on July 16, 2018

Something no one has mentioned yet, could it be that the engineering force at Amazon is no longer what it used to be?

I can personally point to two friends who I consider top notch engineers and designers that have left Amazon because of its toxic culture. I'm sure I'm not the only with these anecdotal examples, we've all heard the stories. At the end of the day years of unbalanced work/life balance, overly aggressive management and frugal approach to everything makes for a weak argument for A players to stick around.

Could this be an example of crumbling engineering standard at Amazon?

throwaway289355 · on July 16, 2018

AWS employee and bar raiser here.

>Something no one has mentioned yet, could it be that the engineering force at Amazon is no longer what it used to be?

In many regards, yes. The bar had to be lowered to meet the demands of growth. We've also taken in a lot of hires from companies that have brought their culture and friends with them. The culture at Amazon is not what is was even 2 years ago. It is in many places day 2.

No one also seems to notice that Amazon retail often suffers widespread issues like this. We can count on SEV1's happening during peak as things blow up badly. This has happened several years in a row, and sadly the themes are pretty much the same across all: forgot to scale (yes...really) or some stupid system bottleneck. It doesn't help that Amazon retail has a good amount of its workforce based in India and seemingly disconnected from the Seattle based leadership.

ergothus · on July 16, 2018

> It is in many places day 2.

For those unfamiliar with internal Amazon lingo, a big deal is made of it always being "day 1"

https://www.fool.com/investing/2017/04/13/jeff-bezos-says-it...

dabockster · on July 16, 2018

> The bar had to be lowered to meet the demands of growth.

Not only that, but I've noticed that Amazon has started using way more contractors (not actual firms, but more Mechanical Turk/UpWork like contractors, if not just straight up misclassification) and agencies within the past year or so. I can't say that I'm surprised that Amazon is having technical issues now that they've grew in numbers but not in technical chops.

On a side note, it is a great time to become a recruiter in Seattle with all these agencies popping up. /s

smt88 · on July 17, 2018

This makes me terrified that my users' security relies on AWS...

abrookewood · on July 17, 2018

Honestly don't think that is an issue. AWS is THE most certified company I have ever worked with. Seriously, go look at their compliance page - it is ridiculous. https://aws.amazon.com/compliance/programs/

usr1106 · on July 17, 2018

Certification is a theater for managers and so called decision makers. It might have a positive impact on code quality as long as people are motivated. But if insiders are telling me that the workplace culture is deteriorating that is a serious problem. Certification will not prevent bad code hitting production.

salesynerd · on July 17, 2018

Reminds me of several CMM Level 5 certified IT services firms consistently grounding projects into dust!

abrookewood · on July 18, 2018

I was mainly referring to the security of their systems, but I take your point.

yayadarsh · on July 18, 2018

Not certain on this, but I think Azure is known to have more compliance certifications. This is probably largely due to the long relationships Microsoft has with government organizations etc.

duxup · on July 17, 2018

I got the impression from some former Amazon employees that they were burnt out and put their time in and got Amazon on their resume and were just done with life at Amazon so they moved on.

Do you sense any attrition like that?

weka · on July 17, 2018

> It doesn't help that Amazon retail has a good amount of its workforce based in India and seemingly disconnected from the Seattle based leadership.

Could you expand on this a bit, please?

codeonfire · on July 17, 2018

An Amazon manager, full of shit, once tried to tell a joke at a social gathering. He claimed that he and friends removed the engine from someone's car over lunch as a prank. This is the education level that some of them have. He literally didn't think people would know he was full of shit.

toomanybeersies · on July 17, 2018

It would be technically possible to remove a car engine during lunch. However, that's not so much a prank as malicious damage. You'd also have to be a very skilled mechanic, familiar with that specific car, to do it that fast.

joncrane · on July 17, 2018

Not to mention you'd drop a lot of polluting fluids on the ground if you did it in a parking lot.

exikyut · on July 17, 2018

Honest question: education level or mental developmental level?

Mental health is enough of an epidemic that $company_with_exponential_scale is most definitely going to pick up a fair chunk of people with various issues, cognitive included.

Chances are this is even part of why the bar is a bit lower.

(Said as someone with mild high-functioning autism)

codeonfire · on July 17, 2018

My point is Amazon has hired some managers that have really lowered the bar due to the environment they are from and people they are used to dealing with. Imagine someone from a place where you could tell a preposterous lie and 50% of people would believe you.

hueving · on July 17, 2018

Gullibility has little to do with 'education level'.

phendrenad2 · on July 17, 2018

You misread it - the manager wasn’t gullible, he was the one telling the fib.

rrjanbiah · on July 17, 2018

FWIW, if you can look into https://github.com/rrjanbiah/amazon-india-issue-tracker

ainiriand · on July 17, 2018

I can see what you mean. Take for example the email I received to promote Prime. A mix of spanish and english with some very bad translation problems in spanish. It was a bit embarrasing to read it.

yeukhon · on July 17, 2018

No load testing ever?

hossbeast · on July 17, 2018

Good luck generating Prime Day load artificially.

greenleafjacob · on July 17, 2018

At Uber we invested in synthetic load tools that create fake riders and drivers, match them, etc and test our entire dispatch system end to end to arbitrary amounts of online driver and riders. I don’t see why they couldn’t do the same with carts, adding products to carts, etc

wiredone · on July 17, 2018

They do. But I'm guessing you don't scale up 1M+ servers for your Uber canary traffic tests - these are some of the scales Amazon undergoes in these events. The scale is unlike almost any other web property around.

BinaryIdiot · on July 17, 2018

> But I'm guessing you don't scale up 1M+ servers

Do they really need 1 million servers? Many of my friends who work at other tech companies need such few servers in comparison even with significantly high traffic that just screams massive inefficiencies...which seems wrong.

But I've never worked at Amazon so I wouldn't know.

convivialdingo · on July 18, 2018

At one of my previous companies we managed about 100 servers for peaks of about 30 million users.

But we didn’t handle free-form text search like Amazon. I can imagine that would necessitate a huge scaling up of compute and data.

deegles · on July 17, 2018

Judging by all the error pages after launch, they needed more!

greenleafjacob · on July 17, 2018

I’m not sure how it’s relevant. If you have the infrastructure to send 1000 concurrent users you can probably send 1M concurrent users. We only test small integer multiples of our peak traffic, and if your absolute number of servers to service that is in the millions then it would make absolute sense to be routinely running that capacity test. If that means “scale up 1M+ servers” then that is what you have to do, otherwise how can you be sure?

HankB99 · on July 17, 2018

Could you do something like that from another cloud service? I suppose the difference would be whether 10K requests from one IP address would be the same as 10K requests coming from 10K IP addresses. The further you move from the actual production load, the higher risk that the test doesn't test everything. For example if the network could only handle connections from 5k hosts then the former could pass while the latter failed miserably.

pageald · on July 17, 2018

Simulating millions of users should be well within the capabilities of a company as large as Amazon. Off the shelf load testing tools like Locust can create thousands of fake users with one worker.

sitkack · on July 17, 2018

Please, generating load is always easier than servicing it.

yeukhon · on July 17, 2018

Probably can find a hotspot and load test on the hospot.

jjirsa · on July 17, 2018

(◔_◔)

jtchang · on July 17, 2018

It's actually pretty trivial. Getting the right parameters for the load will be hard and making sure you are loading it properly. Think about DDOS attacks. Generating the load is rarely the issue.

nikofeyn · on July 16, 2018

what is a bar raiser?

and when has amazon ever been an engineering force? i have always felt the website and service experience is a relic of the 2000s. more often than not, i get the answer “our system can’t do that” from customer service.

throwaway289355 · on July 16, 2018

>https://www.socialtalent.com/blog/recruitment/raising-the-ba...

I think Amazon has taken on an outsized image to many people that just isn't true. We have good engineers in many organizations, but we don't pay enough, have the right strategy, or take care of individuals well enough to lure the kind of great folks you find at other big tech companies. In many ways, Amazon is a retailer that does technology because it found a way to make money from it. The DNA is still MBAs/finance and retail.

vadym909 · on July 17, 2018

A bar raiser is typically someone in the interview loop for a candidate that is not from the same team and therefore more objective about whether the candidate is truly 'better' than most people on the team (raises the bar). Although good in theory- its not actually practical to find every single new person better than the current employees but it helps keep things in perspective and is a deterrent to bringing in weak buddies.

Its true Amazon has some great engineers but is not a very engineering centric. I remember a senior engineer in Retail once comparing it to a plumbing system kept together with bandages.

deskamess · on July 16, 2018

How much do you think the on-call contributes to engineers leaving? You would think support tools and support personnel could help to retain engineers.

jhall1468 · on July 17, 2018

No, bad design and refusal to manage technical debt is the issue. Oncall only matters in some orgs and even then only matters where the tech debt is totally out of control.

Bottom line is Amazon is a product culture not an engineering culture and that makes it really easy to leave for Google or unicorns that really appreciate tech debt tradeoffs.

chronid · on July 17, 2018

Who cares about retaining those support/ops engineers you shit on because you don't want to own the crap you write, right? :)

kk_cz · on July 17, 2018

From the site:

In simple terms, bar raisers are current Amazon employees that come in during the interview process to analyse candidates. They do this alongside their own full-time job, assessing as many as 10 candidates a week and spending 2-3 hours on each one.

In other words 20-30 hours per week on top of the full-time job? That doesn't sound quite right.

patch_cable · on July 17, 2018

Another bar raiser here. That site is wrong. The expectation is 2 candidates a week. And it would be more proper to say that it is part of your full-time job, not on top of it.

10 candidates in a week may happen at some kind of event, but then it isn’t 2-3 hours per candidate, and in that case you’d effectively be taking a couple of days off from your normal job.

xigency · on July 17, 2018

This bar raiser thing sounds like normal hiring practice in my work experience.

nojvek · on July 17, 2018

The thing that baffles me is the industry perception is that Amazon has subpar engineering but Amazon right now is 2nd after Apple for Marketcap, so they must be doing something right.

I love the Amazon customer service. They’ve managed to crack a difficult problem and execute enough that other Giants haven’t come close to it yet.

GCP and Azure tail AWS by quite a bit. Amazon online retail is a Google search engine level monopoly now.

So Amazon can do a lot of things wrong, but I’d have to say they get the important parts right.

Godel_unicode · on July 17, 2018

> Amazon online retail is a Google search engine level monopoly now.

Not even particularly close. Amazon doesn't even have a majority of online sales, although it's getting close. They seem like a much bigger force than they are because of growth.

https://techcrunch.com/2018/07/13/amazons-share-of-the-us-e-...

pas · on July 17, 2018

They are big, because they invested every penny they made into getting big. And those pennies are converted through human misery in "fulfillment centers". (And AWS on-call is not that fancy either for engineers, though the two are almost incomparable.)

They got some parts right, some parts awfully wrong, and some are just irrelevant now. They make money, they are cheap + convenient, and that usually what people focus on. They are not sophisticated, they are not great designers, etc.

BryantD · on July 17, 2018

This doesn't completely answer the question, but I would distinguish UI and UX from the ability to build systems that run successfully at Amazon scale.

cma · on July 17, 2018

Corporate meme thing like six sigma. Each new employee is supposed to raise the bar, or the whole organization gets worse on average. Jeff Bezos as #1 is supposedly the least qualified employee in the company as each new hire raised the bar.

booleandilemma · on July 17, 2018

what is a bar raiser?

A PM who has never opened an IDE in his or her life and who is only familiar with “coding” concepts through Wikipedia. They read books by Malcolm Gladwell, Daniel Kahneman, and Nassim Taleb and majored in one of the humanities. When they’re shown the webpage that the geeks created which loads in 2 seconds, they tell the lead developer that they want the loading time cut down to 1 second and the font to be changed.

overkalix · on July 17, 2018

Hey hey hey. What's wrong with Kahneman?

convivialdingo · on July 16, 2018

Absolutely. The fact that this is ongoing hours later is pretty much a clear sign that there’s problems.

I understand that it’s not always easily fixable. But honestly this looks pretty bad. Could be anything from bad code to a flakey cable.

I got out of high volume websites - too much marketing, unpaid overtime, and horrible work-life balance.

tills13 · on July 16, 2018

Considering they cleared $2.6 billion on last year's Prime Day, and membership is up YOY, I'd say this comment is 100% knee-jerk/over analysing. Stuff breaks, things go wrong, planning for the worst sometimes isn't enough.

chc · on July 16, 2018

I don't see how "they sold a lot a year ago" and "membership is up YOY" is meant to dispute the idea that engineering talent is leaving. I agree we don't have enough evidence to say for sure, but those figures don't seem relevant.

rezashirazian · on July 16, 2018

My comment has nothing to do with Amazon's current financial success or growth. It's regarding their engineering team and their ability to recruit and retain top talent.

It is not a binary state of absolute destitute and top notch brilliance, it's a trend that can move one way or another and will show itself in more frequent outages, poorly rolled out products, lazy design and etc.

The plumping can keep on working for a few years even after it begins to erode. Historically there are many examples of this.

bhouston · on July 16, 2018

Maybe but your evidence is annecdotal. Good engineers leave huge companies all the time but it is usually balanced by inflow. I guess we have no evidence of inflow as compared to outflow of good engineers.

rezashirazian · on July 16, 2018

I didn't reach a conclusion, I presented a possible reason knowing very well that I don't have the evidence to back it. Hence why I said "could it be that the engineering force at Amazon is no longer what it used to be?" and not "The engineering force at Amazon is no longer what it used to be."

HillaryBriss · on July 17, 2018

have we reached peak Amazon?

fjsolwmv · on July 17, 2018

No. Amazon has had outages on holiday shopping days throughout its history.

Amazon has always been toxic and frugal, so obviously it didn't interfere with whatever mythical software quality it had in the past.

And today's failures might be because past engineers built unstable unmaintainable systems and then ran away.

And of course amazon is 10, 100, 1000 times larger tech system than it used to be.

deegles · on July 16, 2018

My gut feel is that they did everything right to prepare based on forecasts from last year but still got swamped by demand.

ams6110 · on July 17, 2018

I have Amazon Prime, and I had no idea that today was Prime day. Haven't seen any marketing for it, and nothing in email until this morning.

ricardonunez · on July 17, 2018

It has been all over the place. Unless you don't visit Amazon on a regular bases, or deals websites. I don't even watch tv often and saw ad on it.

nbar1 · on July 17, 2018

The only marketing I've seen for it was the tape on my packages the last few weeks.

ergothus · on July 16, 2018

I saw the page reload multiple times when doing a search before finally dying...I can only imagine that means additional load generated by the initial triggering failure.

Which in turn tells me they didn't test the failure case. Now, Amazon is a huge and complicated beast so I don't want to imply this was a "dumb" mistake, but (assuming I'm correct) it is a failure that risked making MORE failure, so it's not demand alone to blame.

codeonfire · on July 17, 2018

[flagged]

sctb · on July 17, 2018

Please don't rant like this on Hacker News.

https://news.ycombinator.com/newsguidelines.html

codeonfire · on July 17, 2018

In addition to the shit that is not politically correct to say, Amazon relies heavily on interns for production code and ops. Bezos has this view that the unskilled can get him 80% of what he wants and then a few top people can smooth out the rest. In reality the few top people never even come in contact with the kid that is trying to keep amazon.com running overnight. Successive waves of interns and SDE 1's fuck up the same shit over and over. Bezos is stupidly applying this same strategy at his space company.

Analog24 · on July 17, 2018

This is complete nonsense, at least in my org. Interns are given projects that are as far from production code as possible. I'm sure there are managers out there who have done this but it does not extrapolate to the entire company and it's ridiculous to think that it's a company wide practice to make interns responsible for anything that would affect customers.

wiredone · on July 17, 2018

This is complete BS. Interns are basically on a really well paid holiday and given random features that may/may not ever make it to production.

fjsolwmv · on July 17, 2018

I always wonder how stupid people can accumulate $100Billion being stupid.

sadamznintern · on July 17, 2018

>Could this be an example of crumbling engineering standard at Amazon?

As someone that's been going through quite a bit of depression because Amazon was the best offer I got (as opposed to Google or Facebook because I'm still bad at coding interviews) I'm afraid this is the straw that's going to break the camel's back for me. This is exactly what I was afraid of and exactly what I have waiting for me when I join next week.

There's no point anymore.

in_cahoots · on July 17, 2018

100% serious here: you should probably see a therapist. Life is so much more than your job, and your job is so much more than the brand name of your company. Having worked at a failing business and one of the companies you mentioned, I’m no happier at the latter than I was at the former. If you define yourself by the names on your resumé one day you’ll wake up and realize how much of life you’ve missed out on.

edoceo · on July 17, 2018

Hanging out a semi-related Meetup, which allows a safe coocoon to vent may also help. It worked for me, YMMV

kbyatnal · on July 17, 2018

Hey, I've seen a few of your posts around here and just wanted to encourage you to keep your head up. Amazon is an amazing place and even though I don't know anything about your specific situation, there are a lot of incredibly smart people there and I promise that you'll learn a ton.

atomical · on July 18, 2018

It's literally comment after comment of playing the victim.

sadamznintern · on July 28, 2018

I'm not "playing the victim", I'm bitching about how terrible my own life is. Which it is.

Why do I see literally everyone else have their dreams come true while mine don't?

throawayamzn · on July 17, 2018

I was in the exact same boat - ex Amazon intern, felt I was a pretty solid engineer, but didn't quite nail the FB interview. Ended up joining Amazon because I was sick of interviewing.

Let me tell you: it will be OK, and the other posts about making the best of your situation are true. Amazon is an incredible learning opportunity if you stay open to it. I'm still working there a year later and I'm building far higher impact projects than my friends at Goog and FB because like another poster said, Amazon lets SDE1s work on just about everything. The growth potential is tremendous if you show you're competent and willing to learn.

Chin up, you're in a good position :)

manfredo · on July 17, 2018

People say the same thing about every big tech company - Google and Facebook included. Trust me, employers will still be impressed with Amazon on your resume.

sadamznintern · on July 17, 2018

I've personally never seen it.

Nowadays they have to be impressed by the project too...and mine's front-end development on an internal service nobody uses.

throawayamzn · on July 17, 2018

It's true. Go check their Blind's, or talk to some employees that have been there a few years.

autokad · on July 17, 2018

new person to amazon.

they have a lot of really amazing and smart people. but like anything in life, its what you make of it. I'd say put your best into the position and try to learn as much as you can from others. no good reason to do less

pythonistic · on July 17, 2018

It's not just what you make of it: a lot of your experience there is going to depend on your manager. I had three: the first was awesome, so he left for another company; the second was good, but he cared too much and went back to his old (non-management) position to feel fulfilled; the third was a sociopath and did really well, promoted to upper management and a subsidiary and out to greener pastures. Working for the sociopath was awful and got me to leave (where I'd planned on a long career). And no, there was no escape from him except leaving Amazon: he sunk my performance reviews when I asked to leave to go to another team 'cause I told him I didn't think I was a good fit where I was.

There are a lot of amazing and smart people, like you said, but there is also a lot of stress, heartache, and trouble if you don't keep your ear to the ground and build a strong network of people to give you an early warning. Don't keep your head down and concentrate on tech and building cool stuff: Amazon can be way too political for pure techies to thrive without strong protection from management.

throwaway080383 · on July 17, 2018

Chin up. You're probably in the top 5% of the country in income. If you're in Washington, feel free to tack on another 9% or so to your income in state taxes you don't have to pay.

01100011 · on July 17, 2018

While it's nice to not have state income taxes to worry about, total tax burden isn't all that much different from any other state. The only time you really come out ahead is if you can live in WA and do all your shopping in OR.

fjsolwmv · on July 17, 2018

For high income a people like amazon engineers, WA is crazy low tax burden. CA income tax could only match WA sales tax of you spend 100% or probably more of your income on taxable items.

symlinkk · on July 17, 2018

I would fucking love to work at Amazon. You’re doing better than almost everyone I know.

barnstorm · on July 17, 2018

You can't let an outage or some alleged talent gap be a reason for wanting to tap out.

You need to realize that your doing great, folks would kill to get a job at Amazon, the scale and the challenges are mind bending compared to what most other companies deal with and the interview is grueling and you made it.

Technology is a word that describes something that doesn't work yet and Amazon thinks that you are a person that can help tip the balance.

rpeden · on July 16, 2018

If only they had access to some kind of scalable cloud hosting service, they could've completely avoided this sort of outage. :)

Jokes aside, I admire the work of the team(s) responsible for Amazon's web site. I use it so often and encounter glitches so rarely that it really stands out when something does go wrong.

coryfklein · on July 16, 2018

Serious question: I've heard that Bezos's approach with building out commercial units is to break down each part of the vertical into separate commercially-viable components. Idea being if AWS doesn't make sense for 3rd parties to use then it may not be economical to use it internally.

Now the question part: would Amazon ever secretly run Amazon.com in a multi-cloud setup, balancing between AWS, GCE, Azure, etc?

nevir · on July 16, 2018

As someone that worked at Amazon a long time ago—back when AWS was just getting started—can confirm (historically). Publicly there were a myriad of AWS services; internally all we could use was S3 (for many years), if we were lucky. AWS being born of Amazon's "spare capacity" is an urban legend.

Nowadays, I hear it's quite different, and much of AWS is more rapidly dogfooded.

chronid · on July 16, 2018

Some services (before I left) just did not have the sheer capacity to have Amazon.com as a customer at peak. The service teams just said "nope, sorry, you're going to kill us".

The requirements for prime day/black friday/cyber monday were mind-boggling.

xkjkls · on July 16, 2018

That was quickly adjusted. I think like 3 years ago, Amazon publicly was saying that every day to AWS they were adding the same amount of computing power that was used to run Amazon.com when it was a $10 billion business. AWS is massively greater in scale than Amazon.com at this point.

ec109685 · on July 17, 2018

Scaling services for many independent businesses is somewhat of a different challenge than “vertically” scaling for one large one like amazon.com.

chronid · on July 17, 2018

I cannot confirm or deny that (since my memories are sorta fuzzy), but the real problem was not EC2 capacity. Many other AWS services were involved. :)

greenleafjacob · on July 16, 2018

This is not right from an infra POV. You are just moving the problem (by team) but it doesn’t go away. Then you end up with duplication.

hobofan · on July 17, 2018

At the same time you are also giving the other team more flexibility to solve the problem, as they can optimize it for the Amazon.com use case and are not constrained to a generic solution that is optimized for the masses.

cabaalis · on July 16, 2018

I'm a big fan of dogfooding. But given Amazon's ecommerce huge computing needs, maybe it would make sense to not burden the compute availability of your fledgling cloud offering with your ginormous system?

klodolph · on July 16, 2018

If built well your compute offering can scale with the resources you give it. If you have Amazon’s commerce platform running on a pool of computers, the team running commerce can give those computers to EC2 in exchange for an equivalent amount of EC2 quota. In theory it’s a bookkeeping operation.

There are plenty of good reasons why you wouldn’t run Amazon’s commerce on EC2, but I don’t think cloud availability is one of them.

ryanianian · on July 16, 2018

FWIW 100% of the Amazon retail website is run on EC2 (and iirc it's 100% on spot instances too). (There are of course lots of internal services that still use non-ec2 capacity but fewer and fewer every day.)

15charlimitdumb · on July 16, 2018

note though it is not 100% AWS, ex they use Akamai instead of cloudfront

discodave · on July 16, 2018

I don't think the spot thing is accurate. Amazon retail EC2 capacity is more like reserved instances.

IDontWorkAtAmzn · on July 17, 2018

That's just marketing BS.

One of Amazon Retail's big internal goals this year is "finally get everyone off of Oracle".

Bezos opposed the creation of AWS. Almost everything that was early AWS (EC2, S3) was done over the objection of Seattle leadership. Look where the teams were based.

After they shipped AWS, it took eighty zillion years for Retail to use anything AWS offered beyond S3, and, as mentioned above, it's not like their 100% on DDB or RDS now: they have dependencies on freakin' _Oracle_ all over the place.

I mean, this is not a criticism of Bezos or Amazon at all, but at the end of the day, even if Bezos is a supergenius (and I see no reason to doubt that he is. Although I often wonder how he keeps himself motivated to continue to work so hard on building a Walmart competitor with the precious few years he has left on Earth given that he could do literally anything with his time), Amazon is still a company made of tens of thousands of people. It's not a 4 dimensional chess-ballet. It's got probably 95% of the same chaos and disorder that every big organization made of humans has. It just turns out that cutting it 5% in the right markets has incredible returns.

fjsolwmv · on July 17, 2018

AWS got built over the objection of he Chief Executive and majority owner of the company?

camtarn · on July 16, 2018

Worked for Amazon a few years back.

I can't see Amazon ever using external cloud hosting for anything except the most trivial of tasks. They're absolutely, utterly paranoid about any sort of confidential information, and I think even with encryption the perceived risk would be too high.

rb808 · on July 16, 2018

lol, meanwhile everyone else is trusting aws.

amorousf00p · on July 16, 2018

That's sad truth. But it's convenience like anything else.

In my business I can put a hardware site online with

---6X--- 2 x intel gold 5115 10 core + 64 GB RAM 1 nvme @512G + soft raid1 @4TB magnetic 1 10G, 2 1G ether ~= 42K.

---2X--- storage or NAS with 60TB @RAID5 + 2x quad core low end xeon + 32 GB RAM = 16k

---2X--- 1G edge/core mngd switches + 10G SAN/LAN mngd switches = 5K

---2X--- Endian firewalls + threat appliances = 5K

---1X--- Colo with 2 year lease and 25 amps @208v 1g port speed and committed throughput > 100/mbps = 16K yearly

68K one time cost for depreciating assets we maintain, provision and secure + 16K yearly recurring cost.

Or I can go AWS and modify my processing model, security expectations and service infra and spend 25K a year + 15K 1x migration cost.

jjeaff · on July 16, 2018

So... you save $9k a year in recurring and it will be more than 5 years before you break even due to your $68k up front equipment costs.

And that's assuming you don't have any needs to quickly scale up or down and you are limited to 1 colo instead of the ability to expand to multiple regions like with AWS.

And that's not even taking into account the cost of the brain power to make sure your hardware stays up and running.

Doesn't sound like rolling your own stuff in a colo is a very good idea in this case. But that's job security if you are the sys admin I guess.

mmt · on July 16, 2018

> And that's not even taking into account the cost of the brain power to make sure your hardware stays up and running.

Although, as I said upthread, I agree that AWS is very likely ideal for this particular deployment size, let me try to dispell this oft-repeated myth.

Modern server hardware takes almost no "brain power" (or effort of any kind) to keep up and running.

We aren't living in the days of the early dot-com boom where Linux-on-Intel in the datacenter could mean flimsy cases, barely rack-mountable, with nary a redundant part to be seen.

Applying some up front "brain power", one can even choose and configure hardware in such a way as to provide things like server-level redundancy, if that's important and/or preferable to intra-server redundancy (think Hadoop), or the ability to abandon mechanical disks in place instead of ever having to replace one.

mmt · on July 16, 2018

This is the main "sweet spot" for AWS (or "cloud" infrastructure in general): small scale.

I am generally a strong proponent of using ones own hardware in a colo or on-premises, instead of or in addition to the cloud (primarily for "base" workload).

However, if the entirety of your needs can fit into a single rack, even I will advocate for AWS, since "convenience" is, perhaps, not strong enough a word.

I do think your server and storage prices are around $25k too high, but that's easy to do buying brand name and/or not negotiating with multiple vendors on price (which is particularly tough at low volume unless you're a startup with a credible growth story). That's assuming such an expensive CPU (in comparison to so little RAM) isn't foolishly profligate, along with the other hardware choices. Of course, this underscores the point (on which we agree) that, as a rule, it's just not worth that much time and effort for so little.

I'll take your word on the AWS pricing, as it's fairly predictable, if very tedious to perform the prediction. The main "gotchas" I've found people run into are forgetting to add in EBS costs for EC2 instance types without (or without comparable) local storage and underestimating data transfer costs.

amorousf00p · on July 17, 2018

You'll have to trust me that this examples hardware spec and requirements are for a basic/base site. You can thin the profile and increase the # of chassis, compromise on redundancy, etc...but experience has shown that this arrangement is most cost effective. Kinetic event impact modeling system -w- RT data delivery -- that should answer your conjectures.

No large vendors used in this example - thinkmate or aberdeen supermicro re-brands for due diligence and warranty.

mmt · on July 17, 2018

> You can thin the profile and increase the # of chassis, compromise on redundancy, etc

No, I wouldn't suggest more chasses, as that's almost always more expensive (it's tough to break even on that $1k minimum buy-in on a server).

I believe your workload needs the resources you say. It just happens to be a remarkably rare ratio, hence my remark.

> No large vendors used in this example - thinkmate or aberdeen supermicro re-brands for due diligence and warranty.

The vendor doesn't have to be large to jack up the price.. Any re-brand is super suspicious. To me, a large part of the point of a commodity server product is the reliability is predictable (and therefore easy enough to engineer for/around). Paying extra for "diligence", warranty, or hardware support is just flushing money down the toilet.

A fee for custom assembly and/or a basic smoke test is fine, but it had better be a flat rate per server and on the order of $100. Technician labor isn't that expensive.

Larger or "enterprise" vendors are merely the extreme version of this, with upwards of a 10x premium on something like storage arrays, especially if one includes

amorousf00p · on July 17, 2018

You seem to be an absolute type of planner. I used to approach IT mgmt and provisioning that way some years ago before being confronted with the realities of small and large business. One size obviously does not fit all and sometimes you take shortcuts..usually you pay for them later.

I agree with your cautions around supermicro resale but the warranty support and build diligence are absolutely necessary for a small business. Having a good business relationship with a trusted provider of hardware that always performs the first time is priceless.

mmt · on July 17, 2018

I don't know what an "absolute type of planner" is, but I consider myself an engineer and a pragmatist. I'm well versed with realities. In reality, with business, there's no such thing as "priceless", only risk, and risk is, generally, quantifiable. With enough data, it's easily quantifiable.

I admit that, having an affinity for startups, rather than more traditional small businesses, I have a greater affinity for risk. Ironically, perhaps, I'm usually the voice of risk-aversion with respect to IT infrastucture, so I don't believe it affects my overall understanding.

I recently pointed out to an interviewer who was trying to convince me that it was worth spending half a megabuck on a petabyte from Netapp because it was "business critical" instead of 1/10th that amount for DIY, that, just like the DIY solution, Netapp does not indemnify the business against loss. One isn't buying insurance, only a bunch of technology.

Sure, "works the first time" is worth something. Is it worth the cost of a whole, complete, extra server on a order of qty 6? If the infant-mortality rate on servers is anywhere approaching 1-in-6 and they're being shipped somewhere that the replacement time and/or cost would be prohibitive, I'd still probably rather just order 7 servers instead.

That's my main problem with paying a vendor for "reliability": it's a very fuzzy, hand-wavy assurance. Paying for reliability with more hardware has data and statistics behind it, which is an engineering solution.

amorousf00p · on July 18, 2018

Risk is not quantifiable based on your insight into most businesses. You provision and despair.

mmt · on July 18, 2018

I can't understand either of these two sentences, not even who the "you" is supposed to be.

I'd hope for a more substantive reply, if anything.

danieltillett · on July 17, 2018

That is some serious set up. I can’t see how you get close to this only spending $25K a year on AWS - maybe the price I was quoted for my needs was some sort of suckers price.

amorousf00p · on July 17, 2018

This is a ballpark average and only for one VPC based site. You'd need multiple sites...just as you do with hardware.

stone-monkey · on July 16, 2018

I don't see their comment as paranoid because cloud providers generally aee incompetent/steal your data, but because other cloud providers are direct competitors.

Even if they're not sifting through your server data, they can possibly try to get a competitive advantage by analyzing things like usage data as someone above pointed out.

brownbat · on July 16, 2018

When AT&T was forced by law to allow MCI switching equipment in their facilities, they tended to leave that room's window open, so pigeons could nest in the equipment.

(I might have a detail wrong, this and a hundred other great telecom anecdotes are in The Master Switch.)

Always nice to make sure no parties you depend on have conflicting interests.

insensible · on July 16, 2018

I know of a big company that competes with Amazon outside the cloud that pressures its vendors not to use AWS for any of its data.

SpaceRaccoon · on July 16, 2018

Ebay? Walmart? Alibaba?

rb808 · on July 16, 2018

Walmart was news last year. https://www.techrepublic.com/article/walmart-forces-tech-par...

scarejunba · on July 17, 2018

Walmart, Carrefour

antonvs · on July 16, 2018

Most companies can't afford to implement all the security and compliance measures that AWS, Google, or Azure does.

The asymmetry in trust is partly because those cloud providers are huge, and partly because those things are part of their core competency.

camtarn · on July 16, 2018

I know, right? Always seemed a tiny bit hypocritical to me.

lev99 · on July 16, 2018

> Now the question part: would Amazon ever secretly run Amazon.com in a multi-cloud setup, balancing between AWS, GCE, Azure, etc?

I'm not sure how secretive it would be. AWS 'bids' on Amazon.com's business and Amazon.com is under no obligation to use AWS as it's cloud service provider.

coryfklein · on July 16, 2018

Are these bids currently public knowledge then?!

lev99 · on July 16, 2018

Id would be pretty hard to hide if Amazon.com's images are severed from Azure ip addresses...

tomjakubowski · on July 16, 2018

Surely the public-facing stuff is just the tip of the infrastructure iceberg.

core-questions · on July 16, 2018

They're more than big enough to be bringing their own public IP blocks to a provider and having them announce with BGP.

Hell, even us regular joes can do that with services like Packet.Net or Vultr.

chx · on July 16, 2018

The use Akamai for that

Spooky23 · on July 16, 2018

I doubt it. These companies compete brutally, they would never allow Azure or Google to know their use patterns.

lev99 · on July 16, 2018

The (hypothetical) day when AWS & Amazon.com goes down on Prime Day and Amazon stock tanks 50%.

nodesocket · on July 16, 2018

It is beyond absurd people who sell $AMZN on this news. Amazon users will just keep refreshing until it comes back online, and then continue buying as normal.

coryfklein · on July 16, 2018

Some rabid fans may, but it's ridiculous to assume that Amazon sees zero missed revenue when their website goes down on their (arguably) biggest sale of the year.

That's similar to all of the locks on all Walmart stores inexplicably getting stuck in the locked state for 2 hours at 6AM on Black Friday.

jethro_tell · on July 16, 2018

sure, they probably don't get that money back which might slightly influence the revenue for the quarter, but it's not as if the value of the business has changed drastically in the last 3 hours.

This happens every time AWS has an outage as well. Reddit is down, better sell AMZN.

dpark · on July 16, 2018

It's not the same at all, because trying Walmart again later in the day means literally driving back to the physical store. With Amazon, it means getting your phone back out.

I'm sure there is some impact, but it's nowhere near the inconvenience of being locked out of a physical store.

mynameisvlad · on July 16, 2018

If you already waited overnight, then 2 hours isn't going to cause you to go home and come back. You're going to continue to wait in the line until the store opens back up.

dpark · on July 16, 2018

Sorry, I missed the Black Friday comparison. Prime Day and Black Friday simply aren’t comparable, because no one camps out for Prime Day.

solarkraft · on July 16, 2018

My guess is that they will extend prime day. But they fixed it quickly. For me currently on amazon.de there is no problem and amazon.com shows a captcha form, so it looks like they were DoSed.

ghshephard · on July 16, 2018

We had about $300-$400 worth of purchases lined up this morning that were going to be made just for the day. Site was down, we couldn’t get through - they aren’t going to happen. A lot of people only shop on the 30-40% of days - so, I’m guessing we’llnwait For digital Monday out in November?

lev99 · on July 17, 2018

This.

Tons of people that shop sales will just spend their money somewhere else if they miss a sale.

scarejunba · on July 17, 2018

You’re not going to do it once the site’s up if you get the same price?

Presumably the price sensitive customer doesn’t just go elsewhere, right?

avip · on July 16, 2018

Not necessarily people.

Zombieball · on July 16, 2018

Precisely. I wouldn’t doubt there are algorithms that take into account general tone of tweets and news posts (using some sort of NLP) into a trading decision.

smolder · on July 16, 2018

Yes, this is done in practice and is a little scary because of its potential as a means of manipulating automated trading. On the other hand the markets can be manipulated already via news read by humans...

econochoice · on July 17, 2018

Have worked on one such project, can confirm that these exist.

2trill2spill · on July 16, 2018

It's absurd for long term investors to sell on this news but if your a short term trader or bot then it might be a nice short term profit to trade on this news.

chronid · on July 16, 2018

You never completely recover an order dip. Believe me :)

deevolution · on July 16, 2018

Their stocks started declining around 2:30 eastern time

mikeash · on July 16, 2018

As of 3:45PM they’re up 0.75% for the day.