My bet is they maxed out on the SQS FIFO 300 messages per second limit.

camtarn · on July 16, 2018

I have no idea really (I stopped working on the retail websites in 2013) but my gut feel is that that's about a couple of orders of magnitude too low at least.

But, yes, unforeseen rate limits and size limits can cause many hilarious things to happen. I've seen a few good ones in my time. In particular, when somebody sets an upper size on an in-memory table and commits it thinking "Well, I added a few orders of magnitude safety margin - that should be enough for anybody", that's probably going to become an incident at some point in the distant future ;) With luck, the throttling or failure behaviour will only affect a few people, and it'll be spotted by looking at traffic graphs and noticing a very slightly elevated rate of service errors. If you're unlucky, though, when you hit the limit the whole service slows down, locks up, or just plain crashes, and something like this happens...

snewman · on July 16, 2018

Yup, many an outage has something like this at its core. At my company, to address that, we've built an internal library for enforcing rate limits and size limits that is (a) configurable on the fly and (b) generates logs, so that we can trigger an alert whenever any limit reaches 70% of capacity. And thus, hopefully, head these things off before they reach the tipping point.

https://blog.scalyr.com/2014/08/99-99-uptime-9-5-schedule/

codebje · on July 17, 2018

Don't forget to alert when the double derivative of capacity growth is above a (very low!) threshold, that can catch an explosive problem far earlier than the 70% mark, by which time it might be growing at 5% per second.

cheeze · on July 17, 2018

It might also be a momentary blip that you really shouldn't need to wake a human up for. The hard part of monitoring imo is that perfect balance. Asymtotically approaching nirvana but never quite reaching.

codebje · on July 17, 2018

All monitoring has that problem :-) But if your rate of rate of change has been positive for some suitable amount of time for the context, it's worth waking a human up over it, because the amount of resource remaining is dwindling exponentially.

bostonvaulter2 · on July 17, 2018

That sounds like a very interesting metric to track and report on. Do you have any further references that discuss that approach?

codebje · on July 17, 2018

Not offhand, sorry - I first encountered the notion at a conference some years back, but I've long since gotten out of operations so it's all a bit dusty for me now.

adreamingsoul · on July 16, 2018

That's cool, but it's a bit more complicated then rate limits.

zwkrt · on July 17, 2018

I once owned an EAV DB for storing small config values. Someone wrote a great wrapper around it that made it seem like a proper DB with many tables. Since it was for storing configs this library cached the whole thing in memory on startup. Zoom to 5 years later, we have 10+ settings for every customer in this store, and one day all our hosts keep failing over. As it turns out that small table for configs was north of 5 gigs and destroying our heap.

aviv · on July 16, 2018

This is their official rate limit for the FIFO type SQS. Although I bet the Amazon folks know a couple people around the office who can up that limit :)