Hacker News new | past | comments | ask | show | jobs | submit login

My bet is they maxed out on the SQS FIFO 300 messages per second limit.



I have no idea really (I stopped working on the retail websites in 2013) but my gut feel is that that's about a couple of orders of magnitude too low at least.

But, yes, unforeseen rate limits and size limits can cause many hilarious things to happen. I've seen a few good ones in my time. In particular, when somebody sets an upper size on an in-memory table and commits it thinking "Well, I added a few orders of magnitude safety margin - that should be enough for anybody", that's probably going to become an incident at some point in the distant future ;) With luck, the throttling or failure behaviour will only affect a few people, and it'll be spotted by looking at traffic graphs and noticing a very slightly elevated rate of service errors. If you're unlucky, though, when you hit the limit the whole service slows down, locks up, or just plain crashes, and something like this happens...


Yup, many an outage has something like this at its core. At my company, to address that, we've built an internal library for enforcing rate limits and size limits that is (a) configurable on the fly and (b) generates logs, so that we can trigger an alert whenever any limit reaches 70% of capacity. And thus, hopefully, head these things off before they reach the tipping point.

https://blog.scalyr.com/2014/08/99-99-uptime-9-5-schedule/


Don't forget to alert when the double derivative of capacity growth is above a (very low!) threshold, that can catch an explosive problem far earlier than the 70% mark, by which time it might be growing at 5% per second.


It might also be a momentary blip that you really shouldn't need to wake a human up for. The hard part of monitoring imo is that perfect balance. Asymtotically approaching nirvana but never quite reaching.


All monitoring has that problem :-) But if your rate of rate of change has been positive for some suitable amount of time for the context, it's worth waking a human up over it, because the amount of resource remaining is dwindling exponentially.


That sounds like a very interesting metric to track and report on. Do you have any further references that discuss that approach?


Not offhand, sorry - I first encountered the notion at a conference some years back, but I've long since gotten out of operations so it's all a bit dusty for me now.


That's cool, but it's a bit more complicated then rate limits.


I once owned an EAV DB for storing small config values. Someone wrote a great wrapper around it that made it seem like a proper DB with many tables. Since it was for storing configs this library cached the whole thing in memory on startup. Zoom to 5 years later, we have 10+ settings for every customer in this store, and one day all our hosts keep failing over. As it turns out that small table for configs was north of 5 gigs and destroying our heap.


This is their official rate limit for the FIFO type SQS. Although I bet the Amazon folks know a couple people around the office who can up that limit :)




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: