This really shows the value of a few good engineers. I can easily see this taking a team of engineers much longer, but in this case a few engineers figured out the least painful solution and implemented it with minimal down time.
RDS is just fine for production environments. Any tuning you need done can be done with parameters; your only limit might be that imposed on RDS for connections (based on instance size).
In my experience (a few years ago though, to be honest), RDS was absolutely terrible for high performance and/or latency sensitive write workloads. Due to how (again, at the time) replication was handled -- amazon apparently did (at the time? still does?) synchronous writes to each AZ, and only completes the transaction when both return. When one AZ/RDS-instance was slow or dropping packets (seemed oddly frequent at the time for cross-AZ traffic -- again about 3 years ago though), our production stack would catch fire and come to a crashing halt. Never again!
Hi, one of the Airbnb engineers involved in this op here. Yeah... that does sound a lot like 3 years ago. The situation has gotten a lot better, especially with 5.6 and PIOPS. These days, things work pretty smoothly (even as the volume of traffic and data has scaled massively).
Is there any visibility into how this is actually implemented? Was considering switching from self-hosted Postgres on EC2 to RDS, but that delay would be certainly an issue.
No, there was zero visibility. Again..This was just over 3 years ago, so my memory isn't super fresh but we managed to wring bits and pieces of info from our paid aws support after we kept getting such horrible intermittent write latency.
We later tried just running mysql ourselves on big instances, and raid'ing across a large amount of EBS volumes... we ended up running into other weird issues with that too. We would sometimes get terrible write latency spikes, which we were told was a result of "stuck blocks on the SAN". Apparently the backing SAN would sometimes have some blocks that performed very poorly (maybe a disk under high contention in one SAN cabinet?), and this would cause our overall RDS performance to plummet, but only irregularly. We would usually get on the horn and after talking with someone it would either magically stop being slow ("we dont see any problems here!") or we would be told about some "stuck blocks" and they would do some type of remapping or migration of those blocks. Not very transparent to us what really was going on. Sometimes we would just spin up new instances and ebs volumes, and do perf testing on them until we got a set that performed consistently, until something goofy happened again a week later or something. Pretty awful. We tried local instance storage, but it just wasn't fast enough (primary reason), and it felt a bit dangerous -- even though we backed up to an EBS volume pretty regularly.
We ended up bailing from aws and saw huge performance improvements (reduction in latency, etc) by using ssds and real hardware. We even actually ended up with some cost savings! Not long after we left, amazon came out with local SSD storage (high I/O instances I think they were called?), which may have been workable, but by then we had migrated away (we still used s3 though, and still used on-demand instances for developers).
Got it, thanks for clarifying. The transition to bare metal story doesn't seem to be nearly as frequent nowadays, although it does still happen. We've been mostly lucky with AWS performance so far, but this certainly isn't super encouraging, especially given the apparent performance unpredictability here.
From what I see, that failover uses DNS. The endpoint stays same, but it's pointed to new IP address or so...And the app may continue to use the cached IP from DNS query. I have to write a daemon to listen to RDS event and restart our app if it detects a failover event :(