Details of yesterday's Bunny CDN outage

ram_rar · on June 23, 2021

> On June 22nd at 8:25 AM UTC, we released a new update designed to reduce the download size of the optimization database. Unfortunately, this managed to upload a corrupted file to the Edge Storage.

I wonder, if simple checksum verification of the file would have helped in avoiding this outage all together.

> Turns out, the corrupted file caused the BinaryPack serialization library to immediately execute itself with a stack overflow exception, bypassing any exception handling and just exiting the process. Within minutes, our global DNS server fleet of close to a 100 servers was practically dead

This is exactly, why one needs a canary based deployments. I have seen umpteen amounts of issues being caught in canary, which has saved my team tons of firefighting time.

jgrahamc · on June 23, 2021

I wonder, if simple checksum verification of the file would have helped in avoiding this outage all together.

Oh man, you stirred up a really old Cloudflare memory. Back when I was working on our DNS infrastructure I wrote up a task that says: "RRDNS has no way of knowing how many lines to expect or whether what it is read is valid. This could create an issue where the LB map data is not available inside RRDNS."

At the time this "LB map" thing was critical to the mapping between a domain name and its associated IP address(es). Without it Cloudflare wouldn't work. Re-reading the years old Jira I see myself and Lee Holloway discussing the checksumming of the data. He implemented the writing of the checksum and I implemented the read and check.

I miss Lee.

methyl · on June 23, 2021

For whom, like myself, don't know the story, here it is: https://www.wired.com/story/lee-holloway-devastating-decline...

I'm deeply moved after reading it. Can't imagine how tragic it must be for people who know Lee.

HellsMaddy · on June 24, 2021

Sounds similar to what happened to Nietzsche:

https://en.wikipedia.org/wiki/Friedrich_Nietzsche#Mental_ill...

dylanz · on June 23, 2021

That was an incredible story, and I went down a rabbit hole of reading more about that disease. Thank you very much for sharing.

yabones · on June 23, 2021

Wow, that is absolutely tragic. Neurodegenerative diseases are something I fear the most, having seen what Huntington's can do to somebody.

dpcx · on June 23, 2021

In the post or comments, they claimed using canary; perhaps their canary simply didn't die in the coalmine?

Matthias247 · on June 24, 2021

That doesn't protect against a file already being generated in a broken fashion. Or that it's content is not compatible to the newest schema you are using for deserialization.

For serialization in a distributed system you always want to have a parser which can detect invalid data and has means to support forward and backward compatibility.

cutemonster · on June 24, 2021

> forward and backward compatibility

Also for HTTP requests I suppose

YetAnotherNick · on June 23, 2021

They are making it sound like they did everything right and it was a issue of third party library. If we list all the libraries our code depend on, it will be in 1000s. I can't comprehend how a CDN does not have any canary or staging setup and in a update everything could go haywire in seconds. I think it is standard practice in any decent size company to have staging/canary and rollbacks.

bovermyer · on June 23, 2021

That's not the impression I got. Yeah, their takeaway was to stop using BinaryPack, which I disagree with. However, it sounded to me like they very much understood that they made the biggest error in putting all of their eggs in one basket.

Your system WILL go down eventually. The question is how will you recover from it?

dejangp · on June 23, 2021

Right, this was our biggest failure (not the only one of course, but we are here to improve). Relying on our own systems to maintain our own systems.

We are dropping BinaryPack mainly because we're a small team, and it wasn't really a big benefit anyway, so spending more time than necessary to try and salvage that makes no sense. This was more of a hot-fix since we don't want the same thing repeating in a week.

bovermyer · on June 23, 2021

That makes sense then with the additional context.

I don't know the details of your operation, but keeping your ability to update your systems separate from your systems is something I'd strongly encourage.

xwolfi · on June 23, 2021

I came to post that, yeah. I work in a sensitive system on which people can lose millions for a few minutes downtime, and we are a bit anal about week long pilots where half the prod is in a permanent canary stage.

But also they used their own infra it feels to setup their stuff and if their infra was dead they couldnt rollback, which sounds like a case where people a bit too optimistic.

We had catastrophes too, notably on poison pills in a record stream we cant alter, but this update cascade crash sounds avoidable.

Always easy to judge anyway, always happens to you eventually :D

pantulis · on June 23, 2021

This. While failure, human or not, is unavoidable in the long term, from their writeup they do not seem to have procedures to avoid this particular mode of failure.

string · on June 23, 2021

Good and clear explanation. This is a risk you take when you use a CDN, I still think the benefits outweigh the occasional downtime. I'm a big fan of BunnyCDN, they've saved me a lot of money over the past few years.

I'm sure I'd be fuming if I worked at some multi-million dollar company but as someone that mainly works for smaller businesses it's not the end of the world, I suspect most of my clients haven't even noticed yet.

manishsharan · on June 23, 2021

TIL about BunnyCDN. I had been paying $0.08 per GB on AWS Cloudfront whereas BunnyCDN is only $0.01 per GB. Can you comment on you experience with them ? Are the APIs comprehensive e.g. cache invalidation ? do they support cookie base authorization ? Any support Geo-Fencing?

sudhirj · on June 23, 2021

Think the answer is yes to all three questions, depending on the specifics. They've got a nice setup, about ~40+ edge locations compared to Cloudfront's ~200+, but the advantage is they're massively cheaper for very small increase in latency. They also have the ~5 region high-volume tier which is something like another order of magnitude cheaper.

The feature set is pretty full, no edge functions, but there is a rule engine you can run on the edge. Fast config updates, nice console and works well enough for most of my projects.

They also have a nice integrated storage solution that's way easier to configure than S3 + Cloudfront, and lots of origin shielding options.

string · on June 23, 2021

I noticed another user has already commented, sounds like they've had more experience with the things you're interested in than I have, for FWIW, the APIs have been sufficient for my use cases and you can definitely purge a pullzone's cache with them.

My primary use has been for serving image assets, switched over from Cloudfront and have seen probably a >80% cost reduction, and no noticeable performance reduction, but as I mentioned I'm operating at a scale where milliseconds of difference don't mean much.

kalev · on June 23, 2021

First time using a CDN improved our site [1] performance by a huge amount, thanks to BunnyCDN. Really easy to setup, great dashboard. The image optimizer for a flat rate works really really well. Only missing option is to rotate images, which I opened a feature request for with them.

You can see our CDN usage inspecting the URLs to the product images. Size attributes are added to the URL and Bunny automatically resizes and compresses the images on the fly.

[1] https://www.airsoftbazaar.nl

aitchnyu · on June 23, 2021

Can this allow me to route x.mydomain.com (more than one wildcard and top level) to x.a.run.app (Google Cloud Run)? Cloud Run (and the Django app behind it) won't approve domain mapping for Mumbai yet so I am looking for transparent domain rewriting. Cloudflare allows it but its kinda expensive.

https://cloud.google.com/run/docs/locations#domains

gcbirzan · on June 23, 2021

As the docs say, you can use a LB with this. It'll be 18 dollars a month, though.

ksec · on June 24, 2021

>AWS Cloudfront

AWS Cloudfront is neither the fastest or cheapest, in both bulk and small file transfer. I cant think of a single technical reason it is better. Fastly, Akamai, Limelight, Cloudflare or even good old EdgeCast. They all have their strong point in some of their niche services or domain.

Any reason for using AWS Cloudfront Enterprise purchase reason? Or are there some technical superiority I am not seeing?

summarity · on June 23, 2021

I’m using them quite extensively (except the Stream video feature). APIs are good, traffic can be restricted or rerouted based on Geo. Not sure what cookie based auth would do in a CDN but if it’s on the origin it passes through. For authenticating URLs there is a signing scheme you can use.

foobarbazetc · on June 23, 2021

I like how something is “auto-healing” when it’s like… has `Restart=on-failure` in systemd.

Anyway, it’s always DNS. Always.

“Unfortunately, that allowed something as simple as a corrupted file to crash down multiple layers of redundancy with no real way of bringing things back up.”

You can spend many, many millions of $ on multi-AZ Kubernetes microservices blah blah blah and it’ll still be taken down by a SPOF, which, 99% of the time, is DNS.

Actual redundancy, as opposed to “redundancy”, is extremely difficult to achieve because the incremental costs of one more 9 are almost exponential.

And then a customer updates their configuration and your entire global service goes down for hours ala Fastly.

Or a single corrupt file crashes your entire service.

tyingq · on June 23, 2021

>Anyway, it’s always DNS. Always.

Which is disappointing. An infrastructure where the backend is VERY easy to make highly redundant. Thwarted by decisions not to do that easy work, or thwarted by client libraries that don't take advantage of it.

zamalek · on June 23, 2021

This brings up one of my pet peeves: recursion. Of course there should have been other mitigations in place, but recursion is such a dangerous tool. So far as reasonably possible, I consider its only real purpose to confuse students in 101 courses.

I assume that they are using .Net, as SOEs bring down .Net processes. While that sounds like a strange implementation detail, the philosophy of the .Net team has always been "how do you reasonably recover from an stack overflow?" Even in C++ what happens if, for example, the allocator experiences a stack overflow while deallocating some RAII resource, or a finally block calls a function and allocates stack space, or... you get the idea.

The obvious thing to do here would be to limit recursion in the library (which amounts to safe recursion usage). BinaryPack does not have a recursion limit option, which makes it unsafe for any untrusted data (and that can include data that you produce, as Bunny experienced). Time to open a PR, I guess.

This applies to JSON, too. I would suggest that OP configure their serializer with a limit:

[1]: https://www.newtonsoft.com/json/help/html/MaxDepth.htm

bob1029 · on June 23, 2021

> recursion is such a dangerous tool.

The most effective tools for the job are usually the more dangerous ones. Certainly, you can do anything without recursion, but forcing this makes a lot of problems much harder than they need to be.

vfaronov · on June 23, 2021

> While that sounds like a strange implementation detail, the philosophy of the .Net team has always been "how do you reasonably recover from an stack overflow?"

Can you expand on this or link to any further reading? I just realized that this affects my platform (Go) as well, but I don't understand the reasoning. Why can't stack overflow be treated just like any other exception, unwinding the stack up to the nearest frame that has catch/recover in place (if any)?

zamalek · on June 23, 2021

> Why can't stack overflow be treated just like any other exception[...]?

Consider the following code:

    func overflows() {
        defer a()
        
        fmt.Println("hello") // <-- stack overflow occurs within
    }

    func a() {
        fmt.Println("hello")
    }

The answer lies in trying to figure out how Go would successfully unwind that stack, it can't: when it calls `a` it will simply overflow again. Something that has been discussed is "StackAboutToOverflowException", but that only kicks the bucket down the road (unwinding could still cause an overflow).

In truth, the problem exists because of implicit calls at the end of methods interacting with stack overflows, whether that's because of defer-like functionality, structured exception handling, or deconstructors.

vfaronov · on June 24, 2021

But doesn’t this apply to “normal” panics as well? When unwinding the stack of a panicking goroutine, any deferred call might panic again, in which case Go keeps walking up the stack with the new panic. In a typical server situation, it will eventually reach some generic “log and don’t crash” function, which is unlikely to panic or overflow.

Perhaps one difference is that, while panics are always avoidable in a recovery function, stack overflows are not (if it happens to be deep enough already). Does the argument go “even a seemingly safe recovery function can’t be guaranteed to succeed, so prevent the illusion of safety”?

(To be clear: I’m not arguing, just trying to understand.)

zamalek · on June 24, 2021

I'm not actually sure what Go would do in a double-fault scenario (that's when a panic causes a panic), but assuming it can recover from that:

In the absolute worst case scenario: stack unwinding is itself a piece of code[1]. In order to initiate the stack unwind, and deal with SEH/defer/dealloc, the Go runtime would need stack space to call that method. Someone might say, "freeze the stack and do the unwind on a different thread." The problem is the bit in the quotes is, again, at least one stack frame and needs stack space to execute.

I just checked the Go source, and it basically uses a linked list of stack frames in the heap[2]. If a stack is about to overflow, it allocates a new stack and continues in that stack. This does have a very minor performance penalty. So you're safe from this edge case :).

[1]: https://www.nongnu.org/libunwind/ [2]: https://golang.org/src/runtime/stack.go

theandrewbailey · on June 23, 2021

> So far as reasonably possible, I consider its only real purpose to confuse students in 101 courses.

I had a "high school" level programming class with Python before studying CS. I ran into CPython's recursion limit often and wondered why one would use recursion when for loops were a more reliable solution.

Nowadays, my "recursion" is for-looping over an object's children and calling some function on each child object.

patrickbolle · on June 23, 2021

Great write-up. I've just switched from Cloudinary to Backblaze B2 + Bunny CDN and I am saving a pretty ridiculous amount of money for hosting thousands of customer images.

Bunny has a great interface and service; I'm really surprised how little people know about it, I think I discovered it on some 'top 10 CDNs list' that I usually ignore, but the pricing was too good to pass up.

The team is really on the ball from what I've seen. Appreciate the descriptive post, folks!

nathanganser · on June 23, 2021

I'm impressed by the transparency and clarity of their explanation! Definitely makes me want to use their solution even though they messed up big times!

zerop · on June 23, 2021

on a different note, this outage news will give them more publicity than the product itself, I believe...

FerretFred · on June 23, 2021

TIL .. of bunny.net :-)

dejangp · on June 23, 2021

Dejan here from bunny.net. I was reading some of the comments, but wasn't sure where to reply, so I guess I'll post some additional details here. I tried to keep the blog post somewhat technical, but not overwhelm non-technical readers.

So to add some details, we already use multiple deployment groups (one for each DNS cluster). We always deploy each cluster separately to make sure we're not doing something destructive. Unfortunately this deployment went to a system that we believed was not a critical part of infrastructure (oh look how wrong we were) and was not made redundant, since the rest of the code was supposed to handle it gracefully in case this whole system was offline or broken.

It was not my intention to blame the library, obviously this was our own fault, but I must admit we did not expect a stack overflow out of it, which completely obliterated all of the servers immediately when the "non-critical" component got corrupted.

This piece of data is highly dynamic and processes every 30 seconds or so based on hundreds of thousands of metrics. Running a checksum did nothing good here, because the distributed file was perfectly fine. The issue happened when it was being generated, not distributed.

Now for the DNS itself, which is a critical part of our infrastructure.

We of course operate a staging environment with both automated testing and manual testing before things go live.

We also operate multiple deployment groups so separate clusters are deployed first, before others go live, so we can catch issues.

We do the same for the CDN and always use canary testing if possible. We unfortunately never assumed this piece of software could cause all the DNS servers to stack overflow.

Obviously, I mentioned, we are not perfect, but we are trying to improve on what happened. The biggest flaws we discovered were the reliance on our own infrastructure to handle our own infrastructure deployments.

We have code versioning and CI in place as well as the options to do rollbacks as needed. If the issue happened under normal circumstances, we would have the ability to roll back all the software instantly, and maybe experience a 2-5 minute downtime. Instead, we brought down the whole system like dominos because it all relied on each other.

Migrating deployment services to third-party solutions is therefore our biggest fix at this point.

The reason we are moving away from BinaryPack is because it simply wasn't really providing that much benefit. It was helpful, but it wasn't having a significant impact on the overall behavior, so we would rather stick with something that worked fine for years without issues. As a small team, we don't have the time or resources to spend improving it at this point.

I'm somewhat exhausted after yesterday, so I hope this is not super unstructured, but I hope that answers some questions and doesn't create more of them :)

If I missed any suggestions or something that was unclear, please let me know. We're actively trying to improve all the processes to avoid similar situations in the future.

throwdbaaway · on June 24, 2021

> This piece of data is highly dynamic and processes every 30 seconds or so based on hundreds of thousands of metrics.

Perhaps you guys need a ... database? Relevant HN discussion few months ago when tailscale migrated from json to etcd: https://news.ycombinator.com/item?id=25767128

sysbot · on June 23, 2021

From the article "Turns out, the corrupted file caused the BinaryPack serialization library to immediately execute itself with a stack overflow exception, bypassing any exception handling and just exiting the process. Within minutes, our global DNS server fleet of close to a 100 servers was practically dead." and from your comment "We do the same for the CDN and always use canary testing if possible. We unfortunately never assumed this piece of software could cause all the DNS servers to stack overflow."

This read like the DNS software is being changed. As some people already mentioned is this a corruption where checksum would of been prevented the stack overflow or would a canary detected this? Why would a change to DNS server software not canaried?

unilynx · on June 23, 2021

I read it as "DNS software changed, that worked fine, but it turns out we sometimes generate a broken database - not often enough to see it hit during canary, but devastating when it finally happened"

GP also notes that this database changed perhaps every 30 seconds

Just a few guesses.. if you have a process that corrupts a random byte every 100.000 runs, and you run it every 30 seconds, it might take days before you're at 50% odds of having seen it happening. and if that used to be a text or JSON database, flipping a random bit might not even corrupt anything important. Or if the code swallows the exception at some level, it might even self-heal after 30 seconds when new data comes in, causing an unnoticed blib in the monitoring if at all

Now I don't know what binary pack does exactly, but if you were to replace the above process with something that compresses data, a flipped bit will corrupt a lot more data, often everything from that point forwards (where text or json is pretty self-syncronizing). And if your new code falls over completely if that happens, no more self-healing.

I can totally imagine missing an event like that during canary testing

jiofih · on June 23, 2021

> Unfortunately this deployment went to a system that we believed was not a critical part of infrastructure

Deploying to a different, lower priority system is not a canary. Do you phase deployments to each system, per host or zone?

dejangp · on June 23, 2021

For critical systems (or let's call them services) such as DNS, CDN, optimizer, storage, we usually deploy either on a server to server basis, regional basis or cluster basis before going live. What I mean here was that this was not really a critical service that nobody thought could actually cause any harm, so we didn't do canary testing there as it would add a very high level of complexity.

thekonqueror · on June 23, 2021

Hi Dejan, we are evaluating Bunny for a long-term multi-tenant project. Today your support mentioned that cdn optimizer strips all origin headers. Is there any way permit some headers on a zone basis?

jgrahamc · on June 23, 2021

Thanks for the write up. I enjoyed reading it.

ing33k · on June 23, 2021

hey dejan, we have been using BunnyCDN for quite some time. Thanks for the detailed writeup.

looks like storage zones are still not fully stable ? after experiencing several issues with storage zones earlier, we migrated to pull a zone. we didn't had any major issues after the migration.

what plans do you have to improve your storage zone ?

dejangp · on June 23, 2021

Hey, glad to hear that and sorry again about any issues. If you're experiencing any ongoing problems, please message our support team. I'm not aware of anything actively broken, but if there's a problem I'm sure we'll be able to help.

gazby · on June 23, 2021

I have also had problems with storage zones. We experienced multiple periods of timeouts, super long TTFB, and 5xx responses. A ticket was opened (#136096) about the TTFB issue with full headers/curl output with an offer to supply any further useful information, but the response of "can you confirm this is no longer happening?" the following day discouraged me from further time spent there.

To this day US PoPs are still pulling from EU storage servers (our storage zone is in NY, replicated in DE).

  < Server: BunnyCDN-IL1-718
  < CDN-RequestCountryCode: US
  < CDN-EdgeStorageId: 617
  < CDN-StorageServer: DE-51

We've since moved away from Bunny, but if there's anything I can do to help improve this situation I'd be happy to do it because it is otherwise a fantastic product for the price.

T4m2 · on June 23, 2021

We had the same, super long TTFB and lots of 5xx errors, seems to be mostly fixed now, but there are defiantly things that could be done differently, however given the pricing and feature set I'm happy with the service

Would love additional capabilities within the image optimizer such as methods of crop

slackerIII · on June 23, 2021

Oh, this is a great writeup. I co-host a podcast on outages, and over and over we see cases where circular dependencies end up making recovery much harder. Also, not using a staged deployment is a recipe for disaster!

We just wrapped up the first season, but I'm going to put this on the list of episodes for the second season: https://downtimeproject.com.

0des · on June 23, 2021

This is great! I love these types of podcasts. Adding this one to my subscriptions list right now. A bunny CDN episode would be fun. Thanks for putting this podcast on my radar.

corobo · on June 23, 2021

I didn't notice but I do appreciate the automatic SLA honouring plus the letting me know

Nice work Bunny CDN.

manigandham · on June 23, 2021

All this focus on redundancy should be replaced with a focus on recovery. Perfect availability is already impossible. For all practical uses, something that recovers within minutes is better than trying to always be online and failing horribly.

EricE · on June 23, 2021

A backup vendor once pointed out that backup was the most misnamed product/function in all of computerdom. He argued it should really be referred to as restore, since when the chips are down that's what you really, really care about. That really resonated with the young sysadmin I was at the time.

Very similar to the story about the planes with holes coming back in WWII and the initial analysis of adding more armor to where the holes were, when someone flipped it and pointed out that armor was needed where the holes weren't since planes with holes in those spots weren't the ones coming back.

qaq · on June 23, 2021

And thanx to the writeup making it to the top of HN now I and prob many more people here learned about the existence of Bunny CDN.

EricE · on June 23, 2021

It’s not DNS

There is a no way it’s DNS

It was DNS

One of the most bittersweet haikus for any sysadmin :p

lclarkmichalek · on June 23, 2021

These follow ups aren't super compelling IMO.

> To do this, the first and smallest step will be to phase out the BinaryPack library and make sure we run a more extensive testing on any third-party libraries we work with in the future.

Sure. Not exactly a structural fix. But maybe worth doing. Another view would be that you've just "paid" a ton to find issues in the BinaryPack library, and maybe should continue to invest in it.

Also, "do more tests" isn't a follow up. What's your process for testing these external libs, if you're making this a core part of your reliability effort?

> We are currently planning a complete migration of our internal APIs to a third-party independent service. This means if their system goes down, we lose the ability to do updates, but if our system goes down, we will have the ability to react quickly and reliably without being caught in a loop of collapsing infrastructure.

Ok, now tell me how you're going to test it. Changing architectures is fine, but until you're running drills of core services going down, you don't actually know you've mitigated the "loop of collapsing infrastructure" issue.

> Finally, we are making the DNS system itself run a local copy of all backup data with automatic failure detection. This way we can add yet another layer of redundancy and make sure that no matter what happens, systems within bunny.net remain as independent from each other as possible and prevent a ripple effect when something goes wrong.

Additional redundancy isn't a great way of mitigating issues caused by a change being deployed. Being 10x redundant usually adds quite a lot of complexity, provides less safety than it seems (again, do you have a plan to regularly test that this failover mode is working?) and can be less effective than preventing issues getting to prod.

What would be nice to see if a full review of the detection, escalation, remediation and prevention for this incident.

More specifically, the triggering event here, the release of a new version of software, isn't super novel. More discussion of follow ups that are systematic improvements to the release process would be useful. Some options:

- Replay tests to detect issues before landing changes

- Canaries to detect issues before pushing to prod

- Gradual deployments to detect issues before they hit 100%

- Even better, isolated gradual deployments (i.e. deploy region by region, zone by zone) to mitigate the risk of issues spreading between regions.

Beyond that, start thinking about all the changing components of your product, and their lifecycle. It sounds like here some data file got screwed up as it was changed. Do you stage those changes to your data files? Can you isolate regional deployments entirely, and control the rollout of new versions of this data file on a regional basis? Can you do the same for all other changes in your system?

_x5md · on June 23, 2021

This, I am not at all reassured by this that it won't happen again. Next week perhaps.

Also, their DNS broke last month as well, but I guess we won't mention that as it would invalidate 2 years of stellar reliability

busymom0 · on June 23, 2021

One of the comments on the post is:

> One thing you could do in future is to url redirect any BunnyCDN url back to the clients original url, in essence disabling the CDN and getting your clients own hosts do what they were doing before they connected to BunnyCDN, yes it means our sites won't be as fast but its better than not loading up the files at all. I wonder if that is possible in technical terms?

Isn't this a horrible idea? If you use bunny, this would cause a major spike in the traffic and thus costs from your origin server.

ev1 · on June 23, 2021

Doing this if you are intentionally trying to protect or hide your origin is effectively guaranteed killing on the origin. For example, if Cloudflare unproxied one of my subdomains I'd leave them immediately, and likely have to change all my infrastructure and providers due to attacks.

This is also a terrible idea because of ACLs/firewalls only allowing traffic from CDN (this is extremely common for things like Cloudflare and Akamai) and relying on the CDN for access control.

corobo · on June 23, 2021

Yeah please don't do this without me having checked a box to do that haha

There's a reason I use a CDN, let me decide if it's up or down if the CDN is down. If I want failover, I'll do that bit.

dindresto · on June 23, 2021

Also, how would this work if their whole infrastructure is down? The same problem that prevented them from fixing the network would also have prevented them from adding such a redirect.

path411 · on June 23, 2021

Sounds like they got really lucky they could get it back up so quickly. They must have some very talented engineers working there.

My take aways though were they should have tested the update better. They should have their production environment more segmented with staggered updates so they have much more contained disasters. And they should have had much better catastrophic failure plans in place.

_x5md · on June 23, 2021

It was not that quick tbh. We were seeing intermittent issues for several hours after the initial problem arose.

It tought me a valuable lesson: make sure it is easy to switch to another CDN & to update cached/stored urls

ing33k · on June 23, 2021

they are very a small team. Their CEO codes !

dejangp · on June 23, 2021

I do! In fact, I work like 90 hours a week. I decided to go the bootstrap way (looking back not sure if that was the best idea, but we are where we are), but we're growing 3X year over year, so things are picking up. :)

listenallyall · on June 24, 2021

...and provides support too. Even to a low-traffic, low-spend user whose site generates just a few pennies of revenue for Bunny. Yet every time I made a support request, I got a response in minutes, often from Dejan himself. Highly recommend.

debarshri · on June 23, 2021

I can imagine how stressful the situation was, but it was pleasure to read. It again goes to show you that no matter how prepared, how optimized/over optimized you want to be, there will always be a situation you have never accounted for and sh*t always hits the fan, that is the reality of IT ops.

iJohnDoe · on June 24, 2021

Slightly off-topic, has anyone else noticed higher latency with internet traffic going in or out of Germany? Just in general?

Frankfurt was mentioned in the post and I immediately thought it would be a bad idea because I’ve always seen USA to Germany traffic have higher latency. Maybe within Europe it’s fine.

xarope · on June 23, 2021

I would think critical systems and updates should also have some form of out-of-band access channel?

TacticalCoder · on June 23, 2021

Slighlty offtopic but what about the big outage from a few days/weeks ago where half the Internet was down (exaggerating only a little bit), has there been a postmortem I missed?

penguinten · on June 23, 2021

https://www.fastly.com/blog/summary-of-june-8-outage

altmind · on June 23, 2021

fastly RCA is underwhelming - no info on what was the component, what happened and how the situation was tackled.

rapsey · on June 23, 2021

I think a public company will never dare go into as much detail as bunny did. Or maybe it is just the size of the organisation that discourages that.

plett · on June 23, 2021

Cloudflare's RFO blog posts are incredibly detailed. Each time I read one, I feel reasonably confident that they have learnt from the mistakes that lead to that outage and that it shouldn't happen again.

https://blog.cloudflare.com/tag/outage/

JetSpiegel · on June 26, 2021

Yeah, but Cloudflare is accountable to the US Congress, they better be transparent.

ruuda · on June 23, 2021

Not to say that additional mitigations are inappropriate, but a stack overflow when parsing a corrupt file sounds like something that could have easily been found by a fuzzer.

christophilus · on June 23, 2021

Happy BunnyCDN user here. Thanks for the writeup.

> Both SmartEdge and the deployment systems we use rely on Edge Storage and Bunny CDN to distribute data to the actual DNS servers. On the other hand, we just wiped out most of our global CDN capacity.

That’s the TLDR. What a stressful couple of hours that must have been for their team.

busymom0 · on June 23, 2021

> On June 22nd at 8:25 AM UTC, we released a new update designed to reduce the download size of the optimization database.

That's around 4:25 a.m EST. Are updates usually done around this time at other companies? Seems like that's cutting pretty close to the around 8am where a lot of employees start working.

The details of the whole incident sounds pretty terrifying and I am inspired to hear how much pressure their admins were under and got it working again. Good work.

tpetry · on June 23, 2021

A cdn has a global target, there will be someone starting to work all the time around the world.

throw1742 · on June 23, 2021

They're based in Slovenia, so that was 10:25 AM local time for them.

busymom0 · on June 23, 2021

I am going based off of the map here, Europe and then North America is their biggest market:

https://bunny.net/network

Seems like they were updating production during work hours for most people which is pretty odd imo. Usually I would expect them to get this done between midnight and 2-3am.

PaywallBuster · on June 23, 2021

If you have a global infrastructure with worldwide customer base you'd want to do critical upgrades when everyone's at the office ready to jump at issues.

latch · on June 23, 2021

Is there a reason to assume EST?

busymom0 · on June 23, 2021

I was mostly going based off of their majority market being Europe and North America.