MCC (Master Chief Collection) Server Incident Summary

mrguyorama · on April 26, 2022

The most interesting part of this to me is that they have a path, both in code and in process, to fallback to peer-to-peer if stuff breaks. That's pretty impressive.

thatguy0900 · on April 26, 2022

Now that big game companies are really starting to shut down old servers en masse it should be the default, really. https://www.gamespot.com/articles/ubisoft-shuts-down-online-...

jasomill · on April 26, 2022

Also interesting and impressive: Halo multiplayer fans have modded the original Xbox version of Halo to act as a dedicated server for Xbox LAN multiplayer[1] that serves the same architectural role as 343's UHS does for MCC online play.

[1] http://halo1nhe.com

zymhan · on April 26, 2022

Now I'm very curious how the P2P matchmaking is bootstrapped.

AgentME · on April 27, 2022

The clients still talk to the official servers which run matchmaking and group up players together. The difference is whether the matchmaking servers tell all the players to connect to a dedicated gameserver or to connect to one of the players.

xmodem · on April 26, 2022

I know we normally never hear about this stuff from game compaies at all, but it would be nice to have a little bit more detail.

> Finally, it was identified that there had been some updates behind the scenes to the servers we use to relay STUN traffic as part of the ICE process which had resulted in misconfigurations.

What updates? How were they tested (or not)?

Darkphibre · on April 26, 2022

The thing with using cloud services is that sometimes the hosting provider can make changes to configurations that have impacts (such as monitoring software that steals precious CPU at seemingly random moments, or network topology hardware that impacts connectivity)... without it necessarily being under your own control. I've had a few high-visibility incidents where, after investigation, the buck stopped with us even though technically the studio could have shifted blame to an unanticipated change from the hosting provider.

Disclaimer: I currently work at Microsoft Game Studios, though this comment reflects my own opinions based on experiences unrelated to this incident.

xmodem · on April 26, 2022

Of course, that's totally understandable - I recently root caused an incident to "AWS likely changed how this component behaves at some point in a particular 2 month timespan."

But I think there's a lot more to be learned by sharing how STUN changed, what the new behaviour is, what the intent of the change was, how it was tested, etc.

For a counter-example of the level of detail I'd like to see, I saw this [1] DataDog incident report go by on Twitter this morning. This is straight up awesome and more detailed that most of our internal incident reports. I definitely learned a lot from reading it.

1: https://www.datadoghq.com/blog/engineering/grpc-dns-and-load...

ozarker · on April 26, 2022

Really cool to see this level of transparency on an issue from a multiplayer game dev. Really cool write up

stryan · on April 26, 2022

Both Bungie and 343 have done an admirable job (well, compared to other devs) about explaining their network infrastructure etc. Back in the day they did a big talk about how their matchmaking in Halo2/3 worked that I think to this day is still one of the best methods of learning when you're not in the industry yet. I can't recall what it was called though: might be the "Chris Butcher - Recreating the LAN Party Online: The Networking and Social Infrastructure of Halo 2" GS talk but I can't listen right now to check

Darkphibre · on April 26, 2022

Along these lines is the venerable TRIBES Engine Networking Model whitepaper. It was so good, it was shipped as part of the XDK for a decade. I believe Bungie even leaned on it quite a bit when creating their networking stack. https://www.gamedevs.org/uploads/tribes-networking-model.pdf

Disclaimer: I work at Microsoft Game Studios, but this comment reflects my own opinions.

coldpie · on April 26, 2022

Only semi-related, but there was a recent excellent, short podcast interview series with one of Halo 2's multiplayer designers: https://smarturl.it/H2Pod

xeromal · on April 26, 2022

Halo 2 was my favorite experience of FPS multiplayer in my life so thank you for sharing!

maicro · on April 26, 2022

And similar to my sibling comment, mostly for anyone else browsing through - the comments on this recent HN post have a lot of resources about game dev stuff in general, which might be of interest (though I also cannot promise directly related X) ): https://news.ycombinator.com/item?id=31084779

Jasper_ · on April 27, 2022

This talk from the Halo: Reach team might be one of my favorite game networking talks I've ever seen: https://www.youtube.com/watch?v=h47zZrqjgLc

auto · on April 26, 2022

Given that the error resulted from the STUN and ICE servers, which from my understanding exist solely to play a part in the NAT punching process, would this entire situation have been mitigated if things were end-to-end IPV6?

pilif · on April 26, 2022

Only in theory. In practice even when IPv6 is in use, people have stateful firewalls that will drop unsolicited connection attempts.

Compared to ipv4 where there is UPnP and NAT-PMP with widespread support in routers, there are protocols to allow clients to reconfigure the router with ephemeral firewall rules, but they are not wide-spread and support is very spotty.

So in practice, users with just IPv6 would have the exact same problems and would be even more likely to depend on STUN and ICE because their firewalls likely won’t support client-side hole-punching correctly

xeromal · on April 26, 2022

Wow, that was a fun read! I don't envy the people who had to stare at wireshark logs for 3 days though. Oof.

gundmc · on April 26, 2022

I don't recall seeing this sort of post-mortem from a gaming provider before. Really cool to see! Kudos, Halo team and Microsoft!

srmn · on April 26, 2022

Roblox did a very interesting and similar write up at the beginning of this year regarding a large outage they experienced at the end of 2021: https://blog.roblox.com/2022/01/roblox-return-to-service-10-...

BaconPackets · on April 27, 2022

The root cause is REALLY surprising. If it's really an unrelated change to the NAT/STUN relay server, it means that there was a pretty broad lack of change management framework.

darknavi · on April 26, 2022

Interesting that they call their service UDS. I was under the impression that they used PlayFab.

tehbeard · on April 26, 2022

MCC probably has an "interesting" architecture as it's a combination of Halo games that spans a few generations that makes something like playfab a bit too restrictive to work in.

sgtfrankieboy · on April 26, 2022

Halo Master Chief Collection was already released for 4 years (2014) when MSFT bought PlayFab (2018)

wyldfire · on April 26, 2022

So if I wanted to refer to a group of them, would they be called 'Masters Chief'?

seizethegdgap · on April 26, 2022

I don't believe so. The full E-9 "title" in the US Navy is Matter Chief Petty Officer, so plural would be Master Chief Petty Officers

verall · on April 26, 2022

Anyone else having issues viewing this on Firefox? I can see the whole webpage for an instant, then everything disappears, then it is "slowing down my browser".

abbeyj · on April 26, 2022

I am but only with uBlock Origin turned on. uBlock Origin blocks something and then the page goes into an infinite loop requesting https://wpcontent.svc.halowaypoint.com/purchase-content/game... and https://wpcontent.svc.halowaypoint.com/content-ratings/esrb/... over and over again. It is also rapidly using up memory so eventually you run low and the whole browser starts performing poorly. Whitelisting the site in uBlock Origin is a workaround.

zymhan · on April 26, 2022

Odd, maybe another extension in combination is causing that issue? I'm running FF 100.0 (wow, I remember FF 3.5) with UBO and Privacy Badger.

spartanatreyu · on April 26, 2022

I am running FF 99 with only uBlock Origin and I had no slowdowns

weberer · on April 26, 2022

https://archive.ph/tJcbd

tjpnz · on April 26, 2022

Loaded fine in Firefox for Android. Shame the text size makes the post unreadable.