Hacker News new | past | comments | ask | show | jobs | submit login
MCC (Master Chief Collection) Server Incident Summary (halowaypoint.com)
120 points by DivisionSol on April 26, 2022 | hide | past | favorite | 32 comments



The most interesting part of this to me is that they have a path, both in code and in process, to fallback to peer-to-peer if stuff breaks. That's pretty impressive.


Now that big game companies are really starting to shut down old servers en masse it should be the default, really. https://www.gamespot.com/articles/ubisoft-shuts-down-online-...


Also interesting and impressive: Halo multiplayer fans have modded the original Xbox version of Halo to act as a dedicated server for Xbox LAN multiplayer[1] that serves the same architectural role as 343's UHS does for MCC online play.

[1] http://halo1nhe.com


Now I'm very curious how the P2P matchmaking is bootstrapped.


The clients still talk to the official servers which run matchmaking and group up players together. The difference is whether the matchmaking servers tell all the players to connect to a dedicated gameserver or to connect to one of the players.


I know we normally never hear about this stuff from game compaies at all, but it would be nice to have a little bit more detail.

> Finally, it was identified that there had been some updates behind the scenes to the servers we use to relay STUN traffic as part of the ICE process which had resulted in misconfigurations.

What updates? How were they tested (or not)?


The thing with using cloud services is that sometimes the hosting provider can make changes to configurations that have impacts (such as monitoring software that steals precious CPU at seemingly random moments, or network topology hardware that impacts connectivity)... without it necessarily being under your own control. I've had a few high-visibility incidents where, after investigation, the buck stopped with us even though technically the studio could have shifted blame to an unanticipated change from the hosting provider.

Disclaimer: I currently work at Microsoft Game Studios, though this comment reflects my own opinions based on experiences unrelated to this incident.


Of course, that's totally understandable - I recently root caused an incident to "AWS likely changed how this component behaves at some point in a particular 2 month timespan."

But I think there's a lot more to be learned by sharing how STUN changed, what the new behaviour is, what the intent of the change was, how it was tested, etc.

For a counter-example of the level of detail I'd like to see, I saw this [1] DataDog incident report go by on Twitter this morning. This is straight up awesome and more detailed that most of our internal incident reports. I definitely learned a lot from reading it.

1: https://www.datadoghq.com/blog/engineering/grpc-dns-and-load...


Really cool to see this level of transparency on an issue from a multiplayer game dev. Really cool write up


Both Bungie and 343 have done an admirable job (well, compared to other devs) about explaining their network infrastructure etc. Back in the day they did a big talk about how their matchmaking in Halo2/3 worked that I think to this day is still one of the best methods of learning when you're not in the industry yet. I can't recall what it was called though: might be the "Chris Butcher - Recreating the LAN Party Online: The Networking and Social Infrastructure of Halo 2" GS talk but I can't listen right now to check


Along these lines is the venerable TRIBES Engine Networking Model whitepaper. It was so good, it was shipped as part of the XDK for a decade. I believe Bungie even leaned on it quite a bit when creating their networking stack. https://www.gamedevs.org/uploads/tribes-networking-model.pdf

Disclaimer: I work at Microsoft Game Studios, but this comment reflects my own opinions.


Only semi-related, but there was a recent excellent, short podcast interview series with one of Halo 2's multiplayer designers: https://smarturl.it/H2Pod


Halo 2 was my favorite experience of FPS multiplayer in my life so thank you for sharing!


And similar to my sibling comment, mostly for anyone else browsing through - the comments on this recent HN post have a lot of resources about game dev stuff in general, which might be of interest (though I also cannot promise directly related X) ): https://news.ycombinator.com/item?id=31084779


This talk from the Halo: Reach team might be one of my favorite game networking talks I've ever seen: https://www.youtube.com/watch?v=h47zZrqjgLc


Given that the error resulted from the STUN and ICE servers, which from my understanding exist solely to play a part in the NAT punching process, would this entire situation have been mitigated if things were end-to-end IPV6?


Only in theory. In practice even when IPv6 is in use, people have stateful firewalls that will drop unsolicited connection attempts.

Compared to ipv4 where there is UPnP and NAT-PMP with widespread support in routers, there are protocols to allow clients to reconfigure the router with ephemeral firewall rules, but they are not wide-spread and support is very spotty.

So in practice, users with just IPv6 would have the exact same problems and would be even more likely to depend on STUN and ICE because their firewalls likely won’t support client-side hole-punching correctly


Wow, that was a fun read! I don't envy the people who had to stare at wireshark logs for 3 days though. Oof.


I don't recall seeing this sort of post-mortem from a gaming provider before. Really cool to see! Kudos, Halo team and Microsoft!


Roblox did a very interesting and similar write up at the beginning of this year regarding a large outage they experienced at the end of 2021: https://blog.roblox.com/2022/01/roblox-return-to-service-10-...


The root cause is REALLY surprising. If it's really an unrelated change to the NAT/STUN relay server, it means that there was a pretty broad lack of change management framework.


Interesting that they call their service UDS. I was under the impression that they used PlayFab.


MCC probably has an "interesting" architecture as it's a combination of Halo games that spans a few generations that makes something like playfab a bit too restrictive to work in.


Halo Master Chief Collection was already released for 4 years (2014) when MSFT bought PlayFab (2018)


So if I wanted to refer to a group of them, would they be called 'Masters Chief'?


I don't believe so. The full E-9 "title" in the US Navy is Matter Chief Petty Officer, so plural would be Master Chief Petty Officers


Anyone else having issues viewing this on Firefox? I can see the whole webpage for an instant, then everything disappears, then it is "slowing down my browser".


I am but only with uBlock Origin turned on. uBlock Origin blocks something and then the page goes into an infinite loop requesting https://wpcontent.svc.halowaypoint.com/purchase-content/game... and https://wpcontent.svc.halowaypoint.com/content-ratings/esrb/... over and over again. It is also rapidly using up memory so eventually you run low and the whole browser starts performing poorly. Whitelisting the site in uBlock Origin is a workaround.


Odd, maybe another extension in combination is causing that issue? I'm running FF 100.0 (wow, I remember FF 3.5) with UBO and Privacy Badger.


I am running FF 99 with only uBlock Origin and I had no slowdowns



Loaded fine in Firefox for Android. Shame the text size makes the post unreadable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: