The most interesting part of this to me is that they have a path, both in code and in process, to fallback to peer-to-peer if stuff breaks. That's pretty impressive.
Also interesting and impressive: Halo multiplayer fans have modded the original Xbox version of Halo to act as a dedicated server for Xbox LAN multiplayer[1] that serves the same architectural role as 343's UHS does for MCC online play.
The clients still talk to the official servers which run matchmaking and group up players together. The difference is whether the matchmaking servers tell all the players to connect to a dedicated gameserver or to connect to one of the players.
I know we normally never hear about this stuff from game compaies at all, but it would be nice to have a little bit more detail.
> Finally, it was identified that there had been some updates behind the scenes to the servers we use to relay STUN traffic as part of the ICE process which had resulted in misconfigurations.
The thing with using cloud services is that sometimes the hosting provider can make changes to configurations that have impacts (such as monitoring software that steals precious CPU at seemingly random moments, or network topology hardware that impacts connectivity)... without it necessarily being under your own control. I've had a few high-visibility incidents where, after investigation, the buck stopped with us even though technically the studio could have shifted blame to an unanticipated change from the hosting provider.
Disclaimer: I currently work at Microsoft Game Studios, though this comment reflects my own opinions based on experiences unrelated to this incident.
Of course, that's totally understandable - I recently root caused an incident to "AWS likely changed how this component behaves at some point in a particular 2 month timespan."
But I think there's a lot more to be learned by sharing how STUN changed, what the new behaviour is, what the intent of the change was, how it was tested, etc.
For a counter-example of the level of detail I'd like to see, I saw this [1] DataDog incident report go by on Twitter this morning. This is straight up awesome and more detailed that most of our internal incident reports. I definitely learned a lot from reading it.
Both Bungie and 343 have done an admirable job (well, compared to other devs) about explaining their network infrastructure etc. Back in the day they did a big talk about how their matchmaking in Halo2/3 worked that I think to this day is still one of the best methods of learning when you're not in the industry yet. I can't recall what it was called though: might be the "Chris Butcher - Recreating the LAN Party Online: The Networking and Social Infrastructure of Halo 2" GS talk but I can't listen right now to check
Along these lines is the venerable TRIBES Engine Networking Model whitepaper. It was so good, it was shipped as part of the XDK for a decade. I believe Bungie even leaned on it quite a bit when creating their networking stack.
https://www.gamedevs.org/uploads/tribes-networking-model.pdf
Disclaimer: I work at Microsoft Game Studios, but this comment reflects my own opinions.
Only semi-related, but there was a recent excellent, short podcast interview series with one of Halo 2's multiplayer designers: https://smarturl.it/H2Pod
And similar to my sibling comment, mostly for anyone else browsing through - the comments on this recent HN post have a lot of resources about game dev stuff in general, which might be of interest (though I also cannot promise directly related X) ): https://news.ycombinator.com/item?id=31084779
Given that the error resulted from the STUN and ICE servers, which from my understanding exist solely to play a part in the NAT punching process, would this entire situation have been mitigated if things were end-to-end IPV6?
Only in theory. In practice even when IPv6 is in use, people have stateful firewalls that will drop unsolicited connection attempts.
Compared to ipv4 where there is UPnP and NAT-PMP with widespread support in routers, there are protocols to allow clients to reconfigure the router with ephemeral firewall rules, but they are not wide-spread and support is very spotty.
So in practice, users with just IPv6 would have the exact same problems and would be even more likely to depend on STUN and ICE because their firewalls likely won’t support client-side hole-punching correctly
The root cause is REALLY surprising. If it's really an unrelated change to the NAT/STUN relay server, it means that there was a pretty broad lack of change management framework.
MCC probably has an "interesting" architecture as it's a combination of Halo games that spans a few generations that makes something like playfab a bit too restrictive to work in.
Anyone else having issues viewing this on Firefox? I can see the whole webpage for an instant, then everything disappears, then it is "slowing down my browser".