I know we normally never hear about this stuff from game compaies at all, but it...

Darkphibre · on April 26, 2022

The thing with using cloud services is that sometimes the hosting provider can make changes to configurations that have impacts (such as monitoring software that steals precious CPU at seemingly random moments, or network topology hardware that impacts connectivity)... without it necessarily being under your own control. I've had a few high-visibility incidents where, after investigation, the buck stopped with us even though technically the studio could have shifted blame to an unanticipated change from the hosting provider.

Disclaimer: I currently work at Microsoft Game Studios, though this comment reflects my own opinions based on experiences unrelated to this incident.

xmodem · on April 26, 2022

Of course, that's totally understandable - I recently root caused an incident to "AWS likely changed how this component behaves at some point in a particular 2 month timespan."

But I think there's a lot more to be learned by sharing how STUN changed, what the new behaviour is, what the intent of the change was, how it was tested, etc.

For a counter-example of the level of detail I'd like to see, I saw this [1] DataDog incident report go by on Twitter this morning. This is straight up awesome and more detailed that most of our internal incident reports. I definitely learned a lot from reading it.

1: https://www.datadoghq.com/blog/engineering/grpc-dns-and-load...