Oh, didn't realise they had renamed themselves until I saw this post. Reading the article on why, the reason makes sense, though I never made that connection until reading the article. They could have picked a less generic new name though, feels a bit like the naming equivalent of the flat logos which currently have an article on the front page too.
I don't know if Paul was the only guy working on the network code, but I believe he was responsible for most or all of it.
He wrote that code when he was 18 or 19 years old I believe. But unlike some of the other devs at Ensemble, he never had a big ego, and he was just generally nice and humble.
He (much) later created the early mobile hit, Words With Friends, which had many copycats later.
As a fairly high rated AoE2 player (~1800), I can attest that the network architecture has not stood the test of time. Lockstep causes tonnes of problems. Desynchronisation is still possible, when that happens the simulation needs to backtrack (I think?), which can cause units to jump multiple tiles. Because your machine is resolving the movements of other players, you run into problems with more players. The wire protocol is widely known and easily hacked, in a way that's hard to detect in game.
AoE2 DE is still lockstep, but through a server. All players are connected to the same server. This has slightly improved on the previous P2P, but not really addressed anything.
> Desynchronisation is still possible, when that happens the simulation needs to backtrack (I think?
Sounds more like client-side prediction to smooth things, than actual simulation desync. I have a hard time believing a deterministic game with such a large state was able to backtrack and resync the sim back to determinism. I did not think that lockstep RTS games would need client-side prediction (the indirect and long-term commands in RTS helps hides latency), but I guess if your gameplay lends itself to high APMs then it becomes necessary.
Well you could just have state snapshots taken at regular intervals and then verify that both sides' hashes agree. It's only a couple of thousand entities so it's really not so bad. For you can probably ignore those that haven't deviated from the previous snapshot (and that would account for a state reconstruction taking time).
RTS games have a replay mechanisms at least as far back as StarCraft: Brood War, so a journal of player inputs are likely going to be recorded anyway.
Yes the journal of player inputs sure, but the intermediate states is a different matter. AoE2's state size is peanuts for a modern machine, but I would say that at the time, it was quite significant and it would be too costly to store it on the fly. I certainly did not dare try that in the RTS-like deterministic games I worked on (Commandos and Praetorians)
We're still talking about a few dozen kilobytes of data here. A dozen or so bytes worth of global state per player, up to 200 units per player with a few bytes worth of state (position, order, action, action target, hitpoints), maybe a hundred projectiles, order of 1000 static entities with just hitpoints.
Gotta keep in mind that these games were written in languages that did not have modern garbage collection, so almost certainly stored entity information in arrays to avoid heap fragmentation and malloc costs.
A few dozen Kb is far beyond what you can push over a modem in real time for sure, but memcpy:ing a couple of kilobytes' worth of arrays was still plenty fast in the late '90s/early 2000s.
It's actually a lot more, an uncompressed world state from a recorded game is 1.6MB - compressed (aoe uses deflate) it's only 153kb, but that's still a lot.
I think they have checksums every once in a while over their world state, fog of war state etc, and if these checksums don't match it desyncs. Then it creates an out of sync save, probably for just before the desync occured.
Nah, the desync is when two floating point operations do not produce the same outcome, the checksum is detecting when that butterfly has caused a thunderstorm of diverging game states that is measurable. That can happen fairly late , depending on what is hashed.
The strategy of occasional save-game storage and backtracking only works, if the cause is rare and not deterministic.
I'd need to look into it again but I think pretty much everything object state wise gets hashed.
Edit: The checksum for the player includes the content of each attribute of the player, the object state for each object owned by the player, the master object id of that object, the amount of attributes they carry (which is I think resources that villagers carry for example) and the world x/y/z position.
Total Annihilation was asynchronous, Supreme Commander was synchronous lockstep (sending user interactions only), Planetary Annihilation said they were client-server, with the server only sending updates for units a player could see.
While lockstep is useful for some games, it was never really a good choice for fighting games due to the fast and timing sensitive nature. Rollback is an improvement over it that adds speculative execution for remote inputs, this is what should be the standard, although Japanese developers have only recently gotten on board with it.
I guess it depends on how high a level you're looking at it from. Both lockstep and rollback are variants of the general approach of "shared deterministic simulation" where you're mostly just sending inputs. ("Mostly" 'cuz on reconnect or desync or whatever you'll probably want to send full state.)
I really enjoyed this GDC talk on rollback networking:
One improvement is, to train a model on the player inputs and thus predict behavior. It takes some time, but if done right, the game after a while feels really "instantaneous" even when it comes to fast paced high-apm action.
In the section "Things That Went Wrong Or We Could Have Done Better", it mentions as the final point an insight I often think about:
"8. We didn’t take enough advantage of automated testing. In the final weeks of development, we set up the game to automatically play up to eight computers against each other. Additionally, a second computer containing the development platform and debugger could monitor each computer that took part. These games, while randomly generated, were logged so that if anything happened, we could reproduce the exact game over and over until we isolated the problem. The games themselves were allowed to run at an accelerated speed and were left running overnight. This was a great success and helped us in isolating very hard to reproduce problems. Our failure was in not doing this earlier in development; it could have saved us a great deal of time and effort. All of our future production plans now include automated testing from Day One."
It is true that actual interaction is not tested in this way. Still, AI actions are not all it tests: For example, just by keeping such games running for a long time, it implicitly also tests resource management (such as garbage collection), robustness of the networking code, synchronization etc.
What I found appealing in the description is the idea of keeping your program running for a long time and have it perform all kinds of actions automatically. Nowadays, I often hear about "test-driven development", "unit tests" etc., and it often turns out that they test very specific things out of a vast universe of all possible things. That does not mean that they are useless. As I see it, it means - as they phrased it in this article - that we often do not take enough advantage of automated testing.
Personally, when I test software, I always try to follow the general idea stated in the post mortem: Keep it running, and perform all kinds of actions automatically. I found several crashes and memory leaks in this way, which were not noticed during manual interaction because they only became significant when the actions were repeated thousands of times.
Based on my personal experience of automated testing in AAA games, even if you "only" count the overlap with the actions humans take, it's a huge overlap in practice. And while there are edge cases only humans find, automated testing finds bugs humans don't.
The point of that section is that the earlier you find bugs, the quicker you fix them, and you end up with less overall pain for the length of the project.
This article paved the way. FWIW I had written a fully deterministic (non networked though) game engine around... 1991.
HN user "dfan" here wrote a fully deterministic game engine for the DOS game Terra Nova in 1996.
There were probably others but resources about deterministic game engines were very rare for nearly nobody had one so this article about AoE was a gold mine for many.
Some terms may have changed, but the techniques are basically identical.
Blizzard in 2002 with Warcraft III had basically solved that issue for RTS once and forever too (and maybe for Starcraft before that?) and not much has changed since then.
What specifically is outdated? If you look under the hood of modern RTS games they're all going to look pretty similar, and this article has a bunch of info Gaffer's doesn't. I would read both.
For example you really don't need to enforce a 200ms turn rate, and their notion of what makes a 'communications turn' is unclear given the terms gamedevs use nowadays.
Yeah, I used 30hz for my turn rate, but the principles are the same.
The terminology is dated, but there's some real gold in there. The bits about deer facing snowballing out of sync errors, out of simulation code corrupting simulation state, comparing state dumps... this is exactly what your life will be like if you build your games this way. I know this because I have (although complicated by the addition of rollback).
Subspace was another impressive one from around that era. 50-100 concurrent players on 28.8/56k.
They used a totally different technique(dead reckoning + game design that was about prediction) but it was pretty smooth given what average pings where back then.
The techniques used in the game ended up being replicated in many other games and apps. Its networking was also key to the low lag it is entirety based on a custom UDP network stack that can send both reliable and unreliable packets.
It’s more sad than amazing. The amount of computing resources we waste for no real reaso. We have crazy fast technology that runs software just as slow as they did in the 2000s
Not having to care about performance (because hardware is fast) allows us to build more software than ever before. Although optimized software saves on computing resources, it is wasteful on the most important resource of all: Software Engineers' Time. Not to say it's not important, but saying we are wasting computing resources for "no real reason" is not fair. There are a lot of delicate trade-offs involved.
That’s also the sad part. It’s no longer about taking pride in one’s craftsmanship and caring about the end-users experience. It’s about cranking out janky products to make money.
No, it's about delivering products faster or prettier.
There are fields where squeezing every tiny bit of performance matters — media compression, 3D software (including game engines), data processing, hardware.
Engineering is not art, it's about building things that can be useful.
Unfortunately as much as I love it, the new AoE2 DE has fallen to the same trap. Old AoE2 uses a fraction of the power (and disk space) compared to the new one.
Eh I disagree. On the same computer original aoe2 would lag like crazy even on a LAN with 4-5 players and pop. limit to 200, with AoE2 DE it's muuuch smoother
I remember that BBSes added SLIRP and PPP in order to dial in and access their Internet connection. It was mostly pay BBSes with MajorBBS or some commercial software. We played Doom that way as well.
My claim to fame was a 4-way deathmatch with Thresh [1], myself, and two other randos (like me) on Doom 2, I think the map E2M7? The rules were first to a score of 100, rocket launchers only. Final score: Thresh, 100; me, 7; other opponents -3 and -6.
The mention of Doom reminds me - games with this form of perfect synchronization allowed for VERY small replays - basically all it needed to record was keypresses and time stamps.
Unfortunately it also struggles with patching, because due to balance changes the commands don't work anymore as they are supposed to. For example back in the day Starcraft balanced Spawning Pool to cost 200 minerals instead of 150, to delay early zergling rushes by several seconds (which makes a massive difference). But if watching an old replay, it will still try to build the Spawning Pool at 150 minerals and from there everything breaks.
I recall using SLIRP back in the day with an dialup ISP that provided shell accounts in the base package, but charged extra for PPP (IIUC this was a fairly common business model). With SLIRP I was able to access internet directly from my PC without paying for PPP access.
"Part of the difficulty was conceptual -- programmers were not used to having to write code that used the same number of calls to random within the simulation (yes, the random numbers were seeded and synchronized as well)."
What do they mean, "same number of calls to random within the simulation", didn't get it
The random number generator needs to give the same outcomes on both ends so that you don’t need to transfer results. To this end, you have to use the same deterministic generator code, seed and ensure you call it the same number of times on each system. This last point is because queries to a random number generator also change its state. If system A and B start with some seed, but system A calls it 30 times and system B 29 times, the next call will yield a different result on each system.
I suspect it pertains to the fact that’s most RNGs tend to be pseudo-random number generators that maintain an internal state. Each call of the RNG mutates the internal state. If you have two processes, one calls the RNG 2 times and the other calls it 5 times, the process states will have diverged and the simulation is no longer consistent between them. This is true even if the simulation ends up using 1 value out of those RNG calls, because the RNG state itself is implicitly part of the simulation.
If you call a pseudo-random number generator, you get the next random number in a strictly defined sequence of numbers derived from the seed. So the generator has state, and needs to be called exactly the same number of times in the same order on all clients to make sure that the random events happen the same on all clients.
I found RTS3 had a lot more network issues than AoE2. It would more consistently fail to hole punch (fail to join a lobby game with certain people for no discernible rhyme or reason) and/or drop connection during play.
I could play RTS3 with most people around the world, it seemed, except a few of my best friends. They could individually play with random people all over the world, but not each other or myself, depending on the day of the week or something like that.
Perhaps their decision to move away from DirectPlay was a mistake or premature optimization!
Age of Mythology always worked for me on LAN and usually WAN without too much port forwarding. AoE (both 1 and 2) never seemed to work well over the internet for me without loads of manually forwarded ports.
The rewrites/updates have definitely helped in this situation but I think the AoM code was built for and tested on networking hardware that was a lot more error prone than modern routers are, which in turn affects how it behaves even today. Clever NAT bypasses definitely work better now than they did fifteen or twenty years ago, but they're definitely not guaranteed to work still. The rise in CGNAT usage also paints a grim picture for anyone playing video games together.
I've also had my fair share of peering issues between ISPs. At one point I discovered that I could send UDP packets to my friend, but my friend could not send UDP packets to me, no matter the port forwards and even after forcing the modem into bridge mode. TCP worked fine in both directions, though. Something upstream seemed to filter the packets out, I'm guessing to prevent DDoS attacks, so whenever we played P2P games one of us needed to run some kind of VPN.
IPv6 can solve many of the problems I remember having to deal with setting up games back in the day, but most games have become cloud-only anyway. The age of custom servers and P2P gaming is over, killed by lootboxes and shitty NAT implementations.
I love it. I wonder how much of this they applied to AOE2 DE, which is currently thriving, and uses the same basic game mechanics, artwork, and balance as "AOK".
It's the same engine with some updates, except game messages are routed through a centralised server. All of the stuff about lockstep simulation still applies.
If you're referring to building destruction, they are just baked animations; the physics simulation was done in an external 3D modelling program. Which is why the enhanced graphics pack for the game is about 30 gigabytes.
If each command would only be executed two communication turns in the future, and a communication turn is ~200ms, isn't there ~600ms of lag for each command?
> For RTS games, 250 milliseconds of command latency was not even noticed -- between 250 and 500 msec was very playable, and beyond 500 it started to be noticeable.
The article mentions this as well, so was ~600ms latency just expected?
Why doesn't it work to execute commands _one_ communication turn in the future?
I've built a couple of demos using this article as a guide to understanding how the net architecture works. It's pretty fun to run through. Making your sim deterministic is a neat challenge.
Here's an older snapshot of the article from back when the site was still called Gamasutra, that includes the full text of the article:
https://web.archive.org/web/20180719170411/https://www.gamas...