Hey all. Thanks for noticing :P Obviously this is embarrassing and I'm sorry about it. As a non-developer I can't really explain how or why this happened, but I can say that we do our best and are sorry when that falls short.
Computers set limits internally on how big numbers can be when they're keeping track of stuff.
Your developers had given each game a number to identify it. So your first game was #1, the 40th game was #40, and so on.
The limit for how big the number could be was a bit over 2 billion, and your players have just now played a bit over 2 billion games, and so that id number suddenly exceeded the computer's internal limit. Specifically, the limit was 2147483648, so basically it crashed on game #2147483649, which is the next id after the last acceptable one (notice the last digit is 1 higher.)
I'm simplifying slightly but that's the idea. It'll be fixable by essentially using a different format for the id number so that the limit is higher, much like telling the computer "use a higher limit for this particular number, it's special."
Yes - I understand HOW it happened, just not sure WHY. Meaning, I'm not sure what the developer was thinking, and at this point, I'm not going to track down exactly who it was and point fingers. I think everyone has learned enough through this highly interesting bug. It certainly was interesting to see the slack room exploding with theories and debugging. A new iOS client has been submitted to Apple (hurry plz!!!), and a server fix is also in QA now. Fun problems to have......
It's most likely for efficiency and performance reason. 64-bit doubles the storage requirement of 32-bit and would have impact on database's utilization of memory, querying window size, cache, and storage.
Edit: 32 bits worth of games played means about 4 billion games. 4 billion X 4 bytes for 32-bit = 16GB just for the 32-bit ID's. 64-bit ID's would need 32GB for the 4 billion games. I guess memory and storage weren't that cheap back then.
It sounds like it was client side, not server side. Most likely the iOS client was using Objective-C's NSInteger or Swift's Int, just because that's the default choice for most programmers working in that language, and they didn't think it through.
On a 32-bit system, a "long" is usually also 32 bits. On a non-Microsoft 64-bit system, a "long" is usually 64 bits. On both 32-bit and 64-bit systems (Microsoft or not), an "int" is usually 32 bits.
If the issue happened only on 32-bit iPads, but not on 64-bit iPads, the programmer probably picked a "long", not an "int". Had the programmer picked an "int", the problem would also happen on 64-bit iPads.
Our iOS app with Java backend was using long for database IDs on both ends. I was going through the ILP32->LP64 conversion process and when I realized we had a pretty serious discrepancy.
I think it's a really easy mistake for the first developer to make (especially because they weren't a C/Obj-C programmer), and then the sort of thing that no one audits after that.
> Meaning, I'm not sure what the developer was thinking
A 32bit integer is pretty much the default numeric type for the majority of programming tasks over the last 20 years. Even with 64bit CPUs, 32bit is still a common practice. Probably 99% of all programmers would make the same choice unless given specific requirements to support more than 2 billion values.
It's often not even an explicit choice, it's just default behavior.
Up until recently, Rails defaulted to 32 bit IDs, so there are a ton of apps out there that could have these issues, especially since Rails has always prided itself on providing sane defaults: https://github.com/rails/rails/pull/26266
Others, like JS and Lua, just use doubles, meaning they'll never overflow - instead every 2 numbers start to be considered equal. Then a while after that every 4, etc. Not exactly optimal behavior when using incrementing IDs.
I don't think you do understand, you sound like you're upset that a developer "set" this limit. When in reality it's tied to fundamental programming principles. It wasn't really a conscious decision to say, "I'm only going to account for 2bil games"...
Probably when this was initially developed, nobody thought you'd ever go over 2 billion games. This error is brought to light by your success and popularity.
Computer history is riddled with assumptions like that. The Y2K problem, Unix dates running out in 2037, and 32 bit computers unable to address more than 4 GB of memory are just the big ones. It's everywhere. Smaller software projects are generally built for what you need right now, and less for what might happen in the distant future.
Ideally you want to retain some awareness that this is an issue so you can start working on once you go over a billion games, but in a small company, there are probably always more urgent things to worry about, and nobody ever gets around to fixing this technical debt.
2 billion is a very large number that was probably not envisioned as reachable in the near future - as a programmer I'd argue this is a pretty easy mistake to make, and that while (slightly) embarrassing, its a good learning moment.
It's also really awesome that you're here, and that you guys were so honest about the nature of the bug - this is really something that should be encouraged.
Maybe we should start a blog about all of the interesting bugs and challenges we encounter. It certainly is white-knuckle pretty often when running at scale. The number of devices, connections, features... I'm aging prematurely :P
Agree with Aloha. I wouldn't be too hard on the programmer (also, if I understand correctly it's not a database issue, but only with the 32-bit iOS client). I'd pat him on his back and say “you didn't think we'd get this big, eh?” ;-)
> 2 billion is a very large number that was probably not envisioned as reachable in the near future
I disagree. Simple napkin calculation: 100 million players playing 40 games each per year (about 1 per week) over 5 years = 10 billion unique games.
As others pointed out it was likely not a miscalculation, just a lack of calculation. The bug occurred only in the client and the decision to use a smaller data type was likely not a conscious one.
In any case, I wouldn't hold it against an individual programmer. But arguably this sort of bug indicates your development process has flaws (not enough testing, code reviews, etc).
Thanks. I'm a pretty understanding "boss", especially on the heals of reaching the 2 billion games milestone :D Our team is awesome and we love what we do. Unfortunately we're still a bunch of humans sitting at kitchen counters and on couches around the world, so things do sometimes fall in between the cushions...
Indeed. I'm not sure that anyone here at Chess.com at the beginning thought we would hit a billion games played in our lifetime. But I guess after 10 years....
To put things in perspective, 2 billion games in 10 years is half a million games per day on average over the 10 years. Considering you didn't start at that rate and that it's an average, it means you have way more than half a million games per day now. (that's also more than 6 per second!)
Think of a mechanical odometer, and how it only has a certain number of digits. Eventually you'll hit 999,999 miles, and on the next mile, everything will roll over to 000,000.
Same deal here. 32-bit numbers are stored as 32 switches, starting from
0000 0000 0000 0000 0000 0000 0000 0000
which is 0, to
1111 1111 1111 1111 1111 1111 1111 1111
which is 4,294,967,295. Since the 32-bit iOS version of Chess.com apparently uses 32-bit numbers to store each game's unique ID, that means you can have 4,294,967,295 games.
So what happens on game 4,294,967,296? Just like the odometer, everything rolls back to 0, and things start breaking because the program gets confused.
Pretty common problem, really. The fix would be to use a 64-bit number, which doubles the number of binary digits.
- Erik, CEO, Chess.com