Hacker News new | past | comments | ask | show | jobs | submit login
Chess.com stopped working on 32bit iPads because 2^31 games have been played (chess.com)
757 points by NewGier on June 12, 2017 | hide | past | favorite | 326 comments



It's fascinating... the Y2K problem never came to fruition because - arguably - of the immense effort put in behind the scenes by people who understood what might have happened if they hadn't. The end result has been that the entire class of problems is overlooked, because people see it as having been a fuss over nothing.

I sometimes think it would've been better if a few things had visibly failed in January 2000.


If you were watching closely and knew what to look for in the first couple of months of 2000, the failures were there. But they were generally minor and easy to overlook as Y2K problems.

I spotted something like half a dozen failures in various systems I interact with which I strongly suspected, based upon the timing, were likely Y2K problems that slipped through testing. For example, I received duplicate bills for one of my credit cards for the January 2000 billing period, and then a subsequent apology for the duplicate bills. They never said Y2K, but the timing was very suggestive.

It's pretty much exactly what I expected from most companies...the big stuff had been largely been dealt with, but a few things slipped through which they could dismiss with some hand-waving. The thing that surprised me was that there didn't really seem to be any high profile disasters (like a company that couldn't ship products, an airline that couldn't issue tickets, or whatever) at all...I figured there'd be at least a couple.


Having spent a year hardening financial systems against Y2K, I was very unimpressed to discover my credit card not working on 1st of January.

The call centre staff told me that this wasn't a Y2K bug, but a year-end bug. As if that was meant to make me feel better about an obvious, grim, failure.


> in the first couple of months of 2000, the failures were there.

They were there before then too. Things that could go wrong at midnight on NYE were only one of a few classes of problem associated with roll-overs. There were a lot of bugs in like scheduling applications (and similar system tools) in the run up to 2000 that the man on the street didn't associate with the Y2K issue because it didn't happen at that exact moment.


in JavaScript, new Date().getYear() returns 117 :)


And in early 2000 web was full of "Copyright 19100".


getYear() is based from 1900, if you use getFullYear() you'll get 2017. Though, given all the additions to ES201*, it would be nice if all the timezone data were in the browsers, so one could get moment-timezone without the, relatively large data files.


.getYear() and .setYear() are deprecated in all recent standards [IIRC they were before 2000, I certainly remember them being recommended against as far back as that].


Makes sense.


Similar to how much effort goes into dealing with things like dangerous strains of bird flu, only to have people complain about how much money was spent on "nothing" when an outbreak doesn't occur.


This is a whole class of problem - I wonder if there's a name for it. More examples include: talking about welfare being unnecessary because no one is starving. Or people on medications stopping because they feel better (while still on them).


This makes me think of a hypothetical scenario proposed in one of Nassim Nicolas Taleb's books where-in some senator gets a bill passed in the weeks before 9/11 requiring reinforced doors on all commercial airplane cockpits.

No parade would be thrown for this senator for having prevented 9/11 and likely he'd be castigated for having given airlines an excuse to raise prices due to restrictive government regulations.


that example (while i was reading the book) kind of stuck out with me - I think of it almost as preventative maintenance or good software dev practices - you'll get crap from others for 'wasting' time on it, but they don't realise it might just save your ass when you need it most.


That's a very definitive bias we always hold. Praising a player who falls behind and achieves something while not paying much attention to one who does something normally. Chaos gets attention!


The guy at work who causes a complex IT problem and fixes it by working late one night: team player doing what he can to support the company.

The guy at work whose work doesn't cause a problem and things keep working fine: why aren't you as much a team player as that other guy?


The silent wheel gets rode?


Greased wheels don't squeak.


How about: "prodigal brother syndrome"?

I suspect the parable it alludes to passes the cultural literacy threshold for comprehension, at least in the US.



This reminds me of an essential but overlooked truth examined in this paper: "Nobody Ever Gets Credit for Fixing Problems that Never Happened". It builds a simple system dynamics model and shows the long-term effects of working smarter versus working harder: http://web.mit.edu/nelsonr/www/Repenning=Sterman_CMR_su01_.p...


I think the "availability heuristic" is what you're looking for.

https://en.wikipedia.org/wiki/Availability_heuristic

The availability heuristic is a mental shortcut that relies on immediate examples that come to a given person's mind when evaluating a specific topic, concept, method or decision. The availability heuristic operates on the notion that if something can be recalled, it must be important, or at least more important than alternative solutions which are not as readily recalled. Subsequently, under the availability heuristic, people tend to heavily weigh their judgments toward more recent information, making new opinions biased toward that latest news.


For a pop-culture name, I'd call it the Head & Shoulders Problem, after those old shampoo commercials:

"You use Head & Shoulders? But you don't have dandruff!" "Exactly!"


I wish. I do have dandruff (depending on where in the world I happen to be). Head & Shoulders does absolutely nothing.


Dandruff is generally caused by fungi living on the scalp. The fungi eat up your skin and make it dry. The dry skin flakes off.

Dandruff shampoos work by killing the fungi. There are three main types of anti-fungals, zinc pyrithione is used in Head and Shoulders, selenium disulfide in Selson Blue. There is a third one not commonly used called ketoconazole that you may want to try.

There is also coal tar shampoo, which is not an anti-fungal, instead it slows skin cell growth and sloughs off the dead skin.

Selson Blue works well for me, I use it daily. If I go off of it for a few days the dandruff kicks into high gear.

The thing about dandruff shampoos is you have to use them with a certain regularity because even if you kill the fungi, the condition takes a few days to clear up, the damage to your scalp is done. You need to create a hostile environment for the microorganisms, and that takes time.

Your dandruff is probably caused by only one type of organism that thrives in particular climates. Next time you're there, you might get ahold of two or three shampoos and try them for a week each and see if it clears up. Then you know which is 'your' anti-fungal.


Have you seen your doctor about this problem? There's quite a lot they can do.


Dandruff has a barely noticeable impact on your life. Following medical advice can do much more.

Some problems aren't worth trying to fix.


Not merely an insightful repository of analysis and explanations around the more profound aspects of our world, John Ralston Saul's exquisite Doubter's Companion includes some answers - acerbic accusations bundled free - to the more pedestrian issues:

"DANDRUFF: The ANSWER is usually vinegar. To some problems there are solutions.

"What we call dandruff is often the result of a PH imbalance on the skin, which shampoo exacerbates. Wash your hair with a simple non-detergent shampoo, soap, olive oil, beer, almost anything. Rinse. Then close your eyes and pour on some vinegar. The extremely cheap but natural sort—apple cider, for example—is probably best. The smell will stimulate interesting conversations in changing-room showers and your explanation will win you friends. Wait thirty to sixty seconds. Rinse it off. The smell will go away. So will your dandruff.

"All dermatologists, pharmacists and pharmaceutical companies know this simple secret. They don’t tell you because they make money by converting dandruff into a complex medical and social problem. By most professional standards this would amount to legally defined incompetence or misrepresentation.

"Dandruff shampoos that promise to keep your shoulders and even your head clean are harsh detergents and may promote baldness, which ought to constitute malpractice."


Stop spreading psuedo-scientific alternative-medicine falsehoods. Dandruff is in no way caused by a "PH imbalance on the skin": http://www.mayoclinic.org/diseases-conditions/dandruff/diagn...


Actually, I think I'm mostly annoyed by your use of three pejoratives to try to make one point, and the fact you spelled pseudo wrong just irritates some other part of my brain. :)

Sure, dandruff and 'dry skin flakes' are separate things entirely.

Dandruff is generally regarded to be caused by the fungus Malassezia. This fungus does not enjoy an acidic environment.

(Healthy) human skin tends towards pH 4-5 (there seems to be some debate). Malassezia likes 5-8 (again, some debate). Reducing the pH / increasing the acidity would appear to have some non-psuedo, non-alternative-medicine, non-falsehood foundation.


Are you claiming vinegar is an ineffective treatment, or merely that the explanation for why it's effective doesn't meet your requirements?

If the latter, I can probably agree. If the former, please do some more research.


Wikipedia says one cause of dandruff is yeast-like fungus. Perhaps PH plays a role in that. Also vinegar has cleansing properties and may be less drying than detergent, and dry skin is another cause of dandruff.


> _may_ be less drying than detergent (Emphasis mine) Did you miss the part where GP asked people to stop spreading pseudo scientific falsehoods?


"harsh soaps, detergents, and perfumes, [...] tend to dry the skin"

https://www.urmc.rochester.edu/encyclopedia/content.aspx?con...


The word "vinegar" is notably absent from that link. That solely shows that harsh soaps are harsh (which as you may have noticed is axiomatic), not that vinegar isn't.


When I typed the word "may" I meant it. If it's medically established that properly diluted vinegar dries the skin I'd be interested in knowing that. Otherwise, it seems like reasonable speculation to me rather than pseudo-scientific BS.


>notably absent

Maybe that's because vinegar isn't a staple in the hair care aisle at Walgreen's.


I suffer from dandruff, tried a bunch of shampoos but they didn't work well. I saw this tip about vinegar, which I tried a single day then I gave up thinking this could be popular saying. I'll try again and report.


When you say 'this could be popular saying' are you meaning that it may be wrong because it's popular?

I'd be curious to know your results, in any case. I've had mixed outcomes from conventional (commercial shampoos) and have also been trying to identify causal factors (worse in winter, especially after a few days of wearing a beanie, worse when I'm staying near a high-pollution area, etc). It's all anecdata, but OTOH not beyond our abilities to thoughtfully analyse.


This unconscious bias went unnoticed to me (as expected, per definition). Thanks for pointing it :)

So far, my experiments with other shampoos where conducted a bit ad hoc. They're all anecdata. When I try with vinegar, I should try to rule out other causal factors first. I probably haven't done that in my first try. Otherwise, we can never be sure.


Have you tried going to a dermatologist?


To be honest, I only went to see a single dermatologist, but my experience was bad. The dermatologist kinda overlooked and shrugged. I received a shampoo recommendation, but the dermatologist didn't say we should monitor the treatment or anything. From what I saw, she would keep recommending shampoos and I would try until I found one that worked. so I decided to try by myself.

Not my wisest moment. I should probably have seen another dermatologist.


Great book and great advice. Must be about 20 years since I tried this approach. Count me as a data point strongly supporting the use of vinegar to deal with dandruff.


Like when people say to IT: "Everything's working, what am I playing you for?!"


Followed by the inevitable "Nothing's working, what am I paying you for?!"


You have to spend money to make money, but I spent all mine trying to make more.


I think this is actually a version of confirmation bias since

  > Confirmation bias, also known as Observational selection or The enumeration of
  > favorable circumstances is the tendency for people to (consciously or
  > unconsciously) seek out information that conforms to their pre-existing view
  > points, and subsequently ignore information that goes against them, both
  > positive and negative [0]
[0] http://rationalwiki.org/wiki/Confirmation_bias


You'll find confirmation bias everywhere if you look hard enough for it.


I'm looking hard enough to see what you did there.


I think it's selection bias because you sample from the set of perceived problems which are the ones that actually occurred relative to the ones that could have happened.


I mean, yes, but from the outside some examples can look an awful lot like "my tiger-repelling rock works because we haven't been attacked by tigers".


I find your lack of heuristics disturbing.


The opposite of that is frustrating too.. I see it a lot in SF. Something like "We spend $100M/year on homelessness and we have 3,500 homeless. Why don't we just give them the money and do away with the social services!" As if the $100M isn't keeping many thousands more from living on the streets..


A more relevant way of putting it for me: Why do we have to have so many different corporate rules and processes when everything is working anyways?


The flip side is the confirmation bias of "it's working because of our rules."


Yes, Daniel Kahneman describes this kind of cognitive bias as "what you see is all there is". A simple example given in his book* is to ask the members of any couple (or roommates) which percentage of the home duties he/she performs. The sum is always above 100÷

*"Thinking fast and slow", best good recommendation I got from the HN crowd


The "It's Working" problem.


Maybe the expression "absence of evidence is not evidence of absence," discussed in https://en.wikipedia.org/wiki/Argument_from_ignorance and https://en.wikipedia.org/wiki/Evidence_of_absence.


With the important caveat (described on the Wiki page) that absence of evidence sometimes is (weak) evidence of absence.


It's a form of (inverse) Survivorship Bias. "None survived, therefore none started out" is an (inverse) extension of "The ones that survived were the ones that started out".


I was going to say ... it's a corollary of survivor bias ...


I wonder if there's a general law here, something like "prevention is socially unsustainable".


As a former Y2K project manager, I had one thing go BOOM. But it didn't go boom on 1/1/2000, rather on 2/29/2000. Our Y2k program had, IIRC, some 9 likely fail dates including the 1/1/2000 date. The component in question was provided globally and untestable locally, so I the best I could do is acquire a copy of the remote complete test script and call it good.

I was thrilled that we had something to point to as a "see, this is why we put in so much work". Prior to that had received lots of criticism about the amounts spent, people hired, blah blah blah.


"Wait, we spent two million dollars servicing your fleet of 100 cars in the last year, and you're complaining that none of them broke down? Why spend all that money on maintenance when they just keep running?"


Are you sue those numbers are right? That's 20k per car per year. At such a cost running the cars with no maintenance then buying new cars every break down is cheaper.


They were intentionally inflated - however I probably had Australian pricing in mind which is probably about 50% more expensive than the US ones. Considering the audience I guess I should have said 150 cars instead. :)

A lot of money was spent fixing the Y2K issue. Can't exactly recall how much time I spent myself, but it was a dominant factor as far as IT projects went back then.


> running the cars with no maintenance

Your car runs into a bus full of children.

Other lawyer: Jury, they saved millions of dollars per year not checking the brakes on their cars, you should award those millions of dollars, and some more millions of dollars to the families whose children died that day.


Maybe it's because of how young I was at the time, but I remember media outlets reporting on the Y2K problem, and showing video of hundreds of older computing devices that were being retired (thrown in landfills) because they wouldn't work properly after the date change happened.

Maybe this isn't a "real" failure, and the symptom of some IT departments working diligently to solve the problem before it happened. In any case though, I'm curious how inaccurate the televised reality from my youth actually was.


Y2K was a software issue.

In short and very simple terms, some software stored date as DD/MM/YY.

It would asume that for year it would always prefix 19. The problem was when you reached /00. Any calculations or software decisions that happened based on that date, it would be off. Way off.

Some solutions where expanding the date to YYYY or adding a prefix to the new dates. Dates without the prefix would be 19xx. Date with the prefix X was 19xx and dates with prefix Y where 20xx.


Another solution was to set the prefix for 2-digit years on a sliding scale, so they are interpreted as a date within a specific 100-year period (date windowing). For example, see the 2029 rule: https://support.microsoft.com/en-us/help/214391/how-excel-wo...

This turned out to be one of the most cost-effective methods of fixing the problem, and was probably one of the most likely to be implemented. This was especially the case in situations involving software which ran closer to the hardware (for example, BIOS or firmware) or on systems where RAM or storage couldn't be increased and/or the change might increase the software's requirements beyond the system's capabilities.

32-bit Unix timestamps have the 2038 problem.

Despite how common these problems have been in the history of computers, we keep making them: https://en.wikipedia.org/wiki/Time_formatting_and_storage_bu...


There were a couple of computers in my shop that had a Y2K BIOS issue. Although there is a sort of software in the form of firmware on BIOS, many people considered this sort of Y2k error presentation a hardware problem.

So, while a Y2K compliance program dealt mostly with software, a complete program went through and tested all hardware in inventory.


the issue was a closed source problem, you could easily add the option to switch between 19 and 20 for calculations. or have a pre 2000 instance of the app and a post 2000.


The problem was that media reports at the time seemed to assume that anything with a computer would completely blow up, instead of rationally thinking through much more benign consequences of a date or calendar being wrong.

The image of a bunch of perfectly capable computers being discarded for no reason describes this media frenzy wonderfully.


> the Y2K problem never came to fruition because - arguably - of the immense effort put in behind the scenes by people who understood what might have happened if they hadn't.

Except many countries who spent no money on upgrading their system has very few problems too.


[citation-needed]


There actually were a lot of inconsequential failures with date display.

A fair number of websites went from showing "Dec. 31, 1999" to showing "Jan. 1, 19100".


The National Express (a UK based coach company) website showed the current date as 01/01/19100 on new years day 2000.


I never have thought this way. If this is true then there should have been highly visible media campaign to point this out.


My long-term active retirement plan involves building a year-2038 consultancy.


> I sometimes think it would've been better if a few things had visibly failed in January 2000.

Having to spend billions of dollars on programmers' goofy means of saving a few bytes of memory is a pretty visible failure.


Nope. Definitely not goofy.

My dad was a programmer in the early days. The machines he started on in the 1960s had 8 KB of RAM. Saving a byte then is the equivalent today of saving 1 MB on an 8 GB machine.

Multiply that times, say, the thousands of customer orders you're trying to process and the goofy thing would be burning a lot of additional RAM because it might help somebody 35 years later. Who among us is writing code today worried about how it will be used in 2052?


>Who among us is writing code today worried about how it will be used in 2052?

This decade I knowingly wrote code that will break in 2036 [1]. My supervisor was against investing the time to do it future-proof (he will be retired by 2036), and I have good reason to believe the code will still be around by that time. I don't think I'm the only programmer in that position.

[1] Library specific variant of the y2038 problem https://en.wikipedia.org/wiki/Year_2038_problem


- This decade I knowingly wrote code that will breaks in 2038

Sure, but how bad was it really? Something you could fix relatively quickly with a little time and money, or an instance of Lovecraftian horror unleashed upon the world like so much COBOL code?


Even then, mix in people switching jobs, losing the knowledge of where or what all these landmines are, and add similar but unrelated issues across your entire codebase. This stuff adds up. I like to at least add stern log messages when we are at 10%-50% of the limitation. It's saved my ass before, especially when your base assumption can be faulty.

In one of those scenarios, where we expected the growth of an integer to last at least 100 years, due to certain unaccounted for pathological behaviors, a user burned through 20% of that in a single day. But we had heavy warnings around this, so we were able to address the problem before it escalated.


I'm assuming this is C/C++ so I'm wondering if there isn't some compiler pragma to have it warn people still building the code in 2025 or so?


Ha! Rebuilding the code. The source will be lost and the binary will be irreplaceable in many circumstances.


Embed the source code in the binary.


Wow, these libraries are way too large

strip a.out


Last time this came up, I ran the numbers and the cost of the RAM saved per date stored was hundreds of dollars. Not per computer, or per program, but per date. Comparing total memory sizes doesn't tell the whole story, because RAM for a whole machine is so much cheaper now.

Spending that much money on storing "19" just so your code keeps working in the unlikely event that it's still in use 3+ decades into the future isn't a good tradeoff. Obviously things are different now.


Excellent point. Yeah, the machines my dad started on had magnetic core memory [1]. Each bit was a little metal donut with hand-wound wiring.

And in some ways, even "hundreds of dollars per date" doesn't quite convey it. These machines were rare and fiendishly expensive. In 2017 dollars, they started at $1M and went up rapidly from there. Getting more memory wasn't a quick trip to Fry's; even if you could afford it and it was technically possible, it was a long negotiation and industrial installation.

Another constraint that we forget about is the physicality of storage. Every 80 columns was a whole new punch card. That's a really big incentive to keep your records under 80 characters. Each one of those characters took time to punch. Each new card required physical motion to write and read, and space to store.

There were just so many incentives to make do.

https://en.wikipedia.org/wiki/Magnetic-core_memory


It really was a different world. I think a lot of programmers don't understand just how different it was (I barely do), and don't realize that modern principles like programmer time being more expensive than computer time are not universal truths about computing, but are just observations of how things are in recent decades.


The interesting thing about this from an engineering point of view is, you quietly pass a threshold where the clever hack which was worthwhile becomes literally more trouble than it is worth. When that happens is a multivariate problem that we couldn't truly predict at the time of the code's creation. (and when it happens, there might not even be anyone on the payroll thinking about it)


You're calculating what it would cost to store a string representation of a date. Which is silly. You should always convert to a timestamp for storage. You can cram way more info into a single integer than you can with a base 10 string. And the bonus is you verify the date's correctness before storing.

Even a 32-bit int could hold 11 million years worth of dates. And if your software is used for longer than that, you can just change it to a 64-bit long and have software that will outlast the sun.


Silly or not, that's the reality of punch card based technology (BCDIC later extended to EBCDIC). Punch cards pre-date electronic computers, and making a relay tabulator set-up working with binary formats is impractical.

As computer hardware grew out of that, it maintained much of the legacy, down to hardware data paths and specialized processor instructions. It was more than a programming convention.


They were stored in BCD at the time:

https://en.wikipedia.org/wiki/Binary-coded_decimal

That was the right choice for the era. As mikeash points out, your approach takes more bits and more CPU cycles. But it also takes a computer to decode. Any programmer can look at a punch card, a hex dump, or even blinkenlights and read BCD. Decoding a 32-bit int for the date takes special code. Which you have to make sure to manually include in your program, the size of which you are already struggling to keep under machine limits.

We've come a long way.


Systems from this era were probably using BCD rather than base-10 strings. A BCD date would take up 24 bits.

Running a complicated date routine to convert to/from 32-bit timestamps would also have cost a huge amount. These machines had speeds measured in thousands of operations per second, and the division operations needed to do that sort of date work would take ages, relatively speaking. All on a machine that cost dozens of times the average yearly wage at the time, and accordingly needed to get as much work done as possible in order to earn its keep.


Sometimes this worry is thrust upon you by the problem domain. I do remember tackling the Y2K38 problem in 2008 - the business logic dictated that the expiration date should be tracked, and some of them were set to 30 years.


But a 2 digit date should take less than 7 bits. Were they using systems that didn't use 8 bit bytes? Why wouldn't the dates work from, say 1900 to 2155?


Back in the days it was probably 7 bits, but the word size is not that important. The problem still exists today with a modern 64 bit computer:

Even if a system internally can store a timestamp with nanosecond precision since the beginning of the universe, all that precision is lost when communicating with another system if it must send the timestamp as a six character string formated as "yymmdd" in ASCII.


My understanding is that the actual number of bits used would generally have been 4 bits per digit, as they were using Binary Coded Decimal [1]. So dropping the 19 would save you a byte per date.

Sure, they could have used a custom encoding. But that increases maintenance cost and extra development work. All to solve a problem that nobody cared about at the time.

[1] https://en.wikipedia.org/wiki/Binary-coded_decimal


Just seems wierd. A 2 byte day count would give a huge range and use 2 bytes for the full date Vs 3 bytes for a BCD yymmdd representation.


There are a lot of weird things...

You are assuming 8 bits per byte, but a byte can be any number of bits.

With two bytes of 7 bits each, the range is only about 40 years. Is is also impractical when the storing media is punch-cards, and the systems adder unit only counts in binary coded decimal.


But then you need special code to decode that. Code that you have to write yourself or borrow and include in your program. Remember, no shared libraries. And it means extra CPU time you have to display a date. Whereas BCD has special hardware support.

It means that data interchange is now much more complicated too. How do you get everybody to agree on the same 2-byte representation for dates? This is the 1960s, so you can't just email them. You have to have somebody type up a letter and mail it. Or if you want to get on the phone, a 3-minute international call will cost $12, which is about $100 in 2017 dollars.

Plus then you can't look at a hex dump or a punch card or front panel lights and see the date, so now you've made debugging much harder.


A 2 digit date typically used two guys, one for each digit. Cobol and all


Good question! Nobody was writing 1900 + (year % 100), right?


For example, some systems stored the year in a byte, and when printing out a report it printed "19" and that byte - so year 1999 would be followed by year 19100.

Some systems, where storing numbers in columns of characters were common practice (COBOL idiomatic style?) stored the date as two digits (possibly BCD), so the possible range is 00-99 no matter how many bits are used.


Some people were.

But it's worse than that. In the 90s a lot of code used 16-bit values, character strings. That is, it stored a char(2), parsed it as 2-digit number and then converted it to a date by adding 1900.

So it was only really "saving space" when compared with storing a char(4).


But if they wanted to save space why not store a 8 bit number? I imagine it must have something to do with punch card compat or some binary coded decimal nonsense. Still seems inefficient.


If a system gives you two options for storing a date (using 2-digit or 4-digit years), how many dates do you need to store and use in calculations before you end up saving space by creating a new data type and all of the supporting operations to make the storage of the date itself more efficient? In recent years, it's more common to make this type of decision because something else is causing an issue, otherwise we rarely consider the space required for a date (and many languages no longer have a separate type for dates).


Maybe because they didn't know better? I was going to say "maybe they were bad programmers", but likely just "average" programmers.

No punch cards or BCD, I'm talking about DOS/Windows systems.


- Who among us is writing code today worried about how it will be used in 2052?

The good programmers who are fortunate enough to be working on software that matters.


I doubt that's true, unless you mean it tautologically.

There are plenty of good programmers working on software that matters that should absolutely not be trading off hazy possible benefits in 2052 for significant costs now.

It's occasionally necessary. When I wrote the code for Long Bets [1], I took a number of prudent steps to make sure things would have a good shot at surviving for decades. But I only took the cheap ones; the important thing was to ship on time.

And I think that's the right choice for most people. Technological change has slowed down some, but 35 years is still an incredible amount of volatility. Betting a lot of money on your theories of what will be beneficial then is very risky.


> There are plenty of good programmers working on software that matters that should absolutely not be trading off hazy possible benefits in 2052 for significant costs now.

I guess it's not obvious, but I think there's really a continuum here. You don't necessarily need to write software that will run perfectly in 2052, but it'd be good if you wrote software that can be comprehended, adapted and altered later on. Maintainability is never a "hazy benefit." (If the problem isn't a total throwaway.)


Sure. Maintainability pays off relatively soon, and often makes systems simpler and cheaper to operate. But the topic in this sub-thread is the Y2K bug, where the proposed solution would have been expensive and provided no benefit for 35 years. And at the time, those benefits would have been very hazy.


I don't know, I think the attitudes that make you a good programmer mean you won't be satisfied leaving broken code in your product, no matter how far out the consequences are.


> "Technological change has slowed down some"

wat


It's definitely true. Technical change in the 60s was enormous. During that period there was a lot of fundamental architectural change and experimentation; that's when they settled on 8-bit bytes as the standard, as well as many other things. Moore' Law became a thing. In the 70s is when we started seeing operating systems that look familiar to us, and even into the 80s it was plausible to introduce a new OS from scratch (see, e.g., NeXT, or Be).

But Moore's Law is now dead:

https://www.technologyreview.com/s/601441/moores-law-is-dead...

Per-thread performance has been basically flat after decades of exponential gain:

http://preshing.com/20120208/a-look-back-at-single-threaded-...

The iPhone is a decade old; every phone now looks like it, and it's highly plausible that they'll look basically the same a decade from now, possibly much longer. Laptops are 30 years old; they've gotten cheaper, faster, and better, but are recognizably the same. HTML is coming up on 30, and it will be in use until long after I'm dead. TCP is nearly 40; Ethernet is over 40; even Wifi is 20.

So it's just easier now to guess what programmers will be doing in 35 years compared to 1965.


In my experience it's not the good systems that stick around - they're cleanly replaceable.

It's the quick hacks and bodges that stick around forever.


As someone who just wrote a quick hack for a temporary problem, I agree.

It's not just the shitty programmers who do this. Sometimes we have shitty product managers who won't push back against this kind of thing. And you're forced into a creating something evil because most of the job is very good but this one time, you have to suck it up.

My response to that, though I agree with you, is that when a supervisor or PM or whoever gets on you about something you know is bad, you negotiate.

"Yes, I'll do this for now because the company needs it now. But only if you guarantee me the time (and possibly people) to do it right later."

You get agreement in an email, create the ticket and assign it to yourself as mustfix two months from now. And you shove it down their throats.

That's not an ideal place to work if you have to do that, but I have worked at those places and this is how you deal with that situation.

"Yeah, I'll give you a shit solution in 1 day right now. But only if you give me a a couple of weeks for a good solution later."

In reality, I've mostly only had to deal with this situation in startups. Mid-level and mature companies are usually open to pushing back and getting things right. But there are exceptions. Today was an exception. But that's also one of the reasons I don't really want to work at startups anymore.


Shitty solutions are usually the right answer. At least in the areas I work in (mostly startups). I would estimate 99% of the code I write gets thrown away. Most of it is trying something out. Even for code that was intended to hit production, the company/project often gets cancelled before it ever hits production.

I'm not saying this is true in your case. But there are so many different classes and types of programmers and projects that it's hard to generalize.


Nothing about this comment is okay.

99% of your shit code isn't getting thrown away. It's sticking around making life hell for people like me.

Stop writing shit code because it's going to get thrown away. If you work for startups, you are always operating in protoduction mode. Everything you write ends up in prod.

Write code that doesn't suck. It doesn't have to be perfect or optimal, but make it not suck before you push.


Hmm no. That's what's happening in your world, but you're imposing that world view on me.

Probably about 80% of the code I write doesn't even get looked at or used by another developer. If the technique/analysis proves useful, it gets rewritten/refactored. That has the added advantage that I then understand the model better.


Yeah, same here.

For me there's a giant difference between code that lasts, which needs to be sustainable, and disposable code, which doesn't. I'm also very big on YAGNI; my code gets so much cleaner and more maintainable when I'm only solving problems that are at hand or reasonably close. Speculative building for the future can get insanely expensive: there are many possible futures, but we only end up living in one.

Indeed, I think a "do it right" tendency can prevent people from really doing it right. If we invest in the wrong sorts of rightness up front, we can create code bases that are too heavy or rigid to meet the inevitable changes. So then people are forced into different sorts of wrongness, working around the old architecture rather than cleaning it up.


Good for you. That's my approach, too. And to rig the system such that technical debt gets cleaned up continuously and gradually without the product managers knowing the details.

When there are real business reasons to rush something, I'm glad to support that by splitting the work like you suggest. But the flipside is them recognizing that not every thing is an emergency, and that most of the time we have to do it right if they don't want to get bogged down.


Well, yeah, I absolutely agree. Replaceability and maintainability go hand in hand in a system. It's a cruel irony that the code that sticks around, often sticks around because it's crap.

(that doesn't stop me from sometimes having a weird admiration for incomprehensible software kept going forever with weird hacks. It's like with movies, sometimes they're so uniquely awful that you have to admire the art of them)


In the 80s I questioned the use of using two bytes for the date. I was laughed at by the experienced programmers. They said the software would be rewritten by then. It should have been, but it wasn't... But there is a trade off between how much time you spend today vs future compatibility.


It really depends on how much data and how old.

My dad's first programming job was initially to mechanically change how variables where stored thus saving 1 and only 1 byte of disk space. Someone ran the numbers and having someone do that for a few months was a net savings.

A few years after that he was talking about some relatively minor optimizations that saved a full million dollars worth of hardware costs by delaying a single new computer purchase.


Regarding the point I was speaking to: it's undoubtedly true that some of those hacks were initially worthwhile. Keeping them going until the year 2000 was, by that same standard of cost effectiveness, pretty definitely a visible failure.


You keep them going because of status quo bias / not touching a working system.


How many bytes of memory were saved? 32-bits multiplied by potentially trillions of time, considering storage was not dirt cheap as it is now...


Self-confidence as a programmer is when starting a new project, storing the transaction ID as a long rather than an int...


> Self-confidence as a programmer is when starting a new project, storing the transaction ID as a long rather than an int...

uint64_t even

Or a UUID as others have suggested.

Technically C spec doesn't really say exactly how many bits int, long and long long should be. If you want specific sizes and your code to be somewhat portable use the specific bit sizes to make that clear. There are also types for size-like things (size_t) and pointer and offset like things.


> If you want specific sizes

I would go further and say you should _always_ use specific sizes, unless forced otherwise. There's no reason not to.


There's a usecase for lower-bounded types such as int_least32_t, where the compiler may choose a larger type if it offers better performance. However, if you're using that, the test suite should run all relevant tests for multiple actual sizes of that particular type (through strategic use of #define, for example).


> There's a usecase for lower-bounded types such as int_least32_t, where the compiler may choose a larger type if it offers better performance.

If you're looking for the best performances you shouldn't use leastX types, you should use fastX types (e.g. int_fast32_t for the "fastest integer type available in the implementation, that has at least 32 bits").

The difference between "leastX" and "fastX" is that "leastX" is the smallest type in the implementation which has at least X bits. So if the implementation has 16, 32 and 64b ints and is a 32b architecture, least8 would give you a 16b int but fast8 might give you a 32b one.


The reason is that anything else is using default types and you lose your safety battle in each:

  int32_t x = call_returning_int();
line. Otherwise, you assert/recover on each me-they borderline. C is a language where you have absolutely no guarantee that int or constants defined as int will fit into anything beyond int, long or long long, and then there is UB patiently waiting for your mistake. The method of handling that is to never change or fix types unless you have to, and then be careful with that.


This is a perfect use for the C++ auto feature:

    auto x = call_returning_int();


C had this particular feature for decades.


No, and it still doesn’t.



What he means by that is the old definition of auto, which C++11 deprecated in favour of making it do type deduction instead.

auto foo = func_returning_int(); to my knowledge worked in C because 'auto' was the lifetime keyword - like 'register' - and the default type in C is 'int'.

That's why when you miss a definition in C++ the compiler warns you that there's no default int.


There are reasons not to when creating RDBMS schemas.


There are a gazillion reasons not to. You should _never_ use specific sizes, unless forced otherwise.


Your code is actually less portable if you use types like uint64_t -- if your system doesn't have exactly that type implemented the typedef won't exist. If all you need is a really big number 'unsigned long long' is required to exist and be able to store 0..2^64-1


The C spec does specify minimum sizes: at least 16 bits for int, 32 for long, and 64 for long long.

Best to use the stdint types, just in case.


The naive, 9 years in the past me once was like "int will last us forever! and it'll save us some space!", only to have to change it to bigint a few months later


Also see: IPv4 vs IPv6.

Remember, these fancy computing devices were built for the rich and the government, not for the average joe, noone thought computing would be this easily accessible.


Except that isn't a great example because with Nat, ip4 can get us oodles of devices still.

Can't really use NAT on a primary key...


NAT is a colossal pain in the ass. I shudder to think of the number of man-years that have been spent on NAT traversal. NAT breaks one of the fundamental functions of an IP address - a unique identifier for a network device. It turns a simple identifier into a weird, amorphous abstraction.

IPv6 isn't perfect, but we could have avoided a lot of hassle if we'd started off with it.


Sure. Now ping my computer at 10.3.1.4


Sharding, or a second key, or detecting the iPad and using a lower number with an offset internally? Still requires special handling at one end or another.


"int will last us forever! and it'll save us some space!",

There are still plenty of people with that mindset. Some will even quote "YAGNI" when you question them.


Makes me wonder how many long ints haven't needed a char to store them


It gets better when you realize that on Windows a long is still only 32 bit....


You mean 64bit Windows?

It gets better yet when you realize that on 32bit systems (like in TFA) long usually is 32bit too ;)


It's the smart thing to do - 4.3 billion is not that much.

I had some students that asked me if even a long would be enough to handle exponential growth, after all it's only twice as big. As a thought experiement I asked them to come up with a time to fill a 32 bit int completely. They came up with roughly a year. Then as a margin of error I said, let's assume you have 4.3 billion transactions every second instead. This volume can be sustained for 100 years, and we're still not in the danger zone yet.


Uh it's not twice as big.

    2^32  ~              4,294,967,000
    2^64  ~ 18,446,700,000,000,000,000


One is 32 bits, the other is 64 bits. It occupies exactly twice as much space. It can contain far more information, but it is only twice as big on the harddrive.


2^64-1 is a lot more than 2 times 2^32-1


2 times 32 bits equals 64 bits...


That is, unless you're on Windows.


Or UUID...


I pick UUIDs because experience has taught me that even for the smallest workloads, I'll inevitably have to shard my DB (to partition a shared public cloud from N isolated "enterprise" deployments), and then will inevitably want to do statistical-analysis things that involve ingesting rows from the shards (or log entries referencing those rows), and deduplicating them by ID, without generating false-positive collisions in the process.

The simplest way to do that is to just throw UUID at all problems from the start. (https://github.com/alizain/ulid s are better, but there aren't libraries to generate them in literally every language + RDBMS.)


I don't agree with all their reasoning---but ulid still seems like a good idea. (Though the main difference you care about in programming is how they are generated---via timestamps plus randomness, not that they have a different serialization format.)

For some applications you don't want to leak the time. Choose wisely.


So when is it self-confidence and when is it overengineering and invoking the wrath of YAGNI? :P


You know what, you're right. I'm going to change some SERIALs to BIGSERIAL in the database of my side project. Someone has to start believing in it - it may as well be me.


Hey all. Thanks for noticing :P Obviously this is embarrassing and I'm sorry about it. As a non-developer I can't really explain how or why this happened, but I can say that we do our best and are sorry when that falls short.

- Erik, CEO, Chess.com


We can help you understand how and why!

Computers set limits internally on how big numbers can be when they're keeping track of stuff.

Your developers had given each game a number to identify it. So your first game was #1, the 40th game was #40, and so on.

The limit for how big the number could be was a bit over 2 billion, and your players have just now played a bit over 2 billion games, and so that id number suddenly exceeded the computer's internal limit. Specifically, the limit was 2147483648, so basically it crashed on game #2147483649, which is the next id after the last acceptable one (notice the last digit is 1 higher.)

I'm simplifying slightly but that's the idea. It'll be fixable by essentially using a different format for the id number so that the limit is higher, much like telling the computer "use a higher limit for this particular number, it's special."


Yes - I understand HOW it happened, just not sure WHY. Meaning, I'm not sure what the developer was thinking, and at this point, I'm not going to track down exactly who it was and point fingers. I think everyone has learned enough through this highly interesting bug. It certainly was interesting to see the slack room exploding with theories and debugging. A new iOS client has been submitted to Apple (hurry plz!!!), and a server fix is also in QA now. Fun problems to have......


This is an fairly easy mistake to make for a novice developer to make. It even has a specific name - integer overflow - https://en.wikipedia.org/wiki/Integer_overflow

The original Pacman crashes at level 256 for the same reason. - http://pacman.wikia.com/wiki/Map_256_Glitch


It's most likely for efficiency and performance reason. 64-bit doubles the storage requirement of 32-bit and would have impact on database's utilization of memory, querying window size, cache, and storage.

Edit: 32 bits worth of games played means about 4 billion games. 4 billion X 4 bytes for 32-bit = 16GB just for the 32-bit ID's. 64-bit ID's would need 32GB for the 4 billion games. I guess memory and storage weren't that cheap back then.


It sounds like it was client side, not server side. Most likely the iOS client was using Objective-C's NSInteger or Swift's Int, just because that's the default choice for most programmers working in that language, and they didn't think it through.


It also could've been just the person picking an int over a long. Is ints vs. longs the first place to look for optimizing efficiency/performance?


On a 32-bit system, a "long" is usually also 32 bits. On a non-Microsoft 64-bit system, a "long" is usually 64 bits. On both 32-bit and 64-bit systems (Microsoft or not), an "int" is usually 32 bits.

If the issue happened only on 32-bit iPads, but not on 64-bit iPads, the programmer probably picked a "long", not an "int". Had the programmer picked an "int", the problem would also happen on 64-bit iPads.


Our iOS app with Java backend was using long for database IDs on both ends. I was going through the ILP32->LP64 conversion process and when I realized we had a pretty serious discrepancy.

I think it's a really easy mistake for the first developer to make (especially because they weren't a C/Obj-C programmer), and then the sort of thing that no one audits after that.


Yeah, Java always uses 64 bits for "long", even on 32-bit systems. Which only adds to the confusion, since it's different from C and C++.

(Another place where Java is confusingly different: "volatile" implies a memory barrier in Java, but not in C and C++.)


> Meaning, I'm not sure what the developer was thinking

A 32bit integer is pretty much the default numeric type for the majority of programming tasks over the last 20 years. Even with 64bit CPUs, 32bit is still a common practice. Probably 99% of all programmers would make the same choice unless given specific requirements to support more than 2 billion values.


It's often not even an explicit choice, it's just default behavior.

Up until recently, Rails defaulted to 32 bit IDs, so there are a ton of apps out there that could have these issues, especially since Rails has always prided itself on providing sane defaults: https://github.com/rails/rails/pull/26266


> 99% of all programmers

Many dynamically typed languages have an automatic change from int to bigint rather than allowing overflow. For example, Python.


Others, like JS and Lua, just use doubles, meaning they'll never overflow - instead every 2 numbers start to be considered equal. Then a while after that every 4, etc. Not exactly optimal behavior when using incrementing IDs.


however, doubles represent integers accurately up to 53 bits. so that's still quite a lot before you run into that problem.


I don't think you do understand, you sound like you're upset that a developer "set" this limit. When in reality it's tied to fundamental programming principles. It wasn't really a conscious decision to say, "I'm only going to account for 2bil games"...


As a consolation: Over 2 billion games played - congratulations!


Probably when this was initially developed, nobody thought you'd ever go over 2 billion games. This error is brought to light by your success and popularity.

Computer history is riddled with assumptions like that. The Y2K problem, Unix dates running out in 2037, and 32 bit computers unable to address more than 4 GB of memory are just the big ones. It's everywhere. Smaller software projects are generally built for what you need right now, and less for what might happen in the distant future.

Ideally you want to retain some awareness that this is an issue so you can start working on once you go over a billion games, but in a small company, there are probably always more urgent things to worry about, and nobody ever gets around to fixing this technical debt.


> Meaning, I'm not sure what the developer was thinking, and at this point, I'm not going to track down exactly who it was and point fingers.

As a developer, this sentence made my skin crawl.


2 billion is a very large number that was probably not envisioned as reachable in the near future - as a programmer I'd argue this is a pretty easy mistake to make, and that while (slightly) embarrassing, its a good learning moment.

It's also really awesome that you're here, and that you guys were so honest about the nature of the bug - this is really something that should be encouraged.


Maybe we should start a blog about all of the interesting bugs and challenges we encounter. It certainly is white-knuckle pretty often when running at scale. The number of devices, connections, features... I'm aging prematurely :P


A few articles would definitely be appreciated. Might even help with recruiting fresh blood.


Agree with Aloha. I wouldn't be too hard on the programmer (also, if I understand correctly it's not a database issue, but only with the 32-bit iOS client). I'd pat him on his back and say “you didn't think we'd get this big, eh?” ;-)


> 2 billion is a very large number that was probably not envisioned as reachable in the near future

I disagree. Simple napkin calculation: 100 million players playing 40 games each per year (about 1 per week) over 5 years = 10 billion unique games.

As others pointed out it was likely not a miscalculation, just a lack of calculation. The bug occurred only in the client and the decision to use a smaller data type was likely not a conscious one.

In any case, I wouldn't hold it against an individual programmer. But arguably this sort of bug indicates your development process has flaws (not enough testing, code reviews, etc).


It's an honest mistake, things like this happen. Not the end of the world, especially for an otherwise great product like chess.com.

Go easy on your eng team ;)


Thanks. I'm a pretty understanding "boss", especially on the heals of reaching the 2 billion games milestone :D Our team is awesome and we love what we do. Unfortunately we're still a bunch of humans sitting at kitchen counters and on couches around the world, so things do sometimes fall in between the cushions...


> on the heals of reaching the 2 billion games milestone

That's a very poetic typo


True. Only time (and grammatical errors) can heel :P


It's a beautifully simple problem and one born of success.


Indeed. I'm not sure that anyone here at Chess.com at the beginning thought we would hit a billion games played in our lifetime. But I guess after 10 years....


To put things in perspective, 2 billion games in 10 years is half a million games per day on average over the 10 years. Considering you didn't start at that rate and that it's an average, it means you have way more than half a million games per day now. (that's also more than 6 per second!)

Congrats on such a success.


Think of a mechanical odometer, and how it only has a certain number of digits. Eventually you'll hit 999,999 miles, and on the next mile, everything will roll over to 000,000.

Same deal here. 32-bit numbers are stored as 32 switches, starting from

    0000 0000 0000 0000 0000 0000 0000 0000
which is 0, to

    1111 1111 1111 1111 1111 1111 1111 1111
which is 4,294,967,295. Since the 32-bit iOS version of Chess.com apparently uses 32-bit numbers to store each game's unique ID, that means you can have 4,294,967,295 games.

So what happens on game 4,294,967,296? Just like the odometer, everything rolls back to 0, and things start breaking because the program gets confused.

Pretty common problem, really. The fix would be to use a 64-bit number, which doubles the number of binary digits.


"This was obviously an unforeseen bug that was nearly impossible to anticipate..."

Snarky... Except that there were probably years of games to notice that you were approaching a "magic number" like 2^31.


I actually read that quote as fully sarcastic.


As expected, sarcasm always translates correctly into textual form.


It's weird that you say that, because I always felt sarcasm didn't translate well to text.


I think it was sarcasm


Sarcasm is recursive.


Yours is the first comment in this chain that I can say pretty confidently isn't sarcasm. So it kind of breaks the chain, making the sarcasm in this chain non-recursive. Which means maybe you were being sarcastic after all? Actually I don't even know if my own comment is sarcasm or not.


Your comment comes across as sincere. Mine was sincere, but an overstatement. I should have said that sarcasm tends to be recursive, until broken by sincerity. Anyway, here's my take on the chain:

SomeHacker44 -- sincere

CGamesPlay -- sincere

blktiger -- sarcastic

i_cant_speel -- mildly sarcastic

jazoom -- sincere


Yes because what we have always known about sarcasm and what this thread is a perfect example of is how you can define something as sarcastic/not sarcastic just by how it subjectively "comes across".


Yes, you can guess. But assessment depends strongly on context. And Pow's Law still applies. People can write messages that seem sincere, and then later claim sarcasm. As in "I was only joking". Or people can write sincerely, but come across as sarcastic, or vice versa, and yet be ambiguous enough that readers can't tell. That's where the /s flag help. Done intentionally, such ambiguous messages can probe the reader's state of mind. Or set traps.

But maybe you're just being sarcastic ;)


It's the base case


Poe's law.


That's not snark. That's just self-doubt, expressed. "We never thought we'd host 2^31 chess games."

EDIT: Apparently they already said that: https://news.ycombinator.com/item?id=14540509


On the contrary, this is pretty much the perfect example of the completely anticipatible bug.


I recently experienced a nasty bug with BLOB in MySQL. The software vendor was storing a giant json which contained the entire config in a single cell. It ran fine for months, and then when it was restarted it totally broke. Reason was: the json had been truncated the entire time in the database, so it was gone forever. It was only working because it used the config stored in memory on the local system. Nasty!


MySQL's silent data truncation is such a nuisance. It's off by default in 5.7, and can be disabled in earlier versions by adding STRICT_ALL_TABLES/STRICT_TRANS_TABLES to sql_mode [1].

I inherited a system where, among other things, the entire response body from a payment gateway callback is saved into a text field using utf8 character set, despite the fact that most of the supported payment gateways send data in iso-8859-x (and indicate the used charset inside the body itself, how's that for a chicken-and-egg problem). Of course when the data gets truncated due to not actually being utf8, nobody notices. Fun times.

[1]: https://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sql-mo...


> MySQL's silent data truncation is such a nuisance.

Yes, yes it is - it burned me so badly (catastrophic, unrecoverable production data loss) in the early days of my career (~15 years ago as a junior level dev in a senior level role) that it has forever colored my opinion of MySQL - I will really never trust it again.

Long live PostgreSQL!


Long live reading documentation. Otherwise you risk data loss no matter the DB.


At the time silent truncation was an undocumented "feature", IIRC. I think I found a description of the issue buried on some forum somewhere.


Early web had these issues as well: server sends response with Content-Type: text/html; charset=win-1251 (or no charset), but body contains meta charset=utf-8. MSIE worked around that in IE4 by comparing the letter frequencies to a hardcoded table and guessing the charset from there. It sort of worked, 80/20.


For those, like my, that didn't recall off the top of their heads: BLOB in MySQL can usually only hold ~64KB.

EDIT: Though I am curious why MySQL doesn't throw an error when you try to store more than 64KB in BLOB?


The BLOB size limit is a nuisance, the silent truncation is the bug...


It'll give a warning. Many/most MySQL drivers won't interpret it as an error, though.


Huh, I thought BLOB was a backronym for Binary Large OBject, not Binary Medium Object.


There are LONGBLOB, TINYBLOB, etc just like the equivalent TEXT fields


Mysql's unix_timestamp() breaks on years past mid 2038...so plan for that one too.


OpenBSD[1]:

> time_t is now 64 bits on all platforms.

Linux[2]:

> The vast majority of 64-bit hosts use 64-bit time_t. This includes GNU/Linux, the BSDs, HP-UX, Microsoft Windows, Solaris, and probably others. There are one or two holdouts (Tru64 comes to mind) but they're a clear minority.

This is little help for older, already deployed systems, of course.

[1]: https://www.openbsd.org/55.html

[2]: https://mm.icann.org/pipermail/tz/2004-June/012483.html


Good to know. I was, though, specifically taking about a function inside MySQL.


This problem is more related to a programming underestimation than the actual limitations of a 32bit CPU (which can happily process numbers or IDs that arbitrarily big if you have the memory for it and program it correctly).

That said, this is definitely indicative of what's going to happen in just 20 years, 6 months and 20 days from now. I mean, we're still cranking out 32bit CPUs in the billions, running more and more devices, and devs still aren't thinking beyond a few years out. I know of code that I wrote 12 years ago still happily cranking away in production, and there may be some I wrote even longer than that out there... and I guarantee I hadn't given two thoughts about the year 2038 problem back then, and I doubt many devs are giving it much thought today.

It's truly going to be chaos.


The sad part is people are going to look at the lack of a year 2000 event and assume 2038 is going to be a "dud", when they fail to see all the damn work that went into making sure Y2K was a dud including a significant portion of IT hours and probably a lot of extra support laid in.

I expect 2038 to be a rare hell because of the nature of the devices. Y2K was an IT problem, but 2038 will be an embedded system problem and that's going to be a much more painful thing to audit. Moving from the server room to inside equipment and walls is going to be fun.


Why 2038?


Long long time ago, I created a poll on a small website I was maintaining. I didn't expect much traffic and, so, not thinking too much about it, I put the ID column to be a TINYINT (i.e. max value = 255)...

That was a valuable lesson.

(I actually generated most entries myself while testing stuff - live in prod of course - and while there were probably fewer than 255 votes, the AUTO_INCREMENT did its job and produced an overflow).


> "Long long time ago"

Seems you have learned your lesson :-)


Reminds me of the havoc that was caused when Twitter tweet IDs rolled over. Resulting in every third party developer to update their apps (and at the time there were a lot of those).

Twitter saw it coming and forced the issue. By saying that at a certain date and time they would manually jump the ID numbers rather than wait for it to happen at some unpredictable time.


They didn't roll over, they exceeded 2^53-1 which is the max Number which doesn't truncate when treated as an integer in js. The solution was to treat it as a string.

(Or we're thinking of different events, I apologize if so)


Twitter must have been misleading when they communicated the reasons for this change since they did not exceed 2^53-1, nor do they expect to exceed this in the near future.

From a (former) Twitter dev:

> Given the current allocation rate, they'll probably never overflow Javascript's precision nor get anywhere near the 64-bit integer space.

https://twittercommunity.com/t/discussion-for-moving-to-64-b...


Your link discusses 2^64, which applies to languages that have native integer types.

The 2^53 problem was for Javascript, which has no native integer type, and is thus limited by the mantissa size of Number (which is defined as an IEEE double-precision float).

Twitter ids are unsigned 64-bit, since they're generated using Snowflake. That link must pre-date the move to snowflake ids, and is speaking to the count of tweets instead.


I'd be incredibly surprised if they overflowed 9 quadrillion tweets. That's like a million tweets per person on earth.


Haha, sorry, that thought was in reference to the link saying they hadn't overflowed 2^31 yet. Two billion seems believable for tweets as of 2013.


In the dystopian future where everyone's IoT devices communicate by tweeting at each other this might become a real issue.


Its the retweets, dude. /s

On a more serious side, that number won't be reached anytime soon..


Rollover was a confusing word for me to use. I did not mean it in terms of integer overflow. I meant it in accounting terms. As in to roll from one namespace to another.

I could have definitely chosen my words better.


Yep, it would just point to the wrong tweet. Fun stuff. So many of these time bombs.


This reminds me YouTube changed its view counter from 32-bit integers to 64-bit integers due to the popularity of 'Gangnam style' https://www.wired.com/2014/12/gangnam-style-youtube-math/


That was a joke; it was always a 64-bit integer.


I visited that video specifically because the view counter was jammed at UINT_MAX. There were comments confirming that everyone was now visitor number 4,294,967,295. In fact, it might have been an HN post that brought it to my attention; I totally didn't get sidetracked on YT and end up watching K-pop all afternoon.


Do you have a source for that?


WHAT?


It was a joke, it was always a 64-bit integer.


I really want this not be a joke :(


Do we know when chess.com launched? If so, we can calculate the average number of games being played per second.


Wikipedia says "June 2007", which I'll approximate to 10 years. That gives us 6.8 games per second.


were they ever expecting negative number of games? why signed integer?


You need to first establish that the type was chosen intentionally before asking why it was intentionally chosen. Otherwise the question is ill-formed.

It looks like they are using PHP/MySQL/Javascript/Flash, with only MySQL having any explicit types.

Even so, an error is often preferable to overflow, which is usually undefined behavior and could lead to a duplicate primary key anyways if it wraps to the first game.

A better question is "why 32-bit over 64-bit", but the site dates back to 2005 where that was the norm and the question has the same issues.


It's very reasonable. This way they overflow into invalid values instead of zero.


Assuming you're working in a language that defines signed integer overflow. Depending on the language, you can result in undefined behavior, instead. For that reason, I'd go with an unsigned counter, with the first million IDs being invalid or reserved for future use. That way, you get well-defined overflow into an invalid region.


When's the last time you worked with an architecture that didn't use twos complement and roll into negatives on overflow?

Your reserved bottom range is a perfectly good solution. But rolling into negatives seems fine, too.


C/C++ is notoriously head-ache inducing on this point. Yes, all the CPU archs you'd reasonably expect to encounter today behave this way. However, because the language standard says signed overflow is undefined, compilers are free to assume that it will never happen, and make optimizations that you'd think would be unsafe, but are technically permissible. [1]

[1]https://stackoverflow.com/questions/18195715/why-is-unsigned...


Well that's interesting. I was not aware of any compilers doing this. I wonder if there's a switch in gcc/llvm/msvc/etc to turn this specific behavior off.


https://linux.die.net/man/1/gcc -- Search for `-fstrict-overflow`. And note how it says

  The -fstrict-overflow option is enabled at levels -O2, -O3, -Os. 
In other words, basically every program that you're using is compiled with that option enabled. (Release builds typically use -O2, sometimes even -O3.)


-fstrict-overflow is the opposite of what the parent comment was asking about. You want -fwrapv or -fno-strict-overflow.


I was answering the "not aware that any compilers were doing this" part, hoping that they would be able to answer their second question using the source I linked.


Basically all C and C++ compilers do this.

They do it, so they can simplify things like "x < x+1" to "true".


Signed overflow is ub in C


I'm aware of that. The question was whether any practical architectures will do the "wrong" thing here.


Compilers will do the "wrong" thing. For example, gcc optimizes out the "x >= 0" check in

    for (int x = 0; x >= 0; x++)
making it an infinite loop, because it assumes "x++" can't overflow.


Signed overflow is IB, not UB.


Alas, no. Signed overflow is UB and compilers can and will assume that it cannot happen when performing optimizations.


Which is a good thing, assuming you did reasonable testing with ubsan etc. Having to assume things that never happen is a big problem for optimization!


That's a good point. Seems like we could do a bit better than the current state of the art, though. If non-optimized builds trapped on overflow, that would at least give you a better chance of detecting these problems before the optimizer starts throwing out code you meant to keep.


That's not going to work because you're expecting every client to ban ID's that are too small.


Also note that some languages like Java do not even support unsigned integers.


Compilers can optimize for signed integers better. Overflow/underflow on signed integers is undefined behavior, which is space for compilers to optimize. But unsigned ints are defined for all cases, so you get less optimal code.

Also, you have problems whenever you compare against signed ints.


Because signed is default for some reason in most languages, and most developers aren't taught to think critically about how decisions like simple datatypes might affect scalability.


The problem is momentum. I could use unsigned int everywhere, but then I have to constantly typecast to int and back anywhere I use a library expecting signed ints. If we all switched to unsigned int by default, then everything would make more sense but we'll all live in typecasting hell during the migration.


Unsigned by default doesn't make more sense than signed by default. The behavior near 0 is surprising; if you underflow you either get a huge value (anything not Swift) or you crash (Swift).

It was a mistake to use them for sizes in C++. Google code style requires using int64 to count sizes instead of uint32 for good reasons.


Not just the default, it doesn't even exist in some languages.


I read somewhere in the swift documentation that unless you have a specific need for a UInt, that Int is preferred even if you know that the value will always be nonnegative. I think compatibility is one reason they give.


How many other examples like this have occurred throughout computing history?


World of Warcraft for quite some time, I think about 4 years, had a "mysterious" limit for the maximum amount of gold that you could accumulate on a single character. It was 214.748 gold, 36 silver and 47 copper...

At least they put an actual check in there - you didn't suddenly overrun and wake up with an enormous debt, so I'll give them some credit for that ;-)


IP4 address space comes to mind.

And I think it isn't uncommon to have to add digits to the front of telephone numbers as regions grow (or telephones become more common) and the number space isn't large enough.


IIRC at least slashdot and twitter had to do some non trivial database migrations because they hit maximum value of some ID field.


Twitter's was called the twitpocalypse and it was a pretty big deal at the time. I don't know how bad it was internally, but it hit a lot of third-party apps that stored tweet IDs in 32-bit integers.

One interesting aspect of it was that Twitter realized what was going to happen in advance, and artificially pushed their IDs over the edge at a preplanned time so they could have as many people available as possible to work on any problems that appeared.


I think Instagram also recently (year?) passed the 31 or 32 bit mark, as some client lib I was using started failing.



Microsoft Lync has an as-yet unfixed bug where after 49.7103 days (4294967294 aka 2^32 milliseconds) of system uptime, it locks the Lync status indicator to "away". The status indicator is otherwise unusable thereafter until a full system reboot.

See: https://social.technet.microsoft.com/Forums/en-US/1e76e106-4...


I mentioned in in another commend but Pacman crashes at level 256

http://pacman.wikia.com/wiki/Map_256_Glitch


A counter overflow factored in the THERAC-25's race condition (one of the software's interlock was an overflowable counter rather than a flag, if the operator started treatment right as the counter overflowed the system would proceed rather than refuse).


IIRC the whole "Gandhi is gonna nuke the rest of the world" meme came from such a bug occuring in the world-domination strategy game Civilization 2.


They had meant to give him the lowest aggression rating possible, but accidentally inputted -1, which then looped back the other way to the highest rating possible. Nukes soon followed.


Was it exactly that? I remember it being that you had to create a senate or something (that decreased character aggression by 2 points). Gandhi started with 1, so if you did it with Gandhi, then it'd underflow to the highest possible aggression.


Plenty, Twitter for example.


Well, famously youtube's views counter overflow from Gangnam Style...


That was a joke though and they were prepared.


Fun to read some of other stories where this bit them too (PacMan, WoW, and eBay)! Anyway, new app has been approved by Apple and should be rolling out soooooooooon....

Thanks for all the comments! Always lots to learn from.


So they probably just need to use longs instead of ints. But I'm curious, if you were really stuck with a 32-bit limit on data types, what's your preferred workaround? I'm thinking I'd add another field that represents a partition. Are there other "tricks"?


If you could only use 32-bit data types, you can get 64-bits from using 2 integers together like a long number. So the right integer would hit the max, start over at 0, and increment the left integer. Then, using this idea you can create a class of numbers that can have however many bits you want by using more ints.


Cool, yeah, that's what I meant by partitioning, which I guess is more of a database term.


eBay (almost) had this problem and I cannot find any articles about it online. They were rapidly approaching 2^31-1 auctions. So they switched to a larger integer, the switchover went badly, and they were mostly down for 4 days, if my memory serves. This would be like 10+ years ago I think.


A lot of comments but no one said the great time that we are living for chess. So many games online, ready to be analysed and learn from them. After deep blue people thought that it was the end of chess, but it´s only getting better. Computers helping players to improve.

Chess.com is a great site, also lichess.org and chessable.com if you like chess you should check them.


That's the most successful reason for failure.


The title is probably wrong, off by one.

You probably mean 2^31 -1.


Help me, Off-By-One Kenobi! You're one of my two last hopes!


These are always the best problems to have


The other one to watch out for is the 53-bit javascript integer limit. That caused the twitpocalypse when Twitter tweet IDs hit it. They had to switch to strings in the JSON representation.


The 2009 Twitpocalypse concerned overflow of 31-bit precision. Twitter has not yet hit 53-bits for raw number of Tweets, in fact, they passed 32-bits in 2014 and might not have reached 33-bits, yet.

Moving to strings for Javascript was really just safety planning for the future since:

> Given the current allocation rate, they'll probably never overflow Javascript's precision nor get anywhere near the 64-bit integer space.

https://twittercommunity.com/t/discussion-for-moving-to-64-b...


That discussion is about user IDs, not tweet IDs.

Tweet ID from today: 875423039323688960

Number of bits of precision necessary to represent it exactly: 60

Overflowed 53-bit precision long long ago. You can read about it here: https://dev.twitter.com/overview/api/twitter-ids-json-and-sn...


And I was just reading Heroku/Django discussing the same issue this morning!

https://groups.google.com/forum/m/#!topic/django-developers/...


Issues like this are not uncommon on Chess.com. I've been playing there since 2008 or 2009. If you read recent comments about issues as they pertain to the recent "v3" release ... as much is to be expected.


> For f sake how are we supposed to Anderstand that. I suppose your French fry maker is broken ?

Didn't expect Chess.com and YouTube to have a crossover of users? Surprised there isn't active moderation on a site this size.


In my experience the chat on chess.com harbors a similar demographic to that of most video games. You'd think that chess would attract a more mature player base, but nope.


The only time I've observed people in real life acting like people in video games is at a chess tournament. Constantly trash-talking until they lose, then accusations of cheating. You certainly don't see this type at all (or most) chess events; I think the lack of entry free drew them out into the open.


At my local library the loudest people aren't those on their phones or laptops, but the chess players. It really surprised me considering chess can be played completely non-verbally other than calling "check". Every time I'm there, they constantly argue (most of the time it's because someone wants to take back a move), trash talk to get on each others nerves, yell across the table to other players in games, and talk loudly as if they were in a park. On one hand I think it's great that the library provides a community space and lets people use their chess sets, but on the other hand as someone who goes there for quiet, it's very irritating. (I wish they had a game room or something where they could go wild) Once upon a time libraries had mythical status as a place of silence, to the point where people would shush each other for the smallest noises... I actually stopped going to that library because of noise issues and in general because of its size and limited seating.


Chess played on a computer fits the literal definition of a video game, and one that has near infinite replay value.


What would be the best way to test for this kind of issues in advance. Testing at theoretical limits at all endpoints?


I'm not sure this falls under testing. If you start with an empty database each time you start the test you may never hit this issue.

I think this is more of a capacity planning issue.


My concern is that even if one plans for a sufficient capacity, there still needs to be testing done to verify that the code actually works if the capacity is nearing the theoretical limit. In this example the database id was transformed into a 32 bit integer somewhere in the application code.

Usually when I hit some sort of unexpected bug in production I try to think about what type of testing will prevent similar problems in the future.


Will the Lichess app and platform have this issue? And if not, why not?


Looks like lichess is using strings for IDs so they will not have this issue

https://github.com/ornicar/lila/blob/master/app/controllers/...


> an unforeseen bug that was nearly impossible to anticipate

Hmmm... :)


Real world example of why Apple is killing 32 bit apps on iOS.


This has nothing to do with CPU architecture.


It's a little related. The languages typically used for iOS programming encourage the use of data types whose size matches the CPU architecture's bitness. Thus, careless programmers will end up using 32-bit integer types on 32-bit devices, and 64-bit types on 64-bit devices.

I really doubt this is in any way linked to Apple's reasons for dropping 32-bit, though.


"The reason that some iOS devices are unable to connect to live chess games is because of a limit in 32bit devices which cannot handle gameIDs above 2,147,483,647."


That is a combination of code and architecture, not architecture alone.


This is simply because they're taking an integer from their database as an auto incrementing ID and converting it directly into a native integer on the iOS device thus breaking it. They could work around this any number of ways.

It's a pretty lame bug, to be honest and certainly something easily foreseeable as this wasn't an overnight occurrence.


How so? A 32 bit platform doesn't mean you're limited to 32 bit ints....


"Obviously unforeseen.. impossible to predict." Really? You don't know how to properly store ID numbers?

IMPOSSIBLE to predict.


That's clearly sarcasm.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: