It's fascinating... the Y2K problem never came to fruition because - arguably - of the immense effort put in behind the scenes by people who understood what might have happened if they hadn't. The end result has been that the entire class of problems is overlooked, because people see it as having been a fuss over nothing.
I sometimes think it would've been better if a few things had visibly failed in January 2000.
If you were watching closely and knew what to look for in the first couple of months of 2000, the failures were there. But they were generally minor and easy to overlook as Y2K problems.
I spotted something like half a dozen failures in various systems I interact with which I strongly suspected, based upon the timing, were likely Y2K problems that slipped through testing. For example, I received duplicate bills for one of my credit cards for the January 2000 billing period, and then a subsequent apology for the duplicate bills. They never said Y2K, but the timing was very suggestive.
It's pretty much exactly what I expected from most companies...the big stuff had been largely been dealt with, but a few things slipped through which they could dismiss with some hand-waving. The thing that surprised me was that there didn't really seem to be any high profile disasters (like a company that couldn't ship products, an airline that couldn't issue tickets, or whatever) at all...I figured there'd be at least a couple.
Having spent a year hardening financial systems against Y2K, I was very unimpressed to discover my credit card not working on 1st of January.
The call centre staff told me that this wasn't a Y2K bug, but a year-end bug. As if that was meant to make me feel better about an obvious, grim, failure.
> in the first couple of months of 2000, the failures were there.
They were there before then too. Things that could go wrong at midnight on NYE were only one of a few classes of problem associated with roll-overs. There were a lot of bugs in like scheduling applications (and similar system tools) in the run up to 2000 that the man on the street didn't associate with the Y2K issue because it didn't happen at that exact moment.
getYear() is based from 1900, if you use getFullYear() you'll get 2017. Though, given all the additions to ES201*, it would be nice if all the timezone data were in the browsers, so one could get moment-timezone without the, relatively large data files.
.getYear() and .setYear() are deprecated in all recent standards [IIRC they were before 2000, I certainly remember them being recommended against as far back as that].
Similar to how much effort goes into dealing with things like dangerous strains of bird flu, only to have people complain about how much money was spent on "nothing" when an outbreak doesn't occur.
This is a whole class of problem - I wonder if there's a name for it. More examples include: talking about welfare being unnecessary because no one is starving. Or people on medications stopping because they feel better (while still on them).
This makes me think of a hypothetical scenario proposed in one of Nassim Nicolas Taleb's books where-in some senator gets a bill passed in the weeks before 9/11 requiring reinforced doors on all commercial airplane cockpits.
No parade would be thrown for this senator for having prevented 9/11 and likely he'd be castigated for having given airlines an excuse to raise prices due to restrictive government regulations.
that example (while i was reading the book) kind of stuck out with me - I think of it almost as preventative maintenance or good software dev practices - you'll get crap from others for 'wasting' time on it, but they don't realise it might just save your ass when you need it most.
That's a very definitive bias we always hold. Praising a player who falls behind and achieves something while not paying much attention to one who does something normally. Chaos gets attention!
This reminds me of an essential but overlooked truth examined in this paper: "Nobody Ever Gets Credit for Fixing Problems that Never Happened". It builds a simple system dynamics model and shows the long-term effects of working smarter versus working harder: http://web.mit.edu/nelsonr/www/Repenning=Sterman_CMR_su01_.p...
The availability heuristic is a mental shortcut that relies on immediate examples that come to a given person's mind when evaluating a specific topic, concept, method or decision. The availability heuristic operates on the notion that if something can be recalled, it must be important, or at least more important than alternative solutions which are not as readily recalled. Subsequently, under the availability heuristic, people tend to heavily weigh their judgments toward more recent information, making new opinions biased toward that latest news.
Dandruff is generally caused by fungi living on the scalp. The fungi eat up your skin and make it dry. The dry skin flakes off.
Dandruff shampoos work by killing the fungi. There are three main types of anti-fungals, zinc pyrithione is used in Head and Shoulders, selenium disulfide in Selson Blue. There is a third one not commonly used called ketoconazole that you may want to try.
There is also coal tar shampoo, which is not an anti-fungal, instead it slows skin cell growth and sloughs off the dead skin.
Selson Blue works well for me, I use it daily. If I go off of it for a few days the dandruff kicks into high gear.
The thing about dandruff shampoos is you have to use them with a certain regularity because even if you kill the fungi, the condition takes a few days to clear up, the damage to your scalp is done. You need to create a hostile environment for the microorganisms, and that takes time.
Your dandruff is probably caused by only one type of organism that thrives in particular climates. Next time you're there, you might get ahold of two or three shampoos and try them for a week each and see if it clears up. Then you know which is 'your' anti-fungal.
Not merely an insightful repository of analysis and explanations around the more profound aspects of our world, John Ralston Saul's exquisite Doubter's Companion includes some answers - acerbic accusations bundled free - to the more pedestrian issues:
"DANDRUFF: The ANSWER is usually vinegar. To some problems there are solutions.
"What we call dandruff is often the result of a PH imbalance on the skin, which shampoo exacerbates. Wash your hair with a simple non-detergent shampoo, soap, olive oil, beer, almost anything. Rinse. Then close your eyes and pour on some vinegar. The extremely cheap but natural sort—apple cider, for example—is probably best. The smell will stimulate interesting conversations in changing-room showers and your explanation will win you friends. Wait thirty to sixty seconds. Rinse it off. The smell will go away. So will your dandruff.
"All dermatologists, pharmacists and pharmaceutical companies know this simple secret. They don’t tell you because they make money by converting dandruff into a complex medical and social problem. By most professional standards this would amount to legally defined incompetence or misrepresentation.
"Dandruff shampoos that promise to keep your shoulders and even your head clean are harsh detergents and may promote baldness, which ought to constitute malpractice."
Actually, I think I'm mostly annoyed by your use of three pejoratives to try to make one point, and the fact you spelled pseudo wrong just irritates some other part of my brain. :)
Sure, dandruff and 'dry skin flakes' are separate things entirely.
Dandruff is generally regarded to be caused by the fungus Malassezia. This fungus does not enjoy an acidic environment.
(Healthy) human skin tends towards pH 4-5 (there seems to be some debate). Malassezia likes 5-8 (again, some debate). Reducing the pH / increasing the acidity would appear to have some non-psuedo, non-alternative-medicine, non-falsehood foundation.
Wikipedia says one cause of dandruff is yeast-like fungus. Perhaps PH plays a role in that. Also vinegar has cleansing properties and may be less drying than detergent, and dry skin is another cause of dandruff.
The word "vinegar" is notably absent from that link. That solely shows that harsh soaps are harsh (which as you may have noticed is axiomatic), not that vinegar isn't.
When I typed the word "may" I meant it. If it's medically established that properly diluted vinegar dries the skin I'd be interested in knowing that. Otherwise, it seems like reasonable speculation to me rather than pseudo-scientific BS.
I suffer from dandruff, tried a bunch of shampoos but they didn't work well. I saw this tip about vinegar, which I tried a single day then I gave up thinking this could be popular saying. I'll try again and report.
When you say 'this could be popular saying' are you meaning that it may be wrong because it's popular?
I'd be curious to know your results, in any case. I've had mixed outcomes from conventional (commercial shampoos) and have also been trying to identify causal factors (worse in winter, especially after a few days of wearing a beanie, worse when I'm staying near a high-pollution area, etc). It's all anecdata, but OTOH not beyond our abilities to thoughtfully analyse.
This unconscious bias went unnoticed to me (as expected, per definition). Thanks for pointing it :)
So far, my experiments with other shampoos where conducted a bit ad hoc. They're all anecdata. When I try with vinegar, I should try to rule out other causal factors first. I probably haven't done that in my first try. Otherwise, we can never be sure.
To be honest, I only went to see a single dermatologist, but my experience was bad. The dermatologist kinda overlooked and shrugged. I received a shampoo recommendation, but the dermatologist didn't say we should monitor the treatment or anything. From what I saw, she would keep recommending shampoos and I would try until I found one that worked. so I decided to try by myself.
Not my wisest moment. I should probably have seen another dermatologist.
Great book and great advice. Must be about 20 years since I tried this approach. Count me as a data point strongly supporting the use of vinegar to deal with dandruff.
I think this is actually a version of confirmation bias since
> Confirmation bias, also known as Observational selection or The enumeration of
> favorable circumstances is the tendency for people to (consciously or
> unconsciously) seek out information that conforms to their pre-existing view
> points, and subsequently ignore information that goes against them, both
> positive and negative [0]
I think it's selection bias because you sample from the set of perceived problems which are the ones that actually occurred relative to the ones that could have happened.
The opposite of that is frustrating too.. I see it a lot in SF. Something like "We spend $100M/year on homelessness and we have 3,500 homeless. Why don't we just give them the money and do away with the social services!" As if the $100M isn't keeping many thousands more from living on the streets..
Yes, Daniel Kahneman describes this kind of cognitive bias as "what you see is all there is". A simple example given in his book* is to ask the members of any couple (or roommates) which percentage of the home duties he/she performs. The sum is always above 100÷
*"Thinking fast and slow", best good recommendation I got from the HN crowd
It's a form of (inverse) Survivorship Bias. "None survived, therefore none started out" is an (inverse) extension of "The ones that survived were the ones that started out".
As a former Y2K project manager, I had one thing go BOOM. But it didn't go boom on 1/1/2000, rather on 2/29/2000. Our Y2k program had, IIRC, some 9 likely fail dates including the 1/1/2000 date. The component in question was provided globally and untestable locally, so I the best I could do is acquire a copy of the remote complete test script and call it good.
I was thrilled that we had something to point to as a "see, this is why we put in so much work". Prior to that had received lots of criticism about the amounts spent, people hired, blah blah blah.
"Wait, we spent two million dollars servicing your fleet of 100 cars in the last year, and you're complaining that none of them broke down? Why spend all that money on maintenance when they just keep running?"
Are you sue those numbers are right? That's 20k per car per year. At such a cost running the cars with no maintenance then buying new cars every break down is cheaper.
They were intentionally inflated - however I probably had Australian pricing in mind which is probably about 50% more expensive than the US ones. Considering the audience I guess I should have said 150 cars instead. :)
A lot of money was spent fixing the Y2K issue. Can't exactly recall how much time I spent myself, but it was a dominant factor as far as IT projects went back then.
Other lawyer: Jury, they saved millions of dollars per year not checking the brakes on their cars, you should award those millions of dollars, and some more millions of dollars to the families whose children died that day.
Maybe it's because of how young I was at the time, but I remember media outlets reporting on the Y2K problem, and showing video of hundreds of older computing devices that were being retired (thrown in landfills) because they wouldn't work properly after the date change happened.
Maybe this isn't a "real" failure, and the symptom of some IT departments working diligently to solve the problem before it happened. In any case though, I'm curious how inaccurate the televised reality from my youth actually was.
In short and very simple terms, some software stored date as DD/MM/YY.
It would asume that for year it would always prefix 19. The problem was when you reached /00. Any calculations or software decisions that happened based on that date, it would be off. Way off.
Some solutions where expanding the date to YYYY or adding a prefix to the new dates. Dates without the prefix would be 19xx. Date with the prefix X was 19xx and dates with prefix Y where 20xx.
Another solution was to set the prefix for 2-digit years on a sliding scale, so they are interpreted as a date within a specific 100-year period (date windowing). For example, see the 2029 rule: https://support.microsoft.com/en-us/help/214391/how-excel-wo...
This turned out to be one of the most cost-effective methods of fixing the problem, and was probably one of the most likely to be implemented. This was especially the case in situations involving software which ran closer to the hardware (for example, BIOS or firmware) or on systems where RAM or storage couldn't be increased and/or the change might increase the software's requirements beyond the system's capabilities.
There were a couple of computers in my shop that had a Y2K BIOS issue. Although there is a sort of software in the form of firmware on BIOS, many people considered this sort of Y2k error presentation a hardware problem.
So, while a Y2K compliance program dealt mostly with software, a complete program went through and tested all hardware in inventory.
the issue was a closed source problem, you could easily add the option to switch between 19 and 20 for calculations. or have a pre 2000 instance of the app and a post 2000.
The problem was that media reports at the time seemed to assume that anything with a computer would completely blow up, instead of rationally thinking through much more benign consequences of a date or calendar being wrong.
The image of a bunch of perfectly capable computers being discarded for no reason describes this media frenzy wonderfully.
> the Y2K problem never came to fruition because - arguably - of the immense effort put in behind the scenes by people who understood what might have happened if they hadn't.
Except many countries who spent no money on upgrading their system has very few problems too.
My dad was a programmer in the early days. The machines he started on in the 1960s had 8 KB of RAM. Saving a byte then is the equivalent today of saving 1 MB on an 8 GB machine.
Multiply that times, say, the thousands of customer orders you're trying to process and the goofy thing would be burning a lot of additional RAM because it might help somebody 35 years later. Who among us is writing code today worried about how it will be used in 2052?
>Who among us is writing code today worried about how it will be used in 2052?
This decade I knowingly wrote code that will break in 2036 [1]. My supervisor was against investing the time to do it future-proof (he will be retired by 2036), and I have good reason to believe the code will still be around by that time. I don't think I'm the only programmer in that position.
- This decade I knowingly wrote code that will breaks in 2038
Sure, but how bad was it really? Something you could fix relatively quickly with a little time and money, or an instance of Lovecraftian horror unleashed upon the world like so much COBOL code?
Even then, mix in people switching jobs, losing the knowledge of where or what all these landmines are, and add similar but unrelated issues across your entire codebase. This stuff adds up. I like to at least add stern log messages when we are at 10%-50% of the limitation. It's saved my ass before, especially when your base assumption can be faulty.
In one of those scenarios, where we expected the growth of an integer to last at least 100 years, due to certain unaccounted for pathological behaviors, a user burned through 20% of that in a single day. But we had heavy warnings around this, so we were able to address the problem before it escalated.
Last time this came up, I ran the numbers and the cost of the RAM saved per date stored was hundreds of dollars. Not per computer, or per program, but per date. Comparing total memory sizes doesn't tell the whole story, because RAM for a whole machine is so much cheaper now.
Spending that much money on storing "19" just so your code keeps working in the unlikely event that it's still in use 3+ decades into the future isn't a good tradeoff. Obviously things are different now.
Excellent point. Yeah, the machines my dad started on had magnetic core memory [1]. Each bit was a little metal donut with hand-wound wiring.
And in some ways, even "hundreds of dollars per date" doesn't quite convey it. These machines were rare and fiendishly expensive. In 2017 dollars, they started at $1M and went up rapidly from there. Getting more memory wasn't a quick trip to Fry's; even if you could afford it and it was technically possible, it was a long negotiation and industrial installation.
Another constraint that we forget about is the physicality of storage. Every 80 columns was a whole new punch card. That's a really big incentive to keep your records under 80 characters. Each one of those characters took time to punch. Each new card required physical motion to write and read, and space to store.
It really was a different world. I think a lot of programmers don't understand just how different it was (I barely do), and don't realize that modern principles like programmer time being more expensive than computer time are not universal truths about computing, but are just observations of how things are in recent decades.
The interesting thing about this from an engineering point of view is, you quietly pass a threshold where the clever hack which was worthwhile becomes literally more trouble than it is worth. When that happens is a multivariate problem that we couldn't truly predict at the time of the code's creation. (and when it happens, there might not even be anyone on the payroll thinking about it)
You're calculating what it would cost to store a string representation of a date. Which is silly. You should always convert to a timestamp for storage. You can cram way more info into a single integer than you can with a base 10 string. And the bonus is you verify the date's correctness before storing.
Even a 32-bit int could hold 11 million years worth of dates. And if your software is used for longer than that, you can just change it to a 64-bit long and have software that will outlast the sun.
Silly or not, that's the reality of punch card based technology (BCDIC later extended to EBCDIC). Punch cards pre-date electronic computers, and making a relay tabulator set-up working with binary formats is impractical.
As computer hardware grew out of that, it maintained much of the legacy, down to hardware data paths and specialized processor instructions. It was more than a programming convention.
That was the right choice for the era. As mikeash points out, your approach takes more bits and more CPU cycles. But it also takes a computer to decode. Any programmer can look at a punch card, a hex dump, or even blinkenlights and read BCD. Decoding a 32-bit int for the date takes special code. Which you have to make sure to manually include in your program, the size of which you are already struggling to keep under machine limits.
Systems from this era were probably using BCD rather than base-10 strings. A BCD date would take up 24 bits.
Running a complicated date routine to convert to/from 32-bit timestamps would also have cost a huge amount. These machines had speeds measured in thousands of operations per second, and the division operations needed to do that sort of date work would take ages, relatively speaking. All on a machine that cost dozens of times the average yearly wage at the time, and accordingly needed to get as much work done as possible in order to earn its keep.
Sometimes this worry is thrust upon you by the problem domain. I do remember tackling the Y2K38 problem in 2008 - the business logic dictated that the expiration date should be tracked, and some of them were set to 30 years.
But a 2 digit date should take less than 7 bits. Were they using systems that didn't use 8 bit bytes? Why wouldn't the dates work from, say 1900 to 2155?
Back in the days it was probably 7 bits, but the word size is not that important. The problem still exists today with a modern 64 bit computer:
Even if a system internally can store a timestamp with nanosecond precision since the beginning of the universe, all that precision is lost when communicating with another system if it must send the timestamp as a six character string formated as "yymmdd" in ASCII.
My understanding is that the actual number of bits used would generally have been 4 bits per digit, as they were using Binary Coded Decimal [1]. So dropping the 19 would save you a byte per date.
Sure, they could have used a custom encoding. But that increases maintenance cost and extra development work. All to solve a problem that nobody cared about at the time.
You are assuming 8 bits per byte, but a byte can be any number of bits.
With two bytes of 7 bits each, the range is only about 40 years.
Is is also impractical when the storing media is punch-cards, and the systems adder unit only counts in binary coded decimal.
But then you need special code to decode that. Code that you have to write yourself or borrow and include in your program. Remember, no shared libraries. And it means extra CPU time you have to display a date. Whereas BCD has special hardware support.
It means that data interchange is now much more complicated too. How do you get everybody to agree on the same 2-byte representation for dates? This is the 1960s, so you can't just email them. You have to have somebody type up a letter and mail it. Or if you want to get on the phone, a 3-minute international call will cost $12, which is about $100 in 2017 dollars.
Plus then you can't look at a hex dump or a punch card or front panel lights and see the date, so now you've made debugging much harder.
For example, some systems stored the year in a byte, and when printing out a report it printed "19" and that byte - so year 1999 would be followed by year 19100.
Some systems, where storing numbers in columns of characters were common practice (COBOL idiomatic style?) stored the date as two digits (possibly BCD), so the possible range is 00-99 no matter how many bits are used.
But it's worse than that. In the 90s a lot of code used 16-bit values, character strings. That is, it stored a char(2), parsed it as 2-digit number and then converted it to a date by adding 1900.
So it was only really "saving space" when compared with storing a char(4).
But if they wanted to save space why not store a 8 bit number? I imagine it must have something to do with punch card compat or some binary coded decimal nonsense. Still seems inefficient.
If a system gives you two options for storing a date (using 2-digit or 4-digit years), how many dates do you need to store and use in calculations before you end up saving space by creating a new data type and all of the supporting operations to make the storage of the date itself more efficient? In recent years, it's more common to make this type of decision because something else is causing an issue, otherwise we rarely consider the space required for a date (and many languages no longer have a separate type for dates).
I doubt that's true, unless you mean it tautologically.
There are plenty of good programmers working on software that matters that should absolutely not be trading off hazy possible benefits in 2052 for significant costs now.
It's occasionally necessary. When I wrote the code for Long Bets [1], I took a number of prudent steps to make sure things would have a good shot at surviving for decades. But I only took the cheap ones; the important thing was to ship on time.
And I think that's the right choice for most people. Technological change has slowed down some, but 35 years is still an incredible amount of volatility. Betting a lot of money on your theories of what will be beneficial then is very risky.
> There are plenty of good programmers working on software that matters that should absolutely not be trading off hazy possible benefits in 2052 for significant costs now.
I guess it's not obvious, but I think there's really a continuum here. You don't necessarily need to write software that will run perfectly in 2052, but it'd be good if you wrote software that can be comprehended, adapted and altered later on. Maintainability is never a "hazy benefit." (If the problem isn't a total throwaway.)
Sure. Maintainability pays off relatively soon, and often makes systems simpler and cheaper to operate. But the topic in this sub-thread is the Y2K bug, where the proposed solution would have been expensive and provided no benefit for 35 years. And at the time, those benefits would have been very hazy.
I don't know, I think the attitudes that make you a good programmer mean you won't be satisfied leaving broken code in your product, no matter how far out the consequences are.
It's definitely true. Technical change in the 60s was enormous. During that period there was a lot of fundamental architectural change and experimentation; that's when they settled on 8-bit bytes as the standard, as well as many other things. Moore' Law became a thing. In the 70s is when we started seeing operating systems that look familiar to us, and even into the 80s it was plausible to introduce a new OS from scratch (see, e.g., NeXT, or Be).
The iPhone is a decade old; every phone now looks like it, and it's highly plausible that they'll look basically the same a decade from now, possibly much longer. Laptops are 30 years old; they've gotten cheaper, faster, and better, but are recognizably the same. HTML is coming up on 30, and it will be in use until long after I'm dead. TCP is nearly 40; Ethernet is over 40; even Wifi is 20.
So it's just easier now to guess what programmers will be doing in 35 years compared to 1965.
As someone who just wrote a quick hack for a temporary problem, I agree.
It's not just the shitty programmers who do this. Sometimes we have shitty product managers who won't push back against this kind of thing. And you're forced into a creating something evil because most of the job is very good but this one time, you have to suck it up.
My response to that, though I agree with you, is that when a supervisor or PM or whoever gets on you about something you know is bad, you negotiate.
"Yes, I'll do this for now because the company needs it now. But only if you guarantee me the time (and possibly people) to do it right later."
You get agreement in an email, create the ticket and assign it to yourself as mustfix two months from now. And you shove it down their throats.
That's not an ideal place to work if you have to do that, but I have worked at those places and this is how you deal with that situation.
"Yeah, I'll give you a shit solution in 1 day right now. But only if you give me a a couple of weeks for a good solution later."
In reality, I've mostly only had to deal with this situation in startups. Mid-level and mature companies are usually open to pushing back and getting things right. But there are exceptions. Today was an exception. But that's also one of the reasons I don't really want to work at startups anymore.
Shitty solutions are usually the right answer. At least in the areas I work in (mostly startups). I would estimate 99% of the code I write gets thrown away. Most of it is trying something out. Even for code that was intended to hit production, the company/project often gets cancelled before it ever hits production.
I'm not saying this is true in your case. But there are so many different classes and types of programmers and projects that it's hard to generalize.
99% of your shit code isn't getting thrown away. It's sticking around making life hell for people like me.
Stop writing shit code because it's going to get thrown away. If you work for startups, you are always operating in protoduction mode. Everything you write ends up in prod.
Write code that doesn't suck. It doesn't have to be perfect or optimal, but make it not suck before you push.
Hmm no. That's what's happening in your world, but you're imposing that world view on me.
Probably about 80% of the code I write doesn't even get looked at or used by another developer. If the technique/analysis proves useful, it gets rewritten/refactored. That has the added advantage that I then understand the model better.
For me there's a giant difference between code that lasts, which needs to be sustainable, and disposable code, which doesn't. I'm also very big on YAGNI; my code gets so much cleaner and more maintainable when I'm only solving problems that are at hand or reasonably close. Speculative building for the future can get insanely expensive: there are many possible futures, but we only end up living in one.
Indeed, I think a "do it right" tendency can prevent people from really doing it right. If we invest in the wrong sorts of rightness up front, we can create code bases that are too heavy or rigid to meet the inevitable changes. So then people are forced into different sorts of wrongness, working around the old architecture rather than cleaning it up.
Good for you. That's my approach, too. And to rig the system such that technical debt gets cleaned up continuously and gradually without the product managers knowing the details.
When there are real business reasons to rush something, I'm glad to support that by splitting the work like you suggest. But the flipside is them recognizing that not every thing is an emergency, and that most of the time we have to do it right if they don't want to get bogged down.
Well, yeah, I absolutely agree. Replaceability and maintainability go hand in hand in a system. It's a cruel irony that the code that sticks around, often sticks around because it's crap.
(that doesn't stop me from sometimes having a weird admiration for incomprehensible software kept going forever with weird hacks. It's like with movies, sometimes they're so uniquely awful that you have to admire the art of them)
In the 80s I questioned the use of using two bytes for the date. I was laughed at by the experienced programmers. They said the software would be rewritten by then. It should have been, but it wasn't...
But there is a trade off between how much time you spend today vs future compatibility.
My dad's first programming job was initially to mechanically change how variables where stored thus saving 1 and only 1 byte of disk space. Someone ran the numbers and having someone do that for a few months was a net savings.
A few years after that he was talking about some relatively minor optimizations that saved a full million dollars worth of hardware costs by delaying a single new computer purchase.
Regarding the point I was speaking to: it's undoubtedly true that some of those hacks were initially worthwhile. Keeping them going until the year 2000 was, by that same standard of cost effectiveness, pretty definitely a visible failure.
> Self-confidence as a programmer is when starting a new project, storing the transaction ID as a long rather than an int...
uint64_t even
Or a UUID as others have suggested.
Technically C spec doesn't really say exactly how many bits int, long and long long should be. If you want specific sizes and your code to be somewhat portable use the specific bit sizes to make that clear. There are also types for size-like things (size_t) and pointer and offset like things.
There's a usecase for lower-bounded types such as int_least32_t, where the compiler may choose a larger type if it offers better performance. However, if you're using that, the test suite should run all relevant tests for multiple actual sizes of that particular type (through strategic use of #define, for example).
> There's a usecase for lower-bounded types such as int_least32_t, where the compiler may choose a larger type if it offers better performance.
If you're looking for the best performances you shouldn't use leastX types, you should use fastX types (e.g. int_fast32_t for the "fastest integer type available in the implementation, that has at least 32 bits").
The difference between "leastX" and "fastX" is that "leastX" is the smallest type in the implementation which has at least X bits. So if the implementation has 16, 32 and 64b ints and is a 32b architecture, least8 would give you a 16b int but fast8 might give you a 32b one.
The reason is that anything else is using default types and you lose your safety battle in each:
int32_t x = call_returning_int();
line. Otherwise, you assert/recover on each me-they borderline. C is a language where you have absolutely no guarantee that int or constants defined as int will fit into anything beyond int, long or long long, and then there is UB patiently waiting for your mistake. The method of handling that is to never change or fix types unless you have to, and then be careful with that.
What he means by that is the old definition of auto, which C++11 deprecated in favour of making it do type deduction instead.
auto foo = func_returning_int(); to my knowledge worked in C because 'auto' was the lifetime keyword - like 'register' - and the default type in C is 'int'.
That's why when you miss a definition in C++ the compiler warns you that there's no default int.
Your code is actually less portable if you use types like uint64_t -- if your system doesn't have exactly that type implemented the typedef won't exist. If all you need is a really big number 'unsigned long long' is required to exist and be able to store 0..2^64-1
The naive, 9 years in the past me once was like "int will last us forever! and it'll save us some space!", only to have to change it to bigint a few months later
Remember, these fancy computing devices were built for the rich and the government, not for the average joe, noone thought computing would be this easily accessible.
NAT is a colossal pain in the ass. I shudder to think of the number of man-years that have been spent on NAT traversal. NAT breaks one of the fundamental functions of an IP address - a unique identifier for a network device. It turns a simple identifier into a weird, amorphous abstraction.
IPv6 isn't perfect, but we could have avoided a lot of hassle if we'd started off with it.
Sharding, or a second key, or detecting the iPad and using a lower number with an offset internally? Still requires special handling at one end or another.
It's the smart thing to do - 4.3 billion is not that much.
I had some students that asked me if even a long would be enough to handle exponential growth, after all it's only twice as big. As a thought experiement I asked them to come up with a time to fill a 32 bit int completely. They came up with roughly a year. Then as a margin of error I said, let's assume you have 4.3 billion transactions every second instead. This volume can be sustained for 100 years, and we're still not in the danger zone yet.
One is 32 bits, the other is 64 bits. It occupies exactly twice as much space. It can contain far more information, but it is only twice as big on the harddrive.
I pick UUIDs because experience has taught me that even for the smallest workloads, I'll inevitably have to shard my DB (to partition a shared public cloud from N isolated "enterprise" deployments), and then will inevitably want to do statistical-analysis things that involve ingesting rows from the shards (or log entries referencing those rows), and deduplicating them by ID, without generating false-positive collisions in the process.
The simplest way to do that is to just throw UUID at all problems from the start. (https://github.com/alizain/ulid s are better, but there aren't libraries to generate them in literally every language + RDBMS.)
I don't agree with all their reasoning---but ulid still seems like a good idea. (Though the main difference you care about in programming is how they are generated---via timestamps plus randomness, not that they have a different serialization format.)
For some applications you don't want to leak the time. Choose wisely.
You know what, you're right. I'm going to change some SERIALs to BIGSERIAL in the database of my side project. Someone has to start believing in it - it may as well be me.
Hey all. Thanks for noticing :P Obviously this is embarrassing and I'm sorry about it. As a non-developer I can't really explain how or why this happened, but I can say that we do our best and are sorry when that falls short.
Computers set limits internally on how big numbers can be when they're keeping track of stuff.
Your developers had given each game a number to identify it. So your first game was #1, the 40th game was #40, and so on.
The limit for how big the number could be was a bit over 2 billion, and your players have just now played a bit over 2 billion games, and so that id number suddenly exceeded the computer's internal limit. Specifically, the limit was 2147483648, so basically it crashed on game #2147483649, which is the next id after the last acceptable one (notice the last digit is 1 higher.)
I'm simplifying slightly but that's the idea. It'll be fixable by essentially using a different format for the id number so that the limit is higher, much like telling the computer "use a higher limit for this particular number, it's special."
Yes - I understand HOW it happened, just not sure WHY. Meaning, I'm not sure what the developer was thinking, and at this point, I'm not going to track down exactly who it was and point fingers. I think everyone has learned enough through this highly interesting bug. It certainly was interesting to see the slack room exploding with theories and debugging. A new iOS client has been submitted to Apple (hurry plz!!!), and a server fix is also in QA now. Fun problems to have......
It's most likely for efficiency and performance reason. 64-bit doubles the storage requirement of 32-bit and would have impact on database's utilization of memory, querying window size, cache, and storage.
Edit: 32 bits worth of games played means about 4 billion games. 4 billion X 4 bytes for 32-bit = 16GB just for the 32-bit ID's. 64-bit ID's would need 32GB for the 4 billion games. I guess memory and storage weren't that cheap back then.
It sounds like it was client side, not server side. Most likely the iOS client was using Objective-C's NSInteger or Swift's Int, just because that's the default choice for most programmers working in that language, and they didn't think it through.
On a 32-bit system, a "long" is usually also 32 bits. On a non-Microsoft 64-bit system, a "long" is usually 64 bits. On both 32-bit and 64-bit systems (Microsoft or not), an "int" is usually 32 bits.
If the issue happened only on 32-bit iPads, but not on 64-bit iPads, the programmer probably picked a "long", not an "int". Had the programmer picked an "int", the problem would also happen on 64-bit iPads.
Our iOS app with Java backend was using long for database IDs on both ends. I was going through the ILP32->LP64 conversion process and when I realized we had a pretty serious discrepancy.
I think it's a really easy mistake for the first developer to make (especially because they weren't a C/Obj-C programmer), and then the sort of thing that no one audits after that.
> Meaning, I'm not sure what the developer was thinking
A 32bit integer is pretty much the default numeric type for the majority of programming tasks over the last 20 years. Even with 64bit CPUs, 32bit is still a common practice. Probably 99% of all programmers would make the same choice unless given specific requirements to support more than 2 billion values.
It's often not even an explicit choice, it's just default behavior.
Up until recently, Rails defaulted to 32 bit IDs, so there are a ton of apps out there that could have these issues, especially since Rails has always prided itself on providing sane defaults: https://github.com/rails/rails/pull/26266
Others, like JS and Lua, just use doubles, meaning they'll never overflow - instead every 2 numbers start to be considered equal. Then a while after that every 4, etc. Not exactly optimal behavior when using incrementing IDs.
I don't think you do understand, you sound like you're upset that a developer "set" this limit. When in reality it's tied to fundamental programming principles. It wasn't really a conscious decision to say, "I'm only going to account for 2bil games"...
Probably when this was initially developed, nobody thought you'd ever go over 2 billion games. This error is brought to light by your success and popularity.
Computer history is riddled with assumptions like that. The Y2K problem, Unix dates running out in 2037, and 32 bit computers unable to address more than 4 GB of memory are just the big ones. It's everywhere. Smaller software projects are generally built for what you need right now, and less for what might happen in the distant future.
Ideally you want to retain some awareness that this is an issue so you can start working on once you go over a billion games, but in a small company, there are probably always more urgent things to worry about, and nobody ever gets around to fixing this technical debt.
2 billion is a very large number that was probably not envisioned as reachable in the near future - as a programmer I'd argue this is a pretty easy mistake to make, and that while (slightly) embarrassing, its a good learning moment.
It's also really awesome that you're here, and that you guys were so honest about the nature of the bug - this is really something that should be encouraged.
Maybe we should start a blog about all of the interesting bugs and challenges we encounter. It certainly is white-knuckle pretty often when running at scale. The number of devices, connections, features... I'm aging prematurely :P
Agree with Aloha. I wouldn't be too hard on the programmer (also, if I understand correctly it's not a database issue, but only with the 32-bit iOS client). I'd pat him on his back and say “you didn't think we'd get this big, eh?” ;-)
> 2 billion is a very large number that was probably not envisioned as reachable in the near future
I disagree. Simple napkin calculation: 100 million players playing 40 games each per year (about 1 per week) over 5 years = 10 billion unique games.
As others pointed out it was likely not a miscalculation, just a lack of calculation. The bug occurred only in the client and the decision to use a smaller data type was likely not a conscious one.
In any case, I wouldn't hold it against an individual programmer. But arguably this sort of bug indicates your development process has flaws (not enough testing, code reviews, etc).
Thanks. I'm a pretty understanding "boss", especially on the heals of reaching the 2 billion games milestone :D Our team is awesome and we love what we do. Unfortunately we're still a bunch of humans sitting at kitchen counters and on couches around the world, so things do sometimes fall in between the cushions...
Indeed. I'm not sure that anyone here at Chess.com at the beginning thought we would hit a billion games played in our lifetime. But I guess after 10 years....
To put things in perspective, 2 billion games in 10 years is half a million games per day on average over the 10 years. Considering you didn't start at that rate and that it's an average, it means you have way more than half a million games per day now. (that's also more than 6 per second!)
Think of a mechanical odometer, and how it only has a certain number of digits. Eventually you'll hit 999,999 miles, and on the next mile, everything will roll over to 000,000.
Same deal here. 32-bit numbers are stored as 32 switches, starting from
0000 0000 0000 0000 0000 0000 0000 0000
which is 0, to
1111 1111 1111 1111 1111 1111 1111 1111
which is 4,294,967,295. Since the 32-bit iOS version of Chess.com apparently uses 32-bit numbers to store each game's unique ID, that means you can have 4,294,967,295 games.
So what happens on game 4,294,967,296? Just like the odometer, everything rolls back to 0, and things start breaking because the program gets confused.
Pretty common problem, really. The fix would be to use a 64-bit number, which doubles the number of binary digits.
Yours is the first comment in this chain that I can say pretty confidently isn't sarcasm. So it kind of breaks the chain, making the sarcasm in this chain non-recursive. Which means maybe you were being sarcastic after all? Actually I don't even know if my own comment is sarcasm or not.
Your comment comes across as sincere. Mine was sincere, but an overstatement. I should have said that sarcasm tends to be recursive, until broken by sincerity. Anyway, here's my take on the chain:
Yes because what we have always known about sarcasm and what this thread is a perfect example of is how you can define something as sarcastic/not sarcastic just by how it subjectively "comes across".
Yes, you can guess. But assessment depends strongly on context. And Pow's Law still applies. People can write messages that seem sincere, and then later claim sarcasm. As in "I was only joking". Or people can write sincerely,
but come across as sarcastic, or vice versa, and yet be ambiguous enough that readers can't tell. That's where the /s flag help. Done intentionally, such ambiguous messages can probe the reader's state of mind. Or set traps.
I recently experienced a nasty bug with BLOB in MySQL. The software vendor was storing a giant json which contained the entire config in a single cell. It ran fine for months, and then when it was restarted it totally broke. Reason was: the json had been truncated the entire time in the database, so it was gone forever. It was only working because it used the config stored in memory on the local system. Nasty!
MySQL's silent data truncation is such a nuisance. It's off by default in 5.7, and can be disabled in earlier versions by adding STRICT_ALL_TABLES/STRICT_TRANS_TABLES to sql_mode [1].
I inherited a system where, among other things, the entire response body from a payment gateway callback is saved into a text field using utf8 character set, despite the fact that most of the supported payment gateways send data in iso-8859-x (and indicate the used charset inside the body itself, how's that for a chicken-and-egg problem). Of course when the data gets truncated due to not actually being utf8, nobody notices. Fun times.
> MySQL's silent data truncation is such a nuisance.
Yes, yes it is - it burned me so badly (catastrophic, unrecoverable production data loss) in the early days of my career (~15 years ago as a junior level dev in a senior level role) that it has forever colored my opinion of MySQL - I will really never trust it again.
Early web had these issues as well: server sends response with Content-Type: text/html; charset=win-1251 (or no charset), but body contains meta charset=utf-8. MSIE worked around that in IE4 by comparing the letter frequencies to a hardcoded table and guessing the charset from there. It sort of worked, 80/20.
> The vast majority of 64-bit hosts use 64-bit time_t. This
includes GNU/Linux, the BSDs, HP-UX, Microsoft Windows, Solaris, and probably others. There are one or two holdouts (Tru64 comes to mind) but they're a clear minority.
This is little help for older, already deployed systems, of course.
This problem is more related to a programming underestimation than the actual limitations of a 32bit CPU (which can happily process numbers or IDs that arbitrarily big if you have the memory for it and program it correctly).
That said, this is definitely indicative of what's going to happen in just 20 years, 6 months and 20 days from now. I mean, we're still cranking out 32bit CPUs in the billions, running more and more devices, and devs still aren't thinking beyond a few years out. I know of code that I wrote 12 years ago still happily cranking away in production, and there may be some I wrote even longer than that out there... and I guarantee I hadn't given two thoughts about the year 2038 problem back then, and I doubt many devs are giving it much thought today.
The sad part is people are going to look at the lack of a year 2000 event and assume 2038 is going to be a "dud", when they fail to see all the damn work that went into making sure Y2K was a dud including a significant portion of IT hours and probably a lot of extra support laid in.
I expect 2038 to be a rare hell because of the nature of the devices. Y2K was an IT problem, but 2038 will be an embedded system problem and that's going to be a much more painful thing to audit. Moving from the server room to inside equipment and walls is going to be fun.
Long long time ago, I created a poll on a small website I was maintaining. I didn't expect much traffic and, so, not thinking too much about it, I put the ID column to be a TINYINT (i.e. max value = 255)...
That was a valuable lesson.
(I actually generated most entries myself while testing stuff - live in prod of course - and while there were probably fewer than 255 votes, the AUTO_INCREMENT did its job and produced an overflow).
Reminds me of the havoc that was caused when Twitter tweet IDs rolled over. Resulting in every third party developer to update their apps (and at the time there were a lot of those).
Twitter saw it coming and forced the issue. By saying that at a certain date and time they would manually jump the ID numbers rather than wait for it to happen at some unpredictable time.
They didn't roll over, they exceeded 2^53-1 which is the max Number which doesn't truncate when treated as an integer in js. The solution was to treat it as a string.
(Or we're thinking of different events, I apologize if so)
Twitter must have been misleading when they communicated the reasons for this change since they did not exceed 2^53-1, nor do they expect to exceed this in the near future.
From a (former) Twitter dev:
> Given the current allocation rate, they'll probably never overflow Javascript's precision nor get anywhere near the 64-bit integer space.
Your link discusses 2^64, which applies to languages that have native integer types.
The 2^53 problem was for Javascript, which has no native integer type, and is thus limited by the mantissa size of Number (which is defined as an IEEE double-precision float).
Twitter ids are unsigned 64-bit, since they're generated using Snowflake. That link must pre-date the move to snowflake ids, and is speaking to the count of tweets instead.
Rollover was a confusing word for me to use. I did not mean it in terms of integer overflow. I meant it in accounting terms. As in to roll from one namespace to another.
I visited that video specifically because the view counter was jammed at UINT_MAX. There were comments confirming that everyone was now visitor number 4,294,967,295. In fact, it might have been an HN post that brought it to my attention; I totally didn't get sidetracked on YT and end up watching K-pop all afternoon.
You need to first establish that the type was chosen intentionally before asking why it was intentionally chosen. Otherwise the question is ill-formed.
It looks like they are using PHP/MySQL/Javascript/Flash, with only MySQL having any explicit types.
Even so, an error is often preferable to overflow, which is usually undefined behavior and could lead to a duplicate primary key anyways if it wraps to the first game.
A better question is "why 32-bit over 64-bit", but the site dates back to 2005 where that was the norm and the question has the same issues.
Assuming you're working in a language that defines signed integer overflow. Depending on the language, you can result in undefined behavior, instead. For that reason, I'd go with an unsigned counter, with the first million IDs being invalid or reserved for future use. That way, you get well-defined overflow into an invalid region.
C/C++ is notoriously head-ache inducing on this point. Yes, all the CPU archs you'd reasonably expect to encounter today behave this way. However, because the language standard says signed overflow is undefined, compilers are free to assume that it will never happen, and make optimizations that you'd think would be unsafe, but are technically permissible. [1]
Well that's interesting. I was not aware of any compilers doing this. I wonder if there's a switch in gcc/llvm/msvc/etc to turn this specific behavior off.
The -fstrict-overflow option is enabled at levels -O2, -O3, -Os.
In other words, basically every program that you're using is compiled with that option enabled. (Release builds typically use -O2, sometimes even -O3.)
I was answering the "not aware that any compilers were doing this" part, hoping that they would be able to answer their second question using the source I linked.
Which is a good thing, assuming you did reasonable testing with ubsan etc. Having to assume things that never happen is a big problem for optimization!
That's a good point. Seems like we could do a bit better than the current state of the art, though. If non-optimized builds trapped on overflow, that would at least give you a better chance of detecting these problems before the optimizer starts throwing out code you meant to keep.
Compilers can optimize for signed integers better. Overflow/underflow on signed integers is undefined behavior, which is space for compilers to optimize. But unsigned ints are defined for all cases, so you get less optimal code.
Also, you have problems whenever you compare against signed ints.
Because signed is default for some reason in most languages, and most developers aren't taught to think critically about how decisions like simple datatypes might affect scalability.
The problem is momentum. I could use unsigned int everywhere, but then I have to constantly typecast to int and back anywhere I use a library expecting signed ints. If we all switched to unsigned int by default, then everything would make more sense but we'll all live in typecasting hell during the migration.
Unsigned by default doesn't make more sense than signed by default. The behavior near 0 is surprising; if you underflow you either get a huge value (anything not Swift) or you crash (Swift).
It was a mistake to use them for sizes in C++. Google code style requires using int64 to count sizes instead of uint32 for good reasons.
I read somewhere in the swift documentation that unless you have a specific need for a UInt, that Int is preferred even if you know that the value will always be nonnegative. I think compatibility is one reason they give.
World of Warcraft for quite some time, I think about 4 years, had a "mysterious" limit for the maximum amount of gold that you could accumulate on a single character. It was 214.748 gold, 36 silver and 47 copper...
At least they put an actual check in there - you didn't suddenly overrun and wake up with an enormous debt, so I'll give them some credit for that ;-)
And I think it isn't uncommon to have to add digits to the front of telephone numbers as regions grow (or telephones become more common) and the number space isn't large enough.
Twitter's was called the twitpocalypse and it was a pretty big deal at the time. I don't know how bad it was internally, but it hit a lot of third-party apps that stored tweet IDs in 32-bit integers.
One interesting aspect of it was that Twitter realized what was going to happen in advance, and artificially pushed their IDs over the edge at a preplanned time so they could have as many people available as possible to work on any problems that appeared.
Microsoft Lync has an as-yet unfixed bug where after 49.7103 days (4294967294 aka 2^32 milliseconds) of system uptime, it locks the Lync status indicator to "away". The status indicator is otherwise unusable thereafter until a full system reboot.
A counter overflow factored in the THERAC-25's race condition (one of the software's interlock was an overflowable counter rather than a flag, if the operator started treatment right as the counter overflowed the system would proceed rather than refuse).
They had meant to give him the lowest aggression rating possible, but accidentally inputted -1, which then looped back the other way to the highest rating possible. Nukes soon followed.
Was it exactly that? I remember it being that you had to create a senate or something (that decreased character aggression by 2 points). Gandhi started with 1, so if you did it with Gandhi, then it'd underflow to the highest possible aggression.
Fun to read some of other stories where this bit them too (PacMan, WoW, and eBay)! Anyway, new app has been approved by Apple and should be rolling out soooooooooon....
Thanks for all the comments! Always lots to learn from.
So they probably just need to use longs instead of ints. But I'm curious, if you were really stuck with a 32-bit limit on data types, what's your preferred workaround? I'm thinking I'd add another field that represents a partition. Are there other "tricks"?
If you could only use 32-bit data types, you can get 64-bits from using 2 integers together like a long number. So the right integer would hit the max, start over at 0, and increment the left integer. Then, using this idea you can create a class of numbers that can have however many bits you want by using more ints.
eBay (almost) had this problem and I cannot find any articles about it online. They were rapidly approaching 2^31-1 auctions. So they switched to a larger integer, the switchover went badly, and they were mostly down for 4 days, if my memory serves. This would be like 10+ years ago I think.
A lot of comments but no one said the great time that we are living for chess. So many games online, ready to be analysed and learn from them. After deep blue people thought that it was the end of chess, but it´s only getting better. Computers helping players to improve.
Chess.com is a great site, also lichess.org and chessable.com if you like chess you should check them.
The other one to watch out for is the 53-bit javascript integer limit. That caused the twitpocalypse when Twitter tweet IDs hit it. They had to switch to strings in the JSON representation.
The 2009 Twitpocalypse concerned overflow of 31-bit precision. Twitter has not yet hit 53-bits for raw number of Tweets, in fact, they passed 32-bits in 2014 and might not have reached 33-bits, yet.
Moving to strings for Javascript was really just safety planning for the future since:
> Given the current allocation rate, they'll probably never overflow Javascript's precision nor get anywhere near the 64-bit integer space.
Issues like this are not uncommon on Chess.com. I've been playing there since 2008 or 2009. If you read recent comments about issues as they pertain to the recent "v3" release ... as much is to be expected.
In my experience the chat on chess.com harbors a similar demographic to that of most video games. You'd think that chess would attract a more mature player base, but nope.
The only time I've observed people in real life acting like people in video games is at a chess tournament. Constantly trash-talking until they lose, then accusations of cheating. You certainly don't see this type at all (or most) chess events; I think the lack of entry free drew them out into the open.
At my local library the loudest people aren't those on their phones or laptops, but the chess players. It really surprised me considering chess can be played completely non-verbally other than calling "check". Every time I'm there, they constantly argue (most of the time it's because someone wants to take back a move), trash talk to get on each others nerves, yell across the table to other players in games, and talk loudly as if they were in a park. On one hand I think it's great that the library provides a community space and lets people use their chess sets, but on the other hand as someone who goes there for quiet, it's very irritating. (I wish they had a game room or something where they could go wild) Once upon a time libraries had mythical status as a place of silence, to the point where people would shush each other for the smallest noises... I actually stopped going to that library because of noise issues and in general because of its size and limited seating.
My concern is that even if one plans for a sufficient capacity, there still needs to be testing done to verify that the code actually works if the capacity is nearing the theoretical limit. In this example the database id was transformed into a 32 bit integer somewhere in the application code.
Usually when I hit some sort of unexpected bug in production I try to think about what type of testing will prevent similar problems in the future.
It's a little related. The languages typically used for iOS programming encourage the use of data types whose size matches the CPU architecture's bitness. Thus, careless programmers will end up using 32-bit integer types on 32-bit devices, and 64-bit types on 64-bit devices.
I really doubt this is in any way linked to Apple's reasons for dropping 32-bit, though.
"The reason that some iOS devices are unable to connect to live chess games is because of a limit in 32bit devices which cannot handle gameIDs above 2,147,483,647."
This is simply because they're taking an integer from their database as an auto incrementing ID and converting it directly into a native integer on the iOS device thus breaking it. They could work around this any number of ways.
It's a pretty lame bug, to be honest and certainly something easily foreseeable as this wasn't an overnight occurrence.
I sometimes think it would've been better if a few things had visibly failed in January 2000.