Hacker News new | past | comments | ask | show | jobs | submit login
Confirmed: Windows Azure downtime caused by leap-day bug (msdn.com)
103 points by panarky on March 1, 2012 | hide | past | favorite | 49 comments



The last line is a little unfortunate. It made me think of: http://37signals.com/svn/posts/1528-the-bullshit-of-outage-l...


Contrast Microsoft's wishy washy "any inconvenience this may have caused" with Amazon's sincere apology.

http://aws.amazon.com/message/65648/

(Scroll to last paragraph.)


I was hoping to hear a little more technical apology from Microsoft. For a suit, reading what the VP posted might make sense, but for a techie like myself I really need to know that the platform is mature and stable. A detailed description of the problem helps me judge.


To be fair, the post does say they'll provide the technical explanation when they have more info:

    We will post an update on this situation, including details on the root cause analysis at the end of this incident.
I think that's fair enough (assuming they don't just quietly forget about this last part).


Sincere apology? You just told us to scroll to the last paragraph, presumably because most people who clicked your link would not have seen it. That's not a sincere apology, that's a huge wall of text that could have been about five times shorter. It's PR, it's always PR.


It isn't PR, it is a thorough technical rundown of the issue and what they are doing to improve it.


It's verbose, not thorough, and it's light on the technical side. It only superficially looks technical to people who don't work in the field.

Everything is PR. Even if this was indeed a very thorough and technical rundown of the issue, it would still exist only because of PR.


At least they didn't say "... may have caused".


Another strange time related issue that just burned us, if your server is up for 497 days, it will stop closing sockets: http://support.microsoft.com/kb/2553549


On a hunch I converted 497 days to seconds, and it works out to be 42.9 million. A suspiciously familiar number, as it is precisely 2^32 hundredths of a second. Since 10 ms is a common clock resolution on systems that points to an obvious cause: a 32-bit counter for time rolling over and horfing the relative age calculations, so all of the sockets that were open prior to the rollover stay open forever.


Windows 95 crashed after 49.7 days (2^32 ms) for similar reasons: http://news.cnet.com/2100-1040-222391.html


There are two things which are a bit off-putting about this.

First, the fact that the same exact type of bug had been known in 1999 and yet they either failed to fix it in the newer code base or they reimplemented the exact same bug in new code.

Second, almost certainly the reason that these bugs weren't caught earlier is because it's unusual for Windows to have such long uptime (50 days for Win 9x is impressive, and over a year for Windows server equally so). More so, almost certainly the average user has such low expectations of windows reliability that if they see the system become unstable or slow after a long period of uptime they will as a rule merely reboot the system rather than investigate.

Edit: a thought occurs to me. Perhaps the "fix" for the older problem was to simply change from using milliseconds since last boot for tcp/ip socket age to using hundredths of a second. I really, really hope that wasn't the case.


This is very much a Windows issue however, as other operating systems have higher resolution TCP timestamps, e.g. 1ms on Linux which rolls over every 49.7 days, and yet they do not have issues with closing sockets.


They noticed it at 5:45 PM PST?

  $ date -d "Feb 28 2012 17:45 PST" -u 
  Wed Feb 29 01:45:00 UTC 2012
Does that mean it took them nearly two hours to spot the problem? Or are they not running on UTC?


How are these things happening? in Australia our Healthcare system had a bug too: http://www.itnews.com.au/News/292081,hicaps-bug-hits-health-...

I don't recall any of this occurring in 2008 or 2004.


This happens all the time because of how confusing our calendar is. Lots of dates are figured out by counting the number of seconds since 1970. It's really easy to miss a little detail like a leap year, since they happen so infrequently and they didn't always exist.

It's not just Microsoft that does stuff like this either. Apple regularly messes up iPhone alarms during daylight savings.


Divisible by 4 => leap year

Also divisible by 100 => not leap year unless also divisible by 400.

It's really not that complicated.

[edit: Oops. I messed it up. Irony. Fixed now.]


This is not an irony that you messed it up. You provide the perfect example why these bugs happen: not only you didn't know the exact rules (divisibility by 400), but you also thought it would be fine to implement them yourself, when you should most likely use an existing library to handle date & time calculations.


... and most people don't realize that knowing that correct calculation is only half the battle. Once you know it, you can successfully navigate the Gregorian calendar. But what happens when you need to work with dates prior to the start of the Gregorian calendar? Does your Gregorian start happen in 1582 when the first countries adopted it, or in 1752 when the British adopted it? Most people simply apply Gregorian rules indefinitely into the past, which is not always correct for every situation:

http://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar


Check out wikipedia's page I just noticed they have the algorithm spelled out in pseudocode ;p

http://en.wikipedia.org/wiki/Leap_year

What baffles me is did they really have to rewrite that again somewhere else? Don't they have libraries in whatever language they are using with something to do the date calculations they need?


why is anybody even doing these calculations? import an existing datetime library and use it. these problems have been solved, and tested in previous leap years. what's the point in re-solving it?


Also divisible by 100 => not leap year unless also divisible by 400. Complicated enough to mess up.


I don't understand. Your software works for months with 28 or 30 or 31 days. Why does it break for months with 29 days?

I think unless you're messing with non-Gregorian calendars, this is a solved problem. Am I missing something?


Although time is an important data in most applications, there is only poor library support for it. Most languages don't even have a data type for time!

For example, the system call gettimeofday(2) does only return the seconds since Jan 1, 1970 (the 'epoch'), the time zone and the daylight saving time correction. No day, no month, no year. For this, you have to call some other function, eg. libc' localtime(3).

libc's time(3) function says it returns the number of seconds since the epoch, but it ignores leap seconds, so it actually does not return the number of seconds since the epoch (since there have been several leap seconds since the epoch).

Even if you use localtime(3) to get the actual wall clock hour and minutes, you are left on your own from there. Want to have a time point two hours from now? Do your own math, but watch out for the end of the day, which might also be the end of the month and/or the end of the year. One month from now? Do your own math, but watch out for months that have less days than your current month.

You may want to resort to do calculations only in seconds since the epoch, but how many seconds are in a month? Depends on the month (and the year as we've just learned!). In a year? Depends on the year (Is it a leap year? Was / will be there a leap second? Do you have to care about leap seconds?).

I just picked the C language because it is so prevalent. Other languages have their own issues or inherit them from C. In Python, for example, there is a timedelta object, but it can only handle days, seconds, and microseconds, so you still cannot calculate the date one month from now or one year ago.

I find it unbelievably funny that it's 2012 and we still have to deal with this 'solved problem'. Turns out, it is not solved at all.


libc's time(3) function says it returns the number of seconds since the epoch, but it ignores leap seconds, so it actually does not return the number of seconds since the epoch (since there have been several leap seconds since the epoch).

It returns the number of actual seconds that have elapsed since the epoch, the only way to do this is to ignore leap seconds. When leap seconds occur, they don't actually exist, they just change the offset to keep things in sync. Every second the time(3) function returns exists uniquely and none are skipped, there are no ambiguous values, which you can't say if leap seconds were not ignored.


Nope, Unix time is aligned on UTC days, which means it jumps back by one when a leap second is inserted (and would jump forward when one is deleted): https://en.wikipedia.org/wiki/Unix_time


Ah, well I stand corrected. That being said, it seems that UNIX time doesn't take leap seconds into account physically, but they are logically visible when viewing the change over time.


The C (or at least POSIX) functions that accept broken-down (i.e. seconds, minutes, hours) datetimes allow numbers outside of the range which makes it possible to advance the calendar one month even if you're already in December. For example, give me the date for 8am yesterday:

    >>> t = list(time.localtime()); t[2] -= 1; t[3:6] = [8,0,0]; print time.ctime(time.mktime(t))
     Wed Feb 29 08:00:00 2012
That correctly handled the leap day and the change in month number -- the 0th day of March became the last day of February.

Give the 1st of the month, 11 months from now:

    >>> t = list(time.localtime()); t[1] += 11; t[2:6] = [1,8,0,0]; print time.ctime(time.mktime(t))
    Fri Feb  1 08:00:00 2013
That also converted the 14th month of 2012 into the 2nd month of 2013.


I figured you talk Python, that's good to know (x)! However, if you continue:

    >>> t = list(time.localtime()); t[2] -= 1; t[3:6] = [8,0,0]; print time.ctime(time.mktime(t))
    Wed Feb 29 08:00:00 2012
    >>> t[1] -= 1;
    >>> time.ctime(time.mktime(t))
    'Tue Jan 31 08:00:00 2012'
So if you have a `t` that is Feb 29, one month ago is Jan 31 or Feb 29, depending on how `t` was constructed.

(x) Edit: I meant, it's good to know about the time module in Python, not that you know Python...


I don't follow -- the first time which translates to Feb 29th is entered as 0th March. So that underflows to Feb 29th -- the day before March 1st.

The second time you have, is 0th February. So that underflows back to the previous day, Jan 31st -- the day before February 1st.


I mean there are two t (0th March and Feb 29th) that print as Feb 29 (so any user seeing the value of those t may rightly assume they are the same), but for one t one month earlier is Jan 31, while for the other t one month earlier is Jan 29.

Of course, internally, all is correct, but for the user, things may behave strangely.


I don't think we're actually disagreeing, but I don't think you have addressed my point. I'm assuming your software can handle 28 or 30 or 31-day months, after all it has worked for the last few years. So you must already have a library that provides you with the adequate abstractions. Then I'm asking you why your software breaks when there are 29 days.

So I don't see how libc or other low-level libraries are relevant here. My assumption implies that you're using some library with support for dates. I'm not even assuming that you're Microsoft, though that helps.

(Note that this has little to do with leap seconds. There are astronomers that might care about the details, but for most of us it's enough to think of a leap second as a "long second".)


I don't think we disagree, either. Maybe I can clarify how I addressed your point.

    So you must already have a library that provides you with the adequate abstractions.
Given that it failed on Feb 29, either the library is weak, used wrongly, or no such library had been in use. Instead, the code may have assumed that every February has 28 days or it calculated the leap year wrongly. In either case, I can imagine that code that assumes that Feburary has 28 days may go havoc on the 29th. For example, before midnight you might want to schedule some important task two hours from now, do calculations based on the number of seconds in two hours and calculate the task to happen on March 1st instead of February 29th.

I think the issue arises because even though time is such an important (and difficult to handle) data point, there is hardly any language support or the libraries are weak.


All of these are handled by offloading the problem to a good library. It would be dumb to handle the 30/31 months oneself, failing to take everything else into account.


Do you know any good library?

Besides, my point was that I think time is such an essential data point, it should be handled directly by the language without the need to look for a library. Like sin() is directly accessible in every non-toy language.


My current project has modest needs; there are some times in a database but I just make sure everything is handled as UTC internally, and converted on rendering (showing the timezone explicitly). There are some timedeltas (Python stdlib) involved, but they are expressed in days. As you notice, “one month from now” is ambiguous, and I think it is good that timedelta does not use that unit. So I haven't needed to go outside Python's stdlib (which mirrors a POSIX libc), even though I don't hesitate to do so.

In other projects I've used clock_monotonic (POSIX, but needs a Python module or ctypes until the next Python 3 release), and either dateutil (if I need to do more advanced calendar math) or pytz (if I just need a timezone database).


> Do you know any good library?

I've never worked on a product where timekeeping was essential, but if I did, I'd probably use libtai[1], by DJB.

[1] http://cr.yp.to/libtai.html


It seems as though you're talking about core libraries or built in functionality of a language. But for what it's worth, if you're using Python there is this:

http://labix.org/python-dateutil


Time in distributed cloud computing gets very complicated and very critical.

Google has an interesting post on how they handled the Leap Second in 2008: http://googleblog.blogspot.com/2011/09/time-technology-and-l...


One of the many issues I saw today was around converting times to and from different formats. Let's say you have a timestamp in epoch, and you want to calculate a year long offset. You add seconds_in_year to your epoch time,, and call your date conversion library to get some kind of iso date (dd-mm-yyyy). Yes, it's wrong because seconds_in_year is not a constant, but this stuff sneaks its way into a large codebase and nobody realizes because all the tests pass until it's leap day.

This is only the beginning, there are so many possible issues related to an edge case like this, while it might be a "solved problem" that doesn't mean it's something developers are thinking about every day. And don't even get me started on daylight savings time...


yeah throw in daylight savings time and time zones and it becomes almost unmanageable.


Is there any study comparing the downtime of the different cloud platforms over, say, the past few years? EC2, Azure, Google Apps, etc. That would be the ultimate tool to shame substandard cloud vendors...


CloudHarmony collects these, here are stats for the past year: https://cloudharmony.com/status


Correction: it doesn't seem to show the recent Azure outage, which would be on Azure Compute regions.


They should really be running some test servers with clocks set ahead of time in order to get advanced notice of problems like this. I seem to recall that Amazon does this with its servers.


This sort of thing happens every leap year, and everytime I remember that the very first exercise in the very first CS class that I took in college was to write an algorithm to decide whether or not a given year was a leap year.



The Zune leap-year bug bricked the players on December 31.

Money quote: "Microsoft says it will issue a bug fix for the device so that this problem won't occur again in 2012, the next leap year."


Maybe that's why they cancelled the Zune in 2011 to avoid having to fix the leap year bug.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: