Biz & IT —

Microsoft’s Azure service toppled by garden-variety leap-year bug

What does the always-on cloud service and the protagonist in the Gilbert and …

The tagline for Microsoft Azure is
The tagline for Microsoft Azure is "I laugh in the face of unpredictability."

Microsoft has confirmed that Wednesday's Windows Azure outage that left some customers in the dark for more than 12 hours was the result of a software bug triggered by the February 29 leap-year date that prevented systems from calculating the correct time.

In a post, Azure lead engineer Bill Laing said his team was able to put a fix in place that restored service to most customers around 3am Pacific time on Wednesday, a little more than nine hours after they became aware of the issue. In a follow-up bulletin, he promised to provide a fuller post-mortem on the root cause soon. Point-of-sale terminals in New Zealand supermarkets were also reportedly bitten by leap-year bugs.

The dearth of specifics right now makes it impossible to know exactly how Azure's inability to calculate the correct date brought down a site whose tag line is "I laugh in the face of unpredictability." But when combined with additional information attributed to Microsoft that the leap-year bug involved a "cert issue," it's possible to read the tea leaves. The most likely explanation is that the bug hampered functions that inspect digital certificates that internal systems used to authenticate each other. As a result, critical systems were likely unable to communicate.

All SSL, or secure sockets layer, certificates include the date the credential was issued and the date it expires. Before an application accepts it as valid, it computes the current time to make sure it falls inside that range.

"You would think that all the code has to do is look at today's date and compare it," Marsh Ray, a software developer who writes code for two-factor authentication company PhoneFactor, told Ars. "Is today's date greater than or less than the two dates on the certificate? It ought to be pretty simple, but nothing is ever that simple when you actually go to deploy it."

Many administrators prefer that certificates remain valid for relatively short periods of time, sometimes for a span of only one or two years. One possibility is that the certificates Azure relied on allotted years consisting of only 365 days, rather than the 366 days that are needed once every four years to account for leap years. If that error affected Azure certificates, the cloud platform may have shut down as systems were unable to confirm they were connected to other trusted nodes.

Of bugs and pirates

The technical glitch isn't unlike the predicament that befalls the protagonist in the Gilbert and Sullivan musical The Pirates of Penzance. Bound by an apprenticeship to a band of pirates until his 21st birthday, he is chagrined in his 22nd year to learn he still isn't free of the obligation because his birthday falls on February 29. That means he has technically celebrated only five birthdays so far and must wait another six decades until he's free.

Developers have long experienced similar travails navigating the leap-year phenomenon. A post published Thursday on The Daily WTF blog details two real-world examples of date-calculation gone wrong and includes the observation: "There are only three hard things in Computer Science: cache invalidation, naming things, and handling of the 29th of February."

On Wednesday, photo-sharing Website Flicker also succumbed to a problem that affected digital certificates. According to a Flickr staff member identified as yflickerboy, the glitch involved Wednesday's leap date. A spokesman for the site later told Ars that wasn't the case, but didn't elaborate.

Listing image by Photograph by Matt Preston

Channel Ars Technica