Different ways to react to outages in the cloud: Google vs Microsoft vs Amazon

| | bookmark | email

Launched less than 48 hours ago, the Google App Engine cron support has already seen a long outage spanning around 9 hours:

The scheduled tasks service saw some down time last night from 11:30pm and 2:30am PST (GMT-8), and again from 6am to 12:00pm PST.

I must confess that I don't have a (big) problem with the fact that the service was down. Microsoft Azure has been down for 20+ hours sometimes in March and I've heard of quite a few Amazon AWS outages in the past. But this is not the subject of this post.

What I am actually interested in is to see how these teams are reacting when facing problems, so let's take a look at a couple of examples.

Google

Here is the fragment from the latest report from Google about the issue I've mentioned at the beginning of the post:

This issue has been identified and fixed.

And this is not the first time they are very short on providing any sort of explanation (I have quite a few more examples, but I'll keep it short)

Investigation Complete - Issue Resolved
March 10, 2009 at 1:46 am 2009-03-10T01:45:51Z 2009-03-10T01:45:51Z
Online
We have determined that this spike did not affect the performance or uptime of applications. If you feel we have incorrectly diagnosed this issue please inform us by posting in our developer forum.

Investigation Complete - Issue Resolved
March 9, 2009 at 12:47 am 2009-03-09T00:46:51Z 2009-03-09T00:46:51Z
Online
We have determined that this spike did not affect the performance or uptime of applications. If you feel we have incorrectly diagnosed this issue please inform us by posting in our developer forum.

Microsoft

Let's just compare the above reports with Microsoft's report about their Azure outage. The team has posted on their blog an explanation covering 3 aspects:

  • What Happened?
  • What Was Affected?
  • How Will We Prevent This in the Future?

Leaving aside the fact that as a way to prevent further problems the Azure team has decided to provide by default a second instance that is not counted against the service quota:

For continued availability during upgrades, we recommend that application owners deploy their application with multiple instances of each role. We'll make two the default in our project templates and samples. We will not count the second instance against quota limits, so CTP participants can feel comfortable running two instances of each application role.

the report makes me feel confident in the team capability to deal with problems.

Amazon

Unfortunately I don't have any specific references here, but I clearly remember Amazon been very explicit in their explanations and the plans to address similar issues in the future.


My question now is: who would you trust?

I do feel that failing to provide any details about the problems and their resolution is not the best policy to convince your users about your capability to operate the system and its stability.

What we've got here is a failure to communicate.