#Azure #SLA: Stop promising all those 9s…

…without understanding what each 9 means.

Warning: long post. For TLDR; scroll to the bottom of the post. Also, I will be using uptime and SLA interchangeably but the entire topic and concepts are more nuanced (downloadable link).

(My extreme imaginary) Scenario 1 (based loosely on real-world conversations I’ve witnessed)
Client: What type of uptime do you guarantee?
You: Oh it’s on Azure cloud, it’s pretty much up all the time
*Buzzer goes off, indicating incorrect answer and the screen pops up this message scraped from the Azure Status Dashboard“*

4/15
App Service \ Web Apps – East Asia – Advisory
SUMMARY OF IMPACT: Between 18:35 and 18:53 UTC on 15 Apr 2016 a subset of customers using App Service \ Web Apps in East Asia experienced Connection timeouts or connection resets when accessing sites. PRELIMINARY ROOT CAUSE: At this stage we do not have a definitive root cause. MITIGATION: Our systems have self-healed and have returned to a healthy state. NEXT STEPS: Continue to investigate the underlying root cause of this issue and develop a solution to prevent recurrences.

(My extreme imaginary) Scenario 2 (based loosely on real-world conversations I’ve witnessed)
You: Oh, we use Azure web apps, so I can tell you it’ll be up at least 99% of the time
Client: We got a quote from another consultant that guaranteed 99.999% uptime, do you do that?
You: Oh yes, I was going to talk to you about the maintenance contract. For a couple of hundred dollars a month we can guarantee 99.999% uptime.
*Conference room wall goes down revealing a camera crew. Chris Hansen walks in and asks you to take a seat. He pulls out the Azure apps SLA page showing this blurb*

We guarantee that Apps running in a customer subscription will be available 99.95% of the time. No SLA is provided for Logic Apps while such services are still in Preview or for Apps under either the Free or Shared tiers.

condescendingwonka

This is not to say it’s impossible to provide 99.999% SLA but this requires some planning and additional infrastructure that increase the overall cost exponentially. In those examples, SLA was almost certainly an after thought and would not be possible (without great planning) at the quoted price points. In both cases, the client was asking for a (form of) Service-Level agreement and in both cases, the client was not provided with accurate information. In fact, in both instances the client was over-promised uptime. So what do these numbers mean anyway?

Meaning of 9s

n Service Availability(%) System Type Annualized Down Minutes Monthly Down Minutes Practical Meaning FAA rating
0 90 Unmanaged 52596 4383 Down 5 weeks/year
1 99 Managed 5296  438.3  Down 4 days/year  Routine
2 99.9 Well Managed 529.6  43.83  Down 9 hours/year  Essential
3 99.99 Fault tolerant 52.96 4.38  Down 1 hour/year
4 99.999 High Availability 5.3  0.44  Down 5 minutes/year  Critical
5 99.9999 Very High Availability  0.53  0.04  Down 30 seconds/year
6 99.99999 Ultra High Availability  0.05  –  Down 3 seconds/year  Safety Critical

Note: FAA is exactly who you think they are.

The cost of each step is exponential, think x*(5-10)n, where x is the cost of building + maintenance for an unmanaged system and n (first column) is the number of 9s after the first 9 and the multiplier between 5 to 10 depends on the complexity of the system. So if a simple website costs x dollars to build and maintain as an (relatively) unmanaged system (90% SLA) (n=0), it’s not unreasonable to expect a similar system that is fault tolerant (99.99% SLA) (n=3) could cost somewhere between 100x to 1000x the unmanaged system over the long-term. 

Generally speaking, we (people in software and IT) habitually underestimate effort and cost and overestimate our capabilities. So this calculation may very well be an underestimate but it least warns us to be wary of throwing out wild guesses and indiscriminate 9s for SLAs. There are some development tools, methodologies and tricks we can use to reduce the cost and effort, maybe we can move n=0 up to to 99 instead of 90 but succeeding 9s will still incur exponential increase in costs.

Another interesting discussion on SLAs and Azure we had at my talk at last week’s Global Azure Bootcamp was how to calculate what you could promise to the client.

We started with an audience member asking if the SLA provided should include the strongest links, for instance, if you use SQL Azure (99.99% SLA) and Azure Web Apps (99.95% SLA), can you honestly say you’re capable of providing 99.99% SLA?

The general consensus at the time seemed to be that your SLA should be your weakest link, for instance, if you have a Azure Virtual Machine (99.95% SLA) connecting to a SQL Azure Database server (99.99% SLA), the weakest link would be VMs, ergo 99.95% SLA.

But, some thought and research shows that this calculation has a basic and fatal flaw – assuming that the downtime for the weakest link and strongest link overlap in time. Check out this blogpost by Troy Hunt about how that assumption is patently wrong.

My naïve worldview:

Shooting for zero downtime is counter-productive and frankly, impossible. Your system is going down at one point or another, period. Just find out what is an acceptable and affordable level of uptime for your client and plan accordingly (source for image).

Screen Shot 2016-04-21 at 4.15.25 PM

Embrace downtime unless you’re running critical medical equipment, operating flights/rockets/missiles, trading stocks, running a cloud data center or losing money because your system is down for a couple of minutes (think Google losing revenue from ad-money because their ad network went down for 2 minutes). When it comes to SLA, under promise and over deliver. For instance, aim for 99.99% (4-5 minutes/month downtime) but promise only 99.9% SLA (45 minutes/month downtime). Perform any upgrade/updates in non-peak conditions (late Saturday night, for instance) with enough warning to users/clients. Use in-built and 3rd party monitoring tools to keep you honest and alert you ifwhen there are issues. And despite your best prep, create offline-triage procedures for the occasional extended downtime during peak hours.

From a personal opinion, if you’re an individual consultant or running a small development shop, be wary of undertaking (without additional help) anything beyond a well managed (n=2 99.9% SLA) system – downtime of approximately 45 minutes/month.

To sum up:

  1. Each additional 9 added to an SLA probably increases long term costs by 5x to 10x
  2. SLA should account for combined non-time overlap downtime of all system components, not just the weakest one.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s