Counting the Cost and Setting the Goal

Part of Hints for Operating High Quality Services by Mark Verber

Very, Very Early Draft 0.1-- January 30, 2004

I am just starting to collect ten years of notes on this topics.  This is not even a skeleton yet.

Understanding Magnitude

Many people seem to forget that as they add "9"s, that they are increasing the problem by an order of magnitude.  Going from 3 "9"s to 4 "9"s is not merely doubling the reliability of a service... it is increase reliably ten fold.  So to put the "9"s in perspective:

99% - 5259 minutes/year (7 hours/month)

There is no system which should be worse than this.  If you are less than 99% then either:

99.9% - 525 minutes/year (43 minutes / month)

A well run operations team with reasonable quality components can mostly achieve this in a single location.  If the service is well engineered, this can be done without a 24x7 org.  A serious disaster such as a fire, earthquake, or flood can take a 3 9s service off the air for months. You should expect that you will fail to deliver 3 9s every year or two because of a catastrophic failure of air conditioning, power, fiber cuts, or other physical issues outside your control.

99.99% - 52 minutes/year (4 minutes / month)

Need full redundancy (and geo-diversity) with automatic failover.  Need a 24x7 org.  Needs mature change management process.

99.999% - 4 minutes/year (20 seconds / month)

Very few orgs find it cost effective to build 5 9s.  Requires six sigma or other quality improvement process.  Requires a level of discipline and control which slows down innovation beyond what is acceptable for most companies.

Effective Measurements

Availability is often a very weak promise

Serviceability is typically what you want as a customer of a service

Need to make sure the right thing is measured.  Often times people measure performance which is only part of what customer care about and therefore there metric give no insight into where a customers needs are being serviced.  For example "call pickup" -vs- "successful call".

Picking Targets

I have often run into people who talk about needing 5 9s... but very few services need this level of reliability. Most service provides don't hit 5 9s today, including the larger telcos.

Furthermore, very few organizations are willing to pay the cost (capital, personal, and the loss of agility) required to hit 5 9s. I know of a company that was willing (free of charge) to put an engineer on site full time and give access to senior engineers for any company that was targeting 4 or 5 9s and would actually purchasing the equipment and training their staff to do this.  After a number of years they discovered that while many customers talked about wanting 5 9s... only a few were willing to make the required investments.  My memory is that had three takers.  So they initiate a second program where companies had to pay for their assistance to reach the lower goal of  3.5 9s (99.95%).  They have more than a hundred companies willing to pay to get to this lower goal.

Targets I would suggest:

Time Investment

Improving reliability in large & complex systems is an incremental process.  It takes longer than most people expect.  Hitting 5 "9"s requires time to let a sigma six or other improvement system to tune and improve.  Wants more is that you need increasing longer periods of time to be sure that you have make the fundamental investments needed to sustain the reliability you think you have achieved.  For example, one month of 100% service could be a fluke.  You would need years before you were sure you had surpassed 5 "9"s. I have repeatedly seen people expect to take a new service and get it to 4 "9"s in just one year.  This never happens unless great disciple (such as what used to be found in the real time system community) was applied to gathering requirement, creating an effective design, developing effective metrics, producing a high quality implementation, and doing a careful deployment, and refining the deployment based on good metrics.

A much more common progression is:

Year 1: 99.5%
Year 2: 99.8%
Year 3: 99.95%
Year 4: 99.99
Year 5: 99.995%

Cost Models

Conventional wisdom is that for each "9" expect to have 2x staff and 6x the capital cost.  This is typical if you use conventional solutions which involve through money at companies like EMC, Oracle, IBM, etc.  There are creative approaches which can be less than this.

Design Patterns


Further Information