Mean Time to Repair: A Value of YOUR IT Architecture

When this was built, was a target MTTR designed? Image Source

Mean Time to Repair (or MTTR henceforth) frequently plays second fiddle to Mean Time Between Failure (MTBF) in the thoughts of infrastructure designers. However, when it comes time for calculating your Availability [which I'll write as MTBF/(MTBF+MTTR)], or in the thoughts of your affected users, it's at least as important.

And, perhaps more interestingly, you've got much more control over it than the MTBF which is baked in at design time. Let's take a look how it plays out.

First, just how much influence does it have on your Availability? Let's suppose you've got a unit with an MTBF of 10,000 hours. If it takes a week to get up and running again, your Availability would be

10 000 / {10 000 + (7*24)} = 98.3%

If it takes 1 day, then your Availability would be 99.7%. Supposing you were aiming for the golden Five Nines 99.999%, you need your MTTR to be 6 minutes. 6 minutes!

Clearly, there's a radical difference between being able to get something back up and running in 7 days versus 6 minutes. That difference is embodied in your IT architecture - and by architecture I do mean with multiple views including hardware, software, data, processes, people, and so forth. Let's dig a little deeper.

To clarify the definition of MTTR, we're talking about the whole outage of the service, not just the hands-on-tools maintenance period.

  • Time to identify
  • Queuing and Prioritization– workload redundancy, pre-assigned criticalities
  • Triage
  • Staff Allocation (training, absence coverage, supervision)
  • Confirmation
  • Fault Isolation (Repair-By-Replacement? CMDB?)
  • Spares Logistics
  • Known-Good Configurations and the CMDB
  • Rectification Procedure (Repair-By-Replacement? Documentation? Quality process?)
  • Closing the Loop (not MTTR but indispensable)

We've touched on many different elements of your IT architecture – from staff training to maintenance procedures to triage processes to sparing policies. To be able to meaningfully estimate an MTTR you need to have a grip on all of it. But the good news is that these elements tend to be within your control, even if you've got an existing physical infrastructure with fixed MTBFs. You can choose which views of your architecture to apply resources to, and how best to reduce the MTTR of your most critical services.

Reducing downtime might not seem glamorous, but prolonged outages will quickly devalue your organization in the eyes of your customers. MTTR is a solid measure of the value of your IT architecture.