In the world of big data IT, performance is everything. User satisfaction with IT infrastructure is determined by application availability and response times. But in that same world, failure is inevitable, even within the most robust IT infrastructure. And each instance of downtime or failure to meet availability and/or performance objectives can have a significant effect on customer satisfaction. So when technology fails, your first thought is how to utilize incident management knowledge to resolve the situation and minimize downtime.  

MTTR is an acronym that has been typically associated with Mean Time to Repair, a measure of how long it takes to get a product or subsystem up and running after a failure. It’s used in the context of a traditional data center and relates to the physical infrastructure of an organization like servers and the network. Mean Time to Repair is calculated by taking total maintenance time over a given period and dividing it by the number of incidents that occurred.

However, In a digitized world that revolves around big data applications and distributed computing architectures, it’s more accurate to think in terms of another MTTR definition, Mean Time to Recovery.  When IT support speed is of the essence, that definition of MTTR becomes a key focus.  Mean Time to Recovery is a service-level metric that measures the average elapsed time from when an incident is reported until the incident is resolved and the affected system or service has recovered from a failure.  It includes the time it takes to identify the failure, diagnose the problem and repair it, and is measured in business hours, not clock hours. 

A ticket that is opened at 4:00 pm on a Friday and closed out at 4:00 pm the following Monday, for example, will have a resolution time of eight business hours, not 72 clock hours. MTTR comes into play when entering into contracts that include Service Level Agreement (SLA) targets or maintenance agreements. In SLA targets and maintenance contracts, you would generally agree to some Mean Time to Recovery metric to provide a minimum service level that you can hold the vendor accountable for. In a digitized environment where infrastructure and hardware repair has become more automated, Mean Time to Recovery can refer to application as well as infrastructure issues.

Digital transformation encompasses cloud adoption, rapid change, and the implementation of new technologies. It also requires a shift in focus to applications and developers, an increased pace of innovation and deployment, and the involvement of new digital components like machine agents, Internet of Things (IOT) devices, and Application Program Interfaces (APIs). 

When your network or applications unexpectedly fail or crash, IT downtime can have a direct impact on your bottom line and ongoing business operations.