In the world of big data IT, performance is everything. User satisfaction with IT infrastructure is determined by application availability and response times. But in that same world, failure is inevitable, even within the most robust IT infrastructure. And each instance of downtime or failure to meet availability and/or performance objectives can have a significant effect on customer satisfaction. So when technology fails, your first thought is how to utilize incident management knowledge to resolve the situation and minimize downtime.  

MTTR is an acronym that has been typically associated with Mean Time to Repair, a measure of how long it takes to get a product or subsystem up and running after a failure. It’s used in the context of a traditional data center and relates to the physical infrastructure of an organization like servers and the network. Mean Time to Repair is calculated by taking total maintenance time over a given period and dividing it by the number of incidents that occurred.

However, In a digitized world that revolves around big data applications and distributed computing architectures, it’s more accurate to think in terms of another MTTR definition, Mean Time to Recovery.  When IT support speed is of the essence, that definition of MTTR becomes a key focus.  Mean Time to Recovery is a service-level metric that measures the average elapsed time from when an incident is reported until the incident is resolved and the affected system or service has recovered from a failure.  It includes the time it takes to identify the failure, diagnose the problem and repair it, and is measured in business hours, not clock hours. 

A ticket that is opened at 4:00 pm on a Friday and closed out at 4:00 pm the following Monday, for example, will have a resolution time of eight business hours, not 72 clock hours. MTTR comes into play when entering into contracts that include Service Level Agreement (SLA) targets or maintenance agreements. In SLA targets and maintenance contracts, you would generally agree to some Mean Time to Recovery metric to provide a minimum service level that you can hold the vendor accountable for. In a digitized environment where infrastructure and hardware repair has become more automated, Mean Time to Recovery can refer to application as well as infrastructure issues.

Digital transformation encompasses cloud adoption, rapid change, and the implementation of new technologies. It also requires a shift in focus to applications and developers, an increased pace of innovation and deployment, and the involvement of new digital components like machine agents, Internet of Things (IOT) devices, and Application Program Interfaces (APIs). 

When your network or applications unexpectedly fail or crash, IT downtime can have a direct impact on your bottom line and ongoing business operations. According to Gartner, the average cost of IT downtime is $5,600 per minute, which extrapolates to well over $300K per hour.  However, this is just an average and there is a large degree of variance based on the characteristics of your business and IT environment. The cost to online businesses can soar into the millions of dollars per hour.  Amazon’s one hour of downtime on Prime Day in 2018 may have cost it up to $100 million in lost sales.

Reducing and accelerating MTTR enables you to save time and IT resources, as well as mitigate incident severity, frequency, and the likelihood of application or service downtime. To resolve issues there are usually three basic steps involved:

  • Detecting the problem, ideally before it impacts users or when its significance is low
  • Diagnosing the problem rapidly using detailed information to consistently narrow the search
  • Resolving and testing to confirm that the problem has been fixed

Reducing MTTR is a key objective of IT Operations groups with the desired outcome of improved stakeholder satisfaction. The majority of total problem resolution time is taken with identifying the root cause of a problem, and the minority in actually fixing it. Problems that are left to escalate will have a much higher cost to the organization. So, being able to quickly identify the root cause of a problem can drastically reduce the MTTR for enterprise applications and analytics workloads.

However, application environments vary in scale and complexity and there is no “one size fits all” solution. Big data environments, for example, are exceptional and require a specialized approach to resolving application and service MTTR issues. Data is constantly generated anytime we open an app, search Google or simply travel from place to place with our mobile devices. The result is big data: massive, complex structured and unstructured data sets that are generated and transmitted from a wide variety of sources, stored on Hadoop and Spark platforms, and ultimately visualized and analyzed.

There is no official definition of big data, but a common one is “data sets that are too large for traditional tools to store, process, or analyze”. Traditional application performance management (APM) solutions simply aren’t equipped to handle this kind of complexity and volume. Resolving big data performance issues requires an APM solution specifically designed for big data environments.

Big data workloads and applications are often plagued by multiple performance problems that result in system failures, which are only magnified in a distributed computing architecture like Hadoop and Spark.  Intermittent performance problems, in particular, tend to be the most challenging to diagnose for several reasons:

  • The conditions of the failure are often elusive
  • Re-occurrence is unpredictable
  • There are few opportunities to observe the problem
  • The environment itself is changing through the course of these long-running problems

A big data APM approach addresses all of these challenges and enables ITOps and Developers to quickly diagnose performance problems. That’s because a big data APM approach, using Pepperdata Application Spotlight and Platform Spotlight, continuously collects application and infrastructure performance metrics from more than 300 data points, from each node in a big data cluster, every five seconds. This rich set of metrics enables Pepperdata customers to rapidly detect the root cause of problems. Over the past year, Pepperdata has captured more than 900 Trillion data points from more than 275 big data production clusters, a figure which continues to grow.  

Proactive big data application performance management with Pepperdata Application Spotlight and Platform Spotlight can reduce MTTR by up to 95 percent, and in many cases, pre-empt service downtime in large-scale, multi-tenant Hadoop and Spark environments. With Pepperdata big data APM solutions, determining the root cause of bottlenecks and other performance-related problems takes minutes instead of hours or days. Pepperdata big data APM solutions also help raise the flag on symptoms before they become problems, from finding sluggish queries to identifying high volume requests that should be optimized.