How to Quickly Pinpoint Cluster Issues

During the course of administering a Hadoop Cluster, there is often a time when you need to determine what application is causing the current performance degradation. This could be an application that starts and begins exhibiting bad behavior immediately, or it could be a long running ETL job or Spark ML job that suddenly starts using all of the resources of the nodes it is running on.

What now? You know there is an issue on the cluster, but you need to be able to do a few things in order to fix it:

  1. Identify the application. If this is an ad-hoc query, or an application without an SLA attached, you might have sufficient information to resolve the issue, allowing you to kill the application quickly, before it does more damage.
  2. Quantify the problem: If you are dealing with a high-SLA application, or an application in a dedicated high priority queue, you are going to need to ask the application owner’s permission to kill the rogue application. As part of that, they are likely to ask you some questions such as:
    • “What is my application doing wrong?”
    • “What other applications/system processes am I affecting?”
    • “If I am using too many resources, how many am I using so the application can be limited?
  3. Remediate: If this is a one-time bad behavior caused by a massive shift in data-set, work with the team involved to insure you are aware of future such events. If the application has consistently performed poorly over time, but has finally been identified as the culprit, it might be necessary to ask for a refactor. Quantifying the resource usage will be very important as part of this ask.

Pepperdata makes all three of these very easy to do.

  1. Identify the application: As we can see below via the Pepperdata dashboard, the CPU utilization went from very low to high very quickly.

Following is a view showing us that 4 worker nodes in the cluster were pushed to 98-99% User CPU. This is going to cause many problems if left unchecked:

  • NodeManagers timing out into an unhealthy state
  • Any high priority applications running at the same timeframe missing SLA

  1. Quantify the problem:

As you can see below, this Spark GraphX application was using over 98 percent of the User CPU on the four worker nodes in this cluster. There were also Hive queries running at the same time, and the long running ones in this chart see their share of CPU plummet.

  1. Remediate

Now that you have identified the issue, found the culprit, and quantified both what the application was doing and its effect on the system, you need to move on to remediation. Pepperdata helps here both in providing the identification and quantification, but also in providing recommendations, and in the case of Spark, a view of the internals, therefore identifying which stage of a Spark application was at fault and allowing the developer to understand where in their code to start looking for the issue to make changes.

This granularity and ease of discovery exists everywhere in Pepperdata: IO, Physical Memory, Queue memory, CPU, File opens, Sockets, etc. When an application is causing an issue in your environment, whether it is causing Namenode RPC timeouts, completely clogging a capacity-scheduled queue, or writing multiple petabytes into HDFS, we can help you identify the issue immediately and give you the information you need to take action. You can also set alerts on specific parameters to catch repeat offenders, or to catch applications that cause bottlenecks.

More info: