How to Quickly Pinpoint Cluster Issues
During the course of administering a Hadoop Cluster, there is often a time when you need to determine what application is causing the current performance degradation. This could be an application that starts and begins exhibiting bad behavior immediately, or it could be a long running ETL job or Spark ML job that suddenly starts using all of the resources of the nodes it is running on.
What now? You know there is an issue on the cluster, but you need to be able to do a few things in order to fix it:
- Identify the application. If this is an ad-hoc query, or an application without an SLA attached, you might have sufficient information to resolve the issue, allowing you to kill the application quickly, before it does more damage.
- Quantify the problem: If you are dealing with a high-SLA application, or an application in a dedicated high priority queue, you are going to need to ask the application owner’s permission to kill the rogue application. As part of that, they are likely to ask you some questions such as:
- “What is my application doing wrong?”
- “What other applications/system processes am I affecting?”
- “If I am using too many resources, how many am I using so the application can be limited?
Remediate: If this is a one-time bad behavior caused by a massive shift in data-set, work with the team involved to insure you are aware of future such events. If the application has consistently performed poorly over time, but has finally been identified as the culprit, it might be necessary to ask for a refactor. Quantifying the resource usage will be very important as part of this ask.
Pepperdata makes all three of these very easy to do.
- Identify the application: As we can see below via the Pepperdata dashboard, the CPU utilization went from very low to high very quickly.
Following is a view showing us that 4 worker nodes in the cluster were pushed to 98-99% User CPU. This is going to cause many problems if left unchecked:
- NodeManagers timing out into an unhealthy state
- Any high priority applications running at the same timeframe missing SLA
- Quantify the problem:
As you can see below, this Spark GraphX application was using over 98 percent of the User CPU on the fou