My team was one of the first to use Hadoop in production – even before it was called Hadoop. It was a powerful platform that we used to support Yahoo’s search engine, and we certainly had our share of technical challenges. The cluster inexplicably melted down on more than one occasion. We were sometimes able to diagnose the problem after the fact, but we still faced downtime for key search or advertising functionality. At Pepperdata, we’re working with customers who run some of the largest Hadoop clusters out there – several of them have more than 1,000 nodes.

We’ve been able to observe and help unravel some intriguing meltdown mysteries. It turns out that while Hadoop holds great promise as a platform, complex jobs can fail unexpectedly even when they previously worked. Clusters can also mysteriously grind to a halt (a recent article in Forbes points out some more of the potential failure modes). If you’ve been running Hadoop in production for any period of time, these situations may seem familiar.

I’ll share two examples of meltdowns we saw on large-scale (hundred to thousand nodes) customer clusters and how we found the root causes. Although we use Pepperdata’s dashboard in the examples, the approach we took can help you think through diagnosing similar meltdowns in your own Hadoop cluster.

Disks are thrashing!

At one customer, we saw a case where the cluster’s disks started thrashing and slowed everything to a crawl. We could clearly see frequent spikes when the disks became very busy (below). The question at hand was “what was causing those spikes?”

Per node disk I/O metrics. The spikes to 200K+ IOPs are clearly visible.
To further diagnose this, we looked at some job-level metrics about how they were using the cluster’s IO subsystem. We quickly discovered that the spikes (see the bottom graph below) happened at the same time as heavy periods of many individual files being opened, quite rapidly. The biggest spike, that one on the right, showed that this cluster was opening over 200 thousand individual files per second.

Files opened per second (top) compared to number of disk operations (bottom).
In order to find the culprit, we broke the data down by user and job. It became immediately clear that it was not widespread usage of the cluster, but rather one individual repeated series of ETL jobs that was causing this behavior. The middle graph below shows that all the spikes were from one user and the right-hand graph shows that it was several jobs from that user. There are other users on the middle graph (for example you can see the tiny little green job), but they are all small compared to this ETL user.

Per-user and per-job files opened, compared to overall files opened, show the clear culprit.
When we zoomed in on one of these jobs and looked at its tasks (in the chart on the right below), we could see that each individual task was opening between 400 and 500 new files per second.

File opens per job and per task. One job is clearly overwhelming the cluster with file requests.
Once we knew this root cause, the solution was to let the author know. He was rather surprised, to say the least.

Nodes are dying!

In another case, a customer faced a situation where nodes were dying across a business-critical 1,200-node cluster. Late one night, nodes abruptly started swapping and becoming non-responsive. The datacenter operations had to get paged to come in off-hours and physically reset the hosts. When the team investigated the problem the next day, the job submitters all reported that they didn’t change anything and no new software got deployed. So the question was “what the heck changed?” Because of the alarm bells raised by the swapping, we started by looking at memory usage on some of the nodes and found these spikes  beginning at around 10:00pm.

Memory used by host lets us identify unusual spikes like this one.
We found that the memory used on by tasks running on this node suddenly went way above normal, consuming 6GB out of the 8GB on the host (in the graph above). When you add in the OS overhead, this caused the node to start swapping. This particular node recovered, but most were not so lucky. From there, we decided to break down the usage spike down by job (graph below), and we found that it was one particular job that caused the entire problem.

Drilling down on the per job memory usage lets us immediately spot rogue jobs.
By looking at that job, we were able to identify the user and the job’s purpose. When we looked closely at the individual tasks (graph below), we found that this one job’s tasks were much greedier than everything else. Most tasks were well under 1GB, but this one very rapidly (within a few seconds) spiked up to 1.5GB or even 2GB.

By digging further into individual task usage, we can see that only a few tasks are using more memory than the expected 1GB.
When we investigated the specific job, we found that while it didn’t change, the input data did. That user’s jobs got stopped immediately so the cluster would stop melting down. We then changed two configuration settings: We made better use of the capacity scheduler’s virtual memory controls, and also used the Pepperdata protection features to limit the physical memory of tasks.
What we can learn from these examples is that while you can find problems at the node level, you find the root causes when you really look into the job and task details. As more applications get deployed on Hadoop, the mysteries are certainly going to get more complex and intriguing – though the IT operations teams who have to deal with them would only describe them as “intriguing” long after they’ve resolved them!
That’s one of the key reasons Chad Carson and I founded Pepperdata. We believe that if we can give IT clarity into job, task and user-level operations and better control over the cluster resources, the “mysterious” meltdowns will be a thing of the past for Hadoop.