My team was one of the first to use Hadoop in production – even before it was called Hadoop. It was a powerful platform that we used to support Yahoo’s search engine, and we certainly had our share of technical challenges. The cluster inexplicably melted down on more than one occasion. We were sometimes able to diagnose the problem after the fact, but we still faced downtime for key search or advertising functionality. At Pepperdata, we’re working with customers who run some of the largest Hadoop clusters out there – several of them have more than 1,000 nodes.
We’ve been able to observe and help unravel some intriguing meltdown mysteries. It turns out that while Hadoop holds great promise as a platform, complex jobs can fail unexpectedly even when they previously worked. Clusters can also mysteriously grind to a halt (a recent article in Forbes points out some more of the potential failure modes). If you’ve been running Hadoop in production for any period of time, these situations may seem familiar.
I’ll share two examples of meltdowns we saw on large-scale (hundred to thousand nodes) customer clusters and how we found the root causes. Although we use Pepperdata’s dashboard in the examples, the approach we took can help you think through diagnosing similar meltdowns in your own Hadoop cluster.
Disks are thrashing!
At one customer, we saw a case where the cluster’s disks started thrashing and slowed everything to a crawl. We could clearly see frequent spikes when the disks became very busy (below). The question at hand was “what was causing those spikes?”
Per node disk I/O metrics. The spikes to 200K+ IOPs are clearly visible.
To further diagnose this, we looked at some job-level metrics about how they were using the cluster’s IO subsystem. We quickly discovered that the spikes (see the bottom graph below) happened at the same time as heavy periods of many individual files being opened, quite rapidly. The biggest spike, that one on the right, showed that this cluster was opening over 200 thousand individual files per second.
Files opened per second (top) compared to number of disk operations (bottom).
In order to find the culprit, we broke the data down by user and job. It became immediately clear that it was not widespread usage of the cluster, but rather one individual repeated series of ETL jobs that was causing this behavior. The middle graph below shows that all the spikes were from one user and the right-hand graph shows that it was several jobs from that user. There are other users on the middle graph (for example you can see the tiny little green job), but they are all small compared to this ETL user.
Per-user and per-job files opened, compared to overall files opened, show the clear culprit.
When we zoomed in on one of these jobs and looked at its tasks (in the chart on the right below), we could see that each individual task was opening between 400 and 500 new files per second.
File opens per job and per task. One job is clearly overwhelming the cluster with file requests.
Once we knew this root cause, the solution was to let the author know. He was rather surprised, to say the least.
Nodes are dying!
In another case, a customer faced a situation where nodes were dying across a business-critical 1,200-node cluster. Late one night, nodes abruptly started swapping and becoming non-responsive. The datacenter operations had to get paged to come in off-hours and physically reset the hosts. When the team investigated the problem the next day, the job submitters all reported that they didn’t change anything and no new software got deployed. So the question was “what the heck changed?” Because of the alarm bells raised by the swapping, we started by looking at memory usage on some of the nodes and found these spikes beginning a