One question that we often get is “How is the visibility functionality of Pepperdata different from tools like Ganglia, Cloudera Manager, and Ambari?” We wanted to take some time to address this, as while we’re fans of those tools, visibility in Pepperdata has some important differences in technology and use cases, and it’s worth using a few pixels to highlight those.
A quick note: in this post, we speak in terms of jobs and tasks, but the same also applies in a YARN framework. Pepperdata fully supports both Hadoop 1 and Hadoop 2 (YARN).
Installed on a Hadoop cluster, Pepperdata serves three major functions:
VISIBILITY–Capture an unprecedented level of detail on cluster resource usage
Pepperdata collects 200+ metrics in real-time for the four resources (CPU, RAM, disk I/O, and network) for any given job or task, by user, group, and queue. This allows operators to quickly identify what job is causing issues, and it allows users to see what and how their jobs are doing on the cluster while they are running. Because users and operators are able to see how jobs are behaving, the jobs can be improved.
CONTROL–Enable implementation of service-level policies that guarantee on-time completion of high-priority jobs
Pepperdata senses contention among the four resources in real time and will slow down low-priority jobs just enough to ensure that high-priority SLAs are maintained. This SLA enforcement is ideal for multi-tenant environments, such as a cluster that is a central service used by various business units or a cluster that has a lot of diverse workloads running.
CAPACITY–Increase cluster throughput by 30-50%
Pepperdata knows the actual hardware resource capacity of your cluster and allows more tasks to run on nodes that have free resources at any given moment. In many instances jobs will run much faster because Pepperdata will dynamically allow them to use more of the true resource on the cluster when it is available.
The control and capacity aspects of Pepperdata are unique to our solution, so for the rest of this post, we will only talk about visibility and how it differs from Ganglia, Ambari, and Cloudera Manager.PEPPERDATA VISIBILITY IN COMPARISON WITH CLOUDERA MANAGER, AMBARI, AND GANGLIA
To help explain the difference, we’ve prepared the following diagram which should shed some light. The Y-axis represents an increasing level of granularity in what is measured as you move up the axis. Node is the least amount of granularity, followed by job, and then the task. The X-axis represents how often something is measured, in increasing frequency as you move right along the axis.
Cloudera Manager, Ambari, and Ganglia (the “monitoring tools”) all provide similar amounts of visibility, with Ambari actually relying on Ganglia for some of its collection of metrics. Rather than go into each one specifically, we’re going to highlight the major differences of monitoring and reporting.1 If you’re interested in getting into the weeds, contact us for a whitepaper that does just that.
The instructive question here is how much detail can I see? The monitoring tools generally are concerned with the status of the cluster health, individual node health (e.g. is the ResourceManager functioning?), and job completion. Hadoop itself provides a decent set of high-level metrics, so the tools are designed to aggregate and present these statistics.
Pepperdata takes this much further, allowing operators to drill down to see how a task is behaving on an individual node over time. This level of information is important because operators can use it to understand how a job’s behavior changes as it runs and interacts with other applications running on the cluster (the dreaded contention). Having that level of insight means that operators can identify the jobs that are actually causing problems and fix them.
The other half of fine-grained monitoring is update frequency. We believe that operators should see the state of the cluster in as close to real time as possible in order to make correct decisions.2 The remit of the monitoring tools – to understand things at a node level – means that their frequency of 10-second to minute-level is appropriate, since the focus is on the data for the entire job lifetime.
So why does Pepperdata increase the frequency of data measurement by an order of magnitude? It comes back to supporting a task-level view of a cluster, so administrators can see exactly what tasks are doing over time, and when/why contention is occurring. Real-time views aid understanding and debugging of cluster performance, answering the why, not just the what.
What does all this mean in practice? This image is an example with CPU and memory by host, by job, and even by the tasks that make up each job in real time. This allows operators to see what jobs are actually doing over time in the context of other jobs and tasks, making identifying outliers very easily. The monitoring tools only provide job/task-level information once the job/task has completed, and even then only the total for the entire run time of the job/task.
In conclusion, the visibility features of Pepperdata should be seen as complementary to the functionality offered by the metrics tools. Operators who have SLAs to meet, and users depending on them, need to go deeper with more granularity to understand every aspect of their cluster’s behavior, and our visibility features offer that. Of course, we use those same metrics to give capacity and control, but that’s the subject of another post…
- We’ll be ignoring other functionality, like Cloudera Manager’s assistance in deploying and configuring clusters.
- Or, the difference between a navigation system that says “turn here” and one that says “the reason you’re late is that you missed a turn.”