Doesn’t Yarn Already Do This? The Limitations of Manually Tuning Hadoop and How Pepperdata Improves YARN and the ResourceManager

Posted by:

Pepperdata makes Hadoop+YARN based systems better by providing total performance management (TPM) for big data. TPM is the combination of application performance management (APM) and operations performance management (OPM) in a single package so developers and operators can rely on the same underlying information to build and operate highly performant big data applications in multi-tenant clusters. For developers, the Application Spotlight self-service APM portal surfaces applications that require attention from a performance perspective. Application Spotlight provides precise recommendations to improve performance, automatically identifies bottlenecks and makes it easy to analyze root cause of errors and failures.

Application Spotlight provides a personalized experience via a dashboard that shows all of the developer’s applications, key performance indicators, and custom views in one place.

For operators, the OPM solution (Cluster Analyzer and the Capacity Optimizer add-on module) makes it easy to identify applications and users causing issues on the platform, proactively alert on those issues, and improve cluster performance. We include roll up reports for things like chargeback and capacity planning.

A summary view of the cluster giving the operator an ‘at-a-glance’ view of the cluster health, key performance indicators, and access to custom views.

The Pepperdata Capacity Optimizer add-on module can automatically add up to 50% more containers without any additional hardware by addressing some of the inefficiencies of how YARN does resource management today.

OPM Control (Capacity Optimizer) Adds up to 50% More Containers without any Additional Hardware

Doesn’t YARN Already Do This?

We are sometimes asked the question, doesn’t YARN already do this? Or, does Pepperdata replace YARN? The quick answers: YARN does not already do this and Pepperdata does not replace YARN or the ResourceManager, but it can significantly augment its capabilities.

YARN (“Yet Another Resource Negotiator”) was introduced as part of Hadoop 2.0 in 2012. YARN takes the resource management capabilities of MapReduce and packages them for use by new engines. YARN enables batch, interactive, and streaming jobs to run simultaneously on the same Hadoop cluster. This allows enterprises to deploy Hadoop for new and different applications and use cases. YARN coordinates consumption and usage reservations in an attempt to ensure resources are allocated fairly.

However, YARN does not track containers once they start running. This means that YARN must be conservative in its assumptions about memory usage and assume the worst case instead of monitoring and adjusting based on actual usage. The Pepperdata solution solves these problems by monitoring per-task hardware usage as jobs run and maximizing resource utilization.

How the Pepperdata Capacity Optimizer Add-on Module Complements YARN

The Capacity Optimizer is an optional add-on module for operators that uses active resource management to dynamically eliminate inefficiencies and bottlenecks without manual job or cluster tuning. 

Working with the Pepperdata Cluster Analyzer OPM solution, Capacity Optimizer improves the capacity utilization of existing production clusters without manual tuning or intervention

At its core, YARN enables many different types of workloads to be run on Hadoop. However, YARN provides little to no resource management after jobs start running. Sometimes, YARN assumes that node memory utilization is high based on the static container reservations specified by developers’ run-time parameters rather than actual physical memory usage, thus leaving resources unused.

Capacity Optimizer picks up where YARN leaves off. Capacity Optimizer uses sophisticated, patented algorithms to track and predict the actual memory usage per container, allowing YARN to schedule more workload immediately. In effect, Pepperdata increases the amount of usable memory on the node made available to YARN. This proprietary advantage allows operators to achieve much higher hardware utilization. Typical enterprise deployments experience a 30-50% increase in throughput when Capacity Optimizer is enabled.

The Limitations of Manually Tuning Hadoop and How Pepperdata Improves YARN and the ResourceManager

Operators who spend significant time tuning their Hadoop deployments may be skeptical of Capacity Optimizer’s ability to improve performance on a cluster that has already been tuned using industry-standard best practices. Capacity Optimizer identifies “holes” where a node can temporarily do more work and fills those holes with additional tasks, all while ensuring cluster reliability and safety. Capacity Optimizer automatically monitors and adjusts hardware resource usage at the process level in real time.  

On a typical cluster, Capacity Optimizer makes hundreds or thousands of decisions per second. Even if Hadoop provided a mechanism to do so, the most talented dedicated operator or outside consultant could not make manual configuration changes with the precision and speed of Capacity Optimizer. Standard Hadoop configurations only affect up-front static resource reservations, so Hadoop must assume peak resource usage by every task, which typically wastes a significant amount of the cluster’s hardware resources. Additionally, YARN cannot engage in active resource management after container launch, except to kill jobs under certain conditions.

Related Links

More on This Topic

To hear me speak about this in a little more detail, please see this replay of a recent webinar on the same topic. Watch the replay here.

Schedule a Demo

To see firsthand how Pepperdata can help you run more jobs on an existing cluster, run jobs faster on an existing cluster, and reduce or delay new hardware acquisition, sign up for the demo below.