This blog post is part two of a two-part series authored by Pepperdata Field Engineer Jimmy Bates. Bates is a true veteran of the big data world. He speaks from a place of expertise and in-the-trenches experience. This second part describes his discovery within his career at Pepperdata, and his thoughts on how it would have made so many of his previous MapR customers’ lives easier. 

For full context here, read our part one.

Pepperdata to the Rescue

I wish I had known of Pepperdata from day one, during my first deployment of Hortonworks. I wish I had used Pepperdata on every Cloudera project I worked on. I wish I had used it on every MapR success. 

I could have saved thousands of hours, had I possessed the visibility provided by Pepperdata. It would not have solved all my problems. But it would have allowed me to identify and solve all my problems much faster, and solved some of my tougher production resource problems before I even knew I had them. Pepperdata would have given me 20% of my big data life back.

The GPS of Big Data

I think of Pepperdata as the GPS of Big Data. It continually helps me triangulate my current position on my flight of production success. It helps me in operations, in development, and in planning, as I carry my legacy loads from destination to destination. It gives me feedback to stay on course.

Triangulating Your Current Position

When you have real-world production workloads in big data systems, you always have waste—unless you fully understand every job and have tuned it to perfection. The most obvious benefit I get from Pepperdata? Capacity optimization. This has an immediate impact and takes no effort to implement. 

Pepperdata looks at every node in your big data system and compares the resources requested and allocated, to the resources on each instance, to determine what is actually used. It then goes back and adjusts what is reported as available every 30 seconds so that the queues and scheduling rules you have in place have a better understanding of what is still available. This allows new projects where you don’t really know what you need to ask for to access more resources – without fear of turning your big data freeway into a big data parking lot. This works in all major Hadoop offerings and in all cloud-managed Hadoop offerings. 

The advent of auto-scale cloud services may lead you to think that this problem is already solved. But it’s not.

Why? Because an auto-scale cloud service acts by scaling resources on a cluster-wide view. When X amount of a cluster resource hits specific conditions, the cluster will scale in or out accordingly. Unfortunately, this also scales your waste conditions directly with your consumption needs. A great cost model for the cloud, but not so much for a consumer. 

What I found with Pepperdata is that the focus on per node/instance resource consumption helps to increase utilization of the current resources, decrease the time to job completion, and decrease auto-scaling events in the double-digit percentage area. Even when I am flying blind, it gives me a limited option for an auto-pilot.

Job Optimization

Job optimization is a must. We all know you need to know your data and know your jobs. As you move a project from concept to first deployments to production, the job needs to be optimized for maximum efficiency. While Pepperdata lessens the impact of non-optimized jobs, it simultaneously does the opposite. A perfect combo. 

Operations Optimization

Operations, operations, operations! This is the last data point you need to fix your location and plot your course to success. With Pepperdata, I get continual optimization for my cluster on a per-node basis. I get a constant stream of recommendations to help optimize jobs. But that only takes you so far. You also need to empower your data operations with a holistic view of your big data environment. The per-job view is great but you also need to see that job in relation to all cluster operations. With Pepperdata, I have per-node views, per-job views, and cluster-wide views to help bring perspective into focus. 

Sometimes a job has issues because of someone else’s innovation. Pepperdata allows for quick isolation of cluster issues. With this, I don’t waste my time trying to figure out that the reason my job took a sudden step back is because someone else’s took one step forward. With Pepperdata, I can see details on jobs between runs and on jobs after instance changes. As I was exploring Amazon EMR, Google Dataproc, Qubole, and Microsoft HDInsights, I was even able to run a series of jobs through each cloud environment and compare cost and performance across all of them.

Plotting Your Flight Path to Success

Pepperdata optimization in Operations, jobs, and capacity gives you the insight you need to plot your path to production success within the complex world of multi-tenant big data. It reduces the challenge of handling skills shortages by giving targeted recommendations to your developers and operations folks as they come up to speed. It alleviates some of your resource constraints by continually optimizing your capacity on a per-node basis so all of your jobs run better. It does this across all major Hadoop offerings—no matter if it is on-prem or cloud-managed.

In my years of flying blind, Pepperdata was the autopilot I needed on my production journey.