My journey with Big Data began at a time when performance monitoring wasn’t a major focus, and yet it landed me at a company where that’s all we do. In 2012, I joined Groupon as part of a team of systems engineers who were focused on best practices. With the exception of Nagios telling us that a host was down or otherwise unreachable, there wasn’t much in terms of performance monitoring. The handful of Hadoop clusters at the time were all based on Cloudera CDH3.x (MapReduce v1), and within my first year most of this hardware was replaced with infrastructure that could handle more data with faster throughput. The notion of running Hadoop on commodity servers was fading as organizations adopted high-end infrastructure that could handle the increasing demands of big data workloads.
Eventually, the demand for Hadoop infrastructure grew to cover almost every group in the company. More than 2,000 nodes were in production, spanning two shared clusters that handled ETL and various other batch processing workloads plus some additional large clusters handling various other production workloads. The sudden need for “data-driven everything” was catching on at data-dependent organizations around the world. So while companies were investing millions in upgraded hardware, the question remained, “Who is going to manage all of this infrastructure?” At some point, the answer to this question pointed to me, when in 2014 I became the engineering manager responsible for Hadoop operations. Now I had to build a team.
Managing Big Data systems is tricky. But staffing Big Data operations teams is even harder. Finding experienced engineers with solid Linux administration experience, a decent mastery of an automation system (puppet, chef, cf engine) and experience working at scale with Hadoop was extremely challenging. Going through the process of screening calls and interviews, it was clear that the majority of candidates were not qualified even though their freshly-obtained certified administrator credential claimed otherwise. The biggest obstacle was finding candidates who had worked on these systems at scale, and had experienced the interesting ways in which Hadoop could “break”.
Fortunately, I was able to recruit two rock-star engineers internally. Unfortunately, my new team spent a substantial part of their time answering questions like, “Why is my job running so slow? Why did my job fail?, and “Is there something wrong with the Hadoop cluster?” By the time I was managing the team we inherited workloads from other teams that we migrated to an Ambari-based shared cluster.
Inheriting clusters with existing workloads and data made it impossible to deploy best practices and policies with a clean slate. Implementing them on existing data pipelines, while not impossible, is quite challenging because of all the moving parts. What type of compression is used? What is the minimum file size? How to best manage data partitioning, table layout, data at rest encryption, security, etc? These are not things you want to discuss after the fact. When teams are busy spending the majority of the time trying to answer “Why is my job running slowly today?”, there isn’t much time left to work on infrastructure improvements.
At least now we had some monitoring, but each trouble ticket had to be investigated thoroughly. Triage took ages. It seemed like every support request required digging through application logs, the yarn application history server, the spark history server, node manager logs, and resource manager logs. Missed SLAs started a blame-game among user groups. There were no detailed metrics to either substantiate claims or resolve them.
Eventually, our team received a support ticket from an engineer indicating that there was excessive garbage collection causing delays in workloads. There wasn’t. HDFS was hosed. Yes, small files. I spent the majority of my time in meetings explaining what small files were, why they are breaking HDFS, and why we needed to address the issue immediately. The sky was falling and HDFS performance on the main shared cluster was dismal. Fortunately, at that time, I discovered a company named Pepperdata after making a desperate request to Google in search of Hadoop monitoring solutions.
Within a week of connecting with Pepperdata, my Hadoop operations team was working with their field engineers on a POC on that very same, failing cluster. Pepperdata instantly confirmed the small file issue by clearly identifying the number of files opened under 1Mb and then 10Mb per application in a nifty dashboard with data organized by username, app id, group, or pretty much any category I wanted. With hundreds of other metrics and the ability to understand what was going on with the cluster as well as individual applications in next to real-time, my prayers were answered. Now we knew where to start looking when problems arose.
Eventually, we were able to get HDFS performing optimally. Running the load balancer no longer brought the filesystem and applications accessing it to a screeching halt. I did, however, wind up leaving the company about 6 months later. After five years as a Hadoop operations manager, I was burnt out. After taking a break for a few months, I was ready to rejoin the workforce. The first company I reached out to was Pepperdata, and I ended up joining the company in 2017. This was no accident.
I was really impressed by the technology and the people at Pepperdata while working with them on a POC deployment at my previous employer. They had a solution that could bridge the communication gap between engineering and operations and correlate application performance with infrastructure performance, providing a single pane of glass to work from…something that could have saved me a lot of lost sleep (and hair) back in 2014. Fast forward to 2019, and I’m still very excited about my role at Pepperdata. I get to work with a lot of awesome customers, several in the Fortune 100, who experience many of the big data performance challenges that I’m familiar with.
At Pepperdata I’m able to help customers with best practices by optimizing big data performance monitoring and management, and dramatically reducing their triage time. I also teach developers to monitor and performance-tune their own applications using our big data APM solutions. If you are managing Hadoop, Spark, or Yarn-based infrastructure and are experiencing any of the challenges that I described in this blog post, please reach out to me at firstname.lastname@example.org.