My journey with Big Data began at a time when performance monitoring wasn’t a major focus, and yet it landed me at a company where that’s all we do. In 2012, I joined Groupon as part of a team of systems engineers who were focused on best practices. With the exception of Nagios telling us that a host was down or otherwise unreachable, there wasn’t much in terms of performance monitoring. The handful of Hadoop clusters at the time were all based on Cloudera CDH3.x (MapReduce v1), and within my first year most of this hardware was replaced with infrastructure that could handle more data with faster throughput. The notion of running Hadoop on commodity servers was fading as organizations adopted high-end infrastructure that could handle the increasing demands of big data workloads.
Eventually, the demand for Hadoop infrastructure grew to cover almost every group in the company. More than 2,000 nodes were in production, spanning two shared clusters that handled ETL and various other batch processing workloads plus some additional large clusters handling various other production workloads. The sudden need for “data-driven everything” was catching on at data-dependent organizations around the world. So while companies were investing millions in upgraded hardware, the question remained, “Who is going to manage all of this infrastructure?” At some point, the answer to this question pointed to me, when in 2014 I became the engineering manager responsible for Hadoop operations. Now I had to build a team.
Managing Big Data systems is tricky. But staffing Big Data operations teams is even harder. Finding experienced engineers with solid Linux administration experience, a decent mastery of an automation system (puppet, chef, cf engine) and experience working at scale with Hadoop was extremely challenging. Going through the process of screening calls and interviews, it was clear that the majority of candidates were not qualified even though their freshly-obtained certified administrator credential claimed otherwise. The biggest obstacle was finding candidates who had worked on these systems at scale, and had experienced the interesting ways in which Hadoop could “break”.
Fortunately, I was able to recruit two rock-star engineers internally. Unfortunately, my new team spent a substantial part of their time answering questions like, “Why is my job running so slow? Why did my job fail?, and “Is there something wrong with the Hadoop cluster?” By the time I was managing the team we inherited workloads from other teams that we migrated to an Ambari-based shared cluster.
Inheriting clusters with existing workloads and data made it impossible to deploy best practices and policies with a clean slate. Implementing them on existing data pipelines, while not impossible, is quite challenging because of all the moving parts. What type of compression is used? What is the minimum file size? How to best manage data partitioning, table layout, data at rest encryption, s