The Pepperdata team has been working with Hadoop since its earliest days, and we’ve already seen it grow through two significant phases of maturity, defined by how companies are using it and what tools the ecosystem has provided. We’ve recently started to see another set of changes that make it clear we’re entering a third phase of Hadoop maturity.
Interestingly, the phases of Hadoop adoption by a given company tend to parallel the historical evolution of the Hadoop ecosystem itself.
Phase One: Hadoop as a shiny new tool
In the first phase of deployment, Hadoop is a new tool that a few scattered groups start to explore for research projects. (What is this thing? What can I do with it?) Developers build small clusters and write experimental jobs to see how they can use the power, scalability, and flexibility of Hadoop to do more with data than they could before.
This level of maturity is also like the historical early days of Hadoop itself – users could run MapReduce and HBase, and early tools like Pig and Hive made Hadoop easier to use, but people still thought in terms of “writing jobs” and “will this job complete at all?” rather than in terms of applications, workflows, predictable run times, and operability.
Phase Two: Hadoop for real, when it matters
For a given company, that first phase typically lasts for a year or so, and then the results become useful enough to the business that a department ends up building a “real” cluster and starts managing it with operations people instead of pure developers. Sometimes this transition is planned; other times it sneaks up on people. Companies start to care about predictable run times, running different kinds of workloads on the same cluster (for example, running MapReduce and HBase together), efficiency and ROI, disaster recovery, and similar concerns that are typical of “real” IT projects.
As with the first phase, this second phase of Hadoop maturity within a company was mirrored by increasing maturity of the Hadoop ecosystem as a whole: at the core, Hadoop 2 let people deploy more kinds of applications via YARN, and companies and open-source projects have added new kinds of functionality such as Spark and fast interactive databases on Hadoop. Similarly, tools for cluster management (such as Cloudera Manager and Ambari) became increasingly robust and easy to use, and third parties began to sell products to provide features like security that IT departments require.
(This second phase is also where Pepperdata came in. As businesses started to run more workloads on the same cluster, they realized that out-of-the-box Hadoop didn’t give them the kind of predictability and performance they needed. Pepperdata’s real-time cluster optimization software fills that gap.)
Phase Three: Hadoop provided to every business unit, by centralized departments
Now it’s clear that we’re entering a third phase of Hadoop maturity within enterprises. Over the past year we’ve talked to hundreds of companies using Hadoop, and a new theme has emerged: multi-departmental Hadoop clusters run by central IT organizations to serve all of their business units. This is similar to how central IT groups already provide networking and data center space as a service. Interestingly, we’ve seen this trend more among large enterprises (especially financial institutions) than among the technology companies that were the early adopters of Hadoop.
The enterprises we’re working with generally call this “internal Hadoop as a service” or “Big Data as a service.” (Try acronyming that second one. ? )
With this third phase comes a new set of requirements for Hadoop:
- When multiple departments are using shared infrastructure, they demand SLAs – it just doesn’t work if one group’s use of Hadoop slows down everyone else’s use beyond an acceptable limit.
- As Hadoop becomes an increasingly big part of a company’s IT spend, it’s more important than ever that it be efficient. Enterprises don’t want to buy another hundred servers just because that’s the most obvious way to support more work – and even if they did, they often can’t find enough data center space to put them in.
- With many departments and hundreds or thousands of users running jobs on the same cluster, the operations group needs to understand which jobs might be causing problems for all the others, and to help users understand and reduce the impact of poorly written jobs.
- Now that each business unit isn’t buying its own cluster, it’s critical for IT to be able to accurately allocate costs back to each department.
- … Plus myriad other requirements that enterprises have once business units start sharing data and compute: granular access control, business continuity, regulatory compliance, and so on.
Chargeback reports: Providing another missing piece of the puzzle
Pepperdata’s real-time cluster optimization software has already helped enterprises meet some of these requirements. Today we’re launching a new feature that provides accurate chargeback reports of real hardware usage In order to give enterprise IT organizations the detailed visibility they need when running internal Hadoop as a service.
Until today, IT’s ability to measure and charge for Hadoop usage was limited to measuring data storage at rest – how many terabytes each department was storing on disk. The problem is that such a simple metric only captures a fraction of the total cost of the cluster; it doesn’t reflect who’s actually using that data and how much computational power they’re consuming as they do their work. It also tends to penalize groups that provide data that is useful for other departments as well, since they get charged for storing “their” data even if others are using it too.
Our new chargeback reports provide the missing pieces needed get a complete picture: we report on total CPU, memory, disk I/O (data access, not just data at rest), and network used by each group during any desired window in time. Operators can get a summary view broken down by application or queue, and if needed they can also drill down into usage by individual users and even jobs, in order to find the most expensive workloads running on the cluster.
As we’ve talked to customers about the challenges of running a centralized Hadoop service, we’ve heard over and over that SLAs, efficiency, and chargebacks are critical requirements that they’re not getting from out-of-the-box Hadoop. With today’s news, we’re excited to announce that Pepperdata is providing another one of these missing pieces.