What is Scalability in Cloud Computing

What is Scalability in Cloud Computing

The Pepperdata team has been working with Hadoop since its earliest days, and we’ve already seen it grow through two significant phases of maturity, defined by how companies are using it and what tools the ecosystem has provided. We’ve recently started to see another set of changes that make it clear we’re entering a third phase of Hadoop maturity. Interestingly, the phases of Hadoop adoption by a given company tend to parallel the historical evolution of the Hadoop ecosystem itself.

Phase One: Hadoop as a shiny new tool

In the first phase of deployment, Hadoop is a new tool that a few scattered groups start to explore for research projects. (What is this thing? What can I do with it?) Developers build small clusters and write experimental jobs to see how they can use the power, scalability, and flexibility of Hadoop to do more with data than they could before.

This level of maturity is also like the historical early days of Hadoop itself – users could run MapReduce and HBase, and early tools like Pig and Hive made Hadoop easier to use, but people still thought in terms of “writing jobs” and “will this job complete at all?” rather than in terms of applications, workflows, predictable run times, and operability.

Phase Two: Hadoop for real, when it matters

For a given company, that first phase typically lasts for a year or so, and then the results become useful enough to the business that a department ends up building a “real” cluster and starts managing it with operations people instead of pure developers. Sometimes this transition is planned; other times it sneaks up on people. Companies start to care about predictable run times, running different kinds of workloads on the same cluster (for example, running MapReduce and HBase together), efficiency and ROI, disaster recovery, and similar concerns that are typical of “real” IT projects.

As with the first phase, this second phase of Hadoop maturity within a company was mirrored by increasing maturity of the Hadoop ecosystem as a whole: at the core, Hadoop 2 let people deploy more kinds of applications via YARN, and companies and open-source projects have added new kinds of functionality such as Spark and fast interactive databases on Hadoop. Similarly, tools for cluster management (such as Cloudera Manager and Ambari) became increasingly robust and easy to use, and third parties began to sell products to provide features like security that IT departments require.

(This second phase is also where Pepperdata came in. As businesses started to run more workloads on the same cluster, they realized that out-of-the-box Hadoop didn’t give them the kind of predictability and performance they needed. Pepperdata’s real-time cluster optimization software fills that gap.)

Phase Three: Hadoop provided to every business unit, by centralized departments

Now it’s clear that we’re entering a third phase of Hadoop maturity within enterprises. Over the past year we’ve talked to hundreds of companies using Hadoop, and a new theme has emerged: multi-departmental Hadoop cluster