There is little doubt that Hadoop adoption is growing, and not just among Enterprise-sized organizations, but by small and medium sized businesses as well. In fact, the size of the organization (in terms of revenue and/or employee size which are the two criteria used to measure Enterprise vs. SMB) does not alway correlate to cluster size. Some of the largest Hadoop deployments and users tend to be “small shops”, such as ad tech companies, digital marketing, and analytics departments, that don’t always have the highest numbers of employees.

At Pepperdata, we know that the universe of Hadoop users is growing; every day we encounter a company using Hadoop who we didn’t know previously existed. In an effort to understand this growing market more deeply – such as their key use cases, the size of their Hadoop environments, their key challenges, etc. – we partnered with O’Reilly Media to survey their readership about how and why they are using Hadoop for their business operations. We had 134 readers complete the entire survey and the results were quite interesting.

The survey respondents came from a range of experience, but all are working at companies that are currently running Hadoop in production. The majority of respondents were from software engineering/development, data scientist, or data architect job titles (25%, 17%, and 12% respectively). Almost half (40%) were from the Information Technology industry. The next highest came from Education and Financial Services (11% and 10%).  Over 45% have been in production for two years or more, with 15% of those being “advanced users” (4 years or more in production).

Here are some interesting charts to show the size of the Hadoop environments that our survey respondents are directly involved with:

Pepperdata finds most organizations just starting out have 1 production cluster and 1 test/dev. For those who have not figured out how to reliably run multi-tenant environments, it is common to isolate clusters. To learn how Pepperdata helps with multitenancy, read our whitepaper.

Those with more nodes tend to have been running Hadoop in production longer. 84% of companies who have been running Hadoop for a year or less have at max 50 nodes; compared to 75% of those who have been running for more than one year and have greater than 50 nodes.

There is an interesting correlation between the types of workloads and the size of Hadoop clusters. Respondents who cited “streaming / real-time” as one of their workloads tended to have more clusters in production (46% had 4 or more clusters). Among respondents who did not have streaming or real-time workloads, only 20% had 4 or more clusters. The move to real time is adding cost and complexity to Hadoop deployments, through the use of cluster isolation as a “best practice” to guarantee performance. Pepperdata’s Adaptive Hadoop Performance enables QoS and allows organizations to run real-time/streaming applications (i.e. Spark) alongside batch workloads (i.e. MapReduce) on a single cluster. 

In terms of the workloads organizations are running, MapReduce leads the pack with an overwhelming 70% of respondents currently running MapReduce in production. Spark and Hive are close on the heels with 65% and 57% respectively. You can see a breakdown of top workload types in the chart below.

Given the breakdown, it is clear that many organizations are running mixed workloads in production and increasing the risk of cluster chaos.

Perhaps the most interesting part of the survey for the Pepperdata team was seeing what the top challenges are in the eyes of Hadoop users. It most certainly reinforced the common threads we see amongst our customers and prospects: too much time spent troubleshooting, resource contention, and lack of visibility all came up as common challenges.

This list confirms that with all the progress we have made Hadoop is still hard.  At Pepperdata we have been in the Hadoop game longer than most.  Our founders were the first production users of Hadoop – ever- and are well versed in these customer challenges.  This is why we started Pepperdata and we help customers solve these challenges every day.  Our Adaptive Performance Core helps with resource utilization by dynamically reshaping how the applications use cluster hardware. This helps jobs complete on time, facilitates higher utilization of existing hardware resources, and guarantees QoS for Hadoop. In addition, the fact that we monitor how hardware is getting used at a very granular level (CPU, ram, disk, and network), our customers have seen troubleshooting times decrease by 90%.

The survey was a great validation that the market is not only growing but maturing at a very rapid pace.  New processing engines running on Hadoop are driving new, real-time, production use cases that bring their own set of performance challenges that need to be managed to realize true operational value from Hadoop.

To learn more about how Pepperdata helps organizations address performance issues and guarantee QoS, register for our upcoming webinar with O’Reilly media.