Modern Hadoop and Spark environments are busy places. Multiple applications being run by multiple users with wildly different workloads (HIVE queries, for instance, cheek-by-jowl with long Map‐ Reduce jobs) are contending for the same resources. And users are noticing the problems that result from contention: companies spend big bucks on hardware or on virtual machines (VMs) in the cloud, and don’t get the results in the time they need.
Luckily, you can solve this without throwing in more and more money and overprovisioning hardware resources. Instead, you can aim for Quality of Service (QoS) in mixed workload, multitenant Hadoop and Spark environments. Throughout this report, I will use the term distributed processing to refer to modern Big Data analysis tools such as Hadoop, Spark, and HIVE. It’s a very general term that covers long-running jobs such as MapReduce, fast-running in-memory Spark jobs that are often called “real-time,” and other tools in the Hadoop universe.