Download our 2021 Kubernetes & Big Data Report for more information and insights.

The past few years have seen a dramatic increase in companies using Spark on Kubernetes (K8s). This isn’t surprising, considering the benefits that K8s brings to the table. According to our recent survey, 77% of enterprises are adopting Kubernetes technology with the goal of improving their utilization resources and cutting down cloud expenses.

Deploying Spark this way is a growing trend, especially with ongoing widespread migration to the cloud. However, companies should know that using Spark this way has its downsides. Enterprises that run Spark applications on Kubernetes must be ready to address the challenges that come with this solution. Above all, they need to have observability into their infrastructure and be able to easily optimize multiple aspects of its performance.

What is Spark on Kubernetes?

Spark is an open-source analytics engine designed to process large volumes of data. It gives users a unified interface for programming whole clusters using data parallelism and fault tolerance. Kubernetes is an open-source container-orchestration platform that automates computer application deployment, scaling, and management.

Think of Spark on K8s as this: Spark provides the computing framework, while Kubernetes manages the cluster. Kubernetes provides users with a sort of operating system for managing multiple clusters. As a result, this technology delivers superior flexibility in cluster use and allocation, which then translates to massive cost savings.


Should You Run Spark on K8s?

We recommend running Spark on K8s because it’s the more logical and practical approach when compared to running Spark on YARN. For one, Kubernetes doesn’t have YARN’s limitations. YARN’s clusters are complex and consume more compute resources than needed for a job. Plus, users need to create and tear down clusters for every job in YARN. Not only does this setup waste a lot of resources, which translates to more costs, but it also results in inefficient task management.

On the other hand, Kubernetes has exploded in the big data scene and touched almost every enterprise technology, including Spark. With its growing prevalence and ubiquity, as well as the rapid expansion of its user community, Kubernetes is set to replace YARN as the world’s main big data processing engine.

Spark on Kubernetes vs. YARN

Is Spark on Kubernetes replacing YARN? Recent trends suggest we are heading in that direction.

Various Spark users are pointing out the advantages of running Spark jobs on Kubernetes over YARN. For one, it’s not difficult to deploy Spark applications into an organization’s existing Kubernetes infrastructure. This results in the fast and seamless alignment of the efforts and goals across multiple software delivery teams.

Second, the most recent Spark version (3.2) has resolved previous performance and reliability issues with Kubernetes. Using Kubernetes to manage Spark jobs leads to better performance and cost savings, surpassing what YARN delivers. Multiple test runs conducted by Amazon revealed a 5% time saving when using Kubernetes instead of YARN.

Other benefits of running Spark on K8s are now emerging. But the biggest reason why enterprises should adopt Kubernetes? Enterprises and cloud vendors have expressed support for the framework via the CNCF (Cloud Native Computing Foundation). Spark on Kubernetes is simply the future of big data analytics.

Does Spark Run on Kubernetes?

Yes. When you run Spark on Kubernetes, Spark generates a Spark driver running internally in a Kubernetes pod. The driver then creates executors, which operate within Kubernetes pods and connect to them, and implement the application code.

Once the application is completed, the executor pods terminate. These are then cleaned up, but the driver pod continues to log and persist in a “completed” state in the Kubernetes API. It remains that way until it’s eventually garbage that’s either collected or manually cleaned up.

Efficient Resource Utilization

Businesses are embracing both Spark and Kubernetes to improve the utilization of their cloud resources, streamline cloud processes, and shorten deployment time. According to our survey, nearly 30% of enterprises moved to Kubernetes so they can achieve efficient resource utilization. On top of that, over 17% of respondents said they adopted Kubernetes with the goal of accelerating their application deployment cycles.

Software engineers, developers, and IT specialists alike love Spark because of its ability to implement and perform computational tasks at a rate 100 times quicker than the MapReduce framework.

Now, if they run Spark on K8s, iteration cycles speed up by up to 10 times faster due to containerization, with reports of five-minute dev workflows reduced to 30 seconds. Aside from a dramatic increase in processing, containerization using Kubernetes is not as resource intensive as hardware-level virtualization.

Spark on Kubernetes

Big Cost Savings

In the world of big data, cloud computing, and pay-as-you-go pricing, every enterprise wants to cut down their costs. This is why enterprises are looking for ways to achieve efficient resource sharing and utilization.

Almost 30% of the IT leaders we surveyed are looking at Kubernetes to help reduce their cloud costs. Running Apache Spark on K8s has proven to help enterprises reduce their cloud costs substantially via complete isolation and powerful resource sharing.
How does this cost reduction happen? Users can deploy all their apps in one Kubernetes cluster. As an application finishes, Kubernetes can quickly tear down its containers and quickly reallocate the resources to another application. The whole process only takes 10 seconds for resources to be moved elsewhere.

So how much savings are we talking about? In one case, a company slashed 65% of their cloud costs after switching to a Spark on Kubernetes model from YARN.

The Problems with Spark on Kubernetes

Kubernetes is increasingly important for a unified IT infrastructure, and Spark is the number one big data application moving to the cloud. However, Spark applications tend to be quite inefficient.

These inefficiencies manifest in a variety of ways. We asked respondents what their main challenges were while using Kubernetes to run Spark apps. Below are their three biggest stumbling blocks:

Initial Deployment

There are a number of Spark challenges that hinder successful implementation. Our survey ranks initial deployment as the biggest challenge faced by our respondents when running Spark applications on Kubernetes. The technology is complex. For those who are unfamiliar with the platform, the framework, language, tools, etc., can be daunting.

Running Spark apps on a Kubernetes infrastructure at scale needs substantial expertise around the technology. Even those with considerable Kubernetes knowledge recognize that there are parts to build prior to deployment, such as clusters, node pools, the spark-operator, the K8s autoscaler, docker registries, and more.


Moving to Kubernetes can put your enterprise in an advantageous position, much like moving to the cloud or adopting big data. But that is only possible if you have a sound and strong strategy prior to migration. Among the many reasons why migration to Kubernetes can be difficult—or result in outright failure—is because leaders decided to adopt a technology without a well-defined reason for such a big move.

Plus, many enterprises don’t have the prerequisite skills within their organization. Switching to a Spark on K8s architecture requires substantial talent to make the transition smooth and successful in the end. Ignoring or failing to recognize underlying issues with the applications or infrastructure, such as scaling or reliability, can cause migration challenges.

Monitoring and Alerting

Kubernetes is a complex technology, and monitoring can be difficult. Choosing the right tool to monitor and assess your Kubernetes implementation adds to this complexity. According to our survey, 28% of respondents either use a manual approach or a homegrown solution, while 27% leverage application performance monitoring (APM) software.

Performance monitoring and optimization have grown beyond human capabilities. Generic APMs are often not designed or configured to handle big Spark on K8s workloads and other big data infrastructures. Effective and powerful monitoring now requires comprehensive and robust tools that are purposely designed for big data workloads on Kubernetes.

Pepperdata: Powerful Observability and Optimization for Spark on Kubernetes

Running Spark on Kubernetes is an effective approach to achieving accelerated product cycles and continuous operations. However, its implementation and management can be complicated.

Luckily, Pepperdata can help.

Pepperdata gives enterprises full-stack observability for Spark apps and workloads running on Kubernetes. For developers, this enables the manual tuning of their applications as well as simultaneous and autonomous optimization at run time. By combining manual and autonomous tuning, developers are able to find and deliver the best price/performance for these applications.

Pepperdata autonomous optimization and full-stack observability, along with machine learning, give users a clear and comprehensive picture of their cloud environment. This includes containers, clusters, pods, nodes, and workflows. Developers enjoy a superior speed and scalability that neither traditional APMs nor manual tuning can provide.

Spark on Kubernetes

Pepperdata dashboard

With Pepperdata, all resources within the Kubernetes framework are automatically optimized. At the same time, the platform gives users a correlated understanding of applications and infrastructure, down to the granular level. Unhindered observability gives users access to the actionable data needed for debugging and understanding complex processes, while autonomous optimization guarantees efficient resource utilization.

Our Spark on K8s optimization solution includes powerful features, such as:

  • The autonomous optimization of resources and workloads on Amazon EKS, HPE Ezmeral, and Red Hat OpenShift
  • Application and infrastructure observability for Spark on EKS, Ezmeral, OpenShift, and YARN
  • A self-service dashboard so developers can manually tune using recommendations for speed or resource utilization
  • Detailed usage attribution for chargeback

Download our 2021 Kubernetes & Big Data Report for more information and insights.

Take a free 30-day trial to see what Big Data success looks like

Pepperdata products provide complete visibility and automation for your big data environment. Get the observability, automated tuning, recommendations, and alerting you need to efficiently and autonomously optimize big data environments at scale.