Read our Autonmous FinOps for Kubernetes datasheet for more information and insights.

Apache Spark vs Kubernetes? Or both? The past few years have seen a dramatic increase in companies using Spark on Kubernetes (K8s). This isn’t surprising, considering the benefits that K8s brings to the table. According to our recent survey, 77% of enterprises are adopting Kubernetes technology to improve their utilization resources and cut down cloud expenses.

The number of companies that deploy Spark on Kubernetes is growing, fueled by widespread migration to the cloud. However, companies should know that running Spark on Kubernetes this way has its downsides. Enterprises that run Spark with Kubernetes must be ready to address the challenges that come with this solution. Above all, they need to have observability into their infrastructure and easily optimize multiple aspects of its performance.

What is Spark on Kubernetes?

Spark is an open-source analytics engine designed to process large volumes of data. It gives users a unified interface for programming whole clusters using data parallelism and fault tolerance. Kubernetes is an open-source container-orchestration platform that automates computer application deployment, scaling, and management.

Think of Apache Spark on Kubernetes as this: Spark provides the computing framework, while Kubernetes manages the cluster. Kubernetes provides users with a sort of operating system for managing multiple clusters. As a result, this technology delivers superior cluster use and allocation flexibility, which translates to massive cost savings.

graphic

Should You Run Spark on K8s?

There’s no Spark vs Kubernetes debate. We recommend enterprises to deploy Spark on Kubernetes because it’s the more logical and practical approach compared to running Spark on YARN. For one, Spark K8s environments don’t have YARN’s limitations. YARN’s clusters are complex and consume more compute resources than are needed for a job. Plus, users need to create and tear down clusters for every job in YARN. Not only does this setup waste a lot of resources, which translates to more costs, but it also results in inefficient task management.

On the other hand, Kubernetes has exploded in the big data scene and touched almost every enterprise technology, including Spark. With its growing prevalence and ubiquity, as well as the rapid expansion of its user community, Kubernetes is set to replace YARN as the world’s main big data processing engine.

Spark on Kubernetes vs. YARN

Is there merit to a Spark vs Kubernetes argument? Recent trends suggest we are heading in that direction.

Various Spark users are pointing out the advantages of running Spark jobs on Kubernetes over YARN. For one, it’s not difficult to deploy Spark applications into an organization’s existing Kubernetes infrastructure. This results in the fast and seamless alignment of the efforts and goals across multiple software delivery teams.

Second, the most recent Spark version (3.2) has resolved previous performance and reliability issues with Kubernetes. Using Kubernetes to manage Spark jobs leads to better performance and cost savings, surpassing what YARN delivers. Multiple test runs conducted by Amazon revealed a 5% time saving when using Kubernetes instead of YARN.

Other benefits of running Spark on K8s are now emerging. But the biggest reason why enterprises should adopt Kubernetes? Enterprises and cloud vendors have expressed support for the framework via the CNCF (Cloud Native Computing Foundation). Spark on Kubernetes is simply the future of big data analytics.

Does Spark Run on Kubernetes?

Yes. When you run Spark on Kubernetes, Spark generates a Spark driver running internally in a Kubernetes pod. The driver then creates executors, which operate within Kubernetes pods, connect to them, and implement the application code.

Once the application is completed, the executor pods terminate. These are then cleaned up, but the driver pod continues to log and persist in a “completed” state in the Kubernetes API. It remains that way until it’s eventually garbage that’s either collected or manually cleaned up.

Efficient Resource Utilization

Businesses are embracing a Spark Kubernetes set-up to improve the utilization of their cloud resources. Spark Kubernetes dynamic allocation of resources helps streamline cloud processes and shorten deployment time. According to our survey, nearly 30% of enterprises moved to Kubernetes to achieve efficient resource utilization. On top of that, over 17% of respondents said they adopted Kubernetes Spark, intending to accelerate their application deployment cycles.

Software engineers, developers, and IT specialists alike love Spark because of its ability to implement and perform computational tasks at a rate 100 times quicker than the MapReduce framework.

Now, if they utilize a Spark Kubernetes approach, iteration cycles speed up by up to 10 times faster due to containerization, with reports of five-minute dev workflows reduced to 30 seconds. Spark Kubernetes dynamic allocation not only results in a dramatic increase in processing, containerization using Kubernetes is not as resource intensive as hardware-level virtualization.

Easy, Centralized Administration

It’s quite common for Kubernetes Spark projects to have mixes of the different elements available for data handling and Spark job orchestration. These can include various back ends and Spark SQL data storage (to mention a few) when users run these Spark components as workloads in a similar Kubernetes cluster results in better performance. This happens as Kubernetes ensures each workload has adequate resources and connectivities to all dependencies, whether in the same cluster or outside of it.

Spark on Kubernetes

Big Cost Savings

In the world of big data, cloud computing, and pay-as-you-go pricing, every enterprise wants to cut down their costs. That is why many enterprises are looking to K8s Spark for efficient resource sharing and utilization.

Almost 30% of the IT leaders we surveyed are looking at Kubernetes to help reduce their cloud costs. Running Apache Spark on K8s has proven to help enterprises reduce their cloud costs substantially via complete isolation and powerful resource sharing.
How does this cost reduction happen? Users can deploy all their apps in one Kubernetes cluster. As an application finishes, Kubernetes can quickly tear down its containers and quickly reallocate the resources to another application, and  optimize Spark Kubernetes configuration based on Spark metrics and other performance benchmarks. The whole process only takes 10 seconds for resources to be moved elsewhere.

So how much savings are we talking about? In one case, a company slashed 65% of its cloud costs after switching to a Spark on Kubernetes model from YARN.

The Problems with Spark on Kubernetes

Kubernetes is increasingly important for a unified IT infrastructure, and Spark is the number one big data application moving to the cloud. However, Spark applications tend to be quite inefficient.

These inefficiencies manifest in a variety of ways. We asked respondents what their main challenges were while using a Kubernetes Spark setup for their processes. Below are their three biggest stumbling blocks:

Initial Deployment

Several Spark challenges hinder successful implementation. Our survey ranks initial deployment as the biggest challenge faced by our respondents when running Spark on Kubernetes. The technology is complex. For those unfamiliar with the Kubernetes Spark platform, the framework, language, tools, etc., can be daunting.

Running Spark apps on a Kubernetes infrastructure at scale needs substantial expertise around the technology. Even those with considerable Kubernetes knowledge recognize that there are parts to build prior to deployment, such as clusters, node pools, the spark-operator, the K8s autoscaler, docker registries, and more.

Migration

Moving to Kubernetes can put your enterprise in an advantageous position, much like moving to the cloud or adopting big data. But that is only possible if you have a sound and strong strategy prior to migration. Among the many reasons why migration to Kubernetes can be difficult—or result in outright failure—is because leaders decided to adopt a technology without a well-defined reason for such a big move.

Plus, many enterprises don’t have the prerequisite skills within their organization. Switching to a Spark on K8s architecture requires considerable talent to make the transition smooth and successful. Ignoring or failing to recognize underlying issues with the applications or infrastructure, such as scaling or reliability, can cause migration challenges.

Monitoring and Alerting

Kubernetes is a complex technology, and monitoring can be difficult. More so when you combine Spark with Kubernetes. Choosing the right tool to monitor and assess your Kubernetes implementation adds to this complexity. According to our survey, 28% of respondents either use a manual approach or a homegrown solution, while 27% leverage application performance monitoring (APM) software.

Performance monitoring and optimization have grown beyond human capabilities. Generic APMs are often not designed or configured to handle big Spark on K8s workloads and other big data infrastructures. Effective and powerful Kubernetes and Spark monitoring now requires comprehensive and robust tools that are purposely designed for big data workloads on Kubernetes.

Pepperdata: Powerful Observability and Optimization for Spark on Kubernetes

The whole Spark vs Kubernetes argument has taken the back seat as the advantages of a Spark Kubernetes set-up becomes more obvious. Running Spark on Kubernetes effectively achieves accelerated product cycles and continuous operations.

The expansion of data science and machine learning technologies has accelerated the adoption of containerization. This development effectively drives the Spark on Kubernetes approach as a preferred set-up for data clustering and modeling ecosystems. Spark on Kubernetes provides users the ability to abstract elastic GPUs and CPUs, as well as its on-demand scalability.

Pepperdata gives enterprises full-stack observability for Spark apps and workloads running on Kubernetes. For developers, this enables the manual tuning of their applications as well as simultaneous and autonomous optimization at run time. By combining manual and autonomous tuning, developers can find and deliver the best price/performance for these applications.

Pepperdata’s autonomous Kubernetes performance optimization is built on full-stack observability, comprehensive Spark monitoring, and machine learning. This gives users a clear and comprehensive picture of their cloud environment. This includes Spark metrics, containers, clusters, pods, nodes, and workflows. Developers enjoy a superior speed and scalability that neither traditional APMs nor manual tuning can provide.

Spark on Kubernetes

Pepperdata dashboard

With Pepperdata, all resources within the Kubernetes framework are automatically optimized. At the same time, the platform gives users a correlated understanding of applications and infrastructure down to the granular level. Unhindered observability offers users access to the actionable data needed for debugging and understanding complex processes, while autonomous optimization guarantees efficient resource utilization. All these result in effective Kubernetes performance tuning and output.

Our Spark on Kubernetes performance tuning and optimization solution includes powerful features, such as:

  • The autonomous optimization of resources and workloads on Amazon EKS, HPE Ezmeral, and Red Hat OpenShift
  • Application and infrastructure observability for Spark on EKS, Ezmeral, OpenShift, and YARN
  • A self-service dashboard so developers can manually tune using recommendations for speed or resource utilization
  • Detailed usage attribution for chargeback

Interested in how you and your team can recover resource waste and control costs without constant provisioning? Learn how Pepperdata’s Autonomous FinOps platform optimizes your data stack so you can focus on innovation rather than fine-tuning.

Read our Autonmous FinOps for Kubernetes datasheet for more information and insights.

Explore More

Looking for a safe, proven method to reduce waste and cost by up to 47% and maximize value for your cloud environment? Sign up now for a free waste assessment to see how Pepperdata Capacity Optimizer Next Gen can help you start saving immediately.