What is Scalability in Cloud Computing

What is Scalability in Cloud Computing

Download our 2021 Kubernetes & Big Data Report for more information and insights.

The past few years have seen a dramatic increase in companies using Spark on Kubernetes (K8s). This isn’t surprising, considering the benefits that K8s brings to the table. According to our recent survey, 77% of enterprises are adopting Kubernetes technology with the goal of improving their utilization resources and cutting down cloud expenses.

Deploying Spark jobs on Kubernetes is a growing trend, especially with ongoing widespread migration to the cloud. However, companies should know that using Spark this way has its downsides. Enterprises that run Spark on Kubernetes must be ready to address the challenges that come with this solution. Above all, they need to have observability into their infrastructure and be able to easily optimize multiple aspects of its performance.

Here at Pepperdata, with our most recent product update, we have just made observability and optimization for Spark workloads running on Kubernetes a reality.

Efficient Resource Utilization

Businesses are embracing both Spark and Kubernetes to improve the utilization of their cloud resources, streamline cloud processes, and shorten deployment time. According to our survey, nearly 30% of enterprises moved to Kubernetes so they can achieve efficient resource utilization. On top of that, over 17% of respondents said they adopted Kubernetes with the goal of accelerating their application deployment cycles.

Software engineers, developers, and IT specialists alike love Spark because of its ability to implement and perform computational tasks at a rate 100 times quicker than the MapReduce framework.

Now, if they run Spark on Kubernetes, iteration cycles speed up by up to 10 times faster due to containerization, with reports of 5-minute dev workflows reduced to 30 seconds. Aside from a dramatic increase in processing, containerization using Kubernetes is not as resource intensive as hardware-level virtualization.

Big Cost Savings

In the world of big data, cloud computing, and pay as you go pricing, every enterprise wants to cut down their costs. This is why enterprises are looking for ways to achieve efficient resource sharing and utilization.

Almost 30% of the IT leaders we surveyed are looking at Kubernetes to help reduce their cloud costs. Running Apache Spark on Kubernetes has proven to help enterprises reduce their cloud costs substantially via complete isolation and powerful resource sharing.

Goals4Adopting 01


How does this cost reduction happen? Well, users can deploy all their apps in one Kubernetes cluster. As an application finishes, Kubernetes can quickly tear down its containers and quickly reallocate the resources to another application. The whole process only takes 10 seconds for resources to be moved somewhere else.

So how much savings are we talking about? In one case, a company saved 65% of their cloud costs after switching to a Spark on Kubernetes model from YARN.

The Problems with Spark on Kubernetes

Kubernetes is increasingly important for a unified IT infrastructure, and Spark is the number one big data application moving to the cloud. However, Spark applications tend to be quite inefficient.

These inefficiencies manifest in a variety of ways. We asked respondents what their main challenges were while running Spark on Kubernetes. Below are their three biggest stumbling blocks:

Initial Deployment

There are a number of Spark challenges that hinder successful implementation. Our survey ranks initial deployment as the biggest challenge faced by our respondents when adopting Apache Spark on Kubernetes. The technology is complex. For those who are unfamiliar with the platform, the framework, language, tools, etc., can be daunting.

Building a reliable Spark on Kubernetes infrastructure at scale needs substantial expertise around the technology. Even those with considerable Kubernetes knowledge recognize that there are parts to build prior to deployment, such as clusters, node pools, the spark-operator, the K8s autoscaler, docker registries, and more.


Moving to Kubernetes can put your enterprise in an advantageous position, much like moving to the cloud or adopting big data. But that is only possible if you have a sound and strong strategy prior to migration. Among the many reasons why migration to Kubernetes can be difficult—or result in outright failure—is because leaders decided to adopt a technology without a well-defined reason for such a big move.

Plus, many enterprises don’t have the prerequisite skills within their organization. Switching to a Spark on Kubernetes architecture requires substantial talent to make the transition smooth and successful in the end. Ignoring or failing to recognize underlying issues with the applications or infrastructure, such as scaling or reliability, can cause migration challenges.

Monitoring and Alerting

Kubernetes is a complex technology, and monitoring can be difficult. Choosing the right tool to monitor and assess your Kubernetes implementation adds to this complexity. According to our survey, 28% of respondents either use a manual approach or a homegrown solution, while 27% leverage application performance monitoring (APM) software.
Performance monitoring and optimization has grown beyond human capabilities. Generic APMs are often not designed or configured to handle big Spark on Kubernetes workloads and other big data infrastructures. Effective and powerful monitoring now requires comprehensive and robust tools that are purposely designed for big data workloads on Kubernetes.

Pepperdata: Powerful Observability and Optimization for Spark on Kubernetes

Running Spark on Kubernetes is an effective approach to achieving accelerated product cycles and continuous operations. However, its implementation and management can be complicated.

Luckily, our recent product update is here to help.

Pepperdata gives enterprises full-stack observability for Spark on Kubernetes. For developers, this enables the manual tuning of their applications as well as simultaneous and autonomous optimization at run time. By combining manual and autonomous tuning, developers are able to find and deliver the best price/performance for these applications.

Pepperdata autonomous optimization and full-stack observability, along with machine
learning, give users a clear and comprehensive picture of their cloud environment. This includes containers, clusters, pods, nodes, and workflows. Developers enjoy a superior speed and scalability that neither traditional APMs nor manual tuning can provide.

With Pepperdata, all resources within the Kubernetes framework are automatically optimized. At the same time, the platform gives users a correlated understanding of applications and infrastructure, down to the granular level. Unhindered observability gives users access to the actionable data needed for debugging and understanding complex processes, while autonomous optimization guarantees efficient resource utilization.
Our Spark on Kubernetes optimization solution includes powerful features, such as:

  • Autonomous optimization of resources and workloads on Amazon EKS, HPE Ezmeral, and Red Hat OpenShift
  • Application and infrastructure observability for Spark on EKS, Ezmeral, OpenShift, and YARN
  • A self-service dashboard so developers can manually tune using recommendations for speed or resource utilization
  • Detailed usage attribution for chargeback

Download our 2021 Kubernetes & Big Data Report for more information and insights.