As our recent survey showed, Apache Spark is poised to continue as big data’s most dominant large-scale big data processing platform. Thus it is imperative that Spark users learn and master Spark tuning if they want to get the most out of their Spark environments.
But what is tuning in Spark? How is it done? Read on to know more about Spark tuning.
Spark performance tuning is the process of adjusting the configurations of the Spark environment to ensure all processes and resources are optimized and function smoothly. This Spark optimization process guarantees excellent Spark performance while mitigating resource bottlenecks.
We are already seeing many big data workloads running on Spark, and it’s safe to assume that more applications and processes will be migrated to the Spark framework in the foreseeable future.
A majority of enterprises are running Spark, primarily on the Kubernetes framework. This is mainly because they want to improve the utilization of their Spark resources while bringing down their cloud costs. We wanted to take some time to dive a little deeper into one topic: Spark optimization through Spark tuning.
The Challenge of Apache Spark Performance Tuning
Spark developers have a lot of things to worry about when processing huge amounts of data: how to efficiently source the data, perform ETL (extract, transform, load) operations, and validate datasets at a very large scale. But while they’re making sure that the programs are free of bugs and maintained in all the necessary environments, they often overlook tasks such as tuning Spark application parameters for optimal performance.
When done properly, tuning Spark applications lowers resource costs while maintaining SLAs for critical processes, which is a concern for both on-premises and cloud environments. For on-premises Hadoop environments, clusters are typically shared by multiple apps (and their developers). If one person’s apps are resource hogs, it slows down everyone’s applications and risks a higher rate of task failures.
In recent Pepperdata research, we see this as a pain point in adopting new technology, including cloud computing and big data. The majority of respondents are very concerned with the resource optimization of their compute resources because over 33% (1 in 3) companies are spending between 20% to 40% beyond their initial cloud budget. Simply put, organizations are failing to optimize their Spark resources, resulting in overspending.
In this blog post, we’ll discuss two Apache Spark optimization techniques:
- Sizing Spark executors and partitions. We’ll look at how sizing for executors and partitions is interrelated and the implications of incorrect (or nonoptimal) choices. We’ll also provide a heuristic that we’ve found to be effective for our own Spark workloads.
- Using Pepperdata Capacity Optimizer. Capacity Optimizer is the easiest and most practical Spark optimization solution for organizations with a large number of Spark applications. It ensures that resources are utilized to the maximum extent possible.
Before getting into the details, let’s review a few Spark terms and definitions:
A Spark application is divided into stages. A stage is a step in the physical execution plan. It ends when a shuffle is required (a ShuffleMapStage) or when the stage writes its result and terminates as expected (a ResultStage).
Each stage is divided into tasks that are executed in parallel—one task per partition. Tasks are executed by the executors.
Executors are the workers that execute tasks. Resources (memory and CPU cores) are allocated to executors by the developer before runtime.
Partitions are logical chunks of data—specifically, chunks of a resilient distributed dataset (RDD)—which can be configured by the developer before runtime. The number of partitions in an RDD determines the number of tasks that will be executed in a stage. For each partition, a task (chunk of application code) is given to an executor to execute.
Because a Spark application can consist of many different types of stages, the configuration that’s optimal for one stage might be inappropriate for another stage. Therefore, Spark memory optimization techniques for Spark applications have to be performed stage by stage.
In addition to configuring stages, developers have control over the number of tasks in an application (parallelism), as well as the executor sizing for the application. What isn’t straightforward is how to pick the number of partitions and the size of the executors. We’ll cover that next.
Executor and Partition Sizing
Executor and partition sizing are two of the most important factors that a developer has control over with Spark tuning. To understand how they are related to each other, we first need to understand how Spark executors use memory. Figure 2 shows the different regions of Spark executor memory.
We can see that there is a single parameter that controls the portion of executor memory reserved for both execution and storage: spark.memory.fraction. So if we want to store our RDDs in memory, we need our executors to be large enough to handle both storage and execution. Otherwise, we run the risk of errors (in data/calculations and task failures due to lack of resources) or having a long runtime for apps.
On the other hand, the larger the executor size, the fewer executors we can simultaneously run in the cluster. That is, large executor sizes frequently cause suboptimal execution speed due to a lack of task parallelism.
There’s also the problem of choosing the number of CPU cores for each executor, but the choices are limited. Typically, a value from 1-4 cores/executor will provide a good balance between achieving full write throughput and not overtaxing the ability of the HDFS client to manage concurrent threads.
How Do We Choose the Partition and Executor Sizes?
One of the best Spark memory optimization techniques when dealing with partitions and executors is to first choose the number of partitions, then pick an executor size to meet the memory requirements.
Choosing the Number of Partitions
Partitions control how many tasks will execute on the dataset for a particular stage. Under optimal conditions with little to no friction (network latency, host issues, and the overhead associated with task scheduling and distribution), assigning the number of partitions to be the number of available cores in the cluster would be the ideal. In this case, all the tasks would start at the same time, and they would all finish at the same time, in a single step.
However, real environments are not optimal. When Spark tuning, we must consider that:
- Executors don’t finish the tasks at the same speed. Straggler tasks are tasks that take significantly longer than the rest of an app’s tasks to execute. To combat this, we should configure the number of partitions to be more than the number of available cores because we want the fast hosts to work on more tasks than the slow hosts work on.
- There is overhead associated with sending and scheduling each task. If we run too many tasks, the increased overhead takes a larger percentage of overall resources, and the result is a significant increase in app runtimes.
When using Apache Spark optimization techniques, remember this rule of thumb: For large datasets—larger than the available memory on a single host in the cluster—always set the number of partitions to be 2 or 3 times the number of available cores in the cluster.
However, if the number of cores in the cluster is small and you have a huge dataset, choosing the number of partitions that results in partition sizes that are equal to the Hadoop block size (by default, 128 MB) has some advantages in regards to I/O speed.
Choosing an Executor Size
As we’ve discussed, Spark tuning also involves giving your executors enough memory to handle both storage and execution. So when you choose your executor size, you should consider the partition size, the entire dataset size, and whether you will be caching the data in memory.
To ensure that tasks execute quickly, we need to avoid disk spills. Disk spills occur when we don’t give the executors enough memory. This forces Spark to “spill” some of the tasks to disk during runtime.
In our experiments, we’ve found that a good choice for executor size is the smallest size that does not cause disk spills. We don’t want to pick too large a value because we would be using too few executors. Finding the right size that avoids disk spills requires some experimentation.
Figure 3 shows results from one of our experiments for a machine learning application:
We ran the same application multiple times, altering only the executor memory size. We kept the partition size at 256 MB and the number of executor cores at 4. We see that the tasks ran significantly faster when there were no disk spills. Doubling the memory size from 4 GB to 8 GB eliminated the disk spilling, and the tasks ran more than twice as fast. But we can also see that going from 8 GB to 10 GB didn’t affect the task duration. It’s not always as clear cut as this, but based on our experience, choosing the minimum memory size that results in no disk spills is usually a good Spark tuning practice.
We’ve answered “What is tuning in Spark?” The next big question: “Is it really practical for all applications to be optimized?”
This is a crucial question. Check out part two of this blog post series to find out the answer.
You can also download our 2021 Kubernetes and Big Data Report for more information and rich insights into how enterprises are using Spark and Kubernetes to manage their big data.
Also, check out this video on Spark optimization for a more visual, in-depth demonstration.