In this blog series we’ve been examining the Five Myths of Apache Spark Optimization. The fourth myth we’re considering relates to a common misunderstanding held by many Spark practitioners: Spark application tuning can eliminate all of the waste in my applications. Let’s dive into it.
Manually Tuning Spark Applications
Manual tuning refers to a developer’s ability to turn the knobs that control the CPU, memory, and other resources allocated to an application. The resource requirements for an application, especially applications such as Spark, typically vary over time—sometimes by a great amount. There is a peak period, when resource requirements are at their greatest, and an off-peak period.
In practice, developers almost always size their applications to this peak, or even above. This ensures that the application has the right amount of resources and will not fail. However, the peak period often represents a small fraction of the overall time that an application runs. Most applications run well below this peak allocation.
Figure 1: Developers are required to allocate memory and CPU for each of their Spark applications. To prevent their applications from being killed due to insufficient resources, developers typically set the resource request level to accommodate peak usage requirements.
Adjusting these resource request knobs for CPU and memory can help reduce the resources and costs required to run Spark applications by moving the provisioning line as close to peak as possible. This allows developers who tune manually to reclaim some of the waste due to applications running well below peak.
When a Static Provisioning Level Meets a Dynamic Application: Waste is Not Eliminated
It quickly becomes obvious, however, that application tuning does not eliminate waste when the application’s resource utilization curve is not at peak. During the off-peak period, the application does not need all the resources that have been provisioned for it. However, nothing can be done to remediate wasted resources during this off-peak time since the amount of provisioned resources is a static designation within Spark. The static provisioning level simply does not account for the changes that inevitably occur as the application or its data characteristics change. This off-peak time period is often very long—on the order of hours—and it commonly represents a large fraction of the application’s total run time with organizations typically wasting 30 percent or more of their resources when running data-intensive workloads on Spark.
As we saw in the previous blog, Myth 3: Instance Rightsizing, a modern application’s CPU and memory requirements may change dramatically while the application is running. The instance type that was chosen for an application prior to its run may not be the optimal instance type by the end of the run.
A Special Challenge: Tuning Infrequent Applications
Applications that run infrequently present a special challenge. Some applications may run only once a week, or maybe only once a month. A developer might not be inclined to invest the effort to develop a custom provisioning profile for such an application.
Instead, the developer might simply provision that application using the same configurations selected for other, more frequent applications. As a result, the CPU and memory provisioned for such a one-off application may be inappropriate for it, which could result in arbitrary amounts of overprovisioning and waste.
Another Challenge: Eliminating Waste in Applications with Varying Resource Requirements
Another scenario involves applications with varying resources requirements by day. Consider an application that is extremely efficient two days out of the week but relatively inefficient the other five days. Some cost-conscious developers might choose to write two different applications to accommodate this behavior: one application with configurations tuned for the efficient days and a second application with configurations tuned for the other days. In this way, the applications are optimized for each day’s requirements.
Although this practice would help optimize resources, it’s labor intensive, and few developers would be excited to take on this project. Writing and managing multiple applications in this way essentially doubles (or more) a developer’s work. As a result, most developers will simply write one application and provision it with sufficient resources to care for the worst-case scenario.
These simple examples illustrate how cumbersome manual solutions to application tuning can be—they simply do not scale.
The Opportunity Cost of Manual Application Tuning
Manually tuning any application also comes with an additional drawback, namely the significant opportunity cost it represents. As discussed in Myth 1, Observability & Monitoring, when a developer is handed a list of recommendations to improve application performance, they usually have little incentive to follow those recommendations. While some applications might generate only a handful of tuning recommendations, others might result in a checklist of several dozen or more parameters to tweak. The developer might already be working on a to-do list of other new impactful projects, and they may resist spending the time to go back and revive an application they wrote months ago. And most companies want their developers to be developing (hence their title!) rather than tuning.
Summing It Up: Manual Application Tuning Doesn’t Address In-Application Waste and Can’t Scale
Given the waste inside the Spark application itself due to common overprovisioning, manual tuning leaves money on the table when used as a cost optimization strategy. Specifically, applications with static allocation levels provisioned to meet dynamic resource requirements lead to waste that is all but impossible to tune manually, and even more difficult to try to tune at scale. Without automation, there is no practical solution to match scheduled resource allocations to the application’s actual usage needs in real time. Most organizations are reluctant to spend their development resources chasing this near-futile effort.
In our next blog entry in this series, we’ll examine the fifth and final myth about Spark, which involves Spark Dynamic Allocation. Stay tuned!