“Why is Spark So Slow?”
When Apache Spark works well, it works really well. Sometimes, though, users find themselves asking this frustrating question.
Spark is such a popular large-scale data processing framework because it is capable of performing more computations and carrying out more stream processing than traditional data processing solutions. Compared to popular conventional systems like MapReduce, Spark is 10-100x faster. But while capable of handling an impressively wide range of workloads and big data sets, Spark can sometimes struggle. Here’s why, and here’s what you can do about it.
What Slows Spark Down?
So: You’ve already tried a few Apache Spark performance tuning techniques—but your applications are still slow. At this point, it’s time you dive deeper into your Spark architecture, and determine what is making your instance sluggish.
In a Spark architecture, the driver functions as an orchestrator. As a result, it is provisioned with less memory than executors. When a driver suffers an OutOfMemory (OOM) error, it could be the result of:
- Low driver memory configured vs. memory requirement per the application
- Misconfiguration of spark.sql.autoBroadcastJoinThreshold
Simply put, an OOM error occurs when a driver is tasked to perform a service that requires more memory or tries to use more memory than it has been allocated. Two effective Spark tuning tips to address this situation are:
- increase the driver memory
- decrease the spark.sql.autoBroadcastJoinThreshold value
Sometimes, Spark runs slowly because there are too many concurrent tasks running.
The capacity for high concurrency is a beneficial feature, as it provides Spark-native fine-grained sharing. This leads to maximum resource utilization while cutting down query latencies. Spark divides jobs and queries into multiple phases and breaks down each phase into multiple tasks. Depending on several factors, Spark executes these tasks concurrently.
However, the number of tasks executed in parallel is based on the spark.executor.cores property. While high concurrency means multiple tasks are getting executed, the executors will fail if the value is set to too high a figure, without due consideration to the memory.
Why is Spark so slow? Maybe you have a poorly written query lurking somewhere.
By design, Spark’s Catalyst engine automatically attempts to optimize a query to the fullest extent. However, any optimization effort is bound to fail if the query itself is badly written. For example, a query programmed to select all the columns of a Parquet/ORC table. Every column requires some degree of in-memory column batch state. If a query selects all columns, that results in a higher overhead.
A good query reads as few columns as possible. A good Spark performance tuning practice is to utilize filters wherever you can. This helps limit the data fetched to executors.
Another good tip is to use partition pruning. Converting queries to use partition columns is one way to optimize queries, as it can drastically limit data movement.
Getting memory configurations right are critical to the overall performance of a Spark application.
Each Spark app has a different set of memory and caching requirements. When incorrectly configured, Spark apps either slow down or crash. A deep look into the spark.executor.memory or spark.driver.memory values will help determine if the workload requires more or less memory.
YARN container memory overhead can also cause Spark applications to slow down because it takes YAR