“Why is Spark So Slow?”
When Apache Spark works well, it works really well. Sometimes, though, users find themselves asking this frustrating question.
Spark is such a popular large-scale data processing framework because it is capable of performing more computations and carrying out more stream processing than traditional data processing solutions. Compared to popular conventional systems like MapReduce, Spark is 10-100x faster. But while capable of handling an impressively wide range of workloads and big data sets, Spark can sometimes struggle. Here’s why, and here’s what you can do about it.
What Slows Spark Down?
So: You’ve already tried a few Apache Spark performance tuning techniques—but your applications are still slow. At this point, it’s time you dive deeper into your Spark architecture, and determine what is making your instance sluggish.
In a Spark architecture, the driver functions as an orchestrator. As a result, it is provisioned with less memory than executors. When a driver suffers an OutOfMemory (OOM) error, it could be the result of:
- Low driver memory configured vs. memory requirement per the application
- Misconfiguration of spark.sql.autoBroadcastJoinThreshold
Simply put, an OOM error occurs when a driver is tasked to perform a service that requires more memory or tries to use more memory than it has been allocated. Two effective Spark tuning tips to address this situation are:
- increase the driver memory
- decrease the spark.sql.autoBroadcastJoinThreshold value
Sometimes, Spark runs slowly because there are too many concurrent tasks running.
The capacity for high concurrency is a beneficial feature, as it provides Spark-native fine-grained sharing. This leads to maximum resource utilization while cutting down query latencies. Spark divides jobs and queries into multiple phases and breaks down each phase into multiple tasks. Depending on several factors, Spark executes these tasks concurrently.
However, the number of tasks executed in parallel is based on the spark.executor.cores property. While high concurrency means multiple tasks are getting executed, the executors will fail if the value is set to too high a figure, without due consideration to the memory.
Why is Spark so slow? Maybe you have a poorly written query lurking somewhere.
By design, Spark’s Catalyst engine automatically attempts to optimize a query to the fullest extent. However, any optimization effort is bound to fail if the query itself is badly written. For example, a query programmed to select all the columns of a Parquet/ORC table. Every column requires some degree of in-memory column batch state. If a query selects all columns, that results in a higher overhead.
A good query reads as few columns as possible. A good Spark performance tuning practice is to utilize filters wherever you can. This helps limit the data fetched to executors.
Another good tip is to use partition pruning. Converting queries to use partition columns is one way to optimize queries, as it can drastically limit data movement.
Getting memory configurations right are critical to the overall performance of a Spark application.
Each Spark app has a different set of memory and caching requirements. When incorrectly configured, Spark apps either slow down or crash. A deep look into the spark.executor.memory or spark.driver.memory values will help determine if the workload requires more or less memory.
YARN container memory overhead can also cause Spark applications to slow down because it takes YARN longer to allocate larger pools of memory. What happens is YARN runs every Spark component, like drivers and executors, within containers. The overhead memory it generates is actually the off-heap memory used for JVM (driver) overheads, interned strings, and other metadata of JVM.
When Spark performance slows down due to YARN memory overhead, you need to set the spark.yarn.executor.memoryOverhead to the right value. Typically, the ideal amount of memory allocated for overhead is 10% of the executor memory.
Speed Spark up with Optimization Practices
There are certain steps you need to take to ensure Spark isn’t running slowly. Here are some effective ways to keep your Spark architecture, nodes, and apps running at optimal levels.
This particular Spark optimization technique converts an in-memory data structure into a different format that can be stored in a file or delivered over a network. With this tactic, you can dramatically enhance the performance of distributed applications. The two popular methods of data serialization are:
Java serialization – You serialize data using the ObjectOutputStream framework, with the java.io.Externalizable leveraged to give you total control over the performance of the serialization. Java serialization provides lightweight persistence.
Kyro serialization – Spark utilizes the Kryo serialization library (v4) to serialize objects faster than Java serialization. This is a more compact method. To really enhance the performance of your Spark application by using Kyro serialization, the classes must be registered via the registerKryoClasses method.
Caching is a highly efficient optimization technique used when working with data that is repeatedly required and queried. Cache() and persist() are great for storing the computations of a Data Set, RDD, and DataFrame.
The thing to remember is that cache() puts the data in the memory, whereas persist() stores it in the storage level specified or defined by the user. Caching helps bring down costs and saves time when dealing with repeated computations as reading data from memory is much faster than reading from disk.
Data Structure Tuning
Data structure tuning reduces Spark memory consumption. Data structure tuning usually involves:
- Using enumerated objects or numeric IDs instead of strings for keys
- Refraining from using many objects and complex nested structures
- Setting the JVM flag to xx:+UseCompressedOops if the memory size is less than 32 GB
Garbage Collection Optimization
Garbage collection is a memory management tool. Each application stores data in memory, and that in-memory data has a life cycle. Garbage collection marks which data is no longer needed, marks it for removal, and removes it. The removal takes place during a pause of the application. These pauses are to be avoided. When garbage collection becomes a bottleneck, leveraging the G1GC garbage collector with -XX:+UseG1GC has been proven to be more efficient.
Monitor, Tune, and Optimize Spark with Pepperdata
Spark doesn’t always run perfectly. It’s a great data-processing platform, but it can’t quite be left to run on full autopilot. Consistent Spark performance tuning will help your Spark infrastructure perform at optimal levels
The next time you find yourself asking “why is Spark so slow?”, dive into the Spark architecture and take a closer look. The aforementioned reasons for slow Spark performance might just be one of the culprits, and the tips mentioned for improving performance might be what you need to improve things.
Pepperdata can help with optimizing your Spark usage. With Pepperdata, you have the tools and the technology to help you monitor your Spark infrastructure, processes, workloads, and apps in real-time. Pepperdata monitoring solutions provide you with deep and comprehensive visibility and observability. You see everything. Plus, Pepperdata generates smart recommended actions to ensure your Spark infrastructure performs at optimal levels while keeping costs low. Apache Spark performance tuning on a whole new level.