Everyone in the Big Data world knows Spark. It’s an amazing, powerful piece of software that runs workloads 100x faster than other methods.
However, fantastic as Spark is, like all software, it has its challenges. In a recent webinar, we sat down with Alexander Pierce, a Pepperdata Field Engineer, to discuss them. Alex drew on his experiences across dozens of production deployments, and pointed out the best ways to overcome the five most common Spark challenges.
Serialization is Key
Serialization plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or those that consume a large number of bytes, will greatly slow down the computation.
“Because you’ll have to distribute your codes for running and your data for execution, you need to make sure that your programs can both serialize, deserialize, and send objects across the wire quickly,” Alex explains. This will often be the first thing you should tune to optimize a Spark application. Furthermore, Alex recommends the use of the Kryo serializer, because Java’s default serializer has mediocre performance with respect to runtime, as well as the size of its results.
Getting Partition Recommendations and Sizing to Work for You
Generally speaking, any performance management software that sees data skew will recommend more partitions, but not too many more. “The more partitions you have, the better your sterilizations could be,” says Alex. But that’s not always the case.
The best way to decide on the number of partitions in an RDD is to equate the number of partitions to a multiple of the number of cores in the cluster. This is so that all the partitions will process in parallel and the resources receive optimum utilization. Alex further suggests that you’ll want to avoid a situation where you have four executors and five partitions.
Monitoring Both Executor Size, And Yarn Memory Overhead
Often, what you’re trying to do is subdivide your data set into the smallest pieces that can be easily consumed by your Spark executors, but you don’t want them to be too small. There are a few ways to find that happy middle ground, but you’ll have to find a way around data skew by ensuring a well-distributed key space.
“Make a guess at the size of your executor based on the amount of data you expect to be processed at any one time,” Alex says. “Know your data set, know your partition count.” However, that’s not everything there is to it. There are two values in Spark on YARN to keep an eye on: the size of your executor, and the YARN memory overhead. This is to prevent the YARN scheduler from killing an application that uses a large amount of NIO memory or other off-head memory areas.
Getting the Most out of DAG Management
It’s always a good idea to keep an eye on the complexity of the execution plan. Use the DAG (directed acyclic graph) Visualization tool that comes with SparkUI for one possible visual map. If something that you think should be straightforward (a basic join, for example) is taking 10 stages, you can look at your query or code and perhaps reduce it to two or three stages.
Alex offers one solid tip: “Look at each of the stages in the parallelization.” he says. “Keep an eye on your DAG, not just on the overall complexity. Make sure each stage in your code is actually running in parallel.” If you have a (non-parallel) stage using less than 60% of the available executors, the questions to keep in mind are: Should that compute be rolled into another stage? Is there a separate partitioning issue?
Managing Library Conflicts
When it comes to shading, one quick tip from Alex is to make sure that any external dependencies and classes you bring in are available in the environment you are using, and that they do not conflict with internal libraries used by your version of Spark. A specific example of such is the use of Google Protobuf, a popular binary format for storing and transporting data more compact than JSON.