Apache Spark™ has quickly become an open-source framework, and the only unified analytics engine, that combines large-scale data processing with state-of-the-art machine learning and AI algorithms. However, Apache Spark™ can also come with an array of problems and deployment issues. Pepperdata Field Engineer Alexander Pierce draws on his experiences across dozens of production deployments to talk about the issues observed in a cluster environment with Apache Spark™.

The Problem with Executor Resource Sizing and Heap Utilization

You’re often trying to subdivide your data set into the smallest pieces your Spark executors can consume, but you don’t want them to be too small. While there are a few ways to find that happy middle ground, you’ll have to find a way to minimize Apache Spark™ problems around data skew by ensuring a well-distributed key space. 

“Make a guess at the size of your executor based on the amount of data you expect to be processed at any one time,” Alex divulges. “Know your data set, know your partition count.” However, that’s not everything there is to it. There are two values in Spark on YARN to keep an eye on if you want to avoid problems and deployment issues: the size of your executor, and the YARN memory overhead. This is for the YARN scheduler to not kill an application when it uses a large amount of NIO memory or other off-heap memory areas.

Overcoming Apache Spark Problems

Spark can be challenging. Whether you’re faced with problems with executor resource sizing, heap utilization, or other common issues with Apache Spark™, watch our webinar with Alex Pierce to learn how to improve the performance, usability, and supportability of Spark.

You can also learn more about Pepperdata by beginning your free trial now

Up Next: What is Apache Spark™ Used For?: Partition Recommendations, Sizing, and Configuration Best Practices