Apache Spark™ has quickly become an open-source framework, and the only unified analytics engine, that combines large-scale data processing with state-of-the-art machine learning and AI algorithms. However, it also comes with an array of problems. Pepperdata Field Engineer Alexander Pierce draws on his experiences across dozens of production deployments to talk about the issues observed in a cluster environment with Apache Spark™.
The Problem with Executor Resource Sizing and Heap Utilization
You’re often trying to subdivide your data set into the smallest pieces your Spark executors can consume, but you don’t want them to be too small. While there are a few ways to find that happy middle ground, you’ll have to find a way around data skew by ensuring a well-distributed key space.
“Make a guess at the size of your executor based on the amount of data you expect to be processed at any one time,” Alex divulges. “Know your data set, know your partition count.” However, that’s not everything there is to it. There are two values in Spark on YARN to keep an eye on: the size of your executor, and the YARN memory overhead. This is for the YARN scheduler to not kill an application when it uses a large amount of NIO memory or other off-heap memory areas.
Overcoming Spark Challenges
Spark can be challenging. Whether you’re faced with executor resource sizing, heap utilization, or another common challenge with Spark, watch our webinar with Alex Pierce to learn how to improve the performance, usability, and supportability of Spark.
You can also learn more about Pepperdata by beginning your free trial now!