Apache Spark™ is a full-fledged data engineering toolkit that enables you to operate on large data sets without worrying about the underlying infrastructure. It achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine. While all that may be promising, Spark also has many challenges, and one of them is dealing with partition sizes.

Pepperdata Field Engineer Alexander Pierce has had a lot of experiences across dozens of production deployments. In our webinar, he explores this particular issue as an observation of a cluster environment with Spark.

Getting Partition Recommendations and Sizing to Work for You

Generally speaking, any performance management software that sees data skew will recommend more partitions — but not too many more!

“More partitions can be better, but not always,” said Alex. “The more partitions you have, the better your sterilizations could be.” The best way to decide on the number of partitions in an RDD is to equate the number of partitions to a multiple of the number of cores in the cluster. This is so that all the partitions will process in parallel and the resources get optimum utilization.  Alex suggests that to reduce it to a simple case, you’ll want to avoid a situation where you have 4 executors and 5 partitions.

Learn more about Spark challenges by watching our webinar with Alex Pierce. Get guidelines on how to overcome the most common Spark problems you are likely to encounter. Understand how to improve the usability and supportability of Spark in your projects and successfully overcome common challenges.

Or you can start your Pepperdata free trial now and see all these in action!