Apache Spark is playing a critical role in the adoption and evolution of Big Data technologies because it provides sophisticated ways for enterprises to leverage Big Data compared to Hadoop. The increasing amounts of data being analyzed and processed through the framework is massive and continues to push the boundaries of the engine. Whether you’re programming in Java or working out with Python, these five items can impact your Spark applications.
- Serialization – This plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or consume a large number of bytes, will greatly slow down the computation. Often, this will be the first thing you should tune to optimize a Spark application. The Java default serializer has very mediocre performance with respect to runtime as well as the size of its results. Therefore, the Spark team instead recommends the use of the Kryo serializer.Tips: Avoid using anonymous classes, instead use static classes as an anonymous class will force you to have the outer class serialized. Avoid using static variables as a workaround for serialization issues, as multiple tasks can run inside the same JVM and the static instance might not be thread safe.
- Partition Sizes – Generally speaking, any performance management software that sees data skew will recommend more partitions — but not too many more! The best way to decide on the number of partitions in an RDD is to make the number of partitions equal to a multiple of the number of cores in the cluster so that all the partitions will process in parallel and the resources will be utilized in an optimal way. You want to avoid a situation where you have 4 executors and 5 partitions (to reduce it to a simple case).Tip: As an upper bound, tasks should take longer than 100ms, or scheduling tasks will take more time than executing tasks. As a lower bound, in order for there to be parallelization, make the number of partitions at least 2x the number of cores being requested, or reasonably expected to be available. This may take a couple of tries to find the right balance while avoiding skew.
- Executor Resource Sizing – What you’re trying to do is sub-divide your data set into the smallest pieces that can be easily consumed by your Spark executors, but you don’t want them too small. There are a few ways to find that happy middle ground, but first of all you’re trying to avoid data skew by making sure your key space is well distributed.Make a guess at the size of your executor based on the amount of data you expect to be processed at any one time. There are two values in Spark on YARN to keep an eye on: The size of your executor, and what is called the YARN memory overhead. This is for the YARN scheduler to not kill an application when it uses a large amount of NIO memory or other off-head memory areas.Make sure your driver is large enough to hold the execution plan, and the expected results to be delivered to the client. One way to simplify this is to use Dynamic Allocation, which requires Spark to run in a YARN (Hadoop) environment. This allows you to manage executor size/core count but lets the environment itself determine the number of executors that can/should be launched by the Application. Please talk to your administrators about this, as they may request you place a maxExecutor cap on the size of the ask, or Spark will ask for an executor for every task!