Apache Spark is playing a critical role in the adoption and evolution of Big Data technologies because it provides sophisticated ways for enterprises to leverage Big Data compared to Hadoop. The increasing amounts of data being analyzed and processed through the framework is massive and continues to push the boundaries of the engine. Whether you’re programming in Java or working out with Python, these five items can impact your Spark applications.

  1. Serialization – This plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or consume a large number of bytes, will greatly slow down the computation. Often, this will be the first thing you should tune to optimize a Spark application. The Java default serializer has very mediocre performance with respect to runtime as well as the size of its results. Therefore, the Spark team instead recommends the use of the Kryo serializer.Tips: Avoid using anonymous classes, instead use static classes as an anonymous class will force you to have the outer class serialized. Avoid using static variables as a workaround for serialization issues, as multiple tasks can run inside the same JVM and the static instance might not be thread safe.
  2. Partition Sizes – Generally speaking, any performance management software that sees data skew will recommend more partitions — but not too many more! The best way to decide on the number of partitions in an RDD is to make the number of partitions equal to a multiple of the number of cores in the cluster so that all the partitions will process in parallel and the resources will be utilized in an optimal way. You want to avoid a situation where you have 4 executors and 5 partitions (to reduce it to a simple case).Tip: As an upper bound, tasks should take longer than 100ms, or scheduling tasks will take more time than executing tasks. As a lower bound, in order for there to be parallelization, make the number of partitions at least 2x the number of cores being requested, or reasonably expected to be available. This may take a couple of tries to find the right balance while avoiding skew.
  3. Executor Resource Sizing – What you’re trying to do is sub-divide your data set into the smallest pieces that can be easily consumed by your Spark executors, but you don’t want them too small. There are a few ways to find that happy middle ground, but first of all you’re trying to avoid data skew by making sure your key space is well distributed.Make a guess at the size of your executor based on the amount of data you expect to be processed at any one time. There are two values in Spark on YARN to keep an eye on: The size of your executor, and what is called the YARN memory overhead. This is for the YARN scheduler to not kill an application when it uses a large amount of NIO memory or other off-head memory areas.Make sure your driver is large enough to hold the execution plan, and the expected results to be delivered to the client. One way to simplify this is to use Dynamic Allocation, which requires Spark to run in a YARN (Hadoop) environment. This allows you to manage executor size/core count but lets the environment itself determine the number of executors that can/should be launched by the Application. Please talk to your administrators about this, as they may request you place a maxExecutor cap on the size of the ask, or Spark will ask for an executor for every task!Tip: Understand your resource usage! Being able to understand true heap utilization of your executors and total RAM used vs. RAM requested can assist you in right-sizing your requests to work better in multi-tenant environments. (This is one small piece of what Pepperdata provides and can assist with.)
  4. DAG Management – It’s always a good idea to keep an eye on the complexity of the execution plan. Use the DAG (direct acyclic graph) Visualization tool that comes with SparkUI for one possible visual map. If something that you think should be straightforward (a basic join, for example) is taking 10 stages, look at your query or code to reduce it to 2 to 3 stages if possible.Tip: Look at each individual stage to understand the parallelization that is actually taking place. If you have a (non-parallel) stage that is using < 60% of the available executors, should that compute be rolled into another stage? Is there a separate partitioning issue?
  5. Shading – Make sure that any external dependencies and classes you are bringing in do not conflict with internal libraries used by your version of Spark or are available in the environment you are using. For example, are you using Google Protobuf? It’s a popular format for storing and transporting data that is more compact that JSON. But let’s say you want to use the getUnmodifiableView() function. This is only available in Protobuf 2.6.0, and most Hadoop implementations are delivered with Protobuf 2.5.0. You would need to shade the jar while building your project to avoid conflicts in which Protobuf is being used by your application.Tip: Keep your eyes on the prize: Always be aware of potential conflicts.

Other Things You Can Do

Interested in learning more about APM and how it can be applied to optimize your application and infrastructure environment?