Apache Spark™ is a unified analytics engine for large-scale data processing. It is known for running workloads 100x faster than other methods, due to the improved implementation of MapReduce, that focuses on keeping data in memory instead of persisting data on disk.
However, despite its many great benefits, Spark also comes with unique issues, one of these being serialization. What is the best way to deal with this? In our webinar, Pepperdata Field Engineer Alexander Pierce took on this question.
The Problem with Serialization
This plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or those that consume a large number of bytes, will greatly slow down the computation.
“Serialization is fairly important when you’re dealing with distributed applications,“ Alex explains. “Because you’ll have to distribute your code for running and your data for execution, you need to make sure that your programs can both serialize, deserialize, and send objects across the wire quickly.” Often, this will be the first thing you should tune to optimize a Spark application. The Java default serializer has very mediocre performance with respect to runtime, as well as the size of its results. Alex recommends the use of the Kryo serializer.
Watch our webinar to learn more about tackling the many challenges with Spark. Understand how to improve the usability and supportability of Spark in your projects and successfully overcome common challenges.
Or… if you want to skip ahead to the ‘good stuff’ and see how Pepperdata takes care of these challenges for you, start your free trial now!