Apache Spark™ is a unified analytics engine for big/large-scale data processing. It is known for running workloads 100x faster than other methods, due to the improved implementation of MapReduce, that focuses on keeping data in memory instead of persisting data on disk.
However, despite its many great benefits, Apache Spark’s big data processing capabilities also comes with unique issues, one of these being serialization. What is the best way to deal with this? In our webinar, Pepperdata Field Engineer Alexander Pierce provides a guide on the challenges associated with serialization initiatives using Apache™ Spark.
The Problem with Apache™ Spark During Serialization
Serialization plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or those that consume a large number of bytes, will greatly slow down the computation.
“Serialization is fairly important when you’re dealing with distributed applications,“ Alex explains. “Because you’ll have to distribute your code for running and your data for execution, you need to make sure that your big/large scale data processing programs can both serialize, deserialize, and send objects across the wire quickly.” Often, this will be the first thing you should tune to optimize a Spark application. The Java default serializer has very mediocre performance with respect to runtime, as well as the size of its results. Alex recommends the use of the Kryo serializer.
Watch our webinar to learn more about tackling Apache™ Spark challenges.
Understand how to improve the usability and supportability of Spark in your projects and successfully overcome common challenges and performance tuning during large/big scale data processing.
Or… if you want to skip ahead to the ‘good stuff’ and see how Pepperdata takes care of these challenges for you, start your free trial now!