As discussed in our last blog, Apache Spark has become the big thing in Big Data. As the largest open source community in Big Data, Spark enables flexible in-memory data processing and advanced analytics on the Hadoop platform. The roster of companies that have adopted Spark and also are project committers includes some of tech’s biggest names, such as Facebook, Netflix, Yahoo, IBM and Intel.
In the first installment of “The Big Data Insiders” — interviews with Big Data experts and industry influencers — Ian O’Connell, an accomplished Silicon Valley software engineer whose resume includes Twitter and online payments company Stripe, shares his insights about Spark.
Q: Why is Spark so popular and what key features do you like?
A: Interestingly, the reasons for Spark’s growing popularity have varied over time. Originally, having a scalable API that mapped onto MapReduce style tasks was, by itself, a decent selling point. But the really big original driver was the ability to cache data into memory. This was a big shift from the likes of Pig, Hive and Cascading, where data was aggressively kept on disk, compressed at all times, and this capability allowed huge improvements in iterative algorithms. The first machine learning algorithmic pipeline I ported over had a 50-fold increase in performance!
More modern-day Spark is a great ETL, SQL, ML and distributed execution platform. The key feature for me in my current workloads is the integration between Spark’s Dataset API and SQL engine. This has allowed me to easily swap between using Spark’s high performance SQL engine where there is less type safety, especially at compile time, and type safe Dataset containers. This lets me avoid a lot of the drawbacks of using SQL code where the lack of types makes sharing difficult and error prone, exposing interfaces in terms of datasets.
Q: What is the most interesting application of Spark that you have seen?
A: The most interesting application I’ve seen was actually more an illustration of Spark as a general distributed compute platform. In this application, at Yahoo, engineers perform ETL, use ML Lib and custom ML libraries. Then they swap to using Spark as a generic compute platform running deep learning algorithms with the cafe library. This mixture of ETL and specialized algorithms is one of the more versatile and interesting applications.
Q: What are some lessons learned from using Spark in production?
A: Integration tests are very important when using Spark – more so than using some other scalable Big Data tooling like Scalding. Spark Dataset’s and SQL-based code paths are heavily based on reflection. This could have downside for stable production jobs in that failures will only crop up at runtime on your cluster. As a result, getting a good robust integration test environment using serialization between tasks is important to keep stability high.
Q: If someone has an existing cluster they have been operating for several years with MapReduce, what do they need to consider and prepare for as they migrate to Spark?
A: Spark tends to be more memory intensive, so the configuration of hardware in a years-old MapReduce cluster often will not be id