As discussed in our last blog, Apache Spark has become the big thing in Big Data. As the largest open source community in Big Data, Spark enables flexible in-memory data processing and advanced analytics on the Hadoop platform. The roster of companies that have adopted Spark and also are project committers includes some of tech’s biggest names, such as Facebook, Netflix, Yahoo, IBM and Intel.
In the first installment of “The Big Data Insiders” — interviews with Big Data experts and industry influencers — Ian O’Connell, an accomplished Silicon Valley software engineer whose resume includes Twitter and online payments company Stripe, shares his insights about Spark.
Q: Why is Spark so popular and what key features do you like?
A: Interestingly, the reasons for Spark’s growing popularity have varied over time. Originally, having a scalable API that mapped onto MapReduce style tasks was, by itself, a decent selling point. But the really big original driver was the ability to cache data into memory. This was a big shift from the likes of Pig, Hive and Cascading, where data was aggressively kept on disk, compressed at all times, and this capability allowed huge improvements in iterative algorithms. The first machine learning algorithmic pipeline I ported over had a 50-fold increase in performance!
More modern-day Spark is a great ETL, SQL, ML and distributed execution platform. The key feature for me in my current workloads is the integration between Spark’s Dataset API and SQL engine. This has allowed me to easily swap between using Spark’s high performance SQL engine where there is less type safety, especially at compile time, and type safe Dataset containers. This lets me avoid a lot of the drawbacks of using SQL code where the lack of types makes sharing difficult and error prone, exposing interfaces in terms of datasets.
Q: What is the most interesting application of Spark that you have seen?
A: The most interesting application I’ve seen was actually more an illustration of Spark as a general distributed compute platform. In this application, at Yahoo, engineers perform ETL, use ML Lib and custom ML libraries. Then they swap to using Spark as a generic compute platform running deep learning algorithms with the cafe library. This mixture of ETL and specialized algorithms is one of the more versatile and interesting applications.
Q: What are some lessons learned from using Spark in production?
A: Integration tests are very important when using Spark – more so than using some other scalable Big Data tooling like Scalding. Spark Dataset’s and SQL-based code paths are heavily based on reflection. This could have downside for stable production jobs in that failures will only crop up at runtime on your cluster. As a result, getting a good robust integration test environment using serialization between tasks is important to keep stability high.
Q: If someone has an existing cluster they have been operating for several years with MapReduce, what do they need to consider and prepare for as they migrate to Spark?
A: Spark tends to be more memory intensive, so the configuration of hardware in a years-old MapReduce cluster often will not be ideal. Migrations are much easier on cloud-based platforms like AWS. However, regardless of the hardware platform, a stable and well-configured YARN setup is important.
Q: What are some of the more common mistakes you have seen Spark developers make?
A: Not being careful with serialization is a common mistake when working with jvm/scala based wrappers on Big Data technology. While premature optimization is often viewed in computer science circles as a terrible thing, when it comes to serialization, optimization unfortunately is easier done as one goes along than by retrofitting. Being mindful of this in systems like Spark can have a large impact on performance and stability.
Q: What are the most popular Spark ecosystem tools being used (Spark SQL, Spark Streaming, GraphX etc.)?
A: Spark SQL is the most popular ecosystem tool by a wide margin.
Q: Are people using object stores rather than HDFS?
A: S3 is often used by AWS users for non-transient storage. Ideally, all machines in the cloud should be replaced often; using spot instances to reduce cost is also very powerful. Separating your storage from your compute platform by using S3 makes this, for the most part, very easy to do.
There are two big issues using S3, however, that do not come up with HDFS. S3 does not support the move operation, which is fundamental to how Hadoop output formats, also used in spark, write data to the file system. The effect here is that the committer winds up performing a copy operation in order to commit, rather than what is an essentially free move on HDFS. This can slow jobs down massively.
The other more insidious issue has to do with the consistency of directory listing on S3. This issue, if not controlled, can mean downstream jobs will read partial data from HDFS resulting in corrupted outputs.
Q: What do you see for the future of Spark?
A: Spark is currently top dog in this space, though it is always competing with other generic big data platforms like Flink or against specialist computing platforms along the lines of Timely Dataflow. Spark however has reached a critical mass install base, so it is hard to imagine it not being the new Hadoop for at least the next five years.