High-Performance Spark (PDF)

Best Practices for Scaling and Optimizing the Network

This exclusive chapter from O’Reilly book “High-Performance Spark” introduces Spark’s place in the big data ecosystem, and helps you understand how Spark programs are executed, its model for parallel computing, and improve your general understanding of this open source technology.

 Spark is often considered an alternative to Apache MapReduce, since Spark can also be used for distributed data processing with Hadoop. 1, packaged with the distributed file system, Apache Hadoop.] As we will discuss in this chapter, Spark’s design principals are quite different from MapReduce’s and unlike Hadoop MapReduce, Spark does not need to be run in tandem with Apache Hadoop. Spark has inherited parts of its API, design, and supported formats from other existing computational frameworks, particularly DraydLINQ. However, Spark’s internals, especially how it handles failures, differ from many traditional systems. 2 Spark’s ability to leverage lazy evaluation within memory computations make it particularly unique. Spark’s creators believe it to be the first high-level programing language for fast, distributed data processing. 3 Understanding the general design principals behind Spark will be useful for understanding the performance of Spark jobs.

Download Now