Data Analytics with Hadoop: A Practical Guide to Understanding Data Science and Analytics

Data Analytics with Hadoop: A Practical Guide to Understanding Data Science and Analytics

A Practical Guide to Understanding Data Science and Analytics

The term big data has come into vogue for an exciting new set of tools and techniques for modern, data-powered applications that are changing the way the world is computing in novel ways. Much to the statistician’s chagrin, this ubiquitous term seems to be liberally applied to include the application of well-known statistical techniques on large datasets for predictive purposes. Although big data is now officially a buzzword, the fact is that modern, distributed computation techniques are enabling analyses of datasets far larger than those typically examined in the past, with stunning results

Distributed computing alone, however, does not directly lead to data science. Through the combination of rapidly increasing datasets generated from the Internet and the observation that these data sets are able to power predictive models (“more data is better than better algorithms”1), data products have become a new economic paradigm. Stunning successes of data modeling across large heterogeneous datasets— for example, Nate Silver’s seemingly magical ability to predict the 2008 election using big data techniques—has led to a general acknowledgment of the value of data science, and has brought a wide variety of practitioners to the field.

Hadoop has evolved from a cluster-computing abstraction to an operating system for big data by providing a framework for distributed data storage and parallel computation. Spark has built upon those ideas and made cluster computing more accessible to data scientists. However, data scientists and analysts new to distributed computing may feel that these tools are programmer oriented rather than analytically oriented.

This is because a fundamental shift needs to occur in thinking about how we manage and compute upon data in a parallel fashion instead of a sequential one. This book is intended to prepare data scientists for that shift in thinking by providing an overview of cluster computing and analytics in a readable, straightforward fashion. We will introduce most of the concepts, tools, and techniques involved with distributed computing for data analysis and provide a path for deeper dives into specific topics areas.