Hadoop, the distributed processing and storage platform whose creation was spearheaded a decade ago by my former Yahoo colleagues, Doug Cutting and Eric Baldeschwieler, is a watershed innovation. It credibly holds the promise to change everything: how we collect, store, and analyze data on a technical level, as well as how we engage with each other, understand the past, and predict the future.
By now, everyone knows the story of how Doug and Eric crafted Hadoop out of the Nutch project while at Yahoo. We’ve all heard how big data analysis will allow us to better engage with customers, improve ROI, pioneer new research and discovery, and make smarter decisions faster — in short, to innovate. And in many ways, Hadoop and its sprawling ecosystem have delivered on that promise by tirelessly inventing and evolving new tools and techniques to help us ingest data faster, analyze it better, and store it more quickly and reliably than ever before. Before Hadoop, it was impossible to do any meaningful analysis at scale without being an expert programmer in distributed systems (or an engineer), because there was no platform that could gracefully coordinate tens, hundreds, or thousands of computers’ resources towards answering constantly changing questions using data sets of such volume, velocity, and variety. Things have definitely changed with the birth of Hadoop.
But there’s also a story of how things have remained the same, how we risk not realizing that bright future of widespread, cheap, data-fueled innovation. The ghost in the room: performance.
Back when I was at Yahoo, before “data science” was even a phrase, my Search Technology team had the privilege of becoming the first alpha user of the earliest production Hadoop environment ever deployed. That inaugural year (2006), the Hadoop cluster started life with a mere 10 nodes — and we were thrilled if Hadoop ran continuously for a full day. My team had huge volumes of web crawler data to crunch: page content and links and enrichment data (useful for ranking search results). We threw it all at Hadoop, looking to leverage that data to improve our search results, which accounted for half of Yahoo’s revenue at the time. The cluster gradually grew to over a thousand nodes by 2008, and it acquired more users and thus, more and varied workloads. Everyone from data scientists looking to run tests and analyses to techies looking for somewhere to store laptop backups wanted to use the cluster — and we thought that was a mighty fine thing.
But one day, something unexpected happened: without any warning, the entire search functionality of Yahoo’s website suddenly went offline. Pagers went off. Phones rang. There was a cacophony of people rapidly logging into consoles and jabbing at their keyboards. Finally, after agonizing minutes of troubleshooting, we realized that, somehow, Hadoop was implicated. We literally ran down two flights of stairs to the 6th floor of Yahoo’s Mission College campus, where the Hadoop team sat, barged into their offices, and frantically asked them to stop, stop, stop whatever it was they were doing. We didn’t know what had happened; all we knew was they had to turn Hadoop off now.
The post-mortem will look familiar to many Hadoop operators today: a Yahoo Hadoop engineer had decided to run an innocuous job on the cluster — and that job had innocently sucked up all of some resource — in this case, the available network bandwidth — literally crippling Yahoo search. In short, resource contention on a busy cluster running many different types of applications and workloads caused a performance wall to be hit. Recognize that issue?
Therein lies an Achilles’ heel that has persisted for ten years: with distributed systems like Hadoop and YARN, a performance wall is inevitably hit. This is because these platforms are designed to gracefully schedule and start applications on their respective clusters, but not to gracefully manage those same applications while running. In short, Hadoop is powerless to control active jobs’ performance.
The ramifications of this are profound: stalled innovation. Unexpected resource contention that organically arises and bottlenecks on cluster performance that cannot be overcome through manual tuning, better planning, or passive tools. By definition, tuning is always a response to problems that have already happened (i.e., you’re fixing yesterday’s problems today instead of this moment’s problems right now). It’s humanly impossible to make the thousands of decisions necessary each second, across all nodes and processes, to truly optimize cluster performance.
Sure, you could dedicate a single cluster to a single job — but is that really scalable or efficient?
What’s needed is active, intelligent software making real-time decisions tailored to ever-changing cluster conditions. Without this, organizations looking to operate distributed systems “at the speed of business” cannot escape velocity and will inevitably hit a performance wall that will stop innovation in its tracks.