As co-founder of a Hadoop software company in Silicon Valley, I have the privilege of spending time on a daily basis with companies that are on the cutting edge of big data analytics. It’s exciting to be part of an industry that is advancing so quickly, and fascinating to learn about new and amazing use cases.
One of our customers that’s doing truly transformational things with big data is Opower. By combining data management, insightful analytics, and behavioral science, Opower’s customer engagement platform positions utilities as trusted energy advisors to the customers they serve. To date, the Opower platform has created enough energy savings through behavior change to power all the homes in a city of 1 million people for a year. As a Bay Area resident who is served by PG&E, I’m actually an Opower user myself.
Opower uses Hadoop as the infrastructure foundation for its data warehouse and its operational data store. The Software as a Service (SaaS) company relies on Hadoop to deliver big data analytics to its more than 95 utility partners, including 28 of the 50 largest U.S. electric utilities, and over 50 million household and business customers in nine countries. The multi-tenant production environment runs mixed workloads that include Hive queries, MapReduce jobs, and HBase serving data in real time to end users.
As with many initial Hadoop deployments, stability was a big challenge for Opower in the early days. Jobs would compete with each other for physical resources. When cluster capacity was exceeded, the Hadoop cluster would experience cascading failures and critical applications would become unavailable.
When we first started working with Eric Chang, Data Infrastructure Technology lead at Opower, he and his team were taking a number of steps to mitigate these issues, including extensive capacity planning, cluster tuning, prioritized job scheduling, and the careful curation and testing of jobs before release to production. In spite of adopting these best practices, Eric saw a need to do more.
The volume of data from energy meters, third-party data feeds, and event data was growing rapidly. At the same time, the number of users and diversity of workloads continued to increase. In the face of these challenges, meeting SLAs and cost-effectively managing hardware expenses were of critical importance.
More visibility, control, and capacity
Opower began using Pepperdata Supervisor in the spring of 2014. Installation took less than an hour, with no modifications to Opower’s schedulers, workflow, or jobs.
Opower’s Hadoop administrators now have the ability to monitor every facet of cluster performance in real time. Visibility into CPU, memory, disk I/O, and network usage by job, task, user, and group makes it easier to proactively identify potential performance problems before they occur and take preventive actions.
When performance bottlenecks do occur, Opower’s Hadoop administrators are able to quickly diagnose and fix the problem. Troubleshooting activities that used to take days are now typically completed in a matter of minutes.
Opower is also using Pepperdata Supervisor to dynamically adjust job resources to reflect their service level priorities, ensuring that their clusters are devoting sufficient resources to appropriate jobs. As a result, critical jobs now run faster, more reliably, and more efficiently on Opower’s existing servers, allowing the infrastructure team to scale services with fewer hardware resources.
Opower’s story is relevant to any company using Hadoop in production. The operational challenges they’ve experienced are typical, and the steps they’ve taken to overcome those challenges have collectively been very effective. Their story illustrates how Pepperdata makes Hadoop more reliable, even when jobs and queries are efficiently written and the cluster is highly tuned.