How does a CTO determine which data should go to hot cloud storage, and which to cold data storage? How can companies control their growing amount of data, and the escalating costs that come with it? And how can data storage methods benefit from observability solutions and tools?

These questions, and more, were addressed during our webinar, titled What Does a CTO Do When a 60PB Hadoop Cluster Devours the IT Budget? It featured Chuck Yarbrough, Senior Director of Product Marketing at Hitachi Vantara, and our very own Pepperdata Field Engineer Alex Pierce. Here is a look at some of what was discussed.

The Challenges of Data Storage Methods

Yarbrough, for the better part of his career, has encountered various challenges involving data storage methods in the industry. He recalls how deciding the amount of data he could make available for apps and developers was a constant back-and-forth during his previous stint as a data warehouse manager in Silicon Valley.

“Business users always wanted more. Inevitably we’d say, ‘You could only have three years’ worth of data, and that’s it. And they’d be like, ‘Well, no, I need seven.’ And you argue about it, but the reality is there were limitations,” Yarbrough says.

Big data and the advent of Hadoop really enabled the industry to go way beyond the limitations of prior architectures. However, with this advancement comes another challenge: cost optimization.

“It [Hadoop and big data] enabled a mass change in the industry so that we could scale to areas that we hadn’t been able to before,” says Yarbrough. “But that leaves us here now, where we’re talking about lots of data, 60 petabytes of data, literally eating the budget. The cost began to get pretty big,” he adds.

Determining data temperature is proving to be a challenge now, as well. Before, when data became “cold” (or infrequently accessed), companies could just decide to store that data offline because of limited storage capacities. Often, they would base it on the age of the data.

hitachipd 00


“Years ago, when I was managing data warehouses,” Yarbrough shares, “a lot of times we would just pick a year [and say] ‘that’s old data.’ Well, now, that’s not necessarily cold data. It might be older, but it’s not necessarily cold,” he adds.

Moreover, with the sheer volume of data being generated and the amount of storage available from these large scale Hadoop environments, offloading (optimizing a data warehouse or data lake) can become a headache.

Solving Hot Cloud Storage Issues (and Other Big Data Dilemmas)

So how can companies solve this trifecta of data availability, escalating costs, and determining hot cloud storage?

“What we really need,” says Alex Pierce, one of our resident field engineers here at Pepperdata, “is to move past monitoring into what we call observability.”

hitachipd 01


By proactively using Hadoop’s functionalities, like its filing system, big data teams can achieve observability. “One of the things that Hadoop’s filing system allows us to do is an analysis of the metadata,” Pierce explains. “It helps us determine whether this data has been accessed recently and we know it’s going to be used quite frequently.”

The Pepperdata Analytics Stack Performance Suite uses observability to provide clarity into big data performance, which helped the major bank in question.

“One of the things I love about Pepperdata is this idea of observability,” Yarbrough says. “For this very large bank, they are literally out of control from a cost perspective, with 60 petabytes [of data]. But Pepperdata applied these reporting capabilities to observe and understand exactly what was going on with their data. That way, we could determine what the right data to identify by temperature was,” he explains.

The Pepperdata temperature analysis done on the entire cluster also revealed over 10 petabytes of data that should be in cold data storage. “A difference of 10 petabytes worth of data is not a small chunk of change at all,” Yarbrough stresses. “Especially if you have everything in your environment stored on high-performance storage—SSD or fast drives—internal systems,” he explains further.

hitachipd 02


In terms of escalating costs, the Pepperdata Product Suite can also perform the following optimizations:

Reaping the Benefits

“Probably the most important element here is the business benefit,” says Yarbrough. Companies wish to gain control of their costs without impacting their business.

Data sets continue to grow exponentially, so they’ll need to make sure they consistently deliver value to their customers. In this case, the bank wanted to leverage their hot cloud storage and cold data storage responsibly. This is on top of the applications and additional machine learning tools that help them do more intelligent activity on that data.

“That’s the benefit that they got, ultimately increasing their agility and getting a greater return on their data,” emphasizes Yarbrough. “It’s not just return on investment, which definitely improved, but a return on their data. So they’re getting more. They’re doing more.”

hitachipd 03


With tools like the Pepperdata Analytics Stack Performance Suite, companies can optimize costs while improving their app and infrastructure performance. Pepperdata solutions allow you to close visibility gaps, maximize current infrastructure, and recapture wasted resources.

“Every company has its own practical challenges, technologies, and processes. However, at the end of the day, what matters is what we can actually bring to the customer. Ultimately, companies that can get their budget under control allow themselves the opportunity to focus on the primary mission—delivering business value.”

If you want to learn more about how the 60-petabyte Hadoop cluster issue was solved, and how Pepperdata helped them determine hot cloud storage and cold data, check out the full webinar: What Does a CTO Do When a 60PB Hadoop Cluster Devours the IT Budget?


Take a free 15-day trial to see what Big Data success looks like

Pepperdata products provide complete visibility and automation for your big data environment. Get the observability, automated tuning, recommendations, and alerting you need to efficiently and autonomously optimize big data environments at scale.