Webinar: Proven Approaches to Hive Query Tuning

Webinar: Proven Approaches to Hive Query Tuning

Our “Pepperdata Profiles” series shines a light on our talented individuals and explores employee experiences. This week, we chatted with Justin Ng, our resident data scientist. Having worked at Pepperdata for more than two years, Justin shared his thoughts on the future of data analytics, the cloud data management challenges that Pepperdata solves, and what people often get wrong about the data analytics industry.

justin ng

Hey, Justin! How did you start with your career in data science?

Before I joined Pepperdata, I had around five years of experience in the industry. I used to be more of a software developer or software engineer type, but I was always interested in looking at data a lot. So I decided to have a career change of sorts. I basically went back to school to do my graduate studies in statistics, in the hope of getting into any sort of field in the data analytics industry, particularly in data science. After I finished my master’s degree in statistics, I went off and I started working with a focus on the data analytics industry, more than software engineering.

Interesting. And how did you come across Pepperdata?

A recruiter called me. I had worked for large corporations in Canada before, and what drew me in with Pepperdata was its appeal as a startup. It was something I was interested in experiencing. The other thing was the industry: I was particularly keen on looking at different sorts of data that I hadn’t seen before, the type of data that Pepperdata collects.

Really? What were the “new” types of data that you were hoping to come across?

Well, before Pepperdata, I had been working with large corporations, right? A lot of those were banks and, also, a telecom company. So the data I’d often see were transactions: how many people withdrew money, how much people were spending, stuff like that. Pepperdata, by comparison, tackles a different animal. Stuff like memory usage, resource usages, job types—higher frequency data that we can potentially use in order to help customers run their data-intensive workloads more effectively. I just thought that was a nice change.

Yeah, the data science world is very cool, very modern. But tell us, Justin: What do people often get wrong about it?

People always hear about the more “sexy” types of projects in the analytics industry. Things like, artificial intelligence, or whatever new thing there is. But for most of the problems, I found that a lot more work goes into retrieving the data, preparing it, and trying to productionalize it, more so than actually building some really intelligent models to do stuff on it.

So there’s a lot of, I guess you can say, mundane work, which takes up most of the time. It’s that type of work that most people wouldn’t consider very “fun” or interesting. But it’s definitely something that you have to deal with before you can do any sort of analysis for most problems.

In reality, data science is crunching enormous data sets and doing quite complex statistical stuff. So if you actually want to do anything, you have to do the boring work first.

Another thing about the analytics industry and sciences is that you don’t actually control the outcome. You’re not sure what you’re going to get before you actually do it. There’s always a little bit of uncertainty as to what’s going to happen, in terms of what your results are going to be. It’s a lot different than building a piece of software: you know what your goal is and what results you want to get. But for data science, it really depends on the data. And often, you’re not sure if there’s some randomness in what you’re going to get out of it.

That’s really interesting. And what would you say are the unique data science challenges that Pepperdata helps with?

A lot of people and organizations have been in this business of constantly collecting data, without a lot of focus on what they’re going to do with it. That’s where data scientists like me come in and find the real opportunities.

In terms of how Pepperdata helps with that play, it becomes the foundation for making sure that customers are able to get access to the data in ways that they can analyze. That’s pretty important. For example, in banking: are they able to quickly get loans to people who need it? Can they anticipate what the customer’s spend is going to be? All that stuff is table stakes. So Pepperdata makes sure that the right information is collected and initially processed so guys like me in the data analytics industry can do our work.

How has your work changed ever since you started at Pepperdata? And where do you see the future of data analytics going?

The big difference from when I started to now is that we’re doing more with the cloud. Before, it was all on premises, and everyone was more worried about errors and malfunctioning hosts or their computers. Or if an app they have is using up too much memory.

Today, with cloud data management, those are not as important. In the cloud, if one of the nodes or machines fail, you start up another one. Moreover, customers are often only running single jobs at a time, so ephemeral clusters are started up to run one particular job.

I’ve found out I needed to adjust some of the metrics that we tracked in some of our big data industry reports to develop more cloud-related insights. Things like, “What are the best instance types to use for certain jobs?” “How many resources can we predict a job is going to use?” “How can we use that to do autoscaling more effectively for customers?”

We’ve also started tracking costs more precisely because each instance has a defined cost. It wasn’t as important to get those details as correct before when on-premises infrastructures dominated the big data industry. Runtime is also more important now, in the cloud, because the infrastructure is more prone to runaway costs than on-prem.

There’s a transition away from on-prem stuff, like managing hardware, CPU time, and memory. Now it’s a little bit more focused on virtual instances: People are paying per resource used.

Anything else that adds to the complexity of the cloud, aside from cloud data management?

Well, I guess I would say that cost metrics are definitely more at the forefront now. On premises, companies often have a good view of what their costs are. But now, they have less visibility in the cloud, so those cost metrics have to be more precise.

The other thing is autoscaling, which adds new nodes to meet the demands of a particular job or a particular cluster. So there’s that factor of “how many nodes does a job really need?” That in particular is a more complex aspect of the cloud than what we had for on premises because it had been more of a fixed cluster, a fixed number of nodes that you didn’t have to really think about as much before.

So I would say autoscaling and the desire to see costs in detail are just a few of those complex things.

See how Pepperdata makes autoscaling more efficient with Pepperdata Capacity Optimizer.

What do you find most enjoyable about the Pepperdata company culture, even in spite of the pandemic? And what would it take to succeed in Pepperdata, for anyone who’s interested?

The reason why I like Pepperdata, first and foremost, is because of the people I work with. I’ve read somewhere that people don’t quit companies, they quit teams. And I really like all the people at Pepperdata. Everyone’s so nice, supportive, and easygoing.

That was a big factor for me in choosing and staying with Pepperdata. In fact, this is actually the longest job I’ve been at, even though it’s only been two years and a handful of months. Before, I hopped around a little bit, and that was also because of the people.

As for what it takes to succeed, I guess it is a little bit different than working at a larger company. So it’s more about the work, the value you bring to the company, rather than office politics, or anything like that. You just have to get your work done and get along with others, which isn’t hard to do. And ask questions, because everyone is really helpful. That’s basically what you need in order to succeed here.

To read more profiles on our Pepperdata stars, read part one of our Pepperdata Profiles series.

The views expressed on this blog are those of the author and do not necessarily reflect the views of Pepperdata. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by Pepperdata.