In today’s big data world, Apache Spark technology is a core tool. However, Spark is very complex, and it can present a range of problems if unoptimized. Without the right approach to Spark performance tuning, you put yourself at risk of overspending and suboptimal performance.
In a recent webinar, Alex Pierce, a Pepperdata field engineer, dispensed some valuable knowledge regarding how to optimize Spark jobs and successfully perform Spark performance tuning. Since some of the topics covered gathered quite a bit of interest, we wanted to delve a bit deeper into them. Alex expands upon salting and what it means to be a good tenant in a multi-tenant environment in this ten-minute interview. Give it a listen, or read the full transcript below.
Kiana: Hi everyone. I’m your host Kiana with Pepperdata, and I’ll be interviewing Alex Pierce, the Pepperdata field engineer who led our recent webinar, Best Practices for Spark Performance Management. If you haven’t had a chance to watch that webinar, it’ll be linked on the page this interview is on. So, feel free to go check it out. Now, let’s get right into the questions.
Kiana: During the webinar, we got quite a bit of interest in using a salt to fix partition sizes and data skew. Could you expand upon how salting works and how someone could use it to better manage their Spark performance?
Alex: For sure. When you’re looking at what you’re trying to do, let’s specifically look at joins in this case, just because that’s a very common use case in Spark SQL. But this is anytime you’re dealing with data sets where you have a particular dimension. Let’s say you’re dealing with months of the year, days of the week, or something similar as a dimension. That’s a pretty small keyspace. There are only seven days in a week, only 12 months in the year. And let’s say you’re a type of business or something where the vast majority of records happen on a Saturday.
So when we go through to process the data, and let’s say we’re doing a month’s worth of data and we’re doing a join on this data, there is going to be whatever task is stuck doing the join between the data set and the dimension table on Saturday is going to run much longer than the other tasks. So what salting does—it’s kind of like repartitioning without actually needing to repartition your data. So basically, what we do is we take the key we’re going to join on in, let’s say, our left table, and we’re going to make it more uniformly distributed.
And the way we do that is we append, the easiest way I should say, is to append a random number somewhere between 0 and N. You could determine based on the size of your environment, the size of your data set, the scale you need to look at, how large Ns should be. And then we need to do the same thing on the other side of our join. So now we need to take the dimension table, I mean, sorry, we need to take the data set table and where those I.Ds did exist before, we need to run through the same thing on that I.D. set to append the same 0-N values randomly to those keys.
Now, that does not mean N doesn’t need to match. If one side had numbers that did not match the other side, there were definitely problems. But at this point, we can now do a join using these salted keys, and let’s say in our weekday case, instead of having seven keys, we now have 47 keys. So we’ve now distributed that amongst a significantly larger space.
What this means is, come time to actually do the join, instead of having one particular executor that’s going to do 80%-90% of the work, because of the data set skew that’s going to be better distributed. Now you’re going to need to test with your data set exactly what size of a salt works best for you, and you do need to remember if you happen to be using broadcast tables, that your salt is going to increase the size of that dimensional table.
So if you’re using a broadcast table you need to keep an eye on your memory to make sure you don’t blow up the executors, and you just need to adjust. It can take some experimentation; you know your data set best so you know how heavy your skew is, and you can oftentimes visualize that within tools like Pepperdata to understand exactly how large of a salt space to add. But typically, you will see a noticeable increase in performance and definitely in terms of parallelization.
So if you’re in a distributed environment, whereas before maybe there are a thousand hosts in your environment but you were only using seven hosts because of your limited executor’s keyspace, you can now run it on 47 or 50. All of sudden you’re using the environment’s resources better, you’re not a bottleneck, causing perhaps a CPU bottleneck, on one of the other nodes for extended periods. It’s just a nice way to deal with limited keyspace based data.
Now, as for the actual code for it. There are tons of examples out there, even just looking at things like DataZone or Stack Overflow. You should be able to find examples of how to do a salt on a table in Spark very simply. All right. So there were more questions?
Kiana: Yes, and thank you for that answer. That was great. So, you also mentioned that one best practice for Spark performance management was, in a multi-tenant environment, to be a good tenant. What does that mean exactly? And do you have any tips people might not have thought of yet?
Alex: Sure. So this one’s interesting. Part of it is understanding the scale of the environment you’re working in, and part of it is understanding the cue limitations for where you’re launching, but the idea is: Spark is greedy. Let’s say that you’re doing something, even super simple, like the SparkPi example that comes with Spark, and you ask for a hundred thousand slices. Now, Spark is going to ask for a hundred thousand executors. It’ll run just fine if it gets 40, but it’ll keep asking until it gets everything it can.
So, one thing you can do to be a good tenant is also set a max value on your ask. Let’s say, I want to run one hundred thousand slices, I want to use Spark dynamic allocation, but don’t ask for more than 100 executors—which we know will give us the performance we need, but will leave resources available for other users while allowing us to meet any sort of SLA. So that’s one very simple example of how to be a good tenant with your Spark.
Another way is to think about how you’re sizing things. So if your data set can be broken down further, and once again this depends on your knowledge of your own data set, it might be more beneficial to the environment to, instead of asking for a handful of 90 gig or 100 gig executors—that does sound ridiculous but we do see this out there—ask for 10 to 20 gig executors, and break your data set down further.
It’s probably going to be beneficial for you, because you’re more likely to get those executors on the system, and it’s definitely gonna be beneficial for everybody else who’s trying to use that same system. Because if you manage to launch a hundred gig executor on a node, that’s usually over 50% of the space in a node, and sometimes that could even be 70% of the space in a node. So one, you’re going to have to wait for that space to free up and two, once you’re on there nobody else is getting workload on there. So it’s always better if you can break your data set down to try to size what’s going to fit the environment and allow other people to work at the same time.
That’s another one that’s maybe a little bit more difficult, but still not too hard to do. I mean, if you’re working on binary blob data sets and they only come to you in a certain size, there’s not a lot you can do. Almost everything else can be improved. Sometimes even by, like our last question, salting, because maybe you have one executor that’s blowing out all this memory because that’s where all the data is.
Instead of fixing the SKU problem, you’ve been just increasing the memory until it ran. So that’s one good way to fix that. Same thing on the core side. There’s only so much CPU power to go around, and if your code is multi-threaded, sometimes you’re going to use more than those cores that you asked for. So, just keep in mind what resources are available and that other people are using it, and make sure you make smart decisions that are both going to help you fit into those resource-constrained environments and allow everybody else to still use them while you are.
Those will be the two ones just off the top of my head. I’d say, just doing that is going to give you a huge amount of performance improvement in general and in the environment. And many times it might even be a performance improvement to your own application if you discover that huge resource asks means you spend a lot of time waiting in the queue.
Kiana: Ok, well, thank you, Alex, for your time. It’s great to delve a bit deeper into some of the topics that you touched on in the webinar.
And again, to our listeners, if you’d like to watch the full webinar, Best Practices for Spark Performance Management, it’s linked on the page this interview is hosted on. Thanks for listening.