Many of us in the big data world are already familiar with Spark. But newcomers may be wondering: What is Spark? Even if you’re a user, there are a lot of Spark performance tuning tips around the internet. How do you sort the wheat from the chaff?
Spark is an open-source, distributed processing framework designed to run big data workloads at a much faster rate than Hadoop and with minimal resources. Spark leverages in-memory caching and optimized query execution to perform fast queries against data of any size.
In today’s big data world, Spark technology is a core tool. However, it is very complex, and it can present a range of problems if not properly optimized. Without the right approach to Spark performance tuning, you put yourself at risk of many Spark performance issues, including overspending and suboptimal performance.
What is Spark Performance Tuning?
Spark performance tuning is the process of making rapid and timely changes to Spark configurations to ensure all processes and resources are optimized and function smoothly. This Spark optimization process enables users to achieve SLA-level Spark performance while mitigating resource bottlenecks and preventing performance issues.
Below are the common approaches to spark performance tuning:
Data Serialization. Decrease memory usage by storing Spark RDDs (Resilient Distributed Datasets) in a serialized format. Data serialization helps ensure efficient resource utilization and jobs run on a precise execution engine. Data serialization makes sure that jobs that run long are terminated.
Memory Tuning. By default, Java objects are incredibly quick to access. However, they can easily use 2-5x more space than the “raw” data inside their fields. Through memory tuning, users can determine and optimize the memory usage of objects, resulting in improved performance.
Data Structure Tuning. Helps reduce memory consumption by avoiding the use of Java features that can cause overhead.
Garbage Collection Tuning. Garbage collection is costly in data structures that have large “churn” in terms of the RDDs stored by the programs. By utilizing data structures with fewer objects, garbage collection costs are greatly reduced.
Memory Management. Spark utilizes memory to for data storage and execution. Effective memory management ensures Storage Memory and Execution Memory exist in harmony and share each other’s free space.
We reached out to our own Spark optimization expert, Field Engineer Alex Pierce, to delve deeper into the Spark technology and understand how to fully maximize the Spark framework. Alex recently held a webinar on how to optimize Spark jobs and successfully execute Spark performance tuning.
3 Spark Performance Tuning Best Practices
Alex lists three Spark optimization techniques he considers as best practices that every Spark user must know and implement. These are:
- Being a Good Tenant
Read on as we explore each Spark performance tuning tips straight from our very own Spark veteran.
Kiana: Hi everyone. I’m your host Kiana with Pepperdata, and I’ll be interviewing Alex Pierce, the Pepperdata field engineer who led our recent webinar, Best Practices for Spark Performance Management. If you haven’t had a chance to watch that webinar on Spark performance tuning and optimization, it’ll be linked on the page this interview is on. So, feel free to go check it out. Now, let’s get right into the questions.
Kiana: During the webinar, we got quite a bit of interest in the topic of how to optimize Spark jobs through salting. You mentioned salting fixes like partition sizes and data skew. Could you expand upon how salting works and how someone could use it to better manage their Spark performance?
Alex: For sure. When you’re looking at what you’re trying to do, let’s specifically look at joins in this case, just because that’s a very common use case in Spark SQL. But this is anytime you’re dealing with data sets where you have a particular dimension. Let’s say you’re dealing with months of the year, days of the week, or something similar as a dimension. That’s a pretty small keyspace. There are only seven days in a week, only 12 months in the year. And let’s say you’re a type of business or something where the vast majority of records happen on a Saturday.
So when we go through to process the data, and let’s say we’re doing a month’s worth of data and we’re doing a join on this data, there is going to be whatever task is stuck doing the join between the data set and the dimension table on Saturday is going to run much longer than the other tasks. This is so common among Spark performance issues. So what salting does—it’s kind of like repartitioning without actually needing to repartition your data. So basically, what we do is we take the key we’re going to join on in, let’s say, our left table, and we’re going to make it more uniformly distributed.
And the way we do that is we append, the easiest way I should say, is to append a random number somewhere between 0 and N. You could determine based on the size of your environment, the size of your data set, the scale you need to look at, how large Ns should be. And then we need to do the same thing on the other side of our join. So now we need to take the dimension table, I mean, sorry, we need to take the data set table and where those I.Ds did exist before, we need to run through the same thing on that I.D. set to append the same 0-N values randomly to those keys.
Now, that does not mean N doesn’t need to match. If one side had numbers that did not match the other side, there were definitely problems. But at this point, we can now do a join using these salted keys, and let’s say in our weekday case, instead of having seven keys, we now have 47 keys. So we’ve now distributed that amongst a significantly larger space.
What this means is, come time to actually do the join, instead of having one particular executor that’s going to do 80%-90% of the work, because of the data set skew that’s going to be better distributed. Now you’re going to need to test with your data set exactly what size of a salt works best for you, and you do need to remember if you happen to be using broadcast tables, that your salt is going to increase the size of that dimensional table.
So if you’re using a broadcast table you need to keep an eye on your memory to make sure you don’t blow up the executors, and you just need to adjust. It can take some experimentation; you know your data set best so you know how heavy your skew is, and you can oftentimes visualize that within tools like Pepperdata to understand exactly how large of a salt space to add. But typically, you will see a noticeable increase in performance and definitely in terms of parallelization.
So if you’re in a distributed environment, whereas before maybe there were a thousand hosts in your environment but you were only using seven hosts because of your limited executor’s keyspace, you can now run it on 47 or 50. All of sudden, with this Spark tuning technique, you’re using the environment’s resources better, you’re not a bottleneck, causing perhaps a CPU bottleneck, on one of the other nodes for extended periods. It’s just a nice way to deal with limited keyspace based data.
Now, as for the actual code for it. There are tons of examples out there, even just looking at things like DataZone or Stack Overflow. You should be able to find examples of how to do a salt on a table in Spark very simply.
Being a Good Tenant
Kiana: Yes, and thank you for that answer. That was great. So, you also mentioned that one of the best spark optimization techniques was, in a multi-tenant environment, to be a good tenant. What does that mean exactly? And do you have any tips people might not have thought of yet?
Alex: Sure. So this one’s interesting. Part of it is understanding the scale of the environment you’re working in, and part of it is understanding the cue limitations for where you’re launching, but the idea is: Spark is greedy. Let’s say that you’re doing something, even super simple, like the SparkPi example that comes with Spark, and you ask for a hundred thousand slices. Now, Spark is going to ask for a hundred thousand executors. It’ll run just fine if it gets 40, but it’ll keep asking until it gets everything it can.
So, one thing you can do to be a good tenant is also set a max value on your ask. Let’s say, I want to run one hundred thousand slices. I want to use Spark dynamic allocation, but don’t ask for more than 100 executors—which we know will give us the performance we need, but will leave resources available for other users while allowing us to meet any sort of SLA. So that’s one very simple example of how tenancy becomes an effective Spark performance tuning practice.
Alex: Another Spark tuning tip to think about is how you’re sizing things. So if your data set can be broken down further, and once again this depends on your knowledge of your own data set, it might be more beneficial to the environment to, instead of asking for a handful of 90 gig or 100 gig executors—that does sound ridiculous but we do see this out there—ask for 10 to 20 gig executors, and break your data set down further.
It’s probably going to be beneficial for you, because you’re more likely to get those executors on the system, and it’s definitely gonna be beneficial for everybody else who’s trying to use that same system. Because if you manage to launch a hundred gig executor on a node, that’s usually over 50% of the space in a node, and sometimes that could even be 70% of the space in a node. So one, you’re going to have to wait for that space to free up and two, once you’re on there nobody else is getting workload on there. So it’s always better if you can break your data set down to try to size what’s going to fit the environment and allow other people to work at the same time.
That’s another one that’s maybe a little bit more difficult, but still not too hard to do. I mean, if you’re working on binary blob data sets and they only come to you in a certain size, there’s not a lot you can do. Almost everything else can be improved. Sometimes even by, like our last question, salting, because maybe you have one executor that’s blowing out all this memory because that’s where all the data is.
Instead of fixing the SKU problem, you’ve been just increasing the memory until it ran. So that’s one good way to fix that. Same thing on the core side. There’s only so much CPU power to go around, and if your code is multi-threaded, sometimes you’re going to use more than those cores that you asked for. So, just keep in mind what resources are available and that other people are using it, and make sure you make smart decisions that are both going to help you fit into those resource-constrained environments and allow everybody else to still use them while you are.
Kiana: Ok, well, thank you, Alex, for your time. It’s great to delve a bit deeper into your go-to Spark performance tuning tips as well as some of the topics that you touched on in the webinar.
And again, to our readers listeners, if you’d like to watch the full webinar, Best Practices for Spark Performance Management, it’s linked on the page this interview is hosted on. Also, check out this video on Spark optimization for a more visual, in-depth demonstration.
Contact Pepperdata today for more information on how to fully leverage the Spark framework and how our Spark optimization solution can help you get more from your Spark workloads.