Determining exactly what causes a performance problem in a big data environment is a complex challenge. The first step is identifying the root cause, which is not exactly straightforward. Here are a few potential issues and ideas developers can use to fix them.

When you can affect change

Code is not optimal

As a developer, you can improve the performance of an application by modifying your code. Whether this is optimizing your SQL for Hive, or using the correct IO calls for the type of data you are retrieving, this is one of the areas you can make a direct impact on your application’s performance. There are tools that can help with both, including Pepperdata.

Dataset skew (distribution/keyspace)

Sometimes you can fix this, sometimes you can’t — it depends on how much control you have on your incoming data. But if the dataset is very skewed, parts of your application can easily take 2-3 times longer than necessary processing your data as other parts.

You’re asking for incorrect resources

This one is slightly more subtle, and can change based on the amount of data you are processing, but one of the easiest ways to improve application performance in Hadoop is to ask for the correct amount of resources — not too much and not too little. If you are asking for your application to use 50TB of memory and 2000 cores (1000 tasks each asking for 5GB and 2 cores), but you only use 40 percent of that at peak use (largest task is 2GB and 1 core), then you are impacting two things:

  • your performance, as you will need to wait for more space in the queue to launch your application, and
  • the performance of other tenants on the system, as you are occupying queue space they could be using when you aren’t using the physical resources.

When you cannot affect change because it’s not your fault

Contention for queue resources

If you are assigned to a particular set of queues, you should submit your application to one of those. However, if those queues are full, your application could:

  • Spend more time in queue than executing, or
  • Be limited to running only one or two tasks at a time as opposed to taking advantage of the parallel compute capability of the Hadoop platform.

The only way to fix this issue is to find time to run your applications when the queue is less busy, or talk to your system administrator about your own queue if you have SLA’s around your applications.

Contention for physical resources

Sometimes, there are actual physical resource bottlenecks you will run into, even if your application is doing everything correctly and is able to fully utilize the queue you are assigned to. This is difficult to diagnose, as Hadoop will not expose this by default. Your system operators should have tooling to let them look at the system resource utilization during the time window that your application is running, and see what the resource contention is.

  • Is your application running on cluster nodes where the CPU utilization is above 95 percent? This is not uncommon in busy environments, and depending on the version of Hadoop and the scheduler your system is using, there may not be any controls preventing this from happening!
  • Is someone else using all of the IO bandwidth? This isn’t a resource you can request and use for scheduling (unless your company is using MapR), so it is possible for one application to use all of the IO bandwidth that should be being shared between 10-100 applications are any given moment.

Learn more: