Apache Hive is the most prevalent query engine used in many of the largest enterprise environments today. To get the most out of the engine, it’s important to perform Hive performance tuning. But before we dive into that, let’s cover the basics.
What is Hive performance tuning? Hive performance tuning refers to the collective processes and steps designed to improve and accelerate the performance of your Hive environments. When queries are not optimized, simple statements take longer to execute, resulting in performance lags and downtime.
While we now understand its importance, tuning your Hive environments for optimal performance can be tricky. And knowing how to analyze Hive query performance is a must for success. But just how do you optimize a Hive query? What are Hive performance tuning best practices? And what can developers and Ops teams do to ensure optimal Hive query performance?
If you have these questions, this post is for you. Keep reading to learn effective performance tuning best practices across three key categories. Whether you’re tuning for time or efficient use of resources, these tips apply.
Category 1: Data: Manipulate as Little as Possible
How can I improve Hive performance? Most users and developers start with tweaking their data. The use of partitioning, bucketing, compression, avoiding small files, and more are all great Hive query optimization techniques.
Here at Pepperdata, we deal with all sorts of questions about Hive queries, with improving Hive performance chief among them. In this section, we’ll dive into how to manipulate data as little as possible to gain success.
Partitioning is a common Hive query tuning tactic which places table data in separate subdirectories of a table location based on keys. Partition keys present an opportunity to target a subset of the table data rather than scanning data you don’t need for your operations.
No matter how much data exists, when you have partitions, Hive only reads a specific amount of data to generate results. This drastically improves performance, even when you execute complex analytics queries. This is because Hive only has to read data from a few partitions specified in the clause. It already filters out the data needed before query execution is initiated.
Bucketing, similar to partitioning, is a Hive query tuning tactic that allows you to target a subset of data. In this case, to improve join performance specifically by scanning less data. This improves the query across the vectors of time and efficiency as less data has to be input, output, or stored in memory.
Bucketing in Hive entails the decomposition of a table data set into smaller parts. Thus, data is easier to handle. With bucketing, you join similar data types and write them to a single file. This step here greatly enhances performance while joining tables or reading data. This is why bucketing with partitioning is so popular among Hive users.
Compression ranks as one of the best Hive query optimization techniques. Big data compression cuts down the amount of bandwidth and storage required to handle large data sets. In addition, compression eliminates redundant and unimportant pieces from your systems.
Each bit of data that is manipulated by a query has I/O associated with getting the data from disk, into memory, out of memory, and back to disk or another end target. Compression minimizes the amount of data that traverses each of those steps and decreases the time spent moving through the query states.
Avoid Small Files
Eliminating small file operations from your query is an effective Hive performance tuning tactic. Doing so promotes a healthy Hive ecosystem. Each file is tracked by the Hive metastore and stored in HDFS, which are each performance-optimized to handle larger files over many smaller files. Query performance is limited to the health of the overall system and platform.
Hive experts recommend denormalizing data as a go-to Hive performance tuning approach if you want to eliminate the need to join data from multiple tables during runtime. Denormalization is performed by adding redundant data to one or more tables. This can help us avoid costly joins in a relational database.
While normalization is useful, avoiding joins is one of the most impactful changes you can make to a given query short of eliminating unwanted data from the operation.
Hive tables are different from the traditional database tables most data professionals are accustomed to. They are essentially sub-directories. Increasing the number of partitions to promote efficient reads and parallelism is one of the most effective Hive optimization techniques. However, this solution is not to be overdone. Too many partitions can degrade the performance of the metastore and Hive server. Tracking and baselining performance are the best ways to understand when the number of partitions has gone from helpful to hurtful.
Simple Joins are Usually Better
There are a lot of tactics that aim to make joins more efficient. SMB join, map joins, stream tables–each is designed to eliminate the complexity or phases of a join. Nested joins are also costly to perform. Because joins are costly, a lot of work is being done to improve join performance.
Input File Format Selection
Input formats selection is important in Hive query tuning. JSON, for instance, isn’t an ideal format choice when dealing with massive production systems that generate high volumes of data. That is because JSON, and similar format types, actually take up lots of space as well as some overhead of parsing.
Hive utilizes columnar input formats like RCFile and ORC to address such a problem. Columnar formats enable you to access each column individually, thereby decreasing the read operations in analytics queries. This results in faster query performance.
Category 2: Execution
Writing Hive queries right initially is crucial. The execution of Hive queries hinges primarily on the code written by its users. But not all code is written perfectly. Hive query tuning isn’t just about data; improving execution is also critical to Hive’s success.
Use Tez (or Something Better)
Apache Tez is a framework built on top of Apache Hadoop 2.0 (Yarn) and is designed to accelerate query execution for Hive. Tez helps users start and hold one or multiple containers which can be reused to perform multiple queries. It also helps users avoid multiple disk I/Os and decreases overhead for launching JVMs.
Execution engines are a clear focus for developers as we see frameworks like Tez, LLAP, and Hive on Spark look to add to core Hive in ways that improve performance without the need for low-level tuning. Understanding and leveraging the best execution engine for the task at hand should be a mandatory consideration for Hive performance tuning.
Hive queries deploy a distributed set of tasks. The overall query is only as fast as the slowest task. Ensuring an even distribution of work across the tasks is an effective Hive performance tuning approach. That’s because it keeps the query from slowing itself down by handling more data than necessary in some tasks.
Increase Parallel Execution
By default, Hive will only execute one stage at a given time. However, a specific job may contain multiple stages, which may not be completely interdependent. Instead of running a single stage at one instance, executing these non-interdependent stages in parallel can drastically reduce the run time of the entire job.
Parallel execution is one of the best Hive optimization techniques, but it should only be leveraged when sequential operations are not required. The amount of parallelism is dictated by the availability of resources and the structure of the data. This is another area where the “right” number can be tricky to derive without good performance solutions.
Category 3: Testing
Sampling/Unit Testing Is a Big Help
Sampling and unit testing are all about taking a subset of your data and running a thousand rows before manipulating a million rows. This particular Hive query tuning best practice helps you understand how your code works to get you the desired result before you throw a big data set at it. It’s not foolproof, but working through failures or odd results at a small scale is quicker and more efficient than doing so at scale.
Keep Bad Queries Out
Scrutinizing query performance and keeping inefficient queries from going to production sounds simple, but this Hive performance tuning step is often skipped until problems arise and it’s too late. Each query should be automatically measured for performance and efficiency to meet minimum accepted levels before being promoted to higher-level environments.
According to our 2021 Big Data Survey Report, 29% of enterprises say that Hive applications and workloads consume most of their resources. Hive is a crucial component in the operations of today’s enterprises. This is why it is crucial to fine-tune Hive queries to achieve optimal performance while keeping resource consumption and related costs manageable.
Pepperdata offers enterprises big data optimization solutions designed to give comprehensive visibility and observability into their big data infrastructure, stacks, applications, processes, and more. Users enjoy real-time performance reports and data-fueled recommendations to optimize the performance of their big data infrastructure and all its components while significantly cutting down operational costs.
For more tips on boosting Hive performance, download our e-book: Improve Performance with Real Insight Into How Queries are Executing.