Hive query tuning is critical to big data success overall. Hive is the most prevalent query engine used in many of the largest enterprise environments today. What are Hive query tuning best practices? In other words, what can Developers and Ops teams do to ensure optimal Hive query performance?
Here are the most effective Hive tuning best practices across three key categories, whether you are tuning for time or efficient use of resources.
Category 1 – Data – Manipulate as little as possible
Partitioning is a common Hive performance tuning tactic which places table data in separate subdirectories of a table location based on keys. Partition keys present an opportunity to target a subset of the table data rather than scanning data you don’t need for your operations.
Bucketing, similar to Partitioning, is a Hive query tuning tactic that allows you to target a subset of data. In this case, to improve join performance specifically by scanning less data. This improves the query across the vectors of time and efficiency as less data has to be input, output, or stored in memory.
Each bit of data that is manipulated by a query has I/O associated with getting the data from disk, into memory, out of memory, and back to disk or another end target. Compression minimizes the amount of data that traverses each of those steps and decreases the time spent moving through the query states.
Avoid small files
Eliminating small file operations from your query promotes a healthy Hive ecosystem. Each file is tracked by the Hive metastore and stored in HDFS, which are each performance-optimized to handle larger files over many smaller files. Query performance is limited to the health of the overall system and platform.
Denormalizing data is a process that eliminates the need to join data from multiple tables during runtime. While normalization is useful, avoiding joins is one of the most impactful changes you can make to a given query short of eliminating unwanted data from the operation altogether.
Hive tables are different from the traditional database tables most data professionals are accustomed to. They are essentially sub-directories. Increasing the number of partitions to promote efficient reads and parallelism are useful but not to be overdone. Too many partitions can degrade the performance of the metastore and Hive server. Tracking and baselining performance are the best ways to understand when the number of partitions has gone from helpful to hurtful.
Simple joins are usually better
There are a lot of tactics which aim to make joins more efficient. SMB join, map joins, stream tables–each is designed to eliminate complexity or phases of a join. Nested joins are also costly to perform. So much work is being done to improve join performance because joins are costly.
Input file format selection
Input formats selection is important in Hive query tuning. JSON, for instance, isn’t an ideal format choice when dealing with massive production systems that generate high volumes of data. That is because JSON, and similar format types, actually take up lots of space as well as some overhead of parsing.
Apache Hive utilizes columnar input formats like RCFile and ORC to address such a problem. Columnar formats enable you to access each column individually, thereby decreasing the read operations in analytics queries. This results in faster query performance.
Category 2 – Execution
Use Tez or something better
Hive is a continual work in progress. Execution engines are a clear focus for the developers as we see things like Tez, LLAP, and Hive on Spark look to add to core Hive in ways that improve performance without the need for low-level tuning. Understanding and leveraging the best execution engine for the task at hand should be a mandatory consideration for tuning Hive query performance.
Hive queries deploy a distributed set of tasks. The overall query is only as fast as the slowest task. Ensuring an even distribution of work across the tasks keeps the query from slowing itself down by handling more data than necessary in some tasks.
Increase parallel execution
Parallel execution should be leveraged whenever sequential operations are not required. The amount of parallelism is dictated by the availability of resources and the structure of the data. This is another area where the “right” number can be tricky to derive without good performance solutions.
Category 3 – Testing
Sampling/unit testing is a big help
Sampling and unit testing is all about taking a subset of your data and running a thousand rows before you go and manipulate a million rows. This particular Hive query tuning best practice helps you understand how your code works to get you the desired result before you throw a big data set at it. It’s not foolproof, but working through failures or odd results at a small scale are quicker and more efficient than doing so at scale.
Keep bad queries out
Scrutinizing query performance and keeping inefficient queries from going to production sounds simple, but this step is often skipped until problems arise and it’s too late. Each query should be automatically measured for performance and efficiency to meet minimum accepted levels before being promoted to higher-level environments.
Pepperdata offers enterprises big data optimization solutions designed to give comprehensive visibility and observability into their big data infrastructure, stacks, applications, processes, and more. Users enjoy real-time performance reports and data-fueled recommendations to optimize the performance of their big data infrastructure and all its components while significantly cutting down operational costs.
For more big data performance management tips, check out this recent webinar, Best Practices for Spark Performance Management.