Hive Performance Tuning Techniques

Prasad Adsul
2 min readOct 15, 2022

--

Hive performance tuning techniques are very useful when you are working on big data. Also, it is a very frequently asked question in big data engineer interviews. I have tried to cover most of the techniques in this blog.

1) Table Design & File Format:

a) Partitioning:
Works by dividing the data into smaller logical segments. We finally scan only one partition.
Partitioning can be done on columns with low cardinality.

b) Bucketing:
Works by dividing the data into smaller logical segments. Bucketing can be done on columns with high cardinality.

c) Right File Format with Hive:
ORC can reduce data storage by 75% of the original data size. It uses techniques like predicate push-down and compression to improve the performance of the query.

2) Optimization of Query:

a) Join Optimization:
Map Side Join, Bucket Map Join, and Sort Merge Bucket Join (SMB) all do minimum shuffling. For this set following parameters are as follows :

set hive.enforce.bucketing=true
set hive.enforce.sorting=true
set hive.auto.convert.join=true
set hive.auto.convert.sortmerger.join=true
set hive.optimize.bucektmapjoin=true
set hive.optimize.bucketmapjoin.sortedmerge=true

b) Window Functions:
Use of windowing functions for performing complicated queries.

3) Query Execution Engine:
It is a new application framework built on Hadoop Yarn. That executes complex-directed acyclic graphs of general data processing tasks. However, we can consider it to be a much more flexible and powerful successor to the map-reduce framework. So try using TEZ or Spark to enable Hive Engine for better query execution.

4) Vectorization:
This operation refers to scans, aggregations, filters, and joins.
It happens by performing them in batches of 1024 rows at once instead of a single row each time
Limitation: This can be used only with ORC File format. You can easily enable vectorization with the following
two parameters settings :

set hive.vectorized.execution=True
set hive.vectorized.execution.enabled=True

5) If possible try avoiding UDFs as they are not much optimized.

6) Hive Indexing: Basically, we use it to speed up the access of a column or set of columns in a Hive database.
Since, the database system does not need to read all rows in the table to find the data with the use of the index,
especially that one has selected.

7) Cost-Based Optimization:
CBO performs optimizations based on query cost in a recent addition to Hive. That results in potentially different decisions: how to order joins, which type of join to perform, the degree of parallelism, and others.
To use CBO, set the following parameters at the beginning of your query :

set hive.cbo.enable=True
set hive.compute.query.using.stats=True
set hive.stats.fetch.column.stats=True
set hive.stats.fetch.partition.stats=True

--

--

Prasad Adsul

Hey, Welcome Data guys. I love to share my knowledge and experience with you. If you find it helpful then I will be very Happy !!