Hive Tutorial 15 : Hive Bucketing

July 18, 2017

Usually Partitioning in Hive offers a way of segregating hive table data

into multiple files/directories.

But partitioning gives effective results when,

* There are limited number of partitions

* Comparatively equal sized partitions

But this may not possible in all scenarios, like when are partitioningour tables based geographic locations like country, some biggerwill have large partitions (ex: 4-5 countries itselfcontributing 70-80% of total data) where as small countries data willcreate small partitions (remaining all countries in the world maycontribute to just 20-30 % of total data). So, In these casesPartitioning will not be ideal.

*To overcome the problem of over partitioning*,

Hive provides Bucketingconcept, another technique for decomposing table data sets into more manageable parts.

Features

* Bucketing concept is based on *(hashing function on the bucketed

column)* *mod* *(by total number of buckets)*. The hash function

depends on the type of the bucketing column.

* Records with the same bucketed column will always be stored in the

same bucket.

* We use *CLUSTERED BY* clause to divide the table into buckets.

* Physically, each bucket is just a file in the table directory,

and Bucket numbering is 1-based.

* Bucketing can be done along with Partitioning on Hive tables and

even without partitioning.

* Bucketed tables will create almost equally distributed data file parts.

There are two benefits of bucketing.

· First, it is more efficient for certain types of queries in hive particularly join operation on two tables which are bucketed on same column.

· Second reason is your sampling queries are more efficient if they are performed on bucketed columns.

What is sampling?

You might have a situation when accuracy is not very important and you just need run your queries on a smaller set of data instead of entire table. One use case for such situation is testing scenarios. You have huge data in table which takes lot of time to read and return your queries and you want to test it on just a smaller sample of you data instead of entire set. Another use case is when you need to draw some patterns or need some aggregation like average and accuracy is not your top priority. A general sense with a predictable degree of error is fine. You may just need to draw your patterns on a smaller sample set of data instead of full data set. Sampling can also be used to obtain a random subset of the actual rows in a table for auditing purposes.

Limitations

* Specifying bucketing doesn’t ensure that the table is properly

Populated. Data loading into buckets needs to be handled by our-self.

Ø How to create table with bucketing?

We use the CLUSTEREDBY clause to specify the columns to bucket on and the number of buckets.

hive >CREATE TABLE bucketed_users (id INT, name STRING)

CLUSTERED BY (id) INTO 4 BUCKETS;

Bucketing Example

CREATETABLEorder (

username STRING,

orderdate STRING,

amount DOUBLE,

tax DOUBLE,

) PARTITIONED BY (company STRING)

CLUSTEREDBY (username) INTO 25 BUCKETS;

Here we divided Hive Buckets into 25 parts.Set the maximum number of reducers to the same number of buckets specified in the table metadata (i.e. 25)

set map.reduce.tasks = 25

Use the following command to enforce bucketing:

set hive.enforce.bucketing = true

Better to set default Hive Buckets is 25

HIve Buckets table

Load Data Into Table

HIve Buckets table data load

Check below screen and you will realize three files names as 000000_0, 000001_0 and 000002_0 are created these are our data files.

HIve-Buckets-table-data-load-output

Search This Blog

BigD360

Hive Tutorial 15 : Hive Bucketing

Comments

Post a Comment

Popular posts from this blog

Hive Tutorial 31 : Analytic Functions

Hive Tutorial 37 : Performance Tuning

Top 10 best free books for learning R