Hive Tutorial 15 : Hive Bucketing

Usually Partitioning in Hive offers a way of segregating hive table data
into multiple files/directories.
But partitioning gives effective results when,

  * There are limited number of partitions
  * Comparatively equal sized partitions

But this may not possible in all scenarios, like when are partitioningour tables based geographic locations like country, some biggerwill have large partitions (ex: 4-5 countries itselfcontributing 70-80% of total data) where as small countries data willcreate small partitions (remaining all countries in the world maycontribute to just 20-30 % of total data). So, In these casesPartitioning will not be ideal.

*To overcome the problem of over partitioning*,
Hive provides Bucketingconcept, another technique for decomposing table data sets into more manageable parts.

Features
  * Bucketing concept is based on *(hashing function on the bucketed
    column)* *mod* *(by total number of buckets)*. The hash function
    depends on the type of the bucketing column.
  * Records with the same bucketed column will always be stored in the
    same bucket.
  * We use *CLUSTERED BY* clause to divide the table into buckets.
  * Physically, each bucket is just a file in the table directory,
    and Bucket numbering is 1-based.
  * Bucketing can be done along with Partitioning on Hive tables and
    even without partitioning.
  * Bucketed tables will create almost equally distributed data file parts.

There are two benefits of bucketing.
·         First, it is more efficient for certain types of queries in hive particularly join operation on two tables which are bucketed on same column.
·         Second reason is your sampling queries are more efficient if they are performed on bucketed columns.
What is sampling?
            You might have a situation when accuracy is not very important and you just need run your queries on a smaller set of data instead of entire table. One use case for such situation is testing scenarios. You have huge data in table which takes lot of time to read and return your queries and you want to test it on just a smaller sample of you data instead of entire set. Another use case is when you need to draw some patterns or need some aggregation like average and accuracy is not your top priority. A general sense with a predictable degree of error is fine. You may just need to draw your patterns on a smaller sample set of data instead of full data set. Sampling can also be used to obtain a random subset of the actual rows in a table for auditing purposes.

Limitations
  * Specifying bucketing doesn’t ensure that the table is properly
Populated. Data loading into buckets needs to be handled by our-self.

Ø  How to create table with bucketing?
We use the CLUSTEREDBY clause to specify the columns to bucket on and the number of buckets.

hive >CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;



Bucketing Example
CREATETABLEorder (
    username     STRING,
    orderdate    STRING,
    amount        DOUBLE,
    tax        DOUBLE,
) PARTITIONED BY (company STRING)
CLUSTEREDBY (username) INTO 25 BUCKETS;
Here we divided Hive Buckets into  25 parts.Set the maximum number of reducers to the same number of buckets specified in the table metadata (i.e. 25)
set map.reduce.tasks = 25
Use the following command to enforce bucketing:
set hive.enforce.bucketing = true
Better to set default Hive Buckets is 25

HIve Buckets table
Load Data Into Table

HIve Buckets table data load
Check below screen and you will realize three files names as 000000_0, 000001_0 and 000002_0 are created these are our data files.


HIve-Buckets-table-data-load-output

Comments

Popular posts from this blog

Hive Tutorial 31 : Analytic Functions

Hive Tutorial 37 : Performance Tuning