Hive Tutorial 15 : Hive Bucketing
Usually
Partitioning in Hive offers a way of segregating hive table data
into multiple
files/directories.
But partitioning
gives effective results when,
* There are limited number of
partitions
* Comparatively equal sized
partitions
But this may not possible in all scenarios, like when are
partitioningour tables based geographic locations like country, some biggerwill
have large partitions (ex: 4-5 countries itselfcontributing 70-80% of total data)
where as small countries data willcreate small partitions (remaining all
countries in the world maycontribute to just 20-30 % of total data). So, In
these casesPartitioning will not be ideal.
*To overcome the problem of over partitioning*,
Hive provides Bucketingconcept, another technique for decomposing table
data sets into more manageable parts.
Features
* Bucketing concept is based on
*(hashing function on the bucketed
column)* *mod* *(by total
number of buckets)*. The hash function
depends on the type of the
bucketing column.
* Records with the same bucketed
column will always be stored in the
same bucket.
* We use *CLUSTERED BY* clause
to divide the table into buckets.
* Physically, each bucket is
just a file in the table directory,
and Bucket numbering is
1-based.
* Bucketing can be done along
with Partitioning on Hive tables and
even without partitioning.
* Bucketed tables will create
almost equally distributed data file parts.
There are two benefits of
bucketing.
·
First, it is more
efficient for certain types of queries in hive particularly join operation on
two tables which are bucketed on same column.
·
Second reason is your
sampling queries are more efficient if they are performed on bucketed columns.
What is sampling?
You might have a situation when accuracy is not very
important and you just need run your queries on a smaller set of data instead
of entire table. One use case for such situation is testing scenarios. You have
huge data in table which takes lot of time to read and return your queries and
you want to test it on just a smaller sample of you data instead of entire set.
Another use case is when you need to draw some patterns or need some
aggregation like average and accuracy is not your top priority. A general sense
with a predictable degree of error is fine. You may just need to draw your
patterns on a smaller sample set of data instead of full data set. Sampling can
also be used to obtain a random subset of the actual rows in a table for
auditing purposes.
Limitations
* Specifying bucketing doesn’t
ensure that the table is properly
Populated. Data loading into buckets needs to be handled by our-self.
Ø How to create
table with bucketing?
We use the CLUSTEREDBY
clause to specify the columns to bucket on and the number of buckets.
hive >CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
Bucketing
Example
CREATETABLEorder
(
username
STRING,
orderdate
STRING,
amount DOUBLE,
tax
DOUBLE,
)
PARTITIONED BY (company STRING)
CLUSTEREDBY
(username) INTO 25 BUCKETS;
Here we divided Hive Buckets into 25
parts.Set the maximum number of reducers to the same number of buckets
specified in the table metadata (i.e. 25)
set map.reduce.tasks = 25
Use
the following command to enforce bucketing:
set
hive.enforce.bucketing = true
Better
to set default Hive Buckets is 25
HIve Buckets
table
Load Data Into Table
HIve Buckets
table data load
Check below screen and you will realize three files names as 000000_0,
000001_0 and 000002_0 are created these are our data files.
HIve-Buckets-table-data-load-output
Comments
Post a Comment