Hive Tutorial 13 : Hive Partitions

Table partitioning means dividing table data into some parts based on the values of particular columns like date or country, segregate the input records into different files/directories based on date or country.

Partitioning can be done based on more than column which will impose multi-dimensional structure on directory storage. For Example, In addition to partitioning log records by date column, we can also sup divide the single day records into country wise separate files by including country column into partitioning. We will see more about this in the examples.

Partitions are defined at the time of table creation using the PARTITIONED BY clause, with a list of column definitions for partitioning

Syntax:

Create table tablename(column1 datatype1,column2 datatype2)PARTITIONED BY(col3 datatype3,col4 datatype4);

** Partition column names should be added after PARTITIONED BY key word**
** Partitions can be done on multiple columns**

Limitations:

Having too many partitions in table creates large number of files and directories in HDFS, which is an overhead to NameNode since it must keep all metadata for the file system in memory only.
Partitions may optimize some queries based on Where clauses, but may be less responsive for other important queries on grouping clauses.
In Mapreduce processing, Huge number of partitions will lead to huge no of tasks (which will run in separate JVM) in each mapreduce job, thus creates lot of overhead in maintaining JVM start up and tear down. For small files, a separate task will be used for each file. In worst scenarios, the overhead of JVM start up and tear down can exceed the actual processing time.


Pointers:


A few things to keep in mind when using partitioning:

  • It’s important to consider the cardinality of the column that will be partitioned on. Selecting a column with high cardinality will result in fragmentation of data and put strain on the name node to manage all the underlying structures in HDFS.
  • Do not over-partition the data. With too many small partitions, the task of recursively scanning the directories becomes more expensive than a full table scan of the table.
  • Partitioning columns should be selected such that it results in roughly similar size partitions in order to prevent a single long running thread from holding up things.
  • If hive.exec.dynamic.partition.mode is set to strict, then you need to do at least one static partition. In non-strict mode, all partitions are allowed to be dynamic.
  • If your partitioned table is very large, you could block any full table scan queries by putting Hive into strict mode using the set hive.mapred.mode=strict command. In this mode, when users submit a query that would result in a full table scan (i.e. queries without any partitioned columns) an error is issued.


Comments

Popular posts from this blog

Hive Tutorial 31 : Analytic Functions

Hive Tutorial 37 : Performance Tuning