Hive Tutorial 11 : Hive Managed Table vs External Table

Managed Tables            
Managed tables are created by default with CREATE TABLE statements. When hive create managed tables, it follows schema on read and load complete file as it is, without any parsing or modification Hive stores the data for these tables in a subdirectory under the directory defined by hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default. And its schema information would be saved in hive metastore for later operational use. During drop table it drop data file from warehouse directory as well as schema from metastore.
  
hive>CREATE TABLE managed_table (dummy STRING);
LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;

It will move the file /user/tom/data.txtinto hive ware house directory for manage_ table which is hdfs: //user/hive/warehouse/managed_table



External Table

The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generate. If we drop table Hive does not removed or drop anything related to source file. It only drops schema information from hive metastore at time of drop tables. in gist for external table user is responsible to manage them.

hive>CREATE EXTERNAL TABLE external_table (dummy STRING)
LOCATION '/user/tom/external_table';

LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;


Managed Tables Vs External Table

The key difference between external and managed table in Hive is that data in the external table is not managed by Hive. When you create external table you define HDFS directory for that table and Hive is simply "looking" in it and can get data from it but Hive can't delete or change data in that folder. When you drop external table Hive only deletes metadata from its metastore and data in HDFS remains unchanged.

    How to choose which table to be used?

Use EXTERNAL tables when:
  • The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn't lock the files.
  • Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
  • You want to use a custom location such as ASV.
  • Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
  • You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
  • The data is temporary.
  • You want Hive to completely manage the lifecycle of the table and data.

Comments

Popular posts from this blog

Hive Tutorial 31 : Analytic Functions

Hive Tutorial 37 : Performance Tuning