Hive Tutorial 11 : Hive Managed Table vs External Table
Managed Tables
Managed tables are created by
default with CREATE TABLE statements. When hive create managed tables, it
follows schema on read and load complete file as it is, without any parsing or
modification Hive stores the data for these tables in a
subdirectory under the directory defined by hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default. And its schema
information would be saved in hive metastore for later operational use. During
drop table it drop data file from warehouse directory as well as schema from
metastore.
hive>CREATE TABLE managed_table (dummy STRING);
LOAD
DATA INPATH '/user/tom/data.txt' INTO table managed_table;
External
Table
The EXTERNAL keyword lets you
create a table and provide a LOCATION so that Hive does not use a default
location for this table. This comes in handy if you already have data generate.
If we drop table Hive does not removed or drop anything related to source file.
It only drops schema information from hive metastore at time of drop tables. in
gist for external table user is responsible to manage them.
hive>CREATE EXTERNAL TABLE external_table (dummy STRING)
LOCATION '/user/tom/external_table';
LOAD
DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;
Managed Tables Vs External
Table
The key difference between
external and managed table in Hive is that data in the external table is not
managed by Hive. When you create external table you define HDFS directory for
that table and Hive is simply "looking" in it and can get data from
it but Hive can't delete or change data in that folder. When you drop external
table Hive only deletes metadata from its metastore and data in HDFS remains
unchanged.
How to choose
which table to be used?
Use EXTERNAL tables when:
- The data is also used outside of Hive.
For example, the data files are read and processed by an existing program
that doesn't lock the files.
- Data needs to remain in the underlying
location even after a DROP TABLE. This can apply if you are pointing
multiple schemas (tables or views) at a single data set or if you are iterating
through various possible schemas.
- You want to use a custom location such as
ASV.
- Hive should not own data and control
settings, dirs, etc., you have another program or process that will do
those things.
- You are not creating table based on
existing table (AS SELECT).
Use INTERNAL
tables when:
- The data is temporary.
- You want Hive to completely manage the
lifecycle of the table and data.
Comments
Post a Comment