Hive Tutorial 1 : Apache Hive

A little history about Apache Hive will be a nice story and will help you understand why it came into existence. When Facebook started gathering data and ingesting it into Hadoop, the data was incoming at the rate of 10s of GBs/day around 2006. Then in 2007, it grew it 1 TB/day and in few years increased to around 15 TBs/day. Initially, they had written Python scripts to ingest data in Oracle databases, but with the growth in data rate and also the sources/types of data incoming, this was getting difficult. The Oracle instances were getting filled pretty fast and it was time to get a new kind of system that handled large amounts of data. So, they built Hive so most of the people who had SQL skills could use the new system with minimal changes compared to other RDBMs.
Hive is a data warehouse implementation on top of HDFS. It provides a SQL like dialect to interact with data. It is most close to the MySQL syntax, however it is not a complete database. It does not conform to ANSI SQL standard. Also, it is not OLTP ready yet. It can be used as an OLAP system. One of the disadvantages is there are no provisions for row level Insert, Update or Delete. Hive works on the files stored on HDFS. The data is exposed via tables in Hive. It is more suited for large data warehouse like operation for analytics, report generation and mining.
Hive can be used by the Database Administrators, architects and analysts without them being required to learn anything new. Architects who have been designing databases and tables, can do all of those and now on even more data. Hive supports the following as part of its data model; tables, partitions and buckets. Analysts will be productive without having to learn a lot of new stuffs, by writing queries similar to SQL. Many organizations have started using HIVE as part of the ETL and have discontinued the older systems.

Comments

Popular posts from this blog

Hive Tutorial 31 : Analytic Functions

Hive Tutorial 37 : Performance Tuning