Hive Tutorial 28 : Hive vs Pig

Pig was developed by Yahoo in the year 2006 so that they can have an ad-hoc method for creating and executing MapReduce jobs on huge data sets. The main motive behind developing Pig was to cut-down on the time required for development via its multi query approach. Pig is a high level data flow system that renders you a simple language platform popularly known as Pig Latin that can be used for manipulating data and queries.

When to use Hive , When to use Pig?
If you know SQL, then Hive will be very familiar to you.  Since Hive uses SQL, you will feel at home with all the familiar select, where, group by, and order by clauses similar to SQL for relational databases.  You do, however, lose some ability to optimize the query, by relying on the Hive optimizer.  This seems to be the case for any implementation of SQL on any platform, Hadoop or traditional RDBMS, where hints are sometimes ironically needed to teach the automatic optimizer how to optimize properly.

However, compared to Hive, Pig needs some mental adjustment for SQL users to learn.  Pig Latin has many of the usual data processing concepts that SQL has, such as filtering, selecting, grouping, and ordering, but the syntax is a little different from SQL (particularly the group by and flatten statements!).  Pig requires more verbose coding, although it’s still a fraction of what straight Java MapReduce programs require.  Pig also gives you more control and optimization over the flow of the data than Hive does.

Comments

Popular posts from this blog

Hive Tutorial 31 : Analytic Functions

Hive Tutorial 37 : Performance Tuning