Posts

Hive Tutorial 33 : Incremental Updates

Image
INCREMENTAL UPDATES Hadoop and Hive are quickly evolving to outgrow previous limitations for integration and data access. On the near-term development roadmap, we expect to see Hive supporting full CRUD operations (Insert, Select, Update, Delete). As we wait for these advancements, there is still a need to work with the current options—OVERWRITE or APPEND— for Hive table integration. The OVERWRITE option requires moving the complete record set from source to Hadoop. While this approach may work for smaller data sets, it may be prohibitive at scale. The APPEND option can limit data movement only to new or updated records. As true Inserts and Updates are not yet available in Hive, we need to consider a process of preventing duplicate records as Updates are appended to the cumulative record set. In this blog, we will look at a four-step strategy for appending Updates and Inserts from delimited and RDBMS sources to existing Hive table definitions. While there are several opt...

Hive Tutorial 32 : Appending new data to existing data

It can be achieved in couple of ways (Purely depends on your requirement) If you don't bother about overwriting the existing records in the partition, (I mean you don't have a big history data, say 10 yrs data), then Insert Overwrite might fit. INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement; If you don't bother about duplicates in the partition, then Insert Into might fit (Honestly I wudn't prefer to have duplicate records). INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement; If you have history data plus Incremental data, then History data can be inserted once and the incremental data(based on the frequency that you choose daily/weekly/fortnightly basis) can be inserted using a Insert Overwrite

Hive Tutorial 31 : Analytic Functions

Image
Hadoop Hive analytic functions compute an aggregate value that is based on a group of rows. A Hadoop Hive HQL analytic function works on the group of rows and ignores the NULL in the data if you specify Hadoop Hive COUNT Analytic Function Returns number of rows in query or group of rows. Syntax: COUNT(column reference | value expression | *) over(window_spec) For Example; select pat_id, dept_id, count(*) over (partition by dept_id order by dept_id asc) as pat_cnt from patient; at_id dept_id pat_cnt 6 111 4 2 111 4 5 111 4 1 111 4 4 222 3 5 222 3 3 222 3 7 333 1 8 444 1 Hadoop Hive SUM Analytic Function Just like count function, sum Hive analytic function is used to compute the sum of columns or expression. Sum analytic function is used to compute the sum of all rows of table or rows within the groups. Syntax: SUM(column | expression) OVER( window_spec) For example: Calculate sum insured amount of all patients within each department...