Hive Tutorial 31 : Analytic Functions

Hadoop Hive analytic functions compute an aggregate value that is based on a group of rows. A Hadoop Hive HQL analytic function works on the group of rows and ignores the NULL in the data if you specify


Hadoop Hive COUNT Analytic Function

Returns number of rows in query or group of rows.
Syntax:
COUNT(column reference | value expression | *) over(window_spec)
For Example;
select pat_id,
dept_id,
count(*) over (partition by dept_id order by dept_id asc) as pat_cnt
from patient;

at_iddept_idpat_cnt
61114
21114
51114
11114
42223
52223
32223
73331
84441

Hadoop Hive SUM Analytic Function

Just like count function, sum Hive analytic function is used to compute the sum of columns or expression. Sum analytic function is used to compute the sum of all rows of table or rows within the groups.
Syntax:
SUM(column | expression) OVER( window_spec)
For example:
Calculate sum insured amount of all patients within each department. Query and output as follows:
select pat_id,
dept_id,
sum(ins_amt) over (partition by dept_id order by dept_id asc) as total_ins_amt
ins_amt,
from patient ;
pat_iddept_idins_amttotal_ins_amt
611190000390000
2111150000390000
511150000390000
1111100000390000
42222500001290000
52228900001290000
32221500001290000
7333110000110000
84441000010000

Hadoop Hive MIN and MAX Analytic Function

Like the Hive HQL MIN and MAX functions, Hadoop Hive analytic MIN and MAX functions are used to compute the MIN and MAX of the rows in the column or expression and on rows within group.
Syntax:
MIN(column | expression) OVER( window_spec)
MAX(column | expression) OVER( window_spec)
For example:
Calculate Min and Max of insured amount of all patients within each department. Query and output as follows:
select pat_id,
dept_id,
min(ins_amt) over (partition by dept_id order by dept_id asc) as min_ins_amt,
ins_amt,
max(ins_amt) over (partition by dept_id order by dept_id asc) as max_ins_amt
from patient ;
pat_iddept_idins_amtmin_ins_amtmax_ins_amt
61119000050000150000
211115000050000150000
51115000050000150000
111110000050000150000
4222250000150000890000
5222890000150000890000
3222150000150000890000
7333110000110000110000
8444100001000010000

Hadoop Hive LEAD and LAG Analytic Function

Lead and Lag Hadoop Hive analytic functions used to compare different rows of a table by specifying an offset from the current row. You can use these functions to analyze change and variation in the data.
Syntax:
LEAD(column, offset, default) OVER( window_spec)LAG(column, offset, default) OVER( window_spec)
The default value of offset is 1. Offset is the relative position of the row to be accessed. If there is no row next/prior to access the LEAD/LAG function returns NULL, You can change this NULL value by specifying the “default” values.
For example;
Get the insured amount of the patient later and prior than the current rows in each department. Query and output as follows:
select pat_id,
dept_id,
lead(ins_amt,1,0) over (partition by dept_id order by dept_id asc ) as lead_ins_amt,
ins_amt,
lag(ins_amt,1,0) over (partition by dept_id order by dept_id asc ) as lag_ins_amt
from patient;
pat_iddept_idins_amtlead_ins_amtlag_ins_amt
6111900001500000
21111500005000090000
511150000100000150000
1111100000050000
42222500008900000
5222890000150000250000
32221500000890000
733311000000
84441000000

Hadoop Hive FIRST_VALUE and LAST_VALUE Analytic Function

You can use the Hadoop Hive first_value and last_value analytic functions to find the first value and last value in a column or expression or within group of rows. You must specify the sort criteria to determine the first and last values.
Syntax:
FIRST_VALUE(column | expression) OVER( window_spec)LAST_VALUE(column | expression) OVER( window_spec)
For example;
Compute the lowest and highest insured patients in each department. Query and output as follows:
select pat_id,
dept_id,
first_value(ins_amt) over (partition by dept_id order by ins_amt ) as low_ins_amt,
ins_amt,
last_value(ins_amt) over (partition by dept_id order by ins_amt ) as high_ins_amt
from patient;
pat_iddept_idins_amtlow_ins_amthigh_ins_amt
5111500005000050000
6111900005000090000
111110000050000100000
211115000050000150000
3222150000150000150000
4222250000150000250000
5222890000150000890000
7333110000110000110000
8444100001000010000

Hadoop Hive ROW_NUMBER, RANK and DENSE_RANK Analytical Functions

The row_number Hive analytic function is used to assign unique values to each row or rows within group based on the column values used in OVER clause.
The Rank Hive analytic function is used to get rank of the rows in column or within group. Rows with equal values receive the same rank with next rank value skipped. The rank analytic function is used in top n analysis.
The Dense rank Hive function returns the rank of a value in a group. Rows with equal values for ranking criteria receive the same rank and assign rank in sequential order i.e. no rank values are skipped. The rank analytic function is used in top n analysis
Syntax:
ROW_NUMBER() OVER( window_spec)RANK() OVER( window_spec)DENSE_RANK() OVER( window_spec)
For example;
Assign row number, rank on insured amount using Hadoop Hive analytic functions. Query and output as follows:
select pat_id,
dept_id,
row_number() over (order by ins_amt) as rn,
ins_amt,
rank() over (order by ins_amt ) as rk,
dense_rank() over (order by ins_amt ) as dense_rk
from patient;
pat_iddept_idins_amtrnrkdense_rk
844410000111
511150000222
611190000333
1111100000444
7333110000555
2111150000666
3222150000766
4222250000887
5222890000998

Comments

Popular posts from this blog

Hive Tutorial 37 : Performance Tuning

How to change sqoop saved job parameters