Hadoop Hive analytic functions compute an aggregate value that is based on a group of rows. A Hadoop Hive HQL analytic function works on the group of rows and ignores the NULL in the data if you specify Hadoop Hive COUNT Analytic Function Returns number of rows in query or group of rows. Syntax: COUNT(column reference | value expression | *) over(window_spec) For Example; select pat_id, dept_id, count(*) over (partition by dept_id order by dept_id asc) as pat_cnt from patient; at_id dept_id pat_cnt 6 111 4 2 111 4 5 111 4 1 111 4 4 222 3 5 222 3 3 222 3 7 333 1 8 444 1 Hadoop Hive SUM Analytic Function Just like count function, sum Hive analytic function is used to compute the sum of columns or expression. Sum analytic function is used to compute the sum of all rows of table or rows within the groups. Syntax: SUM(column | expression) OVER( window_spec) For example: Calculate sum insured amount of all patients within each department...
Performance plays key role in big data related projects as they deals which huge amount of data. So when you are using Hive if you keep few things in mind then we can see dramatic change in the performance. Performance tuning in hive: Partitions Bucketing File formats Compression Sampling Tez Vectorization Parallel execution CBO Partitions : The concept of partitioning in Hive is very similar to what we have in RDBMS. A table can be partitioned by one or more keys. This will determine how the data will be stored in the table. For example, if a table has two columns, id, name and age; and is partitioned by age, all the rows having same age will be stored together. So when we try to query based on age range, then hive will retrieve the data by going into particular folders instead of parsing through whole data. /hdfs/user/tablename/age/10 /hdfs/user/tablename/age/11 Bucketing : Bucketing is more efficient for sampling,data will be segre...
At the initial stage I thought it was impossible to reset the incremental.last.value in a sqoop job. After creating a sqoop job for incremental import of data from sqlserver to hadoop, there comes a request to re-dump the data. Although, flushing the previously ingested data in hadoop seems to be effortless using the command hdfs dfs -rm /location/to/dir/part* the tough one comes after trying to execute the sqoop job again. The job executes without pulling any record because it sees no increment in the number of records. So, instead of removing the sqoop job and creating another one afresh, the way to go is to reset the incremental.last.value to 0. And this can be done by changing the value of the last record in the sqoop metastore. The steps involves: Navigate to the home directory using cd ~ Locate the sqoop metastore using ls -a You'll see a dir named .sqoop, cd into it with cd .sqoop vi metastore.db.script or nano metastore.db.script Scroll all the way down to s...
Comments
Post a Comment