Performance plays key role in big data related projects as they deals which huge amount of data. So when you are using Hive if you keep few things in mind then we can see dramatic change in the performance. Performance tuning in hive: Partitions Bucketing File formats Compression Sampling Tez Vectorization Parallel execution CBO Partitions : The concept of partitioning in Hive is very similar to what we have in RDBMS. A table can be partitioned by one or more keys. This will determine how the data will be stored in the table. For example, if a table has two columns, id, name and age; and is partitioned by age, all the rows having same age will be stored together. So when we try to query based on age range, then hive will retrieve the data by going into particular folders instead of parsing through whole data. /hdfs/user/tablename/age/10 /hdfs/user/tablename/age/11 Bucketing : Bucketing is more efficient for sampling,data will be segre...
Comments
Post a Comment