BigD360

Posts

Hive Tutorial 37 : Performance Tuning

July 07, 2018

Performance plays key role in big data related projects as they deals which huge amount of data. So when you are using Hive if you keep few things in mind then we can see dramatic change in the performance. Performance tuning in hive: Partitions Bucketing File formats Compression Sampling Tez Vectorization Parallel execution CBO Partitions : The concept of partitioning in Hive is very similar to what we have in RDBMS. A table can be partitioned by one or more keys. This will determine how the data will be stored in the table. For example, if a table has two columns, id, name and age; and is partitioned by age, all the rows having same age will be stored together. So when we try to query based on age range, then hive will retrieve the data by going into particular folders instead of parsing through whole data. /hdfs/user/tablename/age/10 /hdfs/user/tablename/age/11 Bucketing : Bucketing is more efficient for sampling,data will be segre...

Top 10 best free books for learning R

June 29, 2018

1. The Art of R Programming by Norm Matloff Link : https://bit.ly/1ydgTrJ 2. Mining with Rattle and R by Graham Williams Link : https://bit.ly/2h7YgGm 3. ggplot2 by Hadley Wickham Link : https://bit.ly/2Dt8AmP 4. R for Data Science by Garrett Grolemund , Hadley Wickham Link : https://bit.ly/2tfmalX 5. R in Action by Robert Kabacoff Link : https://bit.ly/2tLgOQp 6. Machine Learning with R by Brett Lantz Link : https://bit.ly/2KtNPaT 7. R and Data Mining: Examples and Case Studies by Yanchang Zhao Link : https://bit.ly/2sD4QtW 8. The R Book by Michael J. Crawley Link : https://bit.ly/2BPobcO 9. An Introduction to Statistical Learning in R by Gareth James, Daniela Witten Link : https://bit.ly/1iUJso0 10. R through Excel Link : https://bit.ly/2lF0sVN NOTE : PDF links is not a copyright infringement, they are f...

Spark : Catalyst Optimizer

June 27, 2018

Most of the power of Spark SQL comes due to Catalyst optimizer, so let’s have a look into it Catalyst optimizer has two primary goals: · Make adding new optimization techniques easy · Enable external developers to extend the optimizer Spark SQL uses Catalyst's transformation framework in four phases: · Analyzing a logical plan to resolve references · Logical plan optimization · Physical planning · Code generation to compile the parts of the query to Java bytecode Analysis The analysis phase involved looking at a SQL query or a DataFrame, creating a logical plan out of it, which is still unresolved (the columns referred may not exist or may be of wrong datatyp...

Sqoop tutorial 15 : Free-Form Query

June 04, 2018

Importing Data from Two Tables : Instead of using table import, use free-form query import. In this mode, Sqoop will allow you to specify any query for importing data. Instead of the parameter --table, use the parameter --query with the entire query for obtaining the data you would like to transfer. sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --query 'SELECT normcities.id, \ countries.country, \ normcities.city \ FROM normcities \ JOIN countries USING(country_id) \ WHERE $CONDITIONS' \ --split-by id \ --target-dir cities Custom Boundary Queries: You can specify any valid query to fetch minimum and maximum values of the --splitby column using the --boundary-query parameter: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --query 'SELECT normcities.id, \ countries.country, \ normcities.city \ FROM normcities \ JOIN countries USING(coun...

Sqoop tutorial 14 : Incrementally Importing Mutable Data

June 04, 2018

Use the lastmodified mode instead of the append mode. For example, use the following command to transfer rows whose value in column last_update_date is greater than 2013-05-22 01:01:01: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table visits \ --incremental lastmodified \ --check-column last_update_date \ --last-value "2013-05-22 01:01:01" You can take advantage of the built-in Sqoop metastore that allows you to save all parameters for later reuse. You can create a simple incremental import job with the following command: sqoop job \ --create visits \ -- \ import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table visits \ --incremental append \ --check-column id \ --last-value 0 And start it with the --exec parameter: sqoop job --exec visits

Sqoop tutorial 13 : Importing Only New Data

June 04, 2018

Activate Sqoop’s incremental feature by specifying the --incremental parameter. The parameter’s value will be the type of incremental import. When your table is only getting new rows and the existing ones are not changed, use the append mode. Incremental import also requires two additional parameters: --check-column indicates a column name that should be checked for newly appended data, and --last-value contains the last value that successfully imported into Hadoop The following example will transfer only those rows whose value in column id is greater than 1: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table visits \ --incremental append \ --check-column id \ --last-value 1

Search This Blog

BigD360

Posts

How to change sqoop saved job parameters

Hive Tutorial 37 : Performance Tuning

Top 10 best free books for learning R

Spark : Catalyst Optimizer

Sqoop tutorial 15 : Free-Form Query

Sqoop tutorial 14 : Incrementally Importing Mutable Data

Sqoop tutorial 13 : Importing Only New Data