Posts

How to change sqoop saved job parameters

At the initial stage I thought it was impossible to reset the incremental.last.value in a sqoop job. After creating a sqoop job for incremental import of data from sqlserver to hadoop, there comes a request to re-dump the data. Although, flushing the previously ingested data in hadoop seems to be effortless using the command hdfs dfs -rm /location/to/dir/part*  the tough one comes after trying to execute the sqoop job again. The job executes without pulling any record because it sees no increment in the number of records. So, instead of removing the sqoop job and creating another one afresh, the way to go is to reset the incremental.last.value to 0.  And this can be done by changing the value of the last record in the sqoop metastore. The steps involves: Navigate to the home directory using cd ~ Locate the sqoop metastore using ls -a You'll see a dir named .sqoop, cd into it with cd .sqoop vi metastore.db.script or nano metastore.db.script Scroll all the way down to s...

Hive Tutorial 37 : Performance Tuning

Performance plays key role in big data related projects as they deals which huge amount of data. So when you are using Hive if you keep few things in mind then we can see dramatic change in the performance.  Performance tuning in hive: Partitions Bucketing File formats Compression Sampling Tez Vectorization Parallel execution CBO Partitions : The concept of partitioning in Hive is very similar to what we have in RDBMS. A table can be partitioned by one or more keys. This will determine how the data will be stored in the table. For example, if a table has two columns, id, name and age; and is partitioned by age, all the rows having same age will be stored together. So when we try to query based on age range, then hive will retrieve the data by going into particular folders instead of parsing through whole data. /hdfs/user/tablename/age/10 /hdfs/user/tablename/age/11 Bucketing : Bucketing is more efficient for sampling,data will be segre...

Top 10 best free books for learning R

1. The Art of R Programming by  Norm Matloff Link :  https://bit.ly/1ydgTrJ 2. Mining with Rattle and R by Graham Williams Link :  https://bit.ly/2h7YgGm 3. ggplot2 by Hadley Wickham   Link : https://bit.ly/2Dt8AmP 4. R for Data Science by  Garrett Grolemund  , Hadley Wickham Link :  https://bit.ly/2tfmalX 5. R in Action by  Robert Kabacoff Link :  https://bit.ly/2tLgOQp 6. Machine Learning with R  by  Brett Lantz Link :  https://bit.ly/2KtNPaT 7. R and Data Mining: Examples and Case Studies by  Yanchang Zhao Link :  https://bit.ly/2sD4QtW 8. The R Book by Michael J. Crawley Link :  https://bit.ly/2BPobcO 9. An Introduction to Statistical Learning in R by Gareth James, Daniela Witten Link :  https://bit.ly/1iUJso0 10. R through Excel Link :  https://bit.ly/2lF0sVN NOTE : PDF links is not a copyright infringement, they are f...

Spark : Catalyst Optimizer

Image
Most of the power of Spark SQL comes due to Catalyst optimizer, so let’s have a look into it Catalyst optimizer has two primary goals: ·          Make adding new optimization techniques easy ·          Enable external developers to extend the optimizer Spark SQL uses Catalyst's transformation framework in four phases: ·          Analyzing a logical plan to resolve references ·          Logical plan optimization ·          Physical planning ·          Code generation to compile the parts of the query to Java bytecode Analysis The analysis phase involved looking at a SQL query or a DataFrame, creating a logical plan out of it, which is still unresolved (the columns referred may not exist or may be of wrong datatyp...

Sqoop tutorial 15 : Free-Form Query

Importing Data from Two Tables : Instead of using table import, use free-form query import. In this mode, Sqoop will allow you to specify any query for importing data. Instead of the parameter --table, use the parameter --query with the entire query for obtaining the data you would like to transfer. sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --query 'SELECT normcities.id, \ countries.country, \ normcities.city \ FROM normcities \ JOIN countries USING(country_id) \ WHERE $CONDITIONS' \ --split-by id \ --target-dir cities Custom Boundary Queries: You can specify any valid query to fetch minimum and maximum values of the --splitby column using the --boundary-query parameter: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --query 'SELECT normcities.id, \ countries.country, \ normcities.city \ FROM normcities \ JOIN countries USING(coun...

Sqoop tutorial 14 : Incrementally Importing Mutable Data

Use the lastmodified mode instead of the append mode. For example, use the following command to transfer rows whose value in column last_update_date is greater than 2013-05-22 01:01:01: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table visits \ --incremental lastmodified \ --check-column last_update_date \ --last-value "2013-05-22 01:01:01" You can take advantage of the built-in Sqoop metastore that allows you to save all parameters for later reuse. You can create a simple incremental import job with the following command: sqoop job \ --create visits \ -- \ import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table visits \ --incremental append \ --check-column id \ --last-value 0 And start it with the --exec parameter: sqoop job --exec visits

Sqoop tutorial 13 : Importing Only New Data

Activate Sqoop’s incremental feature by specifying the --incremental parameter. The parameter’s value will be the type of incremental import. When your table is only getting new rows and the existing ones are not changed, use the append mode. Incremental import also requires two additional parameters: --check-column indicates a column name that should be checked for newly appended data, and --last-value contains the last value that successfully imported into Hadoop The following example will transfer only those rows whose value in column id is greater than 1: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table visits \ --incremental append \ --check-column id \ --last-value 1