Posts

Showing posts from June, 2018

Top 10 best free books for learning R

1. The Art of R Programming by  Norm Matloff Link :  https://bit.ly/1ydgTrJ 2. Mining with Rattle and R by Graham Williams Link :  https://bit.ly/2h7YgGm 3. ggplot2 by Hadley Wickham   Link : https://bit.ly/2Dt8AmP 4. R for Data Science by  Garrett Grolemund  , Hadley Wickham Link :  https://bit.ly/2tfmalX 5. R in Action by  Robert Kabacoff Link :  https://bit.ly/2tLgOQp 6. Machine Learning with R  by  Brett Lantz Link :  https://bit.ly/2KtNPaT 7. R and Data Mining: Examples and Case Studies by  Yanchang Zhao Link :  https://bit.ly/2sD4QtW 8. The R Book by Michael J. Crawley Link :  https://bit.ly/2BPobcO 9. An Introduction to Statistical Learning in R by Gareth James, Daniela Witten Link :  https://bit.ly/1iUJso0 10. R through Excel Link :  https://bit.ly/2lF0sVN NOTE : PDF links is not a copyright infringement, they are f...

Spark : Catalyst Optimizer

Image
Most of the power of Spark SQL comes due to Catalyst optimizer, so let’s have a look into it Catalyst optimizer has two primary goals: ·          Make adding new optimization techniques easy ·          Enable external developers to extend the optimizer Spark SQL uses Catalyst's transformation framework in four phases: ·          Analyzing a logical plan to resolve references ·          Logical plan optimization ·          Physical planning ·          Code generation to compile the parts of the query to Java bytecode Analysis The analysis phase involved looking at a SQL query or a DataFrame, creating a logical plan out of it, which is still unresolved (the columns referred may not exist or may be of wrong datatyp...

Sqoop tutorial 15 : Free-Form Query

Importing Data from Two Tables : Instead of using table import, use free-form query import. In this mode, Sqoop will allow you to specify any query for importing data. Instead of the parameter --table, use the parameter --query with the entire query for obtaining the data you would like to transfer. sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --query 'SELECT normcities.id, \ countries.country, \ normcities.city \ FROM normcities \ JOIN countries USING(country_id) \ WHERE $CONDITIONS' \ --split-by id \ --target-dir cities Custom Boundary Queries: You can specify any valid query to fetch minimum and maximum values of the --splitby column using the --boundary-query parameter: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --query 'SELECT normcities.id, \ countries.country, \ normcities.city \ FROM normcities \ JOIN countries USING(coun...

Sqoop tutorial 14 : Incrementally Importing Mutable Data

Use the lastmodified mode instead of the append mode. For example, use the following command to transfer rows whose value in column last_update_date is greater than 2013-05-22 01:01:01: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table visits \ --incremental lastmodified \ --check-column last_update_date \ --last-value "2013-05-22 01:01:01" You can take advantage of the built-in Sqoop metastore that allows you to save all parameters for later reuse. You can create a simple incremental import job with the following command: sqoop job \ --create visits \ -- \ import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table visits \ --incremental append \ --check-column id \ --last-value 0 And start it with the --exec parameter: sqoop job --exec visits

Sqoop tutorial 13 : Importing Only New Data

Activate Sqoop’s incremental feature by specifying the --incremental parameter. The parameter’s value will be the type of incremental import. When your table is only getting new rows and the existing ones are not changed, use the append mode. Incremental import also requires two additional parameters: --check-column indicates a column name that should be checked for newly appended data, and --last-value contains the last value that successfully imported into Hadoop The following example will transfer only those rows whose value in column id is greater than 1: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table visits \ --incremental append \ --check-column id \ --last-value 1

MongoDB Tutorial 21 : Increasing Shell batch Size

By default Mongo prints out first 20 documents IIRC. If you want more you can define new value to batch size in Mongo shell, e.g. DBQuery.shellBatchSize = 100

Sqoop tutorial 12 : Importing All Your Tables

Rather than using the import tool for one table, you can use the import-all-tables tool. For example, to import all tables from our example database, you would use the following Sqoop command: sqoop import-all-tables \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop When using the import-all-tables tool, Sqoop will use catalog queries to retrieve a list of all tables in your database. Subsequently for each table, Sqoop will internally call the import tool to import your data to Hadoop. Tables will be imported in sequential order to avoid any unnecessary burden on the database server that would be created by importing several tables at the same time. With this one command, Sqoop will import the entire content of a given database to Hadoop. If you need to import all but a few tables, you can use the parameter --excludetables that accepts a comma-separated list of table names that should be excluded from the bulk import. For example, if y...

Sqoop tutorial 11: Handling NULL Values

You can override the NULL substitution string with the --null-string and --null-nonstring parameters to any arbitrary value.  For example, use the following command to override it to \N: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table cities \ --null-string '\\N' \ --null-non-string '\\N' 1.  --null-string  argument represents what should be writtern in HDFS whenever a NULL is identified in column defined as string in relational database. 2.  --null-non-string  argument works similar to --null-string argument except that it work on columns defined as non-string in relational database.

Sqoop tutorial 10 : Controlling Parallelism

Sqoop by default uses four concurrent map tasks to transfer data to Hadoop. Transferring bigger tables with more concurrent tasks should decrease the time required to transfer all data. You want the flexibility to change the number of map tasks used on a per-job basis. Use the parameter --num-mappers if you want Sqoop to use a different number of mappers. For example, to suggest 10 concurrent tasks, you would use the following Sqoop command: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table cities \ --num-mappers 10 ** If you want to increase the mapper then there should be a primary key in source table, as per the PK the division between the mappers will happen. If you dont have any PK in source then we can explicitly specify a columns as PK by using --Split-by parameter ** When we are specifying PK using --spilt-by make sure that the column should be of integer type so that the data distribution between mapp...

Sqoop tutorial 9 : Overriding Type Mapping

Use Sqoop’s ability to override default type mapping using the parameter --mapcolumn- java. For example, to override the type of column id to Java type Long: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --table cities \ --map-column-java id=Long

Sqoop tutorial 8 : Compressing Imported Data

Use the parameter --compress to enable compression: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --table cities \ --compress By default, when using the --compress parameter, output files will be compressed using the GZip codec, and all files will end up with a .gz extension. You can choose any other codec using the --compression-codec parameter. The following example uses the BZip2 codec instead of GZip sqoop import --compress \ --compression-codec org.apache.hadoop.io.compress.BZip2Codec

Sqoop tutorial 7 : File Formats

The tab-separated CSV file that Sqoop uses by default .If you want to change the default file format then you can use below syntax sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table cities \ --as- sequencefile sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table cities \ --as- avrodatafile

Sqoop tutorial 6 : Importing Only a Subset of Data

Use the command-line parameter --where to specify a SQL condition that the imported data should meet. For example, to import only USA cities from the table cities, you can issue the following Sqoop command: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table cities \ --where "country = 'USA'"

Sqoop tutorial 5 : Specifying a Target Directory

Sqoop offers two parameters for specifying custom output directories: --target-dir and --warehouse-dir.  Sqoop will reject importing data when the final output directory already exists. Use the --target-dir parameter to specify the directory on HDFS where Sqoop should import your data. For example, use the following command to import the table cities into the directory /etl/input/cities: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table cities \ --target-dir /etl/input/cities To specify the parent directory for all your Sqoop jobs, instead use the --warehousedir parameter: sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table cities \ --warehouse-dir /etl/input/ By default, Sqoop will create a directory with the same name as the imported table inside your home directory on HDFS and import all data there. For example, whe...

Sqoop tutorial 4 : Transferring an Entire Table

Importing one table with Sqoop is very simple: you issue the Sqoop import command and specify the database credentials and the name of the table to transfer. sqoop import \ --connect jdbc:mysql://mysql.example.com/sqoop \ --username sqoop \ --password sqoop \ --table cities

Sqoop tutorial 3 : Import arguments

argument Description --append Append data to an existing dataset in HDFS --as-avrodatafile Imports data to Avro Data Files --as-sequencefile Imports data to SequenceFiles --as-textfile Imports data as plain text (default) --boundary-query <statement> Boundary query to use for creating splits --columns <col,col,col…> Columns to import from table --direct Use direct import fast path --direct-split-size <n> Split the input stream every  n  bytes when importing in direct mode --inline-lob-limit <n> Set the maximum size for an inline LOB -m,--num-mappers <n> Use  n  map tasks to import in parallel -e,--query <statement> Import the results of  statement . --split-by <column-name> Column of the table used to split work units --table <table-name> Table to read --target-dir <dir> HDFS destination dir --warehouse-dir <dir> HDFS parent for table destination --where <where clause> WHERE cl...

Sqoop tutorial 2 : Sqoop Import

Sqoop import tool imports an individual table from an RDBMS to HDFS. Each row from a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line), or in binary representation as Avro or SequenceFiles. Syntax :  sqoop import --connect jdbc:mysql://database.example.com/tablename \ --username --password

Sqoop tutorial 1 : Introduction

Image
What is Sqoop? Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.