Sqoop tutorial 10 : Controlling Parallelism

June 03, 2018

Sqoop by default uses four concurrent map tasks to transfer data to Hadoop. Transferring bigger tables with more concurrent tasks should decrease the time required to transfer all data. You want the flexibility to change the number of map tasks used on a per-job basis.

Use the parameter --num-mappers if you want Sqoop to use a different number of mappers.
For example, to suggest 10 concurrent tasks, you would use the following Sqoop command:

sqoop import \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--num-mappers 10

** If you want to increase the mapper then there should be a primary key in source table, as per the PK the division between the mappers will happen. If you dont have any PK in source then we can explicitly specify a columns as PK by using --Split-by parameter

** When we are specifying PK using --spilt-by make sure that the column should be of integer type so that the data distribution between mapper will be done properly,if you choose string type their might be some discrepancy in data distribution between mappers.

Search This Blog

BigD360

Sqoop tutorial 10 : Controlling Parallelism

Comments

Post a Comment

Popular posts from this blog

Hive Tutorial 31 : Analytic Functions

Hive Tutorial 37 : Performance Tuning

MongoDB Tutorial 21 : Increasing Shell batch Size