Spark: How to know how spark is doing partitioning based on which key?

May 29, 2018

You have to distinguish between two different things:

partitioning as distributing data between partitions depending on a value of the key which is limited only to the PairwiseRDDs (RDD[(T, U)]). This creates a relationship between partition and the set of keys which can be found on a given partition.

partitioning as splitting input into multiple partitions where data is simply divided into chunks containing consecutive records to enable distributed computation. Exact logic depends on a specific source but it is either number of records or size of a chunk.

In case of parallelize data is evenly distributed between partitions using indices. In case of HadoopInputFormats (like textFile) it depends on properties like mapreduce.input.fileinputformat.split.minsize / mapreduce.input.fileinputformat.split.maxsize.

So default partitioning scheme is simply none because partitioning is not applicable to all RDDs. For operations which require partitioning on a PairwiseRDD (aggregateByKey, reduceByKey etc.) default method is use hash partitioning.

Search This Blog

BigD360

Spark: How to know how spark is doing partitioning based on which key?

Comments

Post a Comment

Popular posts from this blog

Unix Tutorial 11 : Conditional Statements

Unix Tutorial 12 : Loop Types

Unix Tutorial 17 : Advance Unix