Spark : Catalyst Optimizer
Most of the power of Spark SQL comes due to
Catalyst optimizer, so let’s have a look into it
Catalyst optimizer has
two primary goals:
·
Make adding new
optimization techniques easy
·
Enable external
developers to extend the optimizer
Spark SQL uses
Catalyst's transformation framework in four phases:
·
Analyzing a
logical plan to resolve references
·
Logical plan
optimization
·
Physical planning
·
Code generation
to compile the parts of the query to Java bytecode
Analysis
The analysis phase involved
looking at a SQL query or a DataFrame, creating a logical plan out of it, which
is still unresolved (the columns referred may not exist or may be of wrong
datatype) and then resolving this plan using the Catalog object (which connects
to the physical data source), and creating a logical plan
Logical plan optimization
The logical plan
optimization phase applies standard rule-based optimization to the logical
plan. These include constant folding, predicate pushdown, projection pruning,
null propagation, Boolean expression simplification, and other rules.
I would like to draw special attention to predicate
the pushdown rule here. The concept is simple; if you issue a query in one
place to run against the massive data, which is another place, it can lead to a
lot of unnecessary data moving across the network.
If we can push down the part of the query to where the
data is stored, and thus filter out unnecessary data, it reduces network
traffic significantly.
Physical planning
In the physical planning phase,
Spark SQL takes a logical plan and generates one or more physical plans. It
then measures the cost of each physical plan and generates one physical plan
based on that.
Code generation
The final phase of query
optimization involves generating Java bytecode to run on each machine. It uses
a special Scala feature called Quasi
quotes to accomplish that
x
Comments
Post a Comment