Spark : Catalyst Optimizer

Most of the power of Spark SQL comes due to Catalyst optimizer, so let’s have a look into it


Catalyst optimizer has two primary goals:
·         Make adding new optimization techniques easy
·         Enable external developers to extend the optimizer

Spark SQL uses Catalyst's transformation framework in four phases:
·         Analyzing a logical plan to resolve references
·         Logical plan optimization
·         Physical planning
·         Code generation to compile the parts of the query to Java bytecode



Analysis
The analysis phase involved looking at a SQL query or a DataFrame, creating a logical plan out of it, which is still unresolved (the columns referred may not exist or may be of wrong datatype) and then resolving this plan using the Catalog object (which connects to the physical data source), and creating a logical plan



Logical plan optimization

The logical plan optimization phase applies standard rule-based optimization to the logical plan. These include constant folding, predicate pushdown, projection pruning, null propagation, Boolean expression simplification, and other rules.
I would like to draw special attention to predicate the pushdown rule here. The concept is simple; if you issue a query in one place to run against the massive data, which is another place, it can lead to a lot of unnecessary data moving across the network.

If we can push down the part of the query to where the data is stored, and thus filter out unnecessary data, it reduces network traffic significantly.




Physical planning
In the physical planning phase, Spark SQL takes a logical plan and generates one or more physical plans. It then measures the cost of each physical plan and generates one physical plan based on that.


Code generation
The final phase of query optimization involves generating Java bytecode to run on each machine. It uses a special Scala feature called Quasi quotes to accomplish that



x

Comments

Popular posts from this blog

Hive Tutorial 31 : Analytic Functions

Hive Tutorial 37 : Performance Tuning

How to change sqoop saved job parameters