Hive Tutorial 26 : Hive sort by vs order by vs Clusterby
SORT BY:
Sort By vs Order By vs Group By vs
Cluster By in Hive
Hive uses the columns in SORT BY to
sort the rows before feeding the rows to a reducer. The sort order will be
dependent on the column types. If the column is of numeric type, then the sort
order is also in numeric order. If the column is of string type, then the sort
order will be lexicographical order.
Ordering : It orders data at
each of ‘N’ reducers , but each reducer can have overlapping ranges of data.
Outcome : N or more sorted
files with overlapping ranges.
SELECT key, value FROM src SORT BY key ASC, value DESC
The
query had 2 reducers, and the output of each is:
Reducer 1 :
0 5
0 3
3 6
9 1
|
Reducer 2 :
0 4
0 3
1 1
2 5
|
As, we
can see, each reducer output is ordered but total ordering is missing , since
we end up with multiple outputs per reducer.
ORDER
BY
This is
similar to ORDER BY in SQL Language.
In
Hive, ORDER BY guarantees total ordering of data, but for that it has to be
passed on to a single reducer, which is normally unacceptable and therefore in
strict mode, hive makes it compulsory to use LIMIT with ORDER BY so that
reducer doesn’t get overburdened.
Ordering : Total Ordered
data.
Outcome : Single output
i.e. fully ordered.
SELECT key, value FROM src ORDER BY key ASC, value DESC
Reducer :
0 5
0 4
0 3
0 3
1 1
2 5
3 6
9 1
|
DISTRIBUTE BY
Hive uses the columns in Distribute By to
distribute the rows among reducers. All rows with the same Distribute
By columns will go to the same reducer.
It ensures each of N reducers gets non-overlapping ranges
of column, but doesn’t sort the output of each reducer. You end up with N
or more unsorted files with non-overlapping ranges.
Example (
taken directly from Hive wiki ):-
We are Distributing By x on
the following 5 rows to 2 reducer:
x1
x2
x4
x3
x1
|
Reducer
1 got
x1
x2
x1
|
Comments
Post a Comment