Hive Tutorial 19 : Hive UDF
Beyond the multitude of string functions, the usual mathematical and date functions are available as well as conditionals and collections in form of arrays and maps. The most powerful functionality though is the ability for users to write their own functions to extend Hive.
Traditionally, many map-reduce programs had little original logic. They emerged from the need to execute a small piece of logic on a huge amount of data. The resulting Java programs were a lot of scaffolding to access and move data around. Today, in many cases Hive can do away with trivial map-reduce programs for ETL and analytics in favour of simple queries as we have seen.
There are a few cases where additional functionality beyond what Hive offers is needed. Reading and writing data in an exclusive format is one case. Hive can be extended to support any format by writing a SerDe, which is short for Serializer/Deserializer.
Other cases are; transformation of one row value into another one, which can be added with UDFs (User Defined Function); transformation of multiple row values into one, which can be added with UDAFs (User Defined Aggregate Functions); transformation of one row value into many, which can be added with UDTFs (User Defined Table Functions).
The method to add these is to use the same programming model Hive uses to implement the functions it ships with. These functions are in fact themselves just UDF/UDAF/UDTFs and the only difference to user functions are that they are shipped with Hive. A user can add them by writing the functions in Java, compiling them into a JAR and loading the JAR in Hive. From that point on the user’s functions are accessible in Hive queries just like any other function, e.g. LENGTH (UDF), COUNT (UDAF), or EXPLODE (UDTF).
Let’s explore the LENGTH example, an UDF, which is available in Hive. The query
SELECT LENGTH(col1) FROM table;
applies the function LENGTH on strings like ‘abcdef’ and returns ‘6’, for example. package org.apache.hadoop.hive.ql.udf; import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.hive.ql.exec.vector.VectorizedExpressions; import org.apache.hadoop.hive.ql.exec.vector.expressions.StringLength; import org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; /** * UDFLength. * */ @Description(name = "length", value = "_FUNC_(str | binary) - Returns the length of str or number of bytes in binary data", extended = "Example:\n" + " > SELECT _FUNC_('Facebook') FROM src LIMIT 1;\n" + " 8") @VectorizedExpressions({StringLength.class}) public class UDFLength extends UDF { private final IntWritable result = new IntWritable(); public IntWritable evaluate(Text s) { if (s == null) { return null; } byte[] data = s.getBytes(); int len = 0; for (int i = 0; i < s.getLength(); i++) { if (GenericUDFUtils.isUtfStartByte(data[i])) { len++; } } result.set(len); return result; } public IntWritable evaluate(BytesWritable bw){ if (bw == null){ return null; } result.set(bw.getLength()); return result; } }
The implementation of an UDF extends the UDF class and overwrites the evaluate method with its own functionality. UDAFs are a little more complex and require a basic understanding of the map-reduce framework.
Comments
Post a Comment