Apache pig Tutorial 5: LOAD data
LOAD key word is used to load data into Pig.
LOAD 'data' [USING function] [AS schema];
data: It will the input file(/home/user/inputfile)
USING : If the USING clause is omitted, the default load function PigStorage is used.
function: We can use built-in-function or UDF
AS : key word
schema: Schemas enable you to assign names to fields and declare types for fields. Schemas are optional but we encourage you to use them whenever possible; type declarations result in better parse-time error checking and more efficient code execution.
Known Schema Handling
Note the following:
- You can define a schema that includes both the field name and field type.
- You can define a schema that includes the field name only; in this case, the field type defaults to bytearray.
- You can choose not to define a schema; in this case, the field is un-named and the field type defaults to bytearray.
If you assign a name to a field, you can refer to that field using the name or by positional notation. If you don't assign a name to a field (the field is un-named) you can only refer to the field using positional notation.
If you assign a type to a field, you can subsequently change the type using the cast operators. If you don't assign a type to a field, the field defaults to bytearray; you can change the default type using the cast operators.
Unknown Schema Handling
Note the following:
- When you JOIN/COGROUP/CROSS multiple relations, if any relation has an unknown schema (or no defined schema, also referred to as a null schema), the schema for the resulting relation is null.
- If you FLATTEN a bag with empty inner schema, the schema for the resulting relation is null.
- If you UNION two relations with incompatible schema, the schema for resulting relation is null.
- If the schema is null, Pig treats all fields as bytearray (in the backend, Pig will determine the real type for the fields dynamically)
Example:
inputfile.txt
1 2 3 4 2 1 8 3 4
A = LOAD 'inputfile.txt';
A = LOAD 'inputfile.txt' USING PigStorage('\t');
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
Comments
Post a Comment