Posts

Showing posts from October, 2017

Apache pig Tutorial 5: LOAD data

LOAD key word is used to load data into Pig. LOAD 'data' [USING function] [AS schema]; data: It will the input file(/home/user/inputfile) USING :  If the USING clause is omitted, the default load function PigStorage is used. function: We can use built-in-function or UDF  AS : key word schema:  Schemas enable you to assign names to fields and declare types for fields. Schemas are optional but we encourage you to use them whenever possible; type declarations result in better parse-time error checking and more efficient code execution. Known Schema Handling Note the following: You can define a schema that includes both the field name and field type. You can define a schema that includes the field name only; in this case, the field type defaults to bytearray. You can choose not to define a schema; in this case, the field is un-named and the field type defaults to bytearray. If you assign a name to a field, you can refer to that field using the name ...

Apache pig Tutorial 4: Pig Latin Statements

Pig Latin statements are the basic constructs you use to process data using Pig. A Pig Latin statement is an operator that takes a relation as input and produces another relation as output. (This definition applies to all Pig Latin operators except LOAD and STORE which read data from and write data to the file system.) Pig Latin statements may include expressions and schemas. Pig Latin statements can span multiple lines and must end with a semi-colon ( ; ). By default, Pig Latin statements are processed using multi-query execution. Pig Latin statements are generally organized as follows: A LOAD statement to read data from the file system. A series of "transformation" statements to process the data. A DUMP statement to view results or a STORE statement to save the results. Note that a DUMP or STORE statement is required to generate output. In this example Pig will validate, but not execute, the LOAD and FOREACH statements. A = LOAD '...

Apache pig Tutorial 3: Batch Mode

You can run Pig in batch mode using Pig scripts and the "pig" command (in local or hadoop mode). Example The Pig Latin statements in the Pig script (id.pig) extract all user IDs from the /etc/passwd file. First, copy the /etc/passwd file to your local working directory. Next, run the Pig script from the command line (using local or mapreduce mode). The STORE operator will write the results to a file (id.out). /* id.pig */ A = load 'passwd' using PigStorage(':'); -- load the passwd file B = foreach A generate $0 as id; -- extract the user IDs store B into 'id.out'; -- write the results to a file name id.out Local Mode $ pig -x local id.pig Tez Local Mode $ pig -x tez_local id.pig Spark Local Mode $ pig -x spark_local id.pig Mapreduce Mode $ pig id.pig or $ pig -x mapreduce id.pig Tez Mode $ pig -x tez id.pig Spark Mode $ pig -x spark id.pig

Apache pig Tutorial 2 : Execution modes

We can run Apache pig using various modes. Pig has six execution modes or exectypes: Local Mode  - To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local). Tez Local Mode  - To run Pig in tez local mode. It is similar to local mode, except internally Pig will invoke tez runtime engine. Specify Tez local mode using the -x flag (pig -x tez_local). Note:  Tez local mode is experimental. There are some queries which just error out on bigger data in local mode. Spark Local Mode  - To run Pig in spark local mode. It is similar to local mode, except internally Pig will invoke spark runtime engine. Specify Spark local mode using the -x flag (pig -x spark_local). Note:  Spark local mode is experimental. There are some queries which just error out on bigger data in local mode. Mapreduce Mode  - To run Pig in mapreduce mode, y...