/    /  Pig Latin – Datatypes

Pig Latin – Datatypes:

  1. Relation – Pig Latin statements work with relations. We can say relation as a bag which contains all the elements. We can say it as a table in RDBMS.
  2. Bag – A bag is a collection of tuples. A bag can have tuples with any with different number of fields.
  3. Tuple – A tuple is an ordered set of fields. We can think it as a row which contain any number of fields. Here the field can have any type of datatype.
  4. Map – It is a set of key-value pairs.
  5. Field – It contains the data.

We also have simple datatypes like int, long, float, double, chararray, bytearray, etc. Bag, tuple, map are called as complex datatypes. We can load the data using datatypes or without data types.

If we do not supply any datatypes the default datatype is bytearray. Then if we generate any results we can get different output by implicit conversion of that datatype, so it is recommended to define any datatype while loading the data in the pig.

Let us take an example:

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);

X = FOREACH A GENERATE name,$2;

Dump X;

In the above example, the name of relation is A. We can refer the name here using name or its positional notation also($0).

A = LOAD 'data' AS (t1:tuple(t1a:int, t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));

DUMP A;

X = FOREACH A GENERATE t1.t1a,t2.$0;

Here A contains the tuples and we are using deference operators(the dot in t1.t1a) to access the fields in tuples.

Some of the examples of Pig which you can try are,

X = GROUP A BY f2*f3;

X = FILTER A BY (f1==8) OR (NOT (f2+f3 > f1));

X = FOREACH A GENERATE f1, f2, f1%f2;

X = FILTER A BY (f1 == 8);

Now let us look at some of the Pig Latin statements,

Note:

In general, Pig processes Pig Latin statements as follows:

  1. First, Pig validates the syntax and semantics of all statements.
  2. Next, if Pig encounters a DUMP or STORE, Pig will execute the statements.