Pig Latin – Grouping and Joining:
JOIN:
Join concept is similar to Sql joins, here we have many types of joins such as Inner join, outer join and some specialized joins.
INNER JOIN:
The JOIN operator always performs an inner join. Inner joins ignore null keys, so it makes sense to filter them out before the join.
Note: Both Cogroup and join work in a similar way, just the difference is Cogroup creates a nested set of output records.
X = JOIN A BY a1, B BY b1;
OUTER JOIN:
Use the OUTER JOIN operator to perform left, right, or full outer joins. Outer joins will only work provided the relations which need to produce nulls (in the case of non-matching keys) have schemas.
C = JOIN A by $0 LEFT OUTER, B BY $0;
REPLICATED:
This join is used when we know that one or more relations is small enough to fit into the memory. Then the pig performs this join very fast because here all the hadoop work is done on the map side.
In this large relation is followed by one or more small relation. If the small relation is more than the memory, then the process fails and we get an error. We can use this join using both inner and outer join.
C = JOIN A BY $0 LEFT, B BY $0 USING 'replicated';
There are also Skewed and Merge joins, which have certain limitations and conditions. Here is an example
C = JOIN A BY name FULL, B BY name USING 'skewed'; C = JOIN A BY a1, B BY b1 USING 'merge';
GROUP:
Similar to Group in Sql, here we use group for one relation and Cogroup of more number of relations. Both GROUP and COGROUP are similar to each other.
B = GROUP A BY age; X = GROUP A BY f2*f3; X = COGROUP A BY owner INNER, B BY friend2 INNER; B = GROUP A BY (tcid, tpid);
CROSS:
CROSS operator to use to compute the cross product (Cartesian product) of two or more relations.
X = CROSS A, B;