Apache Pig is a tool which is used to analyze huge amounts of data.
Pig was built in Yahoo. It is an abstraction over MapReduce. In this all the scripts are internally converted into Map and Reduce tasks. We can analyze adhoc data analysis and iterative processing using Pig. Pig is mainly used for research purpose.
We can analyze huge amount of data in Pig by representing them as dataflow.
So what do we mean by data flows ?
In Pig Latin program we give a series of transformations as input to produce output. Taken as a whole the operations describe a data flow. We can also say Pig is a scripting language for exploring large number of datasets.
One advantage of Pig is we can write the same Map Reduce program in a easy and simple way using Pig. We can analyze terabytes of data using less number of code, and also we can see what each line of code is doing using Pig. So it is used by researchers and engineers for analyzing huge amounts of datasets.
Pig is mainly divided into 2 pieces:
- Pig Latin – The language which is used to express data flows, it is the language in which we write our code.
- Execution Environment – The environment which is used to run Pig programs.
We will discuss Pig Latin after some time, let us discuss first about the execution environment.
There are 2 environments to run pig programs
- Local environment
- Distributed environment
- Local Mode: Used for running the scripts in local mode, so for this no Hadoop installation is required. Just we can untar the pig file and store the files in the local filesystem. Here Pig is run under single JVM process, so it is suitable only for less or small number of data. For testing purpose we can use this.
To execute in Local mode, we can use –x or –exectype option.
%pig –x local Grunt>
- Distributed Mode: Used for running the script in Hadoop, we need to install Hadoop Cluster before running Pig. In this mode all the Pig code will be internally converted into Map-Reduce mode jobs and executed. We can have pseudo or fully distributed cluster. Generally, we use Fully distributed cluster for large datasets.
Here no need to use –x option. Directly type.
So, Let’s have a quick view of Pig tutorial…
Pig installation is very easy in Hadoop cluster, just we need to install a stable release from pig official site and unpack the tarball in a suitable place of your workstation.After untar, if we have set the Hadoop_home and path variables no need to do any extra work in Pig. Learn more..
Writing programs in MapReduce is very difficult as it requires Java or any other Language but pig is very easy.Length of code is very less compared to MapReduce.Pig is a high level data flow language compared to MapReduce a low level data language. Learn more..
Parser – It checks the syntax of the script. Optimizer – It performs activities such as merge, split, joins, Order by, group by, etc. It basically tries to reduce the amount of data which is being send to the next stage.Compiler – It converts the code into Mapreduce jobs. Learn more..
As we have seen in the architecture when the Pig Latin program is executed, each statement is parsed in turn. If there are any syntax errors or other problems, the interpreter will wait and display an error message. Learn more..
We also have simple datatypes like int, long, float, double, chararray, bytearray, etc. Bag, tuple, map are called as complex datatypes. We can load the data using datatypes or without data types. Learn more..
A LOAD statement reads data from the file system.
Stores or saves results to the file system.Use the STORE operator to run (execute) Pig Latin statements and save (persist) results to the file system. Use STORE for production scripts and batch mode processing. Learn more..
Pig will validate, but not execute, the LOAD and FOREACH statements.Removes duplicate tuples in a relation.Based on some condition it selects tuples from a relation. Learn more..
Join concept is similar to Sql joins, here we have many types of joins such as Inner join, outer join and some specialized joins.Similar to Group in Sql, here we use group for one relation and Cogroup of more number of relations. Both GROUP and COGROUP are similar to each other. Learn more..
Limit the number of tuples.Sorts a relation based on one or more fields. Learn more..
Partitions a relation into two or more relations.Computes the union of two or more relations.Does not preserve the order of tuples, does not ensure (as databases do) that all tuples adhere to the same schema or that they have the same number of fields. Learn more..
Returns the schema of an alias.Displays execution plans.Use the EXPLAIN operator to review the logical, physical, and map reduce execution plans that are used to compute the specified relationship. Lean more..
DIFF – Compares two fields in a tuple. IsEmpty – Checks if a bag or map is empty. Max – To get the highest value.
Min – To get the lowest value. Learn more..
Schema assigns name to the field and declares data type of the field. It is optional in pig but it is recommended to use them for getting good results.We have seen in the load function that we have defined datatypes for every field, using describe command we can see the schema. Learn more..
First we need to start all the Hadoop and yarn daemons by using ‘start-dfs.sh’ and ‘start-yarn.sh’ in the Hadoop cluster. We need to check whether all the daemons are running or not using ‘JPS’ option. Learn more..