Pig Tutorial:

Apache Pig is a tool which is used to analyze huge amounts of data.

Pig was built in Yahoo. It is an abstraction over MapReduce. In this all the scripts are internally converted into Map and Reduce tasks. We can analyze adhoc data analysis and iterative processing using Pig. Pig is mainly used for research purpose.

We can analyze huge amount of data in Pig by representing them as dataflow.

So what do we mean by data flows ?

In Pig Latin program we give a series of transformations as input to produce output. Taken as a whole the operations describe a data flow. We can also say Pig is a scripting language for exploring large number of datasets.

One advantage of Pig is we can write the same Map Reduce program in a easy and simple way using Pig. We can analyze terabytes of data using less number of code, and also we can see what each line of code is doing using Pig. So it is used by researchers and engineers for analyzing huge amounts of datasets.

 

Pig is mainly divided into 2 pieces:

  • Pig Latin – The language which is used to express data flows, it is the language in which we write our code.
  • Execution Environment – The environment which is used to run Pig programs.

We will discuss Pig Latin after some time, let us discuss first about the execution environment.

There are 2 environments to run pig programs

  • Local environment
  • Distributed environment
  1. Local Mode: Used for running the scripts in local mode, so for this no Hadoop installation is required. Just we can untar the pig file and store the files in the local filesystem. Here Pig is run under single JVM process, so it is suitable only for less or small number of data. For testing purpose we can use this.

To execute in Local mode, we can use –x or –exectype option.

 

  1. Distributed Mode: Used for running the script in Hadoop, we need to install Hadoop Cluster before running Pig. In this mode all the Pig code will be internally converted into Map-Reduce mode jobs and executed. We can have pseudo or fully distributed cluster. Generally, we use Fully distributed cluster for large datasets.

Here no need to use –x option. Directly type.