Apache Spark Introduction :

Apache Spark is an open source big data processing framework built on distributed cluster used for managing big data processing requirements with a variety of datasets which are diverse in nature like text data, graph data, real time streaming data, etc.

We can write any type of applications in spark using Java, Scala, Python, R which have a default built-in set of many high level operators. Spark also supports SQL queries, streaming data, machine learning and graph data processing.

Spark has many advantages compared to other big data technologies. It is faster than Map-Reduce and offers low latency. It has the capability of storing the data in memory rather than writing every intermediate output to disk. Due to this nature the execution time will be very fast.

When the data is more than threshold of the memory storage then it is automatically spilled to the disk. Spark allows programmers to develop complex, multi-step data pipelines using directed acyclic graph i.e. DAG pattern. It does not execute the tasks immediately but maintains it as a chain of operations called as DAG.

We have transformations and actions in Spark. In Spark the process is lazy evaluation which means the action on the DAG happens only when we apply an action operation on transformation DAG. Some of the important features of Spark are,

  1. Spark enables applications in Hadoop clusters to run up to as much as 100 times faster in memory and 10 times faster even when running in disk.
  2. Spark can run on clusters managed by YARN or Mesos or on the standalone system.
  3. Spark can be integrated with various data sources like SQL, NoSQL, S3, HDFS, local file system etc.
  4. In addition to Map and Reduce operations, it supports SQL like queries, streaming data, machine learning and data processing in terms of graph.

 

Let’s have a quick review of  Apache Spark Tutorial!!!

 

How to install Apache Spark?

For installing Spark, we just need to download a stable release of the Spark from the official Spark downloads page and unpack the tarball in a suitable location. Learn more..

 

Understanding Spark RDD:

RDD is the central abstraction in Spark which is a read only collection of objects partitioned across multiple machines in the cluster so that they can operated in parallel. RDD is called as Resilient Distributed dataset, where resilient means that Spark can easily reconstruct a lost partition by re-computing it. Learn more..

 

Understanding RDD Creation in Spark:

First method is using Parallelized Collections. Here RDD are created by using Spark Context parallelize method. We will call this method on an existing collection in our program. When we call this method than the elements in the collection will be copied to form a distributed dataset which will be operated in parallel. Learn more..

 

What are Transformations and Actions in Spark?

We have 2 operations in RDD, they are transformations and actions. Transformations will create a new dataset from an existing one and shows the result to the user or stores them to external storage when action is triggered. Learn more..

 

What are Types of transformation in Spark?

Return a new distributed dataset formed by passing each element of the source through a function func.Return a new dataset formed by selecting those elements of the source on which func returns true. Learn more..

 

What are Types of Actions in Spark?

Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. Learn more..

 

Understanding Persistence in spark?

When we persist an RDD, then each and every node stores its partitions and computes them in memory and reuses them in other actions of that dataset. We can persist RDD using persist() or cache() methods. Learn more..

 

What are Shared Variables in Spark?

In Spark while doing any operations or functions, it works on different variables used in that function. Generally multiple copies of the same variables are copied to the slave nodes and the update to that variables never returns to the driver program. This will be an inefficient way as the data transfer rate will be huge. Learn more..