Spark RDD:
RDD is the central abstraction in Spark which is a read only collection of objects partitioned across multiple machines in the cluster so that they can operated in parallel. RDD is called as Resilient Distributed dataset, where resilient means that Spark can easily reconstruct a lost partition by re-computing it.
Here we use transformation and actions. We apply multiple transformations on RDD and at last when the action is applied the process is started, until then only its logical plan is created. We can also persist an RDD into memory if we want to reuse it afterwards for better execution. Let us know look at the creation of RDD.
We have 3 ways for creating RDD.
- Parallelizing a collection
- Using a dataset from external dataset like HDFS, Hbase, etc.
- By transforming an existing RDD.
First we need to create a Spark
Context object, which tells the spark about how to access a cluster. Spark Context can be used to create RDD, broadcast variables and accumulators in cluster.
We will look into that in detail in coming slides. For creating Spark Context object first we need to build a Spark Conf object which contains information about our application. Perm JVM we must have only 1 Spark Context active, if there is another then we need to stop that first.
val conf = new SparkConf().setAppName(appName).setMaster(master) new SparkContext(conf)
Here ‘appName’ will be name of our application which will be shown on the cluster UI. ‘master’ can be mesos, yarn, local if it is run on local mode. Here in our coming slides we will be using Spark shell. In spark shell, Spark Context is already created for us by default called as ‘sc’.
We cannot create our own spark Context object. In Spark shell we can set our own master by ‘—master’ argument, can use ‘—jars’ argument, ‘—packages’, ‘—repositories’ arguments.
For example, to run bin/spark-shell on exactly four cores, use:
$ ./bin/spark-shell --master local[4]
Or, to also add code.jar to its class path, use:
$ ./bin/spark-shell --master local[4] --jars code.jar
For a complete list of options, run spark-shell –help. Behind the scenes, spark-shell invokes the more general spark-submit script.