Apache Spark Small Talk

February 22, 2017

Apache Spark is sub project of Apache Hadoop. It is developed in University of California, Berkeley. Apache Spark was build for "Lighting faster cluster computing" as it's official web site says. By using Spark some issues in Hadoop was addressed well therefore Spark was popular in no time. It was open sourced in 2010 under BSD license.

Main Components in Apache Spark

Apache Spark has a few tightly integrated components as you can see here. As you may understand Spark Core has core functionalities like memory management, task scheduling, fault recovery etc. The main data abstraction Resilient Distributed Data set (RDD) is also defined in the Spark Core.

Spark SQL, Spark Streaming Real Time, MLib(Machine Lerning Library), Graph X each component has unique and different functionalities. Since we focus more on Spark Core functionalities and concepts at this stage, we will dive into those in later episodes of this serise.

Resilient Distributed Dataset (RDD)

This is a main abstraction used in Spark. RDD is a immutable collection of objects. These objects can be python, java, scala or user defined objects. This RDDs can be partitioned so then and be manipulated in different nodes on the cluster. A RDD can be created using following ways.

Distribution of collection of objects
Loading an external data set
Using a existing RDD (Transformations)

When Created a RDD we can perform two types of operations.

Transformations
Actions

I will explain what are these and what impact they have on RDD later in this blog. One thing to keep in mind is that when we perform a action on a RDD that RDD will be recomputed. If you want to use same RDD again with out recomputing then you can instruct to persist it by using RDD.persist(). Hare again we have a choice of persisting the RDD in memory or in disk. It is obvious that if your RDD is large it is better to keep in disk rather than in memory. It is best to load the RDD to memory using persist() when we repeatedly querying the same RDD.

To create a RDD as I metion we can use a existing collection in the program like a list or set. We can use parallelize() method. Here are few examples in python to do so,

fruits = ["apple","Orange","Pear"]

rdd= sc.parallelize(fruits) # this will create a RDD from fruits list

To create a RDD loading data from a external data set there are many ways. But we are already familer with the simple Spark job that we ran in our previous Spark episode. You might remember the following line we used. This is an example where we are creating a RDD using external data set.

lines = sc.textFile("README.md")

Transformations

Transformations returns a new RDD and they use Lazy Execution. I will explain Lazy Execution in latter part of this episode because it is more general concept. By using a transformation we can narrow down or broad the existing RDD by adding more data or filtering out data. It is not much complicated we will try to understand transformations using following example.

inputRDD = sc.textFile("README.md")
pythonRDD = inputRDD.filter(lambda x: "python" in x)
javaRDD = inputRDD.filter(lambda x: "java" in x)

programmingRDD = javaRDD.union(pythonRDD)

Here we have created four RDDs. inputRDD is created using a text file. As you can see we have used transformations to create other three RDDs. It is obvious that "filter" command is used to filter out the data in the inputRDD. As we unterstood before it is a function which narrow downs the existing RDD. The "union" function is another function which combines two RDD into one.

One important consent used in transformations is the "Linkage Graph". Spark uses a linkage graph to

Compute each RDDs.
Recover lost data if a part of RDD is missing.

Actions

These are the oparations that return a final value to driver program or write data to external storage system.

Below are two simple actions that we can perform on RDDs.

take(x) - This will collect top x number of elements from RDD.
collect() - This will retrieve an entire RDD. But one thing keep in mind using this is that in most cases the RDD will not be able to collect to driver program because it is too large. In such cases it is better to write data to HDFS or Amazon S3 using some other actions like saveAsTextFile() or saveAsSequneceFile().

One Last Concept- Lazy Evaluation

As we get know before transactions are lazily evaluated. Which means, transaction is not done immediately after it is requested. Even the loading data into RDD is lazily evaluated Instead of that Spark will record meta data to indicate that such transaction is requested. Therefore it is fine to think that RDD is not just contains data.

RDDs also contain instructions how to compute them.This is smart move by done by Spark to overtake Hadoop MapReduce because by using such evaluation method it allows Spark to reduce number of passes it have to perform by grouping operations.

Final Toast

I hope this helped you gain some knowledge about Apache Spark and to scratch it's surface. Please point out the error if I have made any. Your comments are welcome in order to improve this Apache Spark series.

Small Talks From Zero