Showing posts from February, 2017

Apache Spark Small Talk

Apache Spark is sub project of Apache Hadoop. It is developed in University of California, Berkeley. Apache Spark was build for "Lighting faster cluster computing" as it's official web site says. By using Spark some issues in Hadoop was addressed well therefore Spark was popular in no time. It was open sourced in 2010 under BSD license. Main Components in Apache Spark  Apache Spark has a few tightly integrated components as you can see here. As you may understand Spark Core has core functionalities like memory management, task scheduling, fault recovery etc. The main data abstraction Resilient Distributed Data set (RDD) is also defined in the Spark Core. Spark SQL, Spark Streaming Real Time, MLib(Machine Lerning Library), Graph X each component has unique and different functionalities. Since we focus more on Spark Core functionalities and concepts at this stage, we will dive into those in later episodes of this serise.  Resilient Distributed Dataset

Getting Started With Apache Spark in Ubuntu

This is the first episode of the Apache Spark series that i wish to continue to up to advance Apache Spark Programming.   What is Apache Spark Apache Spark is a cluster computing platform. In simpler terms it will execute your instructions in a cluster of computer instead using a single one. By doing so it will produce fast results. Apache spark is most used in Big Data Analysis but it has features which can be used for some other areas such as Machine Learning and Graph Processing. The most satisfying thing about Apache Spark is that it supports Java, Scala and python which are well known languages.   Downloading Apache Spark  Please note this installation walk through is for installing Spark on a single node.   Download Apache from here . Select Apache version and Package type. The package type is important if you have already using Apache Hadoop. Make sure you use the correct package type to avoid any collisions. But please note that Apache Hadoop is NOT a nec