A RDD known as Resilient Distributed Dataset in Spark is simply an immutable distributed huge collection of objects sets. Each RDD is split into multiple partitions (a smaller units), which may be computed on different aspects of nodes of the cluster. RDDs can contain any type of languages such as Python, Java, or Scala objects, … Continue reading Spark: Programming with RDDs
Month: May 2018
Apache Spark Architecture
In order to understand the way Spark runs, it is very important to know the architecture of Spark. Following diagram and discussion will give you a clearer view into it. There are three ways Apache Spark can run : Standalone – The Hadoop cluster can be equipped with all the resources statically and Spark can … Continue reading Apache Spark Architecture
Hadoop Multi Node Clusters
Installing Java Syntax of java version command $ java -version Following output is presented. java version "1.7.0_71" Java(TM) SE Runtime Environment (build 1.7.0_71-b13) Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode) Creating User Account System user account on both master and slave systems should be created to use the Hadoop installation. # useradd hadoop # … Continue reading Hadoop Multi Node Clusters