Apache Spark Tutorial - BigDatio

Last Updated: April 8, 2022

Fran H

Throughout this Apache Spark tutorial you will learn how to use spark for analytical data processing through simple examples in Scala that you can copy and paste code to run your own tests.

We will mainly use Scala and Python to show the examples in Spark and it is intended to be a source of knowledge and code for Apache Spark query examples.

Apache Spark

What is Apache Spark?

Apache Spark is an open source analytical processing engine for large-scale distributed data processing and machine learning applications.

It was originally developed in 2009 at UC Berkeley, California by Matei Zaharia and subsequently donated to the Apache Software Foundation.

Since February 2014, Spark became a top-level Apache project and has been evolved by thousands of engineers and making Spark one of Apache's most active open source projects and the largest open source project in data processing.

It uses in-memory caching and optimized query execution to perform fast queries against big data. In short, Spark is a fast, distributed, general engine for large-scale data processing.

Apache Spark Architecture

Spark SQL + DataFrames

Streaming

MLlib
Machine Learning

GraphX
Graph Computation

Spark Core API
(Scala, Python, SQL, R, Java)

Yarn

Mesos

Standalone Scheduler

Kubernetes

Spark SQL + DataFrames: Spark SQL queries is the most easy way to exploring data for business intelligence users, data scientist and analysts. Based on DataFrames and distribute processing can speed up queries to run up to 100x faster Hadoop Hive queries.
Streaming: There are applications that have to process data in real time, not just batch. In addition to the advantages of spark, such as fault tolerance and integration with more and more data sources such as HDFS, Flume, Kafka, Twitter, ...
MLlib - Machine Learning: Built on top of Spark, MLlib is a scalable machine learning library that delivers high-quality algorithms and offers up to 100 times faster speed than MapReduce. It can be used with Scala, Java and Python.
GraphX - Graph Computation: GraphX is a graph computation engine that allows users to interactively build, transform and make decisions on graph-structured data at scale. It comes complete with a library of common algorithms.
Spark Core API:

Apache Spark works in a master-slave architecture where the master is called "Driver" and slaves are called "Workers". When you run a Spark application, Spark Driver creates a context that is an entry point to your application, and all transformations and actions are executed on worker nodes, and the resources are managed by Cluster Manager.

Cluster Manager Types

As of writing this Apache Spark Tutorial, Spark supports below cluster managers:

Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
Apache Mesos – Mesons is a Cluster manager that can also run Hadoop MapReduce and Spark applications.
Hadoop YARN – the resource manager in Hadoop 2. This is mostly used, cluster manager.
Kubernetes – an open-source system for automating deployment, scaling, and management of containerized applications.
local – which is not really a cluster manager but still I wanted to mention as we use “local” for master() in order to run Spark on your laptop/computer.

Apache Spark Features

In-memory computation

Distributed processing using parallelize

Can be used with many cluster managers (Spark, Yarn, Mesos, ...)

Fault-tolerant

Immutable

Lazy evaluation

Cache & persistence

Inbuild-optimization when using DataFrames

Supports ANSI SQL

Apache Spark Advantages over Hadoop MapReduce

Spark run in-memory while mapreduce uses disk.
Spark is a general-purpose, in-memory, fault-tolerant, distributed processing engine that allows you to process data efficiently in a distributed fashion.
Applications running on Spark are 100x faster than traditional systems like Hadoop mapreduce.
You will get great benefits using Spark for data ingestion pipelines.
Using Spark we can process data from Hadoop HDFS, AWS S3, Databricks DBFS, Azure Blob Storage, and many file systems.
Spark also is used to process real-time data using Streaming and Kafka.
Using Spark Streaming you can also stream files from the file system and also stream from the socket.
Spark natively has machine learning (MLlib) and Graph libraries.

Spark Installation

In order to run Apache Spark examples mentioned in this tutorial, you need to have Spark and it’s needed tools to be installed on your computer. Since most developers use Windows for development, I will explain how to install Spark on windows in this tutorial. you can also Install Spark on Linux server if needed.

Download Apache Spark by accessing Spark Download page and select the link from “Download Spark (point 3)”. If you wanted to use a different version of Spark & Hadoop, select the one you wanted from drop downs and the link on point 3 changes to the selected version and provides you with an updated link to download.

Apache Spark Installation After download, untar the binary using 7zip and copy the underlying folder spark-3.0.0-bin-hadoop2.7 to c:\apps

Now set the following environment variables.

SPARK_HOME  = C:\apps\spark-3.0.0-bin-hadoop2.7
HADOOP_HOME = C:\apps\spark-3.0.0-bin-hadoop2.7
PATH=%PATH%;C:\apps\spark-3.0.0-bin-hadoop2.7\bin

Setup winutils.exe

Download wunutils.exe file from winutils, and copy it to %SPARK_HOME%\bin folder. Winutils are different for each Hadoop version hence download the right version from https://github.com/steveloughran/winutils

spark-shell

Spark binary comes with interactive spark-shell. In order to start a shell, go to your SPARK_HOME/bin directory and type “spark-shell2“. This command loads the Spark and displays what version of Spark you are using.

Apache spark shell spark-shell By default, spark-shell provides with spark (SparkSession) and sc (SparkContext) object’s to use. Let’s see some examples.

spark shell spark-shell create RDD Spark-shell also creates a Spark context web UI and by default, it can access from http://localhost:4041.

Spark-submit

The spark-submit command is a utility to run or submit a Spark or PySpark application program (or job) to the cluster by specifying options and configurations, the application you are submitting can be written in Scala, Java, or Python (PySpark) code. You can use this utility in order to do the following.

Submitting Spark application on different cluster managers like Yarn, Kubernetes, Mesos, and Stand-alone. Submitting Spark application on client or cluster deployment modes

./bin/spark-submit \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key<=<value> \
  --driver-memory <value>g \
  --executor-memory <value>g \
  --executor-cores <number of cores>  \
  --jars  <comma separated dependencies>
  --class <main-class> \
  <application-jar> \
  [application-arguments]

Spark Web UI

Apache Spark provides a suite of Web UIs (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark application, resource consumption of Spark cluster, and Spark configurations. On Spark Web UI, you can see how the operations are executed.

Spark History Server

Spark History server, keep a log of all completed Spark application you submit by spark-submit, spark-shell. before you start, first you need to set the below config on spark-defaults.conf

spark.eventLog.enabled true
spark.history.fs.logDirectory file:///c:/logs/path

Now, start spark history server on Linux or mac by running.

$SPARK_HOME/sbin/start-history-server.sh

If you are running Spark on windows, you can start the history server by starting the below command.

$SPARK_HOME/bin/spark-class.cmd org.apache.spark.deploy.history.HistoryServer

By default History server listens at 18080 port and you can access it from browser using http://localhost:18080/

history server

Spark History Server

By clicking on each App ID, you will get the details of the application in Spark web UI.

The history server is very helpful when you are doing Spark performance tuning to improve spark jobs where you can cross-check the previous application run with the current run.

Spark Modules

Spark Core
Spark SQL
Spark Streaming
Spark MLlib
Spark GraphX

In this section of the Apache Spark Tutorial, you will learn different concepts of the Spark Core library with examples in Scala code. Spark Core is the main base library of the Spark which provides the abstraction of how distributed task dispatching, scheduling, basic I/O functionalities and etc.

Before getting your hands dirty on Spark programming, have your Development Environment Setup to run Spark Examples using IntelliJ IDEA

SparkSession

SparkSession introduced in version 2.0, It is an entry point to underlying Spark functionality in order to programmatically use Spark RDD, DataFrame and Dataset. It’s object spark is default available in spark-shell.

Creating a SparkSession instance would be the first statement you would write to program with RDD, DataFrame and Dataset (Scala, Python or Java). SparkSession will be created using SparkSession.builder() as you can see bellow.

import org.apache.spark.sql.SparkSession

val spark:SparkSession = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExamples.com")
      .getOrCreate()

Spark Context

SparkContext is available since Spark 1.x (JavaSparkContext for Java) and is used to be an entry point to Spark and PySpark before introducing SparkSession in 2.0. Creating SparkContext was the first step to the program with RDD and to connect to Spark Cluster. It’s object sc by default available in spark-shell.

Since Spark 2.x version, When you create SparkSession, SparkContext object is by default create and it can be accessed using spark.sparkContext

Note that you can create just one SparkContext per JVM but can create many SparkSession objects.

Apache Spark RDD Tutorial

RDD (Resilient Distributed Dataset) is a fundamental data structure of Spark and it is the sky data abstraction in Apache Spark and the Spark Core. RDDs are fault-tolerant, immutable distributed collections of objects, which means once you create an RDD you cannot change it. Each dataset in RDD is divided into logical partitions, which can be computed on different nodes of the cluster.

This Apache Spark RDD Tutorial will help you start understanding and using Apache Spark RDD (Resilient Distributed Dataset) with Scala code examples. All RDD examples provided in this tutorial were also tested in our development environment and are available at GitHub spark scala examples project for quick reference.

In this section of the Apache Spark tutorial, I will introduce the RDD and explains how to create them and use its transformation and action operations. Here is the full article on Spark RDD in case if you wanted to learn more of and get your fundamentals strong.

RDD creation RDD’s are created primarily in two different ways, first parallelizing an existing collection and secondly referencing a dataset in an external storage system (HDFS, HDFS, S3 and many more).

sparkContext.parallelize() sparkContext.parallelize is used to parallelize an existing collection in your driver program. This is a basic method to create RDD.

//Create RDD from parallelize    
val dataSeq = Seq(("Java", 20000), ("Python", 100000), ("Scala", 3000))   
val rdd=spark.sparkContext.parallelize(dataSeq)

sparkContext.textFile()

Using textFile() method we can read a text (.txt) file from many sources like HDFS, S#, Azure, local e.t.c into RDD.

// Create RDD from external Data source
val rdd2 = spark.sparkContext.textFile("/path/textFile.txt")

RDD Operations

On Spark RDD, you can perform two kinds of operations:

RDD Transformations

Spark RDD Transformations are lazy operations meaning they don’t execute until you call an action on RDD. Since RDD’s are immutable, When you run a transformation(for example map()), instead of updating a current RDD, it returns a new RDD.

Some transformations on RDD’s are flatMap(), map(), reduceByKey(), filter(), sortByKey() and all these return a new RDD instead of updating the current.

RDD Actions

RDD Action operation returns the values from an RDD to a driver node. In other words, any RDD function that returns non RDD[T] is considered as an action. RDD operations trigger the computation and return RDD in a List to the driver program.

Some actions on RDD’s are count(), collect(), first(), max(), reduce() and more.