Effective Big Data Analytics: Spark

Paper http://nil.csail.mit.edu/6.824/2020/papers/zaharia-spark.pdf

Lecture https://www.youtube.com/watch?v=mzIoSW-cInA

Why doe ?

Existing cluster computing frameworks Mapreduce lack abstractions for leveraging distributed memory which makes it inefficient for applications that reuse intermediate results across multiple computations. Data reuse is common in many iterative machine learning and graph algorithms, including PageRank, K-means clustering, and logistic regression. Unfortunately in most current frameworks the only way to reuse data between computations (e.g: between two MapReduce jobs) is to write it to an external stable storage system, e.g., a distributed file system. This incurs substantial overheads due to data replication, disk I/O, and serialization.

Enter RDD

Spark paper proposes an abstraction called resilient distributed datasets (RDDs) that enables efficient data reuse in a broad range of applications. RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.

The main challenge in RDD is providing fault tolerance which existing abstractions based on fine-grained updates to mutable state provided with replicating data across machines which is quite expensive especially in the big data realm

RDD provides an interface for coarse-grained transformations which means they are applied over an entire dataset rather than a subset of data and used functional operators like map, filter and join.

Spark’s model takes advantage of this because once it saves your small DAG of operations which is rather small compared to the data you are processing and it can use that to recompute as long as the original data is still there.

Checkpointing the data in some RDDs may be useful when a lineage chain grows large

Fun RDD facts

RDD is a read-only, partitioned collection of records
RDDs can only be created through deterministic operations
We call these operations transformations (eg: map, filter, and join)
Users can control persistence (reuse across ops) and partitioning (parallel ops)

RDDs do not need to be materialized at all times. Instead, an RDD has enough information about how it was derived from other datasets (its lineage) to compute its partitions from data in stable storage. Spark also computes RDDs lazily so that it can pipeline transformations.

RDDs provide an efficient programming model for batch analytics and would be less suitable for applications that make asynchronous finegrained updates to shared state.

Spark be built different

In Spark Driver is the process where the main method runs. It creates SparkSession or SparkContext. Driver converts the user program to tasks and schedules them on the executors.

Executors are worker nodes processes in charge of running individual tasks in a given Spark job and then sending the results back to the driver. They can cache/ persist data on the worker node and provide in-memory storage for RDDs that are cached by user programs through Block Manager. Executors are dynamically launched and removed as required by the driver. Spark is dependent on the Cluster Manager to launch the Executors and also the Driver (Yarn, Mesos or standalone).

RDD Operations in Spark

RDD support Transformations and Actions

Transformations are lazy operations that take an RDD as the input and produce one or many RDDs as the output.

Narrow transformation — Elements required to compute the records in single partition live in the single partition of parent RDD. eg: map(), filter().
Wide transformation — Elements that are required to compute the records in the single partition may live in many partitions of parent RDD. eg: groupbyKey and reducebyKey.

Actions launch a computation to return a value to the program or write data to external storage. It is one of the ways of sending data from spark executor to the driver. eg: reduce, collect, foreach etc

Operations like join are only available on RDDs of key-value pairs. Also most function names are chosen to match Scala/functional language APIs. eg: map is a one-to-one mapping, while flatMap maps each input value to one or more outputs.

When working with spark it is important to remember that it is a distributed system and passing around something stateful might land you in bizzaro land

var counter = 0
var rdd = sc.parallelize(data)
// Wrong 💀 DONOTDOTHIS !!
rdd.foreach(x => counter += x)
println("Counter value: " + counter)

To execute jobs, Spark breaks up the processing of RDD operations into tasks, each of which is executed by an executor. Prior to execution, Spark computes the task’s closure. The closure is variables and methods which must be visible for the executor to perform its computations on the RDD like foreach(). This closure is serialized and sent to each executor so when counter is referenced within the foreach function it’s no longer the counter on the driver node since executors only see the copy from the serialized closure thus the final value of counter will still be zero

Spark provides you with two limited types of shared variables for two common usage patterns namely broadcast variables and accumulators

Accumulator can be used mainly when global aggregation is needed (Counter), They are variables that are only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel

val accum = sc.accumulator(0) 
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)

In actions spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations tho each task’s update may be applied more than once if tasks or job stages are re-executed.

Accumulators do not change the lazy evaluation model of Spark. If they are being updated within an operation on an RDD, their value is only updated once that RDD is computed as part of an action.

There are also broadcast variables which allow allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

In cluster mode when running rdd.foreach(println) the output to stdout being called by the executors is writing to the executor’s stdout instead of the driver so we use collect() to first bring the RDD to the driver node and then print rdd.collect().foreach(println) if the RDD is big this can cause the driver to run out of memory.

Useful guide on RDD operations: https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

Certain operations within Spark trigger an event known as the Shuffle which is a way of re-distributing data so that it’s grouped differently across partitions. Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably ordered data following shuffle then it’s possible to use:

mapPartitions to sort each partition using, for example, .sorted
repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously repartitioning
sortBy to make a globally ordered RDD

The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O, They can also consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically reduceByKey and aggregateByKey create these structures on the map side and ByKey operations generate these on the reduce side.

When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.

Job Scheduling

As a core component of data processing platform, scheduler is responsible for schedule tasks on compute units. Whenever a user runs an action (e.g., count or save) on an RDD, the scheduler examines that RDD’s lineage graph to build a DAG of stages to execute. Each stage contains as many pipelined transformations with narrow dependencies as possible. The boundaries of the stages are the shuffle operations required for wide dependencies, or any already computed partitions that can shortcircuit the computation of a parent RDD. The scheduler then launches tasks to compute missing partitions from each stage until it has computed the target RDD.

Our scheduler assigns tasks to machines based on data locality (placing tasks on nodes that contain their input data) using Delay scheduling which is an algo that addresses the conflict between locality and fairness: when the job that should be scheduled next according to fairness cannot launch a local task, it waits for a small amount of time, letting other jobs launch tasks instead. We find that delay scheduling achieves nearly optimal data locality in a variety of workloads and can increase throughput by up to 2x while preserving fairness.

Final Touches

RDDs themselves in Spark are now somewhat deprecated and spark has recently moved towards “DataFrames” which implements a more column-oriented representation while maintaining the good ideas from RDDs.

Spark doesn’t need complicated protocols like Raft because RDDs are immutable and can always be recomputed using the lineage graph.