Shared Variables in Spark

Explore how shared variables in Spark enable efficient data sharing and aggregation across worker nodes. Understand broadcast variables for distributing static data and accumulators for aggregating updates. This lesson covers setup overhead reduction and practical use cases in distributed computing with Spark.

We'll cover the following...

Shared variables
- Broadcast variables
  - Implementation of broadcast variables
- Accumulators
  - Implementation of accumulators

In addition to RDDs, Spark's second abstraction is distributed shared variables. We might want to send static data to all the workers (driver-to-worker information flow) or might want to collect some state from all the workers (workers-to-driver information flow). Spark's shared variable abstraction helps with both of these scenarios.

Shared variables

Setup work is required for some operations, like creating a random number from a specific distribution, for each partition. The user will have to create and send it to the worker ...

1.Prologue

2.File Systems

3.Google File System (GFS)

4.Google Colossus File System

5.Facebook's Tectonic File System

6.Databases

7.Google Bigtable

8.Google Megastore

9.Google Spanner

10.Key-value Stores

11.Many-core Key-value Store

12.Scaling Memcache

13.SILT

14.Amazon DynamoDB

15.Concurrency Management

16.Two-phase Locking (2PL)

17.Google Chubby Locking Service

18.ZooKeeper

19.Big Data Processing: Batch to Stream Processing

20.MapReduce

21.Spark

22.Kafka

23.Consensus

24.Understanding Consensus: Two Generals, FLP, & Byzantine Generals

25.Two-phase Commit

26.State Machine Replication

27.Paxos

28.Raft

29.Epilogue

Shared Variables in Spark

Shared variables