Apache Spark

1. Spark Basics

What is Apache Spark?
A unified analytics engine designed for large-scale data processing. It provides an interface for programming clusters with data parallelism and fault tolerance.
Core Concepts:
- Resilient Distributed Dataset (RDD): Immutable collections of objects partitioned across the nodes in a cluster.
- Directed Acyclic Graph (DAG): Spark's execution model, representing RDD operations as a series of stages with a lineage of transformations.
- Lazy Evaluation: Spark only executes computations when an action (like collect or save) is triggered.

Components:
- Driver Program: Manages SparkContext and executes the DAG.
- Cluster Manager: Allocates resources across applications (can be YARN, Mesos, or Kubernetes).
- Executors: Run on worker nodes to execute tasks and store RDD partitions.
- Tasks: Units of work assigned to executors.
Execution Workflow:
1. Submit Job: The driver program initiates a Spark job.
2. Task Scheduling: The DAG scheduler breaks the job into stages, creating tasks for each RDD partition.
3. Execution: Executors perform the tasks and return results to the driver.

RDDs (Resilient Distributed Datasets):
Basic data structure with operations like map, filter, and reduce. Provides fault tolerance through lineage.
DataFrames:
High-level data structure optimized for querying, backed by Catalyst optimizer. Supports SQL queries, making it a preferred choice for structured data.
Datasets:
Type-safe, object-oriented API available in languages like Scala. Offers benefits of both RDDs and DataFrames but is limited in Python.

Transformations:
Examples: map, filter, flatMap, union, groupByKey. These are lazy operations that produce new RDDs but don’t execute until an action is called.
Actions:
Examples: collect, count, reduce, saveAsTextFile. Actions trigger the execution of transformations and return results to the driver.

Catalyst Optimizer (for SQL and DataFrames): Uses rule-based and cost-based optimization to produce efficient query plans.

Memory Management: Manages caching, shuffle files, and resource allocation for performance.