A unified analytics engine designed for large-scale data processing. It provides an interface for programming clusters with data parallelism and fault tolerance.
collect
or save
) is triggered.Basic data structure with operations like map
, filter
, and reduce
. Provides fault tolerance through lineage.
High-level data structure optimized for querying, backed by Catalyst optimizer. Supports SQL queries, making it a preferred choice for structured data.
Type-safe, object-oriented API available in languages like Scala. Offers benefits of both RDDs and DataFrames but is limited in Python.
Examples: map
, filter
, flatMap
, union
, groupByKey
. These are lazy operations that produce new RDDs but don’t execute until an action is called.
Examples: collect
, count
, reduce
, saveAsTextFile
. Actions trigger the execution of transformations and return results to the driver.
Catalyst Optimizer (for SQL and DataFrames): Uses rule-based and cost-based optimization to produce efficient query plans.
Memory Management: Manages caching, shuffle files, and resource allocation for performance.