Spark Architecture Essentials
Spark's execution model: a Driver process coordinates the computation, breaks it into Tasks, and distributes tasks to Executors running on worker nodes. Data is stored in partitions — subsets of the dataset distributed across executors. The DataFrame API (Spark's primary API since Spark 2.0) provides a high-level abstraction over the distributed RDD (Resilient Distributed Dataset) layer.
Lazy evaluation: Transformations on DataFrames are lazy — they build a logical plan but do not execute. Execution is triggered by an action (count(), show(), write(), toPandas()). Spark's Catalyst optimiser analyses the full logical plan before execution, reordering filters and joins for efficiency. This is why you should chain transformations freely — Catalyst will optimise the execution plan.
Key DataFrame operations for ML feature engineering:
df.filter(F.col("date") > start_date)— Partition pruning on Delta Lake tables uses predicate pushdown to read only matching files.df.groupBy("user_id").agg(F.sum("amount").alias("total_spend"), F.count("*").alias("n_events"))— Distributed aggregation. Triggers a shuffle to group data by user_id.df.join(other_df, on="user_id", how="left")— Distributed join. Usebroadcast(other_df)if other_df is small (<~200MB): eliminates the shuffle by broadcasting the small table to all executors.Window.partitionBy("user_id").orderBy("event_timestamp")— Spark Window functions equivalent to SQL window functions.F.lag("amount", 1).over(window_spec)for lag features.
Spark MLlib Pipeline API
Spark MLlib's Pipeline API follows the same Estimator/Transformer/Pipeline pattern as scikit-learn, but operates on distributed DataFrames:
- StringIndexer — Encodes a string column into numeric indices (sorted by frequency).
StringIndexer(inputCol="category", outputCol="category_idx"). Required before OneHotEncoder and tree-based models that need numeric inputs. - OneHotEncoder — Encodes indexed categories as sparse binary vectors. Applied after StringIndexer.
- VectorAssembler — Combines multiple numeric columns into a single feature vector column. Required by all MLlib models.
VectorAssembler(inputCols=["feature1", "feature2", ...], outputCol="features"). - StandardScaler — Normalises features to zero mean and unit standard deviation. Applied after VectorAssembler.
- Models — LogisticRegression, RandomForestClassifier, GBTClassifier (gradient boosted trees), LinearRegression, KMeans. All support
fit(training_df)andtransform(test_df). GBTClassifier is typically the highest-performing classical ML model in MLlib for tabular data.
For deep learning at Spark scale, spark-pytorch-distributor (part of Databricks Runtime) runs distributed PyTorch DDP or FSDP training on a Spark cluster, with each Spark worker acting as a DDP process.
Delta Lake for ML Pipelines
Delta Lake solves four problems that arise in production ML pipelines built on object storage:
- Atomicity — Writes to a Delta table are atomic. Either all files from a write succeed, or none do. No partial writes visible to concurrent readers. Critical for feature store tables updated during training.
- Time travel —
spark.read.format("delta").option("versionAsOf", 42).load(path)ortimestampAsOf. Read a table as it existed at any point in history. Essential for reproducing training datasets and debugging model degradation: "what did the feature table look like when we trained the model that started failing?" - MERGE (upsert) — Incrementally update a feature table:
DeltaTable.forPath(spark, path).alias("target").merge(source_df.alias("source"), "target.user_id = source.user_id").whenMatchedUpdateAll().whenNotMatchedInsertAll().execute(). Atomically updates matching rows and inserts new ones. - Schema enforcement and evolution — Delta rejects writes with schema mismatches by default.
option("mergeSchema", "true")allows adding new columns. Prevents accidental schema corruption from upstream data pipeline changes.
Frequently Asked Questions
When should you use Spark instead of Pandas?
Pandas: dataset fits in RAM (<10–20GB), simpler API, faster for prototyping. Spark: dataset larger than a single machine's memory (100GB+), data already in distributed storage (Delta Lake, Parquet on S3/HDFS), or need distributed table joins. In practice, Spark for upstream data preparation; Pandas for the final aggregated feature dataset used in training.
What is the difference between narrow and wide transformations?
Narrow: each partition processed independently, no data movement (map, filter, union). Wide: data combined across partitions, triggers a shuffle — disk I/O, network transfer, serialisation (groupBy, join, repartition). Shuffles are the most expensive Spark operations. Minimise by filtering before joins, using broadcast joins for small tables, and repartitioning only when necessary.
What is Delta Lake and why does it matter for ML pipelines?
Delta Lake adds ACID transactions, versioning, and schema enforcement to Parquet on object storage. Key for ML: time travel (query data as of past timestamp for point-in-time correct features), MERGE for incremental feature table updates, Change Data Feed for efficient incremental computation. The foundation of the Databricks Lakehouse architecture.
What is the Spark MLlib Pipeline API?
Spark MLlib's Pipeline API mirrors scikit-learn: Transformers (transform() method, e.g., StringIndexer, VectorAssembler, StandardScaler), Estimators (fit() returns a Transformer, e.g., LogisticRegression, RandomForestClassifier), and Pipeline chains them. CrossValidator and TrainValidationSplit for evaluation. Best for large-scale classical ML; for deep learning, use PyTorch/TF on Spark with spark-pytorch-distributor.
What is Databricks and how does it relate to Spark?
Databricks is the managed platform created by Apache Spark's original authors. Provides managed Spark clusters, optimised runtime (Photon engine, 2–4× faster than open-source), Delta Lake integration, MLflow (also created by Databricks), Unity Catalog for data governance, and notebooks. The dominant Spark platform in UK enterprise. Databricks SQL Warehouse provides serverless SQL on Delta Lake without managing Spark clusters.