Skills Guide

    Apache Spark for Machine Learning
    The 2026 Skills Guide

    Spark is essential for ML engineers working with large-scale datasets at UK companies. This guide covers PySpark DataFrames, the MLlib Pipeline API, Delta Lake for feature engineering, and how the Databricks platform builds on Spark.

    Spark Architecture Essentials

    Spark's execution model: a Driver process coordinates the computation, breaks it into Tasks, and distributes tasks to Executors running on worker nodes. Data is stored in partitions — subsets of the dataset distributed across executors. The DataFrame API (Spark's primary API since Spark 2.0) provides a high-level abstraction over the distributed RDD (Resilient Distributed Dataset) layer.

    Lazy evaluation: Transformations on DataFrames are lazy — they build a logical plan but do not execute. Execution is triggered by an action (count(), show(), write(), toPandas()). Spark's Catalyst optimiser analyses the full logical plan before execution, reordering filters and joins for efficiency. This is why you should chain transformations freely — Catalyst will optimise the execution plan.

    Key DataFrame operations for ML feature engineering:

    • df.filter(F.col("date") > start_date) — Partition pruning on Delta Lake tables uses predicate pushdown to read only matching files.
    • df.groupBy("user_id").agg(F.sum("amount").alias("total_spend"), F.count("*").alias("n_events")) — Distributed aggregation. Triggers a shuffle to group data by user_id.
    • df.join(other_df, on="user_id", how="left") — Distributed join. Use broadcast(other_df) if other_df is small (<~200MB): eliminates the shuffle by broadcasting the small table to all executors.
    • Window.partitionBy("user_id").orderBy("event_timestamp") — Spark Window functions equivalent to SQL window functions. F.lag("amount", 1).over(window_spec) for lag features.

    Spark MLlib Pipeline API

    Spark MLlib's Pipeline API follows the same Estimator/Transformer/Pipeline pattern as scikit-learn, but operates on distributed DataFrames:

    • StringIndexer — Encodes a string column into numeric indices (sorted by frequency). StringIndexer(inputCol="category", outputCol="category_idx"). Required before OneHotEncoder and tree-based models that need numeric inputs.
    • OneHotEncoder — Encodes indexed categories as sparse binary vectors. Applied after StringIndexer.
    • VectorAssembler — Combines multiple numeric columns into a single feature vector column. Required by all MLlib models. VectorAssembler(inputCols=["feature1", "feature2", ...], outputCol="features").
    • StandardScaler — Normalises features to zero mean and unit standard deviation. Applied after VectorAssembler.
    • Models — LogisticRegression, RandomForestClassifier, GBTClassifier (gradient boosted trees), LinearRegression, KMeans. All support fit(training_df) and transform(test_df). GBTClassifier is typically the highest-performing classical ML model in MLlib for tabular data.

    For deep learning at Spark scale, spark-pytorch-distributor (part of Databricks Runtime) runs distributed PyTorch DDP or FSDP training on a Spark cluster, with each Spark worker acting as a DDP process.

    Delta Lake for ML Pipelines

    Delta Lake solves four problems that arise in production ML pipelines built on object storage:

    • Atomicity — Writes to a Delta table are atomic. Either all files from a write succeed, or none do. No partial writes visible to concurrent readers. Critical for feature store tables updated during training.
    • Time travelspark.read.format("delta").option("versionAsOf", 42).load(path) or timestampAsOf. Read a table as it existed at any point in history. Essential for reproducing training datasets and debugging model degradation: "what did the feature table look like when we trained the model that started failing?"
    • MERGE (upsert) — Incrementally update a feature table: DeltaTable.forPath(spark, path).alias("target").merge(source_df.alias("source"), "target.user_id = source.user_id").whenMatchedUpdateAll().whenNotMatchedInsertAll().execute(). Atomically updates matching rows and inserts new ones.
    • Schema enforcement and evolution — Delta rejects writes with schema mismatches by default. option("mergeSchema", "true") allows adding new columns. Prevents accidental schema corruption from upstream data pipeline changes.

    Frequently Asked Questions

    When should you use Spark instead of Pandas?

    Pandas: dataset fits in RAM (<10–20GB), simpler API, faster for prototyping. Spark: dataset larger than a single machine's memory (100GB+), data already in distributed storage (Delta Lake, Parquet on S3/HDFS), or need distributed table joins. In practice, Spark for upstream data preparation; Pandas for the final aggregated feature dataset used in training.

    What is the difference between narrow and wide transformations?

    Narrow: each partition processed independently, no data movement (map, filter, union). Wide: data combined across partitions, triggers a shuffle — disk I/O, network transfer, serialisation (groupBy, join, repartition). Shuffles are the most expensive Spark operations. Minimise by filtering before joins, using broadcast joins for small tables, and repartitioning only when necessary.

    What is Delta Lake and why does it matter for ML pipelines?

    Delta Lake adds ACID transactions, versioning, and schema enforcement to Parquet on object storage. Key for ML: time travel (query data as of past timestamp for point-in-time correct features), MERGE for incremental feature table updates, Change Data Feed for efficient incremental computation. The foundation of the Databricks Lakehouse architecture.

    What is the Spark MLlib Pipeline API?

    Spark MLlib's Pipeline API mirrors scikit-learn: Transformers (transform() method, e.g., StringIndexer, VectorAssembler, StandardScaler), Estimators (fit() returns a Transformer, e.g., LogisticRegression, RandomForestClassifier), and Pipeline chains them. CrossValidator and TrainValidationSplit for evaluation. Best for large-scale classical ML; for deep learning, use PyTorch/TF on Spark with spark-pytorch-distributor.

    What is Databricks and how does it relate to Spark?

    Databricks is the managed platform created by Apache Spark's original authors. Provides managed Spark clusters, optimised runtime (Photon engine, 2–4× faster than open-source), Delta Lake integration, MLflow (also created by Databricks), Unity Catalog for data governance, and notebooks. The dominant Spark platform in UK enterprise. Databricks SQL Warehouse provides serverless SQL on Delta Lake without managing Spark clusters.

    Browse Data Engineering and ML Jobs

    Find ML and data engineering roles using Spark at UK companies.

    Quick Facts

    Demand level
    High
    Difficulty
    Advanced
    Time to proficiency3–6 months

    Key Technologies

    PySpark
    Spark MLlib
    Delta Lake
    Databricks
    Structured Streaming
    Photon
    Unity Catalog