Question 1

When should you use Spark instead of Pandas for ML feature engineering?

Accepted Answer

Pandas runs on a single machine and loads data into memory. If your dataset fits comfortably in RAM (typically under 10–20GB depending on available machine memory), Pandas is simpler, faster for prototyping, and has a richer API for complex transformations. Spark is appropriate when: your feature engineering dataset is larger than a single machine's RAM (100GB+), your data is already stored in a distributed format (Delta Lake, Parquet on S3/HDFS, Hive tables), you need to join multiple large tables that would require distributed shuffle operations, or your team's existing data engineering stack is already built on Spark. For most UK ML teams, Spark is used for the upstream data preparation and feature computation stages, while the final feature dataset (already aggregated to a manageable size) is processed with Pandas and PyTorch.

Question 2

What is the difference between Spark's narrow and wide transformations?

Accepted Answer

Narrow transformations (map, filter, union) process each partition independently with no data movement between partitions. They are fast and do not require a shuffle stage. Wide transformations (groupBy, join, repartition, distinct) require data from multiple partitions to be combined, triggering a shuffle — data is sorted, written to disk, and redistributed across the cluster. Shuffles are the most expensive operations in Spark: they involve disk I/O, network transfer, and serialisation/deserialisation. Understanding which transformations trigger shuffles is important for writing efficient Spark code. Minimise shuffles by filtering data before joins, using broadcast joins for small lookup tables (spark.sql.autoBroadcastJoinThreshold), and repartitioning once rather than repeatedly.

Question 3

What is Delta Lake and why is it important for ML pipelines?

Accepted Answer

Delta Lake is an open-source storage layer (by Databricks) that adds ACID transactions, versioning, and schema enforcement to Parquet files on object storage (S3, GCS, ADLS). For ML pipelines, the key features are: (1) Time travel — query a table as of any past timestamp or version number. This enables point-in-time correct feature computation: for each training example, compute features using data available at the label time. (2) Schema evolution — add new columns to a Delta table without breaking existing readers (mergeSchema). (3) MERGE (upsert) — update existing records or insert new ones in a single atomic operation, essential for maintaining feature store tables incrementally. (4) Change Data Feed — efficiently read only changed records since a given version, enabling efficient incremental feature computation.

Apache Spark for Machine Learning
The 2026 Skills Guide

Spark Architecture Essentials

Spark MLlib Pipeline API

Delta Lake for ML Pipelines

Frequently Asked Questions

When should you use Spark instead of Pandas?

What is the difference between narrow and wide transformations?

What is Delta Lake and why does it matter for ML pipelines?

What is the Spark MLlib Pipeline API?

What is Databricks and how does it relate to Spark?

Browse Data Engineering and ML Jobs

Quick Facts

Key Technologies

Related Skills

Apache Spark for Machine LearningThe 2026 Skills Guide