Skills Guide

    SQL for Data Science and ML
    The 2026 Skills Guide

    SQL remains one of the most tested skills in data science and ML engineering interviews — and one of the most directly useful in day-to-day work. This guide goes beyond basic queries to cover the SQL patterns that UK AI teams actually use: window functions, feature engineering, point-in-time correctness, and analytical SQL at scale.

    Window Functions for Feature Engineering

    Window functions are the single most impactful SQL skill for ML feature engineering. They allow you to compute aggregates and rankings over a sliding window of rows related to the current row, without collapsing rows into a single output as GROUP BY does.

    Syntax: function() OVER (PARTITION BY partition_cols ORDER BY order_col frame_clause)

    Aggregate window functions for temporal features:

    • SUM(amount) OVER (PARTITION BY user_id ORDER BY event_timestamp ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) — 30-day rolling spend per user. Computed for every row without a self-join.
    • COUNT(*) OVER (PARTITION BY user_id ORDER BY event_timestamp ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) — Event count over the past 7 events (or days with RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW).
    • AVG(session_duration) OVER (PARTITION BY user_id ORDER BY session_date ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) — 5-session rolling average session duration.

    Lag and lead functions:

    • LAG(purchase_amount, 1) OVER (PARTITION BY user_id ORDER BY purchase_date) — Previous purchase amount. The most common way to create autoregressive features.
    • LEAD(event_type, 1) OVER (...) — Next event type. Used in conversion funnel analysis and sequence prediction.
    • LAG(event_timestamp, 1) OVER (...) combined with timestamp arithmetic to compute time since last event — a highly predictive feature for churn models.

    Ranking functions:

    • RANK() / DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) — Rank products within their category by revenue. Useful for relative position features.
    • NTILE(10) OVER (ORDER BY feature_value) — Assign each row to a decile based on a continuous feature. Used for discretising features and creating percentile bins.
    • ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_timestamp) — Sequential event number per user. Essential for session reconstruction and sequence feature extraction.

    CTEs and Feature Pipeline Architecture

    Complex ML feature pipelines in SQL benefit from structuring as a chain of CTEs, where each CTE represents one logical transformation step. This improves readability, makes it easy to debug intermediate results, and allows reuse of intermediate datasets within a single query.

    A typical training feature table query structure:

    1. Labels CTE — Define the training examples with their label timestamps (the point in time at which the label is measured).
    2. Events CTE — Join the raw event stream to the labels, filtering to events BEFORE each label timestamp.
    3. User aggregates CTE — Compute per-user aggregate features (30-day spend, event counts, etc.) over the filtered events.
    4. Item features CTE — Join item/product metadata.
    5. Final SELECT — Join all feature CTEs on the training example identifier, selecting the final feature columns.

    Analytical SQL Platforms and dbt

    UK ML teams typically run feature engineering SQL on cloud data warehouses rather than transactional databases. Each has SQL dialect differences worth knowing:

    • BigQuery (GCP) — Columnar, serverless, billed by bytes scanned. Use PARTITION BY clauses on tables and WHERE filters that exploit partitions to reduce scan costs. BigQuery ML allows training basic ML models in SQL via CREATE MODEL syntax. Supports nested and repeated fields (STRUCT, ARRAY) for denormalised schema patterns.
    • Snowflake — Popular in UK fintech and retail. Strong support for semi-structured data (VARIANT type for JSON). Time travel (query data as of a past timestamp) is useful for point-in-time feature computation. Snowpark allows Python execution within Snowflake's compute environment.
    • Redshift — AWS's columnar data warehouse. Distribution keys and sort keys significantly affect query performance — understanding EXPLAIN output and choosing appropriate distribution strategies matters at scale.
    • dbt (data build tool) — Transforms SQL queries into a modular, version-controlled, tested pipeline. Write SELECT statements; dbt handles the CREATE TABLE AS SELECT materialisation. ref() macro creates dependencies between models. Schema tests (not_null, unique, accepted_values, relationships) run automatically. dbt is widely used in UK data teams and increasingly appears in ML engineer job requirements for feature pipeline ownership.

    Frequently Asked Questions

    Why do ML engineers need strong SQL skills?

    Training data lives in databases/warehouses; features are computed from event tables; model evaluation requires joining predictions with ground truth. SQL enables: exploring training data directly, building scalable feature engineering pipelines, evaluating model predictions in the database, and debugging data quality at the source. Listed as required in the majority of UK ML engineer job descriptions.

    What are window functions and how are they used in feature engineering?

    Window functions compute aggregates over a set of rows related to the current row without collapsing them like GROUP BY. LAG(value) OVER (PARTITION BY user_id ORDER BY event_time): previous event value. SUM(revenue) OVER (...ROWS BETWEEN 7 PRECEDING AND CURRENT ROW): 7-day rolling sum. RANK() for ranking features within groups. High-signal skill in ML engineering interviews.

    What is a CTE and why use it?

    CTEs (WITH clause) define named temporary result sets, referenced multiple times within a query. Break complex feature pipelines into readable named steps instead of nested subqueries. Each CTE represents one transformation: user aggregates → join metadata → apply filters → final feature table. Recursive CTEs for graph/hierarchical data.

    How do you avoid data leakage in SQL feature engineering?

    Filter events to those BEFORE the label timestamp for each training row. Point-in-time correct joins: join to feature values where feature_date <= label_date. Window functions: use ROWS BETWEEN N PRECEDING AND 1 PRECEDING (not CURRENT ROW) to exclude the current event from past-looking aggregate features. A common interview topic and frequent production bug source.

    What SQL skills distinguish senior data scientists and ML engineers?

    Query optimisation (reading EXPLAIN/EXPLAIN ANALYZE output, choosing appropriate join types, using covering indexes), efficiently partitioning large analytical tables, writing incremental queries for large-scale feature computation, dbt for modularised and tested SQL pipelines, understanding columnar storage (BigQuery/Snowflake/Redshift) vs row-based (PostgreSQL/MySQL) for analytical vs OLTP workloads.

    Browse Data Science and ML Engineering Jobs

    Find data science and ML engineering roles at UK AI companies.

    Quick Facts

    Demand level
    Essential
    Difficulty
    Foundational
    Time to proficiency1–3 months

    Key Technologies

    PostgreSQL
    BigQuery
    Snowflake
    Redshift
    dbt
    Window Functions
    CTEs