Skills Guide

    Docker for Machine Learning
    The 2026 Skills Guide

    Docker is the baseline containerisation skill for ML engineering. This guide covers writing efficient ML Dockerfiles, GPU containers with CUDA, multi-stage builds for production images, and deploying models with FastAPI.

    Core Docker Concepts for ML Engineers

    ML engineers need to understand Docker at the level of writing and optimising Dockerfiles, not just running pre-built images. Three core concepts:

    • Images and layers — A Docker image is a stack of read-only layers, each layer being the result of a Dockerfile instruction (RUN, COPY, ADD). Layers are cached: if a layer and everything before it is unchanged, Docker reuses the cached layer rather than re-executing the instruction. Layer caching is the primary tool for making Docker builds fast.
    • The Union File System — Docker uses a union file system (OverlayFS on Linux) to present the stack of layers as a single coherent filesystem. When a container runs, a writable layer is added on top; writes go to this layer and do not affect the image. This is why containers are ephemeral — the writable layer is discarded when the container stops, unless you use volumes.
    • Build contextdocker build . -t my-image sends the entire current directory to the Docker daemon as the build context. Large build contexts (model weights, datasets, node_modules) significantly slow builds. Use .dockerignore to exclude files and directories from the context.

    Writing an Efficient ML Dockerfile

    Layer caching strategy — Order Dockerfile instructions from least to most frequently changing:

    1. Base image (FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime)
    2. System packages (RUN apt-get update && apt-get install -y libgl1 && rm -rf /var/lib/apt/lists/*) — clean apt cache in the same RUN command to avoid bloating the layer
    3. Python requirements (COPY requirements.txt . then RUN pip install --no-cache-dir -r requirements.txt) — separate from source code so it's only re-run when requirements change
    4. Source code (COPY . .) — changes on every code edit; must be last

    Non-root user — Run the container as a non-root user: RUN useradd -m appuser && chown -R appuser /app, USER appuser. Running as root inside a container is a security risk; most production environments require non-root containers.

    Environment variables — Set ML-specific env vars in the Dockerfile: ENV PYTHONUNBUFFERED=1 (stream Python output immediately, not buffered — important for logging), ENV PYTHONDONTWRITEBYTECODE=1 (don't write .pyc files — saves disk space in containers), ENV TRANSFORMERS_CACHE=/model-cache (control HuggingFace model cache location for volume mounts).

    GPU Containers: CUDA and NVIDIA Docker

    Running PyTorch training or inference with GPU acceleration in Docker requires the NVIDIA Container Toolkit. Key concepts:

    • CUDA compatibility — The CUDA toolkit version inside the container must be ≤ the maximum CUDA version supported by the host's installed NVIDIA driver. Run nvidia-smi on the host to see the maximum CUDA version. The container CUDA version does not need to match exactly — CUDA is backward compatible within a major version.
    • nvidia-smi in container — After docker run --gpus all, nvidia-smi should be available and show the GPUs. If it fails, the NVIDIA Container Toolkit is not properly configured on the host.
    • cuDNN — The CUDA Deep Neural Network library. Use -cudnn8- or -cudnn9- tagged images for training (cuDNN provides optimised convolution kernels). The -runtime tag includes cuDNN; -base does not.
    • Docker Compose GPU config: Use deploy.resources.reservations.devices with driver: nvidia, count: all (or a specific number), capabilities: [gpu] in the service definition.

    Model Serving Pattern: FastAPI + Docker

    The standard pattern for serving an ML model in production is a FastAPI application containerised with Docker. Key considerations:

    • Model loading at startup — Load the model once at startup using a FastAPI lifespan context manager or the @app.on_event("startup") handler. Store the model in the app's state (app.state.model). Never load the model inside a request handler — this re-loads on every request, making inference 10–100× slower.
    • Concurrency — FastAPI is async; model inference is typically synchronous and CPU/GPU-bound. Run inference in a thread pool: await asyncio.get_event_loop().run_in_executor(None, model.predict, data) to avoid blocking the event loop and degrading throughput for other requests.
    • Health checks — Implement /health and /ready endpoints. /health returns 200 immediately (liveness). /ready returns 200 only after the model is loaded and ready to serve (readiness). Kubernetes uses these to manage pod lifecycle.
    • Uvicorn/Gunicorn — Start FastAPI with uvicorn app:app --host 0.0.0.0 --port 8080 in the container. For CPU-bound applications, use gunicorn -w 4 -k uvicorn.workers.UvicornWorker to utilise multiple CPU cores. For GPU inference, typically one worker per GPU.

    Frequently Asked Questions

    Why does ML specifically benefit from Docker?

    ML environments are notoriously hard to reproduce — Python version, library versions, CUDA version all matter. Docker packages model, dependencies, and runtime environment into a single image that runs identically on any host. Same container runs on a dev laptop (no GPU), CI runner, training cluster, and production server.

    What is the best base image for a PyTorch container?

    pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime is a common starting point. For production, use multi-stage: build from -devel, copy to -runtime for a smaller final image. Always pin specific tags — never use :latest. CUDA version in the image must be compatible with the host driver (backward compatible, not forward compatible).

    How do you pass a GPU to a Docker container?

    Install NVIDIA Container Toolkit on the host, then: docker run --gpus all my-ml-image. Specific GPUs: --gpus '"device=0,1"'. In Docker Compose: deploy.resources.reservations.devices with driver: nvidia. Container CUDA must be ≤ maximum CUDA version the host driver supports.

    What is a multi-stage build and why does it matter?

    Multiple FROM statements in one Dockerfile — stage 1 (builder) installs build tools and compiles; stage 2 (runtime) starts from a smaller base and copies only installed packages. Production images 2–5× smaller, excluding gcc/make/CUDA dev headers. Faster deployment, lower registry costs, smaller attack surface.

    How do you optimise Docker layer caching for ML projects?

    Copy requirements.txt and run pip install BEFORE copying your source code. pip install is slow and its result is cached as a layer — only invalidated if requirements.txt changes, not when you edit Python files. For large model weights, don't COPY them into the image — load from a mounted volume or object storage (S3/GCS) at container startup.

    Browse MLOps and ML Engineering Jobs

    Find roles building containerised ML systems at UK companies.

    Quick Facts

    Demand level
    Essential
    Difficulty
    Foundational
    Time to proficiency2–6 weeks

    Key Tools

    Docker
    Docker Compose
    NVIDIA Docker
    CUDA
    FastAPI
    Uvicorn
    ECR
    GCR