Core Docker Concepts for ML Engineers
ML engineers need to understand Docker at the level of writing and optimising Dockerfiles, not just running pre-built images. Three core concepts:
- Images and layers — A Docker image is a stack of read-only layers, each layer being the result of a Dockerfile instruction (RUN, COPY, ADD). Layers are cached: if a layer and everything before it is unchanged, Docker reuses the cached layer rather than re-executing the instruction. Layer caching is the primary tool for making Docker builds fast.
- The Union File System — Docker uses a union file system (OverlayFS on Linux) to present the stack of layers as a single coherent filesystem. When a container runs, a writable layer is added on top; writes go to this layer and do not affect the image. This is why containers are ephemeral — the writable layer is discarded when the container stops, unless you use volumes.
- Build context —
docker build . -t my-imagesends the entire current directory to the Docker daemon as the build context. Large build contexts (model weights, datasets, node_modules) significantly slow builds. Use.dockerignoreto exclude files and directories from the context.
Writing an Efficient ML Dockerfile
Layer caching strategy — Order Dockerfile instructions from least to most frequently changing:
- Base image (
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime) - System packages (
RUN apt-get update && apt-get install -y libgl1 && rm -rf /var/lib/apt/lists/*) — clean apt cache in the same RUN command to avoid bloating the layer - Python requirements (
COPY requirements.txt .thenRUN pip install --no-cache-dir -r requirements.txt) — separate from source code so it's only re-run when requirements change - Source code (
COPY . .) — changes on every code edit; must be last
Non-root user — Run the container as a non-root user: RUN useradd -m appuser && chown -R appuser /app, USER appuser. Running as root inside a container is a security risk; most production environments require non-root containers.
Environment variables — Set ML-specific env vars in the Dockerfile: ENV PYTHONUNBUFFERED=1 (stream Python output immediately, not buffered — important for logging), ENV PYTHONDONTWRITEBYTECODE=1 (don't write .pyc files — saves disk space in containers), ENV TRANSFORMERS_CACHE=/model-cache (control HuggingFace model cache location for volume mounts).
GPU Containers: CUDA and NVIDIA Docker
Running PyTorch training or inference with GPU acceleration in Docker requires the NVIDIA Container Toolkit. Key concepts:
- CUDA compatibility — The CUDA toolkit version inside the container must be ≤ the maximum CUDA version supported by the host's installed NVIDIA driver. Run
nvidia-smion the host to see the maximum CUDA version. The container CUDA version does not need to match exactly — CUDA is backward compatible within a major version. - nvidia-smi in container — After
docker run --gpus all,nvidia-smishould be available and show the GPUs. If it fails, the NVIDIA Container Toolkit is not properly configured on the host. - cuDNN — The CUDA Deep Neural Network library. Use
-cudnn8-or-cudnn9-tagged images for training (cuDNN provides optimised convolution kernels). The-runtimetag includes cuDNN;-basedoes not. - Docker Compose GPU config: Use
deploy.resources.reservations.deviceswithdriver: nvidia,count: all(or a specific number),capabilities: [gpu]in the service definition.
Model Serving Pattern: FastAPI + Docker
The standard pattern for serving an ML model in production is a FastAPI application containerised with Docker. Key considerations:
- Model loading at startup — Load the model once at startup using a FastAPI
lifespancontext manager or the@app.on_event("startup")handler. Store the model in the app's state (app.state.model). Never load the model inside a request handler — this re-loads on every request, making inference 10–100× slower. - Concurrency — FastAPI is async; model inference is typically synchronous and CPU/GPU-bound. Run inference in a thread pool:
await asyncio.get_event_loop().run_in_executor(None, model.predict, data)to avoid blocking the event loop and degrading throughput for other requests. - Health checks — Implement
/healthand/readyendpoints./healthreturns 200 immediately (liveness)./readyreturns 200 only after the model is loaded and ready to serve (readiness). Kubernetes uses these to manage pod lifecycle. - Uvicorn/Gunicorn — Start FastAPI with
uvicorn app:app --host 0.0.0.0 --port 8080in the container. For CPU-bound applications, usegunicorn -w 4 -k uvicorn.workers.UvicornWorkerto utilise multiple CPU cores. For GPU inference, typically one worker per GPU.
Frequently Asked Questions
Why does ML specifically benefit from Docker?
ML environments are notoriously hard to reproduce — Python version, library versions, CUDA version all matter. Docker packages model, dependencies, and runtime environment into a single image that runs identically on any host. Same container runs on a dev laptop (no GPU), CI runner, training cluster, and production server.
What is the best base image for a PyTorch container?
pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime is a common starting point. For production, use multi-stage: build from -devel, copy to -runtime for a smaller final image. Always pin specific tags — never use :latest. CUDA version in the image must be compatible with the host driver (backward compatible, not forward compatible).
How do you pass a GPU to a Docker container?
Install NVIDIA Container Toolkit on the host, then: docker run --gpus all my-ml-image. Specific GPUs: --gpus '"device=0,1"'. In Docker Compose: deploy.resources.reservations.devices with driver: nvidia. Container CUDA must be ≤ maximum CUDA version the host driver supports.
What is a multi-stage build and why does it matter?
Multiple FROM statements in one Dockerfile — stage 1 (builder) installs build tools and compiles; stage 2 (runtime) starts from a smaller base and copies only installed packages. Production images 2–5× smaller, excluding gcc/make/CUDA dev headers. Faster deployment, lower registry costs, smaller attack surface.
How do you optimise Docker layer caching for ML projects?
Copy requirements.txt and run pip install BEFORE copying your source code. pip install is slow and its result is cached as a layer — only invalidated if requirements.txt changes, not when you edit Python files. For large model weights, don't COPY them into the image — load from a mounted volume or object storage (S3/GCS) at container startup.