Skills Guide

    Kubernetes for MLOps
    The 2026 Skills Guide

    Kubernetes is the infrastructure layer for large-scale ML in production. This guide covers the K8s concepts MLOps engineers actually use — GPU scheduling, Kubeflow for ML workflows, KServe for model serving, autoscaling, and resource management for training jobs.

    Core K8s Concepts for ML Engineers

    ML engineers working on MLOps need to understand Kubernetes at the operator level — enough to configure, debug, and optimise ML workloads, not necessarily to administer the cluster.

    • Pods — The smallest deployable unit. One or more containers sharing network namespace and storage. ML training Pods typically have one container (the training script), a shared volume for checkpoints, and resource requests for CPU, memory, and GPU.
    • Jobs and CronJobs — Jobs run Pods to completion; ideal for training. completions: 1, parallelism: 1 for a single-Pod job. backoffLimit: 3 for up to 3 retries on failure. CronJobs for scheduled retraining (nightly feature updates, weekly model refreshes).
    • Deployments and Services — Deployments manage always-on replicas (inference servers). Services expose Pods via a stable DNS name and optional load balancer. ClusterIP for internal-only services; LoadBalancer for external access; NodePort for development.
    • ConfigMaps and Secrets — ConfigMaps store non-sensitive configuration (model hyperparameters, environment config) as key-value pairs mounted as environment variables or files. Secrets store sensitive values (API keys, database passwords) with base64 encoding and tighter RBAC. Reference both in Pod specs — never hardcode configuration in Docker images.
    • PersistentVolumeClaims (PVCs) — Claim persistent storage for model checkpoints, training data, and logs. For ML training, use ReadWriteOnce (one Pod at a time) for training checkpoints and ReadOnlyMany for shared training data across multiple Pods.

    GPU Scheduling and Resource Management

    GPU scheduling in Kubernetes works through the device plugin framework. The NVIDIA GPU Operator installs and manages the device plugin, which advertises GPU capacity as nvidia.com/gpu on each node. The scheduler uses this as a resource constraint when placing Pods.

    • Node labels and selectors — Label GPU nodes with their GPU type: kubectl label node gpu-node-01 gpu-type=a100-80gb. Use nodeSelector in Pod specs to target specific GPU types for training jobs that require high-memory GPUs.
    • Taints and tolerations — Taint GPU nodes to prevent non-GPU workloads from being scheduled on expensive GPU instances: kubectl taint nodes gpu-node-01 gpu=true:NoSchedule. GPU Pods must add the corresponding toleration. This prevents CPU workloads from occupying nodes where GPU Pods need to run.
    • Resource quotas — Per-namespace quotas prevent individual teams from monopolising GPU resources. ResourceQuota objects cap total GPU requests per namespace, enforcing fair sharing across teams.
    • Priority classesPriorityClass objects assign numeric priorities to Pods. High-priority inference Pods preempt lower-priority training Jobs during resource contention — inference SLAs are maintained even when the cluster is running batch training workloads.

    Kubeflow and KServe

    Kubeflow Pipelines define ML workflows as directed acyclic graphs of containerised steps. Each step is a Python function compiled to a container image. The SDK's @dsl.component decorator converts Python functions to pipeline components; @dsl.pipeline composes them. Pipelines are versioned, scheduled, and observable in the Kubeflow UI. The generated pipeline YAML (IR format) can be stored in Git and deployed via CI/CD.

    PyTorch Operator defines a PyTorchJob custom resource. Specify the number of Master and Worker replicas; the Operator creates the corresponding Pods, injects the right MASTER_ADDR, MASTER_PORT, RANK, and WORLD_SIZE environment variables, and handles Pod failure and recovery. Your training script uses standard torch.distributed.init_process_group(backend="nccl").

    KServe (formerly KFServing) provides Kubernetes-native model serving via InferenceService custom resources. Supports: multiple serving runtimes (Triton, TorchServe, ONNX Runtime, HuggingFace TGI), canary deployments (route X% of traffic to a new model version), and serverless scaling to zero via Knative. Define an InferenceService with a storageUri pointing to model weights in S3/GCS, and KServe handles container lifecycle, load balancing, and autoscaling.

    Frequently Asked Questions

    How do you run GPU workloads on Kubernetes?

    Install NVIDIA GPU Operator or device plugin. GPUs appear as schedulable resources: nvidia.com/gpu. Request in Pod spec via resources.limits. NVIDIA MIG (Multi-Instance GPU) for A100 enables hardware partitioning of a single GPU. For multi-node distributed training, use PyTorch Operator (Kubeflow) to manage distributed Pod lifecycles.

    What is Kubeflow and what does it provide?

    Kubeflow is a collection of Kubernetes-native ML tools: Pipelines (DAG workflow orchestration), PyTorch/TF Operator (distributed training job management), Katib (hyperparameter tuning), and KServe (model serving with autoscaling). The K8s-native alternative to managed platforms like SageMaker.

    What is the difference between a Deployment and a Job?

    Deployment: long-running Pods, always-on, rolling updates, replicas — for model serving and APIs. Job: Pods that run to completion, restarts on failure — for training jobs and batch inference. CronJob creates Jobs on a schedule. Use Deployments for inference servers; Jobs for training.

    How does Horizontal Pod Autoscaling work for ML inference?

    HPA scales the number of inference Pod replicas based on metrics. Configure with: minReplicas, maxReplicas, and target metrics (CPU utilisation, memory, or custom metrics like requests-per-second via KEDA or Prometheus Adapter). For GPU inference, custom metrics (inference latency p95, queue depth) are more meaningful than CPU. Scale-to-zero requires KEDA (Kubernetes Event-Driven Autoscaling) since standard HPA doesn't scale below 1.

    What are resource requests and limits in Kubernetes and why do they matter for ML?

    Requests: minimum resources guaranteed to a Pod (used by the scheduler to find a suitable node). Limits: maximum resources a Pod can consume (exceeding CPU limit throttles the container; exceeding memory limit kills it with OOM). For ML training: set requests = limits for GPU Pods (GPUs are not shared by default). For inference: set memory requests conservatively and limits generously. Never set GPU limits greater than GPU requests — this wastes GPU capacity by reserving GPUs that idle.

    Browse MLOps Jobs in the UK

    Find MLOps engineer roles working with Kubernetes and ML infrastructure.

    Quick Facts

    Demand level
    High
    Difficulty
    Advanced
    Time to proficiency4–8 months
    Salary premium+£10,000–£25,000

    Key Technologies

    Kubernetes
    Kubeflow
    KServe
    PyTorch Operator
    Helm
    NVIDIA GPU Operator
    KEDA