Kubernetes for MLOps
The 2026 Skills Guide
Kubernetes is the infrastructure layer for large-scale ML in production. This guide covers the K8s concepts MLOps engineers actually use — GPU scheduling, Kubeflow for ML workflows, KServe for model serving, autoscaling, and resource management for training jobs.
Core K8s Concepts for ML Engineers
ML engineers working on MLOps need to understand Kubernetes at the operator level — enough to configure, debug, and optimise ML workloads, not necessarily to administer the cluster.
- Pods — The smallest deployable unit. One or more containers sharing network namespace and storage. ML training Pods typically have one container (the training script), a shared volume for checkpoints, and resource requests for CPU, memory, and GPU.
- Jobs and CronJobs — Jobs run Pods to completion; ideal for training.
completions: 1,parallelism: 1for a single-Pod job.backoffLimit: 3for up to 3 retries on failure. CronJobs for scheduled retraining (nightly feature updates, weekly model refreshes). - Deployments and Services — Deployments manage always-on replicas (inference servers). Services expose Pods via a stable DNS name and optional load balancer.
ClusterIPfor internal-only services;LoadBalancerfor external access;NodePortfor development. - ConfigMaps and Secrets — ConfigMaps store non-sensitive configuration (model hyperparameters, environment config) as key-value pairs mounted as environment variables or files. Secrets store sensitive values (API keys, database passwords) with base64 encoding and tighter RBAC. Reference both in Pod specs — never hardcode configuration in Docker images.
- PersistentVolumeClaims (PVCs) — Claim persistent storage for model checkpoints, training data, and logs. For ML training, use
ReadWriteOnce(one Pod at a time) for training checkpoints andReadOnlyManyfor shared training data across multiple Pods.
GPU Scheduling and Resource Management
GPU scheduling in Kubernetes works through the device plugin framework. The NVIDIA GPU Operator installs and manages the device plugin, which advertises GPU capacity as nvidia.com/gpu on each node. The scheduler uses this as a resource constraint when placing Pods.
- Node labels and selectors — Label GPU nodes with their GPU type:
kubectl label node gpu-node-01 gpu-type=a100-80gb. UsenodeSelectorin Pod specs to target specific GPU types for training jobs that require high-memory GPUs. - Taints and tolerations — Taint GPU nodes to prevent non-GPU workloads from being scheduled on expensive GPU instances:
kubectl taint nodes gpu-node-01 gpu=true:NoSchedule. GPU Pods must add the corresponding toleration. This prevents CPU workloads from occupying nodes where GPU Pods need to run. - Resource quotas — Per-namespace quotas prevent individual teams from monopolising GPU resources.
ResourceQuotaobjects cap total GPU requests per namespace, enforcing fair sharing across teams. - Priority classes —
PriorityClassobjects assign numeric priorities to Pods. High-priority inference Pods preempt lower-priority training Jobs during resource contention — inference SLAs are maintained even when the cluster is running batch training workloads.
Kubeflow and KServe
Kubeflow Pipelines define ML workflows as directed acyclic graphs of containerised steps. Each step is a Python function compiled to a container image. The SDK's @dsl.component decorator converts Python functions to pipeline components; @dsl.pipeline composes them. Pipelines are versioned, scheduled, and observable in the Kubeflow UI. The generated pipeline YAML (IR format) can be stored in Git and deployed via CI/CD.
PyTorch Operator defines a PyTorchJob custom resource. Specify the number of Master and Worker replicas; the Operator creates the corresponding Pods, injects the right MASTER_ADDR, MASTER_PORT, RANK, and WORLD_SIZE environment variables, and handles Pod failure and recovery. Your training script uses standard torch.distributed.init_process_group(backend="nccl").
KServe (formerly KFServing) provides Kubernetes-native model serving via InferenceService custom resources. Supports: multiple serving runtimes (Triton, TorchServe, ONNX Runtime, HuggingFace TGI), canary deployments (route X% of traffic to a new model version), and serverless scaling to zero via Knative. Define an InferenceService with a storageUri pointing to model weights in S3/GCS, and KServe handles container lifecycle, load balancing, and autoscaling.
Frequently Asked Questions
How do you run GPU workloads on Kubernetes?
Install NVIDIA GPU Operator or device plugin. GPUs appear as schedulable resources: nvidia.com/gpu. Request in Pod spec via resources.limits. NVIDIA MIG (Multi-Instance GPU) for A100 enables hardware partitioning of a single GPU. For multi-node distributed training, use PyTorch Operator (Kubeflow) to manage distributed Pod lifecycles.
What is Kubeflow and what does it provide?
Kubeflow is a collection of Kubernetes-native ML tools: Pipelines (DAG workflow orchestration), PyTorch/TF Operator (distributed training job management), Katib (hyperparameter tuning), and KServe (model serving with autoscaling). The K8s-native alternative to managed platforms like SageMaker.
What is the difference between a Deployment and a Job?
Deployment: long-running Pods, always-on, rolling updates, replicas — for model serving and APIs. Job: Pods that run to completion, restarts on failure — for training jobs and batch inference. CronJob creates Jobs on a schedule. Use Deployments for inference servers; Jobs for training.
How does Horizontal Pod Autoscaling work for ML inference?
HPA scales the number of inference Pod replicas based on metrics. Configure with: minReplicas, maxReplicas, and target metrics (CPU utilisation, memory, or custom metrics like requests-per-second via KEDA or Prometheus Adapter). For GPU inference, custom metrics (inference latency p95, queue depth) are more meaningful than CPU. Scale-to-zero requires KEDA (Kubernetes Event-Driven Autoscaling) since standard HPA doesn't scale below 1.
What are resource requests and limits in Kubernetes and why do they matter for ML?
Requests: minimum resources guaranteed to a Pod (used by the scheduler to find a suitable node). Limits: maximum resources a Pod can consume (exceeding CPU limit throttles the container; exceeding memory limit kills it with OOM). For ML training: set requests = limits for GPU Pods (GPUs are not shared by default). For inference: set memory requests conservatively and limits generously. Never set GPU limits greater than GPU requests — this wastes GPU capacity by reserving GPUs that idle.