What is Kubeflow and what does it provide for ML teams?

Kubeflow is a collection of Kubernetes-native ML tools. The main components: Kubeflow Pipelines (a DAG-based workflow orchestration system for defining and running ML pipelines as Kubernetes resources), PyTorch Operator and TensorFlow Operator (custom controllers that manage distributed training jobs — handle pod creation, failure recovery, and clean-up), Katib (hyperparameter tuning using Bayesian optimisation, random search, or grid search via Kubernetes Jobs), and KServe (model serving using custom Kubernetes resources that provision inference servers with autoscaling). Kubeflow is the Kubernetes-native answer to managed ML platforms like SageMaker.

What is the difference between a Kubernetes Deployment and a Job?

A Deployment manages long-running Pods that should always be running — it ensures a specified number of Pod replicas are alive, replaces failed Pods, and supports rolling updates. Used for model serving endpoints, web APIs, and long-running services. A Job runs Pods to completion — it starts one or more Pods, waits for them to exit successfully, and then the Job is complete. If a Pod fails, the Job controller restarts it. Used for ML training jobs, batch inference, data preprocessing. A CronJob creates Jobs on a schedule. For ML, use Deployments for inference servers and Jobs for training.

Skills Guide

Kubernetes for MLOps
The 2026 Skills Guide

Q: How do you run GPU workloads on Kubernetes?

GPU workloads on Kubernetes require the NVIDIA GPU Operator (or the legacy device plugin). Once installed, GPUs appear as schedulable resources: nvidia.com/gpu. Request GPUs in a Pod spec via resources.limits: {'nvidia.com/gpu': '1'}. The scheduler places the Pod on a node with available GPU capacity. Multiple containers in a Pod share the node's GPUs only if time-slicing is configured (NVIDIA MIG for A100 GPUs enables hardware-level partitioning). For training jobs requiring multiple GPUs across multiple nodes, use Kubernetes Jobs or custom operators (PyTorch Operator in Kubeflow) that manage distributed training pod lifecycles.

Kubernetes is the infrastructure layer for large-scale ML in production. This guide covers the K8s concepts MLOps engineers actually use — GPU scheduling, Kubeflow for ML workflows, KServe for model serving, autoscaling, and resource management for training jobs.

Core K8s Concepts for ML Engineers

ML engineers working on MLOps need to understand Kubernetes at the operator level — enough to configure, debug, and optimise ML workloads, not necessarily to administer the cluster.

Pods — The smallest deployable unit. One or more containers sharing network namespace and storage. ML training Pods typically have one container (the training script), a shared volume for checkpoints, and resource requests for CPU, memory, and GPU.
Jobs and CronJobs — Jobs run Pods to completion; ideal for training. completions: 1, parallelism: 1 for a single-Pod job. backoffLimit: 3 for up to 3 retries on failure. CronJobs for scheduled retraining (nightly feature updates, weekly model refreshes).
Deployments and Services — Deployments manage always-on replicas (inference servers). Services expose Pods via a stable DNS name and optional load balancer. ClusterIP for internal-only services; LoadBalancer for external access; NodePort for development.
ConfigMaps and Secrets — ConfigMaps store non-sensitive configuration (model hyperparameters, environment config) as key-value pairs mounted as environment variables or files. Secrets store sensitive values (API keys, database passwords) with base64 encoding and tighter RBAC. Reference both in Pod specs — never hardcode configuration in Docker images.
PersistentVolumeClaims (PVCs) — Claim persistent storage for model checkpoints, training data, and logs. For ML training, use ReadWriteOnce (one Pod at a time) for training checkpoints and ReadOnlyMany for shared training data across multiple Pods.

GPU Scheduling and Resource Management

GPU scheduling in Kubernetes works through the device plugin framework. The NVIDIA GPU Operator installs and manages the device plugin, which advertises GPU capacity as nvidia.com/gpu on each node. The scheduler uses this as a resource constraint when placing Pods.

Node labels and selectors — Label GPU nodes with their GPU type: kubectl label node gpu-node-01 gpu-type=a100-80gb. Use nodeSelector in Pod specs to target specific GPU types for training jobs that require high-memory GPUs.
Taints and tolerations — Taint GPU nodes to prevent non-GPU workloads from being scheduled on expensive GPU instances: kubectl taint nodes gpu-node-01 gpu=true:NoSchedule. GPU Pods must add the corresponding toleration. This prevents CPU workloads from occupying nodes where GPU Pods need to run.
Resource quotas — Per-namespace quotas prevent individual teams from monopolising GPU resources. ResourceQuota objects cap total GPU requests per namespace, enforcing fair sharing across teams.
Priority classes — PriorityClass objects assign numeric priorities to Pods. High-priority inference Pods preempt lower-priority training Jobs during resource contention — inference SLAs are maintained even when the cluster is running batch training workloads.

Kubeflow and KServe

Kubeflow Pipelines define ML workflows as directed acyclic graphs of containerised steps. Each step is a Python function compiled to a container image. The SDK's @dsl.component decorator converts Python functions to pipeline components; @dsl.pipeline composes them. Pipelines are versioned, scheduled, and observable in the Kubeflow UI. The generated pipeline YAML (IR format) can be stored in Git and deployed via CI/CD.

PyTorch Operator defines a PyTorchJob custom resource. Specify the number of Master and Worker replicas; the Operator creates the corresponding Pods, injects the right MASTER_ADDR, MASTER_PORT, RANK, and WORLD_SIZE environment variables, and handles Pod failure and recovery. Your training script uses standard torch.distributed.init_process_group(backend="nccl").

KServe (formerly KFServing) provides Kubernetes-native model serving via InferenceService custom resources. Supports: multiple serving runtimes (Triton, TorchServe, ONNX Runtime, HuggingFace TGI), canary deployments (route X% of traffic to a new model version), and serverless scaling to zero via Knative. Define an InferenceService with a storageUri pointing to model weights in S3/GCS, and KServe handles container lifecycle, load balancing, and autoscaling.

Frequently Asked Questions

How do you run GPU workloads on Kubernetes?

Install NVIDIA GPU Operator or device plugin. GPUs appear as schedulable resources: nvidia.com/gpu. Request in Pod spec via resources.limits. NVIDIA MIG (Multi-Instance GPU) for A100 enables hardware partitioning of a single GPU. For multi-node distributed training, use PyTorch Operator (Kubeflow) to manage distributed Pod lifecycles.

What is Kubeflow and what does it provide?

Kubeflow is a collection of Kubernetes-native ML tools: Pipelines (DAG workflow orchestration), PyTorch/TF Operator (distributed training job management), Katib (hyperparameter tuning), and KServe (model serving with autoscaling). The K8s-native alternative to managed platforms like SageMaker.

What is the difference between a Deployment and a Job?

Deployment: long-running Pods, always-on, rolling updates, replicas — for model serving and APIs. Job: Pods that run to completion, restarts on failure — for training jobs and batch inference. CronJob creates Jobs on a schedule. Use Deployments for inference servers; Jobs for training.

How does Horizontal Pod Autoscaling work for ML inference?

HPA scales the number of inference Pod replicas based on metrics. Configure with: minReplicas, maxReplicas, and target metrics (CPU utilisation, memory, or custom metrics like requests-per-second via KEDA or Prometheus Adapter). For GPU inference, custom metrics (inference latency p95, queue depth) are more meaningful than CPU. Scale-to-zero requires KEDA (Kubernetes Event-Driven Autoscaling) since standard HPA doesn't scale below 1.

What are resource requests and limits in Kubernetes and why do they matter for ML?

Requests: minimum resources guaranteed to a Pod (used by the scheduler to find a suitable node). Limits: maximum resources a Pod can consume (exceeding CPU limit throttles the container; exceeding memory limit kills it with OOM). For ML training: set requests = limits for GPU Pods (GPUs are not shared by default). For inference: set memory requests conservatively and limits generously. Never set GPU limits greater than GPU requests — this wastes GPU capacity by reserving GPUs that idle.

Browse MLOps Jobs in the UK

Find MLOps engineer roles working with Kubernetes and ML infrastructure.

Quick Facts

Demand level

High

Difficulty

Advanced

Time to proficiency4–8 months

Salary premium+£10,000–£25,000

Key Technologies

Kubernetes

Kubeflow

KServe

PyTorch Operator

Helm

NVIDIA GPU Operator

KEDA

Kubernetes for MLOpsThe 2026 Skills Guide