AWS SageMaker
ML Platform Skills Guide 2026
AWS SageMaker is the most widely used managed ML platform in the UK. This guide covers the components ML engineers actually use — Training Jobs, Endpoints, Pipelines, and the Feature Store — with practical guidance on cost optimisation and when to use each service.
SageMaker Components
Web-based IDE: notebooks, experiment tracking, model registry, pipeline visualisation, and debugging in one interface.
Managed compute for running training scripts. Supports PyTorch, TensorFlow, HuggingFace, and custom containers. Managed Spot Training for up to 70% cost reduction.
Deploy models for synchronous inference with autoscaling, blue/green deployments, and A/B testing across model variants.
Offline, async inference over large S3 datasets. Pay only while processing — no idle costs.
DAG-based ML workflow orchestration. Define training, evaluation, and registration steps in Python; SageMaker handles scheduling and execution.
Centralised feature storage with online (low-latency serving) and offline (training) stores. Feature groups with automatic time-travel for point-in-time correct features.
Bayesian optimisation over hyperparameter search spaces. Runs parallel training jobs and focuses search on promising regions.
Version models, track lineage, manage approval workflows (Pending → Approved → Rejected), and trigger downstream deployment pipelines.
Training Jobs in Depth
SageMaker Training Jobs decouple your training code from the compute infrastructure. You write a standard training script, and SageMaker handles provisioning, scaling, logging, and cleanup.
Script mode — The most common pattern. Provide your training script and specify a framework container (the HuggingFace or PyTorch estimator). SageMaker passes data from S3 to the container via environment variables: SM_CHANNEL_TRAIN, SM_CHANNEL_VALIDATION. Model outputs are saved to SM_MODEL_DIR and automatically uploaded to S3 on completion.
Managed Spot Training — Use EC2 Spot Instances for training jobs where interruptions are acceptable. SageMaker automatically saves checkpoints to S3 and restarts the job from the last checkpoint on interruption. Requires implementing checkpoint saving in your training script. Savings of 60–70% over On-Demand for GPU instances.
Distributed training — SageMaker provides its own distributed training library (SageMaker Distributed Data Parallel / Model Parallel). For most workloads, using standard PyTorch DDP or FSDP inside a SageMaker Training Job is simpler and equally effective. The SageMaker SDK handles setting up the distributed training environment automatically when you specify instance_count > 1.
Debugging and profiling — SageMaker Debugger can capture tensors, gradients, and activations during training without modifying your script. SageMaker Profiler identifies compute bottlenecks (CPU vs GPU utilisation, data loading wait times) during training.
Deployment: Endpoints and Inference Options
SageMaker offers three inference deployment options for different latency and cost requirements:
- Real-time endpoints — Persistent endpoint with an ELB in front of one or more instances. Application AutoScaling adjusts instance count based on InvocationsPerInstance CloudWatch metric. Blue/green deployment with traffic shifting enables zero-downtime model updates. Multi-model endpoints can serve hundreds of models on one instance by loading/unloading models from S3 on demand.
- Async inference — Requests are queued in SQS; SageMaker processes them and writes outputs to S3. The caller receives an output S3 URL immediately; the output is written when processing completes. Instance count can scale to zero when there are no queued requests (unlike real-time endpoints). Best for: large model inference (>60 second processing), batch-like workloads that come in bursts.
- Serverless inference — No instance management; SageMaker allocates resources per request. Scales to zero; cold start latency of 1–10 seconds. Best for infrequent, unpredictable traffic where you cannot justify a persistent instance. Memory sizes: 1024 MB to 6144 MB.
Frequently Asked Questions
What are the main SageMaker components ML engineers use?
Training Jobs (managed compute for training scripts), Endpoints (real-time inference), Batch Transform (offline inference over S3 data), Pipelines (ML workflow orchestration), Model Registry (version lifecycle management), and Feature Store (centralised feature storage). SageMaker Studio is the IDE for managing all of these.
How do SageMaker Training Jobs work?
SageMaker provisions a managed EC2 instance, pulls a container from ECR, downloads training data from S3, runs your script, and uploads model artifacts back to S3. You pay only for training job duration. Use built-in framework containers (PyTorch, HuggingFace) or a custom container.
When should you use real-time endpoint vs batch transform?
Real-time: synchronous, user-facing applications (chatbots, recommendation APIs). You pay for the instance continuously — use autoscaling. Batch Transform: offline inference over large S3 datasets, provisions instances only during the job. Async inference: queue-based for requests over 1 second, results written to S3.
How does SageMaker compare to GCP Vertex AI and Azure ML?
SageMaker: most mature, broadest AWS integration, most widely used in UK (especially finance/retail). Vertex AI: cleaner unified API, better integrated MLOps tooling. Azure ML: strong Microsoft ecosystem integration, dominant in enterprises with Microsoft agreements. SageMaker is most broadly applicable for UK jobs.
How do you reduce SageMaker training costs?
Use Spot Instances (managed spot training) for up to 70% savings — SageMaker handles checkpointing and restarting from the last checkpoint on interruption. Right-size instance types — start with a smaller instance to verify your script works, then scale up. Use S3 as the data source and enable S3 Transfer Acceleration for large datasets. Set max_run time limits on training jobs to prevent runaway costs.