PyTorch for Deep Learning
The 2026 Skills Guide
PyTorch is the dominant deep learning framework at UK AI companies. This guide covers everything from tensors and autograd to production training patterns, mixed precision, distributed training, and model export — the skills interviewers actually test.
Tensors and Autograd
The torch.Tensor is the fundamental data structure in PyTorch — an n-dimensional array that can live on CPU or GPU and optionally participates in automatic differentiation. Three concepts you must understand deeply:
- Device management —
.to(device)moves a tensor to a given device ('cpu','cuda','cuda:1','mps'for Apple Silicon). Writing device-agnostic code —device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')— is standard practice. Forgetting to move a tensor to the right device is one of the most common runtime errors. - Dtypes —
float32is the default.float16andbfloat16(brain float) are used for mixed-precision training.bfloat16has the same exponent range asfloat32(avoiding overflow/underflow) but half the precision — it is the preferred mixed-precision dtype on recent NVIDIA and Google TPU hardware.int8is used for quantisation. - Autograd — Setting
requires_grad=Trueon a tensor tells PyTorch to track all operations involving it in a computation graph. Calling.backward()on a scalar loss traverses this graph via reverse-mode automatic differentiation, accumulating gradients in each leaf tensor's.grad. The gradient of the loss with respect to parameter θ is: ∂L/∂θ — exactly what gradient descent needs. Wrap inference intorch.no_grad()to skip graph construction and save memory.
Building Models with nn.Module
torch.nn.Module is the base class for all neural network components in PyTorch. Every model — from a single linear layer to a billion-parameter transformer — is a subclass of nn.Module. Understanding how it works internally is essential for debugging and implementing custom architectures.
__init__andforward()— Define layers (asnn.Parameteror sub-modules) in__init__; define the forward computation inforward(). Parameters registered as attributes or viann.ModuleList/nn.ModuleDictare automatically tracked by.parameters()and included instate_dict()..parameters()and.named_parameters()— Return iterators over all learnable parameters. Used to construct the optimiser:optim.Adam(model.parameters(), lr=1e-4)..named_parameters()gives(name, param)pairs, useful for weight decay exclusion (biases and layer norms typically excluded)..train()and.eval()— Switches the model between training and evaluation modes, affecting Dropout (disabled in eval) and BatchNorm (uses running statistics in eval rather than batch statistics). Forgetting to callmodel.eval()during inference is a common bug that causes inconsistent results.- Key layers —
nn.Linear,nn.Conv2d(withkernel_size,stride,paddingparameters),nn.MultiheadAttention(the building block of transformers),nn.Embedding(for token embeddings),nn.LayerNorm(now preferred over BatchNorm in transformer architectures),nn.Dropout.
The PyTorch Training Loop
The canonical PyTorch training loop follows a specific pattern. Understanding why each step happens in this order — and what goes wrong if you change it — is a common interview topic:
optimizer.zero_grad()— Clear gradients from the previous step. PyTorch accumulates gradients by default (useful for gradient accumulation); call this at the start of each step unless you are deliberately accumulating.- Forward pass —
outputs = model(inputs). This builds the computation graph. - Compute loss —
loss = criterion(outputs, targets). The loss must be a scalar (or you must specify a gradient for vector losses). loss.backward()— Reverse traversal of the computation graph, computing and accumulating gradients.- Gradient clipping (optional) —
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)before the optimiser step. Essential for transformer training; prevents gradient explosions that destabilise training. optimizer.step()— Update parameters using the computed gradients according to the optimisation algorithm (Adam, AdamW, SGD).- LR scheduler step —
scheduler.step()afteroptimizer.step(). Common schedulers:CosineAnnealingLR,OneCycleLR,LinearWarmupCosineDecay(for transformers).
Gradient accumulation: When GPU memory prevents using the desired batch size, accumulate gradients over N mini-batches before calling optimizer.step(). Effective batch size = mini-batch size × N. Divide the loss by N to keep gradient scale consistent.
DataLoader and Dataset
PyTorch's data loading system is designed for performance at scale. Understanding Dataset and DataLoader is essential for building efficient training pipelines.
- Map-style
Dataset— Implement__len__()and__getitem__(idx).DataLoadercalls these to fetch samples. Suitable when all data fits in a structure accessible by index. - Iterable-style
IterableDataset— Implement__iter__(). Used for streaming datasets that cannot be indexed — e.g., data read from a database or a stream. Requires careful handling of worker processes to avoid duplicate data. DataLoaderkey parameters —num_workers: number of subprocesses for data loading (typically 4–8; 0 runs in the main process).pin_memory=True: allocates data in pinned (page-locked) host memory for faster CPU→GPU transfers.prefetch_factor: number of batches loaded in advance per worker.persistent_workers=True: keep worker processes alive between epochs.- Custom
collate_fn— Controls how a list of samples is assembled into a batch. Essential for variable-length sequences (e.g., padding text to the longest sequence in the batch), multi-modal data, or nested data structures. - Samplers —
WeightedRandomSamplerfor imbalanced datasets.DistributedSamplerfor DDP training (ensures each process gets a non-overlapping subset). Sampler is passed toDataLoader; using DDP withoutDistributedSamplercauses each process to train on the full dataset.
Mixed Precision, Serialisation, and Deployment
Mixed precision training with torch.cuda.amp uses float16 or bfloat16 for most operations (speeding up compute and halving memory) while maintaining float32 precision for the master weights and loss scaling. Use torch.autocast(device_type='cuda') as a context manager around the forward pass, and torch.cuda.amp.GradScaler to scale the loss before .backward() (preventing float16 underflow in gradients).
Model serialisation options in order of preference:
state_dict— Save withtorch.save(model.state_dict(), path); load withmodel.load_state_dict(torch.load(path)). Decoupled from class definition. Always preferred for checkpointing.- TorchScript —
torch.jit.script(model)ortorch.jit.trace(model, example_input)compiles the model to a serialisable, Python-independent representation. Used for deployment in C++ environments via LibTorch.torch.jit.traceworks for most models;torch.jit.scripthandles control flow (if/else, loops). - ONNX —
torch.onnx.export(model, example_input, 'model.onnx')exports to the Open Neural Network Exchange format for cross-framework deployment (ONNX Runtime, TensorRT, CoreML). Standard for deploying PyTorch models in non-PyTorch serving environments.
Learning Path for PyTorch Skills
Foundations (0–2 months)
- Tensors: creation, indexing, device management, dtypes
- Autograd: requires_grad, .backward(), .grad, torch.no_grad()
- nn.Module: defining __init__ and forward(), .parameters(), .state_dict()
- Basic training loop with a simple MLP on tabular data
Core Skills (2–5 months)
- CNNs with nn.Conv2d for image classification
- RNNs / LSTMs for sequential data (nn.LSTM, nn.GRU)
- Custom Dataset and DataLoader with collate_fn
- AdamW optimiser with cosine LR schedule, gradient clipping
- Mixed precision with torch.autocast and GradScaler
Production Skills (5–10 months)
- PyTorch Lightning: LightningModule, Trainer, callbacks, LightningDataModule
- Transformer implementation from scratch with nn.MultiheadAttention
- DistributedDataParallel (DDP) for multi-GPU training
- Model export: state_dict, TorchScript, ONNX
- Profiling with torch.profiler and identifying bottlenecks
Expert Level (10+ months)
- Custom CUDA kernels via torch.utils.cpp_extension
- FlashAttention integration for memory-efficient attention
- torch.compile (Dynamo + Inductor) for kernel fusion and acceleration
- Quantisation: post-training quantisation and quantisation-aware training with torch.quantization
- Large-scale distributed training with FSDP (Fully Sharded Data Parallel)
Frequently Asked Questions
Is PyTorch better than TensorFlow for UK jobs?
For most UK AI companies — startups, scale-ups, research labs — PyTorch is the dominant and expected skill. TensorFlow retains a presence in larger enterprises. Know PyTorch deeply; have enough TensorFlow familiarity to read existing code. The PyTorch ecosystem (HuggingFace, Lightning, TorchServe) is the de facto standard for new projects.
What is autograd and why does it matter?
Autograd is PyTorch's automatic differentiation engine. Operations on tensors with requires_grad=True build a computation graph; calling .backward() traverses it in reverse, accumulating gradients. This explains why you call optimizer.zero_grad() before each backward pass, use torch.no_grad() during inference, and why in-place operations on leaf tensors raise errors.
When should you save state_dict vs the full model?
Always prefer state_dict (torch.save(model.state_dict(), ...)). Saving the full model with torch.save(model, ...) pickles the class definition alongside weights — loading requires the class to be importable from the exact same path. This breaks during refactoring. state_dict is decoupled from the class definition and is the standard approach.
What is the difference between DataParallel and DistributedDataParallel?
DataParallel replicates the model to each GPU on a single machine and gathers outputs on the primary GPU — which becomes a bottleneck. DistributedDataParallel (DDP) spawns one process per GPU, synchronises gradients via all-reduce (NCCL), and has no bottleneck. DDP is always preferred for multi-GPU training and is the only option for multi-node.
Should ML engineers learn PyTorch Lightning?
Yes. Lightning handles training loop boilerplate, distributed training, mixed precision, logging, and checkpointing through a structured LightningModule. Widely used at UK AI companies. You should understand both raw PyTorch (for debugging and custom implementations) and Lightning (for productive engineering work).