Skills Guide

    Transformer Architecture
    Technical Guide for AI Engineers

    Every LLM, embedding model, and vision transformer is built on the same core architecture. Understanding transformers deeply — self-attention, positional encoding, pre-training objectives, and scaling — separates engineers who use LLMs from engineers who can build and adapt them.

    Scaled Dot-Product Attention

    The transformer's key innovation, introduced in "Attention Is All You Need" (Vaswani et al., 2017), is scaled dot-product attention. For an input sequence of tokens, each token's representation is projected into three vectors using learned weight matrices: Q (Query), K (Key), and V (Value).

    The attention output is: Attention(Q, K, V) = softmax(QKᵀ / √d_k) V

    • QKᵀ — The matrix multiplication of queries and transposed keys produces an n×n score matrix representing pairwise compatibility between every token pair.
    • √d_k scaling — Without scaling, large d_k values push the dot products into regions where the softmax gradient becomes extremely small (vanishing gradient). Dividing by √d_k keeps dot product magnitudes roughly unit-scale before softmax.
    • Softmax normalisation — Converts raw scores into a probability distribution over positions. Each token's output is a convex combination of the Value vectors, weighted by attention probabilities.

    Multi-head attention runs h parallel attention heads, each with its own Q, K, V projection matrices of dimension d_model/h. The outputs are concatenated and projected: MultiHead(Q,K,V) = Concat(head_1,...,head_h)W_O. Multiple heads allow the model to simultaneously attend to different aspects of the input — one head might capture syntactic dependencies, another coreference, another semantic similarity.

    Causal (masked) attention in decoder-only models applies a mask to the attention score matrix that sets all future positions to −∞ before softmax, ensuring token i can only attend to positions ≤ i. This is what makes autoregressive generation possible.

    Transformer Block Components

    A transformer block (layer) consists of two sub-layers, each with a residual connection and layer normalisation:

    • Multi-head self-attention sub-layer — Computes the attended representation of the input sequence. Output: x + Dropout(MHA(LayerNorm(x))) (pre-norm formulation, used in modern LLMs; original paper used post-norm).
    • Position-wise feed-forward network (FFN) — Applied independently to each position: FFN(x) = max(0, xW_1 + b_1)W_2 + b_2 for ReLU activation. In modern LLMs (Llama, Mistral), SwiGLU (a gated linear unit variant) replaces the standard FFN: FFN_SwiGLU(x) = (SiLU(xW_gate) ⊙ xW_up) W_down. SwiGLU generally improves perplexity for the same parameter count. The FFN typically has a hidden dimension of 4× the model dimension.
    • Layer normalisationLayerNorm normalises each token's representation independently across the feature dimension (not across the batch or sequence). RMSNorm (Root Mean Square Layer Norm), used in Llama and Mistral, removes the mean-centering step for efficiency while maintaining stable training.

    GQA (Grouped Query Attention), used in Llama 3, Mistral, and Gemma, reduces memory bandwidth during inference by sharing Key and Value heads across groups of Query heads. Full multi-head attention has h KV heads for h query heads; MQA has 1 shared KV head; GQA groups query heads into g groups with one KV head per group. GQA provides near-MHA quality with near-MQA inference speed.

    Pre-training Objectives and Model Families

    Encoder-Only
    Pre-training: Masked Language Modelling (MLM): randomly mask 15% of tokens; predict masked tokens from bidirectional context.
    Models: BERT, RoBERTa, DeBERTa, ModernBERT
    Best for: Text classification, NER, semantic similarity, dense retrieval embeddings.
    Decoder-Only
    Pre-training: Causal Language Modelling (CLM): predict the next token from all previous tokens. The simplest and most scalable pre-training objective.
    Models: GPT family, Llama, Mistral, Gemma, Phi, Falcon, Qwen
    Best for: Text generation, instruction following, coding, reasoning. The dominant LLM architecture.
    Encoder-Decoder
    Pre-training: Span Corruption (T5): corrupt input by replacing random spans with sentinel tokens; predict original spans in the decoder.
    Models: T5, FLAN-T5, BART, mT5
    Best for: Translation, summarisation, QA over a source document, structured generation tasks.

    Scaling Laws and Context Length

    The Chinchilla scaling laws (Hoffmann et al., 2022) established that compute-optimal training requires roughly equal scaling of model size and training data: an optimal model has approximately N ≈ 20 × D parameters relative to D training tokens. This led to a shift from training very large models on relatively small data (GPT-3: 175B params on 300B tokens) to training smaller models on much more data (Llama 3 8B on 15T tokens — well beyond Chinchilla-optimal for the model size, deliberately "overtrained" to create a more efficient inference model).

    Context length is an increasingly important engineering concern. Attention is O(n²) in sequence length; processing a 128K-token context requires O(128K)² = O(16B) attention operations. Techniques to address this include: FlashAttention (IO-aware exact attention), ring attention (distributing the sequence across GPUs), and sliding window attention (attending only to a local window, used in Mistral's Sliding Window Attention). RoPE-based context extension techniques (YaRN, LongRoPE) allow models trained with shorter contexts to generalise to longer ones.

    Frequently Asked Questions

    What is self-attention and why is it the core of transformers?

    Self-attention lets every token attend to every other token. For each token, Query (Q), Key (K), and Value (V) vectors are computed. Attention scores = softmax(QKᵀ/√d_k); the output is the weighted sum of V vectors. This gives the model global context with a single operation — unlike RNNs which process sequentially. The mechanism is O(n²) in sequence length.

    What is the difference between encoder-only, decoder-only, and encoder-decoder?

    Encoder-only (BERT, RoBERTa): bidirectional attention, pre-trained with MLM, best for understanding tasks (classification, NER, embeddings). Decoder-only (GPT, Llama, Mistral): causal attention, pre-trained with next-token prediction, the dominant LLM architecture. Encoder-decoder (T5, BART): encoder processes input, decoder generates output with cross-attention. Best for sequence-to-sequence tasks (translation, summarisation).

    What is positional encoding and why is it needed?

    Transformers process all tokens simultaneously with no inherent order. Positional encoding adds order information. Original: fixed sinusoidal (PE = sin/cos of position). Modern: RoPE (Rotary Position Embedding — rotates Q and K vectors in the complex plane) is used in Llama and Mistral and generalises to longer sequences. ALiBi adds a linear position bias to attention scores.

    What are scaling laws and why do they matter for ML engineers?

    Scaling laws (Kaplan 2020, Chinchilla 2022) describe power-law relationships between model size, training tokens, compute, and loss. Chinchilla showed compute-optimal training needs ~20 tokens per parameter — a 7B model optimally trains on ~140B tokens. This guides how to spend compute budgets: smaller model on more data often beats larger model on less.

    What is FlashAttention and why is it important?

    Standard attention is O(n²) in memory because the full n×n attention matrix must be materialised. FlashAttention (Dao et al., 2022) restructures the computation using tiling to stay within GPU SRAM, avoiding the HBM (high-bandwidth memory) bottleneck. FlashAttention-2 achieves near-peak GPU throughput for attention, significantly reducing training time and memory. It is now the standard implementation in most serious LLM training and fine-tuning frameworks.

    Browse AI and ML Engineering Jobs

    Find LLM and ML engineering roles requiring deep transformer knowledge at UK companies.

    Quick Facts

    Demand level
    Essential
    Difficulty
    Advanced
    Time to proficiency3–6 months

    Key Concepts

    Self-attention
    Multi-head attention
    RoPE
    GQA
    FlashAttention
    Scaling Laws
    LayerNorm
    SwiGLU