Question 1

What is self-attention in transformers?

Accepted Answer

Self-attention allows every token in a sequence to attend to every other token, computing a weighted representation of the entire context. For each token, three vectors are computed: a Query (what information this token is looking for), a Key (what information this token offers), and a Value (the information itself). The attention score between token i and token j is computed as the dot product of their Query and Key vectors, scaled by 1/√d_k to prevent gradient vanishing in deep dot products, and normalised with softmax. The output for token i is the weighted sum of all Value vectors, where the weights are the attention scores. This mechanism is O(n²) in sequence length because every token pair is scored.

Question 2

What is the difference between encoder-only, decoder-only, and encoder-decoder transformers?

Accepted Answer

Encoder-only models (BERT, RoBERTa) use bidirectional attention — every token attends to all other tokens in both directions. They are pre-trained with masked language modelling (MLM) and excel at understanding tasks: classification, NER, semantic search. Decoder-only models (GPT family, Llama, Mistral) use causal (autoregressive) attention — each token can only attend to previous tokens. Pre-trained by predicting the next token, they excel at generation tasks and are the dominant architecture for LLMs. Encoder-decoder models (T5, BART) use an encoder to process the input and a decoder to generate the output with cross-attention to encoder representations. Best suited for sequence-to-sequence tasks: translation, summarisation, question answering with a source document.

Question 3

What is positional encoding and why do transformers need it?

Accepted Answer

Unlike RNNs, transformers process all tokens simultaneously with no inherent notion of sequence order. Positional encoding injects order information. The original Transformer (Vaswani et al. 2017) used fixed sinusoidal encodings: PE(pos, 2i) = sin(pos/10000^(2i/d_model)), PE(pos, 2i+1) = cos(pos/10000^(2i/d_model)). Learned absolute positional embeddings (GPT-2 style) are simpler but fail to generalise to sequences longer than those seen during training. Rotary Position Embedding (RoPE, used in Llama, Mistral) encodes relative position by rotating the query and key vectors in the complex plane — it generalises well to longer sequences and is now the dominant approach in decoder-only LLMs. ALiBi adds a linear position bias to attention scores rather than modifying embeddings.

Question 4

What are scaling laws and why do they matter?

Accepted Answer

Scaling laws (Kaplan et al. 2020, Hoffmann et al. 2022 — 'Chinchilla') describe predictable power-law relationships between model size (parameters N), training data (tokens D), compute (FLOPs C), and model performance (loss). Kaplan et al. suggested that model size was most important and recommended large models trained on relatively little data. Chinchilla revised this, showing that compute-optimal training requires roughly 20 tokens per parameter — Chinchilla 70B trained on 1.4T tokens outperformed GPT-3 175B trained on 300B tokens. These laws matter for engineering because they guide decisions about how to spend a given compute budget: training a smaller model on more data often yields better results than training a larger model on less data.

Transformer Architecture
Technical Guide for AI Engineers

Scaled Dot-Product Attention

Transformer Block Components

Pre-training Objectives and Model Families

Scaling Laws and Context Length

Frequently Asked Questions

What is self-attention and why is it the core of transformers?

What is the difference between encoder-only, decoder-only, and encoder-decoder?

What is positional encoding and why is it needed?

What are scaling laws and why do they matter for ML engineers?

What is FlashAttention and why is it important?

Browse AI and ML Engineering Jobs

Quick Facts

Key Concepts

Related Skills

Transformer ArchitectureTechnical Guide for AI Engineers