Skills Guide

    HuggingFace Transformers
    The 2026 Skills Guide

    The HuggingFace ecosystem — Transformers, Datasets, PEFT, Accelerate, TRL — has become the standard toolkit for applied LLM work. This guide covers the libraries UK AI engineers use every day, from loading models to production fine-tuning.

    The HuggingFace Ecosystem

    transformers

    Core library: AutoModel, AutoTokenizer, pipeline(), Trainer. The main interface to 500,000+ models on the Hub.

    datasets

    Load, process, and stream datasets. Backed by Apache Arrow for zero-copy memory mapping. load_dataset() from the Hub or local files.

    peft

    LoRA, QLoRA, Prefix Tuning, IA3, AdaLoRA. Efficient fine-tuning without updating all parameters.

    accelerate

    Hardware-agnostic training: single GPU, multi-GPU DDP, TPU, mixed precision — no code changes.

    evaluate

    Metrics library: BLEU, ROUGE, BERTScore, F1, Accuracy, and 100+ others. Consistent interface for model evaluation.

    trl

    Training library for RL-based LLM alignment: SFTTrainer, DPOTrainer, PPOTrainer, ORPOTrainer.

    tokenizers

    Fast tokeniser library (Rust backend). BPE, WordPiece, Unigram, BPE-dropout. Used internally by transformers.

    diffusers

    Diffusion model library: Stable Diffusion, SDXL, Flux, ControlNet, and video generation models.

    The Transformers Library: Core Patterns

    Loading models and tokenisers

    • AutoTokenizer.from_pretrained(model_name) — Loads the tokeniser matching the model. Handles BPE, WordPiece, SentencePiece automatically based on the model config.
    • AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map='auto')device_map='auto' uses Accelerate to distribute layers across available GPUs and CPU if the model doesn't fit in a single GPU. torch_dtype=torch.bfloat16 halves memory compared to float32 with minimal quality loss for most models.
    • AutoModel.from_pretrained(model_name, load_in_4bit=True, quantization_config=BitsAndBytesConfig(...)) — QLoRA-style 4-bit loading via bitsandbytes for fine-tuning on consumer hardware.

    Tokenisation essentials

    • Tokeniser output is a dict with input_ids (integer token IDs), attention_mask (1 for real tokens, 0 for padding), and for encoder-decoder models, token_type_ids. Always pass return_tensors='pt' for PyTorch tensors.
    • Chat template: tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) — formats a list of message dicts into the model's expected format (including special tokens like <|im_start|>, [INST], <|begin_of_text|>). Critical for instruction-tuned models.
    • Padding: pad to the right for decoder-only models (left padding is needed only for batched generation, to align the generation start). Set tokenizer.padding_side = 'left' for batched generation.

    Generation

    • model.generate(input_ids, max_new_tokens=512, temperature=0.7, do_sample=True, top_p=0.9) — generates a completion from a prompt. Key parameters: max_new_tokens (not max_length — controls only new token budget), temperature (higher = more random; 0 = greedy), top_p (nucleus sampling; keep only the top tokens whose probabilities sum to p), repetition_penalty (penalise repeated n-grams).
    • TextStreamer or TextIteratorStreamer — stream token-by-token output to the user without waiting for the full completion. TextIteratorStreamer allows running generation in a separate thread and yielding tokens to a FastAPI streaming response.

    HuggingFace Trainer and Custom Training Loops

    The Trainer class handles the training loop boilerplate: device management, mixed precision, gradient accumulation, logging, checkpointing, and evaluation. Configure with TrainingArguments:

    • Key TrainingArguments: output_dir, num_train_epochs, per_device_train_batch_size, gradient_accumulation_steps (effective batch = per_device × num_gpus × accumulation_steps), learning_rate, lr_scheduler_type ('cosine', 'linear', 'constant_with_warmup'), warmup_ratio (fraction of steps for LR warmup), fp16 / bf16 (mixed precision), evaluation_strategy and save_strategy ('epoch' or 'steps').
    • Custom data collators: pass a DataCollatorWithPadding or a custom callable to Trainer's data_collator argument. For SFT, use DataCollatorForSeq2Seq with label padding to -100 (ignored tokens in the CE loss) to mask prompt tokens.
    • Callbacks: EarlyStoppingCallback, WandbCallback, MLflowCallback. Custom callbacks inherit from TrainerCallback and override event methods (on_epoch_end, on_evaluate, etc.).

    When Trainer is not sufficient (custom loss functions, complex evaluation, non-standard training dynamics), use Accelerate directly for a clean, hardware-agnostic training loop with full control.

    Frequently Asked Questions

    What is the difference between AutoModel and a specific model class?

    Auto classes (AutoModelForCausalLM, AutoTokenizer) read config.json to determine the correct architecture and instantiate it automatically. This lets you swap models by changing only the name string. Specific classes (BertForSequenceClassification) are for architecture-specific access. Always use Auto classes in production code.

    How does the pipeline API work?

    pipeline('text-generation', model='...') bundles tokenisation, inference, and post-processing into one callable. Ideal for prototyping and evaluation. For production, use the tokeniser and model directly for control over batching, device placement, and output handling.

    What is HuggingFace Accelerate?

    Accelerate abstracts hardware differences (1 GPU, multi-GPU DDP, TPU, mixed precision) from training code. Wrap model/optimiser/dataloader with accelerator.prepare(). Your script runs unchanged across hardware configurations. It is the backbone of HuggingFace Trainer.

    What is PEFT and what methods does it support?

    PEFT trains a small number of additional parameters instead of the full model. Supports: LoRA, QLoRA (needs bitsandbytes), Prefix Tuning, Prompt Tuning, IA3, AdaLoRA, LoftQ. Define LoraConfig (target_modules, r, lora_alpha), call get_peft_model(), train, save adapter weights only (hundreds of MB vs tens of GB for full model).

    How do you load a model from the HuggingFace Hub?

    model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.1-8B-Instruct', torch_dtype=torch.bfloat16, device_map='auto'). device_map='auto' uses accelerate to split the model across available GPUs/CPU automatically. For gated models (Llama, Gemma), set the HF_TOKEN environment variable or call huggingface_hub.login() first.

    Browse LLM Engineering Jobs

    Find roles working with HuggingFace and LLMs at UK AI companies.

    Quick Facts

    Demand level
    Essential
    Difficulty
    Intermediate
    Time to proficiency2–4 months
    Salary premium+£10,000–£20,000

    Key Libraries

    transformers
    datasets
    peft
    accelerate
    trl
    evaluate
    tokenizers
    diffusers