The HuggingFace Ecosystem
transformersCore library: AutoModel, AutoTokenizer, pipeline(), Trainer. The main interface to 500,000+ models on the Hub.
datasetsLoad, process, and stream datasets. Backed by Apache Arrow for zero-copy memory mapping. load_dataset() from the Hub or local files.
peftLoRA, QLoRA, Prefix Tuning, IA3, AdaLoRA. Efficient fine-tuning without updating all parameters.
accelerateHardware-agnostic training: single GPU, multi-GPU DDP, TPU, mixed precision — no code changes.
evaluateMetrics library: BLEU, ROUGE, BERTScore, F1, Accuracy, and 100+ others. Consistent interface for model evaluation.
trlTraining library for RL-based LLM alignment: SFTTrainer, DPOTrainer, PPOTrainer, ORPOTrainer.
tokenizersFast tokeniser library (Rust backend). BPE, WordPiece, Unigram, BPE-dropout. Used internally by transformers.
diffusersDiffusion model library: Stable Diffusion, SDXL, Flux, ControlNet, and video generation models.
The Transformers Library: Core Patterns
Loading models and tokenisers
AutoTokenizer.from_pretrained(model_name)— Loads the tokeniser matching the model. Handles BPE, WordPiece, SentencePiece automatically based on the model config.AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map='auto')—device_map='auto'uses Accelerate to distribute layers across available GPUs and CPU if the model doesn't fit in a single GPU.torch_dtype=torch.bfloat16halves memory compared to float32 with minimal quality loss for most models.AutoModel.from_pretrained(model_name, load_in_4bit=True, quantization_config=BitsAndBytesConfig(...))— QLoRA-style 4-bit loading via bitsandbytes for fine-tuning on consumer hardware.
Tokenisation essentials
- Tokeniser output is a dict with
input_ids(integer token IDs),attention_mask(1 for real tokens, 0 for padding), and for encoder-decoder models,token_type_ids. Always passreturn_tensors='pt'for PyTorch tensors. - Chat template:
tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)— formats a list of message dicts into the model's expected format (including special tokens like <|im_start|>, [INST], <|begin_of_text|>). Critical for instruction-tuned models. - Padding: pad to the right for decoder-only models (left padding is needed only for batched generation, to align the generation start). Set
tokenizer.padding_side = 'left'for batched generation.
Generation
model.generate(input_ids, max_new_tokens=512, temperature=0.7, do_sample=True, top_p=0.9)— generates a completion from a prompt. Key parameters:max_new_tokens(not max_length — controls only new token budget),temperature(higher = more random; 0 = greedy),top_p(nucleus sampling; keep only the top tokens whose probabilities sum to p),repetition_penalty(penalise repeated n-grams).TextStreamerorTextIteratorStreamer— stream token-by-token output to the user without waiting for the full completion. TextIteratorStreamer allows running generation in a separate thread and yielding tokens to a FastAPI streaming response.
HuggingFace Trainer and Custom Training Loops
The Trainer class handles the training loop boilerplate: device management, mixed precision, gradient accumulation, logging, checkpointing, and evaluation. Configure with TrainingArguments:
- Key TrainingArguments:
output_dir,num_train_epochs,per_device_train_batch_size,gradient_accumulation_steps(effective batch = per_device × num_gpus × accumulation_steps),learning_rate,lr_scheduler_type('cosine', 'linear', 'constant_with_warmup'),warmup_ratio(fraction of steps for LR warmup),fp16/bf16(mixed precision),evaluation_strategyandsave_strategy('epoch' or 'steps'). - Custom data collators: pass a
DataCollatorWithPaddingor a custom callable to Trainer'sdata_collatorargument. For SFT, useDataCollatorForSeq2Seqwith label padding to -100 (ignored tokens in the CE loss) to mask prompt tokens. - Callbacks:
EarlyStoppingCallback,WandbCallback,MLflowCallback. Custom callbacks inherit fromTrainerCallbackand override event methods (on_epoch_end, on_evaluate, etc.).
When Trainer is not sufficient (custom loss functions, complex evaluation, non-standard training dynamics), use Accelerate directly for a clean, hardware-agnostic training loop with full control.
Frequently Asked Questions
What is the difference between AutoModel and a specific model class?
Auto classes (AutoModelForCausalLM, AutoTokenizer) read config.json to determine the correct architecture and instantiate it automatically. This lets you swap models by changing only the name string. Specific classes (BertForSequenceClassification) are for architecture-specific access. Always use Auto classes in production code.
How does the pipeline API work?
pipeline('text-generation', model='...') bundles tokenisation, inference, and post-processing into one callable. Ideal for prototyping and evaluation. For production, use the tokeniser and model directly for control over batching, device placement, and output handling.
What is HuggingFace Accelerate?
Accelerate abstracts hardware differences (1 GPU, multi-GPU DDP, TPU, mixed precision) from training code. Wrap model/optimiser/dataloader with accelerator.prepare(). Your script runs unchanged across hardware configurations. It is the backbone of HuggingFace Trainer.
What is PEFT and what methods does it support?
PEFT trains a small number of additional parameters instead of the full model. Supports: LoRA, QLoRA (needs bitsandbytes), Prefix Tuning, Prompt Tuning, IA3, AdaLoRA, LoftQ. Define LoraConfig (target_modules, r, lora_alpha), call get_peft_model(), train, save adapter weights only (hundreds of MB vs tens of GB for full model).
How do you load a model from the HuggingFace Hub?
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.1-8B-Instruct', torch_dtype=torch.bfloat16, device_map='auto'). device_map='auto' uses accelerate to split the model across available GPUs/CPU automatically. For gated models (Llama, Gemma), set the HF_TOKEN environment variable or call huggingface_hub.login() first.