NLP Engineering Skills
The 2026 Skills Guide
NLP engineering underpins virtually every AI product — from chatbots and search to contract analysis and content moderation. This guide covers tokenisation, core NLP tasks, spaCy and HuggingFace, evaluation metrics, and the text preprocessing practices that matter in production.
NLP Tasks and Tooling
| Task | Key Tools | Primary Metric |
|---|---|---|
| Text Classification | sklearn, HuggingFace AutoModelForSequenceClassification | F1, Accuracy, AUC-ROC |
| Named Entity Recognition | spaCy, HuggingFace token-classification pipeline | Entity-level F1 (strict) |
| Semantic Similarity | Sentence Transformers, text-embedding-3-small | Spearman correlation (STS-B benchmark) |
| Text Summarisation | BART, PEGASUS, T5 via HuggingFace | ROUGE-1, ROUGE-2, ROUGE-L |
| Machine Translation | MarianMT, Helsinki-NLP models | BLEU, COMET |
| Question Answering | HuggingFace question-answering pipeline | Exact Match, F1 |
| Relation Extraction | GLiREL, REBEL (end-to-end) | Micro F1 on relation types |
| Text Generation | GPT-family, Llama, Mistral via HuggingFace | Perplexity, task-specific, human eval |
Tokenisation in Depth
Tokenisation converts raw text into a sequence of tokens (integers) that a model processes. The tokeniser is as important as the model — using the wrong tokeniser, mishandling special tokens, or ignoring maximum sequence lengths are among the most common bugs in NLP systems.
Subword tokenisation algorithms:
- BPE (Byte Pair Encoding) — Used by GPT family, RoBERTa. Bottom-up: start with characters, merge most frequent pairs. Tiktoken (GPT-4) operates at byte level, handling any Unicode or binary content without unknown tokens.
- WordPiece — Used by BERT. Similar to BPE but merges the pair that maximises the language model likelihood of the training data rather than the raw frequency. Produces ## prefix for continuation subwords (e.g., ['token', '##isation']).
- SentencePiece / Unigram — Used by T5, Llama, Mistral, Gemma. Language-agnostic: treats the input as a sequence of Unicode characters without pre-tokenisation. Handles spaces explicitly as part of tokens (▁ prefix for word-initial subwords). Better for multilingual models and languages without clear word boundaries.
Critical tokenisation pitfalls:
- Never truncate before applying the chat template — always tokenise first, then check the token count.
- BOS (beginning-of-sequence) and EOS tokens must match what the model was trained with — omitting them degrades quality.
- Whitespace handling differs between models: GPT tokenisers treat leading spaces as part of tokens; BERT does not. This affects how you format multi-document inputs.
Production NLP with spaCy
spaCy remains the production standard for rule-based and hybrid NLP pipelines that need to process large volumes of text at CPU speeds. Key features:
- Pipeline architecture —
nlp = spacy.load("en_core_web_trf")(transformer-based) oren_core_web_sm(statistical, fast). The pipeline chains: tokeniser → tagger → parser → NER.nlp.pipe(texts, batch_size=64)processes multiple documents efficiently with batching. - EntityRuler — Add rule-based NER patterns that fire before or after the statistical NER component. Patterns can be token-based (match on token attributes: text, POS, shape) or regex-based via
PhraseMatcher. Used to ensure high-precision recognition of domain-specific entities (product codes, company names) that the statistical model misses. - Custom components —
@Language.componentand@Language.factorydecorators add custom logic to the pipeline (custom attribute assignment, business rules, integration with external APIs). Components receive and return a Doc object, enabling clean composition. - Span categorisation — spaCy 3's
spancatcomponent handles overlapping spans (unlike NER which requires non-overlapping entities), important for biomedical text and complex legal document analysis.
Frequently Asked Questions
What is BPE tokenisation and why do LLMs use subword tokenisation?
BPE iteratively merges the most frequent adjacent character pairs, building a subword vocabulary. Solves: (1) Unknown word problem — any word can be represented as known subwords down to characters. (2) Vocabulary size vs sequence length balance. GPT-4 uses cl100k_base BPE with 100,277 tokens; Llama 3 uses 128,256 tokens. WordPiece (BERT) and SentencePiece (T5, Llama) are related subword algorithms.
What is the difference between BLEU and ROUGE?
BLEU: precision-based — fraction of output n-grams found in the reference. Standard for translation. ROUGE: recall-based — fraction of reference n-grams found in the output. ROUGE-1/2/L standard for summarisation. Both measure lexical overlap only. BERTScore uses contextual embeddings for semantic similarity — preferred for nuanced evaluation.
When should you use spaCy vs HuggingFace Transformers?
spaCy: high-speed production NLP on CPU (tokenisation, NER, POS tagging, dependency parsing), rule-based components, thousands of docs/second. HuggingFace: state-of-the-art accuracy on challenging NER, semantic similarity, embeddings, text generation. Common pattern: spaCy for fast preprocessing, HuggingFace for contextual understanding.
What is Named Entity Recognition (NER) and how is it evaluated?
NER identifies and classifies named entities in text (PERSON, ORGANISATION, LOCATION, DATE, MONEY, etc.). Evaluated with entity-level F1 score: precision (fraction of predicted entities that are correct) × recall (fraction of gold entities found) harmonic mean. Strict evaluation: both span and label must match. Partial credit evaluation: IoU-based span matching. CoNLL-2003 (news wire) and OntoNotes are standard benchmarks.
What is semantic similarity and how is it implemented?
Semantic similarity measures how similar two texts are in meaning. Implementation: embed both texts with a sentence transformer model (all-mpnet-base-v2, paraphrase-MiniLM-L6-v2, text-embedding-3-small), compute cosine similarity between the embedding vectors. Sentence Transformers (SBERT) are trained specifically for this task using siamese networks on paraphrase and NLI datasets — they produce embeddings where semantic similarity maps to vector similarity, unlike general-purpose LLM embeddings.