Skills Guide

    NLP Engineering Skills
    The 2026 Skills Guide

    NLP engineering underpins virtually every AI product — from chatbots and search to contract analysis and content moderation. This guide covers tokenisation, core NLP tasks, spaCy and HuggingFace, evaluation metrics, and the text preprocessing practices that matter in production.

    NLP Tasks and Tooling

    TaskKey ToolsPrimary Metric
    Text Classificationsklearn, HuggingFace AutoModelForSequenceClassificationF1, Accuracy, AUC-ROC
    Named Entity RecognitionspaCy, HuggingFace token-classification pipelineEntity-level F1 (strict)
    Semantic SimilaritySentence Transformers, text-embedding-3-smallSpearman correlation (STS-B benchmark)
    Text SummarisationBART, PEGASUS, T5 via HuggingFaceROUGE-1, ROUGE-2, ROUGE-L
    Machine TranslationMarianMT, Helsinki-NLP modelsBLEU, COMET
    Question AnsweringHuggingFace question-answering pipelineExact Match, F1
    Relation ExtractionGLiREL, REBEL (end-to-end)Micro F1 on relation types
    Text GenerationGPT-family, Llama, Mistral via HuggingFacePerplexity, task-specific, human eval

    Tokenisation in Depth

    Tokenisation converts raw text into a sequence of tokens (integers) that a model processes. The tokeniser is as important as the model — using the wrong tokeniser, mishandling special tokens, or ignoring maximum sequence lengths are among the most common bugs in NLP systems.

    Subword tokenisation algorithms:

    • BPE (Byte Pair Encoding) — Used by GPT family, RoBERTa. Bottom-up: start with characters, merge most frequent pairs. Tiktoken (GPT-4) operates at byte level, handling any Unicode or binary content without unknown tokens.
    • WordPiece — Used by BERT. Similar to BPE but merges the pair that maximises the language model likelihood of the training data rather than the raw frequency. Produces ## prefix for continuation subwords (e.g., ['token', '##isation']).
    • SentencePiece / Unigram — Used by T5, Llama, Mistral, Gemma. Language-agnostic: treats the input as a sequence of Unicode characters without pre-tokenisation. Handles spaces explicitly as part of tokens (▁ prefix for word-initial subwords). Better for multilingual models and languages without clear word boundaries.

    Critical tokenisation pitfalls:

    • Never truncate before applying the chat template — always tokenise first, then check the token count.
    • BOS (beginning-of-sequence) and EOS tokens must match what the model was trained with — omitting them degrades quality.
    • Whitespace handling differs between models: GPT tokenisers treat leading spaces as part of tokens; BERT does not. This affects how you format multi-document inputs.

    Production NLP with spaCy

    spaCy remains the production standard for rule-based and hybrid NLP pipelines that need to process large volumes of text at CPU speeds. Key features:

    • Pipeline architecturenlp = spacy.load("en_core_web_trf") (transformer-based) or en_core_web_sm (statistical, fast). The pipeline chains: tokeniser → tagger → parser → NER. nlp.pipe(texts, batch_size=64) processes multiple documents efficiently with batching.
    • EntityRuler — Add rule-based NER patterns that fire before or after the statistical NER component. Patterns can be token-based (match on token attributes: text, POS, shape) or regex-based via PhraseMatcher. Used to ensure high-precision recognition of domain-specific entities (product codes, company names) that the statistical model misses.
    • Custom components@Language.component and @Language.factory decorators add custom logic to the pipeline (custom attribute assignment, business rules, integration with external APIs). Components receive and return a Doc object, enabling clean composition.
    • Span categorisation — spaCy 3's spancat component handles overlapping spans (unlike NER which requires non-overlapping entities), important for biomedical text and complex legal document analysis.

    Frequently Asked Questions

    What is BPE tokenisation and why do LLMs use subword tokenisation?

    BPE iteratively merges the most frequent adjacent character pairs, building a subword vocabulary. Solves: (1) Unknown word problem — any word can be represented as known subwords down to characters. (2) Vocabulary size vs sequence length balance. GPT-4 uses cl100k_base BPE with 100,277 tokens; Llama 3 uses 128,256 tokens. WordPiece (BERT) and SentencePiece (T5, Llama) are related subword algorithms.

    What is the difference between BLEU and ROUGE?

    BLEU: precision-based — fraction of output n-grams found in the reference. Standard for translation. ROUGE: recall-based — fraction of reference n-grams found in the output. ROUGE-1/2/L standard for summarisation. Both measure lexical overlap only. BERTScore uses contextual embeddings for semantic similarity — preferred for nuanced evaluation.

    When should you use spaCy vs HuggingFace Transformers?

    spaCy: high-speed production NLP on CPU (tokenisation, NER, POS tagging, dependency parsing), rule-based components, thousands of docs/second. HuggingFace: state-of-the-art accuracy on challenging NER, semantic similarity, embeddings, text generation. Common pattern: spaCy for fast preprocessing, HuggingFace for contextual understanding.

    What is Named Entity Recognition (NER) and how is it evaluated?

    NER identifies and classifies named entities in text (PERSON, ORGANISATION, LOCATION, DATE, MONEY, etc.). Evaluated with entity-level F1 score: precision (fraction of predicted entities that are correct) × recall (fraction of gold entities found) harmonic mean. Strict evaluation: both span and label must match. Partial credit evaluation: IoU-based span matching. CoNLL-2003 (news wire) and OntoNotes are standard benchmarks.

    What is semantic similarity and how is it implemented?

    Semantic similarity measures how similar two texts are in meaning. Implementation: embed both texts with a sentence transformer model (all-mpnet-base-v2, paraphrase-MiniLM-L6-v2, text-embedding-3-small), compute cosine similarity between the embedding vectors. Sentence Transformers (SBERT) are trained specifically for this task using siamese networks on paraphrase and NLI datasets — they produce embeddings where semantic similarity maps to vector similarity, unlike general-purpose LLM embeddings.

    Browse NLP Engineering Jobs

    Find NLP and AI engineering roles at UK companies.

    Quick Facts

    Demand level
    High
    Difficulty
    Intermediate
    Time to proficiency3–6 months

    Key Tools

    spaCy
    HuggingFace
    NLTK
    Sentence Transformers
    BLEU
    ROUGE
    BERTScore
    tiktoken