The LLM engineering stack has matured fast. Certain tools have become de facto standards across UK AI companies — knowing them is a baseline expectation in most job postings. This guide covers what you actually need to know, layer by layer.
LLM APIs: Your Foundation Layer
OpenAI API remains the dominant choice for production LLM applications in the UK. GPT-4o is the current flagship model, offering strong performance across reasoning, coding, and instruction following. GPT-3.5-turbo is the cost-efficient option for less complex tasks. The Assistants API and function calling capabilities are used for agentic applications. If you learn one LLM API, make it OpenAI.
Anthropic's Claude API is the primary alternative. Claude models are particularly strong for long-context tasks (Claude's context window runs up to 200k tokens), careful instruction following, and tasks requiring nuanced reasoning. Many UK companies use both OpenAI and Anthropic to enable model routing — using the cheaper or faster model for simple tasks and the more capable one for complex ones.
Cohere offers embeddings, generation, and re-ranking APIs. Their Embed API is widely used for RAG pipelines because it produces embeddings specifically optimised for retrieval tasks.
For open-source models, Hugging Face is the central repository. Models like Llama 3 (Meta), Mistral, and Phi (Microsoft) can be run locally via Ollama for development, and deployed in production via vLLM or TGI.
Orchestration: LangChain vs LlamaIndex vs Raw
LangChain is the most widely adopted orchestration framework. It provides abstractions for chains (sequences of LLM calls), agents (LLM-driven decision loops), memory (conversation history management), and RAG components. Its breadth is both its strength and weakness — it covers almost everything but the abstractions can obscure what's happening at the API level and make debugging harder. Use LangChain for rapid development; be prepared to replace components with custom code as you scale.
LlamaIndex is more focused on data indexing and retrieval — it's the better choice if RAG is the core of your application. It provides more sophisticated indexing primitives, better handling of complex document structures, and a cleaner abstraction for retrieval pipelines. Many teams use LlamaIndex for the retrieval layer and LangChain for agent and chain orchestration.
Raw API calls (no framework) is the right choice when you have simple requirements, when framework abstractions would complicate rather than simplify your code, or when you need maximum control and debuggability. For straightforward single-turn completions or simple RAG pipelines, raw API calls with custom prompts are often cleaner than framework code.
Vector Databases
Pinecone: Fully managed, easy to scale, low operational overhead. Good default for production deployments where you want to minimise infrastructure work. Cost is meaningful at scale.
Weaviate: Open-source with managed option. Built-in hybrid search (vector + BM25 keyword) makes it useful when combining semantic and keyword retrieval. Good if hybrid search is important to your use case.
Qdrant: Open-source, self-hostable, excellent performance. Best choice for teams with data residency requirements, compliance constraints, or cost concerns at scale.
pgvector: PostgreSQL extension. Free, no new service to manage, sufficient for most use cases up to millions of vectors. Best choice if you're already running PostgreSQL.
Fine-tuning tools worth knowing
- • LoRA / QLoRA: Parameter-efficient fine-tuning methods that reduce GPU memory requirements dramatically. Most production fine-tuning uses one of these.
- • Axolotl: A framework that simplifies LoRA and QLoRA fine-tuning. Widely used for open-source model fine-tuning.
- • Unsloth: Optimises fine-tuning speed and memory usage. Popular for fine-tuning on limited GPU resources.
- • OpenAI fine-tuning API: The easiest option for fine-tuning GPT models if you don't want to manage your own infrastructure.
Evaluation
RAGAS: The standard framework for evaluating RAG pipelines. Provides metrics for context precision, context recall, answer faithfulness, and answer relevance. Essential if you're building RAG systems and want to measure quality rigorously.
DeepEval: A broader LLM evaluation framework covering RAG and general LLM output quality. Supports custom metrics and integrates with CI/CD pipelines.
LLM-as-judge: Using a capable LLM (GPT-4, Claude) to evaluate the output of another LLM. Scalable but requires careful prompt design to avoid systematic biases in the judge model.
Serving
vLLM: The leading open-source serving framework for LLMs. Uses PagedAttention for high throughput and efficient memory use. The default choice for self-hosted LLM serving at scale.
Text Generation Inference (TGI): Hugging Face's production serving framework. Good integration with the Hugging Face model hub.
Ollama: For local development and testing. Easy to run open-source models locally, not designed for production serving at scale.
Monitoring and Observability
LangSmith (from LangChain): Tracing, debugging, and monitoring for LangChain applications. Shows full chain execution traces, prompt inputs/outputs, and latency breakdowns.
Helicone: LLM observability proxy. Logs all API calls, tracks costs, and provides latency analytics without framework lock-in.
See the full LLM Engineer career guide
Salary benchmarks, required tools, UK hiring companies, and how to demonstrate stack knowledge in interviews.
Frequently Asked Questions
Do I need to know all of these tools?
No. Core minimum: one LLM API (OpenAI), conceptual understanding of LangChain or LlamaIndex, one vector database (pgvector or Pinecone), and some evaluation exposure. Serving and fine-tuning tools are optional unless your role requires them.
What's the minimum stack to get hired?
Python proficiency, OpenAI or Anthropic API experience, RAG architecture understanding, familiarity with at least one vector database, and some prompt engineering and evaluation exposure.
Is LangChain worth learning?
Yes — it reduces boilerplate significantly and appears in most LLM job postings. Just make sure you understand what it's doing under the hood, not just how to use the abstractions.
Which vector database should I start with?
If you already run PostgreSQL, start with pgvector. If starting fresh and expecting scale, Pinecone is lowest friction. For self-hosting control, Qdrant.
What tools do top UK companies actually use?
Based on public job postings and engineering blogs: OpenAI and Anthropic APIs are most common. LangChain or LlamaIndex appear in most LLM roles. Pinecone, Weaviate, and pgvector all have significant adoption. LangSmith is widely used for observability.