FastAPI Architecture for ML Serving
FastAPI is built on Starlette (ASGI framework) and Pydantic (data validation). It runs under Uvicorn (ASGI server), which handles the async event loop and connection management.
A complete ML serving application has these layers:
- Application startup — Model loading, GPU warm-up, connection pool initialisation. Done once, not per request.
- Request validation — Pydantic models validate and parse incoming JSON. Invalid requests return a 422 Unprocessable Entity with detailed field-level error messages automatically.
- Preprocessing — Transform validated request data into model inputs (tokenisation, feature extraction, normalisation).
- Inference — Run the model. CPU/GPU-bound; must not block the event loop for async endpoints.
- Postprocessing — Transform model output (logits → class labels, token IDs → text, raw scores → structured response).
- Response serialisation — Pydantic's
response_modelparameter validates and serialises the response, excluding unexpected fields and enforcing the response schema.
Model Loading, Lifespan, and Dependency Injection
Lifespan context manager (FastAPI 0.93+, replaces deprecated startup/shutdown events):
- Define an
asyncfunction decorated with@contextlib.asynccontextmanager: load the model beforeyield, clean up after. This ensures the model is loaded exactly once at startup and properly released on shutdown. - Store the model in
app.state:app.state.model = modelbeforeyield app. Access in endpoints via theRequestobject:request.app.state.model.
Dependency injection with Depends() is the clean way to access shared resources in endpoints:
- Define a dependency function that returns the model:
def get_model(request: Request): return request.app.state.model. - Use in endpoints:
async def predict(data: InputData, model = Depends(get_model)). FastAPI resolves the dependency and injects it for every request. - Dependencies can be used for authentication (validate Bearer tokens), database sessions (yield a SQLAlchemy session and close it after the request), rate limiting, and feature flags.
Background tasks: FastAPI's BackgroundTasks runs a function after the response is sent. Useful for: logging inference requests to a database, sending analytics events, or triggering model retraining checks after accumulating enough new data — without adding to response latency.
Streaming LLM Responses
Streaming is essential for LLM-backed APIs — users see tokens appearing in real time rather than waiting for a full response (which may take 10–60 seconds for long generations). Two approaches:
HuggingFace TextIteratorStreamer
- Create a
TextIteratorStreamer: it acts as both the output destination formodel.generate()and an iterator that yields decoded text tokens. - Run
model.generate()in a separate thread (since it's blocking):thread = Thread(target=model.generate, kwargs={..., 'streamer': streamer}); thread.start() - Yield tokens from the streamer in a generator function:
for text in streamer: yield text - Return
StreamingResponse(token_generator(), media_type='text/event-stream')for Server-Sent Events (SSE), which is supported by browser EventSource API and most HTTP clients.
Server-Sent Events format: Each event should be formatted as data: {json_payload}\n\n (two newlines as event separator). The final event is data: [DONE]\n\n, following the OpenAI streaming convention. This allows client-side parsing regardless of how token boundaries fall.
Timeout handling: Set Uvicorn's timeout-keep-alive and consider adding a asyncio.timeout() context around the generation to prevent runaway long generations from holding connections indefinitely.
Production Considerations
- Batching — GPU inference is much more efficient when batched. For high-throughput serving, use dynamic batching: accumulate requests for a few milliseconds before processing as a batch. Libraries like BentoML and TensorRT-LLM handle this; for FastAPI, implement with an asyncio queue and a background batch processor.
- Health and readiness —
/healthreturns 200 immediately (liveness probe)./readyreturns 200 only after model loading is complete (readiness probe). Kubernetes uses these to manage rolling deployments safely. - Middleware — Add
CORSMiddlewarefor browser clients,GZipMiddlewarefor large response payloads, and a custom logging middleware that captures request/response size, latency, and model ID for every inference call. - Structured error handling — Add exception handlers for common ML errors (OOM, timeout, invalid model state) that return structured JSON error responses with appropriate HTTP status codes rather than unhandled 500 errors with Python tracebacks.
Frequently Asked Questions
Why is FastAPI the dominant ML model serving framework?
Async support (concurrent request handling), Pydantic validation (type-safe ML API contracts), automatic OpenAPI docs from type hints, and high performance. Define input schema once as a Pydantic model: get validation, error messages, serialisation, and Swagger UI documentation for free.
How do you load an ML model at startup?
Use FastAPI's lifespan context manager — an async function with @asynccontextmanager that loads the model before yield (startup) and cleans up after (shutdown). Pass lifespan to FastAPI(lifespan=lifespan). Store the model in app.state.model. Never load the model inside request handlers — this re-loads on every request.
How do you stream LLM tokens from a FastAPI endpoint?
StreamingResponse with a generator: use HuggingFace TextIteratorStreamer, run model.generate() in a background Thread, yield tokens as they come. Return StreamingResponse(generator(), media_type='text/event-stream') for SSE. The client receives tokens immediately, not after the full completion.
What is the difference between async and sync endpoints for ML?
Sync endpoints run in a thread pool automatically — they don't block the event loop but limit concurrency to thread pool size. Async endpoints run in the event loop; run CPU-bound inference in a thread pool with asyncio.run_in_executor(). Pattern: async def endpoint(...)... result = await loop.run_in_executor(executor, model.predict, data).
How do you version ML APIs in FastAPI?
Use API routers with version prefixes: APIRouter(prefix='/v1') and APIRouter(prefix='/v2'). Include both in app with app.include_router(). Maintain backward compatibility by keeping v1 endpoints functional when deploying v2. Document breaking changes in the OpenAPI schema description. Use response_model to enforce output schema stability between deployments.