AI  

FastAPI + LLMs: Building a Scalable AI Backend

Abstract

This guide shows how to build a production-ready AI backend using FastAPI and Large Language Models (LLMs). You’ll learn how to design APIs that handle AI prompts, integrate with providers like OpenAI or Mistral, manage performance with caching and streaming, and deploy on scalable infrastructure.

The article focuses on intermediate-level developers familiar with Python and REST APIs who want to move beyond playground experimentation to deploy AI features in real-world applications.

Conceptual Background

FastAPI is a high-performance Python web framework built on Starlette and Pydantic. It’s asynchronous, type-safe, and ideal for modern AI applications requiring low latency and concurrent requests.

LLMs (Large Language Models) are transformer-based neural networks trained to generate human-like text. Developers access them via APIs—examples include OpenAI GPT-4, Anthropic Claude, Mistral, and Cohere.

Combining both technologies enables:

  • Dynamic AI endpoints (chat, summarization, code generation)

  • Model-agnostic backends (switching between providers)

  • Integration with RAG (Retrieval-Augmented Generation) pipelines

Architecture Overview

fastapi-llm-backend-architecture

Step-by-Step Walkthrough

1. Setup Your Environment

pip install fastapi uvicorn httpx pydantic openai redis langchain

Add environment variables in .env:

OPENAI_API_KEY=YOUR_API_KEY
REDIS_URL=redis://localhost:6379

2. Create a FastAPI App

# main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import httpx, os

app = FastAPI(title="FastAPI + LLM Backend")

class PromptRequest(BaseModel):
    prompt: str
    model: str = "gpt-4o-mini"

@app.post("/generate")
async def generate_text(request: PromptRequest):
    async with httpx.AsyncClient() as client:
        try:
            response = await client.post(
                "https://api.openai.com/v1/chat/completions",
                headers={
                    "Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"
                },
                json={
                    "model": request.model,
                    "messages": [{"role": "user", "content": request.prompt}],
                    "temperature": 0.7
                }
            )
            data = response.json()
            return {"output": data["choices"][0]["message"]["content"]}
        except Exception as e:
            raise HTTPException(status_code=500, detail=str(e))

Run the app:

uvicorn main:app --reload

Test with:

curl -X POST http://127.0.0.1:8000/generate -H "Content-Type: application/json" \
  -d '{"prompt":"Write a haiku about FastAPI"}'

3. Add Redis Caching

LLM calls are expensive and slow. Cache frequent prompts.

import aioredis, hashlib, json

redis = aioredis.from_url(os.getenv("REDIS_URL"), decode_responses=True)

async def get_cached_response(prompt):
    key = hashlib.sha256(prompt.encode()).hexdigest()
    if (cached := await redis.get(key)):
        return json.loads(cached)
    return None

async def set_cache(prompt, data):
    key = hashlib.sha256(prompt.encode()).hexdigest()
    await redis.set(key, json.dumps(data), ex=3600)

Integrate in /generate:

cached = await get_cached_response(request.prompt)
if cached:
    return {"output": cached}

# else call API...
await set_cache(request.prompt, output)

4. Integrate a Vector Database for Context (RAG)

Use Chroma or Pinecone to inject relevant context before sending to the LLM.

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings()
db = Chroma(persist_directory="./db", embedding_function=embeddings)

async def get_context(query):
    docs = db.similarity_search(query, k=3)
    return "\n".join([d.page_content for d in docs])

Modify prompt flow:

context = await get_context(request.prompt)
final_prompt = f"Context:\n{context}\n\nUser query:\n{request.prompt}"

5. Add Streaming Responses

For chat interfaces, stream LLM tokens using Server-Sent Events (SSE).

from fastapi.responses import StreamingResponse

@app.post("/stream")
async def stream_response(request: PromptRequest):
    async def stream():
        async with httpx.AsyncClient() as client:
            async with client.stream(
                "POST",
                "https://api.openai.com/v1/chat/completions",
                headers={"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"},
                json={
                    "model": request.model,
                    "messages": [{"role": "user", "content": request.prompt}],
                    "stream": True,
                },
            ) as response:
                async for chunk in response.aiter_text():
                    yield chunk
    return StreamingResponse(stream(), media_type="text/event-stream")

Use Cases

  • AI Customer Support Backend: Serve chatbots with real-time LLMs.

  • Document QA Systems: Combine FastAPI, RAG, and embeddings.

  • Coding Assistants: Deploy private GPT endpoints for dev tools.

  • Enterprise Knowledge Portals: Secure internal AI APIs.

Limitations & Considerations

  • Latency: Cold API starts and token generation delay.

  • Rate Limits: Apply exponential backoff or retry queues.

  • Cost Control: Cache aggressively and truncate long histories.

  • Data Privacy: Avoid sending sensitive text to third-party APIs.

  • Model Switching: Abstract API layer for portability across LLMs.

Common Fixes

ProblemLikely CauseFix
429 Too Many RequestsOpenAI rate limitAdd async sleep/retry
JSON Decode ErrorAPI streamingUse incremental decoding
Context cutoffToken overflowImplement truncation logic

Example Workflow JSON

{
  "workflow": {
    "name": "Generate Context-Aware Answer",
    "steps": [
      {"load_context": "from vector db"},
      {"format_prompt": "inject context + user query"},
      {"query_llm": "POST /generate"},
      {"cache_response": "Redis"},
      {"return_output": "to client"}
    ]
  }
}

FAQs

Q1: Can I host this backend on AWS Lambda or Cloud Run?
Yes. Use uvicorn with ASGI adapter. Lambda cold starts may add latency; prefer Cloud Run or ECS for persistent connections.

Q2: Can I use open-source LLMs locally?
Yes. Swap out the HTTP call for libraries like transformers or ollama running Mistral or Llama 3.

Q3: How do I add authentication?
Use Depends(get_current_user) with fastapi.security.OAuth2PasswordBearer.

Q4: How to handle multiple models dynamically?
Define a model router that reads from a config mapping model_name → API_URL.

References

Conclusion

FastAPI provides the speed, async support, and modular design needed for AI backends. When combined with LLMs, caching, and RAG, it becomes a full-stack foundation for deploying generative AI features in production.

Building an AI backend is not about just calling an API—it’s about engineering scalable, reliable, and privacy-conscious AI systems that integrate seamlessly with applications.