Setup AI Gateway for LLMs

Varun Setia
Mar 21
979
0
1

Article

What is MLflow?

MLflow is an open-source platform created by Databricks and now a widely adopted community project designed to simplify and manage the entire lifecycle of machine learning (ML) and generative AI applications.It's often described as the largest open-source AI engineering platform for building, debugging, evaluating, monitoring, and deploying:

Traditional ML models
Large Language Models (LLMs)
AI agents and LLM-powered applications

MLflow is vendor-neutral as it works with any cloud or on-prem setup. It helps teams handle the complexity of experimentation, reproducibility, collaboration, production deployment, and cost/governance for modern AI workloads.

Core Components of MLflow

Tracking: Central logging system for experiments: records parameters, metrics, artifacts, code versions, and datasets in runs grouped under experiments.
MLflow Projects: Standard way to package ML/GenAI code + dependencies (via MLproject file, Conda/Docker/virtualenv) into reproducible, shareable units that can run locally or remotely with one command.
MLflow Model Registry: Centralized repository for versioning, staging, lineage tracking, annotations, and lifecycle management of models with governance and permissions.
Tracing & Observability: Detailed, OpenTelemetry based logging of LLM chains, agents, tool calls, inputs/outputs, latency, token usage, and errors.
Evaluation: Built-in framework to assess model/LLM quality using metrics, that helps quantify performance, safety, cost, and alignment.
AI Gateway: Proxy layer for calling multiple LLM providers (OpenAI, Anthropic, Azure, local models, etc.). Adds rate limits, per-user budgets, caching, access control, secret management, and unified logging — protects and controls LLM invocations (like the /gateway/.../invocations endpoint you were using).
Prompt Management & Optimization: Versioning, storage, tagging, and automated tuning of prompts — treats prompts as first-class, trackable artifacts similar to models or code.
Authentication Layer support: Supports vast number of authentication schemes that helps secure and manage MLflow easily.

What is an AI Gateway?

The AI Gateway acts as a centralized, secure proxy for interacting with Large Language Models (LLMs) and other generative AI providers. It solves common enterprise pain points when teams use multiple LLM services such as scattered API keys, inconsistent integrations, lack of visibility into usage/costs, security risks from exposed keys, and difficulty switching or testing models such as A/B testing scenarios.

Unified OpenAI-Compatible API Supports unified code that allows us to switch between providers without switching endpoints in code with vast support of providers such as OpenAI, Anthropic, Google Gemini, Amazon Bedrock, Azure OpenAI, Mistral, Ollam, etc.

Credential Management Store provider API keys encrypted in AI gateway and not scattered across application..

Usage Tracking & Cost Control tracking based on Per-endpoint/model/team: token usage, spend estimates, rate limiting, per-user/team budgets/quota enforcement.

Advanced Traffic Routing

A/B testing: Split traffic (e.g., 70% to GPT-4o, 30% to Claude-3.5).
Fallback chains: If primary fails (rate limit, outage), auto-try backup.

MLflow provides an AI gateway that helps us easily integrate with the ecosystem to streamline LLM process and also supports core ML flows.

Walkthrough

Download the server

uvx mlflow server

Now server is accessible at http://localhost:5000/

Now click on AI Gateway option, and we will see below screen.

Follow the steps mentioned in screen or follow along with me.

Close the existing server
Run command pip install "mlflow[genai]"
Start mlflow server with command - mlflow server --backend-store-uri sqlite:///mlflow.db --host 0.0.0.0 --port 5000

Note: For local development scenarios skip encryption passphrase but for production scenario it is mandatory setup.

Once done with above steps we will see below screen.

Now we will click on '+ Create endpoint' button.

Now we are setting up a proxy in front of your local Ollama instance running the ministral-3b model.Instead of calling Ollama directly (http://localhost:11434), applications can now call a single, well-governed MLflow endpoint.

Click on 'Create' button.

Once done, we will click on 'Use' button on top right and try it to check if it works correctly. For this we will click send request with request body on left side and once response is received we will get confirmation that integration works well. Now, the above screenshot confirms that integration works correctly.

We can also make curl request:

curl -X POST http://localhost:5000/gateway/llm-dev/mlflow/invocations \
  -H "Content-Type: application/json" \
  -d '{
  "messages": [
    {"role": "user", "content": "Hello I am Varun, how are you?"}
  ]
}'

Write python code using REST API:

import requests

response = requests.post(
    "http://localhost:5000/gateway/llm-dev/mlflow/invocations",
    json={
        "messages": [
            {"role": "user", "content": "Hello I am Varun, how are you?"}
        ]
    }
)
print(response.json())

Also, make Open AI style compatible calls.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:5000/gateway/mlflow/v1",
    api_key="",  # API key not needed, configured server-side
)

messages = [{"role": "user", "content": "Hi my name is Varun Setia, How are you?"}]

response = client.chat.completions.create(
    model="llm-dev",  # Endpoint name as model
    messages=messages,
)
print(response.choices[0].message)

Monitoring: For monitoring it provides us powerful dashboard and granular request logging. Below are examples.

Apart from this we can track cost usage and do a lot of stuff.

Security

Network Protection: MLflow 3.5.0+ includes security middleware to protect against DNS rebinding, CORS attacks, and clickjacking. These features are available with the default FastAPI-based tracking server (uvicorn).

User name passoword: MLflow supports basic HTTP authentication to enable access control over experiments, registered models, and scorers. Once enabled, any visitor will be required to login before they can view any resource from the Tracking Server.

SSO: You can use SSO (Single Sign-On) to authenticate users to your MLflow instance, by installing a custom plugin or using a reverse proxy.

It also provides custom authentication that can be extended beyond above.

For this demo we will install username and password authentication, for this we will install below package:

pip install mlflow[auth]

Once done we will stop the existing server and restart with below configs:

set MLFLOW_FLASK_SERVER_SECRET_KEY="my-secret-key"

mlflow server --backend-store-uri sqlite:///mlflow.db --host 0.0.0.0 --port 5000 --app-name basic-auth

Now, our requests will fail when made directly.

C:\Users\Varun>curl -X POST "http://localhost:5000/gateway/llm-dev/mlflow/invocations" -H "Content-Type: application/json" -d "{\"messages\": [{\"role\": \"user\", \"content\": \"Hello, how are you?\"}]}"

Response:

You are not authenticated.

Summary

We covered following things:

What is MLflow
Its core components
AI Gateway in depth.
Local setup walkthrough via server
Code examples using REST, requests
Security for basic auth setup

MLflow is a fully open-source platform, widely adopted by thousands of users and organizations to build and manage their own customized setups. Its open nature, along with support from major cloud providers, makes it a reliable choice for teams that want flexibility and want to avoid being tied to a single vendor. Let me know how you feel about it's capabilities and usage.