LLMs  

Ollama API: A Complete Guide to Local AI with Generate, Embeddings & Model Management

AI development is no longer tied to the cloud. With Ollama, you can run powerful large language models (LLMs) directly on your local machine and interact with them through a simple API.

The Ollama API makes it easy to:

  • Generate text with any open-source models

  • Create embeddings for search and retrieval

  • Manage models (list, pull, create, delete) with simple HTTP calls

In this guide, we'll explore each major API endpoint of Ollama.

What is Ollama?

Ollama is a lightweight platform to run, manage, and interact with open-source LLMs locally on macOS, Linux, and Windows (via WSL). It supports models like LLaMA, Mistral, Gemma, Phi, and more.

Unlike cloud APIs, Ollama runs models locally, giving you:

  • Privacy – your data stays on your machine

  • Zero API costs – no per-token billing

  • Flexibility – swap models easily

Overview of the Ollama API

The Ollama API is a REST interface exposed at http://localhost:11434.

EndpointPurpose
POST /api/generateGenerate text
POST /api/embeddingsGenerate embeddings (vector representations)
GET /api/tagsList installed models
POST /api/pullDownload (pull) a new model
POST /api/createCreate a custom model
DELETE /api/deleteRemove a model

1. Generating Text (/api/generate)

import requests, json

url = "http://localhost:11434/api/generate"
payload = {
    "model": "gemma:2b",
    "prompt": "Explain quantum computing in simple terms."
}

response = requests.post(url, data=json.dumps(payload))
print(response.text)  # streamed JSON

Output

Quantum computing uses quantum bits (qubits) which can be both 0 and 1 at the same time...

2. Generating Embeddings (/api/embeddings)

Embeddings are numeric vectors representing text, essential for search, clustering, and RAG systems.

url = "http://localhost:11434/api/embeddings"
payload = {
    "model": "gemma:2b",
    "prompt": "Artificial intelligence is transforming industries."
}

response = requests.post(url, data=json.dumps(payload))
data = response.json()
print("Dimensions:", len(data["embedding"]))
print("First 10 numbers:", data["embedding"][:10])

Output

Dimensions: 3072
First 10 numbers: [0.0123, -0.0456, 0.0897, -0.2314, 0.1456, -0.0678, 0.1234, 0.5678, -0.0987, 0.4567]

3. Listing Installed Models (/api/tags)

url = "http://localhost:11434/api/tags"
response = requests.get(url)
print(response.json())

Output

{'models': [{'name': 'gemma:2b'}, {'name': 'mistral:7b'}]}

4. Pulling (Downloading) a Model (/api/pull)

url = "http://localhost:11434/api/pull"
payload = {"name": "mistral:7b"}

response = requests.post(url, data=json.dumps(payload))
print(response.text)  # shows download progress

Output

{'status': 'success', 'name': 'mistral:7b', 'downloaded': True}

5. Creating a Custom Model (/api/create)

Example Modelfile

FROM gemma:2b
SYSTEM "You are a Shakespearean poet."

Python Code

url = "http://localhost:11434/api/create"
payload = {
  "name": "shakespeare-gemma",
  "modelfile": 'FROM gemma:2b\nSYSTEM "You are a Shakespearean poet."'
}

response = requests.post(url, data=json.dumps(payload))
print(response.json())

Output

{'status': 'created', 'name': 'shakespeare-gemma'}

Real-World Use Cases

  • Local Chatbots

  • Knowledge-based Assistants with private data (RAG)

  • Offline AI-enhanced Apps

  • Custom Fine-Tuned Models in desktop/web apps

Conclusion

The Ollama API gives you everything you need to build private, local AI workflows:

  • POST /api/generate – text generation

  • POST /api/embeddings – vector embeddings

  • GET /api/tags – list installed models

  • POST /api/pull – download new models

  • POST /api/create – create custom models

  • DELETE /api/delete – remove models

With these endpoints, you can run chatbots, create embeddings, manage models, and deploy full AI applications — all locally on your machine.