AI development is no longer tied to the cloud. With Ollama, you can run powerful large language models (LLMs) directly on your local machine and interact with them through a simple API.
The Ollama API makes it easy to:
Generate text with any open-source models
Create embeddings for search and retrieval
Manage models (list, pull, create, delete) with simple HTTP calls
In this guide, we'll explore each major API endpoint of Ollama.
What is Ollama?
Ollama is a lightweight platform to run, manage, and interact with open-source LLMs locally on macOS, Linux, and Windows (via WSL). It supports models like LLaMA, Mistral, Gemma, Phi, and more.
Unlike cloud APIs, Ollama runs models locally, giving you:
Privacy – your data stays on your machine
Zero API costs – no per-token billing
Flexibility – swap models easily
Overview of the Ollama API
The Ollama API is a REST interface exposed at http://localhost:11434.
| Endpoint | Purpose |
|---|
| POST /api/generate | Generate text |
| POST /api/embeddings | Generate embeddings (vector representations) |
| GET /api/tags | List installed models |
| POST /api/pull | Download (pull) a new model |
| POST /api/create | Create a custom model |
| DELETE /api/delete | Remove a model |
1. Generating Text (/api/generate)
import requests, json
url = "http://localhost:11434/api/generate"
payload = {
"model": "gemma:2b",
"prompt": "Explain quantum computing in simple terms."
}
response = requests.post(url, data=json.dumps(payload))
print(response.text) # streamed JSON
Output
Quantum computing uses quantum bits (qubits) which can be both 0 and 1 at the same time...
2. Generating Embeddings (/api/embeddings)
Embeddings are numeric vectors representing text, essential for search, clustering, and RAG systems.
url = "http://localhost:11434/api/embeddings"
payload = {
"model": "gemma:2b",
"prompt": "Artificial intelligence is transforming industries."
}
response = requests.post(url, data=json.dumps(payload))
data = response.json()
print("Dimensions:", len(data["embedding"]))
print("First 10 numbers:", data["embedding"][:10])
Output
Dimensions: 3072
First 10 numbers: [0.0123, -0.0456, 0.0897, -0.2314, 0.1456, -0.0678, 0.1234, 0.5678, -0.0987, 0.4567]
3. Listing Installed Models (/api/tags)
url = "http://localhost:11434/api/tags"
response = requests.get(url)
print(response.json())
Output
{'models': [{'name': 'gemma:2b'}, {'name': 'mistral:7b'}]}
4. Pulling (Downloading) a Model (/api/pull)
url = "http://localhost:11434/api/pull"
payload = {"name": "mistral:7b"}
response = requests.post(url, data=json.dumps(payload))
print(response.text) # shows download progress
Output
{'status': 'success', 'name': 'mistral:7b', 'downloaded': True}
5. Creating a Custom Model (/api/create)
Example Modelfile
FROM gemma:2b
SYSTEM "You are a Shakespearean poet."
Python Code
url = "http://localhost:11434/api/create"
payload = {
"name": "shakespeare-gemma",
"modelfile": 'FROM gemma:2b\nSYSTEM "You are a Shakespearean poet."'
}
response = requests.post(url, data=json.dumps(payload))
print(response.json())
Output
{'status': 'created', 'name': 'shakespeare-gemma'}
Real-World Use Cases
Conclusion
The Ollama API gives you everything you need to build private, local AI workflows:
POST /api/generate – text generation
POST /api/embeddings – vector embeddings
GET /api/tags – list installed models
POST /api/pull – download new models
POST /api/create – create custom models
DELETE /api/delete – remove models
With these endpoints, you can run chatbots, create embeddings, manage models, and deploy full AI applications — all locally on your machine.