Docker Model Runner Adds vLLM for Fast AI Inference

Sarthak Varshney
Dec 01
3k
0
2

Article

You know that moment when you're running an AI model locally and everything seems perfect—until you realize your laptop is practically on fire and the inference speed is glacially slow? Yeah, I've been there. More times than I'd like to admit. But here's the thing: Docker just made running large language models way more practical, and honestly, I'm pretty excited about it.

Last month, Docker announced that Model Runner now supports vLLM , and this isn't just another "we added a feature" announcement. This is genuinely game-changing stuff for anyone working with AI models, especially if you're tired of choosing between "runs on my laptop" and "actually performs well."

What's Docker Model Runner, Anyway?

Let me back up for a second. If you haven't played with Docker Model Runner yet, here's the quick version: it's Docker's way of making it ridiculously easy to run large language models. Instead of spending hours setting up Python environments, wrestling with CUDA drivers, and debugging why your model refuses to load, you just pull a model like it's a regular Docker image and run it.

When Docker first launched Model Runner, it used llama.cpp as its inference engine. And don't get me wrong—llama.cpp is fantastic for what it does. It's portable, runs on pretty much any hardware, and works with GGUF format models. Perfect for prototyping on your laptop or running models on modest hardware.

But there was always this gap. You could prototype locally with llama.cpp, but when you wanted to scale up for production—when you needed real throughput for serving hundreds or thousands of users—you had to jump to a completely different stack. Usually vLLM. Different setup, different commands, different workflow. It was like learning to code in two different languages just to take your project from dev to prod.

Enter vLLM: The Speed Demon

If you haven't heard of vLLM , think of it as the high-performance sports car of LLM inference engines. While llama.cpp is your reliable daily driver, vLLM is what you bring out when you need serious speed.

The secret sauce behind vLLM is something called Paged Attention, and honestly, it's one of those ideas that makes you go "why didn't anyone think of this before?" You know how operating systems manage memory with virtual memory and paging? The vLLM team looked at how LLMs use memory and thought, "Hey, what if we applied the same concept here?"

Here's why that matters: when an LLM generates text, it stores something called the KV cache—basically, all the key and value tensors from the attention mechanism. For a model like LLaMA-13B, this can eat up 1.7GB just for a single sequence. And traditional systems waste 60-80% of that memory due to fragmentation and over-allocation. It's like reserving an entire parking lot when you only need a few spaces.

PagedAttention fixes this by breaking the KV cache into blocks (like pages in virtual memory) that don't need to be stored contiguously. The result? Near-zero memory waste—we're talking less than 4% waste compared to the 60-80% waste in traditional systems. That efficiency means you can batch more requests together, max out your GPU utilization, and get 2-4x better throughput compared to other systems.

The Magic: One Workflow, Two Engines

Here's what makes the Docker Model Runner integration brilliant: you don't have to think about which engine you're using. At all.

Let's say you want to run a model. With the vLLM backend installed, it looks like this:

docker model install-runner --backend vllm --gpu cuda

Once the installation finishes, you’re ready to start using it right away:

docker model run ai/smollm2-vllm "Can you read me?" 
Sure, I am ready to read you.

That's it. No configuration files. No specifying which inference engine to use. Docker Model Runner is smart enough to figure it out automatically based on the model format. Pull a GGUF model? It routes to llama.cpp. Pull a safetensors model? It fires up vLLM.

You can also hit it via API, and again—no mention of which engine you're using:

 curl --location 'http://localhost:12434/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
  "model": "ai/smollm2-vllm",
  "messages": [
    {
      "role": "user",
      "content": "Can you read me?"
    }
  ]
}'

Same Docker commands. Same API format. Same CI/CD pipelines. Whether you're on your laptop with llama.cpp or scaling up production with vLLM, your workflow stays consistent.

Why This Actually Matters

Look, I've built enough AI applications to know that consistency is everything. When you're prototyping, you want something lightweight that runs anywhere. When you're deploying, you need performance. Historically, those two requirements meant completely different stacks.

I remember working on a project where we built the entire prototype using one setup, then had to basically rewrite everything when we moved to production. Different APIs, different deployment process, different debugging tools. It was a mess.

Docker Model Runner with vLLM support eliminates that friction. You can:

Prototype locally on your laptop with llama.cpp and GGUF models
Test with the same Docker commands on a staging server
Deploy to production with vLLM and safetensors models
Use the exact same API calls throughout

No rewrites. No surprises. Just scale up when you need to.

Getting Your Hands Dirty

Installing the vLLM backend is straightforward:

docker model install-runner --backend vllm --gpu cuda

Once that's done, you've got access to a growing collection of vLLM-compatible models on Docker Hub. We're talking about serious models here:

VLLM hub.docker.com

ai/qwen3-reranker-vllm : Multilingual reranking model for text retrieval, scoring document relevance across 119 languages (2.5K pulls)
ai/snowflake-arctic-embed-l-v2-vllm : Snowflake's Arctic-Embed v2.0 boosts multilingual retrieval and efficiency (2.2K pulls)
ai/qwen3-embedding-vllm : Qwen3 Embedding for multilingual models and advanced text/ranking tasks like retrieval & clustering (5.2K pulls)
ai/gemma3-vllm : Google's latest Gemma, small yet strong for chat and generation (9.2K pulls)
ai/gpt-oss-vllm : OpenAI's open-weight models designed for powerful reasoning and agentic tasks (3.4K pulls)
ai/embeddinggemma-vllm : State-of-the-art text embedding model from Google DeepMind (773 pulls)
ai/all-minilm-l6-v2-vllm : Sentence-transformers that map sentences & paragraphs to a 384 dimensional vector
ai/qwen3-vllm : The latest Qwen LLM, built for top-tier coding, math, reasoning, and language tasks (3.5K pulls)
ai/smollm2-vllm : SmolVLM lightweight multimodal model for video, image, and text analysis, optimized for devices (2.5K pulls)

Each of these is packaged as an OCI artifact, which means you can push and pull them just like any Docker image. Store them in Docker Hub, your company's private registry, wherever. They're just containers.

Real-World Use Cases

Let me paint a picture of where this really shines.

Scenario 1: The Solo Developer

You're building a side project—maybe a smart documentation tool. You start on your M2 MacBook with llama.cpp, testing ideas and iterating quickly. Your model runs fine for testing. Then your project takes off. You move to a cloud instance with a beefy NVIDIA GPU. With Docker Model Runner, you literally just switch to a safetensors model, and boom—vLLM kicks in. Same code, better performance.

Scenario 2: The Startup

Your company is building an AI-powered customer service chatbot. Development happens on developer laptops (various specs, various OSes). Staging is on a small cluster. Production needs to handle thousands of concurrent users. Docker Model Runner means everyone—from intern to DevOps—uses the same tools. Your CI/CD pipeline? One Dockerfile. One set of commands.

Scenario 3: The Enterprise

You need to run different models for different teams. Some teams need quick responses on modest hardware. Others need maximum throughput for batch processing. Instead of maintaining two separate infrastructure stacks, you use Docker Model Runner. Same monitoring, same orchestration, same security policies. Just different models.

Understanding the Formats: GGUF vs Safetensors

Here's a quick breakdown because format choice matters:

GGUF (GPT-Generated Unified Format) is llama.cpp's native format. It's designed for portability and supports quantization brilliantly. Think of it as the "runs everywhere" format. Great for commodity hardware, fantastic for development, perfect when memory bandwidth is your bottleneck. Everything's in one file, easy to distribute.

Safetensors is the modern standard for high-throughput inference. It's vLLM's native format and what you want when you're pushing for maximum performance on serious GPU hardware. It's optimized for speed, designed for production, and becoming the industry standard.

The beauty of Docker Model Runner? You don't have to choose. Use GGUF models for dev work. Use safetensors models for production. The same commands work for both.

What's Coming Next

Docker's roadmap for Model Runner is pretty ambitious, and I'm here for it:

WSL2 and Docker Desktop Support : Right now, vLLM works on x86_64 Linux with NVIDIA GPUs. They're actively working on bringing it to Windows via WSL2. As someone who's constantly bouncing between operating systems, this is huge. Being able to run the same high-performance inference on Windows dev machines? Yes, please.

DGX Support : For the folks working with serious hardware (NVIDIA DGX systems), Docker is optimizing Model Runner specifically for those environments.

Performance Tuning : The team is honest about areas for improvement. vLLM's startup time is currently slower than llama.cpp's. They're working on that "time-to-first-token" metric to make the development loop even faster.

The Bigger Picture

Here's what really gets me excited about this: Docker is making AI infrastructure portable. Not just "oh, it runs in a container" portable, but truly portable across your entire workflow.

We're at this inflection point with AI where everyone's trying to figure out the right way to deploy and serve models. There are like seventeen different tools for serving LLMs, each with their own quirks and requirements. Docker Model Runner isn't trying to replace all of them—it's trying to give you a consistent interface that works whether you're on your laptop or in a Kubernetes cluster.

And that matters because the easier we make it to deploy AI, the more people can actually use it. Not just ML engineers, but regular developers who just want to add AI features to their apps without becoming infrastructure experts.

Why You Should Care

If you're building anything with AI, here's my take: start with Docker Model Runner. Seriously.

Not because it's perfect—nothing is. But because it gives you flexibility. You can prototype fast, deploy confidently, and scale when needed. You're not locked into one inference engine or one deployment pattern.

Plus, it's getting better fast. Docker's actively developing this, the community is growing, and having vLLM integration means you've got a clear path from "works on my machine" to "handles production load."

Getting Involved

The Docker Model Runner project is open source, and they genuinely want community input. If you're interested in contributing or have ideas:

Star the GitHub repo to show support
Try it out and open issues if you hit problems
Fork it and submit PRs if you want to contribute code
Spread the word to other developers

The more people using it and providing feedback, the better it gets for everyone.

Final Thoughts

Look, I've been working with Docker for years, and I've seen them tackle a lot of complex problems. Making containers work. Making distributed systems manageable. Making deployment repeatable.

This vLLM integration in Model Runner feels like the same kind of breakthrough. They're taking something that was fragmented and complicated—running LLMs efficiently across different environments—and making it straightforward.

You can prototype on your laptop in the morning and deploy to production in the afternoon. Same tools, same workflow, just better hardware. That's the promise of Docker Model Runner with vLLM support.

And honestly? That's pretty cool.

So if you're working with LLMs or thinking about adding AI to your projects, give it a shot. Install the vLLM backend, pull a model from Docker Hub, and see what you can build. The barrier to entry just got a whole lot lower, and that means more people can start building interesting stuff.

And isn't that the point of good tooling? Making the complex simple so we can focus on building great things.

Resources: