Docker Swarm Architecture

Sarthak Varshney
May 23
1.1k
0
7

Article

Introduction

When I first heard someone say, “In Swarm, the managers handle the decisions, and the workers carry them out,” it sounded a bit like a scene from a corporate office. And, funny enough, that analogy stuck with me.

I remember my first attempt at setting up a Swarm cluster. I had three nodes and no idea what roles they were supposed to play. I ran a few commands, containers were spinning, and everything seemed fine—until one of the nodes went down and took half the app with it. That’s when I realized: understanding the architecture of Docker Swarm isn't optional, it's essential.

Docker Swarm Architecture

In this chapter, we’ll peel back the curtain and look at what really makes Docker Swarm tick. We’ll cover the manager-worker model, the genius of the Raft consensus algorithm, and how tasks and services keep your containers running like clockwork.

By the end, you’ll not only understand how Swarm works—you’ll start to appreciate why it works that way.

Managers, Workers, and Raft Consensus

Let’s break down the most important part of Docker Swarm’s architecture: the roles of nodes in a cluster.

Managers and Workers: The Two Roles in Every Swarm

In a Docker Swarm cluster, every machine (or VM, or cloud instance) is called a node. These nodes take on one of two roles:

Manager nodes: These are the brains of the operation. They make decisions, assign tasks, and maintain the desired state of the cluster.
Worker nodes: These are the muscles. They carry out the instructions given by managers and run the actual containers.

Let’s say you're deploying a web app with three services: frontend, backend, and database.

The manager decides: “We need two replicas of the frontend, one backend, and one database.”
Then it tells the workers: “You, Worker-1, run a frontend container. Worker-2, you run the other one.”

Managers also monitor the health of services. If a container crashes or a worker node goes down, the manager takes action to restore the desired state.

Personal Tip: Always Have an Odd Number of Managers

One lesson I learned the hard way was having only two manager nodes. When one went offline, the cluster stalled—it couldn’t agree on decisions. That brings us to the secret sauce behind Swarm’s decision-making process: the Raft consensus algorithm.

The Raft Consensus Algorithm (Explained for Humans)

Consensus sounds like a fancy computer science term—and it is. But here’s an easy way to understand it:

Imagine five friends trying to decide on a movie. They vote, and at least three need to agree for a decision to be made. That’s the consensus.

In Docker Swarm, the manager nodes use a similar system, powered by Raft, to agree on:

Cluster state
Leadership elections
Task assignments

Only one manager at a time is the leader. It’s the one actively assigning tasks. The others are followers, ready to take over if the leader fails.

This ensures

High availability: No single point of failure
Consistency: All managers are in sync
Fault tolerance: As long as a majority of managers are up, the system continues smoothly

Best Practice: Use 3 or 5 Manager Nodes

Always keep an odd number of managers so consensus can be reached even if one fails. In production, 3 managers + several workers is a solid setup.

How Tasks and Services Work

Now that we understand the roles, let’s look at what these nodes are actually doing, specifically around services and tasks.

What is a “Service” in Docker Swarm?

A service in Swarm is an instruction to run one or more replicas of a containerized application.

For example, you might define a service like this:

docker service create --name webapp --replicas 3 my-web-app:latest

This tells the Swarm: “Run 3 identical containers of my web app. Keep them alive. If any fail, replace them.”

Swarm takes it from there:

Schedules the containers across nodes
Ensures the number of replicas stays consistent
Performs rolling updates if needed

Services abstract away the low-level container management and give you a higher-level, declarative way to manage apps.

Tasks: The Atomic Units of Work

Every service is made up of multiple tasks.

A task = a single container + its instructions (like ports, volumes, environment variables).

So in our example with --replicas 3 Swarm creates 3 tasks, each of which runs a container somewhere in the cluster.

If a task fails (e.g., the container crashes), the manager notices and creates a new task to replace it. This is the foundation of Swarm’s self-healing capability.

Rolling Updates and Version Control

One of my favorite features of Swarm is how elegantly it handles rolling updates.

Let’s say you have version 1.0 of your app running and you want to update to 1.1. Instead of stopping all containers and redeploying, you simply run:

docker service update --image my-web-app:1.1 webapp

Swarm then

Takes down a few old tasks at a time
Replaces them with new ones using the new image
Waits for them to be healthy before proceeding

This reduces downtime and avoids service interruptions.

It’s like changing the tires of a moving car—Swarm just makes it look effortless.

Health Checks and Auto-Recovery

You can also define health checks as part of your Dockerfile or service configuration. If a task fails a health check, the manager spins up a new one.

I once misconfigured an API service that crashed intermittently. With Swarm, I didn’t even know it was happening at first—the manager detected the crash, removed the unhealthy task, and launched a healthy one.

That’s the kind of automation that saves hours of debugging in production.

Routing Mesh and Load Balancing

One more piece of the puzzle: when you create a service, Swarm automatically sets up a routing mesh.

This means

You can send traffic to any node in the cluster.
Swarm routes it to the correct container, even if that container is on a different node.

No need to manually configure load balancers or worry about which node is doing what. Swarm handles it all behind the scenes.

Conclusion

Understanding Docker Swarm's architecture is like learning the inner workings of a clock. At first, it looks like magic—services just run, containers scale, and everything heals itself. But once you understand managers, workers, tasks, and the Raft consensus, it all starts to make sense.

Let’s recap

Managers make decisions; workers carry them out
Raft ensures consensus and leadership among managers
Services define what should run; tasks are the actual running units
Swarm provides self-healing, rolling updates, and a routing mesh with minimal setup

Docker Swarm's architecture is what makes it so resilient, simple, and suitable for real-world projects, especially when you need orchestration without the overhead of Kubernetes.