Distributed systems often run the same scheduled job on many nodes (k8s pods, app instances, server farms). Without coordination you get duplicate work, duplicate external side-effects, contention on downstream resources, billing surprises and race conditions.
A Distributed Job Locking System ensures at-most-one executor runs a specific job (or job partition) at a time. This article gives a complete, practical design with patterns, diagrams, safe implementations (Redis / SQL / cloud primitives), .NET code you can use, monitoring, edge cases and operational best practices.
Assumptions: you control application servers and the job runner. The system must be resilient to node crashes, network partitions and clock drift.
Goals and safety properties
Mutual exclusion: only one owner at a time for a given lock key.
Liveness: if owner dies, lock eventually becomes available.
Correctness on network partitions: no false notion of ownership that allows two owners to run simultaneously (use fencing tokens or strong consensus when necessary).
Low latency: lock acquire/release is fast.
Extensible: support re-entrant locks, leases, lock renewal, forced release (admin), lock status UI.
Idempotency-friendly: design jobs so retries safe.
High-level architecture
+-----------+ +-----------+ +-------------+
| Scheduler | <---> | LockStore | <---> | Job Runners |
| (many) | | (Redis) | | (many) |
+-----------+ +-----------+ +-------------+
| |
v v
Admin UI Monitoring / Metrics
LockStore can be Redis, DB (sp_getapplock), Zookeeper, etcd, DynamoDB conditional writes, Azure Blob leases, or cloud-native primitives.
Basic patterns
1) Single global lock
Single key “job:export:daily” — at-most-one runner executes the job.
2) Partitioned locks (sharded jobs)
Split work by key (e.g., tenantId ranges). Acquire lock per shard job:export:shard:42.
3) Leader election (for continuous background work)
Use lock with renewal to elect a leader that schedules tasks to others.
4) Lease-based locks (recommended)
A lock has TTL (lease). Owner must renew before TTL expiry. If owner dies and fails to renew, lease expires and others can acquire.
5) Fencing tokens for safety (strong correctness)
Lock store returns a monotonically increasing token (fence). When a runner performs external side-effects, it includes the fence token. Downstream systems must reject operations with stale tokens to avoid split-brain issues.
Choosing a lock store
| Store | Pros | Cons | Best use |
|---|
| Redis (SET NX + EX) | Fast, distributed, TTL support, widely used | Single node Redis can be single point; multi-node needs care (RedLock or Redis Cluster) | Most common |
| RedLock algorithm (Redis) | Safer across multiple Redis masters | Complex to implement correctly | High-safety Redis setup |
SQL Server sp_getapplock | Leverages existing DB; transactional | Tied to DB; locks held while transaction open | Simpler environments |
| Zookeeper / etcd | Strong consensus-based locks | Infra overhead | Mission-critical coordination |
| DynamoDB conditional writes | Durable, serverless | Cost, per-request fees | AWS serverless |
| Azure Blob Lease | Built-in lease / renew | Blob storage dependency | Azure environments |
| Service Bus Sessions | Useful for message processing | Works for queue-based jobs | Message-driven patterns |
Core algorithms (short)
Redis simple lease
Acquire: SET lockKey ownerId NX PX <ttl-ms> → success only if key not exists.
Renew: GET lockKey == ownerId then PEXPIRE lockKey <ttl-ms> or SET lockKey ownerId XX PX <ttl-ms>.
Release: delete if owner matches (atomic Lua script).
Use unique ownerId (GUID + hostname + pid + timestamp).
Redis fencing token
SQL sp_getapplock
EXEC sp_getapplock @Resource='job:export', @LockMode='Exclusive', @LockOwner='Session', @LockTimeout=0;
Use transaction scope and sp_releaseapplock or close connection.
Why fencing tokens matter
If Node A acquires a lock, then the network partitions, node A loses contact and expires lease on Redis, Node B acquires lock. Node A (partitioned) might still be able to perform side-effects (because it thinks it owns), causing double-effect. If side-effects require a fencing token (a monotonic integer), downstream services reject any operation that carries a token lower-or-equal to last accepted token. Only the holder with the freshest token will succeed. This prevents split-brain damage.
.NET implementations
A. Redis-based lock (safe with Lua check)
Below is a lightweight, production-grade pattern using StackExchange.Redis and Lua for atomic release and renew. This is a lease-based lock (ownerId string + ttl) without fencing token. After this example we’ll present optional fencing extension.
NuGet: StackExchange.Redis
// Helper classpublic sealed class RedisDistributedLock : IAsyncDisposable
{
private readonly IDatabase _db;
private readonly string _key;
private readonly string _ownerId;
private readonly TimeSpan _ttl;
private Timer? _renewTimer;
private bool _hasLock;
private const string ReleaseScript = @"
if redis.call('GET', KEYS[1]) == ARGV[1] then
return redis.call('DEL', KEYS[1])
else
return 0
end";
public RedisDistributedLock(IDatabase db, string key, string ownerId, TimeSpan ttl)
{
_db = db;
_key = key;
_ownerId = ownerId;
_ttl = ttl;
}
public async Task<bool> AcquireAsync(int retryCount = 3, TimeSpan? retryDelay = null)
{
retryDelay ??= TimeSpan.FromMilliseconds(200);
for (int i = 0; i < retryCount; i++)
{
var ok = await _db.StringSetAsync(_key, _ownerId, _ttl, when: When.NotExists).ConfigureAwait(false);
if (ok)
{
_hasLock = true;
// schedule renewal at 60% of TTL
_renewTimer = new Timer(async _ => await RenewAsync().ConfigureAwait(false),
null,
TimeSpan.FromMilliseconds(_ttl.TotalMilliseconds * 0.6),
TimeSpan.FromMilliseconds(_ttl.TotalMilliseconds * 0.6));
return true;
}
await Task.Delay(retryDelay.Value).ConfigureAwait(false);
}
return false;
}
public async Task RenewAsync()
{
if (!_hasLock) return;
// Renew only if owner matches
// Use Lua script to get+set atomically? Simpler: only set if equals
// SET key ownerId XX PX ttl
await _db.StringSetAsync(_key, _ownerId, _ttl, when: When.Exists).ConfigureAwait(false);
}
public async Task ReleaseAsync()
{
if (!_hasLock) return;
await _db.ScriptEvaluateAsync(ReleaseScript, new RedisKey[] { _key }, new RedisValue[] { _ownerId }).ConfigureAwait(false);
_hasLock = false;
_renewTimer?.Dispose();
}
public async ValueTask DisposeAsync()
{
await ReleaseAsync().ConfigureAwait(false);
}
}
Usage
var ownerId = $"{Environment.MachineName}:{Process.GetCurrentProcess().Id}:{Guid.NewGuid()}";
using var muxer = ConnectionMultiplexer.Connect(redisConn);
var db = muxer.GetDatabase();
var jobLock = new RedisDistributedLock(db, "job:daily-export", ownerId, TimeSpan.FromSeconds(30));
if (await jobLock.AcquireAsync())
{
try
{
// run job
}
finally
{
await jobLock.ReleaseAsync();
}
}
else
{
// skip; another node has lock
}
Notes
Use StringSet with NX and PX to atomically acquire with TTL.
Renew with SET key owner PX TTL XX. This is not strictly atomic with a match check in older clients; prefer Lua if get == owner then pexpire if you must ensure owner still holds.
B. Redis with fencing token (recommended for side-effects)
This uses an incremental token and lock value including token:
INCR lock:token → token
SET lock:key "<ownerId>:<token>" NX PX <ttl>
On success, owner gets token.
When performing external side-effect, include token (e.g., X-Job-Token: 123456).
Downstream systems maintain lastAcceptedToken per job and reject tokens <= lastAcceptedToken.
Acquire/Release pattern with Node A
// Acquirevar token = db.StringIncrement("job:daily:token");
var lockValue = $"{ownerId}:{token}";
var ok = db.StringSet(lockKey, lockValue, expiry: ttl, when: When.NotExists);
if (!ok) { /* fail */ }
// Later when doing external call
CallExternalService(payload, token);
// Release// Only delete when matching owner:token// (use Lua to compare full value)
Fencing token gives strict ordering. Use this if job does irreversible side-effects (billing, sending money).
C. RedLock (multiple Redis masters)
RedLock (Antirez) algorithm requires multiple independent Redis masters and performing timed majority SET NX PX calls. Use libraries like RedLock.net which implement recommended practices. RedLock offers higher safety if Redis nodes are independent and properly configured. Use it for higher guarantees.
D. SQL Server sp_getapplock example
If you prefer DB locks:
-- Acquire a lockDECLARE @rc INT;
EXEC @rc = sp_getapplock @Resource = 'job:daily-export', @LockMode = 'Exclusive', @LockOwner = 'Session', @LockTimeout = 0;
IF @rc >= 0BEGIN-- got lock; run job inside same session/transactionENDELSEBEGIN-- lock not acquiredEND
-- ReleaseEXEC sp_releaseapplock @Resource = 'job:daily-export', @LockOwner = 'Session';
Pattern: app opens a DB connection and keeps it open for the duration of job (or uses transaction-scoped lock owner). This ties lock to DB session lifetime.
Job lifecycle & renewal strategy
Acquire lock with TTL (e.g., 30s)
Start job and track progress checkpoints in DB/Redis so job can resume if restarted
Before TTL expires, renew lock (in background) — renew must be safe (owner check)
If renew fails (lock lost), stop executing (graceful abort) to avoid split-brain — design job to be interruptible
On completion, perform single Release call (atomic delete-if-owner)
Use heartbeat metric (periodic key job:heartbeat:<owner>), but heartbeat alone is not sufficient for safety
Handling long-running jobs and renew failures
Two strategies:
Short tasks + idempotency — better architecture: break big job into many small idempotent tasks processed under separate locks.
Lease renewal + cooperative cancellation — background renew thread; if it fails, set a cancellation token and abort job work ASAP. On abort, persist an incomplete marker; later other node can pick up and resume.
Forced unlock / admin operations
Provide safe admin operations:
Read lock status: GET lockKey → parse ownerId/token, TTL via PTTL lockKey. Show owner and uptime in UI.
Force release: admin command runs Lua to delete key regardless of owner, but log operation and require authorization. For better safety, avoid forcing unless an operator decides; prefer waiting for TTL expiry.
Monitoring & Metrics
Track:
Lock acquire failures (contention)
Lock hold time distribution
Renew failures and forced releases
Job run counts per node
Heartbeat last seen per lock owner
Fence token highest seen per job
Expose metrics via Prometheus or Application Insights.
Idempotency and job design
Design jobs with idempotency keys and checkpoints:
Persist job progress (jobruns table) with status: Started, Completed, Failed, Cancelled. Include RunId, LockOwner, FenceToken.
External side-effects include runId and fenceToken to dedupe.
Use optimistic concurrency when updating the shared progress record.
Example JobRuns table
CREATE TABLE JobRuns (
RunId UNIQUEIDENTIFIER PRIMARY KEY,
JobName NVARCHAR(200),
LockOwner NVARCHAR(200),
FenceToken BIGINT NULL,
StartedAt DATETIME2,
CompletedAt DATETIME2 NULL,
Status NVARCHAR(50),
LastHeartbeat DATETIME2
);
Partitioned job locks (sharding)
For large-scale workloads split by key:
Partition keys: tenantId ranges, hash(key) → N shards
Acquire lock per shard: job:process:shard:{shardId}
Execute worker for that shard only — parallelism controlled by number of shards and cluster size
Advantages: avoids big global locks and allows scaling.
Failure modes & mitigations
| Failure | Symptom | Mitigation |
|---|
| Node crash holding lock | Lock expires after TTL — new owner acquires | Set TTL reasonably; use fencing tokens for side-effects |
| Clock skew | TTL-based expiry risk | Use server-side TTL (Redis) — clients do not rely on wall clock |
| Network partition | Two nodes think they hold lock | Use fencing tokens or consensus (Zookeeper/etcd) |
| Slow renew | Job may be preempted | Tune TTL > renew interval; renew early (60% TTL) |
| Lock leak (no release) | Lock stays until TTL | Provide admin UI; shorten TTL if safe |
Testing & verification
Unit tests for acquire/renew/release logic (mock Redis).
Integration tests with real Redis cluster; inject network partitions, simulate process crash, ensure at-most-one execution.
Chaos tests: kill job runner mid-job; verify new runner picks up work safely after TTL.
Load tests: simulate many concurrent nodes attempting same lock.
Admin UI (Angular) features
List active locks (key, owner, acquiredAt, TTL left, token)
Force release lock (admin only) with confirmation and audit log
View job runs and status (JobRuns table)
Trigger run / manual run with options (dry-run, resume)
Metrics dashboard (contention rates, retries)
Angular snippet for status API
getLocks() {
return this.http.get('/api/locks/active');
}
forceRelease(key: string) {
return this.http.post('/api/locks/release', { key });
}
Best practices / checklist
Use GUID-based strong ownerId (include hostname, pid).
Use short TTL plus aggressive renewal; renew at 50–70% of TTL.
Always use atomic release (Lua script checks value before delete).
Use fencing tokens for destructive external effects.
Prefer small tasks with checkpoints rather than single huge jobs.
Track job runs and persist status to allow resume and auditing.
Implement cooperative cancellation when lease lost.
Monitor metrics and alert on high contention or renew failures.
Document admin release policy and require authorization.
For critical correctness use consensus-based locks (Zookeeper/etcd) or cloud-native strong primitives.
When to use consensus (Zookeeper/etcd) vs Redis
Use Redis for speed and ease of use when fencing tokens + care are sufficient.
Use Zookeeper/etcd if correctness must be guaranteed even under partitions and you need leader election with strong consistency. They follow CP (consensus) model. Use them when jobs coordinate critical financial workflows or cluster-wide leader election for control plane.
Example: full acquire + run loop (pseudo .NET)
async Task RunScheduledJobAsync()
{
var ownerId = $"{Environment.MachineName}-{Guid.NewGuid()}";
var lockKey = "job:daily-report";
using var redisLock = new RedisDistributedLock(db, lockKey, ownerId, TimeSpan.FromSeconds(30));
if (!await redisLock.AcquireAsync()) return; // another node is running
var runId = Guid.NewGuid();
await SaveRunStarted(runId, lockKey, ownerId);
var cts = new CancellationTokenSource();
try
{
// start background renew (RedisDistributedLock handles it)
await DoWorkAsync(cts.Token); // cooperative cancellation: check token frequently
await SaveRunCompleted(runId, success: true);
}
catch (OperationCanceledException)
{
await SaveRunCompleted(runId, success: false, reason: "Lease lost");
}
catch (Exception ex)
{
await SaveRunCompleted(runId, success: false, reason: ex.Message);
throw;
}
finally
{
await redisLock.ReleaseAsync();
}
}
Conclusion
A robust Distributed Job Locking System prevents duplicate execution, preserves correctness and reduces resource contention. For most systems, Redis lease + fencing token + idempotent tasks gives a good balance of performance and safety. For absolute correctness under network partitions choose consensus-based locks (Zookeeper/etcd) or cloud primitives. Always design jobs to be checkpointable and idempotent, monitor lock metrics, and provide safe admin controls.