Designing a Distributed Job Locking System | .NET + Redis + SQL

Rajesh Gami
2d
347
0
0

Article

Distributed systems often run the same scheduled job on many nodes (k8s pods, app instances, server farms). Without coordination you get duplicate work, duplicate external side-effects, contention on downstream resources, billing surprises and race conditions.

A Distributed Job Locking System ensures at-most-one executor runs a specific job (or job partition) at a time. This article gives a complete, practical design with patterns, diagrams, safe implementations (Redis / SQL / cloud primitives), .NET code you can use, monitoring, edge cases and operational best practices.

Assumptions: you control application servers and the job runner. The system must be resilient to node crashes, network partitions and clock drift.

Goals and safety properties

Mutual exclusion: only one owner at a time for a given lock key.
Liveness: if owner dies, lock eventually becomes available.
Correctness on network partitions: no false notion of ownership that allows two owners to run simultaneously (use fencing tokens or strong consensus when necessary).
Low latency: lock acquire/release is fast.
Extensible: support re-entrant locks, leases, lock renewal, forced release (admin), lock status UI.
Idempotency-friendly: design jobs so retries safe.

High-level architecture

+-----------+       +-----------+       +-------------+
| Scheduler | <---> | LockStore | <---> | Job Runners |
| (many)    |       | (Redis)   |       | (many)      |
+-----------+       +-----------+       +-------------+
       |                  |
       v                  v
  Admin UI            Monitoring / Metrics

LockStore can be Redis, DB (sp_getapplock), Zookeeper, etcd, DynamoDB conditional writes, Azure Blob leases, or cloud-native primitives.

Basic patterns

1) Single global lock

Single key “job:export:daily” — at-most-one runner executes the job.

2) Partitioned locks (sharded jobs)

Split work by key (e.g., tenantId ranges). Acquire lock per shard job:export:shard:42.

3) Leader election (for continuous background work)

Use lock with renewal to elect a leader that schedules tasks to others.

4) Lease-based locks (recommended)

A lock has TTL (lease). Owner must renew before TTL expiry. If owner dies and fails to renew, lease expires and others can acquire.

5) Fencing tokens for safety (strong correctness)

Lock store returns a monotonically increasing token (fence). When a runner performs external side-effects, it includes the fence token. Downstream systems must reject operations with stale tokens to avoid split-brain issues.

Choosing a lock store

Store	Pros	Cons	Best use
Redis (SET NX + EX)	Fast, distributed, TTL support, widely used	Single node Redis can be single point; multi-node needs care (RedLock or Redis Cluster)	Most common
RedLock algorithm (Redis)	Safer across multiple Redis masters	Complex to implement correctly	High-safety Redis setup
SQL Server `sp_getapplock`	Leverages existing DB; transactional	Tied to DB; locks held while transaction open	Simpler environments
Zookeeper / etcd	Strong consensus-based locks	Infra overhead	Mission-critical coordination
DynamoDB conditional writes	Durable, serverless	Cost, per-request fees	AWS serverless
Azure Blob Lease	Built-in lease / renew	Blob storage dependency	Azure environments
Service Bus Sessions	Useful for message processing	Works for queue-based jobs	Message-driven patterns

Core algorithms (short)

Redis simple lease

Acquire: SET lockKey ownerId NX PX <ttl-ms> → success only if key not exists.
Renew: GET lockKey == ownerId then PEXPIRE lockKey <ttl-ms> or SET lockKey ownerId XX PX <ttl-ms>.
Release: delete if owner matches (atomic Lua script).
Use unique ownerId (GUID + hostname + pid + timestamp).

Redis fencing token

Acquire: INCR tokenKey → token; SET lockKey ownerId:token NX PX ttl. If lock set, owner holds token. When doing side-effects send token as header and downstream checks token > lastSeen.

SQL `sp_getapplock`

EXEC sp_getapplock @Resource='job:export', @LockMode='Exclusive', @LockOwner='Session', @LockTimeout=0;
Use transaction scope and sp_releaseapplock or close connection.

Why fencing tokens matter

If Node A acquires a lock, then the network partitions, node A loses contact and expires lease on Redis, Node B acquires lock. Node A (partitioned) might still be able to perform side-effects (because it thinks it owns), causing double-effect. If side-effects require a fencing token (a monotonic integer), downstream services reject any operation that carries a token lower-or-equal to last accepted token. Only the holder with the freshest token will succeed. This prevents split-brain damage.

.NET implementations

A. Redis-based lock (safe with Lua check)

Below is a lightweight, production-grade pattern using StackExchange.Redis and Lua for atomic release and renew. This is a lease-based lock (ownerId string + ttl) without fencing token. After this example we’ll present optional fencing extension.

NuGet: StackExchange.Redis

// Helper classpublic sealed class RedisDistributedLock : IAsyncDisposable
{
    private readonly IDatabase _db;
    private readonly string _key;
    private readonly string _ownerId;
    private readonly TimeSpan _ttl;
    private Timer? _renewTimer;
    private bool _hasLock;

    private const string ReleaseScript = @"
if redis.call('GET', KEYS[1]) == ARGV[1] then
  return redis.call('DEL', KEYS[1])
else
  return 0
end";

    public RedisDistributedLock(IDatabase db, string key, string ownerId, TimeSpan ttl)
    {
        _db = db;
        _key = key;
        _ownerId = ownerId;
        _ttl = ttl;
    }

    public async Task<bool> AcquireAsync(int retryCount = 3, TimeSpan? retryDelay = null)
    {
        retryDelay ??= TimeSpan.FromMilliseconds(200);

        for (int i = 0; i < retryCount; i++)
        {
            var ok = await _db.StringSetAsync(_key, _ownerId, _ttl, when: When.NotExists).ConfigureAwait(false);
            if (ok)
            {
                _hasLock = true;
                // schedule renewal at 60% of TTL
                _renewTimer = new Timer(async _ => await RenewAsync().ConfigureAwait(false),
                                       null,
                                       TimeSpan.FromMilliseconds(_ttl.TotalMilliseconds * 0.6),
                                       TimeSpan.FromMilliseconds(_ttl.TotalMilliseconds * 0.6));
                return true;
            }
            await Task.Delay(retryDelay.Value).ConfigureAwait(false);
        }
        return false;
    }

    public async Task RenewAsync()
    {
        if (!_hasLock) return;
        // Renew only if owner matches
        // Use Lua script to get+set atomically? Simpler: only set if equals
        // SET key ownerId XX PX ttl
        await _db.StringSetAsync(_key, _ownerId, _ttl, when: When.Exists).ConfigureAwait(false);
    }

    public async Task ReleaseAsync()
    {
        if (!_hasLock) return;
        await _db.ScriptEvaluateAsync(ReleaseScript, new RedisKey[] { _key }, new RedisValue[] { _ownerId }).ConfigureAwait(false);
        _hasLock = false;
        _renewTimer?.Dispose();
    }

    public async ValueTask DisposeAsync()
    {
        await ReleaseAsync().ConfigureAwait(false);
    }
}

Usage

var ownerId = $"{Environment.MachineName}:{Process.GetCurrentProcess().Id}:{Guid.NewGuid()}";
using var muxer = ConnectionMultiplexer.Connect(redisConn);
var db = muxer.GetDatabase();

var jobLock = new RedisDistributedLock(db, "job:daily-export", ownerId, TimeSpan.FromSeconds(30));
if (await jobLock.AcquireAsync())
{
    try
    {
        // run job
    }
    finally
    {
        await jobLock.ReleaseAsync();
    }
}
else
{
    // skip; another node has lock
}

Notes

Use StringSet with NX and PX to atomically acquire with TTL.
Renew with SET key owner PX TTL XX. This is not strictly atomic with a match check in older clients; prefer Lua if get == owner then pexpire if you must ensure owner still holds.

B. Redis with fencing token (recommended for side-effects)

This uses an incremental token and lock value including token:

INCR lock:token → token
SET lock:key "<ownerId>:<token>" NX PX <ttl>
On success, owner gets token.
When performing external side-effect, include token (e.g., X-Job-Token: 123456).
Downstream systems maintain lastAcceptedToken per job and reject tokens <= lastAcceptedToken.

Acquire/Release pattern with Node A

// Acquirevar token = db.StringIncrement("job:daily:token");
var lockValue = $"{ownerId}:{token}";
var ok = db.StringSet(lockKey, lockValue, expiry: ttl, when: When.NotExists);
if (!ok) { /* fail */ }

// Later when doing external call
CallExternalService(payload, token);

// Release// Only delete when matching owner:token// (use Lua to compare full value)

Fencing token gives strict ordering. Use this if job does irreversible side-effects (billing, sending money).

C. RedLock (multiple Redis masters)

RedLock (Antirez) algorithm requires multiple independent Redis masters and performing timed majority SET NX PX calls. Use libraries like RedLock.net which implement recommended practices. RedLock offers higher safety if Redis nodes are independent and properly configured. Use it for higher guarantees.

D. SQL Server `sp_getapplock` example

If you prefer DB locks:

-- Acquire a lockDECLARE @rc INT;
EXEC @rc = sp_getapplock @Resource = 'job:daily-export', @LockMode = 'Exclusive', @LockOwner = 'Session', @LockTimeout = 0;
IF @rc >= 0BEGIN-- got lock; run job inside same session/transactionENDELSEBEGIN-- lock not acquiredEND

-- ReleaseEXEC sp_releaseapplock @Resource = 'job:daily-export', @LockOwner = 'Session';

Pattern: app opens a DB connection and keeps it open for the duration of job (or uses transaction-scoped lock owner). This ties lock to DB session lifetime.

Job lifecycle & renewal strategy

Acquire lock with TTL (e.g., 30s)
Start job and track progress checkpoints in DB/Redis so job can resume if restarted
Before TTL expires, renew lock (in background) — renew must be safe (owner check)
If renew fails (lock lost), stop executing (graceful abort) to avoid split-brain — design job to be interruptible
On completion, perform single Release call (atomic delete-if-owner)
Use heartbeat metric (periodic key job:heartbeat:<owner>), but heartbeat alone is not sufficient for safety

Handling long-running jobs and renew failures

Two strategies:

Short tasks + idempotency — better architecture: break big job into many small idempotent tasks processed under separate locks.
Lease renewal + cooperative cancellation — background renew thread; if it fails, set a cancellation token and abort job work ASAP. On abort, persist an incomplete marker; later other node can pick up and resume.

Forced unlock / admin operations

Provide safe admin operations:

Read lock status: GET lockKey → parse ownerId/token, TTL via PTTL lockKey. Show owner and uptime in UI.
Force release: admin command runs Lua to delete key regardless of owner, but log operation and require authorization. For better safety, avoid forcing unless an operator decides; prefer waiting for TTL expiry.

Monitoring & Metrics

Track:

Lock acquire failures (contention)
Lock hold time distribution
Renew failures and forced releases
Job run counts per node
Heartbeat last seen per lock owner
Fence token highest seen per job

Expose metrics via Prometheus or Application Insights.

Idempotency and job design

Design jobs with idempotency keys and checkpoints:

Persist job progress (jobruns table) with status: Started, Completed, Failed, Cancelled. Include RunId, LockOwner, FenceToken.
External side-effects include runId and fenceToken to dedupe.
Use optimistic concurrency when updating the shared progress record.

Example JobRuns table

CREATE TABLE JobRuns (
  RunId UNIQUEIDENTIFIER PRIMARY KEY,
  JobName NVARCHAR(200),
  LockOwner NVARCHAR(200),
  FenceToken BIGINT NULL,
  StartedAt DATETIME2,
  CompletedAt DATETIME2 NULL,
  Status NVARCHAR(50),
  LastHeartbeat DATETIME2
);

Partitioned job locks (sharding)

For large-scale workloads split by key:

Partition keys: tenantId ranges, hash(key) → N shards
Acquire lock per shard: job:process:shard:{shardId}
Execute worker for that shard only — parallelism controlled by number of shards and cluster size

Advantages: avoids big global locks and allows scaling.

Failure modes & mitigations

Failure	Symptom	Mitigation
Node crash holding lock	Lock expires after TTL — new owner acquires	Set TTL reasonably; use fencing tokens for side-effects
Clock skew	TTL-based expiry risk	Use server-side TTL (Redis) — clients do not rely on wall clock
Network partition	Two nodes think they hold lock	Use fencing tokens or consensus (Zookeeper/etcd)
Slow renew	Job may be preempted	Tune TTL > renew interval; renew early (60% TTL)
Lock leak (no release)	Lock stays until TTL	Provide admin UI; shorten TTL if safe

Testing & verification

Unit tests for acquire/renew/release logic (mock Redis).
Integration tests with real Redis cluster; inject network partitions, simulate process crash, ensure at-most-one execution.
Chaos tests: kill job runner mid-job; verify new runner picks up work safely after TTL.
Load tests: simulate many concurrent nodes attempting same lock.

Admin UI (Angular) features

List active locks (key, owner, acquiredAt, TTL left, token)
Force release lock (admin only) with confirmation and audit log
View job runs and status (JobRuns table)
Trigger run / manual run with options (dry-run, resume)
Metrics dashboard (contention rates, retries)

Angular snippet for status API

getLocks() {
  return this.http.get('/api/locks/active');
}
forceRelease(key: string) {
  return this.http.post('/api/locks/release', { key });
}

Best practices / checklist

Use GUID-based strong ownerId (include hostname, pid).
Use short TTL plus aggressive renewal; renew at 50–70% of TTL.
Always use atomic release (Lua script checks value before delete).
Use fencing tokens for destructive external effects.
Prefer small tasks with checkpoints rather than single huge jobs.
Track job runs and persist status to allow resume and auditing.
Implement cooperative cancellation when lease lost.
Monitor metrics and alert on high contention or renew failures.
Document admin release policy and require authorization.
For critical correctness use consensus-based locks (Zookeeper/etcd) or cloud-native strong primitives.

When to use consensus (Zookeeper/etcd) vs Redis

Use Redis for speed and ease of use when fencing tokens + care are sufficient.
Use Zookeeper/etcd if correctness must be guaranteed even under partitions and you need leader election with strong consistency. They follow CP (consensus) model. Use them when jobs coordinate critical financial workflows or cluster-wide leader election for control plane.

Example: full acquire + run loop (pseudo .NET)

async Task RunScheduledJobAsync()
{
    var ownerId = $"{Environment.MachineName}-{Guid.NewGuid()}";
    var lockKey = "job:daily-report";
    using var redisLock = new RedisDistributedLock(db, lockKey, ownerId, TimeSpan.FromSeconds(30));

    if (!await redisLock.AcquireAsync()) return; // another node is running

    var runId = Guid.NewGuid();
    await SaveRunStarted(runId, lockKey, ownerId);

    var cts = new CancellationTokenSource();

    try
    {
       // start background renew (RedisDistributedLock handles it)
       await DoWorkAsync(cts.Token); // cooperative cancellation: check token frequently
       await SaveRunCompleted(runId, success: true);
    }
    catch (OperationCanceledException)
    {
       await SaveRunCompleted(runId, success: false, reason: "Lease lost");
    }
    catch (Exception ex)
    {
       await SaveRunCompleted(runId, success: false, reason: ex.Message);
       throw;
    }
    finally
    {
       await redisLock.ReleaseAsync();
    }
}

Conclusion

A robust Distributed Job Locking System prevents duplicate execution, preserves correctness and reduces resource contention. For most systems, Redis lease + fencing token + idempotent tasks gives a good balance of performance and safety. For absolute correctness under network partitions choose consensus-based locks (Zookeeper/etcd) or cloud primitives. Always design jobs to be checkpointable and idempotent, monitor lock metrics, and provide safe admin controls.