Files, Directory, IO  

High-Scale File Sync Service (Detect Changes, Push Deltas, Track Versions)

Introduction

Modern enterprise systems — collaboration platforms, document management, CAD repositories and content delivery networks — need robust file synchronization that works at scale. Users expect near-real-time sync across many devices, efficient use of bandwidth, reliable version history, and safe conflict handling. A naive upload-everything approach soon breaks: it consumes bandwidth, increases storage cost and forces long sync delays.

This article describes a production-ready architecture and implementation patterns for a High-Scale File Sync Service that:

  • Detects local and server-side changes efficiently

  • Transfers only deltas (changed chunks) instead of whole files

  • Tracks versions, parents, and history reliably

  • Resolves conflicts safely and provides clear UX for users

  • Works offline and supports resumable syncs

  • Scales horizontally and secures data at rest and in transit

Examples and code snippets use Angular for the client and .NET (ASP.NET Core) for the backend. Diagrams follow the block-style format you chose.

Goals and Non-Goals

Goals

  • Minimise uploaded bytes using chunking and delta/diff transfer

  • Provide exact version history with facile replay and rollback

  • Support concurrent edits with deterministic conflict strategies

  • Keep sync latency low and resource usage predictable

  • Offer robust retry, resume and partial-sync behaviour

Non-Goals

  • This article does not implement an OT/CRDT merge engine for arbitrary binary formats (that is complex and format-specific). For text documents, CRDTs or operational transform approaches are recommended separately.

High-Level Architecture

 ┌──────────────────────────────┐
 │       Angular Client(s)      │
 │ (Detect change, chunking,    │
 │  delta upload, conflict UI)  │
 └───────────────┬──────────────┘
                 │ HTTPS / WebSocket / SignalR
                 ▼
 ┌───────────────┴──────────────┐
 │      Sync Gateway API        │
 │  (auth, request validation,  │
 │   routing, tenant resolution)│
 └───────┬────────────┬─────────┘
         │            │
         │            ▼
         │     ┌──────────────┐
         │     │  Sync Engine │
         │     │ (delta, queue,│
         │     │  versioning)  │
         │     └────┬─────────┘
         │          │
 ┌───────▼──────┐ ┌─▼──────────┐
 │ Chunk Storage│ │ Metadata DB│
 │ (object store│ │ (versions,  │
 │  S3/Blob)    │ │  locks, refs)│
 └──────────────┘ └─────────────┘
         │                │
         ▼                ▼
    CDN / Edge /         Monitoring,
    Backup               Metrics, Audit

Core Components

1. Change Detector (Client Side)

Detect file changes using file system watchers or application-level events. For robust delta transfer, compute chunk-level hashes:

  • Split file into fixed-size chunks (e.g., 4 MiB) or use variable-size rolling chunks (Rsync-style; better for insertions).

  • Compute SHA-256 (recommended) per chunk. Keep list chunkHashes[].

Client sends minimal metadata: file id (or path), file size, last-modified, chunkHashes, and root hash.

Why chunk hashing on client?

  • Server compares hashes to its latest version and requests only missing chunks.

  • It reduces unnecessary transfers on large files with small edits.

2. Delta Engine (Server Side)

Delta Engine determines which chunks are new and responds with an upload plan:

  • missingChunks = clientChunkHashes - serverKnownHashes

  • Server returns upload URLs (pre-signed) for only missing chunks or an instruction to send a binary patch (bsdiff) depending on algorithm chosen.

3. Chunk Storage / Blob Store

Store chunks in object store (S3, Azure Blob, GCS) using content-addressable keys: sha256(chunk) or fileId/version/chunkIndex. Content-addressable storage avoids duplication across tenants and versions.

Keep two stores conceptually:

  • Chunk store: immutable chunk blobs

  • File manifests: metadata that lists chunk sequence for a particular file version (Merkle root, chunk list)

4. Version Manager / Manifest

A version manifest records:

  • fileId, versionId, parentVersionId, chunkHashes[], createdBy, createdAt, changeType

  • rootHash (Merkle root computed over chunk hashes)

Manifest is small JSON; write it to metadata DB atomically once all missing chunks uploaded.

5. Conflict Resolver

If two clients modify the same base version concurrently:

  • Detect by comparing the base parentVersionId.

  • Strategies:

    • Forking: create parallel versions and present both to user for manual merge.

    • Auto-merge for text: use three-way merge if file is text and merges cleanly.

    • Manual resolve: preserve both versions, mark conflict and show UI to merge or choose.

Prefer preserving both versions rather than deleting or silently overwriting.

Chunking Strategies and Delta Algorithms

Fixed-size chunking

  • Simple, fast, predictable offsets.

  • Poor for insertions (shifts all subsequent chunks).

Rolling hash / Content-defined chunking

  • Uses Rabin fingerprinting to find boundaries that remain stable with insert/delete.

  • Used by rsync, Dropbox. Better delta detection for many real-world edits.

Binary diffs

  • If both sides have similar versions, bsdiff/xdelta produce compact binary patches.

  • Require both versions locally or server-assisted chunk comparison. Good for very specific binary formats.

Merkle Tree for Integrity

  • Arrange chunk hashes into a Merkle tree; store root in manifest.

  • Helps detect tampering and quickly verify integrity on client or server.

Protocol: Sync Session and APIs

Typical Sync Flow

  1. Client computes chunkHashes[] and requests POST /sync/plan with { fileId, size, chunkHashes, versionHint }.

  2. Server responds with { missingChunks[], uploadUrls[], targetManifestId }.

  3. Client uploads missing chunks directly to blob store using PUT to pre-signed URL.

  4. Client calls POST /sync/complete to signal all chunks uploaded; server validates chunks, writes manifest, and increments version.

  5. Server notifies other devices via WebSocket/SignalR push or by queued notification.

Example API Signatures (REST)

  • POST /api/sync/plan → returns upload plan

  • PUT /api/sync/chunk/{fileId}/{chunkHash} → upload chunk (or via pre-signed URL)

  • POST /api/sync/complete → finalize version (manifest commit)

  • GET /api/files/{fileId}/versions → list versions

  • GET /api/files/{fileId}/manifest/{versionId} → get manifest

Design the APIs to be idempotent. PUT chunk can safely be retried.

.NET Backend Patterns

Metadata Model (SQL)

CREATE TABLE files (
  file_id UUID PRIMARY KEY,
  current_version UUID NULL,
  created_at timestamptz,
  tenant_id UUID
);

CREATE TABLE file_versions (
  version_id UUID PRIMARY KEY,
  file_id UUID REFERENCES files(file_id),
  parent_version UUID NULL,
  root_hash text NOT NULL,
  created_by UUID,
  created_at timestamptz DEFAULT now()
);

CREATE TABLE version_chunks (
  version_id UUID REFERENCES file_versions(version_id),
  chunk_index int,
  chunk_hash text,
  PRIMARY KEY(version_id, chunk_index)
);

Finalize Version (pseudo .NET)

public async Task FinalizeVersion(Guid fileId, Guid versionId, List<string> chunkHashes, Guid parentVersion)
{
    using var tx = await _db.BeginTransactionAsync();
    await _db.InsertFileVersion(versionId, fileId, parentVersion, ComputeRootHash(chunkHashes));
    for (int i=0;i<chunkHashes.Count;i++)
        await _db.InsertVersionChunk(versionId, i, chunkHashes[i]);
    await _db.UpdateFileCurrentVersion(fileId, versionId);
    await tx.CommitAsync();
}

Commit manifest atomically to prevent partial versions.

Chunk Uploading

  • Provide pre-signed URLs so large chunk upload bypasses the API server.

  • Validate uploaded chunk by server-side checksum before manifest commit.

  • Keep a garbage collection policy to delete orphaned chunks after a retention window.

Client (Angular) Implementation

Client Responsibilities

  • Compute chunk hashes (Web Crypto API)

  • Request upload plan

  • Upload missing chunks with retry and concurrency

  • Notify finalize, show progress

  • Subscribe to push notifications for remote changes

Example chunk hashing in Angular:

async function hashChunk(buffer: ArrayBuffer): Promise<string> {
  const hash = await crypto.subtle.digest('SHA-256', buffer);
  return Array.from(new Uint8Array(hash)).map(b => b.toString(16).padStart(2,'0')).join('');
}

Upload Manager

  • Use a concurrent upload queue (e.g., 4-8 parallel uploads).

  • Persist sync session locally to resume after crash.

  • Expose progress per-chunk and aggregated.

Offline Support and Resumability

  • Keep local journal: pending uploads, manifest draft, last synced version.

  • When client reconnects:

    • Compare server manifest for file

    • Recompute missing chunks using chunk hashes saved locally

    • Resume pending chunk uploads

Avoid recomputing full-file hashes frequently — cache chunk hashes for unchanged files.

Conflict Detection and UI

  • Show a clear “conflict” state with both versions listed and timestamps.

  • Provide merge helpers:

    • For text, show 3-way diff and allow merge in UI.

    • For binary, allow preview of both and choose one or upload merged version.

  • Provide automatic conflict policies per tenant: last-writer-wins, manual-merge, version-branching.

Security and Compliance

  • Encrypt in transit (TLS 1.2+).

  • Encrypt chunks at rest (server-side encryption or client-side encryption with tenant key).

  • Use signed URLs with short expiry for direct uploads.

  • Authenticate and authorize per-tenant operations.

  • Audit writes, uploads, deletes and version changes for compliance.

For BYOK-sensitive tenants, support client-side envelope encryption: client encrypts chunk payload with tenant key before upload; server stores ciphertext (server never holds raw key).

Scalability, Performance and Cost Optimizations

  • Use a distributed object store (S3 / Blob) that scales horizontally.

  • Use CDN for serving frequently read versions.

  • Deduplicate chunks globally by content-addressable storage.

  • Compress chunks when beneficial; choose compression tradeoffs (CPU vs bandwidth).

  • Store small file inline in metadata for speed (e.g., <16 KB).

  • Use batched manifest commits and backpressure the publisher.

Garbage Collection & Retention

  • Implement GC to remove orphaned chunks:

    • Mark chunks referenced by any manifest as live.

    • Periodically scan chunk store and delete objects not referenced and older than retention threshold.

  • Provide retention policy per-tenant to retain versions for compliance.

Monitoring, Observability and SLOs

Track:

  • Sync success/failure rates

  • Average upload latency per chunk

  • Percentage of bytes saved by delta vs full upload

  • Number of conflicts per tenant

  • Storage growth and orphaned chunk rate

  • Throughput: chunks/sec, manifests/sec

Expose metrics to Prometheus and tracing via OpenTelemetry. Use traceId correlation for debugging complex sync flows.

Testing Strategy

  • Unit tests for chunk hashing, manifest computation and conflict detection.

  • Integration tests with real object store and database.

  • Performance tests with synthetic large files and concurrent clients.

  • Network-chaos tests: packet loss, slow links, reconnects.

  • End-to-end tests for resume and offline scenarios.

Operational Playbook

  • Onboard: provision storage and default retention per tenant.

  • Throttling: protect system by per-tenant rate limits on chunk uploads.

  • Backfill: support background re-sync jobs for clients that missed pushes.

  • Recovery: if manifest commit fails due to missing chunks, provide clear remediation steps (re-upload missing chunks or roll back).

  • Backups: snapshot metadata DB and ensure object store versioning is enabled or use cross-region replication.

Example End-to-End Scenario

User edits a 1 GB CAD file and changes a 5 MB subsection. Client detects changed chunks (say 2 chunks changed). Instead of uploading 1 GB, client will upload only those 2 chunks (~8 MB) plus a small manifest commit. Server composes new version by referencing existing chunks and new chunks. Other devices get notified and download only two chunks.

Summary

A high-scale file sync service must balance correctness, efficiency and usability. The architecture described provides:

  • Efficient delta detection via chunking and content hashing

  • Safe, atomic version commits with manifests and Merkle roots

  • Robust conflict handling with clear UX patterns

  • Offline support and resumable uploads

  • Scalable storage and deduplication patterns

Implement this with Angular for client hashing/upload orchestration and .NET for the Sync Gateway, metadata DB and background processes. Focus on idempotency, observability and security. Start with fixed-size chunking and add rolling chunk logic as you observe real workloads — that is often the fastest path to production.