Implementing a Test Data Fabric | Auto-generate Relational Test Data With Constraints — Practical Design

Rajesh Gami
9h
106
0
0

Article

Introduction

A reliable test environment needs realistic, consistent, and compliant data. Manual seeding or ad-hoc scripts produce brittle test sets: missing referential integrity, unrealistic distributions, PII leakage, and hard-to-reproduce failures. A Test Data Fabric is an automated, repeatable platform that generates relational test data across schemas while respecting constraints, business rules, and privacy.

This article explains how to design and implement a production-ready Test Data Fabric using .NET for backend orchestration, SQL Server / Postgres for target databases, and an Angular UI for configuration and preview. You’ll get architecture, data models, algorithms (graph-walk, dependency resolution, domain distribution), implementation patterns, sample code snippets, testing and governance, and operational tips.

Goals and Non-goals

Goals

Auto-generate referentially consistent relational data across multiple tables.
Respect foreign keys, unique constraints, check constraints, sequences and triggers.
Support domain-aware distributions (customer geography, order patterns, seasonality).
Support PII-safe synthetic data and tenant-aware masking.
Provide reproducible runs (seeded RNG) and selective refresh (table-level, row-level).
Offer UI for configuring datasets, sampling rates, and previewing samples.

Non-goals

Not a replacement for full production anonymization pipelines (but it can integrate with them).
Not a deep synthesis engine for complex unstructured data (NLG can be plugged in).

High-Level Architecture

┌──────────────────────────┐    ┌──────────────────────┐    ┌──────────────────┐
│ Angular Admin Console    │ -> │ Orchestration API    │ -> │ Target Databases │
│ (configure fabric, jobs) │    │ (.NET Core)          │    │ (SQL Server/Postgres)
└──────────┬───────────────┘    └──┬───────────────────┘    └──────┬───────────┘
           │                        │                               │
           ▼                        │                               ▼
┌──────────────────────────┐        │                     ┌────────────────────┐
│ Template / Rules Store   │ <------┘                     │ Data Generators     │
│ (JSON/YAML/DB)           │                              │ (Generators, Faker) │
└──────────────────────────┘                              └────────────────────┘

Components

Admin UI (Angular): design templates, set volumes, preview, manage jobs, audit.
Orchestration API (.NET): job scheduling, dependency resolution, execution engine.
Generators: pluggable modules that create values per column (domains, distributions, references).
Target DB connector: applies data through efficient bulk APIs (BULK INSERT, COPY).
Audit + Replay store: store seeds, config, and generated manifests for reproducibility.

Key Concepts

Schema Graph

Model tables and FK edges as a directed acyclic graph (DAG). Generation order respects dependencies: parent tables first (or use multi-pass strategies for circular references).

Templates and Column Profiles

Each column has a profile: type, generator, domain constraints, uniqueness, nullable, distribution (zipf, normal, uniform), masking rules for PII.

Example column profile

{
  "table":"Customer",
  "column":"Email",
  "generator":"EmailGenerator",
  "uniqueness":"global",
  "nullable":false,
  "distribution": { "type":"categorical", "values":["gmail.com","yahoo.com","corp.com"], "weights":[0.5,0.3,0.2] }
}

Referential Integrity Strategies

Parent-first: generate parent rows, keep mapping of natural keys to generated surrogate keys, then emit child rows referencing parents.
Deferred constraints: in DBs that support deferred FK checking (Postgres), insert in any order and defer constraint validation at transaction end. Useful for circular FKs.
Two-pass generation: generate placeholders for children, then fill FK columns once parent keys are known.

Domain Awareness

Domain rules encode business invariants: e.g., Order.Total == SUM(OrderLine.Amount), Stock >= 0. The fabric must either generate consistent aggregates or run reconciliation jobs post-generation.

Reproducibility

Seeded PRNGs with recorded seed per job produce identical datasets for the same configuration. Persist seeds and job manifests for replay.

Workflow Diagram

1. User configures template in Angular UI and requests N customers, M orders.
2. UI sends template + job request to .NET Orchestration API.
3. Orchestration resolves schema graph, selects generators, computes volumes per table.
4. Generator modules produce rows in-memory or buffered to disk, mapping PKs and FKs.
5. Orchestration writes to temporary staging tables or directly bulk-loads target DB.
6. Post-checks run: constraints, business rules, uniqueness audits.
7. Job summary and sample data returned to UI for review.

Detailed Implementation Steps

1 — Schema Introspection

Read database schema (INFORMATION_SCHEMA or system catalog) to build table metadata:

columns, types, default values
foreign keys, unique constraints, check constraints
identity/sequence generation method

.NET snippet using Npgsql / SqlClient:

using var conn = new NpgsqlConnection(connStr);
await conn.OpenAsync();
var cmd = new NpgsqlCommand("SELECT table_name,column_name,data_type FROM information_schema.columns WHERE table_schema='public'", conn);
using var rdr = await cmd.ExecuteReaderAsync();
while (await rdr.ReadAsync()) {
  // build metadata
}

2 — Build Dependency Graph

Construct graph with nodes as tables, edges from parent → child (FK direction). Detect cycles; for cycles use deferred constraints or placeholder PKs.

3 — Plan Volumes

User provides high-level volumes (e.g., 10k customers, average 5 orders/customer). Orchestration expands to table row counts using per-table multi-assignment:

Orders = Customers * avgOrders
OrderLines = Orders * avgLines

Allow variability (Poisson sampling) for realistic skew.

4 — Choose Generators

Generators are modular. Examples:

Primitive: random int, sequential, GUID, timestamp.
Semantic: PersonName, Email, PhoneNumber, Address (using Faker or custom datasets).
Domain: Product SKUs from product catalog distribution.
Currency/Financial: decimal with rounding and constraints.

Generators accept config: locale, distribution, nullRate, uniqueness key.

5 — Key Mapping & FK Assignment

While generating parent rows, store mapping from synthetic natural key to DB PK (or generate PK on the fly). For large volumes, avoid storing full mapping in memory: use streaming with deterministic generation or use bucketed mapping saved to local storage (e.g., RocksDB or Redis) to look up when generating children.

Example deterministic mapping:

Use a seeded RNG + stable ordering to derive parent PK for child index i so lookup requires no state: ParentId = DeterministicId(seed, tableName, index % parentCount).

6 — Bulk Load

Write generated rows to CSV/NDJSON and use DB bulk loaders:

Postgres: COPY from STDIN
SQL Server: BULK INSERT or SqlBulkCopy

Use temporary staging schema to reduce locks and then INSERT INTO target SELECT ....

7 — Enforce Business Rules

For rules like totals, either:

Generate child lines first and derive parent totals; or
Generate parents with placeholders and adjust parents later with updates based on children.

Prefer child-first for aggregates (generate lines then compute parent totals) if possible.

8 — Post-Checks and Repair

Run validation checks:

FK integrity
Unique constraints satisfied
Check constraints satisfied
Statistical distribution checks (mean, variance)
If failures, either fix automatically (re-generate rows) or fail the job with diagnostics.

Algorithms and Techniques

Graph Walk Generation

Topological sort tables; for each table:

Determine count.
For i in 1..count:
- Generate PK (or request identity).
- Generate non-FK columns.
- If FK to already-generated parent(s), choose parent via distribution (uniform, Zipfian).
- Attach row to batch.

For cycles, use deferred constraints or iterative fill.

Sampling Distributions

Support distributions to mimic real data:

Uniform: equal probability
Zipfian / Power-law: popular for product sales (head and long tail)
Normal: for metrics like order value
Poisson: for event counts (orders per customer)

Implement RNG with seeded source so sampling is repeatable.

Uniqueness and Collisions

Uniqueness generator must check collisions. For high-volume uniqueness, use deterministic namespace + counter (e.g., CUST-{seed}-{i}) instead of random to avoid costly checks.

Referential Selection Strategies

When picking parent for child:

Uniform: pick random parent ID
Weighted: weighted by parent's attribute (e.g., customers with higher spend more orders)
Time-aware: parents created earlier favored for older children

Domain Awareness Examples

E-Commerce: customers → orders → orderlines. Orders have Total = sum(orderline.amount) + taxes + shipping. Generate orderlines with price, quantity, tax and then compute order total.
Banking: accounts → transactions. Ensure running balance by ordering transactions by timestamp and computing cumulative sum; insert synthetic overdraft events.
Inventory: stock movements must preserve Stock >= 0 unless negative allowed; generate receipts and adjustments in logical order.

Angular Admin UI Ideas

UI should let product engineers and QA:

Upload a template or choose schema from DB.
Configure volumes and distributions with sliders.
Map columns to generators via drag-and-drop.
Preview sample rows (10–100 rows) before executing full job.
Schedule recurring refresh jobs and seed-based replay.
Audit view for job results, errors, and sample extracts.

Example components

SchemaExplorerComponent — tree view of tables and columns.
GeneratorMapperComponent — map column → generator.
JobDesignerComponent — volumes, distributions, seed config.
JobMonitorComponent — progress, logs, post-check results.

Security, Compliance & PII Handling

By default generate synthetic PII (no real production values). For anonymized production-derived datasets, ensure proper anonymization and legal approvals.
If mapping production patterns, use only masked/salted values and never store cleartext sensitive values in logs or intermediate files.
For tenant-owned datasets, support BYOK or tenant-provided salts to make synthetic values tenant-specific.
Maintain audit trails of dataset generation: who ran the job, configuration, seed, and where outputs were written.

Testing Strategy

Unit tests for generator modules and deterministic outputs given known seeds.
Integration tests: run small job end-to-end and validate FK and unique constraints.
Property-based tests: assert invariants like COUNT(orders) == SUM(customer.orderCount) for configured relationships.
Performance tests: measure generation speed, memory usage, bulk load throughput for realistic volumes.
Reproducibility tests: same seed + config → identical outputs (row-level checksums).

Performance Considerations

Stream rows to disk to avoid memory blow-up for large volumes.
Use parallelism by partitioning parent IDs; for extremely high scale, distribute generation across worker nodes with coordinated seeds and sharding.
Minimize round-trips: build large batches and use bulk loaders.
For PK lookup mapping, use deterministic functions to avoid huge in-memory maps.

Operational Playbook

Dry-run: always preview small random sample before running full dataset.
Quota management: limit maximum rows per run to protect dev/test environments.
Rollback: generate in dedicated schema or temp tables so teardown is a schema drop.
Job retries: idempotent job design with job manifest ensures safe retry.
Monitoring: CPU, disk I/O, DB write throughput; alert when bulk load stalls.
Data retirement: schedule cleanup jobs to drop generated data after TTL.

Example: Minimal .NET Generator Skeleton

public class TableGenerator
{
    private readonly IGeneratorFactory _fact;
    private readonly Random _rng;

    public TableGenerator(IGeneratorFactory fact, int seed) {
        _fact = fact; _rng = new Random(seed);
    }

    public async Task GenerateAsync(TableMetadata meta, long count, Func<long, long> pickParent) {
        var batch = new List<Dictionary<string, object>>();
        for(long i=0;i<count;i++) {
            var row = new Dictionary<string, object>();
            row[meta.PkColumn] = meta.GeneratePk(i);
            foreach(var col in meta.Columns.Where(c => !c.IsFk)) {
                var gen = _fact.Create(col.Profile);
                row[col.Name] = gen.Next(_rng);
            }
            foreach(var fk in meta.ForeignKeys) {
                row[fk.Column] = pickParent(i); // deterministic selection
            }
            batch.Add(row);
            if(batch.Count >= 10000) {
                await BulkInsert(meta.TableName, batch);
                batch.Clear();
            }
        }
        if(batch.Count>0) await BulkInsert(meta.TableName, batch);
    }
}

Common Pitfalls and Remedies

Memory overflow on mapping large parents: use deterministic mapping or external map store.
Uniqueness collisions at high volume: switch to counter-based generation; avoid random-only approach.
Constraint failures due to triggers or computed columns: inspect triggers and replicate necessary side-effects or use staging then post-process.
Slow bulk insert due to indexes: drop non-essential indexes before load and rebuild after, or load into a staging table and use minimally indexed schema.

Summary

A Test Data Fabric transforms seeding from ad-hoc scripts into a repeatable, auditable, domain-aware process. Key takeaways:

Model the schema as a graph and plan generation accordingly.
Use seeded RNGs and manifest storage for reproducible runs.
Provide modular generators (primitive, semantic, domain) and let templates handle the rest.
Respect constraints using parent-first, deferred checks, or two-pass strategies.
Integrate Angular UI for self-service configuration and preview.
Make generation scalable using streaming, bulk loaders, and sharded workers.
Prioritise PII safety and auditability.