Spec-Driven Development in Secure Research Environments: Why Healthcare Can't Afford to Vibe Code

Artur Zinnurov
11h
183
0
2

Article

Spec-driven development (SDD) has entered the mainstream in 2026. Forbes, GitHub, AWS, and Microsoft all position it as a way to improve developer productivity and maintain architectural consistency - mostly framed as an antidote to "vibe coding," where you prompt an AI conversationally and hope the output holds together.

But this framing misses a domain where spec-driven development is not a productivity improvement. It is a regulatory and ethical necessity.

My work sits at the intersection of institutional health data platforms and national research infrastructure in the UK - designing integration systems inside secure research environments where sensitive patient-level data flows between organisations. Every integration carries consequences that go well beyond broken builds.

This experience has shaped a conviction: healthcare research has been practising spec-driven development out of necessity for years - and the methodology that emerges from these projects offers a blueprint the broader industry urgently needs.

What "spec-driven" actually means when the data is sensitive

In mainstream SDD literature, a "spec" typically refers to a PRD, a set of BDD scenarios, or an API schema that guides AI-assisted code generation.

In a secure research environment, the definition is much wider. A spec is not just a technical contract - it is a compliance boundary. It defines:

What data fields are permitted to leave a secure environment
What metadata format a national platform will accept
What transformations are valid when converting between institutional data structures
What prompts are safe to send to a language model inside a restricted network

Consider a middleware layer connecting institutional data portals to a national research discovery platform. The spec becomes the only reliable interface between systems that cannot directly inspect each other's internals.

This middleware accepts metadata from different institutional platforms, each with its own data structure, and transforms it into a single nationally compliant format. Without a rigid, versioned spec, the system would silently produce metadata that looked valid but misrepresented the underlying datasets.

Multi-format metadata integration for a national health research platform

The national research gateway aggregates health dataset metadata from multiple institutional platforms across the UK. The middleware between these institutional systems and the national platform required end-to-end ownership - from architecture definition and cross-organisational API negotiations, through to implementation and deployment.

There was no existing codebase and no precedent. The technical approach had to be defined from scratch, stakeholders across three independent organisations with different priorities had to be aligned, and the development methodology had to satisfy both the pace of delivery and the compliance requirements of a secure research environment.

Why SDD was the only option

Three realities shaped the decision:

Real data was off-limits during development. The datasets describe real patient cohorts - sensitive information that cannot be exposed to external tooling or AI agents. Every line of code had to be built against dummy data and example JSON files from the national platform's pre-production environment. The system had to be built against a specification, not by trial and error.

The integration spans organisational boundaries. The middleware consumes data from internal institutional APIs and publishes to the national platform's external gateway API. Neither end is under direct control. The only reliable interface was the API contract - OpenAPI specifications, JSON schema definitions, and agreed-upon data formats.

Multiple institutional platforms have fundamentally different data shapes. One institution stores metadata in a single JSON structure. Another separates descriptive and structural metadata into entirely different formats with different naming conventions. A converter built by intuition would silently produce incorrect output for one while appearing to work for the other.

Cross-organisational alignment

The middleware sits at the intersection of three parties:

The Provisioning team (internal, managing data assets and institutional APIs)
The National platform team (external, maintaining the gateway API and target schema)
The middleware architect - responsible not only for the transformation layer, but for defining the integration standards that all parties would work within

Neither API behaved exactly as documented. Direct technical discussions with both teams surfaced and resolved these gaps - establishing shared expectations that went beyond what Swagger docs could provide. Access to the target JSON schema and test JSON files from the national platform's pre-production environment became the ground truth for every transformation. This decision set the technical direction for the entire integration and ensured that all three organisations were working against a single source of truth.

Specification-first implementation with AI assistance

The primary development environment was Cursor paired with FastAPI and Pydantic, but with strict discipline:

The AI agent never saw real patient data. All development used dummy data and pre-production examples - a compliance requirement, not a preference.
OpenAPI specifications defined contracts for both upstream and downstream APIs. They were the starting point for implementation, not generated after the fact.
Pydantic models served as executable specifications. Every payload was validated at ingestion and output boundaries. If data did not match the spec, the system rejected it explicitly.
GitHub Spec Kit provided the scaffolding for this structured approach. In practice, constitution.md, plan.md, and tasks.md files were maintained for each feature - living documents that captured the governing principles, technical plans, and task breakdowns discussed and agreed with researchers and technical teams across the participating institutions. These files became the shared reference point between the developer and the AI agent: every code generation session started from these artifacts, not from a blank prompt.

How the Pipeline API embodies spec-driven design

To give a concrete sense of what this looks like in practice, here is how the middleware's pipeline API is structured. Every endpoint, every response, every step is governed by an explicit specification.

The full pipeline runs five discrete steps, each with its own contract:

Each step returns a structured PipelineStepResult- not free-form text, not "it worked." A step has a name, a status (success, failed, or skipped), a human-readable message, and an errors array if applicable. The overall pipeline response wraps all of these into a single PipelineResponse with an overall_status, the elastic_id , and the resulting gateway_dataset_id.

This is spec-driven development in action. Nothing passes through on trust.

The pipeline also provides safe testing boundaries by design:

Dry run (POST /pipeline/asset-to-gateway/{elastic_id}/dry-run) - queues a background job that runs Fetch, Save, Convert, and Validate but does not send to the national platform. Returns 202 Accepted with a job_id for tracking.
Preview (GET /pipeline/asset-to-gateway/{elastic_id}/preview) - fetches and converts the asset, returns the nationally compliant metadata payload without validating or sending. This lets researchers inspect the converted output before committing.
Full pipeline (POST /pipeline/asset-to-gateway/{elastic_id}) - runs the entire chain end-to-end.

This layered approach - preview, dry run, full execution - is a direct consequence of spec-driven thinking. In a domain where you cannot afford silent failures, you build in checkpoints at every stage.

The development methodology

Each converter component followed a strict progression: establish principles, define the specification, clarify ambiguities, plan, break down tasks, then implement. No step could be skipped.

The constitution.md for the project encoded non-negotiable constraints: no real patient data in AI context, schema validation at both boundaries, all output traceable to a versioned spec. This was not a one-off document - it was referenced by every subsequent step, ensuring that no plan or task could violate the governing principles.

From there, the source schema, target schema, and transformation rules were captured in a versioned specification - focused on the what and why, not the implementation.

Example artifact: a versioned SPEC.md checked into the repo:

# National Gateway Metadata Converter - Specification (v0.1)

## Purpose
Convert institutional dataset metadata into nationally compliant gateway JSON.

## Scope
- In scope: dataset-level metadata fields and structural metadata mapping
- Out of scope: patient-level data, free-text clinical notes

## Inputs
- **Institution A metadata JSON** (source)
- **Institution B metadata files** (source)

## Output
- **National Gateway Dataset JSON** (target)

## Transformation rules
- `source.dataset_name` → `target.name`
- `source.description` → `target.description`
- `source.keywords[]` → `target.tags[]`
- If `source.publisher` is missing, set `target.publisher` to `"Unknown"` (do not omit)

## Validation
- Reject payloads that fail Pydantic model validation at ingress.
- Reject payloads that fail target schema validation at egress.

## Test fixtures
- `fixtures/gateway/preprod/valid_minimal.json`
- `fixtures/gateway/preprod/valid_full.json`

## Non-negotiables (secure environment constraints)
- No production data in development or AI context
- All mappings must be traceable to this spec and git history

Before any code was generated, a clarification pass surfaced gaps between API documentation and actual behaviour: fields documented as required but sometimes missing, date formats varying between institutions. These ambiguities were resolved before code generation.

The plan.md captured the technical implementation plan - architecture constraints, integration boundaries, and tech stack decisions. The tasks.md then broke this into an ordered, dependency-aware sequence: define source model → define target model → implement transformation → validate → test against pre-production gateway. Both files were living documents, updated as each feature progressed and as requirements evolved from discussions with the national platform and Provisioning teams.

Only then did AI-assisted code generation begin - constrained by the constitution, the clarified spec, and the task breakdown.

This matters for auditability. If an auditor asks "why does this transformation drop field X?" the answer is not "because the AI suggested it." The answer is traceable through the artifact chain: the constitution defined that no optional fields should be silently dropped, the spec documented that field X maps to target field Y, the clarification pass recorded that field X is absent in one institution and should use a default value. Every decision has a paper trail.

Validation through the national platform's pre-production environment

The national platform's pre-production environment - a fully functional gateway API replica that accepts test submissions without affecting the live catalogue - proved essential.

This caught edge cases that schema validation alone could not surface: fields that were technically valid JSON but semantically rejected by the platform's pipeline - wrong date formats, unexpected null handling, missing optional-but-expected fields.

The combination of static specs (Pydantic models, OpenAPI contracts) and a dynamic spec (the pre-production environment) is what made SDD robust in practice.

In the pipeline itself, this is reflected in the external validation step. Before anything reaches the national platform, the converted metadata passes through an external validation service. If it fails, the pipeline returns structured errors - not a generic 500. The researcher can see exactly which field failed and why.

Technical challenges and the decisions that resolved them

The upstream API gap. The Provisioning team's API did not always return metadata as documented. Because the middleware validates incoming data against Pydantic models before transformation, discrepancies were caught immediately. Rather than silently working around them, specific, actionable feedback went back to the Provisioning team - which led them to correct their own API inconsistencies. A good example of how architectural decisions upstream can improve the quality of systems you do not directly control.

The second institution divergence. Extending the system to support a second institutional platform revealed a fundamentally different data shape, requiring a new source Pydantic model and separate transformation path - but the target spec did not change. The modular architecture established from the start meant onboarding a new institution without disrupting the existing pipeline. This design decision directly reduced the time and risk of scaling the system to additional data providers.

The AI boundary constraint. The AI agent could only work with specs, schemas, and example data - never production reality. The quality of its output was directly proportional to the quality of the specifications provided. This reinforced a principle worth advocating broadly: the discipline of specification is the prerequisite for responsible AI use in sensitive domains.

The diff tracking requirement. The middleware stores a diff_summary JSON field that captures what changed between the original asset metadata and the converted nationally compliant metadata - renamed fields, type transformations, structural changes, added or removed fields. This provides researchers with an auditable record of exactly what the transformation did, and without a spec-first architecture, this kind of traceability would be nearly impossible to retrofit.

Architecture overview

Tech stack: FastAPI, Pydantic, OpenAPI/Swagger, Docker, Elasticsearch, Cursor + Spec Kit, RabbitMQ, PostgreSQL.

Why healthcare's version of SDD is different

The mainstream SDD conversation focuses on developer productivity. In healthcare research, the failure modes are fundamentally different:

Data integrity failures are silent and compounding. A transformation that drops a sensitivity flag does not throw an error - it produces output that misclassifies a variable, propagating to a national catalogue where researchers make decisions based on incorrect metadata.

Reproducibility depends on traceable specifications. If a pipeline transforms input in ways that are not formally specified, the transformation itself becomes a hidden variable. Specs here are not about code quality - they are about scientific integrity.

Multi-party systems require shared contracts. Each participating institution maintains independent systems. There is no shared monorepo. The spec is the only reliable coordination mechanism.

Security boundaries are non-negotiable. Data cannot leave the environment without explicit approval. Systems must be predictable by specification, not by testing against production data.

How this can work in secure environment and how to adopt

The experience of building these systems across organisational boundaries points to a set of principles that should guide any organisation adopting AI-assisted development - not just in healthcare, but in any domain where data integrity, compliance, or multi-party collaboration matters.

In practice, this approach allowed the system to onboard multiple Trusted Research Environments with fundamentally different data structures without breaking existing integrations - significantly reducing the risk and complexity of scaling the platform to new institutional partners.

Treat specs as system interfaces, not documentation. A specification should be an enforceable contract between services - not a document that drifts from reality within weeks. Specifications should be executable and validated at runtime, not archived in Confluence or Jira.

Design AI interactions around structured methodology. Force developers to articulate intent, define constraints, and clarify ambiguities before generating code. Tools like Spec Kit can help, but the principle matters more than the tooling - any structured progression from intent to implementation will outperform conversational prompting in regulated environments.

Make transformation logic explicit and versioned. Any system that converts data between formats should treat the transformation spec as a first-class artifact. This is doubly true when compliance is involved. In the pipeline described above, this is why the diff_summary exists - every transformation is traceable.

Build observability into the spec layer. Logging, audit trails, and correlation IDs should be part of the specification, not added after incidents reveal gaps. In a well-designed pipeline, every job has a job_id, timestamps for when it was triggered and finished, structured error details, and a full step-by-step status trail.

Layer your testing boundaries. The preview → dry run → full execution pattern is not just convenient. It is a fundamental part of operating safely in a domain where mistakes compound silently. Any system handling sensitive data should offer these checkpoints.

Specification as responsibility

In healthcare and research, specification is not an optimisation. It is a responsibility - the mechanism through which we ensure that systems handling sensitive data behave predictably, that transformations are traceable, that AI tools operate within defined boundaries, and that independent organisations can collaborate without ambiguity.

The most robust systems are not the ones with the most advanced tooling. They are the ones where every boundary is governed by an explicit, versioned, testable specification - and where the technical leadership invested in defining those boundaries before writing a single line of code.

As AI-assisted development accelerates, the practitioners best positioned to lead this transition are those who have already operated under the constraints that SDD formalises - constraints that healthcare research has enforced for years. The question for the broader industry is not whether to adopt spec-driven development. The question is whether your domain can afford not to.