LLMs + SQL: From Natural Language to Safe, Auditable Queries

John Godel
Nov 05
3.4k
0
2

Article

Introduction

Natural-language analytics promises a simple idea: “ask” your data and get an answer. In practice, letting a large language model (LLM) talk to a database is risky: schemas are messy, permissions vary by tenant, joins are brittle, and a single unbounded query can melt your warehouse. The way forward is to treat Text-to-SQL as an operational system, not a demo. That system constrains what the model can do, verifies each step, and leaves a trail you can replay. This article lays out a production pattern for LLMs with SQL—covering architecture, safety, evaluation, and a real deployment—so teams can move from pretty prototypes to governed self-serve analytics.

Why naïve Text-to-SQL fails

LLMs are great at language, not at your organization’s semantics. They hallucinate table names, misread dimensions, over-aggregate, or ignore row-level policies. Even when the SQL runs, it may be expensive, stale, or wrong: cross-joining huge fact tables, skipping required filters (e.g., tenant_id), or querying deprecated sources. Without guardrails, you trade analyst bottlenecks for incident tickets.

An operating model that works

Successful setups split responsibility:

The model proposes a query plan—not freeform SQL. The plan encodes the intent (“daily active users by plan for last 30 days”), referenced entities, required filters, and expected shape of the result.
A validator enforces policy: allowed tables, row/column permissions, mandatory predicates (tenant, time, soft-delete), cost caps (row estimates, bytes scanned), and freshness SLA.
A compiler renders SQL from the approved plan using trusted templates and parameterization.
The runtime executes the query in a sandbox, attaches a receipt (query_id, bytes_scanned, result_rowcount), and returns both the payload and evidence (source tables, lineage).
Post-checks verify the result (NULL/duplicate guards, unit consistency) and down-rank or block if anomalies appear.

Think “propose → validate → compile → execute → verify,” with receipts at each step.

Architecture and context: what the model must know (and not know)

Schema catalog. Provide a compact, curated view: tables, columns, types, PK/FK, sample values, and business descriptions. Exclude internal and sensitive columns.
Semantic layer. Where possible, expose governed models (views or metrics) instead of raw tables. LLMs are far more accurate over a tidy semantic layer than over a raw lake.
Row-level security (RLS). Enforce per-tenant and per-role filters at the database, not in the prompt. The LLM should never see rows it cannot see.
Time & freshness. Supply table update timestamps and SLAs so the model can decline stale data (“inventory snapshot older than 24h”).
Examples, not answers. Provide few-shot pairs (question ↔ approved SQL) for tricky joins, but keep them schema-anchored and up to date.

Safety and cost controls (non-negotiable)

Read-only by default. DDL/DML blocked; separate identity and network path for reads.
Allowlist. Only approved tables/views are addressable; everything else is invisible.
Mandatory predicates. Auto-inject tenant and time filters; fail closed if the plan omits them.
Bounded results. Require LIMIT + ORDER BY for explorations; paginate for larger results.
Static analysis. Reject cartesian joins, unqualified SELECT *, ambiguous columns; cap estimated cost (bytes/rows).
Explain-plan gates. Parse EXPLAIN to block full scans of known heavy tables without partition predicates.
Secrets isolation. The model never holds credentials; a broker signs queries and redacts literals from logs when necessary.

How the prompts should change (less “magic,” more contracts)

Ditch “write a SQL query that…”; adopt a contract:

Query plan (model output)

intent: concise natural language (what business question)
entities: tables/views and join keys
filters: normalized constraints (tenant, time, segment)
metrics: aggregates and windowing
expected_shape: columns, types, and grain
checks: invariants (e.g., totals ≥ 0; proportions in [0,1])

Compiler maps this plan to SQL templates you control. This keeps reasoning short and checkable, and it makes evaluation deterministic.

Evaluation you can trust

Golden questions. A representative suite of business questions with ground-truth SQL and answer snapshots. Any schema/prompt/model change must pass.
Semantic diffs. Compare proposed plans to goldens; tolerate alias differences but flag entity or filter divergences.
Right-for-the-right-reason. Require minimal lineage spans (which tables/columns contributed); penalize answers pulled from deprecated sources even if numerically close.
Error taxonomy. Track failure modes: missing tenant filter, wrong grain, incorrect join, stale source, cost violation. Fix the system—not just the prompt.

Observability and receipts

Every request emits a trace capturing: catalog version, plan, validator outcomes, compiled SQL hash, EXPLAIN summary, execution stats (rows, bytes, time), sample of results, and lineage (tables/columns). Attach the query_id so analysts can reproduce in the warehouse UI. This makes audits—and debugging—fast.

Real-world deployment: self-serve revenue analytics

Context.
A SaaS company wanted product managers and CSMs to ask questions like “weekly expansions by plan in EMEA” without waiting on data engineering. The schema spanned billing, product events, and CRM, with strict tenant and region controls.

Design.

A semantic layer exposed governed views: fct_billing, dim_account, fct_product_events, and a metrics view mrr_daily.
The catalog included descriptions, PK/FK, freshness, and sample values per column.
The contract forced tenant and region filters; the compiler injected WHERE tenant_id = :tenant AND region IN (:allowed_regions) AND date BETWEEN :start AND :end.
The validator rejected queries that bypassed mrr_daily when the question referenced revenue (preventing double-counting from raw invoices).
Receipts included Snowflake QUERY_ID, bytes scanned, result rowcount, lineage (view → base tables), and the catalog version hash.

Outcomes (six weeks).

Median time from question to validated answer: seconds (down from days).
Wrong-join incidents: near zero (down from frequent), owed to compiler templates and goldens.
Warehouse cost: −22% per query via partition predicates and full-scan gates.
Trust: PMs could click “View lineage” to see exactly which sources fed the metric.

Implementation starter (you can adapt today)

Bundle (YAML)

bundle_id: "text2sql.v2"
purpose: "Answer analytic questions from governed views; enforce tenant/region/time filters."
allowlist:
  views: ["mrr_daily","fct_product_events","dim_account","fct_billing"]
mandatory_predicates:
  - name: tenant
    template: "tenant_id = :tenant_id"
  - name: time
    template: "date BETWEEN :start AND :end"
cost_caps:
  bytes_scanned_mb: 512
  est_rows: 5_000_000
validators:
  - no_select_star
  - require_partition_filter_on("mrr_daily","date")
  - block_cross_join
compiler:
  template_dir: "sql_templates/"
monitoring:
  log: ["plan","explain_summary","rows","bytes","duration_ms","query_id","lineage"]

Plan (model output → compiled)

{
  "intent": "Weekly MRR expansion by plan for EMEA in the last 8 weeks",
  "entities": ["mrr_daily","dim_account"],
  "join_keys": [{"left":"mrr_daily.account_id","right":"dim_account.account_id"}],
  "filters": {"region":["EMEA"], "date_range":"last_8_weeks"},
  "metrics": [{"name":"expansion_mrr","agg":"SUM","expr":"mrr_delta","where":"mrr_delta > 0"}],
  "group_by": ["week","plan_name"],
  "expected_shape": [{"col":"week","type":"date"},{"col":"plan_name","type":"string"},{"col":"expansion_mrr","type":"decimal"}],
  "checks": ["expansion_mrr >= 0"]
}

Compiler renders a parameterized SQL from trusted templates; runtime executes; receipts are logged for replay.

Practical tips & pitfalls

Prefer views over raw tables. Move business logic to the semantic layer; keep the LLM’s job small.
Keep examples fresh. Outdated few-shots teach the wrong joins. Align them with your goldens.
Normalize synonyms. A thin ontology (“customer=account”, “revenue=MRR/ARR”) cuts ambiguity dramatically.
Page results. First answer should be small and quick; offer “expand” only after user confirmation.
Teach refusal. If freshness or eligibility fails, the right answer is “cannot answer with governed data”—with a hint (e.g., “inventory snapshot older than SLA”).
Close the loop. Let users mark “correct/incorrect” with a reason; feed that back to goldens and the compiler’s templates.

Conclusion

LLMs can make data genuinely self-serve—if you design the system around contracts, catalogs, and controls. Have the model propose a plan, not raw SQL. Validate against policy, compile with templates, execute in a sandbox, and ship every answer with receipts and lineage. Do this, and “ask the data” stops being a risky parlor trick and becomes a reliable, auditable capability your product and business teams can trust.