LLMs: Build an AI Healthcare App with Streamlit and GPT-5 (In-Depth)

John Godel
Oct 12
2.8k
0
1

Article

agimageai_downloaded_image_f071bc76647724555458b7c220e31bbef74

⚠️ Scope & safety: This is an education & navigation app—not a diagnostic or prescribing tool. It should help users understand symptoms, prepare for appointments, and find credible resources. Always add clear disclaimers, emergency routing, and guardrails that block diagnostic language or medication directives.

0) What We’re Building

A privacy-aware Streamlit app that:

Collects user concerns with explicit consent and client-side redaction.
Uses a prompt contract to keep outputs safe (education only), structured, and auditable.
Optionally grounds answers with a curated knowledge base (CDC/WHO/NICE) via policy-aware retrieval and minimal-span citations.
Flags red-flag symptoms and escalates to emergency guidance when appropriate.
Logs anonymized telemetry (opt-in) and enforces no-PHI storage defaults.
Ships with tests, CI pack replays, feature flags, canary/rollback, and basic cost & latency dashboards.

1) Architecture at a Glance

streamlit_app/
├─ app.py                     # UI & orchestration
├─ contracts.py               # Prompt contract(s) & output schemas
├─ redact.py                  # Client-side PII/PHI redaction
├─ llm.py                     # GPT-5 client wrapper (timeouts, retries)
├─ retrieval/
│  ├─ kb_index.py            # Curated KB index (eligible sources only)
│  └─ cite.py                # Minimal-span citation extraction
├─ safety/
│  ├─ red_flags.yaml         # Emergency indicators & copy
│  └─ validator.py           # Schema, refusal, discrepancy checks
├─ eval/
│  ├─ traces/*.json          # Golden traces (anonymized scenarios)
│  └─ replay.py              # CI pack replays & gates
├─ telemetry/
│  └─ logger.py              # Opt-in anonymized metrics
├─ requirements.txt
└─ Dockerfile

Data flow

User input → consent check → redaction (client side where possible).
Optional retrieval from curated KB with eligibility filters (jurisdiction, language, freshness).
Build Context Pack (atomic, timestamped claims + source IDs).
Call GPT-5 with a strict prompt contract (education scope, JSON schema).
Run validators (schema, citations, red-flag escalation, uncertainty/abstention).
Render guidance + follow-ups + citations; record opt-in metrics.

2) The Prompt Contract (Your Primary Safety System)

Why: Models are fluent, not safe by default. The contract encodes scope, style, refusals, output shape, and escalation rules.

# contracts.py
SYSTEM_CONTRACT = """
You are a cautious healthcare education assistant. You provide general information and navigation—not diagnosis or prescriptions.
Rules:
- Use ONLY the provided context (user input + KB excerpts). If insufficient, ask targeted follow-ups.
- Surface emergency red flags; advise immediate local emergency care when applicable.
- Avoid definitive diagnoses, dosing, or treatment directives.
- Use inclusive, plain language. Keep uncertainty explicit.
Output JSON ONLY:
{"summary":"...", "advice":"...", "red_flags":["..."], "followups":["..."], "uncertainty":0.0}
"""

Design notes

Minimal & testable: Fewer rules, each measurable (e.g., “no diagnosis terms present”).
Versioned: contract.health.v1.2.0 with changelog & CI tests.
Abstention: If required fields (onset, duration, severity, age band, pregnancy, meds/allergies) are missing, app must prompt—not guess.

3) Privacy, Consent, & Redaction

Consent UX

Explicit checkbox with scope of processing: education only, no storage of PHI, you may stop anytime.
Disable submit until consent is given.

Redaction (client-side first)

Naïve patterns (emails, phones, street addresses, MRNs).
Replace with placeholders ([EMAIL], [PHONE], [ID]).
Encourage users not to enter names or exact addresses.

# redact.py (excerpt)
PATTERNS = [
  (r"\b[\w\.-]+@[\w\.-]+\.\w{2,}\b", "[EMAIL]"),
  (r"\b(?:\+?\d{1,2}\s?)?(?:\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4})\b", "[PHONE]"),
  (r"\b\d{1,3}\s+[A-Za-z0-9.\s]{3,}\b", "[ADDRESS]"),
  (r"\b(?:MRN|Patient ID)[:\s]*[A-Za-z0-9-]+\b", "[ID]")
]

Production: pair pattern redaction with fielded entry (age range, sex at birth, pregnancy yes/no, chronic conditions) to reduce free-text risks.

4) Retrieval & Grounding (Optional but Recommended)

Goal: Improve reliability by citing credible sources. Keep set small and curated (CDC, WHO, NICE, national guidelines). Store:

source_id, title, published_at, jurisdiction, language, snippet, url.

Eligibility before similarity

Only include sources matching user jurisdiction/language.
Enforce freshness windows (e.g., < 24 months) unless guideline is evergreen.
Prefer primary guidance > blogs or news.

# retrieval/kb_index.py (conceptual)
def search_kb(query, jurisdiction="US", language="en"):
    eligible = filter_by_policy(KB, jurisdiction, language, max_age_days=730)
    hits = bm25_or_embeddings(eligible, query)
    return topk(hits, k=6)

# retrieval/cite.py
def make_context_pack(user_text, kb_hits):
    claims = []
    for h in kb_hits:
        span = minimal_span(h.snippet, query=user_text)
        claims.append({
          "id": h.source_id, "text": span, "effective_date": h.published_at,
          "tier": h.tier, "url": h.url
        })
    return {"task":"edu_guidance", "claims":claims}

Minimal-span citations keep quotes tight and auditable.

5) Red Flags & Escalation

Maintain a red_flags.yaml with symptom patterns and copy:

- name: "Chest pain, severe or crushing"
  patterns: ["severe chest pain", "crushing chest pain", "pain radiating to arm/jaw"]
  action: "Call your local emergency number now."
- name: "Stroke signs"
  patterns: ["face drooping", "slurred speech", "weakness on one side"]
  action: "Call emergency services immediately."
- name: "Suicidal ideation"
  patterns: ["want to harm myself", "suicidal"]
  action: "If in immediate danger, contact local emergency services now."

Flow: pre-screen user text → if any match, short-circuit model call and show emergency guidance (plus international helpline links where applicable).

6) The Streamlit App (Orchestrated)

# app.py (condensed for length)
import streamlit as st
from contracts import SYSTEM_CONTRACT
from redact import redact
from retrieval.kb_index import search_kb
from retrieval.cite import make_context_pack
from safety.validator import validate_health_response, detect_red_flags
from llm import call_gpt5

st.set_page_config(page_title="Healthcare Navigator (Education)", layout="centered")
st.title("🩺 AI Healthcare Navigator (Education Only)")
st.caption("Not medical advice. For emergencies, contact local emergency services.")

with st.sidebar:
    mode = st.selectbox("Focus", ["Symptoms", "Medication info", "General education"])
    consent = st.checkbox("I consent to processing for educational guidance only.")
    use_kb = st.checkbox("Use curated health sources (citations)", value=True)
    telemetry = st.checkbox("Share anonymized metrics (no raw text)")

user_raw = st.text_area("Describe your concern (avoid names/IDs):", height=140, placeholder="Dry cough for 3 days...")

if st.button("Get guidance", type="primary"):
    if not consent:
        st.warning("Please provide consent to proceed.")
        st.stop()

    # Emergency pre-screen
    emerg = detect_red_flags(user_raw)
    if emerg:
        st.error("⚠️ Potential emergency")
        for e in emerg:
            st.markdown(f"- **{e['name']}** — {e['action']}")
        st.stop()

    user = redact(user_raw)[:4000]
    kb_hits = search_kb(user) if use_kb else []
    pack = make_context_pack(user, kb_hits) if kb_hits else {"claims":[]}

    payload = {
      "system": SYSTEM_CONTRACT,
      "user": f"Mode: {mode}\nUser (redacted): {user}\nContext Pack: {pack}"
    }

    result = call_gpt5(payload)   # returns dict per schema
    ok, issues = validate_health_response(result, require_citations=use_kb)
    if not ok:
        st.warning("The assistant couldn’t produce a safe, structured response.")
        st.json({"issues": issues})
        st.stop()

    # Render
    st.subheader("Summary")
    st.write(result["summary"])
    if result["red_flags"]:
        st.subheader("⚠️ Potential red flags")
        for rf in result["red_flags"]:
            st.markdown(f"- **{rf}**")
    st.subheader("Next steps")
    st.write(result["advice"])
    if result["followups"]:
        st.subheader("Follow-up questions")
        for q in result["followups"]:
            st.markdown(f"- {q}")
    if use_kb and pack["claims"]:
        st.subheader("Citations")
        for c in pack["claims"]:
            st.markdown(f"- [{c['id']}] {c['text']} ({c['effective_date']})")

LLM wrapper (llm.py) should add timeouts, retry with jitter, and require response_format=json.

7) Validators & Guardrails (Pre-Display)

# safety/validator.py (conceptual)
PROHIBITED = ["diagnose", "take X mg", "prescribe", "start antibiotic", "stop your medication"]

def validate_health_response(resp:dict, require_citations:bool)->tuple[bool,list]:
    issues=[]
    # schema
    for k in ["summary","advice","red_flags","followups","uncertainty"]:
        if k not in resp: issues.append(f"Missing field: {k}")
    # style/scope
    text = (resp.get("summary","") + resp.get("advice","")).lower()
    if any(p in text for p in PROHIBITED):
        issues.append("Contains diagnostic/prescribing language.")
    # uncertainty bounds
    u = resp.get("uncertainty", 0.0)
    if not (0.0 <= u <= 1.0): issues.append("Uncertainty out of range.")
    # citations (if KB used)
    if require_citations and not resp.get("citations"):
        issues.append("Missing citations when KB enabled.")
    return (len(issues)==0, issues)

If checks fail, do not show the model output; fall back to static educational copy or ask the user targeted follow-ups.

8) Evaluation & CI: Golden Traces + Pack Replays

Create ~100 anonymized scenarios:

{
  "trace_id":"T-acute-cough-003",
  "user":"Dry cough 3 days, mild fever 38C, no chest pain, non-smoker, age 35",
  "expectations":{
    "must_not_diagnose":true,
    "should_include_followups":["duration","fever","chest pain"],
    "should_not_trigger_emergency":true,
    "max_uncertainty":0.8
  }
}

Pack replay (pre-merge CI):

Run traces through the contract with/without KB.
Assert: schema valid, no prohibited language, follow-ups present when fields missing, emergency routing correct.
Gate on adherence ≥ 0.98, p95 latency ≤ SLO, and $/outcome ≤ budget.
Canary new prompts/contracts to 5–10% of traffic; rollback if adherence dips ≥ 2 points.

9) Cost & Latency Engineering

Token budgets: header ≤ 250, context ≤ 1200 (after compression), generation ≤ 300.
Compression with guarantees: convert KB pages into atomic claims (single fact + timestamp + URL).
Caching: template prompts, retrieval hits, deterministic responses (low temp); show hit-rate × tokens saved.
Routing: For benign Q&A, you can front a small draft model (speculative decoding) to prefill tokens that GPT-5 verifies for 1.5–2.5× speedups.

Dashboard: cost per successful answer, p50/p95 latency, adherence, abstention quality, cache hits.

10) Deployment & Privacy Posture

Dockerfile (sketch):

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV PYTHONUNBUFFERED=1
EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]

Privacy defaults

No raw text logs. Only store derived metrics if user opts in.
Short retention. Ephemeral session state; purge caches nightly.
Config flags: SAFE_MODE=true, STORE_PHI=false (and actually enforce).
Secrets: .env via a secret manager; never commit keys.

Regulatory fit

If you move toward PHI: secure enclave/VPC, BAA with vendor, encryption at rest/in transit, access logs, minimum necessary data, DPIAs (GDPR), and a formal safety review of prompts, validators, and data flows.

11) UX That Builds Trust

Front-loaded disclaimer + emergency banner.
Plain language and short paragraphs; avoid medical jargon; define terms when needed.
Follow-ups as single, specific questions.
Citations expandable; link to official guidance.
Localization: translate copy and KB; adjust emergency numbers & health-system terms by region.
Accessibility: high contrast, keyboard navigation, ARIA labels.

12) Common Failure Modes & Fixes

Model “diagnoses” → tighten prohibited list, add evaluator prompt, reject on violation, retrain with negative examples.
Hallucinated citations → require minimal-span quotes from curated KB only; reject if spans don’t match source text.
Over-abstention → add pattern-matched safe advice for common benign cases; still avoid diagnosis.
Latency spikes → shrink context via compression; apply speculative decoding; cache retrieval; cap generation tokens.

Full Single-File Demo (for quick starts)

If you want one file to try locally, use the earlier app.py you have; then modularize into the structure above as you harden.

Conclusion

Building a healthcare education app with Streamlit + GPT-5 is less about clever prompts and more about governance: explicit contracts, curated context with citations, emergency routing, privacy by default, and measurable adherence. Pair a tight prompt contract with policy-aware retrieval, validators, and CI pack replays, and you’ll ship something that’s helpful, safe, and auditable—while staying fast and affordable. When (or if) you expand toward clinical features, the same bones—contracts, eligibility filters, provenance, and refusal paths—are what let you scale responsibly.