LLMs  

LLMs That Fix Things: Field Service Assistants from Triage to Wrench Time

Introduction

Most “AI in the enterprise” stories end at a summary. Field service needs more: a customer has no heat, a controller is throwing codes, a technician is 40 minutes away, and the business bleeds money every hour equipment is down. Large Language Models (LLMs) are finally useful here—not as chatty bots, but as operational assistants that turn symptoms into a triage plan, select eligible evidence, and drive actions like parts ordering, slot booking, and work-order updates with auditable receipts. This article lays out a production pattern for LLMs in field service, then walks through a real deployment for an HVAC fleet.

Why field service is an LLM-native problem

Service conversations are messy (customer phrasing, photos, error codes), knowledge is scattered (OEM PDFs, tribal notes, maintenance logs), and outcomes require coordinated steps—diagnose → parts → schedule → fix → verify. LLMs excel at stitching heterogeneous signals into a concise plan if you give them the right constraints: what data is eligible, which tools are permitted, and what a valid plan must look like. The win isn’t eloquence; it’s fewer truck rolls, higher first-time fix rate, and shorter time-to-restore.

The operating pattern: from symptom to action

An LLM service assistant should produce a contract-bound artifact, not prose. For example:

  • Triage card with {symptom, likely_cause, confidence, evidence_spans, required_parts, safety_notes, next_action}

  • Scheduler request with {site_id, duration_estimate, skill, window, parts_on_hand?}

  • Work-order delta with {tasks, torque/spec refs, QC checks, photos required}

The runtime—not the model—executes actions via typed tools (e.g., OrderPart, ScheduleVisit, UpdateWorkOrder) and attaches receipts (PO ID, calendar event ID, WO ID). If a safety or policy gate fails (e.g., missing lockout/tagout confirmation), the plan is blocked or downgraded to “advise only.”

Architecture you can operate

Retrieval. Index only eligible sources: current manuals by model/version, certified troubleshooting trees, prior WO notes for that asset, telemetry windows, and inventory by depot. Each snippet carries freshness, license, and asset/model tags to prevent cross-model contamination.

Reasoning. Keep it short and checkable. The assistant fills the triage/scheduler/work-order schemas with a one-sentence rationale and at most two minimal-span citations. Uncertainty triggers “collect more evidence” (photo, make/model plate, thermometer reading) before scheduling.

Execution. Tools run through a broker with least privilege, idempotency keys, and rate limits. The model never sees raw credentials. Every tool returns a receipt pinned into the case timeline.

Governance. Versioned “service bundles” combine prompts, retrieval policy, tool permissions, safety gates, and golden tests (e.g., gas leak scenarios). Releases canary on a slice of sites and auto-rollback on safety or KPI regression.

Real-world deployment: multi-site HVAC operator

Context.
A regional operator managed 3,800 rooftop units across 220 locations. Average first-time fix was 62%; mis-diagnoses created repeat visits and emergency parts shipments.

Design.

  • Data eligibility. Only OEM PDFs (2021+), site-specific maintenance logs, last 72h telemetry, and inventory snapshots. Contractor chat logs and user forums were ineligible.

  • Artifacts.

    • Triage card forced a single next action: collect, order_parts, or schedule.

    • Scheduler request computed duration and skill (Level 1/2), and checked parts_on_hand before suggesting a window.

    • WO delta listed torque specs from the manual span and a compact QC checklist.

  • Tools. OrderPart(sku, depot) → po_id, ScheduleVisit(site, tech_level, window) → event_id, UpdateWO(wo_id, tasks) → commit_id, RequestPhoto(prompt) → asset_url.

  • Guardrails. Gas/combustion or refrigerant alerts forced human approval; no scheduling allowed until a make/model span was cited.

Operations.
The assistant ran in the dispatcher console and mobile tech app. For “no-heat” calls, it pulled telemetry (supply/return air deltas) and the latest service note, produced a triage card with two spans (manual page for pressure switch; telemetry window) and proposed order_parts before booking. If depots lacked stock, it suggested cross-ship from a nearby site with ETA.

Outcomes (10 weeks).

  • First-time fix: 62% → 78% (+16 pts).

  • Repeat truck rolls: −28%.

  • Emergency freight cost: −24% (pre-positioned parts).

  • Dispatcher handle time: −35% per incident.

  • Audit: zero findings—every scheduled job had span-level citations and purchase/order receipts.

Incident & rollback.
A mis-tagged inventory feed briefly showed phantom stock. Golden tests caught rising “parts missing at arrival”; release rolled back, and the allocator switched to a conservative policy (schedule only when stock confirmed by two sources).

Implementation starter (drop-in patterns)

Service bundle (YAML)

bundle_id: "hvac_service.v3"
purpose: "Diagnose HVAC incidents and coordinate parts+schedule; never bypass safety gates."
retrieval_policy:
  include: ["manuals://oem/*@2021+","telemetry://last72h","workorders://site/*","inventory://depot/*"]
  conflicts: ["prefer:asset_model_match","prefer:newer"]
model_contracts:
  triage_card: ["symptom","likely_cause","confidence","evidence_spans","required_parts","safety_notes","next_action"]
  scheduler_request: ["site_id","tech_level","duration_estimate","window","parts_on_hand"]
  wo_delta: ["tasks","spec_refs","qc_checks","photos_required"]
tools_allowed: ["OrderPart","ScheduleVisit","UpdateWO","RequestPhoto"]
guardrails:
  - "require_make_model_span_before_schedule"
  - "combustion_or_refrigerant -> human_approval"
tests:
  golden_sets: ["no_heat_pswitch","iced_coil","blower_fail","sensor_drift","inventory_miss"]
rollout:
  canary_pct: 10
  rollback_triggers: ["repeat_visit_rate > baseline*1.3 for 60m"]

Triage card (JSON)

{
  "incident_id":"INC-88213",
  "symptom":"No heat; supply air 62°F while setpoint 72°F",
  "likely_cause":"Pressure switch stuck open",
  "confidence":0.78,
  "evidence_spans":["manual:RTU-X p.44 L12-L29","telemetry:delta_T 10:12-10:27"],
  "required_parts":["PS-RTU-X-02"],
  "safety_notes":"Lockout/tagout; verify gas shutoff.",
  "next_action":"order_parts"
}

Practical tips & pitfalls

  • Photos beat prose. Prompt for a photo of the data plate; OCR to lock make/model before recommending parts.

  • Token discipline. Keep outputs schema-first and short; route common cases to a small model, escalate only on uncertainty.

  • Human in the loop where it matters. Safety-critical work (gas, refrigerant) requires countersignature; encode this as a guardrail, not a guideline.

  • Close the loop. Post-repair telemetry and a quick QC checklist validate the fix and become training data—don’t leave learning on the table.

  • Cost is a risk. Cap retries, set per-incident token budgets, and cache common spans (manual sections) by model SKU.

What to build first

Pick one symptom class (“no heat” or “fan won’t start”) and one product family. Index just the current OEM manuals and the last 72 hours of telemetry. Ship the triage card + OrderPart + ScheduleVisit. Run one week in advice mode; then enable gated execution with a 10% canary. Expand to more SKUs only after first-time fix moves in the right direction.

Conclusion

Real-life applications of LLMs succeed when they produce governed artifacts and drive actions with receipts. In field service, that means concise triage, verified parts, smart scheduling, and QC that proves the result. Treat retrieval as eligibility, reasoning as a contract, execution as sandboxed tools, and governance as code—and you’ll convert AI from nicer chat to fewer truck rolls, faster restores, and happier customers.