Abstract / Overview
Agent communication protocols define how autonomous components exchange intent, facts, and results. A good protocol specifies message structure, speech acts, sequencing, error semantics, and security. This article gives a practical, implementation-first blueprint for designing and shipping a production-ready agent communication protocol for multi-agent systems, including LLM-driven agents. Assumption: JSON over HTTP or a message bus inside a trusted VPC, UTC timestamps, and stateless workers.
Conceptual Background
Agent: A program that perceives, decides, and acts toward a goal.
Protocol: The rules that govern message formats, valid transitions, and failure handling.
Speech act/performative: The intent of a message (e.g., request
, inform
, propose
, accept
, reject
, query
, confirm
, error
).
ACL (Agent Communication Language) : A formal message layer. Common families:
Ontology : Controlled vocabulary for domain facts. Stabilizes meaning across agents.
Conversation : A stateful series of messages bound by a conversation_id
.
Transport : The delivery substrate (HTTP, WebSocket, gRPC, Kafka, NATS, AMQP).
Envelope vs content : Envelope carries routing and control; content carries domain data.
Determinism vs stochasticity : LLM agents introduce non-determinism; protocols must absorb it with retries, idempotency keys, and deterministic validation where needed.
Minimal envelope fields
message_id
, conversation_id
, sender
, receiver
, timestamp
, performative
, content
, reply_to
, causality
( in_reply_to
), ttl
, priority
, schema_version
, auth
, signature
.
Message lifecycle
Compose → Validate → Send → Route → Receive → Authorize → Handle → Reply or Conclude → Persist → Audit. Timeouts and cancellations are first-class.
Step-by-Step Walkthrough
1) Define roles and responsibilities
Examples: Planner
, Researcher
, Executor
, Critic
, Orchestrator
.
Each role owns a bounded context and a narrow set of performatives.
2) Choose performatives and map to transitions
Start lean:
request
: ask to act.
inform
: deliver facts or results.
propose
/ accept
/ reject
: for auctions or negotiations.
query
: ask for information without side effects.
confirm
: acknowledge state change.
error
: surface failure with machine-actionable codes.
3) Design the message schema
Stabilize the envelope; version the content
schema per ontology.
Include idempotency_key
for safe retries.
Provide tool_calls
for structured, verifiable actions.
4) Specify conversations
Example patterns:
Request–Response : request
→ inform|error
.
Contract Net (task auction): call_for_proposal
→ propose*
→ accept|reject
→ inform
.
Subscription : subscribe
→ inform*
→ cancel
.
5) Serialization and transport
JSON for readability. Protobuf for throughput. Pick one and stick to it.
For buses, set topics by domain and verb: orders.request
, research.propose
.
6) Reliability, ordering, and idempotency
7) Security and governance
Mutual TLS or signed JWT in auth
.
Optional detached signature of the payload in signature
.
Redact sensitive fields in logs. Define PII handling.
8) Error taxonomy
error.transient
: retryable (network, rate limit).
error.semantic
: bad content, schema violation.
error.authz
: authentication or authorization failure.
error.timeout
: downstream exceeded ttl
.
9) Observability
Correlate by conversation_id
.
Emit structured logs and metrics: latency, success rate, tokens, retries.
Persist transcripts for audit with controlled retention.
Code / JSON Snippets
A. Canonical message schema (JSON Schema)
Use this to validate messages at ingress.
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "AgentMessage",
"type": "object",
"required": [
"message_id", "conversation_id", "sender", "receiver",
"timestamp", "performative", "content", "schema_version"
],
"properties": {
"message_id": {"type": "string", "pattern": "^[a-f0-9-]{36}$"},
"conversation_id": {"type": "string"},
"sender": {"type": "string"},
"receiver": {"type": "string"},
"timestamp": {"type": "string", "format": "date-time"},
"performative": {"type": "string"},
"reply_to": {"type": "string"},
"in_reply_to": {"type": "string"},
"ttl": {"type": "integer", "minimum": 1},
"priority": {"type": "integer", "minimum": 0, "maximum": 9},
"schema_version": {"type": "string"},
"idempotency_key": {"type": "string"},
"auth": {
"type": "object",
"properties": {
"type": {"type": "string"},
"token": {"type": "string"}
},
"required": ["type", "token"]
},
"signature": {"type": "string"},
"tool_calls": {
"type": "array",
"items": {
"type": "object",
"required": ["name", "arguments"],
"properties": {
"name": {"type": "string"},
"arguments": {"type": "object"}
}
}
},
"content": {"type": "object"},
"attachments": {
"type": "array",
"items": {"type": "string", "format": "uri"}
}
},
"additionalProperties": false
}
B. Example messages
Request to research agent
{
"message_id": "7e1a6b3e-3a74-4c7a-8c6f-2e21b8f59f44",
"conversation_id": "conv-2024-09-15-abc123",
"sender": "planner@svc",
"receiver": "researcher@svc",
"timestamp": "2024-09-15T12:02:03Z",
"performative": "request",
"schema_version": "1.0.0",
"ttl": 120,
"idempotency_key": "task-4821-v1",
"content": {
"task": "market_scan",
"query": "compare vector databases for hybrid search",
"constraints": {"max_sources": 5, "timebox_min": 10}
}
}
Inform with result
{
"message_id": "2fbb2e8b-1c6a-4a53-a2b9-f90d15c7f9f7",
"conversation_id": "conv-2024-09-15-abc123",
"sender": "researcher@svc",
"receiver": "planner@svc",
"timestamp": "2024-09-15T12:12:30Z",
"performative": "inform",
"in_reply_to": "7e1a6b3e-3a74-4c7a-8c6f-2e21b8f59f44",
"schema_version": "1.0.0",
"content": {
"status": "ok",
"summary": "Three options exhibit strong hybrid support.",
"artifacts": ["s3://reports/conv-2024-09-15-abc123/summary.md"]
}
}
Error with retry hint
{
"message_id": "6a7f7f60-459d-4908-8343-2a75be3b4a52",
"conversation_id": "conv-2024-09-15-abc123",
"sender": "researcher@svc",
"receiver": "planner@svc",
"timestamp": "2024-09-15T12:05:11Z",
"performative": "error",
"in_reply_to": "7e1a6b3e-3a74-4c7a-8c6f-2e21b8f59f44",
"schema_version": "1.0.0",
"content": {
"code": "error.transient.rate_limit",
"message": "Rate limit exceeded",
"retry_after_seconds": 20
}
}
C. Minimal Python reference implementation (HTTP + JSON)
A tiny sender and handler with validation and idempotency.
Sender.py
import json, uuid, time, requests, datetime
def now_iso():
return datetime.datetime.utcnow().replace(microsecond=0).isoformat() + "Z"
def compose_request(conversation_id, sender, receiver, task, query):
return {
"message_id": str(uuid.uuid4()),
"conversation_id": conversation_id,
"sender": sender,
"receiver": receiver,
"timestamp": now_iso(),
"performative": "request",
"schema_version": "1.0.0",
"ttl": 120,
"idempotency_key": f"{conversation_id}:{task}",
"content": {"task": task, "query": query}
}
def send(url, msg):
r = requests.post(url, json=msg, timeout=10)
r.raise_for_status()
return r.json()
if __name__ == "__main__":
conv = "conv-{}".format(int(time.time()))
msg = compose_request(conv, "planner@svc", "researcher@svc", "market_scan", "RAG evaluation best practices")
print(send("http://localhost:8000/ingest", msg))
handler.py
from fastapi import FastAPI, Request
from pydantic import BaseModel, Field
from typing import Optional, List, Dict
import uvicorn, time, hashlib
app = FastAPI()
SEEN = set()
class ToolCall(BaseModel):
name: str
arguments: Dict[str, object]
class AgentMessage(BaseModel):
message_id: str
conversation_id: str
sender: str
receiver: str
timestamp: str
performative: str
schema_version: str
ttl: Optional[int] = 60
idempotency_key: Optional[str] = None
in_reply_to: Optional[str] = None
reply_to: Optional[str] = None
tool_calls: Optional[List[ToolCall]] = None
content: Dict[str, object]
def dedupe(key: str) -> bool:
if key in SEEN:
return True
SEEN.add(key)
return False
@app.post("/ingest")
async def ingest(msg: AgentMessage):
# Idempotency
if msg.idempotency_key and dedupe(msg.idempotency_key):
return {"status": "duplicate", "message_id": msg.message_id}
# TTL enforcement (simple)
# In production, compare received_at - parsed timestamp.
if msg.ttl and msg.ttl <= 0:
return {"status": "error", "code": "error.timeout"}
# Route by performative
if msg.performative == "request" and msg.content.get("task") == "market_scan":
# Simulate work
time.sleep(0.1)
return {
"status": "ok",
"reply": {
"performative": "inform",
"conversation_id": msg.conversation_id,
"in_reply_to": msg.message_id,
"summary": "Report ready"
}
}
return {"status": "error", "code": "error.semantic.unknown_task"}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
D. Sample workflow JSON (copy-paste)
A small, declarative orchestration of a research job. Each node sends a message and waits for a specific performative.
{
"name": "market_research_workflow",
"version": "1.0.0",
"assumptions": ["HTTP JSON", "at-least-once delivery", "UTC timestamps"],
"variables": {
"conversation_id": "conv-${ISO_DATE}-${RAND}",
"query": "vector database comparison for hybrid search"
},
"nodes": [
{
"id": "plan",
"type": "emit",
"message": {
"performative": "request",
"receiver": "researcher@svc",
"content": {"task": "market_scan", "query": "${query}", "constraints": {"max_sources": 5}}
},
"out": {"on": "inform", "to": "summarize", "fallback": {"on": "error", "to": "retry"}}
},
{
"id": "retry",
"type": "wait_retry",
"policy": {"max_retries": 3, "backoff_sec": 10},
"out": {"to": "plan"}
},
{
"id": "summarize",
"type": "emit",
"message": {
"performative": "request",
"receiver": "writer@svc",
"content": {"task": "summarize", "source": "s3://reports/${conversation_id}/raw.json"}
},
"out": {"on": "inform", "to": "finish"}
},
{"id": "finish", "type": "end"}
]
}
Use Cases / Scenarios
Autonomous research pipelines : Planner issues request
; researchers inform
with aggregated findings; critic agent validates sources; writer composes summaries.
DevOps change management : Change-planner propose
rollout; approver accept
; executor inform
results; auditor subscribes to events.
Procurement or scheduling : Contract Net with call_for_proposal
and propose
bids; orchestrator accept
the best; suppliers inform
delivery.
Robotics fleets : Coordinator issues request
missions; robots inform
position; safety agent can reject
unsafe proposals.
Enterprise chat assistants : Router agent query
knowledge agents; legal agent reject
unsafe actions; finalizer confirm
approved emails.
Limitations / Considerations
Ambiguous intent : LLM outputs can misclassify performatives. Use strict parsers and fallbacks.
Ontology drift : Evolving schemas break interop. Version aggressively and keep a migration playbook.
Cost and latency : Multi-agent chatter increases tokens and network hops.
Security : Message replay and prompt injection. Use signatures, nonces, and content sanitization.
Observability debt : Without correlation IDs, debugging is costly.
Transport trade-offs : HTTP simplifies ops; a bus improves fan-out and backpressure.
Human-in-the-loop : Some decisions require human accept
. Model it explicitly.
Token budget model (set costs per your provider)
Let:
C_in
= cost per 1K input tokens.
C_out
= cost per 1K output tokens.
t_in
, t_out
= tokens per message.
m
= messages per conversation.
Per conversation cost:
Cost ≈ m * ( (t_in/1000)*C_in + (t_out/1000)*C_out )
Example:
C_in = $0.50
, C_out = $1.50
, t_in = 600
, t_out = 300
, m = 8
.
Cost ≈ 8 * (0.6*0.50 + 0.3*1.50) = 8 * (0.30 + 0.45) = 8 * 0.75 = $6.00
.
Control cost by batching requests, tightening prompts, and using small models for routing.
Fixes (common pitfalls and solutions)
Missing idempotency
Symptom : duplicate side effects after retries.
Fix : require idempotency_key
; store a hash of side-effect inputs; return previous result when a duplicate arrives.
Unbounded conversations
Symptom : endless loops with request
↔ inform
.
Fix : enforce ttl
and max_hops
; add a conclude
marker in inform
to close threads.
Schema drift
Symptom : 4xx error.semantic
after deploys.
Fix : pin schema_version
per agent; support v-1, v, v+1
during migration; provide an adapter layer.
Silent losses on the bus
Symptom : missing messages under load.
Fix : enable durable subscriptions; ack on persistence; monitor lag; apply backpressure.
Ambiguous performatives
Symptom : inform
vs confirm
confusion.
Fix : table of allowed transitions per role; reject illegal transitions with error.semantic.invalid_transition
.
Prompt injection
Symptom : downstream tools executed with malicious text.
Fix : never pipe raw user text into tool_calls
; validate arguments against whitelists; include “must ignore untrusted instructions” guardrails.
Clock skew
Symptom : spurious ttl
expiry.
Fix : NTP sync; allow ±120 s grace; compare receipt time, not only sender time.
Out-of-order replies
Symptom : late inform
overwrites newer state.
Fix : store vector clocks or monotonic sequence numbers per conversation_id
; apply last-writer-wins only with guardrails.
Opaque errors
Symptom : unreproducible failures.
Fix : structured error.*
codes and minimal, safe diagnostics in content
.
Diagram
Sequence for Contract Net-style task allocation.
![Untitled diagram _ Mermaid Chart-2025-08-25-115528]()
Flowchart of request–response with error and retry
![Untitled diagram _ Mermaid Chart-2025-08-25-120127]()
Conclusion
A durable agent communication protocol keeps teams productive and systems stable. Define performatives and conversations up front. Freeze the envelope, version the ontology, and validate every ingress. Prefer idempotency, causal ordering, and explicit error codes. Make observability and budgets visible. The result is a multi-agent system that remains interpretable, safe, and cost-aware as it scales.