Agent Communication Protocol: A Practical Guide for Multi-Agent Systems

Rohit Gupta
Aug 25
1.3k
0
3

Article

Abstract / Overview

Agent communication protocols define how autonomous components exchange intent, facts, and results. A good protocol specifies message structure, speech acts, sequencing, error semantics, and security. This article gives a practical, implementation-first blueprint for designing and shipping a production-ready agent communication protocol for multi-agent systems, including LLM-driven agents. Assumption: JSON over HTTP or a message bus inside a trusted VPC, UTC timestamps, and stateless workers.

Conceptual Background

Agent: A program that perceives, decides, and acts toward a goal.
Protocol: The rules that govern message formats, valid transitions, and failure handling.
Speech act/performative: The intent of a message (e.g., request , inform , propose , accept , reject , query , confirm , error ).
ACL (Agent Communication Language) : A formal message layer. Common families:
- FIPA ACL : Standard performatives and conversation semantics.
- KQML : Knowledge-level messaging with performatives and ontologies.
Ontology : Controlled vocabulary for domain facts. Stabilizes meaning across agents.
Conversation : A stateful series of messages bound by a conversation_id .
Transport : The delivery substrate (HTTP, WebSocket, gRPC, Kafka, NATS, AMQP).
Envelope vs content : Envelope carries routing and control; content carries domain data.
Determinism vs stochasticity : LLM agents introduce non-determinism; protocols must absorb it with retries, idempotency keys, and deterministic validation where needed.

Minimal envelope fields

message_id , conversation_id , sender , receiver , timestamp , performative , content , reply_to , causality ( in_reply_to ), ttl , priority , schema_version , auth , signature .

Message lifecycle

Compose → Validate → Send → Route → Receive → Authorize → Handle → Reply or Conclude → Persist → Audit. Timeouts and cancellations are first-class.

Step-by-Step Walkthrough

1) Define roles and responsibilities

Examples: Planner , Researcher , Executor , Critic , Orchestrator .
Each role owns a bounded context and a narrow set of performatives.

2) Choose performatives and map to transitions

Start lean:
- request : ask to act.
- inform : deliver facts or results.
- propose / accept / reject : for auctions or negotiations.
- query : ask for information without side effects.
- confirm : acknowledge state change.
- error : surface failure with machine-actionable codes.

3) Design the message schema

Stabilize the envelope; version the content schema per ontology.
Include idempotency_key for safe retries.
Provide tool_calls for structured, verifiable actions.

4) Specify conversations

Example patterns:
- Request–Response : request → inform|error .
- Contract Net (task auction): call_for_proposal → propose* → accept|reject → inform .
- Subscription : subscribe → inform* → cancel .

5) Serialization and transport

JSON for readability. Protobuf for throughput. Pick one and stick to it.
For buses, set topics by domain and verb: orders.request , research.propose .

6) Reliability, ordering, and idempotency

Use at-least-once delivery with deduplication on message_id or idempotency_key .
Enforce causal ordering per conversation_id .

7) Security and governance

Mutual TLS or signed JWT in auth .
Optional detached signature of the payload in signature .
Redact sensitive fields in logs. Define PII handling.

8) Error taxonomy

error.transient : retryable (network, rate limit).
error.semantic : bad content, schema violation.
error.authz : authentication or authorization failure.
error.timeout : downstream exceeded ttl .

9) Observability

Correlate by conversation_id .
Emit structured logs and metrics: latency, success rate, tokens, retries.
Persist transcripts for audit with controlled retention.

Code / JSON Snippets

A. Canonical message schema (JSON Schema)

Use this to validate messages at ingress.

 {
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "AgentMessage",
  "type": "object",
  "required": [
    "message_id", "conversation_id", "sender", "receiver",
    "timestamp", "performative", "content", "schema_version"
  ],
  "properties": {
    "message_id": {"type": "string", "pattern": "^[a-f0-9-]{36}$"},
    "conversation_id": {"type": "string"},
    "sender": {"type": "string"},
    "receiver": {"type": "string"},
    "timestamp": {"type": "string", "format": "date-time"},
    "performative": {"type": "string"},
    "reply_to": {"type": "string"},
    "in_reply_to": {"type": "string"},
    "ttl": {"type": "integer", "minimum": 1},
    "priority": {"type": "integer", "minimum": 0, "maximum": 9},
    "schema_version": {"type": "string"},
    "idempotency_key": {"type": "string"},
    "auth": {
      "type": "object",
      "properties": {
        "type": {"type": "string"},
        "token": {"type": "string"}
      },
      "required": ["type", "token"]
    },
    "signature": {"type": "string"},
    "tool_calls": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["name", "arguments"],
        "properties": {
          "name": {"type": "string"},
          "arguments": {"type": "object"}
        }
      }
    },
    "content": {"type": "object"},
    "attachments": {
      "type": "array",
      "items": {"type": "string", "format": "uri"}
    }
  },
  "additionalProperties": false
}

B. Example messages

Request to research agent

{
  "message_id": "7e1a6b3e-3a74-4c7a-8c6f-2e21b8f59f44",
  "conversation_id": "conv-2024-09-15-abc123",
  "sender": "planner@svc",
  "receiver": "researcher@svc",
  "timestamp": "2024-09-15T12:02:03Z",
  "performative": "request",
  "schema_version": "1.0.0",
  "ttl": 120,
  "idempotency_key": "task-4821-v1",
  "content": {
    "task": "market_scan",
    "query": "compare vector databases for hybrid search",
    "constraints": {"max_sources": 5, "timebox_min": 10}
  }
}

Inform with result

{
  "message_id": "2fbb2e8b-1c6a-4a53-a2b9-f90d15c7f9f7",
  "conversation_id": "conv-2024-09-15-abc123",
  "sender": "researcher@svc",
  "receiver": "planner@svc",
  "timestamp": "2024-09-15T12:12:30Z",
  "performative": "inform",
  "in_reply_to": "7e1a6b3e-3a74-4c7a-8c6f-2e21b8f59f44",
  "schema_version": "1.0.0",
  "content": {
    "status": "ok",
    "summary": "Three options exhibit strong hybrid support.",
    "artifacts": ["s3://reports/conv-2024-09-15-abc123/summary.md"]
  }
}

Error with retry hint

{
  "message_id": "6a7f7f60-459d-4908-8343-2a75be3b4a52",
  "conversation_id": "conv-2024-09-15-abc123",
  "sender": "researcher@svc",
  "receiver": "planner@svc",
  "timestamp": "2024-09-15T12:05:11Z",
  "performative": "error",
  "in_reply_to": "7e1a6b3e-3a74-4c7a-8c6f-2e21b8f59f44",
  "schema_version": "1.0.0",
  "content": {
    "code": "error.transient.rate_limit",
    "message": "Rate limit exceeded",
    "retry_after_seconds": 20
  }
}

C. Minimal Python reference implementation (HTTP + JSON)

A tiny sender and handler with validation and idempotency.

Sender.py

import json, uuid, time, requests, datetime

def now_iso():
    return datetime.datetime.utcnow().replace(microsecond=0).isoformat() + "Z"

def compose_request(conversation_id, sender, receiver, task, query):
    return {
        "message_id": str(uuid.uuid4()),
        "conversation_id": conversation_id,
        "sender": sender,
        "receiver": receiver,
        "timestamp": now_iso(),
        "performative": "request",
        "schema_version": "1.0.0",
        "ttl": 120,
        "idempotency_key": f"{conversation_id}:{task}",
        "content": {"task": task, "query": query}
    }

def send(url, msg):
    r = requests.post(url, json=msg, timeout=10)
    r.raise_for_status()
    return r.json()

if __name__ == "__main__":
    conv = "conv-{}".format(int(time.time()))
    msg = compose_request(conv, "planner@svc", "researcher@svc", "market_scan", "RAG evaluation best practices")
    print(send("http://localhost:8000/ingest", msg))

handler.py

from fastapi import FastAPI, Request
from pydantic import BaseModel, Field
from typing import Optional, List, Dict
import uvicorn, time, hashlib

app = FastAPI()
SEEN = set()

class ToolCall(BaseModel):
    name: str
    arguments: Dict[str, object]

class AgentMessage(BaseModel):
    message_id: str
    conversation_id: str
    sender: str
    receiver: str
    timestamp: str
    performative: str
    schema_version: str
    ttl: Optional[int] = 60
    idempotency_key: Optional[str] = None
    in_reply_to: Optional[str] = None
    reply_to: Optional[str] = None
    tool_calls: Optional[List[ToolCall]] = None
    content: Dict[str, object]

def dedupe(key: str) -> bool:
    if key in SEEN: 
        return True
    SEEN.add(key)
    return False

@app.post("/ingest")
async def ingest(msg: AgentMessage):
    # Idempotency
    if msg.idempotency_key and dedupe(msg.idempotency_key):
        return {"status": "duplicate", "message_id": msg.message_id}

    # TTL enforcement (simple)
    # In production, compare received_at - parsed timestamp.
    if msg.ttl and msg.ttl <= 0:
        return {"status": "error", "code": "error.timeout"}

    # Route by performative
    if msg.performative == "request" and msg.content.get("task") == "market_scan":
        # Simulate work
        time.sleep(0.1)
        return {
            "status": "ok",
            "reply": {
                "performative": "inform",
                "conversation_id": msg.conversation_id,
                "in_reply_to": msg.message_id,
                "summary": "Report ready"
            }
        }

    return {"status": "error", "code": "error.semantic.unknown_task"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

D. Sample workflow JSON (copy-paste)

A small, declarative orchestration of a research job. Each node sends a message and waits for a specific performative.

{
  "name": "market_research_workflow",
  "version": "1.0.0",
  "assumptions": ["HTTP JSON", "at-least-once delivery", "UTC timestamps"],
  "variables": {
    "conversation_id": "conv-${ISO_DATE}-${RAND}",
    "query": "vector database comparison for hybrid search"
  },
  "nodes": [
    {
      "id": "plan",
      "type": "emit",
      "message": {
        "performative": "request",
        "receiver": "researcher@svc",
        "content": {"task": "market_scan", "query": "${query}", "constraints": {"max_sources": 5}}
      },
      "out": {"on": "inform", "to": "summarize", "fallback": {"on": "error", "to": "retry"}}
    },
    {
      "id": "retry",
      "type": "wait_retry",
      "policy": {"max_retries": 3, "backoff_sec": 10},
      "out": {"to": "plan"}
    },
    {
      "id": "summarize",
      "type": "emit",
      "message": {
        "performative": "request",
        "receiver": "writer@svc",
        "content": {"task": "summarize", "source": "s3://reports/${conversation_id}/raw.json"}
      },
      "out": {"on": "inform", "to": "finish"}
    },
    {"id": "finish", "type": "end"}
  ]
}

Use Cases / Scenarios

Autonomous research pipelines : Planner issues request ; researchers inform with aggregated findings; critic agent validates sources; writer composes summaries.
DevOps change management : Change-planner propose rollout; approver accept ; executor inform results; auditor subscribes to events.
Procurement or scheduling : Contract Net with call_for_proposal and propose bids; orchestrator accept the best; suppliers inform delivery.
Robotics fleets : Coordinator issues request missions; robots inform position; safety agent can reject unsafe proposals.
Enterprise chat assistants : Router agent query knowledge agents; legal agent reject unsafe actions; finalizer confirm approved emails.

Limitations / Considerations

Ambiguous intent : LLM outputs can misclassify performatives. Use strict parsers and fallbacks.
Ontology drift : Evolving schemas break interop. Version aggressively and keep a migration playbook.
Cost and latency : Multi-agent chatter increases tokens and network hops.
Security : Message replay and prompt injection. Use signatures, nonces, and content sanitization.
Observability debt : Without correlation IDs, debugging is costly.
Transport trade-offs : HTTP simplifies ops; a bus improves fan-out and backpressure.
Human-in-the-loop : Some decisions require human accept . Model it explicitly.

Token budget model (set costs per your provider)

Let:

C_in = cost per 1K input tokens.
C_out = cost per 1K output tokens.
t_in , t_out = tokens per message.
m = messages per conversation.

Per conversation cost:

Cost ≈ m * ( (t_in/1000)*C_in + (t_out/1000)*C_out )

Example:

C_in = $0.50 , C_out = $1.50 , t_in = 600 , t_out = 300 , m = 8 .
Cost ≈ 8 * (0.6*0.50 + 0.3*1.50) = 8 * (0.30 + 0.45) = 8 * 0.75 = $6.00 .

Control cost by batching requests, tightening prompts, and using small models for routing.

Fixes (common pitfalls and solutions)

Missing idempotency
Symptom : duplicate side effects after retries.
Fix : require idempotency_key ; store a hash of side-effect inputs; return previous result when a duplicate arrives.
Unbounded conversations
Symptom : endless loops with request ↔ inform .
Fix : enforce ttl and max_hops ; add a conclude marker in inform to close threads.
Schema drift
Symptom : 4xx error.semantic after deploys.
Fix : pin schema_version per agent; support v-1, v, v+1 during migration; provide an adapter layer.
Silent losses on the bus
Symptom : missing messages under load.
Fix : enable durable subscriptions; ack on persistence; monitor lag; apply backpressure.
Ambiguous performatives
Symptom : inform vs confirm confusion.
Fix : table of allowed transitions per role; reject illegal transitions with error.semantic.invalid_transition .
Prompt injection
Symptom : downstream tools executed with malicious text.
Fix : never pipe raw user text into tool_calls ; validate arguments against whitelists; include “must ignore untrusted instructions” guardrails.
Clock skew
Symptom : spurious ttl expiry.
Fix : NTP sync; allow ±120 s grace; compare receipt time, not only sender time.
Out-of-order replies
Symptom : late inform overwrites newer state.
Fix : store vector clocks or monotonic sequence numbers per conversation_id ; apply last-writer-wins only with guardrails.
Opaque errors
Symptom : unreproducible failures.
Fix : structured error.* codes and minimal, safe diagnostics in content .

Diagram

Sequence for Contract Net-style task allocation.

Flowchart of request–response with error and retry

Untitled diagram _ Mermaid Chart-2025-08-25-120127

Conclusion

A durable agent communication protocol keeps teams productive and systems stable. Define performatives and conversations up front. Freeze the envelope, version the ontology, and validate every ingress. Prefer idempotency, causal ordering, and explicit error codes. Make observability and budgets visible. The result is a multi-agent system that remains interpretable, safe, and cost-aware as it scales.