Overview
The article describes best practices for designing, building, evaluating, and refining tools that large-language-model (LLM) agents use. Tools here mean deterministic software (APIs, SDKs, services) that agents can call. The piece emphasizes that tools must be designed for agents, rather than just wrapping APIs. It covers:
How to prototype tools
How to run systematic evaluations of tools (realistic tasks, metrics)
How to collaborate with agents (e.g., using Anthropic’s Claude) to improve tools
Key principles in tool design: selecting useful tools, clear namespacing, meaningful tool responses, token/page context efficiency, prompt/spec descriptions
![ChatGPT Image Sep 12, 2025, 11_15_34 AM]()
Context / Conceptual Background
Definitions and setting:
Tool vs Agent: Agents are non-deterministic (LLMs) and may choose to call tools among other actions. Tools are deterministic, predictable modules that the agent calls when needed. (Anthropic)
Model Context Protocol (MCP): A protocol enabling agents to use (“load”) many tools. (Anthropic)
Tools are only useful if agents can reliably choose when and how to use them; designing tools for that requires different trade-offs than for purely software-developer endpoints. (Anthropic)
Step-by-Step Walkthrough of Building + Maintaining Agent Tools
Below are steps distilled from Anthropic’s procedure.
Phase | What to Do | Key Tips |
---|
Prototype | Build an early version of tools. Provide documentation/APIs/SDKs that tools will wrap. Use the local MCP server or the Desktop extensions for testing. Test in Claude Code / CLI. (Anthropic) | Include LLM-friendly documentation. Try out tools on realistic workflows. Catch rough edges. (Anthropic) |
Evaluation | Define evaluation tasks grounded in real-world use. Run agents with tools. Have verifiable metrics. Collect logs, reasoning chains. Use agents to help analyze performance. (Anthropic) | Avoid trivial tasks. Include complexity (multiple tool calls, context). Measure accuracy, tool use, token usage, and errors. (Anthropic) |
Agent Collaboration | Utilize agents to assist in refactoring and improving tools. Use transcripts. Claude Code can assist. Use held-out test sets to prevent overfitting. (Anthropic) | Treat agent feedback seriously (including what it omits). Align tool descriptions, behavior, and naming. (Anthropic) |
Principles / Best Practices
From Anthropic’s experience:
Choose tools carefully
Don’t build tools that are mere wrappers of APIs if they don’t help agent strategies.
Tools should reduce the burden on agents (e.g., limit context size, avoid returning irrelevant data).
Consolidate functionality: group operations into tools that match common workflows. (Anthropic)
Namespacing
Tool names should make clear what domain or service they belong to (e.g. “asana_search”, “jira_search”).
Use consistent prefix/suffix systems.
Helps the agent pick the correct tool, avoid confusion. (Anthropic)
Meaningful Responses
Return relevant, semantically useful data.
Avoid opaque identifiers; give names, labels, images, etc.
Optionally allow different verbosity/detail levels (concise vs detailed). (Anthropic)
Token / Context Efficiency
Tools that return large blocks of data or full datasets waste agent context.
Use pagination, filtering, and range selection. Truncate judiciously.
Error messages should be actionable. (Anthropic)
Prompt / Spec Engineering
Tool descriptions and specs matter: they live in agent context.
Be explicit: what inputs are expected, what outputs delivered. Parameter names should be unambiguous.
Data formats and schemas matter (JSON, Markdown, etc.) — agent performance may depend on these. (Anthropic)
Use-Cases / Scenarios
The practices apply in contexts such as:
Customer support agents: fetching logs, tickets, and user context.
Scheduling assistants: combining calendar, availability, and meeting tool calls.
Internal tools dashboards: letting agents pull data from microservices.
Search/knowledge base agents: retrieving precise content, avoiding overwhelming context.
Limitations / Considerations
Agent hallucinations are still possible. Even with perfect tools, agents may choose the wrong tools or misinterpret outputs.
Evaluations may overfit: internal test sets may not represent all real world edge cases.
Cost trade-offs: tool creation, maintenance, and evaluation (compute, human oversight) cost resources.
Tool complexity vs simplicity: simpler tools may generalize better; over-specialization may limit reuse.
Common Pitfalls & Fixes
Pitfall | Fix / Mitigation |
---|
Tools returning too much data | Add filtering, pagination, and concise modes. |
Tool names are vague or overlapping | Use good namespacing; pick clear prefixes/suffixes. |
Unclear spec/parameter names | Enforce strict schemas; name inputs unambiguously. |
Evaluations too simplistic | Use realistic, multi-step, multi-tool tasks; held-out test sets. |
Agents don’t call tools (or misuse them) | Analyze reasoning logs; refine descriptions; provide examples. |
Conclusion
Designing tools for LLM agents requires rethinking traditional API/SDK design. Success depends on:
Iteration: prototyping, evaluating, refining with real tasks
Clarity: in tool naming, specifications, and responses
Efficiency: limiting context/token overload
Alignment: letting agents help evaluate and improve tools
If you build tools with these in mind, agents will use them better, more reliably, more efficiently.