Building Effective Tools for LLM Agents

Rohit Gupta
Sep 12
1.3k
0
3

Article

Overview

The article describes best practices for designing, building, evaluating, and refining tools that large-language-model (LLM) agents use. Tools here mean deterministic software (APIs, SDKs, services) that agents can call. The piece emphasizes that tools must be designed for agents, rather than just wrapping APIs. It covers:

How to prototype tools
How to run systematic evaluations of tools (realistic tasks, metrics)
How to collaborate with agents (e.g., using Anthropic’s Claude) to improve tools
Key principles in tool design: selecting useful tools, clear namespacing, meaningful tool responses, token/page context efficiency, prompt/spec descriptions

Context / Conceptual Background

Definitions and setting:

Tool vs Agent: Agents are non-deterministic (LLMs) and may choose to call tools among other actions. Tools are deterministic, predictable modules that the agent calls when needed. (Anthropic)
Model Context Protocol (MCP): A protocol enabling agents to use (“load”) many tools. (Anthropic)
Tools are only useful if agents can reliably choose when and how to use them; designing tools for that requires different trade-offs than for purely software-developer endpoints. (Anthropic)

Step-by-Step Walkthrough of Building + Maintaining Agent Tools

Below are steps distilled from Anthropic’s procedure.

Phase	What to Do	Key Tips
Prototype	Build an early version of tools. Provide documentation/APIs/SDKs that tools will wrap. Use the local MCP server or the Desktop extensions for testing. Test in Claude Code / CLI. (Anthropic)	Include LLM-friendly documentation. Try out tools on realistic workflows. Catch rough edges. (Anthropic)
Evaluation	Define evaluation tasks grounded in real-world use. Run agents with tools. Have verifiable metrics. Collect logs, reasoning chains. Use agents to help analyze performance. (Anthropic)	Avoid trivial tasks. Include complexity (multiple tool calls, context). Measure accuracy, tool use, token usage, and errors. (Anthropic)
Agent Collaboration	Utilize agents to assist in refactoring and improving tools. Use transcripts. Claude Code can assist. Use held-out test sets to prevent overfitting. (Anthropic)	Treat agent feedback seriously (including what it omits). Align tool descriptions, behavior, and naming. (Anthropic)

Principles / Best Practices

From Anthropic’s experience:

Choose tools carefully
- Don’t build tools that are mere wrappers of APIs if they don’t help agent strategies.
- Tools should reduce the burden on agents (e.g., limit context size, avoid returning irrelevant data).
- Consolidate functionality: group operations into tools that match common workflows. (Anthropic)
Namespacing
- Tool names should make clear what domain or service they belong to (e.g. “asana_search”, “jira_search”).
- Use consistent prefix/suffix systems.
- Helps the agent pick the correct tool, avoid confusion. (Anthropic)
Meaningful Responses
- Return relevant, semantically useful data.
- Avoid opaque identifiers; give names, labels, images, etc.
- Optionally allow different verbosity/detail levels (concise vs detailed). (Anthropic)
Token / Context Efficiency
- Tools that return large blocks of data or full datasets waste agent context.
- Use pagination, filtering, and range selection. Truncate judiciously.
- Error messages should be actionable. (Anthropic)
Prompt / Spec Engineering
- Tool descriptions and specs matter: they live in agent context.
- Be explicit: what inputs are expected, what outputs delivered. Parameter names should be unambiguous.
- Data formats and schemas matter (JSON, Markdown, etc.) — agent performance may depend on these. (Anthropic)

Use-Cases / Scenarios

The practices apply in contexts such as:

Customer support agents: fetching logs, tickets, and user context.
Scheduling assistants: combining calendar, availability, and meeting tool calls.
Internal tools dashboards: letting agents pull data from microservices.
Search/knowledge base agents: retrieving precise content, avoiding overwhelming context.

Limitations / Considerations

Agent hallucinations are still possible. Even with perfect tools, agents may choose the wrong tools or misinterpret outputs.
Evaluations may overfit: internal test sets may not represent all real world edge cases.
Cost trade-offs: tool creation, maintenance, and evaluation (compute, human oversight) cost resources.
Tool complexity vs simplicity: simpler tools may generalize better; over-specialization may limit reuse.

Common Pitfalls & Fixes

Pitfall	Fix / Mitigation
Tools returning too much data	Add filtering, pagination, and concise modes.
Tool names are vague or overlapping	Use good namespacing; pick clear prefixes/suffixes.
Unclear spec/parameter names	Enforce strict schemas; name inputs unambiguously.
Evaluations too simplistic	Use realistic, multi-step, multi-tool tasks; held-out test sets.
Agents don’t call tools (or misuse them)	Analyze reasoning logs; refine descriptions; provide examples.

Conclusion

Designing tools for LLM agents requires rethinking traditional API/SDK design. Success depends on:

Iteration: prototyping, evaluating, refining with real tasks
Clarity: in tool naming, specifications, and responses
Efficiency: limiting context/token overload
Alignment: letting agents help evaluate and improve tools

If you build tools with these in mind, agents will use them better, more reliably, more efficiently.