Design MCP Tools Your AI Agent Won't Misuse

Q: How should I write MCP tool descriptions?

Start with the action and resource, list key return fields, specify the response format (inline vs files), and include post-call instructions. Keep parameter-level detail in the schema, not the description. For write operations, embed the full expected workflow in the description — including what to show the user before confirming.

TL;DR

Most MCP servers fail in production not because of the protocol — but because of how tools are designed. Vague descriptions make agents pick the wrong tool. Flat data dumps bloat the context window. Unguarded write operations let agents modify production data without confirmation. And poor error messages leave agents guessing instead of recovering. This guide covers four patterns we use to build MCP servers that agents actually use correctly: writing descriptions that program agent behavior, keeping responses out of the context window, making write operations safe with two-phase confirmation, and designing errors that help agents recover.

You built an MCP server. Your AI agent connects to it. It discovers the tools. And then it calls the wrong one. Or it calls the right one with the wrong arguments. Or it calls the right one correctly — and writes bad data to your customer's SAP instance because nobody asked the user to confirm first.

This is where most MCP projects go sideways. The protocol works fine. The server runs fine. The tools are technically functional. But the agent doesn't use them the way you intended — because you designed the tools for a developer, not for an LLM.

An MCP tool is a user interface for AI. The "user" is a language model that reads your tool names, descriptions, and schemas to decide what to call and how. If those descriptions are vague, the agent guesses. If the schema is ambiguous, the agent hallucinates arguments. If the error messages are meant for humans, the agent can't recover.

After building MCP servers for enterprise systems like SAP, Coupa, Salesforce, Dynamics 365, we've developed a set of patterns that consistently produce reliable agent behavior. Here's what we've learned.

Why do AI agents call the wrong MCP tools?

The number one reason agents call the wrong tool is that they have too many tools with overlapping descriptions.

When your MCP server exposes 25+ tools, the LLM has to read every name and description in its context window just to pick one. If two tools sound similar — say get_supplier and coupa_api_query with resource="suppliers" — the agent has to guess which one is right. Sometimes it guesses wrong. Sometimes it picks the more generic one when the specific one would have been better. Sometimes it calls both.

This is documented in research on MCP server composition, which found that most tool-calling mistakes happen because agents are forced to choose from too many overlapping definitions.

That doesn't mean you need to have fewer tools — you actually need a lot of them for production setups. You just need better descriptions for them.

What a bad tool description looks like:

"Query SAP data."

This tells the agent nothing. What data? Which resource? When should I use this instead of another tool?

What a good tool description looks like:

"List suppliers from SAP with filtering, sorting, and pagination. Returns supplier records with: name, number, status, address, contacts, payment terms. Use filter to narrow results (e.g. by status, name, on-hold). Results are written to filesystem sandbox as CSV. Returns file paths, not the data itself — you MUST read the files afterward."

This description does five things: tells the agent what the tool does, what data comes back, how to customize the query, what format the response comes in, and what to do after calling it. The agent doesn't need to guess any of these.

The rule: A tool description is an instruction manual for the LLM. If a developer reading it would need to ask a follow-up question, the description isn't complete enough. Be specific about what the tool does, what it returns, and when to use it instead of another tool.

How should you write MCP tool descriptions for enterprise systems?

Tool descriptions for enterprise systems need more detail than descriptions for simple APIs. Enterprise APIs have quirks — unusual field names, non-obvious ID formats, overlapping resources — and your descriptions need to pre-empt the confusion before the agent encounters it.

Here's a pattern we follow for every enterprise MCP tool:

Start with the action and the resource. "List suppliers from SAP with filtering, sorting, and pagination." The agent immediately knows what this tool does and which system it targets.

List the key fields that come back. "Returns supplier records with: name, number, status, address, contacts, payment terms." This prevents the agent from calling the tool and then being surprised by the response format.

Explain input parameters in the schema, not the description. The tool description tells the agent when and why to use the tool. The parameter schemas tell it how. Keep these separate — the MCP specification itself recommends this separation.

Include disambiguation. If two tools could serve the same request, the description should say when to use this one and when to use the other. For example: "Use this for common supplier queries. For custom filtering or unusual parameter combinations, use the generic API query tool instead."

Specify the response format and post-call instructions. This is the most frequently missed detail. If your tool writes results to files instead of returning them inline, say so explicitly: "Results are written to filesystem sandbox as CSV. Returns file paths, not the data itself — you MUST read the files afterward."

For write operations, the description should include the full expected workflow — not just what the tool does, but what the agent should do next. More on this below.

Should your MCP server return data directly or write it to files?

This depends on how much data you're returning.

If the tool returns a single record or a short answer, return it directly in the tool response. The agent gets it instantly, no extra steps.

But if the tool can return 50, 100, or 500 records — which is common when querying enterprise systems like SAP or Coupa — returning that data directly is a problem. It fills up the agent's context window with raw data, leaving less room for reasoning. As a result, the agent slows down. It starts losing track of what it was doing. And the cost per query goes up because you're processing thousands of tokens of CSV data through the LLM.

We use a file-based response pattern for any tool that can return more than a handful of records. Instead of returning the data, the tool writes it to the filesystem — typically as CSV for tabular data and markdown for single records — and returns only the file paths.

The response the agent sees:

"47 suppliers found. Files written to /workspace/sap-list_suppliers-1738.../ — suppliers_filtered.csv — index.txt"

The agent then reads the files it needs. If it only needs the first 10 rows, it reads those. If it needs to cross-reference with another query, it can read both files without either one polluting the context window.

The index.txt pattern: For every query, we write a small index file that summarizes what was returned — record count, column names, filter parameters, and file paths. The agent reads this first to orient itself, then decides which files to open. Think of it as a table of contents for the query results.

When NOT to use files: Single record lookups, confirmation messages, error messages, and anything under ~10 rows. For these, inline responses are simpler and faster.

How do you make MCP write operations safe?

This is probably the most important section in this article.

When your MCP server can create, update, or delete records in an enterprise system, you need a safety mechanism. AI agents are non-deterministic. They sometimes misinterpret user intent. They sometimes hallucinate arguments. And enterprise data — suppliers in Coupa, employees in Workday, customers in SAP — is not something you want an AI modifying without explicit human confirmation.

Most MCP servers don't address this. The agent calls create_user and the user gets created. If the agent got the arguments wrong, too bad — the record is already in the system.

We use a two-phase confirmation pattern that separates preparation from execution:

Phase 1 — Prepare. The agent calls a prepare_create_user or prepare_update_user tool. This tool validates the inputs, checks for duplicates (is this login already taken? is this email in use?), and generates a human-readable preview of what will happen. It does NOT write anything to the enterprise system. Instead, it returns a preview and a single-use confirmation token.

Between phases — Human review. The agent presents the preview to the user. "Here's the user that will be created: login jsmith, email jsmith@acme.com, roles: Buyer, Expense User. Confirm?" The user reviews and either approves or rejects.

Phase 2 — Confirm or cancel. If approved, the agent calls confirm_action with the token. The tool executes the write. If rejected, the agent calls cancel_action and the token is discarded. Tokens are single-use — once confirmed, they can't be replayed.

Stale data detection. For updates, the prepare step snapshots the current record. When the agent calls confirm, the tool re-fetches the record and compares it to the snapshot. If someone else modified the record between prepare and confirm, the operation is rejected and the agent is told to re-prepare. This prevents the agent from overwriting changes made by other users.

Why this matters for enterprise systems specifically: Some enterprise platforms like Coupa use PUT for all updates, not PATCH. If you send a PUT with missing fields, those fields get wiped to null. Our prepare step automatically merges the proposed changes onto the current record, so the PUT payload is always complete. The user never has to worry about accidentally blanking a field they didn't mention.

The workflow rules live in the tool description. The prepare tool's description includes explicit instructions: "NEVER call confirm_action without showing the preview to the user first. If the user says NO or changes the request, cancel ALL pending tokens FIRST, then re-prepare." This programs the agent's behavior through metadata — no custom agent framework needed.

What makes a good MCP error message?

Here's something most developers get wrong: MCP error messages aren't for humans. They're for the AI agent.

When a tool call fails, the agent receives the error message and decides what to do next. If the message says "Error 422" — the agent has nothing to work with. If it says "Login 'jsmith' is already taken by user ID 47. Choose a different login." — the agent knows exactly how to recover: ask the user for a different login, then retry.

Every error message from an enterprise MCP server should include three things:

What happened. "Coupa API returned 403 Forbidden."

Why it happened. "The API client does not have permission for this resource."

What to do about it. "Verify the OAuth2 client scopes include the required permissions."

For validation errors, be specific about which field failed and what the expected format is. For auth errors, point toward the likely configuration problem. For timeouts, suggest whether the query might be too complex or the server unresponsive.

The agent treats your error messages as instructions. Write them accordingly.

How many tools should an MCP server expose?

There's no magic number, but the pattern matters more than the count.

Docker's MCP catalog team uses the concept of a "tool budget" — the number of tools an agent can handle effectively before performance degrades. Their recommendation: design tools around use cases, not API endpoints.

A bad MCP server takes 15 REST endpoints and creates 15 tools. The agent sees 15 options with overlapping functionality and has to reason about which one to use.

A good MCP server groups related operations into focused tools that match how users think about the task. "List suppliers" and "Get supplier by ID" are two tools because they represent two distinct user intents. But "Get supplier addresses" and "Get supplier contacts" might be better as part of the "Get supplier" tool that returns all sub-resources together — with addresses and contacts written as separate files the agent can read selectively.

For simple tools, we typically end up with 8–15 tools per system. For our enterprise MCP servers, it's usually between 25–45 tools: a handful of read tools (list, get by ID, generic query) for each scope (suppliers, customers, invoices, orders, inventory, etc), write tools with the two-phase pattern (prepare, confirm, cancel), and sometimes a few specialized tools for common workflows.

The key metric is hit rate — does the agent consistently pick the correct tool for a given user request? If it doesn't, you either have too many overlapping tools or your descriptions aren't clear enough. Testing this systematically before production is essential.

A checklist before you ship your MCP server to production

Before any MCP server goes to production, run through these questions:

Tool descriptions: Can an LLM reading only the tool names and descriptions correctly pick the right tool for any given user request? Are there two tools that sound similar enough to cause confusion? Does every description specify what the tool returns and what format it's in?

Write safety: Can the agent create, update, or delete records without human confirmation? If yes, add a confirmation step. Enterprise data isn't something you want AI agents modifying unilaterally.

Response size: Could any tool return more data than fits comfortably in a context window? If a query could return hundreds of records, use file-based responses instead of inline data.

Error messages: Do your error messages tell the agent what went wrong AND how to fix it? Test by intentionally triggering errors and reading the messages as if you were an LLM with no other context.

Tool count: Are you under 15–20 tools? If not, can you combine related operations or use server composition to split across focused servers?

Parameter schemas: Do all required fields have clear descriptions? Do enums have explicit allowed values? Are field format requirements specified (e.g., "10-digit KUNNR" for SAP customer numbers)?

FAQ

Why do AI agents call the wrong MCP tools?

Agents pick the wrong tool when descriptions are vague or when too many tools have overlapping functionality. The fix is writing precise descriptions that tell the agent exactly when to use each tool, what it returns, and how the response is formatted. For enterprise systems with many similar operations, include explicit disambiguation in each description.

How should I write MCP tool descriptions?

Start with the action and resource ("List suppliers from SAP"), list key return fields, specify the response format (inline vs files), and include post-call instructions. Keep parameter-level detail in the schema, not the description. For write operations, embed the full expected workflow in the description — including what to show the user before confirming.

Should MCP servers return data directly or write to files?

For small responses (single records, short answers), return directly. For queries that can return dozens or hundreds of records, write results to files and return file paths. This prevents context window bloat and lets the agent read only what it needs. Include an index.txt file that summarizes the query results so the agent can orient itself.

How do you prevent AI agents from writing bad data through MCP?

Use a two-phase confirmation pattern: the agent calls a prepare tool that validates inputs and returns a preview, then the user reviews and approves before the agent calls a confirm tool to execute. This prevents the agent from writing incorrect data and gives users visibility into what's about to happen. For updates, snapshot the record at prepare time and check for concurrent changes before confirming.

How many tools should an MCP server have?

Design tools around user intents, not API endpoints. Simple MCP servers work well with 8–15 tools, but enterprise systems usually require 25+. The key metric is hit rate — whether the agent consistently picks the correct tool. If it doesn't, your tools are either too numerous, too overlapping, or not described clearly enough.

What's the most common mistake when building an MCP server?

Treating tool descriptions like API documentation. Tool descriptions are instructions for an LLM — they need to tell the agent exactly when to use the tool, what it returns, and what to do after calling it. Vague descriptions like "Query data" cause agents to pick the wrong tool or pass incorrect arguments. Every description should be specific enough that the agent never has to guess.

Do I need two-phase confirmation for all MCP write operations?

For any operation that creates, updates, or deletes records in an enterprise system — yes. AI agents are non-deterministic and can misinterpret user intent or hallucinate arguments. The two-phase pattern (prepare → human review → confirm) prevents bad data from reaching production. For read-only operations, no confirmation is needed.

How do you test whether an MCP server works correctly with AI agents?

Measure three metrics: hit rate (does the agent pick the right tool?), success rate (does the tool return correct results?), and unnecessary call rate (does the agent make extra calls that waste tokens?). Run these tests with real user prompts, not just ideal scenarios. A high hit rate with low success rate means descriptions are clear but execution is broken. Low hit rate means tools overlap too much or aren't described clearly enough.

Can I use off-the-shelf MCP servers for enterprise systems?

For internal tools or developer workflows, community MCP servers can work. For customer-facing AI features that touch enterprise data (SAP, NetSuite, D365), custom servers are typically necessary. Off-the-shelf servers often have vague tool descriptions, limited error handling, and security gaps that aren't acceptable for production use with customer data.

How to Design MCP Tools Your AI Agent Won't Misuse

Why do AI agents call the wrong MCP tools?

How should you write MCP tool descriptions for enterprise systems?

Should your MCP server return data directly or write it to files?

How do you make MCP write operations safe?

What makes a good MCP error message?

How many tools should an MCP server expose?

A checklist before you ship your MCP server to production

FAQ

Why do AI agents call the wrong MCP tools?

How should I write MCP tool descriptions?

Should MCP servers return data directly or write to files?

How do you prevent AI agents from writing bad data through MCP?

How many tools should an MCP server have?

What's the most common mistake when building an MCP server?

Do I need two-phase confirmation for all MCP write operations?

How do you test whether an MCP server works correctly with AI agents?

Can I use off-the-shelf MCP servers for enterprise systems?

Building an MCP server for enterprise systems?

Why do AI agents call the wrong MCP tools?

How should you write MCP tool descriptions for enterprise systems?

Should your MCP server return data directly or write it to files?

How do you make MCP write operations safe?

What makes a good MCP error message?

How many tools should an MCP server expose?

A checklist before you ship your MCP server to production

FAQ

Why do AI agents call the wrong MCP tools?

How should I write MCP tool descriptions?

Should MCP servers return data directly or write to files?

How do you prevent AI agents from writing bad data through MCP?

How many tools should an MCP server have?

What's the most common mistake when building an MCP server?

Do I need two-phase confirmation for all MCP write operations?

How do you test whether an MCP server works correctly with AI agents?

Can I use off-the-shelf MCP servers for enterprise systems?

Related reading

MCP Server Development: Build or Outsource?

What Is an MCP Server?

Building an MCP server for enterprise systems?