LLMs Don’t Actually “Call” Tools

Here is a fact that surprises most people: LLMs cannot run code. They cannot call APIs. They cannot read files. All they do is predict the next token — they output raw text, one piece at a time.

So how do they “use tools”?

The trick: structured wishes

During training, models are fine-tuned on augmented datasets that include tool call examples. The training pipeline works roughly like this:

  1. Take existing text corpora
  2. Use prompts to generate synthetic tool calls within that text — e.g., [Calculator(400 / 1400)] -> 0.29
  3. Filter: only keep examples where the tool call improved prediction of subsequent tokens
  4. Fine-tune the model on this augmented dataset

This is the approach behind Toolformer — the model learns “when I’m about to produce a factual number, inserting a calculator call here makes my next tokens more accurate.” Over many such examples, it develops the instinct: when in doubt, call a tool rather than guess.

The result? Instead of guessing an answer, a trained model outputs something like:

<tool_call>
{"name": "get_weather", "arguments": {"city": "Tokyo"}}
</tool_call>

The model does not run this. It just writes it.

The loop: five steps, not one

A host application — ordinary software sitting between the model and the outside world — orchestrates the full cycle:

  1. App sends tool definitions + user message. The tool schemas (name, description, JSON Schema for parameters) are injected into the system prompt as tokens — they consume context window space just like any other text.
  2. Model analyzes whether calling a tool would help fulfill the request.
  3. Model outputs a structured tool_calls array — potentially multiple calls in parallel, not just one at a time.
  4. App executes each function locally, captures the results.
  5. App sends results back to the model as new messages, referencing each call by ID. The model reads them and generates a final response.

To the end user, it looks like the AI “called a function.” In reality, the AI wrote a structured wish, and the system granted it.

One subtlety worth noting: since tool definitions are injected as tokens, every tool you register costs context window space — even if the model never calls it. This becomes important later.

This is the foundation. Everything that follows is about making this loop better.


Stage 1: Functions Hardcoded on the Server

The earliest tool-use systems were dead simple. You, the developer, would:

  1. Write a Python function (e.g., def get_weather(city): ...)
  2. Describe that function in the system prompt — “You have access to get_weather(city). Call it when the user asks about weather.”
  3. Build a parser to catch the model’s output and route it to your function

This worked. But it had problems:

  • Every tool lived in your codebase. Want to add a database lookup? Write the function, redeploy.
  • The system prompt grew with every tool. Ten tools meant a wall of text describing each one’s name, parameters, and usage rules — all eating into the model’s context window.
  • No sharing. Your weather function was yours alone. Another developer building a similar app had to write their own from scratch.

Still, this was the moment LLMs went from “chat assistants” to “agents that can act.” OpenAI formalized this as Function Calling in June 2023, giving it a structured JSON schema instead of freeform text in the system prompt. But the architecture remained the same: functions lived on your server, described in your prompt.


Stage 2: Tool Servers — Moving Tools Out

The next step was obvious: why keep tools in your app? If a tool is just a function with inputs and outputs, it can live anywhere. Run it as a separate service. Call it over HTTP.

This gave rise to tool servers — standalone services that expose tools via APIs. Your app no longer needs to contain the tool code. It just needs to know the tool’s URL, its schema, and how to call it.

Benefits:

  • Separation of concerns. The AI app handles conversation. The tool server handles actions.
  • Reusability. One tool server can serve many AI apps.
  • Independent scaling. A heavy tool (like image generation) can run on beefy hardware while the chat app stays lightweight.

But there was no standard. Every tool server had its own API format, its own auth scheme, its own way of describing capabilities. Connecting your AI app to three different tool servers meant writing three different integrations.


Stage 3: MCP — A Universal Plug for Tools

In late 2024, Anthropic released the Model Context Protocol (MCP). The idea: a single, open standard for connecting AI models to tools, data sources, and external systems.

Think of it like USB for AI. Before USB, every device had its own cable. MCP does the same thing for tools — one protocol, any tool, any model.

An MCP server exposes:

  • Tools — functions the model can call
  • Resources — data the model can read (files, database rows, API responses, documentation)
  • Prompts — reusable prompt templates that can embed domain knowledge, task workflows, and usage guidelines

This means an MCP server is not just a function registry. It can carry rich context: a database MCP server can include a schema overview and query best practices; a codebase MCP server can expose architecture docs alongside the file-reading tools. The model receives both the capability and the knowledge needed to use it correctly.

Any AI app that speaks MCP can connect to any MCP server. A developer builds a Slack MCP server once, and every MCP-compatible AI app can use it. No custom integration needed.

MCP gained traction fast. GitHub, Postgres, Slack, Google Drive — community-built MCP servers popped up for everything. The ecosystem exploded.

But then came a new problem.


Stage 4: The Context Window Problem

MCP made it easy to add tools. Too easy.

Each MCP server comes with instructions: “Here is what I can do. Here is how to call me. Here are the rules.” And since MCP servers can also carry resources and domain knowledge, the payload per server can be substantial. When you connect five MCP servers, the model’s context window fills up with tool descriptions, domain docs, usage guidelines, and server metadata — even when most of them are irrelevant to the current task.

Imagine a developer asking the AI to fix a bug in a Python file. The model’s context already contains instructions for:

  • A Slack MCP server (not needed right now)
  • A database MCP server (not needed right now)
  • A Jira MCP server (not needed right now)
  • A browser automation MCP server (not needed right now)
  • A file system MCP server (finally, the one it needs)

80% of the context window is wasted on tools the model won’t use for this task. And context window space is precious — it is the model’s working memory. Every token spent on irrelevant tool descriptions is a token not available for understanding the actual problem.

This is the scaling curse of MCP: the more capable your agent becomes, the less room it has to think.


Stage 5: Skills — Tools on Demand, Knowledge on Demand

The solution? Don’t load everything upfront. Load what you need, when you need it.

This is the idea behind skills. A skill is a bundle of tools, instructions, and domain knowledge that gets injected into the model’s context only when triggered. Think of it as a lazy-loaded capability — and more than that, a lazy-loaded expert.

The key insight: a skill is not just a tool schema. It is a workflow guide.

A Jira skill, for example, doesn’t just say “here is how to call the Jira API.” It says:

  • When creating a bug ticket, always include steps to reproduce and the affected version.
  • Search for an existing ticket before creating a new one.
  • Link the new ticket to its parent epic.
  • Use the fix-version field only after confirming with the project lead.

This is embedded know-how — the kind of knowledge a senior engineer carries in their head. The skill packages both the tool access and the procedural knowledge for using it correctly.

A full skill payload typically contains:

  • Tool schemas — the API calls or functions the model can make
  • Domain knowledge — facts about the system, constraints, naming conventions
  • Task workflows — step-by-step guidance for common operations (e.g., “how to do a code review”, “how to open a PR”)
  • Examples — sample inputs and outputs that show correct behavior
  • Error handling — what to do when things go wrong

Instead of permanently stuffing the context with all of this, you register a skill:

  • Name: jira
  • Trigger: When the user mentions Jira, tickets, or sprint planning
  • Payload: Tool schemas + domain knowledge + task workflows + examples

When the model encounters a Jira-related request, the skill activates and its full payload enters the context. When the task is done, the payload leaves. The context stays clean.

Skills solve the scaling problem:

Approach Context cost 10 tools 100 tools
Raw MCP All tools always loaded Heavy Unworkable
Skills Only active tools loaded Light Still light

Now an agent can have access to hundreds of capabilities without choking its context window. The shift is from “know everything” to “know how to find what you need — and know how to use it once you find it.”


Stage 6: One Agent = Memory + Skills

With skills in place, we can define what a modern AI agent actually is:

An agent = a model + memory + a set of skills

  • Model: The LLM brain that reasons and generates text
  • Memory: Persistent knowledge that survives across conversations — user preferences, project context, past decisions
  • Skills: On-demand capabilities that activate when needed — each carrying its own tools, domain knowledge, and task workflows

Memory is what makes an agent yours. Without memory, every conversation starts from zero. With memory, the agent remembers that you prefer TypeScript over JavaScript, that your database is Postgres, that your team uses conventional commits.

Skills are what makes an agent capable and knowledgeable. Without skills, the agent can only chat. With skills, it can commit code, manage PRs, query databases, browse the web — and it knows how to do each of those things correctly, not just that the tools exist. A commit skill doesn’t just provide git access; it carries the team’s commit message conventions, branching strategy, and review checklist.

This is a clean, scalable architecture. One agent, focused context, relevant tools loaded on demand, persistent knowledge.

But one agent has limits.


Stage 7: Teams of Agents

Some tasks are too big for one agent. Not because the model isn’t smart enough, but because:

  • Context windows are finite. A single agent working on a large codebase will eventually run out of room.
  • Sequential work is slow. One agent reading 50 files, then planning, then coding, then testing — that is a long chain.
  • Different tasks need different focus. Research requires broad exploration. Coding requires deep file-level attention. Testing requires running commands and reading output. One agent juggling all three loses focus.

The answer: teams of agents, each with their own context window, their own memory, and their own skills.

A typical team might look like:

Agent Role Skills
Lead Coordinates work, synthesizes results Task management, messaging
Researcher Explores the codebase, reads docs Search, web browsing
Coder Writes and edits code File editing, code generation
Tester Runs tests, reports results Test execution, log analysis

Each agent runs as an independent process. They don’t share a context window. They each start clean and focused on their assigned task.

But this creates a new problem: how do they talk to each other?


Stage 8: The Communication Problem

The moment you have multiple agents, you need answers to hard questions:

  • How does Agent A tell Agent B what to do?
  • How does Agent B report results back?
  • How do they avoid stepping on each other’s work?
  • How does the team know when everything is done?

Different systems answer these questions differently. Here are the main approaches that exist today.

Approach 1: Shared Conversation Thread

Used by: Microsoft AutoGen

The simplest model. All agents share one message thread. Every agent can see what every other agent has said. A selector (either round-robin or model-based) picks who speaks next.

Agent A: "I found the bug in auth.py line 42."
Agent B: "I'll write a fix."
Agent C: "I'll write a test for the fix."

Pros: Simple. Full visibility. Easy to debug.

Cons: Context grows fast. With 5 agents and 100 messages, every agent carries the full conversation — even the parts that don’t concern it. Expensive and wasteful at scale.

Approach 2: LLM-Driven Delegation

Used by: CrewAI

Agents don’t share a thread. Instead, when Agent A needs help, it uses a delegation tool — essentially asking the LLM to compose a message to a specific teammate.

Agent A calls: delegate_work(task="fix the bug", coworker="Coder", context="bug is in auth.py line 42")

The framework routes this to the Coder agent, which works on it and returns a result.

Pros: Natural. Agents decide when and what to delegate.

Cons: Unpredictable. The LLM decides when to delegate, so you can’t guarantee it will. No structured lifecycle — you can’t track whether a delegated task is pending, in-progress, or done.

Approach 3: State Graph

Used by: LangGraph

The developer defines a graph where each node is an agent and each edge is an allowed transition. State flows through the graph as typed data. Agents hand off control explicitly using Command objects that specify both a state update and the next agent to run.

Researcher → Planner → Coder → Tester → (back to Planner if tests fail)

Pros: Deterministic. Auditable. You know exactly what happens and in what order.

Cons: Rigid. You must design the graph upfront. Agents can’t self-organize or adapt the workflow at runtime.

Approach 4: Task List + Direct Messages

Used by: Claude Code Agent Teams

Each team has a shared task list stored on disk. Agents claim tasks, work on them, and mark them done. For direct communication, agents use a message tool to send messages to specific teammates.

Lead creates task: "Fix auth bug" → status: pending
Coder claims task → status: in_progress
Coder finishes → status: completed → sends message to Lead: "Fixed. See commit abc123."
Lead reads message, checks task list, assigns next work.

Tasks support dependencies — Task B can be blocked by Task A, and automatically unblocks when A completes.

Pros: Structured and flexible. Tasks track lifecycle. Agents work independently without a shared context. The lead doesn’t bottleneck communication — agents can message each other directly.

Cons: Requires discipline. Agents must remember to update task status. An agent that forgets to mark a task done can block the whole team.

Approach 5: Queue-Based Actor Model with Mention Parsing

Used by: TinyClaw

What if agents could coordinate using natural language mentions — just like humans do on Slack or Discord?

TinyClaw takes this approach. There is no central orchestrator. Instead, agents communicate by writing mention tags directly in their responses:

I fixed the bug in auth.ts.
[@reviewer: Please review the changes in auth.ts for security issues.]
[@tester: Run the test suite after the review is done.]

A parser extracts these tags, and the system enqueues internal messages to the target agents via a shared SQLite queue. Each agent has its own isolated workspace and processes messages sequentially through a promise chain — but different agents run in parallel.

The coordination relies on a pending counter. When an agent mentions a teammate, the counter increments. When a teammate finishes (with no further mentions), it decrements. When it hits zero, the conversation is complete and all responses are aggregated.

User: "@dev fix the auth bug"
  → Coder (team leader) receives message          pending: 1
  → Coder responds, mentions @reviewer            pending: 1
  → Reviewer responds, mentions @tester            pending: 1
  → Tester responds (no mentions)                  pending: 0 → COMPLETE
  → Aggregated response sent back to user

Text written outside mention tags becomes shared context — delivered to all mentioned agents. This lets an agent broadcast background information while sending targeted instructions:

Sprint ends Friday. 3 open bugs remain.
[@coder: Focus on the auth bug first.]
[@reviewer: Prioritize any open PRs.]

Both agents receive the shared context plus their specific message.

Pros: No orchestrator needed. Natural language mentions feel intuitive. SQLite queue is durable (survives crashes, supports retries). Supports fan-out, backflow, and cross-talk — all through the same queue-and-counter mechanism.

Cons: Agents must follow the mention format correctly. No structured task lifecycle — you can’t query whether a piece of work is “pending” or “in progress” like you can with a task list.

Approach 6: An Open Protocol — Google’s A2A

Announced: April 2025 | Governed by: Linux Foundation (since June 2025)

All the approaches above work within a single framework. But what if your agents are built by different companies, running on different servers, using different models?

Google’s Agent-to-Agent (A2A) protocol tackles this. It is an open standard for agents to discover each other, authenticate, and exchange work — across organizational boundaries.

The key concepts:

  1. Agent Cards — Every agent publishes a JSON file at /.well-known/agent.json describing what it can do, like a machine-readable resume.
  2. Tasks — Structured work units that move through a lifecycle: submitted → working → completed / failed.
  3. Messages and Parts — The communication units within a task, supporting text, images, audio, and other modalities.
  4. Streaming — Long-running tasks use Server-Sent Events (SSE) for real-time progress updates.

A2A is designed to complement MCP, not replace it:

  • MCP = how an agent connects to tools (vertical)
  • A2A = how an agent connects to other agents (horizontal)

As of early 2026, A2A has strong enterprise backing (AWS, Microsoft, Salesforce, SAP) but MCP has won more grassroots developer adoption. The two may converge over time — tools and agents are just different points on the same spectrum.

A Note on Hybrid Systems

Real-world systems rarely follow a single pattern. They mix and match — deterministic routing for known workflows, LLM-driven delegation for open-ended tasks, and shared infrastructure for discovery and auth. OpenClaw is one example: it uses a gateway for agent routing (LangGraph-like deterministic control), LLM-driven session spawning for dynamic collaboration (CrewAI-like flexibility), and a shared gateway layer for authentication and discovery. As agent architectures mature, expect more hybrids like this rather than pure implementations of any single approach.

Comparing the Approaches

Approach Orchestration Context Task Tracking Flexibility Best For
Shared Thread (AutoGen) Selector (round-robin / LLM) Shared across all agents None built-in Moderate — selector can adapt Brainstorming, group discussions
LLM Delegation (CrewAI) LLM decides when to delegate Isolated per agent None built-in High — agents self-organize Open-ended exploration
State Graph (LangGraph) Developer-defined graph Shared typed state Implicit in graph transitions Low — graph is fixed Deterministic pipelines
Task List + DMs (Claude Code) Lead assigns via task list Isolated per agent Structured lifecycle (pending → done) High — tasks adapt at runtime Complex multi-step projects
Queue + Mentions (TinyClaw) No orchestrator — mention-based Shared context + targeted messages Pending counter only High — agents self-coordinate Decentralized async workflows
A2A Protocol (Google) Client agent initiates tasks Isolated across organizations Structured lifecycle (submitted → done) Moderate — protocol-defined Cross-org agent interop

The Full Picture

Let’s zoom out and see how all the pieces fit together:

┌─────────────────────────────────────────────────┐
│                  Agent Team                      │
│                                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
│  │ Agent A   │  │ Agent B   │  │ Agent C   │      │
│  │ (Lead)    │  │ (Coder)   │  │ (Tester)  │      │
│  │          │  │          │  │          │       │
│  │ Memory   │  │ Memory   │  │ Memory   │       │
│  │ Skills   │  │ Skills   │  │ Skills   │       │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘       │
│       │              │              │             │
│       └──────── Communication ──────┘             │
│         (Messages / Tasks / A2A)                  │
└───────────────────────┬─────────────────────────┘
                        │
                   Skills activate
                   on demand
                        │
              ┌─────────┼─────────┐
              │         │         │
          ┌───┴───┐ ┌───┴───┐ ┌──┴────┐
          │ MCP   │ │ MCP   │ │ MCP   │
          │Server │ │Server │ │Server │
          │(Slack)│ │(DB)   │ │(Git)  │
          └───────┘ └───────┘ └───────┘

The evolution, in one line:

Hardcoded functions → Tool servers → MCP (standard plug) → Skills (on-demand loading) → Agents (memory + skills) → Teams (multiple agents) → Communication protocols (how they coordinate)

Each layer solved the previous layer’s problem. Each layer created a new one. And that is how we got here.


The LLM is still just predicting the next token. Everything else — tools, skills, memory, teams, protocols — is scaffolding we built around that one simple trick.


Sources