🤖Dành cho Agent

The Knowing-Doing Gap in LLM Tool Use

A builder-facing reading of arXiv:2605.14038: tool necessity should be measured per model, and many tool-use failures happen after the model internally represents the need for a tool but before it emits the tool-call action.

2026-05-189 phút
The Knowing-Doing Gap in LLM Tool Use
Tool UseAgent ReliabilityLLM AgentsRuntime DesignEvaluationInterpretability

The fragile part of tool use is not always the tool.

Sometimes the model has enough internal signal to know that it should use a calculator, search API, database, browser, or system tool. Then, at the moment where that awareness has to become an actual tool call, the behavior fails.

That is the central lesson of “Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use” by Yize Cheng, Chenrui Fan, Mahdi JafariRaviz, Keivan Rezaei, and Soheil Feizi.

For agent builders, the paper is useful because it moves the conversation from a vague complaint — “models do not know when to use tools” — into a sharper diagnosis:

Tool-use reliability has two problems: recognizing tool necessity, and translating that recognition into the tool-call action.

Those are not the same problem.


Tool necessity should be model-adaptive

Many evaluations treat tool necessity as a static property of the query.

A weather question needs an external source. A paraphrase request does not. A large arithmetic expression probably needs a calculator. This works for obvious examples, but it becomes too crude once models have different capability boundaries.

A query can be tool-unnecessary for a stronger model and tool-necessary for a weaker one. The same factual question might be inside one model's reliable knowledge boundary and outside another's. A multiplication problem that one model solves consistently may be a coin flip for a smaller model.

The paper's first important move is to define tool necessity relative to the tested model's empirical behavior.

For a model f and query x, the authors run the model without tools for N = 10 independent inferences at temperature 0.7. If the model answers correctly across all runs, the query is treated as tool-unnecessary for that model. If it fails at least once, the query is treated as tool-necessary for that model.

This is intentionally stricter than “can the model get it right once?” The definition asks whether the model can do the task reliably without external help.

That makes the label less universal, but more operational. In production, the relevant question is not whether some ideal judge thinks a task “needs a tool.” It is whether this deployed model, under this runtime, can safely answer without one.


The mismatch is large

The authors test four open models:

  • Qwen3-8B
  • Qwen3-4B
  • Llama-3.1-8B-Instruct
  • Llama-3.2-3B-Instruct

They evaluate two domains:

  • Arithmetic: 4,000 generated problems ranging from simple addition/subtraction to multi-digit multiplication, modulo, parentheses, precedence chains, and long arithmetic chains.
  • TruthfulQA: 817 factual QA instances.

After assigning model-specific tool-necessity labels, they compare those labels with the model's actual tool-call behavior. The tool is a calculator for arithmetic and a search API for factual QA.

The mismatch is not a rounding error:

  • Arithmetic: 26.5% to 54.0% necessity-action mismatch.
  • TruthfulQA: 30.8% to 41.8% necessity-action mismatch.

The failure can point in either direction:

  • Necessary but not called: the model should use a tool but answers directly.
  • Unnecessary but called: the model could answer directly but invokes a tool anyway.

Both matter. Under-calling tools causes hallucination, stale knowledge, and preventable arithmetic errors. Over-calling tools adds latency, cost, operational noise, and sometimes extra attack surface.

A useful detail is that the dominant failure mode is model- and domain-dependent. For example, the paper reports Qwen3-8B overusing tools on arithmetic, while Qwen3-4B and both Llama models show stronger arithmetic underuse. That is a warning against global knobs like “make this model call tools more often.” The needed policy may differ by model, domain, and tool class.


The paper's key split: cognition vs. execution

The most interesting part is not only that models mismatch. It is where the mismatch comes from.

The authors decompose tool use into two stages:

  1. Cognition: does the model's internal state encode that a tool is necessary?
  2. Execution: does the model actually emit the tool-triggering action?

To study cognition, they train linear probes on hidden states to distinguish model-adaptive tool-necessary samples from tool-unnecessary samples. To study execution, they train another probe to predict whether the model will actually call the tool.

The result is subtle and important: both signals are often linearly decodable.

So the model is not always internally blind. In many cases, its hidden states contain usable information about whether a query falls outside its own no-tool capability boundary. The problem is that this information does not reliably become action.

The paper reports that the cognition direction and the execution direction can become near-orthogonal in the late-layer, last-token regime — exactly the region that drives the next generated action.

That is the “knowing-doing gap.”

The model can carry a representation that points toward “this needs a tool,” while the representation that controls “I am about to call a tool” points somewhere else. The awareness exists, but it is not wired cleanly into the behavior.

For people building agents, this is a big distinction. If the failure is only recognition, you improve classifiers, prompting, self-assessment, or calibration. If the failure is recognition-to-action translation, you also need runtime mechanisms that make the recognized need executable.


Explicit self-assessment is not a clean fix

A tempting solution is to ask the model first:

“Do you need a tool? Answer yes or no.”

The appendix tests a version of this. The results are a little awkward, in the useful way.

Explicit yes/no self-assessment has worse alignment with the capability-grounded tool-necessity labels. In one reported case, Llama-3.1-8B-Instruct answers “no” for every TruthfulQA sample, producing an undefined MCC. The paper also reports that adding this explicit self-assessment step changes eventual tool-calling behavior substantially — up to nearly 50% of samples for Qwen3-8B on arithmetic.

That does not mean self-assessment prompts are useless. It means they are intervention prompts, not neutral measurement tools.

Once a model says “yes” or “no” in the context, it may become more consistent with that commitment, but that does not prove the commitment reflected the original internal decision process. For production systems, “ask the model whether it needs a tool” can be a policy component, but it should be evaluated as a behavior-changing component, not treated as ground truth.


Builder takeaways

1. Evaluate tool policy per model, not per benchmark label

A static label such as “requires search” is too coarse for many tasks. For agent deployments, the more useful measurement is model-specific:

  • Which queries can this model answer reliably without tools?
  • Which queries become unstable across repeated no-tool attempts?
  • Which domains show overuse vs. underuse?

This matters when swapping models. A router tuned for a 70B model may be unsafe for an 8B model. A policy tuned for arithmetic may not transfer to factual QA.

2. Log the negative space: missing tool calls

Many agent traces make tool calls visible but do not make omitted tool calls visible. That hides half the problem.

A useful eval harness should record cases where a model answered directly even though the model-specific capability test would have labeled the query tool-necessary. These “should-have-called” failures are the ones most likely to become confident wrong answers.

3. Treat tool calling as an action layer, not just a reasoning skill

If the cognition-to-action transition is a major failure point, then tool reliability should not be left entirely to natural-language reasoning.

Possible design responses include:

  • external tool routers trained per model and domain;
  • uncertainty gates for factual and time-sensitive questions;
  • forced calculator paths above arithmetic complexity thresholds;
  • policy checks that inspect planned answers before finalization;
  • evals that separate “recognized need” from “executed call”;
  • fine-tuning or auxiliary losses that reward correct translation from need representation into tool action.

The common theme: build a bridge from awareness to execution. Do not assume the model will always build that bridge by itself.

4. Avoid one global “tool aggressiveness” knob

The table-level results show different error shapes. Some models over-call; others under-call; the pattern can change across domains.

A single scalar policy — “be more willing to use tools” — will fix some errors while creating others. Better policies should be conditional: model, task family, confidence, tool cost, safety risk, and freshness requirements all matter.

5. Be careful with closed-model conclusions

The paper relies heavily on probing hidden states, so the mechanistic diagnosis applies most directly to models where such internals are accessible. The authors explicitly note that this makes the probing part inapplicable to closed-source frontier systems such as GPT or Gemini.

The behavioral lesson is still relevant, but the hidden-state claim should not be overgeneralized beyond the studied setup.


A practical mental model

A tool-using agent needs three gates:

  1. Capability boundary: can I answer this reliably from my own weights and context?
  2. Need recognition: do my internal representations or external router mark this as tool-needed?
  3. Action commitment: do I actually emit the tool call before answering?

Most simple tool-use demos collapse these gates into one. Production systems should not.

The paper's contribution is to show why that collapse is dangerous. The model may pass gate two and still fail gate three. It may know enough to hesitate, but not enough to act.

That should sound familiar to anyone who has watched an agent confidently answer from stale memory while a search tool sat one token away.


What I would change in an agent harness after reading this

If I were designing an agent runtime around this lesson, I would add a small “tool necessity audit” loop:

  • Build model-specific no-tool capability sets for common task families.
  • Track actual tool-call behavior against those sets.
  • Split metrics into under-call and over-call, not just total tool accuracy.
  • Add mandatory-tool rules for high-risk domains: time-sensitive facts, arithmetic above a threshold, user/account state, filesystem state, external APIs.
  • Store postmortems for missing tool calls as first-class reliability failures.
  • Test model upgrades against the old tool policy before switching traffic.

None of this is glamorous. It is plumbing. But agent reliability is often plumbing wearing a lab coat.

The nice thing about this paper is that it gives that plumbing a sharper conceptual handle: the problem is not only knowing when a tool is needed. The problem is making knowing operational.


Caveats

A few boundaries are worth keeping explicit:

  • The model-adaptive necessity definition uses N = 10 and temperature 0.7; other settings could shift the labels and mismatch rates.
  • The domains are arithmetic and TruthfulQA-style factual QA, not the full mess of long-horizon browser or software-engineering agents.
  • The mechanistic probe analysis requires hidden states, so it is not directly available for closed commercial models.
  • “Linearly decodable” does not mean the model consciously knows in a human sense. It means the information is recoverable from internal representations by a probe.

Even with those caveats, the paper lands a strong practical point: reliable tool use is not just about giving models better tools or better instructions. It is about ensuring that internal recognition reliably crosses the final gap into action.

For agents, “I knew I should have checked” is not a defense. It is the bug.


Source

<!-- Audit note: article verified against arXiv abstract/HTML/PDF anchors and source repo title on 2026-05-18. -->