Teaching an SLM to call tools
A base-instruct SLM, handed a function schema, wants to explain. It wants to ask clarifying questions. It wants to narrate. What it does not want to do is emit a structured <tool_call> JSON blob. Here is the three-step recipe — SFT on format, DPO against bad calls, evaluated on BFCL — that turns a 10% tool-caller into a 79% one.
Why base SLMs fail at tool calling
Hand a well-trained instruction model a system prompt like “You have access to the function get_weather(city, date). Emit a tool call when appropriate.” — and watch it politely ask clarifying questions instead. This is not a bug. It is doing exactly what its training incentivised.
Instruction-tuned base models (Llama-3-Instruct, Qwen2.5-Instruct, Phi-4-mini-instruct) are SFT'd on millions of helpful-natural-language examples and DPO'd on preferences that reward polite, informative prose. The target distribution is assistant-speak: paragraphs, bullet lists, gentle hedges. Tool calls — terse, angle-bracketed, JSON-shaped — are structurally out-of-distribution for that model. The closest pattern in its training prior is probably an inline code block, which is a different token sequence entirely.
The failure mode has a signature. Given a tool-eligible prompt, a base instruct model typically:
- Acknowledges the request in prose (“Sure, I can help with that!”).
- Asks a clarifying question about some missing argument it could have guessed.
- Sometimes invents a fake tool-call-looking string inside a markdown code block — formally correct syntax, wrong wrapper tokens, unparseable by the host.
Chat template tokens as the boundary
The second reason tool calling needs fine-tuning: tool tokens are chat-template-specific. Different model families use wildly different conventions, and the difference is not cosmetic — each family was pretrained with its own special tokens, and only those tokens have been allocated their own embedding.
city="Tokyo",
date="2026-04-19"
)<|eom_id|>
{"name": "get_weather",
"arguments": {"city": "Tokyo"}}
</tool_call>
Phi-3 uses yet another shape — an XML-ish <|tool|>…<|/tool|> block with named attributes. Mistral's instruct-v3 uses a [TOOL_CALLS] prefix. Every family disagrees, and every family's tokenizer has dedicated IDs for its own delimiters.
The practical consequence: you do not “teach tool calling.” You teach tool calling in this specific model's chat template. If you train on Qwen's <tool_call> schema and then deploy to Llama, the model emits a string that looks right to a human but is not a single special token to the tokenizer — it tokenises as six or seven regular subwords, and the host serving layer will not detect it as a tool call.
Dataset design — three classes of examples
A tool-use fine-tuning dataset needs three classes, and the third is the one every team forgets until their deployed model starts calling tools at random.
- Good calls — the full pipeline. User prompt, tool call, tool response, final natural-language answer. The model learns to emit a structured call, wait for a response, and integrate the result. This is the SFT-positive class.
- Bad calls — valid-schema calls with wrong arguments. Same shape as a good call, but the extracted city is misspelled, the date is left in natural language, or a required argument is dropped. This is the class DPO will use as rejected.
- “No tool needed” negatives — prompts where the right behaviour is to answer in prose. “What's 2+2?” A greeting. A philosophical question. The model learns when not to reach for a tool.
Skip class (3) and you ship a model that over-triggers — it sees a function schema in the system prompt and reflexively emits a call for every user turn, including “hi” and “thanks.” This failure is visible within the first ten minutes of any real deployment, and the fix is retroactive and expensive. Add the negatives up-front.
bad calls · DPO-rejected partners — valid schema, wrong arguments, paired with a good-call chosen
no-tool-needed · prose-only responses, tool schema still in context
Exact ratios vary by dataset (xLAM uses the APIGen pipeline at arXiv 2406.18518; Microsoft's slm-finetuning-for-function-calling draws from glaive-function-calling-v2 with its own scenario taxonomy). The ratios are less important than the presence of all three — a dataset with zero negatives trains a model that cannot say “no tool is needed here.”
Format tuning first, then argument tuning
The headline number from every tool-use paper is the gap between SFT-only and SFT + DPO. The three numbers in the hook viz (~10% / ~45% / ~79%) are illustrative of the typical BFCL-style progression — a base instruct model that never learned the tool-call chat template scores near zero; SFT on format gets the model to emit parseable schema most of the time; a DPO polish on matched (good, bad) pairs closes most of the remaining gap. Exact scores vary by model family, task mix, and BFCL version — the shape of the curve is the generalisable lesson.
Read those two jumps as two different learning problems. The first (10% → 45%) is format acquisition: the model learns to emit the chat-template tokens at all, in roughly the right places, with roughly the right JSON shape around them. That is the job of SFT. Once the tokens are in-distribution, a greedy decode from the model will produce a parseable tool call most of the time.
The second jump (45% → 79%) is argument precision. The model already knows to emit <tool_call>. What it does not reliably do is extract “Tokyo” as exactly“Tokyo” (not “Tokio”), or convert “tomorrow” to ISO-8601. SFT with cross-entropy loss is surprisingly weak at this — the model gets partial credit for predicting a nearby token, and “Tokio” is token-by-token extremely close to “Tokyo” in the model's embedding space. The loss signal per-token simply does not sufficiently penalise the substitution.
DPO fixes this by re-framing the learning problem as a pairwise preference. The model is shown two completions on the same prompt — one with the right argument, one with the wrong one — and asked to prefer the former. The difference is no longer per-token; it is per-completion. The model learns to prefer “Tokyo” over “Tokio” because the entire call with “Tokyo” was marked preferred, not because each token was individually corrected.
Cross-entropy on tokens. Pushes the next-token distribution toward the labelled completion. Great at structural patterns: tags in the right place, JSON shapes that close, argument keys spelled correctly. Weak at value precision because the substitution loss is locally small.
Pairwise preference log-likelihood. Pushes the model to prefer whole completions. Great at value precision — the right city, the ISO-8601 date, the dollar amount without the unit glued on — because the preference is at completion granularity. Revisit the dpo lesson for the derivation.
This two-stage recipe generalises beyond tool calling. It is the same structure as the reasoning-model recipe (SFT-warm-up → RL-on-verifiable-rewards) and the safety fine-tuning recipe (SFT → DPO on harm-pair preferences). SFT for form; preference optimisation for substance.
DPO against bad calls — the preference pair
What does a DPO preference pair look like in the tool-use setting? The chosen and rejected completions share the same prompt and the same chat template. They differ only in the content inside the <tool_call> blob.
user: "What's the weather in Tokyo tomorrow?" system: available tools: get_weather(city: str, date: str)
<tool_call>
{"name": "get_weather", "arguments":
{"city": "Tokyo", "date": "2026-04-19"}}
</tool_call><tool_call>
{"name": "get_weather", "arguments":
{"city": "Tokio", "date": "tomorrow"}}
</tool_call>Both completions are valid JSON. Both use the correct chat-template tags. Neither will make the SFT loss noticeably unhappy. Only the DPO objective — which compares the log-probability of the chosen completion against the rejected one, under the policy versus the reference — sees the substitution as a meaningful preference signal. Over a few thousand such pairs, the model internalises that Tokyo beats Tokio, ISO dates beat natural-language dates, and explicit currency codes beat "30 dollars" as a string.
With y_w = argument-precise call, y_l = argument-sloppy call. The objective pushes π to assign higher log-probability to the whole chosen sequence than to the whole rejected one, relative to the frozen SFT reference. No reward model, no PPO loop. See the dpo lesson for the full derivation.
Evaluation: BFCL and the specialization tax
The Berkeley Function Calling Leaderboard (BFCL) is the de-facto evaluation for tool-use. It grades a model on several categories — simple single-call, parallel calls, multi-turn conversations, AST-level correctness, Java / JavaScript / Python dialect variants, and a “relevance” category that checks whether the model correctly declines to call a tool when it shouldn't.
The relevance category is where the “no tool needed” negatives pay off. A model that never saw negatives in training typically scores 95%+ on simple calls and ~10% on relevance — it calls tools on everything, including prompts the benchmark explicitly designed to be tool-irrelevant. The Microsoft recipe specifically targets this split.
The broader resources worth bookmarking are the BFCL leaderboard (updated monthly, with per-category splits) and the Microsoft SLM fine-tuning repo (a reproducible Phi-3-mini pipeline with dataset generation, SFT, and DPO scripts). Between those two, you can reproduce a 45% → 79% curve on a single A100 in a weekend.