act vi · making it yours · lesson

Teaching an SLM to call tools

A base-instruct SLM, handed a function schema, wants to explain. It wants to ask clarifying questions. It wants to narrate. What it does not want to do is emit a structured <tool_call> JSON blob. Here is the three-step recipe — SFT on format, DPO against bad calls, evaluated on BFCL — that turns a 10% tool-caller into a 79% one.

three variants · one prompt · one schema

pick a prompt. play each variant. watch the tags glow.

base · 10%

sft · 45%

sft + dpo · 79%

user prompt

user

What's the weather in Tokyo tomorrow?

classic tool-use task · date and city are structured args

available tools · get_weather, search_courses, schedule_meeting, summarize_text

Base instruct

pretrained · chat-tuned · no tool training

10%

I'd need to check the weather — could you tell me Tokyo's timezone? Also, when you say tomorrow, do you mean local time or UTC?

SFT on tool format

tags learned · arguments still sloppy

45%

<tool_call>{"name": "get_weather", "arguments": {"city": "Tokio", "date": "tomorrow"}}</tool_call>

SFT + DPO against bad calls

preferences trained · arguments precise

79%

<tool_call>{"name": "get_weather", "arguments": {"city": "Tokyo", "date": "2026-04-19"}}</tool_call>

BFCL success rate · active variant

10% pass

10%

45%

79%

base instruct

10%

after SFT

45%

after SFT + DPO

79%

Notice where the jumps come from. SFT on format alone buys you thirty-five points— that's the model learning to emit tags at all. The remaining thirty-four points come from DPO against bad calls — teaching it to put the right string inside the tags. Two different learning problems, two different optimisers, one stack.

note · percentages are rough BFCL-style numbers drawn from xLAM-7B-FC and the Microsoft SLM fine-tuning-for-function- calling results — real leaderboard figures vary by category and split.

Why base SLMs fail at tool calling

Hand a well-trained instruction model a system prompt like “You have access to the function get_weather(city, date). Emit a tool call when appropriate.” — and watch it politely ask clarifying questions instead. This is not a bug. It is doing exactly what its training incentivised.

Instruction-tuned base models (Llama-3-Instruct, Qwen2.5-Instruct, Phi-4-mini-instruct) are SFT'd on millions of helpful-natural-language examples and DPO'd on preferences that reward polite, informative prose. The target distribution is assistant-speak: paragraphs, bullet lists, gentle hedges. Tool calls — terse, angle-bracketed, JSON-shaped — are structurally out-of-distribution for that model. The closest pattern in its training prior is probably an inline code block, which is a different token sequence entirely.

The failure mode has a signature. Given a tool-eligible prompt, a base instruct model typically:

Acknowledges the request in prose (“Sure, I can help with that!”).
Asks a clarifying question about some missing argument it could have guessed.
Sometimes invents a fake tool-call-looking string inside a markdown code block — formally correct syntax, wrong wrapper tokens, unparseable by the host.

MMXXVI

historical note

2023 · Patil et al., UC Berkeley

The original Gorilla paper looked at a base GPT-4 and asked it to pick API calls from a pool of 1,600+ ML APIs. GPT-4 out-of-the-box hallucinated argument names, confused API versions, and often returned Python code instead of the requested structured call. Their fix was retrieval + fine-tuning on a dataset of verified API calls — the paper that established tool-use as a training problem, not a prompt problem.

◆ paper

Gorilla: Large Language Model Connected with Massive APIs

Patil, Zhang, Wang, Gonzalez · 2023 · NeurIPS 2023

arxiv:2305.15334

The first big paper arguing that LLMs need training, not just prompting, to call tools reliably. Gorilla also introduced AST-matching evaluation (is the emitted call syntactically and semantically a valid invocation?), which becomes the backbone of BFCL two years later.

Chat template tokens as the boundary

The second reason tool calling needs fine-tuning: tool tokens are chat-template-specific. Different model families use wildly different conventions, and the difference is not cosmetic — each family was pretrained with its own special tokens, and only those tokens have been allocated their own embedding.

llama 3.1 · python tag

<|python_tag|>get_weather(
city="Tokyo",
date="2026-04-19"
)<|eom_id|>

qwen2 · json blob

<tool_call>
{"name": "get_weather",
"arguments": {"city": "Tokyo"}}
</tool_call>

Phi-3 uses yet another shape — an XML-ish <|tool|>…<|/tool|> block with named attributes. Mistral's instruct-v3 uses a [TOOL_CALLS] prefix. Every family disagrees, and every family's tokenizer has dedicated IDs for its own delimiters.

The practical consequence: you do not “teach tool calling.” You teach tool calling in this specific model's chat template. If you train on Qwen's <tool_call> schema and then deploy to Llama, the model emits a string that looks right to a human but is not a single special token to the tokenizer — it tokenises as six or seven regular subwords, and the host serving layer will not detect it as a tool call.

Dataset design — three classes of examples

A tool-use fine-tuning dataset needs three classes, and the third is the one every team forgets until their deployed model starts calling tools at random.

Good calls — the full pipeline. User prompt, tool call, tool response, final natural-language answer. The model learns to emit a structured call, wait for a response, and integrate the result. This is the SFT-positive class.
Bad calls — valid-schema calls with wrong arguments. Same shape as a good call, but the extracted city is misspelled, the date is left in natural language, or a required argument is dropped. This is the class DPO will use as rejected.
“No tool needed” negatives — prompts where the right behaviour is to answer in prose. “What's 2+2?” A greeting. A philosophical question. The model learns when not to reach for a tool.

Skip class (3) and you ship a model that over-triggers — it sees a function schema in the system prompt and reflexively emits a call for every user turn, including “hi” and “thanks.” This failure is visible within the first ten minutes of any real deployment, and the fix is retroactive and expensive. Add the negatives up-front.

the three-class mix — an illustrative recipe

good calls · SFT positives — the canonical prompt → tool → response → answer traces
bad calls · DPO-rejected partners — valid schema, wrong arguments, paired with a good-call chosen
no-tool-needed · prose-only responses, tool schema still in context

Exact ratios vary by dataset (xLAM uses the APIGen pipeline at arXiv 2406.18518; Microsoft's slm-finetuning-for-function-calling draws from glaive-function-calling-v2 with its own scenario taxonomy). The ratios are less important than the presence of all three — a dataset with zero negatives trains a model that cannot say “no tool is needed here.”

MMXXVI

historical note

2024 · Microsoft, slm-finetuning recipe

The Microsoft recipe (github.com/microsoft/ slm-finetuning-for-function-calling) shipped an open reproducible pipeline for Phi-3-mini on BFCL, built on the glaive-function-calling-v2 dataset. Scenarios are partitioned by category (single, parallel, multi-turn, missing-information, etc.), SFT trains on the positive classes, and a DPO step targets matched (good, bad) pairs. One of the cleanest open recipes to read if you have not built one yourself.

In production, the fourth class is multi-turn tool conversations — cases where the model must call a tool, inspect the response, decide whether a secondcall is needed, then answer. We skip those here; xLAM and Hammer both include them as a separate dataset slice and report separate numbers for them on BFCL's multi-turn category.

Format tuning first, then argument tuning

The headline number from every tool-use paper is the gap between SFT-only and SFT + DPO. The three numbers in the hook viz (~10% / ~45% / ~79%) are illustrative of the typical BFCL-style progression — a base instruct model that never learned the tool-call chat template scores near zero; SFT on format gets the model to emit parseable schema most of the time; a DPO polish on matched (good, bad) pairs closes most of the remaining gap. Exact scores vary by model family, task mix, and BFCL version — the shape of the curve is the generalisable lesson.

Read those two jumps as two different learning problems. The first (10% → 45%) is format acquisition: the model learns to emit the chat-template tokens at all, in roughly the right places, with roughly the right JSON shape around them. That is the job of SFT. Once the tokens are in-distribution, a greedy decode from the model will produce a parseable tool call most of the time.

The second jump (45% → 79%) is argument precision. The model already knows to emit <tool_call>. What it does not reliably do is extract “Tokyo” as exactly“Tokyo” (not “Tokio”), or convert “tomorrow” to ISO-8601. SFT with cross-entropy loss is surprisingly weak at this — the model gets partial credit for predicting a nearby token, and “Tokio” is token-by-token extremely close to “Tokyo” in the model's embedding space. The loss signal per-token simply does not sufficiently penalise the substitution.

DPO fixes this by re-framing the learning problem as a pairwise preference. The model is shown two completions on the same prompt — one with the right argument, one with the wrong one — and asked to prefer the former. The difference is no longer per-token; it is per-completion. The model learns to prefer “Tokyo” over “Tokio” because the entire call with “Tokyo” was marked preferred, not because each token was individually corrected.

SFT teaches · format

Cross-entropy on tokens. Pushes the next-token distribution toward the labelled completion. Great at structural patterns: tags in the right place, JSON shapes that close, argument keys spelled correctly. Weak at value precision because the substitution loss is locally small.

DPO teaches · precision

Pairwise preference log-likelihood. Pushes the model to prefer whole completions. Great at value precision — the right city, the ISO-8601 date, the dollar amount without the unit glued on — because the preference is at completion granularity. Revisit the dpo lesson for the derivation.

This two-stage recipe generalises beyond tool calling. It is the same structure as the reasoning-model recipe (SFT-warm-up → RL-on-verifiable-rewards) and the safety fine-tuning recipe (SFT → DPO on harm-pair preferences). SFT for form; preference optimisation for substance.

DPO against bad calls — the preference pair

What does a DPO preference pair look like in the tool-use setting? The chosen and rejected completions share the same prompt and the same chat template. They differ only in the content inside the <tool_call> blob.

one DPO pair · user prompt

user: "What's the weather in Tokyo tomorrow?"
system: available tools: get_weather(city: str, date: str)

chosen · argument-precise

<tool_call>
{"name": "get_weather", "arguments":
  {"city": "Tokyo", "date": "2026-04-19"}}
</tool_call>

rejected · argument-sloppy

<tool_call>
{"name": "get_weather", "arguments":
  {"city": "Tokio", "date": "tomorrow"}}
</tool_call>

Both completions are valid JSON. Both use the correct chat-template tags. Neither will make the SFT loss noticeably unhappy. Only the DPO objective — which compares the log-probability of the chosen completion against the rejected one, under the policy versus the reference — sees the substitution as a meaningful preference signal. Over a few thousand such pairs, the model internalises that Tokyo beats Tokio, ISO dates beat natural-language dates, and explicit currency codes beat "30 dollars" as a string.

DPO loss, recycled from the DPO lesson

L = −E[ log σ( β · log π(y_w|x)/π_ref(y_w|x) − β · log π(y_l|x)/π_ref(y_l|x) ) ]

With y_w = argument-precise call, y_l = argument-sloppy call. The objective pushes π to assign higher log-probability to the whole chosen sequence than to the whole rejected one, relative to the frozen SFT reference. No reward model, no PPO loop. See the dpo lesson for the full derivation.

A subtle authoring choice: in a good pair, only the argument value differs. The name of the function is identical, the number of arguments is identical, and the JSON structure is byte-for-byte the same everywhere except the value in question. This keeps DPO from learning the wrong lesson — if the rejected call also had a misspelled function name, the model might learn that “valid names beat invalid names” instead of the intended argument-precision signal. Controlled pairs are load-bearing.

Evaluation: BFCL and the specialization tax

The Berkeley Function Calling Leaderboard (BFCL) is the de-facto evaluation for tool-use. It grades a model on several categories — simple single-call, parallel calls, multi-turn conversations, AST-level correctness, Java / JavaScript / Python dialect variants, and a “relevance” category that checks whether the model correctly declines to call a tool when it shouldn't.

The relevance category is where the “no tool needed” negatives pay off. A model that never saw negatives in training typically scores 95%+ on simple calls and ~10% on relevance — it calls tools on everything, including prompts the benchmark explicitly designed to be tool-irrelevant. The Microsoft recipe specifically targets this split.

MMXXVI

historical note

2024 · Salesforce AI, xLAM

The xLAM-7B-FC release (Sept 2024) claimed the top of the BFCL leaderboard at launch, beating GPT-4 by a several-point margin. xLAM is a 7B Mistral-base model fine-tuned on Salesforce's tool-use dataset (APIGen pipeline). The punchline: on open-domain reasoning (MMLU, GSM8K, HumanEval) the same model loses to its un-fine-tuned Mistral counterpart by a noticeable margin. The specialization tax is real and directional — you paid for BFCL with general-reasoning capacity. (As of 2026 the BFCL leaderboard has moved on — ToolACE-8B and Granite-20B-FC have taken the top spots — but xLAM is still the canonical specialisation-tax case study.)

◆ paper

xLAM: A Family of Large Action Models to Empower AI Agent Systems

Zhang, Liu, Hoang, Tan, Savarese, et al. · 2024 · Salesforce AI Research

arxiv:2409.03215

The xLAM tech report (Sept 2024). xLAM-7B-FC topped BFCL at release, beating GPT-4 on function calling. By 2026 the leaderboard has churned — ToolACE-8B and Granite-20B-FC have taken the top spots — but the recipe (function-calling SFT on APIGen-generated data, arXiv 2406.18518) remains the standard template. The specialization tax is the lesson's pedagogical point: pick a specialist only when the domain is narrow enough that general reasoning is not on the critical path.

The broader resources worth bookmarking are the BFCL leaderboard (updated monthly, with per-category splits) and the Microsoft SLM fine-tuning repo (a reproducible Phi-3-mini pipeline with dataset generation, SFT, and DPO scripts). Between those two, you can reproduce a 45% → 79% curve on a single A100 in a weekend.

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3

Why does a base instruction-tuned SLM typically fail to emit a valid tool call?

act vi · making it yours · lesson

Teaching an SLM to call tools

three variants · one prompt · one schema

pick a prompt. play each variant. watch the tags glow.

base · 10%

sft · 45%

sft + dpo · 79%

user prompt

user

What's the weather in Tokyo tomorrow?

classic tool-use task · date and city are structured args

available tools · get_weather, search_courses, schedule_meeting, summarize_text

Base instruct

pretrained · chat-tuned · no tool training

10%

I'd need to check the weather — could you tell me Tokyo's timezone? Also, when you say tomorrow, do you mean local time or UTC?

SFT on tool format

tags learned · arguments still sloppy

45%

<tool_call>{"name": "get_weather", "arguments": {"city": "Tokio", "date": "tomorrow"}}</tool_call>

SFT + DPO against bad calls

preferences trained · arguments precise

79%

<tool_call>{"name": "get_weather", "arguments": {"city": "Tokyo", "date": "2026-04-19"}}</tool_call>

BFCL success rate · active variant

10% pass

10%

45%

79%

base instruct

10%

after SFT

45%

after SFT + DPO

79%

note · percentages are rough BFCL-style numbers drawn from xLAM-7B-FC and the Microsoft SLM fine-tuning-for-function- calling results — real leaderboard figures vary by category and split.

Why base SLMs fail at tool calling

The failure mode has a signature. Given a tool-eligible prompt, a base instruct model typically:

Acknowledges the request in prose (“Sure, I can help with that!”).
Asks a clarifying question about some missing argument it could have guessed.
Sometimes invents a fake tool-call-looking string inside a markdown code block — formally correct syntax, wrong wrapper tokens, unparseable by the host.

MMXXVI

historical note

2023 · Patil et al., UC Berkeley

◆ paper

Gorilla: Large Language Model Connected with Massive APIs

Patil, Zhang, Wang, Gonzalez · 2023 · NeurIPS 2023

arxiv:2305.15334

Chat template tokens as the boundary

llama 3.1 · python tag

<|python_tag|>get_weather(
city="Tokyo",
date="2026-04-19"
)<|eom_id|>

qwen2 · json blob

<tool_call>
{"name": "get_weather",
"arguments": {"city": "Tokyo"}}
</tool_call>

Dataset design — three classes of examples

A tool-use fine-tuning dataset needs three classes, and the third is the one every team forgets until their deployed model starts calling tools at random.

Good calls — the full pipeline. User prompt, tool call, tool response, final natural-language answer. The model learns to emit a structured call, wait for a response, and integrate the result. This is the SFT-positive class.
Bad calls — valid-schema calls with wrong arguments. Same shape as a good call, but the extracted city is misspelled, the date is left in natural language, or a required argument is dropped. This is the class DPO will use as rejected.
“No tool needed” negatives — prompts where the right behaviour is to answer in prose. “What's 2+2?” A greeting. A philosophical question. The model learns when not to reach for a tool.

the three-class mix — an illustrative recipe

MMXXVI

historical note

2024 · Microsoft, slm-finetuning recipe

Format tuning first, then argument tuning

SFT teaches · format

DPO teaches · precision

DPO against bad calls — the preference pair

one DPO pair · user prompt

user: "What's the weather in Tokyo tomorrow?"
system: available tools: get_weather(city: str, date: str)

chosen · argument-precise

<tool_call>
{"name": "get_weather", "arguments":
  {"city": "Tokyo", "date": "2026-04-19"}}
</tool_call>

rejected · argument-sloppy

<tool_call>
{"name": "get_weather", "arguments":
  {"city": "Tokio", "date": "tomorrow"}}
</tool_call>

DPO loss, recycled from the DPO lesson

L = −E[ log σ( β · log π(y_w|x)/π_ref(y_w|x) − β · log π(y_l|x)/π_ref(y_l|x) ) ]

Evaluation: BFCL and the specialization tax

MMXXVI

historical note

2024 · Salesforce AI, xLAM

◆ paper

xLAM: A Family of Large Action Models to Empower AI Agent Systems

Zhang, Liu, Hoang, Tan, Savarese, et al. · 2024 · Salesforce AI Research

arxiv:2409.03215

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3