Model Merging: Task Arithmetic, TIES, and DARE Explained

Two fine-tunes, one merge — no gradient step

You have a base model — call it Llama-3-8B. Someone on HuggingFace fine-tuned it on math and released Llama-3-8B-math. Someone else fine-tuned it on code and released Llama-3-8B-code. You want a model that is good at both. Option one: spin up a GPU, collect a blended math+code dataset, fine-tune again, pay the bill. Option two: download both checkpoints, do a few numpy-level operations on the weights, save the result. No training. No GPU bill.

The second option works — well enough that about half of the models at the top of the 2024-2025 HuggingFace open LLM leaderboard were merges rather than fresh fine-tunes. This lesson is about the three merge recipes that made that possible: task arithmetic, TIES, and DARE — and about the empirical geometry that makes any of it work at all.

MMXXVI

historical note

2022-2024 · Ilharco, Yadav, Yu, and others

The idea that you could add and subtract fine-tunes like vectors appeared in late 2022 (Ilharco et al.'s task arithmetic paper). TIES followed in mid-2023; DARE in late 2023. By 2024, Charles Goddard's mergekit library had standardised the recipes into YAML configs, and the HF leaderboard filled with models whose spec was literally other people's weights, blended.

Why merging works at all — linear mode connectivity

Here is the empirical fact the whole field rests on. Take two fine-tunes $\theta_1$ and $\theta_2$ of the same base. Walk along the straight line between them in weight space. Evaluate loss at each point. If you did this for two independently trainedmodels, you'd see a big hill of loss between them — a ridge separating two different solutions. For two fine-tunes of the same base, you see a basin instead: loss stays roughly flat all the way across. The two solutions are linearly connected.

◆ paper

The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks

Entezari, Sedghi, Saukh, Neyshabur · 2021

arxiv:2110.06296

The paper that made the geometry precise. They show that two SGD runs from the same initialization — once you account for neuron permutations — land in the same connected low-loss basin. Frankle and Dziugaite's earlier work had hinted at this; Entezari et al. nailed down the caveat (permutation) and the empirical scope.

Why this matters for merging: if two fine-tunes live in the same basin, then points between them (or near the base) also live in the basin. You can do weight-space arithmetic without falling off a cliff. The merged model behaves like a model you could, in principle, have trained.

Task vectors — the composable delta

Once you accept that fine-tunes are close to the base, a natural object appears: the task vector. Ilharco et al. defined it as the difference between the fine-tuned and base weights.

the task vector

\tau_\text{task} \;=\; \theta_\text{fine-tuned} \;-\; \theta_\text{base}

$\tau$ has the same shape as $\theta$ — it's a vector in weight space that points from the base toward a task-competent region. It composes: you can add task vectors together, scale them, subtract them (unlearning), and negate them.

◆ paper

Editing Models with Task Arithmetic

Ilharco, Ribeiro, Wortsman, Gururangan, Schmidt, Hajishirzi, Farhadi · 2022

arxiv:2212.04089

The paper that named the mechanism. They showed task vectors could be added (compose two specialisations), negated (forget a task), and analogised(“A is to B as C is to ?” in task-vector arithmetic). A bit like word-vector arithmetic, but at the scale of a full model.

“Task arithmetic” in the strict sense is just scaled addition: $\theta_\text{merged} = \theta_\text{base} + \alpha \tau_A + \beta \tau_B$ . Sliders, not recipes. It's the baseline every other merge method is measured against. Play with it in the hook viz — notice how α and β let you tilt the merged point along each task-vector axis.

weight-space vector field

base at origin · two fine-tunes as arrows · merge redraws the third

merge method

Raw sum — no sliders, no filtering. Every component of every task vector fires. Watch task A and task B scores crater as the two deltas interfere directly in weight space.

task A

task B

task C · unseen

Linear add and its failure modes

The simplest merge is $\alpha = \beta = 1$ — just sum the task vectors and add them to the base. It takes one line of code. It often does worse than either fine-tune alone on their respective tasks. Why?

Because task vectors interfere. At some parameter $\theta_i$ , τ_A wants to push it positive and τ_B wants to push it negative. The sum cancels both contributions. You end up with a parameter near the base value — which means neithertask's signal at that parameter survives. Do this at enough parameters and both tasks degrade.

In the hook viz, linear add sends the merged point to $\tau_A + \tau_B$ . Task A's bar drops from 85 alone to ~60 merged; Task B drops from 82 to 55. The merged model is worse at both tasks than either fine-tune was at its own task. This is the problem every other merge method tries to solve.

what linear add does

Compute $\tau_A = \theta_A - \theta_\text{base}$ and $\tau_B = \theta_B - \theta_\text{base}$ . Set $\theta_\text{merged} = \theta_\text{base} + \tau_A + \tau_B$ . Save and ship.

Cost: zero gradient steps. Runtime: seconds on a CPU.

why it fails

At parameters where τ_A and τ_B have opposite signs, their sum is smaller than either alone — sometimes near-zero. Each task loses its specialisation signal at those parameters.

The interference grows with the number of merged tasks. Two fine-tunes is manageable; five is usually catastrophic.

TIES — trim, elect, merge

Yadav et al.'s TIES (TrIm, Elect Sign, Merge) is the obvious fix once you diagnose the interference. Three filters, applied in order, before summing.

Trim. For each task vector, keep only the top-k% of components by magnitude; zero the rest. Typical k is 20. The small-magnitude components were mostly noise anyway — fine-tune deltas are heavy-tailed, and the tail carries most of the signal.
Elect sign.At each parameter, sum the signed magnitudes across trimmed task vectors and take the sign of that sum — the side with more total mass wins, not the side with more votes. Zero out any contribution whose sign disagrees with the elected direction. (In the 2-task case this collapses to “sign of the larger-magnitude contributor”; with three or more tasks, mass and majority can disagree.)
Merge. Average the surviving contributions (the ones that passed both filters). Scale by λ and add to the base.

the TIES update

\theta_\text{TIES} \;=\; \theta_\text{base} + \lambda \cdot \text{avg}_\text{survivors}\!\left(\{\text{trim}(\tau_i)\}\right)

where a component survives if (a) it was kept by the trim step and (b) its sign agrees with the elected sign at that parameter. λ is a global scale, typically 0.7–1.0.

one parameter · θ_17

At this parameter, τ_A says “go right 0.3” and τ_B says “go left 0.1”. The larger-magnitude sign wins the election. τ_B's contribution is zeroed for this parameter — it doesn't just get averaged out, it gets dropped from the sum entirely.

The election step is the conceptual move. Linear add cancels τ_A and τ_B when they disagree. TIES picks a winner. The result is a shorter merged task vector — you kept fewer components — but its direction is less corrupted by interference. In the hook viz, TIES recovers Task A to ~78 and Task B to ~76, against linear add's 60 and 55.

◆ paper

TIES-Merging: Resolving Interference When Merging Models

Yadav, Tam, Choshen, Raffel, Bansal · 2023

arxiv:2306.01708

The paper that formalised trim-elect-merge. Key finding: interference is dominated by sign conflicts, not magnitude mismatches, and resolving the signs matters much more than tuning scale. Ablations in their Table 2 show the election step contributes most of the recovered performance.

DARE — drop and rescale

Yu et al.'s DARE (Drop And REscale) looks reckless at first. Take a task vector. Randomly set 90% of its components to zero. Multiply the survivors by 10× to compensate. Use the result as if it were the original task vector — add it to base, merge with other DARE'd task vectors, whatever.

the DARE operation

\tilde\tau_i \;=\; \begin{cases} \tau_i / (1-p) & \text{with probability } 1-p \\ 0 & \text{with probability } p \end{cases}

In expectation, $\mathbb{E}[\tilde\tau_i] = \tau_i$ — the rescaling is chosen precisely to make the dropped vector equal the original in expectation, component by component.

DARE · 100 weight components

kept: 11 · rescale: 10.00×

drop probability p90%

Why this works at all, given that you've thrown away 90% of the delta: fine-tune task vectors are redundant. The same specialisation signal is spread across many parameters, so any single parameter can be dropped without losing the capability. This is dropout's intuition applied to an already-trained delta rather than to activations during training.

DARE's other use case is combining with other merge methods. ties-dare in mergekit means: DARE each task vector first, then do TIES. The dropped sparse vectors have fewer sign conflicts by construction, so TIES has less interference to resolve. The two methods compose cleanly; Yu et al. report the combination outperforms either alone.

◆ paper

Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch

Le Yu, Bowen Yu, Haiyang Yu, Xuanang Huang, Yongbin Li · 2023

arxiv:2311.03099

The DARE paper. Memorable framing: fine-tuned models pick up abilities like Mario picks up mushrooms. The experimental finding is striking — on many tasks, a 90% DARE drop leaves performance essentially unchanged, and a 99% drop still recovers ~80% of the fine-tune's gain. Fine-tune deltas are really that redundant.

When to merge vs when to fine-tune from scratch

Here is the rule of thumb worth memorising. Merging composes existing specialisations cheaply. It produces a model that is good at the union of what the input models were good at. It cannot produce a capability that none of the input models had.

use merging when

You want capability A and capability B, and fine-tunes for both already exist (or are cheap to produce separately).

You're shipping on a CPU and can't afford a training run.

You're exploring which specialisations compose well. Merges are cheap enough to try hundreds of combinations in an afternoon.

fine-tune from scratch when

You want capability A and something new that doesn't exist as a fine-tune yet. Merging can't invent capability.

Your target domain has shifted (new vocabulary, new style) — a fresh fine-tune on in-domain data will outperform any merge of stale checkpoints.

The base model itself is the wrong starting point (different architecture, different tokenizer) — no amount of merging can cross that gap.

Three methods, one mental model

All three methods are operations on task vectors in weight space. Linear add is the naive baseline. TIES filters out the components most likely to interfere before summing. DARE sparsifies each task vector independently so their conflict rate drops by construction.

In the hook viz: linear add sent the merged arrow to $\tau_A + \tau_B$ and cratered both benchmark scores. Task arithmetic let you tune the direction with sliders. TIES kept a cleaner subset and recovered to 78/76. DARE threw away 90% of the components with minimal loss — a redundant delta, made sparse, rescaled in expectation. You could compose them: DARE each vector, then TIES the sparsified results, which is mergekit's dare_ties method.

A note on what merging doesn't give you. A merged model can be less stablethan either input — especially on long-form generation, where small weight-space deviations compound across many tokens. Always benchmark generation quality, not just multiple-choice accuracy, after a merge. Leaderboard numbers can lie; a few paragraphs of generated text usually don't.

comprehension check

Three tiers. Three ways to test the same ideas.

Recall checks the facts. Apply runs the merge on new numbers. Reason asks about scenarios the lesson didn't cover — you'll have to transfer the mechanism.

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3

What exactly is a task vector?

Two fine-tunes, one merge — no gradient step

MMXXVI

historical note

2022-2024 · Ilharco, Yadav, Yu, and others

Why merging works at all — linear mode connectivity

◆ paper

The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks

Entezari, Sedghi, Saukh, Neyshabur · 2021

arxiv:2110.06296

Task vectors — the composable delta

Once you accept that fine-tunes are close to the base, a natural object appears: the task vector. Ilharco et al. defined it as the difference between the fine-tuned and base weights.

the task vector

\tau_\text{task} \;=\; \theta_\text{fine-tuned} \;-\; \theta_\text{base}

◆ paper

Editing Models with Task Arithmetic

Ilharco, Ribeiro, Wortsman, Gururangan, Schmidt, Hajishirzi, Farhadi · 2022

arxiv:2212.04089

weight-space vector field

base at origin · two fine-tunes as arrows · merge redraws the third

merge method

Raw sum — no sliders, no filtering. Every component of every task vector fires. Watch task A and task B scores crater as the two deltas interfere directly in weight space.

task A

task B

task C · unseen

Linear add and its failure modes

what linear add does

Compute $\tau_A = \theta_A - \theta_\text{base}$ and $\tau_B = \theta_B - \theta_\text{base}$ . Set $\theta_\text{merged} = \theta_\text{base} + \tau_A + \tau_B$ . Save and ship.

Cost: zero gradient steps. Runtime: seconds on a CPU.

why it fails

At parameters where τ_A and τ_B have opposite signs, their sum is smaller than either alone — sometimes near-zero. Each task loses its specialisation signal at those parameters.

The interference grows with the number of merged tasks. Two fine-tunes is manageable; five is usually catastrophic.

TIES — trim, elect, merge

Yadav et al.'s TIES (TrIm, Elect Sign, Merge) is the obvious fix once you diagnose the interference. Three filters, applied in order, before summing.

Trim. For each task vector, keep only the top-k% of components by magnitude; zero the rest. Typical k is 20. The small-magnitude components were mostly noise anyway — fine-tune deltas are heavy-tailed, and the tail carries most of the signal.
Elect sign.At each parameter, sum the signed magnitudes across trimmed task vectors and take the sign of that sum — the side with more total mass wins, not the side with more votes. Zero out any contribution whose sign disagrees with the elected direction. (In the 2-task case this collapses to “sign of the larger-magnitude contributor”; with three or more tasks, mass and majority can disagree.)
Merge. Average the surviving contributions (the ones that passed both filters). Scale by λ and add to the base.

the TIES update

\theta_\text{TIES} \;=\; \theta_\text{base} + \lambda \cdot \text{avg}_\text{survivors}\!\left(\{\text{trim}(\tau_i)\}\right)

where a component survives if (a) it was kept by the trim step and (b) its sign agrees with the elected sign at that parameter. λ is a global scale, typically 0.7–1.0.

one parameter · θ_17

◆ paper

TIES-Merging: Resolving Interference When Merging Models

Yadav, Tam, Choshen, Raffel, Bansal · 2023

arxiv:2306.01708

DARE — drop and rescale

the DARE operation

\tilde\tau_i \;=\; \begin{cases} \tau_i / (1-p) & \text{with probability } 1-p \\ 0 & \text{with probability } p \end{cases}

In expectation, $\mathbb{E}[\tilde\tau_i] = \tau_i$ — the rescaling is chosen precisely to make the dropped vector equal the original in expectation, component by component.

DARE · 100 weight components

kept: 11 · rescale: 10.00×

drop probability p90%

◆ paper

Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch

Le Yu, Bowen Yu, Haiyang Yu, Xuanang Huang, Yongbin Li · 2023

arxiv:2311.03099

When to merge vs when to fine-tune from scratch

use merging when

You want capability A and capability B, and fine-tunes for both already exist (or are cheap to produce separately).

You're shipping on a CPU and can't afford a training run.

You're exploring which specialisations compose well. Merges are cheap enough to try hundreds of combinations in an afternoon.

fine-tune from scratch when

You want capability A and something new that doesn't exist as a fine-tune yet. Merging can't invent capability.

Your target domain has shifted (new vocabulary, new style) — a fresh fine-tune on in-domain data will outperform any merge of stale checkpoints.

The base model itself is the wrong starting point (different architecture, different tokenizer) — no amount of merging can cross that gap.

Three methods, one mental model

comprehension check

Three tiers. Three ways to test the same ideas.

Recall checks the facts. Apply runs the merge on new numbers. Reason asks about scenarios the lesson didn't cover — you'll have to transfer the mechanism.

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3

Model merging: task vectors, TIES, DARE

Two fine-tunes, one merge — no gradient step

Why merging works at all — linear mode connectivity

Task vectors — the composable delta

Linear add and its failure modes

TIES — trim, elect, merge

DARE — drop and rescale

When to merge vs when to fine-tune from scratch

Three methods, one mental model

Three tiers. Three ways to test the same ideas.

What exactly is a task vector?

Model merging: task vectors, TIES, DARE

Two fine-tunes, one merge — no gradient step

Why merging works at all — linear mode connectivity

Task vectors — the composable delta

Linear add and its failure modes

TIES — trim, elect, merge

DARE — drop and rescale

When to merge vs when to fine-tune from scratch

Three methods, one mental model

Three tiers. Three ways to test the same ideas.

What exactly is a task vector?