Two fine-tunes, one merge — no gradient step
You have a base model — call it Llama-3-8B. Someone on HuggingFace fine-tuned it on math and released Llama-3-8B-math. Someone else fine-tuned it on code and released Llama-3-8B-code. You want a model that is good at both. Option one: spin up a GPU, collect a blended math+code dataset, fine-tune again, pay the bill. Option two: download both checkpoints, do a few numpy-level operations on the weights, save the result. No training. No GPU bill.
The second option works — well enough that about half of the models at the top of the 2024-2025 HuggingFace open LLM leaderboard were merges rather than fresh fine-tunes. This lesson is about the three merge recipes that made that possible: task arithmetic, TIES, and DARE — and about the empirical geometry that makes any of it work at all.
mergekit library had standardised the recipes into YAML configs, and the HF leaderboard filled with models whose spec was literally other people's weights, blended.Why merging works at all — linear mode connectivity
Here is the empirical fact the whole field rests on. Take two fine-tunes and of the same base. Walk along the straight line between them in weight space. Evaluate loss at each point. If you did this for two independently trainedmodels, you'd see a big hill of loss between them — a ridge separating two different solutions. For two fine-tunes of the same base, you see a basin instead: loss stays roughly flat all the way across. The two solutions are linearly connected.
Why this matters for merging: if two fine-tunes live in the same basin, then points between them (or near the base) also live in the basin. You can do weight-space arithmetic without falling off a cliff. The merged model behaves like a model you could, in principle, have trained.
Task vectors — the composable delta
Once you accept that fine-tunes are close to the base, a natural object appears: the task vector. Ilharco et al. defined it as the difference between the fine-tuned and base weights.
has the same shape as — it's a vector in weight space that points from the base toward a task-competent region. It composes: you can add task vectors together, scale them, subtract them (unlearning), and negate them.
“Task arithmetic” in the strict sense is just scaled addition: . Sliders, not recipes. It's the baseline every other merge method is measured against. Play with it in the hook viz — notice how α and β let you tilt the merged point along each task-vector axis.
Linear add and its failure modes
The simplest merge is — just sum the task vectors and add them to the base. It takes one line of code. It often does worse than either fine-tune alone on their respective tasks. Why?
Because task vectors interfere. At some parameter , τ_A wants to push it positive and τ_B wants to push it negative. The sum cancels both contributions. You end up with a parameter near the base value — which means neithertask's signal at that parameter survives. Do this at enough parameters and both tasks degrade.
In the hook viz, linear add sends the merged point to. Task A's bar drops from 85 alone to ~60 merged; Task B drops from 82 to 55. The merged model is worse at both tasks than either fine-tune was at its own task. This is the problem every other merge method tries to solve.
Compute and . Set . Save and ship.
Cost: zero gradient steps. Runtime: seconds on a CPU.
At parameters where τ_A and τ_B have opposite signs, their sum is smaller than either alone — sometimes near-zero. Each task loses its specialisation signal at those parameters.
The interference grows with the number of merged tasks. Two fine-tunes is manageable; five is usually catastrophic.
TIES — trim, elect, merge
Yadav et al.'s TIES (TrIm, Elect Sign, Merge) is the obvious fix once you diagnose the interference. Three filters, applied in order, before summing.
- Trim. For each task vector, keep only the top-k% of components by magnitude; zero the rest. Typical k is 20. The small-magnitude components were mostly noise anyway — fine-tune deltas are heavy-tailed, and the tail carries most of the signal.
- Elect sign.At each parameter, sum the signed magnitudes across trimmed task vectors and take the sign of that sum — the side with more total mass wins, not the side with more votes. Zero out any contribution whose sign disagrees with the elected direction. (In the 2-task case this collapses to “sign of the larger-magnitude contributor”; with three or more tasks, mass and majority can disagree.)
- Merge. Average the surviving contributions (the ones that passed both filters). Scale by λ and add to the base.
where a component survives if (a) it was kept by the trim step and (b) its sign agrees with the elected sign at that parameter. λ is a global scale, typically 0.7–1.0.
The election step is the conceptual move. Linear add cancels τ_A and τ_B when they disagree. TIES picks a winner. The result is a shorter merged task vector — you kept fewer components — but its direction is less corrupted by interference. In the hook viz, TIES recovers Task A to ~78 and Task B to ~76, against linear add's 60 and 55.
DARE — drop and rescale
Yu et al.'s DARE (Drop And REscale) looks reckless at first. Take a task vector. Randomly set 90% of its components to zero. Multiply the survivors by 10× to compensate. Use the result as if it were the original task vector — add it to base, merge with other DARE'd task vectors, whatever.
In expectation, — the rescaling is chosen precisely to make the dropped vector equal the original in expectation, component by component.
Why this works at all, given that you've thrown away 90% of the delta: fine-tune task vectors are redundant. The same specialisation signal is spread across many parameters, so any single parameter can be dropped without losing the capability. This is dropout's intuition applied to an already-trained delta rather than to activations during training.
DARE's other use case is combining with other merge methods. ties-dare in mergekit means: DARE each task vector first, then do TIES. The dropped sparse vectors have fewer sign conflicts by construction, so TIES has less interference to resolve. The two methods compose cleanly; Yu et al. report the combination outperforms either alone.
When to merge vs when to fine-tune from scratch
Here is the rule of thumb worth memorising. Merging composes existing specialisations cheaply. It produces a model that is good at the union of what the input models were good at. It cannot produce a capability that none of the input models had.
You want capability A and capability B, and fine-tunes for both already exist (or are cheap to produce separately).
You're shipping on a CPU and can't afford a training run.
You're exploring which specialisations compose well. Merges are cheap enough to try hundreds of combinations in an afternoon.
You want capability A and something new that doesn't exist as a fine-tune yet. Merging can't invent capability.
Your target domain has shifted (new vocabulary, new style) — a fresh fine-tune on in-domain data will outperform any merge of stale checkpoints.
The base model itself is the wrong starting point (different architecture, different tokenizer) — no amount of merging can cross that gap.
Three methods, one mental model
All three methods are operations on task vectors in weight space. Linear add is the naive baseline. TIES filters out the components most likely to interfere before summing. DARE sparsifies each task vector independently so their conflict rate drops by construction.
In the hook viz: linear add sent the merged arrow to and cratered both benchmark scores. Task arithmetic let you tune the direction with sliders. TIES kept a cleaner subset and recovered to 78/76. DARE threw away 90% of the components with minimal loss — a redundant delta, made sparse, rescaled in expectation. You could compose them: DARE each vector, then TIES the sparsified results, which is mergekit's dare_ties method.
Three tiers. Three ways to test the same ideas.
Recall checks the facts. Apply runs the merge on new numbers. Reason asks about scenarios the lesson didn't cover — you'll have to transfer the mechanism.