Microscale
0
Act VIMaking It Yours
lesson frameworks · 14 min · 60 xp

Fine-tuning frameworks

Unsloth · Axolotl · LLaMA-Factory · TRL · torchtune · MLX-LM — and how to pick

Which wrench do you actually pick up?

You've learned about LoRA, QLoRA, DPO, GRPO, and the specialization recipes. None of that matters until you actually run a training job. In April 2026 there are six serious fine-tuning frameworks — each one solving a different slice of the problem. Pick the wrong one and you'll waste a week. Pick the right one and you'll ship a specialist by the weekend.

This lesson does two things. First, it walks through the six frameworks in depth — what they're for, how they work under the hood, what their benchmarks actually mean. Second, it gives you a decision tree so that given any reasonable scenario, you can pick in 30 seconds.

MMXXVI
historical note
2021–2023 · Before the framework Cambrian
Fine-tuning in the early 2020s meant hand-writing PyTorch training loops with HuggingFace Transformers, bolting on DeepSpeed for memory savings, debugging NCCL errors, and waiting a week for a reasonable result on a single 8B model. The first wave of frameworks (DeepSpeed, Accelerate, Trainer) reduced boilerplate but left the hard parts — quantization, LoRA mechanics, preference optimization — for you to figure out. Then QLoRA landed (Dettmers 2023) and suddenly 7B fine-tuning became feasible on a consumer GPU. The framework race that followed is what we're about to walk through.

How the six frameworks fit together

A critical distinction: some of these tools stack rather than compete. Unsloth rewrites the kernel layer but hands off the trainer to TRL. LLaMA-Factory is a UI that sits on top of both. Axolotl wraps TRL and adds YAML. torchtune is the PyTorch-native alternative that skips the HF stack entirely. MLX-LM-LoRA is the Apple Silicon path that sidesteps the whole CUDA world.

the six · click for dossier
Unsloth
by unslothai
Pick when: One GPU, need max throughput per VRAM dollar
strengths
  • 3–5× faster training, 30–90% less VRAM
  • Custom Triton kernels for RoPE, MLP, cross-entropy, attention
  • Single-GPU champion — squeezes maximum out of one card
  • Integrates cleanly with HF TRL for the actual training loop
  • 2026 MoE kernels: 12× faster, 35% less VRAM for MoE models
weaknesses
  • Multi-GPU is paid (Unsloth Pro)
  • Most gains are in LoRA/QLoRA — less dramatic for full FT
  • Kernel is Hopper/Ada-focused; older hardware gets smaller gains
hardware
NVIDIA
multi-GPU
no (free)
8B QLoRA bench
3.2h
algorithms
8
install
pip install unsloth
SFTDPOORPOKTOGRPOQLoRALoRAcontinued pretraining

Unsloth — the kernel king

Unsloth's value proposition is blunt: same math, different kernels, dramatically faster. Under the hood, it rewrites the training hot path in Triton(OpenAI's GPU kernel DSL) so that the operations PyTorch normally handles get fused and specialized.

◆ paper
Unsloth: Fast and Memory-Efficient Fine-Tuning of LLMs
Daniel Han, Michael Han, Unsloth team · 2024
Unsloth is published as a series of engineering blog posts and kernel releases rather than a single academic paper. The benchmarks are reproducible on HuggingFace Open-LLM-Leaderboard hardware and the code is open source. See github.com/unslothai/unsloth.
unsloth — minimal example
from unsloth import FastLanguageModel
from trl import SFTTrainer

# 1. Load base with QLoRA in one call.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen3-4B-Instruct",
    max_seq_length = 4096,
    load_in_4bit = True,              # NF4 quantization
)

# 2. Attach LoRA adapters.
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, lora_alpha = 32,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing = "unsloth",  # ← Unsloth's checkpointing
)

# 3. Hand off to TRL for the actual training loop.
trainer = SFTTrainer(
    model = model, tokenizer = tokenizer,
    train_dataset = my_dataset,
    args = SFTConfig(
        learning_rate = 2e-4,
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4,
        num_train_epochs = 3,
    ),
)
trainer.train()

Axolotl — the production YAML

Axolotl's proposition is different: instead of optimising kernels, optimise reproducibility. A training run is a YAML file. The YAML is checked into git. Every parameter is explicit. Two colleagues running the same YAML on different machines get the same result.

axolotl — config.yaml
base_model: Qwen/Qwen3-4B-Instruct
model_type: AutoModelForCausalLM

load_in_4bit: true
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

datasets:
  - path: my/function-calling-data
    type: chat_template

sequence_len: 4096
learning_rate: 2e-4
num_epochs: 3
micro_batch_size: 4
gradient_accumulation_steps: 4
bf16: auto
flash_attention: true

# Multi-GPU
deepspeed: configs/ds_zero3.json
# or use fsdp: full_shard auto_wrap

Run with accelerate launch -m axolotl.cli.train config.yaml. Multi-GPU distribution is handled by the YAML reference to DeepSpeed or FSDP — not by your Python code.

LLaMA-Factory — the beginner-friendly UI

LLaMA-Factory is the fastest way to do your first fine-tune. You install it, launch llamafactory-cli webui, open your browser, select a model, upload a dataset, pick a training mode, and click start. The UI (LlamaBoard) is backed by the same Python engine you'd call directly — you can dump the config as YAML and re-run it later headless.

Crucially, LLaMA-Factory detects Unsloth and uses it as a backend automatically when available. You get Unsloth's speed (3.4 hours on the same benchmark, within ~6% of raw Unsloth) with zero kernel configuration.

TRL — the reference implementation

HuggingFace TRL is where every new preference optimization method debuts. When a paper drops a new technique — DPO, GRPO, ORPO, KTO, CPO, RLOO, XPO, Dr. GRPO, DAPO — the first production-quality implementation is almost always in TRL within weeks. Reading TRL's source is how you understand what these methods actually dobeyond the paper's pseudocode.

◆ paper
TRL: Transformer Reinforcement Learning
von Werra et al. · 2020
Originally released to reproduce RLHF from the InstructGPT paper. Now the canonical library for preference-based training. Maintained by HuggingFace.

torchtune — PyTorch-native, QAT, and the mobile path

torchtune is PyTorch's official fine-tuning library. It doesn't depend on HuggingFace Transformers; every model is implemented natively. This is less convenient for most users but unlocks two things nothing else in the ecosystem does well: quantization-aware training and ExecuTorch export.

◆ paper
Quantization-Aware Training for Large Language Models with PyTorch
Kim et al. · 2024 · PyTorch Engineering Blog
Shows that QAT (fine-tuning with simulated quantization) recovers ~96% of the hellaswag accuracy loss and ~68% of the perplexity loss that post-training quantization causes on Llama 3. The produced model is the same size as PTQ but materially better — at the cost of running the quantization path during fine-tuning.
post-training quantization (PTQ)

Train in FP16. After training, quantize.

Simple. Fast. No retraining cost.

But: weights weren't trained to tolerate the rounding, so quality drops measurably.

Llama-3 8B @ 4-bit: +0.35 perplexity

quantization-aware training (QAT)

Fine-tune with simulated quantization in the forward pass.

Straight-through estimator for backward.

Weights learn to be rounding-robust. Quality preserved.

Llama-3 8B @ 4-bit: +0.11 perplexity

MLX-LM-LoRA — the Mac path

If you're on Apple Silicon — and for SLM fine-tuning you often should be, because an M3 Ultra has 192 GB of unified memory, more than almost any consumer GPU has of HBM — you want MLX-LM-LoRA.

MLX is Apple's ground-up array framework built specifically for unified memory. There is no PCIe tax, no CPU↔GPU copy step, no CUDA dependency. MLX-LM-LoRA sits on top and provides a training library that supports 12 training algorithms: SFT, DPO, CPO, ORPO, GRPO, GSPO, Dr. GRPO, DAPO, Online DPO, XPO, RLHF, and PPO. That's more algorithms than TRL.

MMXXVI
historical note
2023–2026 · MLX's quiet dominance on Mac
Apple released MLX in late 2023 alongside the M3 chip launch. The community around it grew rapidly. By 2025, mlx-lm-lora was being used in production by Apple's own research team, IBM, Bosch, Red Hat, Daimler Truck, and Mercedes-Benz Group for on-device research. The mlx-tune package then added an Unsloth-compatible API so PyTorch-trained users could move across without rewriting their training loops. Mac fine-tuning went from “basically impossible” to “the most ergonomic option” in roughly eighteen months.
mlx-lm-lora — on a Mac
# Install
pip install mlx-lm-lora

# Convert a HuggingFace model to MLX format
python -m mlx_lm.convert \
  --hf-path Qwen/Qwen3-4B-Instruct \
  --mlx-path ./mlx-qwen3-4b \
  -q --q-bits 4

# SFT + LoRA
python -m mlx_lm_lora.sft \
  --model ./mlx-qwen3-4b \
  --data ./my-data.jsonl \
  --lora-rank 16 \
  --iters 1000

# DPO with preference pairs
python -m mlx_lm_lora.dpo \
  --model ./mlx-qwen3-4b \
  --data ./preference-pairs.jsonl \
  --beta 0.1

# GRPO for verifiable rewards (math, code)
python -m mlx_lm_lora.grpo \
  --model ./mlx-qwen3-4b \
  --data ./math-problems.jsonl \
  --reward-fn ./my_reward.py
the 30-second decision tree
What's your situation?

What about the new frameworks?

The space moves. Two 2026 entrants worth tracking:

◆ paper
Chronicals: A High-Performance Framework for LLM Fine-Tuning with 3.51× Speedup over Unsloth
Anonymous et al. · 2026 · arXiv preprint
arxiv:2601.02609
Claims a further 3.51× speedup over Unsloth via more aggressive kernel fusion and compile-time graph optimisation. The benchmarks are preliminary and hardware-specific; I'd wait for third-party reproduction before adopting in production. The direction is clear — Unsloth's kernel advantage will continue to be narrowed by new entrants.

Unsloth itself keeps releasing new kernels. The 2026 release brought MoE-specific fused kernels, 3× faster training with smart packing, and auto-tuning for batch size vs sequence length. Expect another step-change every six months.

The practical recipe — which to install today

Given all of this, here's my honest 2026 defaults if you're starting fresh:

  • Mac with 32+ GB unified memory: install mlx-lm-lora. Train and serve on the same machine.
  • One NVIDIA GPU, want to go fast: install unsloth + trl. Use Unsloth's FastLanguageModel loader, hand to TRL trainers.
  • One NVIDIA GPU, first time: install llama-factory. Web UI, Unsloth under the hood.
  • Multi-GPU, production: install axolotl. YAML-driven, DeepSpeed or FSDP backend.
  • Targeting mobile deployment: install torchtune + torchao. QAT fine-tune → ExecuTorch export → ship.
  • Implementing a new algorithm from a paper: read trl source directly.
comprehension check
comprehension · 1 / 4

What does Unsloth actually rewrite versus what does it hand off to TRL?