Microscale
0
Act IXShip It
lesson mlx · 7 min · 35 xp

MLX-LM on Mac

The Apple Silicon path

MLX — the Mac-native path

Apple's MLX framework exists because NVIDIA's CUDA stack doesn't run on Apple Silicon. But the absence is also an opportunity — Macs have a secret weapon that data center GPUs don't: unified memory. The CPU and GPU share the same physical memory. No CPU↔GPU transfers, no PCIe bottleneck, no “copy the model into VRAM” step.

On an M3 Max with 64 GB unified memory, a 30B model with 4-bit quantization fits entirely in “GPU memory” — because all memory is GPU memory. A data center A100 with 80 GB of HBM has more raw bandwidth but not as much memory, and it costs ~$10k. A MacBook Pro costs ~$4k and can serve a 70B Q4 model at useful tokens/sec.

MLX was released in December 2023 by Awni Hannun and the Apple ML Research team as a deliberately NumPy-ish array library — if you can write mx.matmul(a, b) you can write MLX — with a PyTorch-style nnmodule layered on top. The decode story is dominated by memory bandwidth, not FLOPs, so the interesting comparison isn't TFLOPs but the roofline: an M3 Max at ~400 GB/s and an M2/M3 Ultra at ~800 GB/s sit in the same zone as a consumer RTX 4090 (~1008 GB/s), which is why a 70B Q4 model on an Ultra actually streams tokens at a human-readable pace instead of lurching. No other laptop architecture — not a Threadripper workstation, not a gaming laptop with a mobile 4090 capped at 8 GB VRAM — can load the weights at all, let alone stream them.

mlx-lm — the library and CLI
# Install
pip install mlx-lm

# Convert a HuggingFace model to MLX 4-bit
python -m mlx_lm.convert \
  --hf-path Qwen/Qwen3-4B-Instruct \
  --quantize --q-bits 4 \
  --mlx-path ./mlx-qwen3-4b

# Chat with it
python -m mlx_lm.generate \
  --model ./mlx-qwen3-4b \
  --prompt "What is GQA?" \
  --max-tokens 200

# Serve an OpenAI-compatible HTTP endpoint
python -m mlx_lm.server \
  --model ./mlx-qwen3-4b \
  --port 8080

# Fine-tune with LoRA on your own data
python -m mlx_lm.lora \
  --model ./mlx-qwen3-4b \
  --train --data ./data.jsonl \
  --iters 1000
M3 Max bandwidth
400GB/s
M3 Ultra unified memory
192GB
Qwen3-4B Q4 tok/sec
~80

Why MLX beats llama.cpp on Apple Silicon

llama.cpp has a Metal backend and runs fine on Apple Silicon. MLX is usually 20–40% faster for the same model at the same quantization. The reasons:

  • Metal kernels hand-tuned by Apple's ML team for the specific matrix units on each chip generation.
  • Unified memory allocation hints — MLX arrays are allocated with the right alignment for the Neural Engine on M5+.
  • Lazy evaluation — MLX builds a compute graph before launching kernels, allowing fusion and scheduling optimizations that per-operation interpreted frameworks can't do.

The tradeoff: MLX is Apple-only. On Linux or Windows, llama.cpp and Ollama are still your path.