MLX — the Mac-native path
Apple's MLX framework exists because NVIDIA's CUDA stack doesn't run on Apple Silicon. But the absence is also an opportunity — Macs have a secret weapon that data center GPUs don't: unified memory. The CPU and GPU share the same physical memory. No CPU↔GPU transfers, no PCIe bottleneck, no “copy the model into VRAM” step.
On an M3 Max with 64 GB unified memory, a 30B model with 4-bit quantization fits entirely in “GPU memory” — because all memory is GPU memory. A data center A100 with 80 GB of HBM has more raw bandwidth but not as much memory, and it costs ~$10k. A MacBook Pro costs ~$4k and can serve a 70B Q4 model at useful tokens/sec.
MLX was released in December 2023 by Awni Hannun and the Apple ML Research team as a deliberately NumPy-ish array library — if you can write mx.matmul(a, b) you can write MLX — with a PyTorch-style nnmodule layered on top. The decode story is dominated by memory bandwidth, not FLOPs, so the interesting comparison isn't TFLOPs but the roofline: an M3 Max at ~400 GB/s and an M2/M3 Ultra at ~800 GB/s sit in the same zone as a consumer RTX 4090 (~1008 GB/s), which is why a 70B Q4 model on an Ultra actually streams tokens at a human-readable pace instead of lurching. No other laptop architecture — not a Threadripper workstation, not a gaming laptop with a mobile 4090 capped at 8 GB VRAM — can load the weights at all, let alone stream them.
# Install pip install mlx-lm # Convert a HuggingFace model to MLX 4-bit python -m mlx_lm.convert \ --hf-path Qwen/Qwen3-4B-Instruct \ --quantize --q-bits 4 \ --mlx-path ./mlx-qwen3-4b # Chat with it python -m mlx_lm.generate \ --model ./mlx-qwen3-4b \ --prompt "What is GQA?" \ --max-tokens 200 # Serve an OpenAI-compatible HTTP endpoint python -m mlx_lm.server \ --model ./mlx-qwen3-4b \ --port 8080 # Fine-tune with LoRA on your own data python -m mlx_lm.lora \ --model ./mlx-qwen3-4b \ --train --data ./data.jsonl \ --iters 1000
Why MLX beats llama.cpp on Apple Silicon
llama.cpp has a Metal backend and runs fine on Apple Silicon. MLX is usually 20–40% faster for the same model at the same quantization. The reasons:
- Metal kernels hand-tuned by Apple's ML team for the specific matrix units on each chip generation.
- Unified memory allocation hints — MLX arrays are allocated with the right alignment for the Neural Engine on M5+.
- Lazy evaluation — MLX builds a compute graph before launching kernels, allowing fusion and scheduling optimizations that per-operation interpreted frameworks can't do.
The tradeoff: MLX is Apple-only. On Linux or Windows, llama.cpp and Ollama are still your path.