Ollama — 60 seconds to your first local LLM
Ollama is a Go application that wraps llama.cpp with a model registry, a REST API, automatic GPU detection, and Modelfile-based configuration. It's the fastest path from “I have nothing installed” to “I'm serving an OpenAI-compatible local LLM.”
Under the hood, ollama serve is a Go REST façade that spawns and talks to llama.cpp's llama-serverprocess — the same C++ binary Georgi Gerganov's project ships — with a model registry layered on top. Its opinionated default is Q4_K_M for every pulled model: ~2.5× smaller than fp16 with only a ~0.01 perplexity penalty on Llama-class 7Bs, which is why a Qwen3-4B download lands at 2.6 GB instead of 8 GB. Real numbers on the hardware you probably own: an M3 Max with Metal backend decodes Llama-3.1-8B Q4_K_M at roughly 55–70 tok/s for a single stream, an RTX 4090 with CUDA hits ~130 tok/s, and a mid-tier M2 (24 GB) lands around 25 tok/s — all numbers that feel instantaneous to a human reader but collapse hard once you try to fan out to concurrent users.
The Modelfile — your model's manifest
A Modelfile is Ollama's equivalent of a Dockerfile. It declares a model: its base, its parameters, its system prompt, its template.
FROM qwen3:4b PARAMETER temperature 0.3 PARAMETER num_ctx 8192 PARAMETER stop "<|im_end|>" SYSTEM """You are a concise technical assistant. Answer in 2-3 sentences unless asked for detail."""
Save as Modelfile, run ollama create my-assistant -f Modelfile, and your configured variant is available at http://localhost:11434 via both the native Ollama API and the OpenAI-compatible endpoint.
Calling it from Python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # not validated, can be anything
)
response = client.chat.completions.create(
model="my-assistant",
messages=[{"role": "user", "content": "Explain GQA"}],
)
print(response.choices[0].message.content)Any tool that speaks the OpenAI API speaks Ollama. LangChain, LlamaIndex, LiteLLM, Pydantic-AI, the raw openai package — all work by changing one URL.
When Ollama is wrong
Ollama is wonderful for local development and small-team on-prem deployments. It is notthe right tool for high-throughput production serving. Under load, llama.cpp's multi-stream handling is meaningfully slower than vLLM's PagedAttention + continuous batching. For production serving to 100+ concurrent users, use vLLM. For one user, one conversation, one Modelfile: Ollama is unbeatable.
The concrete failure mode to watch for: once you exceed ~4 concurrent streams, Ollama's throughput curve goes flat, because llama.cpp's server schedules batches by re-running prefill work that vLLM would amortize with PagedAttention's shared-prefix blocks. You'll see P99 latency tail into multi-second territory on a machine that single-streams at 60 tok/s. That's the signal to graduate: if you can serve it with Ollama and be happy, do — your users get a 4 GB download and a localhost URL; if you need batching, paged attention, or multi-LoRA fan-out, the next two lessons are the answer.