Ollama in 60 seconds · Microscale

Ollama — 60 seconds to your first local LLM

Ollama is a Go application that wraps llama.cpp with a model registry, a REST API, automatic GPU detection, and Modelfile-based configuration. It's the fastest path from “I have nothing installed” to “I'm serving an OpenAI-compatible local LLM.”

Under the hood, ollama serve is a Go REST façade that spawns and talks to llama.cpp's llama-serverprocess — the same C++ binary Georgi Gerganov's project ships — with a model registry layered on top. Its opinionated default is Q4_K_M for every pulled model: ~2.5× smaller than fp16 with only a ~0.01 perplexity penalty on Llama-class 7Bs, which is why a Qwen3-4B download lands at 2.6 GB instead of 8 GB. Real numbers on the hardware you probably own: an M3 Max with Metal backend decodes Llama-3.1-8B Q4_K_M at roughly 55–70 tok/s for a single stream, an RTX 4090 with CUDA hits ~130 tok/s, and a mid-tier M2 (24 GB) lands around 25 tok/s — all numbers that feel instantaneous to a human reader but collapse hard once you try to fan out to concurrent users.

terminal

The Modelfile — your model's manifest

A Modelfile is Ollama's equivalent of a Dockerfile. It declares a model: its base, its parameters, its system prompt, its template.

FROM qwen3:4b

PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER stop "<|im_end|>"

SYSTEM """You are a concise technical assistant. Answer in 2-3 sentences unless asked for detail."""

Save as Modelfile, run ollama create my-assistant -f Modelfile, and your configured variant is available at http://localhost:11434 via both the native Ollama API and the OpenAI-compatible endpoint.

Calling it from Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # not validated, can be anything
)

response = client.chat.completions.create(
    model="my-assistant",
    messages=[{"role": "user", "content": "Explain GQA"}],
)
print(response.choices[0].message.content)

Any tool that speaks the OpenAI API speaks Ollama. LangChain, LlamaIndex, LiteLLM, Pydantic-AI, the raw openai package — all work by changing one URL.

When Ollama is wrong

Ollama is wonderful for local development and small-team on-prem deployments. It is notthe right tool for high-throughput production serving. Under load, llama.cpp's multi-stream handling is meaningfully slower than vLLM's PagedAttention + continuous batching. For production serving to 100+ concurrent users, use vLLM. For one user, one conversation, one Modelfile: Ollama is unbeatable.

The concrete failure mode to watch for: once you exceed ~4 concurrent streams, Ollama's throughput curve goes flat, because llama.cpp's server schedules batches by re-running prefill work that vLLM would amortize with PagedAttention's shared-prefix blocks. You'll see P99 latency tail into multi-second territory on a machine that single-streams at 60 tok/s. That's the signal to graduate: if you can serve it with Ollama and be happy, do — your users get a 4 GB download and a localhost URL; if you need batching, paged attention, or multi-LoRA fan-out, the next two lessons are the answer.