Install
The official one-liner installs the ollama binary, creates a system user, drops a systemd unit, and (if it can detect one) wires up your GPU:
curl -fsSL https://ollama.com/install.sh | sh
On a fresh Debian 12 or Ubuntu 24.04 box this is everything. The installer detects NVIDIA cards via nvidia-smi and AMD cards via ROCm; if it finds neither, you get CPU-only inference, which still works for the small models. The systemd unit is enabled and started automatically:
systemctl status ollama
sudo journalctl -u ollama -f
Ollama listens on 127.0.0.1:11434 by default. The HTTP API is the only interface — the ollama CLI is just a client.
Pull and run your first model
Models live in the Ollama library at ollama.com/library. Each tag specifies a parameter count and quantization. A reasonable starting point on a machine with ~8 GB of VRAM (or ~8 GB of free RAM for CPU):
ollama pull llama3.2:3b
ollama run llama3.2:3b "Explain what an SNI callback does, in two sentences."
First run downloads the weights (~2 GB for a 3B model at Q4_K_M, the default quant). After that, model load is seconds and inference starts streaming immediately.
Rule of thumb at Q4_K_M quantization: a model needs roughly 0.6 × parameter-count GB of VRAM, plus a bit for KV cache. A 3B model fits comfortably on 6 GB cards; 8B needs ~6 GB; 14B wants 12 GB+; 70B wants two 24 GB cards or aggressive quantization. ollama ps shows what's loaded and how much memory each model is using.
The HTTP API
Two endpoints cover 95% of usage. Both stream by default and accept "stream": false for buffered responses:
# Completion-style
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2:3b",
"prompt": "Write a Bash one-liner to find files larger than 1GB.",
"stream": false
}'
# Chat-style (multi-turn)
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2:3b",
"messages": [
{"role": "system", "content": "You are a terse Linux assistant."},
{"role": "user", "content": "How do I list TCP listeners?"}
],
"stream": false
}'
There's also an OpenAI-compatible shim at /v1/chat/completions, /v1/completions, and /v1/embeddings. Any client that takes an OpenAI base_url can point at Ollama and just work:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required by the client, not used by the server
)
resp = client.chat.completions.create(
model="llama3.2:3b",
messages=[{"role": "user", "content": "Hi"}]
)
print(resp.choices[0].message.content)
Embeddings
For RAG / semantic search you want a small, fast embedding model rather than a chat model. The two I default to:
ollama pull nomic-embed-text # 768 dim, ~270MB, very fast
ollama pull mxbai-embed-large # 1024 dim, ~670MB, slightly better recall
curl http://localhost:11434/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "the quick brown fox"
}'
Pipes straight into pgvector or any other vector store. The 768/1024 dim sizes match common vector(n) column definitions.
Exposing it to other machines
Binding to 127.0.0.1 is the safe default. If you want other machines on your LAN (or Tailscale net) to hit it, override OLLAMA_HOST via a systemd drop-in — don't edit the shipped unit, it gets overwritten on upgrade:
sudo systemctl edit ollama
That opens an editor. Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS=*"
Environment="OLLAMA_KEEP_ALIVE=10m"
Environment="OLLAMA_NUM_PARALLEL=4"
Then sudo systemctl restart ollama. A quick rundown:
OLLAMA_HOST— bind interface and port. Use0.0.0.0:11434for all interfaces, or a specific IP. There is no auth; firewall it or front it with Caddy + basic auth if it's not on a private network.OLLAMA_ORIGINS— CORS whitelist.*is fine on a private net; restrict to your origins otherwise.OLLAMA_KEEP_ALIVE— how long an idle model stays in VRAM. Default is 5 minutes. Set higher if you reuse the same model often;0unloads immediately after each request.OLLAMA_NUM_PARALLEL— concurrent requests per model. Each parallel slot consumes additional KV cache memory.
A nicer chat UI
If you want a ChatGPT-style web UI without writing anything, Open WebUI is the path of least resistance — runs in a single container and points at your Ollama instance:
docker run -d --restart unless-stopped \
-p 3000:8080 \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Visit http://localhost:3000, create the first user (which becomes admin), pick your model.
Where models live
Default storage path is /usr/share/ollama/.ollama/models (owned by the ollama user). On a server with a small root partition, move it before pulling anything large:
sudo systemctl stop ollama
sudo mkdir -p /data/ollama-models
sudo chown -R ollama:ollama /data/ollama-models
sudo systemctl edit ollama
# Add:
# [Service]
# Environment="OLLAMA_MODELS=/data/ollama-models"
sudo systemctl start ollama
Already-pulled models can be moved with rsync -a before restarting.
Troubleshooting
- "CUDA out of memory" after the model loads fine. KV cache for long contexts adds up. Drop the context window with
"options": {"num_ctx": 4096}in your API call, or pick a smaller model. - GPU not detected. Check
nvidia-smiworks as theollamauser (sudo -u ollama nvidia-smi). On a fresh install you may need the proprietary driver and a reboot. The Ollama log will say "no compatible GPUs were discovered". - First token takes ~10s on a small model. The model isn't loaded yet. Subsequent requests within
OLLAMA_KEEP_ALIVEare instant. For interactive use, increaseOLLAMA_KEEP_ALIVE. - Slow on CPU. Make sure you have a
:q4_K_M,:q4_0, or smaller quant; the unquantized:fp16tags are huge and very slow without a GPU.ollama show llama3.2:3btells you the quant of what you pulled. - Want to test the OpenAI shim is alive?
curl http://localhost:11434/v1/models— should return JSON listing your local models.