Self-host LLMs on Linux with Ollama

Install

The official one-liner installs the ollama binary, creates a system user, drops a systemd unit, and (if it can detect one) wires up your GPU:

curl -fsSL https://ollama.com/install.sh | sh

On a fresh Debian 12 or Ubuntu 24.04 box this is everything. The installer detects NVIDIA cards via nvidia-smi and AMD cards via ROCm; if it finds neither, you get CPU-only inference, which still works for the small models. The systemd unit is enabled and started automatically:

systemctl status ollama
sudo journalctl -u ollama -f

Ollama listens on 127.0.0.1:11434 by default. The HTTP API is the only interface — the ollama CLI is just a client.

Pull and run your first model

Models live in the Ollama library at ollama.com/library. Each tag specifies a parameter count and quantization. A reasonable starting point on a machine with ~8 GB of VRAM (or ~8 GB of free RAM for CPU):

ollama pull llama3.2:3b
ollama run llama3.2:3b "Explain what an SNI callback does, in two sentences."

First run downloads the weights (~2 GB for a 3B model at Q4_K_M, the default quant). After that, model load is seconds and inference starts streaming immediately.

Picking a model size

Rule of thumb at Q4_K_M quantization: a model needs roughly 0.6 × parameter-count GB of VRAM, plus a bit for KV cache. A 3B model fits comfortably on 6 GB cards; 8B needs ~6 GB; 14B wants 12 GB+; 70B wants two 24 GB cards or aggressive quantization. ollama ps shows what's loaded and how much memory each model is using.

The HTTP API

Two endpoints cover 95% of usage. Both stream by default and accept "stream": false for buffered responses:

# Completion-style
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:3b",
  "prompt": "Write a Bash one-liner to find files larger than 1GB.",
  "stream": false
}'

# Chat-style (multi-turn)
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2:3b",
  "messages": [
    {"role": "system", "content": "You are a terse Linux assistant."},
    {"role": "user", "content": "How do I list TCP listeners?"}
  ],
  "stream": false
}'

There's also an OpenAI-compatible shim at /v1/chat/completions, /v1/completions, and /v1/embeddings. Any client that takes an OpenAI base_url can point at Ollama and just work:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required by the client, not used by the server
)

resp = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[{"role": "user", "content": "Hi"}]
)
print(resp.choices[0].message.content)

Embeddings

For RAG / semantic search you want a small, fast embedding model rather than a chat model. Two reasonable defaults:

ollama pull nomic-embed-text       # 768 dim, ~270MB, very fast
ollama pull mxbai-embed-large      # 1024 dim, ~670MB, slightly better recall

curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "the quick brown fox"
}'

Pipes straight into pgvector or any other vector store. The 768/1024 dim sizes match common vector(n) column definitions.

Exposing it to other machines

Binding to 127.0.0.1 is the safe default. If you want other machines on your LAN (or Tailscale net) to hit it, override OLLAMA_HOST via a systemd drop-in — don't edit the shipped unit, it gets overwritten on upgrade:

sudo systemctl edit ollama

That opens an editor. Add:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS=*"
Environment="OLLAMA_KEEP_ALIVE=10m"
Environment="OLLAMA_NUM_PARALLEL=4"

Then sudo systemctl restart ollama. A quick rundown:

OLLAMA_HOST — bind interface and port. Use 0.0.0.0:11434 for all interfaces, or a specific IP. There is no auth; firewall it or front it with Caddy + basic auth if it's not on a private network.
OLLAMA_ORIGINS — CORS whitelist. * is fine on a private net; restrict to your origins otherwise.
OLLAMA_KEEP_ALIVE — how long an idle model stays in VRAM. Default is 5 minutes. Set higher if you reuse the same model often; 0 unloads immediately after each request.
OLLAMA_NUM_PARALLEL — concurrent requests per model. Each parallel slot consumes additional KV cache memory.

A nicer chat UI

If you want a ChatGPT-style web UI without writing anything, Open WebUI is the path of least resistance — runs in a single container and points at your Ollama instance:

docker run -d --restart unless-stopped \
  -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Visit http://localhost:3000, create the first user (which becomes admin), pick your model.

Where models live

Default storage path is /usr/share/ollama/.ollama/models (owned by the ollama user). On a server with a small root partition, move it before pulling anything large:

sudo systemctl stop ollama
sudo mkdir -p /data/ollama-models
sudo chown -R ollama:ollama /data/ollama-models

sudo systemctl edit ollama
# Add:
# [Service]
# Environment="OLLAMA_MODELS=/data/ollama-models"

sudo systemctl start ollama

Already-pulled models can be moved with rsync -a before restarting.

Troubleshooting

"CUDA out of memory" after the model loads fine. KV cache for long contexts adds up. Drop the context window with "options": {"num_ctx": 4096} in your API call, or pick a smaller model.
GPU not detected. Check nvidia-smi works as the ollama user (sudo -u ollama nvidia-smi). On a fresh install you may need the proprietary driver and a reboot. The Ollama log will say "no compatible GPUs were discovered".
First token takes ~10s on a small model. The model isn't loaded yet. Subsequent requests within OLLAMA_KEEP_ALIVE are instant. For interactive use, increase OLLAMA_KEEP_ALIVE.
Slow on CPU. Make sure you have a :q4_K_M, :q4_0, or smaller quant; the unquantized :fp16 tags are huge and very slow without a GPU. ollama show llama3.2:3b tells you the quant of what you pulled.
Want to test the OpenAI shim is alive? curl http://localhost:11434/v1/models — should return JSON listing your local models.