Local LLM models are huge. Multi-gigabyte files scattered across default cache folders, downloaded separately on every machine, with no integrity verification. It drove me nuts. If I’m going to commit to immutable dev environments, models shouldn’t be the exception living in some uncontrolled ~/.cache somewhere.

This post covers how I turned that mess into reproducible, content-addressed model management using Nix primitives.

Starting Point: llama-swap

When I was getting SmolLM3 running locally, I started with Ollama. It works, but I quickly wanted more control. Tweaking parameters like context size means creating a Modelfile and rebuilding — it defaults to 32K, but most models support 128K. More friction than I wanted.

That led me to llama-swap - a model management proxy that sits in front of llama.cpp. All the knobs are right there: context size, sampling parameters, quantization settings, flash attention - everything declarative in the model configuration.

It solves a practical problem: you want multiple models available, but you can’t load them all simultaneously. When a request comes in, llama-swap loads the requested model, unloads idle ones based on TTL, and routes everything through an OpenAI-compatible API, hot-swapping models without restarting services.

I started using it immediately. But now I had the problem I opened with - models downloading to whatever cache folder llama.cpp wanted. Not acceptable.

Organizing the Chaos

My first step toward sanity was creating a model catalog — a central place defining how to run each model. This lives in lib/models.nix:

models = {
  "Qwen3-Coder-30B-Q8-256K" = {
    hf = "unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q8_K_XL";
    ctxSize = 262144;
    flashAttn = false;
    extraArgs = [
      "--jinja"
      "-ngl 99"
      "--temp 0.7"
      "--top-p 0.8"
    ];
  };
};

Each entry captures:

  • The HuggingFace identifier (org/repo:quantization)
  • Context size, flash attention settings
  • Sampling parameters and extra args

This brought order. Instead of remembering command-line flags, I had a single source of truth. The Home Manager module reads this catalog and templates it into llama-swap’s YAML configuration:

Qwen3-Coder-30B-Q8-256K:
  cmd: /nix/store/.../llama-server --port ${PORT} -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q8_K_XL
    --ctx-size 262144 --metrics --jinja -ngl 99 --temp 0.7 --top-p 0.8
  ttl: 300

Each Nix attribute becomes the right command-line flag. I never write YAML directly — the module handles that. (And yes, llama-server itself comes from the Nix store — same reproducibility guarantees.)

But there was still a problem. The hf field just tells llama.cpp to download from HuggingFace on demand. Every machine downloads its own copy. No sharing. No integrity verification. No reproducibility. The models existed, but outside my Nix infrastructure.

The Promotion Workflow

What I needed was a workflow that lets me experiment freely — download models from HuggingFace, try them out — and then promote the ones that stick as daily drivers to proper Nix derivations.

The ggufs Section

The catalog gained a second section called ggufs — these are GGUF model files that have been promoted. Each entry here was generated by the promotion script I’ll show next:

ggufs = {
  "unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q8_K_XL" = {
    file = "Qwen3-Coder-30B-A3B-Instruct-UD-Q8_K_XL.gguf";
    sha256 = "sha256-yGetwqX4I85k9aoHF688S2YJWcPxbCPF9w97+Lp5Xig=";
  };

  # Large models ship pre-split from upstream (HuggingFace file size limits)
  "unsloth/gpt-oss-120b-GGUF:Q8_K_XL" = {
    files = [
      { name = "UD-Q8_K_XL/gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf"; 
        sha256 = "sha256-6xaAL+71gKlKiGLmct8pX2j8hITSzM8OElOfA4YUR2g="; }
      { name = "UD-Q8_K_XL/gpt-oss-120b-UD-Q8_K_XL-00002-of-00002.gguf"; 
        sha256 = "sha256-KwCVJR07HPmkyp1viieTcVQi+Q6UaMorPe73ZqNoptk="; }
    ];
  };

  # Vision-language models include projection files
  "unsloth/Qwen3-VL-4B-Thinking-GGUF:Q8_K_XL" = {
    file = "Qwen3-VL-4B-Thinking-UD-Q8_K_XL.gguf";
    sha256 = "sha256-o9dojgU94bVPq1jH3Fk+ZvQQq9SBDbt9w/+3xYDzelY=";
    mmproj = {
      file = "mmproj-F16.gguf";
      sha256 = "sha256-cjVPzT/HWTW4TnRcpJLW543QA7taAg1xspbnZQkmrIc=";
    };
  };
};

The Fallback Logic

The Home Manager module’s fetchGGUF function checks if a model is in the ggufs section. The command builder uses this to decide how to load the model:

gguf = fetchGGUF model.hf;
modelArg = if gguf != null then "-m ${gguf.model}" else "-hf ${model.hf}";

Unpromoted models: Not in ggufs, fetchGGUF returns null, uses -hf flag. llama.cpp downloads from HuggingFace on demand.

Promoted models: In ggufs with hash, fetchGGUF returns Nix store path, uses -m flag. Fetched from Nix cache with integrity verification.

This means I can experiment with new models immediately - just add them to models and llama.cpp handles the download. When I’m ready to make it permanent, I promote it.

The nixify-model Script

I wrote nixify-model to handle the promotion step — taking a model I’ve been using and making it permanent:

nixify-model "unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q8_K_XL"

The script:

  1. Finds the downloaded file by pattern matching the HuggingFace identifier
  2. Computes the SHA256 hash
  3. Adds it to the local Nix store
  4. Signs it with my cache’s secret key
  5. Copies it to my cache server
  6. Outputs the catalog entry to paste into models.nix
Add to lib/models.nix ggufs section:

    "unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q8_K_XL" = {
      file = "Qwen3-Coder-30B-A3B-Instruct-UD-Q8_K_XL.gguf";
      sha256 = "sha256-yGetwqX4I85k9aoHF688S2YJWcPxbCPF9w97+Lp5Xig=";
    };

The Cache Server Payoff

I run a Nix cache on my home server. Once a model is promoted and pushed there, every machine benefits. When home-manager switch runs on any machine:

  1. Nix sees the SHA256 hash from the catalog
  2. Checks configured substituters (including my cache)
  3. Finds the file in my local cache
  4. Downloads at LAN speed, not HuggingFace speed

A 40GB model that took 20 minutes from HuggingFace now takes 2 minutes from my home server. And it’s verified - if the hash doesn’t match, Nix rejects it. This is the payoff that makes all the Nix ceremony worth it.

Expanding Beyond LLMs

With LLM models handled, I wanted the same setup for speech-to-text and text-to-speech. Whisper handles transcription, Kokoro handles voice synthesis — both have model files that need the same management.

But there’s a catch: these aren’t llama.cpp models. They’re separate server processes with their own binaries. I couldn’t just add them to the models section.

The ggufs section expanded to include their models:

# Whisper models
"ggerganov/whisper.cpp:large-v3-turbo" = {
  file = "ggml-large-v3-turbo.bin";
  sha256 = "sha256-H8cPd0046xaZk6w5Huo1fvR8iHV+9y7llDh5t+jivGk=";
};

# Silero VAD model for whisper-cpp
"ggml-org/whisper-vad:silero-v6.2.0" = {
  file = "ggml-silero-v6.2.0.bin";
  sha256 = "sha256-KqJpt4XutTqCmDogUB3ffB2cSOM6tjpBORrGyff7aYc=";
};

And the Home Manager module gained proxyModels — a way to manage non-llama.cpp servers through llama-swap. Same model catalog, same promotion workflow, different execution:

proxyModels.whisper = {
  package = pkgs.whisper-cpp;
  binary = "whisper-server";
  port = 9233;
  checkEndpoint = "/v1/audio/transcriptions/";
  hf = "ggerganov/whisper.cpp:large-v3-turbo";
  vadModel = "ggml-org/whisper-vad:silero-v6.2.0";
  group = "always-on";
  extraArgs = [ "--convert" "--vad" ];
};

proxyModels.kokoro = {
  package = pkgs.kokoro-fastapi;
  binary = "kokoro-server";
  port = 8880;
  checkEndpoint = "/health";
  useModelArg = false;  # model bundled in package
  group = "always-on";
  aliases = [ "tts" "kokoro-tts" ];
};

The always-on group keeps these services running persistently while LLM models swap in and out. The result: a single local endpoint handling chat completions, audio transcriptions, and speech synthesis - all with OpenAI-compatible APIs.

The Home Manager Module

The llama-swap Home Manager module ties everything together. It:

  • Reads the model catalog
  • Resolves ggufs entries via fetchGGUF
  • Falls back to HuggingFace for unpromoted models
  • Generates llama-swap YAML configuration
  • Sets up systemd services (Linux) or launchd agents (macOS)

Host configurations become simple model selections:

services.llama-swap = {
  enable = true;
  models = modelsLib.toLlamaSwapModels (modelsLib.selectModels [
    "SmolLM3-3B-Q4-64K-KVQ8"
    "Qwen3-Coder-30B-Q4-128K-KVQ8"
  ]);
};

Different machines select different subsets. Same catalog, same hashes, same infrastructure.

Full Circle

What started with default cache folders evolved into a reproducible model management system:

  1. Experiment freely: Add models to the catalog, llama.cpp downloads on demand
  2. Promote the ones that stick: nixify-model moves daily drivers into Nix infrastructure
  3. Share everywhere: Cache server serves all machines at LAN speed
  4. Verify always: SHA256 hashes ensure integrity

The model catalog and promotion workflow aren’t LLM-specific. They work for any large model files - Whisper, VAD models, TTS models, whatever needs managing.

The full configuration lives at github.com/qmx/dotfiles: