Escaping the Dumb Zone with RLMs
Back in October I wrote about small models being the future. The thesis: you don’t need a trillion parameters for most tasks. I want to get to a point where a collection of small models - 30B or less - running locally can handle my coding workflow.
But there’s a catch that hits all models, big and small: context windows.
128K or 200K context windows sound great on paper. In practice, you burn through that budget fast. Once you’re past ~40% capacity, you’re in the dumb zone - the model starts drifting, hallucinating, forgetting its own instructions. This isn’t a small model problem. It’s how attention works.
I’ve been digging into Recursive Language Models lately. It’s research, not production-ready, but I think the pattern might finally change this.
The Pattern
Traditional LLMs are passive receivers - you dump a huge document into the prompt and pray the model finds what it needs. The entire corpus competes for attention.
RLMs flip this. Context becomes a variable the model can query, not a payload it has to swallow.
Instead of stuffing a massive document directly into the prompt, RLMs store it in a persistent environment - a REPL that persists across reasoning steps. The model does what any good developer does: it greps. It peeks at structure, runs regex searches across millions of tokens, pulls in what it needs when it needs it.
The REPL runs in a RestrictedPython sandbox - safe execution with a curated set of tools. The model gets access to re, json, math, datetime, and collections. Enough to slice, search, and transform data. Not enough to escape the sandbox.
A typical interaction looks like this:
# Model writes code to explore context
import re
matches = re.findall(r'def \w+\(', context)
print(f"Found {len(matches)} function definitions")
# When it has the answer
FINAL("The module contains 47 functions, primarily focused on...")
The model iterates - run code, see results, refine approach - until it calls FINAL() with the answer. Each iteration only sends the code and results, not the entire context.
When a problem is too big for one pass, RLMs partition the context and delegate to sub-agents. Each sub-agent handles a chunk, reports back, and the root model synthesizes.
This adds latency. Sequential REPL operations take time compared to single-pass inference. For interactive chat, that hurts. For batch processing codebases? Worth it.
Traditional approach - you send everything:
System: You are a helpful assistant.
User: Here is a codebase (95,000 tokens of code...)
Question: How many API endpoints are defined?
The model receives ~95K tokens. Most of it irrelevant to the question. Attention dilutes across the entire corpus.
RLM approach - you send instructions:
System: You have access to a REPL with `context` variable containing
the codebase. Use Python to explore it. Call FINAL() with your answer.
Available: re, json, math, datetime, collections
Variable: context (str, 95000 tokens)
User: How many API endpoints are defined?
The model receives ~500 tokens of instructions. The 95K token codebase exists as a Python variable it can query. First iteration might be:
import re
endpoints = re.findall(r'@app\.(get|post|put|delete)\(', context)
FINAL(f"Found {len(endpoints)} API endpoints")
Three lines of code. Maybe 50 tokens generated. The model never loaded the full context into its attention window - it searched it from outside.
The recursive-llm README reports 2-3k tokens per query vs 95k+ for the direct approach. That’s not a rounding error. That’s a different architecture.
Already Shipping
RLMs formalize what production systems already discovered empirically.
Claude Code uses exactly this architecture. CLAUDE.md files load upfront for critical context. Everything else gets retrieved just-in-time via glob and grep. No vector databases, no embeddings, no RAG - just raw string matching. An Anthropic engineer admitted on Hacker News they “just grep your repo line by line.”
The sub-agent pattern runs deep. When you ask Claude Code to explore a codebase, it doesn’t try to hold everything in working memory. It spawns an Explore agent with its own context window. That agent greps, reads files, builds understanding, then reports back a summary. The parent orchestrator never saw the raw file contents - just the distilled findings.
Same with planning. A Plan agent thinks through architecture, identifies files to modify, considers trade-offs. It returns a structured plan. The orchestrator decides whether to proceed, spawn workers, or ask clarifying questions. Each agent operates in isolation, communicates through summaries, and dies when done.
Anthropic’s context management post shows their “context editing” - pruning conversation history as it grows. Old tool calls get summarized. Redundant information gets dropped. The result: 84% token reduction in 100-turn evaluations without degrading task performance.
OpenAI and Google’s Deep Research systems use the same orchestrator + sub-agent delegation pattern. The industry converged on this before the RLM paper gave it a name. That’s not coincidence - it’s the only architecture that scales.
The Research
The RLM paper formalizes the pattern. The benchmarks used frontier-scale models (Qwen3-Coder-480B), and the results are promising:
- Long context retrieval: On OOLONG (132K tokens), GPT-5-mini with RLM scaffolding outperformed standard GPT-5 by 33% at roughly the same cost
- Code understanding: Standard GPT-5 hit 24% accuracy on CodeQA; RLM-wrapped hit 62%
- Scaling: Standard models degrade around 150K tokens, but RLMs have handled 10M+ tokens without performance loss
Math reasoning still lags - models aren’t trained to leverage REPL scaffolding for symbolic computation yet. But for code and retrieval tasks, the architecture works.
Where This Goes
The tools are pluggable. Current implementations use basic regex, but nothing stops you from giving the model access to ast-grep for structural code search, or tree-sitter for proper parsing. Raw grep finds def foo( - ast-grep finds “all functions that take a callback and don’t handle errors.” Same architecture, sharper tools.
Despite the frontier-scale benchmarks, recursive-llm has solid local model support via LiteLLM:
rlm = RLM(model="ollama/llama3.2")
The library supports a two-model strategy - use a capable model for root decisions and a smaller one for recursive exploration:
rlm = RLM(
model="ollama/llama3.2",
recursive_model="ollama/llama2",
max_iterations=10,
max_depth=3
)
Fragility is real - each step only knows the current state, so one bad transition can cascade. But the token savings might make smaller models viable for tasks they’d otherwise choke on. If RLM scaffolding squeezes 33% more performance out of a 480B model, maybe it does something similar for a 30B running locally.
The Python REPL is the obvious choice - models are trained on tons of Python, and RestrictedPython provides some sandboxing. But it’s bolting safety onto a language that wasn’t designed for it. What if the execution environment itself was rethought?
I keep wondering about Prolog. RLMs are fundamentally about querying context, and Prolog is a query language at its core. Pattern matching via unification seems like a natural fit for searching codebases. Datalog, a restricted subset of Prolog, already powers serious code analysis tools - GitHub’s CodeQL uses it to find security vulnerabilities across millions of repositories.
Imagine a Prolog-based RLM doing the same context search:
% Context is asserted as chunks: chunk(Id, Text)
% Find lines matching a pattern
?- chunk(_, Text), re_match("@app\\.(get|post)", Text, Match).
% Slice context around matches
?- findall(Line, (chunk(N, Line), N > 100, N < 200), Slice).
Declarative, constrained by design, no side effects. The model would describe what to find, not how to find it.
For sandboxing, I’m curious about WebAssembly - memory-safe, deterministic, no escape hatch by design. What if you ran a Prolog interpreter inside WASI? Query language for the task, sandboxed runtime for containment. Runtimes like Wasmtime and Wasmer are already battle-tested. The model generates Prolog queries, a WASI-sandboxed interpreter runs them against the context. No compilation step, just interpretation within guaranteed bounds.
There’s an irony in current RLM implementations: the GPU sits idle while the interpreter chugs through regex on the CPU. The context is already in VRAM for inference - why ship it to CPU for search? GPU-accelerated Datalog is an active research area. VFLog reports 200x speedups over CPU-based engines using column-oriented storage. Boolean Matrix Logic Programming gets 1-4 orders of magnitude improvement by casting Datalog as matrix ops. The memory bandwidth gap is brutal: an H100 pushes 3.35 TB/s versus ~576 GB/s for top-end CPUs. wasm-vk already compiles WASM to Vulkan compute shaders. The WASI-gfx proposal is working on GPU compute for WASI.
The Python REPL works today. But I suspect the more interesting future lies in execution environments designed for the task rather than borrowed from general-purpose programming.
Prime Intellect and MIT researchers are calling RLMs the next major milestone after Chain of Thought. We spent years racing to make models bigger. RLMs suggest another path - smarter context management that benefits any model, but especially unlocks smaller ones. The dumb zone isn’t a fundamental limit. It’s a design choice we can engineer around.