Escaping the Dumb Zone with RLMs

Back in October I wrote about small models being the future. The thesis: you don’t need a trillion parameters for most tasks. I want to get to a point where a collection of small models - 30B or less - running locally can handle my coding workflow.

But there’s a catch that hits all models, big and small: context windows.

128K or 200K context windows sound great on paper. In practice, you burn through that budget fast. Once you’re past ~40% capacity, you’re in the dumb zone - the model starts drifting, hallucinating, forgetting its own instructions. This isn’t a small model problem. It’s how attention works.

I’ve been digging into Recursive Language Models lately. It’s research, not production-ready, but I think the pattern might finally change this.

The Pattern

Traditional LLMs are passive receivers - you dump a huge document into the prompt and pray the model finds what it needs. The entire corpus competes for attention.

RLMs flip this. Context becomes a variable the model can query, not a payload it has to swallow.

Instead of stuffing a massive document directly into the prompt, RLMs store it in a persistent environment - a REPL that persists across reasoning steps. The model does what any good developer does: it greps. It peeks at structure, runs regex searches across millions of tokens, pulls in what it needs when it needs it.

This REPL runs in a RestrictedPython sandbox - safe execution with a curated set of tools. Models get access to re, json, math, datetime, and collections. Enough to slice, search, and transform data. Not enough to escape the sandbox.

A typical interaction looks like this:

# Model writes code to explore context
import re
matches = re.findall(r'def \w+\(', context)
print(f"Found {len(matches)} function definitions")

# When it has the answer
FINAL("The module contains 47 functions, primarily focused on...")

It iterates - run code, see results, refine approach - until calling FINAL() with the answer. Each iteration only sends the code and results, not the entire context.

When a problem is too big for one pass, RLMs partition the context and delegate to sub-agents. Each sub-agent handles a chunk, reports back, and the root model synthesizes.

This adds latency. Sequential REPL operations take time compared to single-pass inference. For interactive chat, that hurts. For batch processing codebases? Worth it.

Traditional approach - you send everything:

System: You are a helpful assistant.
User: Here is a codebase (95,000 tokens of code...)

Question: How many API endpoints are defined?

That’s ~95K tokens hitting the model at once, most of it irrelevant to the question. Attention dilutes across the entire corpus.

RLM approach - you send instructions:

System: You have access to a REPL with `context` variable containing
the codebase. Use Python to explore it. Call FINAL() with your answer.

Available: re, json, math, datetime, collections
Variable: context (str, 95000 tokens)

User: How many API endpoints are defined?

Now the model receives ~500 tokens of instructions, while the 95K token codebase exists as a Python variable it can query. First iteration might be:

import re
endpoints = re.findall(r'@app\.(get|post|put|delete)\(', context)
FINAL(f"Found {len(endpoints)} API endpoints")

Three lines of code. Maybe 50 tokens generated. The model never loaded the full context into its attention window - it searched it from outside.

The recursive-llm README reports 2-3k tokens per query vs 95k+ for the direct approach. That’s not a rounding error. That’s a different architecture.

The Research

The RLM paper formalizes the pattern. The benchmarks tested GPT-5 and Qwen3-Coder, and the results are promising:

Long context retrieval: On OOLONG (132K tokens), RLM scaffolding improved performance by ~33% at roughly the same cost
Complex aggregation: On OOLONG-Pairs, RLM hit 58% F1 where CodeAct agents scored 24% and base models fell below 0.1%
Scaling: Standard models degrade around 150K tokens, but RLMs have handled 10M+ tokens without performance loss

Math reasoning still lags - models aren’t trained to leverage REPL scaffolding for symbolic computation yet. But for code and retrieval tasks, the architecture works.

Already Shipping

Production systems discovered this pattern empirically before the research formalized it.

Claude Code is a clear example. CLAUDE.md files load upfront for critical context, but everything else gets retrieved just-in-time via glob and grep. No vector databases, no embeddings, no RAG - just raw string matching. An Anthropic engineer admitted on Hacker News: “we just grep your repo line by line.”

Sub-agents make this scale. When you ask Claude Code to explore a codebase, it doesn’t try to hold everything in working memory - it spawns an Explore agent with its own context window. That agent greps, reads files, builds understanding, then reports back a summary. The parent orchestrator never sees raw file contents, just distilled findings. Planning works the same way: a Plan agent thinks through architecture and trade-offs, returns a structured plan, then dies. Each agent operates in isolation and communicates through summaries.

Anthropic’s context management post describes their “context editing” approach - pruning conversation history as it grows by summarizing old tool calls and dropping redundant information. Result: 84% token reduction in 100-turn evaluations without degrading task performance.

OpenAI and Google’s Deep Research systems use the same orchestrator + sub-agent delegation. Everyone converged on this before the RLM paper gave it a name - because it’s the only architecture that scales.

Where This Goes

Sharper Tools

Tools are pluggable. Current implementations use basic regex, but nothing stops you from giving the model access to ast-grep for structural code search, or tree-sitter for proper parsing. Raw grep finds def foo( - ast-grep finds “all functions that take a callback and don’t handle errors.” Same architecture, sharper tools.

Despite the frontier-scale benchmarks, recursive-llm has solid local model support via LiteLLM:

rlm = RLM(model="ollama/llama3.2")

It also supports a two-model strategy - use a capable model for root decisions and a smaller one for recursive exploration:

rlm = RLM(
    model="ollama/llama3.2",
    recursive_model="ollama/llama2",
    max_iterations=10,
    max_depth=3
)

Fragility is real - each step only knows the current state, so one bad transition can cascade. But the token savings might make smaller models viable for tasks they’d otherwise choke on. If RLM scaffolding squeezes 33% more performance out of frontier models, maybe it does something similar for a 30B running locally.

Rethinking the Runtime

The Python REPL is the obvious choice - models are trained on tons of Python, and RestrictedPython provides some sandboxing. But it’s bolting safety onto a language that wasn’t designed for it. What if the execution environment itself was rethought? Three questions shape the design space: What language should models write? How do we sandbox it? How do we make it fast?

The language question. One scrappy approach sidesteps sandboxing entirely: just-bash simulates bash in TypeScript with an in-memory virtual filesystem. There’s no real execution to escape from - the “sandbox” is the simulation itself. Models know less bash than Python, but bash semantics (grep, pipes, text streams) map naturally to the “query context” pattern. The virtual filesystem could be the context - mount the codebase as files and let the model grep it.

But there’s a deeper question: RLMs are fundamentally about querying context, so why not use a query language? Prolog and its restricted subset Datalog are built for exactly this. Pattern matching via unification fits naturally with searching codebases. GitHub’s CodeQL already uses Datalog to find security vulnerabilities across millions of repositories.

Imagine a Prolog-based RLM doing the same context search:

% Context is asserted as chunks: chunk(Id, Text)
% Find lines matching a pattern
?- chunk(_, Text), re_match("@app\\.(get|post)", Text, Match).

% Slice context around matches
?- findall(Line, (chunk(N, Line), N > 100, N < 200), Slice).

Declarative, constrained by design, no side effects. The model describes what to find, not how to find it.

The sandboxing question. Query semantics are great, but we still need containment. WebAssembly offers exactly that - memory-safe, deterministic, no escape hatch by design. Run a Prolog interpreter inside WASI and you get both: query language for the task, sandboxed runtime for safety. Runtimes like Wasmtime and Wasmer are battle-tested, so this isn’t speculative - you could wire it up today.

The performance question. Even with the right language and sandbox, there’s a deeper inefficiency: the GPU sits idle while the interpreter chugs through regex on the CPU. Context is already in VRAM for inference - why ship it to CPU for search?

GPU-accelerated Datalog is an active research area. VFLog reports 200x speedups over CPU-based engines, and Boolean Matrix Logic Programming gets 1-4 orders of magnitude improvement by casting Datalog as matrix ops. Worth pursuing given the memory bandwidth gap: an H100 pushes 3.35 TB/s versus ~576 GB/s for top-end CPUs.

Most of these building blocks already exist. wasm-vk compiles WASM to Vulkan compute shaders, and WASI-gfx is working toward standardized GPU compute. Wire them together - Datalog for queries, WASI for sandboxing, GPU for speed - and you get an execution environment built for the job instead of bolted onto Python.

Whether the runtime stays Python or evolves into something else, the core insight is what matters. Prime Intellect and MIT researchers are calling RLMs the next major milestone after Chain of Thought. We spent years racing to make models bigger. RLMs suggest another path - smarter context management that benefits any model, but especially unlocks smaller ones. The dumb zone isn’t a fundamental limit. It’s a design choice we can engineer around.