MIT’s Recursive Language Models: A Systems Approach to 10M-Token Contexts

MIT CSAIL researchers are proposing a different answer to the long-context problem in large language models: don’t keep stretching the context window—change how the system uses it.

Their new Recursive Language Models (RLMs) framework treats long prompts as an external environment, not something that must fit wholesale into a transformer’s context. Instead of feeding millions of tokens directly to a model, RLMs let the model programmatically inspect and decompose that text from within a code environment, recursively calling itself on only the relevant snippets.

The result, in reported benchmarks, is the ability to handle inputs in the 10-million-token range without catastrophic context degradation—using existing models, without retraining. For AI engineers and enterprise architects wrestling with codebase-wide analysis, contract review, or multi-hop reasoning over huge corpora, RLMs sketch out a systems-level alternative to ever-larger context windows and lossy summarization.

The limits of traditional context windows

Long-context capabilities in frontier models have improved, but two hard constraints remain. First is the physical upper bound: a transformer can only accept a fixed number of tokens per forward pass, even with architectural and positional encoding tricks. Second is what practitioners increasingly call “context rot” — the tendency for model performance to degrade as prompts grow longer and more complex.

The MIT team frames the core research question as whether it is possible to increase the effective context size of general-purpose LLMs by orders of magnitude without retraining them. That requirement is driven by enterprise use cases where documents, logs, or codebases span millions of tokens and cannot be easily sharded into independent problems.

According to co-author Alex Zhang, simply scaling up the context window runs into statistical and practical limits. He points to an “entropy argument” implying that as the effective context window grows, the amount of data required to properly train a model to use that context grows exponentially. In other words, training a monolithic model that reliably reasons over millions of tokens is not just an engineering challenge; it has unfavorable data complexity properties as well.

Current workarounds often compress history. Systems summarize older messages, documents, or code sections to free up space. That is acceptable when earlier information can be coarsened without loss. But many tasks—auditing a specific function definition in a large repository, finding a particular covenant in a contract stack, or answering a question that depends on a detail buried 8 million tokens back—require random access to precise fragments. Summarization can obliterate exactly the information that matters.

RLMs respond by reframing the challenge as a systems problem rather than a modeling problem: keep the data outside the model, but make it inspectable with code.

From out-of-core algorithms to recursive language models

The conceptual inspiration for RLMs comes from classical “out-of-core” algorithms. Historically, when datasets were larger than main memory, algorithms read from disk in chunks, performing computation over manageable slices while treating external storage as the true home of the data.

RLMs apply the same pattern to LLM inference. Instead of pouring a massive prompt into the model’s context, the full text is loaded as a string variable in a Python runtime. The model is told about the existence and coarse properties of this variable—for example, the total character count—but it does not receive the contents by default.

Within that environment, the LLM behaves as a programmer. It writes code to interact with the external text: slicing, searching, and filtering as needed. Using ordinary Python operations—string methods, regular expressions, loops—it can “peek” into specific regions and pull only the necessary chunks into its active context window for detailed reasoning.

To the outside world, this is still wrapped as a familiar API: you send a string, you get an answer. But inside the wrapper, the model’s interaction with the data looks much closer to an out-of-core processing pipeline than a single forward pass over a giant sequence.

Inside the RLM architecture

MIT’s experiments typically use a two-agent hierarchy:

Root language model: a capability-strong model such as GPT-5 that orchestrates the process. It plans how to attack the task, writes and iterates on Python code, and manages data flow in a REPL-like environment.
Recursive language model: a faster, cheaper worker model that processes specific text slices. The root model spawns sub-calls to this worker when it identifies relevant spans in the external prompt.

The workflow looks roughly like this:

The system receives an oversized input—say, an 8–10 million token book, an enormous log bundle, or a giant document collection. This entire payload is stored as a variable in the environment’s memory, not in the model’s context window.
The root model is prompted with metadata (such as total length) and the user’s task (“summarize each chapter”, “find all references to clause X and analyze them”, etc.).
The root writes Python code to explore the text. For example, it might use regular expressions to find markers like “Chapter 1” or “financial results”, scan for headings, or tokenize sections by delimiter.
When the code identifies a relevant region, the system extracts only that substring and feeds it into the worker model (or the root model itself) in a normal, bounded-context call.
The root aggregates these local results, optionally recursing into subproblems, and finally synthesizes a global answer.

In a book-level example, the root might first write a loop that finds chapter boundaries. It would then recursively call a worker model to summarize each chapter independently, before producing a higher-level synthesis. At no point does any single model call see the entire book.

Because the prompt lives in the execution environment rather than in the transformer’s context window, the practical input size becomes limited by system memory and I/O, not by the LLM’s positional encoding or context limit. Critically, RLMs are implemented as a wrapper over existing models. For applications that today make direct LLM API calls, the RLM can be integrated as a drop-in replacement: the interface to application code need not change.

For practitioners, the MIT team has made the implementation available on GitHub, enabling direct experimentation with the framework.

How RLMs actually reason over massive prompts

The power of RLMs depends on the model’s ability to perform non-trivial decomposition. Zhang emphasizes that “most complex tasks can be decomposed into smaller, ‘local’ sub-tasks,” but the decomposition itself is challenging and must be learned or emergently performed by the model.

In practice, this involves several layers of reasoning:

Task planning: The root model must first infer an overall strategy: for instance, “identify structural boundaries, then process each unit independently,” or “locate all mentions of a specific entity, then trace relationships across mentions.”
Data navigation: Using Python, the model writes code to navigate the external text. This can include scanning for section headers, matching patterns, or calculating indices and offsets.
Selective loading: Rather than streaming large blocks, the root aims to bring only small, task-relevant fragments into context: a section, a function definition, a contract clause, or a small cluster of paragraphs.
Recursive calls: When a local fragment is itself complex, the system can recursively apply the same strategy—treat a subsection as an environment, write code to split it further, and delegate to worker calls.

Because each sub-call is bounded by an ordinary context window, the familiar failure modes of long, monolithic prompts—loss of attention to early details, degradation of accuracy as length increases—are largely avoided. The global behavior instead depends on whether the root can write effective navigation and decomposition code.

Viewed through a systems lens, RLMs turn LLMs into agents that combine two roles: statistical language modeling and program synthesis for data access. The statistical model is no longer solely responsible for “remembering” all prior tokens; it also orchestrates how and when to read from its external memory.

Benchmarking RLMs at 10M+ tokens

To test this architecture, the MIT team evaluated RLMs against base models and other agentic approaches, including CodeAct and generic summary agents, across several long-context benchmarks.

On BrowseComp-Plus—a benchmark with inputs between 6 and 11 million tokens—standard base models “failed completely,” with scores reported at 0%. By contrast, an RLM powered by GPT-5 achieved 91.33%. This substantially outperformed a Summary Agent baseline (70.47%) and Apple’s CodeAct (51%).

The framework also showed advantages on tasks with high computational complexity. On OOLONG-Pairs, a reasoning benchmark whose difficulty scales quadratically with input length, base GPT-5 essentially collapsed, with a score of 0.04%. The corresponding RLM reached an F1 score of 58%, indicating that the recursive, selective access pattern enables the model to maintain useful precision and recall where a single-pass approach does not.

In code understanding, evaluated with the CodeQA benchmark, the RLM boosted performance from 24% for base GPT-5 to 62%—more than a twofold improvement. For engineers, this hints at practical viability for repository-scale code analysis, where naive long-context prompting has struggled.

Across these evaluations, a consistent pattern emerges: as task complexity and input length grow, base GPT-5’s performance decays sharply, illustrating context rot. The RLM’s performance, however, remains comparatively stable and superior for contexts longer than roughly 16,000 tokens.

While specific numbers are benchmark-dependent, the qualitative result is clear: under tests that require reasoning over millions of tokens or dense interactions that grow with input length, RLMs avoid the catastrophic failures seen in monolithic long-context use.

Cost, stability, and operational trade-offs

Recursive workflows introduce more moving parts: code execution, multiple sub-calls, and potentially deep recursion. That raises natural questions around latency, cost, and stability in production settings.

According to the reported experiments, RLMs often match or even beat alternative approaches on average cost, despite their complexity. On BrowseComp-Plus, for example, the RLM was up to three times cheaper than a summarization-based baseline. The likely reason is that well-targeted, local calls avoid repeated summarization and reprocessing of large chunks, while focusing compute on relevant segments only.

However, the cost distribution is “long-tailed.” The researchers note that while median runs are efficient, some trajectories become expensive: the model may get stuck in loops, over-verify results, or spawn many unnecessary sub-calls. Behavior also varies by model family. GPT-5, used as a root, tended to be conservative with its recursive invocations, whereas the open-source Qwen3-Coder model sometimes attempted thousands of sub-calls for relatively simple problems.

For practitioners, this means that while RLMs can be cost-effective on average, guardrails are currently essential. Zhang cautions that “today, you likely will have to implement your own guardrails and logic to control RLM behavior.” This includes:

Budgeting the number of recursive calls or total tokens used.
Detecting and breaking out of loops or redundant verification cycles.
Instrumenting and logging RLM trajectories for observability and debugging.

Looking ahead, the researchers suggest that future models could be trained to manage their own compute budgets more effectively, internalizing cost-awareness into their planning. Companies such as Prime Intellect are exploring ways to integrate RLM concepts into the training process itself, potentially reducing these long-tail behaviors. For now, though, operational controls remain the responsibility of the system designer.

Where RLMs fit in the enterprise stack

For enterprise teams, the key appeal of RLMs is pragmatic: they are designed as a wrapper around existing language models and present a familiar string-in/string-out interface. That makes them viable as a drop-in replacement in many applications rather than a wholesale architectural rewrite.

RLMs target exactly the kind of long-horizon tasks that stress conventional LLM deployments:

Codebase analysis: Understanding cross-cutting concerns, refactoring plans, or bug origins across millions of tokens of source and documentation.
Legal and compliance review: Tracing obligations, exceptions, and interdependencies across large contract bundles or regulatory texts.
Complex multi-step reasoning: Answering questions that require weaving together information scattered across vast text corpora, logs, or reports.

Zhang notes that RLMs remain “extremely useful for chatbots (think long chat histories)” but also “argue for an alternative way of using LMs.” Rather than relying solely on retrieval-augmented generation (RAG) with vector stores and embeddings, RLMs offer a complementary approach where the model itself orchestrates structured exploration of a known, large text environment.

Importantly, Zhang does not position RLMs as a drop-in replacement for RAG. Instead, he suggests they “work in tandem with standard retrieval methods like RAG; they do not serve as a replacement, and can be used in different settings or together.” For example, RAG might be used to pull in relevant documents from a large corpus, while RLM logic manages deep, structured analysis of a selected subset that is still too large for a single context window.

For architects deciding where to invest, the implication is that long-context capability need not come solely from bigger transformer contexts. System-level orchestration—treating prompts as environments, using code as a control interface, and delegating recursively—can unlock multi-million-token reasoning today, using existing foundation models.

Open questions and practical next steps

While the reported results are strong, several open issues will matter for teams considering adoption:

Decomposition robustness: RLMs assume that a capable root model can reliably learn or infer effective decomposition strategies. How stable is this behavior across domains, and how brittle is it under distribution shift?
Tooling and observability: Productionizing RLMs will likely require mature tooling: tracing, visualization of recursive call trees, and policy frameworks for cost and safety.
Security and governance: Running arbitrary, model-generated Python code around sensitive data raises questions about sandboxing, access control, and compliance, especially in regulated industries.

Nonetheless, the framework’s core proposition is clear: by treating massive text inputs as an environment that models can program against, Recursive Language Models sidestep the fundamental scaling limits of monolithic context and mitigate context rot at very large scales. For engineers and architects facing 10M-token problems today, RLMs offer a concrete, code-centric path to push beyond current context ceilings—without waiting for the next generation of ultra-long-context models to be trained from scratch.

Cary Huang

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.