Google researchers are proposing a different way to train AI systems for complex, long-horizon tasks—one that doesn’t revolve around endlessly sampling the next token. Their new technique, called internal reinforcement learning (internal RL), shifts the focus from what a model says to how it thinks internally, by directly steering its hidden activations toward high-level, multi-step solutions.
Instead of treating reasoning as a long sequence of token predictions, internal RL uses a separate controller network to nudge an LLM’s internal state so it settles into a useful, abstract plan. The base model then fills in the concrete steps it already “knows” from pretraining. Early experiments suggest this might offer a scalable path to AI agents and robots that can plan and act over long horizons without constant human guidance.
The problem with next-token RL for long-horizon reasoning
Modern large language models are optimized around next-token prediction. Even when reinforcement learning is added on top—for example, to improve reasoning—training still typically explores by tweaking token-level outputs. For long-horizon problems with sparse rewards, the Google team argues this is the wrong level of abstraction.
Because LLMs are autoregressive, they generate one token at a time and explore by perturbing that next token or action. This works for short, local decisions, but breaks down when the solution requires a coordinated sequence of many steps. In long-horizon tasks, reward signals are rare and delayed, so the chance of stumbling into a successful sequence through random token-level exploration is extremely low—on the order of one in a million, according to the researchers.
The issue is not just that models “get confused,” but where they get confused. As co-author Yanick Schimpf explained to VentureBeat, in a 20-step task an agent can either get lost in the fine-grained details of a single step or lose track of the overall goal altogether. If exploration is happening at the level of individual tokens, the agent may never reliably lock onto the higher-level strategy needed to complete the task.
Schimpf describes what’s needed as “goal-oriented exploration” for problems with abstract structure: solve the task at a high level first, commit to a plan, and then let the detailed execution follow. Without this high-level commitment, the agent risks “getting lost in one of the reasoning steps” and failing to complete the broader workflow.
This motivation aligns with longstanding work in hierarchical reinforcement learning (HRL), where complex behaviors are decomposed into temporally extended actions or options. Instead of managing behavior as a flat chain of primitive actions (or tokens), HRL seeks a hierarchy of subroutines representing different phases of a solution. But discovering such meaningful subroutines has proven difficult in practice: many HRL methods converge to “degenerate options” that don’t map to useful behaviors, and even modern algorithms like GRPO struggle in environments that demand robust long-horizon planning.
From hierarchical RL to internal RL: what actually changes?
Internal RL builds on the same intuition as hierarchical RL—operate at higher levels of abstraction—but applies it inside the model rather than in its external action space. The key observation from the Google team is that advanced autoregressive models already exhibit rich internal structure: they seem to internally represent multi-step plans, even when they only output text or simple actions token by token.
That internal structure lives in the model’s residual stream—the evolving vector of activations that flows through its layers. The internal RL approach attempts to harness this by introducing a second network, a metacontroller, that learns to manipulate those internal activations instead of the explicit outputs.
Crucially, this is not a hand-crafted hierarchy. There are no manually defined subroutines or options. Instead, the metacontroller learns to induce useful, high-level internal states that correspond to progress on the task. The base model is left to realize these states as concrete action or token sequences.
In that sense, internal RL reframes the HRL idea: rather than imposing a hierarchy on the output space, the hierarchy emerges in the space of internal representations. The model still outputs tokens or actions one step at a time, but the exploration and credit assignment are now centered on abstract internal decisions about where in the solution space to go next.
Technically, the internal RL setup adds an “internal neural network controller” wrapped around the middle layers of a pre-existing autoregressive model. This metacontroller takes as input the current internal state of the base model (its activations) and outputs adjustments to those activations in the residual stream.
The effect is a gentle push: the metacontroller nudges the base model into a particular region of its internal state space that corresponds to a useful high-level plan. The base model—unchanged in its architecture and next-token prediction objective—then naturally generates the low-level sequence of actions or tokens consistent with that plan, because it has already seen and encoded such sequences during its original pretraining.
Training the metacontroller does not rely on human labels. Instead, the researchers use a self-supervised framework: the system looks at full trajectories of behavior and works backward to infer which latent high-level intent would best explain the observed sequence. Through internal RL, the metacontroller is updated to favor those latent choices that lead to successful outcomes, effectively learning its own abstract action space.
During this phase, the training objective shifts away from next-token prediction toward discovering high-level internal actions that reliably lead to reward. The base model’s token-level knowledge remains intact; what changes is how that knowledge is accessed and organized by the metacontroller.
For practitioners, a useful analogy is temperature in LLM decoding. Today, there is a compromise in tasks like code generation: low temperature yields syntactic correctness but little creativity, while high temperature encourages exploration but risks syntax errors and incoherent logic. Schimpf suggests internal RL offers a different path: explore in the abstract space of logic and method composition, but keep the low-temperature, high-fidelity base distribution for tokens. The model can try different high-level approaches to a coding problem without corrupting syntax and structure at the token level.
Two training setups: frozen base vs. joint optimization
The Google team studied two ways to apply the metacontroller:
- Frozen base model: Pretrain the base autoregressive model on a behavioral dataset, then freeze its parameters. Only the metacontroller is trained afterward to steer the residual stream.
- Joint training: Optimize the base model and metacontroller together from scratch, updating parameters of both networks simultaneously.
Both setups share the same broad goal—discover high-level internal actions that support long-horizon decision-making—but they differ in how much they rely on a stable, pretrained representation.
The frozen-base configuration leans heavily on the idea that the model already “knows” how to perform the necessary low-level behaviors. Internal RL then becomes a way of selecting and sequencing those behaviors by choosing appropriate internal states. Joint training, by contrast, attempts to learn both the lower-level behavior and the high-level control at once.
In practice, the results strongly favored the frozen-base approach. When both networks were co-trained from scratch, the system struggled to develop meaningful abstractions: the metacontroller did not converge on useful internal options, and the hierarchical structure the method aims for failed to materialize. Only when the base model was kept fixed did the metacontroller reliably discover high-level switches aligned with genuine task subgoals.
Benchmark environments: grid worlds and quadruped control
To test whether internal RL delivers practical benefits, Google’s researchers evaluated it in hierarchical environments designed to be difficult for traditional RL methods. They focused on two families of tasks with sparse rewards and long action sequences:
- Discrete grid world: A structured, symbolic environment requiring an agent to navigate and solve tasks using long sequences of decisions.
- Continuous control with a quadrupedal “ant” robot: A physics-based control problem where a four-legged agent must coordinate many joints over time to achieve a goal, again with sparse feedback.
These environments are deliberately hostile to naive exploration. Standard baselines, including the GRPO algorithm and CompILE, failed to learn the tasks within a million episodes. The difficulty arises from credit assignment: with rewards so sparse and horizons so long, it is extremely hard to determine which individual low-level actions contributed to eventual success.
Internal RL, by contrast, achieved high success rates with far fewer training episodes. By moving exploration to the high-level goal space, the metacontroller drastically reduced the effective search space. Instead of sampling over an enormous number of token- or action-level sequences, it chose between a smaller set of internal high-level options, each of which unleashed a full sequence of low-level behaviors in the base model.
This change made credit assignment tractable. The system could associate success or failure with a relatively small number of abstract choices rather than millions of low-level variations. In the frozen-base setup, the metacontroller even learned to align its internal switching mechanism almost perfectly with the ground-truth boundaries between subgoals—discovering natural “checkpoints” in the task without any human-provided labels.
Why the ‘frozen model + metacontroller’ recipe matters
The finding that a frozen base model outperforms joint training has direct implications for how practitioners might apply internal RL in real-world systems.
First, it supports a modular pipeline: train a strong base model on large-scale behavioral or language data, then specialize it for long-horizon reasoning by learning a metacontroller on top, without disturbing the base weights. This is appealing in enterprise settings where base models are expensive to retrain and must remain stable for safety and compliance reasons.
Second, it suggests that high-level reasoning capabilities may be more about accessing and organizing what the model already represents than about fundamentally changing the model itself. If the base model encodes rich patterns of behavior and reasoning, a relatively small, focused controller may be enough to unlock them for complex agentic tasks.
Finally, the failure of joint training to discover useful abstractions underscores a familiar challenge in hierarchical RL: simultaneously learning low-level skills and high-level control often leads to entangled, unstable representations. Internal RL’s success with a frozen base hints that decoupling these stages—pretraining first, then learning internal control—may be a more reliable path to hierarchy in practice.
Implications for reasoning models, robotics, and multimodal agents
Many current “reasoning” models emphasize explicit chain-of-thought: verbose intermediate steps in natural language that make the reasoning process more transparent. Google’s work points in a different direction: internal reasoning that never needs to be written out as tokens may be both feasible and more efficient.
As Schimpf notes, their study contributes to a growing body of evidence that internal reasoning can outperform token-based approaches, especially for long-horizon tasks. The metacontroller’s “silent thoughts” operate on internal activations, not text, and can be decoupled from specific input or output modalities. That decoupling could be especially important for multimodal AI systems that must jointly reason over language, perception, and control signals.
For autonomous agents and robotics, the potential upside is clear. If an internal controller can set abstract goals—such as subgoals in a navigation or manipulation task—while the base model handles the granular control signals or textual outputs, then agents could plan further ahead with fewer samples and less human intervention. This is directly relevant to enterprise ambitions around autonomous code agents, workflow orchestrators, and physical robots operating in complex environments.
The research does not claim to have solved long-horizon reasoning in open-ended real-world settings, nor does it provide production-ready recipes for every domain. Its experiments are confined to controlled benchmark environments, and details beyond those environments are not explored in the source description. But the conceptual shift is notable: instead of asking models to speak their reasoning into existence token by token, internal RL tries to tap into what they already represent internally and steer that process.
If this paradigm scales, prompting strategies and decoding tricks may matter less than our ability to interface with a model’s hidden state. For technical leaders, that raises a new class of design questions: how to structure behavioral pretraining, how to expose internal state to controllers safely, and how to evaluate internal reasoning that may never surface as text. The answers will shape how far long-horizon AI agents—and the systems built on top of them—can ultimately go.

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.





