Why AI Agents Need Hierarchical Memory to Cut Costs

The Memory Problem Hitting Enterprise AI

Here’s a scenario playing out in companies across the tech world right now: you build an AI assistant that customers actually want to use. It handles support tickets, answers questions, remembers preferences. Then, after a few weeks of use, something breaks. The answers get worse. Response times crawl. Your costs spike. What went wrong?

The culprit isn’t the model itself—it’s how the system remembers. As AI agents stay engaged with users across weeks or months of conversation, their memory structures simply weren’t built for that kind of sustained, multi-session workload. This is the core challenge xMemory, a new technique developed by researchers at King’s College London and The Alan Turing Institute, was built to solve.

So what exactly is going wrong with standard approaches? And why does hierarchical memory cut token costs nearly in half while actually improving answer quality? Let’s unpack this.

Why standard RAG collapses in long conversations

If you’ve worked with LLM applications, you’ve likely used retrieval-augmented generation, or RAG. The standard approach works like this: store past dialogues, retrieve a fixed number of top matches based on embedding similarity, then concatenate them into the context window to generate answers.

This works beautifully for large document databases where the retrieved content is highly diverse. But here’s the catch—an AI agent’s memory isn’t a diverse database. It’s a continuous, tightly correlated stream of conversation. The data chunks are frequently near-duplicates, heavily interconnected through references, ellipsis, and timeline dependencies.

Consider a user who has mentioned oranges, mandarins, and citrus fruits across different conversations. Traditional RAG treats all of these as semantically close. “If retrieval collapses onto whichever cluster is densest in embedding space, the agent may get many highly similar passages about preference, while missing the category facts needed to answer the actual query,” Lin Gui, co-author of the paper, explained to VentureBeat.

Teams often try to fix this with post-retrieval pruning or compression—filtering out what looks like noise. But these methods assume retrieved passages are diverse and that irrelevant noise can be cleanly separated from useful facts. In conversational memory, that assumption breaks. Human dialogue is “temporally entangled,” as the researchers put it. Pruning tools accidentally delete important context, leaving the AI without the thread it needs to reason accurately.

That’s the core problem: standard RAG wasn’t built for this. And the fixes most teams reach for actually make things worse.

Decoupling Conversations Into Hierarchy

The researchers propose a fundamentally different approach. Instead of matching user queries directly against raw, overlapping chat logs, the system organizes conversation into a hierarchical structure. This “decoupling to aggregation” strategy separates the conversation stream into distinct, standalone semantic components, then aggregates them into higher-level themes.

When the AI needs to recall information, it searches top-down through the hierarchy—from themes to semantics to raw snippets. This prevents redundancy. If two dialogue snippets have similar embeddings, the system won’t retrieve them together if they’ve been assigned to different semantic components.

Raw messages to themes: the four-level structure

Enter xMemory, a framework that combines structured memory management with an adaptive, top-down search strategy. The system continuously organizes the raw conversation stream into a four-level hierarchy.

At the base are raw messages—the actual user inputs and agent responses. These get summarized into contiguous blocks called “episodes”—like grouping a week’s worth of related chats into a single context window.

From episodes, the system distills reusable facts as semantics. This is where the magic happens: the framework disentangles core, long-term knowledge from repetitive chat logs. Those semantics then get grouped into high-level themes to make them easily searchable.

Here’s the key part: xMemory uses a special objective function to constantly optimize how it groups these items. It prevents categories from becoming too bloated—which slows down search—or too fragmented—which weakens the model’s ability to aggregate evidence and answer questions.

The results speak for themselves. xMemory drops token usage from over 9,000 to roughly 4,700 tokens per query on certain tasks, while actually improving answer quality and long-range reasoning across various LLMs.

Uncertainty Gating—retrieving only what matters

Now here’s where the architecture really shines. When xMemory receives a prompt, it performs a top-down retrieval across the hierarchy. It starts at the theme and semantic levels, selecting a diverse, compact set of relevant facts. This matters because real-world queries often require gathering information across multiple topics or chaining connected facts together for complex, multi-hop reasoning.

Once it has this high-level skeleton of facts, the system controls redundancy through what the researchers call “Uncertainty Gating.” It only drills down to pull finer, raw evidence at the episode or message level if that specific detail measurably decreases the model’s uncertainty. It stops expanding when it detects that adding more detail no longer helps answer the question.

Think of it this way: “Semantic similarity is a candidate-generation signal; uncertainty is a decision signal,” Gui said. “Similarity tells you what is nearby. Uncertainty tells you what is actually worth paying for in the prompt budget.”

This is the part that directly impacts your token costs. Rather than blindly pulling the top-k most similar chunks, the system intelligently decides whether each additional piece of context actually moves the needle on answer quality.

What This Means for Your AI Applications

If you’re building AI agents today, this matters directly. The difference between flat memory designs (like MemGPT, which logs raw dialogue or minimally processed traces) and hierarchical designs (like xMemory) isn’t just architectural—it’s financial.

Flat approaches capture everything, but they accumulate massive redundancy as history grows longer. Every retrieved chunk costs tokens. Structured systems like A-MEM and MemoryOS organize memories into hierarchies, but they still rely on raw or minimally processed text as their primary retrieval unit, often pulling in extensive, bloated contexts. They also depend heavily on LLM-generated memory records with strict schema constraints—a single formatting deviation can cause memory failure.

xMemory addresses these limitations through its optimized memory construction scheme, hierarchical retrieval, and dynamic restructuring as memory grows larger.

When to choose xMemory over standard RAG

According to Gui, “xMemory is most compelling where the system needs to stay coherent across weeks or months of interaction.”

Customer support agents benefit greatly. They must remember stable user preferences, past incidents, and account-specific context without repeatedly pulling up near-duplicate support tickets. Personalized coaching is another ideal use case, requiring the AI to separate enduring traits from momentary states across long relationships.

If your agent only needs to recall information within a single session or short window, standard RAG might still serve you well. But for persistent, multi-session deployments where context quality matters and costs are a concern, hierarchical memory isn’t just an optimization—it’s becoming a necessity.

The Takeaway for Building Smarter Agents

Here’s what to take away from this: standard retrieval wasn’t built for the continuous, correlated nature of conversational memory. The fixes teams commonly apply—post-retrieval pruning, compression—assume diversity that doesn’t exist in dialogue data. They end up cutting vital context.

The path forward is structural. By organizing conversation into semantic hierarchies—raw messages, episodes, facts, and themes—you enable intelligent, top-down retrieval that pulls only what actually reduces uncertainty. This approach cuts token costs roughly in half while improving the coherence and reasoning quality of your AI agents.

For developers building persistent AI assistants, this isn’t a niche optimization—it’s a foundational shift in how memory should work. The agents that get this right will be the ones that scale.

Cary Huang

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.