IndexCache: How Reusing Sparse Attention Indices Speeds Up Long-Context LLM Inference

Running large language models (LLMs) at 100K–200K+ context is still brutally expensive. Even architectures that use sparse attention hit new bottlenecks once sequence lengths grow into the hundreds of thousands of tokens. A new technique from researchers at Tsinghua University and Z.ai, called IndexCache, targets one of those bottlenecks inside DeepSeek Sparse Attention (DSA) models and delivers substantial speedups without sacrificing quality.

IndexCache removes up to 75% of redundant computation in the DSA indexer path and, at 200,000 tokens, delivers up to 1.82x faster time-to-first-token and 1.48x faster generation throughput on real models such as GLM-4.7 Flash. Early tests on the 744B-parameter GLM-5 show similar gains at production scale.

This article unpacks what’s actually happening under the hood, how IndexCache differs from KV cache compression, and what it means for engineering teams operating long-context LLM services.

Why long-context attention still hurts at inference time

Self-attention is the workhorse of modern transformers: every token attends to every preceding token to compute the next-step representation. In the vanilla formulation, compute and memory scale quadratically with sequence length L (O(L²)). At context windows of 100K–200K tokens, this quadratic scaling becomes the dominant factor in both latency and cost.

To make such workloads viable, many architectures move to sparse attention. Instead of attending to the full prefix, each query token selects a smaller, relevant subset of keys/values to attend to. DeepSeek Sparse Attention (DSA), introduced in the DeepSeek-V3.2 family and used in newer DeepSeek and GLM models, is a concrete realization of this idea.

DSA augments each transformer layer with a lightweight “lightning indexer module.” For a given layer, the indexer scores all preceding tokens and decides which subset matter most. The core attention mechanism then runs only on those selected tokens. The result: the heavy attention computation is reduced from quadratic to effectively linear in sequence length, dramatically improving speed and resource usage compared with dense attention.

However, even this optimized setup hides a cost center that becomes visible only when you push the context far enough.

The hidden cost of DeepSeek Sparse Attention (DSA)

While DSA’s core attention path is linear, its indexer path isn’t free. At every layer, the indexer still needs to look at all preceding tokens to decide which ones to keep. That selection step scales quadratically with sequence length, even if the computation is lighter than full attention.

For modest sequence lengths, the indexer overhead is tolerable. But as you move to very long contexts—think 200K tokens and beyond—the accumulated cost of running these indexers at every layer becomes dominant, particularly during the prefill stage (processing the input prompt before generating the first token). The researchers found that this indexer work can “skyrocket” with longer contexts, slowing models even when the core attention is already sparsified.

In other words, DSA solves the main attention bottleneck but leaves a second-order bottleneck in place: repeated indexer computations across layers that are highly redundant.

Key insight: adjacent layers mostly care about the same tokens

The research team observed an empirical regularity in DSA models: adjacent layers tend to select very similar sets of important tokens. When they examined the subsets that different layers picked, they found that neighboring layers shared between 70% and 100% of their selected tokens.

This suggests that, for long-context workloads, much of the indexer’s work is recomputing almost the same answer over and over again as you move up the stack of transformer layers. That redundancy is exactly what IndexCache exploits.

Instead of treating each layer as independently deciding its token subset, the technique asks: if layers mostly agree on what matters, can some of them simply reuse the decisions made earlier without re-running the scoring logic?

How IndexCache works: F layers, S layers, and index reuse

IndexCache restructures the role of the indexer across layers by introducing two layer types:

Full (F) layers: These retain their indexers. They compute fresh token importance scores and select the subset of tokens to be attended to.
Shared (S) layers: These drop the indexer entirely. Instead of re-scoring, they reuse the indices cached from the most recent preceding F layer.

During inference, the model’s behavior becomes conditional on the layer type:

On an F layer, the indexer runs, selects important tokens, and stores the indices.
On an S layer, the model skips indexer computation and directly uses the cached indices from the last F layer.

Because adjacent layers tend to select highly overlapping token subsets, this reuse introduces minimal loss in relevance while eliminating a large fraction of indexer compute. In practice, the researchers report that they can safely remove about 75% of the indexers without degrading downstream performance in long-context benchmarks.

Crucially, IndexCache is not another KV cache trick. KV cache optimization works by compressing or sharing the key/value tensors themselves to reduce memory usage. IndexCache, by contrast, attacks the computation in the indexer. As co-author Yushi Bai emphasizes, it “reduces computation rather than just memory footprint” and is complementary to KV cache compression. Teams can layer both techniques for additional benefit.

Training-free vs. training-aware deployment approaches

The researchers describe two ways to bring IndexCache into real models, depending on whether you can modify training or not. Both approaches currently apply to architectures using DSA, such as recent DeepSeek and GLM series models.

Training-free, greedy layer selection

For teams using off-the-shelf DSA models where full retraining is infeasible, IndexCache offers a training-free deployment path. The key idea is to determine which layers should be F layers and which can be S layers via a calibration procedure.

The method uses a “greedy layer selection” algorithm:

You run a small calibration dataset through the model.
The algorithm searches for layer configurations (which layers keep their indexers) that preserve output quality while removing as many indexers as possible.

This search happens without updating model weights. Empirical results show that this greedy algorithm can remove roughly 75% of indexers while matching the original model’s downstream performance on long-context benchmarks. That makes it attractive for production teams that want a drop-in optimization with no retraining loop.

Training-aware, multi-layer distillation loss

For teams that are pre-training or substantially fine-tuning their own DSA-based foundation models, there is a deeper training-aware option. Here, the model is explicitly optimized to support cross-layer index sharing.

The training-aware approach introduces a “multi-layer distillation loss” during training. This loss encourages each retained (F-layer) indexer to learn a token subset that remains highly relevant for the stack of S layers that reuse its indices. Over time, the F layers converge to a kind of consensus view of which tokens matter not just for themselves, but for all subsequent layers they serve.

While the article does not provide full architectural or loss-function details, the key point for practitioners is that there is a path to bake IndexCache’s sharing assumptions directly into training, potentially unlocking even cleaner trade-offs at very large scale.

Measured speedups on GLM-4.7 Flash and GLM-5

The team evaluated IndexCache on real deployed models, not just toy setups. On the 30B-parameter GLM-4.7 Flash model, they compared the baseline DSA implementation with one where 75% of the indexers were removed using the training-free method.

At a 200K-token context length, they observed:

Prefill latency dropped from 19.5 seconds to 10.7 seconds—a 1.82x speedup in time-to-first-token.
Decode throughput per request improved from 58 tokens/s to 86 tokens/s—a 1.48x speedup at the same context.
When server memory was saturated with concurrent requests, total decode throughput increased by up to 51%.

Preliminary experiments on the production-scale 744B-parameter GLM-5 model showed similar directional results. Removing 75% of indexers with the training-free method yielded at least a 1.3x speedup on contexts above 100K tokens, while maintaining nearly the same quality on long-context tasks.

These numbers are especially relevant for operators running high-throughput RAG systems, document analytics, and complex agent pipelines, where long contexts are the norm rather than the exception.

Accuracy and reasoning: what do you give up?

Speedups are only meaningful if they don’t quietly erode model quality. The researchers report that, with 75% of indexers removed using the training-free approach, the 30B GLM-4.7 Flash model essentially matches the original DSA baseline on long-context benchmarks.

The baseline model scored 50.2 on average across these benchmarks, while the IndexCache-optimized model scored 49.9—well within typical variance for LLM evaluations. On AIME 2025, a demanding math reasoning benchmark, the optimized model actually did slightly better, scoring 92.6 versus the baseline’s 91.0.

For the 744B GLM-5 tests, the paper notes that long-context task quality remained “nearly identical” after indexer removal at 100K+ token contexts. The data presented focuses primarily on overall averages rather than per-task breakdowns, but the headline takeaway is clear: IndexCache delivers significant latency and throughput gains without observable degradation in the reported evals.

That said, the published results focus on the tested workloads and benchmarks. If you operate highly specialized domains or safety-critical systems, you would still want to validate behavior under your own distributions before rolling this out widely.

Practical deployment considerations for engineering teams

For teams that want to adopt IndexCache via the training-free path, the workflow looks manageable but not entirely “flip-a-switch.” The greedy layer-selection algorithm depends strongly on the calibration data you feed it.

Co-author Yushi Bai recommends using domain-specific calibration sets so that the discovered pattern of F and S layers reflects real production workloads. For example, if your primary use case is legal document review, the calibration corpus should mirror those documents rather than generic web text. The better the calibration data, the more likely you’ll maintain quality while maximizing indexer removal.

Once the calibration step is done, integration is relatively straightforward. The team has released open-source patches on GitHub for major inference engines, including vLLM and SGLang. With those patches, you can:

Apply IndexCache to existing DSA-based models in your stack.
Configure F/S layer layouts based on calibration outputs.
Enable the optimization with minimal configuration changes to your serving infrastructure.

From an operations standpoint, the expected impact is directly on ROI and user experience. Bai notes that for long-context workloads—RAG pipelines, document analysis, agentic systems—the team observes at least ~20% reduction in deployment cost and similar improvements in user-perceived latency. For very short context tasks, the benefit is more modest, around 5%, reflecting the fact that indexer overhead is less dominant when sequences are small.

IndexCache therefore makes the most sense in environments where long-context traffic is significant and latency budgets are tight.

What IndexCache signals about future model design

Beyond the immediate performance wins, IndexCache reflects a broader shift in how foundation models are being architected. Rather than treating inference-time constraints as an afterthought, the work assumes those constraints from the start and designs mechanisms to exploit structural redundancies—here, cross-layer similarity in token importance.

The authors argue that future models are likely to be architected with inference constraints in mind by default. That means designing not just for parameter count and training scalability, but also for predictable, efficient throughput and latency under realistic, long-context workloads.

For ML engineers and infra teams, the implication is that techniques like sparse attention, index reuse, KV cache optimization, and similar structural tricks will become first-class design elements rather than optional optimizations. As context windows continue to grow, the models that win in production may be those that can aggressively reuse work—across tokens, layers, and even requests—without sacrificing the behavior that users rely on.

Cary Huang

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.