IndexCache: How Reusing Sparse Attention Indices Speeds Up Long-Context LLM Inference
Running large language models (LLMs) at 100K–200K+ context is still brutally expensive. Even architectures that use sparse attention hit new bottlenecks once sequence lengths grow into the hundreds of thousands of tokens. A new technique from researchers at Tsinghua University… Read More »IndexCache: How Reusing Sparse Attention Indices Speeds Up Long-Context LLM Inference











