Skip to content
Home » All Posts » Cutting LLM Costs with Semantic Caching: Architecture, Threshold Tuning, and Invalidation in Production

Cutting LLM Costs with Semantic Caching: Architecture, Threshold Tuning, and Invalidation in Production

Production LLM usage has a way of quietly turning into a line item that finance starts asking about. One team saw its LLM API bill growing 30% month-over-month, even though traffic wasn’t climbing at the same pace. A closer look at 100,000 queries revealed the main culprit: users repeatedly asking the same questions, just phrased differently.

Exact-match caching barely helped, capturing only 18% of redundant calls. By moving to semantic caching—matching queries by meaning rather than surface text—the team lifted cache hit rate to 67% and cut LLM API costs by 73%, while also reducing average latency by 65%.

This article walks through how they did it: the architecture of semantic caching, how to tune similarity thresholds safely, how to invalidate cached responses, and what the real production numbers look like.

From exploding bills to semantic caching

The initial symptom was familiar: a growing LLM line item with no obvious spike in user volume or new features. When the team dug into logs, common customer-service questions showed up over and over:

  • “What’s your return policy?”
  • “How do I return something?”
  • “Can I get a refund?”

All of these queries hit the LLM separately and produced nearly identical answers. Every time, the application paid full API cost and incurred full LLM latency.

They first tried the simplest lever: exact-match caching keyed by the raw query string. But real users rarely repeat the exact same wording. In a sample of 100,000 production queries, they found:

  • 18% were exact duplicates.
  • 47% were semantically similar to previous queries (same intent, different phrasing).
  • 35% were genuinely new.

Exact-match caching could only capture that 18%. The 47% of semantically similar queries still triggered full LLM calls, even though equivalent responses already existed.

To unlock that missed opportunity, they switched the cache key from text equality to semantic similarity. That change alone took the cache hit rate from 18% to 67%, yielding a 73% drop in LLM spend.

Why exact-match caching is structurally limited

Most teams start with a standard pattern: hash the input text and use it as a cache key. It’s simple, fast, and works well when inputs repeat verbatim. But as soon as you expose an LLM to end users, you inherit natural language variability: synonyms, reordered phrases, different levels of detail.

In the observed workload, less than one-fifth of queries were exact duplicates, but almost half were clear rephrasings of prior queries. That gap between 18% exact duplicates and 47% semantically similar queries represents pure inefficiency: paying repeatedly for responses that are, for practical purposes, the same answer.

For any LLM-heavy system serving FAQs, support, or search-like use cases, expecting users to hit exact string matches is unrealistic. Exact-match caching will remain bounded by that reality, no matter how aggressively you tune TTLs or cache sizes.

Semantic caching addresses this by making the cache aware of intent. Instead of asking “have I seen this string?” it asks “have I seen a question like this before?”

Semantic caching architecture

Image 1

The core idea of semantic caching is simple: represent queries as vectors in an embedding space and retrieve cached responses by nearest-neighbor search instead of string equality.

A basic semantic cache implementation, as deployed by the team, looks like this at a high level:

  • Embedding model: Converts the incoming query text into a vector.
  • Vector store: Stores query embeddings and supports similarity search (e.g., FAISS, Pinecone).
  • Response store: Stores the actual responses and metadata (e.g., Redis, DynamoDB).

On a cache lookup, the flow is:

  1. Embed the incoming query.
  2. Search the vector store for the most similar existing query.
  3. If similarity exceeds a configured threshold, return the cached response.
  4. Otherwise, fall through to the LLM, then store the new query–response pair.

The key shift is in what counts as a “hit.” Instead of equality of hash(query_text), the system checks proximity in embedding space. Queries like “How do I return something?” and “What is your return policy?” become neighbors, allowing one answer to serve both.

This architecture is not drastically more complex than a traditional cache, but it introduces a new dimension of tuning: deciding how similar two queries must be before you trust that they share the same intent.

Tuning similarity thresholds by query type

Image 2

The similarity threshold is the critical control knob in semantic caching. It governs the trade-off between:

  • Precision: When the cache returns a response, how often is it actually correct?
  • Recall: Of all similar-intent queries, how many are served from the cache?

Initially, the team set a single global threshold of 0.85, assuming that queries with 85% similarity would generally be “the same question.” That assumption broke down quickly.

One concrete failure mode looked like this:

  • New query: “How do I cancel my subscription?”
  • Cached query: “How do I cancel my order?”
  • Similarity: 0.87

Here, the model judged the texts similar, but the required actions—and correct answers—were different. The 0.85 threshold allowed this as a cache hit, producing an incorrect response.

Through experimentation and annotation (described in the next section), they discovered that no single global threshold worked well across all query types. Different categories had different tolerance for error:

  • FAQ-style questions: Optimal threshold around 0.94 – wrong answers directly damage trust.
  • Product searches: Optimal threshold around 0.88 – near-misses are more acceptable.
  • Support queries: Optimal threshold around 0.92 – need balance between coverage and accuracy.
  • Transactional queries: Optimal threshold around 0.97 – extremely low error tolerance.

To operationalize this, they added a query classifier and an adaptive cache that selects thresholds per query type. FAQs use a stricter cutoff than search; transactional flows are stricter still. This per-category tuning was essential to keep the false-positive rate (wrong cached answers) at an acceptable 0.8% while still achieving large cost savings.

How to tune thresholds: a practical methodology

Thresholds weren’t guessed; they were derived from labeled data and precision/recall analysis. The process the team used can be applied to other LLM workloads:

  1. Sample query pairs by similarity band.
    They sampled 5,000 pairs of queries spanning similarity scores from 0.80 to 0.99. This ensured coverage of the “grey zone” where intent may or may not match.
  2. Collect human labels for intent.
    Annotators labeled each pair as “same intent” or “different intent.” To control for subjectivity, three annotators reviewed each pair, with majority vote determining the final label.
  3. Compute precision and recall for candidate thresholds.
    For a given threshold T, a pair is treated as a “hit” if its similarity ≥ T. Using the ground truth labels, they computed:
  • Precision: among pairs above the threshold, what fraction truly had the same intent?
  • Recall: among all “same intent” pairs, what fraction fell above the threshold?

By sweeping different thresholds, they obtained precision/recall curves for each query type.

  1. Pick thresholds based on the cost of errors.
    The last piece is business logic: deciding whether you care more about avoiding wrong answers (precision) or maximizing cache hits (recall).

For example:

  • FAQ questions: They prioritized precision. A threshold of 0.94 yielded 98% precision—i.e., 98% of cached hits were truly same-intent pairs—at the cost of some missed cache opportunities.
  • Search queries: They leaned toward recall. Missing a cache hit here just costs money; it rarely breaks user trust. A threshold of 0.88 provided better coverage of same-intent pairs while keeping accuracy acceptable.

This approach grounds threshold selection in both empirical behavior of the embedding model and the real-world cost profile of mistakes in your application.

Latency overhead and end-to-end performance

Image 3

Semantic caching adds work to every request: embedding the query and searching the vector index. The question for production systems is whether that extra work pays off once cache hits and misses are accounted for.

In this deployment, the team measured the following p50 and p99 latencies:

Operation Latency (p50) Latency (p99)
Query embedding 12ms 28ms
Vector search 8ms 19ms
Total cache lookup 20ms 47ms
LLM API call 850ms 2400ms

On a cache miss, the system now pays the 20ms lookup overhead plus the LLM call: roughly 870ms instead of 850ms at p50. But with a 67% cache hit rate, the aggregate effect is strongly positive:

  • Before semantic cache: 100% of queries × 850ms ≈ 850ms average.
  • After semantic cache:
    • 33% of queries × 870ms (miss path)
    • 67% of queries × 20ms (hit path)

That works out to ~300ms average latency—a 65% reduction in end-to-end response time—despite adding work on every request.

The main takeaway is that for workloads where LLM calls dominate latency, a small embedding + vector search overhead is usually negligible on hits and tolerable on misses, provided hit rate is high enough.

Keeping cached answers fresh: invalidation strategies

Reducing cost and latency is only useful if answers remain correct. Over time, product details change, pricing updates, and policies are revised. Without a coherent invalidation strategy, semantic caching will eventually serve outdated or incorrect responses.

The team combined three strategies to manage staleness:

1. Time-based TTLs by content type

They assigned different time-to-live values based on how frequently the underlying information changes. For example:

  • Pricing: TTL on the order of hours (e.g., 4 hours) due to frequent updates.
  • Product info: ~1 day TTL for daily refreshes.
  • Policies: ~1 week TTL, as these change less often.
  • General FAQs: ~2 weeks TTL for very stable information.

This ensures that even if nothing else triggers invalidation, no answer lives indefinitely.

2. Event-based invalidation on content updates

When a known piece of content changes—such as a pricing table or policy page—the system proactively invalidates related cache entries. This requires tracking which cached queries depended on which content IDs, then purging those entries when the underlying data is updated.

This strategy quickly removes obviously stale answers after product or content changes, rather than waiting for TTLs to expire.

3. Staleness detection via semantic comparison

Some responses can drift without an explicit content update event. To catch these cases, they periodically sampled cached entries and re-ran the original queries against current data or logic. They then compared the old and new responses semantically.

If the similarity between the cached answer and the fresh answer dropped below a threshold (e.g., 0.90), the entry was invalidated. This helped detect and clean up subtle drifts that TTLs and event-based rules missed.

Together, these three mechanisms—TTL, event-based invalidation, and semantic staleness checks—kept correctness high while still allowing the cache to aggressively reuse prior work.

What worked in production—and what to avoid

After three months of running semantic caching in production, the team recorded the following changes:

Metric Before After Change
Cache hit rate 18% 67% +272%
LLM API costs $47K/month $12.7K/month -73%
Average latency 850ms 300ms -65%
False-positive rate (wrong cached answers) N/A 0.8%
Customer complaints (wrong answers) Baseline +0.3% Minimal increase

The 0.8% false-positive rate—cases where the cache returned an answer that turned out to be semantically incorrect—was concentrated near the similarity thresholds, where queries were just similar enough to pass the cutoff but differed subtly in intent. For their use case, this trade-off was acceptable given the cost and latency gains.

Along the way, several pitfalls emerged:

  • Avoid a single global threshold. Different query types have very different error tolerances. Use per-category thresholds, tuned with precision/recall analysis.
  • Don’t assume you can skip embeddings on hits. Embedding and similarity search are how you discover hits in the first place; this work is intrinsic to semantic caching.
  • Don’t ship without invalidation. A semantic cache without a clear invalidation strategy will inevitably serve stale responses. Build TTLs, event hooks, and some form of freshness checking early.
  • Don’t cache every response. Certain classes of responses are unsafe to cache, such as highly personalized outputs, time-sensitive answers, or transactional confirmations. The team used rules to exclude responses containing personal information, time-sensitive content, or transactional flows from the cache.

For teams operating production LLM workloads, semantic caching is a pragmatic, high-ROI pattern rather than a theoretical optimization. In this case, it delivered a 73% cost reduction and a 65% latency improvement with only a marginal increase in user complaints, provided thresholds were tuned carefully and staleness was managed proactively.

The main investments are in embedding infrastructure, vector search, query classification, and a disciplined approach to threshold tuning and invalidation. For many systems, those investments are quickly repaid by tangible savings and more responsive user experiences.

Join the conversation

Your email address will not be published. Required fields are marked *