Key Takeaway
Semantic caching using embedding similarity can achieve cache hit rates of 30-50% for typical LLM applications while maintaining output quality above user-noticeable thresholds. This guide covers five caching patterns from exact match to predictive prefetching, with implementation examples, cache key design, invalidation strategies, and hit rate benchmarks.
Prerequisites
- An AI inference endpoint in production with observable request patterns
- Redis, Memcached, or equivalent in-memory cache infrastructure
- Access to an embedding model for semantic caching (any text embedding API)
- Understanding of your application's freshness requirements (how stale can cached responses be?)
- Cost tracking to measure the ROI of caching implementation
Why AI Caching Is Different
Traditional web caching matches exact request keys: the same URL with the same parameters returns the same cached response. AI caching faces two challenges that make exact matching insufficient. First, semantically equivalent queries may have different surface forms ('What is the capital of France?' and 'France capital city?' should return the same cached answer). Second, AI outputs are often non-deterministic, meaning identical inputs may produce different (but equally valid) outputs, making cache validation more nuanced.
The reward for solving these challenges is significant. LLM API calls are expensive (dollars per thousand requests at frontier model pricing) and slow (seconds of latency). A cache hit eliminates both the cost and the latency, providing a cached response in milliseconds for zero API cost. Even modest cache hit rates of 20-30% translate to meaningful cost reduction and latency improvement.
Pattern 1: Exact Match Cache
Exact match caching is the simplest pattern: hash the normalized input and look up the hash in a key-value store. If found, return the cached response. If not, call the model and cache the result. This works well for deterministic queries with structured inputs (e.g., classification of product descriptions, extraction from invoices) but poorly for free-form queries where users phrase the same question differently.
Pattern 2: Semantic Cache
Semantic caching embeds each query using a text embedding model, then searches for cached queries whose embeddings are similar above a threshold. When a match is found, the cached response is returned. The critical parameter is the similarity threshold: too high (e.g., 0.99) and you get few hits; too low (e.g., 0.85) and you return responses for queries that are semantically different enough to warrant a fresh response. Start at 0.95 and tune based on quality feedback.
/**
* Semantic cache for LLM responses.
*
* Uses embedding similarity to match semantically
* equivalent queries and return cached responses.
*/
interface CacheEntry {
query: string;
embedding: number[];
response: string;
model: string;
timestamp: number;
hitCount: number;
}
interface CacheStats {
hits: number;
misses: number;
evictions: number;
avgSimilarity: number;
}
class SemanticCache {
private entries: CacheEntry[] = [];
private stats: CacheStats = {
hits: 0,
misses: 0,
evictions: 0,
avgSimilarity: 0,
};
constructor(
private readonly similarityThreshold: number = 0.95,
private readonly maxEntries: number = 10000,
private readonly ttlMs: number = 3600000, // 1 hour
) {}
private cosineSimilarity(a: number[], b: number[]): number {
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
lookup(queryEmbedding: number[]): string | null {
const now = Date.now();
let bestScore = 0;
let bestEntry: CacheEntry | null = null;
for (const entry of this.entries) {
// Skip expired entries
if (now - entry.timestamp > this.ttlMs) continue;
const similarity = this.cosineSimilarity(
queryEmbedding,
entry.embedding,
);
if (similarity > bestScore) {
bestScore = similarity;
bestEntry = entry;
}
}
if (bestScore >= this.similarityThreshold && bestEntry) {
this.stats.hits++;
bestEntry.hitCount++;
return bestEntry.response;
}
this.stats.misses++;
return null;
}
store(
query: string,
embedding: number[],
response: string,
model: string,
): void {
// Evict oldest if at capacity
if (this.entries.length >= this.maxEntries) {
this.entries.sort((a, b) => a.timestamp - b.timestamp);
this.entries.shift();
this.stats.evictions++;
}
this.entries.push({
query,
embedding,
response,
model,
timestamp: Date.now(),
hitCount: 0,
});
}
get hitRate(): number {
const total = this.stats.hits + this.stats.misses;
return total > 0 ? this.stats.hits / total : 0;
}
invalidateByModel(model: string): number {
const before = this.entries.length;
this.entries = this.entries.filter((e) => e.model !== model);
return before - this.entries.length;
}
}Pattern 3: Prompt Prefix Cache
Prompt prefix caching is a provider-level optimization available from Anthropic, OpenAI, and other LLM providers. When your requests share a common prefix (system prompt, few-shot examples, retrieved context), the provider caches the prefix computation and charges reduced rates for the cached portion on subsequent requests. This is particularly effective for applications with long, stable system prompts that are reused across many requests. Anthropic's implementation charges write costs for the initial cache and read costs (significantly lower) for subsequent uses.
Pattern 4: Embedding Cache
Embedding generation is a hidden cost center in RAG applications. Every document chunk, user query, and index rebuild triggers embedding API calls. An embedding cache stores computed embeddings keyed by a content hash, so the same text is never embedded twice. This is especially valuable during development (frequent index rebuilds) and for applications that process overlapping documents across multiple pipelines.
Pattern 5: Hierarchical Cache
Hierarchical caching uses a fast, cheap model as a first-tier cache for a slower, expensive model. Common questions are answered by the cheap model, and only queries that the cheap model cannot handle confidently are forwarded to the expensive model. The confidence threshold for forwarding is the key parameter. This pattern combines cost reduction with quality preservation, because the expensive model is only used where it adds value.
Cache Invalidation Strategies
Cache invalidation for AI responses is more complex than traditional caching because the correct response can change without any change to the input. Model updates change what the correct output is. Data freshness requirements mean a cached response may be stale even if the query is identical. Knowledge cutoff dates limit the accuracy of cached factual responses. Design your invalidation strategy around these three dimensions: time-based expiry (TTL), model-version invalidation (clear cache when model changes), and content-based invalidation (clear cache when underlying data changes).
Never cache responses to queries about real-time data (stock prices, weather, live scores) or queries whose correct answer changes over time without a robust invalidation strategy. The most dangerous cache bug is serving confidently stale information that appears correct to the user. When in doubt, set a shorter TTL and accept the higher cache miss rate.
| Pattern | Hit Rate | Implementation Effort | Best For | Watch Out For |
|---|---|---|---|---|
| Exact Match | 5-15% | Low -- hash + key-value lookup | Structured inputs, classification, extraction | Low hit rate for free-form queries |
| Semantic Cache | 20-50% | Medium -- requires embedding + similarity search | Search queries, Q&A, support chat | Threshold tuning, false positives at low thresholds |
| Prompt Prefix | N/A (reduces per-token cost) | Low -- provider feature, minimal code changes | Long system prompts, few-shot examples | Provider-specific, cache write costs on first use |
| Embedding Cache | 60-90% (for repeated content) | Low -- content hash + storage | RAG pipelines, index rebuilds, evaluation suites | Storage costs for large embedding vectors |
| Hierarchical | 40-70% (queries handled by cheap model) | High -- requires two-model architecture and routing | High-volume with mixed complexity queries | Quality degradation if cheap model threshold is too aggressive |
Version History
1.0.0 · 2026-03-01
- • Initial release with five caching patterns for AI inference
- • Semantic cache implementation in TypeScript with cosine similarity
- • Cache invalidation strategies for model updates, TTL, and data freshness
- • Pattern comparison table with hit rates and implementation effort
- • Caching readiness checklist