Key Takeaway

Semantic caching using embedding similarity can achieve cache hit rates of 30-50% for typical LLM applications while maintaining output quality above user-noticeable thresholds. This guide covers five caching patterns from exact match to predictive prefetching, with implementation examples, cache key design, invalidation strategies, and hit rate benchmarks.

Prerequisites

An AI inference endpoint in production with observable request patterns
Redis, Memcached, or equivalent in-memory cache infrastructure
Access to an embedding model for semantic caching (any text embedding API)
Understanding of your application's freshness requirements (how stale can cached responses be?)
Cost tracking to measure the ROI of caching implementation

Why AI Caching Is Different

Traditional web caching matches exact request keys: the same URL with the same parameters returns the same cached response. AI caching faces two challenges that make exact matching insufficient. First, semantically equivalent queries may have different surface forms ('What is the capital of France?' and 'France capital city?' should return the same cached answer). Second, AI outputs are often non-deterministic, meaning identical inputs may produce different (but equally valid) outputs, making cache validation more nuanced.

The reward for solving these challenges is significant. LLM API calls are expensive (dollars per thousand requests at frontier model pricing) and slow (seconds of latency). A cache hit eliminates both the cost and the latency, providing a cached response in milliseconds for zero API cost. Even modest cache hit rates of 20-30% translate to meaningful cost reduction and latency improvement.

Pattern 1: Exact Match Cache

Exact match caching is the simplest pattern: hash the normalized input and look up the hash in a key-value store. If found, return the cached response. If not, call the model and cache the result. This works well for deterministic queries with structured inputs (e.g., classification of product descriptions, extraction from invoices) but poorly for free-form queries where users phrase the same question differently.

Pattern 2: Semantic Cache

Semantic caching embeds each query using a text embedding model, then searches for cached queries whose embeddings are similar above a threshold. When a match is found, the cached response is returned. The critical parameter is the similarity threshold: too high (e.g., 0.99) and you get few hits; too low (e.g., 0.85) and you return responses for queries that are semantically different enough to warrant a fresh response. Start at 0.95 and tune based on quality feedback.

semantic-cache.ts

/**
 * Semantic cache for LLM responses.
 *
 * Uses embedding similarity to match semantically
 * equivalent queries and return cached responses.
 */

interface CacheEntry {
  query: string;
  embedding: number[];
  response: string;
  model: string;
  timestamp: number;
  hitCount: number;
}

interface CacheStats {
  hits: number;
  misses: number;
  evictions: number;
  avgSimilarity: number;
}

class SemanticCache {
  private entries: CacheEntry[] = [];
  private stats: CacheStats = {
    hits: 0,
    misses: 0,
    evictions: 0,
    avgSimilarity: 0,
  };

  constructor(
    private readonly similarityThreshold: number = 0.95,
    private readonly maxEntries: number = 10000,
    private readonly ttlMs: number = 3600000, // 1 hour
  ) {}

  private cosineSimilarity(a: number[], b: number[]): number {
    let dotProduct = 0;
    let normA = 0;
    let normB = 0;
    for (let i = 0; i < a.length; i++) {
      dotProduct += a[i] * b[i];
      normA += a[i] * a[i];
      normB += b[i] * b[i];
    }
    return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
  }

  lookup(queryEmbedding: number[]): string | null {
    const now = Date.now();
    let bestScore = 0;
    let bestEntry: CacheEntry | null = null;

    for (const entry of this.entries) {
      // Skip expired entries
      if (now - entry.timestamp > this.ttlMs) continue;

      const similarity = this.cosineSimilarity(
        queryEmbedding,
        entry.embedding,
      );

      if (similarity > bestScore) {
        bestScore = similarity;
        bestEntry = entry;
      }
    }

    if (bestScore >= this.similarityThreshold && bestEntry) {
      this.stats.hits++;
      bestEntry.hitCount++;
      return bestEntry.response;
    }

    this.stats.misses++;
    return null;
  }

  store(
    query: string,
    embedding: number[],
    response: string,
    model: string,
  ): void {
    // Evict oldest if at capacity
    if (this.entries.length >= this.maxEntries) {
      this.entries.sort((a, b) => a.timestamp - b.timestamp);
      this.entries.shift();
      this.stats.evictions++;
    }

    this.entries.push({
      query,
      embedding,
      response,
      model,
      timestamp: Date.now(),
      hitCount: 0,
    });
  }

  get hitRate(): number {
    const total = this.stats.hits + this.stats.misses;
    return total > 0 ? this.stats.hits / total : 0;
  }

  invalidateByModel(model: string): number {
    const before = this.entries.length;
    this.entries = this.entries.filter((e) => e.model !== model);
    return before - this.entries.length;
  }
}

Pattern 3: Prompt Prefix Cache

Prompt prefix caching is a provider-level optimization available from Anthropic, OpenAI, and other LLM providers. When your requests share a common prefix (system prompt, few-shot examples, retrieved context), the provider caches the prefix computation and charges reduced rates for the cached portion on subsequent requests. This is particularly effective for applications with long, stable system prompts that are reused across many requests. Anthropic's implementation charges write costs for the initial cache and read costs (significantly lower) for subsequent uses.

Pattern 4: Embedding Cache

Embedding generation is a hidden cost center in RAG applications. Every document chunk, user query, and index rebuild triggers embedding API calls. An embedding cache stores computed embeddings keyed by a content hash, so the same text is never embedded twice. This is especially valuable during development (frequent index rebuilds) and for applications that process overlapping documents across multiple pipelines.

Pattern 5: Hierarchical Cache

Hierarchical caching uses a fast, cheap model as a first-tier cache for a slower, expensive model. Common questions are answered by the cheap model, and only queries that the cheap model cannot handle confidently are forwarded to the expensive model. The confidence threshold for forwarding is the key parameter. This pattern combines cost reduction with quality preservation, because the expensive model is only used where it adds value.

Cache Invalidation Strategies

Cache invalidation for AI responses is more complex than traditional caching because the correct response can change without any change to the input. Model updates change what the correct output is. Data freshness requirements mean a cached response may be stale even if the query is identical. Knowledge cutoff dates limit the accuracy of cached factual responses. Design your invalidation strategy around these three dimensions: time-based expiry (TTL), model-version invalidation (clear cache when model changes), and content-based invalidation (clear cache when underlying data changes).

Never cache responses to queries about real-time data (stock prices, weather, live scores) or queries whose correct answer changes over time without a robust invalidation strategy. The most dangerous cache bug is serving confidently stale information that appears correct to the user. When in doubt, set a shorter TTL and accept the higher cache miss rate.

Pattern	Hit Rate	Implementation Effort	Best For	Watch Out For
Exact Match	5-15%	Low -- hash + key-value lookup	Structured inputs, classification, extraction	Low hit rate for free-form queries
Semantic Cache	20-50%	Medium -- requires embedding + similarity search	Search queries, Q&A, support chat	Threshold tuning, false positives at low thresholds
Prompt Prefix	N/A (reduces per-token cost)	Low -- provider feature, minimal code changes	Long system prompts, few-shot examples	Provider-specific, cache write costs on first use
Embedding Cache	60-90% (for repeated content)	Low -- content hash + storage	RAG pipelines, index rebuilds, evaluation suites	Storage costs for large embedding vectors
Hierarchical	40-70% (queries handled by cheap model)	High -- requires two-model architecture and routing	High-volume with mixed complexity queries	Quality degradation if cheap model threshold is too aggressive

0/8 completed

Version History

1.0.0 · 2026-03-01

• Initial release with five caching patterns for AI inference
• Semantic cache implementation in TypeScript with cosine similarity
• Cache invalidation strategies for model updates, TTL, and data freshness
• Pattern comparison table with hit rates and implementation effort
• Caching readiness checklist

Key Takeaway

Prerequisites

An AI inference endpoint in production with observable request patterns
Redis, Memcached, or equivalent in-memory cache infrastructure
Access to an embedding model for semantic caching (any text embedding API)
Understanding of your application's freshness requirements (how stale can cached responses be?)
Cost tracking to measure the ROI of caching implementation

Why AI Caching Is Different

Pattern 1: Exact Match Cache

Pattern 2: Semantic Cache

semantic-cache.ts

/**
 * Semantic cache for LLM responses.
 *
 * Uses embedding similarity to match semantically
 * equivalent queries and return cached responses.
 */

interface CacheEntry {
  query: string;
  embedding: number[];
  response: string;
  model: string;
  timestamp: number;
  hitCount: number;
}

interface CacheStats {
  hits: number;
  misses: number;
  evictions: number;
  avgSimilarity: number;
}

class SemanticCache {
  private entries: CacheEntry[] = [];
  private stats: CacheStats = {
    hits: 0,
    misses: 0,
    evictions: 0,
    avgSimilarity: 0,
  };

  constructor(
    private readonly similarityThreshold: number = 0.95,
    private readonly maxEntries: number = 10000,
    private readonly ttlMs: number = 3600000, // 1 hour
  ) {}

  private cosineSimilarity(a: number[], b: number[]): number {
    let dotProduct = 0;
    let normA = 0;
    let normB = 0;
    for (let i = 0; i < a.length; i++) {
      dotProduct += a[i] * b[i];
      normA += a[i] * a[i];
      normB += b[i] * b[i];
    }
    return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
  }

  lookup(queryEmbedding: number[]): string | null {
    const now = Date.now();
    let bestScore = 0;
    let bestEntry: CacheEntry | null = null;

    for (const entry of this.entries) {
      // Skip expired entries
      if (now - entry.timestamp > this.ttlMs) continue;

      const similarity = this.cosineSimilarity(
        queryEmbedding,
        entry.embedding,
      );

      if (similarity > bestScore) {
        bestScore = similarity;
        bestEntry = entry;
      }
    }

    if (bestScore >= this.similarityThreshold && bestEntry) {
      this.stats.hits++;
      bestEntry.hitCount++;
      return bestEntry.response;
    }

    this.stats.misses++;
    return null;
  }

  store(
    query: string,
    embedding: number[],
    response: string,
    model: string,
  ): void {
    // Evict oldest if at capacity
    if (this.entries.length >= this.maxEntries) {
      this.entries.sort((a, b) => a.timestamp - b.timestamp);
      this.entries.shift();
      this.stats.evictions++;
    }

    this.entries.push({
      query,
      embedding,
      response,
      model,
      timestamp: Date.now(),
      hitCount: 0,
    });
  }

  get hitRate(): number {
    const total = this.stats.hits + this.stats.misses;
    return total > 0 ? this.stats.hits / total : 0;
  }

  invalidateByModel(model: string): number {
    const before = this.entries.length;
    this.entries = this.entries.filter((e) => e.model !== model);
    return before - this.entries.length;
  }
}

Pattern 3: Prompt Prefix Cache

Pattern 4: Embedding Cache

Pattern 5: Hierarchical Cache

Cache Invalidation Strategies

Pattern	Hit Rate	Implementation Effort	Best For	Watch Out For
Exact Match	5-15%	Low -- hash + key-value lookup	Structured inputs, classification, extraction	Low hit rate for free-form queries
Semantic Cache	20-50%	Medium -- requires embedding + similarity search	Search queries, Q&A, support chat	Threshold tuning, false positives at low thresholds
Prompt Prefix	N/A (reduces per-token cost)	Low -- provider feature, minimal code changes	Long system prompts, few-shot examples	Provider-specific, cache write costs on first use
Embedding Cache	60-90% (for repeated content)	Low -- content hash + storage	RAG pipelines, index rebuilds, evaluation suites	Storage costs for large embedding vectors
Hierarchical	40-70% (queries handled by cheap model)	High -- requires two-model architecture and routing	High-volume with mixed complexity queries	Quality degradation if cheap model threshold is too aggressive

0/8 completed

Version History

1.0.0 · 2026-03-01

• Initial release with five caching patterns for AI inference
• Semantic cache implementation in TypeScript with cosine similarity
• Cache invalidation strategies for model updates, TTL, and data freshness
• Pattern comparison table with hit rates and implementation effort
• Caching readiness checklist

AI Caching Strategies

Why AI Caching Is Different

Pattern 1: Exact Match Cache

Pattern 2: Semantic Cache

Pattern 3: Prompt Prefix Cache

Pattern 4: Embedding Cache

Pattern 5: Hierarchical Cache

Cache Invalidation Strategies

Version History

Related content

AI Caching Strategies

Why AI Caching Is Different

Pattern 1: Exact Match Cache

Pattern 2: Semantic Cache

Pattern 3: Prompt Prefix Cache

Pattern 4: Embedding Cache

Pattern 5: Hierarchical Cache

Cache Invalidation Strategies

Version History

Related content