What is context caching?

6 min read

┌──────────────────────────────────────────────────────────┐
│  ═══════════════════════════════════════════════════     │
│  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  ────────────────────────────────────────────────────    │
│  ██████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  █████████████████████████████████░░░░░░░░░░░░░░░░░░     │
│  ██████████████████████████████████████░░░░░░░░░░░░░     │
│  ████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░     │
│  ────────────────────────────────────────────────────    │
│  ███████████████████████████████████████░░░░░░░░░░░░     │
└──────────────────────────────────────────────────────────┘

Context caching is a technique that allows AI providers to reuse previously processed input data across multiple requests, significantly reducing cost and latency. When you send the same large block of text, such as a system prompt or reference document, to an AI model repeatedly, context caching lets the provider skip reprocessing that content and instead reuse the computed results from the first time it was seen.

Why Context Caching Matters

────────────────────────────────────────

Every time you send a request to a language model, the model processes your entire input from scratch. For a short prompt, this is fine. But many real-world applications include substantial context that stays the same across many requests: a long system prompt with detailed instructions, a large reference document, or a set of few-shot examples.

Without caching, you pay full price and wait full latency for the model to re-read that identical content every single time. Context caching eliminates this redundancy.

How It Works Under the Hood

────────────────────────────────────────

To understand context caching, it helps to know what happens during inference. When a language model processes your input, it converts each token into internal representations called [key-value (KV) pairs]. These KV pairs are computed layer by layer through the model's transformer architecture and are stored in what is called the [KV cache].

The KV cache is what allows the model to generate tokens efficiently, as each new token can attend to the cached representations of all previous tokens without recomputing them. Context caching extends this concept across requests: if the beginning of your prompt is identical to a previous request, the provider can reuse the KV cache from that earlier computation.

This means the model only needs to process the new, unique portion of your input. The cached prefix is essentially free in terms of computation.

Provider Implementations

────────────────────────────────────────

Different providers approach context caching differently:

[Google's Context Caching] (available with Gemini models) is an explicit caching system. You create a named cache by sending your content to a caching endpoint, receive a cache identifier, and then reference that cache in subsequent requests. Cached content can include text, images, and other supported modalities. You control the cache's time-to-live (TTL), and you are charged a reduced rate for cached tokens. This approach gives you direct control over what is cached and for how long.

[OpenAI's Prompt Caching] works automatically. When you send a request that shares a prefix with a recent request, OpenAI's infrastructure automatically detects the overlap and reuses cached computations. There is no explicit cache creation step. This makes it effortless to use but gives you less direct control. Cached input tokens are billed at a 50% discount. Prompt caching kicks in for prompts longer than 1,024 tokens, and the cache typically persists for 5-10 minutes of inactivity.

[Anthropic's Prompt Caching] for Claude uses explicit cache control markers. You add a cache_control field to specific content blocks in your request to indicate what should be cached. Cached content is charged at a reduced read rate on subsequent requests. The first request incurs a slightly higher write cost to establish the cache. Cache lifetimes are managed by the provider, typically persisting for at least 5 minutes.

Cost Savings

────────────────────────────────────────

The financial impact of context caching can be substantial. Typical savings include:

[Cached input tokens cost 50-90% less] than uncached tokens, depending on the provider. If your application sends a 10,000-token system prompt with every request, caching that prompt means you pay a fraction of the original cost for those tokens on every subsequent call.

[Latency reductions of 30-80%] for the cached portion, since the model skips the computation for already-processed tokens. For applications where response time matters, this is a significant improvement.

For a concrete example: imagine a customer service bot that includes a 5,000-token system prompt with company policies, product information, and response guidelines. If this bot handles 10,000 requests per day, context caching could reduce your input token costs for that prefix by 50-90%, potentially saving hundreds or thousands of dollars monthly.

When Caching Is Most Useful

────────────────────────────────────────

Context caching delivers the greatest benefit in specific scenarios:

[Long, repeated system prompts]: If every request to your application includes the same detailed instructions, caching that prompt is an obvious win.

[Reference documents]: Applications that include a large document (a user manual, a codebase, a legal contract) as context for every query benefit enormously from caching.

[Few-shot examples]: If you include the same set of examples in every prompt to guide the model's behavior, those examples are ideal for caching.

[Conversational applications]: In multi-turn conversations, the conversation history grows with each turn. Caching ensures the model does not reprocess the entire conversation history from scratch on every message.

[Batch-like workloads]: When you are running many similar requests that share a common prefix, such as processing different customer records against the same analysis prompt, caching the shared prompt saves cost on every request.

Implementation Considerations

────────────────────────────────────────

When implementing context caching, keep these points in mind:

[Cache ordering matters]: Caching works on prefixes. The cached content must come at the beginning of your prompt. If you rearrange the order so that variable content comes first, caching will not work.

[Minimum size thresholds]: Most providers require a minimum number of tokens for caching to activate. OpenAI requires at least 1,024 tokens; Google and Anthropic have similar thresholds. Very short prompts will not benefit from caching.

[Cache expiration]: Caches do not persist forever. They typically expire after minutes of inactivity. For explicit caching systems like Google's, you can set a custom TTL. Design your application to gracefully handle cache misses.

[Cost of cache creation]: With explicit caching systems, there may be a one-time cost to write to the cache that is higher than the normal input rate. This is amortized over subsequent cached reads.

[Cache invalidation]: If you need to update your system prompt or reference documents, cached content becomes stale. You will need to create a new cache or wait for the old one to expire.

Automatic vs Explicit Caching

────────────────────────────────────────

The choice between automatic and explicit caching often depends on your provider, but understanding the tradeoff is useful:

[Automatic caching] (OpenAI's approach) requires no code changes. If your prompts naturally share prefixes, you get caching for free. The downside is less control and visibility.

[Explicit caching] (Google's and Anthropic's approach) requires you to specify what should be cached. This gives you control over cache lifetime and content but requires intentional implementation.

Context caching is one of those optimizations that can meaningfully reduce costs and improve performance with relatively little effort. If your application sends repeated content to AI models, it is worth implementing.

What is context caching?

Why Context Caching Matters

How It Works Under the Hood

Provider Implementations

Cost Savings

When Caching Is Most Useful

Implementation Considerations

Automatic vs Explicit Caching

What is a context window?

What is a token?

What is inference?