Module 09 · Lesson 03

Prompt Caching & Cost Engineering

Reading time: 20 minutes Track: Advanced Claude Setup · Universal Prerequisites: Module 09 · Lesson 02

The cost problem with Claude at scale

Individual sessions with Claude are inexpensive. Production applications that send the same 50,000-token system prompt on every single request are not.

If your biotech runs a regulatory drafting assistant and every API call includes a 50k-token system prompt with relevant guidance documents, ICH sections, and output templates, you are paying for those 50,000 tokens to be processed on every request. At $5/MTok for Opus 4.8 input, that's $0.25 per call — before any actual user content. Ten employees using it 20 times per day is $50/day, $18,250/year, for tokens that never change.

Prompt caching eliminates most of that cost.

01 · How prompt caching works

Prompt caching lets Claude reuse a previously processed prompt prefix rather than reprocessing it from scratch on each request.

When you mark a section of your prompt with a cache breakpoint, Claude checks whether that prefix has been processed recently. If it has (cache hit), it reads the cached version at 10% of the normal input cost. If it hasn't (cache miss), it processes it normally and stores the prefix at 125% of the normal cost.

The math:

Cache write: 1.25x base input price
Cache read: 0.1x base input price (90% discount)

Worked example — Opus 4.8:

Scenario	Cost per call
100k tokens, no caching, 10 calls	100k × 10 × $5/MTok = $5.00 total
100k tokens, first call writes cache	100k × $6.25/MTok = $0.625
100k tokens, next 9 calls hit cache	100k × 9 × $0.50/MTok = $0.45
Total with caching	$1.075 — 78.5% savings

Scale that to a 30-day month of daily use and the savings compound further.

02 · Cache TTL options

Caches have two time-to-live (TTL) options:

TTL	Default?	Write cost multiplier	Use when
5 minutes	Yes	1.25x	High-frequency use; cache refreshes automatically at no cost if accessed within 5 minutes
1 hour	Optional (set `ttl: "1h"`)	2.0x	Less frequent access; extended reasoning tasks; applications with longer gaps between requests

To use 1-hour caching:

system=[
    {
        "type": "text",
        "text": "Your static system prompt content...",
        "cache_control": {"type": "ephemeral", "ttl": "1h"}
    }
]

Biotech application: Regulatory drafting applications typically have users making requests every few minutes during a session, then going dark for hours. Use 5-minute TTL for interactive sessions. If you're running overnight batch document processing, 1-hour TTL prevents cache expiry during the run.

03 · Cache breakpoint placement rules

This is where most teams make mistakes. The rules are precise:

Rule 1: Place cache_control on the last block that stays identical across requests.

If your prompt has three parts — (A) static system prompt, (B) user-specific context, (C) user question — the cache breakpoint goes at the end of A, not at the end of B or C.

# Correct placement
system = [
    {
        "type": "text",
        "text": "Static guidance: ICH sections, output templates, constraints...",
        "cache_control": {"type": "ephemeral"}  # ← breakpoint here
    },
    {
        "type": "text",
        "text": f"User context: {user_name}, project: {project_id}"  # ← NO breakpoint
    }
]

Rule 2: Static content comes first.

Content order in the cache hierarchy: tools → system → messages. Within a section, static content must come before dynamic content. If dynamic content appears before the breakpoint, the cache never hits because the prefix changes on every request.

The classic mistake:

# WRONG — timestamp changes every request; cache never hits
system = [
    {"type": "text", "text": f"Session started: {datetime.now()}"},  # changes every call
    {"type": "text", "text": "Static 50k-token system prompt...", "cache_control": {...}}
]

# CORRECT — static content before breakpoint, dynamic after
system = [
    {"type": "text", "text": "Static 50k-token system prompt...", "cache_control": {...}},
    {"type": "text", "text": f"Session started: {datetime.now()}"}  # after breakpoint
]

Rule 3: Up to 4 cache breakpoints per request.

You can cache different sections that change at different frequencies. Tool definitions rarely change (cache them longest). User context changes per session. User messages change per turn.

# Three-tier caching for a document assistant
system = [
    {
        "type": "text",
        "text": "Tool guidance and base instructions...",
        "cache_control": {"type": "ephemeral", "ttl": "1h"}  # breakpoint 1 — changes rarely
    },
    {
        "type": "text",
        "text": "Current project context and uploaded documents...",
        "cache_control": {"type": "ephemeral"}  # breakpoint 2 — changes per session
    },
    {
        "type": "text",
        "text": f"User: {user_name}, Role: {user_role}"  # breakpoint 3 would go here if needed
    }
]

Minimum token thresholds:

Sonnet 4.6: 1,024 tokens minimum to be eligible for caching
Opus 4.8, Haiku 4.5: 4,096 tokens minimum

Below these thresholds, the cache breakpoint is ignored and content is processed normally.

04 · Monitoring cache performance

Every API response includes cache usage fields:

response.usage.cache_creation_input_tokens  # tokens written to cache this call
response.usage.cache_read_input_tokens      # tokens read from cache this call
response.usage.input_tokens                 # uncached input tokens (after breakpoint)

Track these in production. If cache_read_input_tokens is near zero on a high-frequency application, something is wrong with your breakpoint placement — probably dynamic content before the cache point.

Target: for a well-structured application with a large static system prompt, cache hit rate should be >80% after warmup.

05 · Pre-warming the cache

For production services with strict latency requirements, pre-warm the cache before users arrive:

# Fire before traffic starts (e.g., at deployment, or at session initialization)
prewarm = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=0,  # required — no output needed, just warm the cache
    system=[
        {
            "type": "text",
            "text": "Your static system prompt...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "warmup"}],
)
# Response will have empty content and zero output token charges
# Subsequent requests with the same system prompt get immediate cache hits

max_tokens: 0 is not compatible with streaming. Use it for non-streaming pre-warm requests only.

06 · Batch API for non-real-time workloads

For batch processing (overnight document analysis, bulk classification, non-interactive report generation), the Batch API provides 50% cost reduction on all token charges. It also supports extended output up to 300k tokens per request using the output-300k-2026-03-24 beta header on Opus and Sonnet models.

Batch API is appropriate when:

Latency doesn't matter (results can be retrieved hours later)
Volume is high enough that 50% savings is meaningful
Tasks don't require real-time back-and-forth

Combine Batch API with prompt caching for maximum cost reduction on large, repetitive document processing workloads.

07 · Knowledge check

Q1. Your application sends a 60,000-token static system prompt on every API call using Sonnet 4.6 ($3/MTok input). Without caching, what is the input cost for 100 calls? With caching (assuming cache hits from call 2 onwards)?

a) Without: $18.00 / With: $2.97 (write $2.25 + 99 reads at $0.018 each) b) Without: $18.00 / With: approximately $2.45 c) Without: $6.00 / With: $0.99 d) Without: $18.00 / With: $9.00

Q2. Your system prompt has three parts: (A) 40k tokens of static guidance, (B) a 200-token user name and role, (C) a 100-token session timestamp. Where should you place the cache breakpoint?

a) At the end of part C — cache the full prompt b) At the end of part A — cache only the static content before any dynamic content c) At the end of part B — cache static guidance and user identity d) No breakpoint — session timestamps prevent caching from working at all

Q3. After deploying your caching implementation, you check cache_read_input_tokens in production and it shows near zero. What is the most likely cause?

a) The cache TTL is too short — switch to 1-hour TTL b) Dynamic content (timestamp or user ID) is placed before the cache breakpoint, invalidating it on every request c) The system prompt is under the minimum token threshold for caching d) Streaming mode is incompatible with prompt caching

Answers: Q1: b · Q2: b · Q3: b

End of Lesson 03.