KV Cache: Long Horizon Tasks

What the cache looks like across an agent loop

t=0 load

AGENTS.md system prompt tool definitions

✓ Heavy prefix loaded once, cached on first pass, free on every loop after

t=1

prefix ✓ + user task

✓ Cache hit on prefix. Only the new task tokens are processed.

t=2

prefix ✓ task ✓ + tool_call: read_file()

✓ Append-only. Cache grows forward cleanly.

t=3

prefix ✓ task ✓ tool_call ✓ + tool_result: 8,000 tokens

⚠ Raw output appended. Cache grows fast, fine for now.

t=4 ✗ wrong

prefix ✓ task ✓ tool_result DELETED + next step

✗ Agent deletes raw output to "save space." Sequence breaks. Full miss from t=3 forward.

t=4 ✓ right

prefix ✓ task ✓ tool_call ✓ [ref#1: summary] + next step

✓ Verbose output replaced with placeholder. Sequence intact. Cache preserved.

The core agent rule: never delete from history mid-loop. Replace verbose tool outputs with a compact tag like [Tool result masked: ref#1, 47 rows returned]. The timeline stays intact, VRAM pressure drops, and the model still knows what happened at each step.

Across multiple loops, what it looks like in code

# Loop 1, prefix cached on first pass context: [AGENTS.md][system][tools][user_task] ← all cached # Loop 2, appends cleanly context: [prefix✓][task✓][tool_call][tool_result_raw] # Loop 3, agent removes tool_result to slim context context: [prefix✓][task✓][tool_call][ GAP ] ← sequence broken result: everything after the gap recomputes from scratch # Loop 3, correct context: [prefix✓][task✓][tool_call][ref#1] ← placeholder, intact result: cache hit, fast, model knows ref#1 = the file read

The edit problem, why "fixing" a message is expensive

baseline

msg 1 ✓ msg 2 ✓ msg 3 ✓ msg 4 ✓ msg 5 ✓

✓ Full thread cached. Every new message is near-instant.

edit msg 2

msg 1 ✓ msg 2 ✗ msg 3 ✗ msg 4 ✗ msg 5 ✗

✗ Edit at position 2 invalidates msgs 3–5. Everything downstream recomputes.

edit msg 4

msg 1 ✓ msg 2 ✓ msg 3 ✓ msg 4 ✗ msg 5 ✗

⚠ Editing later is cheaper, but still invalidates everything after it.

append instead

msg 1 ✓ msg 2 ✓ msg 3 ✓ msg 4 ✓ msg 5 ✓ correction +

✓ Entire history stays cached. Only the correction tokens are new.

Editing earlier = more damage. An edit at message 2 of a 20-message thread invalidates 18 messages worth of cached vectors. An edit at message 19 invalidates 1. But neither is free. The only zero-cost correction is a new message at the bottom.

The branch problem, editing creates a fork

# Original thread msg1 → msg2 → msg3 → msg4 → msg5 ← all cached # You edit msg2 and regenerate msg1 → msg2' → msg3' → msg4' → msg5' ← new branch, zero cache overlap # The original thread is gone. # The model now lives in a parallel timeline. # If you reference a decision made in the original thread, # the model has no idea what you're talking about. # With long-horizon coding tasks this compounds fast: # architectural decisions, variable names, API contracts # agreed in the old branch are invisible in the new one.

How rules, skills, and AGENTS.md interact with the cache

ideal order

AGENTS.md SKILL.md system prompt tools → user message

✓ Static prefix. Cached once, reused across every request in the session.

common mistake

timestamp at top AGENTS.md SKILL.md user message

✗ One dynamic token at position 1 invalidates AGENTS.md, SKILL.md, everything. Full miss every single request.

mid-task skill

AGENTS.md ✓ SKILL_A ✓ + SKILL_B loaded now

⚠ Adding a new skill mid-thread appends cleanly. Cache hit on everything before, only SKILL_B is new. Fine.

skill reorder

AGENTS.md ✓ SKILL_B first SKILL_A second

✗ Swapping skill order across sessions changes token positions. No prefix match. Miss every time.

edit AGENTS.md

AGENTS.md v2 SKILL.md tools user msg

✗ Any edit to AGENTS.md between sessions changes the prefix. Full re-ingestion next load, even for a one-word change.

AGENTS.md is your most expensive asset. If it's 2,000 tokens and loads on every request, keeping it byte-for-byte identical means you pay to process it once, then it's free. Every edit, however small, costs a full re-ingestion next time. Treat it like a database schema: change it intentionally, not casually.

Optimal prefix order, most to least stable

[1] AGENTS.md / global rules ← never changes mid-session [2] SKILL.md files ← stable, fixed order, always [3] tool definitions ← stable per session [4] background docs / context ← stable once loaded [5] conversation history ← grows forward only (append-only) [6] user message ← dynamic, always at the end [7] timestamps / session IDs ← inside user message only, never above

Every action and its cache impact

Action	Cache impact	Fix
Edit message N in thread	Invalidates all tokens from N to end. Full recompute of tail.	Append correction as new message at bottom
Switch threads for related task	100% miss. Separate threads share zero cache state.	Stay in one thread per task until done
Switch models mid-conversation	100% miss. Different architecture = different cache entirely.	Start a fresh thread if you need a different model
Inject timestamp into system prompt	Every request is a miss. Token 1 changes = everything recomputes.	Move dynamic values to the user message (end)
Agent deletes tool output mid-loop	Sequence gap. Everything after the deleted block recomputes.	Replace with compact summary tag, never delete
Edit AGENTS.md between sessions	Next session re-ingests entire prefix from edit point forward.	Batch edits. Treat it like a schema, not a scratchpad.
Load skills in different order	Different token positions = no prefix match. Miss every session.	Fix the load order. Most stable skill first, always.
Re-attach same file mid-thread	File reprocessed at new position. Old cached version stale.	Attach all files at thread start. Don't re-attach.
Reopen old thread after days	Cache evicted from server. Full re-ingestion on first message.	Keep sessions short. Summarise before closing.
Serialise JSON keys inconsistently	Same data, different token sequence = no prefix match.	Sort all serialised data deterministically before sending