KV Cache: Long Horizon Tasks

What happens to your cache across agent loops, user edits, and persistent rules

Agent loops
User edits
Rules / AGENTS.md
Full impact table
What the cache looks like across an agent loop
t=0 load
AGENTS.md system prompt tool definitions
✓ Heavy prefix loaded once, cached on first pass, free on every loop after
t=1
prefix ✓ + user task
✓ Cache hit on prefix. Only the new task tokens are processed.
t=2
prefix ✓ task ✓ + tool_call: read_file()
✓ Append-only. Cache grows forward cleanly.
t=3
prefix ✓ task ✓ tool_call ✓ + tool_result: 8,000 tokens
⚠ Raw output appended. Cache grows fast, fine for now.
t=4 ✗ wrong
prefix ✓ task ✓ tool_result DELETED + next step
✗ Agent deletes raw output to "save space." Sequence breaks. Full miss from t=3 forward.
t=4 ✓ right
prefix ✓ task ✓ tool_call ✓ [ref#1: summary] + next step
✓ Verbose output replaced with placeholder. Sequence intact. Cache preserved.
The core agent rule: never delete from history mid-loop. Replace verbose tool outputs with a compact tag like [Tool result masked: ref#1, 47 rows returned]. The timeline stays intact, VRAM pressure drops, and the model still knows what happened at each step.
Across multiple loops, what it looks like in code
# Loop 1, prefix cached on first pass context: [AGENTS.md][system][tools][user_task] ← all cached # Loop 2, appends cleanly context: [prefix✓][task✓][tool_call][tool_result_raw] # Loop 3, agent removes tool_result to slim context context: [prefix✓][task✓][tool_call][ GAP ] ← sequence broken result: everything after the gap recomputes from scratch # Loop 3, correct context: [prefix✓][task✓][tool_call][ref#1] ← placeholder, intact result: cache hit, fast, model knows ref#1 = the file read
The edit problem, why "fixing" a message is expensive
baseline
msg 1 ✓ msg 2 ✓ msg 3 ✓ msg 4 ✓ msg 5 ✓
✓ Full thread cached. Every new message is near-instant.
edit msg 2
msg 1 ✓ msg 2 ✗ msg 3 ✗ msg 4 ✗ msg 5 ✗
✗ Edit at position 2 invalidates msgs 3–5. Everything downstream recomputes.
edit msg 4
msg 1 ✓ msg 2 ✓ msg 3 ✓ msg 4 ✗ msg 5 ✗
⚠ Editing later is cheaper, but still invalidates everything after it.
append instead
msg 1 ✓ msg 2 ✓ msg 3 ✓ msg 4 ✓ msg 5 ✓ correction +
✓ Entire history stays cached. Only the correction tokens are new.
Editing earlier = more damage. An edit at message 2 of a 20-message thread invalidates 18 messages worth of cached vectors. An edit at message 19 invalidates 1. But neither is free. The only zero-cost correction is a new message at the bottom.
The branch problem, editing creates a fork
# Original thread msg1 → msg2 → msg3 → msg4 → msg5 ← all cached # You edit msg2 and regenerate msg1 → msg2' → msg3' → msg4' → msg5' ← new branch, zero cache overlap # The original thread is gone. # The model now lives in a parallel timeline. # If you reference a decision made in the original thread, # the model has no idea what you're talking about. # With long-horizon coding tasks this compounds fast: # architectural decisions, variable names, API contracts # agreed in the old branch are invisible in the new one.
How rules, skills, and AGENTS.md interact with the cache
ideal order
AGENTS.md SKILL.md system prompt tools user message
✓ Static prefix. Cached once, reused across every request in the session.
common mistake
timestamp at top AGENTS.md SKILL.md user message
✗ One dynamic token at position 1 invalidates AGENTS.md, SKILL.md, everything. Full miss every single request.
mid-task skill
AGENTS.md ✓ SKILL_A ✓ + SKILL_B loaded now
⚠ Adding a new skill mid-thread appends cleanly. Cache hit on everything before, only SKILL_B is new. Fine.
skill reorder
AGENTS.md ✓ SKILL_B first SKILL_A second
✗ Swapping skill order across sessions changes token positions. No prefix match. Miss every time.
edit AGENTS.md
AGENTS.md v2 SKILL.md tools user msg
✗ Any edit to AGENTS.md between sessions changes the prefix. Full re-ingestion next load, even for a one-word change.
AGENTS.md is your most expensive asset. If it's 2,000 tokens and loads on every request, keeping it byte-for-byte identical means you pay to process it once, then it's free. Every edit, however small, costs a full re-ingestion next time. Treat it like a database schema: change it intentionally, not casually.
Optimal prefix order, most to least stable
[1] AGENTS.md / global rules ← never changes mid-session [2] SKILL.md files ← stable, fixed order, always [3] tool definitions ← stable per session [4] background docs / context ← stable once loaded [5] conversation history ← grows forward only (append-only) [6] user message ← dynamic, always at the end [7] timestamps / session IDs ← inside user message only, never above
Every action and its cache impact
Action Cache impact Fix
Edit message N in thread Invalidates all tokens from N to end. Full recompute of tail. Append correction as new message at bottom
Switch threads for related task 100% miss. Separate threads share zero cache state. Stay in one thread per task until done
Switch models mid-conversation 100% miss. Different architecture = different cache entirely. Start a fresh thread if you need a different model
Inject timestamp into system prompt Every request is a miss. Token 1 changes = everything recomputes. Move dynamic values to the user message (end)
Agent deletes tool output mid-loop Sequence gap. Everything after the deleted block recomputes. Replace with compact summary tag, never delete
Edit AGENTS.md between sessions Next session re-ingests entire prefix from edit point forward. Batch edits. Treat it like a schema, not a scratchpad.
Load skills in different order Different token positions = no prefix match. Miss every session. Fix the load order. Most stable skill first, always.
Re-attach same file mid-thread File reprocessed at new position. Old cached version stale. Attach all files at thread start. Don't re-attach.
Reopen old thread after days Cache evicted from server. Full re-ingestion on first message. Keep sessions short. Summarise before closing.
Serialise JSON keys inconsistently Same data, different token sequence = no prefix match. Sort all serialised data deterministically before sending