KV Cache: Long Horizon Tasks
What happens to your cache across agent loops, user edits, and persistent rules
Agent loops
User edits
Rules / AGENTS.md
Full impact table
What the cache looks like across an agent loop
t=0 load
AGENTS.md
system prompt
tool definitions
✓ Heavy prefix loaded once, cached on first pass, free on every loop after
t=1
prefix ✓
+
user task
✓ Cache hit on prefix. Only the new task tokens are processed.
t=2
prefix ✓
task ✓
+
tool_call: read_file()
✓ Append-only. Cache grows forward cleanly.
t=3
prefix ✓
task ✓
tool_call ✓
+
tool_result: 8,000 tokens
⚠ Raw output appended. Cache grows fast, fine for now.
t=4 ✗ wrong
prefix ✓
task ✓
tool_result DELETED
+
next step
✗ Agent deletes raw output to "save space." Sequence breaks. Full miss from t=3 forward.
t=4 ✓ right
prefix ✓
task ✓
tool_call ✓
[ref#1: summary]
+
next step
✓ Verbose output replaced with placeholder. Sequence intact. Cache preserved.
The core agent rule: never delete from history mid-loop. Replace verbose tool outputs with a compact tag like [Tool result masked: ref#1, 47 rows returned]. The timeline stays intact, VRAM pressure drops, and the model still knows what happened at each step.
Across multiple loops, what it looks like in code
# Loop 1, prefix cached on first pass
context: [AGENTS.md][system][tools][user_task] ← all cached
# Loop 2, appends cleanly
context: [prefix✓][task✓][tool_call][tool_result_raw]
# Loop 3, agent removes tool_result to slim context
context: [prefix✓][task✓][tool_call][ GAP ] ← sequence broken
result: everything after the gap recomputes from scratch
# Loop 3, correct
context: [prefix✓][task✓][tool_call][ref#1] ← placeholder, intact
result: cache hit, fast, model knows ref#1 = the file read
The edit problem, why "fixing" a message is expensive
baseline
msg 1 ✓
msg 2 ✓
msg 3 ✓
msg 4 ✓
msg 5 ✓
✓ Full thread cached. Every new message is near-instant.
edit msg 2
msg 1 ✓
msg 2 ✗
msg 3 ✗
msg 4 ✗
msg 5 ✗
✗ Edit at position 2 invalidates msgs 3–5. Everything downstream recomputes.
edit msg 4
msg 1 ✓
msg 2 ✓
msg 3 ✓
msg 4 ✗
msg 5 ✗
⚠ Editing later is cheaper, but still invalidates everything after it.
append instead
msg 1 ✓
msg 2 ✓
msg 3 ✓
msg 4 ✓
msg 5 ✓
correction +
✓ Entire history stays cached. Only the correction tokens are new.
Editing earlier = more damage. An edit at message 2 of a 20-message thread invalidates 18 messages worth of cached vectors. An edit at message 19 invalidates 1. But neither is free. The only zero-cost correction is a new message at the bottom.
The branch problem, editing creates a fork
# Original thread
msg1 → msg2 → msg3 → msg4 → msg5 ← all cached
# You edit msg2 and regenerate
msg1 → msg2' → msg3' → msg4' → msg5' ← new branch, zero cache overlap
# The original thread is gone.
# The model now lives in a parallel timeline.
# If you reference a decision made in the original thread,
# the model has no idea what you're talking about.
# With long-horizon coding tasks this compounds fast:
# architectural decisions, variable names, API contracts
# agreed in the old branch are invisible in the new one.
How rules, skills, and AGENTS.md interact with the cache
ideal order
AGENTS.md
SKILL.md
system prompt
tools
→
user message
✓ Static prefix. Cached once, reused across every request in the session.
common mistake
timestamp at top
AGENTS.md
SKILL.md
user message
✗ One dynamic token at position 1 invalidates AGENTS.md, SKILL.md, everything. Full miss every single request.
mid-task skill
AGENTS.md ✓
SKILL_A ✓
+
SKILL_B loaded now
⚠ Adding a new skill mid-thread appends cleanly. Cache hit on everything before, only SKILL_B is new. Fine.
skill reorder
AGENTS.md ✓
SKILL_B first
SKILL_A second
✗ Swapping skill order across sessions changes token positions. No prefix match. Miss every time.
edit AGENTS.md
AGENTS.md v2
SKILL.md
tools
user msg
✗ Any edit to AGENTS.md between sessions changes the prefix. Full re-ingestion next load, even for a one-word change.
AGENTS.md is your most expensive asset. If it's 2,000 tokens and loads on every request, keeping it byte-for-byte identical means you pay to process it once, then it's free. Every edit, however small, costs a full re-ingestion next time. Treat it like a database schema: change it intentionally, not casually.
Optimal prefix order, most to least stable
[1] AGENTS.md / global rules ← never changes mid-session
[2] SKILL.md files ← stable, fixed order, always
[3] tool definitions ← stable per session
[4] background docs / context ← stable once loaded
[5] conversation history ← grows forward only (append-only)
[6] user message ← dynamic, always at the end
[7] timestamps / session IDs ← inside user message only, never above
Every action and its cache impact
| Action |
Cache impact |
Fix |
| Edit message N in thread |
Invalidates all tokens from N to end. Full recompute of tail. |
Append correction as new message at bottom |
| Switch threads for related task |
100% miss. Separate threads share zero cache state. |
Stay in one thread per task until done |
| Switch models mid-conversation |
100% miss. Different architecture = different cache entirely. |
Start a fresh thread if you need a different model |
| Inject timestamp into system prompt |
Every request is a miss. Token 1 changes = everything recomputes. |
Move dynamic values to the user message (end) |
| Agent deletes tool output mid-loop |
Sequence gap. Everything after the deleted block recomputes. |
Replace with compact summary tag, never delete |
| Edit AGENTS.md between sessions |
Next session re-ingests entire prefix from edit point forward. |
Batch edits. Treat it like a schema, not a scratchpad. |
| Load skills in different order |
Different token positions = no prefix match. Miss every session. |
Fix the load order. Most stable skill first, always. |
| Re-attach same file mid-thread |
File reprocessed at new position. Old cached version stale. |
Attach all files at thread start. Don't re-attach. |
| Reopen old thread after days |
Cache evicted from server. Full re-ingestion on first message. |
Keep sessions short. Summarise before closing. |
| Serialise JSON keys inconsistently |
Same data, different token sequence = no prefix match. |
Sort all serialised data deterministically before sending |