Skip to main content
What it does. Reuses the parts of every prompt that do not change from turn to turn — system prompt, conversation history, tool schemas — so your provider charges the cheap “cached read” rate instead of the full input rate. Who it is for. Anyone running long-lived agents in chat. The page leads plain-English; a developer reference table follows. LLM providers charge 10–20x less for cached prompt reads than for cache writes. That difference only shows up when the bytes sent to the provider are byte-for-byte identical to the previous turn. Comis treats cache stability as a first-class goal — every feature in this document exists to protect that stability across the full lifetime of a conversation. A 76-call Claude Opus session in production achieved a 16.9x cache read/write ratio: 94% of input tokens served from cache at $0.50/MTok instead of $15/MTok uncached. (Marketing benchmark — see the README for the exact session.) On a typical Claude Sonnet long-running channel agent, that ratio shaves a daily cost from roughly $26 uncached → $5 cached for the same conversation volume, just by keeping the prefix stable.
README and marketing copy used to advertise “20 cache optimizations.” The shipped count is 15+ specific optimizations across the executor, context engine, and provider adapters. They are listed below.

The 15+ optimizations at a glance

Each row is one named mechanism in code. The “Why” column explains what it prevents.
#OptimizationWhat it doesWhy it matters
1Adaptive TTL escalationStarts every session at “short” (5m); promotes to “long” (1h) only after 3+ turns of confirmed cache readsAvoids paying 1h-write premium on sessions that would have ended in minutes
2Fast-path large-write escalationIf turn 1 writes more than 20K cache tokens, escalates to “long” immediately on turn 2Big system prompts deserve long retention up front
3Prefix-instability detectionIf cache reads stay flat at the system-prompt baseline for 5 consecutive turns, forces TTL back to “short”Stops bleeding 1h writes when something keeps invalidating the prefix
4Cache fence (lastBreakpointIndex)Tracks the index of the highest cache_control marker; layers below the fence are skipped on subsequent passesPrevents context-engine layers from rewriting cached bytes and blowing the cache
5Sub-agent spawn staggeringConcurrent sub-agents are started with offsets so they reuse the parent’s cached prefix instead of all writing fresh entriesConcurrent fan-out without the N× write cost
6Two-phase cache-break detectionPre-call snapshot of system + tools + metadata; post-call AND-threshold (>5% drop AND >2K tokens) attributes the causeDifferentiates real breaks from normal cache-write noise
7Cache-break diff writerWhen a real break is detected, writes a JSONL diff to ~/.comis/cache-breaks/ showing exactly what changedLets you debug “why did my cache invalidate?” after the fact
8Cache priming for sub-agentsSub-agent spawn packets carry the parent’s cached prefix so the child starts warmChild gets the parent’s deal without the parent paying twice
9Cache-write suppression on sub-agentsSub-agents read the parent’s cache but do not write fresh long-TTL entries of their ownNo double-billing the same prefix
10Per-zone retentionThe recent-message zone always uses “short”; older zones get “long” after escalationRecent content churns; older content is stable, so each gets the right TTL
11Anthropic ephemeral cachecache_control: { type: "ephemeral", ttl: "5m" or "1h" } markers placed at the right block boundariesNative Anthropic cache, used by 100% of Claude calls
12Gemini CachedContentExplicit CachedContent API with SHA-256 content hashing + 50% TTL refresh on active sessionsGemini’s per-object cache lifecycle, opt-in via geminiCache.enabled
13OpenAI completion-storage flagstore: true on Responses-API requests when storeCompletions is setLets OpenAI keep generated outputs available for replay/debugging
14Microcompaction at write timeTool results above the per-tool inline threshold are saved to disk and replaced with a tiny reference at the moment they are writtenKeeps verbose tool output out of every future turn’s prefix
15Schema strippingAfter discover_tools injects a schema block, it is replaced in history with a one-line summary on subsequent turnsRecovers ~1.7K tokens per discovery, which would otherwise sit in the cached prefix forever
16Tool deferral with discoveryNon-essential tool schemas are removed from the tools array and re-injected only when the agent calls discover_toolsSaves ~81% of the tokens the LLM would spend on tool definitions every turn
The numbered list lives in three files: packages/agent/src/executor/adaptive-cache-retention.ts (rows 1–3, 10), packages/agent/src/context-engine/context-engine.ts (row 4), packages/agent/src/spawn/lifecycle-hooks.ts (rows 5, 8–9), packages/agent/src/executor/cache-break-detection.ts and cache-break-diff-writer.ts (rows 6–7), packages/agent/src/executor/gemini-cache-manager.ts (row 12), packages/agent/src/executor/stream-wrappers/request-body-injector.ts (rows 11, 13), and packages/skills/src/builtin/... (rows 14–16 wire into the executor). The rest of this page explains the most user-visible of these optimizations in detail.

How context grows (and why it costs money)

Each conversation turn adds content to the context window your agent sends to the LLM. Here is what fills it:
  • Conversation turns. Every user message and assistant reply is stored in a JSONL session file and replayed on the next call.
  • Tool results. Every tool invocation appends a result block. A bash tool returning 50K characters of log output becomes a permanent fixture unless Comis intervenes.
  • System prompt. Identity files (AGENTS.md), security rules, skill manifests, and workspace instructions are assembled once and sent every turn.
  • Memory retrieval. Before each call, the RAG layer fetches up to five memory entries and injects them as additional context.
  • JIT skill guides. Verbose operational guides for specific tools are withheld from the system prompt and injected on first use.
Comis never silently deletes conversation messages. The full history is always persisted to disk. What changes between turns is how much of that history the context engine chooses to forward to the model.

Adaptive retention

When Comis places a cache_control marker on a prompt block, it signals the provider to cache that block. But there are two TTLs to choose from: short (5 minutes) and long (1 hour). Writing at 1h TTL on a cold session wastes money if the cache entry expires before it is ever read. Adaptive retention solves this. Here is what is actually happening under the hood:
  1. Cold start. The first call writes at "short" (5m) TTL. This minimizes write cost on sessions that may not continue.
  2. Escalation check. After each turn, Comis records the cacheReadTokens returned by the provider. Once confirmed reads arrive, the session is “warm.”
  3. Promotion. After 3 turns of confirmed cache reads (or immediately on turn 2 if the first turn wrote more than 20K cache tokens — a signal of a large system prompt), retention escalates to "long" (1h).
  4. Prefix instability detection. If cache reads stay at or below the system prompt baseline for 5 consecutive turns, the prefix is unstable and further 1h writes would go to waste. Comis forces retention back to "short" until the prefix stabilizes.
Sub-agents use static "short" retention. They complete in under 60 seconds and never accumulate enough cache reads to justify adaptive tracking overhead.
The config key adaptiveCacheRetention (default true) controls this behavior. Set it to false to use the static value from cacheRetention instead.

Cache fence

The context engine pipeline modifies conversation history before each LLM call. Without coordination, those modifications would land inside the cached prefix and invalidate it — defeating the purpose of caching. The cache fence prevents this. After each call, Comis records the message index of the highest cache_control breakpoint as the fence index. Any context-engine layer that would normally modify content at or below that index skips those messages instead. Concretely:
  • The observation masker (which replaces old tool results with placeholders) skips messages below the fence.
  • Schema stripping (which strips verbose discovery schemas from session history) skips messages below the fence.
  • Microcompaction (which offloads oversized tool results to disk) skips messages in the first third of the session.
The fence index is stored in pre-trim space and correctly translated after the history window trims old messages from the front of the session, so protection survives across turns with long histories.
The cache fence is provider-specific. It applies to Anthropic sessions. Gemini uses explicit CachedContent API instead (see the Gemini section in the configuration table).

Microcompaction

Microcompaction runs at write time — the moment a tool result is appended to the session file, not at the start of the next turn. This is different from the LLM compaction layer, which runs during context assembly. When a tool result exceeds the per-tool inline threshold, Comis:
  1. Saves the full result to a JSON file on disk.
  2. Writes a compact reference into the session in its place: [Tool result offloaded to disk: ...] with a head/tail preview (1,500 and 500 characters respectively).
  3. The agent can recover the full content at any time using the read tool with the disk path.
Per-tool inline thresholds:
ToolInline limit
Default tools8,000 chars
MCP tools (mcp__*)15,000 chars
read (file read)15,000 chars
Hard cap (any tool)100,000 chars
Results below the threshold are never offloaded. Results that are already-offloaded are skipped on subsequent passes. The observation masker recognizes the [Tool result offloaded to disk: prefix and also skips those entries.

JIT guide injection

Verbose operational guides for specific tools — like the workspace customization guide for agents_manage, or the task delegation section — would consume significant tokens if included in the system prompt for every turn. Most turns never use these tools. JIT guide injection defers them. The guides sit in memory, indexed by tool name. When a tool produces its first successful result in a session, Comis appends the corresponding guide as a text block at the end of that result:
---
[Tool Guide - shown once per session]
<guide content here>
---
The deliveredGuides set tracks which guides have been delivered in the current session. Once delivered, the guide is never injected again. On session reset, the set is cleared.
If a tool call errors on its first invocation, the guide is not consumed — the delivery slot stays open so a successful retry can fire it. This was a subtle bug in earlier versions that caused some guides to silently disappear.
Two guide sources exist independently:
  • Tool guides — per-tool operational instructions (keyed by exact tool name).
  • System prompt guides — deferred system prompt sections triggered by a tool name, such as the privileged tools section which fires once on the first use of any of 10 privileged tools.

Tool deferral and parallelism

Deferral

With 50+ tools available, sending every schema to the LLM on every call wastes ~81% of the tokens spent on tool definitions. Comis defers non-essential tools using a BM25-backed discovery tool (discover_tools). Deferred tools are removed from the tools parameter entirely. Their schemas are stored in a DiscoveryTracker. When the agent needs a deferred tool, it calls discover_tools with a natural-language query. Comis runs BM25 search over the deferred set and re-injects matching tool schemas mid-turn. Tool deferral is configurable per-agent:
deferredTools.modeBehavior
"auto" (default)Rule + budget heuristics decide what to defer
"always"All non-core tools are deferred
"never"Deferral is disabled
Use deferredTools.neverDefer and deferredTools.alwaysDefer to pin specific tools regardless of mode.

Parallelism

The SDK can execute multiple tool calls concurrently within a single turn. This is safe for read-only tools but unsafe for tools that modify files, databases, or external services — ordering conflicts can corrupt state. Comis classifies each tool as read-only or mutating. Read-only tools (file reads, web fetches, memory searches) run concurrently. Mutating tools share an async mutex and run one at a time, even when the SDK fires them in parallel. The classification is a static set; MCP tools default to the serialized path.

Schema stripping

When discover_tools loads a set of tools into the active context, it returns a <functions> block containing their full JSON schemas. Those schemas can be 1,720 tokens per discovery result. Once the schemas are loaded into the tools parameter for subsequent turns, the verbose block in the conversation history is pure redundancy. After each turn, Comis scans the session history for discover_tools tool results containing <functions> blocks and replaces them with compact summaries:
[Discovery loaded: 4 tool(s) are now callable]
- tool_name_1
- tool_name_2
- tool_name_3
- tool_name_4
Already-stripped results (prefixed [Discovery loaded:) are detected and skipped. Results without a <functions> block pass through unchanged. Stripping respects the cache fence: entries at or below the fence index are not touched.

MCP disconnect cleanup

MCP servers can connect and disconnect at runtime. When a server disconnects, any tools it provided are no longer callable. If those tools remain registered in the DiscoveryTracker, the agent may attempt to call them and receive confusing errors. Comis subscribes to two MCP lifecycle events:
  • mcp:server:disconnected — removes all tools from the named server from every active session’s tracker.
  • mcp:server:tools_changed — removes only the specific tools that were removed (not the entire server).
This cleanup runs automatically. You do not need to configure it. The agent’s active tool set reflects the current state of connected MCP servers within one turn of a disconnect event.

Configuration

These are the config keys most relevant to cache behavior. All live under the per-agent config block in your config.yaml.

Cache retention

KeyDefaultWhat it controls
cacheRetention"long"Base TTL hint: "none", "short" (5m), or "long" (1h)
adaptiveCacheRetentiontrueWhether to escalate from short to long after warm-up turns
cacheBreakpointStrategy"single"Breakpoint placement: "single", "multi-zone", or "auto"
cacheRetentionOverrides{}Per-model-prefix overrides, e.g. "claude-haiku": "none"
advancedCacheOptimization.enableRecentZonePromotiontruePromote recent-zone breakpoint from 5m to 1h in slow channels

Gemini explicit cache

KeyDefaultWhat it controls
geminiCache.enabledfalseEnable Gemini CachedContent API
geminiCache.maxActiveCaches20Max simultaneous cached content objects per agent
For Gemini agents, cacheRetention has no effect. Gemini uses the CachedContent API exclusively, managed by GeminiCacheManager. TTL is fixed at 3600 seconds with a 50%-interval refresh on active sessions.

Context engine

KeyDefaultWhat it controls
contextEngine.observationKeepWindow25Tool results kept with full content (observation masker)
contextEngine.observationTriggerChars120,000Chars before observation masking activates
contextEngine.compactionCooldownTurns5Turns between LLM compaction triggers
contextEngine.compactionPrefixAnchorTurns2Head turns preserved during compaction for cache prefix stability
contextEngine.historyTurns15Recent user turns kept verbatim (pipeline mode)

Tool lifecycle

KeyDefaultWhat it controls
toolLifecycle.enabledtrueWhether unused tools are demoted after N turns
toolLifecycle.demotionThreshold20Turns of non-use before a tool is schema-stripped and deferred

Compaction (session)

KeyDefaultWhat it controls
session.compaction.reserveTokens16,384Tokens reserved for summary during SDK auto-compaction
session.compaction.keepRecentTokens32,768Recent-message tokens kept after SDK auto-compaction

Tips for long-running agents

If your agent drifts after many turns, the observation masker may be aggressively replacing tool results with placeholders before the model has used them. Try raising contextEngine.observationKeepWindow from 25 to 35 or 40. If compaction runs too frequently, raise contextEngine.compactionCooldownTurns. The default of 5 turns is conservative. Sessions with slow-moving conversations (e.g., a Telegram channel where messages arrive every 10 minutes) can tolerate 10–15 turns between compactions. If you see high cache-write costs on Haiku or Sonnet with small prompts, set adaptiveCacheRetention: false and cacheRetention: "short". This prevents the system from writing expensive 1h TTL entries on sessions that complete in under 5 minutes. If your agent has many MCP tools, leave deferredTools.mode at "auto" and use deferredTools.neverDefer to pin the tools your agent uses on almost every turn. This avoids a discover_tools round-trip for high-frequency tools. If a tool produces very large results (build logs, full file contents, API responses), the default 8K microcompaction threshold will offload them automatically. The agent can still read the full content on demand. If the offloaded path is causing confusion, raise maxToolResultChars per-agent to keep more content inline. For Gemini agents, enable geminiCache once your system prompt stabilizes. The CachedContent API gives explicit per-object lifecycle control and is more predictable than the Anthropic implicit prefix cache for long-lived agents.