Skip to main content
As conversations grow longer, they eventually become too large for your agent to handle efficiently. The default DAG (LCD) engine keeps the whole conversation losslessly and zoomably compresses the oldest history under a token budget; the opt-in pipeline engine instead processes every conversation through ten layers before each AI call, dropping and masking old content as it goes — contextEngine.version defaults to "dag", and you set it to "pipeline" for the simpler engine. Either way, your agent never runs out of room to think.
The lossless DAG (LCD) engine (the v2.12 “Lossless Context DAG” release) is the default engine — contextEngine.version defaults to "dag" (set "pipeline" to opt into the simpler engine). DAG stores every message losslessly and automatically summarizes the oldest out-of-tail history when context utilization crosses contextThreshold (default 0.75 × the turn’s effective budget window — the reconciled context window under any capability-class cap) at the end of a turn: it collapses the oldest chunk into a leaf summary via the same three-level escalation as pipeline mode (normal → aggressive → a deterministic count-only floor that always fits), then assembles the context under the model’s token budget — always keeping the verbatim fresh tail and never deleting anything from the store. The multi-tier condensed summary hierarchy is now part of DAG too: when enough same-depth summaries accumulate (≥condensedMinFanout, default 4) they fold into one deeper condensed summary, and every summary is rendered honestly (depth/descendant_count/time-range/trust=untrusted markers + an expand footer, body taint-wrapped). The on-demand ctx_* recall tools (ctx_search, ctx_inspect, ctx_expand) let the agent drill back into a compressed region. See DAG mode below for the mechanics and Context Management for the engine overview.

How the context engine works

Every time your agent is about to call the AI provider, the context engine reviews the full conversation and prepares it so that the most relevant content fits within the model’s context window. Think of it like an editor preparing a manuscript — the editor removes rough drafts, trims to the most relevant chapters, masks old footnotes, summarizes if the manuscript is still too long, and restores key references at the end. All of this happens automatically in the background, and you do not need to configure or monitor it.

The ten layers (pipeline mode, the opt-in engine)

These ten layers run in pipeline mode — the opt-in engine you select with contextEngine.version: "pipeline" (the default is "dag"). Here is what happens during each step, in order:

1. Clearing scratch paper

The AI’s internal reasoning notes (called “thinking blocks”) accumulate over a long conversation. This step removes thinking blocks from older messages, keeping only the last 10 turns of reasoning by default. Your agent’s actual responses are never affected — only the behind-the-scenes notes are cleared.

2. Stripping reasoning tags

Some AI providers wrap their reasoning in XML-style tags (e.g., <thinking> blocks). This step strips those tags from the conversation so they do not accumulate and waste context space. Unlike step 1 which removes entire thinking blocks from older messages, this step removes the tag markup itself while preserving any meaningful content outside the tags. It runs on every turn regardless of model settings.

3. Keeping recent messages

Your most recent messages are always kept in full. By default, the last 15 user turns are retained (you can set different limits per channel type — for example, fewer turns in a busy group chat). Older messages are still stored on disk and in long-term memory; they are simply not sent to the AI each time.

4. Removing superseded content

The dead content evictor scans for tool results that have been superseded by later calls. It replaces them with compact placeholder text, keeping the conversation lean without losing track of what happened. Specifically, it removes:
  • Re-read files — When the same file path is read again, earlier reads are replaced (e.g., [file_read result for src/app.ts superseded by later read -- use session_search to recover]).
  • Re-run commands — When the same command is executed again, earlier results are replaced.
  • Re-fetched URLs — When the same URL is fetched again, earlier results are replaced.
  • Old images — Earlier image_analyze calls are replaced when newer ones exist.
  • Stale errors — Tool errors older than evictionMinAge turns (default 15) are removed.
This step runs after history windowing and before observation masking. It never mutates the original conversation array — it produces a new, cleaned copy.
Even after eviction, the original content is still on disk. You can recover it using the session_search tool to find the full text of any superseded result.

5. Masking old tool results

When your agent uses tools (web searches, file reads, command outputs), the results can be large. This step replaces old tool results — beyond the last 25 tool uses — with lightweight placeholders. It only activates when the conversation exceeds 120,000 characters, so shorter conversations are never affected. Certain important tools (memory lookups, file reads, and session search results) are never masked.

6. Saving large results to disk

At write time, unusually large tool results are saved to disk and replaced with compact references in the conversation. Your agent can still access the full content through file reads if needed, but the conversation stays lean.

7. Summarizing as a last resort

If the conversation still exceeds 85% of the context window after all the previous steps, the AI summarizes older messages. This step has a three-level fallback:
  • Full summary — A structured summary preserving key details and decisions.
  • Simplified summary — If the full summary fails, a best-effort summary that excludes oversized content.
  • Count-only note — If all else fails, a simple note recording how many messages were condensed (guaranteed to succeed).
A cooldown prevents this from re-triggering too frequently (default: 5 turns between summaries).

8. Restoring critical context

After summarization, the context engine restores critical context that your agent needs to continue working:
  • Your agent’s instructions (from AGENTS.md) are re-injected.
  • Recently accessed files are restored so the agent remembers what it was working on.
  • A resume instruction helps the agent pick up exactly where it left off.

What gets preserved

Throughout this process, Comis ensures that your agent never loses anything important:
  • Key facts — Important details like names, dates, preferences, and decisions are saved as semantic memories before any summarization.
  • Conversation summary — A condensed version of the conversation is preserved, like a set of meeting notes.
  • Recent messages — Your most recent messages are always kept intact so the agent has full context for what you are currently discussing.
  • Full tool results on disk — Thanks to the disk-saving step, large tool results are always recoverable from disk, even when they are replaced by compact references in the conversation.
After any summarization, your agent’s critical instructions are re-injected into the context so it never forgets who it is or how it should behave.
The context engine does not delete anything from your session storage. It optimizes what gets sent to the AI each turn. You can always review the full conversation history through the web dashboard.

Cost profile and cache savings

Every token in the context window costs money. Here is what that looks like in practice from a production session running a 7-agent NVDA stock analysis pipeline:
OperationCostNotes
First “Hello”$0.206Cold cache — system prompt write
Second message (arxiv fetch + summary)$0.052Warm cache — 75% cheaper
Subsequent messages$0.02–0.05Stable cache reads
7-agent pipeline (full NVDA analysis)~$1.204 analysts + 4 debate rounds + trader verdict
Ad-hoc sub-agent query$0.053Single sub-agent spawn
Without the context engine, the same session would cost 3–5x more — old tool results accumulating in context, system prompt cache misses on every channel switch, and no compaction of growing conversation history.

Cache-stable prompt architecture

Anthropic’s prompt caching can cut costs by 7.5x — but only if the system prompt stays identical across turns. Most frameworks break this by embedding timestamps, message IDs, or channel metadata directly in the system prompt. Comis separates content into two zones:
ZoneContentCache behavior
System prompt (static)Identity, personality, workspace files, tool definitions, security rulesCached — paid once at the write rate, then a fraction on every subsequent call
Dynamic preamble (per-turn)Timestamp, sender metadata, channel context, RAG results, active skills, trust entriesPrepended to the user message — never invalidates the cache prefix
Six categories of content were identified and relocated to the dynamic preamble: date/time, inbound message metadata, channel context, RAG memory results, active skill content, and sender trust entries. Each of these would invalidate the entire cache prefix if left inline.

Multi-provider cache pricing

Cache pricing varies by provider. Comis tracks per-provider costs using the pi-ai SDK’s pricing catalog:
ProviderCache read rateCache write rateNet savings pattern
Anthropic10% of input125% of inputTurn 1: net cost (write premium); Turn 2+: net savings (read discount)
OpenAI50% of inputSame as inputImmediate 50% savings on cached tokens; no write premium
Google25% of inputSame as inputImmediate 75% savings on cached tokens; no write premium
The savedVsUncached metric is positive when read discounts exceed write premiums (typically from turn 2 onward), and negative on the first turn when the cache write premium has not yet been offset.

MCP tool deferral

With many tools registered — some from MCP servers with verbose descriptions — tool definitions alone can consume a significant portion of the context window. When total tool tokens exceed 10% of the context window, Comis defers verbose MCP tool descriptions behind a lightweight discovery tool. Recently-used tools stay visible; rarely-used tools become discoverable on demand.

Token budget algebra

Every LLM call computes available history tokens using:
Available = Window - SystemPrompt - OutputReserve - SafetyMargin - ContextRotBuffer
The context rot buffer (25% of window) accounts for degraded attention quality in long contexts. This prevents the agent from using tokens that the model cannot effectively attend to — avoiding the “lost in the middle” problem while saving money on tokens that would be wasted anyway.

Three-tier budget guard

Every agent has three budget caps checked before each LLM call:
ScopeDefaultPurpose
Per-execution2M tokensPrevent runaway single calls
Per-hour10M tokensRate limiting
Per-day100M tokensDaily cost ceiling
If the estimated token count exceeds any cap, the call is rejected with a diagnostic error before any money is spent.

Observability

Every pipeline run logs structured metrics so you can see exactly what the context engine did:
{
  "tokensLoaded": 28959,
  "tokensEvicted": 0,
  "tokensMasked": 0,
  "tokensCompacted": 0,
  "thinkingBlocksRemoved": 7,
  "budgetUtilization": 0.22,
  "rereadCount": 0,
  "sessionDepth": 98,
  "sessionToolResults": 50,
  "layerCount": 6,
  "durationMs": 1
}
budgetUtilization of 0.22 means only 22% of available context is in use — the rest is headroom. Events fire on every significant action (context:masked, context:compacted, context:evicted, context:reread), feeding into the observability dashboard for real-time cost monitoring.

Advanced compaction strategies

Beyond the ten pipeline layers, Comis includes three additional strategies that improve compaction quality and cache efficiency.

Middle-out compaction

When the LLM compaction layer (step 7) needs to summarize older messages, it does not simply summarize everything and keep the tail. Instead, it preserves a configurable number of user-turn cycles at the head of the conversation (the “prefix anchor”, default 2 turns) to maintain prompt cache stability. This creates a three-zone layout:
  1. Preserved head — The first few turns stay untouched, keeping the cache prefix valid
  2. Compaction summary — Middle content is summarized by the compaction model
  3. Preserved tail — Recent turns (from the history window) remain in full
This approach prevents cache breaks when new turns are added, saving significant costs on long conversations. Configure with contextEngine.compactionPrefixAnchorTurns (default 2, range 0-10).

Adaptive cache retention

Session-level adaptive TTL that optimizes cache slot usage. New sessions start with a short cache TTL (5 minutes). As the session proves active with more turns, the TTL automatically promotes to a longer duration (1 hour). This saves costs on cold-start conversations — abandoned sessions do not waste expensive long-TTL cache slots. Controlled by the adaptiveCacheRetention boolean on agent config (default: true).

Microcompaction guard

A write-time guard that intercepts oversized tool results before they are saved to the session transcript. Per-tool inline thresholds ensure that no single tool result bloats the conversation:
  • Default tools: 8,000 characters
  • MCP tools: 15,000 characters
  • file_read: 15,000 characters
  • Hard cap: 100,000 characters (truncated)
Oversized results are saved to disk as JSON files with compact inline references in the conversation. This is different from step 5 (masking old tool results), which operates at read time — the microcompaction guard operates at write time, preventing large results from ever entering the conversation file.

DAG mode (the default engine)

The DAG (LCD) engine is the default engine (the v2.12 “Lossless Context DAG” release) — contextEngine.version defaults to "dag" (set "pipeline" to opt into the simpler engine). It stores every message losslessly and assembles the context under the model’s token budget: it resolves the ordered context view (raw messages plus any leaf summaries) and evicts the oldest out-of-tail history when the budget is exceeded — always keeping the verbatim fresh tail. Automatic leaf summarization is now live: when context utilization crosses contextThreshold (default 0.75 × the turn’s effective budget window) at the end of a turn, the oldest out-of-tail chunk is summarized into a leaf summary and slotted back into the context view in its original position. The multi-tier condensed summary hierarchy (summaries of summaries) is now live too — when enough same-depth summaries accumulate (≥condensedMinFanout, default 4) they fold into one deeper condensed summary, and every summary is rendered with honest lossiness markers (see Honest presentation below). Only the on-demand ctx_* recall tools described later in this section still land in a later phase. Nothing is ever deleted from the store — both eviction and summarization operate on the assembled context only, so the underlying messages remain recoverable when recall lands.

What is DAG mode

DAG mode stores every message and tool result losslessly and rebuilds the model-facing context each turn from the ordered context view — every surviving message, tool call, and tool result preserved and paired by id, plus a verbatim fresh tail of the most recent steps. When context utilization crosses contextThreshold at the end of a turn, the oldest out-of-tail chunk is summarized into a leaf summary (whole tool-use/tool-result steps only — a pair is never split) and slotted back into the context view in its original position; the fresh tail is always protected. When the assembled history would still exceed the model’s token budget, the oldest evictable history is dropped to fit. Summarization and eviction operate on the assembled context only: the underlying messages stay in the lossless store, so compressed detail remains recoverable. When enough same-depth summaries accumulate (≥condensedMinFanout, default 4), the DAG rolls them up into one deeper condensed summary — a zoomable leaf→condensed hierarchy where each level covers a broader span. Every summary is rendered into context with honest lossiness markers (see Honest presentation of summaries). Only the on-demand ctx_* recall tools (the design described below) still land in a later phase; the current release does leaf summarization, multi-tier condensation, and budget-bounded assembly.

How DAG compaction works

When the conversation tokens exceed the context threshold (default 75% of the model’s token budget), the DAG engine compresses older content:
  1. Leaf summaries — Groups of raw messages are summarized into concise leaf summaries. Each leaf summary covers a chunk of conversation.
  2. Condensed summaries — When enough leaf summaries accumulate, they are merged into higher-level condensed summaries that cover broader spans of conversation.
A three-tier escalation ensures summaries always fit within their target size:
  • Normal — Structured summary with key details preserved.
  • Aggressive — Compact summary that prioritizes brevity.
  • Truncation — A bounded count-only note (marker-prefixed, guaranteed to fit and to shrink below the source it replaces).
The most recent turns (the “fresh tail”, default 8 turns) are always protected from compaction, ensuring your agent has full, uncompressed access to the latest exchanges. One size guard applies inside the protected fresh tail: a single message whose text alone approaches the model’s effective context window (an oversized tool result, or a pasted document far larger than the window) is bounded at assembly — head and tail preserved around an honest truncation marker — so that one message can never overflow the window or permanently wedge the session into context_exhausted. Everything below the cap passes through verbatim, and the full content always remains in the lossless store (recoverable via ctx_expand). The cap scales with the model’s effective window (up to 100,000 characters per message on large windows). The condensed hierarchy is depth-aware: a depth-0 leaf summarizes raw messages, a depth-1 condensed summary summarizes depth-0 leaves once at least condensedMinFanout (default 4) of them accumulate, and so on. Higher depth means broader coverage but more compression loss.

Deferred (background) compaction

By default the leaf and condense passes run out-of-band — they never block the turn. When a turn finishes, the agent’s reply is delivered immediately; the summarization work then runs in the background on a per-conversation single-flight serializer, so a compaction can never run concurrently with (or race) the next turn’s write into the same conversation. This keeps response latency flat even when a large session crosses the compaction threshold: summarizing old history is amortized, not paid synchronously on the turn that happened to tip the budget. This is controlled by contextEngine.deferCompaction (default true). Setting it to false runs the leaf and condense passes inline at the end of the turn (the pre-v2.12 behaviour) — deterministic and useful for tests, at the cost of adding the summarization time to that turn’s latency. The serializer interlock (one compaction per conversation at a time) holds in both modes, so the lossless store is never written by two passes at once.

Honest presentation of summaries

A summary is a lossy, possibly-untrusted compression of earlier turns, so the DAG never slips one into context disguised as a normal message. Each summary the engine assembles is rendered with explicit honesty markers and carried untrusted by role:
  • Lossiness markers — a trusted header announces depth, descendant_count (how many underlying messages it covers), and the ISO time-range it spans, plus trust=untrusted. These values come from the stored summary row, not from the summary text, so a poisoned body cannot forge them.
  • Expand footer — an “Expand for details about:” line states what was compressed (the descendant count, depth, and time-range), making clear the detail exists losslessly in the store even though it is not inline.
  • Taint-wrapped body — the summary text itself is wrapped as untrusted external content (the same taint primitive used for any non-system input), so injected instructions inside a summarized message are not treated as commands.
  • Uncertainty clause — in dag mode the system prompt adds a Compressed context section instructing the model to treat summaries as compressed recall cues rather than proof: do not assert exact commands, SHAs, paths, numbers, or timestamps from a summary, and prefer newer verbatim evidence when a summary and a recent message conflict.
This is the honest presentation contract: the model always knows which parts of its context are compressed, how lossy they are, and that it should verify exact details against fresher evidence rather than trust a summary blindly.

Recall tools

In DAG mode (the default), the in-session ctx_* tools let the agent drill back into a compressed region of this conversation: ctx_search (full-text/regex search across messages and summaries), ctx_inspect (inspect a summary’s coverage), and ctx_expand (rehydrate a region to its underlying messages). They are active only in DAG mode, never-export, and distinct from cross-session recall (memory_search, session_search). In the opt-in pipeline mode the agent instead uses session_search over the raw session history. See Built-in Tools and Platform Tools for the full tool reference documentation.

Switching modes

DAG is the default. contextEngine.version accepts "pipeline" and "dag"; omit it (or set "dag") for the lossless LCD engine, or set "pipeline" to opt into the simpler sequential-layer engine:
~/.comis/config.yaml
agents:
  default:
    contextEngine:
      version: "pipeline"   # opt into the simpler engine; omit (or "dag") for the default lossless LCD engine
contextEngine.version is operator-only — an agent cannot switch its own engine mode via the config.patch RPC.

Per-(tenant, agent, session) isolation

DAG history is scoped by tenant, agent, and session — not by conversation alone. Every read the assembler performs is filtered on all three, so two agents that happen to share the same (tenant, user, channel) conversation key cannot recover each other’s compressed history: agent A never sees agent B’s summaries or raw messages, even within one conversation. If the scope cannot be fully resolved (a missing agent or tenant id), the engine fails closed — it reads no history for that turn rather than risk a cross-agent leak (the verbatim fresh tail still ships, so the live turn is never broken), and logs a warning so the misconfiguration is visible.
The cross-agent isolation adds an agent_id column to the DAG full-text search index. Because the search index is created once on a fresh database (there is no migration — sessions start fresh on the LCD engine by design), a pre-existing development ~/.comis database created before this release may need to be wiped to pick up the isolated index. This affects only local dev databases that predate the LCD engine; a fresh install needs nothing.

Configuration

All context engine settings live under agents.*.contextEngine in your config file. The defaults work well for most setups — you typically do not need to change anything.
OptionTypeDefaultWhat it does
contextEngine.enabledbooleantrueMaster toggle for the context engine
contextEngine.versionstring"dag"Context engine mode: "dag" (default, the lossless LCD engine) or "pipeline" (opt-in sequential-layer engine)
contextEngine.thinkingKeepTurnsnumber10Recent turns that keep AI reasoning notes (1-50)
contextEngine.compactionModelstring""Model used for summarization (empty = runtime-resolved default)
contextEngine.evictionMinAgenumber15Minimum turn age before stale errors are evicted (3-50)
contextEngine.historyTurnsnumber15Recent user turns to keep in full (3-100)
contextEngine.historyTurnOverridesrecordPer-channel turn limits (e.g., { dm: 10, group: 5 })
contextEngine.observationKeepWindownumber25Recent tool uses that keep full results (1-50)
contextEngine.observationTriggerCharsnumber120000Character threshold before old tool results are masked (50K-1M)
contextEngine.compactionCooldownTurnsnumber5Turns to wait before re-triggering summarization (1-50)
contextEngine.compactionPrefixAnchorTurnsnumber2User turns preserved at head for cache prefix stability (0-10)
contextEngine.outputEscalation.enabledbooleantrueAllow escalating output token budget when compaction occurs
contextEngine.outputEscalation.escalatedMaxTokensnumber32768Maximum output tokens after escalation (4096-128000)
contextEngine.observationDeactivationCharsnumber80000Character threshold to deactivate observation masking (20K-500K)
contextEngine.ephemeralKeepWindownumber10Recent ephemeral tool results to preserve from masking (1-50)
contextEngine.contextThresholdnumber0.75DAG mode (live): context-utilization fraction (of the turn’s effective budget window — the reconciled context window under any capability-class cap) that triggers a leaf-summarization pass at the end of a turn (0.1-0.95)
contextEngine.freshTailTurnsnumber8DAG mode (live): most-recent steps (assistant + tool round-trips) always kept verbatim and never summarized/evicted (1-50)
contextEngine.leafChunkTokensnumber20000DAG mode (live): token cap for the oldest out-of-tail chunk summarized into one leaf summary (1K-100K); clamped at runtime to the resolved summarizer model’s window — the smaller of its configured window and the probed served window when the summarizer runs on the served-bound provider — minus the summary target, template overhead, and previous-summary size, so a small compaction summarizer or a served-bound primary is never fed an over-window chunk; a single message larger than the clamped cap is replaced by a bounded deterministic extraction (no LLM call)
contextEngine.leafTargetTokensnumber1200DAG mode (live): target token size for a leaf summary (96-5000)
contextEngine.deferCompactionbooleantrueDAG mode (live): run the afterTurn leaf + condense passes in the background on the per-conversation serializer (never blocking the turn). false runs them inline (deterministic, for tests)
contextEngine.summarizerSpend.maxTokensPerTenantPerHournumber500000DAG mode (live): per-tenant rolling-hour ceiling on summarizer (input+output) tokens; over the cap the summarizer is bypassed and assembly degrades to truncation-only. 0 disables the hourly cap
contextEngine.summarizerSpend.maxTokensPerTenantPerDaynumber5000000DAG mode (live): per-tenant rolling-day summarizer token ceiling. 0 disables the daily cap
contextEngine.summarizerBreaker.failureThresholdnumber5DAG mode (live): consecutive summarizer failures before the circuit breaker opens → truncation-only assembly
contextEngine.summarizerBreaker.resetTimeoutMsnumber60000DAG mode (live): how long the summarizer breaker stays open before a half-open trial
contextEngine.summarizerBreaker.halfOpenTimeoutMsnumber30000DAG mode (live): half-open trial window for the summarizer breaker
contextEngine.compaction.summarizerFallbackProvidersstring[][]DAG mode (live): ordered list of fallback summarizer provider/model ids tried in turn when the primary summarizer fails (SUM-03); the per-tenant breaker + spend caps still bind, and only an all-providers-exhausted failure trips the breaker. [] (default) = no failover — primary only, byte-identical to prior behavior
~/.comis/config.yaml
agents:
  default:
    contextEngine:
      enabled: true
      # version: "dag"  # the default lossless LCD engine; set "pipeline" for the simpler engine
      thinkingKeepTurns: 10
      historyTurns: 15
      historyTurnOverrides:
        dm: 10
        group: 5
      observationKeepWindow: 25
      observationTriggerChars: 120000
      compactionCooldownTurns: 5
      compactionPrefixAnchorTurns: 2  # Turns preserved at head for cache prefix stability
      compactionModel: ""  # empty = runtime-resolved default
For DAG mode configuration fields (compaction thresholds, summary tokens, recall limits), see the Config Reference.
See the Config Reference for the full list of options and validation rules.
You can trigger compaction manually with the /compact command. This is useful if your conversation feels sluggish and you want to free up space without waiting for the automatic threshold.
The session.compaction settings (softThresholdRatio, hardThresholdRatio, etc.) control the memory flush — extracting facts and summaries into long-term storage before the context engine summarizes older messages. These settings are separate from the context engine configuration above. See Sessions and the Config Reference for details.

Sessions

How conversations are created and managed.

Memory

Where compaction summaries are stored for long-term recall.

Config Reference

Full configuration reference for all context engine options.