contextEngine.version defaults to "dag", and you set it to "pipeline" for
the simpler engine. Either way, your agent never runs out of room to think.
The lossless DAG (LCD) engine (the v2.12 “Lossless Context DAG” release) is
the default engine —
contextEngine.version defaults to "dag" (set
"pipeline" to opt into the simpler engine). DAG
stores every message losslessly and automatically summarizes the oldest
out-of-tail history when context utilization crosses contextThreshold (default
0.75 × the turn’s effective budget window — the reconciled context window under
any capability-class cap) at the end of a turn: it collapses the oldest chunk
into a leaf summary via the same three-level escalation as pipeline mode
(normal → aggressive → a deterministic count-only floor that always fits), then
assembles the context under the model’s token budget — always keeping the
verbatim fresh tail and never deleting anything from the store. The multi-tier
condensed summary hierarchy is now part of DAG too: when enough same-depth
summaries accumulate (≥condensedMinFanout, default 4) they fold into one deeper
condensed summary, and every summary is rendered honestly
(depth/descendant_count/time-range/trust=untrusted markers + an expand
footer, body taint-wrapped). The on-demand ctx_* recall tools (ctx_search,
ctx_inspect, ctx_expand) let the agent drill back into a compressed region.
See DAG mode below for the mechanics and
Context Management for the engine overview.How the context engine works
Every time your agent is about to call the AI provider, the context engine reviews the full conversation and prepares it so that the most relevant content fits within the model’s context window. Think of it like an editor preparing a manuscript — the editor removes rough drafts, trims to the most relevant chapters, masks old footnotes, summarizes if the manuscript is still too long, and restores key references at the end. All of this happens automatically in the background, and you do not need to configure or monitor it.The ten layers (pipeline mode, the opt-in engine)
These ten layers run in pipeline mode — the opt-in engine you select withcontextEngine.version: "pipeline" (the default is "dag"). Here is what happens
during each step, in order:
1. Clearing scratch paper
The AI’s internal reasoning notes (called “thinking blocks”) accumulate over a long conversation. This step removes thinking blocks from older messages, keeping only the last 10 turns of reasoning by default. Your agent’s actual responses are never affected — only the behind-the-scenes notes are cleared.2. Stripping reasoning tags
Some AI providers wrap their reasoning in XML-style tags (e.g.,<thinking> blocks). This step strips those tags from the conversation so they do not accumulate and waste context space. Unlike step 1 which removes entire thinking blocks from older messages, this step removes the tag markup itself while preserving any meaningful content outside the tags. It runs on every turn regardless of model settings.
3. Keeping recent messages
Your most recent messages are always kept in full. By default, the last 15 user turns are retained (you can set different limits per channel type — for example, fewer turns in a busy group chat). Older messages are still stored on disk and in long-term memory; they are simply not sent to the AI each time.4. Removing superseded content
The dead content evictor scans for tool results that have been superseded by later calls. It replaces them with compact placeholder text, keeping the conversation lean without losing track of what happened. Specifically, it removes:- Re-read files — When the same file path is read again, earlier reads are
replaced (e.g.,
[file_read result for src/app.ts superseded by later read -- use session_search to recover]). - Re-run commands — When the same command is executed again, earlier results are replaced.
- Re-fetched URLs — When the same URL is fetched again, earlier results are replaced.
- Old images — Earlier
image_analyzecalls are replaced when newer ones exist. - Stale errors — Tool errors older than
evictionMinAgeturns (default 15) are removed.
Even after eviction, the original content is still on disk. You can recover it
using the
session_search tool to find the full text of any superseded result.5. Masking old tool results
When your agent uses tools (web searches, file reads, command outputs), the results can be large. This step replaces old tool results — beyond the last 25 tool uses — with lightweight placeholders. It only activates when the conversation exceeds 120,000 characters, so shorter conversations are never affected. Certain important tools (memory lookups, file reads, and session search results) are never masked.6. Saving large results to disk
At write time, unusually large tool results are saved to disk and replaced with compact references in the conversation. Your agent can still access the full content through file reads if needed, but the conversation stays lean.7. Summarizing as a last resort
If the conversation still exceeds 85% of the context window after all the previous steps, the AI summarizes older messages. This step has a three-level fallback:- Full summary — A structured summary preserving key details and decisions.
- Simplified summary — If the full summary fails, a best-effort summary that excludes oversized content.
- Count-only note — If all else fails, a simple note recording how many messages were condensed (guaranteed to succeed).
8. Restoring critical context
After summarization, the context engine restores critical context that your agent needs to continue working:- Your agent’s instructions (from AGENTS.md) are re-injected.
- Recently accessed files are restored so the agent remembers what it was working on.
- A resume instruction helps the agent pick up exactly where it left off.
What gets preserved
Throughout this process, Comis ensures that your agent never loses anything important:- Key facts — Important details like names, dates, preferences, and decisions are saved as semantic memories before any summarization.
- Conversation summary — A condensed version of the conversation is preserved, like a set of meeting notes.
- Recent messages — Your most recent messages are always kept intact so the agent has full context for what you are currently discussing.
- Full tool results on disk — Thanks to the disk-saving step, large tool results are always recoverable from disk, even when they are replaced by compact references in the conversation.
The context engine does not delete anything from your session storage. It
optimizes what gets sent to the AI each turn. You can always review the full
conversation history through the web dashboard.
Cost profile and cache savings
Every token in the context window costs money. Here is what that looks like in practice from a production session running a 7-agent NVDA stock analysis pipeline:| Operation | Cost | Notes |
|---|---|---|
| First “Hello” | $0.206 | Cold cache — system prompt write |
| Second message (arxiv fetch + summary) | $0.052 | Warm cache — 75% cheaper |
| Subsequent messages | $0.02–0.05 | Stable cache reads |
| 7-agent pipeline (full NVDA analysis) | ~$1.20 | 4 analysts + 4 debate rounds + trader verdict |
| Ad-hoc sub-agent query | $0.053 | Single sub-agent spawn |
Cache-stable prompt architecture
Anthropic’s prompt caching can cut costs by 7.5x — but only if the system prompt stays identical across turns. Most frameworks break this by embedding timestamps, message IDs, or channel metadata directly in the system prompt. Comis separates content into two zones:| Zone | Content | Cache behavior |
|---|---|---|
| System prompt (static) | Identity, personality, workspace files, tool definitions, security rules | Cached — paid once at the write rate, then a fraction on every subsequent call |
| Dynamic preamble (per-turn) | Timestamp, sender metadata, channel context, RAG results, active skills, trust entries | Prepended to the user message — never invalidates the cache prefix |
Multi-provider cache pricing
Cache pricing varies by provider. Comis tracks per-provider costs using the pi-ai SDK’s pricing catalog:| Provider | Cache read rate | Cache write rate | Net savings pattern |
|---|---|---|---|
| Anthropic | 10% of input | 125% of input | Turn 1: net cost (write premium); Turn 2+: net savings (read discount) |
| OpenAI | 50% of input | Same as input | Immediate 50% savings on cached tokens; no write premium |
| 25% of input | Same as input | Immediate 75% savings on cached tokens; no write premium |
savedVsUncached metric is positive when read discounts exceed write premiums (typically from turn 2 onward), and negative on the first turn when the cache write premium has not yet been offset.
MCP tool deferral
With many tools registered — some from MCP servers with verbose descriptions — tool definitions alone can consume a significant portion of the context window. When total tool tokens exceed 10% of the context window, Comis defers verbose MCP tool descriptions behind a lightweight discovery tool. Recently-used tools stay visible; rarely-used tools become discoverable on demand.Token budget algebra
Every LLM call computes available history tokens using:Three-tier budget guard
Every agent has three budget caps checked before each LLM call:| Scope | Default | Purpose |
|---|---|---|
| Per-execution | 2M tokens | Prevent runaway single calls |
| Per-hour | 10M tokens | Rate limiting |
| Per-day | 100M tokens | Daily cost ceiling |
Observability
Every pipeline run logs structured metrics so you can see exactly what the context engine did:budgetUtilization of 0.22 means only 22% of available context is in use — the rest is headroom. Events fire on every significant action (context:masked, context:compacted, context:evicted, context:reread), feeding into the observability dashboard for real-time cost monitoring.
Advanced compaction strategies
Beyond the ten pipeline layers, Comis includes three additional strategies that improve compaction quality and cache efficiency.Middle-out compaction
When the LLM compaction layer (step 7) needs to summarize older messages, it does not simply summarize everything and keep the tail. Instead, it preserves a configurable number of user-turn cycles at the head of the conversation (the “prefix anchor”, default 2 turns) to maintain prompt cache stability. This creates a three-zone layout:- Preserved head — The first few turns stay untouched, keeping the cache prefix valid
- Compaction summary — Middle content is summarized by the compaction model
- Preserved tail — Recent turns (from the history window) remain in full
contextEngine.compactionPrefixAnchorTurns (default 2, range 0-10).
Adaptive cache retention
Session-level adaptive TTL that optimizes cache slot usage. New sessions start with a short cache TTL (5 minutes). As the session proves active with more turns, the TTL automatically promotes to a longer duration (1 hour). This saves costs on cold-start conversations — abandoned sessions do not waste expensive long-TTL cache slots. Controlled by theadaptiveCacheRetention boolean on agent config (default: true).
Microcompaction guard
A write-time guard that intercepts oversized tool results before they are saved to the session transcript. Per-tool inline thresholds ensure that no single tool result bloats the conversation:- Default tools: 8,000 characters
- MCP tools: 15,000 characters
- file_read: 15,000 characters
- Hard cap: 100,000 characters (truncated)
DAG mode (the default engine)
What is DAG mode
DAG mode stores every message and tool result losslessly and rebuilds the model-facing context each turn from the ordered context view — every surviving message, tool call, and tool result preserved and paired by id, plus a verbatim fresh tail of the most recent steps. When context utilization crossescontextThreshold at the end of a turn, the oldest out-of-tail chunk is
summarized into a leaf summary (whole tool-use/tool-result steps only — a pair
is never split) and slotted back into the context view in its original position;
the fresh tail is always protected. When the assembled history would still exceed
the model’s token budget, the oldest evictable history is dropped to fit.
Summarization and eviction operate on the assembled context only: the underlying
messages stay in the lossless store, so compressed detail remains recoverable.
When enough same-depth summaries accumulate (≥condensedMinFanout, default 4),
the DAG rolls them up into one deeper condensed summary — a zoomable
leaf→condensed hierarchy where each level covers a broader span. Every summary is
rendered into context with honest lossiness markers (see
Honest presentation of summaries). Only the
on-demand ctx_* recall tools (the design described below) still land in a later
phase; the current release does leaf summarization, multi-tier condensation, and
budget-bounded assembly.
How DAG compaction works
When the conversation tokens exceed the context threshold (default 75% of the model’s token budget), the DAG engine compresses older content:- Leaf summaries — Groups of raw messages are summarized into concise leaf summaries. Each leaf summary covers a chunk of conversation.
- Condensed summaries — When enough leaf summaries accumulate, they are merged into higher-level condensed summaries that cover broader spans of conversation.
- Normal — Structured summary with key details preserved.
- Aggressive — Compact summary that prioritizes brevity.
- Truncation — A bounded count-only note (marker-prefixed, guaranteed to fit and to shrink below the source it replaces).
context_exhausted. Everything below the cap passes through verbatim, and
the full content always remains in the lossless store (recoverable via
ctx_expand). The cap scales with the model’s effective window (up to 100,000
characters per message on large windows).
The condensed hierarchy is depth-aware: a depth-0 leaf summarizes raw
messages, a depth-1 condensed summary summarizes depth-0 leaves once at least
condensedMinFanout (default 4) of them accumulate, and so on. Higher depth means
broader coverage but more compression loss.
Deferred (background) compaction
By default the leaf and condense passes run out-of-band — they never block the turn. When a turn finishes, the agent’s reply is delivered immediately; the summarization work then runs in the background on a per-conversation single-flight serializer, so a compaction can never run concurrently with (or race) the next turn’s write into the same conversation. This keeps response latency flat even when a large session crosses the compaction threshold: summarizing old history is amortized, not paid synchronously on the turn that happened to tip the budget. This is controlled bycontextEngine.deferCompaction (default true). Setting it
to false runs the leaf and condense passes inline at the end of the turn
(the pre-v2.12 behaviour) — deterministic and useful for tests, at the cost of
adding the summarization time to that turn’s latency. The serializer interlock
(one compaction per conversation at a time) holds in both modes, so the lossless
store is never written by two passes at once.
Honest presentation of summaries
A summary is a lossy, possibly-untrusted compression of earlier turns, so the DAG never slips one into context disguised as a normal message. Each summary the engine assembles is rendered with explicit honesty markers and carried untrusted by role:- Lossiness markers — a trusted header announces
depth,descendant_count(how many underlying messages it covers), and the ISO time-range it spans, plustrust=untrusted. These values come from the stored summary row, not from the summary text, so a poisoned body cannot forge them. - Expand footer — an “Expand for details about:” line states what was compressed (the descendant count, depth, and time-range), making clear the detail exists losslessly in the store even though it is not inline.
- Taint-wrapped body — the summary text itself is wrapped as untrusted external content (the same taint primitive used for any non-system input), so injected instructions inside a summarized message are not treated as commands.
- Uncertainty clause — in dag mode the system prompt adds a Compressed context section instructing the model to treat summaries as compressed recall cues rather than proof: do not assert exact commands, SHAs, paths, numbers, or timestamps from a summary, and prefer newer verbatim evidence when a summary and a recent message conflict.
Recall tools
In DAG mode (the default), the in-sessionctx_* tools let the agent drill back
into a compressed region of this conversation: ctx_search (full-text/regex
search across messages and summaries), ctx_inspect (inspect a summary’s
coverage), and ctx_expand (rehydrate a region to its underlying messages). They
are active only in DAG mode, never-export, and distinct from cross-session
recall (memory_search, session_search). In the opt-in pipeline mode the agent
instead uses session_search over the raw session history.
See Built-in Tools and
Platform Tools for the full tool
reference documentation.
Switching modes
DAG is the default.contextEngine.version accepts "pipeline" and "dag";
omit it (or set "dag") for the lossless LCD engine, or set "pipeline" to opt
into the simpler sequential-layer engine:
~/.comis/config.yaml
contextEngine.version is operator-only — an agent cannot switch its own engine
mode via the config.patch RPC.
Per-(tenant, agent, session) isolation
DAG history is scoped by tenant, agent, and session — not by conversation alone. Every read the assembler performs is filtered on all three, so two agents that happen to share the same(tenant, user, channel) conversation key cannot
recover each other’s compressed history: agent A never sees agent B’s summaries or
raw messages, even within one conversation. If the scope cannot be fully resolved
(a missing agent or tenant id), the engine fails closed — it reads no history
for that turn rather than risk a cross-agent leak (the verbatim fresh tail still
ships, so the live turn is never broken), and logs a warning so the
misconfiguration is visible.
The cross-agent isolation adds an
agent_id column to the DAG full-text search
index. Because the search index is created once on a fresh database (there is no
migration — sessions start fresh on the LCD engine by design), a pre-existing
development ~/.comis database created before this release may need to be wiped
to pick up the isolated index. This affects only local dev databases that predate
the LCD engine; a fresh install needs nothing.Configuration
All context engine settings live underagents.*.contextEngine in your config
file. The defaults work well for most setups — you typically do not need to
change anything.
| Option | Type | Default | What it does |
|---|---|---|---|
contextEngine.enabled | boolean | true | Master toggle for the context engine |
contextEngine.version | string | "dag" | Context engine mode: "dag" (default, the lossless LCD engine) or "pipeline" (opt-in sequential-layer engine) |
contextEngine.thinkingKeepTurns | number | 10 | Recent turns that keep AI reasoning notes (1-50) |
contextEngine.compactionModel | string | "" | Model used for summarization (empty = runtime-resolved default) |
contextEngine.evictionMinAge | number | 15 | Minimum turn age before stale errors are evicted (3-50) |
contextEngine.historyTurns | number | 15 | Recent user turns to keep in full (3-100) |
contextEngine.historyTurnOverrides | record | — | Per-channel turn limits (e.g., { dm: 10, group: 5 }) |
contextEngine.observationKeepWindow | number | 25 | Recent tool uses that keep full results (1-50) |
contextEngine.observationTriggerChars | number | 120000 | Character threshold before old tool results are masked (50K-1M) |
contextEngine.compactionCooldownTurns | number | 5 | Turns to wait before re-triggering summarization (1-50) |
contextEngine.compactionPrefixAnchorTurns | number | 2 | User turns preserved at head for cache prefix stability (0-10) |
contextEngine.outputEscalation.enabled | boolean | true | Allow escalating output token budget when compaction occurs |
contextEngine.outputEscalation.escalatedMaxTokens | number | 32768 | Maximum output tokens after escalation (4096-128000) |
contextEngine.observationDeactivationChars | number | 80000 | Character threshold to deactivate observation masking (20K-500K) |
contextEngine.ephemeralKeepWindow | number | 10 | Recent ephemeral tool results to preserve from masking (1-50) |
contextEngine.contextThreshold | number | 0.75 | DAG mode (live): context-utilization fraction (of the turn’s effective budget window — the reconciled context window under any capability-class cap) that triggers a leaf-summarization pass at the end of a turn (0.1-0.95) |
contextEngine.freshTailTurns | number | 8 | DAG mode (live): most-recent steps (assistant + tool round-trips) always kept verbatim and never summarized/evicted (1-50) |
contextEngine.leafChunkTokens | number | 20000 | DAG mode (live): token cap for the oldest out-of-tail chunk summarized into one leaf summary (1K-100K); clamped at runtime to the resolved summarizer model’s window — the smaller of its configured window and the probed served window when the summarizer runs on the served-bound provider — minus the summary target, template overhead, and previous-summary size, so a small compaction summarizer or a served-bound primary is never fed an over-window chunk; a single message larger than the clamped cap is replaced by a bounded deterministic extraction (no LLM call) |
contextEngine.leafTargetTokens | number | 1200 | DAG mode (live): target token size for a leaf summary (96-5000) |
contextEngine.deferCompaction | boolean | true | DAG mode (live): run the afterTurn leaf + condense passes in the background on the per-conversation serializer (never blocking the turn). false runs them inline (deterministic, for tests) |
contextEngine.summarizerSpend.maxTokensPerTenantPerHour | number | 500000 | DAG mode (live): per-tenant rolling-hour ceiling on summarizer (input+output) tokens; over the cap the summarizer is bypassed and assembly degrades to truncation-only. 0 disables the hourly cap |
contextEngine.summarizerSpend.maxTokensPerTenantPerDay | number | 5000000 | DAG mode (live): per-tenant rolling-day summarizer token ceiling. 0 disables the daily cap |
contextEngine.summarizerBreaker.failureThreshold | number | 5 | DAG mode (live): consecutive summarizer failures before the circuit breaker opens → truncation-only assembly |
contextEngine.summarizerBreaker.resetTimeoutMs | number | 60000 | DAG mode (live): how long the summarizer breaker stays open before a half-open trial |
contextEngine.summarizerBreaker.halfOpenTimeoutMs | number | 30000 | DAG mode (live): half-open trial window for the summarizer breaker |
contextEngine.compaction.summarizerFallbackProviders | string[] | [] | DAG mode (live): ordered list of fallback summarizer provider/model ids tried in turn when the primary summarizer fails (SUM-03); the per-tenant breaker + spend caps still bind, and only an all-providers-exhausted failure trips the breaker. [] (default) = no failover — primary only, byte-identical to prior behavior |
~/.comis/config.yaml
For DAG mode configuration fields (compaction thresholds, summary tokens,
recall limits), see the Config Reference.
The
session.compaction settings (softThresholdRatio, hardThresholdRatio,
etc.) control the memory flush — extracting facts and summaries into
long-term storage before the context engine summarizes older messages. These
settings are separate from the context engine configuration above. See
Sessions and the
Config Reference for details.Sessions
How conversations are created and managed.
Memory
Where compaction summaries are stored for long-term recall.
Config Reference
Full configuration reference for all context engine options.
