How resilience works
The resilience stack is organized in layers from inner (closest to the LLM call) to outer (system-wide). Each layer has a specific scope and escalation path. The sections below describe each subsystem in order from the innermost layer (individual prompt calls) to the outermost (system-wide sweeps).Prompt timeout
Every LLM call has a prompt-level deadline that prevents hung prompts from blocking the agent indefinitely. For primary calls the deadline is a stall budget: it resets on activity (stream text/thinking deltas, throttled ~1/s, and tool completions), so the call is aborted only when NO activity occurs for the configured window. A separate makespan ceiling (promptTimeoutMs × stallCeilingMultiplier, default ×10) bounds the total turn
even while it is still streaming, so a runaway generation cannot reset the
stall budget forever. On abort, the agent automatically retries with a shorter
timeout or falls back to an alternate model.
- Primary calls: 180,000 ms (3 minutes) stall budget by default; makespan ceiling at ×10
- Retry and fallback calls: 60,000 ms (1 minute) by default — whole-turn, not stall-based (no reset on activity)
- What the operator sees: An
execution:prompt_timeoutevent in logs (carrying which limit fired, the binding knob, and elapsed time), followed by an automatic retry or model fallback - What the user sees: A slightly delayed response (the retry is transparent to the user)
promptTimeoutMs in the agent’s config. See the
Config YAML Reference for details.
Execution timeout
The execution timeout is a 600-second (10-minute) backstop on the entire agent pipeline. It fires only when everything else has failed — prompt timeouts, retries, and fallback models have all been exhausted or are taking too long.- Default: 600,000 ms (10 minutes), hardcoded
- What the operator sees: An
execution:abortedevent with reasonpipeline_timeout - What the user sees: A static error message delivered directly to the channel (no LLM call is made)
Sub-agent watchdog
When a parent agent spawns a sub-agent, a watchdog timer starts automatically. The timer uses a dynamic formula to calculate the deadline:Timeout = min(max_steps x perStepTimeoutMs, maxRunTimeoutMs)If
max_steps is not set for the sub-agent, the watchdog falls back to
maxRunTimeoutMs directly. When the deadline passes, the watchdog force-fails
the run and sends a failure notification to the channel.
- Default maxRunTimeoutMs: 600,000 ms (10 minutes)
- Default perStepTimeoutMs: 60,000 ms (1 minute per step)
- What the operator sees: A
Sub-agent watchdog timeoutwarning in logs - What the user sees: A failure notification on the channel explaining that the sub-agent task did not complete in time
maxRunTimeoutMs and perStepTimeoutMs are configurable under
security.agentToAgent.subagentContext. See the
Config YAML Reference for details.
Failure notification
When a sub-agent fails (from a watchdog timeout, execution error, or ghost sweep), a failure notification is delivered to the channel. This notification uses static text only — it never makes an LLM call and never exposes raw error details to the user. The notification is delivered through the parent agent’s announcement system with a 30-second timeout. If the announcement does not complete within 30 seconds, delivery falls back to sending the message directly to the channel, bypassing the parent entirely.- What the operator sees: Log entries for the failure and the notification delivery path (announcement or direct send)
- What the user sees: A brief message indicating the sub-agent task did not complete successfully
Provider health monitor
The provider health monitor aggregates failures across all agents to detect provider-wide outages. It activates automatically with no configuration required. The monitor triggers when either condition is met within a 60-second window:- 2 or more agents experience failures, or
- 3 consecutive failures from any single agent
provider:degraded event. While the
provider is degraded:
- LLM-dependent operations are skipped to avoid wasting tokens on a down provider
- Failure notifications use direct channel send instead of LLM-based responses
- Dead-letter queue entries accumulate for later retry
provider:recovered event. The dead-letter queue automatically drains on
recovery.
- What the operator sees:
provider:degradedandprovider:recoveredevents in logs - What the user sees: During degradation, static error messages instead of AI-generated responses. Normal service resumes automatically on recovery.
Dead-letter queue
When an announcement delivery fails (for example, the channel is temporarily unreachable), the message is saved to a persistent dead-letter queue backed by a JSONL file at~/.comis/dead-letters.jsonl.
- Retry policy: Up to 5 retries per entry
- Maximum age: 1 hour — entries older than this are expired and discarded
- Automatic drain: On
provider:recoveredevents, the queue drains automatically, delivering all queued messages - What the operator sees:
announcement:dead_letteredevents when entries are queued,announcement:dead_letter_deliveredevents when they are successfully delivered on retry
Ghost sweep
The ghost sweep catches sub-agent runs that are stuck in a “running” state past their expected lifetime. This can happen if the process crashes during execution or an unexpected error leaves a run in an inconsistent state. The grace period ismaxRunTimeoutMs + 120 seconds (120,000 ms). Any run still
marked as “running” past this deadline is force-failed at ERROR level. A failure
notification is delivered to the channel.
- Default grace period: 720,000 ms (12 minutes) with default
maxRunTimeoutMs - What the operator sees: A
Ghost run detected and force-failederror in logs - What the user sees: A failure notification on the channel
Batcher timeout
The batcher timeout is a 30-second timeout on announcement delivery through the parent agent’s announcement system. If the announcement does not complete within 30 seconds, delivery falls back to sending the message directly to the channel. This timeout is automatic and requires no configuration. It ensures that a slow or unresponsive announcement system does not delay failure notifications to users.- Default: 30,000 ms (30 seconds), automatic
- What the operator sees: A log entry indicating the fallback from announcement to direct channel send
- What the user sees: The same message, delivered through the fallback path — no visible difference
Summarization spend governance (DAG mode)
When the DAG context engine summarizes old history, the summarizer is an LLM call — so it is governed like any other spend boundary. Two independent guards bound it, both per tenant:- Rolling token cap. A per-tenant rolling-window ceiling on summarizer
input+output tokens —
summarizerSpend.maxTokensPerTenantPerHour(default 500,000) andsummarizerSpend.maxTokensPerTenantPerDay(default 5,000,000), well below the primary per-hour (10M) and per-day (100M) execution budgets because summarization is a background seam. A ceiling of0disables that window. - Circuit breaker. A
summarizerBreakeropens afterfailureThresholdconsecutive summarizer failures (default 5) and stays open forresetTimeoutMs(default 60s) before a half-open trial, reusing the standard circuit-breaker config.
- What the operator sees: a content-free
WARN(witherrorKindresourcefor the spend cap,dependencyfor the open breaker, plus ahintanddurationMs) and acontext:dag_degradedevent carrying the reason (spend_caporbreaker_open) — ids, reason, and timing only, never message content. - What the user sees: nothing different — the turn completes; older history is carried as a truncation rather than a richer summary until the window resets or the breaker closes.
Emergency-fallback summaries are marked and taint-escaped
A summary produced by the truncation-only degrade path (no LLM) is flagged in its trusted header with an unspoofablefallback=emergency-truncation marker, so
the model is told honestly that the summary is a degraded emergency truncation
rather than a full summary. The marker is driven only by the real stored
fallback flag and lives outside the untrusted-content region, and the summary
body itself stays taint-wrapped — so a poisoned summarized message can neither
forge the marker nor inject instructions through it. (See
Honest presentation of summaries.)
Fail-closed session rollover
When a session’s scope is ambiguous or malformed, the DAG ingest path refuses the write rather than silently merging history across sessions. Failing closed here prevents one session’s compressed history from leaking into another. The refusal is logged and emitted as acontext:dag_degraded event with the reason
fail_closed_rollover, so an operator can see (and fix) the misconfiguration —
again, ids and reason only, never content.
Reference: circuit breakers, retries, fallbacks
Three circuit breakers run inside Comis, each with its own scope:| Breaker | Scope | File |
|---|---|---|
| Provider health breaker | Per LLM provider — opens after consecutive failures, blocks all calls during cooldown | packages/agent/src/safety/circuit-breaker.ts |
| Tool retry breaker | Per tool — classifies tool errors and decides whether to retry, fall back, or surface the error | packages/agent/src/safety/tool-retry-breaker.ts |
| Layer circuit breaker | Per context-engine layer — disables a layer for the rest of the session after 3 consecutive failures | packages/agent/src/context-engine/context-engine.ts |
| Summarizer spend breaker | Per tenant — bounds DAG summarization LLM spend with a rolling per-tenant token cap and opens after consecutive summarizer failures; degrades to truncation-only assembly (no LLM call) | packages/agent/src/safety/summarizer-spend-breaker.ts |
- SDK retries — exponential-backoff retries inside the LLM provider SDK for transient HTTP errors (rate limits, 5xx, network timeouts).
- Per-model fallback chain — if the primary model fails after retries,
Comis tries the next model in
modelFailover.fallbackModels. Configured on Models.
packages/agent/src/model/image-router.ts) — primary → secondary →
tertiary vision-capable models, so you never lose image understanding to a
single provider’s outage.
Recovery is also automatic: when the provider health monitor emits
provider:recovered, the dead-letter queue drains and the next normal
request closes the breaker.
Configuration
Most resilience features activate automatically with sensible defaults. The configurable settings are:~/.comis/config.yaml
contextEngine.deferCompaction
toggle that backgrounds compaction are documented in full on the
Compaction and
Config Reference
pages.
See the Config YAML Reference for the full list of
all configuration options with types, defaults, and validation rules.
Related
Safety
Budget, circuit breaker, and step limits.
Subagent Lifecycle
Spawn packets, lifecycle hooks, and end reasons.
Troubleshooting
Solutions for resilience-related issues.
Config YAML Reference
Full configuration reference for all settings.
