Observability

A fleet-wide bug must be diagnosable from one structured artifact with one command in under five minutes. This page documents the observability foundations. It covers the trajectory layer, the 55-entry bridge mapping, lifecycle envelopes, forensic INFO promotions, the dedup detector, boot invariants, the alert budget, step: discipline, log rotation, the memory & recall diagnostics (recall-trace artifact, recall events, degradation signals), and the bundle export entry point. If you are new to Comis observability, start here; then follow the cross-links for operator workflows.

Incident Bundle

One-command bundle export — comis trace export and /export-trajectory.

Trace CLI

Copy-pasteable examples for every comis trace subcommand.

1. Trace Propagation

Every inbound message receives a traceId at channel ingress — before the queue, before the agent, before delivery. The trace ID flows through the entire pipeline via AsyncLocalStorage (the trace-logger mixin in packages/daemon/src/observability/trace-logger.ts), so every Pino log line emitted during that turn carries the same traceId automatically. Practical consequence: grep "traceId=<id>" ~/.comis/logs/daemon.log returns the complete causal chain for a single message — channel ingress, queue enqueue, agent execution, and outbound delivery — without any cross-file correlation. The on-wire field is normalized.metadata.traceId; the helper getMessageTraceId() reads it. All 10 channel adapters wrap their dispatch loops in runWithContext, and the orchestrator’s adapter.onMessage handler applies a defense-in-depth second wrap so no adapter can silently skip propagation.

Architecture invariant: test/architecture/trace-propagation.test.ts asserts that every adapter.onMessage(...) registration site is wrapped in runWithContext. This test is shrink-only — violations are caught at CI time, not at incident time.

2. Trajectory Layer

Every agent session writes a *.trajectory.jsonl file co-located with the SDK transcript. This is the structured artifact the design goal refers to. Key properties:

Schema-versioned — traceSchema: "comis-trajectory", schemaVersion: 1. Additive changes (new optional fields, new event types) stay on version 1.
Content-free at runtime — digests and structural fields only; no prompt text, response text, or tool-result bodies at recording time. Content is synthesized at bundle-export time from the SDK session JSONL.
Multi-source discriminator — source: "runtime" | "transcript" | "export". Runtime events carry "runtime". The bundle exporter synthesizes "transcript" events from the session JSONL and "export" events for bundle-level summaries.
Causality DAG — each event carries entryId (per-event UUID), monotonic seq, optional sourceSeq, and optional parentEntryId. The session branch is reconstructed leaf-to-root at export time.

3. Lifecycle Envelopes

Three sentinel events fire once per session to bracket the trajectory with context:

Envelope	Fires	Carries	Once-per-session
`trace.metadata`	After `session.started`	harness, model, config, plugins, skills, prompting, redaction snapshot	Yes (latch in `pi-event-bridge.ts`)
`trace.artifacts`	Before `session.ended`	finalStatus, abort, timeout, token usage, prompt-cache hit rate, compaction count, lastToolError	Yes (in `comis-session-manager.ts`)
`trace.truncated`	When file fills	droppedEvents, droppedEventBytes, limitBytes	At most once per file (then writer halts)

trace.metadata answers “what configuration ran this session?” trace.artifacts answers “how did it end and what did it cost?” trace.truncated answers “was any data lost?” — a non-zero droppedEvents value means the trajectory file was capped before the session ended. The bundle exporter reads trace.metadata to populate metadata.json and trace.artifacts to populate artifacts.json.

4. Bridge Mapping (55 entries)

The trajectory bus-bridge maps 55 event types from the typed event bus to trajectory events (up from 18 previously — a ~3× coverage improvement).

Architecture invariant: test/architecture/trajectory-event-types-known.test.ts asserts the disjoint-set invariant (no overlap between bridge-mapped types and direct-emit types) and a bridge count ≥ 45. Current count: 55.

Category	Bridge prefix	Event types
Queue	`queue.*`	enqueued, dequeued, overflow, coalesced
Delivery	`delivery.*`	retry, retry_exhausted, markdown_fallback
Execution	`execution.*`	aborted, budget_warning, prompt_timeout, output_escalated, replay_recovered
Security	`security.*`, `sender.blocked`	injection_detected, memory_tainted, warn, sender_blocked
MCP	`mcp.*`	disconnected, reconnecting, reconnect_failed, reconnected, tools_changed
Channel	`channel.*`	health_changed, lifecycle
Compaction	`compaction.*`	started, flush, recommended
Context	`context.*`	evicted, masked, reread, overflow, integrity, rehydrated
Approval	`approval.*`	requested, resolved
Dedup	`dedup.duplicate_inbound`	duplicate_inbound
Health	`health.budget_exceeded`	budget_exceeded
Session	(direct emit — carve-out)	session.started, session.ended, session.transcript.entry, trace.metadata, trace.artifacts, trace.truncated, trace.write_failures

The queue.enqueued event — which carries sessionKey, mode, and queueDepth — is the trajectory signal that makes the 2026-05-24 duplicate-adapter bug visible from a single jq query.

Credential Broker Events (broker:*)

The credential broker emits 7 typed events. No secret value appears in any event payload. All events carry sessionId and timestamp. Every log line emitted by the broker also carries step (pipeline stage), traceId, and agentId for structured correlation.

Event	Key payload fields	Description
`broker:session_opened`	`sessionId`, `agentId`, `host`, `presetId?`	Driven-CLI session established; single-use proxy token issued
`broker:session_closed`	`sessionId`, `agentId`, `durationMs`, `reason`	Session torn down (`teardown` or `error`)
`broker:request`	`sessionId`, `host`, `path`, `method`	HTTP CONNECT request received from driven CLI
`broker:injected`	`sessionId`, `host`, `ruleKind`	Credential successfully injected into the request
`broker:denied`	`sessionId`, `host`, `reason`, `statusCode`	Request denied fail-closed (see reason codes below)
`broker:credential_unavailable`	`sessionId`, `secretRef`, `agentId`	SecretManager returned undefined; 502 returned, request never forwarded
`broker:egress_blocked`	`sessionId`, `targetHostHash`	Direct egress attempt blocked; `targetHostHash` is SHA-256 hex — the plaintext host is never logged

broker:denied reason codes:

Reason	HTTP Status	Meaning
`bad_token`	407	Missing, forged, or consumed single-use proxy token
`no_binding`	403	Host not in any configured binding
`path_policy`	403	Host known but path denied by `pathPolicy`
`malformed_request`	400	HTTP request could not be parsed
`body_too_large`	413	Request body exceeds 10 MiB
`ws_upgrade_not_supported`	501	WebSocket upgrade (not supported in this release)

The secret:accessed event (emitted during broker request handling alongside the broker:* events) logs each SecretManager resolution: fields secretName, agentId, outcome (success / not_found), timestamp. Redaction-by-construction: hosts using query-param injection (setParam rule kind) never emit a full URL in logs or events. The broker:egress_blocked event carries only targetHostHash (SHA-256 hex) — never the plaintext blocked host. Every failure log carries err, errorKind, and a non-empty hint.

Source: packages/infra/src/credential-broker/broker-events.ts (event schema) + packages/core/src/event-bus/events-infra.ts (type declarations)

Credential Broker observability details →

5. Defense-in-depth Bounding

Every payload that enters the trajectory recorder passes through limitTrajectoryPayloadValue, which enforces hard size limits before writing. Every truncation leaves a structured sentinel so consumers know exactly what was lost.

Bound	Limit	Sentinel on hit
Per-string chars	32,768	`{ truncated: true, reason: "trajectory-field-size-limit", originalChars, limitChars }`
Per-array items	64	Items beyond cap dropped
Per-object keys	64	Keys beyond cap dropped
Recursion depth	6	Deeper levels dropped
Per-event bytes	256 KB	Event dropped, `droppedEvents++`
File soft cap	10 MB	Final `trace.truncated` event, recorder closes
File hard cap	50 MB	Writer halts, `droppedEvents` incremented
Active writers (process-wide)	100 (LRU)	Oldest writer evicted
Circular reference	n/a	`{ truncated: true, reason: "trajectory-circular-reference" }`

The MAX_TRAJECTORY_WRITERS = 100 constant is exported from packages/observability/src/trajectory/runtime.ts:116.

6. Forensic INFO Promotions

Seven forensic events that fire at most O(1) per turn are promoted from DEBUG to INFO so production daemons at logLevel: info retain the signal needed to diagnose the 2026-05-24 duplicate-adapter incident class. Before this change, Message enqueued was level: 20 (DEBUG). A production daemon at logLevel: info would have silently discarded it — making the duplicate-enqueue symptom invisible without enabling debug logging first.

Event	Source module	Carries
Adapter registered	`packages/orchestrator/src/channel-manager.ts`	adapterId, channelType, handlersBefore, handlersAfter
Message enqueued	`packages/orchestrator/src/queue/command-queue.ts`	channelType, mode, queueDepth, messageId
Message dequeued	`packages/orchestrator/src/queue/command-queue.ts`	channelType, waitTimeMs
Execution started	`packages/agent/src/executor/pi-executor/pi-executor.ts`	agentId, sessionKey, traceId
Execution complete	`packages/agent/src/executor/executor-post-execution.ts`	finalStatus, durationMs, cacheHitRate, sessionCacheSavingsRate
Memory store complete	`packages/memory/src/sqlite-memory-adapter.ts`	durationMs, op, hasEmbedding, memoryType
Outbound message	8 channel adapters	channelType, messageId, deliveryStatus

Per-turn INFO count grows from ~5 to ~10 lines (bounded). The architecture test test/architecture/forensic-events-info-level.test.ts (15 tests) enforces shrink-only — no forensic event may regress to DEBUG without failing CI.

7. Duplicate-Inbound Detector

The dedup detector runs synchronously in the inbound hot path to catch the same messageId being processed more than once within a short window — the defining symptom of the 2026-05-24 duplicate-adapter incident class. Implementation: a bounded LRU at 1024 entries, 10 s window (windowMs = 10_000). On a duplicate messageId within the window, it:

Emits dedup:duplicate_inbound { messageId, channelType, chatId, firstSeenAt, duplicateAt, deltaMs, source } on the typed event bus.
Logs a WARN with errorKind: "internal" and hint: "Same messageId processed twice; check channel adapter handler list and queue mode".
Does not suppress processing — the duplicate continues through the pipeline so the full symptom is visible in the trajectory.

The dedup:duplicate_inbound event is bridge entry #54, mapped to the dedup.duplicate_inbound trajectory event type. At 300 msg/s (10× expected production load), overhead is sub-microsecond per check (measured in dedup-detector.perf.test.ts). Operator query:

jq 'select(.event == "dedup:duplicate_inbound")' ~/.comis/logs/daemon.log

8. Boot Invariants

On every daemon startup, before traffic is accepted, a daemon:startup_invariants INFO record fires with 8 fields describing the wiring state:

Field	What it measures
`adaptersByChannelType`	Count of adapters per channel type
`handlersPerAdapter`	Count of `onMessage` handlers per adapter
`pluginRegistryCount`	Total registered plugins
`channelRegistryCount`	Total registered channels
`depSlotConsistency`	Whether the deprecated `adaptersList` slot is absent
`agentCount`	Total configured agents
`toolCatalogSize`	Total available tools
`mcpServerCount`	Total MCP servers

If handlersPerAdapter[<type>] > 1, a WARN fires immediately:

{
  "level": 40,
  "msg": "Duplicate adapter registration detected",
  "hint": "Duplicate adapter registration detected",
  "errorKind": "config",
  "handlersPerAdapter": { "telegram": 2 }
}

This WARN fires before saveLastKnownGood and before the daemon begins accepting messages. In the 2026-05-24 incident, the duplicate adapter had been in production for days; the boot invariant would have surfaced it at the next restart with zero traffic impact.

If you see a handlersPerAdapter WARN at boot, do not proceed to production. The daemon will process every inbound message twice, producing duplicate AI responses. Check your setup-channels-runtime.ts wiring before accepting traffic.

9. step: Discipline

Every known pipeline stage emits at least one log line carrying step: "<stage>" — enforced by test/architecture/pipeline-step-coverage.test.ts. Why it matters: before this change, only 3% of daemon.log lines (15 of 480 sampled) carried a step: field. Filtering by pipeline stage required reading every log line. The target is ≥ 50% coverage. The authoritative stage token map lives in test/architecture/pipeline-step-coverage.test.ts — consult it when adding new log call sites. Operator filter pattern:

jq 'select(.step == "queue-enqueue")' ~/.comis/logs/daemon.log

Other useful stage tokens: "channels-inbound", "queue-dequeue", "agent-execute", "delivery-outbound".

The architecture test is shrink-only — adding a new stage without a step:-tagged emit fails CI.

10. Session Index

The session index at ~/.comis/logs/session-index.YYYY-MM-DD.jsonl is an append-only lightweight index of session lifecycle events — the primary scan target for comis trace --since 10m --where error. Three event kinds:

// session_started — one per session open
{ "traceSchema": "comis-session-index", "schemaVersion": 1,
  "event": "session_started", "ts": "...", "sessionId": "...",
  "sessionKey": "...", "channelType": "telegram", "chatId": "...",
  "agentId": "...", "traceIds": ["..."], "source": "runtime" }

// turn_completed — one per agent turn. stopReason is the reliable per-turn
// SDK stop signal ("error" marks an aborted call); finishReason appears only
// once the execution-level disposition has settled away from "stop"
// (e.g. "context_exhausted"), so degraded turns are greppable from this index.
{ "event": "turn_completed", "ts": "...", "sessionId": "...",
  "traceId": "...", "durationMs": 0, "inputTokens": 0, "outputTokens": 0,
  "stopReason": "error", "finishReason": "context_exhausted",
  "lastError": null, "source": "runtime" }

// session_ended — one per session close
{ "event": "session_ended", "ts": "...", "sessionId": "...",
  "endReason": "success", "turnCount": 0, "totalTokens": 0,
  "source": "runtime" }

Every row carries two additive optional provenance fields (added on schemaVersion: 1, so older readers ignore them):

source — "runtime" for production rows (shown above), "test" for rows written by a VITEST/NODE_ENV=test process, "bench" for harness-injected rows.
synthetic — true on test/bench/harness rows so they are self-identifying; absent on production rows. comis trace / obs.* exclude synthetic: true rows by default (a row counts as synthetic only when synthetic === true).

As a safety guard, appendSessionIndexEntry throws if a test process (VITEST=true or NODE_ENV=test) ever targets the real ~/.comis — a test run can never silently pollute production telemetry. Tests must write to a tmp dir, where their rows are stamped source: "test", synthetic: true. Date-rolled files honor the same observability.logRotation policy as daemon.log. For full query examples see the Trace CLI reference.

11. Bundle Export

When a user reports an issue, the operator workflow is:

comis trace --message-id <uuid> or comis trace --chat <chatId> --tail to locate the session.
comis trace export <sessionId> to produce a self-contained bundle directory.
Share the bundle with the diagnosing engineer.

Users can also trigger export via the /export-trajectory slash command in a direct message (or in a group channel, where the result is DM’d to the owner).

Privacy: bundle contents reflect the raw session and runtime trajectory at export time. Redaction applies platform-aware patterns (Telegram chat IDs, JWTs, AWS keys, URL userinfo, basic-auth, cookie headers, emails), substitutes paths, and omits identified PII fields — but redaction is heuristic. Always treat exported bundles as containing sensitive content; share only with authorized engineers, prefer DM/secure channels, and delete after triage.

Full bundle workflow, directory shape (8 files), and redaction policy (platform-aware-v1) are documented in:

Incident Bundle

Bundle export workflow, 8-file directory shape, and redaction policy.

12. Log Rotation

All 5 observability streams honor the observability.logRotation config block:

Stream	File pattern
Daemon log	`~/.comis/logs/daemon.log`
Cache trace	`~/.comis/logs/cache-trace.jsonl`
Config audit	`~/.comis/logs/config-audit.jsonl`
Session index	`~/.comis/logs/session-index.YYYY-MM-DD.jsonl`
Trajectory	`~/.comis/workspace/sessions/*/.trajectory.jsonl`

Defaults:

Setting	Default	Notes
`maxSizeBytes`	`52428800` (50 MB)	Per-file size trigger
`maxFiles`	`5`	Rotated copies to keep per stream
`maxAgeDays`	`30`	Delete rotated files older than this
`compressAged`	`true`	gzip rotated files

Storage budget: 5 streams × 5 files × 50 MB = 1.25 GB worst-case. With gzip compression, expect ~300 MB in practice. Operator action content (viewing, tuning, disk-constrained deployments) is in:

Logging

Log rotation policy, viewing logs, configuring log levels per module.

13. Alert Budget

The rate-aggregator subscribes to health/safety events on the typed event bus. When a per-errorKind threshold is exceeded in a sliding window, it emits health:budget_exceeded { kind, count, windowMs } exactly once until the window slides past — then re-arms (once-per-window latch). health:budget_exceeded is bridge entry #55, mapped to the health.budget_exceeded trajectory event. Supported errorKind values: config, auth, timeout, internal, network, quota, resource, policy, agent, external.

Configurable via observability.alertBudget in your YAML config. Defaults are conservative — most production daemons will not see this fire under normal operation. Tune thresholds downward on high-traffic deployments where per-event noise is expected.

Query:

jq 'select(.event == "health:budget_exceeded")' ~/.comis/logs/daemon.log

14. Worked Example — 2026-05-24 Incident Replay

This section shows how the four observability signals surface the 2026-05-24 duplicate-adapter bug in under five minutes. Setup: fresh daemon, setup-channels-runtime.ts:217 regression re-introduced (the same Telegram adapter passed into both deps.adapters and deps.channelRegistry). Boot the daemon, send one Telegram message.

Signal 1 — Boot WARN

{
  "level": 30,
  "msg": "daemon:startup_invariants",
  "handlersPerAdapter": { "telegram": 2 },
  "adaptersByChannelType": { "telegram": 1 }
}
{
  "level": 40,
  "msg": "Duplicate adapter registration detected",
  "hint": "Duplicate adapter registration detected",
  "errorKind": "config"
}

Root cause visible at boot — before any user traffic.

Signal 2 — Two `queue.enqueued` trajectory events

jq 'select(.type == "queue.enqueued")' ~/.comis/workspace/sessions/default/**/*.trajectory.jsonl

Returns two entries with queueDepth: 1 and queueDepth: 2, same sessionKey, same timestamp window — the duplicate processing visible in the structured artifact.

Signal 3 — dedup:duplicate_inbound

{
  "level": 40,
  "event": "dedup:duplicate_inbound",
  "messageId": "d9f5e08e",
  "channelType": "telegram",
  "deltaMs": 1,
  "source": "pipeline",
  "errorKind": "internal"
}

Fires within 1 ms of the duplicate inbound. deltaMs: 1 matches the original incident’s two Message enqueued lines at 06:01:47.385 and 06:01:47.386.

Signal 4 — comis trace

node packages/cli/dist/cli.js trace --message-id d9f5e08e

01:47.377  channel     inbound          chat=678314278
01:47.385  queue       enqueued         queueDepth=1   mode=steer+followup
01:47.386  queue       enqueued         queueDepth=2   mode=steer+followup  ⚠ DUPLICATE
01:47.386  dedup       duplicate_inbound source=pipeline  deltaMs=1
01:47.576  agent       execution        traceId=...     (run 1)
02:03.492  agent       execution        traceId=...     (run 2)              ⚠ DUPLICATE
02:03.616  channel     outbound         msgId=5785

All four signals — in a single terminal session. Time from user complaint to identified root cause: under 5 minutes.

15. Memory & Recall Diagnostics

Recall is the path that decides which memories enter the prompt. When recall surprises you — the wrong memory injected, or recall feeling slow — these are the signals that explain it. The runbook for acting on them lives in Troubleshooting → “Why did recall pick X / why is recall slow?”; this section documents the artifacts and events those workflows read.

15.1 The recall-trace artifact (opt-in)

The recall trace is a per-recall JSONL artifact that records the ranking preview for each recall: which lanes matched, the fused order, the pre/post-rerank scores, the recency/temporal/proof/trust score components, and the include/exclude reason for every candidate. It is the sibling of the cache trace — same writer family, same rotation policy — but opt-in.

Property	Value
Config key	`diagnostics.recallTrace.enabled`
Default	`false` — opt-in (the only diagnostics writer that ships off; `trajectory` and `cacheTrace` default `true`)
Default path	`~/.comis/logs/recall-trace.jsonl` (override via `diagnostics.recallTrace.filePath`; `~` expansion supported)
File cap	`52428800` (50 MB), via `diagnostics.recallTrace.maxFileBytes`
Env hard-off	`COMIS_DISABLE_RECALL_TRACE` disables the writer regardless of config
Sanitization	Always full-sanitized. Every payload runs through `sanitizeForPersistence` (bound → sanitize → redact) before it touches disk

It is OFF by default because it records per-recall ranking previews you only want captured during a focused debug session. Enable it for a session, reproduce the surprising recall, then turn it back off:

# config.yaml
diagnostics:
  recallTrace:
    enabled: true          # opt-in (default false) — redacted, bounded JSONL
    # filePath: ~/.comis/logs/recall-trace.jsonl   # optional override
    # maxFileBytes: 52428800                        # 50 MB cap (default)

There is no raw-content toggle. Unlike cacheTrace (which has includeMessages / includeSystem / includePrompt opt-ins), recallTrace has no such field. Every payload is full-sanitized before disk — query text, memory bodies, secrets, and absolute paths never reach the file. This is a deliberate security property, not a missing feature: there is no supported way to persist raw recall content. Do not expect a flag to disable sanitization — adding one would be a security regression.

Read the trace with comis memory recall-trace <session> — --format json for the full per-record ranking breakdown, the table view for correlation keys. The memory.recall_trace response also reports tracingEnabled (the recorder gate) and, on an empty result, a hint distinguishing “recorder disabled — set diagnostics.recallTrace.enabled: true” from “enabled but no traces matched this selector yet” — an empty result is never silent about why.

15.2 Recall events (direct-emit — not bridge-mapped)

Five typed events report recall and curation activity on the event bus. They are emitted directly at a single canonical site each — they are not part of the §4 bridge mapping (the disjoint-set invariant in test/architecture/trajectory-event-types-known.test.ts explicitly allowlists them as not-trajectory-mapped, so the 55-entry bridge count is unaffected). Every payload is counts, booleans, and IDs only — never query text, memory bodies, or entity names (AGENTS.md §2.7). The per-recall ranking detail lives in the opt-in recall-trace artifact above, not on the bus.

Event	Fires	Key fields
`memory:recalled`	Once per recall, after fuse/rerank/score	`lanes`, `ftsCandidates`, `vectorCandidates`, `entityCandidates`, `finalCount`, `rerankerAvailable`, `durationMs`
`memory:reranked`	When a rerank stage ran (rerank opt-in)	`candidateCount`, `hitCount`, `rerankerAvailable`, `timedOut`, `fellBack`, `durationMs`
`memory:entities_linked`	After an entity resolve-and-link pass (memory-review job)	`entityCount`, `newEntities`, `durationMs`
`memory:consolidated`	After a consolidation run (consolidation job)	`clustersProcessed`, `observationsCreated`, `dedupHits`, `durationMs`
`context:evicted`	When the LCD/DAG assembler evicts the oldest out-of-tail history to fit the token budget	`agentId`, `sessionKey`, `evictedCount`, `evictedChars`, `categories`, `timestamp`
`context:dag_compacted`	After a DAG compaction pass updates the summary hierarchy (DAG mode)	`conversationId`, `agentId`, `sessionKey`, `leafSummariesCreated`, `condensedSummariesCreated`, `maxDepthReached`, `totalSummariesCreated`, `durationMs`, `timestamp`
`context:dag_expanded`	When an in-session expansion tool (`ctx_search`/`ctx_inspect`/`ctx_expand`) recovers compressed detail (DAG mode)	`conversationId`, `agentId`, `sessionKey`, `tool`, `recoveredCount`, `durationMs`, `timestamp`
`context:mode_switched`	On an actual `pipeline`⇄`dag` engine direction change	`from`, `to`, `fullImport`, `importedCount`, `durationMs`

memory:recalled.finalCount == 0 means recall returned nothing for that turn. memory:reranked.fellBack / .timedOut flag the graceful-degradation paths (see below). context:evicted and context:dag_compacted both fire in DAG mode (the default engine) — context:dag_compacted.durationMs reports the real wall-clock of each compaction pass; context:dag_expanded reports an in-session zoom into the DAG (counts/durationMs only — never message or summary content). context:mode_switched fires only on a real engine-mode direction change and reports the one-time reconciliation importedCount; the mid-conversation pipeline⇄dag reconciliation that emits it lands in a later phase, so that one event is currently dormant.

15.3 No-silent-degradation signals (`errorKind` + `hint`)

Recall degrades gracefully, and every degradation path is explicit — it emits a WARN carrying both errorKind and hint (the same contract as the rest of the daemon; see Logging → Error Classification). There is no silent fallback: if recall quietly drops to a cheaper path, there is a log line that says so and why.

Degradation	What happened	Surfaced as
Reranker unavailable	The cross-encoder reranker was not built/loaded (e.g. rerank off, or the GGUF failed to load) — recall uses the fusion-ranked order	WARN with `errorKind` + `hint`; `memory:reranked.fellBack: true`, `rerankerAvailable: false`
Reranker timeout	The reranker exceeded its budget (`rag.rerank.timeoutMs`, default 800 ms) — recall falls back to the fusion-ranked order	WARN with `errorKind` + `hint`; `memory:reranked.timedOut: true`, `fellBack: true`
Vector index unavailable	The vector lane could not run — recall falls back to FTS-only	WARN with `errorKind` + `hint`; `memory:recalled.vectorCandidates: 0`
Candidate missing embedding	A candidate had no embedding to score on the vector lane — it is handled, not dropped silently	WARN with `errorKind` + `hint`
Invalid extraction / consolidation	An extraction or consolidation step produced an unusable result — skipped with an explicit signal	WARN with `errorKind` + `hint`

For aggregate health (rather than a single recall), point at the comis memory stats recall-counter overlay: lane usage, rerank-fallback rate, consolidation throughput, and recall hit-rate. A climbing rerank-fallback rate, for instance, says the reranker is timing out across the fleet, not just on one turn. The overlay is best-effort — a daemon that has not wired the counters still renders base stats.

# Aggregate recall health (lane usage, rerank-fallback rate, hit-rate)
node packages/cli/dist/cli.js memory stats

# Per-recall ranking detail for a session (enable recallTrace first)
node packages/cli/dist/cli.js memory recall-trace <session> --format json

See the CLI reference for the full comis memory surface and the Troubleshooting runbook for the diagnosis flow.

16. Existing Tracking Surfaces (pre-existing, preserved for completeness)

The following tracking surfaces predate the observability work above and continue to operate as before. They are referenced by the web dashboard at /web-dashboard/observability.

Token Tracking

Every agent execution counts input and output tokens per model and per agent. Token counts feed billing estimation and are exposed on the observability RPC contracts (packages/core/src/api-contracts/observability.ts).

Billing Estimation

Token counts are converted to estimated dollar costs using known provider pricing. Estimates are approximations — actual provider invoices may differ due to rounding, pricing changes, or promotional credits.

Cache Savings Tracking

The savedVsUncached metric shows net cache savings per execution. Embedding cache hits (persistent L2 cache) further reduce costs by avoiding API calls entirely. See Compaction for per-provider cache pricing.

Latency Recording

Every execution is timed: LLM call time, tool execution time, and total execution time. Breakdown helps identify whether latency is in the AI provider, tool execution, or routing.

Diagnostic Collection

System-level metrics collected every 30 seconds: event loop delay, heap/RSS memory, CPU utilization. Surfaced on the /health HTTP endpoint. See Monitoring.

Channel Activity Tracking

Message counts, delivery success rate, and error rates per channel. Useful for identifying which platforms are active and whether any channel is experiencing reliability issues.

Delivery Tracing

Follows each outgoing message through formatting, chunking, and platform confirmation. When a user reports “my message did not arrive,” delivery tracing shows exactly where it got stuck.

Context Engine Metrics

Per-turn context engine statistics: tokens loaded/masked, compaction events, budget utilization, cache hit/write/miss rates. High masking rates indicate active token savings; frequent compaction may indicate unusually long conversations.

Gemini Cache Infrastructure

Comis supports Gemini explicit CachedContent caching (Google AI Studio only) with a guaranteed 90% discount on cached input tokens. Lifecycle operations: Create, Reuse, Invalidate, Dedup, Refresh, Eviction, Session Dispose, Shutdown Dispose, Orphan Cleanup. See the config reference for the full geminiCache schema. Minimum cacheable thresholds: Gemini Flash (2.5, 3) → 1,024 tokens; Gemini Flash (2.0) → 2,048 tokens; Gemini Pro (2.5, 3) → 4,096 tokens.

Cache Observability Metrics

Breakpoint budget auditing (INFO before each Anthropic API call), placement results (DEBUG when placed > 0), cache fence warnings (WARN when no cache_control marker on conversations with ≥ 10 messages), per-execution cache hit rate (added to “Execution complete” log), session cache savings rate, and thinking-block cleaner cache fence impact. See the Web Dashboard Observability page for visual charts.

Logging

Log rotation policy, viewing logs, log levels per module.

Incident Bundle

Bundle export workflow and redaction policy.

Trace CLI

comis trace subcommands with copy-pasteable examples.

Monitoring

Health checks and threshold alerts.

Web Dashboard Observability

Visual dashboard for token, billing, and delivery data.

Agent Safety

Budget limits, token caps, and cost controls.

Incident Bundle

Trace CLI

​1. Trace Propagation

​2. Trajectory Layer

​3. Lifecycle Envelopes

​4. Bridge Mapping (55 entries)

​Credential Broker Events (broker:*)

​5. Defense-in-depth Bounding

​6. Forensic INFO Promotions

​7. Duplicate-Inbound Detector

​8. Boot Invariants

​9. step: Discipline

​10. Session Index

​11. Bundle Export

Incident Bundle

​12. Log Rotation

Logging

​13. Alert Budget

​14. Worked Example — 2026-05-24 Incident Replay

​Signal 1 — Boot WARN

​Signal 2 — Two queue.enqueued trajectory events

​Signal 3 — dedup:duplicate_inbound

​Signal 4 — comis trace

​15. Memory & Recall Diagnostics

​15.1 The recall-trace artifact (opt-in)

​15.2 Recall events (direct-emit — not bridge-mapped)

​15.3 No-silent-degradation signals (errorKind + hint)

​16. Existing Tracking Surfaces (pre-existing, preserved for completeness)

​Token Tracking

​Billing Estimation

​Cache Savings Tracking

​Latency Recording

​Diagnostic Collection

​Channel Activity Tracking

​Delivery Tracing

​Context Engine Metrics

​Gemini Cache Infrastructure

​Cache Observability Metrics

​17. Related Pages

Logging

Incident Bundle

Trace CLI

Monitoring

Web Dashboard Observability

Agent Safety

1. Trace Propagation

2. Trajectory Layer

3. Lifecycle Envelopes

4. Bridge Mapping (55 entries)

Credential Broker Events (broker:*)

5. Defense-in-depth Bounding

6. Forensic INFO Promotions

7. Duplicate-Inbound Detector

8. Boot Invariants

9. step: Discipline

10. Session Index

11. Bundle Export

12. Log Rotation

13. Alert Budget

14. Worked Example — 2026-05-24 Incident Replay

Signal 1 — Boot WARN

Signal 2 — Two `queue.enqueued` trajectory events

Signal 3 — dedup:duplicate_inbound

Signal 4 — comis trace

15. Memory & Recall Diagnostics

15.1 The recall-trace artifact (opt-in)

15.2 Recall events (direct-emit — not bridge-mapped)

15.3 No-silent-degradation signals (`errorKind` + `hint`)

16. Existing Tracking Surfaces (pre-existing, preserved for completeness)

Token Tracking

Billing Estimation

Cache Savings Tracking

Latency Recording

Diagnostic Collection

Channel Activity Tracking

Delivery Tracing

Context Engine Metrics

Gemini Cache Infrastructure

Cache Observability Metrics

17. Related Pages