step: discipline, log rotation, the memory & recall diagnostics (recall-trace artifact, recall events, degradation signals), and the bundle export entry point. If you are new to Comis observability, start here; then follow the cross-links for operator workflows.
Incident Bundle
One-command bundle export —
comis trace export and /export-trajectory.Trace CLI
Copy-pasteable examples for every
comis trace subcommand.1. Trace Propagation
Every inbound message receives atraceId at channel ingress — before the queue, before the agent, before delivery. The trace ID flows through the entire pipeline via AsyncLocalStorage (the trace-logger mixin in packages/daemon/src/observability/trace-logger.ts), so every Pino log line emitted during that turn carries the same traceId automatically.
Practical consequence: grep "traceId=<id>" ~/.comis/logs/daemon.log returns the complete causal chain for a single message — channel ingress, queue enqueue, agent execution, and outbound delivery — without any cross-file correlation.
The on-wire field is normalized.metadata.traceId; the helper getMessageTraceId() reads it. All 10 channel adapters wrap their dispatch loops in runWithContext, and the orchestrator’s adapter.onMessage handler applies a defense-in-depth second wrap so no adapter can silently skip propagation.
Architecture invariant:
test/architecture/trace-propagation.test.ts asserts that every adapter.onMessage(...) registration site is wrapped in runWithContext. This test is shrink-only — violations are caught at CI time, not at incident time.2. Trajectory Layer
Every agent session writes a*.trajectory.jsonl file co-located with the SDK transcript. This is the structured artifact the design goal refers to.
Key properties:
- Schema-versioned —
traceSchema: "comis-trajectory",schemaVersion: 1. Additive changes (new optional fields, new event types) stay on version 1. - Content-free at runtime — digests and structural fields only; no prompt text, response text, or tool-result bodies at recording time. Content is synthesized at bundle-export time from the SDK session JSONL.
- Multi-source discriminator —
source: "runtime" | "transcript" | "export". Runtime events carry"runtime". The bundle exporter synthesizes"transcript"events from the session JSONL and"export"events for bundle-level summaries. - Causality DAG — each event carries
entryId(per-event UUID), monotonicseq, optionalsourceSeq, and optionalparentEntryId. The session branch is reconstructed leaf-to-root at export time.
3. Lifecycle Envelopes
Three sentinel events fire once per session to bracket the trajectory with context:| Envelope | Fires | Carries | Once-per-session |
|---|---|---|---|
trace.metadata | After session.started | harness, model, config, plugins, skills, prompting, redaction snapshot | Yes (latch in pi-event-bridge.ts) |
trace.artifacts | Before session.ended | finalStatus, abort, timeout, token usage, prompt-cache hit rate, compaction count, lastToolError | Yes (in comis-session-manager.ts) |
trace.truncated | When file fills | droppedEvents, droppedEventBytes, limitBytes | At most once per file (then writer halts) |
trace.metadata answers “what configuration ran this session?” trace.artifacts answers “how did it end and what did it cost?” trace.truncated answers “was any data lost?” — a non-zero droppedEvents value means the trajectory file was capped before the session ended.
The bundle exporter reads trace.metadata to populate metadata.json and trace.artifacts to populate artifacts.json.
4. Bridge Mapping (55 entries)
The trajectory bus-bridge maps 55 event types from the typed event bus to trajectory events (up from 18 previously — a ~3× coverage improvement).Architecture invariant:
test/architecture/trajectory-event-types-known.test.ts asserts the disjoint-set invariant (no overlap between bridge-mapped types and direct-emit types) and a bridge count ≥ 45. Current count: 55.| Category | Bridge prefix | Event types |
|---|---|---|
| Queue | queue.* | enqueued, dequeued, overflow, coalesced |
| Delivery | delivery.* | retry, retry_exhausted, markdown_fallback |
| Execution | execution.* | aborted, budget_warning, prompt_timeout, output_escalated, replay_recovered |
| Security | security.*, sender.blocked | injection_detected, memory_tainted, warn, sender_blocked |
| MCP | mcp.* | disconnected, reconnecting, reconnect_failed, reconnected, tools_changed |
| Channel | channel.* | health_changed, lifecycle |
| Compaction | compaction.* | started, flush, recommended |
| Context | context.* | evicted, masked, reread, overflow, integrity, rehydrated |
| Approval | approval.* | requested, resolved |
| Dedup | dedup.duplicate_inbound | duplicate_inbound |
| Health | health.budget_exceeded | budget_exceeded |
| Session | (direct emit — carve-out) | session.started, session.ended, session.transcript.entry, trace.metadata, trace.artifacts, trace.truncated, trace.write_failures |
queue.enqueued event — which carries sessionKey, mode, and queueDepth — is the trajectory signal that makes the 2026-05-24 duplicate-adapter bug visible from a single jq query.
Credential Broker Events (broker:*)
The credential broker emits 7 typed events. No secret value appears in any event payload. All events carrysessionId and timestamp. Every log line emitted by the broker also carries step (pipeline stage), traceId, and agentId for structured correlation.
| Event | Key payload fields | Description |
|---|---|---|
broker:session_opened | sessionId, agentId, host, presetId? | Driven-CLI session established; single-use proxy token issued |
broker:session_closed | sessionId, agentId, durationMs, reason | Session torn down (teardown or error) |
broker:request | sessionId, host, path, method | HTTP CONNECT request received from driven CLI |
broker:injected | sessionId, host, ruleKind | Credential successfully injected into the request |
broker:denied | sessionId, host, reason, statusCode | Request denied fail-closed (see reason codes below) |
broker:credential_unavailable | sessionId, secretRef, agentId | SecretManager returned undefined; 502 returned, request never forwarded |
broker:egress_blocked | sessionId, targetHostHash | Direct egress attempt blocked; targetHostHash is SHA-256 hex — the plaintext host is never logged |
| Reason | HTTP Status | Meaning |
|---|---|---|
bad_token | 407 | Missing, forged, or consumed single-use proxy token |
no_binding | 403 | Host not in any configured binding |
path_policy | 403 | Host known but path denied by pathPolicy |
malformed_request | 400 | HTTP request could not be parsed |
body_too_large | 413 | Request body exceeds 10 MiB |
ws_upgrade_not_supported | 501 | WebSocket upgrade (not supported in this release) |
secret:accessed event (emitted during broker request handling alongside the broker:* events) logs each SecretManager resolution: fields secretName, agentId, outcome (success / not_found), timestamp.
Redaction-by-construction: hosts using query-param injection (setParam rule kind) never emit a full URL in logs or events. The broker:egress_blocked event carries only targetHostHash (SHA-256 hex) — never the plaintext blocked host.
Every failure log carries err, errorKind, and a non-empty hint.
Source: packages/infra/src/credential-broker/broker-events.ts (event schema) + packages/core/src/event-bus/events-infra.ts (type declarations)
5. Defense-in-depth Bounding
Every payload that enters the trajectory recorder passes throughlimitTrajectoryPayloadValue, which enforces hard size limits before writing. Every truncation leaves a structured sentinel so consumers know exactly what was lost.
| Bound | Limit | Sentinel on hit |
|---|---|---|
| Per-string chars | 32,768 | { truncated: true, reason: "trajectory-field-size-limit", originalChars, limitChars } |
| Per-array items | 64 | Items beyond cap dropped |
| Per-object keys | 64 | Keys beyond cap dropped |
| Recursion depth | 6 | Deeper levels dropped |
| Per-event bytes | 256 KB | Event dropped, droppedEvents++ |
| File soft cap | 10 MB | Final trace.truncated event, recorder closes |
| File hard cap | 50 MB | Writer halts, droppedEvents incremented |
| Active writers (process-wide) | 100 (LRU) | Oldest writer evicted |
| Circular reference | n/a | { truncated: true, reason: "trajectory-circular-reference" } |
MAX_TRAJECTORY_WRITERS = 100 constant is exported from packages/observability/src/trajectory/runtime.ts:116.
6. Forensic INFO Promotions
Seven forensic events that fire at most O(1) per turn are promoted from DEBUG to INFO so production daemons atlogLevel: info retain the signal needed to diagnose the 2026-05-24 duplicate-adapter incident class.
Before this change, Message enqueued was level: 20 (DEBUG). A production daemon at logLevel: info would have silently discarded it — making the duplicate-enqueue symptom invisible without enabling debug logging first.
| Event | Source module | Carries |
|---|---|---|
| Adapter registered | packages/orchestrator/src/channel-manager.ts | adapterId, channelType, handlersBefore, handlersAfter |
| Message enqueued | packages/orchestrator/src/queue/command-queue.ts | channelType, mode, queueDepth, messageId |
| Message dequeued | packages/orchestrator/src/queue/command-queue.ts | channelType, waitTimeMs |
| Execution started | packages/agent/src/executor/pi-executor/pi-executor.ts | agentId, sessionKey, traceId |
| Execution complete | packages/agent/src/executor/executor-post-execution.ts | finalStatus, durationMs, cacheHitRate, sessionCacheSavingsRate |
| Memory store complete | packages/memory/src/sqlite-memory-adapter.ts | durationMs, op, hasEmbedding, memoryType |
| Outbound message | 8 channel adapters | channelType, messageId, deliveryStatus |
Per-turn INFO count grows from ~5 to ~10 lines (bounded). The architecture test
test/architecture/forensic-events-info-level.test.ts (15 tests) enforces shrink-only — no forensic event may regress to DEBUG without failing CI.7. Duplicate-Inbound Detector
The dedup detector runs synchronously in the inbound hot path to catch the samemessageId being processed more than once within a short window — the defining symptom of the 2026-05-24 duplicate-adapter incident class.
Implementation: a bounded LRU at 1024 entries, 10 s window (windowMs = 10_000). On a duplicate messageId within the window, it:
- Emits
dedup:duplicate_inbound { messageId, channelType, chatId, firstSeenAt, duplicateAt, deltaMs, source }on the typed event bus. - Logs a WARN with
errorKind: "internal"andhint: "Same messageId processed twice; check channel adapter handler list and queue mode". - Does not suppress processing — the duplicate continues through the pipeline so the full symptom is visible in the trajectory.
dedup:duplicate_inbound event is bridge entry #54, mapped to the dedup.duplicate_inbound trajectory event type.
At 300 msg/s (10× expected production load), overhead is sub-microsecond per check (measured in dedup-detector.perf.test.ts).
Operator query:
8. Boot Invariants
On every daemon startup, before traffic is accepted, adaemon:startup_invariants INFO record fires with 8 fields describing the wiring state:
| Field | What it measures |
|---|---|
adaptersByChannelType | Count of adapters per channel type |
handlersPerAdapter | Count of onMessage handlers per adapter |
pluginRegistryCount | Total registered plugins |
channelRegistryCount | Total registered channels |
depSlotConsistency | Whether the deprecated adaptersList slot is absent |
agentCount | Total configured agents |
toolCatalogSize | Total available tools |
mcpServerCount | Total MCP servers |
handlersPerAdapter[<type>] > 1, a WARN fires immediately:
saveLastKnownGood and before the daemon begins accepting messages. In the 2026-05-24 incident, the duplicate adapter had been in production for days; the boot invariant would have surfaced it at the next restart with zero traffic impact.
9. step: Discipline
Every known pipeline stage emits at least one log line carryingstep: "<stage>" — enforced by test/architecture/pipeline-step-coverage.test.ts.
Why it matters: before this change, only 3% of daemon.log lines (15 of 480 sampled) carried a step: field. Filtering by pipeline stage required reading every log line. The target is ≥ 50% coverage.
The authoritative stage token map lives in test/architecture/pipeline-step-coverage.test.ts — consult it when adding new log call sites.
Operator filter pattern:
"channels-inbound", "queue-dequeue", "agent-execute", "delivery-outbound".
The architecture test is shrink-only — adding a new stage without a
step:-tagged emit fails CI.10. Session Index
The session index at~/.comis/logs/session-index.YYYY-MM-DD.jsonl is an append-only lightweight index of session lifecycle events — the primary scan target for comis trace --since 10m --where error.
Three event kinds:
schemaVersion: 1, so older readers ignore them):
source—"runtime"for production rows (shown above),"test"for rows written by a VITEST/NODE_ENV=testprocess,"bench"for harness-injected rows.synthetic—trueon test/bench/harness rows so they are self-identifying; absent on production rows.comis trace/obs.*excludesynthetic: truerows by default (a row counts as synthetic only whensynthetic === true).
appendSessionIndexEntry throws if a test process
(VITEST=true or NODE_ENV=test) ever targets the real ~/.comis — a test run
can never silently pollute production telemetry. Tests must write to a tmp dir,
where their rows are stamped source: "test", synthetic: true.
Date-rolled files honor the same observability.logRotation policy as daemon.log.
For full query examples see the Trace CLI reference.
11. Bundle Export
When a user reports an issue, the operator workflow is:comis trace --message-id <uuid>orcomis trace --chat <chatId> --tailto locate the session.comis trace export <sessionId>to produce a self-contained bundle directory.- Share the bundle with the diagnosing engineer.
/export-trajectory slash command in a direct message (or in a group channel, where the result is DM’d to the owner).
Full bundle workflow, directory shape (8 files), and redaction policy (platform-aware-v1) are documented in:
Incident Bundle
Bundle export workflow, 8-file directory shape, and redaction policy.
12. Log Rotation
All 5 observability streams honor theobservability.logRotation config block:
| Stream | File pattern |
|---|---|
| Daemon log | ~/.comis/logs/daemon.log |
| Cache trace | ~/.comis/logs/cache-trace.jsonl |
| Config audit | ~/.comis/logs/config-audit.jsonl |
| Session index | ~/.comis/logs/session-index.YYYY-MM-DD.jsonl |
| Trajectory | ~/.comis/workspace/sessions/**/*.trajectory.jsonl |
| Setting | Default | Notes |
|---|---|---|
maxSizeBytes | 52428800 (50 MB) | Per-file size trigger |
maxFiles | 5 | Rotated copies to keep per stream |
maxAgeDays | 30 | Delete rotated files older than this |
compressAged | true | gzip rotated files |
Logging
Log rotation policy, viewing logs, configuring log levels per module.
13. Alert Budget
The rate-aggregator subscribes to health/safety events on the typed event bus. When a per-errorKind threshold is exceeded in a sliding window, it emits health:budget_exceeded { kind, count, windowMs } exactly once until the window slides past — then re-arms (once-per-window latch).
health:budget_exceeded is bridge entry #55, mapped to the health.budget_exceeded trajectory event.
Supported errorKind values: config, auth, timeout, internal, network, quota, resource, policy, agent, external.
Configurable via
observability.alertBudget in your YAML config. Defaults are conservative — most production daemons will not see this fire under normal operation. Tune thresholds downward on high-traffic deployments where per-event noise is expected.14. Worked Example — 2026-05-24 Incident Replay
This section shows how the four observability signals surface the 2026-05-24 duplicate-adapter bug in under five minutes. Setup: fresh daemon,setup-channels-runtime.ts:217 regression re-introduced (the same Telegram adapter passed into both deps.adapters and deps.channelRegistry). Boot the daemon, send one Telegram message.
Signal 1 — Boot WARN
Signal 2 — Two queue.enqueued trajectory events
queueDepth: 1 and queueDepth: 2, same sessionKey, same timestamp window — the duplicate processing visible in the structured artifact.
Signal 3 — dedup:duplicate_inbound
deltaMs: 1 matches the original incident’s two Message enqueued lines at 06:01:47.385 and 06:01:47.386.
Signal 4 — comis trace
15. Memory & Recall Diagnostics
Recall is the path that decides which memories enter the prompt. When recall surprises you — the wrong memory injected, or recall feeling slow — these are the signals that explain it. The runbook for acting on them lives in Troubleshooting → “Why did recall pick X / why is recall slow?”; this section documents the artifacts and events those workflows read.15.1 The recall-trace artifact (opt-in)
The recall trace is a per-recall JSONL artifact that records the ranking preview for each recall: which lanes matched, the fused order, the pre/post-rerank scores, the recency/temporal/proof/trust score components, and the include/exclude reason for every candidate. It is the sibling of the cache trace — same writer family, same rotation policy — but opt-in.| Property | Value |
|---|---|
| Config key | diagnostics.recallTrace.enabled |
| Default | false — opt-in (the only diagnostics writer that ships off; trajectory and cacheTrace default true) |
| Default path | ~/.comis/logs/recall-trace.jsonl (override via diagnostics.recallTrace.filePath; ~ expansion supported) |
| File cap | 52428800 (50 MB), via diagnostics.recallTrace.maxFileBytes |
| Env hard-off | COMIS_DISABLE_RECALL_TRACE disables the writer regardless of config |
| Sanitization | Always full-sanitized. Every payload runs through sanitizeForPersistence (bound → sanitize → redact) before it touches disk |
comis memory recall-trace <session> — --format json for the full per-record ranking breakdown, the table view for correlation keys. The memory.recall_trace response also reports tracingEnabled (the recorder gate) and, on an empty result, a hint distinguishing “recorder disabled — set diagnostics.recallTrace.enabled: true” from “enabled but no traces matched this selector yet” — an empty result is never silent about why.
15.2 Recall events (direct-emit — not bridge-mapped)
Five typed events report recall and curation activity on the event bus. They are emitted directly at a single canonical site each — they are not part of the §4 bridge mapping (the disjoint-set invariant intest/architecture/trajectory-event-types-known.test.ts explicitly allowlists them as not-trajectory-mapped, so the 55-entry bridge count is unaffected).
Every payload is counts, booleans, and IDs only — never query text, memory bodies, or entity names (AGENTS.md §2.7). The per-recall ranking detail lives in the opt-in recall-trace artifact above, not on the bus.
| Event | Fires | Key fields |
|---|---|---|
memory:recalled | Once per recall, after fuse/rerank/score | lanes, ftsCandidates, vectorCandidates, entityCandidates, finalCount, rerankerAvailable, durationMs |
memory:reranked | When a rerank stage ran (rerank opt-in) | candidateCount, hitCount, rerankerAvailable, timedOut, fellBack, durationMs |
memory:entities_linked | After an entity resolve-and-link pass (memory-review job) | entityCount, newEntities, durationMs |
memory:consolidated | After a consolidation run (consolidation job) | clustersProcessed, observationsCreated, dedupHits, durationMs |
context:evicted | When the LCD/DAG assembler evicts the oldest out-of-tail history to fit the token budget | agentId, sessionKey, evictedCount, evictedChars, categories, timestamp |
context:dag_compacted | After a DAG compaction pass updates the summary hierarchy (DAG mode) | conversationId, agentId, sessionKey, leafSummariesCreated, condensedSummariesCreated, maxDepthReached, totalSummariesCreated, durationMs, timestamp |
context:dag_expanded | When an in-session expansion tool (ctx_search/ctx_inspect/ctx_expand) recovers compressed detail (DAG mode) | conversationId, agentId, sessionKey, tool, recoveredCount, durationMs, timestamp |
context:mode_switched | On an actual pipeline⇄dag engine direction change | from, to, fullImport, importedCount, durationMs |
memory:recalled.finalCount == 0 means recall returned nothing for that turn. memory:reranked.fellBack / .timedOut flag the graceful-degradation paths (see below). context:evicted and context:dag_compacted both fire in DAG mode (the default engine) — context:dag_compacted.durationMs reports the real wall-clock of each compaction pass; context:dag_expanded reports an in-session zoom into the DAG (counts/durationMs only — never message or summary content). context:mode_switched fires only on a real engine-mode direction change and reports the one-time reconciliation importedCount; the mid-conversation pipeline⇄dag reconciliation that emits it lands in a later phase, so that one event is currently dormant.
15.3 No-silent-degradation signals (errorKind + hint)
Recall degrades gracefully, and every degradation path is explicit — it emits a WARN carrying both errorKind and hint (the same contract as the rest of the daemon; see Logging → Error Classification). There is no silent fallback: if recall quietly drops to a cheaper path, there is a log line that says so and why.
| Degradation | What happened | Surfaced as |
|---|---|---|
| Reranker unavailable | The cross-encoder reranker was not built/loaded (e.g. rerank off, or the GGUF failed to load) — recall uses the fusion-ranked order | WARN with errorKind + hint; memory:reranked.fellBack: true, rerankerAvailable: false |
| Reranker timeout | The reranker exceeded its budget (rag.rerank.timeoutMs, default 800 ms) — recall falls back to the fusion-ranked order | WARN with errorKind + hint; memory:reranked.timedOut: true, fellBack: true |
| Vector index unavailable | The vector lane could not run — recall falls back to FTS-only | WARN with errorKind + hint; memory:recalled.vectorCandidates: 0 |
| Candidate missing embedding | A candidate had no embedding to score on the vector lane — it is handled, not dropped silently | WARN with errorKind + hint |
| Invalid extraction / consolidation | An extraction or consolidation step produced an unusable result — skipped with an explicit signal | WARN with errorKind + hint |
comis memory stats recall-counter overlay: lane usage, rerank-fallback rate, consolidation throughput, and recall hit-rate. A climbing rerank-fallback rate, for instance, says the reranker is timing out across the fleet, not just on one turn. The overlay is best-effort — a daemon that has not wired the counters still renders base stats.
comis memory surface and the Troubleshooting runbook for the diagnosis flow.
16. Existing Tracking Surfaces (pre-existing, preserved for completeness)
The following tracking surfaces predate the observability work above and continue to operate as before. They are referenced by the web dashboard at /web-dashboard/observability.
Token Tracking
Every agent execution counts input and output tokens per model and per agent. Token counts feed billing estimation and are exposed on the observability RPC contracts (packages/core/src/api-contracts/observability.ts).
Billing Estimation
Token counts are converted to estimated dollar costs using known provider pricing. Estimates are approximations — actual provider invoices may differ due to rounding, pricing changes, or promotional credits.Cache Savings Tracking
ThesavedVsUncached metric shows net cache savings per execution. Embedding cache hits (persistent L2 cache) further reduce costs by avoiding API calls entirely. See Compaction for per-provider cache pricing.
Latency Recording
Every execution is timed: LLM call time, tool execution time, and total execution time. Breakdown helps identify whether latency is in the AI provider, tool execution, or routing.Diagnostic Collection
System-level metrics collected every 30 seconds: event loop delay, heap/RSS memory, CPU utilization. Surfaced on the/health HTTP endpoint. See Monitoring.
Channel Activity Tracking
Message counts, delivery success rate, and error rates per channel. Useful for identifying which platforms are active and whether any channel is experiencing reliability issues.Delivery Tracing
Follows each outgoing message through formatting, chunking, and platform confirmation. When a user reports “my message did not arrive,” delivery tracing shows exactly where it got stuck.Context Engine Metrics
Per-turn context engine statistics: tokens loaded/masked, compaction events, budget utilization, cache hit/write/miss rates. High masking rates indicate active token savings; frequent compaction may indicate unusually long conversations.Gemini Cache Infrastructure
Comis supports Gemini explicit CachedContent caching (Google AI Studio only) with a guaranteed 90% discount on cached input tokens. Lifecycle operations: Create, Reuse, Invalidate, Dedup, Refresh, Eviction, Session Dispose, Shutdown Dispose, Orphan Cleanup. See the config reference for the fullgeminiCache schema.
Minimum cacheable thresholds: Gemini Flash (2.5, 3) → 1,024 tokens; Gemini Flash (2.0) → 2,048 tokens; Gemini Pro (2.5, 3) → 4,096 tokens.
Cache Observability Metrics
Breakpoint budget auditing (INFO before each Anthropic API call), placement results (DEBUG when placed > 0), cache fence warnings (WARN when nocache_control marker on conversations with ≥ 10 messages), per-execution cache hit rate (added to “Execution complete” log), session cache savings rate, and thinking-block cleaner cache fence impact. See the Web Dashboard Observability page for visual charts.
17. Related Pages
Logging
Log rotation policy, viewing logs, log levels per module.
Incident Bundle
Bundle export workflow and redaction policy.
Trace CLI
comis trace subcommands with copy-pasteable examples.Monitoring
Health checks and threshold alerts.
Web Dashboard Observability
Visual dashboard for token, billing, and delivery data.
Agent Safety
Budget limits, token caps, and cost controls.
