Startup Issues
Bootstrap failed: Config file not found
Bootstrap failed: Config file not found
- Check that the config file exists:
- If the file is missing, restore from the last-known-good backup:
- If using pm2, re-run setup to regenerate the ecosystem config with the correct path:
- If using systemd, verify
COMIS_CONFIG_PATHSis set in/etc/comis/env
~/.comis/config.last-good.yaml as a recovery option.Secrets bootstrap failed
Secrets bootstrap failed
- Check that
SECRETS_MASTER_KEYis set in your environment or.envfile: - If the key was lost, remove the secrets database to start fresh:
- Restart the daemon — it will create a new secrets database
- Re-add any secrets that were stored (API keys configured via the secrets system)
Secret decryption failed
Secret decryption failed
- If you changed the master key, restore the original key value
- If the original key is lost, remove the secrets database and re-create:
- Restart the daemon and re-add your secrets
SecretRef resolution failed
SecretRef resolution failed
$secret:name syntax) that does not exist in the secrets store.How to fix:- Check your
config.yamlfor$secret:references - For each reference, make sure the secret is stored — either add it via the API or replace the reference with the actual value
- Restart the daemon
Permission corrections on data directory
Permission corrections on data directory
~/.comis/ have permissions that are too open. It attempted to fix them automatically.How to fix:Set restrictive permissions manually:Port already in use (EADDRINUSE)
Port already in use (EADDRINUSE)
- Find what is using the port:
- If it is an old Comis process, stop it:
Or:
- If another application needs port 4766, change the Comis gateway port in
config.yaml: - Restart the daemon
Config validation failed: Unrecognized key(s) in object
Config validation failed: Unrecognized key(s) in object
'secrets' or a similar unrecognized-key name.)What happened: Your config.yaml contains one or more configuration keys that were removed in v1.5. The security section uses strict schema validation — any unrecognized key causes an immediate boot failure. There is no silent fallback.The three keys removed in v1.5 are:oauth.storage(undersecurity:)security.secrets.enabledCOMIS_DISABLE_ENCRYPTED_SECRETS(in~/.comis/.env)
-
Detect legacy keys in your config and env:
-
Remove each legacy key from the file it appears in:
- Delete the
oauth.storage:line undersecurity:inconfig.yaml - Delete the
security.secrets.enabled:line fromconfig.yaml - Delete the
COMIS_DISABLE_ENCRYPTED_SECRETS=line from.env
- Delete the
-
Add the replacement key
security.storage(if not already present):~/.comis/config.yaml -
Verify boot:
Daemon fails to start: ERR_ACCESS_DENIED in startup logs
Daemon fails to start: ERR_ACCESS_DENIED in startup logs
node --permission (the Node.js permission model), which categorically disables fd-based fs APIs (fsync, fchmod, fchown) at the process level. On pre-guard releases, calling fsyncSync on the data-dir lock file during boot threw ERR_ACCESS_DENIED before any log line was emitted, causing every start attempt to fail.This behavior is Linux-only and occurs only when security.permission.enableNodePermissions: true is set or the systemd unit passes --permission in ExecStart.Is this still a problem? Current Comis versions guard all fd-API call sites via isFsyncDisabledByPermissionModel — the daemon will not crash on this. If you are seeing this error, you may be running an older release or a custom Node.js build that raises ERR_ACCESS_DENIED for a different reason.How to fix:-
Check your Comis version:
Ensure you are on v1.4 or later (the guard was introduced in v1.4).
-
Check if —permission is active:
If you did not intend to enable the permission model, remove
security.permission.enableNodePermissions: truefromconfig.yamland remove--permissionfrom your systemd unit. - If you want to keep —permission: Upgrade to v1.4 or later. The daemon handles the fd-API disablement gracefully (credential writes are best-effort durability; this is expected behavior).
Channel Issues
Telegram enabled but no bot token configured
Telegram enabled but no bot token configured
Set botToken in channels.telegram config or TELEGRAM_BOT_TOKEN env varHow to fix:- Get a bot token from @BotFather on Telegram
- Add it to your config:
Or set the environment variable
TELEGRAM_BOT_TOKEN - Restart the daemon
Discord enabled but no bot token configured
Discord enabled but no bot token configured
Set botToken in channels.discord config or DISCORD_BOT_TOKEN env varHow to fix:- Get a bot token from the Discord Developer Portal
- Add it to your config:
Or set the environment variable
DISCORD_BOT_TOKEN - Make sure you have enabled the Message Content Intent in the Discord Developer Portal under Bot settings
- Restart the daemon
Discord credential validation failed
Discord credential validation failed
Verify DISCORD_BOT_TOKEN is valid in Discord Developer PortalHow to fix:- Go to the Discord Developer Portal
- Select your application and navigate to Bot
- Verify the token matches what is in your config
- Check that the Message Content Intent is enabled (under Privileged Gateway Intents)
- If the token was reset, copy the new token and update your config
- Restart the daemon
Slack credential validation failed
Slack credential validation failed
- Go to your Slack API dashboard
- Verify the Bot User OAuth Token matches your config
- Check that the Signing Secret is correct
- Make sure the bot has the required scopes (chat:write, channels:history, etc.)
- Restart the daemon
WhatsApp or Signal connection failed
WhatsApp or Signal connection failed
- Verify your WhatsApp Business API credentials
- Check that the phone number ID and access token are correct
- Make sure the WhatsApp Business account is active
- Check that
signal-cliis installed and running on your server - Verify the Signal phone number is registered with signal-cli
- Test signal-cli independently:
- Restart the daemon
Runtime Issues
Gateway token auto-generated (ephemeral)
Gateway token auto-generated (ephemeral)
Set GATEWAY_TOKEN_<ID> in environment or secrets store for persistenceWhat happened: No gateway token was configured, so the daemon created a temporary one. This token will be different after every restart, meaning you will need to re-authenticate each time.How to fix:Set a permanent token in your config.yaml:Shutdown timeout exceeded, forcing exit
Shutdown timeout exceeded, forcing exit
- Check the logs before the timeout to see which component was still running
- If this happens frequently, increase the timeout:
- If a specific component is always slow, investigate that component (check channel connections, agent executions in progress)
Unhandled promise rejection (non-fatal)
Unhandled promise rejection (non-fatal)
- Check the error details in the log line — it includes the rejection reason
- If it mentions a specific tool or skill, check that tool’s configuration
- If it persists, this may be a bug — check for updates or report the issue
Configuration Issues
Config patch rate limit exceeded
Config patch rate limit exceeded
- Wait a moment and retry the config change
- If you need to make many changes at once, batch them into a single
config.applycall instead of multipleconfig.patchcalls
Approvals config warning
Approvals config warning
- Review the
approvalssection in yourconfig.yaml - Make sure all referenced policy names match defined policies
- Restart the daemon
Daemon starts but agents don't respond
Daemon starts but agents don't respond
"Comis daemon started" but agents ignore incoming messages.How to fix:- Check agent routing — make sure each agent’s
bindingsmatch the channels you are messaging from: - Check agent status — the agent might be suspended. Check in the web dashboard or logs.
- Check budget — the agent’s token or cost budget might be exhausted. See Agent Safety.
- Check model provider — verify the API key for the configured model provider is valid
- Check logs — set
daemon.logLevels.agent: "debug"to see detailed agent processing logs
Resilience Issues
See Resilience Architecture for how these systems work together.Prompt timeout -- agent falling back to alternate model
Prompt timeout -- agent falling back to alternate model
promptTimeoutMs) or the makespan ceiling (promptTimeoutMs × stallCeilingMultiplier, a streaming runaway). The agent automatically retried with a shorter timeout or fell back to an alternate model. comis explain <sessionKey> names which limit fired and the binding knob.How to fix:- Check the provider’s status page for ongoing outages
- If the model legitimately needs more silent time (e.g., slow local prefill before the first token), increase
promptTimeoutMs: - Review
modelFailover.fallbackModelsfor the agent to ensure fallback models are configured
Sub-agent watchdog timeout
Sub-agent watchdog timeout
maxRunTimeoutMs. The watchdog force-failed the run and sent a failure notification to the channel.How to fix:- Check the sub-agent task — was it genuinely too complex, or did the LLM provider hang?
- If the task is legitimately long, increase
maxRunTimeoutMs: - Check for provider slowness via
provider:degradedevents in logs
Provider degraded -- LLM calls skipped
Provider degraded -- LLM calls skipped
- Check the provider’s status page for ongoing outages
- Wait for automatic recovery — the system emits a
provider:recoveredevent when the provider comes back - Check the dead-letter queue for missed announcements (
~/.comis/dead-letters.jsonl)
Ghost run detected and force-failed
Ghost run detected and force-failed
maxRunTimeoutMs + 120 seconds). The ghost sweep force-failed it and delivered a failure notification to the channel.How to fix:- This usually indicates a process crash during execution
- Check for out-of-memory (OOM) or unhandled errors around the same time in logs
- The failure notification was already delivered to the channel — no action needed for the user
Dead-letter entries accumulating
Dead-letter entries accumulating
~/.comis/dead-letters.jsonl). Retry will be attempted up to 5 times.How to fix:- Check channel connectivity — is the bot still connected to the channel?
- Verify the bot token is still valid
- Entries auto-expire after 1 hour
- On provider recovery the dead-letter queue drains automatically
Execution pipeline timeout -- canned error sent to user
Execution pipeline timeout -- canned error sent to user
- Check if the model was slow — look for
execution:prompt_timeoutevents before this - This is a backstop — if it fires regularly, investigate per-call prompt timeouts first
- Consider adding fallback models via
modelFailover.fallbackModelsin the agent config
Memory & Recall Issues
Recall is the path that selects which memories enter the prompt. When it surprises you, you do not have to read code — two questions cover almost every case. (The artifacts and events behind this runbook are documented in Observability → Memory & Recall Diagnostics.)Why did recall pick X? (or miss the obvious memory)
Why did recall pick X? (or miss the obvious memory)
- Enable the recall trace (opt-in, default off). Add to
config.yaml, then reload: - Reproduce the surprising recall (send the message that triggered it).
- Read the per-memory ranking breakdown with
--format json:The JSON record shows, for each candidate: which lanes matched (FTS / vector / entity), the fused order, the pre- and post-rerank scores, the recency / temporal / proof / trust score components, and the include/exclude reason for every memory. The table view (omit--format json) shows just the correlation keys (trace,session,finalCount,ts) for locating the right record first. - Cross-reference the trust filter. If the memory you expected is excluded, the reason field will say why — most often a trust-level filter (
includeTrustLevels) rather than a low score.
recallTrace.enabled back to false once you have the answer — it is a debug-session writer, not a steady-state one.Why is recall slow?
Why is recall slow?
- Is the cross-encoder reranker on?
rag.rerank.enabledis opt-in (defaultfalse) precisely because the reranker adds latency. If you enabled it, that is the most likely source:Disable it briefly to confirm the latency disappears. On timeout the reranker already falls back to the fusion-ranked order (memory:reranked.timedOut: true), so correctness is preserved — but the wait still happens up totimeoutMs. - First-call reranker model load. The reranker is a local GGUF loaded on first use. The first recall after a restart (with rerank on) pays a one-time model-load cost; steady-state recalls do not. A single slow recall right after startup is expected.
- Vector index availability. If the vector lane cannot run, recall falls back to FTS-only and logs a WARN with
errorKind+hint. Check for it:A vec→FTS fallback shows asmemory:recalled.vectorCandidates: 0.
comis memory stats recall-counter overlay surfaces the rerank-fallback rate, lane usage, consolidation throughput, and recall hit-rate — a climbing fallback rate means the reranker is timing out across the fleet, not just on one turn:comis memory subcommand.Credential Broker
The credential broker intercepts HTTPS requests from driven-CLI spawns, injects the real API key, and forwards the request upstream — all without the key ever entering the sandbox. When something goes wrong, the broker emits abroker:denied or broker:credential_unavailable event and returns a specific HTTP status code. Use the playbooks below to diagnose each failure mode.
Broker request fails with 407 (bad_token)
Broker request fails with 407 (bad_token)
broker:denied with reason: "bad_token".Diagnose:- The CLI was not launched through the broker (
HTTPS_PROXYenv var not set to the broker socket). - The session was torn down before the request completed.
- A token was replayed (single-use invariant — each token can only be consumed once).
Broker returns 403 (no_binding or path_policy)
Broker returns 403 (no_binding or path_policy)
broker:denied reasons produce 403:no_binding— the requested host has no matchinghostRulesentry in any binding. Add a binding or use a built-in preset (anthropic,finnhub).path_policy— the host is allow-listed but the path is blocked by the binding’spathPolicyglob.
hostRules entry for the target host in executor.broker.bindings, or use a matching preset.Resolution for path_policy: Review the pathPolicy glob on the binding. Ensure the requested path matches the allowed pattern (e.g., /v1/*).Broker returns 502 (credential_unavailable)
Broker returns 502 (credential_unavailable)
Broker returns 501 (WebSocket upgrade)
Broker returns 501 (WebSocket upgrade)
broker:denied reason ws_upgrade_not_supported. This is an intentional fail-closed guard — WebSocket credential injection is not supported in this release. The error is returned explicitly rather than hanging silently.Resolution: Use HTTPS (not WebSocket) for API calls that require credential injection. WS credential injection is on the roadmap.Tracing a full broker request lifecycle
Tracing a full broker request lifecycle
step (pipeline stage), traceId, and agentId. To trace a full request:broker:session_opened → broker:request → broker:injected → (upstream response) → broker:session_closedIf broker:denied appears in place of broker:injected, the reason field identifies the failure mode (see accordions above).Filter by pipeline stage:Upstream API returns 401 (key injected but rejected)
Upstream API returns 401 (key injected but rejected)
- Verify the
secretRefkey in the binding matches the stored secret name (secrets list). - Check the stored secret value — update it with:
- Check the injection rule kind — the
anthropicpreset injectsx-api-key(raw, not Bearer). For OpenAI-compatible APIs use a custom binding withformat: bearer. - Look for
broker:injectedin logs to confirm injection actually occurred (vs. a 403 or 502 before it reached upstream).
