Troubleshooting

This page covers the most common issues you might encounter with Comis. Each entry includes the exact error message you will see, what caused it, and step-by-step instructions to fix it.

Use your browser’s find (Ctrl+F / Cmd+F) to search for the error message you are seeing.

Startup Issues

Bootstrap failed: Config file not found

Error message:

FATAL: Bootstrap failed: Config file not found

What happened: The daemon cannot find your configuration file at the expected path.How to fix:

Check that the config file exists:
```
ls -la ~/.comis/config.yaml
```
If the file is missing, restore from the last-known-good backup:
```
cp ~/.comis/config.last-good.yaml ~/.comis/config.yaml
```
If using pm2, re-run setup to regenerate the ecosystem config with the correct path:
```
node packages/cli/dist/cli.js pm2 setup
```
If using systemd, verify COMIS_CONFIG_PATHS is set in /etc/comis/env

The daemon automatically saves a last-known-good config on every successful startup. Look for ~/.comis/config.last-good.yaml as a recovery option.

Secrets bootstrap failed

Error message:

Secrets bootstrap failed: ...

What happened: The master encryption key used to protect stored secrets is missing or invalid.How to fix:

Check that SECRETS_MASTER_KEY is set in your environment or .env file:
```
grep SECRETS_MASTER_KEY ~/.comis/.env
```
If the key was lost, remove the secrets database to start fresh:
```
rm ~/.comis/secrets.db
```
Restart the daemon — it will create a new secrets database
Re-add any secrets that were stored (API keys configured via the secrets system)

Secret decryption failed

Error message:

Secret decryption failed: ...

What happened: The secrets database exists but cannot be decrypted, usually because the master key has changed since the secrets were stored.How to fix:

If you changed the master key, restore the original key value
If the original key is lost, remove the secrets database and re-create:
```
rm ~/.comis/secrets.db
```
Restart the daemon and re-add your secrets

SecretRef resolution failed

Error message:

SecretRef resolution failed: ...

What happened: Your configuration references a secret (using $secret:name syntax) that does not exist in the secrets store.How to fix:

Check your config.yaml for $secret: references
For each reference, make sure the secret is stored — either add it via the API or replace the reference with the actual value
Restart the daemon

Permission corrections on data directory

Warning message:

Permission corrections on data directory

What happened: The daemon detected that files in ~/.comis/ have permissions that are too open. It attempted to fix them automatically.How to fix:Set restrictive permissions manually:

chmod -R 700 ~/.comis/

This ensures only your user can read, write, and access the data directory.

Port already in use (EADDRINUSE)

Error message:

EADDRINUSE: address already in use :::4766

What happened: Another process is already using port 4766, which the gateway needs.How to fix:

Find what is using the port:
```
lsof -i :4766
```

If it is an old Comis process, stop it:

pm2 stop comis

Or:

kill <PID from lsof output>

If another application needs port 4766, change the Comis gateway port in config.yaml:
```
gateway:
  port: 4767   # or any available port
```
Restart the daemon

Config validation failed: Unrecognized key(s) in object

Error message:

Config validation failed: security: Unrecognized key(s) in object: 'oauth'

(The exact key name in the message varies: you may also see 'secrets' or a similar unrecognized-key name.)What happened: Your config.yaml contains one or more configuration keys that were removed in v1.5. The security section uses strict schema validation — any unrecognized key causes an immediate boot failure. There is no silent fallback.The three keys removed in v1.5 are:

oauth.storage (under security:)
security.secrets.enabled
COMIS_DISABLE_ENCRYPTED_SECRETS (in ~/.comis/.env)

How to fix:

Detect legacy keys in your config and env:

grep -n "oauth\.storage\|secrets\.enabled" ~/.comis/config.yaml
grep -n "COMIS_DISABLE_ENCRYPTED_SECRETS" ~/.comis/.env

Remove each legacy key from the file it appears in:
- Delete the oauth.storage: line under security: in config.yaml
- Delete the security.secrets.enabled: line from config.yaml
- Delete the COMIS_DISABLE_ENCRYPTED_SECRETS= line from .env
Add the replacement key security.storage (if not already present):
~/.comis/config.yaml
```
security:
  storage: encrypted   # encrypted (default) | file | env
```

Verify boot:

node packages/cli/dist/cli.js secrets list

For a full migration walkthrough, see Secrets — Migrating from Pre-v1.5 Configuration.

Daemon fails to start: ERR_ACCESS_DENIED in startup logs

Error message (in logs):

Error [ERR_ACCESS_DENIED]: Access to this API has been restricted

Error [ERR_ACCESS_DENIED]: ... is disabled when permission model is enabled

What happened: The daemon is running under node --permission (the Node.js permission model), which categorically disables fd-based fs APIs (fsync, fchmod, fchown) at the process level. On pre-guard releases, calling fsyncSync on the data-dir lock file during boot threw ERR_ACCESS_DENIED before any log line was emitted, causing every start attempt to fail.This behavior is Linux-only and occurs only when security.permission.enableNodePermissions: true is set or the systemd unit passes --permission in ExecStart.Is this still a problem? Current Comis versions guard all fd-API call sites via isFsyncDisabledByPermissionModel — the daemon will not crash on this. If you are seeing this error, you may be running an older release or a custom Node.js build that raises ERR_ACCESS_DENIED for a different reason.How to fix:

Check your Comis version:
```
node packages/cli/dist/cli.js --version
```
Ensure you are on v1.4 or later (the guard was introduced in v1.4).
Check if —permission is active:
```
ps aux | grep 'node.*--permission'
```
If you did not intend to enable the permission model, remove security.permission.enableNodePermissions: true from config.yaml and remove --permission from your systemd unit.
If you want to keep —permission: Upgrade to v1.4 or later. The daemon handles the fd-API disablement gracefully (credential writes are best-effort durability; this is expected behavior).

For full detail on the fd-API impact, see Node Permissions — Production fd-API Disablement.

Channel Issues

Telegram enabled but no bot token configured

Error message:

Telegram enabled but no bot token configured

Hint from the daemon: Set botToken in channels.telegram config or TELEGRAM_BOT_TOKEN env varHow to fix:

Get a bot token from @BotFather on Telegram

Add it to your config:

channels:
  telegram:
    enabled: true
    botToken: "your-bot-token"

Or set the environment variable TELEGRAM_BOT_TOKEN

Restart the daemon

Discord enabled but no bot token configured

Error message:

Discord enabled but no bot token configured

Hint from the daemon: Set botToken in channels.discord config or DISCORD_BOT_TOKEN env varHow to fix:

Get a bot token from the Discord Developer Portal

Add it to your config:

channels:
  discord:
    enabled: true
    botToken: "your-bot-token"

Or set the environment variable DISCORD_BOT_TOKEN

Make sure you have enabled the Message Content Intent in the Discord Developer Portal under Bot settings
Restart the daemon

Discord credential validation failed

Error message:

Discord credential validation failed

Hint from the daemon: Verify DISCORD_BOT_TOKEN is valid in Discord Developer PortalHow to fix:

Go to the Discord Developer Portal
Select your application and navigate to Bot
Verify the token matches what is in your config
Check that the Message Content Intent is enabled (under Privileged Gateway Intents)
If the token was reset, copy the new token and update your config
Restart the daemon

Slack credential validation failed

Error message:

Slack credential validation failed

How to fix:

Go to your Slack API dashboard
Verify the Bot User OAuth Token matches your config
Check that the Signing Secret is correct
Make sure the bot has the required scopes (chat:write, channels:history, etc.)
Restart the daemon

WhatsApp or Signal connection failed

Error messages:

WhatsApp credential validation failed

Signal connection validation failed

How to fix for WhatsApp:

Verify your WhatsApp Business API credentials
Check that the phone number ID and access token are correct
Make sure the WhatsApp Business account is active

How to fix for Signal:

Check that signal-cli is installed and running on your server
Verify the Signal phone number is registered with signal-cli
Test signal-cli independently:
```
signal-cli -u +1234567890 receive
```
Restart the daemon

Runtime Issues

Gateway token auto-generated (ephemeral)

Warning message:

Gateway token auto-generated (ephemeral -- will be lost on restart)

Hint from the daemon: Set GATEWAY_TOKEN_<ID> in environment or secrets store for persistenceWhat happened: No gateway token was configured, so the daemon created a temporary one. This token will be different after every restart, meaning you will need to re-authenticate each time.How to fix:Set a permanent token in your config.yaml:

gateway:
  tokens:
    - id: "admin"
      secret: "your-secure-token-minimum-32-characters-long"
      scopes: ["rpc", "ws", "admin"]

The token secret must be at least 32 characters long.

Shutdown timeout exceeded, forcing exit

Error message:

Shutdown timeout exceeded, forcing exit

What happened: The daemon could not stop all components cleanly within the 30-second timeout. This can happen if an agent is in the middle of a long execution or a channel connection is hanging.How to fix:

Check the logs before the timeout to see which component was still running

If this happens frequently, increase the timeout:

daemon:
  shutdownTimeoutMs: 60000   # 60 seconds instead of 30

If a specific component is always slow, investigate that component (check channel connections, agent executions in progress)

Unhandled promise rejection (non-fatal)

Warning message:

Unhandled promise rejection (non-fatal)

What happened: An unexpected error occurred in an asynchronous operation, but the daemon caught it and continued running. This does not affect normal operation but may indicate a bug in a tool or skill.How to fix:

Check the error details in the log line — it includes the rejection reason
If it mentions a specific tool or skill, check that tool’s configuration
If it persists, this may be a bug — check for updates or report the issue

Configuration Issues

Config patch rate limit exceeded

Error message:

Config patch rate limit exceeded

What happened: Too many configuration changes were sent in rapid succession via the RPC API. The daemon rate-limits config patches to prevent accidental flooding.How to fix:

Wait a moment and retry the config change
If you need to make many changes at once, batch them into a single config.apply call instead of multiple config.patch calls

Approvals config warning

Warning message about approvals configuration.What happened: The approvals section in your configuration has an issue — typically a referenced approval policy that does not match any defined policy.How to fix:

Review the approvals section in your config.yaml
Make sure all referenced policy names match defined policies
Restart the daemon

Daemon starts but agents don't respond

No specific error message — the daemon shows "Comis daemon started" but agents ignore incoming messages.How to fix:

Check agent routing — make sure each agent’s bindings match the channels you are messaging from:
```
agents:
  default:
    bindings:
      - channel: telegram
```
Check agent status — the agent might be suspended. Check in the web dashboard or logs.
Check budget — the agent’s token or cost budget might be exhausted. See Agent Safety.
Check model provider — verify the API key for the configured model provider is valid
Check logs — set daemon.logLevels.agent: "debug" to see detailed agent processing logs

Resilience Issues

See Resilience Architecture for how these systems work together.

Prompt timeout -- agent falling back to alternate model

Log event:

execution:prompt_timeout

What happened: An LLM call exceeded its prompt-level deadline — either the stall budget (no stream/tool activity for promptTimeoutMs) or the makespan ceiling (promptTimeoutMs × stallCeilingMultiplier, a streaming runaway). The agent automatically retried with a shorter timeout or fell back to an alternate model. comis explain <sessionKey> names which limit fired and the binding knob.How to fix:

Check the provider’s status page for ongoing outages
If the model legitimately needs more silent time (e.g., slow local prefill before the first token), increase promptTimeoutMs:
```
agents:
  default:
    promptTimeout:
      promptTimeoutMs: 300000   # 5 minutes
```
Review modelFailover.fallbackModels for the agent to ensure fallback models are configured

Sub-agent watchdog timeout

Log message:

Sub-agent watchdog timeout

What happened: A sub-agent ran longer than maxRunTimeoutMs. The watchdog force-failed the run and sent a failure notification to the channel.How to fix:

Check the sub-agent task — was it genuinely too complex, or did the LLM provider hang?

If the task is legitimately long, increase maxRunTimeoutMs:

security:
  agentToAgent:
    subagentContext:
      maxRunTimeoutMs: 900000   # 15 minutes

Check for provider slowness via provider:degraded events in logs

Provider degraded -- LLM calls skipped

Log event:

provider:degraded

What happened: Multiple agents hit failures within a short window. The provider health monitor flagged the provider as degraded. LLM-dependent operations are being skipped.How to fix:

Check the provider’s status page for ongoing outages
Wait for automatic recovery — the system emits a provider:recovered event when the provider comes back
Check the dead-letter queue for missed announcements (~/.comis/dead-letters.jsonl)

Ghost run detected and force-failed

Log message:

Ghost run detected and force-failed

What happened: A sub-agent was still marked “running” past the grace period (maxRunTimeoutMs + 120 seconds). The ghost sweep force-failed it and delivered a failure notification to the channel.How to fix:

This usually indicates a process crash during execution
Check for out-of-memory (OOM) or unhandled errors around the same time in logs
The failure notification was already delivered to the channel — no action needed for the user

Dead-letter entries accumulating

Log event:

announcement:dead_lettered

What happened: Announcement delivery failed and entries were saved to the dead-letter queue (~/.comis/dead-letters.jsonl). Retry will be attempted up to 5 times.How to fix:

Check channel connectivity — is the bot still connected to the channel?
Verify the bot token is still valid
Entries auto-expire after 1 hour
On provider recovery the dead-letter queue drains automatically

Execution pipeline timeout -- canned error sent to user

Log event:

execution:aborted  reason=pipeline_timeout

What happened: The entire agent execution exceeded the 600-second pipeline timeout. A static error message was sent to the user (no LLM call).How to fix:

Check if the model was slow — look for execution:prompt_timeout events before this
This is a backstop — if it fires regularly, investigate per-call prompt timeouts first
Consider adding fallback models via modelFailover.fallbackModels in the agent config

Memory & Recall Issues

Recall is the path that selects which memories enter the prompt. When it surprises you, you do not have to read code — two questions cover almost every case. (The artifacts and events behind this runbook are documented in Observability → Memory & Recall Diagnostics.)

Why did recall pick X? (or miss the obvious memory)

Symptom: A recall injected a memory you did not expect, or omitted one you did — and you want to see the ranking that produced that result.How to diagnose:

Enable the recall trace (opt-in, default off). Add to config.yaml, then reload:

diagnostics:
  recallTrace:
    enabled: true        # opt-in — redacted, bounded (50 MB) JSONL

Reproduce the surprising recall (send the message that triggered it).
Read the per-memory ranking breakdown with --format json:
```
node packages/cli/dist/cli.js memory recall-trace <session> --format json
```
The JSON record shows, for each candidate: which lanes matched (FTS / vector / entity), the fused order, the pre- and post-rerank scores, the recency / temporal / proof / trust score components, and the include/exclude reason for every memory. The table view (omit --format json) shows just the correlation keys (trace, session, finalCount, ts) for locating the right record first.
Cross-reference the trust filter. If the memory you expected is excluded, the reason field will say why — most often a trust-level filter (includeTrustLevels) rather than a low score.

The recall-trace payload is full-sanitized — it records the ranking structure and scores, never raw memory bodies or query text. There is no toggle to include raw content; see Observability → Memory & Recall Diagnostics.

Turn recallTrace.enabled back to false once you have the answer — it is a debug-session writer, not a steady-state one.

Why is recall slow?

Symptom: Recall adds noticeable latency to a turn.How to diagnose, in order:

Is the cross-encoder reranker on? rag.rerank.enabled is opt-in (default false) precisely because the reranker adds latency. If you enabled it, that is the most likely source:
```
agents:
  default:
    rag:
      rerank:
        enabled: true       # the latency source — disable to A/B
        timeoutMs: 800       # the per-recall budget; fallback fires past this
```
Disable it briefly to confirm the latency disappears. On timeout the reranker already falls back to the fusion-ranked order (memory:reranked.timedOut: true), so correctness is preserved — but the wait still happens up to timeoutMs.
First-call reranker model load. The reranker is a local GGUF loaded on first use. The first recall after a restart (with rerank on) pays a one-time model-load cost; steady-state recalls do not. A single slow recall right after startup is expected.
Vector index availability. If the vector lane cannot run, recall falls back to FTS-only and logs a WARN with errorKind + hint. Check for it:
```
grep '"errorKind"' ~/.comis/logs/daemon.log | grep -i "rerank\|vec\|embed"
```
A vec→FTS fallback shows as memory:recalled.vectorCandidates: 0.

Aggregate health: the comis memory stats recall-counter overlay surfaces the rerank-fallback rate, lane usage, consolidation throughput, and recall hit-rate — a climbing fallback rate means the reranker is timing out across the fleet, not just on one turn:

node packages/cli/dist/cli.js memory stats

The overlay is best-effort; a daemon that has not wired the counters still renders base stats. See the CLI reference for every comis memory subcommand.

Credential Broker

The credential broker intercepts HTTPS requests from driven-CLI spawns, injects the real API key, and forwards the request upstream — all without the key ever entering the sandbox. When something goes wrong, the broker emits a broker:denied or broker:credential_unavailable event and returns a specific HTTP status code. Use the playbooks below to diagnose each failure mode.

Broker request fails with 407 (bad_token)

What happened: The driven-CLI’s proxy token was missing, forged, or already consumed. Each token is single-use — one is issued per driven-CLI spawn. The broker emits broker:denied with reason: "bad_token".Diagnose:

grep 'broker:denied' ~/.comis/logs/daemon.log | grep bad_token | tail -20

Common causes:

The CLI was not launched through the broker (HTTPS_PROXY env var not set to the broker socket).
The session was torn down before the request completed.
A token was replayed (single-use invariant — each token can only be consumed once).

Resolution: Ensure the CLI is launched via the daemon’s driven-CLI spawn path, not invoked manually without the broker environment.

Broker returns 403 (no_binding or path_policy)

What happened: Two broker:denied reasons produce 403:

no_binding — the requested host has no matching hostRules entry in any binding. Add a binding or use a built-in preset (anthropic, finnhub).
path_policy — the host is allow-listed but the path is blocked by the binding’s pathPolicy glob.

Diagnose:

grep 'broker:denied' ~/.comis/logs/daemon.log | grep -E '"reason":"(no_binding|path_policy)"' | tail -10

Resolution for no_binding: Add a hostRules entry for the target host in executor.broker.bindings, or use a matching preset.Resolution for path_policy: Review the pathPolicy glob on the binding. Ensure the requested path matches the allowed pattern (e.g., /v1/*).

Broker returns 502 (credential_unavailable)

What happened: The broker could not resolve the secretRef from SecretManager. The broker emits broker:credential_unavailable and returns 502. The request is never forwarded upstream.Diagnose:

grep 'broker:credential_unavailable' ~/.comis/logs/daemon.log | tail -10
# Check that the secret exists:
node packages/cli/dist/cli.js secrets list

Resolution: Ensure the secretRef in the binding exactly matches the key name returned by secrets list. If the secret is missing, add it:

node packages/cli/dist/cli.js secrets set YOUR_SECRET_KEY

Broker returns 501 (WebSocket upgrade)

What happened: A WebSocket upgrade from the credentialed sandbox returned 501 with broker:denied reason ws_upgrade_not_supported. This is an intentional fail-closed guard — WebSocket credential injection is not supported in this release. The error is returned explicitly rather than hanging silently.Resolution: Use HTTPS (not WebSocket) for API calls that require credential injection. WS credential injection is on the roadmap.

Tracing a full broker request lifecycle

Every broker log line carries step (pipeline stage), traceId, and agentId. To trace a full request:

# Find the traceId from a session_opened event for your agent:
grep 'broker:session_opened' ~/.comis/logs/daemon.log | grep '"agentId":"<your-agent>"' | tail -5

# Follow all events for that trace:
grep '"traceId":"<your-trace-id>"' ~/.comis/logs/daemon.log | jq -c '{step, event: (.event // .type), reason: .reason}'

Expected sequence for a successful injection:broker:session_opened → broker:request → broker:injected → (upstream response) → broker:session_closedIf broker:denied appears in place of broker:injected, the reason field identifies the failure mode (see accordions above).Filter by pipeline stage:

jq 'select(.step == "broker-inject")' ~/.comis/logs/daemon.log

Upstream API returns 401 (key injected but rejected)

What happened: The broker injected a credential and forwarded the request, but the upstream API returned 401. The secret value may be wrong or the injection rule kind may not match what the API expects.Diagnose:

Verify the secretRef key in the binding matches the stored secret name (secrets list).

Check the stored secret value — update it with:

node packages/cli/dist/cli.js secrets set YOUR_SECRET_KEY

Check the injection rule kind — the anthropic preset injects x-api-key (raw, not Bearer). For OpenAI-compatible APIs use a custom binding with format: bearer.
Look for broker:injected in logs to confirm injection actually occurred (vs. a 403 or 502 before it reached upstream).

Full Credential Broker reference →

Daemon

Startup, shutdown, and recovery

Logging

How to view and understand logs

FAQ

Common questions about running Comis

Verification

Diagnostic steps for a new installation

​Startup Issues

​Channel Issues

​Runtime Issues

​Configuration Issues

​Resilience Issues

​Memory & Recall Issues

​Credential Broker

​Related Pages

Daemon

Logging

FAQ

Verification

Startup Issues

Channel Issues

Runtime Issues

Configuration Issues

Resilience Issues

Memory & Recall Issues

Credential Broker

Related Pages