comis fleet surfaces the
daemon-wide pattern, then comis explain root-causes the worst
session it points at.
Recommended models (the receipts)
The numbers below are measured, not asserted. Reproduce them with the in-reposcripts/bench-small-model/ harness (standalone — drives models through Ollama’s
OpenAI-compatible API with a Comis-style prompt and tools, no daemon needed) and the
daemon-routed live tier under test/live/ (test/live/scenarios/local/local-models.test.ts).
| Model | Verdict | Measured receipt |
|---|---|---|
qwen3.6:35b | Recommended | 9/9 bench scenarios passed; security floor 100% injection-resisted / 0% secret-leaked / 0% over-refused — even bare-prompt (no guardrail instructions); comprehension 8/8 |
qwen3.6:27b | Works — one caveat | Security floor holds (100/0/0, fair and bare prompts alike); comprehension 7/8 — fails the one-step natural-language-to-orchestration-graph instruction that 35b passes |
gemma4:31b | Not recommended | Runaway generation: 16× latency and 3.6× tokens vs qwen3.6; one scenario ran 810 s / 56 K tokens (correctness 6/6 — the failure is runaway, not wrong answers) |
gemini-flash-lite | Frontier control | 100% across the matrix at ~2,240 ms — the calibration baseline for what “good” looks like |
qwen3.6:35b is the recommended local executive. The 9/9 is a bench-scenario count (the
full failure-taxonomy scenario matrix), not an orchestration-node count — and the security
floor held even with a bare prompt carrying no guardrail instructions at all.
qwen3.6:27b is a viable smaller alternative. Its security floor is identical to 35b. The
one measured cliff is comprehension 7/8: it fails the single-step “turn this natural-language
request into a multi-agent graph” instruction that 35b passes. The v2.19 orchestration repair
loop — type-aware validation of graph instructions, recovery of malformed completion markers,
and tolerance for every common instruction form the model emits — is what makes 27b
orchestration usable in practice.
gemma4:31b answers correctly (6/6) but generates without restraint: 16× the latency and
3.6× the tokens of qwen3.6, with one scenario running 810 seconds and 56 K tokens. The makespan
ceiling (stallCeilingMultiplier, below) exists for exactly this failure class, but the better
fix is not running the model.
For the full security-hardened local configuration, see the
Recommended Secure qwen3.6 Configuration.
Tool-calling models only
Comis requires native tool-calling. There is no XML-tag tool-call fallback by design — a model that cannot emit structured tool calls cannot drive the agent loop, and a prose-parsing fallback would reintroduce the malformed-call failure class the platform measures against. Pick tool-calling-capable tags (every model in the table above qualifies; check the Ollama model card for “tools” support before adopting a new one).Context window setup
The served-window mechanics are documented once, canonically, in the Local model context window section of the config reference: how the boot probe reconciles the Ollama-servednum_ctx with the configured
contextWindow, the OLLAMA_CONTEXT_LENGTH environment variable and Modelfile
PARAMETER num_ctx knobs for raising the served window, the double-cap chain when both the
served window and a capability cap clamp, the VRAM caveat, and the boot WARN. Start there —
this page does not restate those instructions.
What the canonical section does not carry is the quantization lever: quantized weights and
a quantized KV cache trade GPU memory for a larger affordable num_ctx. A 4-bit weight quant
(the q4_K_M-class tags most Ollama models default to) frees VRAM relative to 8-bit or fp16
weights, and a quantized KV cache (Ollama’s OLLAMA_KV_CACHE_TYPE=q8_0, where supported,
roughly halves KV memory vs the f16 default) frees memory that scales with context length —
together they let the same GPU serve a meaningfully larger window. The trade is model quality
and (for the KV cache) a small accuracy risk on long contexts, so move one step at a time and
re-run the bench. Raising num_ctx without the VRAM headroom triggers exactly the
OOM-or-thrash caveat documented at the
canonical anchor.
Capacity knobs
contextEngine.budget.effectiveContextCapSmall/effectiveContextCapNano— the unconditional class caps (defaults 32000 / 16000;0= uncapped): the effective window for a small/nano-class model never exceeds them regardless of what the model serves. See Capacity Cap.providers.entries.<id>.capabilities.capabilityClass— the override escape hatch: pins the capability class (scaffold level, security level, and the per-class window cap) when the resolver heuristic guesses wrong. The served-window section documents when the pin — not the budget caps — is the lever that actually binds the window.providers.entries.<id>.capabilities.probeServedWindow— the boot-probe opt-out (setfalsewhen Ollama is offline at daemon start). Documented with a YAML example at the served-window section.
Latency knobs
agents.<id>.promptTimeout.promptTimeoutMsis a stall budget, not a wall clock: the deadline resets on stream activity (text and thinking deltas, throttled to about once per second) and on tool completions, so only a fully silent gap kills the turn. It needs to cover the longest silent stretch — in practice the prefill before the first token, which on a loaded local GPU can exceed the 180000 ms default. Raise it for local deployments (e.g.300000). See Prompt Timeout.agents.<id>.promptTimeout.stallCeilingMultiplieris the makespan ceiling: a turn is aborted atpromptTimeoutMs × stallCeilingMultiplier(default 10) even while still streaming. This bounds streaming runaways that would otherwise reset the stall budget forever — the gemma4:31b 810-second receipt above is why this knob exists. Same table.providers.entries.<id>.timeoutMsis config-echo only — it is not enforced on completion calls. The completion deadline lives onagents.<id>.promptTimeout; setting a non-default provider value emits a one-time boot WARN naming the real knob (quoted in the WARN list below).agents.<id>.operationModels.<op>.timeoutoverrides the deadline per operation type (the key istimeout, nottimeoutMs). See Operation Model Overrides.
Schema hygiene (GBNF)
llama.cpp-family servers compile tool schemas into GBNF grammars and reject JSON-Schema keywords the grammar cannot express. Comis ships agbnf tool-schema profile
(models[].comisCompat.toolSchemaProfile: "gbnf") that applies structural, removal-only
transforms before the schemas reach the provider. Providers with type: "ollama" enable it
automatically (an explicit toolSchemaProfile value always wins); LM Studio, llama-server
(llama.cpp), and vLLM endpoints opt in explicitly per model with the same key. If a provider
still rejects a schema at grammar-compile time, the 400 is classified
(tool_schema_unsupported), Comis strips pattern/format from the offending tools and
retries exactly once per session before failing honestly — and comis explain names the
offending tool. The enum row, auto-detection Note, and a worked YAML example live in the
providers section of the config reference (the comisCompat model
table).
What each boot WARN means
One line each — the shipped message, what it tells you, and the fix. “Ollama served context window below configured” (WARN, once per provider per boot): the probe found Ollama serving a smallernum_ctx than your configured contextWindow, so agents
on that provider silently run with the smaller window. Fix by raising the served window
(OLLAMA_CONTEXT_LENGTH=<configured> ollama serve, or Modelfile PARAMETER num_ctx <configured>), or pin probeServedWindow: false to skip the probe. Full mechanics at the
served-window section.
“Boot viable-floor check: effective window below minViable — agent will degrade on real
turns (WARN-only, boot continues)” (WARN, per infeasible agent): the five-term floor
(bootstrapTotalTokens + toolSchemaTokens + outputHeadroomFloor + freshTailReserve + safetyMargin) does not fit the effective window; the hint spells every term’s value and names
the knob for the binding window source. When toolSchemaTokens dominates, the lever is the
tool surface, not the window: the small class defers cold tools behind a 24-tool active
ceiling via discover_tools, so pin capabilityClass or disable unused MCP servers / builtin
tool groups. Equation and worked example at the
served-window section.
“providers.timeoutMs is config-echo only — completion deadline lives on
agents.promptTimeout” (WARN, once per provider per boot, only for non-default values): you
set a provider timeoutMs expecting it to govern completions — it does not. Remove the
provider key or tune agents.<id>.promptTimeout.promptTimeoutMs
(Prompt Timeout).
“GBNF tool-schema transforms applied for local provider” (INFO, once per provider per
boot): the gbnf profile transformed tool schemas — the line carries tool and keyword counts
and names only, never schema bodies. This is healthy, expected behavior on Ollama providers,
not a problem to fix.
“Context window reconciled (served or capability cap bound)” (INFO, once per session, on
the first reconciled turn): something smaller than the configured window — the served
num_ctx or a capability cap — is the effective window for this session. Expected on capped
or served-bound models; a session whose configured window simply wins logs no reconcile line
at all.
Compaction on small models
Compaction triggers ratio against the turn’s effective budget window (the reconciled window under the class cap), not the model’s configuredcontextWindow — so a capped or
served-bound small model compacts when its real window fills, not 4× late. And compaction
input spans are clamped to the resolved summarizer’s window, taken as the smaller of its
configured contextWindow and the probed served window whenever the summarizer runs on the
provider the served value was probed from: a small operationModels.compaction summarizer
(say 8K) — or the primary model itself when Ollama serves less than its configured window —
is never fed an over-window chunk. Oversized backlogs split across multiple bounded passes,
and the un-summarized remainder stays in context, never dropped. A summarizer on a different
provider keeps its own full window (a local served bound never clamps a cloud summarizer).
One honest exception: summarizerFallbackProviders models that take over after a primary
summarizer failure are not window-checked — an over-window span on a smaller fallback
degrades through the escalation ladder to a bounded summary, never data loss. The governing
knobs are the contextThreshold and leafChunkTokens
rows in the Context Engine section;
pick the summarizer model per agent via
operationModels.compaction.
Diagnosing a degraded local session
Start fleet-wide:comis fleet --since 24 surfaces the
config_posture:served_below_configured finding (the count of providers serving below their
configured window at the latest boot — servedBelowConfiguredCount on the
obs.fleet.health RPC), the degraded-by-cause
breakdown (context_exhausted, output_starved, timeout, narration_stall), and
breaker/cost rollups. See comis fleet.
One cause is specific to small/nano models: narration_stall — the model announced
an action (“Now let me run the tool:”) but never emitted the tool call, and the one bounded
continuation re-prompt the platform injects did not recover a real answer. The turn delivers
the narration but is recorded degraded with this named cause instead of a false-clean
success. Frontier/mid models are never re-prompted by this guard.
Then go single-session: comis explain <sessionKey> returns the numbers-backed verdict naming
the binding knob — for context_exhausted sessions the contextBudget section reports
windowCapSource (served, the contextEngine.budget.* cap, or the capabilityClass pin)
with the assembled-vs-window totals; for timeout sessions the verdict names the stall budget
or the makespan ceiling with the configured and elapsed numbers. See
obs.explain.