- Accuracy (cross-judged). A cross-judged baseline measured under the disclosed protocol and graded by two independent LLM judges. These are Comis’s sole end-to-end QA-accuracy numbers.
- Mechanical, keyless, at $0. A set of structural gate deltas for the structural tracks (KG / reasoning / query understanding) — a lane surfaces a linked doc, a write lands at the right trust tier, an off-knob is byte-identical. These run with no answer model, no judge, no key, no cost. They are not QA-accuracy lifts.
Conflict of interest, disclosed. Comis authored this benchmark.
Vendor-reported competitor numbers are non-comparable across protocols (the
judge model alone can swing a memory score from ~49% to ~94%); competitors are
invited to reproduce on their own harness. No superiority claim over any
competitor is made here — that would require a number measured under this
protocol, surviving a cross-judge spread, with the competitor re-run under the
same protocol, and that operator-costed re-run is deferred (see below).
- Retrieval recall@k / MRR — an internal regression proxy. It scores whether the recall pipeline surfaces the gold-labelled memories for a question. It is the gate every recall change is re-run against.
- End-to-end QA accuracy — the apples-to-apples number. It drives recalled context through an answer model, grades correctness with a category-specific LLM judge, and reports overall plus per-category accuracy. This is the metric comparable to what other memory systems publish.
Datasets
Two standard long-term-memory datasets drive both harnesses:- LongMemEval — multi-session question-answering over long conversation
histories. Each haystack session is ingested as one dated document. The
per-turn
has_answerevidence flags are stripped before ingestion (an eval-integrity strip — the gold label must never reach the stored content). - LoCoMo — long conversational memory with
D<session>:<dialogue>evidence references. The loader excludes the category-5 adversarial items and parses theD<sess>:<dia>evidence into dialogue ids. Each session is ingested as one dated document.
MemoryEntry id.
Licensing and no vendoring
The full datasets are not vendored in the repository (LoCoMo is CC BY-NC 4.0). Only two tiny, neutral placeholder fixtures ship in-repo so the structural unit tests can run in default CI. The full haystacks are operator-provided out-of-band, and the harness never auto-downloads datasets or models. This is a supply-chain invariant: a fresh checkout pulls down no benchmark corpus and no model weights.Retrieval recall@k harness (a regression proxy)
The retrieval harness lives inretrieval-harness.bench.test.ts. It
is the regression proxy — every recall change re-runs it and reads the
recall@k / MRR delta. It is not a published score.
What it does:
- Env-gated behind
COMIS_BENCHviadescribe.skipIf(!COMIS_BENCH). The defaultpnpm testandpnpm validateruns skip it entirely — no dataset weight, no model weight, no cost in CI. - Ingests into a real, isolated SQLite store — a fresh temp-directory
SqliteMemoryAdapter(trust levellearned, never~/.comis), so the run cannot touch a live memory store. - Runs the live
MemoryRecallpipeline per question (the same search -> fuse -> rerank -> score -> trust-filter -> dedup pipeline production uses) and scores recall@k / MRR with the shared scorer against the gold-mapped ids. - Asserts only structural invariants —
recall@1 >= 0,recall@5 >= recall@1,mrr >= 0. It never asserts a machine-dependent accuracy floor. The recall@k / MRR number is printed viaconsole.log, not baked into an assertion.
Commands
Default run (the suite is skipped):.. escape is rejected). The datasets are placed out-of-band; nothing is
downloaded.
The FTS-only disclosure
With noLLAMA_MODEL_PATH, the store runs FTS5-only, so the reported
recall@k reflects lexical-only retrieval — the printed flag reads
vectorLane: false. Supplying the embedding model activates the vector lane;
supplying the reranker model adds the cross-encoder lift. Read the
vectorLane / rerank flags on the printed line so a lexical-only number is
never mistaken for semantic recall.
End-to-end QA + LLM-judge accuracy (the apples-to-apples number)
The QA harness lives inqa-judge-harness.bench.test.ts.
This is the metric comparable to what other memory systems publish. It ingests
the datasets once, then for each question:
- Recall the context for the question through the live pipeline.
- Answer with an answer model at temperature 0.
- Judge correctness with a category-specific judge model at temperature 0, using LongMemEval-paper-derived per-category rubrics (the rubric is placed first in the prompt, ahead of any dataset content).
- Parse the judge verdict with a total parser — a verdict it cannot parse
is counted as
invalid, never as a wrong answer. - Aggregate overall and per-category accuracy using the
invalid-excluded denominator,
correct / (total - invalid), matching the reference Hindsight runner’s accounting. - Build a reproducible, secret-free report and write it to a confined path.
- Gated behind
COMIS_BENCHplus a nested gate on the answer/judge model env — it runs only when you supply both model configurations. The default test run skips it. - Structural assertions only —
0 <= overall <= 100,validTotal === total - invalid, per-category bounds. There is no hard accuracy floor baked in. - A
not.toMatch(/apiKey|sk-|Bearer/)assertion proves the written report is secret-free.
Operator environment
All values are provided out-of-band — the page lists the variable names and placeholders only, never a real key, token, or tokened model URL:| Variable | Role |
|---|---|
COMIS_BENCH | Master gate (set to 1 to enable the harness) |
COMIS_BENCH_ANSWER_PROVIDER | Answer model provider (placeholder, e.g. <provider>) |
COMIS_BENCH_ANSWER_MODEL | Answer model id (placeholder, e.g. <model>) |
COMIS_BENCH_ANSWER_API_KEY | Answer model credential (env-var name only) |
COMIS_BENCH_JUDGE_PROVIDER | Judge model provider (placeholder) |
COMIS_BENCH_JUDGE_MODEL | Judge model id (placeholder) |
COMIS_BENCH_JUDGE_API_KEY | Judge model credential (env-var name only) |
COMIS_BENCH_DATA | Optional: absolute path to the full haystack directory |
LLAMA_MODEL_PATH | Optional: absolute path to the embedding GGUF (vector lane) |
LLAMA_RERANKER_MODEL_PATH | Optional: absolute path to the reranker GGUF (rerank lift) |
maxResults 5, minScore 0.1, trust levels system and learned,
with reranking default-off in the bench unless a model path is supplied. Those
defaults are recorded in the report (below) so a run is reproducible.
Judges and prompts (cross-judge >= 2)
The judge model is the single biggest source of non-comparable memory-benchmark numbers — the same answers have been reported at ~49% under an independent judge and ~94% self-judged. So no per-category number is trusted until two independent judges agree within tolerance (|A - B| <= 5.0 points).
- Judge A: OpenAI
gpt-4o-2024-11-20. - Judge B: OpenAI
gpt-4.1. - The rubric is placed first in the prompt, ahead of any dataset content, and is the LongMemEval-paper-derived per-category rubric (see the QA harness section above) — so grading is category-appropriate, not one generic yes/no.
- A verdict the total parser cannot read is counted
invalid, never wrong; both committed judge manifests are real graded runs (invalid: 0,validTotal: 135).
Independence caveat (honest). Judge A and Judge B are two different OpenAI
models — this is a cross-MODEL check, not cross-PROVIDER. The intended
Anthropic judge failed on a same-provider answer+judge pairing
(the model resolved — the failure was the pairing, not resolution),
so a true cross-provider judge is still deferred. A same-provider pair is
a weaker independence test than cross-provider, so treat “survives” as
necessary-not-sufficient evidence.
The cross-judged baseline (the measured accuracy)
These are the only end-to-end QA-accuracy numbers Comis publishes — the measured baseline, each cell graded by both judges. Every number here is read back from the committed manifestbenchmarks/results/2026-05-31-j1-baseline/
(cross-judge-spread.md, qa-report.judge-a.json, qa-report.judge-b.json,
retrieval-metrics.json, run-provenance.json).
| Metric | n | Judge A (gpt-4o) | Judge B (gpt-4.1) | Spread |A-B| | Cross-judge stable? |
|---|---|---|---|---|---|
| Overall (incl. locomo) | 135 | 71.1 | 73.3 | 2.2 | yes |
| LongMemEval-only overall | 120 | 68.3 | 70.0 | 1.7 | yes |
| knowledge-update | 20 | 75.0 | 75.0 | 0.0 | yes |
| multi-session | 20 | 60.0 | 65.0 | 5.0 | yes |
| temporal-reasoning | 20 | 45.0 | 40.0 | 5.0 | yes (weakest stable) |
| single-session-preference | 20 | 30.0 | 45.0 | 15.0 | NO — unstable |
| locomo (comparability-only) | 15 | 93.3 | 100.0 | 6.7 | no — never headlined |
| control (filesystem baseline) | 135 | 52.6 | 36.3 | 16.3 | context only |
retrieval-metrics.json:
recall@1 0.5734 / recall@3 0.7827 / recall@5 0.8450 / MRR 0.7883,
with the vector lane and the on-device reranker both lit (the production default,
not FTS-only).
Cost / latency (GAP-REPORT.md
§1): approx 15.5k tokens/query; end-to-end P50 6.25s / P95 9.97s.
How to read this table honestly:
single-session-preference(30 vs 45 = 15pt) does not survive the cross-judge spread — the two judges disagree sharply on what counts as a correct preference answer. It is a real-but-judge-noisy signal; do not read it as a precise figure.- LoCoMo is comparability-only and is never headlined. Its score is wildly judge-dependent (under one judge a trivial filesystem baseline scores higher than the recall system) — the instability itself is the reason it stays comparability-only.
- The
controlrow is the Letta-style filesystem baseline, not Comis’s score. It is the baseline-is-not-weak proof: on LongMemEval the control trails Comis substantially under both judges (52.6 / 36.3 vs 71.1 / 73.3), so recall earns its keep on the real benchmark. - Subset honesty. QA accuracy is measured on a disclosed category-stratified subset — 20 LongMemEval items per category (120 of 500, all 6 categories) + 15 LoCoMo QA = 135 per judge pass, selected first-N-per-category in file order (deterministic). Full-set sequential grading measured ~13-24h across both judge passes — genuinely infeasible in one session — so the J1 protocol’s disclosed-subset fallback applies (full is the default; silent subsetting is forbidden). Retrieval recall@k is on the full 500+10 set. The full published head-to-head on the complete set is the proving run.
What the structural tracks provably do (keyless, at $0)
This section is deliberately separate from the accuracy table above. The four tracks each ship a mechanical / structural gate delta — a lane surfaces a linked doc, a write lands at the right trust tier, a ranking knob reorders a candidate, an off-knob is byte-identical. Each is measured keyless, with no answer model, no judge, no key, no cost.Each delta below is mechanical, keyless, $0 — not a QA-accuracy lift. A “+1
linked-doc recall delta” is not “+1% accuracy.” The end-to-end QA-accuracy
lift for each track is honestly deferred to the operator-costed re-run (the
reproduce-via-the-gate panel).
Quoting a rank-delta as an accuracy percentage is exactly the fabrication this
benchmark forbids.
| Track | Measured delta (mechanical, keyless, $0) | Manifest |
|---|---|---|
| KG graph-spread lane | linked-doc recall delta +1 (OFF: linked doc absent -> ON: surfaced purely by the KG edge) | graph-spread-contribution-report.json |
| KG trust-first invalidation | 100% (2/2) older-high-trust-wins via the real upsertTriple | trust-first-kg-invalidation-report.json |
| Reasoning inductive write | written at learned, capped <= learned (0 system-trust inductive rows) | reasoning-write-correctness-report.json |
| IQ MMR diversity | diverse-doc rank OFF 3 -> ON 2 (diversityRankLift 1); lambda=1.0 byte-identical to OFF | mmr-diversity-report.json |
| IQ intent reweight | temporal candidate rank OFF 2 -> ON 1 (reweightRankLift 1) | intent-reweight-report.json |
| IQ NL temporal-range | in-window precision OFF 0.5 -> ON 1.0; unparseable query -> no filter (byte-identity) | temporal-range-report.json |
default-off-byte-identity-report.json).
**The proving machine, at 0 — the costed pass fills the real cross-judged
numbers.
What the shipped capabilities provably do (keyless, at $0)
Like the structural-tracks section above, this is deliberately separate from the accuracy table. This set ships six new capabilities — a per-user profile, a per-channel relationship model, an ask-your-memory tool, a query-conditional usefulness reorder, a recall loop that learns which memories prove useful, and principled ranking decay of stale memories — each TDD-green. Every number below is a structural invariant of the shipped code, measured keyless, with no answer model, no judge, no key, no cost.Every row below is mechanical, keyless, $0 — not a QA-accuracy lift. The one
measured learning signal is a recall-score lift, not a QA-accuracy
percentage. The per-capability costed QA-accuracy lift and the competitor
head-to-head are the honestly-deferred operator-costed re-run (see the
reproduce-via-the-gate panel
and the consolidated re-prove manifest,
2026-06-01-phase113-reprove).| Capability (shipped) | Measured delta (mechanical, keyless, $0) | Manifest |
|---|---|---|
| Per-user profile | typed per-user records round-trip 4/4; an external-trust upsert is rejected (0 rows); (tenant, agent, user) isolation; recall stays LLM-free | claim1-prefix-typing-report.json |
| Per-channel relationship model (ships default-off / dormant) | directional A→B and B→A stored as two distinct edges; the sign-off gate holds (enabled-but-unsigned ⇒ 0 reads, null block) | claim4-signoff-gate-report.json |
| Ask-your-memory tool (opt-in / default-off) | recall stays LLM-free (0 model calls on read); citations are a subset of the recalled ids; mandatory abstention on empty recall | consolidated re-prove GATE-REPORT.md |
| Query-conditional usefulness reorder | a memory used for intent X ranks 1 vs 2 for an X- vs Y-query (perIntentRankLift 1); citation→usefulness accrual; default-off byte-identical | claim1-per-intent-bucket-report.json |
| Learning-to-rank (trust frozen) | an opt-in loop learns which memories prove useful and bounded-tunes recall ranking from that signal; trust stays frozen under tuning; default-off byte-identical | claim1-bandit-rank-lift-report.json |
| The one measured learning signal | bandit recall-score lift +0.1 over 5 episodes (goldScoreLift 0.1, measured-positive); the gold’s rank position is flat on the keyless lane (rankLift 0) — never rounded into “+0.1% accuracy” | claim1-bandit-rank-lift-report.json |
| Principled ranking decay (eviction dormant / default-off) | old + unused rank lower (old/unused factor 0.553 < fresh 0.995; gap 0.441); decay ranks, never gates; byte-identical at neutral; dormant footprint 5→5 | claim2-deterministic-decay-report.json |
2026-06-01-phase113-reprove.
The competitive head-to-head (the costed re-run, finally measured)
The operator-costed competitor head-to-head that the section below frames as “reproduce via the gate” has now been run. This is the result, and it is an accuracy, cross-judged number — the second real end-to-end QA-accuracy set after the measured baseline, kept clearly separate from the mechanical $0 deltas above. Every cell is read back from the committed manifestbenchmarks/results/2026-06-02-phase114-prove2/
(head-to-head-report.json, GATE-REPORT.md, run-provenance.json).
The protocol matches the credibility contract: the competitor is re-run by us
under one protocol (same answer model + the same two judges), scored by two
independent judges, and a number stands only if it survives the cross-judge
spread. The answer model is claude-sonnet-4-6 (temp 0); the judges are
gpt-4o-2024-11-20 (the LongMemEval reference, the headline) and claude-sonnet-4-6
(the cross-judge). Cost and latency are recorded; COI is disclosed.
No superiority claim — Comis TIED mem0. Comis and mem0 score identically
(both 7/8). At this best-effort N=8 the two are statistically
indistinguishable on accuracy — a one-question difference is 12.5 points. The
honest framing is “competitive-with mem0 / at-$0-on-device,” never a superiority
claim (binding constraint #8). The differentiator is cost / latency / locality,
not a quality edge.
head-to-head-report.json:
| System | Judge 1 (gpt-4o) | Judge 2 (claude) | Spread |1-2| | vs control | Note |
|---|---|---|---|---|---|
| Comis (as-shipped recall) | 87.5% (7/8) | 87.5% | 0.0 (survives) | +37.5 pt | LLM-free recall, $0 on-device |
mem0 (mem0ai 2.0.4, re-run by us) | 87.5% (7/8) | 87.5% | 0.0 (survives) | +37.5 pt | paid LLM fact-extraction at ingest (~53 min / 8 items) |
| letta-fs control (full-haystack dump) | 50.0% (4/8) | 50.0% | 0.0 (survives) | — | the honesty anchor |
- Comis is competitive with mem0, not ahead of it. Both answer 7/8; the cross-judge spread is 0.0 (the two judges agreed perfectly). At N=8 this is a tie, not a win.
- Both clear the full-dump control by +37.5 pt — so the bench is discriminating (ranked recall / extracted memory scores far above a naive full-context dump). This is the proof the comparison is meaningful, not a saturated bench where everything scores the same.
- The real difference is the production economics. Comis recall is **LLM-free and runs on-device at 0-on-device” axis.
capability-lift-report.json:
Comis’s as-shipped recall scores 98.0% (gpt-4o) / 94.0% (claude) (cross-judge
spread 4.0, survives) on a 50-item LongMemEval+LoCoMo mix. The two recall-config
capabilities (intent-reweight, forget) produce recall byte-identical to
baseline on all 50 questions -> +0.0 pt measured QA-lift (p=1.000). This is the
measure-first outcome: no recall-config capability showed a measured QA-lift, so
the activation phase that follows flips nothing by default.
Honest scope. N=8 and N=50 are best-effort operator-costed samples (real API
spend on the operator’s funded keys this run), not the definitive scale — the
committed harness +
mem0-runner.py + prove2-sample.json reproduce and extend
them. Zep, Hindsight, and Mnemosyne were skip-with-disclosure (not wired to the
protocol this run) — never a fabricated cell; the discriminated-union competitor
adapters make a fabricated competitor number structurally impossible. The earlier
“reproduce via the gate” framing below remains the path to extend this run to a
larger N or more competitors.What is measured at $0 vs the operator-costed head-to-head
Two things are measured at $0 (keyless): the mechanical gate deltas above, and the proving-machine mechanism. The competitor head-to-head — the “Comis vs another memory system” accuracy comparison — is the operator-costed re-run, and its first measured result is the head-to-head above (best-effort N=8, cross-judged). To extend it — a larger N, or the skip-with-disclosure competitors (Zep / Hindsight / Mnemosyne) — you run it yourself: it needs answer + judge model credentials and the competitor systems installed, none of which is a Comis dependency. There is no fabricated competitor cell anywhere on this page (mirroring thefabricatedNumber: false discipline).
Run it yourself:
Operator-costed head-to-head (reproduce it yourself)
Reproducible report
buildBenchmarkReport produces a comparable, secret-free record of each run. It
captures:
- The run configuration — the extraction, answer, judge, embedding, and
reranker model roles recorded as
{provider, modelId}only. Credentials are structurally omitted: the input config is never spread into the report, so no key, token, or base URL can reach the written file. - The dataset version as a sha256 over the dataset bytes.
- The recall defaults the run used.
- The results, carrying both the
invalidcount and thevalidTotaldenominator. - A timestamp and the harness version.
Recall-trace quality view
For diagnosing a recall regression rather than scoring it, the offline analyzeranalyzeRecallTrace folds the recall-trace JSONL into a quality view. It reports:
- rerank-lift-realized — the fraction of reranked recalls where the cross-encoder actually changed the ranking order (a rescale that preserves order counts as no lift).
- trust-filtered and deduped rates.
- per-factor score-factor distributions (recency, temporal, proof, trust).
- per-lane totals (FTS, vector, entity, temporal, causal) and degradation counts.
How this compares to Hindsight (honestly)
This is the crux, and it is easy to get subtly wrong.Recall@k and QA accuracy measure different things — do not equate them.
Hindsight publishes end-to-end QA accuracy, graded by a paper-derived
category-specific LLM judge. It reads the datasets’ gold evidence ids but does
not score retrieval recall@k. Comis’s recall@k harness is an internal
regression proxy; Comis’s QA + judge harness is the apples-to-apples
number that matches what Hindsight publishes.
correct / (total - invalid) denominator, the same
two-judge spread) — exactly what was done for mem0 and is the path to
extend to Hindsight.
Vendor-reported figures graded by a different judge are non-comparable. What this
page documents is the methodology that makes the comparison fair when you run it,
and the recorded report configuration so the run conditions are known on both
sides.
Reproduce it
Copy-paste blocks. Nothing here auto-downloads a dataset or a model — you place the datasets out-of-band and point at local model files. The high-level entry points (the script’s own documented modes):bench-memory entry points
*_API_KEY values are read from your shell environment — pass the env-var
name, never an inline secret. The written report records only
{provider, modelId} per role, so no credential reaches disk.
For the recall defaults the QA harness uses, see the
config reference.
Provenance
Every number on this page traces to the committed run manifestbenchmarks/results/2026-05-31-j1-baseline/run-provenance.json
(and, for the structural-track deltas, the dated phase manifests linked in their table):
| Field | Value |
|---|---|
| Commit | af64462f |
| Node | v22.21.1 |
| Harness | phase-89-v1 |
| Datasets | LongMemEval xiaowu0162/longmemeval (longmemeval_s.json, 500) · LoCoMo snap-research/locomo (locomo10.json, 10) |
| Answer model | anthropic claude-sonnet-4-6 (temperature 0) |
| Judges | openai gpt-4o-2024-11-20 + openai gpt-4.1 (cross-model, not cross-provider) |
| Embedding | local nomic-embed-text-v1.5 |
| Reranker | local bge-reranker-v2-m3 |
{provider, modelId} per role — credentials are
structurally omitted, so no key, token, or base URL reaches the written report.
Conflict of interest: Comis authored this benchmark; vendor-reported competitor
numbers are non-comparable across protocols, and competitors are invited to
reproduce on their own harness.