Memory benchmarks

What this page is. How Comis measures the quality of its agent memory, and how you reproduce those measurements yourself. The posture is deliberate: an honest, reproducible methodology is the differentiator, not a single headline number. There are now real numbers to cite, and they come in two clearly separate kinds — never blend them:

Accuracy (cross-judged). A cross-judged baseline measured under the disclosed protocol and graded by two independent LLM judges. These are Comis’s sole end-to-end QA-accuracy numbers.
Mechanical, keyless, at $0. A set of structural gate deltas for the structural tracks (KG / reasoning / query understanding) — a lane surfaces a linked doc, a write lands at the right trust tier, an off-knob is byte-identical. These run with no answer model, no judge, no key, no cost. They are not QA-accuracy lifts.

The competitor head-to-head (“vs another memory system”) is the honestly-deferred, operator-costed re-run: reproducible via the gate, never a fabricated cell. The harnesses still assert structural invariants — the published numbers come only from committed, re-runnable manifests, each linked below.

Conflict of interest, disclosed. Comis authored this benchmark. Vendor-reported competitor numbers are non-comparable across protocols (the judge model alone can swing a memory score from ~49% to ~94%); competitors are invited to reproduce on their own harness. No superiority claim over any competitor is made here — that would require a number measured under this protocol, surviving a cross-judge spread, with the competitor re-run under the same protocol, and that operator-costed re-run is deferred (see below).

Two distinct things are measured, and they are not interchangeable:

Retrieval recall@k / MRR — an internal regression proxy. It scores whether the recall pipeline surfaces the gold-labelled memories for a question. It is the gate every recall change is re-run against.
End-to-end QA accuracy — the apples-to-apples number. It drives recalled context through an answer model, grades correctness with a category-specific LLM judge, and reports overall plus per-category accuracy. This is the metric comparable to what other memory systems publish.

For how recall itself works (the pipeline, the lanes, the scoring factors), see Memory.

Datasets

Two standard long-term-memory datasets drive both harnesses:

LongMemEval — multi-session question-answering over long conversation histories. Each haystack session is ingested as one dated document. The per-turn has_answer evidence flags are stripped before ingestion (an eval-integrity strip — the gold label must never reach the stored content).
LoCoMo — long conversational memory with D<session>:<dialogue> evidence references. The loader excludes the category-5 adversarial items and parses the D<sess>:<dia> evidence into dialogue ids. Each session is ingested as one dated document.

For both datasets the gold-evidence to memory-id mapping is recorded at ingestion time — the dataset reference is the lookup key, never an id, so a gold reference resolves to the actual stored MemoryEntry id.

Licensing and no vendoring

The full datasets are not vendored in the repository (LoCoMo is CC BY-NC 4.0). Only two tiny, neutral placeholder fixtures ship in-repo so the structural unit tests can run in default CI. The full haystacks are operator-provided out-of-band, and the harness never auto-downloads datasets or models. This is a supply-chain invariant: a fresh checkout pulls down no benchmark corpus and no model weights.

Retrieval recall@k harness (a regression proxy)

The retrieval harness lives in retrieval-harness.bench.test.ts. It is the regression proxy — every recall change re-runs it and reads the recall@k / MRR delta. It is not a published score. What it does:

Env-gated behind COMIS_BENCH via describe.skipIf(!COMIS_BENCH). The default pnpm test and pnpm validate runs skip it entirely — no dataset weight, no model weight, no cost in CI.
Ingests into a real, isolated SQLite store — a fresh temp-directory SqliteMemoryAdapter (trust level learned, never ~/.comis), so the run cannot touch a live memory store.
Runs the live MemoryRecall pipeline per question (the same search -> fuse -> rerank -> score -> trust-filter -> dedup pipeline production uses) and scores recall@k / MRR with the shared scorer against the gold-mapped ids.
Asserts only structural invariants — recall@1 >= 0, recall@5 >= recall@1, mrr >= 0. It never asserts a machine-dependent accuracy floor. The recall@k / MRR number is printed via console.log, not baked into an assertion.

Commands

Default run (the suite is skipped):

pnpm vitest run packages/agent/src/memory/benchmark/retrieval-harness.bench.test.ts
# -> 1 file skipped, 4 tests skipped (describe.skipIf(!COMIS_BENCH) fires)

Gated, FTS-only (no models needed — the honest default lane):

COMIS_BENCH=1 pnpm vitest run packages/agent/src/memory/benchmark/retrieval-harness.bench.test.ts
# prints, for example:
# BENCH recall@k/MRR {"recallAt1":0.667,...} vectorLane: false rerank: false

The printed line above is illustrative over the tiny vendored fixtures — it is not a benchmark result. It is computed over three placeholder questions, so treat it as a smoke signal that the harness ran, nothing more. With the vector and rerank lanes (operator-provided local model files):

COMIS_BENCH=1 \
  LLAMA_MODEL_PATH=/abs/path/to/embedding-model.gguf \
  LLAMA_RERANKER_MODEL_PATH=/abs/path/to/bge-reranker-v2-m3.gguf \
  pnpm vitest run packages/agent/src/memory/benchmark/retrieval-harness.bench.test.ts
# vectorLane: true rerank: true -- recall@k now reflects vector + cross-encoder retrieval

Against the full operator-placed haystack:

COMIS_BENCH=1 COMIS_BENCH_DATA=/abs/path/to/datasets \
  pnpm vitest run packages/agent/src/memory/benchmark/retrieval-harness.bench.test.ts
# reads $COMIS_BENCH_DATA/longmemeval.json + $COMIS_BENCH_DATA/locomo.json

Each resolved dataset path is confined under the base directory before any read (a .. escape is rejected). The datasets are placed out-of-band; nothing is downloaded.

The FTS-only disclosure

With no LLAMA_MODEL_PATH, the store runs FTS5-only, so the reported recall@k reflects lexical-only retrieval — the printed flag reads vectorLane: false. Supplying the embedding model activates the vector lane; supplying the reranker model adds the cross-encoder lift. Read the vectorLane / rerank flags on the printed line so a lexical-only number is never mistaken for semantic recall.

End-to-end QA + LLM-judge accuracy (the apples-to-apples number)

The QA harness lives in qa-judge-harness.bench.test.ts. This is the metric comparable to what other memory systems publish. It ingests the datasets once, then for each question:

Recall the context for the question through the live pipeline.
Answer with an answer model at temperature 0.
Judge correctness with a category-specific judge model at temperature 0, using LongMemEval-paper-derived per-category rubrics (the rubric is placed first in the prompt, ahead of any dataset content).
Parse the judge verdict with a total parser — a verdict it cannot parse is counted as invalid, never as a wrong answer.
Aggregate overall and per-category accuracy using the invalid-excluded denominator, correct / (total - invalid), matching the reference Hindsight runner’s accounting.
Build a reproducible, secret-free report and write it to a confined path.

Gating and assertions:

Gated behind COMIS_BENCH plus a nested gate on the answer/judge model env — it runs only when you supply both model configurations. The default test run skips it.
Structural assertions only — 0 <= overall <= 100, validTotal === total - invalid, per-category bounds. There is no hard accuracy floor baked in.
A not.toMatch(/apiKey|sk-|Bearer/) assertion proves the written report is secret-free.

Operator environment

All values are provided out-of-band — the page lists the variable names and placeholders only, never a real key, token, or tokened model URL:

Variable	Role
`COMIS_BENCH`	Master gate (set to `1` to enable the harness)
`COMIS_BENCH_ANSWER_PROVIDER`	Answer model provider (placeholder, e.g. `<provider>`)
`COMIS_BENCH_ANSWER_MODEL`	Answer model id (placeholder, e.g. `<model>`)
`COMIS_BENCH_ANSWER_API_KEY`	Answer model credential (env-var name only)
`COMIS_BENCH_JUDGE_PROVIDER`	Judge model provider (placeholder)
`COMIS_BENCH_JUDGE_MODEL`	Judge model id (placeholder)
`COMIS_BENCH_JUDGE_API_KEY`	Judge model credential (env-var name only)
`COMIS_BENCH_DATA`	Optional: absolute path to the full haystack directory
`LLAMA_MODEL_PATH`	Optional: absolute path to the embedding GGUF (vector lane)
`LLAMA_RERANKER_MODEL_PATH`	Optional: absolute path to the reranker GGUF (rerank lift)

The harness uses production-representative recall defaults so the accuracy number reflects the shipped behaviour — recency 0.2, temporal 0.2, proof 0.1, trust 0.1, maxResults 5, minScore 0.1, trust levels system and learned, with reranking default-off in the bench unless a model path is supplied. Those defaults are recorded in the report (below) so a run is reproducible.

Judges and prompts (cross-judge >= 2)

The judge model is the single biggest source of non-comparable memory-benchmark numbers — the same answers have been reported at ~49% under an independent judge and ~94% self-judged. So no per-category number is trusted until two independent judges agree within tolerance (|A - B| <= 5.0 points).

Judge A: OpenAI gpt-4o-2024-11-20.
Judge B: OpenAI gpt-4.1.
The rubric is placed first in the prompt, ahead of any dataset content, and is the LongMemEval-paper-derived per-category rubric (see the QA harness section above) — so grading is category-appropriate, not one generic yes/no.
A verdict the total parser cannot read is counted invalid, never wrong; both committed judge manifests are real graded runs (invalid: 0, validTotal: 135).

Independence caveat (honest). Judge A and Judge B are two different OpenAI models — this is a cross-MODEL check, not cross-PROVIDER. The intended Anthropic judge failed on a same-provider answer+judge pairing (the model resolved — the failure was the pairing, not resolution), so a true cross-provider judge is still deferred. A same-provider pair is a weaker independence test than cross-provider, so treat “survives” as necessary-not-sufficient evidence.

The cross-judged baseline (the measured accuracy)

These are the only end-to-end QA-accuracy numbers Comis publishes — the measured baseline, each cell graded by both judges. Every number here is read back from the committed manifest benchmarks/results/2026-05-31-j1-baseline/ (cross-judge-spread.md, qa-report.judge-a.json, qa-report.judge-b.json, retrieval-metrics.json, run-provenance.json).

Metric	n	Judge A (gpt-4o)	Judge B (gpt-4.1)	Spread \|A-B\|	Cross-judge stable?
Overall (incl. locomo)	135	71.1	73.3	2.2	yes
LongMemEval-only overall	120	68.3	70.0	1.7	yes
knowledge-update	20	75.0	75.0	0.0	yes
multi-session	20	60.0	65.0	5.0	yes
temporal-reasoning	20	45.0	40.0	5.0	yes (weakest stable)
single-session-preference	20	30.0	45.0	15.0	NO — unstable
locomo (comparability-only)	15	93.3	100.0	6.7	no — never headlined
control (filesystem baseline)	135	52.6	36.3	16.3	context only

Retrieval (FULL set, 500 LongMemEval + 10 LoCoMo) — retrieval-metrics.json: recall@1 0.5734 / recall@3 0.7827 / recall@5 0.8450 / MRR 0.7883, with the vector lane and the on-device reranker both lit (the production default, not FTS-only). Cost / latency (GAP-REPORT.md §1): approx 15.5k tokens/query; end-to-end P50 6.25s / P95 9.97s. How to read this table honestly:

single-session-preference (30 vs 45 = 15pt) does not survive the cross-judge spread — the two judges disagree sharply on what counts as a correct preference answer. It is a real-but-judge-noisy signal; do not read it as a precise figure.
LoCoMo is comparability-only and is never headlined. Its score is wildly judge-dependent (under one judge a trivial filesystem baseline scores higher than the recall system) — the instability itself is the reason it stays comparability-only.
The control row is the Letta-style filesystem baseline, not Comis’s score. It is the baseline-is-not-weak proof: on LongMemEval the control trails Comis substantially under both judges (52.6 / 36.3 vs 71.1 / 73.3), so recall earns its keep on the real benchmark.
Subset honesty. QA accuracy is measured on a disclosed category-stratified subset — 20 LongMemEval items per category (120 of 500, all 6 categories) + 15 LoCoMo QA = 135 per judge pass, selected first-N-per-category in file order (deterministic). Full-set sequential grading measured ~13-24h across both judge passes — genuinely infeasible in one session — so the J1 protocol’s disclosed-subset fallback applies (full is the default; silent subsetting is forbidden). Retrieval recall@k is on the full 500+10 set. The full published head-to-head on the complete set is the proving run.

What the structural tracks provably do (keyless, at $0)

This section is deliberately separate from the accuracy table above. The four tracks each ship a mechanical / structural gate delta — a lane surfaces a linked doc, a write lands at the right trust tier, a ranking knob reorders a candidate, an off-knob is byte-identical. Each is measured keyless, with no answer model, no judge, no key, no cost.

Each delta below is mechanical, keyless, $0 — not a QA-accuracy lift. A “+1 linked-doc recall delta” is not “+1% accuracy.” The end-to-end QA-accuracy lift for each track is honestly deferred to the operator-costed re-run (the reproduce-via-the-gate panel). Quoting a rank-delta as an accuracy percentage is exactly the fabrication this benchmark forbids.

Track	Measured delta (mechanical, keyless, $0)	Manifest
KG graph-spread lane	linked-doc recall delta +1 (OFF: linked doc absent -> ON: surfaced purely by the KG edge)	`graph-spread-contribution-report.json`
KG trust-first invalidation	100% (2/2) older-high-trust-wins via the real `upsertTriple`	`trust-first-kg-invalidation-report.json`
Reasoning inductive write	written at `learned`, capped <= `learned` (0 `system`-trust inductive rows)	`reasoning-write-correctness-report.json`
IQ MMR diversity	diverse-doc rank OFF 3 -> ON 2 (`diversityRankLift` 1); lambda=1.0 byte-identical to OFF	`mmr-diversity-report.json`
IQ intent reweight	temporal candidate rank OFF 2 -> ON 1 (`reweightRankLift` 1)	`intent-reweight-report.json`
IQ NL temporal-range	in-window precision OFF 0.5 -> ON 1.0; unparseable query -> no filter (byte-identity)	`temporal-range-report.json`

Default-OFF byte-identity. Every structural factor is default-off and, when off, byte-identical to the baseline shipping config — no silent behaviour change, zero category regression (default-off-byte-identity-report.json). **The proving machine, at

0.** The per-release gate runs the cross-judge spread keyless on injected verdicts: **3 of 4 categories survive** the 5pt tolerance; the 15pt preference category does not (disclosed, consistent with the baseline above) ([`cross-judge-spread.json`](https://github.com/comisai/comis/blob/main/benchmarks/results/2026-06-01-phase104-prove/cross-judge-spread.json)). This proves the *mechanism* at

0 — the costed pass fills the real cross-judged numbers.

What the shipped capabilities provably do (keyless, at $0)

Like the structural-tracks section above, this is deliberately separate from the accuracy table. This set ships six new capabilities — a per-user profile, a per-channel relationship model, an ask-your-memory tool, a query-conditional usefulness reorder, a recall loop that learns which memories prove useful, and principled ranking decay of stale memories — each TDD-green. Every number below is a structural invariant of the shipped code, measured keyless, with no answer model, no judge, no key, no cost.

Every row below is mechanical, keyless, $0 — not a QA-accuracy lift. The one measured learning signal is a recall-score lift, not a QA-accuracy percentage. The per-capability costed QA-accuracy lift and the competitor head-to-head are the honestly-deferred operator-costed re-run (see the reproduce-via-the-gate panel and the consolidated re-prove manifest, 2026-06-01-phase113-reprove).

Capability (shipped)	Measured delta (mechanical, keyless, $0)	Manifest
Per-user profile	typed per-user records round-trip 4/4; an external-trust upsert is rejected (0 rows); (tenant, agent, user) isolation; recall stays LLM-free	`claim1-prefix-typing-report.json`
Per-channel relationship model (ships default-off / dormant)	directional A→B and B→A stored as two distinct edges; the sign-off gate holds (enabled-but-unsigned ⇒ 0 reads, null block)	`claim4-signoff-gate-report.json`
Ask-your-memory tool (opt-in / default-off)	recall stays LLM-free (0 model calls on read); citations are a subset of the recalled ids; mandatory abstention on empty recall	consolidated re-prove `GATE-REPORT.md`
Query-conditional usefulness reorder	a memory used for intent X ranks 1 vs 2 for an X- vs Y-query (`perIntentRankLift` 1); citation→usefulness accrual; default-off byte-identical	`claim1-per-intent-bucket-report.json`
Learning-to-rank (trust frozen)	an opt-in loop learns which memories prove useful and bounded-tunes recall ranking from that signal; trust stays frozen under tuning; default-off byte-identical	`claim1-bandit-rank-lift-report.json`
The one measured learning signal	bandit recall-score lift +0.1 over 5 episodes (`goldScoreLift` 0.1, measured-positive); the gold’s rank position is flat on the keyless lane (`rankLift` 0) — never rounded into “+0.1% accuracy”	`claim1-bandit-rank-lift-report.json`
Principled ranking decay (eviction dormant / default-off)	old + unused rank lower (old/unused factor 0.553 < fresh 0.995; gap 0.441); decay ranks, never gates; byte-identical at neutral; dormant footprint 5→5	`claim2-deterministic-decay-report.json`

Default-OFF byte-identity. As with the structural tracks, every capability factor is default-off and, when off, byte-identical to the shipping config — no silent behaviour change, zero category regression. The costed per-capability QA-accuracy lift and the competitor head-to-head are not measured here — they are the operator-costed re-run consolidated in 2026-06-01-phase113-reprove.

The competitive head-to-head (the costed re-run, finally measured)

The operator-costed competitor head-to-head that the section below frames as “reproduce via the gate” has now been run. This is the result, and it is an accuracy, cross-judged number — the second real end-to-end QA-accuracy set after the measured baseline, kept clearly separate from the mechanical $0 deltas above. Every cell is read back from the committed manifest benchmarks/results/2026-06-02-phase114-prove2/ (head-to-head-report.json, GATE-REPORT.md, run-provenance.json). The protocol matches the credibility contract: the competitor is re-run by us under one protocol (same answer model + the same two judges), scored by two independent judges, and a number stands only if it survives the cross-judge spread. The answer model is claude-sonnet-4-6 (temp 0); the judges are gpt-4o-2024-11-20 (the LongMemEval reference, the headline) and claude-sonnet-4-6 (the cross-judge). Cost and latency are recorded; COI is disclosed.

No superiority claim — Comis TIED mem0. Comis and mem0 score identically (both 7/8). At this best-effort N=8 the two are statistically indistinguishable on accuracy — a one-question difference is 12.5 points. The honest framing is “competitive-with mem0 / at-$0-on-device,” never a superiority claim (binding constraint #8). The differentiator is cost / latency / locality, not a quality edge.

Head-to-head (N=8 LongMemEval, cross-judged, spread 0.0 on every cell) — each row read from head-to-head-report.json:

System	Judge 1 (gpt-4o)	Judge 2 (claude)	Spread \|1-2\|	vs control	Note
Comis (as-shipped recall)	87.5% (7/8)	87.5%	0.0 (survives)	+37.5 pt	LLM-free recall, $0 on-device
mem0 (`mem0ai 2.0.4`, re-run by us)	87.5% (7/8)	87.5%	0.0 (survives)	+37.5 pt	paid LLM fact-extraction at ingest (~53 min / 8 items)
letta-fs control (full-haystack dump)	50.0% (4/8)	50.0%	0.0 (survives)	—	the honesty anchor

How to read this honestly:

Comis is competitive with mem0, not ahead of it. Both answer 7/8; the cross-judge spread is 0.0 (the two judges agreed perfectly). At N=8 this is a tie, not a win.
Both clear the full-dump control by +37.5 pt — so the bench is discriminating (ranked recall / extracted memory scores far above a naive full-context dump). This is the proof the comparison is meaningful, not a saturated bench where everything scores the same.
The real difference is the production economics. Comis recall is **LLM-free and runs on-device at $0** (local embed + rerank); mem0 spent real OpenAI fact-extraction across ~53 minutes of ingest for the same 8 items. Equal answer quality, very different cost -- the "competitive-with / at-$ 0-on-device” axis.

Per-capability QA-lift (N=50 mix, cross-judged; measure-first) — capability-lift-report.json: Comis’s as-shipped recall scores 98.0% (gpt-4o) / 94.0% (claude) (cross-judge spread 4.0, survives) on a 50-item LongMemEval+LoCoMo mix. The two recall-config capabilities (intent-reweight, forget) produce recall byte-identical to baseline on all 50 questions -> +0.0 pt measured QA-lift (p=1.000). This is the measure-first outcome: no recall-config capability showed a measured QA-lift, so the activation phase that follows flips nothing by default.

Honest scope. N=8 and N=50 are best-effort operator-costed samples (real API spend on the operator’s funded keys this run), not the definitive scale — the committed harness + mem0-runner.py + prove2-sample.json reproduce and extend them. Zep, Hindsight, and Mnemosyne were skip-with-disclosure (not wired to the protocol this run) — never a fabricated cell; the discriminated-union competitor adapters make a fabricated competitor number structurally impossible. The earlier “reproduce via the gate” framing below remains the path to extend this run to a larger N or more competitors.

What is measured at $0 vs the operator-costed head-to-head

Two things are measured at $0 (keyless): the mechanical gate deltas above, and the proving-machine mechanism. The competitor head-to-head — the “Comis vs another memory system” accuracy comparison — is the operator-costed re-run, and its first measured result is the head-to-head above (best-effort N=8, cross-judged). To extend it — a larger N, or the skip-with-disclosure competitors (Zep / Hindsight / Mnemosyne) — you run it yourself: it needs answer + judge model credentials and the competitor systems installed, none of which is a Comis dependency. There is no fabricated competitor cell anywhere on this page (mirroring the fabricatedNumber: false discipline). Run it yourself:

Operator-costed head-to-head (reproduce it yourself)

# 1. Populate the git-ignored operator env (NEVER committed):
cp scripts/bench-memory.env.example scripts/bench-memory.env
#    Fill COMIS_BENCH_ANSWER_* + COMIS_BENCH_JUDGE_* (run a SECOND judge pass
#    for the cross-judge spread). Values are env-var names, never inline keys.

# 2. Install the competitor systems (operator/external -- NEVER a Comis dependency):
#    mem0:  MEM0_API_KEY + the mem0ai package
#    zep:   ZEP_API_KEY  + the @getzep/zep-js package
#    others: clone + build the sibling repositories

# 3. Run the per-release continuous gate (appends the real dated row under
#    benchmarks/results/history/):
scripts/bench-memory.sh gate

The costed re-run is now done and cross-judged (the head-to-head above), and its honest outcome is “competitive-with mem0 / at-$0-on-device” — a tie, never a superiority headline. The gate command above stays the path to extend it to a larger N or more competitors.

Reproducible report

buildBenchmarkReport produces a comparable, secret-free record of each run. It captures:

The run configuration — the extraction, answer, judge, embedding, and reranker model roles recorded as {provider, modelId} only. Credentials are structurally omitted: the input config is never spread into the report, so no key, token, or base URL can reach the written file.
The dataset version as a sha256 over the dataset bytes.
The recall defaults the run used.
The results, carrying both the invalid count and the validTotal denominator.
A timestamp and the harness version.

Because the config, the dataset hash, and the denominator are all recorded, two runs are comparable across code changes — and a future apples-to-apples comparison against another system’s published figures is a fair one, run on the same datasets with the same accounting.

Recall-trace quality view

For diagnosing a recall regression rather than scoring it, the offline analyzer analyzeRecallTrace folds the recall-trace JSONL into a quality view. It reports:

rerank-lift-realized — the fraction of reranked recalls where the cross-encoder actually changed the ranking order (a rescale that preserves order counts as no lift).
trust-filtered and deduped rates.
per-factor score-factor distributions (recency, temporal, proof, trust).
per-lane totals (FTS, vector, entity, temporal, causal) and degradation counts.

It reads ids, numeric breakdowns, and closed-union reason codes only — never memory bodies — so it is safe to run over a captured trace. It is the diagnostic companion to the recall@k harness: when a recall@k delta moves, the quality view explains which stage moved it.

How this compares to Hindsight (honestly)

This is the crux, and it is easy to get subtly wrong.

Recall@k and QA accuracy measure different things — do not equate them. Hindsight publishes end-to-end QA accuracy, graded by a paper-derived category-specific LLM judge. It reads the datasets’ gold evidence ids but does not score retrieval recall@k. Comis’s recall@k harness is an internal regression proxy; Comis’s QA + judge harness is the apples-to-apples number that matches what Hindsight publishes.

So the fair comparison is Comis’s QA + judge accuracy against Hindsight’s published QA accuracy — both end-to-end, both judged, both on the same datasets, both using the invalid-excluded denominator. Comis’s recall@k number is a different measurement (retrieval only, no generation, no judge) and is not comparable to Hindsight’s scoreboard. Comis now has published QA-accuracy numbers — the cross-judged baseline above (overall 71.1 / 73.3) and the competitive head-to-head (competitive-with mem0 at N=8) — but this page still states no comparative result against Hindsight: Hindsight was skip-with-disclosure in that run, not re-run under this protocol. A fair side-by-side requires the other system re-run under this exact protocol (same datasets, same correct / (total - invalid) denominator, the same two-judge spread) — exactly what was done for mem0 and is the path to extend to Hindsight. Vendor-reported figures graded by a different judge are non-comparable. What this page documents is the methodology that makes the comparison fair when you run it, and the recorded report configuration so the run conditions are known on both sides.

Reproduce it

Copy-paste blocks. Nothing here auto-downloads a dataset or a model — you place the datasets out-of-band and point at local model files. The high-level entry points (the script’s own documented modes):

bench-memory entry points

pnpm bench:memory dry                  # keyless smoke over the vendored fixtures ($0)
pnpm bench:memory retrieval            # recall@k / MRR (keyless; lights vector/rerank if model paths set)
pnpm bench:memory qa                   # end-to-end QA + judge (requires answer + judge env)
scripts/bench-memory.sh suite <tier>   # one SUITE tier -> committed secret-free report
scripts/bench-memory.sh head-to-head   # the proving machine, keyless at $0
scripts/bench-memory.sh gate           # the per-release regression gate

The lower-level Vitest invocations the modes wrap (useful for one harness in isolation): Retrieval recall@k, FTS-only (no models):

COMIS_BENCH=1 \
  pnpm vitest run packages/agent/src/memory/benchmark/retrieval-harness.bench.test.ts

Retrieval recall@k, full haystack plus the vector and rerank lanes:

COMIS_BENCH=1 \
  COMIS_BENCH_DATA=/abs/path/to/datasets \
  LLAMA_MODEL_PATH=/abs/path/to/embedding-model.gguf \
  LLAMA_RERANKER_MODEL_PATH=/abs/path/to/bge-reranker-v2-m3.gguf \
  pnpm vitest run packages/agent/src/memory/benchmark/retrieval-harness.bench.test.ts

End-to-end QA + LLM-judge accuracy (answer and judge models required):

COMIS_BENCH=1 \
  COMIS_BENCH_ANSWER_PROVIDER=<provider> COMIS_BENCH_ANSWER_MODEL=<model> COMIS_BENCH_ANSWER_API_KEY=$ANSWER_KEY \
  COMIS_BENCH_JUDGE_PROVIDER=<provider> COMIS_BENCH_JUDGE_MODEL=<model> COMIS_BENCH_JUDGE_API_KEY=$JUDGE_KEY \
  COMIS_BENCH_DATA=/abs/path/to/datasets \
  pnpm vitest run packages/agent/src/memory/benchmark/qa-judge-harness.bench.test.ts

The *_API_KEY values are read from your shell environment — pass the env-var name, never an inline secret. The written report records only {provider, modelId} per role, so no credential reaches disk. For the recall defaults the QA harness uses, see the config reference.

Provenance

Every number on this page traces to the committed run manifest benchmarks/results/2026-05-31-j1-baseline/run-provenance.json (and, for the structural-track deltas, the dated phase manifests linked in their table):

Field	Value
Commit	`af64462f`
Node	v22.21.1
Harness	`phase-89-v1`
Datasets	LongMemEval `xiaowu0162/longmemeval` (`longmemeval_s.json`, 500) · LoCoMo `snap-research/locomo` (`locomo10.json`, 10)
Answer model	anthropic `claude-sonnet-4-6` (temperature 0)
Judges	openai `gpt-4o-2024-11-20` + openai `gpt-4.1` (cross-model, not cross-provider)
Embedding	local `nomic-embed-text-v1.5`
Reranker	local `bge-reranker-v2-m3`

The run config records only {provider, modelId} per role — credentials are structurally omitted, so no key, token, or base URL reaches the written report. Conflict of interest: Comis authored this benchmark; vendor-reported competitor numbers are non-comparable across protocols, and competitors are invited to reproduce on their own harness.

​Datasets

​Licensing and no vendoring

​Retrieval recall@k harness (a regression proxy)

​Commands

​The FTS-only disclosure

​End-to-end QA + LLM-judge accuracy (the apples-to-apples number)

​Operator environment

​Judges and prompts (cross-judge >= 2)

​The cross-judged baseline (the measured accuracy)

​What the structural tracks provably do (keyless, at $0)

​What the shipped capabilities provably do (keyless, at $0)

​The competitive head-to-head (the costed re-run, finally measured)

​What is measured at $0 vs the operator-costed head-to-head

​Reproducible report

​Recall-trace quality view

​How this compares to Hindsight (honestly)

​Reproduce it

​Provenance

Datasets

Licensing and no vendoring

Retrieval recall@k harness (a regression proxy)

Commands

The FTS-only disclosure

End-to-end QA + LLM-judge accuracy (the apples-to-apples number)

Operator environment

Judges and prompts (cross-judge >= 2)

The cross-judged baseline (the measured accuracy)

What the structural tracks provably do (keyless, at $0)

What the shipped capabilities provably do (keyless, at $0)

The competitive head-to-head (the costed re-run, finally measured)

What is measured at $0 vs the operator-costed head-to-head

Reproducible report

Recall-trace quality view

How this compares to Hindsight (honestly)

Reproduce it

Provenance