Benchmark Suite¶
This page explains the benchmark surfaces. Use Lerim Results for first-party numbers and Market Comparison for source-backed market rows.
Surfaces¶
| Surface | What it measures | What it does not measure | Current public artifact |
|---|---|---|---|
| LongMemEval-S retrieval | Whether Lerim retrieves the session containing gold answer evidence | Generated answer quality or official LongMemEval QA accuracy | benchmarks/results/raw/longmemeval-hybrid-full/report.json and benchmarks/results/raw/longmemeval-lexical-full/report.json |
| Context budget | Token count of selected top-K sessions compared with replaying the whole haystack, always shown with recall | Dollar savings or answer quality | benchmarks/results/raw/context-budget-hybrid-full/report.json |
| Retrieval latency | Local search latency over LongMemEval-S session records | Hosted throughput, concurrent load, or ingestion speed | benchmarks/results/raw/retrieval-latency-longmemeval/report.json |
| Trace ingestion cost/performance | Wall-clock ingestion time, measured LLM calls, and context DB file growth for public source-session traces | Extraction quality, answer quality, or dollar cost when provider usage data is unavailable | benchmarks/results/raw/trace-ingestion-cost-longmemeval-s-sample/report.json |
| MCP integration | Config writer coverage, stdio MCP tools/context probes, trace-submit idempotency, extraction-probe accounting, and selected installed-client acceptances | Completed-session capture for every client, organic production traces, or extraction quality when the extraction probe has 0 acceptances | benchmarks/results/raw/mcp-integration-full/report.json and benchmarks/results/raw/mcp-gemini-live-tool-call/report.json |
| Extraction quality | Lerim's trace-to-context extraction behavior on labeled source-session cases | Market comparison, because competitors have not been run on this private eval | benchmarks/results/raw/extraction-minimax-m27-full-47/report.json |
| False-positive extraction | Whether Lerim avoids durable records on cases labeled as having no durable signal | Market comparison or general retrieval quality | benchmarks/results/raw/false-positive-extraction-minimax-m27-negative-cases/report.json |
| Imported market baselines | Source-backed external numbers normalized or cited for comparison | Fresh local competitor reruns unless explicitly stated | benchmarks/results/raw/*-baseline/ when available |
LongMemEval-S¶
LongMemEval is a long-term memory benchmark for chat assistants. Lerim currently
uses LongMemEval-S, the smaller standard setting compared with LongMemEval-M.
The public runner uses longmemeval_s_cleaned.json from
xiaowu0162/longmemeval-cleaned, snapshot
98d7416c24c778c2fee6e6f3006e7a073259d48f.
The LongMemEval paper distinguishes two history sizes:
| Setting | Meaning in these docs | Approximate size |
|---|---|---|
| LongMemEval-S | The smaller standard setting Lerim currently uses | about 115k tokens per question |
| LongMemEval-M | The larger setting with many more sessions per problem | about 1.5M tokens per question |
The paper does not spell out the letter S in prose. In these docs, read it as
the smaller LongMemEval setting, not as a short or synthetic benchmark invented
by Lerim.
Lerim's LongMemEval-S artifact is retrieval-only:
- Index one retrievable unit per haystack session.
- Search with the question text.
- Compare retrieved session IDs with
answer_session_ids. - Report R@K, NDCG@10, and MRR.
Do not call these official QA scores. They do not call an LLM judge and do not score generated answers.
Context Budget¶
The context-budget runner asks how much source-session text Lerim selects after retrieval. It compares full haystack tokens with the tokens in Lerim's top-1, top-3, top-5, top-10, and top-20 retrieved sessions. A context-budget number is only meaningful when shown with recall.
This is a context-selection diagnostic on the same LongMemEval-S 500-question retrieval run. It does not replace LongMemEval-S retrieval, does not call an LLM judge, and does not claim actual dollar savings. It answers a narrower engineering question: if the downstream agent used Lerim's retrieved sessions as context, how much of the original haystack would be sent forward, and did that smaller context still include the answer-bearing session?
Trace Ingestion Cost/Performance¶
The trace-ingestion cost/performance runner measures the write path, not the retrieval path. It takes public LongMemEval-S haystack sessions, normalizes them through Lerim's generic trace envelope, then sends them through the same DSPy trace-ingestion path used by Lerim.
The current public artifact is a small sample, not a full-suite result. It reports:
- ingestion wall-clock time per trace
- measured LLM calls per trace
- context SQLite file-size growth after schema initialization
- whether provider cost is available
Cost is not inferred from fixed stages or pricing guesses. In the current
artifact, cost is not available because Lerim records LLM call counts but does
not yet expose provider token usage or billed cost for model calls.
Extraction¶
The extraction eval measures trace-to-context behavior: durable-record precision, required concept coverage, faithfulness, evidence validity, and negative precision. The current public artifact is an aggregate-only diagnostic report from an internal 47-case eval. Competitor extraction scores are not available yet because no competitor has been run on the same private traces with the same labels and judge.
The false-positive extraction diagnostic is a narrower slice of that same
47-case eval. It filters to cases labeled negative, where the target behavior
is zero durable records. It reports negative precision, false-positive case
count, and durable records created on negative cases. This is useful as an
engineering guardrail because a memory system can look strong on retrieval while
still saving too much temporary or source-derivable context.
Reporting Rule¶
Do not edit benchmark numbers by hand. Rerun the benchmark, update
report.json, regenerate report.md, and then update public docs from those
artifacts.