Benchmarks¶

This is the public benchmark hub for Lerim.

The rule is simple: public numbers must point to raw artifacts or cited external sources. Generated report copies are kept for auditability, but raw report.json files are the source of truth for Lerim numbers.

Launch-grade benchmark artifacts should be rerun from a clean commit and pass the clean/tracked public benchmark gate. The v0.3.0 public artifacts passed that release gate. Future artifacts with git_dirty: true should still be treated as pre-release evidence until rerun after commit.

Start Here¶

Page	Use it for
Benchmark Suite	Plain-English explanation of each benchmark surface and boundary
Lerim Results	Detailed Lerim-only benchmark results, raw artifact references, commands, and boundaries
Market Comparison	Lerim vs other memory systems, with normalized rows, cited external numbers, and watchlist rows kept separate

Generated reports live under benchmarks/results/reports/ as audit copies. Use the two pages above for the public reading path; use generated reports when you need to trace a table back to raw artifacts.

Raw artifacts are tracked in this repo under benchmarks/results/raw/. Generated audit copies are tracked under benchmarks/results/reports/. Those paths must be included in the release commit before public benchmark links are treated as launch-grade evidence.

Artifact Map¶

Path	Purpose	How to read it
`docs/benchmarks/index.md`	Public benchmark hub	Start here
`docs/benchmarks/benchmark-suite.md`	Benchmark surface explanations	Use when learning what each benchmark means
`docs/benchmarks/lerim-results.md`	Public Lerim-only results	Use for first-party claims
`docs/benchmarks/market-comparison.md`	Public market comparison	Use for competitor/market claims
`benchmarks/lerim_evidence/`	Lerim benchmark runners	Code that produces Lerim numbers
`benchmarks/competitors/`	Source-backed competitor importers	Competitor evidence normalization, not product code
`benchmarks/results/raw/`	Raw benchmark artifacts	Numeric source of truth
`benchmarks/results/reports/`	Generated Markdown reports	Audit copies generated from raw artifacts

Do not edit numbers by hand in docs. Change the runner, rerun it, then update the generated artifacts.

Current Evidence¶

Surface	Current evidence	Where to read
LongMemEval-S retrieval	Full 500-question retrieval-only runs for hybrid and lexical modes	Lerim Results
Context budget	Full 500-question context-selection run using a Hugging Face tokenizer	Lerim Results
Retrieval latency	Partial local scale run on LongMemEval-S sessions	Lerim Results
Trace ingestion cost/performance	Small public-trace sample with measured LLM calls and unavailable-cost disclosure	Lerim Results
MCP integration	Config validation, local stdio MCP probes, trace-submit idempotency, 0 trace-submit extraction acceptances, and a Gemini CLI live tool-call acceptance artifact	Lerim Results
Extraction quality	Aggregate-only 47-case diagnostic report from a `MiniMax-M2.7` agent artifact judged by `MiniMax-M2.5`; not launch-grade	Lerim Results
Market comparison	Source-backed market table with comparable and not-yet-comparable rows separated	Market Comparison

Surface Map¶

Surface	Public question answered	Current evidence	Not proven
LongMemEval-S retrieval	Can Lerim find answer-bearing sessions?	Full 500-question retrieval-only run	Answer generation or official LongMemEval QA accuracy
Context budget	How much context does Lerim select after retrieval?	Same 500 LongMemEval-S questions, Hugging Face tokenizer counts, recall shown beside reduction	Dollar cost savings, answer quality, or a replacement for the retrieval benchmark
Retrieval latency	How fast is local search on this machine?	Local timings over LongMemEval-S sessions	Hosted/server load performance
Trace ingestion cost/performance	How much time, LLM-call count, and local DB growth does the write path use?	Small LongMemEval-S public-trace sample through DSPy ingestion	Extraction quality, answer quality, or dollar cost when provider usage is unavailable
MCP integration	Does Lerim's config and MCP plumbing work?	Config validation, local stdio tools/context probes, trace-submit idempotency, 0 trace-submit extraction acceptances, and Gemini CLI live tool-call acceptance	Autonomous live tool use by every external client or successful trace-submit extraction in this artifact
Extraction	Can Lerim extract durable records from source sessions?	Aggregate-only report from one 47-case internal eval	Launch-grade public claim or market comparison
Market comparison	How does Lerim compare to alternatives?	Market table with source/provenance per row	Full same-boundary market ranking

Reporting Rules¶

report.json is the numeric source of truth for Lerim rows.
Use predictions.jsonl for per-question benchmark rows.
Use details.jsonl for integration probe rows.
Do not publish partial slices as final benchmark results.
Do not call retrieval-only scores official LongMemEval QA scores.
Do not use context-budget numbers without recall.
Do not reuse retrieval numbers as extraction-quality numbers.
Treat Lerim's trace-to-context extraction eval as first-party/private until a competitor runner feeds the same traces into another system and scores its saved memories with the same labels and judge.
Do not publish competitor numbers without matching metric boundaries and source-backed provenance.