Benchmarks¶
This is the public benchmark hub for Lerim.
The rule is simple: public numbers must point to raw artifacts or cited external
sources. Generated report copies are kept for auditability, but raw report.json
files are the source of truth for Lerim numbers.
Launch-grade benchmark artifacts should be rerun from a clean commit and pass
the clean/tracked public benchmark gate. The v0.3.0 public artifacts passed
that release gate. Future artifacts with git_dirty: true should still be
treated as pre-release evidence until rerun after commit.
Start Here¶
| Page | Use it for |
|---|---|
| Benchmark Suite | Plain-English explanation of each benchmark surface and boundary |
| Lerim Results | Detailed Lerim-only benchmark results, raw artifact references, commands, and boundaries |
| Market Comparison | Lerim vs other memory systems, with normalized rows, cited external numbers, and watchlist rows kept separate |
Generated reports live under benchmarks/results/reports/ as audit copies.
Use the two pages above for the public reading path; use generated reports when
you need to trace a table back to raw artifacts.
Raw artifacts are tracked in this repo under benchmarks/results/raw/.
Generated audit copies are tracked under benchmarks/results/reports/.
Those paths must be included in the release commit before public benchmark links
are treated as launch-grade evidence.
Artifact Map¶
| Path | Purpose | How to read it |
|---|---|---|
docs/benchmarks/index.md |
Public benchmark hub | Start here |
docs/benchmarks/benchmark-suite.md |
Benchmark surface explanations | Use when learning what each benchmark means |
docs/benchmarks/lerim-results.md |
Public Lerim-only results | Use for first-party claims |
docs/benchmarks/market-comparison.md |
Public market comparison | Use for competitor/market claims |
benchmarks/lerim_evidence/ |
Lerim benchmark runners | Code that produces Lerim numbers |
benchmarks/competitors/ |
Source-backed competitor importers | Competitor evidence normalization, not product code |
benchmarks/results/raw/ |
Raw benchmark artifacts | Numeric source of truth |
benchmarks/results/reports/ |
Generated Markdown reports | Audit copies generated from raw artifacts |
Do not edit numbers by hand in docs. Change the runner, rerun it, then update the generated artifacts.
Current Evidence¶
| Surface | Current evidence | Where to read |
|---|---|---|
| LongMemEval-S retrieval | Full 500-question retrieval-only runs for hybrid and lexical modes | Lerim Results |
| Context budget | Full 500-question context-selection run using a Hugging Face tokenizer | Lerim Results |
| Retrieval latency | Partial local scale run on LongMemEval-S sessions | Lerim Results |
| Trace ingestion cost/performance | Small public-trace sample with measured LLM calls and unavailable-cost disclosure | Lerim Results |
| MCP integration | Config validation, local stdio MCP probes, trace-submit idempotency, 0 trace-submit extraction acceptances, and a Gemini CLI live tool-call acceptance artifact | Lerim Results |
| Extraction quality | Aggregate-only 47-case diagnostic report from a MiniMax-M2.7 agent artifact judged by MiniMax-M2.5; not launch-grade |
Lerim Results |
| Market comparison | Source-backed market table with comparable and not-yet-comparable rows separated | Market Comparison |
Surface Map¶
| Surface | Public question answered | Current evidence | Not proven |
|---|---|---|---|
| LongMemEval-S retrieval | Can Lerim find answer-bearing sessions? | Full 500-question retrieval-only run | Answer generation or official LongMemEval QA accuracy |
| Context budget | How much context does Lerim select after retrieval? | Same 500 LongMemEval-S questions, Hugging Face tokenizer counts, recall shown beside reduction | Dollar cost savings, answer quality, or a replacement for the retrieval benchmark |
| Retrieval latency | How fast is local search on this machine? | Local timings over LongMemEval-S sessions | Hosted/server load performance |
| Trace ingestion cost/performance | How much time, LLM-call count, and local DB growth does the write path use? | Small LongMemEval-S public-trace sample through DSPy ingestion | Extraction quality, answer quality, or dollar cost when provider usage is unavailable |
| MCP integration | Does Lerim's config and MCP plumbing work? | Config validation, local stdio tools/context probes, trace-submit idempotency, 0 trace-submit extraction acceptances, and Gemini CLI live tool-call acceptance | Autonomous live tool use by every external client or successful trace-submit extraction in this artifact |
| Extraction | Can Lerim extract durable records from source sessions? | Aggregate-only report from one 47-case internal eval | Launch-grade public claim or market comparison |
| Market comparison | How does Lerim compare to alternatives? | Market table with source/provenance per row | Full same-boundary market ranking |
Reporting Rules¶
report.jsonis the numeric source of truth for Lerim rows.- Use
predictions.jsonlfor per-question benchmark rows. - Use
details.jsonlfor integration probe rows. - Do not publish partial slices as final benchmark results.
- Do not call retrieval-only scores official LongMemEval QA scores.
- Do not use context-budget numbers without recall.
- Do not reuse retrieval numbers as extraction-quality numbers.
- Treat Lerim's trace-to-context extraction eval as first-party/private until a competitor runner feeds the same traces into another system and scores its saved memories with the same labels and judge.
- Do not publish competitor numbers without matching metric boundaries and source-backed provenance.