Evaluate Extraction Quality¶

Do not evaluate Lerim by the number of memories created.

Evaluate whether the compiler produced a small set of useful, supported, non-duplicate context records:

precision
usefulness
evidence coverage
duplicate rate
scope compatibility
expected record kind alignment
future reuse

Keep extraction eval data separate from the public package unless the traces are small, sanitized examples. A publishable eval needs:

source-session trace files
labels for expected durable records and no-signal cases
a runner that feeds traces into Lerim
a judge or deterministic scorer with saved raw outputs
sanitized public reports that exclude raw private trace text

The public benchmark reports in this repo are generated artifacts. The private source traces and judge details are intentionally not shipped with the package.