LongMemEval_s retrieval slice¶
This is a small retrieval-only proof using the public LongMemEval longmemeval_s.json split.
It is not a leaderboard claim. It is a product-facing sanity check for one question:
Can openclaw-mem retrieve the right source sessions from a long conversational haystack when scored against session-level gold labels?
Setup¶
- Dataset variant:
longmemeval_s.json - Sample size: 20 examples
- Sampling goal: mixed
question_typecoverage - Ingestion unit: one haystack session per retrievable episodic event
- Query: question text only
- Gold labels:
answer_session_ids - Metrics: session recall@1 / recall@3 / recall@5 and MRR
Question-type distribution:
| question_type | examples |
|---|---|
| knowledge-update | 4 |
| multi-session | 4 |
| single-session-assistant | 3 |
| single-session-preference | 3 |
| single-session-user | 3 |
| temporal-reasoning | 3 |
Results¶
| lane | recall@1 | recall@3 | recall@5 | MRR |
|---|---|---|---|---|
| lexical session baseline | 0.80 | 0.85 | 0.85 | 0.8375 |
| openclaw-mem raw FTS | 0.70 | 0.85 | 0.95 | 0.7950 |
| openclaw-mem vector | 0.65 | 0.90 | 0.90 | 0.7583 |
| openclaw-mem hybrid | 0.80 | 0.95 | 1.00 | 0.8767 |
The useful signal is the shape, not the headline number: raw FTS is a reasonable negative/control lane, vector helps at recall@3, and hybrid gives the best overall recall and MRR on this slice.
What this proves¶
- The
longmemeval_s.jsonschema is usable for session-level retrieval testing because it exposesanswer_session_ids. - openclaw-mem can run a local retrieval-only harness over a bounded LongMemEval_s slice.
- The hybrid retrieval path beats a simple lexical session baseline on recall@3, recall@5, and MRR for this slice.
What this does not prove¶
- It does not prove full LongMemEval performance.
- It does not score answer generation or QA correctness.
- It does not claim the 20-example slice is statistically representative.
- It does not compare against tuned external retrieval systems.
Artifact¶
Machine-readable metrics:
Harness safety note¶
During harness development, a CLI provenance issue was found: for nested episodes commands, --db must be passed after the final action subcommand, for example:
openclaw-mem episodes embed --db ./isolated.sqlite --json
The unsafe shape below can be overwritten by nested parser defaults and should not be used for isolated benchmark harnesses:
openclaw-mem episodes --db ./isolated.sqlite embed --json
The final harness used action-local --db arguments and checked every embedding receipt against the persisted row delta in the same SQLite file.