Thought-links — Observational Memory × LongMemEval¶
This page connects two design/benchmark references to concrete constraints in openclaw-mem.
Sources (trusted by CK): - Mastra — Announcing Observational Memory: https://mastra.ai/blog/observational-memory - LongMemEval (ICLR 2025): https://github.com/xiaowu0162/LongMemEval
Additional trusted references (for lifecycle/decay): - Cepeda et al. (2006) — Distributed Practice in Verbal Recall Tasks: A Review and Quantitative Synthesis (Psychological Bulletin) - https://doi.org/10.1037/0033-2909.132.3.354 - Megiddo & Modha (2003) — ARC: A Self-Tuning, Low Overhead Replacement Cache - https://www.usenix.org/legacy/publications/library/proceedings/fast03/tech/full_papers/megiddo/megiddo.pdf (Used as an engineering analogy: retention should be driven by recency + frequency, not timestamps alone.)
0) Local thought-link notes (project receipts)¶
- 2026-03-04 — Context Budget Sidecar (tool output offload + soft compaction continuity)
docs/archive/thought-links/2026-03-04_context-budget-sidecar-openclaw-token-cost.md- Spec:
docs/specs/context-budget-sidecar-v0.md - 2026-03-30 — memvid guarded adoption (portable capsule + seal/verify/diff/inspect/export-canonical, not stack replacement)
../openclaw-async-coding-playbook/projects/openclaw-mem/TECH_NOTES/2026-03-30_memvid_guarded-adoption_roi-cut_and_capsule-slice.md../openclaw-async-coding-playbook/projects/openclaw-mem/TECH_NOTES/2026-03-30_capsule-diff_chosen-over-restore.md../openclaw-async-coding-playbook/projects/openclaw-mem/TECH_NOTES/2026-03-30_canonical-restore-contract-brief.md../openclaw-async-coding-playbook/projects/openclaw-mem/TECH_NOTES/2026-03-30_export-canonical-contract-brief.md../openclaw-async-coding-playbook/projects/openclaw-mem/TECH_NOTES/2026-03-30_export-canonical-writer_landed.md../openclaw-async-coding-playbook/projects/openclaw-mem/TECH_NOTES/2026-03-30_canonical-operator-packet_and_slowdiff_landed.md- Product doc:
portable-pack-capsules.md
1) memvid / portable capsule pattern → what we take and what we refuse¶
Source (external; product/reference only, not an authority surface):
- memvid/memvid: https://github.com/memvid/memvid
What we take (portable pattern): - a portable capsule is a useful product surface when operators want to move, archive, or re-check a bounded memory artifact - explicit seal / verify lifecycle is worth copying - read-only diff is a safer next step than jumping straight to import/merge
What we refuse: - treating a portable capsule as the new canonical governed store - collapsing provenance/trust tiers/graph families into one generalized blob - shipping restore/merge theater before canonical observation fidelity exists in the artifact contract
How it lands in openclaw-mem:
- tools/pack_capsule.py provides seal, verify, and diff
- diff is intentionally read-only and audit-first
- future restore/import work is deferred until a stronger canonical artifact contract exists
2) Observational Memory → design constraints we adopt¶
What we take (pattern, not branding): - Text-first derived memory layer: a compact “observation log” that’s easy to diff/debug. - Stable two-block context: 1) OBSERVATIONS (stable prefix) 2) RAW BUFFER (recent turns) - Scheduled compression (observer) + infrequent garbage-collection (reflector). - Explicit priority levels (“log levels”) to make governance obvious.
Why it fits openclaw-mem: - We already treat memory as a designed interface (provenance + trust tiers + citations + receipts). A derived observation log can be another auditable artifact — not a magical hidden embedding blob. - It aligns with local-first ops: deterministic storage + optional LLM assist.
Non-negotiables (openclaw-mem flavor): - Derived artifacts must be reproducible and bounded. - Compression must be fail-open (a bad compressor cannot break ingest/recall). - Anything committed/shared must remain redaction-safe (aggregate-only receipts by default).
3) LongMemEval → benchmark strategy constraints we adopt¶
LongMemEval tests long-term interactive memory across categories that map well to our roadmap: - Information Extraction → capture + recall stability - Multi-Session Reasoning → context packing and cross-session continuity - Knowledge Updates → overwrite / correction handling - Temporal Reasoning → timestamps, “what was true when?” - Abstention → don’t hallucinate when memory isn’t present
What changes in our benchmarking because of this:
- Report metrics overall and by question_type (category breakdown is not optional).
- Prefer ablation-style arms that isolate mechanisms:
- importance-gated ingest (our current Phase A/B proxy)
- observational compression (stable log text)
- (later) live adapter chaining: openclaw-mem → memory backend
4) Concrete implementation hooks (where this lands)¶
docs/architecture.md:- Context Packer includes an observational-memory mode variant (two-block window).
openclaw-memory-bench(tooling repo):- retrieval reports should include per-question-type breakdown
- the compare runner should support an observational compression arm (derived dataset) as a cheap, reproducible proxy.
5) What we deliberately do not claim (yet)¶
- We do not claim SoTA LongMemEval scores.
- We do not claim observational compression beats retrieval.
- We only claim what we can reproduce with artifacts (manifests + receipts + compare reports).
6) OpenClaw SuperMemory (SQLite FTS) → ops + safety takeaways¶
Source (external, medium trust; small repo, concept clear):
- openclaw-supermemory: https://github.com/yedanyagamiai/openclaw-supermemory
What we take:
- Local-first lexical fallback: SQLite FTS5 (BM25 scoring) is a solid “zero-embedding / zero-provider” baseline for recall + debugging.
- Strict config contract: additionalProperties: false in plugin schema reduces silent misconfig during cron/long-run ops.
- Anti-echo hygiene: explicitly tag injected context blocks (e.g. <supermemory-context>…</supermemory-context>) and strip them during capture to avoid infinite self-ingest loops.
- Ops-first tools: a memory_profile-style command (counts, categories, size, recent) is disproportionately useful for diagnosing drift.
What to watch: - Pure FTS is weaker for multilingual/semantic recall (esp. Chinese) unless tokenization is addressed. - Auto-capture heuristics must be fail-open and deduped to prevent spammy memory growth.
Actionable roadmap hooks for openclaw-mem:
- Add a profile/stats surface (similar to our label-distribution receipts, but queryable on demand).
- Add an explicit injected-context marker + ignore-list in capture/harvest.
- Add an optional FTS5 lexical fallback lane for --no-embed runs.
7) QMD (hybrid local search engine) → retrieval router + benchmark hooks¶
Source (external; high concept clarity):
- tobi/qmd: https://github.com/tobi/qmd
What it is (in one line):
A local “docs-first” search engine for markdown/transcripts that does FTS5 (BM25 scoring) + vector + (optional) query expansion + LLM reranking, with agent-friendly --json/--files outputs and an MCP surface.
How it relates to us:
- As a retrieval backend, QMD is best seen as an alternative to a pure vector store like memory-lancedb.
- As a system component, it can be a supplement to openclaw-mem (we still need capture/governance/receipts/importance; QMD doesn’t replace that).
What we take (replicable modules): - Hybrid candidate generation: lexical anchors first (FTS; BM25-scored), then semantic recall. - Fusion: RRF-style merging is a pragmatic default. - Budgeting: keep a small candidate set (top-N) before reranking. - Agent I/O contract: stable JSON/file outputs + multi-get for “fetch the actual evidence”.
Quality-first hybrid design we adopt (for openclaw-mem + memory backends):
- Stage 1: QMD/FTS5 for exact anchors (names, APIs, error strings, dates)
- Stage 2: LanceDB vector search for paraphrase recall
- Stage 3: rerank only when needed (ambiguity/close scores) + cap budgets (must first, then nice with a cap)
Benchmarking hooks (where this lands):
- openclaw-memory-bench: add a QMD adapter and a “hybrid router” arm so we can compare:
- QMD-only vs LanceDB-only vs Hybrid (QMD→LanceDB fallback)
- metrics: hit/recall + p95 latency + must-coverage gate
What to watch (risks): - Local GGUF model downloads + rerank latency can be heavy; quality-first is fine, but we need hard caps and a clear “disable rerank” path. - “Docs-first” indexing is great for markdown, but we must ensure redaction-safe exports when sourcing from private session transcripts.
8) OpenViking (context database / filesystem paradigm) → observability + layered loading reference¶
Source (external; concept clarity high):
- volcengine/OpenViking: https://github.com/volcengine/OpenViking
What it is (in one line): A “context database” for agents that models resources + memory + skills as a virtual filesystem (URI + directories), with layered context loading (L0/L1/L2) and observable retrieval trajectories.
What we take (design patterns, not adoption commitment): - Filesystem-as-context mental model: context should be browsable and targetable (by scope/path), not just a flat embedding blob. - Layer contract (L0/L1/L2): - L0: ultra-short abstract for fast filtering - L1: overview + navigation ("how to get details") - L2: original detail, loaded only when necessary - Retrieval observability: a first-class “trajectory/trace” for why something was retrieved (debuggable receipts). - Typed lanes: distinguishing Resource / Memory / Skill as separate context types aligns with our governance goals.
How it relates to openclaw-mem:
- openclaw-mem remains the governance/control-plane (importance, trust tiers, redaction, receipts, packing policy).
- OpenViking is a strong reference for how to make context structured, layered, and observable.
Scope note (CK decision): - Treat OpenViking as thought-link only for now (we are not committing to it as a backend/adapter arm yet).
9) Reference-based decay ("forgetting curve") → lifecycle governance hook¶
Key takeaway: - Retention should be governed by use (recency/frequency), not a fixed “delete after N days since write” rule.
How this maps to openclaw-mem:
- Track last_used_at (ref) for durable records.
- Update ref only when a record is actually used (default: included in the final pack bundle with a citation), not when it’s merely preloaded.
- Apply archive-first lifecycle management (soft delete) so mistakes are reversible.
Trusted references: - Cepeda et al. (2006) distributed practice / spaced repetition: https://doi.org/10.1037/0033-2909.132.3.354 - ARC cache replacement (engineering analogy: recency+frequency beats timestamps): https://www.usenix.org/legacy/publications/library/proceedings/fast03/tech/full_papers/megiddo/megiddo.pdf
Untrusted inspiration (idea source; treat as a field note): - X thread (xiyu): https://x.com/ohxiyu/status/2022924956594806821
10) MCP Tool Search (Claude Code) → dynamic discovery + “Skill Card / Manual” split¶
Source (external; concept clarity high): - 好豪:MCP Tool Search:Claude Code 如何終結 Token 消耗大爆炸 https://haosquare.com/mcp-tool-search-claude-code/
Core idea (portable pattern): - Don’t preload the whole “tool dictionary” (all schemas) into context. - Keep a small always-on core set. - Everything else is discover → inspect → execute (search first; load details only when needed).
Why it matters to openclaw-mem (and our workflow design): - SOP/skills behave like tools: when the library grows, “stuff all SOPs into prompt” becomes a self-inflicted context bomb. - This complements our layered-loading references (e.g., OpenViking L0/L1/L2): - Skill Card = L0/L1 (tiny, searchable): when to use, outputs, risks, keywords. - Skill Manual/Templates = L2 (heavy, deferred): step-by-step SOP, checklists, examples.
Actionable roadmap hooks (candidates):
- Add a lexical index lane (FTS5; BM25 scoring) for skill cards / SOP cards so agents can search first and only load the manual they need.
- Add a minimal “skill discovery” contract:
- naming conventions (regex-friendly)
- keywords/anti-keywords
- explicit outputs + receipt rules
- Provide a small helper surface (CLI or adapter) that returns top-N card matches as JSON, then fetches the chosen manual on demand.
11) Trait / interface-first (systems kernel mindset) → contracts over vibes¶
Source (external; concept clarity high):
- theonlyhennygod/zeroclaw: https://github.com/theonlyhennygod/zeroclaw
What we take (portable pattern): - Treat core subsystems (provider/channel/memory/tools) as interfaces with explicit contracts. - Prefer fail-fast validation for configs and outputs (surface misconfig early). - Keep operator surfaces machine-readable (stable JSON) so cron/receipts don’t depend on prompt parsing.
How this maps to openclaw-mem:
- “Memory governance” is our control-plane; backends remain swappable behind adapters.
- Roadmap candidates: strict config (additionalProperties:false), stable JSON schemas for receipts, and a profile/stats surface.
12) PAI (continuous learning + self-upgrade loop) → "learning records" as a first-class memory type¶
Source (external; concept clarity high): - Daniel Miessler — Personal AI Infrastructure (PAI): https://github.com/danielmiessler/Personal_AI_Infrastructure - v3.0 notes (self-upgrade loop, constraint extraction, drift prevention): https://raw.githubusercontent.com/danielmiessler/Personal_AI_Infrastructure/main/Releases/v3.0/README.md
What we take (portable patterns, not code): - Structured reflections (not just free-form notes): mistakes → fixes → recurring themes. - Mining the loop outputs: cluster repeated failure modes and turn them into targeted upgrades. - Constraint extraction + drift prevention: treat “rules” as extractable artifacts and re-check them before/after producing outputs.
How we go beyond it (openclaw-mem flavor):
- Governance-first: every learning record gets provenance + trust tier + redaction rules by default.
- Importance-aware learnings: learning records can be auto-labeled (must_remember/nice_to_have/ignore) using our importance pipeline.
- Receipts: the learning loop must emit aggregate, diffable receipts (counts, top recurring error patterns, and what changed).
Concrete integration plan (scope-safe):
- Keep runtime hooks/handlers (e.g. .learnings/ writing) outside openclaw-mem core.
- Add a learning-record ingestion + query surface inside openclaw-mem:
- ingest .learnings/{LEARNINGS,ERRORS,FEATURE_REQUESTS}.md (or JSONL) into the warm SQLite ledger
- make them searchable + packable with citations
Risk to watch (and mitigation): - Infinite self-ingest loops (context blocks re-captured as learnings). - Mitigate with explicit injected-context markers + ignore-lists (see SuperMemory takeaways above).
13) Lossless Context Management (LCM) / lossless-claw → fresh-tail protection + provenance + “expand” tooling reference¶
Source (external; concept clarity high):
- martian-engineering/lossless-claw: https://github.com/martian-engineering/lossless-claw
- LCM paper: https://voltropy.com/LCM
What it is (in one line): A pluggable context engine for OpenClaw that stores all session messages in SQLite, compacts via a summary DAG, and provides tools to grep/describe/expand compacted history.
What we take (portable patterns): - Protected “fresh tail”: always keep the last N raw messages un-compacted for continuity. - Evictable prefix: fill remaining budget with older summaries; drop oldest first. - Provenance by construction: summaries link back to source messages; expansion is possible. - Ops safety belts: best-effort compaction with deterministic fallback so the loop doesn’t stall.
How it maps to openclaw-mem (without adopting an engine fork):
- Our Context Packer can adopt the same assembly policy (fresh tail + evictable prefix) even if we don’t own compaction.
- We should treat a pack as a hybrid text + JSON object (stable anchors) with explicit provenance (recordRef) and trace receipts.
See also:
- docs/context-pack.md (ContextPack v1 direction)
- docs/architecture.md (Context Packer)