Skip to content

Thought-links — Observational Memory × LongMemEval

This page connects two design/benchmark references to concrete constraints in openclaw-mem.

Sources (trusted by CK): - Mastra — Announcing Observational Memory: https://mastra.ai/blog/observational-memory - LongMemEval (ICLR 2025): https://github.com/xiaowu0162/LongMemEval

Additional trusted references (for lifecycle/decay): - Cepeda et al. (2006) — Distributed Practice in Verbal Recall Tasks: A Review and Quantitative Synthesis (Psychological Bulletin) - https://doi.org/10.1037/0033-2909.132.3.354 - Megiddo & Modha (2003) — ARC: A Self-Tuning, Low Overhead Replacement Cache - https://www.usenix.org/legacy/publications/library/proceedings/fast03/tech/full_papers/megiddo/megiddo.pdf (Used as an engineering analogy: retention should be driven by recency + frequency, not timestamps alone.)

  • 2026-03-04 — Context Budget Sidecar (tool output offload + soft compaction continuity)
  • docs/archive/thought-links/2026-03-04_context-budget-sidecar-openclaw-token-cost.md
  • Spec: docs/specs/context-budget-sidecar-v0.md
  • 2026-03-30 — memvid guarded adoption (portable capsule + seal/verify/diff/inspect/export-canonical, not stack replacement)
  • ../openclaw-async-coding-playbook/projects/openclaw-mem/TECH_NOTES/2026-03-30_memvid_guarded-adoption_roi-cut_and_capsule-slice.md
  • ../openclaw-async-coding-playbook/projects/openclaw-mem/TECH_NOTES/2026-03-30_capsule-diff_chosen-over-restore.md
  • ../openclaw-async-coding-playbook/projects/openclaw-mem/TECH_NOTES/2026-03-30_canonical-restore-contract-brief.md
  • ../openclaw-async-coding-playbook/projects/openclaw-mem/TECH_NOTES/2026-03-30_export-canonical-contract-brief.md
  • ../openclaw-async-coding-playbook/projects/openclaw-mem/TECH_NOTES/2026-03-30_export-canonical-writer_landed.md
  • ../openclaw-async-coding-playbook/projects/openclaw-mem/TECH_NOTES/2026-03-30_canonical-operator-packet_and_slowdiff_landed.md
  • Product doc: portable-pack-capsules.md

1) memvid / portable capsule pattern → what we take and what we refuse

Source (external; product/reference only, not an authority surface): - memvid/memvid: https://github.com/memvid/memvid

What we take (portable pattern): - a portable capsule is a useful product surface when operators want to move, archive, or re-check a bounded memory artifact - explicit seal / verify lifecycle is worth copying - read-only diff is a safer next step than jumping straight to import/merge

What we refuse: - treating a portable capsule as the new canonical governed store - collapsing provenance/trust tiers/graph families into one generalized blob - shipping restore/merge theater before canonical observation fidelity exists in the artifact contract

How it lands in openclaw-mem: - tools/pack_capsule.py provides seal, verify, and diff - diff is intentionally read-only and audit-first - future restore/import work is deferred until a stronger canonical artifact contract exists

2) Observational Memory → design constraints we adopt

What we take (pattern, not branding): - Text-first derived memory layer: a compact “observation log” that’s easy to diff/debug. - Stable two-block context: 1) OBSERVATIONS (stable prefix) 2) RAW BUFFER (recent turns) - Scheduled compression (observer) + infrequent garbage-collection (reflector). - Explicit priority levels (“log levels”) to make governance obvious.

Why it fits openclaw-mem: - We already treat memory as a designed interface (provenance + trust tiers + citations + receipts). A derived observation log can be another auditable artifact — not a magical hidden embedding blob. - It aligns with local-first ops: deterministic storage + optional LLM assist.

Non-negotiables (openclaw-mem flavor): - Derived artifacts must be reproducible and bounded. - Compression must be fail-open (a bad compressor cannot break ingest/recall). - Anything committed/shared must remain redaction-safe (aggregate-only receipts by default).

3) LongMemEval → benchmark strategy constraints we adopt

LongMemEval tests long-term interactive memory across categories that map well to our roadmap: - Information Extraction → capture + recall stability - Multi-Session Reasoning → context packing and cross-session continuity - Knowledge Updates → overwrite / correction handling - Temporal Reasoning → timestamps, “what was true when?” - Abstention → don’t hallucinate when memory isn’t present

What changes in our benchmarking because of this: - Report metrics overall and by question_type (category breakdown is not optional). - Prefer ablation-style arms that isolate mechanisms: - importance-gated ingest (our current Phase A/B proxy) - observational compression (stable log text) - (later) live adapter chaining: openclaw-mem → memory backend

4) Concrete implementation hooks (where this lands)

  • docs/architecture.md:
  • Context Packer includes an observational-memory mode variant (two-block window).
  • openclaw-memory-bench (tooling repo):
  • retrieval reports should include per-question-type breakdown
  • the compare runner should support an observational compression arm (derived dataset) as a cheap, reproducible proxy.

5) What we deliberately do not claim (yet)

  • We do not claim SoTA LongMemEval scores.
  • We do not claim observational compression beats retrieval.
  • We only claim what we can reproduce with artifacts (manifests + receipts + compare reports).

6) OpenClaw SuperMemory (SQLite FTS) → ops + safety takeaways

Source (external, medium trust; small repo, concept clear): - openclaw-supermemory: https://github.com/yedanyagamiai/openclaw-supermemory

What we take: - Local-first lexical fallback: SQLite FTS5 (BM25 scoring) is a solid “zero-embedding / zero-provider” baseline for recall + debugging. - Strict config contract: additionalProperties: false in plugin schema reduces silent misconfig during cron/long-run ops. - Anti-echo hygiene: explicitly tag injected context blocks (e.g. <supermemory-context>…</supermemory-context>) and strip them during capture to avoid infinite self-ingest loops. - Ops-first tools: a memory_profile-style command (counts, categories, size, recent) is disproportionately useful for diagnosing drift.

What to watch: - Pure FTS is weaker for multilingual/semantic recall (esp. Chinese) unless tokenization is addressed. - Auto-capture heuristics must be fail-open and deduped to prevent spammy memory growth.

Actionable roadmap hooks for openclaw-mem: - Add a profile/stats surface (similar to our label-distribution receipts, but queryable on demand). - Add an explicit injected-context marker + ignore-list in capture/harvest. - Add an optional FTS5 lexical fallback lane for --no-embed runs.

7) QMD (hybrid local search engine) → retrieval router + benchmark hooks

Source (external; high concept clarity): - tobi/qmd: https://github.com/tobi/qmd

What it is (in one line): A local “docs-first” search engine for markdown/transcripts that does FTS5 (BM25 scoring) + vector + (optional) query expansion + LLM reranking, with agent-friendly --json/--files outputs and an MCP surface.

How it relates to us: - As a retrieval backend, QMD is best seen as an alternative to a pure vector store like memory-lancedb. - As a system component, it can be a supplement to openclaw-mem (we still need capture/governance/receipts/importance; QMD doesn’t replace that).

What we take (replicable modules): - Hybrid candidate generation: lexical anchors first (FTS; BM25-scored), then semantic recall. - Fusion: RRF-style merging is a pragmatic default. - Budgeting: keep a small candidate set (top-N) before reranking. - Agent I/O contract: stable JSON/file outputs + multi-get for “fetch the actual evidence”.

Quality-first hybrid design we adopt (for openclaw-mem + memory backends): - Stage 1: QMD/FTS5 for exact anchors (names, APIs, error strings, dates) - Stage 2: LanceDB vector search for paraphrase recall - Stage 3: rerank only when needed (ambiguity/close scores) + cap budgets (must first, then nice with a cap)

Benchmarking hooks (where this lands): - openclaw-memory-bench: add a QMD adapter and a “hybrid router” arm so we can compare: - QMD-only vs LanceDB-only vs Hybrid (QMD→LanceDB fallback) - metrics: hit/recall + p95 latency + must-coverage gate

What to watch (risks): - Local GGUF model downloads + rerank latency can be heavy; quality-first is fine, but we need hard caps and a clear “disable rerank” path. - “Docs-first” indexing is great for markdown, but we must ensure redaction-safe exports when sourcing from private session transcripts.

8) OpenViking (context database / filesystem paradigm) → observability + layered loading reference

Source (external; concept clarity high): - volcengine/OpenViking: https://github.com/volcengine/OpenViking

What it is (in one line): A “context database” for agents that models resources + memory + skills as a virtual filesystem (URI + directories), with layered context loading (L0/L1/L2) and observable retrieval trajectories.

What we take (design patterns, not adoption commitment): - Filesystem-as-context mental model: context should be browsable and targetable (by scope/path), not just a flat embedding blob. - Layer contract (L0/L1/L2): - L0: ultra-short abstract for fast filtering - L1: overview + navigation ("how to get details") - L2: original detail, loaded only when necessary - Retrieval observability: a first-class “trajectory/trace” for why something was retrieved (debuggable receipts). - Typed lanes: distinguishing Resource / Memory / Skill as separate context types aligns with our governance goals.

How it relates to openclaw-mem: - openclaw-mem remains the governance/control-plane (importance, trust tiers, redaction, receipts, packing policy). - OpenViking is a strong reference for how to make context structured, layered, and observable.

Scope note (CK decision): - Treat OpenViking as thought-link only for now (we are not committing to it as a backend/adapter arm yet).

9) Reference-based decay ("forgetting curve") → lifecycle governance hook

Key takeaway: - Retention should be governed by use (recency/frequency), not a fixed “delete after N days since write” rule.

How this maps to openclaw-mem: - Track last_used_at (ref) for durable records. - Update ref only when a record is actually used (default: included in the final pack bundle with a citation), not when it’s merely preloaded. - Apply archive-first lifecycle management (soft delete) so mistakes are reversible.

Trusted references: - Cepeda et al. (2006) distributed practice / spaced repetition: https://doi.org/10.1037/0033-2909.132.3.354 - ARC cache replacement (engineering analogy: recency+frequency beats timestamps): https://www.usenix.org/legacy/publications/library/proceedings/fast03/tech/full_papers/megiddo/megiddo.pdf

Untrusted inspiration (idea source; treat as a field note): - X thread (xiyu): https://x.com/ohxiyu/status/2022924956594806821

10) MCP Tool Search (Claude Code) → dynamic discovery + “Skill Card / Manual” split

Source (external; concept clarity high): - 好豪:MCP Tool Search:Claude Code 如何終結 Token 消耗大爆炸 https://haosquare.com/mcp-tool-search-claude-code/

Core idea (portable pattern): - Don’t preload the whole “tool dictionary” (all schemas) into context. - Keep a small always-on core set. - Everything else is discover → inspect → execute (search first; load details only when needed).

Why it matters to openclaw-mem (and our workflow design): - SOP/skills behave like tools: when the library grows, “stuff all SOPs into prompt” becomes a self-inflicted context bomb. - This complements our layered-loading references (e.g., OpenViking L0/L1/L2): - Skill Card = L0/L1 (tiny, searchable): when to use, outputs, risks, keywords. - Skill Manual/Templates = L2 (heavy, deferred): step-by-step SOP, checklists, examples.

Actionable roadmap hooks (candidates): - Add a lexical index lane (FTS5; BM25 scoring) for skill cards / SOP cards so agents can search first and only load the manual they need. - Add a minimal “skill discovery” contract: - naming conventions (regex-friendly) - keywords/anti-keywords - explicit outputs + receipt rules - Provide a small helper surface (CLI or adapter) that returns top-N card matches as JSON, then fetches the chosen manual on demand.

11) Trait / interface-first (systems kernel mindset) → contracts over vibes

Source (external; concept clarity high): - theonlyhennygod/zeroclaw: https://github.com/theonlyhennygod/zeroclaw

What we take (portable pattern): - Treat core subsystems (provider/channel/memory/tools) as interfaces with explicit contracts. - Prefer fail-fast validation for configs and outputs (surface misconfig early). - Keep operator surfaces machine-readable (stable JSON) so cron/receipts don’t depend on prompt parsing.

How this maps to openclaw-mem: - “Memory governance” is our control-plane; backends remain swappable behind adapters. - Roadmap candidates: strict config (additionalProperties:false), stable JSON schemas for receipts, and a profile/stats surface.

12) PAI (continuous learning + self-upgrade loop) → "learning records" as a first-class memory type

Source (external; concept clarity high): - Daniel Miessler — Personal AI Infrastructure (PAI): https://github.com/danielmiessler/Personal_AI_Infrastructure - v3.0 notes (self-upgrade loop, constraint extraction, drift prevention): https://raw.githubusercontent.com/danielmiessler/Personal_AI_Infrastructure/main/Releases/v3.0/README.md

What we take (portable patterns, not code): - Structured reflections (not just free-form notes): mistakes → fixes → recurring themes. - Mining the loop outputs: cluster repeated failure modes and turn them into targeted upgrades. - Constraint extraction + drift prevention: treat “rules” as extractable artifacts and re-check them before/after producing outputs.

How we go beyond it (openclaw-mem flavor): - Governance-first: every learning record gets provenance + trust tier + redaction rules by default. - Importance-aware learnings: learning records can be auto-labeled (must_remember/nice_to_have/ignore) using our importance pipeline. - Receipts: the learning loop must emit aggregate, diffable receipts (counts, top recurring error patterns, and what changed).

Concrete integration plan (scope-safe): - Keep runtime hooks/handlers (e.g. .learnings/ writing) outside openclaw-mem core. - Add a learning-record ingestion + query surface inside openclaw-mem: - ingest .learnings/{LEARNINGS,ERRORS,FEATURE_REQUESTS}.md (or JSONL) into the warm SQLite ledger - make them searchable + packable with citations

Risk to watch (and mitigation): - Infinite self-ingest loops (context blocks re-captured as learnings). - Mitigate with explicit injected-context markers + ignore-lists (see SuperMemory takeaways above).

13) Lossless Context Management (LCM) / lossless-claw → fresh-tail protection + provenance + “expand” tooling reference

Source (external; concept clarity high): - martian-engineering/lossless-claw: https://github.com/martian-engineering/lossless-claw - LCM paper: https://voltropy.com/LCM

What it is (in one line): A pluggable context engine for OpenClaw that stores all session messages in SQLite, compacts via a summary DAG, and provides tools to grep/describe/expand compacted history.

What we take (portable patterns): - Protected “fresh tail”: always keep the last N raw messages un-compacted for continuity. - Evictable prefix: fill remaining budget with older summaries; drop oldest first. - Provenance by construction: summaries link back to source messages; expansion is possible. - Ops safety belts: best-effort compaction with deterministic fallback so the loop doesn’t stall.

How it maps to openclaw-mem (without adopting an engine fork): - Our Context Packer can adopt the same assembly policy (fresh tail + evictable prefix) even if we don’t own compaction. - We should treat a pack as a hybrid text + JSON object (stable anchors) with explicit provenance (recordRef) and trace receipts.

See also: - docs/context-pack.md (ContextPack v1 direction) - docs/architecture.md (Context Packer)