Roadmap¶

This roadmap translates design principles from personal AI infrastructure into concrete, testable work items for openclaw-mem.

Guiding stance: ship early, stay local-first, and keep every change non-destructive, observable, and rollbackable.

Status tags used here: DONE / PARTIAL / ROADMAP.

Operator-first / product view: - OpenClaw user improvement roadmap (product-facing): Go →

Principles (what we optimize for)¶

Sidecar-first, optional slot owner: openclaw-mem remains the ops sidecar by default. We may additionally ship an optional slot backend (openclaw-mem-engine) to replace memory-lancedb when enabled — still rollbackable via a one-line slot switch.
Fail-open by default: memory helpers should not break ingest or the agent loop.
Non-destructive writes: never overwrite operator-authored fields; only fill missing values.
Upgrade-safe: user-owned data/config is stable across versions.
Receipts over vibes: every automation path should emit a measurable summary.
Trust-aware by design: treat skill/web/tool outputs as untrusted by default until promoted by an explicit policy; preserve provenance so packing/retrieval can make safer choices.

2026-02 Pilot execution order (two pillars, incremental)¶

To keep scope controlled for the current pilot:

Pillar A — build now (implementation + receipts)¶

Harden pack --trace into an explicit contract (openclaw-mem.pack.trace.v1) with schema tests.
Enforce citation/rationale coverage for included items (must stay at zero-missing).
Keep budget policy minimal (single budgetTokens cap); track budget-driven exclusions.
Run counterfactual benchmark arm A0 (baseline pack behavior) vs A1 (contract enforcement) in openclaw-memory-bench.
Promotion gate to default behavior requires: schema pass, determinism pass, and reviewed real-run receipts.

Pillar B — spec now, implement later¶

Define learning-record schema, lifecycle states, and benchmark preregistration only.
No runtime rollout before Pillar A promotion gate + soak evidence.

Change-control guardrail¶

This pilot step updates docs/specs and benchmark plans only.
No live OpenClaw config or cron schedule changes are included in this step.

Now (next milestones)¶

0) OpenClaw Mem Engine (optional memory slot backend)¶

Status: DONE (M1 shipped).

Goal: replace memory-lancedb with a slot backend that supports hybrid recall (FTS + vector), scopes, and auditable policies.
Why: the official backend currently uses LanceDB mostly as a basic vector store; it doesn’t expose hybrid/FTS/index lifecycle/versioning.
Design doc: OpenClaw Mem Engine →

Add-on (critical UX win, no local LLM): - Docs memory: index operator-authored repos (DECISIONS / roadmaps / specs) as a recall surface and include it as a cold lane. - Spec: Docs memory hybrid search v0 (maintainer archive) - Operational note: scoped retrieval starvation in the plugin was hardened in Slice 1 (ef614f4) via bounded overfetch before scope filtering. - Slice 2 first cut now lands repo-allowlist pushdown in openclaw-mem docs search plus plugin-side repo pushdown wiring, while retaining residual plugin filtering as defense-in-depth. - Deferred optimization item: shrink reliance on scoped overfetch only after the broader docs-cold-lane / memory-engine development line reaches a later stable stage; treat this as optimization-phase work, not near-term delivery pressure. - Spec: Docs cold lane scope pushdown v1 (maintainer archive)

Acceptance criteria: - Slot switch + rollback is one line (plugins.slots.memory). - memory_store/memory_recall/memory_forget emit JSON receipts (filters, latency, counts). - M1 delivers a “concept → decisions/preferences” golden set where hybrid beats vector-only. - Scope hardening + receipt legibility are operator-debuggable (scopeFallbackSuppressed, whySummary, whyTheseIds). - Step4 Working Set rollout wiring exists behind config (workingSet.enabled) with rollbackable kill switch. Current evaluation status: frozen / default-off after A/B review found no measured reply-quality lift over baseline recall. - Optimize-assist canary readiness now has a read-only advisory surface (optimize canary-advisory) so future cron reports can state which lifecycle/soft-archive features are enableable, monitor-only, or not ready, with reasons and receipt evidence refs.

1.7) Graphic Memory consumption (triggered preflight → pack integration)¶

Status: DONE (pack integration shipped on stable main; graph-aware synthesis preference and protected tail now apply inside ordinary pack, while triggered preflight remains additive and fail-open).

Problem: Graphic Memory had working auto-capture and graph preflight, but it was not yet routinely consumed in doc/decision/dependency lookup flows.
Shipped slice:
openclaw-mem pack --use-graph=off|auto|on
deterministic Stage 0/1/2 trigger envelope in --trace (stage0, stage1, probe, trigger_reason, probe_decision)
graph preflight integration stays fail-open and does not break baseline pack behavior
graph-derived candidate inclusion can consume query-plane provenance quality with deterministic include/exclude reasons
policy/usage receipts stay bounded via graph_provenance_policy, policy_surface, and lifecycle_shadow
fresh closure packet: docs/2026-03-31_graph-consumption-closure.md

Artifacts: - Spec: docs/specs/graphic-memory-preflight-trigger-policy.md

Acceptance criteria: - pack behavior unchanged when graph is OFF. - In --use-graph=auto, trigger is deterministic + traceable (--trace shows trigger reason). - Auto graph scope stays conservative: unresolved scope degrades to baseline-only instead of cross-project promotion. - Auto graph latency is policy-governed: allow/degrade/skip is receipted and can suppress graph bundle composition. - Graph failures are fail-open and never break pack. - Graph-derived candidate injection can consume query-plane provenance quality with deterministic include/exclude reasons (structured provenance gate + fail-open receipts). - Ordinary pack can prefer a covering synthesis card over raw covered refs without requiring --use-graph=on. - Golden regression scenarios exist for pack policy, protected tail, and graph-auto trigger behavior.

1.7a) Graphic Memory query plane (operator-facing graph interface)¶

Status: DONE (operator query plane shipped on stable main; closure revalidated with fresh query-plane receipts on 2026-03-31).

Problem: operators need a practical query layer over stable topology + runtime drift + provenance, but today those relationships are scattered across YAML, cron state, and receipts.
Decision: keep repo-backed topology as the source of truth; add a derived query plane under openclaw-mem.
Target architecture:
source of truth = structured files (YAML / markdown / receipts)
derived cache = SQLite graph tables
first shippable slice = YAML-only query helper for one-hop operator questions
Shipped slice (v1.1.0):
deterministic query-plane foundation module + SQLite refresh contract
graph query commands for upstream / downstream / lineage / writers / filter
graph query drift --live-json <path> --db <path> for stable-topology vs runtime-state checks
provenance integration deepened: query edges/groups now carry normalized provenance_ref objects, plus bounded graph query subgraph --require-structured-provenance filtering for pack-safe consumption
Initial operator questions:
what depends on this node?
what does this node feed/write?
which jobs write this artifact?
which jobs are background but not human-facing?
where does graph truth drift from live state?

Artifacts: - Spec: docs/specs/graphic-memory-query-plane-v0.md

Acceptance criteria: - YAML remains the editable truth; the derived graph is rebuildable/disposable. - A-fast can ship bounded query value before SQLite lands. - A-deep installs deterministic refresh + drift/provenance boundaries. - Runtime graph failures remain fail-open and do not break baseline memory/pack flows.

1.7b) Automatic topology seed (repo map → topology YAML)¶

Status: ROADMAP.

Problem: our topology surfaces are still curated/demo-first; new repos/jobs/artifacts don’t automatically appear unless a human updates topology files.
Goal: ship a deterministic extractor that can generate a minimal, reviewable topology seed from the workspace + cron registry.
Non-goals: no LLM extraction, no implicit trust promotion, and no silent overwrites of operator-authored topology.

Plan (v0): 1) Build a topology-seed from deterministic sources: - OpenClaw cron registry (job ids, schedules, delivery targets) - maintainer cron job specs (operator archive) - workspace repo roots (git + directory metadata only) 2) Output a small YAML/JSON file + receipt (counts, provenance groups). 3) Optional: “suggest-only” diff against a curated topology file.

Acceptance criteria: - One command can regenerate the seed deterministically and produce a receipt. - Seed output is provenance-first and safe to commit (no secrets, no raw content).

Artifacts: - Spec: docs/specs/topology-auto-extract-v0.md

1.7bb) Verbatim semantic lane (episodic evidence recall)¶

Status: DONE (v1.4.0 first production slice).

Problem: openclaw-mem had strong governance/pack posture, but weaker raw semantic recall over episodic evidence than purpose-built memory products.
Shipped slice:
openclaw-mem episodes embed to build/search-refresh embeddings over redacted episodic_events.search_text
openclaw-mem episodes search --mode lexical|hybrid|vector for bounded episodic evidence recall
optional --trace receipts showing FTS/vector/fused rankings
query-side --query-en assist without introducing a second canonical episodic text plane
Non-goals preserved:
no durable-memory auto-promotion
no Working Set source-corpus inversion
no route-auto default behavior change in this slice

Artifacts: - Reference: docs/verbatim-semantic-lane.md - Spec: Verbatim semantic lane v0 (maintainer archive)

Acceptance criteria: - episodic hybrid recall is additive, read-only, scope-aware, and redaction-safe - embedding refresh is deterministic via search_text_hash - lexical fallback remains usable when vector lane is unavailable

1.7c) Compiled synthesis layer (selected refs → maintained synthesis cards)¶

Status: PARTIAL (graph synth compile / graph synth stale / graph synth refresh / graph synth recommend / deterministic graph lint / optimize governor-review / Phase-1 dream-lite apply plan|verify, Phase-2/3/4 dream-lite apply run|rollback|verify --since, and Phase-5 rehearsal-only dream-lite director observe|stage|checkpoint|apply shipped; graph preflight and graph pack now prefer fresh synthesis cards; deterministic review/contradiction signals now surface in stale/lint; governed-autonomy contract and canary apply-readiness contract are documented; real authority-file mutation remains unshipped).

Problem: Graphic Memory can capture refs and build bounded preflight/query bundles, but it still has to re-derive many high-value cross-source conclusions from scratch.
Goal: add a small, provenance-carrying compiled synthesis layer that turns selected refs into reusable synthesis cards with a stale/lint loop.
Non-goals: no graph DB, no UI/Obsidian dependency, no automatic wiki-writing loop, and no topology-source-of-truth changes.

Plan (v0): 1) Reuse existing selection surfaces (graph index / graph preflight / explicit record refs) as inputs. 2) Add graph synth compile to emit a bounded synthesis-card receipt (+ optional Markdown materialization). 3) Add graph synth stale and deterministic graph lint checks. 4) Shipped in the graph-preflight lane: prefer fresh synthesis cards before replaying many covered raw refs. 5) Shipped in the graph-pack lane: when explicit refs are covered by a fresh synthesis card, prefer the card and surface the preference receipt. 6) Shipped in main pack --use-graph: record graph-consumption receipts and elide raw L1 lines already covered by preferred synthesis cards in the combined graph-aware bundle. 7) Shipped in cmd_hybrid: prefer fresh synthesis cards in top results when they cover multiple high-ranked raw hits, with explicit graph-consumption receipts. 8) Shipped in graph synth refresh: replay the old card selection, emit a fresh replacement card, and mark the old card as superseded with lifecycle receipts. 9) Shipped in graph lint: coverage pressure / candidateCardSuggestions using scope + repeated-keyword clusters for uncovered areas not yet covered by active synthesis cards. 10) Shipped in search: prefer fresh synthesis cards in top results when multiple matched raw hits are covered by the same card, with graph-consumption receipts. 11) Later, extend synthesis-card preference more broadly in other pack/retrieval lanes where it remains truthful. 12) Governance follow-through: keep Dream Lite as zero-write, then add a governor-review surface before any apply path. - scout/helper lanes may inspect and packetize recommendations only - judgment and any future write authority must remain explicit and governed 13) Apply-readiness follow-through: define and wire a Phase-1 compiled-synthesis apply-planning canary before any write lane. - shipped Phase 1: dream-lite apply plan|verify admits only governor-approved refresh_card and emits writes_performed=0 receipts - shipped Phase 1b: dream-lite director observe|stage|checkpoint produces instruction-candidate / staged-patch / checkpoint packets only - first future wet-run canary should admit only refresh_card - compile_new_card stays out of auto-apply in v0 - dry-run / before-after receipts / rollback artifact are mandatory

Acceptance criteria: - A user can compile a reusable synthesis card from bounded refs with provenance. - Staleness is detectable without an LLM. - Graph failures remain fail-open and do not break baseline preflight/pack flows.

Artifacts: - Spec: docs/specs/graphic-memory-compiled-synthesis-v0.md

1.7d) Temporal fact materialized view (KG facts without a new truth owner)¶

Status: SHIPPED v1.9.26 (CLI lane implemented; no Gateway, cron, memory backend, or prompt-injection topology change).

Problem: operators need a default experience for "what is currently true about X, how did that truth evolve, and which receipts support it?" without treating graph/wiki output as a new source of truth.
Decision: model KG facts as a materialized view over Store evidence, surfaced through Pack and proven by Observe receipts. The view is rebuildable and additive; it is not a hidden memory owner.
V0 scope:
explicit assertions only; no automatic LLM extraction
single-subject current truth + timeline + evidence pack
controlled predicate registry
deterministic lint for dangling sources, unknown predicates, interval conflicts, stale facts, and over-confident source tiers
ContextPack-compatible graph fact pack
visible graph fact route receipt when the fact view is used
review-only graph fact propose / measure-extraction extraction assist with writes_performed=false
Non-goals:
graph DB migration
multi-hop inference
free-form predicates
"compiled wiki" as an authority surface

Acceptance criteria: - A valid sourced assertion appears in current and timeline. - Dangling sources and unresolved overlapping contradictions fail lint. - Supersession and invalidation update timeline/current truth deterministically. - graph fact pack is byte-stable under the same inputs, budget-bounded, and cites every included fact. - Dropping the derived cache and rebuilding produces the same stable ids and current-truth set.

Artifacts: - Spec: docs/specs/temporal-fact-materialized-view-v0.md - Dev phase map/backlog: docs/specs/temporal-fact-materialized-view-dev-phases-v0.md - Public docs: docs/temporal-facts.md - Ops skill card: skills/temporal-facts.ops.md

1.6) Sunrise rollout (Stage A→B→C)¶

Status: PARTIAL (Stage A running; Stage B/C pending).

Positioning: - rollout ladder for the shipped writeback / recall path - operational follow-through, not the default next standalone product blade

Stage A: background writeback cron (no slot switch)
Stage B: daily canary slot switch + golden-set recall check
Stage C: live switch with auto-downgrade guard

Acceptance criteria: - Stage A runs stably for 3 days: missingIds=0, error_count=0. - Stage B canary passes 3 consecutive days: engine recall returns receipts with policyTier + ftsTop/vecTop and no tool errors. - Stage C is only enabled after A+B are green.

1.6a) Read-only lane enforcement ladder (sidecar-first deployments)¶

Status: ROADMAP.

Positioning: - platform / deployment guardrail, not the default next product-core blade - tracked partly in openclaw-ops because the end-state depends on runtime / host enforcement beyond openclaw-mem core ownership

Problem: in sidecar-only deployments, “read-only lanes” are still mostly a prompt/runner discipline; exec is the main escape hatch.
Goal: make read-only posture enforceable (tool surface + script-only exec + sandbox) so we can expand unattended coverage safely.

Phase plan (suggested): - Phase 0: prompt-layer read-only card + silent-on-green (today) - Phase 1: tool allowlists deny memory writes + file writes; scripts-first jobs - Phase 2: sandbox exec (script-only wrapper + OS-level restrictions) - Phase 3: widen coverage + add sunrise watchers for each new surface

Acceptance criteria: - A cron lane can prove “read-only” via receipts (allowed scripts + expected state paths only). - Rollback is one command (disable the job / remove profile) and restores baseline behavior.

Artifacts: - Ops backlog: maintainer archive (not part of the public evaluator path)

1.5) Writeback + recall policy loop (M1.5)¶

Status: DONE (proof-first closure slice landed on 2026-04-06).

Add a bounded openclaw-mem writeback-lancedb path that pushes graded metadata from SQLite into LanceDB by row ID.
Default recall policy for memory_recall is fail-open:
must_remember + nice_to_have
+unknown
+ignore
Receipt must expose policyTier used (must+nice, must+nice+unknown, must+nice+unknown+ignore) for diagnostics.
fresh closure packet: docs/2026-04-06_writeback-recall-policy-loop-closure.md

Acceptance criteria: - A smoke writeback run updates importance, importance_label, scope, trust_tier, category only when missing. - Empty-policy recall returns ignore tier and still yields results if any memory exists. - receipts include both engine and writeback summaries.

1.5a) Self-optimizing memory loop (shadow/recommendation-first)¶

Status: DONE (shadow-only review loop shipped on stable main; apply path remains explicit future work, not part of this slice).

Problem: the memory layer can capture and recall, but does not yet systematically learn from repeated misses, user corrections, low-value recalls, or strong evidence that certain memories should be promoted/demoted/merged.
Decision: add a conservative loop:
observe
propose
verify
optionally apply (later, low-risk only)
v0 posture:
recommendation/shadow mode only
no autonomous prompt rewriting
no silent deletion
no hidden config mutation

Shipped v0.1 slice: - openclaw-mem optimize review / optimize consolidation-review (zero-write observer/reporters) - bounded source-of-truth scan (observations, default limit 1000) - low-risk signals: staleness, duplication, bloat, weakly-connected candidates, repeated no-result memory_recall miss patterns, and importance drift spot-checks (score-vs-label mismatch, missing/unparseable metadata, conservative high-risk under-label detections) - outputs structured report openclaw-mem.optimize.review.v0 with recommendations (no mutation)

Artifacts: - Spec: Self-optimizing memory loop v0 (maintainer archive)

Acceptance criteria: - proposal generation does not mutate source truth by default - proposal receipts are inspectable and bounded - the loop is fail-open; disabling it preserves current behavior - only low-risk metadata changes are even considered for future auto-apply

1) Importance grading rollout (MVP v1)¶

Status: DONE (baseline shipped; operator-curated benchmark packet landed on 2026-03-31).

[x] Canonical detail_json.importance object + thresholds
[x] Deterministic heuristic-v1 + unit tests
[x] Feature flag for autograde: OPENCLAW_MEM_IMPORTANCE_SCORER=heuristic-v1
[x] Ingest wiring: only fill missing importance; never overwrite; fail-open
[x] CLI override: --importance-scorer {heuristic-v1|heuristic_v1|off} for ingest/harvest (env fallback remains)
[x] E2E safety belt: prove flag-off = no change; flag-on fills missing; fail-open doesn’t break ingest
[x] Ingest summary (text + JSON) with at least:
total_seen, graded_filled, skipped_existing, skipped_disabled, scorer_errors, label_counts
[ ] Small before/after benchmark set (operator-rated precision on must_remember + spot-check ignore)
Pointer: docs/thought-links.md; rerank proof-of-concept notes remain in the maintainer archive.

Acceptance criteria: - Turning the feature on/off is a one-line env var change. - E2E tests cover overwrite-prevention and fail-open behavior. - Each ingest run produces a machine-readable summary suitable for trend tracking.

2) Formalize memory tiers (hot / warm / cold)¶

Goal: turn the implicit pipeline into an explicit policy.

Hot = observations JSONL (minutes)
Warm = SQLite ledger (hours → days)
Cold = durable summaries / curated files (weeks → months)

Add-on (debuggability + governance): - Episodic events ledger (append-only session timeline; summary-first; scope-isolated) - Spec: Episodic events ledger v0 (maintainer archive)

Deliverables: - A short spec of promotion rules (what moves up tiers, and why) - A default operator workflow: search → timeline → get → store/promote - A future record-kind hardening note for durable event | fact | plan semantics so forward-looking intent is not collapsed into ordinary fact storage

Acceptance criteria: - Operators can explain where a fact lives and how it got there. - Promotions are auditable and reversible.

3) Memory lifecycle (reference-based decay + archive-first)¶

Goal: keep memory high-signal over long horizons by applying use-based retention (recency/frequency), not “age-based deletion”.

Core idea: - Track reference events for durable records (when a record actually influences packed context). - Apply decay/archival based on last_used_at (ref) and priority tiers. - Default to soft archive, not hard delete (fail-safe while recall is imperfect).

Proposed fields (upgrade-safe; store in detail_json.lifecycle first): - priority: P0|P1|P2 - P0 = never auto-archive (identity/safety/operator invariants) - P1 = long-lived preferences/decisions - P2 = short-lived context - last_used_at: timestamp of last real use (see definition below) - used_count: optional, monotonic count of use events - (optional) archived_at / state=archived

Definition: what counts as “used” (avoid gaming the signal) - Does NOT count: bulk preload / “always include these memories”. - Counts (cheap default): a record is selected into the final pack bundle (has a citation like obs:<id>). - Optional later: track a weaker last_retrieved_at for candidates vs last_included_at for final bundle inclusion.

Receipts (non-negotiable) - pack --trace should be able to list which recordRefs were “refreshed” this run. - Daily lifecycle job should emit an aggregate-only receipt (archived counts by tier/trust/importance).

Acceptance criteria: - Ref updates are auditable (receipt + trace), and do not require guessing “importance at write time”. - Archive is reversible; no hard delete in MVP. - Trust tier remains independent: “used often” does not automatically become trusted.

Current optimize-side shipment (governed, bounded): - optimize review surfaces signals.soft_archive_candidates (read-only proposals). - optimize evolution-review emits set_soft_archive_candidate items with proposal-first posture. - optimize governor-review requires explicit --approve-soft-archive for approval. - optimize assist-apply can apply only governor-approved soft-archive items using reversible lifecycle metadata writes (soft_archive_candidate, archived_at, archive_reason_code) with apply-time protection rechecks and no hard delete. - optimize verifier-bundle now reports per-family applied-action accounting (including set_soft_archive_candidate) and asserts no-hard-delete row-count invariants alongside rollback replay checks; optimize posture-review now surfaces the latest verifier family counts for canary-readiness audits.

Next (engineering epics)¶

4) Provenance + trust tiers (defense-in-depth)¶

Goal: make retrieval and packing trust-aware, so “helpful but hostile” content doesn’t become durable state by accident.

Deliverables: - A minimal provenance schema for each record (e.g., source, producer, optional url/tool_name, timestamps) - A simple trust tier field (e.g., trusted | untrusted | quarantined) with sane defaults: - tool/web/skill captures start as untrusted - operator-authored notes/promotions can mark trusted - Promotion/quarantine rules that are explicit and auditable (receipts)

Acceptance criteria: - Default packing/retrieval can prefer trusted without breaking existing flows. - Operators can explain why a record was included (provenance + trust tier).

5) Context Packer (lean prompt build)¶

Goal: for each request, locally build a small, high-signal context bundle instead of shipping the whole session history.

Boundary reminder: graph is a pack enhancement lane, not a parallel memory owner. If graph cannot improve the bounded bundle while staying fail-open, it should not expand the product surface.

Deliverables: - A packing spec (inputs, budgets, citations, redaction rules) including trust gating - A stable ContextPack output contract (hybrid text + JSON) for injection + ops tooling: - openclaw-mem.context-pack.v1 - See: docs/context-pack.md - Status: shipped baseline in openclaw-mem pack as context_pack; future changes should extend compatibly - pack CLI (or equivalent) that outputs: - a short “relevant state” section - bounded summaries of the top-K relevant durable facts/tasks - citations back to record ids / URLs (no private paths) - trust tier and provenance hints (enough for audits; not noisy) - Layer contract (L0/L1/L2) for pack inputs/outputs: - L0 abstract for fast filtering - L1 overview as the default bundle payload - L2 detail only on-demand + strictly bounded - Retrieval trajectory receipts (--trace): pack must be debuggable (why included/excluded). - Include a minimal JSON schema (v1) so we can diff behavior over time and compare arms in benchmarks. - A cheap retrieval baseline without embeddings (FTS + heuristics) - Optional: embedding-based rerank as an opt-in layer - A bounded counterexample / dissent quota policy hook so packs can preserve at least one meaningful contradiction when conflict exists, rather than returning only reinforcing evidence

Trace receipt schema (v1, redaction-safe)¶

When openclaw-mem pack --trace is used, it should be able to emit a JSON receipt like:

{
  "kind": "openclaw-mem.pack.trace.v1",
  "ts": "2026-02-15T00:00:00Z",
  "version": {
    "openclaw_mem": "1.x",
    "schema": "v1"
  },
  "query": {
    "text": "…",
    "scope": "(optional scope tag or project id)",
    "intent": "(optional: lookup|plan|debug|write|research)"
  },
  "budgets": {
    "budgetTokens": 1200,
    "maxItems": 12,
    "maxL2Items": 2,
    "niceCap": 100
  },
  "lanes": [
    {
      "name": "hot",
      "source": "session/recent",
      "searched": true,
      "notes": "recent turns only"
    },
    {
      "name": "warm",
      "source": "sqlite-ledger",
      "searched": true,
      "retrievers": [
        { "kind": "fts5", "topK": 50 }
      ]
    },
    {
      "name": "cold",
      "source": "curated-summaries",
      "searched": false
    }
  ],
  "candidates": [
    {
      "id": "rec:123",
      "type": "memory|resource|skill|decision|digest",
      "layer": "L0|L1|L2",
      "importance": "must_remember|nice_to_have|ignore|unknown",
      "trust": "trusted|untrusted|quarantined|unknown",
      "scores": { "fts": 12.3, "semantic": null, "rrf": null },
      "decision": {
        "included": true,
        "reason": ["high_score", "must_remember", "within_budget"],
        "caps": { "niceCapHit": false, "l2CapHit": false }
      },
      "citations": {
        "url": null,
        "recordRef": "(stable ref; no private paths)"
      }
    }
  ],
  "output": {
    "includedCount": 8,
    "excludedCount": 42,
    "l2IncludedCount": 1,
    "citationsCount": 2
  },
  "timing": {
    "durationMs": 83
  }
}

Notes: - Do not include raw content, absolute local paths, or secrets. - It must be stable enough to diff across versions and to support openclaw-memory-bench policy comparisons.

Hybrid upgrade (quality-first, later within this epic): - Add a retrieval router that can combine multiple backends: - Lexical (SQLite FTS5; BM25 scoring; QMD-style) - Semantic (vector store; e.g. LanceDB) - Default policy (quality-first): 1) lexical anchors (fast + precise) 2) semantic fallback (paraphrase recall) 3) rerank only when needed + strict top-N candidate budgets - Keep outputs auditable: every packed fact must carry provenance + citations/ids.

Acceptance criteria: - For a sample of real requests, packing reduces prompt size materially while keeping answer quality stable. - Output is deterministic enough to debug (receipts + JSON summary).

6) Graph semantic memory (idea → project matching)¶

Status: PARTIAL (v0 idea→project slice shipped via graph match + graph health; deeper typed graph/schema work remains roadmap).

Goal: represent projects/decisions/concepts as typed entities + edges so we can recommend work with path justification.

Deliverables: - Minimal entity/edge schema (typed) - Ingest adapter that builds a graph view from: - digests, scout reports, decisions - v0 automation surfaces (dev): - [x] graph index / graph pack / graph export (graph-first index + packing + export) - [x] graph preflight (deterministic recall pack preflight) - [x] graph capture-git (commit capture) - [x] graph capture-md (index-only markdown capture) - [x] graph auto-status and env toggles (OPENCLAW_MEM_GRAPH_AUTO_RECALL, OPENCLAW_MEM_GRAPH_AUTO_CAPTURE, OPENCLAW_MEM_GRAPH_AUTO_CAPTURE_MD) - [x] graph match (bounded idea/query → top projects → explanation path) - [x] graph health (freshness / staleness / node-count summary for canary use) - Next-value layer: - compiled synthesis cards + stale/lint loop over selected refs - Spec: docs/specs/graphic-memory-compiled-synthesis-v0.md - Query path (current/target): - idea/query → top projects → explanation path - Storage posture: - stay with portable / derived graph artifacts first (SQLite + receipts + optional Markdown materialization) - defer dedicated graph-store evaluation until compiled synthesis and query quality prove the need

Acceptance criteria: - Given an idea, we can point to 3–10 candidate projects/tasks with a human-readable justification path. - High-value repeated cross-source conclusions can be reused as fresh synthesis cards instead of being re-derived every time.

7) User/System separation (upgrade-safe operator state)¶

Deliverables: - Clear boundary of user-owned vs system-owned files/config - Schema versioning + migration notes (compat layer for old records)

Acceptance criteria: - Upgrades do not rewrite operator state. - Old DB/records remain readable.

8) Observability & hooks (receipts everywhere)¶

Deliverables: - Standardized run summaries for ingest/harvest/triage - Drift detection for label distribution (e.g., must_remember suddenly spikes) - Compaction receipts (future): capture before_compaction/after_compaction lifecycle events into the sidecar ledger so operators can audit “what got summarized” vs “what stayed hot”. - Manual /compact flush hook (upstream, future): when an operator triggers /compact, run a pre-compaction memory flush first (configurable), then compact. This reduces “oops I compacted before writing durable notes”.

Acceptance criteria: - Any automated path can be validated via logs + JSON summary.

9) Feedback loop (operator corrections → better behavior)¶

Deliverables: - Minimal manual override flow (mark/adjust importance) - Track correction counts + scorer error counts

Learning records (self-improvement loop; PAI-inspired, openclaw-mem-governed):
A structured record type (warm tier) that can store:
- mistakes / incidents (what happened)
- resolution / mitigation (what to do next time)
- tags (tool/provider/project)
- provenance + trust tier + redaction posture
Ingestion path (local-first, idempotent):
- import from .learnings/ markdown templates (or a JSONL variant)
- emit a receipt: total imported, new vs duplicate, top recurring patterns
Retrieval path:
- allow pack to include the top-N relevant learning records when the query matches an error/tool/workflow

Acceptance criteria: - Operators can correct mistakes and see the system behave differently afterward. - Import is idempotent (re-running doesn’t spam duplicates). - Learning-record outputs are redaction-safe by default (aggregate receipts; content only on explicit request).

10) Pruning-safe capture profiles (future)¶

Goal: make OpenClaw session pruning safer by ensuring important tool outputs remain retrievable locally.

Deliverables: - Capture profiles that are safe by default: - metadata-only (always safe) - summary-only (current default) - head-tail (bounded content) - Explicit allowlist/denylist support per tool, with redaction on. - Shipped hardening follow-up (2026-04-27): bounded black-box runtime checks for sidecar/plugin tool-result summaries now run against the shared secret-detector corpus to ensure high-risk patterns are redacted while benign docs text remains visible. - Shipped hardening follow-up (2026-04-27): tool_result_persist now has a tiny fake-API end-to-end harness (extensions/openclaw-mem/toolResultPersistE2E.test.mjs) that asserts emitted episodic tool.result JSONL lines are redacted/non-leaking and still preserve benign utility text. - Shipped hardening follow-up (2026-04-27, stdout/stderr): the same e2e harness now verifies stdout/stderr-style payload summaries collapse to bounded result captured (output redacted) posture and that no secret-like needles appear anywhere in emitted tool.result JSONL rows. - Shipped hardening follow-up (2026-04-27, structured JSON stdout/stderr): structured JSON-style tool outputs containing stdout/stderr fields now must collapse to bounded redacted-output posture end-to-end, while benign JSON/docs payload snippets without output fields are verified to stay informative. - Shipped hardening follow-up (2026-04-27, escaped output-key docs prose): a negative-control e2e now proves benign docs text that mentions JSON-escaped key strings like \"stdout\" / \"stderr\" remains informative end-to-end and does not falsely collapse to redacted-output posture. - Shipped hardening follow-up (2026-04-27, malformed-json array-first boundary): malformed JSON-like payload coverage now includes top-level, nested object/array, and root array-first ([) boundary cases, asserting quoted "stdout"/"stderr" terms inside prose string content stay informative while true malformed key-like output fields with full OUTPUT_FIELD_KEYS parity (stdout, stderr, raw_stdout, raw_stderr, tool_output, command_output) still collapse.

Acceptance criteria: - Operators can enable aggressive pruning without losing the ability to recover key tool outcomes from openclaw-mem. - Focused verification stays cheap and deterministic: - uv run --group dev python -m pytest tests/test_plugin_episodic_summary_runtime.py tests/test_plugin_episodic_spool.py tests/test_episodic_secret_detection.py -q - node --test extensions/openclaw-mem/toolResultSummary.test.mjs - node --experimental-strip-types --test extensions/openclaw-mem/toolResultPersistE2E.test.mjs (includes plain + structured JSON stdout/stderr collapse, escaped output-key docs negative-control, malformed-json top-level+nested+array-first boundary assertions with full OUTPUT_FIELD_KEYS parity (stdout, stderr, raw_stdout, raw_stderr, tool_output, command_output), no-leak assertions, and benign JSON/docs non-overblock assertions)

11) Contract hardening (interface-first) — stable schemas + fail-fast validation¶

Goal: reduce “silent drift” by treating CLI outputs + configs as interfaces with explicit contracts.

Deliverables: - Stable JSON output schemas (v0) for key operator surfaces: - harvest --json summary (total_seen, graded_filled, skipped_existing, ...) - triage --json (needs_attention, found_new, ...) - pack --trace receipt (openclaw-mem.pack.trace.v1) - Schema tests (unit-level) that verify: - required keys exist - types are stable - unknown keys are either rejected (strict) or explicitly tolerated (documented) - Strict config contract where feasible: - plugin config schema uses additionalProperties: false (or equivalent) to surface misconfig early - profile / stats surface (DONE): - openclaw-mem profile --json for deterministic ops snapshots (counts, importance distribution, recent rows, embedding stats)

Acceptance criteria: - A breaking shape change fails tests before release. - Cron/ops can rely on JSON outputs without regex-parsing or brittle prompt assumptions.

Later (optional, higher ambition)¶

Hybrid improvements: rerank / eval harnesses
Additional scorers (LLM-assisted grading as opt-in, with strict cost caps)
Optional protocol adapters (e.g., MCP-compatible surfaces) without losing local-first defaults

Thought links (design references)¶

These are projects we referenced and actually used to shape features or architecture.

Daniel Miessler — Personal AI Infrastructure (PAI): https://github.com/danielmiessler/Personal_AI_Infrastructure
Used as an architectural checklist (memory tiers, hooks, user/system separation, continuous improvement).
Hao Square — MCP Tool Search for Claude Code: https://haosquare.com/mcp-tool-search-claude-code/
Used to justify the “card → manual” split and dynamic discovery pattern for SOP/skills (context-size friendly).
tobi/qmd: https://github.com/tobi/qmd
Used to shape our hybrid retrieval direction (FTS5 (BM25 scoring) + vectors + fusion + rerank) and the benchmarking plan for a “retrieval router” arm.
1Password — From magic to malware: How OpenClaw's agent skills become an attack surface: https://1password.com/blog/from-magic-to-malware-how-openclaws-agent-skills-become-an-attack-surface
Used to motivate provenance + trust tiers and “trust-aware” context packing (helpful content can still be hostile).
thedotmack/claude-mem: https://github.com/thedotmack/claude-mem
Strong early inspiration for an agent memory layer design; we credit it explicitly (see ACKNOWLEDGEMENTS.md).
volcengine/OpenViking: https://github.com/volcengine/OpenViking
Used as a design reference for layered context loading (L0/L1/L2) and retrieval observability (trajectory/trace). Thought-link only; not a backend commitment.
martian-engineering/lossless-claw (LCM / lossless context engine): https://github.com/martian-engineering/lossless-claw
Used as a design reference for fresh-tail protection, provenance-first summarization, and “expand for details” tooling. Thought-link only; we are not committing to an engine fork.
Reference-based decay / archive-first lifecycle (trusted background + field note):
Cepeda et al. (2006) distributed practice / spaced repetition: https://doi.org/10.1037/0033-2909.132.3.354
ARC cache replacement (recency+frequency): https://www.usenix.org/legacy/publications/library/proceedings/fast03/tech/full_papers/megiddo/megiddo.pdf
X thread (untrusted inspiration): https://x.com/ohxiyu/status/2022924956594806821 ecency+frequency): https://www.usenix.org/legacy/publications/library/proceedings/fast03/tech/full_papers/megiddo/megiddo.pdf
X thread (untrusted inspiration): https://x.com/ohxiyu/status/2022924956594806821