Field notes · The architecture, long form

The Agent That Doesn't Forget

Every agent demo is a sprint. Real work is a marathon. The distance between them is memory — and almost nobody engineers it. Here is the two-layer architecture I'm building so a long, hard task survives the run: virtual sessions for continuity, RLM for deep reasoning, and a control plane that writes a receipt for both.

There's a failure mode of AI agents that never makes it into a screenshot.

Hour three of a long task, the agent confidently redoes something it already finished. Or it contradicts a decision it made an hour ago. Or it re-reads the same file for the fifth time, having lost the note it wrote itself about what was in it. It didn't crash. It didn't error. It forgot — and it forgot silently.

This is the single biggest reason "autonomous agents" demo beautifully and disappoint in production. A demo is short enough to fit in one context window. A real task — refactor this subsystem, triage this incident, work through this spec — is not. And the moment a task outgrows the window, something has to give.

Why agents forget

A model has a bounded context window. When a long conversation approaches it, the usual fix is compaction: summarize the history, drop the detail, continue. It works once. The problem is what happens the third and fourth time.

Compaction is lossy and it is silent. Each pass throws away detail the model judged less important at that moment — and "less important right now" is exactly how a constraint set six hours ago, or a dead-end already explored, quietly disappears. Compress a compression of a compression and task state doesn't just shrink; it drifts. The agent ends up confidently wrong about its own history.

Most frameworks hand this problem straight back to the model: "summarize yourself and keep going." Which means the most dangerous failure in long-horizon work is both invisible (no error is raised) and unrecoverable (the dropped detail is gone). That's not a prompting bug. That's a missing engineering layer.

Continuity is not a property you can prompt for. It's a data structure you have to own.

Two things a long task actually needs

Before reaching for a fix, it's worth separating two needs people constantly conflate — because they require completely different machinery:

"Just use a bigger context window" fails both. "Just summarize and continue" sacrifices the first to fake the second. The honest answer is two layers, each doing one job well.

Layer one: virtual sessions — continuity as a structure

The conceptual move is small and it changes everything: separate the logical identity of the task from the physical session of the model.

The mechanism, exactly as it runs in the harness today:

PHYSICAL SESSIONS ARE DISPOSABLE — THE VIRTUAL SESSION IS THE SPINE VIRTUAL SESSION vsession-… · one stable task identity across every roll-over concrete S1 continuationIndex 0 ✂ compact ✂ compact → threshold concrete S2 continuationIndex 1 ✂ compact ✂ compact → threshold concrete S3 continuationIndex 2 fresh context work continues ROLL OVER ROLL OVER WORKING SET → goal · decisions files · blockers WORKING SET → goal · decisions files · blockers
FIG. 1 — at the compaction threshold, the task rolls into a fresh session under the same virtual id, carrying a bounded working set instead of a degraded one.

// the design philosophy

Four principles fall out of this, and they're the whole point:

// honest status — because this whole project is about that

live-wiredThe rollover check runs on every turn in the reference deployment. The mechanism, the working set, the guards, the receipts — all shipped, all tested without a model call.
proof gatedBut daily chat rarely compacts twice, so a real rollover hasn't fired in production yet. The end-to-end forced-rollover proof — fresh session, same virtual id, working-set injection, no unsafe rewrite, traced through to final delivery — is a published gate, not a claim. The honest gap is the point.

But continuity isn't intelligence

Here's the trap. Virtual sessions keep a task alive and coherent. They do not make the model think harder. And a lot of long tasks are hard not because they're long, but because the evidence is enormous:

A bounded working set is precisely the wrong tool here. You don't want to summarize the evidence away — you want to compute over it: filter, chunk, cross-check, map, reduce. That's a different engine.

Layer two: RLM — reasoning as a bounded engine

This is where Recursive Language Models come in. The idea: instead of stuffing everything into one prompt, the root model writes code in a REPL-like loop — it keeps variables and files across iterations, it calls a typed sub-model (predict(...)) recursively for bounded sub-work, and it keeps going until it explicitly SUBMITs a structured answer. Every iteration, subcall, tool call, and token is captured in a trace.

Why it fits evidence-heavy work: programmatic map/reduce over huge inputs, persistent intermediate state, typed recursive subcalls, and an auditable reasoning trace instead of a black-box monologue. It is, in effect, a reasoning memory — the way the virtual session is a continuity memory.

Full disclosure, in this project's house style: RLM here is a design note, not yet built. What follows is the blueprint — and the entire value of the blueprint is the boundary it draws.

The boundary is the design

The temptation, once you have a recursive reasoning engine, is to let it run the show. That's the mistake. The rule that makes this engineering-grade rather than a clever toy:

The harness is the outer durable control plane. RLM is an inner bounded reasoning engine — a worker job, not an autonomous agent.
TWO LAYERS — THE BOUNDARY IS THE PRODUCT AGENT HARNESS · durable control plane ▸ lifecycle & virtual-session rollover ▸ durable queue · leases · watchdogs ▸ permissions · fail-closed gates ▸ human steering · /stop · /steer ▸ receipts for every step ▸ side effects (the only layer that has them) RLM WORKER JOB · reasoning engine ▸ iterate in a REPL loop ▸ predict() typed subcalls ▸ persist vars & files ▸ map / reduce over evidence ▸ structured result + full trace returns result + trace → harness executes side effects, not RLM ✗ RLM cannot touch: gateway control · auth/identity · destructive fs external send · spend · live cutover untrusted model code → staged workspace · declared tools only · network off by default · every call traced
FIG. 2 — RLM is a sandboxed worker job inside the harness, not an agent above it. It reasons; the harness acts.

Two memories, one long task

Now put both layers together and a complex, multi-day task stops being a gamble.

The virtual session is the spine — the continuity memory: what the task is, where we are, what we decided, what's blocked. It survives compaction, process death, and restarts, because it's a durable structure with receipts, not a context window.

When a turn hits something that needs deep evidence work, it dispatches an RLM worker job — the reasoning memory: bounded iteration and subcall state over large evidence, sandboxed and budget-capped. Its structured result flows back into the working set as a decision, a validation, or an artifact reference. The continuity memory records the conclusion; the reasoning memory keeps the receipts of how it got there.

Two memories for one long task CONTINUITY MEMORY virtual session · working set ▸ what the task is ▸ where we are · what's decided ▸ open files · blockers survives compaction · death · restart REASONING MEMORY RLM trace · vars · subcalls ▸ iterate over huge evidence ▸ typed recursive subcalls ▸ structured result + trace bounded · sandboxed · budgeted RLM result → folded back into the working set conclusion stored in continuity · how-we-got-there kept in the trace
FIG. 3 — continuity keeps the task coherent; reasoning goes deep when it has to. Each feeds the other, and the harness receipts both.

"But isn't that what a vector database is for?"

It's the first objection every engineer raises, and it deserves a straight answer. Long-term memory systems — vector databases like Qdrant, and the agent-memory layers built on top of them (openclaw-mem, MemPalace, Mem0, Letta/MemGPT-style archives) — are genuinely useful. They are also solving a different problem, and they do not fix the forgetting this essay is about.

Here's the distinction that matters. A vector memory answers "what do I know from before?" — it retrieves the top-k semantically similar fragments from everything the agent has ever seen. The working set answers "where am I right now in this task?" — the current goal, the live decisions, the open blockers. Recall is about similarity; continuity is about state. One genuinely cannot stand in for the other.

Try to use retrieval as continuity and the seams show immediately:

None of this is an argument against long-term memory — Agent Harness Core integrates it, externally and fail-open, with citations. It's an argument against asking it to do a job it was never shaped for. The two memories we opened with run this task; long-term memory is a third, orthogonal concern — knowledge across tasks — and a serious long-task agent needs all three:

THREE LAYERS — COMPLEMENTS, NOT SUBSTITUTES KNOWLEDGE MEMORY "what do I know from before?" vector / semantic recall Qdrant · openclaw-mem MemPalace · Mem0 · Letta scope: ACROSS tasks & time top-k similarity · fail-open ✗ without it: amnesiac across sessions — re-learns prefs & lessons CONTINUITY MEMORY "where am I now in this task?" virtual session · working set goal · decisions · blockers authoritative current state scope: WITHIN the task last-write-wins · receipted ✗ without it: forgets mid-task the moment context compacts REASONING MEMORY "how do I think over this pile?" RLM · sandboxed worker job iterate · predict() subcalls vars · files · full trace scope: RIGHT NOW bounded · budgeted ✗ without it: shallow on big evidence — summarizes away the answer
FIG. 4 — three orthogonal memory concerns. Retrieval recalls the past; the working set holds the present; RLM reasons over the pile in front of you.

Pull any one and you get a characteristic failure. Drop long-term memory and the agent is an amnesiac across sessions, re-learning your preferences and last week's lessons every time. Drop the virtual session and it forgets within the task the moment context compacts — the failure we started with. Drop RLM and it goes shallow on the evidence-heavy steps, summarizing away the very thing it needed to compute over.

Retrieval recalls the past. The working set holds the present. RLM reasons over the pile in front of you. They are complements, not substitutes — swap any one for another and you've made a category error.

What "engineering grade" actually means

It's a phrase people throw around, so let me make it concrete. A long-task agent is engineering-grade when these properties hold across both layers:

Long-task capability isn't one clever feature. It's these five properties holding while the model threads churn underneath and the reasoning goes deep on demand. The model is a contractor. The harness is the general contractor that keeps the receipts — and a task you can leave running for a week is the deliverable.

// the honest ledger for this essay

shippedVirtual sessions, working-set rollover, guards, and receipts — live-wired into every turn; tested without a model call.
gatedThe end-to-end forced-rollover proof under live traffic — a named, published release gate, not yet earned.
design noteRLM integration — the boundary, the job spec, the failure classes, the phased plan are specified in the repo; the engine is not built yet.
integratedLong-term memory — openclaw-mem hooks with vector recall over imported embeddings, external and fail-open with citations; full autonomous-graph parity is itself a tracked gate.

That split — shipped vs. gated vs. designed — is deliberate, and it's the same discipline the rest of this project runs on. In a world where anyone can generate an impressive architecture diagram in an afternoon, the differentiator isn't the diagram. It's whether each box points at code, a receipt, or an honest "not yet."

An agent you can leave running on something hard is worth more than one that dazzles for five minutes. That's the whole bet.
★ Star the repo Read the RLM design note → The first essay →

And yes — like the code it describes, this essay was drafted in pair with the same family of models it runs. Welcome to 2026.


Agent Harness Core: a self-hosted AI agent runtime in Rust. Six dependencies. No async runtime. 500+ tests without a model call. Every step gated, every step receipted. Pre-release, Windows-first, dual-licensed MIT/Apache-2.0.