There's a failure mode of AI agents that never makes it into a screenshot.
Hour three of a long task, the agent confidently redoes something it already finished. Or it contradicts a decision it made an hour ago. Or it re-reads the same file for the fifth time, having lost the note it wrote itself about what was in it. It didn't crash. It didn't error. It forgot — and it forgot silently.
This is the single biggest reason "autonomous agents" demo beautifully and disappoint in production. A demo is short enough to fit in one context window. A real task — refactor this subsystem, triage this incident, work through this spec — is not. And the moment a task outgrows the window, something has to give.
Why agents forget
A model has a bounded context window. When a long conversation approaches it, the usual fix is compaction: summarize the history, drop the detail, continue. It works once. The problem is what happens the third and fourth time.
Compaction is lossy and it is silent. Each pass throws away detail the model judged less important at that moment — and "less important right now" is exactly how a constraint set six hours ago, or a dead-end already explored, quietly disappears. Compress a compression of a compression and task state doesn't just shrink; it drifts. The agent ends up confidently wrong about its own history.
Most frameworks hand this problem straight back to the model: "summarize yourself and keep going." Which means the most dangerous failure in long-horizon work is both invisible (no error is raised) and unrecoverable (the dropped detail is gone). That's not a prompting bug. That's a missing engineering layer.
Continuity is not a property you can prompt for. It's a data structure you have to own.
Two things a long task actually needs
Before reaching for a fix, it's worth separating two needs people constantly conflate — because they require completely different machinery:
- Continuity — staying coherent about the task across time: across model-session churn, across process restarts, across days. Knowing what the task is, what's been decided, what's blocked, where we are.
- Reasoning depth — thinking hard over evidence too large for one prompt: a 500-page spec, 12,000 lines of CI logs, a change whose blast radius spans a whole codebase.
"Just use a bigger context window" fails both. "Just summarize and continue" sacrifices the first to fake the second. The honest answer is two layers, each doing one job well.
Layer one: virtual sessions — continuity as a structure
The conceptual move is small and it changes everything: separate the logical identity of the task from the physical session of the model.
- A concrete session is one model thread. It is disposable. It can be compacted, it can die, it can be replaced.
- A virtual session is a stable id that spans many concrete sessions. It is the task's spine — the thing that persists while the model threads underneath it churn.
The mechanism, exactly as it runs in the harness today:
- Count only the compactions that actually cost you — the ones where the compaction succeeded and rewrote the active context. Not every turn. Only the lossy ones.
- Past a threshold, stop degrading. Instead of compressing a fifth time, roll over: open a fresh concrete session under the same virtual id, continuation index incremented.
- Carry forward a bounded, explicit, structured working set — goal and completion criteria, active plan refs, decisions, recent files, validation results, open blockers, a continuation note — injected into the fresh session as a first-class "working-set continuity" block.
- Guard the in-flight work. Rollover refuses to rewrite a queue item that's already been prepared or leased by a runtime worker — so a clean restart never corrupts a turn that's mid-execution.
- Respect task boundaries. /new is a hard boundary: a new task does not inherit the previous task's working set. Continuity is scoped to the task, not leaked across tasks.
// the design philosophy
Four principles fall out of this, and they're the whole point:
- Explicit over implicit. Continuity is a structure the harness owns and writes down — not a summary the model hopes it kept. You can read the working set. You can diff it. You can audit what the agent believes about the task right now.
- Bounded over unbounded. The working set is size-bounded by design. That keeps the prompt prefix stable and cache-friendly, and — crucially — it means compression doesn't compound. You reset to a clean, bounded state instead of summarizing a summary.
- Durable and auditable. Every rollover writes a receipt. The continuity chain — which concrete session, which index, what was carried — is reconstructable after the fact. Continuity you can't audit is just hope with extra steps.
- The model is a contractor, not the owner of its own memory. The harness owns the task's spine. The model does a turn and hands back work. If it dies, the task doesn't.
// honest status — because this whole project is about that
But continuity isn't intelligence
Here's the trap. Virtual sessions keep a task alive and coherent. They do not make the model think harder. And a lot of long tasks are hard not because they're long, but because the evidence is enormous:
- Read a 500-page spec and find the three places it contradicts itself.
- Triage a flaky CI failure whose cause is scattered across long logs and many files.
- Trace the blast radius of a change through a codebase before you touch shared runtime, schema, or queue behavior.
- QA a generated artifact — a report, a spreadsheet, a benchmark — against its own source data.
A bounded working set is precisely the wrong tool here. You don't want to summarize the evidence away — you want to compute over it: filter, chunk, cross-check, map, reduce. That's a different engine.
Layer two: RLM — reasoning as a bounded engine
This is where Recursive Language Models come in. The idea: instead of stuffing everything into one prompt, the root model writes code in a REPL-like loop — it keeps variables and files across iterations, it calls a typed sub-model (predict(...)) recursively for bounded sub-work, and it keeps going until it explicitly SUBMITs a structured answer. Every iteration, subcall, tool call, and token is captured in a trace.
Why it fits evidence-heavy work: programmatic map/reduce over huge inputs, persistent intermediate state, typed recursive subcalls, and an auditable reasoning trace instead of a black-box monologue. It is, in effect, a reasoning memory — the way the virtual session is a continuity memory.
Full disclosure, in this project's house style: RLM here is a design note, not yet built. What follows is the blueprint — and the entire value of the blueprint is the boundary it draws.
The boundary is the design
The temptation, once you have a recursive reasoning engine, is to let it run the show. That's the mistake. The rule that makes this engineering-grade rather than a clever toy:
The harness is the outer durable control plane. RLM is an inner bounded reasoning engine — a worker job, not an autonomous agent.
- The harness owns lifecycle, the durable queue, leases, watchdogs, receipts, permissions, human steering, context rollover — and every side effect that touches the real world.
- RLM owns bounded reasoning over staged evidence, recursive subcalls, and a structured result plus trace. It owns no live side effects.
- RLM may recommend; only the harness executes — with a receipt. The reasoning engine never directly controls the gateway, never changes auth or identity, never does destructive filesystem work, never pushes, never sends an external message, never spends.
- Model-written code is untrusted. Least privilege by default: a staged workspace, only declared files and tools, network denied unless explicitly allowed, every host call recorded in the trace.
- Failures normalize into harness-visible classes — budget exhausted, sandbox fatal, tool-policy denied, output invalid, partial success — so an RLM run is just another durable, classifiable, recoverable job attempt, not a mystery.
Two memories, one long task
Now put both layers together and a complex, multi-day task stops being a gamble.
The virtual session is the spine — the continuity memory: what the task is, where we are, what we decided, what's blocked. It survives compaction, process death, and restarts, because it's a durable structure with receipts, not a context window.
When a turn hits something that needs deep evidence work, it dispatches an RLM worker job — the reasoning memory: bounded iteration and subcall state over large evidence, sandboxed and budget-capped. Its structured result flows back into the working set as a decision, a validation, or an artifact reference. The continuity memory records the conclusion; the reasoning memory keeps the receipts of how it got there.
"But isn't that what a vector database is for?"
It's the first objection every engineer raises, and it deserves a straight answer. Long-term memory systems — vector databases like Qdrant, and the agent-memory layers built on top of them (openclaw-mem, MemPalace, Mem0, Letta/MemGPT-style archives) — are genuinely useful. They are also solving a different problem, and they do not fix the forgetting this essay is about.
Here's the distinction that matters. A vector memory answers "what do I know from before?" — it retrieves the top-k semantically similar fragments from everything the agent has ever seen. The working set answers "where am I right now in this task?" — the current goal, the live decisions, the open blockers. Recall is about similarity; continuity is about state. One genuinely cannot stand in for the other.
Try to use retrieval as continuity and the seams show immediately:
- Similarity is not state. The decision you made an hour ago — "we ruled out approach X because of constraint Y" — may not be the most semantically similar chunk to your current query, so it isn't retrieved, and the agent cheerfully re-walks the dead end. Continuity needs the authoritative current state present deterministically, not the probabilistically-nearest fragment.
- An archive has no "now." A vector store is a pile of past fragments. Even with perfect recall you get relevant snippets, not a coherent, bounded picture of the live task. The working set is a single mutable source of truth — a different shape of data than an embedding index.
- Old truths and new truths look equally relevant. In a long task, decisions get reversed. A vector store happily returns both the superseded decision and its replacement — both score as "relevant" — and the model can't reliably tell which is current. The working set is overwritten: last-write-wins, not retrieve-everything.
- It isn't even in the loop when forgetting happens. The silent loss occurs inside a session's context-window compaction. A memory layer fires on a query; it never learns that a compaction just dropped the constraint you needed. The virtual session is triggered by the compaction event — a completely different control point.
- Recall is untrusted; state must be trusted. Retrieved memory is bounded, fuzzy, fail-open context — it can be stale, off-topic, or even poisoned, so the harness treats it as untrusted input. You do not want the agent's notion of "what I decided" to be a fuzzy retrieval that might be wrong. Authoritative task state has to be harness-owned and deterministic.
None of this is an argument against long-term memory — Agent Harness Core integrates it, externally and fail-open, with citations. It's an argument against asking it to do a job it was never shaped for. The two memories we opened with run this task; long-term memory is a third, orthogonal concern — knowledge across tasks — and a serious long-task agent needs all three:
Pull any one and you get a characteristic failure. Drop long-term memory and the agent is an amnesiac across sessions, re-learning your preferences and last week's lessons every time. Drop the virtual session and it forgets within the task the moment context compacts — the failure we started with. Drop RLM and it goes shallow on the evidence-heavy steps, summarizing away the very thing it needed to compute over.
Retrieval recalls the past. The working set holds the present. RLM reasons over the pile in front of you. They are complements, not substitutes — swap any one for another and you've made a category error.
What "engineering grade" actually means
It's a phrase people throw around, so let me make it concrete. A long-task agent is engineering-grade when these properties hold across both layers:
- Durable — it survives process death and reboots without losing or duplicating work.
- Bounded — budgets, leases, caps, and depth limits mean nothing runs away, neither a chatty session nor a recursive reasoner.
- Gated — fail-closed permissions and an explicit side-effect boundary mean nothing dangerous happens by accident.
- Receipted — every continuity rollover and every reasoning subcall writes an auditable record. If it isn't in a ledger, it didn't happen.
- Recoverable — any step can be reconstructed, resumed, or replayed from those receipts.
Long-task capability isn't one clever feature. It's these five properties holding while the model threads churn underneath and the reasoning goes deep on demand. The model is a contractor. The harness is the general contractor that keeps the receipts — and a task you can leave running for a week is the deliverable.
// the honest ledger for this essay
That split — shipped vs. gated vs. designed — is deliberate, and it's the same discipline the rest of this project runs on. In a world where anyone can generate an impressive architecture diagram in an afternoon, the differentiator isn't the diagram. It's whether each box points at code, a receipt, or an honest "not yet."
★ Star the repo Read the RLM design note → The first essay →
And yes — like the code it describes, this essay was drafted in pair with the same family of models it runs. Welcome to 2026.
Agent Harness Core: a self-hosted AI agent runtime in Rust. Six dependencies. No async runtime. 500+ tests without a model call. Every step gated, every step receipted. Pre-release, Windows-first, dual-licensed MIT/Apache-2.0.