Agent Harness Core — An engineering-grade agent harness for long tasks

01 · The Problem

Agents don't crash on long tasks. They forget.

A demo fits in one context window. Real work — refactor this subsystem, triage this incident, work through this spec — does not. The moment a task outgrows the window, something quietly breaks.

Hour three of a long task, the agent confidently redoes finished work, or contradicts a decision it made an hour ago. It didn't error. It forgot — and it forgot silently.

The usual fix is compaction: summarize the history, drop the detail, continue. It works once. But compaction is lossy and silent — compress a compression a few times and task state doesn't just shrink, it drifts. Most frameworks hand this straight back to the model ("summarize yourself and keep going"), which makes the most dangerous failure in long-horizon work both invisible and unrecoverable. That's not a prompting bug. It's a missing engineering layer.

02 · The Design

Two memories for one long task.

Continuity and reasoning are different problems, and conflating them is why "bigger context window" and "just summarize" both fail. Agent Harness Core gives a long task two distinct, bounded, receipted memory layers.

// virtual session — continuity memory

A stable virtual session id spans many disposable model sessions — the task's spine.
Past a compaction threshold, the runtime stops degrading and rolls over into a fresh session.
It carries a bounded working set — goal, decisions, open files, validation, blockers — not a summary of a summary.
Guards refuse to rewrite prepared or leased in-flight work; /new is a hard task boundary.

// RLM — reasoning memory

Some long tasks are hard from evidence volume: a 500-page spec, 12k log lines, a codebase blast radius.
A Recursive Language Model reasons in a REPL loop with persistent vars, typed recursive subcalls, and a full trace.
It runs as a sandboxed worker job — staged evidence, declared tools, network off by default.
It returns a structured result and owns no side effects of its own.

The harness is the outer durable control plane. RLM is an inner bounded reasoning engine — a worker job, not an agent above it.
It reasons; the harness acts — and receipts both.

That is the difference between an agent that dazzles for five minutes and one you can leave running for a week: continuity is a durable structure, deep reasoning is a bounded sandbox, and every rollover and subcall is auditable. Read the full architecture, with diagrams →

Honest status, in this project's house style: the virtual-session mechanism is live-wired into every turn and tested without a model call, but the end-to-end forced-rollover proof under live traffic is a published gate, not yet earned. RLM integration is a design note in the repo — boundary, job spec, and failure classes specified; the engine is not built yet. Labeled, not blurred.

03 · The Product

The operations layer your agent framework forgot.

Agent Harness Core is not another prompt-orchestration library. It sits around your agents — ingress, permissions, durable queuing, concurrency, delivery, long-task continuity, audit, recovery — and answers the questions frameworks leave to you:

"The task outlived its context window. Now what?"

A stable virtual session rolls the turn into a fresh model session past a compaction threshold, carrying a bounded working set — goal, decisions, open files, blockers — forward. Long tasks survive compaction instead of drifting. → working-set continuity · /new = task boundary

"The process died mid-turn. Now what?"

Durable queue + completed-turn recovery. The turn is not lost. → pending.jsonl, leases, receipts

"Who is allowed to talk to my agent?"

Fail-closed allow-lists per user, chat, channel, and guild. Unknown senders never reach the model. → admin / limited / open-limited tiers

"What exactly did it do at 03:12?"

Append-only JSONL logs, receipts, transcripts and trajectories for everything. One command rebuilds the causal chain. → trace <id>

"How do I stop a runaway turn?"

Scoped cancellation honored by the runtime poll loop — the turn, the job, or everything in the session. → /stop · /stop turn · /stop job <id>

# build and verify — no model account needed
cargo test                                      # 500+ tests, zero model calls
cargo run -p agent-harness-cli -- doctor

# smoke the full pipeline offline with the bundled fake Codex app-server
cargo run -p agent-harness-cli -- channel-run-once --codex-exe tools\agent-fake-codex-app-server\fake-codex-app-server.cmd ...

# then go live: Telegram / Discord ingress → durable queue → Codex/OpenRouter → receipts
cargo run -p agent-harness-cli -- enable-check --target-home C:\path\to\.agent-harness

04 · The Craft

Boring code. Radical accountability.

The craftsmanship challenge: how much operational rigor fits into a codebase one person can fully audit?

Six dependencies. Total. serde, serde_json, ureq, rusqlite, ring, base64. No tokio, no clap, no framework. Synchronous Rust you can read in an afternoon.

Receipts over trust. Two-phase persistence: intent written before side effects, results after. If it isn't in a ledger, it didn't happen.

Fail-closed everything. No allow-list match → no model access. Missing credentials fail at preflight, not mid-turn. Unknown config keys refuse to boot.

Long tasks that don't drift. A stable virtual session spans many concrete model sessions, rolling over past a compaction threshold with a bounded working set — and guarding in-flight queued work from unsafe rewrite.

Invariants as a CLI command. invariants and schema-registry are executable contracts, not wiki pages — the test suite enforces them.

Encrypted vault. PBKDF2-HMAC-SHA256 + ChaCha20-Poly1305, repo-local. vault-get reports presence and length — never plaintext.

MCP, natively. In-process Rust MCP server scaffold with per-agent tool allow-lists — plus hash-pinned tool descriptions, because tool poisoning is a real attack class.

Model-agnostic by one command. Codex app-server executes; /model openrouter/anthropic/claude-sonnet-4 reroutes a conversation without redeploying.

Memory, governed. openclaw-mem integration is external, contract-tested, and fail-open — citations in, silent corruption out.

05 · The Verdict

Reviewed by an AI. Scored without mercy.

Before it was a product page, this was an experiment. In a single day, an AI architect — Claude (Fable 5) — reviewed this repo against OpenClaw (378k★) and Hermes (191k★), wrote the overtaking roadmap, watched it get built, then re-scored it harder, splitting every score into mechanism (code + tests exist) and proven (survived live traffic). The point isn't the speed. It's that an independent frontier model audited the engineering — and the gaps are published, not hidden.

Claude — Fable 5 AI agent-harness architect · two review rounds, 2026-06-12 · opinions are the model's own

"OpenClaw sells breadth. Hermes sells engineering. Agent Harness Core sells a small thing you can trust."
— round 1, comparative review

"By engineering-discipline density per thousand lines of code, this is now first of the three. … Mechanisms ready. Evidence en route."
— round 2, after the staging pass

Dimension	OpenClaw 378k★	Hermes 191k★	AHC mechanism	AHC proven
Concurrency & throughput	4.5	3.5	4.0	3.5
Persistence & data integrity	3.0	4.5	4.5	4.0
Error handling & recovery	3.5	4.0	4.5	3.5
Supervision & operations	4.0	3.5	4.0	2.5
Security	2.0	3.5	4.5	4.0
Observability & debuggability	3.5	3.0	5.0	4.5
Token / resource efficiency	3.0	4.5	4.5	4.0
Extensibility & ecosystem	5.0	4.5	2.5	2.0
Testing & quality engineering	3.0	3.5	4.5	4.0
Maturity & community	4.5	4.5	2.0	2.0

Mechanism = code + staging tests exist, per the repo's own four-tier status ledger. Proven = validated under live traffic. Highlighted rows mark where the review awards AHC the mechanism-level lead: four strict wins (error recovery, security, observability, testing) plus persistence, where equal numbers hide a receipts-audit edge over Hermes. Ties against an incumbent's live-proven score — supervision (4.0 vs OpenClaw's 4.0) and token efficiency (4.5 vs Hermes's 4.5) — deliberately stay unhighlighted: proven outranks mechanism on equal scores. Scores are one AI reviewer's opinion (Claude, Fable 5), from public documentation for OpenClaw/Hermes and full source access for AHC.

Full disclosure, by design: the roadmap this project implemented was written by the same AI that scored it. That self-referential loop is precisely why the scoring is split into mechanism vs proven, why pending items get zero credit, and why every claim must point at a receipt, a test, or a ledger entry in the repo. The honesty is the methodology — and the gap between the two columns is not the reviewer's hedge; it is the project's own published TODO list.

06 · The Proving Ground

Don't believe the scores. Watch us earn them.

Everything still standing between "mechanism" and "proven" is published as a roadmap of public proofs. Each one lands as receipts in the repo — JSONL you can read, not claims you have to trust.

PROOF 01

Forced-Rollover Continuity

The signature design, proven end to end: force the compaction threshold, verify a fresh session under the same virtual id, the working set injected, no unsafe rewrite of in-flight work, traced through to final delivery. The unique advantage, earned not asserted.

MECHANISM LIVE-WIRED · PROOF GATED

PROOF 02

Reboot-Proof Supervision

Service-wrapper registration, kill-a-loop auto-restart with backoff, crash-loop breaker that DMs the operator, and canary deploys with automatic rollback. The harness that notices it is sick — and tells you.

MECHANISM SHIPPED · DRILL PENDING

PROOF 03

30-Day Unattended SLO

≥99.5% delivery. Zero silent failures. Every failure reconstructable from receipts within five minutes, via one trace command. The north star — the clock starts at supervision cutover.

CLOCK NOT YET STARTED

seven-day shadow-queue parity CI pipeline deterministic simulation — seeded crash/interleave replays sanitized real-traffic replay corpus real Codex wire fixtures openclaw-mem ContextPack fixture exchange network advisory audit first tagged release

Evidence accruing, honestly: the reference deployment now has tens of thousands of delivered messages and hundreds of completed model turns on record, with fewer than two dozen dead-letters — all from known transient provider stream disconnects, each retried before dead-lettering. That is real soak, not a passed gate: the formal forced-rollover, shadow-parity, and 30-day SLO clocks above stay open until their exact criteria are met.

Progress is tracked item-by-item in the repo's roadmap & backlog with a four-tier honesty vocabulary — staging-tested · pending live gate · pending fixtures · deferred by policy — and a named remaining gate for every single item. Few projects of any size publish their uncertainty this precisely. That, too, is the product.