An agent you can leave running on something hard.
Because continuity and reasoning are separate, receipted layers.
A self-hosted AI agent runtime in Rust that treats long-task survival as an engineering problem. A stable virtual session keeps a task coherent across context compaction, process death, and restarts; deep evidence-work runs as bounded, sandboxed reasoning jobs — and every step writes an auditable receipt. Independently reviewed by a frontier AI, and honestly labeled mechanism vs. proven throughout.
A demo fits in one context window. Real work — refactor this subsystem, triage this incident, work through this spec — does not. The moment a task outgrows the window, something quietly breaks.
The usual fix is compaction: summarize the history, drop the detail, continue. It works once. But compaction is lossy and silent — compress a compression a few times and task state doesn't just shrink, it drifts. Most frameworks hand this straight back to the model ("summarize yourself and keep going"), which makes the most dangerous failure in long-horizon work both invisible and unrecoverable. That's not a prompting bug. It's a missing engineering layer.
Continuity and reasoning are different problems, and conflating them is why "bigger context window" and "just summarize" both fail. Agent Harness Core gives a long task two distinct, bounded, receipted memory layers.
That is the difference between an agent that dazzles for five minutes and one you can leave running for a week: continuity is a durable structure, deep reasoning is a bounded sandbox, and every rollover and subcall is auditable. Read the full architecture, with diagrams →
Honest status, in this project's house style: the virtual-session mechanism is live-wired into every turn and tested without a model call, but the end-to-end forced-rollover proof under live traffic is a published gate, not yet earned. RLM integration is a design note in the repo — boundary, job spec, and failure classes specified; the engine is not built yet. Labeled, not blurred.
Agent Harness Core is not another prompt-orchestration library. It sits around your agents — ingress, permissions, durable queuing, concurrency, delivery, long-task continuity, audit, recovery — and answers the questions frameworks leave to you:
A stable virtual session rolls the turn into a fresh model session past a compaction threshold, carrying a bounded working set — goal, decisions, open files, blockers — forward. Long tasks survive compaction instead of drifting. → working-set continuity · /new = task boundary
Durable queue + completed-turn recovery. The turn is not lost. → pending.jsonl, leases, receipts
Fail-closed allow-lists per user, chat, channel, and guild. Unknown senders never reach the model. → admin / limited / open-limited tiers
Append-only JSONL logs, receipts, transcripts and trajectories for everything. One command rebuilds the causal chain. → trace <id>
Scoped cancellation honored by the runtime poll loop — the turn, the job, or everything in the session. → /stop · /stop turn · /stop job <id>
# build and verify — no model account needed cargo test # 500+ tests, zero model calls cargo run -p agent-harness-cli -- doctor # smoke the full pipeline offline with the bundled fake Codex app-server cargo run -p agent-harness-cli -- channel-run-once --codex-exe tools\agent-fake-codex-app-server\fake-codex-app-server.cmd ... # then go live: Telegram / Discord ingress → durable queue → Codex/OpenRouter → receipts cargo run -p agent-harness-cli -- enable-check --target-home C:\path\to\.agent-harness
The craftsmanship challenge: how much operational rigor fits into a codebase one person can fully audit?
Before it was a product page, this was an experiment. In a single day, an AI architect — Claude (Fable 5) — reviewed this repo against OpenClaw (378k★) and Hermes (191k★), wrote the overtaking roadmap, watched it get built, then re-scored it harder, splitting every score into mechanism (code + tests exist) and proven (survived live traffic). The point isn't the speed. It's that an independent frontier model audited the engineering — and the gaps are published, not hidden.
"OpenClaw sells breadth. Hermes sells engineering. Agent Harness Core sells a small thing you can trust."
"By engineering-discipline density per thousand lines of code, this is now first of the three. … Mechanisms ready. Evidence en route."
| Dimension | OpenClaw 378k★ | Hermes 191k★ | AHC mechanism | AHC proven |
|---|---|---|---|---|
| Concurrency & throughput | 4.5 | 3.5 | 4.0 | 3.5 |
| Persistence & data integrity | 3.0 | 4.5 | 4.5 | 4.0 |
| Error handling & recovery | 3.5 | 4.0 | 4.5 | 3.5 |
| Supervision & operations | 4.0 | 3.5 | 4.0 | 2.5 |
| Security | 2.0 | 3.5 | 4.5 | 4.0 |
| Observability & debuggability | 3.5 | 3.0 | 5.0 | 4.5 |
| Token / resource efficiency | 3.0 | 4.5 | 4.5 | 4.0 |
| Extensibility & ecosystem | 5.0 | 4.5 | 2.5 | 2.0 |
| Testing & quality engineering | 3.0 | 3.5 | 4.5 | 4.0 |
| Maturity & community | 4.5 | 4.5 | 2.0 | 2.0 |
Mechanism = code + staging tests exist, per the repo's own four-tier status ledger. Proven = validated under live traffic. Highlighted rows mark where the review awards AHC the mechanism-level lead: four strict wins (error recovery, security, observability, testing) plus persistence, where equal numbers hide a receipts-audit edge over Hermes. Ties against an incumbent's live-proven score — supervision (4.0 vs OpenClaw's 4.0) and token efficiency (4.5 vs Hermes's 4.5) — deliberately stay unhighlighted: proven outranks mechanism on equal scores. Scores are one AI reviewer's opinion (Claude, Fable 5), from public documentation for OpenClaw/Hermes and full source access for AHC.
Everything still standing between "mechanism" and "proven" is published as a roadmap of public proofs. Each one lands as receipts in the repo — JSONL you can read, not claims you have to trust.
The signature design, proven end to end: force the compaction threshold, verify a fresh session under the same virtual id, the working set injected, no unsafe rewrite of in-flight work, traced through to final delivery. The unique advantage, earned not asserted.
MECHANISM LIVE-WIRED · PROOF GATEDService-wrapper registration, kill-a-loop auto-restart with backoff, crash-loop breaker that DMs the operator, and canary deploys with automatic rollback. The harness that notices it is sick — and tells you.
MECHANISM SHIPPED · DRILL PENDING≥99.5% delivery. Zero silent failures. Every failure reconstructable from receipts within five minutes, via one trace command. The north star — the clock starts at supervision cutover.
CLOCK NOT YET STARTEDEvidence accruing, honestly: the reference deployment now has tens of thousands of delivered messages and hundreds of completed model turns on record, with fewer than two dozen dead-letters — all from known transient provider stream disconnects, each retried before dead-lettering. That is real soak, not a passed gate: the formal forced-rollover, shadow-parity, and 30-day SLO clocks above stay open until their exact criteria are met.
Progress is tracked item-by-item in the repo's roadmap & backlog with a four-tier honesty vocabulary — staging-tested · pending live gate · pending fixtures · deferred by policy — and a named remaining gate for every single item. Few projects of any size publish their uncertainty this precisely. That, too, is the product.