Field notes · The experiment, long form

Engineering Is No Longer the Moat.
I Have the Receipts.

What a one-day experiment with an AI architect taught me about the future of software — and what it means for every PM, engineer, and open-source maintainer reading this.

On June 12th, an AI reviewed my solo side project against two of the most popular AI agent platforms on GitHub. One has 378,000 stars. The other has 191,000. Mine had four commits and zero stars.

In the morning, the AI ranked us last on almost everything that matters operationally.

By midnight — the same day — it scored us first or tied on five of ten core engineering dimensions.

Nothing about those two giants changed in between. What changed was twelve hours of AI-paired engineering on my side.

I've been a software PM long enough to have shipped roadmaps that took quarters. This one took a day. I haven't stopped thinking about what that means — and I don't think you should either.

The experiment

Here's what actually happened, hour by hour.

JUNE 12, 2026 — ONE DAY, FOUR ACTS MORNING The brutal review 3 harnesses compared. We rank last on ops. MIDDAY The roadmap ~50 items, 10 dimensions, SLOs + stop-the-line. EVENING The build Mechanisms shipped. 207 tests, zero model calls. MIDNIGHT The harsher re-review Dual scores: mechanism vs. proven. Zero credit if pending. reviewer: Claude (Fable 5) · subject: Agent Harness Core vs OpenClaw 378k★ vs Hermes 191k★
FIG. 1 — one day, four acts.

Morning. I asked an AI architect to do an impartial, brutally honest comparative review of three agent harnesses: OpenClaw (378k★), Hermes Agent by Nous Research (191k★), and my project, Agent Harness Core: a self-hosted Rust runtime for AI agents. Think of it as the operations layer most agent frameworks forget — durable queues, fail-closed permissions, and an auditable receipt for every single step an agent takes.

The architect: Claude's Fable 5 — the frontier model many in the industry currently call the smartest AI available to the public. If you're going to get judged, get judged by the toughest judge you can find.

The verdict was uncomfortable. Best-in-trio security posture and observability. Worst-in-trio supervision, ecosystem, maturity. The AI did not flatter me. I have the report.

Midday. The same AI wrote the overtaking roadmap: roughly fifty work items across ten engineering dimensions. Not vague directions — SRE-grade SLOs with a stop-the-line rule, shadow-mode database migration, canary deploys with automatic rollback, a deterministic-simulation testing strategy borrowed from the FoundationDB school, an invariants catalog as an executable CLI command.

Evening. We implemented essentially all of it at mechanism level. Dead-letter retry policies. End-to-end trace IDs. Admission control. An encrypted secrets vault. MCP tool-description pinning, because tool poisoning is a real attack class now. 207 tests passing, without a single model call.

Midnight. I asked the AI to review everything again — and to be harsher. It came back with a dual scoring system: "mechanism" scores for code that exists and passes staging tests, and "proven" scores for what has actually survived live traffic. Anything pending got zero credit.

THE VERDICT — SCORED BY CLAUDE (FABLE 5) · 2026-06-12 DIMENSION OPENCLAW 378k★ HERMES 191k★ AHC·MECHANISM code + tests exist AHC·PROVEN survived live traffic Concurrency & throughput Persistence & integrity Error handling & recovery Supervision & operations Security Observability Token / resource efficiency Testing & quality Ecosystem Maturity & community 4.53.54.03.5 3.04.54.54.0 3.54.04.53.5 4.03.54.02.5 2.03.54.54.0 3.53.05.04.5 3.04.54.54.0 3.03.54.54.0 5.04.52.52.0 4.54.52.02.0 highlighted: review-judged mechanism-level leads — 4 strict wins + persistence, where equal numbers hide a receipts-audit edge. ties against live-proven scores (supervision vs OpenClaw, token vs Hermes) stay plain: proven outranks mechanism on equal numbers.
FIG. 2 — the dual-track scorecard. pending = zero credit.

Mechanism-level: we tied or beat both giants on persistence, error recovery, security, observability, and testing discipline.

Proven-level: we're still behind. And that gap is the most interesting part of this whole story.

The part that should bother you

Let me say the quiet part loudly.

Months of engineering throughput just became a one-day commodity. Queue architectures, supervision trees, audit systems, migration strategies — the stuff that used to take senior teams entire quarters — can now be specified in the morning and exist by dinner.

But here's what the AI's harsher second review made unmissable. There are things no model can compress:

What AI compresses — and what it can't // COMPRESSED: months → hours ▸ Queue & supervision architecture ▸ Audit trails, trace IDs, vaults ▸ Migration & rollback strategy ▸ Test scaffolds & invariants ▸ The gap between "identified" and "code + tests exist" ⏱ 1 day // INCOMPRESSIBLE ▸ A 7-day soak takes 7 days ▸ A 30-day SLO takes 30 days ▸ Trust & third-party scrutiny ▸ Real users, real edge cases ▸ Community — which accrues at relationship speed 📅 calendar time engineering = table stakes · community = the moat
FIG. 3 — the asymmetry this whole experiment rests on.

So here's my thesis, and the reason this project exists as a public experiment:

If engineering is now table stakes, then community is the moat.

Those 378,000 stars on the incumbent? The day after AI parity, they stop measuring engineering quality. They start measuring something rarer: gravity. Distribution. The accumulated trust of thousands of people who chose to depend on something together.

That can't be generated. It has to be earned — in calendar time, in public.

What this means if you ship software for a living

Three takeaways I'm acting on as a PM:

1. Roadmaps should become experiments with public evidence. We published ours as a "Proving Ground": three falsifiable proofs — seven days of shadow-migration parity with zero divergence, reboot-proof supervision with receipts, a 30-day unattended SLO. Not promises. Proofs, landing as machine-readable ledgers anyone can audit. If your roadmap can be replicated by a competitor's AI in a weekend, the only durable thing on it is the evidence you generate while running it.

THE PROVING GROUND — DON'T BELIEVE THE SCORES. WATCH US EARN THEM. PROOF 01 7-Day Shadow Parity Dual-write every turn to the new SQLite lane. Seven live days of zero divergence — receipt-compared — before cutover. SHADOW LIVE-GATED PROOF 02 Reboot-Proof Supervision Auto-restart with backoff, crash-loop breaker that DMs the operator, canary deploys with automatic rollback — all with receipts. MECHANISM · DRILL PENDING PROOF 03 30-Day Unattended SLO ≥99.5% delivery. Zero silent failures. Every failure recon- structable from receipts in 5 minutes, with one trace command. CLOCK NOT YET STARTED proofs land as machine-readable ledgers in the repo — JSONL you can read, not claims you have to trust.
FIG. 4 — the roadmap, rewritten as falsifiable proofs.

2. The scarce skill has moved. It's no longer "can your team build it." It's "can your team specify it precisely and verify it honestly." The most valuable artifacts we produced that day weren't code — they were the invariants catalog, the SLO definitions, and a four-tier honesty vocabulary that labels every claim as staging-tested, pending live gate, pending fixtures, or deferred by policy. Specification and verification are the new senior engineering.

3. Honesty is a feature — maybe the feature. Full disclosure, because it matters: the AI that scored our work is the same AI that wrote our roadmap. A self-referential loop — which is exactly why we split the scoring into mechanism vs. proven, gave zero credit to anything unverified, and published the methodology. In a world where anyone can generate impressive-looking claims in an afternoon, auditable honesty becomes the differentiator. Our product literally writes a receipt for everything. So does our marketing.

The trust dilemma (pick a side)

One more thing before the open questions — the fork this experiment quietly forces on everyone reading it:

Either the smartest publicly available AI just produced a credible engineering assessment — in which case the scorecard stands, and a zero-star side project really did reach mechanism parity with hundred-thousand-star platforms in a single day.

Or even the most capable AI on the market can't be trusted with a bounded, checkable, receipts-on-the-table engineering review — in which case, why are we comfortable deploying AI on decisions that are harder, fuzzier, and far less verifiable than this one? Medical triage. Legal research. Your company's quarterly strategy.

There is no comfortable third door. And if your instinct is "it depends on whether you can verify the claims" — congratulations, you've just articulated this project's entire design philosophy. That's why every step writes a receipt.

The open questions (this is where you come in)

I genuinely don't know the answers to these, and I'd rather argue about them than pretend I do:

The experiment is live and running in public. The scores, the brutal reviews, the unproven gaps — all of it is published, receipts included.

The star button is, quite literally, the dependent variable of this experiment. Either outcome proves something.
★ Star the repo The long-task architecture → The experiment, at a glance →

And yes — this essay was drafted in pair with the same AI that reviewed the code. Welcome to 2026.


Agent Harness Core: a self-hosted AI agent runtime in Rust. Six dependencies. No async runtime. 207 tests without a model call. Every step gated, every step receipted. Pre-release, Windows-first, dual-licensed MIT/Apache-2.0.