On June 12th, an AI reviewed my solo side project against two of the most popular AI agent platforms on GitHub. One has 378,000 stars. The other has 191,000. Mine had four commits and zero stars.
In the morning, the AI ranked us last on almost everything that matters operationally.
By midnight — the same day — it scored us first or tied on five of ten core engineering dimensions.
Nothing about those two giants changed in between. What changed was twelve hours of AI-paired engineering on my side.
I've been a software PM long enough to have shipped roadmaps that took quarters. This one took a day. I haven't stopped thinking about what that means — and I don't think you should either.
The experiment
Here's what actually happened, hour by hour.
Morning. I asked an AI architect to do an impartial, brutally honest comparative review of three agent harnesses: OpenClaw (378k★), Hermes Agent by Nous Research (191k★), and my project, Agent Harness Core: a self-hosted Rust runtime for AI agents. Think of it as the operations layer most agent frameworks forget — durable queues, fail-closed permissions, and an auditable receipt for every single step an agent takes.
The architect: Claude's Fable 5 — the frontier model many in the industry currently call the smartest AI available to the public. If you're going to get judged, get judged by the toughest judge you can find.
The verdict was uncomfortable. Best-in-trio security posture and observability. Worst-in-trio supervision, ecosystem, maturity. The AI did not flatter me. I have the report.
Midday. The same AI wrote the overtaking roadmap: roughly fifty work items across ten engineering dimensions. Not vague directions — SRE-grade SLOs with a stop-the-line rule, shadow-mode database migration, canary deploys with automatic rollback, a deterministic-simulation testing strategy borrowed from the FoundationDB school, an invariants catalog as an executable CLI command.
Evening. We implemented essentially all of it at mechanism level. Dead-letter retry policies. End-to-end trace IDs. Admission control. An encrypted secrets vault. MCP tool-description pinning, because tool poisoning is a real attack class now. 207 tests passing, without a single model call.
Midnight. I asked the AI to review everything again — and to be harsher. It came back with a dual scoring system: "mechanism" scores for code that exists and passes staging tests, and "proven" scores for what has actually survived live traffic. Anything pending got zero credit.
Mechanism-level: we tied or beat both giants on persistence, error recovery, security, observability, and testing discipline.
Proven-level: we're still behind. And that gap is the most interesting part of this whole story.
The part that should bother you
Let me say the quiet part loudly.
Months of engineering throughput just became a one-day commodity. Queue architectures, supervision trees, audit systems, migration strategies — the stuff that used to take senior teams entire quarters — can now be specified in the morning and exist by dinner.
But here's what the AI's harsher second review made unmissable. There are things no model can compress:
- A seven-day shadow soak takes seven days. Physics doesn't read prompts.
- A 30-day unattended reliability SLO takes thirty days.
- Trust, third-party scrutiny, real users hitting real edge cases — these accrue at human speed.
- And community? Community accrues at relationship speed — the slowest and most valuable speed there is.
So here's my thesis, and the reason this project exists as a public experiment:
If engineering is now table stakes, then community is the moat.
Those 378,000 stars on the incumbent? The day after AI parity, they stop measuring engineering quality. They start measuring something rarer: gravity. Distribution. The accumulated trust of thousands of people who chose to depend on something together.
That can't be generated. It has to be earned — in calendar time, in public.
What this means if you ship software for a living
Three takeaways I'm acting on as a PM:
1. Roadmaps should become experiments with public evidence. We published ours as a "Proving Ground": three falsifiable proofs — seven days of shadow-migration parity with zero divergence, reboot-proof supervision with receipts, a 30-day unattended SLO. Not promises. Proofs, landing as machine-readable ledgers anyone can audit. If your roadmap can be replicated by a competitor's AI in a weekend, the only durable thing on it is the evidence you generate while running it.
2. The scarce skill has moved. It's no longer "can your team build it." It's "can your team specify it precisely and verify it honestly." The most valuable artifacts we produced that day weren't code — they were the invariants catalog, the SLO definitions, and a four-tier honesty vocabulary that labels every claim as staging-tested, pending live gate, pending fixtures, or deferred by policy. Specification and verification are the new senior engineering.
3. Honesty is a feature — maybe the feature. Full disclosure, because it matters: the AI that scored our work is the same AI that wrote our roadmap. A self-referential loop — which is exactly why we split the scoring into mechanism vs. proven, gave zero credit to anything unverified, and published the methodology. In a world where anyone can generate impressive-looking claims in an afternoon, auditable honesty becomes the differentiator. Our product literally writes a receipt for everything. So does our marketing.
The trust dilemma (pick a side)
One more thing before the open questions — the fork this experiment quietly forces on everyone reading it:
Either the smartest publicly available AI just produced a credible engineering assessment — in which case the scorecard stands, and a zero-star side project really did reach mechanism parity with hundred-thousand-star platforms in a single day.
Or even the most capable AI on the market can't be trusted with a bounded, checkable, receipts-on-the-table engineering review — in which case, why are we comfortable deploying AI on decisions that are harder, fuzzier, and far less verifiable than this one? Medical triage. Legal research. Your company's quarterly strategy.
There is no comfortable third door. And if your instinct is "it depends on whether you can verify the claims" — congratulations, you've just articulated this project's entire design philosophy. That's why every step writes a receipt.
The open questions (this is where you come in)
I genuinely don't know the answers to these, and I'd rather argue about them than pretend I do:
- If AI collapses engineering cost to near zero, what's the last moat? Community? Distribution? Taste? Regulatory position? Proprietary data? Something else?
- When stars stop signaling engineering quality, what should we measure instead? Soak time? Incident transparency? Receipt density?
- And the one that keeps me up at night: if a solo PM with an AI can reach mechanism parity with hundred-thousand-star projects in a day — what exactly are we paying for when we fund fifty-person platform teams?
The experiment is live and running in public. The scores, the brutal reviews, the unproven gaps — all of it is published, receipts included.
★ Star the repo The long-task architecture → The experiment, at a glance →
And yes — this essay was drafted in pair with the same AI that reviewed the code. Welcome to 2026.
Agent Harness Core: a self-hosted AI agent runtime in Rust. Six dependencies. No async runtime. 207 tests without a model call. Every step gated, every step receipted. Pre-release, Windows-first, dual-licensed MIT/Apache-2.0.