Harness engineering: leveraging Codex in an agent-first world

A case study in agentic engineering at scale: for five months an OpenAI team shipped an internal product with zero manually-written lines of code.

Every line, application logic, tests, CI, docs, observability and tooling, was written by Codex, across roughly a million lines and 1,500 merged PRs from a small team.

They estimate a 10x speedup over hand-written code.

“Humans steer. Agents execute.”
“Agent legibility is the goal”

Review is almost entirely agent-to-agent:

Humans may review pull requests, but aren’t required to. Over time, we’ve pushed almost all review effort towards being handled agent-to-agent.

The engineering role becomes about building the harness: making the app bootable per git worktree, providing browser and log access/search etc. Making the system observable to the agent allows the team to enforce rules like “ensure service startup completes in under 800ms” and “no span in these four critical user journeys exceeds two seconds”.

Another lesson is on context. They tried the “one big AGENTS.md” and it failed: it crowds out the task, rots instantly, and can’t be mechanically verified. The fix is to treat AGENTS.md as a ~100-line table of contents, not an encyclopedia, with a structured docs/ directory as the system of record and CI linters enforcing that it stays cross-linked and fresh. Give the agent a map, not a 1,000-page manual.