Harness Engineering Is Not Context Engineering
Over the past year, “context engineering” became the phrase everyone uses when talking about building with agents. The idea is straightforward: give the model the right information and the output improves. Curate the docs, manage memory, control what gets injected. That helps, but it only addresses what the agent knows. It says nothing about what the agent can do, or what happens when it does the wrong thing.
Context engineering is like giving a new hire a perfect onboarding document. Useful, necessary, but still not enough. That hire also needs boundaries, a codebase that makes sense, a CI pipeline that catches mistakes, an architecture they can navigate without asking someone every few minutes. That second layer is what I think of as harness engineering.
In 2025, a small team at OpenAI started with an empty repository. Everything (application code, CI, docs) was written by Codex agents. The humans didn’t write code. They built the environment. The first lesson wasn’t about model quality. The bottleneck was the environment. The agents didn’t lack the ability to write code. They lacked structure, tools, feedback, clear constraints. The engineering question shifted from “what should we prompt?” to “what capability is missing, and how do we make it visible and enforceable?” They wired Chrome DevTools into the runtime so agents could see the UI and reproduce bugs. They spun up isolated observability stacks per task with logs, metrics, and traces. Now a prompt like “startup should complete under 800ms” was measurable, not aspirational. Architecture rules were enforced mechanically, dependency directions validated, violations caught before merge. One detail that matters: linter errors were written to teach. Every failure message doubled as context for the next attempt. The system wasn’t just blocking mistakes. It was training the agent while it worked. They also moved knowledge into the repository itself. Instead of a giant instruction file, a small AGENTS.md pointed to deeper sources of truth (design docs, architecture maps, execution plans, quality grades), all versioned. A background agent scanned for stale docs and opened cleanup PRs. Context engineering would have stopped at the instruction file level. Harness engineering is everything around it: constraints, feedback loops, observability, enforcement.
One thing showed up quickly. Agents replicate patterns that already exist, good or bad. Drift is inevitable. The team used to spend Fridays cleaning what they called “AI slop,” about a day a week of cleanup. That does not scale. So they encoded their standards directly into the repo: golden rules, quality grades, recurring background checks. Agents scan for deviations, open small refactoring PRs, and most get auto-merged. It’s basically garbage collection for code quality. Human taste captured once, enforced continuously.
That’s the biggest difference between the two. Context engineering asks what should the agent see. Harness engineering is about what should the system prevent, measure, and correct. It moved into linters that block architectural drift, into CI gates that reject entropy, into metrics that let agents verify their own work, into feedback loops that keep the system stable over thousands of changes.
Context gives information. Harness creates a place where the agent doesn’t need supervision. If you’re building with agents, watch where things fail. Wrong output once? Probably context. Slow degradation over weeks? That’s a harness problem. Make architecture enforceable, not just documented. Write errors that teach the fix. Give agents visibility into logs and metrics. Treat your repo knowledge like a product: version it, maintain it, let agents keep it fresh.
At this point the question no one doubts that agents can produce production-quality code. The real question is whether your system can absorb what they produce without someone constantly watching.
Context helps the model think. Harness is what keeps the system from drifting
