Agentic Engineering, Part 3: Team Adoption, Leverage, and the New Engineering Bar

Wed Jun 17, 2026 · 2953 words

Human hands handing a paper blueprint to the first in a line of small robots assembling parts along a workbench

This is Part 3 of a three-part series on building an AI development harness. Part 1 covered the harness as a concept. Part 2 covered feedback loops, hooks, and verification.

Part 1 was about one engineer and one harness. Part 2 was about making that harness produce evidence instead of asking for trust. This post is about the team. The moment more than one person depends on the harness, it stops being a personal productivity trick and turns into shared infrastructure, which changes who maintains it and how much the work of improving it is worth.

In Part 1 I called the model the engine and the harness the car you build around it. That image was about one driver. At team scale the better image is roads. A road is unglamorous to build and nearly invisible once it exists, but every trip across it is faster, and the next road you pave makes every trip after it cheaper still. That is what a team’s harness becomes, and the claim I want to defend here is that paving those roads is now some of the most valuable engineering work on the roadmap.

From individual usage to team capability

In Part 1 I described the engineer who, through trial and error, builds a personal workflow that loads the right context and catches the wrong outputs. That engineer is real and valuable, but their workflow does not transfer. It lives in their head, their shell history, and a few aliases in their dotfiles. When they go on vacation, the workflow goes with them. When a new hire joins, they start from zero. A harness that lives in one person is not leverage. It is a single point of failure that happens to be productive.

The shift from individual usage to team capability is the shift from “I have a good workflow” to “the repo has a good workflow.” Concretely, four things have to leave the individual and move into the project:

Shared workflows: the named paths a task takes from request to merge, checked into the repo where every engineer and every agent can see them, not reinvented per person.
Shared standards: one AGENTS.md, one set of conventions, one answer to “how do we do this here,” so two engineers pointing an agent at the same task get the same shape of result.
Shared verification expectations: an agreed bar for what evidence a change ships with, so “looks done” and “is verified” stop being personal judgment calls.
Shared learning from failed runs: when an agent run goes sideways, the fix lands in the harness where it protects everyone, instead of in one person’s memory where it protects no one.

The test for whether you have made this jump is simple. Could a new engineer, on their first day, point an agent at a real task and get a result that looks like the team’s work, without a senior engineer whispering context over their shoulder? If the answer is no, the capability still lives in people. The goal of everything below is to move it into the repo.

Now the papercuts are worth fixing

For most of my career, building internal tooling needed a business case. Someone had to feel a problem often enough, and painfully enough, that a week of engineering time spent smoothing it out was clearly worth it. Most papercuts never cleared that bar. They were annoying, but not annoying enough to justify the build, so they stayed.

AI changed that calculation. Work that previously required a clear, painful, repeated problem to justify the investment is now worth doing much earlier, because the cost of building the tool has dropped and the benefit compounds across every engineer and every future agent run. The bar for “worth building” fell, and a whole category of papercuts moved above the line.

Here is a small example that makes the math concrete. Merge conflicts in pull requests are a tax everyone pays. Most of them are not hard. Nine times out of ten the right resolution is obvious, and often it is just taking the later change. But the work to resolve one is a pile of low-value steps: grab the branch, pull down the upstream branch, recreate the merge, walk each conflict, commit, push. A handful of clicks and a context switch, repeated on every stale PR. The occasional conflict is genuinely tricky, but those are rare. The common case is several minutes of minutia with no programmatic shortcut.

So we built one. We added a GitHub Action that lets an agent take a shot at the conflicts. It recreates the merge, resolves each conflict, pushes the result, and writes a short explanation for every resolution it made. Nine times out of ten it does exactly what was needed, and several minutes of nobody’s-favorite work disappear. When it hits one of the rare tricky cases, the per-conflict explanations tell the human exactly where to look, so even the miss is cheaper than starting cold. (Those explanations are a verification artifact, in the Part 2 sense: the agent showing its work so a reviewer can form an opinion in seconds.)

A few years ago, this tool was not worth building. The time to write it, against the few minutes it saved per conflict, did not pay back. Now an agent can build most of it in an afternoon, and the saved minutes accrue to every engineer on every conflicted PR, forever. That is the shift. The best AI work on a team may not be feature work. It may be improving the system that ships feature work.

Improving the harness as engineering output

If the harness is shared infrastructure, then improving it is engineering work, and it should be planned, reviewed, and credited like any other engineering work. The failure mode I see most often is that harness improvements happen in the cracks. Someone notices the agent keeps tripping on the same thing, fixes it on a Friday afternoon, and never tells anyone. The fix is real, but it is invisible. It does not show up in standup, it does not get reviewed, and nobody learns from it. Treating harness work as side-of-desk tinkering is how teams underinvest in the thing with the highest return.

The fix is to make harness improvements legible. They are diffs. They go through review. They appear on the board. And there are more of them available than most teams realize:

Better local commands: the acme test / acme migrate style CLI from Part 2, so every lifecycle action is one predictable command instead of a paragraph of README prose.
Better test runners and CI summaries: output an agent can parse, with the failure stated specifically and the recovery commands next to it.
Better repo maps and docs: markdown in the repo describing the invariants and the seams, so the agent loads context instead of guessing at it.
Better review templates: a standard PR shape that asks for the diff, the evidence, and the verification artifacts in the same place every time.
Better session analysis: the “session file in, harness improvement out” loop from Part 2, run as a habit rather than an incident response.
Better permissions and hooks: the pre-tool and post-tool hooks that keep the agent on the supported path and give it clean feedback when it strays.

None of these is glamorous, and that is the point. Each one is a small patch with a compounding payoff, because it is reused on every future run by every engineer and every agent. The merge-conflict Action from the last section is exactly this kind of work. It was a few hundred lines, it was reviewed like any PR, and it now pays back on every conflicted branch the team produces. That is the template: small, reviewable, compounding.

Training engineers on flows

Part 1 ended with a piece of advice: pick one painful, repeated workflow and make it agent-friendly end to end. That advice scales to a team, and the unit that scales is the flow. A flow is a named, repeatable path through a common kind of work, with its own context, its own commands, its own verification, and its own review gate. “Fix a bug” is a vague request. The bug-fix flow is a specific path: reproduce it with a failing test, find the cause, make the test pass, confirm nothing else broke, write the PR with the repro attached.

The reason flows are the right unit for team adoption is that they are teachable and they are checkable. You can onboard an engineer onto a flow in an afternoon. You can look at a PR and tell whether the flow was followed. And once a flow is written down, an agent can run it as reliably as a person can, because the flow is exactly the context and the steps the agent needs.

The flows worth standardizing first are the ones every team already does by hand:

Bug-fix flow: reproduce with a failing test, fix, confirm green, attach the repro.
Test-generation flow: identify the gap, write the cases, show coverage moved.
Refactor flow: change structure with the test suite as the safety net, prove behavior is unchanged.
PR-review flow: a standard pass over diff, scope, evidence, and risk before a human looks.
Incident-analysis flow: correlate logs and alerts, build a timeline, draft the writeup.
Migration flow: plan the steps, run them against a local copy, capture before-and-after row counts.

You do not standardize all six at once. Pick the one that hurts most and write it down: the context, the commands, the verification, the review gate. Run an agent through it, watch where it stumbles (the session file will tell you), and fix the path. When that flow is solid, the second one is easier, because the patterns and the harness pieces carry over. This is the same incremental loop from Part 1, now producing assets the whole team uses instead of one person’s habit.

Using an Agent SDK as a maturity level

Everything so far can be built with config files and shell scripts. An AGENTS.md, a CLI like acme, a few hooks, and some markdown docs will take a team a long way, and for most teams that is the right place to start.

Config files have a ceiling, though. A markdown instruction is a suggestion the agent usually follows. A hook is a blunt allow-or-block gate. We hit that ceiling with the merge-conflict bot. As a GitHub Action it was fine, but the moment we wanted it to pause before force-pushing to a protected branch, wait for a human to approve, and then continue from where it left off, a shell script and a hook had no way to say “stop here, ask a person, resume with their answer.” That is the kind of thing an agent SDK is for. Building the harness programmatically, with something like the Claude Agent SDK, turns the conventions you hope the agent honors into code you control: custom tools instead of everything routed through the shell, permissions decided per tool and per context, lifecycle hooks that can inspect and transform rather than just allow or block, context assembled at runtime, durable session logs captured as a first-class output, and human approval gates at the exact points that need them.

The framing that matters is that this is a maturity level, not a starting point. You earn your way here. A team that reaches for the SDK before it has stable flows and a clean CLI is building a sophisticated machine on a foundation that still wobbles. Get the config-file harness working first. When you find yourself fighting the limits of markdown and hooks, as we did, that friction is the signal the SDK will pay for itself.

What to measure

If the harness is a product, it needs metrics, and the metrics fall into two buckets. The first is delivery health: are you shipping good software faster? These are the outcomes teams have measured for years, and the DORA metrics are a reasonable starting frame.

Cycle time: from “work started” to “merged and deployed.” If the harness is working, this falls.
Review burden: how much human time each change costs to review. Verification artifacts should push this down by shifting the reviewer’s question from “does this work?” to “does this look right?”.
Rework: how often a merged change has to be revisited or reverted. A spike here usually means verification is too thin.
Defects: what reaches production. The harness should bend this down, not up. If speed comes at the cost of defects, the verification half of the harness is not doing its job.

The second bucket is harness health, and this is the one most teams do not measure at all. It is specifically about the agent’s experience of the system:

Agent completion rate: how often an agent finishes a task without a human having to take over.
Human intervention points: where, repeatedly, a person has to step in. Each recurring one is a harness gap with a name.
Failed-run categories: classify the failures (missing context, missing tool, unclear error, flaky test) so the pattern is visible instead of anecdotal.
System improvements created: how many harness improvements the team shipped this period.

In practice this is a small weekly habit, not a dashboard project. Once a week, pull the session files from the runs that needed a human, drop each failure into one of the buckets above, and look at which bucket is biggest. That bucket is next week’s harness work. Most weeks it is something dull and specific. The first time we ran this, the top bucket was “the agent could not tell which of three config files was the authoritative one,” which turned into a single line in AGENTS.md and then stopped costing us runs.

That weekly count of system improvements created is the metric I would fight hardest to keep. It makes “did we make the harness better” something the team is accountable for, not just “did we use it.” A team that ships features but never improves the harness is consuming the leverage other people built and adding none. Measuring it turns the central argument of this post into something a team actually tracks.

The new expectations for engineers

All of this adds up to a change in what good looks like. For a long time, a lot of an engineer’s value sat in their hands and their head: how fast they could type the right code, and how much context about the system they could hold at once. Agents are good at both of those now. The code gets written and the context gets loaded. The value moves to the things agents are not good at, and those things are the new bar.

The engineer who clears it is the one who:

Defines better work: turns a fuzzy request into a precise task with acceptance criteria an agent and a reviewer can both check.
Builds better feedback loops: makes failure fast, specific, and locally reproducible, so the agent can correct itself instead of waiting on a human.
Reviews more rigorously: reads for intent and risk, not just for whether it compiles, because the agent already handled “compiles.”
Improves the harness: treats the system that ships work as part of their job, and leaves it better than they found it.
Creates leverage for the team: measures themselves partly by how much faster everyone else got, not only by what they personally shipped.
Raises quality, not just speed: uses the time the agent gave back to make the change safer and the system cleaner, rather than just shipping more of the same.

This is a real reframe, and it is uncomfortable if you built your identity on being the fastest typist or the person who knew where all the bodies were buried. The good news is that it rewards judgment, taste, and care, which are the parts of the job that were always the most interesting. The highest-leverage engineer on an agentic team is not the one who ships the most features. It is the one who makes the whole team ship better.

The ideal flow

Put it all together and a single clean loop falls out. It is worth walking through once, because it ties the whole series into one picture.

A task enters the system, defined well, with acceptance criteria a person and an agent can both check.
The harness loads context: the docs, the repo map, the conventions, the relevant flow.
The agent plans and implements, using the supported commands instead of guessing at the underlying tools.
Hooks observe and verify as it works, keeping it on the supported path and feeding it clean, specific failures when it strays.
A human reviews intent and risk, with the diff, the evidence, and the verification artifacts already assembled for them.
Session analysis improves the next run: the friction from this run becomes a harness patch that protects everyone on the run after it.

Every part of that loop is something a team can build today. None of it waits on a smarter model. The loop gets better not because the engine improves, but because the team keeps tuning the car.

That is the series. Part 1 argued the system around the agent matters more than the agent. Part 2 built the feedback loops, hooks, and verification that turn agent work into a question of evidence. This post was about the team that does both at once, and what it makes worth building.

The model is the engine. The harness is the car. The roads are what a team paves together, and every one of them makes the whole team arrive sooner. Go build the roads.