Agentic Engineering, Part 2: Feedback Loops, Hooks, and Verification

Thu Jun 04, 2026 · 3127 words

This is Part 2 of a three-part series on building an AI development harness. Part 1 covered the harness as a concept. Part 3 will cover team adoption and the new engineering bar.

Human hands using calipers to measure a small robot, with a magnifying loupe and gauges nearby

In Part 1, I argued that the system around the agent matters more than the agent itself. This post is about the part of that system that does the most work: the feedback loops. Without them, the agent is writing code in a vacuum. With them, the agent has roughly the same internal conversation a careful engineer would have with themselves before opening a PR.

Trust is not the goal

A question I hear a lot is “do I trust the agent yet?”. I think this is the wrong question. Trust does not scale, does not transfer between people, and does not survive the next model release. The better question is “what evidence has the agent produced that this change is safe?”.

Evidence comes from a few places:

Tests that pass and exercise the change.
Type checks that compile cleanly.
Reproducible local steps that show the change working end to end.
A diff a human can read in a reasonable amount of time.
Verification artifacts ( we’ll talk about this later )

If those signals are in place and the agent has shown its work, the merge decision is the same kind of decision you would make about any pull request from any contributor. Trusting the outputted evidence is better than whether you think the model is “smart enough”.

What an agent can actually act on

Feedback is only useful if the agent can act on it. A good rule of thumb: write your error output as if the reader has never seen this codebase before and is trying to figure out what to do in the next thirty seconds.

Four properties matter:

Fast: The feedback loop has to fit inside the agent’s working window. Thirty-minute test suites mean thirty-minute iterations.
Specific: “Build failed” gives the agent nothing to do. Migration M0042 failed at step 3: column users.email_verified does not exist gives it a starting point.
Machine-readable: Exit codes, JSON output, structured log lines. The agent can grep, parse, and branch on these. It cannot reliably parse decorative ASCII tables.
Locally reproducible: If the failure only happens in CI, the agent has to wait for CI on every attempt or worse, ask a human to do it for it. If the same command produces the same output locally, the loop tightens by an order of magnitude.

A useful exercise: take the error output for a real failure mode in your codebase and ask whether a new hire could read it, run one or two commands, and get unstuck. If the answer is no, the agent will struggle in the same place. Fix it for the human and the agent gets the upgrade for free.

Building `acme`

The fastest way to make this concrete is to build it. I’m going to walk through a small CLI called acme. It is the lifecycle tool for a hypothetical web app, also called acme. The CLI has five verbs:

acme test
acme build
acme migrate
acme serve
acme stop

That is the whole surface area. Everything else the agent might want to do gets composed out of those five.

Five verbs, not fifty

The small number is deliberate. A CLI with fifty subcommands forces the agent to either memorize the full surface or rediscover it each session via --help. Both of those waste turns. Five verbs map to the lifecycle steps any web app engineer already has in their head: run the tests, build the artifact, apply schema changes, start the app, stop the app.

When you find yourself reaching for a sixth verb, ask whether it belongs as a flag on an existing verb instead. acme serve --port 4000 is better than acme serve-on-port. The agent does not need a new verb to learn. It needs the existing verbs to behave predictably.

Designing the output

Now the part that matters most. When acme migrate runs cleanly, the output should look something like this:

$ acme migrate
applying M0041_add_user_metadata ... ok (212ms)
applying M0042_add_email_verified ... ok (87ms)
applying M0043_backfill_email_verified ... ok (3142ms)

3 migrations applied. database at revision M0043.

Three things to notice. Each migration is on its own line with a clear status, so the agent can grep for ok or FAILED without parsing prose. The final line states the post-condition explicitly, so the agent does not have to infer the resulting database state from the absence of an error. Timings are present but as a tail field, so they stay out of the way of the rest of the output.

When acme migrate fails, the output should look something like this:

$ acme migrate
applying M0041_add_user_metadata ... ok (212ms)
applying M0042_add_email_verified ... FAILED

error: M0042_add_email_verified failed at step 3
  reason: column users.email already exists
  database state: M0041 applied, M0042 partially applied
  current revision: M0041 (forward), M0042 (dirty)

next steps:
  - run `acme migrate status` to inspect the dirty migration
  - run `acme migrate repair M0042` to roll back the partial apply
  - then re-run `acme migrate`

exit code: 2 (migration failure, see `acme help exit-codes`)

Every line in there is useful. The reason tells the agent what to fix. The database state line tells it what the world looks like right now. The next steps block is an explicit menu of recovery options. The exit code is documented elsewhere so the agent can branch on it.

The most important line is database state. Without it, the agent has to guess whether its next attempt is going to make things worse. With it, the agent has a model of the world it can reason about.

Telling the agent the tool exists

A tool the agent has to discover is a tool the agent will not use reliably. The fix is one block of text in AGENTS.md (or CLAUDE.md, or CODEX.md, depending on which harness you are running; the file name varies but the job is the same).

Here is the entry for acme:

## acme

The `acme` CLI is the only supported way to run lifecycle operations on this
app. Do not run `npm run`, `yarn`, `node server.js`, or `psql` directly.

Verbs:
  acme test      run the test suite
  acme build     produce a deployable artifact in ./dist
  acme migrate   apply pending database migrations
  acme serve     start the app on port 3000 (use --port to override)
  acme stop      stop a running `acme serve` process

Exit codes:
  0   success
  1   generic failure (see stderr)
  2   migration failure (db may be in dirty state; see `acme migrate status`)
  3   port in use (try `acme stop` first)

When a verb fails, read the full error output before retrying. The error
includes a "next steps" section with the right recovery commands.

That is it. The agent reads AGENTS.md on session start, sees acme, and knows the verbs, the exit codes, and the meta-rule about reading the error before retrying. It no longer has to discover any of this by trial.

The “do not run npm / yarn / node / psql directly” line is also very important. Without it, the agent will sometimes shortcut around acme and call the underlying tools, which defeats the purpose of building the CLI. Tell it what not to do, and it usually listens.

Reading the session file

Now the loop closes. After the agent has used acme on a real task, the session file is the artifact that tells you whether your design held up.

Here is a real-looking excerpt from a session where the agent struggled:

turn 14: assistant
  tool: bash
  command: acme migrate
  result: exit 2; M0042 failed; database state M0041 applied, M0042 dirty

turn 15: assistant
  tool: bash
  command: acme migrate
  result: exit 2; M0042 failed; database state M0041 applied, M0042 dirty

turn 16: assistant
  tool: bash
  command: acme migrate
  result: exit 2; M0042 failed; database state M0041 applied, M0042 dirty

turn 17: assistant
  tool: bash
  command: acme migrate --force
  result: exit 1; unknown flag: --force

turn 18: assistant
  tool: bash
  command: acme migrate status
  result: exit 0; M0042 partially applied; use `acme migrate repair M0042`

The agent burned four turns before reading the next steps section of the error and finding the recovery path. Two harness fixes fall out of this:

The error output already mentions acme migrate status, but the agent did not read that far before retrying. Move the recovery commands higher in the error block, immediately after the FAILED line.
Make acme migrate automatically print the output of acme migrate status when it fails into a dirty state. The recovery command does not benefit from being a separate step.

There is a measurable reason fix #1 matters. Horowitz and Plonsky’s 2025 paper, LLM Agents Display Human Biases but Exhibit Distinct Learning Patterns, tested GPT-4o mini and Gemini-1.5 Flash on 100-trial decision tasks and found that “LLMs exhibit strong recency biases, unlike humans, who appear to respond in more sophisticated ways.” Their models matched a pure-recency strategy at 78–91% of trials, against 67% for humans. The practical takeaway for error output: an agent reading a failure weights whatever is closest to the failure signal much more heavily than what appears later. Putting the recovery commands right next to the FAILED line will have the highest chance of success.

Neither of those fixes is a model improvement. Both are small patches to a CLI tool. And both make every future agent run on a failed migration faster.

This is the pattern. Session file in, harness improvement out.

When you do not own the tool

The acme example was easy because we wrote it. We controlled the verbs, the output, and the exit codes. Most real harnesses also wrap tools we did not write. The interesting question is how to give the agent a clean feedback loop on those.

The GitHub CLI, gh, is a good example. It is enormously useful for opening PRs, reading issues, and querying the GitHub API. Two of its default behaviors are bad for agents.

The first is auth. When the token has expired, the agent runs gh pr create and sees something like:

To get started with GitHub CLI, please run:  gh auth login
Alternatively, populate the GH_TOKEN environment variable with a GitHub API authentication token.

The agent will sometimes try gh auth login, which opens a browser the agent cannot see. It will sometimes guess at a GH_TOKEN environment variable that has not been set in this environment. The right behavior is for the harness to either inject a token from a local secret store, or stop the session with a clear instruction to the operator.

A pre-tool hook is how you do that. The hook intercepts any gh invocation, checks auth state, and dispatches to one of two paths.

#!/usr/bin/env bash
# .claude/hooks/before-gh.sh
# Runs before any `gh` command. Exits non-zero to block the call
# and present a clean error in place of gh's default message.

if gh auth status >/dev/null 2>&1; then
  exit 0  # already authenticated, let the command through
fi

# Not authenticated. Try the local auth helper if the operator has one set up.
if [ -x "$HOME/.local/bin/gh-auth-from-keychain" ]; then
  if "$HOME/.local/bin/gh-auth-from-keychain" >/dev/null 2>&1; then
    exit 0  # helper injected a fresh token, let the command through
  fi
fi

# No helper, or helper failed. Stop and tell the agent and operator what to do.
cat <<EOF >&2
gh is not authenticated and no auth helper succeeded.

Operator action required:
  Run \`gh auth login\` in another terminal, then resume this session.
  Or set up the helper script at ~/.local/bin/gh-auth-from-keychain.

Agent: do not retry \`gh\` commands until the operator confirms auth is set up.
EOF
exit 1

The error message at the bottom is the important part. It tells the agent what went wrong (gh is not authenticated), what cannot be solved inside the session (operator action required), and what the agent should do next (do not retry). The agent now gets structured feedback in place of gh’s default nudge to log in.

The second problem with gh is noise. gh pr view 1234 --json returns roughly thirty fields. The agent typically needs four or five. A post-tool hook can filter the JSON down to the fields the agent actually uses, which both reduces token usage and reduces the chance of the agent latching onto an irrelevant field.

The pattern across both fixes is the same. When you cannot change the tool, change the layer between the agent and the tool. Hooks are how you do that.

Have the agent read its own session

The most underused artifact in agent work is the session file. Most agent harnesses produce one. The example below uses Claude Code, which writes a structured log of what the agent did, what tools it called, what it got back, and what it did next. Codex, Cursor, and other agentic IDEs produce equivalent logs in their own locations. Reading those files is where harness improvements come from.

Claude Code stores them here:

~/.claude/projects/<encoded-project-path>/<session-id>.jsonl

One JSON event per line. User messages, assistant messages, tool uses, tool results, all in order.

You do not have to read these by hand. Hand the session file to a fresh agent and ask it to find the friction points. Here is a prompt that works well:

Read the session file at <path>.

Identify the three biggest friction points in this session. Look for:
  - places where the agent retried the same failing command multiple times
  - places where it grepped or read its way around for many turns to find
    something that should have been obvious
  - places where it guessed at a path, a command name, or a flag
  - places where a human had to intervene to unblock it

For each friction point, propose one concrete change to the harness:
  - a new CLI verb or flag
  - a new line in AGENTS.md
  - a new pre-tool or post-tool hook
  - a new doc in the repo

Output as a markdown list. One short paragraph per finding. Reference the
turn numbers from the session file. Be specific. Do not propose model or
prompt changes; only harness changes.

The last line of the prompt is the one that matters most. Without it, the agent will sometimes propose “be more careful” as a fix, which is not a fix. Forcing the recommendations to land on harness artifacts (a verb, a doc line, a hook) produces a punch list of small, well-scoped improvements.

Some of those improvements the same agent can implement in the same session, especially the docs and AGENTS.md lines. Bigger changes (a new verb, a new hook) go into the backlog.

Make this a habit, not an incident response. The agent that just finished its work is the cheapest reviewer of the session you will ever have. Use it.

Verification artifacts

Tests and type checks tell you the code does what it claims to do. They do not tell you the change looks right, or that the user flow does what the team wanted. For anything that touches a UI, an API contract, or a real workload, the green CI run leaves a gap. Verification artifacts close that gap.

An artifact is something the agent produced and attached to its work that a reviewer can look at in seconds and form an opinion. The shape depends on what changed.

Web UI changes: A screenshot of the new state, ideally next to the old state. The agent runs the app locally, drives it with Playwright or Puppeteer, takes the shots, and references them in the PR description.
End-to-end flows: A short screen recording. Five to fifteen seconds of the agent driving through the feature: log in, click the new button, see the new modal, click submit, see the success state. ffmpeg and a headless browser are enough on the web side. For mobile, xcrun simctl io record for iOS or adb shell screenrecord for Android does the same job.
API contracts: A captured request and response, run against the local server. curl -i is sufficient. The reviewer sees the contract change instead of inferring it from the handler code.
Performance: Before and after numbers from a benchmark or a representative query, with the command that produced them. The agent runs the benchmark on main, then on the branch, and posts both. The reviewer does not have to re-run anything to evaluate the claim.
Migrations or data transforms: A row count before and after, plus a small sample of representative rows. The reviewer sees that the data ended up where it should be.

The principle is the same in every case. If a human reviewer would want to inspect the result, the agent should produce that inspection on the reviewer’s behalf.

A few practical notes on making this work:

Make the artifact location predictable: Pick a directory like .verification/ or artifacts/ and have the agent always write there. The reviewer always knows where to look.
Include the command that produced the artifact: Saving a screenshot is useful. Saving a screenshot next to the exact Playwright script that produced it means the reviewer can re-run the capture against their own checkout.
Do not require the artifacts to be perfect: A blurry screenshot of the right thing is more valuable than a polished screenshot of the wrong thing. The goal is to shift the reviewer’s question from “does this work?” to “does this look right?”.
Attach them to the PR: Most code hosts let you embed images and short videos directly in the description or in a comment. An artifact that lives only in the agent’s session is one the reviewer will never find.

There is a deeper reason this matters more for agents than for humans. When a human engineer ships a change, they have already seen the result on their screen. They have already clicked the button. Their confidence and the reviewer’s confidence are roughly the same, because the engineer has already done what the reviewer would do. An agent has no eyes. The artifact is how the agent proves it did the looking on the reviewer’s behalf.

What’s next

That is the verification half of the harness. The next post in the series is about what changes when an entire team works this way. The short version: the highest-leverage engineers stop measuring themselves by features shipped, and start measuring themselves by how much faster the rest of the team can ship. The harness becomes a product. Improving it becomes some of the most valuable work on the roadmap.

Next in this series: Part 3 — Team Adoption, Leverage, and the New Engineering Bar. What changes when an entire team works this way, and why improving the harness becomes some of the highest-leverage work on the roadmap.