Agentic Engineering, Part 1: Building your AI harness

Fri May 22, 2026 · 1875 words

Human hands assembling a small robot

This is Part 1 of a three-part series on building an AI development harness. Part 2 covers feedback loops, hooks, and verification. Part 3 covers team adoption and the new engineering bar.

For the last year or so, most of the AI development conversation I’ve heard has been about models and coding assistants. Which model is smartest. Which IDE has the best autocomplete. Which prompt got the best result. That focus is too narrow. The model matters, but the system around the model matters more.

A great agent dropped into a messy codebase with unclear requirements, slow tests, tribal knowledge, and no verification loop will produce “AI slop”. A decent agent operating inside a well-designed development system can produce reliable work.

This post is about what that harness actually is, why it matters, and how to start building one.

The problem with ad hoc AI usage

Most teams I’ve seen using AI today are doing it ad hoc. Someone copies a function into a chat window, asks for a refactor, pastes the result back. Someone else uses an agent to write tests, but only the tests for the one ticket they’re working on. A third person has set up a beautiful local workflow that no one else on the team knows about. The results are wildly inconsistent.

The pattern that emerges has three problems:

Prompting is inconsistent: Two engineers ask for “the same thing” and get different results because the context they load, the constraints they mention, and the verification they apply all vary. The output quality tracks the engineer, not the model.

Results depend too much on individual skill: The engineers getting good results from AI are not necessarily the most senior. They are the ones who have, through trial and error, built a personal workflow that loads the right context and catches the wrong outputs. That workflow does not transfer.

Agents fail when the system relies on tribal knowledge: If a junior engineer needs to know “talk to Alice before touching the billing module” to do a task safely, an agent has no way to know that. The agent will write code that compiles, passes tests, and breaks something Alice would have caught. The codebase was implicitly relying on a human escalation path that the agent does not have.

In all three cases, the failure mode is the same: the development environment is not set up to give the agent what it needs. We are blaming the model for failing a test the system never gave it a fair chance to pass.

What I mean by “harness”

A harness is the set of things around the agent that make its work reliable and verifiable. I think of it as having these parts:

Context: the docs, code maps, examples, and prior decisions the agent can load to understand the work.
Tools: the commands, scripts, and integrations the agent can call to read state and make changes.
Workflow: the structured path a task takes from “request” to “done,” with the steps the agent is expected to follow.
Evidence: the tests, type checks, linters, and other automated signals that confirm the work is correct.
Human review: the point where a person inspects the change, checks the evidence, and decides whether it is safe to merge.
Session capture: a durable record of what the agent did, what it tried, and what the outcome was, so the next run can be better.

None of it is about the model. If the model is the engine, then the harness is the car. You can put a Formula 1 engine in a shopping cart and still not win any races. And the good news is that the car is ready to be built now. No fantastic model improvements are needed for any of the above to work.

Designing for agentic returns

Once you start thinking about agents as participants in your development process, you start noticing that some codebases are simply more agent-ready than others. The properties that help are the same properties that have always made codebases good to work in. They just matter more now, because the agent has less ability to compensate.

Clear architecture boundaries

Agents do better when modules have obvious responsibilities and obvious seams. A monolith where everything imports everything is just as bad for an agent as it is for a new hire. Worse, actually. At least the new hire can ask someone.

Executable developer workflows

“Run the app locally” should be one command. “Run the tests” should be one command. “Reset to a clean state” should be one command. If the steps are written in a README in prose, the agent has to interpret them. If the steps are in a Makefile or a script, the agent can just run them. Cut down on non-determinism.

Fast tests

A test suite that takes thirty minutes is a thirty-minute round-trip on every agent iteration. A test suite that takes thirty seconds means the agent can try, fail, learn, and try again in the time it used to take to run once. This typically means to make verification “shift-left”. Swap integration tests for unit tests to minimize test setup and IO.

Strong typing

Types catch a category of errors that, for an agent, would otherwise turn into “code compiles, runs, and silently does the wrong thing.” Type errors are also one of the cleanest forms of feedback an agent can act on: machine-readable, specific, and local to the change.

Repeatable local environments

If “works on my machine” is a real risk, the agent’s machine is one more machine where things might not work. Containers, lockfiles, and pinned dependencies pay off harder when an agent is the one running them. Consider the “devbox” for an agent to run in the cloud with access to as much as is possible. See Stripe’s “Minions”

Agent-readable docs

Not Confluence pages buried behind SSO. Markdown in the repo, next to the code, describing the things the code itself does not make obvious. Why this module exists. What the invariants are. Who the consumers are. The agent can read these. The agent cannot read the conversation in Slack from eighteen months ago that explains why the field is nullable.

The pattern here is that good engineering practice and good agentic-returns engineering are mostly the same thing.

A concrete example

One team I worked with had an ETL pipeline running in AWS Glue. The pipeline worked, but the development loop was hostile to almost everything an agent needs.

The code was awkward to version control. Getting local code into Glue was only a half-step above copying and pasting code into a text box in the AWS console. It was unfriendly to CI/CD, unfriendly to source control, and hard to test with confidence before deployment. You could write code locally, but you could not really know whether it would work until it was pushed into the cloud. That is close to the least agent-friendly setup possible.

So we changed the harness. We dropped Glue and moved the job into a plain Docker container running on ECS. Locally, the agent could run the same container that would be shipped to AWS. It could connect to a local database that mimicked production and a local data warehouse. It could run type checks, unit tests, and end-to-end tests. The dependencies were the same locally and remotely because the artifact was the container.

The feedback loop went from twenty or thirty minutes, two humans, and a pull request just to find out whether the thing worked, to ten or twenty seconds with no human in the loop. The shipped code had higher confidence because there were fewer differences between local and production. More importantly, the agent could check its own work instead of waiting for a person to tell it whether the cloud run failed.

The human work was building and designing a harness where the agent could build a high-confidence output ( code, config ):

The agentic SDLC

Once the harness exists, the agent can participate at more than one point in the software lifecycle. This is the part that I think most teams have not fully internalized yet. They use AI for “writing code,” but with the right scaffolding the model can participate across the whole SDLC:

Planning: turning a fuzzy request into a list of concrete tasks with acceptance criteria.
Design: sketching the approach, naming the trade-offs, identifying the files that will change.
Implementation: writing the code.
Verification: running the tests, running the type checker, running custom checks, summarizing failures.
Review: generating a self-critique, surfacing risks, drafting the PR description.
Deployment: assembling release notes, checking migration order, flagging operational concerns.
Operations: triaging alerts, correlating logs, drafting runbooks from incidents.
Retrospective: analyzing what went well and what failed, and proposing improvements to the harness itself.

Most teams I talk to are doing step 3 and a little bit of step 5. The rest is still done the old way, or not done at all. Faster code is the smallest part of what is possible here. The bigger gain is getting the agent to participate across every step, with the right context and the right verification at each one.

You do not have to build this all at once. You probably should not. Write down which lifecycle steps your harness supports today, then pick the next missing one that hurts.

The first maturity step

If you read all of the above and think “this is a lot,” good. It is a lot. Start small: pick one painful, repeated workflow and make it agent-friendly end to end. Designing a perfect harness up front is the wrong goal.

Pick something concrete. “Fixing a flaky test.” “Adding a new field to a CRUD endpoint.” “Bumping a dependency and chasing the type errors.” Something you do often enough that the cost of investing in it pays back, and contained enough that you can finish.

Then, for that one workflow:

Write down the context the agent needs: What files, what docs, what conventions, what gotchas. Put it in the repo.
Make the steps executable: Local commands to run the relevant tests, regenerate types, reset state. One command each.
Add verification: A test run, a type check, whatever a reasonable human would do to convince themselves the change is safe.
Decide the review gate: What does a human look at, and when. Write that down too.
Let the agent operate inside that path: Then observe what fails. Then improve the path. Then run it again.

This is boring, undemoable work. But it pays off because every fix to the path is reused on the next run. And once you have one path working, the second path is easier, because the same patterns carry over.

The teams that get the most out of AI in the next few years will be the ones that did this boring work, over and over, until the harness around their agents was good enough that the agents could be trusted with real engineering tasks. Clever prompts alone will not get them there.

The model is the engine. Go build the car.

Next in this series: Part 2 — Feedback Loops, Hooks, and Verification. How the harness makes agent work safer, more reliable, and easier to improve over time.