How I Built an Open-Source Harness for Longer AI Coding Sessions

16 Mar 2026

aiengineeringthinking

I have been working for weeks on a harness for more efficient agentic AI coding, pushing models through longer and longer autonomous sessions. The result is compound-agent: an open-source npm package that installs a development environment for agentic software work. It is still a work in progress, but already in production and already boosting several of my projects. I am trying to put all of the latest research on applied AI I find and all my experience into it.

The thesis behind it is simple: coding agents are limited less by raw model quality than by the environment we drop them into.

Give an agent a messy repository, weak feedback, and no persistent memory, and it will look unreliable. Give it a codebase that remembers decisions, exposes clear task boundaries, and enforces verification, and the same class of model becomes much more useful. The goal is to make the repository itself do more of the remembering, organizing, and checking. Externalize context, rely less on prompts and on any single agent, to drive long duration, autonomous, and reliable work.

This is not just my intuition. OpenAI, Anthropic, and Cursor each ran large-scale agentic coding projects in early 2026 and pointed in the same direction: the bottleneck is not code generation. It is environment design, task decomposition, and feedback loops that let agents work reliably. (OpenAI, Anthropic, Cursor)

From two constraints to three requirements

I think that current agents have two structural constraints that shape everything else.

The first is limited working memory. An agent only sees the slice of the codebase that fits into the current context window. A human developer builds an internal map of a project over weeks. An agent starts each session with whatever context was prepared for it.

The second is amnesia. When the session ends, the memory is gone. Decisions, failed approaches, naming conventions, and hard-won fixes disappear unless they were written somewhere durable.

Therefore, if we want to reduce the impact of these constraints, we need to make agents interchangeable between sessions. The state has to live in the repository. This pushed me to define three requirements.

Memory. Projects should accumulate knowledge, not just code. Decisions, lessons, and conventions need to be stored where the next session can retrieve them.

Feedback loops. Agents need mechanical ways to know whether they are right. Tests, linting, type checks, and review gates cannot stay informal if the output is going to scale.

Navigable structure. Agents lose a lot of performance on orientation. Tasks need explicit dependencies, clear entry points, and scopes small enough to reason about in one pass.

Building these three ideas into the repository itself, rather than only relying on prompt engineering, removes the failures that limited memory and amnesia cause: repeated mistakes, weak local reasoning, and too much time spent rediscovering project conventions.

What compound-agent Installs

compound-agent packages those three requirements into one system.

At the base is Beads, a git-backed task tracker that stores work items directly in the repository. compound-agent uses these as scoped units of work with explicit dependencies. That gives the system a dependency graph and a reliable answer to a basic question: what is actually ready to be worked on right now? This is the navigable structure requirement.

Semantic memory

Above that is semantic memory. Early on I stored everything in one bucket (lessons, documentation, research) and retrieval was noisy. An agent looking for a quick pattern fix would pull in architecture documentation. An agent needing design context would get session notes. Splitting memory into two systems made retrieval much sharper.

The first is lessons: things the agent learns during work. Mistakes, corrections, patterns, preferred approaches. These are stored in git-tracked JSONL, indexed in SQLite, and retrieved through a mix of semantic similarity and keyword search. A new session pulls in relevant lessons before writing any code.

A real lesson entry from the project's own development:

{
  "id": "L821b491a",
  "type": "lesson",
  "insight": "Inlining phase instructions in a workflow orchestrator causes context drift under compaction. Delegate each phase to its own slash command so instructions are re-injected fresh.",
  "tags": ["compaction", "workflow", "architecture"],
  "severity": "high"
}

The second is knowledge. I point it at the documentation folder (specs, architecture decisions, research papers) and it chunks, embeds, and indexes everything. When an agent needs context about how a system was designed or why a decision was made, it queries the knowledge base instead of guessing from the code.

A knowledge chunk looks like this:

{
  "filePath": "docs/ARCHITECTURE-V2.md",
  "text": "The three layers are not independent: memory without feedback loops is unreliable. Feedback without navigable structure fires blindly.",
  "contentHash": "a4f2e8..."
}

Having both indexed with a mix of semantic similarity and keyword matching means retrieval is fast and relevant. Lessons work as a scratch pad for the codebase, like my personal notebook sitting on my desk. Knowledge works as a queryable library. Together they give each new session a real starting point instead of a blank slate.

Fig. 1 — A task session between two memory systems

Workflow and orchestration

On top of those two layers is the workflow. I found early on that stuffing all phase instructions into one prompt caused context drift. The agent would gradually lose track of earlier instructions as the conversation grew. So each phase runs as its own slash command, re-injecting fresh instructions every time. Five phases:

Spec Dev: develop unambiguous requirements through dialogue.
Plan: build a technical plan enriched by memory search.
Work: execute the plan with test-driven development.
Review: run specialized verification across the changes.
Compound: extract what was learned and write it back into lessons, or propose stronger automatic verifications through linting.

These phases can run manually, but the point is to chain them. The simplest way is /compound:cook-it, which runs all five in one command.

For multiple tasks, ca loop generates a script that picks them up in dependency order and runs each one through the full cycle in its own fresh session. If one fails, it retries. If review finds problems, they get fed back to the implementer. The loop keeps running until all tasks are closed or something needs human input.

Importantly, this is plain code orchestrating agents, not an AI orchestrating other AIs. The generated script controls the flow, decides what runs next, handles retries and failure. The agents only execute within well-defined boundaries. That is what makes it more reliable in practice: the orchestration is deterministic and each session is independent.

Fig. 2 — Inside a task: five phases with feedback

Verification and hooks

I do not trust a first pass from any model. That is the core assumption behind all these workflows. The system runs multiple specialized reviewers on every change, and can check results across different models from different providers. A finding that one model missed gets caught by another. The point is not process for its own sake. The point is to force a clean loop where each cycle leaves the repository in a better state for the next one.

Tying all of this together is a set of automatic hooks wired into the agent's lifecycle. Relevant lessons are loaded when a session starts. Uncaptured lessons are flagged when it ends. The agent does not need to remember to do any of this. The repository handles it.

Fig. 3 — ca loop chains tasks in dependency order

What It Looked Like In Practice

I used compound-agent to build drift-watch, a separate tool for tracking behavioral drift in AI agents across sessions.

The idea behind that is simple: the more I have autonomous tools building for me, the less I will be able to see behaviors drift on my own. On top of that, testing and linting models are historically designed around human behaviors. Agentic AI is a different flavor, and I find it essential to keep observing how models behave over time to catch emergent but potentially detrimental patterns.

While building this second plugin, the useful part was not "one command and the whole product shipped itself." The useful part was that the work stayed legible from beginning to end. The architect decomposed the project into tasks sized for a single work cycle. ca loop ran them in dependency order, each in its own session. The whole build spanned multiple sessions with no shared context between them.

The loop surfaced real issues along the way. Review found missing error handling in the event pipeline, a database migration gap between two modules, and weak test coverage on the session parser. I fixed those on a short second run.

This build was quite autonomous. I was actually away and came back only to have a look once all the verification gates passed. For me, this was not blind trust. I have built more and more systems that way for my job, and I know that an atomic task design that keeps each session focused, strong specifications, and review layers that catch problems before they compound is what made a multi-hour build I did not supervise end in a working, reviewed codebase.

That is the honest state of the system today. It is already useful for turning a specification into structured implementation work and for surfacing defects early. It does not decide what to build, and I do think that long autonomous runs are not fully solved yet. I am still working on its optimization.

Fig. 4 — The compound-agent ecosystem

What I Am Working On Now

The current bottleneck is not code generation. It is specifications of the system and strong evaluation of the result. I am currently implementing stronger architectural preparation and scenario-based testing inside the loop.

The idea is to run representative scenarios against candidate changes before accepting them, and reject what does not improve the result. The hard part is defining scenarios that are precise enough to catch regressions but general enough to survive many iterations. I think this same harness is generalizable beyond coding to other knowledge work workflows, but that is a topic for another post.

Getting Started

If you want to try compound-agent, the current entry point is the npm package:

pnpm add -D compound-agent
npx ca setup

It expects Node.js 20+ and installs the local memory stack, including the embedding model. Task orchestration depends on Beads, so bd also needs to be available in the repository. The tool is under active development.

After setup, the best starting point is /compound:spec-dev or /compound:architect. Pick a project you want to build and start iterating with Claude to define the scope, requirements, and task breakdown. That is where the system starts paying off.

Today the deepest integration is with Claude Code, though I'm pushing for integration with other agentic CLI. The code and architecture docs live at NathanDela/compound-agent.

The project is built by agentic AI, so do not hesitate to use agent to explore it.

References & Inspirations

Harness Engineering — OpenAI
Building a C Compiler with a Team of Parallel Claudes — Anthropic
Towards Self-Driving Codebases — Cursor
Context Graphs: AI's Trillion-Dollar Opportunity — Foundation Capital
The Five Levels: from Spicy Autocomplete to the Dark Factory — Dan Shapiro
StrongDM Software Factory — StrongDM
Beads — Steve Yegge
Gastown — Steve Yegge
Compound Engineering Plugin — Every Inc.
OpenClaw — OpenClaw
Autoresearch — Andrej Karpathy

← Back to Thinks