Building with Two AI Agents

The Setup

I use two AI coding agents on the same projects. Claude Code (Anthropic, Opus 4) runs via terminal CLI. OpenAI Codex runs via its own CLI. Both work on the same deliverables: FTI consulting decks, personal site features, data tools. Sometimes in the same day. Sometimes on the same file.

This is not the marketing version of “multi-agent development.” There is no orchestrator. No shared memory bus. No elegant handoff protocol. There are two independent agents, two terminal windows, and me in the middle trying to keep the work coherent.

On March 24, 2026, I did an audit. What I found was three problems that had been compounding silently for months.

Two Brains, Zero Shared Memory

30+

Claude skill files

Well-organized, indexed

Codex skill files

Overlapping topics, separate

Shared between them

No reads, no writes, no bridge

Each agent had its own private skill library. Claude maintained 30+ skill files in ~/Desktop/~Working/skills/ — well-organized, indexed, covering everything from SVG chart patterns to Excel formatting rules. Codex had 12 skills in ~/.codex/skills/ with overlapping topics but a completely separate file structure.

They never read each other’s work.Codex could catch a bug on a deck export and fix it, but Claude would repeat the exact same bug the next session because it had no idea the fix existed. Knowledge was being created in both systems and retained in neither — at least not in a way that crossed the gap.

Figure 1

Agent Knowledge Silos: Two Libraries, Zero Overlap

Claude Code maintained 30+ well-indexed skill files. Codex had 12 in a separate directory with overlapping topics. Neither agent ever read the other’s library. A bug fixed by one agent would be repeated by the other in the next session.

89 Signal Entries, Every One Blank

Signal entries

signals.jsonl over 3 months

Useful events captured

Every entry: event = '?'

Manual learnings

Frozen since January 2

We had built an automated system to capture learning signals. The architecture looked right on paper: roi_tracker.py wrote to signals.jsonl after every session. A synthesis script was supposed to bridge those signals into a learnings database. A correction detector scanned transcripts for patterns.

It produced 89 entries over months of use. Every single one had event: '?'.The field meant to capture what happened was blank on all records. The correction detector found almost nothing. The learnings database hadn’t been meaningfully updated since January 2.

A telemetry pipeline that runs but captures nothing is worse than no pipeline — it creates the illusion of learning while nothing is actually retained.

Figure 2

Signal Quality Over Time: 89 Entries, Zero Useful Events

The roi_tracker.py script wrote 89 entries to signals.jsonl over three months. Every single entry had the event field set to ‘?’ — the field meant to capture what actually happened was blank on every record. The correction detector found almost nothing. The learnings database was frozen since January 2.

26 Files in Clean/, No Way to Tell Which Was Real

The Tara arbitration deck folder was the worst case study. The Clean/folder — the delivery folder, the thing that goes to the client — had 26 items: 4 Office lock files, 4 versioned editable PowerPoints (V1–V4), a BACKUP copy, multiple NATIVE versions, dated exports. You could not look at it and know which file was the right one.

Then there were the duplicated build environments. Three separate deck/ directories: deck/, deck_codex_v1/, deck_codex_v2/. Each with its own node_modules— 243MB of duplicated dependencies.

No way to tell what was canonical. No record of what changed between V2 and V3. No note about which agent built which version. Every new session started with archaeology: “which file is the latest? what did the last session do? what broke?”

Figure 3

Tara Folder Cleanup: Before vs After the Protocol

The Tara arbitration deck folder was the messiest workspace: 26 mixed items in Clean/, 3 separate node_modules directories eating 243MB, 4 Office lock files. After applying the protocol: 2 canonical files in Clean/, everything else archived or deleted. The unversioned filename IS the latest.

After the cleanup

Can you find the latest?

Before: No. After: Yes, instantly. The unversioned filename IS the latest.

Build history

14 historical entries reconstructed from timestamps. Explicitly marked as inferred context.

The Fix: A Protocol, Not a Database

The fix was not more automation. It was not a shared database or an API or a synchronization service. It was a protocol— a set of rules both agents follow, embedded in the folder structure itself.

Figure 4

The Four-Layer Protocol

The fix was not more automation. It was a protocol — a set of rules both agents follow, embedded in the folder structure itself. Layer 1 unifies the knowledge. Layer 2 captures what happened. Layer 3 prevents clutter. Layer 4 turns project lessons into cross-project skills.

Layer 1: Unified Skills Library

One library, both agents read and write to it. Created AGENTS.md (Codex's instruction file) pointing to the same skills directory. A bridge document defines who does what.

Layer 2: Build Journals

Every project gets _BUILD_LOG.md — an append-only log. Both agents MUST read before starting and MUST write an entry before ending. This is the cross-agent memory. No separate telemetry system — the memory is inline with the work.

Layer 3: Workspace Hygiene

Every project gets _WORKSPACE.md — defines what's canonical, where things go, cleanup rules. The unversioned filename IS the latest. Old versions go to _archive/. Lock files get deleted on sight.

Layer 4: Skill Extraction

When a build log entry says "Skill candidate: Yes", the next agent extracts it into a proper skill file. This is how individual project learnings become portable knowledge.

The key principle: the agent that does the work writes the memory. No intermediary scripts. No deferred synthesis. No batch processing of signals into insights. You finished building the deck? Write what you built, what broke, and what the next agent should know. That entry sits in the same folder as the work.

What we killed

signals.jsonl89 useless entries — deleted

synthesize_signals.pythe bridge that didn't bridge — deleted

session-learnings-capture.pythe detector that didn't detect — deleted

session-learnings.jsonfrozen since January — archived

An Operating Layer, Not a Swarm

The more technical way to describe this is: it is a lightweight agent operating layer. Not an autonomous swarm. Not a central orchestrator. An operating layer. It separates routing, execution, memory, and promotion into distinct surfaces that can evolve independently.

The control plane lives in the instruction files and shared skills. That is where agent roles, quality bars, tool preferences, and failure modes are encoded. The execution plane is the actual work: code edits, exports, tests, deck builds, document rendering. The memory plane is local and explicit: _BUILD_LOG.md, _WORKSPACE.md, and canonical filenames. The promotion plane is what turns a one-project lesson into a reusable skill that changes future behavior across the workspace.

Figure 6A

The Operating Layer in One View

Visible surfaces

Command Center

Open Loops

Context Pack

The pages that make today legible.

Control layer

Skills

Agent roles

Quality rules

The judgment layer that routes and constrains the work.

Canonical substrate

_BUILD_LOG.md

_WORKSPACE.md

project docs

Local records that survive the session.

Promotion lane

Skill candidates

shared defaults

new playbooks

What turns one project lesson into future leverage.

Loop|route|build|challenge|promote

The interface is a view over the records. It is not a replacement for them.

Figure 5

Protocol as Operating Loop

The important shift is that the protocol is no longer just a handoff ritual. It becomes a loop: routing determines who works, work produces artifacts, artifacts generate inline memory, memory is promoted into reusable skills, and those skills change how the next task gets routed.

Control Plane

Shared skills, bridge rules, project instructions, and agent strengths. This is where routing logic and durable judgment live.

Execution Plane

The agent session doing the work: editing files, running tests, exporting artifacts, cleaning folders, fixing regressions.

Memory Plane

Local project state captured inline: what is canonical, what changed, what broke, what the next session must know.

Promotion Plane

The mechanism that upgrades a local lesson into a shared skill so future sessions start with better defaults.

How It Becomes A Growing Ecosystem

Right now the system is durable. The next step is to make it more adaptive. That means adding explicit surfaces for agent capability discovery, promotion queues, and health checks so the protocol does not just preserve knowledge; it reallocates work, catches drift, and improves its own routing over time.

1. Agent Registry

A living registry of agent strengths, weak spots, preferred tools, and reliability by task type. Not just "Claude researches" and "Codex builds" but which agent is best for deck QA, docx rendering, browser automation, spreadsheet generation, and cross-model review.

2. Project Capability Manifest

A more structured project contract on top of _WORKSPACE.md: canonical outputs, test commands, safe write zones, review requirements, and which artifacts count as source of truth.

3. Promotion Queue

Every "Skill candidate: Yes" entry becomes an explicit backlog item rather than a polite suggestion. That creates measurable promotion latency: how long it takes for local pain to become reusable judgment.

4. Quality Gates

Cross-model review, acceptance-criteria generation, provenance checks, rendering checks, and file-health checks become standard gates that attach to task classes rather than ad hoc review requests.

5. Protocol Health Metrics

Coverage metrics such as build-log compliance, unresolved-open-item count, duplicate-fix recurrence, canonical-path drift, and stale-skill detection. The system should tell you when it is decaying.

6. Generated Views from Explicit Memory

Dashboards are still useful, but they should compile from build logs and workspace contracts, never from hidden telemetry. The UI becomes a view over explicit records, not a replacement for them.

Technical north star

The system should act less like two isolated chat sessions and more like a modular environment with explicit memory, explicit routing, explicit quality gates, and explicit promotion of new knowledge.

Put differently: build logs are the event stream, skills are the durable rules, workspace files are the project contract, and the next layer is health monitoring that detects when the ecosystem stops learning.

April 2026 update

The next step now has a name: Jenn OS. Not a grand autonomous swarm, and not a replacement for the protocol described above. A local operating layer.

The useful v1 scope is smaller than the rhetoric. Morning brief, project context packs, open-loop tracking, and closeout that writes memory at the correct layer. In other words: take the protocol and give it a daily product surface.

Lessons

1. Telemetry pipelines are fragile

They break silently and nobody checks. Inline-with-the-work logging — build journals — is more durable because the agent that writes the entry is the same one doing the work. There is no gap between “something happened” and “something was recorded.” The record is a side effect of the work, not a separate system that observes it.

2. Two agents need a protocol, not a database

The solution was not a shared database or an API or a vector store. It was a markdown file in each project folder that both agents read and write to. The protocol is simple enough that any LLM can follow it without custom tooling. Simple is durable. Sophisticated breaks.

3. The folder IS the interface

When the canonical output has no version suffix and lives in Clean/, you don’t need a dashboard to find it. The folder structure communicates state. It answers “what’s the latest?” without opening any file. This is more legible to humans and agents alike than any metadata database.

4. Kill dead infrastructure aggressively

The learning loop had been running for months, creating the comfortable feeling that “we’re capturing data.” We weren’t. 89 entries with blank event fields. Running infrastructure that produces nothing is worse than having nothing — it absorbs the attention that would otherwise go toward noticing the gap. Delete it. Build something that works.

5. Skills over memories

Per-session memories are ephemeral. They describe what happened in a specific context. Skills are portable. They encode judgment that applies across contexts. The goal is not “remember what happened in session X” — it’s “encode the judgment so the next session starts smarter, regardless of which agent runs it.”

How this session worked

This entire infrastructure rebuild was done in one session: 3 background agents running in parallel (skill files, Codex bridge, Tara cleanup), 1 direct thread killing the learning loop. Six tasks tracked, all completed.

The agents didn’t step on each other because the tasks were independent. That’s the same principle we’re encoding for future work: read before you start, write when you’re done, don’t touch what someone else is touching.

multi-agentdeveloper-toolsinfrastructureprotocolsworkflowClaude-CodeCodex