Navigate
HomeStart here
MusingsResearch & long-form
BuildingProjects & learnings
WorkProfessional practice
RunningTraining & races
AboutValues & identity
Life & PlacesCulture, food, travel, cities
Notes & ArchiveJournals, essays, portfolio
Workflow essayTechnology & IntelligenceUpdated March 2026Build-journal narrative + examples from real projects
Musings

Building & Tools · March 2026

Building with Two AI Agents

What happens when Claude Code and Codex work on the same projects, how the first version broke, and why the fix is turning into an operating layer rather than a clever workflow.

01

The Setup

I use two AI coding agents on the same projects. Claude Code (Anthropic, Opus 4) runs via terminal CLI. OpenAI Codex runs via its own CLI. Both work on the same deliverables: FTI consulting decks, personal site features, data tools. Sometimes in the same day. Sometimes on the same file.

This is not the marketing version of “multi-agent development.” There is no orchestrator. No shared memory bus. No elegant handoff protocol. There are two independent agents, two terminal windows, and me in the middle trying to keep the work coherent.

On March 24, 2026, I did an audit. What I found was three problems that had been compounding silently for months.

02

Two Brains, Zero Shared Memory

30+
Claude skill files
Well-organized, indexed
12
Codex skill files
Overlapping topics, separate
0
Shared between them
No reads, no writes, no bridge

Each agent had its own private skill library. Claude maintained 30+ skill files in ~/Desktop/~Working/skills/ — well-organized, indexed, covering everything from SVG chart patterns to Excel formatting rules. Codex had 12 skills in ~/.codex/skills/ with overlapping topics but a completely separate file structure.

They never read each other’s work. Codex could catch a bug on a deck export and fix it, but Claude would repeat the exact same bug the next session because it had no idea the fix existed. Knowledge was being created in both systems and retained in neither — at least not in a way that crossed the gap.

Figure 1

Agent Knowledge Silos: Two Libraries, Zero Overlap

CLAUDE CODE~/Desktop/~Working/skills/design-craftcolor-and-layoutsvg-chartstesting-ai-outputcross-model-reviewselvin-validationai-adoptionbuild-journalexcel-formattingsplit-sync-paths30+skill files0shared between themno reads, no writes, no bridgeCODEX~/.codex/skills/deck-exportpptx-tablesfolder-cleanupcpi-moduleexcel-templatesimage-optimization12skill filesMarch 24, 2026 — before the protocol

Claude Code maintained 30+ well-indexed skill files. Codex had 12 in a separate directory with overlapping topics. Neither agent ever read the other’s library. A bug fixed by one agent would be repeated by the other in the next session.

03

The Learning Loop That Never Learned

89
Signal entries
signals.jsonl over 3 months
0
Useful events captured
Every entry: event = '?'
6
Manual learnings
Frozen since January 2

We had built an automated system to capture learning signals. The architecture looked right on paper: roi_tracker.py wrote to signals.jsonl after every session. A synthesis script was supposed to bridge those signals into a learnings database. A correction detector scanned transcripts for patterns.

It produced 89 entries over months of use. Every single one had event: '?'. The field meant to capture what happened was blank on all records. The correction detector found almost nothing. The learnings database hadn’t been meaningfully updated since January 2.

A telemetry pipeline that runs but captures nothing is worse than no pipeline — it creates the illusion of learning while nothing is actually retained.

Figure 2

Signal Quality Over Time: 89 Entries, Zero Useful Events

HighZeroEvent QualityJan 2026Feb 2026Mar 2026Entry 1: event = '?' (no data captured)Entry 2: event = '?' (no data captured)Entry 3: event = '?' (no data captured)Entry 4: event = '?' (no data captured)Entry 5: event = '?' (no data captured)Entry 6: event = '?' (no data captured)Entry 7: event = '?' (no data captured)Entry 8: event = '?' (no data captured)Entry 9: event = '?' (no data captured)Entry 10: event = '?' (no data captured)Entry 11: event = '?' (no data captured)Entry 12: event = '?' (no data captured)Entry 13: event = '?' (no data captured)Entry 14: event = '?' (no data captured)Entry 15: event = '?' (no data captured)Entry 16: event = '?' (no data captured)Entry 17: event = '?' (no data captured)Entry 18: event = '?' (no data captured)Entry 19: event = '?' (no data captured)Entry 20: event = '?' (no data captured)Entry 21: event = '?' (no data captured)Entry 22: event = '?' (no data captured)Entry 23: event = '?' (no data captured)Entry 24: event = '?' (no data captured)Entry 25: event = '?' (no data captured)Entry 26: event = '?' (no data captured)Entry 27: event = '?' (no data captured)Entry 28: event = '?' (no data captured)Entry 29: event = '?' (no data captured)Entry 30: event = '?' (no data captured)Entry 31: event = '?' (no data captured)Entry 32: event = '?' (no data captured)Entry 33: event = '?' (no data captured)Entry 34: event = '?' (no data captured)Entry 35: event = '?' (no data captured)Entry 36: event = '?' (no data captured)Entry 37: event = '?' (no data captured)Entry 38: event = '?' (no data captured)Entry 39: event = '?' (no data captured)Entry 40: event = '?' (no data captured)Entry 41: event = '?' (no data captured)Entry 42: event = '?' (no data captured)Entry 43: event = '?' (no data captured)Entry 44: event = '?' (no data captured)Entry 45: event = '?' (no data captured)Entry 46: event = '?' (no data captured)Entry 47: event = '?' (no data captured)Entry 48: event = '?' (no data captured)Entry 49: event = '?' (no data captured)Entry 50: event = '?' (no data captured)Entry 51: event = '?' (no data captured)Entry 52: event = '?' (no data captured)Entry 53: event = '?' (no data captured)Entry 54: event = '?' (no data captured)Entry 55: event = '?' (no data captured)Entry 56: event = '?' (no data captured)Entry 57: event = '?' (no data captured)Entry 58: event = '?' (no data captured)Entry 59: event = '?' (no data captured)Entry 60: event = '?' (no data captured)Entry 61: event = '?' (no data captured)Entry 62: event = '?' (no data captured)Entry 63: event = '?' (no data captured)Entry 64: event = '?' (no data captured)Entry 65: event = '?' (no data captured)Entry 66: event = '?' (no data captured)Entry 67: event = '?' (no data captured)Entry 68: event = '?' (no data captured)Entry 69: event = '?' (no data captured)Entry 70: event = '?' (no data captured)Entry 71: event = '?' (no data captured)Entry 72: event = '?' (no data captured)Entry 73: event = '?' (no data captured)Entry 74: event = '?' (no data captured)Entry 75: event = '?' (no data captured)Entry 76: event = '?' (no data captured)Entry 77: event = '?' (no data captured)Entry 78: event = '?' (no data captured)Entry 79: event = '?' (no data captured)Entry 80: event = '?' (no data captured)Entry 81: event = '?' (no data captured)Entry 82: event = '?' (no data captured)Entry 83: event = '?' (no data captured)Entry 84: event = '?' (no data captured)Entry 85: event = '?' (no data captured)Entry 86: event = '?' (no data captured)Entry 87: event = '?' (no data captured)Entry 88: event = '?' (no data captured)Entry 89: event = '?' (no data captured)1 entries31 entries65 entries89 entriesEvery entry: event = '?'signals.jsonl — months of automated collection producing nothing

The roi_tracker.py script wrote 89 entries to signals.jsonl over three months. Every single entry had the event field set to ‘?’ — the field meant to capture what actually happened was blank on every record. The correction detector found almost nothing. The learnings database was frozen since January 2.

04

The Folder Archaeology Problem

The Tara arbitration deck folder was the worst case study. The Clean/ folder — the delivery folder, the thing that goes to the client — had 26 items: 4 Office lock files, 4 versioned editable PowerPoints (V1–V4), a BACKUP copy, multiple NATIVE versions, dated exports. You could not look at it and know which file was the right one.

Then there were the duplicated build environments. Three separate deck/ directories: deck/, deck_codex_v1/, deck_codex_v2/. Each with its own node_modules — 243MB of duplicated dependencies.

No way to tell what was canonical. No record of what changed between V2 and V3. No note about which agent built which version. Every new session started with archaeology: “which file is the latest? what did the last session do? what broke?”

Figure 3

Tara Folder Cleanup: Before vs After the Protocol

BeforeAfterFolder size (MB)Folder size (MB) before: 370MB370MBFolder size (MB) after: 125MB125MB-66%Files in Clean/Files in Clean/ before: 2626Files in Clean/ after: 22-92%Lock filesLock files before: 44Lock files after: 00gonenode_modules dirsnode_modules dirs before: 33node_modules dirs after: 00gone

The Tara arbitration deck folder was the messiest workspace: 26 mixed items in Clean/, 3 separate node_modules directories eating 243MB, 4 Office lock files. After applying the protocol: 2 canonical files in Clean/, everything else archived or deleted. The unversioned filename IS the latest.

After the cleanup

Can you find the latest?

Before: No. After: Yes, instantly. The unversioned filename IS the latest.

Build history

14 historical entries reconstructed from timestamps. Explicitly marked as inferred context.

05

The Fix: A Protocol, Not a Database

The fix was not more automation. It was not a shared database or an API or a synchronization service. It was a protocol — a set of rules both agents follow, embedded in the folder structure itself.

Figure 4

The Four-Layer Protocol

Unified Skills Library: One library, both agents read and write1Unified Skills LibraryOne library, both agents read and writeBuild Journals: Append-only _BUILD_LOG.md per project2Build JournalsAppend-only _BUILD_LOG.md per projectWorkspace Hygiene: _WORKSPACE.md defines canonical files3Workspace Hygiene_WORKSPACE.md defines canonical filesSkill Extraction: Project learnings become portable skills4Skill ExtractionProject learnings become portable skillsEach layer narrowstoward portable knowledge

The fix was not more automation. It was a protocol — a set of rules both agents follow, embedded in the folder structure itself. Layer 1 unifies the knowledge. Layer 2 captures what happened. Layer 3 prevents clutter. Layer 4 turns project lessons into cross-project skills.

Layer 1: Unified Skills Library

One library, both agents read and write to it. Created AGENTS.md (Codex's instruction file) pointing to the same skills directory. A bridge document defines who does what.

Layer 2: Build Journals

Every project gets _BUILD_LOG.md — an append-only log. Both agents MUST read before starting and MUST write an entry before ending. This is the cross-agent memory. No separate telemetry system — the memory is inline with the work.

Layer 3: Workspace Hygiene

Every project gets _WORKSPACE.md — defines what's canonical, where things go, cleanup rules. The unversioned filename IS the latest. Old versions go to _archive/. Lock files get deleted on sight.

Layer 4: Skill Extraction

When a build log entry says "Skill candidate: Yes", the next agent extracts it into a proper skill file. This is how individual project learnings become portable knowledge.

The key principle: the agent that does the work writes the memory. No intermediary scripts. No deferred synthesis. No batch processing of signals into insights. You finished building the deck? Write what you built, what broke, and what the next agent should know. That entry sits in the same folder as the work.

What we killed

signals.jsonl89 useless entries — deleted
synthesize_signals.pythe bridge that didn't bridge — deleted
session-learnings-capture.pythe detector that didn't detect — deleted
session-learnings.jsonfrozen since January — archived
06

What The System Actually Is

The more technical way to describe this is: it is a lightweight agent operating layer. Not an autonomous swarm. Not a central orchestrator. An operating layer. It separates routing, execution, memory, and promotion into distinct surfaces that can evolve independently.

The control plane lives in the instruction files and shared skills. That is where agent roles, quality bars, tool preferences, and failure modes are encoded. The execution plane is the actual work: code edits, exports, tests, deck builds, document rendering. The memory plane is local and explicit: _BUILD_LOG.md, _WORKSPACE.md, and canonical filenames. The promotion plane is what turns a one-project lesson into a reusable skill that changes future behavior across the workspace.

Figure 6A

The Operating Layer in One View

Visible surfaces

Command Center
Open Loops
Context Pack

The pages that make today legible.

Control layer

Skills
Agent roles
Quality rules

The judgment layer that routes and constrains the work.

Canonical substrate

_BUILD_LOG.md
_WORKSPACE.md
project docs

Local records that survive the session.

Promotion lane

Skill candidates
shared defaults
new playbooks

What turns one project lesson into future leverage.

Loop|route|build|challenge|promote

The interface is a view over the records. It is not a replacement for them.

Figure 5

Protocol as Operating Loop

Task IntakePrompt, project context, canonical targetCapability RoutingPick agent, tools, and validation modeExecutionCode, docs, exports, tests, fixesInline Memory_BUILD_LOG.md + _WORKSPACE.mdPromotionSkill candidate -> shared ruleQuality GatesReview, tests, provenance, health checkswork updates memorymemory updates future work

The important shift is that the protocol is no longer just a handoff ritual. It becomes a loop: routing determines who works, work produces artifacts, artifacts generate inline memory, memory is promoted into reusable skills, and those skills change how the next task gets routed.

Control Plane

Shared skills, bridge rules, project instructions, and agent strengths. This is where routing logic and durable judgment live.

Execution Plane

The agent session doing the work: editing files, running tests, exporting artifacts, cleaning folders, fixing regressions.

Memory Plane

Local project state captured inline: what is canonical, what changed, what broke, what the next session must know.

Promotion Plane

The mechanism that upgrades a local lesson into a shared skill so future sessions start with better defaults.

07

How It Becomes A Growing Ecosystem

Right now the system is durable. The next step is to make it more adaptive. That means adding explicit surfaces for agent capability discovery, promotion queues, and health checks so the protocol does not just preserve knowledge; it reallocates work, catches drift, and improves its own routing over time.

1. Agent Registry

A living registry of agent strengths, weak spots, preferred tools, and reliability by task type. Not just "Claude researches" and "Codex builds" but which agent is best for deck QA, docx rendering, browser automation, spreadsheet generation, and cross-model review.

2. Project Capability Manifest

A more structured project contract on top of _WORKSPACE.md: canonical outputs, test commands, safe write zones, review requirements, and which artifacts count as source of truth.

3. Promotion Queue

Every "Skill candidate: Yes" entry becomes an explicit backlog item rather than a polite suggestion. That creates measurable promotion latency: how long it takes for local pain to become reusable judgment.

4. Quality Gates

Cross-model review, acceptance-criteria generation, provenance checks, rendering checks, and file-health checks become standard gates that attach to task classes rather than ad hoc review requests.

5. Protocol Health Metrics

Coverage metrics such as build-log compliance, unresolved-open-item count, duplicate-fix recurrence, canonical-path drift, and stale-skill detection. The system should tell you when it is decaying.

6. Generated Views from Explicit Memory

Dashboards are still useful, but they should compile from build logs and workspace contracts, never from hidden telemetry. The UI becomes a view over explicit records, not a replacement for them.

Technical north star

The system should act less like two isolated chat sessions and more like a modular environment with explicit memory, explicit routing, explicit quality gates, and explicit promotion of new knowledge.

Put differently: build logs are the event stream, skills are the durable rules, workspace files are the project contract, and the next layer is health monitoring that detects when the ecosystem stops learning.

April 2026 update

The next step now has a name: Jenn OS. Not a grand autonomous swarm, and not a replacement for the protocol described above. A local operating layer.

The useful v1 scope is smaller than the rhetoric. Morning brief, project context packs, open-loop tracking, and closeout that writes memory at the correct layer. In other words: take the protocol and give it a daily product surface.

08

Lessons

1. Telemetry pipelines are fragile

They break silently and nobody checks. Inline-with-the-work logging — build journals — is more durable because the agent that writes the entry is the same one doing the work. There is no gap between “something happened” and “something was recorded.” The record is a side effect of the work, not a separate system that observes it.

2. Two agents need a protocol, not a database

The solution was not a shared database or an API or a vector store. It was a markdown file in each project folder that both agents read and write to. The protocol is simple enough that any LLM can follow it without custom tooling. Simple is durable. Sophisticated breaks.

3. The folder IS the interface

When the canonical output has no version suffix and lives in Clean/, you don’t need a dashboard to find it. The folder structure communicates state. It answers “what’s the latest?” without opening any file. This is more legible to humans and agents alike than any metadata database.

4. Kill dead infrastructure aggressively

The learning loop had been running for months, creating the comfortable feeling that “we’re capturing data.” We weren’t. 89 entries with blank event fields. Running infrastructure that produces nothing is worse than having nothing — it absorbs the attention that would otherwise go toward noticing the gap. Delete it. Build something that works.

5. Skills over memories

Per-session memories are ephemeral. They describe what happened in a specific context. Skills are portable. They encode judgment that applies across contexts. The goal is not “remember what happened in session X” — it’s “encode the judgment so the next session starts smarter, regardless of which agent runs it.”

How this session worked

This entire infrastructure rebuild was done in one session: 3 background agents running in parallel (skill files, Codex bridge, Tara cleanup), 1 direct thread killing the learning loop. Six tasks tracked, all completed.

The agents didn’t step on each other because the tasks were independent. That’s the same principle we’re encoding for future work: read before you start, write when you’re done, don’t touch what someone else is touching.

multi-agentdeveloper-toolsinfrastructureprotocolsworkflowClaude-CodeCodex
Jenn Musings · jennumanzor.com