Navigate
HomeStart here
MusingsResearch & long-form
BuildingProjects & learnings
WorkProfessional practice
RunningTraining & races
AboutValues & identity
Life & PlacesCulture, food, travel, cities
Notes & ArchiveJournals, essays, portfolio
Workflow essayTechnology & IntelligenceUpdated March 2026Comparative analysis + architecture design
Technology & Intelligence

The Models Are Interchangeable. The Architecture Isnt.

A visible wave of multi-model tools landed across March 2026. Most of them treat cross-model as a feature. The better pattern treats it as a constraint: no model is both the builder and the judge.

Seven setups compared across five dimensions. Three models with zero role overlap. One architecture where the intelligence lives in the loop and the skills not in the model.

01

The Landscape

Across March 2026, OpenAI published an official way to use Codex from inside Claude Code with an official plugin, Every Inc open-sourced a 12-subagent review system, Y Combinators CEO published his role-based specialist stack, and two independent projects shipped adversarial debate loops between Claude and Codex.

The question isnt whether to use multiple models. Thats settled. The question is whether youre using them as a feature call another model from your primary tool or as a constraint: no model touches two roles in the same deliverable.

That distinction is the difference between tooling and architecture.

02

Seven Setups, Five Dimensions

I compared seven multi-model setups on five dimensions that actually matter for output quality: role separation (is the builder different from the reviewer?), learning system (do lessons persist between sessions?), adversarial depth (does the review check provenance and arithmetic, or just code patterns?), independence (genuinely different models, or same model with different prompts?), and portability (does the system work if you swap out a model?).

Figure 1

Seven Multi-Model Setups, Five Dimensions

RoleSeparationLearningSystemAdversarialDepthIndependencePortabilitycodex-plugin-cccodex-plugin-cc: Role Separation = 2/5codex-plugin-cc: Learning System = 0/5codex-plugin-cc: Adversarial Depth = 2/5codex-plugin-cc: Independence = 3/5codex-plugin-cc: Portability = 1/5Compound Eng.Compound Eng.: Role Separation = 1/5Compound Eng.: Learning System = 4/5Compound Eng.: Adversarial Depth = 3/5Compound Eng.: Independence = 1/5Compound Eng.: Portability = 3/5gstackgstack: Role Separation = 2/5gstack: Learning System = 1/5gstack: Adversarial Depth = 2/5gstack: Independence = 2/5gstack: Portability = 2/5adversarial-specadversarial-spec: Role Separation = 3/5adversarial-spec: Learning System = 0/5adversarial-spec: Adversarial Depth = 4/5adversarial-spec: Independence = 4/5adversarial-spec: Portability = 1/5adversarial-reviewadversarial-review: Role Separation = 3/5adversarial-review: Learning System = 0/5adversarial-review: Adversarial Depth = 4/5adversarial-review: Independence = 4/5adversarial-review: Portability = 1/5Everything-CCEverything-CC: Role Separation = 1/5Everything-CC: Learning System = 3/5Everything-CC: Adversarial Depth = 2/5Everything-CC: Independence = 1/5Everything-CC: Portability = 2/5Jenn’s setupJenn’s setup: Role Separation = 5/5Jenn’s setup: Learning System = 5/5Jenn’s setup: Adversarial Depth = 5/5Jenn’s setup: Independence = 4/5Jenn’s setup: Portability = 4/51/53/55/5Dot size = score

Scored on five dimensions: role separation (builder ≠ challenger), learning system (do lessons persist?), adversarial depth (provenance, arithmetic, or just code review?), independence (genuinely different models vs same model with different prompts), portability (works across tools or locked to one). Jenn’s setup scores highest because the architecture was designed around these constraints, not retrofitted.

codex-plugin-cc (OpenAI, Dominik Kundel) three slash commands: review, adversarial review, and rescue (task delegation). Clean plumbing, no learning system. The strategic read: Claude Code holds a commanding developer lead and OpenAI is walking Codex through the competitors front door rather than waiting for developers to switch.

Compound Engineering (Every Inc, Dan Shipper) Plan/Work/Review/Compound loop with 12 subagents reviewing code from different perspectives: security, performance, maintainability, architecture, adversarial failure scenarios. The Compound step captures learnings for future sessions. Strongest learning system in the comparison. Limitation: all 12 subagents are Claude with different prompts parallel review, not independence.

gstack (Garry Tan) a role-based operating stack that turns Claude Code into a virtual team of specialists: CEO review, design review, QA, security, release, and more. It is strong on operational breadth and portability because the same stack can install across multiple agent hosts. Its limitation is that specialization here is mostly a prompting and workflow pattern, not a structurally separate builder/judge system.

Everything-Claude-Code (Affaan Mustafa) a broader agent harness organized around skills, memory, security, and research-first development across Claude Code, Codex, Cursor, and beyond. It gets credit for reusable method and for treating the operating layer as a system. But it is still more of a unified harness than an independence architecture, which is why it scores higher on learning than on role separation.

adversarial-spec (zscole) and adversarial-review (alecnielsen) two independent projects that run Claude and Codex in adversarial debate loops. adversarial-spec has an anti-rubber-stamp mechanism: if a model agrees too quickly, it must explain why. These get the independence right but have no learning layer.

03

Why Role Separation Matters

A single model producing and validating its own output is a writer proofreading their own essay. They read what they meant to write, not what they actually wrote. Two models with different training data create genuine tension when they converge, confidence goes up; when they diverge, youve found the interesting part.

But two models isnt enough. If Claude dispatches work to Codex, Claude frames the prompt. That framing biases the output. The independence is compromised at the instruction boundary.

Three models, three roles, zero overlap: ChatGPT writes the acceptance criteria (defines correct). Codex builds the implementation (doesnt see Claudes context). Claude validates against the criteria (not against the original requirements against the criteria). The model that specs is not the model that builds, and neither is the model that validates.

Figure 2

From Single-Model to Three-Model Role Separation

Most peopleClaude buildsClaude buildsClaude reviewsClaude reviewsShipsShipssame model, same blind spotscodex-plugin-ccClaude buildsClaude buildsCodex reviewsCodex reviewsShipsShipstwo models, but one always leadsJenn’s operating loopChatGPT specsChatGPT specsCodex buildsCodex buildsClaude validatesClaude validatesSkillsSkillsthree models, zero overlap, structural separation

The single-model flow is a writer proofreading their own essay. The two-model flow adds a second reader but the writer still sets the agenda. The three-model flow ensures the spec-writer, builder, and validator never share a context window. Independence is structural, not aspirational.

04

The Skills Are the Moat

Here is the part nobody wants to hear: the model is a replaceable component. What isnt replaceable is the 30+ skill files that encode judgment from real project failures, the provenance-aware adversarial review that checks whether a cited paper actually says what you claim, and the resolve step that forces you to record what happened to each finding instead of silently discarding it.

Those skills are files. Theyre portable. Any model can read them. Claude reads them well because its instruction-following is strong. Codex reads them well because it can load them from the shared directory. A future model from Google or Meta could read them too.

Figure 3

What Transfers Between Models vs. What Doesn’t

PORTABLEMODEL-LOCKEDSkills library (30+ files)Skills library (30+ files): 95% portable95%Output contract formatOutput contract format: 100% portable100%Resolve log (verdicts)Resolve log (verdicts): 100% portable100%Build journal protocolBuild journal protocol: 100% portable100%Adversarial promptsAdversarial prompts: 85% portable85%Routing decision treeRouting decision tree: 90% portable90%Devil-advocate personasDevil-advocate personas: 70% portable70%Model tendency (explore/execute)Model tendency (explore/execute): 0% portable0%Native tool access (MCP)Native tool access (MCP): 15% portable15%

Seven of nine components are 70%+ portable across models. The two non-portable items — model tendency and native tool access — are exactly what you’d expect: weights-level behavior and platform-specific APIs. Everything else is files, prompts, and protocols that any model can read.

The only things that dont transfer are model-level tendencies (Claude leans toward thoroughness, Codex leans toward exploration) and platform-specific tool access (MCP servers, helper scripts). Everything else the output contracts, the resolve logs, the routing rules, the build journals is infrastructure that works regardless of which model youre running.

This is why the copy-paste workflow matters. When I paste the same prompt into both Claude and Codex independently, Im preserving the independence that makes cross-model comparison meaningful. If Claude dispatches the prompt to Codex, Claudes framing contaminates the output. The human is the routing layer the person who decides the prompt before either model sees it.

05

The Operating Loop

The architecture is five steps: Route, Build, Challenge, Resolve, Promote. Every non-trivial deliverable runs through all five. Trivial work skips to Build and ships.

Figure 4

The Operating Loop: Builder → Challenger → Memory

ROUTE: Owner + risk01ROUTEOwner + riskBUILD: Agent + contract02BUILDAgent + contract/codex:rescueCHALLENGE: Adversarial / parallel03CHALLENGEAdversarial / parallel/challengeRESOLVE: Accept / reject04RESOLVEAccept / rejectPROMOTE: Log + skills05PROMOTELog + skillsskills/builder challenger memoryevery non-trivial deliverable runs the full loop

Every non-trivial deliverable runs through all five steps. The RESOLVE step — where findings get accepted, rejected, or flagged as unresolved — is the one nobody else builds. It’s also the one that makes the system honest.

Route decides who builds, what kind of challenge the output needs, and whether this task needs challenge at all. Not everything does a typo fix ships without adversarial review. But anything with numbers, claims, or methodology choices routes to challenge.

Build produces the deliverable with an output contract: what was produced, what the challenger should verify, what assumptions were made, whats not in scope. The contract is the spec the challenger reviews against without it, the review is unfocused.

Challenge uses the tool that matches the routing decision. /challenge for provenance and arithmetic. ask_chatgpt for an independent read. compare_approaches for parallel generation. /codex:adversarial-review for lightweight code review.

Resolve is the step nobody else builds. Every finding gets a verdict: accepted (fix it), rejected (explain why its wrong), or unresolved (the models disagree and you cant tell whos right). Unresolved findings dont disappear they get flagged in the build log for the next session.

Promote captures what survived. Project changes go to the build log. Durable lessons become skill files. The system gets smarter because the memory layer is explicit, not inferred.

Figure 5A

What Actually Compounds in a Multi-Model System

Human routing

The person sets the prompt before any model sees it. That preserves the independence the comparison depends on.

Separate model roles

Spec, build, and validation stay structurally distinct. No model gets to be both author and judge.

Proof loop

Route, build, challenge, resolve, and promote turn a one-off answer into a checkable workflow.

Local records

Skills, contracts, resolve logs, and build journals survive the session. That is the real compounding asset.

Portable moat|skills|output contracts|resolve logs|build journals

The models do the work. The files make the system compound.

06

What I Actually Use

Figure 5

Routing: Which Model for Which Task?

TASK SHAPEBUILDERCHALLENGERWHYAnalysis / deck / numbers: Claude builds, /challenge + ask_chatgpt challenges. Heavy challenge neededAnalysis / deck / numbersClaude/challenge + ask_chatgptHeavy challenge neededCode from spec: Codex builds, /challenge challenges. Role separationCode from specCodex/challengeRole separationPipeline / data system: Codex builds, 3-model AC challenges. Highest rigorPipeline / data systemCodex3-model ACHighest rigorQuick bug fix: Claude builds, /codex:review challenges. Light, fastQuick bug fixClaude/codex:reviewLight, fastClassification / scoring: Both builds, compare_approaches challenges. Test dependenceClassification / scoringBothcompare_approachesTest dependence

The routing question isn’t “which model is better.” It’s “which model is already loaded for this task, and does this task need exploration or execution?” Exploration = Codex’s natural tendency. Execution = Claude with skills loaded.

The routing isnt about which model is better. Claude is better at frontend because Ive loaded 12 frontend skills, a defect log, spacing tokens, and years of corrections into its context. Codex is better at concept development because its workflow defaults to 80% planning. Strip the configuration from both and the raw model gap is much smaller than it feels.

The real question is: does this task need exploration or execution? Exploration concept development, requirements gathering, what should this be? benefits from Codexs natural tendency to propose options before committing. Execution build to spec, apply known patterns, follow rules benefits from Claudes strength at following layered instructions without skipping steps.

Both can get better at the others strength. The skills are files. The prompts are portable. The model is the least important variable in the system.

07

The Bottom Line

Multi-model isnt about having more models. Its about making sure no model is both the builder and the judge. Most setups in the March 2026 wave treat cross-model as a feature: call another model from your primary tool. The better pattern treats it as a constraint that shapes the entire workflow.

The competitive advantage isnt which model you use. Its the skills library youve built and the loop you run them through. The models are interchangeable. The architecture isnt.

The person doing the routing is the architect. The models are the builders. Most multi-model setups skip the architect and let one model orchestrate the others, which defeats the purpose of using multiple models. If you want genuine independence, the human decides the prompt before either model sees it.

That is the architecture. Everything else is just tooling.

April 2026 update

This argument is now being turned into Jenn OS v1. The practical translation is not a bigger dashboard and not another chat wrapper. It is a local operating layer with one narrow loop: open work, prepare the right context, build with visible contracts, and close out into durable memory that both agents can reuse.