Workflow essayTechnology & Intelligence•Updated March 2026•Comparative analysis + architecture design

Technology & Intelligence

Multi-Model Architecture

A visible wave of multi-model tools landed in spring 2026. Most of them treated cross-model as a feature. The better pattern treats it as a constraint: no model is both the builder and the judge. A season on, the constraint is the part that held up.

Seven setups compared across five dimensions. Three models with zero role overlap. One architecture where the intelligence lives in the loop and the skills — not in the model.

The Landscape

Across March 2026, OpenAI published an official way to use Codex from inside Claude Code with an official plugin, Every Inc open-sourced a 12-subagent review system, Y Combinator’s CEO published his role-based specialist stack, and two independent projects shipped adversarial debate loops between Claude and Codex.

The question isn’t whether to use multiple models. That’s settled. The question is whether you’re using them as a feature — “call another model from your primary tool” — or as a constraint: no model touches two roles in the same deliverable.

That distinction is the difference between tooling and architecture.

Seven Setups, Five Dimensions

I compared seven multi-model setups on five dimensions that actually matter for output quality: role separation (is the builder different from the reviewer?), learning system (do lessons persist between sessions?), adversarial depth (does the review check provenance and arithmetic, or just code patterns?), independence (genuinely different models, or same model with different prompts?), and portability (does the system work if you swap out a model?).

Figure 1

Seven Multi-Model Setups, Five Dimensions

Scored on five dimensions: role separation (builder ≠ challenger), learning system (do lessons persist?), adversarial depth (provenance, arithmetic, or just code review?), independence (genuinely different models vs same model with different prompts), portability (works across tools or locked to one). Jenn’s setup scores highest because the architecture was designed around these constraints, not retrofitted.

codex-plugin-cc (OpenAI, Dominik Kundel) — three slash commands: review, adversarial review, and rescue (task delegation). Clean plumbing, no learning system. The strategic read: Claude Code holds a commanding developer lead and OpenAI is walking Codex through the competitor’s front door rather than waiting for developers to switch.

Compound Engineering (Every Inc, Dan Shipper) — Plan/Work/Review/Compound loop with 12 subagents reviewing code from different perspectives: security, performance, maintainability, architecture, adversarial failure scenarios. The “Compound” step captures learnings for future sessions. Strongest learning system in the comparison. Limitation: all 12 subagents are Claude with different prompts — parallel review, not independence.

gstack (Garry Tan) — a role-based operating stack that turns Claude Code into a virtual team of specialists: CEO review, design review, QA, security, release, and more. It is strong on operational breadth and portability because the same stack can install across multiple agent hosts. Its limitation is that specialization here is mostly a prompting and workflow pattern, not a structurally separate builder/judge system.

Everything-Claude-Code (Affaan Mustafa) — a broader agent harness organized around skills, memory, security, and research-first development across Claude Code, Codex, Cursor, and beyond. It gets credit for reusable method and for treating the operating layer as a system. But it is still more of a unified harness than an independence architecture, which is why it scores higher on learning than on role separation.

adversarial-spec (zscole) and adversarial-review (alecnielsen) — two independent projects that run Claude and Codex in adversarial debate loops. adversarial-spec has an anti-rubber-stamp mechanism: if a model agrees too quickly, it must explain why. These get the independence right but have no learning layer.

No Model Is Both the Builder and the Judge

A single model producing and validating its own output is a writer proofreading their own essay. They read what they meant to write, not what they actually wrote. Two models with different training data create genuine tension — when they converge, confidence goes up; when they diverge, you’ve found the interesting part.

But two models isn’t enough. If Claude dispatches work to Codex, Claude frames the prompt. That framing biases the output. The independence is compromised at the instruction boundary.

Three models, three roles, zero overlap: ChatGPT writes the acceptance criteria (defines “correct”). Codex builds the implementation (doesn’t see Claude’s context). Claude validates against the criteria (not against the original requirements — against the criteria). The model that specs is not the model that builds, and neither is the model that validates.

Figure 2

From Single-Model to Three-Model Role Separation

The single-model flow is a writer proofreading their own essay. The two-model flow adds a second reader but the writer still sets the agenda. The three-model flow ensures the spec-writer, builder, and validator never share a context window. Independence is structural, not aspirational.

The Skills Are the Moat

Here is the part nobody wants to hear: the model is a replaceable component. What isn’t replaceable is the 30+ skill files that encode judgment from real project failures, the provenance-aware adversarial review that checks whether a cited paper actually says what you claim, and the resolve step that forces you to record what happened to each finding instead of silently discarding it.

Those skills are files. They’re portable. Any model can read them. Claude reads them well because its instruction-following is strong. Codex reads them well because it can load them from the shared directory. A future model from Google or Meta could read them too.

Figure 3

What Transfers Between Models vs. What Doesn’t

Seven of nine components are 70%+ portable across models. The two non-portable items — model tendency and native tool access — are exactly what you’d expect: weights-level behavior and platform-specific APIs. Everything else is files, prompts, and protocols that any model can read.

The only things that don’t transfer are model-level tendencies (Claude leans toward thoroughness, Codex leans toward exploration) and platform-specific tool access (MCP servers, helper scripts). Everything else — the output contracts, the resolve logs, the routing rules, the build journals — is infrastructure that works regardless of which model you’re running.

This is why the copy-paste workflow matters. When I paste the same prompt into both Claude and Codex independently, I’m preserving the independence that makes cross-model comparison meaningful. If Claude dispatches the prompt to Codex, Claude’s framing contaminates the output. The human is the routing layer — the person who decides the prompt before either model sees it.

Route, Build, Challenge, Resolve, Promote

The architecture is five steps: Route, Build, Challenge, Resolve, Promote. Every non-trivial deliverable runs through all five. Trivial work skips to Build and ships.

Figure 4

The Operating Loop: Builder → Challenger → Memory

Every non-trivial deliverable runs through all five steps. The RESOLVE step — where findings get accepted, rejected, or flagged as unresolved — is the one nobody else builds. It’s also the one that makes the system honest.

Route decides who builds, what kind of challenge the output needs, and whether this task needs challenge at all. Not everything does — a typo fix ships without adversarial review. But anything with numbers, claims, or methodology choices routes to challenge.

Build produces the deliverable with an output contract: what was produced, what the challenger should verify, what assumptions were made, what’s not in scope. The contract is the spec the challenger reviews against — without it, the review is unfocused.

Challenge uses the tool that matches the routing decision. /challenge for provenance and arithmetic. ask_chatgpt for an independent read. compare_approaches for parallel generation. /codex:adversarial-review for lightweight code review.

Resolve is the step nobody else builds. Every finding gets a verdict: accepted (fix it), rejected (explain why it’s wrong), or unresolved (the models disagree and you can’t tell who’s right). Unresolved findings don’t disappear — they get flagged in the build log for the next session.

Promote captures what survived. Project changes go to the build log. Durable lessons become skill files. The system gets smarter because the memory layer is explicit, not inferred.

Figure 5A

What Actually Compounds in a Multi-Model System

Human routing

The person sets the prompt before any model sees it. That preserves the independence the comparison depends on.

Separate model roles

Spec, build, and validation stay structurally distinct. No model gets to be both author and judge.

Proof loop

Route, build, challenge, resolve, and promote turn a one-off answer into a checkable workflow.

Local records

Skills, contracts, resolve logs, and build journals survive the session. That is the real compounding asset.

Portable moat|skills|output contracts|resolve logs|build journals

The models do the work. The files make the system compound.

What I Actually Use

Figure 5

Routing: Which Model for Which Task?

The routing question isn’t “which model is better.” It’s “which model is already loaded for this task, and does this task need exploration or execution?” Exploration = Codex’s natural tendency. Execution = Claude with skills loaded.

The routing isn’t about which model is “better.” Claude is “better at frontend” because I’ve loaded 12 frontend skills, a defect log, spacing tokens, and years of corrections into its context. Codex is “better at concept development” because its workflow defaults to 80% planning. Strip the configuration from both and the raw model gap is much smaller than it feels.

The real question is: does this task need exploration or execution? Exploration — concept development, requirements gathering, “what should this be?” — benefits from Codex’s natural tendency to propose options before committing. Execution — build to spec, apply known patterns, follow rules — benefits from Claude’s strength at following layered instructions without skipping steps.

Both can get better at the other’s strength. The skills are files. The prompts are portable. The model is the least important variable in the system.

The Bottom Line

Multi-model isn’t about having more models. It’s about making sure no model is both the builder and the judge. Most setups in the March 2026 wave treat cross-model as a feature: “call another model from your primary tool.” The better pattern treats it as a constraint that shapes the entire workflow.

The competitive advantage isn’t which model you use. It’s the skills library you’ve built and the loop you run them through. The models are interchangeable. The architecture isn’t.

The person doing the routing is the architect. The models are the builders. Most multi-model setups skip the architect and let one model orchestrate the others, which defeats the purpose of using multiple models. If you want genuine independence, the human decides the prompt before either model sees it.

That is the architecture. Everything else is just tooling.

June 2026 update

The spring wave settled into a name. People now call this layer the harness — the scaffolding around a model that turns it into a working system — and “harness engineering” is a thing people write about on its own. The argument here aged well: the consensus moved to treating the model as the commodity and the harness as the part worth owning.

That harness is what I actually run now. Where I sit as an AI user, my data flows, and the tools underneath them are written up in Jenn OS.

Sources & References

OpenAI — codex-plugin-cc (GitHub, March 2026)Every Inc — Compound Engineering Plugin (GitHub)Dan Shipper — “Compound Engineering: How Every Codes With Agents”Garry Tan — gstack: Role-based specialist stack (GitHub)zscole — adversarial-spec: Multi-LLM debate for specs (GitHub)Alec Nielsen — adversarial-review: Claude + Codex debate loop (GitHub)Affaan Mustafa — Everything-Claude-Code (GitHub)

← All musings