The Harness and the Appliance: Claude Code vs Codex

2026-05-26 - 20 min read

Daniel Young

Founder, DRYCodeWorks

Every AI dev tool ships with the same demo. The hard question is what compounds. Why we picked Claude Code over Codex — and the surface that earned it.

An accreted wooden workbench with notebooks and instruments beside a clean counter with a single sealed white device

Claude Code is Anthropic's command-line AI coding agent. It runs in your terminal, reads and edits files in your project, and exposes a programmable surface — skills (committed prompts), plugins (shareable bundles), MCP servers (external integrations), hooks (event-driven scripts), and per-directory settings. Everything that customizes Claude Code lives in committable files. See the official docs.

Codex is OpenAI's AI coding tool. See openai.com/codex for current details. This post compares our actual usage of both — interface differences we observed in our workflow, not capability claims about either product.

The fleet morning

Seven-thirty. Coffee cooling beside the keyboard. Three Ghostty terminal windows tiled on one screen: two client monorepos and an orchestrator session. Each window dropped into its own directory and, because of how .claude/settings.local.json scopes plugins per directory, each window loaded a different set of MCP servers and skills automatically. No manual configuration step. No toggling. The directory was the scope.

From the orchestrator, we typed /dispatch. One invocation. The dispatch skill gathered the relevant context — current task, client preamble, the state.md entry from last session — assembled a self-contained prompt for each thread, and launched each in its own Ghostty window. The orchestrator stayed up front. The workers each had a clean slate: a fresh session carrying exactly what it needed to do its job, nothing else from the orchestrator's growing context bleeding in.

In the background, yesterday's PRs were being landed. Auto-merge was enabled. A pr-comment-resolver skill was reading review comments and resolving them autonomously. Downstream workflows were being watched. All of this on the laptop, in parallel sessions opened the day before. Not a scheduled job. Not CI doing the heavy lifting. Just agents running in their own windows, checked on when needed.

None of that is impressive in the abstract. Parallel terminals are not a new idea. What makes it work is that none of it is bespoke. Every skill, plugin, and agent prompt lives in a git repo. The team pulls a tag and inherits the entire setup. That is the thing we want to talk about.

Required to reproduce this scene: macOS, Ghostty (or any terminal that accepts AppleScript window-launch), Claude Code ≥ v2.1.150 (pinned at publish time), a marketplace plugin repo, and .claude/settings.local.json scoping enabled.

A wooden desk in morning light: a laptop with three tiled terminal windows, a coffee mug, and a small notebook

What you just saw, and why it works

The scene above runs on a small set of Claude Code primitives. Skills are committed prompts — plain text files, versioned, reviewable, mutable via pull request. Subagents are the unit of delegated work: an independent session carrying a self-contained prompt, producing its artifact, and closing. Hooks fire on session events, injecting context or triggering cleanup. MCP servers bridge from the session outward to external systems. Slash commands are the user-facing interface to all of it. Each one of these is a file. Each file is reviewable.

A skill on your laptop helps you. That's useful, but it's the smallest possible return on the investment. A skill in a shared marketplace repo helps the team — six months from now, after the original author has moved on to the next project, when the session that produced the skill has long since closed. The unit of shareable setup is the plugin, not the conversation.

Knowledge compounds the same way. Per-project memory files, KB markdown entries, hooks that inject procedures and constraints on session start — the next session inherits what the last session learned. Nothing important relies on the same conversation staying open. The continuity lives in the repository.

What follows is the three-pillar version of how this works in practice. Pillar one is the plugin marketplace pattern — how we structure shareable skills and per-directory scoping. Pillar two is dispatch and the landing agent — the orchestration thesis in concrete form. Pillar three is knowledge that accretes — how corrections and procedures accumulate into something the next session can use. After those three pillars, a Codex coda: observed interface differences, stated plainly, with no capability claims. Then the close: why all of this only compounds if you commit, and what commitment produces on a programmable surface versus an appliance.

Pillar 1: the plugin marketplace pattern

dry-claude-plugin is one git repo containing N plugins. The team installs the marketplace once. Per-plugin enablement is local to each directory via .claude/settings.local.json. There is no central configuration that touches every session at once — each directory opts into exactly the plugins it needs.

The workspace root might look like this:

~/work/.claude/settings.local.json (kitchen-sink workspace root)json

{
"plugins": ["drycodeworks", "dry-internal", "dry-harness"]
}

{
"plugins": ["drycodeworks", "dry-internal", "dry-harness"]
}

A per-client subdirectory scopes down further:

~/work/example-client/.claude/settings.local.json (per-client subdirectory)json

{
"plugins": ["drycodeworks", "example-client"]
}

{
"plugins": ["drycodeworks", "example-client"]
}

Where you see example-client in a JSON or shell snippet, replace it with your own per-client plugin name.

The version tag convention is <plugin>/0.2.2. Cache lives at ~/.claude/plugins/cache/dry-claude-plugin/<plugin>/<version>/. Bumping a tag rolls everyone forward in one git pull and checkout — deterministic, no per-machine negotiation. One commit in the marketplace repo; every session that pulls the tag inherits the change.

What this buys you works at three levels. First, the team-coordination level: a new skill goes from "a useful prompt in a scratchpad" to "everyone has it on Monday" in one commit. There's no second distribution step. Second, the noise level: per-client plugins keep the MCP servers and skills for one client out of every unrelated session. A client with a bespoke data pipeline skill should not be visible to sessions working on something else entirely. Third, the process level: encoded conventions — procedures, preflight checks, agent prompts — propagate as code review, not as Slack messages. Someone asks "why does the dispatch skill do this?" and the answer is in the commit history.

Per-client plugins are DRYCodeWorks's delivery toolkit, not client-owned deliverables — same way an editor config or shell alias isn't a client deliverable.

There are real costs here, and we want to be honest about them.

Setup cost. You're maintaining a small private package registry. The first week is unglamorous: marketplace manifest, plugin scaffolding, the cache-refresh gotcha (/plugin update while on a WIP branch caches the wrong contents). None of it is hard, but all of it is plumbing you own.
Team-onboarding cost. Each new teammate needs the marketplace installed, plugins enabled per-directory, and an understanding of why their setup is structured this way. Plain Claude Code is faster to onboard; the marketplace pays off once you've crossed from "a handful of skills" to "skills we rely on daily." We have not measured this rigorously — it's an observation, not a metric.
Scoping footgun. .claude/settings.local.json scoping is silent when wrong. A typo in a plugin name disables the plugin without warning. Engineers who don't internalize the scoping model will think "the skill is missing" when in fact the scope is off.

In our usage of Codex, we did not find an interface for per-directory plugin scoping or shareable plugin packaging — extensions are configured per-session in the UI rather than committed to a repo. That's an interface observation, not a claim about Codex's roadmap.

For the practitioner's view of how individual skills get encoded — the building-block layer beneath this marketplace pattern — see Building Custom Claude Code Skills for Infrastructure Automation.

Pillar 2: dispatch and the landing agent

Dispatch is the first anchor skill in this pattern. Here is its frontmatter as committed in the marketplace:

dispatch SKILL.md (frontmatter)yaml

---
name: dispatch
description: >
Orchestrator prompt generator and session launcher. Use when dispatching work
to independent Claude sessions from an orchestrator session. Gathers task context,
assembles self-contained prompts with client preambles and work approach instructions,
and launches in new Ghostty terminal windows. Triggers: dispatch, orchestrate,
launch session, generate prompt, hand off, delegate, spawn session, send to another
session, parallel sessions.
---

---
name: dispatch
description: >
Orchestrator prompt generator and session launcher. Use when dispatching work
to independent Claude sessions from an orchestrator session. Gathers task context,
assembles self-contained prompts with client preambles and work approach instructions,
and launches in new Ghostty terminal windows. Triggers: dispatch, orchestrate,
launch session, generate prompt, hand off, delegate, spawn session, send to another
session, parallel sessions.
---

Dispatch lives in the orchestrator session. When you invoke /dispatch, it gathers the current context — client preamble, work approach instructions, the active task, any relevant state — and assembles a self-contained prompt. That prompt is then launched in a new Ghostty terminal. The new session arrives with everything it needs to do its job and nothing it doesn't. The orchestrator's context stays clean. No information from parallel threads bleeds back in unless you explicitly request a summary.

The assembled prompt is a file-like artifact. Here's a sanitized example of what dispatch generates:

text

Client preamble: [generic project-context paragraph — no client name]
Task: refactor the metrics-aggregation pipeline to use the new batched writer
State: see state.md (current branch, last deploy, open blockers)
Work approach:
1. Read state.md and procedures.md before starting
2. Branch via gt; one focused change per commit
3. Run typecheck + tests before pushing
4. Submit via gt submit when complete
Completion criteria: all tests pass; PR opened; deploy watched

Client preamble: [generic project-context paragraph — no client name]
Task: refactor the metrics-aggregation pipeline to use the new batched writer
State: see state.md (current branch, last deploy, open blockers)
Work approach:
1. Read state.md and procedures.md before starting
2. Branch via gt; one focused change per commit
3. Run typecheck + tests before pushing
4. Submit via gt submit when complete
Completion criteria: all tests pass; PR opened; deploy watched

The leverage is that prompts are code. The encoding happens once. After that, every dispatch invocation for a similar task inherits the same approach without reconstruction from memory. The prompt is reviewed in a PR before it reaches production. It can be diffed, reverted, and improved exactly like the application code it coordinates.

The second anchor skill is the landing agent. Here is a sanitized interface fragment:

landing-agent.yaml (sanitized interface fragment)yaml

---
name: landing-agent
description: |
Drive a PR from review-ready to landed-in-main with the
downstream workflow reported green. Generic decision flow;
the per-client agents specialize the details.
---

decision_flow:
- enable auto-merge
- dispatch comment-resolver
- poll for formal review approval
- watch the downstream workflow

model_tier_split:
polling_and_orchestration: lighter model
code_edits_via_comment_resolver: heavier model

refusal_conditions:
- no merge if commits were dropped
- no rollback without explicit authorization
- no privileged-environment changes without explicit authorization

---
name: landing-agent
description: |
Drive a PR from review-ready to landed-in-main with the
downstream workflow reported green. Generic decision flow;
the per-client agents specialize the details.
---

decision_flow:
- enable auto-merge
- dispatch comment-resolver
- poll for formal review approval
- watch the downstream workflow

model_tier_split:
polling_and_orchestration: lighter model
code_edits_via_comment_resolver: heavier model

refusal_conditions:
- no merge if commits were dropped
- no rollback without explicit authorization
- no privileged-environment changes without explicit authorization

The pattern the landing agent implements is worth naming explicitly.

Subagent isolation is the structural choice. The landing agent runs in its own session, which means the orchestrator's context never accumulates the noise of watching a CI queue. When the landing agent finishes, it surfaces a result — green or blocked — and closes. That isolation is what lets the orchestrator stay useful for the next decision rather than getting weighted down with polling logs.
Model-tier orchestration is the efficiency choice. The landing agent splits its own workload: a lighter model handles mechanical polling and state tracking, while a heavier model is dispatched for code edits when the comment-resolver surfaces a change request. The heavier model is invoked only when the task actually demands it.
Guardrails are first-class. The refusal conditions above are not commentary — they're part of the skill definition. The agent will not merge if commits were dropped, will not roll back without authorization, will not touch privileged environments without explicit instruction. Encoding these at the skill level means they can't be overridden by a careless dispatch prompt.

Together, dispatch and the landing agent represent the orchestration thesis in its most concrete form. One skill launches sessions; the other drives them to completion. Dispatch is the upstream verb; the landing agent is the downstream verb. Between them, the orchestrator session keeps its context clean — it doesn't need to know the mechanics of watching a PR land any more than it needs to know how a test runner works. The session that dispatched the landing agent can move on to the next task without waiting.

This is why they pair. The orchestration model is not a queue — it's parallel, context-isolated, model-efficient, and guardrailed by construction. The orchestrator issues intentions; the agents handle execution. Neither has to know the other's internal state.

The first time this paid off concretely: a Monday spent planning a couple dozen parallelizable tasks, queueing them up, and shipping them all in parallel. Not in sequence — in parallel, in their own sessions, each carrying a self-contained prompt, each producing its artifact. The orchestrator session stayed in front of us, ready for the next decision. That was the moment we realized agentic coding could be a profound force multiplier — not because any individual task ran faster, but because the number of independent tasks in flight was no longer bounded by how many we could hold in working memory.

In our Codex sessions, we did not find an exposed primitive for subagent dispatch or for launching parallel sibling sessions — we approached coordination as a sequence of conversations. That's an interface difference we observed; it's not a claim about whether Codex could expose those primitives in a future release. The surface we observed made coordination a manual, sequential operation. The Claude Code harness exposes subagent invocation as a first-class primitive — both Claude-internal (Agent/Task tools) and shell-level (launching new sessions in new terminals) — and the patterns above are built on top of that.

Pillar 3: knowledge that accretes

The session that ran dispatch this morning is not the same session that will run it tomorrow. But tomorrow's session will know what today's session learned. That's not an accident — it's a design, and it depends on three layers that work together.

Skills — the first layer

A skill is a committed prompt. It gets versioned, reviewed, and mutated like any other code. When a procedure changes — because we learned the old one had an edge case — the skill file changes, goes through review, and the updated version reaches every session that pulls the tag. No one has to remember to tell the next session. The correction is in the artifact.

Memory — the second layer

Per-project files, committed to the repo, loaded by hook on session start. Two types: feedback_* files record rules derived from corrections and confirmations — things sessions should do differently than the default. reference_* files record durable facts — cluster topology, environment quirks, things that don't change often but are wrong to rediscover. Here's an anonymized example:

memory/feedback_no_unsolicited_refactor.mdmarkdown

---
name: no-unsolicited-refactor
description: When asked to fix a bug, don't refactor surrounding code.
metadata:
type: feedback
---

When asked to fix a specific bug, don't refactor adjacent code.

**Why:** Past sessions expanded a one-line fix into a 200-line
restructure; the bigger diff hid the actual fix.

**How to apply:** Stay scoped. The smallest change that
resolves the bug, with a test that asserts the bug is gone.

---
name: no-unsolicited-refactor
description: When asked to fix a bug, don't refactor surrounding code.
metadata:
type: feedback
---

When asked to fix a specific bug, don't refactor adjacent code.

**Why:** Past sessions expanded a one-line fix into a 200-line
restructure; the bigger diff hid the actual fix.

**How to apply:** Stay scoped. The smallest change that
resolves the bug, with a test that asserts the bug is gone.

KB and source-of-truth — the third layer

procedures.md, state.md, and feedback.md live in a knowledge-base directory, loaded by hook when a session starts in the relevant project directory. Current branch, last deploy, open blockers, standing procedures — all of it injected at session start, not reconstructed from conversation.

Each correction becomes a memory file. Each repeated task becomes a skill. Each cross-session insight becomes a KB entry. Six months in, a new session starts with hundreds of inherited constraints and procedures. The next conversation begins where the last one ended — not because we remembered to re-explain everything, but because the context was committed. The session reads its briefing from the repo, the same way it reads code.

The April 2026 token drought was where this paid off under pressure. When per-session usage tightened, we couldn't afford to keep dragging dead context into every conversation. So we did the work: reorganized the plugin set so each scope loaded only what it needed, started explicitly invoking subagents for narrow tasks instead of inflating the orchestrator, and right-sized prompts to the actual job. ccusage was the diagnostic — a small CLI that surfaces your token usage patterns over time so you can see which sessions and skills are actually expensive. The pressure forced a discipline we should have had anyway, and the artifacts of that discipline — the lighter plugin scopes, the subagent invocation patterns, the right-sized prompts — are still paying us back months later, long after the rate limits eased.

Our experience building with Codex showed that Codex also exposes memory; we did not find an interface for memory as a controllable, version-controlled, reviewable file artifact in a git repo. We chose the file-based path because we wanted to grep and review it. We wanted to know exactly what a session was being told before it started, and we wanted to be able to change it with a pull request.

The Codex coda

Three contrasts we noticed across the post, gathered in one place. Each one is something we observed in our usage, not a claim about Codex's roadmap.

Negative space, used differently

One tactile observation before the table: the two CLIs use vertical space very differently. Codex narrates every step in the foreground — every Explored, every Search, every approval review, every "(no output)", every "Waited for background terminal" — gets its own line, and many lines are repetitions of the same command. Here's a session of ours, with paths and identifiers redacted:

Codex CLI output showing repeated commands, approval-review lines, and 'no output' / 'waited for background terminal' lines consuming substantial vertical space. Sensitive identifiers redacted in grey blocks.

The "Automatic approval review approved" line repeats. The Auto-reviewer approved codex to run pkill -f "find ..." line repeats the exact command Codex is already running. "(no output)" gets its own line. "Waited for background terminal" gets its own line with the command spelled out again. That's a lot of real estate burned on telling the driver what the driver already knows. Claude Code makes a different choice — subagent and tool detail collapse into background panels (Ctrl+O expands them on demand) so the foreground stays the conversation that actually matters.

The speed trap

A related observation: Codex feels faster, and we think that's a red herring. The visible activity stream creates a sense of progress — tokens are visibly flying, commands are visibly running. Claude Code is more honest about when the agent is spinning its wheels on something, which can feel slower even when the work is more deliberate. After April 2026 — when a Claude demand surge tightened per-session usage and inference visibly slowed — we noticed a wave of users perceive Codex as faster on that basis alone. But:

Speed has never been more important than correctness, and the agentic era makes that ratio more lopsided, not less. The difference between 100,000 lines of code delivered in 10 minutes versus 100,000 lines delivered in eight hours is marginal when the eight-hour version is largely correct and the 10-minute version requires cycles to fix bugs, creates data incorrectness, produces toil to sus out the inconsistencies, builds the wrong thing. The shortened wall-clock is paid back in downstream debugging, rework, and trust erosion.

We'd rather wait. Claude Code at the harness layer feels more deliberate because it is more deliberate — and more surgical at the points where we actually need to intervene.

Honest caveats and what we use Claude Code for

Our comparison is biased. Codex doesn't support some of the semantics our Claude Code stack relies on, so we couldn't fully replicate the setup. We still delegate well-scoped tasks to Codex when the spec is tight and we're okay babysitting a bit. We also use Claude Code for much more than dispatching agentic work — as a thought partner on design docs and architecture, as a debugger, to build diagrams, to comb through logs, to query databases in different environments. The polish across the surface — /slash commands, the skills UI, the plugins UI, slash invocations like /reload-plugins, the way Ctrl+O reveals subagent detail without dragging it into the foreground — keeps paying us back across all of that. We don't have that level of confidence in Codex yet. Part of that is familiarity. Part of it is that Codex hasn't prioritized giving us the information we need at the points where we actually need to engage with the agent.

The three contrasts, side by side

Topic	In our usage of Codex	In our usage of Claude Code
Extensibility	Per-session configuration through the UI; we did not find a path to commit extensions and share them across teammates via pull request.	Plugin marketplace pattern — a git repo with versioned tags any teammate can `git fetch && git checkout` into their environment.
Knowledge persistence	Service-side memory; we did not find a way to open, grep, or version-control the rules a session was carrying.	File-based memory and KB markdown loaded by hook at session start; every rule is a file we can read, diff, and review in git.
Concurrency	Conversation-as-unit; running parallel work required hand-coordinating multiple conversations and re-introducing context across them.	Subagent dispatch (Agent/Task tools) and shell-level new-session launching as first-class primitives.

When we tried to share an extension across teammates in Codex, we found the path was per-session configuration through the UI rather than a committed file in a shared repo. The extension existed for one user in one session; bringing a teammate into the same workflow meant walking them through the same configuration steps manually. Onboarding the second person required a configuration walkthrough, not a pull request. Our chosen path — a marketplace repo with a scoped settings.local.json — made sharing a plugin the same as merging a PR. The plugin exists in git history; any question about why it works the way it does has a traceable answer. Both approaches are valid; our team's workflow made the file-based approach the right call for where we are.

Our path through Codex on memory was useful, but service-side — memory that persisted across sessions in a way we couldn't directly inspect or modify with standard tooling. We didn't find a way to open the memory, read its current contents, grep it for a specific rule, or version it alongside the code it governed. For our use case, that mattered: we want to see exactly what constraints a session carries before it starts work, and we want to change those constraints through review, not through prompting. The difference is not capability — it's visibility and controllability.

We did not find a way to dispatch parallel Codex sessions without handling the coordination by hand — opening multiple conversations, copying relevant context into each, and manually tracking which thread was responsible for which work. That's a workable pattern for two threads; at four or five it becomes a coordination job in its own right. Claude Code's harness exposes a primitive (subagent dispatch and shell-level new-session launching) that takes the coordination pattern off our plate. The orchestrator issues an intent; the harness handles the launch.

If you want a paved road, the appliance wins. If you want to compound your tooling, the programmable surface is the move. Both are defensible. We made our call.

What compounds

In a year, every skill we write, every memory we save, and every hook we add still works. The surface accretes. Skills we wrote in the first month are still running. Memory entries from corrections in early sessions shape how later sessions behave. Hooks that inject procedures on startup have saved us from repeating the same orientation steps dozens of times.

But the accretion is not free, and it is not automatic. It depends on a choice that has nothing to do with the tool itself: the choice to commit. Not in the git sense only, but in the investment sense — the decision to encode a procedure instead of just executing it once, to write a memory file instead of trusting the next session to figure it out again, to build the plugin instead of running the ad-hoc prompt. That choice is harder and slower than the alternative. Every time you make it, you're trading speed now for leverage later. The payoff is downstream, and it only shows up if you keep making the choice.

Why it only compounds on a programmable surface

The compounding argument is not unique to Claude Code. Commitment compounds on any sufficiently rich tool. Vim. Emacs. A specific cloud platform. A deeply-learned ORM. Users who invest in these tools accumulate leverage the casual user doesn't have. That is real, and it's not our claim.

Our claim is sharper: commitment compounds more on a programmable, version-controlled surface — because the artifacts your commitment produces are reusable. A skill is a committable file. A memory entry is grep-able. A plugin is shareable. The compounding doesn't live only in your head.

On an appliance, commitment compounds inside the user. The super-user is real — someone who knows every product shortcut, every prompt pattern that gets the best results, every edge case to avoid. That expertise is genuinely valuable. But it doesn't transfer. Not to a teammate starting their first session. Not to next quarter's project. Not to a fresh conversation. The leverage accumulated by the expert exists only in the expert.

The elite-engineer pattern is a deep, stable toolkit that produces durable artifacts. The practices are committed; the environment is versioned; the procedures are encoded. The knowledge doesn't disappear when the engineer goes on vacation, or when the project rotates, or when a new hire joins and needs to get up to speed. The harness is worth choosing because the things you commit to it outlast the session that produced them. The expertise becomes infrastructure, not institutional memory.

If you want to build this

If you want to actually build this — a private plugin marketplace your team can pull a tag from — we wrote a step-by-step guide. It walks you through the marketplace repo skeleton, per-directory scoping with .claude/settings.local.json, tag-based rollout, and the cache-refresh footgun we already hit so you don't have to. The working setup is small enough to commit to in one sitting; the polish takes longer.

Read it: Running a private Claude Code plugin marketplace.