Claude Code vs Codex CLI: when to use which

Two CLIs, one pipeline, zero loyalty

I don't have a favorite agent. I have tasks that need to get done. Over the past four months I've been running Claude Code and OpenAI's Codex CLI through the exact same Zowl pipelines on the same repos, tracking success rates, token usage, and how much cleanup I need to do in the morning. This is what I found.

Not what the marketing pages say. Not what Twitter influencers claim after one demo. What actually happens when you point these things at real work and go to sleep.

The setup

Both agents ran through Zowl on my M2 MacBook Pro. Same PRDs, same repos, same acceptance criteria. I logged every run over a period of about six weeks across three projects: a Next.js SaaS app, a CLI tool in Rust, and an Express API with a Postgres backend.

For Claude Code, the invocation looks like this:

claude -p "$(cat task.md)" --output-format stream-json

For Codex CLI:

codex --quiet --full-auto "$(cat task.md)"

One immediate gotcha: argument passing works differently between them. Claude Code takes the prompt via -p and gives you structured output with --output-format. Codex uses positional arguments and the --quiet flag to suppress interactive prompts. If you're scripting these into a pipeline, you'll hit this friction from day one.

Understanding existing codebases

Claude Code wins here. It's not even close.

When I handed both agents a task like "add rate limiting to the /api/upload endpoint following the existing pattern in /api/auth," Claude Code actually went and read the auth middleware, understood the pattern, and replicated it. It picked up on the project's error response format, the logging style, even the way we structured middleware chains.

Codex would sometimes do this. But more often it'd generate a perfectly reasonable rate limiter that looked nothing like the rest of the codebase. Technically correct, stylistically alien. You end up with two different patterns living side by side, which is worse than either one.

On the Next.js project specifically, Claude Code understood App Router conventions better. It knew where to put server actions, how to handle revalidatePath, and when to use "use client" vs keeping things on the server. Codex would occasionally spit out Pages Router patterns mixed in, which is a painful thing to debug at 7am.

Greenfield speed

Codex is faster on fresh projects. I'll give it that.

When the task was "create a new CLI tool that does X" or "scaffold a REST API with these endpoints," Codex would finish in 60-70% of the time Claude Code took. It's more aggressive about generating code, less careful about reading existing context (because there isn't much to read).

For one task, building a simple Rust CLI that converts CSV to JSON with some filtering options, Codex nailed it in about 4 minutes. Claude Code took closer to 7. Both outputs were correct. Both compiled. But Codex got there faster because it didn't spend time exploring a codebase that barely existed yet.

Where each one falls apart

Honest section. Both of these tools have failure modes that'll bite you.

Claude Code's stdin piping quirk. If you're piping large PRDs through stdin instead of using -p, you can hit issues with content getting truncated. I learned this the hard way when tasks kept failing because the agent only received half the PRD. The fix is simple (use -p with command substitution), but I wasted two nights of pipeline runs before I figured it out:

# This can truncate on large inputs
cat long-task.md | claude -p -

# This works reliably
claude -p "$(cat long-task.md)" --output-format stream-json

Codex's sandbox constraints. Codex runs in a sandboxed environment by default with --full-auto, which means network access is restricted. If your task involves installing dependencies or hitting external APIs during development, it'll fail silently or produce code that assumes packages exist without installing them. You can work around it, but it adds friction:

# Pre-install deps before Codex runs
npm install && codex --quiet --full-auto "$(cat task.md)"

Claude Code on huge files. If a task requires modifying a file that's 2000+ lines, Claude Code sometimes loses track of where it is. It'll make the right change but in the wrong location, or it'll accidentally duplicate a section. Splitting large files before the agent touches them helps.

Codex on multi-step tasks. Anything that requires "do A, then use A's output to do B" is risky with Codex. It tends to treat multi-step PRDs as a single blob and sometimes skips the dependency between steps. Claude Code handles sequential reasoning much better.

My actual workflow

After six weeks of data, here's how I actually split them:

I use Claude Code for:

Pre-check steps: analyzing the codebase, identifying what files need to change, validating the approach
Complex modifications: tasks that touch 3+ existing files and need to respect existing patterns
Validation steps: reviewing generated code, running the test suite with context about what was supposed to change
Refactoring: anything where understanding the existing code matters more than writing new code

I use Codex for:

Greenfield implementations: new utilities, new endpoints, new components that don't depend heavily on existing code
Boilerplate generation: repetitive tasks where speed matters and stylistic consistency doesn't (tests, migrations, config files)
Simple, well-scoped tasks: one file in, one file out, clear acceptance criteria

The split is roughly 60/40 Claude/Codex across my pipelines.

The gotcha nobody talks about

Here's my spiciest take: the model matters less than the PRD.

I've seen Claude Code fail on a perfectly capable task because the PRD was vague. I've seen Codex crush a complex task because the PRD was excellent. The variance from PRD quality dwarfs the variance between models. People spend hours debating which agent is "better" when they should be spending those hours writing clearer task descriptions.

A well-written PRD with acceptance criteria, file locations, and explicit scope will get good results from either agent. A sloppy PRD will get sloppy results from both.

Command examples from a real pipeline

Here's a stripped-down version of a Zowl pipeline that mixes both agents:

pipeline:
  - step: pre-check
    agent: claude-code
    command: |
      claude -p "Review the codebase at ./src/api and list all
      existing middleware patterns. Output as JSON."
      --output-format stream-json

  - step: implement
    agent: codex
    command: |
      codex --quiet --full-auto "Create a new rate limiting
      middleware following the patterns described in:
      {{steps.pre-check.output}}"

  - step: validate
    agent: claude-code
    command: |
      claude -p "Review the diff in ./src/api/middleware/rateLimit.ts
      against our existing patterns. Check for: consistent error
      format, proper TypeScript types, no new lint warnings."
      --output-format stream-json

Claude analyzes. Codex builds. Claude reviews. Each agent doing what it's best at.

Back to nightloop.sh

When I first built Zowl (back when it was a bash script called nightloop.sh), it could only run one agent type. Everything went through Claude because that's all I had access to at the time. The moment Codex CLI dropped, I started testing it on the same tasks and immediately saw where each one shined. I've detailed the history of how this evolved in another post.

That's when I added per-step agent selection to Zowl. You pick the right tool for each step in the pipeline, not one agent for everything. It sounds obvious, but most orchestration tools still assume you're locked into a single provider. Nah. The whole point is that you shouldn't have to choose. Use both. Let the task decide. For more on how to structure these decisions, see how to stop babysitting Claude Code. And if you want to try this workflow yourself, check out Zowl.