Claude Code vs Codex CLI: when to use which
I've run both through the same pipeline on the same tasks. Here's what actually happened, not what the marketing pages say.
Claude Code vs Codex CLI: when to use which
Two CLIs, one pipeline, zero loyalty
I don't have a favorite agent. I have tasks that need to get done. Over the past four months I've been running Claude Code and OpenAI's Codex CLI through the exact same Zowl pipelines on the same repos, tracking success rates, token usage, and how much cleanup I need to do in the morning. This is what I found.
Not what the marketing pages say. Not what Twitter influencers claim after one demo. What actually happens when you point these things at real work and go to sleep.
The setup
Both agents ran through Zowl on my M2 MacBook Pro. Same PRDs, same repos, same acceptance criteria. I logged every run over a period of about six weeks across three projects: a Next.js SaaS app, a CLI tool in Rust, and an Express API with a Postgres backend.
For Claude Code, the invocation looks like this:
claude -p "$(cat task.md)" --output-format stream-json
For Codex CLI:
codex --quiet --full-auto "$(cat task.md)"
One immediate gotcha: argument passing works differently between them. Claude Code takes the prompt via -p and gives you structured output with --output-format. Codex uses positional arguments and the --quiet flag to suppress interactive prompts. If you're scripting these into a pipeline, you'll hit this friction from day one.
Understanding existing codebases
Claude Code wins here. It's not even close.
When I handed both agents a task like "add rate limiting to the /api/upload endpoint following the existing pattern in /api/auth," Claude Code actually went and read the auth middleware, understood the pattern, and replicated it. It picked up on the project's error response format, the logging style, even the way we structured middleware chains.
Codex would sometimes do this. But more often it'd generate a perfectly reasonable rate limiter that looked nothing like the rest of the codebase. Technically correct, stylistically alien. You end up with two different patterns living side by side, which is worse than either one.
On the Next.js project specifically, Claude Code understood App Router conventions better. It knew where to put server actions, how to handle revalidatePath, and when to use "use client" vs keeping things on the server. Codex would occasionally spit out Pages Router patterns mixed in, which is a painful thing to debug at 7am.
Greenfield speed
Codex is faster on fresh projects. I'll give it that.
When the task was "create a new CLI tool that does X" or "scaffold a REST API with these endpoints," Codex would finish in 60-70% of the time Claude Code took. It's more aggressive about generating code, less careful about reading existing context (because there isn't much to read).
For one task, building a simple Rust CLI that converts CSV to JSON with some filtering options, Codex nailed it in about 4 minutes. Claude Code took closer to 7. Both outputs were correct. Both compiled. But Codex got there faster because it didn't spend time exploring a codebase that barely existed yet.
Where each one falls apart
Honest section. Both of these tools have failure modes that'll bite you.
Claude Code's stdin piping quirk. If you're piping large PRDs through stdin instead of using -p, you can hit issues with content getting truncated. I learned this the hard way when tasks kept failing because the agent only received half the PRD. The fix is simple (use -p with command substitution), but I wasted two nights of pipeline runs before I figured it out:
# This can truncate on large inputs
cat long-task.md | claude -p -
# This works reliably
claude -p "$(cat long-task.md)" --output-format stream-json
Codex's sandbox constraints. Codex runs in a sandboxed environment by default with --full-auto, which means network access is restricted. If your task involves installing dependencies or hitting external APIs during development, it'll fail silently or produce code that assumes packages exist without installing them. You can work around it, but it adds friction:
# Pre-install deps before Codex runs
npm install && codex --quiet --full-auto "$(cat task.md)"
Claude Code on huge files. If a task requires modifying a file that's 2000+ lines, Claude Code sometimes loses track of where it is. It'll make the right change but in the wrong location, or it'll accidentally duplicate a section. Splitting large files before the agent touches them helps.
Codex on multi-step tasks. Anything that requires "do A, then use A's output to do B" is risky with Codex. It tends to treat multi-step PRDs as a single blob and sometimes skips the dependency between steps. Claude Code handles sequential reasoning much better.
My actual workflow
After six weeks of data, here's how I actually split them:
I use Claude Code for:
- Pre-check steps: analyzing the codebase, identifying what files need to change, validating the approach
- Complex modifications: tasks that touch 3+ existing files and need to respect existing patterns
- Validation steps: reviewing generated code, running the test suite with context about what was supposed to change
- Refactoring: anything where understanding the existing code matters more than writing new code
I use Codex for:
- Greenfield implementations: new utilities, new endpoints, new components that don't depend heavily on existing code
- Boilerplate generation: repetitive tasks where speed matters and stylistic consistency doesn't (tests, migrations, config files)
- Simple, well-scoped tasks: one file in, one file out, clear acceptance criteria
The split is roughly 60/40 Claude/Codex across my pipelines.
The gotcha nobody talks about
Here's my spiciest take: the model matters less than the PRD.
I've seen Claude Code fail on a perfectly capable task because the PRD was vague. I've seen Codex crush a complex task because the PRD was excellent. The variance from PRD quality dwarfs the variance between models. People spend hours debating which agent is "better" when they should be spending those hours writing clearer task descriptions.
A well-written PRD with acceptance criteria, file locations, and explicit scope will get good results from either agent. A sloppy PRD will get sloppy results from both.
Command examples from a real pipeline
Here's a stripped-down version of a Zowl pipeline that mixes both agents:
pipeline:
- step: pre-check
agent: claude-code
command: |
claude -p "Review the codebase at ./src/api and list all
existing middleware patterns. Output as JSON."
--output-format stream-json
- step: implement
agent: codex
command: |
codex --quiet --full-auto "Create a new rate limiting
middleware following the patterns described in:
{{steps.pre-check.output}}"
- step: validate
agent: claude-code
command: |
claude -p "Review the diff in ./src/api/middleware/rateLimit.ts
against our existing patterns. Check for: consistent error
format, proper TypeScript types, no new lint warnings."
--output-format stream-json
Claude analyzes. Codex builds. Claude reviews. Each agent doing what it's best at.
Back to nightloop.sh
When I first built Zowl (back when it was a bash script called nightloop.sh), it could only run one agent type. Everything went through Claude because that's all I had access to at the time. The moment Codex CLI dropped, I started testing it on the same tasks and immediately saw where each one shined. I've detailed the history of how this evolved in another post.
That's when I added per-step agent selection to Zowl. You pick the right tool for each step in the pipeline, not one agent for everything. It sounds obvious, but most orchestration tools still assume you're locked into a single provider. Nah. The whole point is that you shouldn't have to choose. Use both. Let the task decide. For more on how to structure these decisions, see how to stop babysitting Claude Code. And if you want to try this workflow yourself, check out Zowl.