ai

Claude Code vs Codex CLI: A Same-Repo Workflow Test

A practical comparison of Claude Code and Codex CLI for real repository work, focused on context handling, patches, tests, review, and supervision cost.

Divyanshu Singh Chouhan, founder of ABCsteps
Divyanshu Singh Chouhan
14 min read2,855 words
Claude Code vs Codex CLI: A Same-Repo Workflow Test cover diagram: A practical comparison of Claude Code and Codex CLI for real repository work, focused on context handling, patches, tests, review, and supervision cost.

Quick decision table

If you want the short answer first, decide by supervision style, not by model branding.

If this is what you needClaude CodeCodex CLI
Safer starting posture in an unfamiliar repoBetter fit. Its docs center CLAUDE.md guidance and explicit permission controls before wider autonomy.Good fit if you deliberately choose a tighter approval mode before you start editing.
One predictable repo contract fileGood if your team is willing to standardize on CLAUDE.md.Better fit if your team already wants AGENTS.md as the repo contract.
Deliberate local review before commitPossible, but the review loop is less central to the docs than permissions and memory.Strong fit. Codex CLI documents /review as a dedicated local review pass on a selected diff.
Explicit parallel-agent workflows only when askedSupported, but Claude Code leans more naturally toward richer persistent helper setup.Strong fit. OpenAI documents subagents as opt-in and says they should be used only when you explicitly ask.
Biggest limitation to keep in mindInstallation guidance moved quickly in 2026. Anthropic docs and the GitHub README must be read together so you do not follow stale npm advice.“Codex” now spans CLI, app, and cloud surfaces. If you do not keep the comparison CLI-only, your conclusion will be sloppy.

My working judgment is intentionally limited. Based on the docs plus the bounded fixture below, Claude Code looks attractive when your first concern is repo trust and permission friction. Codex CLI looks attractive when your first concern is keeping operator control and local review visible.

The wrong comparison wastes your time

Most posts compare these tools like they are chatbots wearing different logos. That is not how real repo work feels.

In a real repository, the cost is decided by a shorter loop:

  1. Load the repo's instructions.
  2. Read only the files that matter.
  3. Make the smallest patch that solves the task.
  4. Run the checks that actually matter.
  5. Return a diff a human can approve quickly.

That loop is what I care about when I teach engineering learners to use coding agents. A tool that sounds sharp in conversation but widens the patch, invents verification, or ignores local rules has not saved you effort. It has only moved the work from typing to supervision.

If your real problem is weak task framing, read Prompt Engineering Essentials — Beyond the Polite Question before you compare another agent. A precise prompt will not rescue a broken repository, but it does make the evaluation fairer.

What the official product docs actually support

On the Claude side, Anthropic documents persistent project guidance through CLAUDE.md, configurable permissions, subagents, and installation guidance in its getting started docs. The public repository README for Claude Code also matters because it now flags npm installation as deprecated.

On the Codex side, OpenAI documents AGENTS.md, subagents, the CLI overview, and CLI features including /review. The open-source Codex repository and its release notes are the other pages worth checking before you trust any dated comparison.

That overlap is real. Both tools live in the terminal. Both can read and edit code. Both can be guided by repo-specific instruction files. Both now move fast enough that old blog posts rot quickly.

The useful differences show up in how they want you to supervise them.

Instruction files matter more than model arguments

Anthropic's memory docs say Claude Code starts sessions fresh and relies on CLAUDE.md for durable instructions. OpenAI's Codex docs say Codex reads AGENTS.md files before doing any work and layers global guidance with project-specific overrides.

That is not a naming detail. It changes how your repo teaches the agent what "good" looks like.

This is the kind of contract I want either tool to inherit:

  • Only touch files directly required for the task.
  • Do not rename public API fields unless explicitly asked.
  • Run npm test and npm run lint before finalizing.
  • If a check fails for unrelated reasons, report that clearly.
  • Do not add dependencies unless the task requires them.
  • Explain the diff in terms of user-visible behavior.

Without this layer, the comparison turns noisy fast. You are no longer testing Claude Code or Codex CLI. You are testing how well each one guesses your unstated standards.

In the ABCsteps curriculum, this is where many beginners lose the thread. They ask for a feature, skip the repo rules, then act surprised when the agent behaves like a generic assistant. I keep repeating the same teaching point: vague repos produce expensive diffs.

Claude Code tends to help when you want the repo to slow the agent down

Claude Code's documented shape is conservative in a useful way.

Anthropic's docs put CLAUDE.md at the center of persistent project guidance. Its permissions docs describe a ladder where reads are easier than edits or command execution, and its subagent docs support more specialized helpers with their own prompts and tool access.

That combination usually benefits teams that care about controlled expansion of scope. If you are entering an unfamiliar monorepo, touching billing code, or working with a weak test suite, this posture makes sense. The repo can teach the tool first, and the tool can earn more freedom later.

The trade-off is that Claude Code can feel heavier if all you wanted was a narrow local patch on a well-understood service. Strong memory and richer helper setup are good only when the repo deserves that weight.

Codex CLI tends to help when you want operator control to stay visible

Codex CLI's docs feel more explicit about operator choices.

OpenAI's AGENTS.md guide explains the load order clearly: global instructions, then project-level files from repo root down to the current directory. Its subagent docs are even clearer for supervision: Codex should use subagents only when you explicitly ask for subagents or parallel work. The CLI features page also documents /review as a dedicated reviewer that reads a chosen diff and reports findings without modifying the working tree.

That makes Codex CLI especially good for engineers who want the workflow to stay inspectable. The operator chooses how much freedom to allow, whether to delegate, and whether to run a separate review pass before commit.

I like that posture for active repo work because it keeps one question visible at all times: did the agent do exactly the work I meant, or did I accidentally authorize more than I had reviewed?

The fairest test is one bounded task in one repo

If you want a decision you can trust, run both tools on the same task in the same repository with the same repo instructions.

The fixture I used was deliberately tiny:

  • package.json
  • src/users.js
  • test/users.test.js
  • scripts/lint.js

The task:

  • Add optional phone support to createUser.
  • Reject non-string phone values.
  • Add one targeted test.
  • Do not touch unrelated files.

The exact prompt for both tools was:

You are in a tiny fixture repo. Implement only this bounded task: add optional phone support to createUser. Accept phone when it is absent or a string; reject non-string phone with status 400 and error 'phone must be a string'. Preserve existing behavior. Add one targeted test. Run npm test and npm run lint. Final response must list files changed and commands run.

Expected working set: src/users.js and test/users.test.js.

Expected verification: npm test and npm run lint, with the exact pass/fail result reported.

A narrow patch should add one optional phone field to the returned user only when it is present, one validation branch in createUser, and one targeted test that sends a numeric phone value and expects a 400 response. If the tool rewrites unrelated validation helpers, reformats the whole route, or changes the response contract, the run is already more expensive than it looks.

Then score both runs against the same questions:

QuestionGood runCostly run
Did the patch stay narrow?Only the route, type, and test changed.It refactored helpers, renamed fields, or reformatted unrelated files.
Did verification stay honest?The tool reports the exact commands run and whether they passed.The tool implies verification without naming commands or hides unrelated failures.
Did it follow the repo contract?No new dependencies, no extra abstractions, no architecture drift.It adds structure you never asked for.
Would you merge faster?Diff is readable in minutes.Diff creates detective work.

This is the point where I need to be strict as an editor: without two completed transcripts from the same repo task, any stronger winner claim would be theater. So the honest article is not "Claude wins" or "Codex wins." The honest article is: each tool gives you a different supervision surface, and the repo decides whether that surface helps or hurts.

For this article, I ran that fixture on May 7, 2026 instead of pretending the comparison could be decided from docs alone.

Methodology:

ItemValue
FixtureLocal ignored JavaScript repo with one createUser function, Node's built-in test runner, and a simple lint script
Claude Code2.1.126, run with the sonnet alias and low effort
Codex CLI0.128.0-alpha.1, run with gpt-5.4 and low reasoning
Starting constraintsSame prompt, same fixture, same expected commands: npm test and npm run lint
Scoring rubricFiles touched, diff size, tests passing, commands named, scope control, review burden

Reproducible evidence block:

Evidence itemWhat was held constant
Fixture behaviorExisting createUser accepted name and email, rejected invalid email, and returned 201 for valid users
PromptThe exact prompt above was used for both tools
Claude command shapeclaude -p --permission-mode bypassPermissions --model sonnet --effort low ... inside the fixture copy
Codex command shapecodex --model gpt-5.4 -c model_reasoning_effort="low" -s workspace-write -a never exec ... inside the fixture copy
Required commandsEach tool had to run npm test and npm run lint
Audit surfacegit diff --stat, changed files, final command claims, and a rerun of both commands after completion
Same fixture resultClaude CodeCodex CLI
Files touchedsrc/users.js, test/users.test.jssrc/users.js, test/users.test.js
Diff size20 insertions, 6 deletions25 insertions, 4 deletions
Tests after run4 passing tests3 passing tests
Verification namednpm test, npm run lintnpm test, npm run lint
Scope behaviorStayed inside the task, added both reject and accept-phone testsStayed inside the task, added the requested reject-phone test only
Review burdenSlightly broader test coverage, still easy to reviewNarrower behavioral surface, very easy to review

Compact transcript evidence:

ToolFinal verification statement
Claude Code"All 4 tests pass and lint is clean."
Codex CLI"Verification: both commands passed."

Compact diff evidence:

ToolSource changeTest change
Claude CodeAdded non-string phone guard, built a user object, and copied phone only when presentAdded one reject test and one accept-string test
Codex CLIAdded non-string phone guard, built a user object, and copied phone only when presentAdded one reject test only

That result does not prove one tool is universally better. It does prove the right comparison method. Claude Code gave me one extra positive-path test, which I liked. Codex CLI gave me the narrower task completion, which I also liked. In a teaching repo, I would accept either patch after review. In a production repo, I would choose based on the team rule: do we reward extra coverage, or do we reward the smallest acceptable diff?

The practical limitation on the Claude side in this run was breadth: adding an accept-string test was helpful, but it was beyond the one targeted rejection test I asked for. The practical limitation on the Codex side was the opposite: the patch was tighter, but it did not prove the positive phone path with a separate test. That is the kind of trade-off a human reviewer can actually use.

To verify this in your own repo in under two minutes, do not ask "which tool is smarter?" Copy one small task, use the same prompt for both tools, then count: files changed, tests added, commands run, unrelated edits, and how long the diff takes you to review.

This is also why I do not want ABCsteps articles to pretend certainty where there is only a small sample. A useful comparison is not louder certainty. It is a reusable evaluation standard that a reader can run in their own repository.

What observed evidence is strong enough to trust

If you run the phone task properly, do not stop at "it worked." Capture evidence that another engineer could audit:

  • Files touched.
  • Exact verification commands run.
  • Whether the tool asked for broader access.
  • Whether the patch widened beyond the requested scope.
  • Whether the final explanation matched the diff.

That evidence matters more than polished terminal output.

When I review learner repos, the strongest agent runs are usually boring. Three or four files changed. One test was added. The explanation matched the code. The commands were named plainly. Boring is good here. Boring means the human reviewer stayed in control.

My founder rule is simple: if I cannot explain the diff to a learner in two minutes, the agent did not finish the job. It merely produced more material for a human to audit. That is why I reward narrow patches, named checks, and honest limits more than dramatic terminal output.

In learner reviews, the recurring mistake is not that students choose the wrong model. It is that they accept a large patch because the final message sounds confident. I ask them to slow down and count the proof: which files changed, which commands ran, which behavior changed, and what new review burden appeared. That habit matters more than the brand name of the agent.

Practical limitations that should affect your choice

Claude Code has one operational wrinkle worth naming clearly. As of May 7, 2026, Anthropic's getting-started docs still include npm uninstall guidance, while the GitHub README says npm installation is deprecated and recommends the native installer paths instead. That is not a product flaw by itself, but it is a reminder to pin setup advice to the date and source.

Codex CLI has a different problem: naming sprawl. The OpenAI docs now cover Codex CLI, app, cloud tasks, and other related surfaces. If someone says "Codex can do this" without specifying which surface they mean, the comparison has already gone soft.

Version velocity also matters. On May 7, 2026, the latest visible stable Codex release on GitHub was 0.121.0, published on April 15, 2026. Anthropic's public Claude Code release notes point readers back to the repository changelog for the most current detail. These version pages do not tell you which tool is better. They tell you the operational surface is changing often enough that old advice expires fast.

So which one should you choose?

Start your evaluation with Claude Code if your team needs stronger repo memory, a more conservative opening posture, and clearer friction before changes spread.

Start your evaluation with Codex CLI if your team wants instruction layering through AGENTS.md, explicit operator intent around delegation, and a built-in local review pass before code leaves the machine.

From the tiny same-task run above, I would summarize the observed difference more carefully: Claude Code was slightly more coverage-generous; Codex CLI was slightly more minimal. That is a useful signal, not a final law.

Choose neither by default if your repository still lacks instruction discipline and reliable checks. In that state, both tools will magnify ambiguity.

My own view is narrow on purpose. I do not reward a coding agent for sounding intelligent. I reward it for returning a patch I can trust after five minutes of review.

What to do next

Run the same phone task in both tools this week. Keep the repo instructions identical, start with the stricter workflow in each tool, and score both results on patch narrowness, verification honesty, and merge confidence.

Then apply that thinking in Lesson 11, where LLMs are introduced as practical engineering tools rather than magic. If you want the model-layer comparison above the workflow layer, read Claude Opus 4.7 vs GPT-5.5 — Which AI Coding Agent Should Developers Choose?.

11

Apply this hands-on · Module C

AI Products Are API Systems

Lesson 11 introduces LLMs as practical engineering tools. This article helps learners compare two coding-agent CLI workflows without confusing model quality, tool surface, and repo-review behavior.

Open lesson

#coding-agents #claude-code #codex-cli #developer-tools #agent-workflows
Divyanshu Singh Chouhan, founder of ABCsteps

Divyanshu Singh Chouhan

Founder, ABCsteps Technologies

Founder of ABCsteps Technologies. Building a 20-lesson AI engineering course that teaches AI, ML, cloud, and full-stack development through written lessons and real projects.