Claude Code/Codex are great. Blind trust is not.

Here’s a sentence I never expected to write: I used one AI to snitch on another AI’s code, then used the first AI to argue back.

We’re building learn.cloudyeti.io, a certification exam prep platform. React frontend, AWS Lambda backend, DynamoDB, Stripe payments. Claude Code has been the primary dev agent for months. It wrote the backend handlers, generated 1,500+ exam questions, wired up Stripe, set up the deployment pipeline. The whole thing.

Before launch, I had a thought: what if Claude Code left bugs it can’t see? Your own builder has blind spots. So I pointed OpenAI’s Codex agent at the codebase and told it to play security auditor.

It came back with seven findings. Three marked “Critical.”

The Report Card

Codex flagged a mix of real problems and overreactions:

No session ownership verification on quiz endpoints
Answer leakage through the results endpoint
Free-tier daily limit completely bypassed
Paid features not gated server-side
Using dev stage name in production
Stripe test/live key separation
No webhook idempotency

Seven findings sounds bad. But here’s the thing about AI auditors: they don’t know your product. They pattern-match against security checklists. Some of these were legitimate. Some were noise.

The One That Actually Mattered

Finding #3 was a real bug, and it was embarrassingly simple. The anonymous user auth system generates IDs with the prefix anonymous-:

# auth.py
user_id = f"anonymous-{uuid.uuid4()}"

But the daily rate limit check in the quiz handler looked for a different prefix:

# quiz.py
if user_id.startswith('anon_'):
    # enforce daily question limit...

'anonymous-xxx'.startswith('anon_') is False. Always. The entire daily limit for free users was dead code. Every anonymous user got unlimited access, and (worse) authenticated free-tier users were actually more restricted than anonymous ones. Log out, get more questions. Great business model.

Claude Code wrote both files. It never caught the mismatch. Codex spotted it in seconds.

Pushing Back on the Rest

But we didn’t just accept all seven findings and start panic-fixing. That’s where the human part matters.

Session ownership (#1): Codex called this Critical because anyone with a session UUID could hit another user’s quiz endpoints. True. But session IDs are UUIDv4, which is 128 bits of randomness. You’re not guessing someone’s session ID. It’s the same entropy as most API tokens. Worth fixing, not worth losing sleep over. We downgraded it to High.

Answer leakage (#2): Codex flagged the results endpoint for showing all answers after quiz completion. For a proctored exam, that’s a disaster. For a study app, that’s literally the feature. The auto-complete behavior on active sessions was a minor design quirk, not a vulnerability. Downgraded to Medium.

Paid feature gating (#4): Fair point in theory. But the “gated” content (explanations) was already visible in results responses for all users. The barn door was already open. Medium.

Dev stage naming (#5 and #6): Codex flagged us for running “dev” as production. Guilty as charged. But dev IS the production stage. It’s a naming convention choice, not a security hole. We merged these two findings into one note and moved on.

Webhook idempotency (#7): The Stripe webhook handler sets a boolean to true and checks a list before appending. Running it twice does nothing harmful. Textbook idempotency, just not explicitly labeled as such. Low priority.

What We Actually Fixed

Two things. That’s it.

Fix 1: Changed anon_ to anonymous- in quiz.py. Two lines of code. The daily limit now actually works.

Fix 2: Added a _verify_session_owner() helper that checks the Firebase UID against the session’s owner for authenticated users and skips the check for anonymous sessions (where the UUID itself acts as a bearer token). Wired it into get_current_question, submit_answer, and get_results.

All 55 tests updated and passing. Deployed. Done.

We didn’t touch the other five findings. Not because they’re wrong, but because they’re low-ROI for a pre-launch MVP. Two fixes gave us roughly 80% of the risk reduction for about 20% of the effort. The rest goes on the backlog.

What I Took Away From This

Don’t blindly trust any single agent. Codex screamed “Critical” three times. One was a launch blocker. One was a medium-priority design choice. One was barely worth discussing. If we’d treated every finding as equally urgent, we’d have burned hours refactoring things that don’t matter yet and possibly introduced new bugs in the process.

The best use of a second agent isn’t building. It’s breaking. Claude Code built the codebase and never noticed anon_ vs anonymous- across two files it wrote. A fresh set of (artificial) eyes caught it immediately. Agents develop the same kind of tunnel vision that human developers do.

Severity calibration still needs a human. An agent calling UUIDs “bearer secrets” sounds alarming until you think about 128-bit entropy for two seconds. AI auditors are great at surface-area coverage. They’re bad at “does this actually matter for this product?”

And honestly, the whole exercise took maybe 30 minutes and cost almost nothing. A cheap second opinion that caught a real bug before real users hit it. That prefix mismatch would have been a genuinely embarrassing production issue: anonymous users with unlimited access, paying users getting throttled.

Two agents. Seven findings. Two fixes. One bug that actually mattered. Ship it.