I went hands-on with ChatGPT Codex and the vibe was not good - here's what happened

Aleksandra Konoplia/Getty Images

ZDNET’s key takeaways

ChatGPT Codex wrote code and saved me time.
It also created a serious bug, but it was able to recover.
Codex is still based on the GPT-4 LLM architecture.

Well, vibe coding this is not. I found the experience to be slow, cumbersome, stressful, and incomplete. But it all worked out in the end.

ChatGPT Codex is ChatGPT’s agentic tool dedicated to code writing and modification. It can access your GitHub repository, make changes, and issue pull requests. You can then review the results and decide whether or not to incorporate them.

Also: How to move your codebase into GitHub for analysis by ChatGPT Deep Research – and why you should

My primary development project is a PHP and JavaScript-based WordPress plugin for site security. There’s a main plugin available for free, and some add-on plugins that enhance the capabilities of the core plugin. My private development repo contains all of this, as well as some maintenance plugins I rely on for user support.

This repo contains 431 files. This is the first time I’ve attempted to get an AI to work across my entire ecosystem of plugins in a private repository. I previously used Jules to add a feature to the core plugin, but because it only had access to the core plugin’s open source repository, it couldn’t take into account the entire ecosystem of products.

Earlier last week, I decided to give ChatGPT Codex a run at my code. Then this happened.

GPT-5 released

On Thursday, GPT-5 slammed into the AI world like a freight train. Initially, OpenAI tried to force everyone to use the new model. Subsequently, they added legacy model support when many of their customers went ballistic.

I ran GPT-5 against my set of programming tests, and it failed half of them. So, I was particularly curious about whether Codex still supported the GPT-4 architecture or would force developers into GPT-5.

However, when I queried Codex five days after GPT-5 launched, the AI responded that it was still based on “OpenAl’s GPT-4 architecture.”

Screenshot by David Gewirtz/ZDNET

I took two things from that:

OpenAI isn’t ready to move Codex coding to GPT-5 (which, recall, failed half my tests).
The results, conclusions, and screenshots I took of my Codex tests are still valid, since Codex is still based on GPT-4.

With that, here is the result of my still-very-much-not-GPT-5 look at ChatGPT Codex.

Getting started

My first step was asking ChatGPT Codex to examine the codebase. I used the Ask mode of Codex, which does analysis, but doesn’t actually change any code.

Screenshot by David Gewirtz/ZDNET

I was hoping for something as deep and comprehensive as the one I received from ChatGPT Deep Research a few months ago, but instead, I received a much less complete analysis.

Screenshot by David Gewirtz/ZDNET

I found a more effective approach was to ask Codex to do a quick security audit and let me know if there were any issues. Here’s how I prompted it.

Identify any serious security concerns. Ignore plugins Anyone With Link, License Fixer, and Settings Nuker. Anyone With Link is in the very early stages of coding, and is not ready for code review. License Fixer and Settings Nuker are specialty plugins that do not need a security audit.

Codex identified three main areas for improvement.

Screenshot by David Gewirtz/ZDNET

All three areas were valid, although I am not prepared to modify the serialization data structure at this time, because I’m saving that for a whole preferences overhaul. The $_POST complaint is managed, but with a different approach than Codex noticed.

Also: The best AI for coding in 2025 (and what not to use)

The third area — the nonce and cross-site request forgery (CSRF) risk — was something worth changing right away. While access to the user interface for the plugin is assumed to be determined by login role, the plugins themselves don’t explicitly check that the person submitting the plugin settings for action is allowed to do so.

That’s what I decided to invite Codex to fix.

Fixing the code

Next up, I instructed Codex to make fixes in the code. I changed the setting from Ask mode to Code mode so the AI would actually attempt changes. As with ChatGPT Agent, Codex spins up a virtual terminal to do some of its work.

Screenshot by David Gewirtz/ZDNET

When the process completed, Codex showed a diff (the difference between original and to-be-modified code).

Screenshot by David Gewirtz/ZDNET

I was heartened to see that the changes were quite surgical. Codex didn’t try to rewrite large sections of the plugin; it just modified the small areas that needed improvement.

In a few areas, it dug in and changed a few more lines, but those changes were still pretty specific to the original prompt.

At one point, I was curious to know why it added a new foreach loop to iterate over an array, so I asked.

Screenshot by David Gewirtz/ZDNET

As you can see above, I got back a fairly clear response on its reasoning. It made sense, so I moved on, continuing to review Codex’s proposed changes.

All told, Codex proposed making changes to nine separate files. Once I was satisfied with the changes, I clicked Create PR. That creates a pull request, which is how any GitHub user suggests changes to a codebase. Once the PR is created, the project owner (me, in this case) has the option to approve those changes, which adds them into the actual code.

It’s a good mechanism, and Codex does a clean job of working within GitHub’s environment.

Screenshot by David Gewirtz/ZDNET

Once I was convinced the changes were good, I merged Codex’s work back into the main codebase.

Screenshot by David Gewirtz/ZDNET

Houston, we have a problem

I brought the changes down from GitHub to my test machine and tried to run the now-modified plugin. Wait for it…

Screenshot by David Gewirtz/ZDNET

Yeah. That’s not what’s supposed to happen. To be fair, I’ve generated my own share of error screens just like that, so I can’t really get angry at the AI.

Instead, I took a screenshot of the error and passed it to Codex, along with a prompt telling Codex, “Selective Content plugin now fails after making changes you suggested. Here are the errors.”

It took the AI three minutes to suggest a fix, which it presented to me in a new diff.

Screenshot by David Gewirtz/ZDNET

I merged that change into the codebase, once again brought it down to my test server, and it worked. Crisis averted.

No vibe, no flow

When I’m not in a rush and I have the time, coding can provide a very pleasant state of mind. I get into a sort of flow with the language, the machine, and what seems like a connection between my fingers and the computer’s CPU. Not only is it a lot of fun, but it can also be emotionally transcendent.

Working with ChatGPT Codex was not fun. It wasn’t hateful. It just wasn’t fun. It felt more like exchanging emails with a particularly recalcitrant contractor than having a meeting of the minds with a coding buddy.

Also: How to use GPT-5 in VS Code with GitHub Copilot

Codex provided its responses in about 10 or 15 minutes, whereas the same code would probably have taken me a few hours.

Would I have created the same bug as Codex? Probably not. As part of the process of thinking through that algorithm, I most likely would have avoided the mistake Codex made. But I undoubtedly would have created a few more bugs based on mistyping or syntax errors.

To be fair, had I introduced the same bug as Codex did, it would have taken me considerably longer than three minutes to find and fix it. Add another hour or so at least.

So Codex did the job, but I wasn’t in flow. Normally, when I code and I’m inside a particular file or subsystem, I do a lot of work in that area. It’s like cleaning day. If you’re cleaning one part of the bathroom, you might as well clean all of it.

But Codex clearly works best with small, simple instructions. Give it one class of change, and work through that one change before introducing new factors. Like I said, it does work and it is a useful tool. But using it definitely felt like more of a chore than programming normally does, even though it saved me a lot of time.

Also: Google’s Jules AI coding agent built a new feature I could actually ship – while I made coffee

I don’t have tangible test results, but after testing Google’s Jules in May and ChatGPT’s Codex now, I get the impression that Jules is able to get a deeper understanding of the code. At this point, I can’t really support that assertion with a lot of data; it’s just an impression.

I’m going to try running another project through Jules. It will be interesting to see if Codex changes much once OpenAI feels safe enough to incorporate GPT-5. Let’s keep in mind that OpenAI eats its own dog food with Codex, meaning it uses Codex to build its code. They might have seen the same iffy results I found in my tests. They might be waiting until GPT-5 has baked for a bit longer.

Have you tried using AI coding tools like ChatGPT Codex or Google’s Jules in your development workflow? What kinds of tasks did you throw at them? How well did they perform? Did you feel like the process helped you work more efficiently? Did it slow you down and take you out of your coding flow?

Do you prefer giving your tools small, surgical jobs, or are you looking for an agent that can handle big-picture architecture and reasoning? Let us know in the comments below.

You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.

What's Hot

How to get into a16z’s super-competitive Speedrun startup accelerator program

Twilio co-founder’s fusion power startup raises $450M from Bessemer and Alphabet’s GV

UpScrolled’s social network is struggling to moderate hate speech after fast growth

The dumbest things that happened in tech this year

Why this year’s best tech talks happened over cocktails at StrictlyVC

ChatGPT launched three years ago today

College social app Fizz expands into grocery delivery

A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

The Reason Murderbot’s Tone Feels Off

Most Popular

College social app Fizz expands into grocery delivery

A Former Apple Luminary Sets Out to Create the Ultimate GPU Software

The Reason Murderbot’s Tone Feels Off

Our Picks

How to get into a16z’s super-competitive Speedrun startup accelerator program

Twilio co-founder’s fusion power startup raises $450M from Bessemer and Alphabet’s GV

UpScrolled’s social network is struggling to moderate hate speech after fast growth

Subscribe to Updates

What's Hot

I went hands-on with ChatGPT Codex and the vibe was not good – here’s what happened

ZDNET’s key takeaways

GPT-5 released

Getting started

Fixing the code

Houston, we have a problem

No vibe, no flow

Related Posts

Subscribe to Updates