cedrickchee/gpt-5.4-and-codex-5.4-released.md

## gpt-5.4-and-codex-5.4-released.md

      
    Raw
  

              gpt-5.4-and-codex-5.4-released.md
            
          
    GPT-5.4 and the Emergence of the Agent Operating System

OpenAI just released GPT-5.4 and Codex 5.4, and the early reactions from developers are unusually strong.
Some report that it solved bugs earlier models failed to fix after dozens of prompts.
Others describe it refactoring entire codebases in a single run.
A few say it has already become their daily driver for coding and knowledge work.
But none of those anecdotes capture the real significance of this release.
GPT-5.4 matters because it reveals the next stage of AI system design.
For the past two years, language models were primarily optimized to answer questions. GPT-5.4 looks increasingly like something else: a system designed to execute work.
Computer use. Tool orchestration. Long-context reasoning. Mid-response steering.
Individually, these features are incremental improvements. Together, they form the foundation of something more ambitious: an operating environment for autonomous AI agents.
The agentic era is beginning to take shape.

The Collapse of the Model Stack

One of the clearest signals from GPT-5.4 is the consolidation of OpenAI’s model lineup.
Earlier models in the GPT-5 family had distinct roles. GPT-5.2 emphasized reasoning tasks, while GPT-5.3-Codex focused on software development.
GPT-5.4 merges these capabilities into a single system capable of reasoning, coding, browsing, and orchestrating tools.
This may seem like a small product decision, but it reflects a deeper architectural shift.
For the past several years, building AI workflows meant orchestrating multiple models. Developers routed requests between specialized systems depending on the task. One model for planning, another for coding, another for retrieval, another for multimodal perception.
GPT-5.4 suggests a different future.
Instead of stitching together multiple models, developers can increasingly rely on one general system that handles the entire workflow.
If that trend continues, specialized models like Codex may eventually disappear, replaced by unified systems capable of performing many types of work simultaneously.

The Missing Capability: Computer Use

Perhaps the most important capability improvement in GPT-5.4 is its support for native computer use.
The model can interact directly with browsers and operating systems, allowing it to execute tasks inside real software environments.
Benchmarks provide some early evidence of progress. GPT-5.4 reaches roughly 75 percent on OSWorld-Verified, a test designed to measure how effectively AI systems operate real operating systems.
This matters because operating systems are chaotic environments.
Interfaces change. Pages load unpredictably. Actions fail. Feedback loops are incomplete.
Traditional automation tools rely on brittle scripts that break when anything changes. Language models offer something fundamentally different: the ability to reason through unexpected conditions.
If the OSWorld results translate to real workflows, GPT-5.4 may represent the moment where browser and desktop automation becomes genuinely agentic rather than scripted.

Tool Search and the Scaling Problem

One of the most interesting technical details in GPT-5.4 is the introduction of tool search.
OpenAI reports that this feature reduces prompt tokens by nearly half while maintaining accuracy.
At first glance this appears to be a simple optimization. In reality, it addresses one of the most fundamental bottlenecks in agent architecture.
Most tool-calling systems load every tool definition into the prompt. As the number of tools grows, the prompt becomes increasingly large and inefficient.
Tool search changes that dynamic.
Instead of including all tools in the prompt, the model retrieves relevant tools dynamically when needed. In effect, this functions like lazy loading for tools.
The implications are significant.
Agents can now operate across large ecosystems of tools without exhausting the context window. This allows systems to scale beyond small curated toolsets toward environments resembling full software ecosystems.
In other words, GPT-5.4 moves closer to something that behaves like a runtime environment for AI agents.

What the Benchmarks Actually Show

Benchmarks for GPT-5.4 reveal a pattern that differs from earlier model releases.
Instead of dramatic improvements on a single task, the gains appear distributed across multiple dimensions.
On MMMU-Pro, the model reaches around 81 percent, reflecting improved multimodal perception. The system also supports images up to 10 megapixels, enabling more detailed visual analysis.
On SWE-Bench Pro, GPT-5.4 performs roughly on par with GPT-5.3-Codex, suggesting coding ability itself has not dramatically increased.
However, the model introduces a fast mode that runs roughly 1.5 times faster, implying improvements in reasoning efficiency rather than simply scaling model size.
Meanwhile, GPT-5.4 scores 83 percent on GDPval, a benchmark designed to measure performance on economically valuable knowledge work.
That metric is particularly interesting. It attempts to capture something closer to real professional tasks rather than academic reasoning problems.
Across these benchmarks, a pattern emerges.
The improvements are not focused on making the model better at answering trivia questions. They are focused on making it more capable of executing real workflows.

Efficiency as the New Frontier

Another subtle but important shift appears in pricing.
GPT-5.4 is more expensive per token than GPT-5.2. However, it is also significantly more token-efficient, meaning many tasks require fewer tokens to complete.
This reflects a broader change in how AI performance is measured.
For years, model improvements were framed primarily in terms of scale: more parameters, larger context windows, higher benchmark scores.
Increasingly, the relevant metric is cost per completed task.
A model that requires fewer steps to solve a problem can be more economically valuable even if each individual token costs more.
This shift toward efficiency may explain why recent models often feel faster and more decisive even when benchmark improvements appear modest.

Developer Reactions

The most revealing signals often come from how developers actually use a model.
Several early testers report that GPT-5.4 behaves differently from earlier GPT-5 models. It plans more effectively, communicates reasoning more clearly, and handles complex debugging tasks with fewer iterations.
One developer reported that GPT-5.4 solved a bug that GPT-5.3-Codex had failed to resolve after dozens of attempts.
Another described a massive automated refactor involving more than a thousand tool invocations and over one hundred thousand lines of generated code.
The resulting system did not fully run afterward, but the architectural restructuring was described as surprisingly coherent.
This illustrates the current frontier of AI coding.
Models can increasingly restructure complex systems, even if the final execution still requires human verification.

Reasoning Modes as Agent Behaviors

Testing GPT-5.4 reveals interesting differences between reasoning levels.
Medium reasoning often begins executing immediately without planning, behaving like a fast coding assistant.
Higher reasoning levels tend to generate a plan before acting.
The highest reasoning modes spend more time gathering context, sometimes exploring files outside the current working directory before writing code.
These differences increasingly resemble different agent behaviors rather than simple thinking intensities.
In other words, reasoning modes are starting to look like configurable strategies for how an AI agent approaches a task.

A Small Experiment: Voxel Worlds

One of my own experiments involves generating large voxel environments in a single prompt.
The first voxel pagoda garden scene generated by Codex 5.4 crossed the 100,000 voxel threshold.
This is the first time a GPT-series model has reached that scale in my one-shot voxel generation tests.
Other frontier models have recently crossed similar thresholds.
That may sound like a niche milestone, but it reveals something interesting about model capability.
Large voxel scenes require consistent reasoning across tens of thousands of structured elements. Generating them successfully suggests models are becoming increasingly capable of maintaining coherence across large structured outputs.

What Still Needs Work

Despite the progress, GPT-5.4 is not dominant across every domain.
Frontend design quality has improved, but models like Claude Opus and Gemini Pro still appear slightly stronger in visual taste.
Some developers also report that GPT-5.4 can be more aggressive in generating solutions, occasionally introducing subtle errors that require correction.
These limitations are reminders that the technology remains imperfect, even as it becomes dramatically more capable.

The Direction of Travel

The most important takeaway from GPT-5.4 is not any single benchmark.
It is the direction of the architecture.
The pieces are beginning to converge:
computer use, tool ecosystems, unified reasoning and coding, steerable execution, and large context windows.
Together they form the foundation of systems that can plan tasks, execute them across software environments, and verify results.
In other words, systems that behave less like chatbots and more like autonomous workers.
GPT-5.4 may not yet be a full agent operating system.
But it looks increasingly like the first model designed to run one.
No results found