Posted in

Building more with GPT-5.1-Codex-Max

The frontier of AI-assisted development has just taken a significant leap forward. OpenAI has unveiled GPT‑5.1-Codex-Max, a new frontier agentic coding model designed to be a more intelligent, efficient, and capable partner for software engineers. This isn’t just an incremental update; it represents a fundamental shift in how AI can handle long-running, complex development tasks.

Built on an enhanced reasoning model, GPT‑5.1-Codex-Max is engineered for the entire development lifecycle. It’s faster, smarter, and, crucially, more token-efficient, promising real-world cost savings for developers. The model is now available across the Codex ecosystem—in the CLI, IDE extensions, cloud, and code review tools—with API access on the horizon.

Unlocking Project-Scale Work with “Compaction”

The standout feature is its native ability to handle “long-running, detailed work.” For the first time, a model has been natively trained to operate across multiple context windows through a novel process called compaction. This allows it to work coherently over millions of tokens in a single task by intelligently pruning its history while preserving critical context.

This breakthrough unlocks capabilities previously hindered by context limits:
* Project-scale refactors
* Deep, multi-hour debugging sessions
* Extended agent loops that can run for days

In practice, the model automatically compacts its session as it nears its context window limit, freeing up space to continue without losing progress. Internally, OpenAI has observed GPT‑5.1-Codex-Max working independently on tasks for over 24 hours, persistently iterating and fixing failures until it delivers a successful result.

In this example, GPT‑5.1-Codex-Max is independently refactoring the Codex CLI open source repository. As the session length approaches the model’s context-window, it automatically compacts the session to free up space to continue the task without losing progress. The video has been trimmed and sped up for clarity.

Enhanced Performance and Efficiency

Trained on real-world software engineering tasks—PR creation, code review, frontend coding, Q&A—the model shows substantial gains on frontier coding evaluations. It also brings practical improvements, such as being the first OpenAI model natively trained to operate effectively in Windows environments.

Benchmark Performance:
* SWE-Lancer IC SWE: 79.9% accuracy (vs. 66.3% for GPT-5.1-Codex)
* Terminal-Bench 2.0: 58.1% accuracy (vs. 52.8%)

Perhaps more impactful is its improved token efficiency. On SWE-bench Verified, GPT‑5.1-Codex-Max achieves better performance than its predecessor while using 30% fewer “thinking” tokens. A new “Extra High” reasoning effort is introduced for non-latency-sensitive tasks, but “medium” remains the recommended daily driver. This efficiency translates directly to lower costs for developers, exemplified by its ability to produce high-quality frontend designs at a much lower cost.

CartPoleSolar system sandboxKanbanSnell's lawGPT-5.1-Codex-MaxGPT-5.1-Codex

Prompt: Generate a single self-contained browser app that renders an interactive CartPole RL sandbox with canvas graphics, a tiny policy-gradient controller, metrics, and an SVG network visualizer.

A Focus on Safety and Trust

With greater capability comes a heightened focus on safety. The model’s long-horizon reasoning improves its performance in areas like cybersecurity. While it does not yet reach “High” capability under OpenAI’s Preparedness Framework, it is their most capable cybersecurity model to date.

Proactive safeguards are in place:
* Dedicated cybersecurity monitoring to detect and disrupt malicious activity.
* A secure sandboxed default environment for Codex, with restricted file writes and no network access unless explicitly enabled.
* An iterative deployment approach, learning from real-world use to update safeguards while preserving defensive tools like vulnerability scanning.

A crucial reminder for developers: as agents become more autonomous, human oversight remains essential. Codex provides terminal logs and cites its actions to facilitate review. It should be treated as a powerful additional reviewer, not a replacement for human judgment before deploying to production.

Availability and Impact

GPT‑5.1-Codex-Max is available now within Codex for ChatGPT Plus, Pro, Business, Edu, and Enterprise plans, and replaces GPT‑5.1-Codex as the default model. It’s important to note this is a specialized, agentic coding model, distinct from the general-purpose GPT‑5.1, and is recommended for use within Codex or similar coding environments.

The internal impact at OpenAI speaks volumes: 95% of their engineers use Codex weekly, and those engineers ship roughly 70% more pull requests since adopting it. This model represents a tangible step toward supercharged engineering productivity, pushing the frontier of what AI agents can accomplish. The era of your AI pair programmer handling a multi-day, complex refactor is no longer science fiction—it’s here.