OpenAI’s GPT-5.4 Aims Beyond Chat: Native Computer Use, Tool Search, and Finance-Grade Excel Skills

OpenAI’s latest model, GPT-5.4, is explicitly pitched as more than another incremental chatbot upgrade. With native computer-use capabilities, long-context planning, and deep spreadsheet integrations for finance teams, OpenAI is aiming this release at sustained, multi-step professional workflows rather than one-off answers.

The model ships in two variants—GPT-5.4 Thinking and GPT-5.4 Pro—targets both developers and enterprise knowledge workers, and arrives just days after the company launched GPT-5.3 Instant. It also comes with a noticeable price premium, which OpenAI argues is justified by higher capability and better token efficiency.

From GPT-5.3 to GPT-5.4: what’s actually new?

GPT-5.4 is offered in two tiers. GPT-5.4 Thinking is aimed at advanced reasoning and general-purpose use, while GPT-5.4 Pro is reserved for the heaviest, most complex workloads. Both variants are available through OpenAI’s paid API and its Codex development environment. On the ChatGPT side, GPT-5.4 Thinking is included for Plus subscribers and above, while Pro and Enterprise users get access to GPT-5.4 Pro. Free ChatGPT users may be auto-routed to GPT-5.4 in some cases but cannot select it directly.

OpenAI emphasizes efficiency as a core differentiator. In at least one benchmark involving tool search across a large set of tools, the system configuration using GPT-5.4 reduced token usage by 47% without sacrificing accuracy compared with a naive setup that exposed all tools directly. The company is careful to clarify that this 47% figure is tied to that specific evaluation, not a blanket claim that all tasks are nearly halved in token count, but it underscores a design goal: reduce the overhead of complex workflows.

The other headline additions are native computer use, improved tool orchestration, and long-context support up to 1 million tokens in the API and Codex, with differential pricing beyond 272,000 tokens. Taken together, these features are positioned as infrastructure for agents that keep state, work across applications, and return output closer to real office deliverables—like models, decks, and structured analysis—rather than just text responses.

Native computer use: what it means for agentic workflows

The most consequential technical change in GPT-5.4 is OpenAI’s new “native” computer-use mode, available via Codex and the API. Instead of limiting the model to calling tools or producing code that a human later runs, GPT-5.4 can directly operate a computer environment: issuing mouse and keyboard actions in response to screenshots and generating code for libraries such as Playwright to drive applications.

OpenAI backs the claim that this is more than a UI wrapper with a set of benchmarks focused on agentic tasks. On BrowseComp, which measures persistent web browsing to locate hard-to-find information, GPT-5.4 improves by 17 percentage points over GPT-5.2, and GPT-5.4 Pro reaches 89.3%, which OpenAI describes as a new state of the art. On OSWorld-Verified—desktop navigation using screenshots plus mouse and keyboard—the model achieves 75.0% success, substantially higher than GPT-5.2’s 47.3% and above reported human performance at 72.4%.

Other benchmarks show incremental but notable gains in web interaction and screenshot-based navigation, including 67.3% success on WebArena-Verified and 92.8% on Online-Mind2Web using screenshots alone. These results collectively support OpenAI’s positioning of GPT-5.4 as a model that can reliably keep a multi-step workflow going across interfaces and web properties, not just complete a single reasoning step.

Computer use is also tied to improvements in multimodal understanding. On MMMU-Pro, a test of multimodal, multi-discipline understanding, GPT-5.4 scores 81.2% without tools, up from GPT-5.2’s 79.5%, and OpenAI notes this is achieved using fewer “thinking tokens.” On OmniDocBench, which measures document understanding, its average error falls from 0.140 with GPT-5.2 to 0.109 with GPT-5.4, and the model adds support for higher-fidelity image inputs, up to an “original” detail level at 10.24 million pixels.

For software and data teams exploring agents that interact with real systems—filling forms, extracting data from dashboards, or navigating complex web apps—these benchmarks point to a model that is more competitive with, and in at least one case surpasses, human performance in the narrow tasks tested. However, OpenAI’s own framing suggests these are building blocks, not a turnkey replacement for human operations: the focus remains on making multi-step, stateful automation more feasible rather than fully autonomous.

Tool search and orchestration: cutting the prompt “tax”

As more tools and services are wired into AI systems, a practical problem emerges: feeding every tool definition into every prompt inflates token costs, increases latency, and pollutes context. OpenAI describes this as a “tax” on requests in growing tool ecosystems.

GPT-5.4 introduces tool search in the API as a structural response. Instead of sending full definitions for every tool, developers provide a lightweight list plus a search capability. The model decides when it needs more detail and only then retrieves the full definition. In OpenAI’s example using 250 tasks from Scale’s MCP Atlas benchmark, with 36 MCP servers available, this design cut total token consumption by 47% while keeping accuracy constant versus the naive “all tools in context” approach.

For engineering teams responsible for cost control and latency, this matters. Tool usage often scales faster than prompt complexity: as more internal services, data sources, and workflows become tools, the naive approach becomes less tenable. By shifting tool selection into a retrieval step, GPT-5.4 aims to keep orchestration manageable even as the tool graph grows.

The model’s broader positioning reinforces this direction. OpenAI repeatedly ties GPT-5.4 to “longer, multi-step workflows”—suggesting usage patterns where an agent may call many tools, maintain state, and retry operations. Reduced overhead per step, both in tokens and orchestration complexity, is crucial if such systems are to be economically viable in production.

Coding, /fast mode, and Codex workflows

On the development side, GPT-5.4 is pitched as combining the coding strengths of GPT-5.3-Codex with stronger tool use and computer-use capabilities, particularly for tasks that are not single-shot. OpenAI reports that GPT-5.4 matches or outperforms GPT-5.3-Codex on SWE-Bench Pro, a benchmark for software engineering tasks, while also offering lower latency across reasoning efforts.

Codex gains additional “workflow-level knobs,” most notably a /fast mode that OpenAI describes as providing up to 1.5× faster performance on supported models, including GPT-5.4, without changing the underlying model weights or intelligence. For teams integrating AI into CI pipelines, test generation, or interactive coding assistants, this kind of latency control is a practical lever for adapting performance to the use case.

OpenAI is also shipping an experimental Codex skill dubbed “Playwright (Interactive)” that showcases how code generation and computer use can work together in practice—for example, visually debugging web or Electron apps and testing an app while it is being built. This is framed as a demonstration of what the new computer-use abilities can enable in developer tooling rather than as a production-ready feature.

For software teams, the picture that emerges is of a model and toolchain optimized for longer-lived coding workflows—debugging, test runs, UI interactions—rather than just generating a single code snippet. The benchmarks and features focus more on reliability and workflow integration than on raw code generation novelty.

Finance-focused Excel and Sheets integrations

Alongside GPT-5.4, OpenAI is introducing a suite of secure, finance-oriented products in ChatGPT designed for enterprises and financial institutions. The centerpiece is ChatGPT for Excel and Google Sheets (beta), which effectively embeds GPT-5.4 inside spreadsheets to build, analyze, and update complex financial models within existing formulas and structures.

OpenAI pairs these spreadsheet integrations with new ChatGPT app connections intended to unify market, company, and internal data in a single workflow, explicitly naming FactSet, MSCI, Third Bridge, and Moody’s as integrated data providers. It also introduces reusable “Skills” that encapsulate recurring finance tasks such as earnings previews, comparables analysis, discounted cash flow (DCF) modeling, and investment memo drafting.

To support the finance pitch, OpenAI cites an internal investment banking benchmark in which performance rose from 43.7% with GPT-5 to 88.0% with GPT-5.4 Thinking. In a separate internal benchmark of spreadsheet modeling mirroring junior investment banking analyst work, GPT-5.4 achieved a mean score of 87.5%, compared with 68.4% for GPT-5.2.

External testers quoted by OpenAI reinforce this framing: Daniel Swiecki of Walleye Capital reports a 30 percentage point improvement in accuracy on internal finance and Excel evaluations, and links this to greater automation in model updates and scenario analysis. Brendan Foody, CEO of Mercor, calls GPT-5.4 the best model the company has tested and notes it leads Mercor’s APEX-Agents benchmark for professional services deliverables such as slide decks, financial models, and legal analysis.

The net message for finance and operations leaders is that GPT-5.4 is being tuned and packaged specifically for spreadsheet-heavy workflows, with integrations and reusable patterns intended to move the model closer to day-to-day analytical tasks rather than abstract language capabilities.

Benchmarks vs. professional work: where GPT-5.4 stands

Beyond narrow technical benchmarks, OpenAI highlights evaluations that attempt to resemble real-world knowledge work. On GDPval, an assessment spanning “well-specified knowledge work” across 44 occupations, GPT-5.4 is reported to match or exceed industry professionals in 83.0% of comparisons, an increase from 71.0% for GPT-5.2.

OpenAI points to specific improvements in failure-prone artifacts such as structured tables, formulas, narrative coherence, and design quality. On a set of presentation tasks, human raters preferred GPT-5.4’s slide decks 68.0% of the time over GPT-5.2’s, citing better aesthetics, greater visual variety, and more effective use of image generation.

These benchmarks are still internal or curated rather than industry standards, and OpenAI does not position them as definitive measures of human equivalence. However, they indicate a direction: performance assessments that are closer to what a junior analyst or associate would actually produce—spreadsheets, narratives, decks—rather than synthetic puzzles. For organizations evaluating GPT-5.4, they provide some evidence that the model is being optimized for the shape of real deliverables, not just token-level accuracy.

Reliability, hallucinations, and risk for enterprise use

OpenAI characterizes GPT-5.4 as its “most factual” model to date and anchors this to de-identified user prompts where earlier models were flagged for factual errors. On this dataset, GPT-5.4’s individual claims are 33% less likely to be false than GPT-5.2’s, and its full responses are 18% less likely to contain any error at all.

For enterprise deployments, this kind of reliability improvement is at least as important as raw capability. Reduced hallucination rates lower the cost of retries: fewer corrections from human reviewers, fewer retries from agents, and fewer workflow reruns. Combined with the token and tool efficiencies described elsewhere, GPT-5.4 is positioned as a model that makes it cheaper—financially and operationally—to loop toward a correct result.

OpenAI does not claim hallucinations are solved; the improvements are framed as percentage reductions rather than eliminations. But the combination of higher accuracy, better document handling, and stronger alignment with professional-style outputs is clearly pitched at organizations that have been cautious about letting models into higher-stakes workflows.

Pricing, long context, and how GPT-5.4 fits the wider model market

GPT-5.4 ships under two API names: gpt-5.4 for the Thinking variant and gpt-5.4-pro for the higher-end Pro model. Pricing is:

GPT-5.4: $2.50 per 1 million input tokens, $15 per 1 million output tokens
GPT-5.4 Pro: $30 per 1 million input tokens, $180 per 1 million output tokens
Batch and Flex: half-rate; priority processing: 2× rate

GPT-5.4 thus sits toward the higher end of the market for API-accessible models, though still below OpenAI’s own GPT-5.2 Pro and well under the top-priced frontier offerings from some competitors. OpenAI also stresses that GPT-5.4 often uses fewer “reasoning tokens” for comparable tasks, an efficiency argument that partially offsets the higher per-token price in scenarios where workflows genuinely consume fewer tokens.

A key pricing nuance is long-context usage. In the API and Codex, GPT-5.4 supports up to 1 million tokens of input context, but any portion above 272,000 tokens is billed at double the normal rate. Codex defaults to compacting prompts down to 272,000 tokens; the higher long-context rate only applies when developers raise that compaction limit and actually send larger prompts. The maximum output remains 128,000 tokens, in line with previous models.

Compared to a broad set of competitors—from Qwen to Gemini, Claude, ERNIE, DeepSeek, and others—the baseline cost of GPT-5.4 and GPT-5.4 Pro is higher than many “flash” or mid-tier models but below the most expensive frontier systems. OpenAI’s stated rationale for the premium is threefold: better performance on complex tasks (including coding, computer use, research, advanced document generation, and tool use), research advances aligned with its roadmap, and more efficient reasoning behavior.

For teams evaluating GPT-5.4, the practical question is whether those reported efficiency and accuracy gains translate into lower end-to-end costs for a given workflow compared with cheaper but less capable models.

What GPT-5.4 signals about OpenAI’s strategy

Across the release, GPT-5.4 is framed as a step away from answer-centric chatbots and toward “sustained professional workflows.” The unifying themes—native computer use, tool search, long context, spreadsheet-native finance skills, and reduced factual errors—point toward a model designed to sit at the center of agentic systems that resemble junior knowledge workers.

The emphasis on token efficiency and orchestration is particularly notable. Rather than just scaling model size or raw reasoning performance, OpenAI is targeting the operational frictions that emerge when AI is woven into real processes: prompt bloat, tool sprawl, retries, and error correction. If those frictions can be reduced, higher prices per token may be acceptable where the total cost of a workflow comes down.

For software developers and AI platform teams, GPT-5.4 offers more robust building blocks for agents that act across applications and the web, with new knobs for performance and cost. For finance and operations leaders, the Excel and Sheets integrations, paired with finance-specific skills and benchmarks, point clearly toward automated or semi-automated analytical work.

OpenAI does not claim these systems are fully autonomous or risk-free. Instead, GPT-5.4 is presented as a more capable and reliable component for orchestrated workflows, leaving organizations to decide where to place it within their existing controls, approvals, and human review loops.

Cary Huang

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.