Skip to content
Home » All Posts » Artificial Analysis Redefines AI Intelligence: From Test Scores to Real-World Work

Artificial Analysis Redefines AI Intelligence: From Test Scores to Real-World Work

Artificial Analysis has significantly reworked how it measures AI capability, shifting its closely watched Intelligence Index away from traditional academic-style tests and toward benchmarks that ask a more direct question: can these systems actually do economically valuable work? For enterprises and developers choosing between rapidly evolving frontier models, this change reframes what “best model” really means in 2026.

The Intelligence Index v4.0: What Changed and Why It Matters

The new Intelligence Index v4.0 is a structural reset, not a minor tune-up. Artificial Analysis now aggregates performance across ten evaluations grouped into four equally weighted pillars: Agents, Coding, Scientific Reasoning, and General Knowledge. That balance is designed to give buyers a more complete picture of a model’s strengths and weaknesses, rather than rewarding narrow excellence on a few saturated leaderboards.

Three familiar benchmarks—MMLU-Pro, AIME 2025, and LiveCodeBench—have been removed. These tests have for years appeared in model launch decks and blog posts as shorthand for “intelligence.” But as leading systems mastered them, they lost their discriminatory power. When most frontier models cluster in the 90th percentile on a given test, it ceases to be a useful signal for procurement decisions.

Artificial Analysis explicitly responds to this saturation problem by “making the curve harder to climb.” On the new v4.0 scale, top models now score 50 or below, compared to around 73 under the previous index. The recalibration restores headroom so that future improvements show up meaningfully in the numbers, rather than being compressed into an indistinguishable band at the top.

Under the updated methodology, OpenAI’s GPT-5.2 with extended reasoning effort takes the top overall slot, followed closely by Anthropic’s Claude Opus 4.5 and Google’s Gemini 3 Pro. GPT-5.2 is positioned by OpenAI as its most capable model series for “professional knowledge work,” while Anthropic’s Claude Opus 4.5 pulls ahead on SWE-Bench Verified, a demanding benchmark for software engineering tasks. For enterprise buyers, this split highlights an emerging reality: no single model cleanly dominates across all categories that matter for production deployments.

The underlying shift was succinctly captured by researcher Aravind Sundar, who noted that intelligence is now being measured “less by recall and more by economically useful action.” For organizations, that reframing is critical: the ability to recite facts or ace competition math is less important than reliably delivering work products that integrate into real workflows.

From Exams to Output: GDPval-AA and “Can This Model Do the Job?”

The centerpiece of the overhaul is GDPval-AA, a new evaluation built on OpenAI’s GDPval dataset. Instead of puzzle-like exam questions, GDPval-AA challenges models with real-world tasks spanning 44 occupations and nine major industries. The deliverables mirror what knowledge workers actually produce: documents, slide decks, diagrams, spreadsheets, and multimedia content.

Artificial Analysis runs these tasks in an “agentic” configuration using its reference harness, Stirrup. Models receive shell access and web browsing capabilities, reflecting how they are increasingly deployed in enterprises—as tools that can call other tools, navigate systems, and assemble composite outputs, rather than simply answer a prompt in one shot.

Performance is scored through blind pairwise comparisons of outputs, with ELO ratings frozen at the time of evaluation to stabilize the index. In this environment, OpenAI’s GPT-5.2 with extended reasoning leads with an ELO of 1442. Anthropic’s Claude Opus 4.5 (non-thinking variant) follows at 1403, with Claude Sonnet 4.5 at 1259.

On the underlying GDPval evaluation, OpenAI reports that GPT-5.2 beat or tied top human professionals on 70.9% of well-specified tasks, and “outperforms industry professionals” across the 44 occupations included. OpenAI cites customers such as Notion, Box, Shopify, Harvey, and Zoom as observing strong long-horizon reasoning and tool-usage performance.

For enterprise decision-makers, the significance is twofold:

First, GDPval-AA collapses the distance between benchmark and production reality. Instead of extrapolating from abstract test scores to business value, buyers get a direct measure of how often a model can generate outputs that resemble what internal teams already create.

Second, the evaluation surface makes it easier to map capability to specific roles and workflows. Because the tasks are drawn from defined occupations and industries, technology leaders can more readily ask, “How close is this model to replacing, augmenting, or reshaping this job family?” without relying entirely on vendor narratives.

The philosophical move is clear: benchmarks are no longer just about whether a model looks “smart” on paper, but whether it can stand in as a productive—if still supervised—participant in economically meaningful work.

Physics and the Ceiling of Current AI: Lessons from CritPT

Image 1

While GDPval-AA emphasizes economic productivity, another new evaluation, CritPT, highlights how far even frontier models remain from genuine scientific reasoning. Developed by more than 50 active physics researchers from over 30 leading institutions, CritPT focuses on research-level tasks in modern physics, including condensed matter, quantum physics, and astrophysics.

Rather than short exercises, CritPT consists of 71 composite challenges that simulate entry-level full-scale research projects—the kind of problems junior graduate students might receive as warm-ups from a principal investigator. Each task is hand-curated to resist simple pattern-matching and is paired with a machine-verifiable answer, making it difficult for models to succeed via superficial heuristics.

On this benchmark, the scores are strikingly low. GPT-5.2 with extended reasoning currently leads the CritPT leaderboard with a score of just 11.5%, followed by Google’s Gemini 3 Pro Preview and Anthropic’s Claude 4.5 Opus Thinking. The takeaway is not that progress has stalled, but that headline-grabbing performance on consumer-facing tests masks serious limitations in deep, multi-step scientific reasoning.

For enterprises, CritPT serves less as a day-to-day selection tool and more as a ceiling indicator. It underscores that while AI systems may be ready to assist with documentation, code refactoring, data summaries, or routine analysis, they are far from autonomously solving open-ended research problems—even in domains where strong intuition and theoretical understanding are critical.

This gap has practical implications for R&D-heavy organizations. AI can be expected to accelerate literature review, basic calculations, and initial modeling, but the CritPT results caution against assuming that frontier models can reliably handle the core creative and conceptual work of scientific discovery without extensive human oversight.

Measuring Hallucinations, Not Just Accuracy: AA-Omniscience

Another major addition to the index, AA-Omniscience, tackles one of the thorniest issues in enterprise AI adoption: not just how accurate a model is, but how often it fabricates plausible-sounding but incorrect information—and whether it can recognize the limits of its own knowledge.

AA-Omniscience tests models across 6,000 questions drawn from 42 economically relevant topics, covering six domains: Business, Health, Law, Software Engineering, Humanities & Social Sciences, and Science/Engineering/Mathematics. The evaluation produces an Omniscience Index that explicitly rewards precise knowledge while penalizing hallucinated responses, providing a more nuanced view than accuracy alone.

The findings are uncomfortable but operationally important: high accuracy does not guarantee low hallucination. Models that answer correctly more often may also be more willing to guess when uncertain, which can be riskier than a model that is slightly less accurate overall but more conservative about answering outside its competence.

On the Omniscience Index, Google’s Gemini 3 Pro Preview leads with a score of 13, followed by Claude Opus 4.5 Thinking and Gemini 3 Flash Reasoning at 10. However, when separated into raw accuracy and hallucination rate, the picture becomes more complex.

By accuracy alone, Google’s two models top the chart at 54% and 51%, with Anthropic’s Claude 4.5 Opus Thinking at 43%. But both Gemini models also exhibit relatively high hallucination rates of 88% and 85%. Anthropic’s Claude 4.5 Sonnet Thinking and Claude Opus 4.5 Thinking show lower hallucination rates—48% and 58%, respectively—while OpenAI’s GPT-5.1 with high reasoning effort records a 51% hallucination rate, the second-lowest among tested models.

Within the overall Intelligence Index v4.0, both Omniscience Accuracy and Hallucination Rate each carry a 6.25% weighting. For organizations in regulated or high-stakes environments, this structure surfaces an essential trade-off: the most aggressive, “always answer” models may excel on some productivity metrics but introduce unacceptable compliance or safety risks if left unsupervised.

How OpenAI, Google, and Anthropic Now Compare Under the New Lens

The timing of this benchmark overhaul coincides with a particularly intense phase of the AI arms race. OpenAI, Google, and Anthropic have each launched major new models within weeks of each other, and Google’s Gemini 3 still holds many top positions on LMArena, another widely referenced leaderboard for large language models.

Google’s November release of Gemini 3 reportedly prompted OpenAI to declare a “code red” internally to improve ChatGPT, concentrating resources on closing the perceived gap. OpenAI is under considerable pressure to justify its roughly $500 billion valuation and more than $1.4 trillion in planned spending, with its GPT model family at the center of that investment thesis. OpenAI executives have said they expected to exit this code-red phase by January.

Anthropic, for its part, responded with Claude Opus 4.5 on November 24, posting an 80.9% accuracy score on SWE-Bench Verified and reclaiming the coding benchmark lead from both OpenAI’s GPT-5.1-Codex-Max and Google’s Gemini 3. That release marked Anthropic’s third major model launch in two months, and it has since attracted multi-billion-dollar investments from Microsoft and Nvidia, pushing its valuation to about $350 billion.

Within Artificial Analysis’s new framework, these competitive dynamics become more granular. GPT-5.2’s lead on the Intelligence Index is anchored in its broader performance across professional tasks, especially under extended reasoning. Anthropic shows strong results in coding benchmarks and competitive standings on scientific reasoning tests. Google’s Gemini line surfaces as particularly strong on raw factual accuracy, while the Omniscience evaluation flags higher hallucination rates that enterprises will need to manage.

For AI-savvy developers, the result is a more differentiated map of the frontier: no model sits on an uncontested summit. Instead, strengths cluster by workload type—agentic, coding-heavy, research-oriented, or knowledge-intensive—encouraging multi-model strategies and workload-specific evaluation rather than treating “#1 overall” as a sufficient purchasing signal.

Inside Artificial Analysis’s Methodology: Independence, Fairness, and Definitions

Artificial Analysis emphasizes that its evaluations are run independently under a standardized methodology that aims for both fairness and real-world relevance. The organization reports that, for the Intelligence Index, it can estimate a 95% confidence interval of less than ±1 percentage point based on experiments with more than 10 repeats on certain models. For enterprises, that kind of stability matters: it reduces the risk that selection decisions hinge on statistical noise.

The published methodology also clarifies several definitions that are increasingly important for buyers navigating a crowded ecosystem of providers, endpoints, and licensing models:

• An endpoint is defined as a hosted instance of a model accessible via an API. The same base model may therefore appear in multiple forms across different providers.

• A provider is any company hosting and exposing access to one or more model endpoints or systems.

• The organization distinguishes between “open weights” models—which have released model weights publicly—and truly open-source models. Many popular “open” LLMs use licenses that do not align with traditional open-source software definitions, a nuance that matters for enterprises with strict legal or compliance requirements.

To standardize cost and throughput comparisons, Artificial Analysis uses OpenAI tokens, as measured with the tiktoken package, as a common unit across providers. That normalization helps buyers compare models without getting lost in incompatible tokenization schemes.

Artificial Analysis also notes that the Intelligence Index is a text-only, English-language suite. Image inputs, speech, and multilingual performance are benchmarked separately. Enterprises with heavy non-English workloads or multimodal requirements should therefore treat the index as a core but not complete signal when assessing fit.

What Enterprise Leaders Should Take Away for 2026 Buying Decisions

Image 2

For enterprise technology leaders and AI-focused developers, the Intelligence Index v4.0 offers a more actionable set of signals than earlier benchmark compilations—but it also demands more nuanced interpretation.

First, treat the overall index as a starting point, not an endpoint. Because Artificial Analysis now weights Agents, Coding, Scientific Reasoning, and General Knowledge equally, a model that ranks highly overall might be overkill—or misaligned—for a narrow use case. For example, a company prioritizing code migration might care more about SWE-Bench Verified performance and coding-related agent benchmarks than about research physics scores.

Second, incorporate hallucination metrics into risk assessments. The AA-Omniscience findings show that top accuracy does not automatically mean lower overall risk. In healthcare, finance, or law, a model that occasionally says “I don’t know” may be preferable to one that answers confidently but sometimes invents facts. The explicit weighting of both accuracy and hallucination rate within the index helps teams formalize that trade-off.

Third, map GDPval-AA tasks to your own workflows. Because GDPval-AA is built from tasks across specific occupations and industries, it can serve as a bridge between lab performance and line-of-business reality. Teams should ask how closely their internal tasks resemble those in GDPval-AA—both in format (documents, spreadsheets, slides) and in complexity—and where additional internal evaluation is needed.

Finally, recognize the limits highlighted by CritPT. If your AI strategy involves advanced R&D or scientific modeling, the modest CritPT scores are a reminder that today’s models function best as accelerators and assistants, not autonomous principal investigators. Human expertise and oversight remain central, particularly where errors carry high experimental or safety costs.

Initial reactions to Artificial Analysis’s overhaul have been largely positive, especially from observers who welcome reduced benchmark saturation and a stronger focus on agent performance and real-world tasks. Some commentators have gone further, predicting that an upcoming wave of models will quickly surpass today’s leaders and make debates over current rankings moot.

Whether or not such predictions materialize, the direction of travel is clear. The industry is moving beyond boasting about exam-style scores to asking a more consequential question: not just “how smart does this model look on tests?” but “what work can it reliably do, at what risk level, and under what supervision?” In 2026, that is the question enterprises will increasingly be judged on when they choose which AI systems to put into production.

Join the conversation

Your email address will not be published. Required fields are marked *