Nvidia’s Cosmos Reason 2 and Nemotron Upgrades Push ‘Physical AI’ Beyond the Chatbox

Nvidia is using CES 2026 to sharpen its pitch for what CEO Jensen Huang has called the “age of physical AI” – systems where agentic models don’t just reason over text or code, but drive robots, vehicles and other devices in the real world. The company’s latest releases center on two pillars: Cosmos Reason 2, an upgraded vision-language model (VLM) for embodied reasoning, and new members of the Nemotron family aimed at speech, retrieval and safety.

From chatbots to ‘physical AI’: Nvidia’s new focus

Nvidia continues to provide large language models for traditional software use cases, but its roadmap is increasingly organized around AI that operates as agents in both digital and physical environments. At CES 2026, the company announced new models specifically designed to move AI beyond chat interfaces and into physical settings, where perception, planning and safety become first-class concerns.

Kari Briski, Nvidia’s vice president for generative AI software, framed the moment as a turning point for robotics. She argued that the field is moving from narrowly scoped specialist robots toward what she described as “generalist specialist systems” – robots that combine broad foundational knowledge with deep, task-specific skills. According to Brski, Nvidia sees reasoning-capable models as a key ingredient in enabling machines to operate in “the unpredictable physical world.”

This shift matters for AI developers and technical leaders because it redefines what “production AI” looks like. Instead of a single LLM powering a chatbot or coding assistant, Nvidia is advocating for systems-of-models that can perceive, decide and act in complex environments, while drawing on shared infrastructure, datasets and tooling.

Inside Cosmos Reason 2: a VLM for embodied reasoning

The headline release for physical AI is Cosmos Reason 2, Nvidia’s latest vision-language model tailored for embodied reasoning. It succeeds Cosmos Reason 1, which introduced a two-dimensional ontology for reasoning about physical situations and currently leads Hugging Face’s physical reasoning for video leaderboard.

Cosmos Reason 2 keeps that ontology as its backbone, but emphasizes two themes relevant for enterprise deployments:

1. Greater customization flexibility for enterprises. While Nvidia has not disclosed architectural details beyond the ontology, the company is positioning Reason 2 as more adaptable for specific applications. For AI teams, this suggests a model intended to be tuned or integrated into domain-specific workflows – for example, a robotics fleet manager, a warehouse automation system, or an industrial inspection pipeline – rather than a one-size-fits-all perception model.

2. Enabling physical agents to plan next actions. Nvidia says Cosmos Reason 2 gives embodied agents the ability to plan their next steps, analogously to how software-based AI agents reason through digital workflows today. In practice, this means the model is designed not only to interpret visual scenes and answer questions, but also to inform action selection – such as where a robot should move next or how it should manipulate an object, based on its understanding of the environment.

Other VLMs, including Google’s PaliGemma and Mistral’s Pixtral Large, can process visual inputs, but Nvidia is explicitly highlighting reasoning as a differentiator. Not all commercially available VLMs are optimized for the type of physical reasoning required in robotics, where understanding causality, affordances, and multi-step tasks is often more important than generic image captioning or classification.

For developers, the practical implication is that Cosmos Reason 2 is meant to sit closer to the “brains” of a robot or embodied agent: ingesting video or image data, interpreting what is happening, and suggesting or informing subsequent actions in real time or near-real time.

Cosmos, Gr00t, and Nemotron: Nvidia’s physical AI stack

Cosmos Reason 2 is part of a broader set of Nvidia models aimed at physical AI and agentic behavior. On the perception and embodiment side, Nvidia has its Cosmos line for robotics and an open vision-language-action (VLA) model called Gr00t – a model class that does not just see and describe, but is designed to connect perception to actions.

Alongside these, Nvidia offers its Nemotron models as agentic AI foundations. Nemotron is positioned as the reasoning and orchestration layer that can sit above or alongside perception and control models, enabling higher-level task planning and decision-making.

Briski emphasized that Nvidia’s roadmap for these models follows “the same pattern of assets across all of our open models.” In her view, building specialized AI agents – whether as part of a digital workforce or as physical robots and autonomous vehicles – requires more than a single model:

Compute to train and simulate the world, supporting development and validation of agents in rich environments.
Data as the “fuel” that allows models to learn and improve, including large, diverse and open datasets.
Open libraries and training scripts so developers can purpose-build AI for their own applications rather than treat models as black boxes.
Blueprints and deployment examples to show how systems-of-models can be wired together in real-world scenarios.

Nvidia’s argument to enterprises is that open models across robotics, perception, language, and agentic reasoning can form a shared ecosystem. Data and training improvements in one area – say, better multilingual retrieval or safer data handling – can benefit agents that operate in both digital workflows and physical machines.

Cosmos Transfer: simulating the physical world for robots

Complementing Cosmos Reason 2, Nvidia also released a new version of Cosmos Transfer, a model focused on generating training simulations for robots. Earlier iterations have been described as making robot training highly realistic, with the goal of improving how well policies trained in simulation transfer to the real world.

For teams building robotics systems, this matters because collecting large amounts of real-world interaction data can be expensive, slow, and risky. A simulation-generating model like Cosmos Transfer is intended to:

Create diverse, high-fidelity training scenarios.
Expose robots to edge cases and rare events that are hard to capture in the wild.
Reduce iteration time by allowing policies to be tested and refined in virtual environments before being deployed on hardware.

Paired with Cosmos Reason 2, Cosmos Transfer fits into Nvidia’s view of a full physical AI pipeline: simulate the world, train robotic behaviors and reasoning skills in those simulations, then deploy embodied agents that can reason and plan in real environments using the same ontology and perception stack.

Nemotron Speech, RAG, and Safety: adding speech, retrieval and guardrails

On the agentic AI side, Nvidia’s Nemotron line is expanding beyond core reasoning models. Following the December release of Nemotron 3, the latest version of its agentic reasoning models, the company announced three new additions: Nemotron Speech, Nemotron RAG, and Nemotron Safety.

Nemotron Speech. In a blog post, Nvidia described Nemotron Speech as providing “real-time low-latency speech recognition for live captions and speech AI applications,” claiming it is 10 times faster than other speech models. For AI developers, this indicates a focus on ultra-low-latency audio pipelines, suitable for interactive agents that must respond to spoken commands or transcribe speech in real time – a critical requirement for voice-controlled robots or on-device assistants.

Nemotron RAG. Nvidia characterizes Nemotron RAG as a two-part system: an embedding model and a rerank model, both with the ability to understand images. This multimodal capability is intended to give data agents richer context and more accurate retrieval across text and visual information.

According to Briski, Nemotron RAG performs strongly on the Massive Multilingual Text Embedding Benchmark (MMTab), while requiring less compute and memory. For technical decision-makers, the key points are:

It targets high-throughput scenarios that must handle many queries with low latency.
It is designed to support strong multilingual performance, which is often a requirement in global deployments.
Its image-aware embeddings and reranking are meant to help agents make sense of mixed text-visual corpora.

Nemotron Safety. The third addition, Nemotron Safety, focuses on detecting sensitive data so AI agents do not inadvertently expose personally identifiable information (PII). In a physical AI context, this is relevant wherever agents interact with environments that may contain personal data – for example, cameras in public or semi-public spaces, or systems that process user documents.

While Nvidia has not detailed the underlying methods, the positioning is clear: Nemotron Safety acts as a guardrail layer that flags or filters sensitive content before it is surfaced or acted upon by other components in an agentic system.

What this means for AI and robotics builders

Nvidia’s announcements at CES 2026 underscore a strategy that blends open models, simulation tools, and specialized components for speech, retrieval, and safety into a unified story about physical AI.

For AI developers and technical decision-makers, several practical implications emerge from the information Nvidia has shared:

Systems-of-models are the new baseline. Nvidia is advocating for architectures where perception (Cosmos), reasoning and orchestration (Nemotron), simulation (Cosmos Transfer), and safety (Nemotron Safety) are distinct but interoperable layers, rather than a single monolithic model.
Reasoning is moving into perception stacks. By emphasizing embodied reasoning in Cosmos Reason 2 and highlighting leadership on a physical reasoning leaderboard, Nvidia is signaling that future robotics perception stacks will be evaluated not just on recognition accuracy but on their ability to support multi-step, cause-aware decision-making.
Multimodal and multilingual capabilities are table stakes for agents. Nemotron RAG’s strong multilingual performance and image-aware embeddings, combined with VLM-based perception and speech recognition, point toward agents that can operate across languages and modalities as a default expectation.
Safety and data sensitivity are built-in concerns. With Nemotron Safety, Nvidia is acknowledging that as agents move closer to end users and physical spaces, preventing unintentional disclosure of sensitive data is a core requirement, not an afterthought.

Details such as exact model sizes, training datasets, and specific deployment patterns were not disclosed in the information Nvidia has provided so far. Nonetheless, the direction is clear: Nvidia wants its open models and tooling to be the default foundation for enterprises building both digital agents and physically embodied AI systems.

For teams planning their next-generation robotics platforms or agentic applications, the combination of Cosmos Reason 2, Cosmos Transfer, and the expanded Nemotron family represents a more complete stack: from simulation and perception to speech, retrieval, and safety, all aimed at operating beyond the chatbox and into the physical world.

Cary Huang

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.