OpenAI has spent years pushing the frontier of large language models. Internally, though, one of its most consequential AI deployments isn’t a new model at all. It’s a data agent, built by just two engineers in three months—70% of its code written by AI—that now serves more than 4,000 of the company’s roughly 5,000 employees every day.
Instead of combing through tens of thousands of datasets and writing SQL by hand, staff across finance, product, engineering, and non-technical functions now ask plain‑English questions in tools like Slack and receive charts, dashboards, and in‑depth analysis in minutes. For data leaders and engineering managers, OpenAI’s experience offers a concrete blueprint—and a set of cautions—for what it actually takes to stand up an AI data agent at enterprise scale.
The scale of OpenAI’s data challenge
OpenAI’s motivation for building an in‑house data agent is rooted in the size and complexity of its data environment. The company operates a data platform spanning more than 600 petabytes across 70,000 datasets. Even for experienced data scientists, simply locating the correct table can consume hours.
The internal Data Platform team, led by head of data infrastructure Emma Tang, sits within infrastructure and oversees big data systems, streaming, and the data tooling layer. That tooling has a broad audience: according to Tang, out of about 5,000 OpenAI employees, more than 4,000 use the team’s data tools.
Previously, a typical workflow for a finance analyst—such as comparing revenue across geographies and customer cohorts—meant hunting through those 70,000 datasets, inspecting schemas, writing and validating SQL, and then iterating on queries. The process could easily absorb several hours before any analysis even started.
The agent fundamentally changes that starting point. A finance analyst now types a question in plain English into Slack and receives a finished chart within minutes. OpenAI’s internal estimates, shared with VentureBeat, suggest the tool saves two to four hours of work per query. But Tang emphasizes that the more important impact is qualitative: the agent surfaces analyses that employees would not have attempted at all under the old workflow.
A plain‑English interface to 600 PB: How employees actually use it

The agent is built on OpenAI’s GPT‑5.2 model and is embedded into the tools where employees already work: Slack, a web interface, IDEs, the Codex command‑line interface, and OpenAI’s internal ChatGPT app. The interaction pattern is intentionally simple: ask a question in natural language and receive structured output—charts, dashboards, or long‑form analytical writeups.
Crucially, this interface is not reserved for specialists. According to Tang’s team, engineers, growth teams, product managers, and non‑technical staff “who may not know all the ins and outs of the company data systems and table schemas” can now pull sophisticated insights themselves. This reduces the back‑and‑forth between business users and data teams and shifts more effort toward interpreting results instead of retrieving them.
The agent’s design turns the corporate data warehouse into something closer to a conversational surface. Users do not need to know that hundreds of petabytes sit beneath that surface, or how those tables are structured. They only need to express the business question clearly enough for the agent to translate it into the right set of queries.
From finance to latency debugging: One cross‑company agent
Unlike many enterprise AI bots that are scoped to a single function (finance, HR, support), OpenAI’s agent operates horizontally across the organization.
Finance uses it for revenue breakdowns across geographies and customer cohorts. A finance analyst can request a comparison in plain text and receive not only numerical output but visualizations and dashboards built on top of vetted tables.
The agent also handles more complex, multi‑step diagnostic work. Tang described a case where discrepancies appeared between two dashboards tracking ChatGPT Plus subscriber growth. The agent identified and surfaced the differences, “stack rank by stack rank,” across five distinct factors that explained the mismatch. Tang noted that for a human analyst, reaching that level of explanation might take hours or days; the agent completed it in minutes.
Product managers use it to analyze feature adoption patterns, while engineers lean on it for operational and performance questions—such as whether a specific ChatGPT component is slower than the day before and which latency components contribute to the change. The agent can compare past periods and break down the drivers of variance from a single prompt.
Organizationally, OpenAI rolled out the agent department by department, curating context and memory for each group. Over time, though, all this context feeds into a shared system. That allows senior leaders and cross‑functional teams to combine sales data, engineering metrics, and product analytics within a single query—something Tang calls a “really unique feature” of the approach.
Codex at the core: context, enrichment, and code generation

The hardest technical problem in the system is not generating SQL; it is reliably finding the right tables among 70,000 candidates. Tang calls table discovery “the biggest problem with this agent.” OpenAI addresses it by placing Codex—its AI coding agent—at the center of the architecture.
Codex plays three distinct roles:
First, it acts as the primary access layer: users interact with the data agent through Codex via OpenAI’s Model Context Protocol (MCP). Second, the team used Codex to write over 70% of the agent’s own code, which helped two engineers ship a production‑grade system in roughly three months.
The third role is the most unusual: a daily asynchronous process where Codex inspects important data tables and their associated pipeline code. From that code, Codex infers each table’s upstream and downstream dependencies, ownership, granularity, join keys, and similar tables. OpenAI refers to this as “Codex Enrichment.”
The team prompts Codex with instructions to analyze the code, extracts the relevant metadata from its responses, and persists that metadata into a database. Later, when a user asks a question about a metric such as revenue, the agent queries a vector database to find the tables that Codex has already linked to that concept.
Codex Enrichment is one of six context layers that the agent can draw on. Others include:
- Basic schema metadata.
- Curated expert descriptions of key tables and dashboards.
- Institutional knowledge mined from Slack, Google Docs, and Notion.
- A learning memory that stores corrections from prior conversations.
- Fallback access to live queries against the warehouse when no prior mapping exists.
To avoid amplifying noise, the team also tiers historical query patterns. Routine, exploratory queries—“select * limit 10”—are deprioritized, while canonical dashboards and executive‑level reports, where analysts have already invested in defining the “right” view of a metric, are tagged as sources of truth.
Designing agent behavior: Overconfidence, context, and ‘thinking time’
Even with multiple context layers, Tang reports that the agent’s biggest behavioral challenge is one that will be familiar to anyone deploying large language models: overconfidence.
Left to its own devices, the model tends to quickly pick a table it believes is correct and proceed with analysis, even when that table is not the right choice. To counter this, the team leaned heavily on prompt engineering aimed at slowing the agent down.
They crafted prompts that explicitly instruct the model to remain in a discovery phase longer: gathering alternatives, comparing possible tables, and validating them before committing to a single source. The prompt resembles guidance to a junior human analyst, emphasizing checks against multiple sources and explicit validation steps before “creating actual data.”
Evaluation work produced another counterintuitive finding: more context is not always better. It can be tempting to “dump everything in” and assume performance will improve, Tang notes, but internal evaluations showed the opposite. Curated, accurate, and smaller context windows outperformed large, noisy ones. This has direct implications for teams planning to connect agents to broad swaths of documents and logs without careful curation.
To build user trust, the agent exposes its reasoning and intermediate steps. It streams its progress in real time, shows which tables it chose and why, and links to raw query results. Users can interrupt mid‑analysis to redirect it, and the system checkpoints state so work can resume after failures. At the end of each task, the model also performs a self‑evaluation, effectively answering: how well did this go? Tang says the model is “fairly good” at assessing its own output quality, adding another signal for the team and users.
Security and guardrails: Simple controls that work
On safety and governance, Tang describes a pragmatic approach: “dumb guardrails” that are straightforward but strict.
The agent operates as an interface layer on top of existing data systems and inherits each user’s permissions. It always uses the user’s personal token, meaning it can only access what that user is already authorized to see. The agent is excluded from public channels and runs only in private channels or direct interfaces, reducing accidental data exposure.
Write access is tightly constrained. When the agent needs to write data, it does so only into a temporary test schema that is periodically wiped and cannot be shared. The system is deliberately prevented from writing directly into production systems or making arbitrary changes to infrastructure.
User feedback forms a key part of the safety loop. Employees can flag incorrect results, which the team then investigates. Combined with the model’s own self‑evaluation, this creates an iterative process to refine prompts, context, and guardrails over time.
Looking ahead, Tang notes that OpenAI intends to move toward a multi‑agent architecture, where specialized agents can monitor and assist one another. That shift has not yet happened, but she says the current, single‑agent design has already taken them “pretty far.”
Why OpenAI isn’t selling it—and what you can reuse

Despite the clear commercial potential, OpenAI says it has no plans to productize this specific internal data agent. Instead, the company’s strategy is to expose the underlying building blocks so enterprises can construct their own.
Tang emphasizes that the team relied entirely on externally available APIs and models: the Responses API, the Evals API, and GPT‑5.2 without any fine‑tuning. In other words, from OpenAI’s perspective, “you can definitely build this” with the publicly accessible components.
This stance aligns with OpenAI’s broader enterprise push. In early February, the company launched OpenAI Frontier, an end‑to‑end platform for building and managing AI agents in enterprise environments. It has signed consulting firms including McKinsey, Boston Consulting Group, Accenture, and Capgemini to help sell and implement the platform. Separately, AWS and OpenAI are co‑developing a Stateful Runtime Environment for Amazon Bedrock that echoes some of the persistent‑context capabilities used in OpenAI’s own agent. And Apple has integrated Codex directly into Xcode.
Internally, Codex is now deeply embedded into OpenAI’s engineering workflows. OpenAI told VentureBeat that 95% of its engineers use Codex and that Codex reviews all pull requests before merging. Since the start of the year, Codex’s global weekly active user base has tripled, surpassing one million, with overall usage growing more than fivefold.
Usage is also broadening beyond code. Tang observes that “Codex isn’t even a coding tool anymore.” Non‑technical teams use it to organize thoughts, create slides, and generate daily summaries. One engineering manager, she says, has Codex review her notes each morning, identify key tasks, aggregate relevant Slack messages and DMs, and draft responses—effectively operating on her behalf across communication workflows.
Lessons for data leaders: Governance as the unsexy prerequisite
For organizations considering their own AI data agents, Tang’s main advice is not about model choice or exotic prompts. It is about data foundations.
“This is not sexy,” she notes, but effective data agents depend on strong data governance. Data must be clean and sufficiently annotated, and there must be clearly defined sources of truth that the agent can crawl and trust. The agent in no way replaces underlying infrastructure; it relies on existing storage, compute, orchestration, and business intelligence layers to function.
Instead, the agent serves as a new entry point to those systems—one that is more autonomous and more accessible to non‑specialists. For data leaders, that means investments in catalogs, lineage, documentation, and access controls are prerequisites, not afterthoughts, if they want agents to behave reliably.
Tang also offers a strategic warning: she expects a widening gap between companies that adopt such tools and those that do not. Organizations that move early, she argues, will “see the benefits very rapidly,” while laggards will “fall behind.” In her view, the pace at which OpenAI can operate has already accelerated, even if that acceleration still lags the company’s own ambitions.
For today’s data and engineering leaders, OpenAI’s experience suggests a practical roadmap: start from existing APIs and models, lean on AI to build the agent itself, focus heavily on table discovery and context curation, keep guardrails simple but strict, and invest continuously in data governance. The payoff, if OpenAI’s internal deployment is any guide, is an organization where thousands of employees can treat the data warehouse as a conversational partner instead of a distant, specialist‑only resource.

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.





