OpenAI’s Responses API Update Turns AI Agents Into Persistent, Shell-Backed Workers

OpenAI is repositioning its platform from “just a model” to a full execution environment for autonomous agents. With new capabilities in its Responses API — server-side compaction, hosted shell containers, and support for the open Skills standard — the company is targeting a core enterprise pain point: getting long-running, reliable AI workflows into production without building a bespoke orchestration stack.

For engineering leaders and architects, these changes shift the questions from “How do we give an agent tools and memory?” to “How do we govern, secure, and scale shell-backed digital workers that can run for hours and manipulate real systems?”

The core problem: context, clutter, and fragile long-running agents

Traditional agent implementations have struggled with what many teams experience as “context amnesia.” As an agent calls tools, runs code, and exchanges messages, its conversation history grows until it hits the model’s token limit. At that point, developers typically truncate older messages, often discarding precisely the “reasoning trail” the agent needs to complete a task.

In real projects, this shows up as agents that perform well on short tasks but drift, hallucinate, or silently fail during multi-hour jobs such as complex ETL flows, multi-step research, or iterative data cleanup. The burden of managing this state has historically fallen on custom logic: proprietary summarizers, ad hoc database snapshots, and brittle heuristics about what to retain.

OpenAI’s latest Responses API update targets this bottleneck directly, with a server-side mechanism designed to keep agents coherent over hours or days of work, and a managed compute layer that removes the need to build and secure your own code execution sandboxes.

Server-side Compaction: from chat history to durable working memory

Server-side Compaction is OpenAI’s answer to the token-limit problem for long-running agents. Instead of naïvely chopping off the oldest messages, the platform can compact history: the model summarizes its past actions into a compressed state that preserves essential context while discarding noise.

This moves the agent from acting like a short-term conversational assistant to behaving more like a persistent process. Early data from e-commerce platform Triple Whale illustrates the impact: its agent, Moby, successfully handled a session involving roughly 5 million tokens and 150 tool calls without a measurable drop in accuracy. That scale would be unwieldy to manage with homegrown truncation policies.

For enterprise teams, the implications are concrete:

Longer-lived workflows: Multi-hour or potentially multi-day tasks become more realistic without constant manual supervision or external state orchestration.
Less custom glue code: You can lean on the built-in compaction rather than implementing your own history-refresh and summarization services for each agent.
More predictable behavior: Because the platform maintains continuity of intent and key facts, the risk of agents “forgetting” critical constraints mid-run is reduced.

The tradeoff for technical leaders is less about raw capability and more about governance: once long-lived memory is available by default, questions of retention policies, observability, and audit become central platform design topics.

Hosted Shell Containers: OpenAI as your agent’s terminal

The second major change is the introduction of a Shell Tool with container_auto, which provisions an OpenAI-hosted Debian 12 environment per agent. This is a significant step beyond a sandboxed “code interpreter” concept: each agent gets a full-featured terminal environment.

Key facets of this environment include:

Multiple native runtimes: Agents can execute code using Python 3.11, Node.js 22, Java 17, Go 1.23, and Ruby 3.1 without you standing up separate services.
Persistent storage: A mounted directory at /mnt/data allows agents to generate, save, and later download artifacts — logs, reports, transformed datasets, intermediate models, or configuration files.
Networking capabilities: Agents can access the internet from within the container to install libraries or interact with third-party APIs, subject to configured policies.

For data and platform engineers, this managed shell effectively becomes a generic, OpenAI-hosted ETL and automation runner. Instead of building and securing custom containers, job runners, and middleware for each new agent workload, teams can delegate “run this code, store these outputs, talk to these APIs” to OpenAI’s environment, using the model itself to orchestrate the sequence of operations.

This rebalances responsibilities: OpenAI provides the “computer” — the OS, runtimes, and networked storage — while your team focuses on describing the procedure and enforcing organizational controls. It can substantially reduce infrastructure overhead but increases your dependency on OpenAI as a compute substrate.

Skills and a shared standard: portable procedures across ecosystems

The third pillar of the update is support for the emerging “Skills” standard for agents. Both OpenAI and Anthropic now recognize skills defined via a SKILL.md manifest: a Markdown file with YAML frontmatter that describes how an agent should perform a specific operation.

In practice, a skill packages procedural knowledge and configuration — the “how” of a task — into a reusable, versioned unit. Because OpenAI and Anthropic have converged on the same open format, the same skill definition can, in principle, be reused across multiple tools and environments such as VS Code, Cursor, or other platforms that adopt the standard.

The impact of this portability is visible in the open source agent ecosystem. The AI agent OpenClaw adopted the SKILL.md manifest and folder-based packaging, allowing it to reuse a body of skills originally built for Anthropic’s Claude. Community hubs like ClawHub now host more than 3,000 skills ranging from smart home automation to complex enterprise workflow orchestration.

Because OpenClaw supports multiple models — including OpenAI’s GPT-5 series and local Llama instances — developers can write skills that are not tightly bound to a single vendor. Skills become an asset class: portable, versioned, and increasingly ecosystem-neutral instructions that encapsulate domain expertise and workflow logic.

OpenAI’s own customers are already seeing tangible benefits. Enterprise AI search startup Glean reports tool accuracy improvements from 73% to 85% after adopting OpenAI’s Skills framework, underscoring that well-structured, reusable procedures can materially improve tool selection and execution quality.

OpenAI vs. Anthropic: two paths to the same Skills standard

Despite adopting a common skills format, OpenAI and Anthropic are pursuing different strategies around it, aimed at distinct enterprise priorities.

OpenAI is positioning the Responses API as a “programmable substrate” optimized for developer speed and performance. By bundling server-side compaction, hosted Debian 12 shells, networking, and skills execution into a single interface, it offers a turnkey path to building long-running agents. A skill is not just read; it is executed inside a managed container with built-in state management and network policies.

This vertically integrated approach is tailored for teams that value:

Deep control and customization: You can define bespoke skills, run multi-language code, and rely on long-lived state in one environment.
High-performance execution: Workloads that push into millions of tokens and numerous tool calls can remain coherent due to compaction.

Anthropic, by contrast, emphasizes what the article describes as an “expertise marketplace.” Its strength lies in a mature directory of pre-packaged partner playbooks with vendors such as Atlassian, Figma, and Stripe. For organizations that want out-of-the-box connectivity to popular SaaS platforms, this marketplace-style strategy provides an immediate catalog of workflows built and maintained with partners.

Both companies sit atop the same open skills standard, but OpenAI leans into a high-performance, customizable runtime, while Anthropic prioritizes breadth of ready-made integrations.

Enterprise architecture and security implications

For enterprise engineering teams, these capabilities materially simplify the technical scaffolding needed to move from experimental chatbots to production-grade workflows.

On the architecture side:

State management becomes built-in: Server-side compaction reduces the need for custom history trimming and summarization services, especially for multi-hour jobs.
Skills enable modular IP: Teams can encapsulate procedural knowledge, fine-tuning, and domain-specific behavior into reusable skill packages that can be shared across internal projects and, where appropriate, across platforms.
Infrastructure demands shrink: Hosted shells and persistent storage mean fewer bespoke containers and job runners to build and maintain, particularly for code-heavy tasks like ETL and data transformation.

However, these benefits introduce new security and governance responsibilities. Giving an AI model access to a shell and network is a meaningful escalation in capability, with corresponding risk. OpenAI addresses some of this with mechanisms such as Domain Secrets and organization-level allowlists, so that agents can call APIs without exposing raw credentials in model context.

Yet as skills become easier to create, share, and import, SecOps and platform teams must anticipate new attack surfaces:

Malicious or compromised skills: Poorly vetted skills can introduce prompt injection vectors or exfiltration paths, especially when combined with shell and network access.
Authorization and auditing: The critical design questions become “Which users or services are allowed to invoke which skills?” and “How do we audit the files and artifacts produced in /mnt/data?” rather than “How do we wire up a terminal?”

The net result is a shift from building low-level plumbing to enforcing high-level policy over a powerful, vendor-managed execution layer.

Choosing a platform: performance substrate or plug-and-play marketplace?

For technical decision-makers, the convergence on the agentskills.io standard simplifies at least one dimension: skills authored today are not inherently locked to a single vendor. The more difficult choice is which orchestration environment best matches your organization’s priorities.

Based on the announced capabilities:

Use OpenAI’s Responses API when you need heavy-duty, stateful execution. If your roadmap includes long-running agents that must handle millions of tokens, orchestrate complex tool chains, and execute arbitrary code in multiple runtimes, OpenAI’s integrated stack functions as a “high-performance OS” for the skills standard.
Use Anthropic when your strategy centers on fast access to established SaaS ecosystems. If your primary need is to plug into a rich set of prebuilt partner playbooks (e.g., Atlassian, Figma, Stripe) with minimal custom development, Anthropic’s marketplace approach to skills may align better with your priorities.

Because the underlying skills format is shared, organizations can, over time, maintain a portfolio of skills that outlives any single platform choice. Architecturally, this moves the industry away from vendor-specific “prompt spaghetti” and toward shared, versioned, and portable definitions of how digital work should be done.

For enterprises, the opportunity — and the challenge — is to treat these skills, shells, and compaction mechanisms not as experimental features, but as foundational building blocks for the next generation of production automation. The more powerful the substrate becomes, the more critical careful design, governance, and security model choices will be.

Cary Huang

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.