Why Harness Engineering, Not Just Better Models, Will Decide Which AI Agents Reach Production

As large language models (LLMs) get more capable, many teams assume that upgrading to the latest model will automatically make their AI agents production-ready. Harrison Chase, co-founder and CEO of LangChain, argues the opposite: model quality is necessary but no longer sufficient. The real differentiator, especially for long-running, tool-using agents, is “harness engineering” — how you structure, constrain, and support the model so it can reliably do multi-step work.

In a recent episode of VentureBeat’s Beyond the Pilot podcast, Chase describes how harnesses are evolving from simple wrappers around a model into full-fledged runtimes for autonomous assistants. Built correctly, these harnesses let an LLM plan, call tools, manage context, and track work over time. Built poorly, they lead to brittle demos that fall apart before they reach real users.

From prompt wrappers to full agent harnesses

Traditional “harnesses” around machine learning models were largely defensive. They constrained what a model could do, limited looping behavior, and focused on containing risks: no uncontrolled tool calls, no unbounded execution, and tight integration paths into existing applications.

Agent-style systems flip this emphasis. Instead of just shielding the model, harnesses for AI agents must enable it to interact more independently and perform long-running tasks. This includes letting the LLM:

Run in loops instead of one-off calls
Call tools (code interpreters, shell commands, external APIs)
Plan and re-plan over longer horizons

Chase describes this shift as an extension of context engineering. Where early LLM applications centered on crafting static prompts, modern harnesses expose more control to the model itself. The harness becomes the system that decides how the model sees the world and how it can act within it, not just what initial instructions it is given.

In this view, the harness is not decoration around a strong model. It is the operational substrate that determines whether the model’s capabilities can be translated into something that runs continuously, handles edge cases, and stays coherent over dozens or hundreds of steps.

What OpenClaw reveals about "letting it rip"

Chase’s framing around harnesses also colors his read of OpenAI’s acquisition of OpenClaw, a system that achieved viral traction by leaning into aggressive autonomy. He argues that its popularity was driven by a willingness to “let it rip” in ways that major labs typically avoid.

The implication for practitioners is that there is a tension between autonomy and safety in harness design. A permissive harness can unlock surprising capabilities — and social buzz — by giving the LLM broad control over context and tools. But the same qualities that make a system exciting to power users make it harder to harden for enterprise environments.

Chase questions whether bringing such a system into a more conservative organization actually moves it closer to a safe, enterprise-ready product. The same freedoms that helped OpenClaw go viral may need to be carefully constrained or rethought when compliance, observability, and failure modes are in scope.

For engineering leaders, this highlights a central question: how much initiative should your agent harness grant the LLM, and what controls do you need around that initiative so the system is both effective and deployable?

Why loops, tools, and long horizons are harder than they look

On paper, the idea of letting an LLM run in a loop and call tools is straightforward: call the model, inspect its output, decide whether to call a tool or stop, repeat. In practice, Chase stresses, doing this reliably has been extremely hard.

For a significant stretch, models were simply “below the threshold of usefulness” for this style of interaction. Developers had to work around the limitations with explicit graphs and manually defined chains of steps. AutoGPT — once the fastest-growing GitHub project — is a prominent cautionary tale. Its architecture resembles what many teams are building today, but the underlying models weren’t good enough at the time to support dependable looping behavior. The result: impressive demos, but lack of reliability, and its rapid fade from use.

As models improve, the threshold shifts. It is now viable to construct environments where agents can:

Run in loops without getting stuck or drifting too quickly
Plan across many steps instead of just a handful
Incrementally refine their own context and strategy

Crucially, better models don’t eliminate the need for a harness — they make the harness more important. When a model cannot reliably run in a loop, you can’t meaningfully improve the harness because you can’t exercise it at full strength. Once the model crosses that usability threshold, you can start systematically refining planning strategies, context policies, and tooling interfaces.

Inside LangChain’s Deep Agents harness

LangChain’s answer to these needs is Deep Agents, which Chase describes as a customizable general-purpose harness for agents. Built on LangChain and LangGraph, it’s designed to give teams a structured runtime in which an LLM can plan, execute, and coordinate work.

Key capabilities of this harness include:

Planning support: Structures for multi-step plan creation and execution so the agent can think in terms of tasks and sequences rather than single prompts.
Virtual filesystem: A way for agents to read, write, and organize artifacts, enabling them to “write things down” and refer back to them over time.
Context and token management: Mechanisms to control what information is visible to the LLM at each step and keep token usage in check.
Code execution: Integration with code interpreters and related tools so the agent can compute, transform data, and verify results programmatically.
Skills and memory: Structures for reusable capabilities and longer-term knowledge, moving beyond flat tool lists.

Deep Agents also supports subagents that can run in parallel. Each subagent can be specialized with its own tools and configuration, and their work is context-isolated: the details of a subtask don’t clutter the main agent’s context window. Instead, larger intermediate contexts are compressed into a single summarized result before being passed back.

In all cases, agents have access to a filesystem-like abstraction, enabling them to create and maintain to-do lists and intermediate artifacts. Chase characterizes this as giving the LLM a place to “write its thoughts down as it goes along,” so it can track progress as it advances from step two to step three to step four in a process that may have hundreds of steps.

Context engineering: what the model actually sees

Chase reduces context engineering to a simple but often overlooked question: what is the LLM actually seeing? This is distinct from what developers see in their IDE or dashboards. The harness must decide, at every step, which pieces of information are surfaced to the model and in what format.

He argues that when agents fail, it is typically because they don’t have the right context; when they succeed, they do. From that angle, context engineering is about “bringing the right information in the right format to the LLM at the right time.”

Several patterns follow from this:

LLM-driven context control: The trend in harness design is to give the model more control over what context it sees and when. Rather than a static prompt, the agent can decide when to retrieve, summarize, or discard information.
Compact, modular system prompts: Instead of loading everything into one large, static system prompt, Chase advocates for a smaller foundation plus dynamically loaded “skills.” If the agent needs to do X, it reads the skill for X; for Y, it reads the skill for Y.
Intentional compaction: Harnesses should be amenable to the model deciding when to compress or compact context. The LLM can choose advantageous points in a workflow to summarize, reducing token usage while preserving coherence.

This requires visibility into the agent’s internal traces. When developers can inspect those traces, they can reconstruct the model’s perspective: what system prompt it received, how that prompt was built, what tools were available, and how tool responses were presented back to the model. That, in turn, guides better context policies and harness refinements.

Skills, code, and sandboxes: giving agents the right tools

For agent builders, another dimension of harness design is the tool surface area. Chase emphasizes the benefits of giving agents access to code interpreters and BASH tools, which significantly increase flexibility. With code execution, an agent can offload precise calculations, data transformations, and validation tasks to a deterministic environment instead of relying purely on generative reasoning.

He draws a distinction between loading “tools” up front and exposing “skills” that can be read when needed. Skills encapsulate behavior or knowledge that the agent can pull into its context on demand, rather than being embedded in a monolithic system prompt. This keeps the core prompt smaller and more general while allowing specialized capabilities to be attached as the situation requires.

Chase also points to code sandboxes as an emerging focus area. While the article does not elaborate deeply, the direction is clear: as agents gain more execution power, harnesses will need controlled environments where they can run code safely, manage side effects, and interact with resources without jeopardizing the broader system.

Why observability and UX will make or break production agents

Looking ahead, Chase highlights two areas that will shape which agent systems actually make it into production: observability and user experience.

On the observability side, traces are central. To evolve a harness, teams need detailed records of agent behavior: prompts, intermediate decisions, tool calls, and context snapshots. This lets engineers diagnose why an agent went off track, correlate failures with missing or malformed context, and iteratively harden the system.

On the UX side, agents that run for longer intervals — or even continuously — will demand different user interfaces than today’s chat-centric patterns. Interfaces will likely need to expose plans, progress, and state over time, not just inputs and outputs. Chase suggests that a new breed of UX will emerge around supervising, steering, and collaborating with long-running agents.

Taken together, these themes reinforce his core argument: better models open the door to more ambitious agents, but it is harness engineering — the runtime, context policies, tool orchestration, observability, and UX — that will decide which of those agents actually reach and survive in production.

For teams building LLM-based systems, this means shifting attention from one-off prompt engineering toward designing robust harnesses that let models operate autonomously yet predictably, with the right context, the right tools, and the right guardrails at every step.

Cary Huang

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.