Why Better LLMs Aren’t Enough: LangChain’s Harrison Chase on Harness Engineering and Deep Agents

Scaling from a clever LLM demo to a reliable, long‑running AI agent in production is less about squeezing a few more percentage points out of the underlying model and more about how you “harness” that model. That’s the core argument LangChain co‑founder and CEO Harrison Chase makes in a recent episode of VentureBeat’s Beyond the Pilot podcast, where he frames the emerging discipline of “harness engineering” as the next layer of context engineering for agents.

Chase’s thesis: as models become capable of running loops, calling tools, and planning over long horizons, the surrounding systems must evolve just as aggressively—or those gains never translate into production‑grade agents.

The shift from better models to better harnesses

Chase describes harness engineering as a natural extension of context engineering—traditionally focused on how prompts, retrieval, and system context are shaped for a single call. With agents, the harness becomes the environment in which the LLM plans, acts, and iterates over time.

Historically, “harnesses” around models often existed to constrain behavior: prevent infinite loops, avoid uncontrolled tool use, and keep generations short and safe. For agentic systems, Chase argues the trend is almost the opposite: give the LLM more control over what it sees, what it does, and how it manages its own context over longer tasks.

According to Chase, modern harnesses must allow an LLM to:

Run in loops rather than a single request–response cycle
Call tools as needed, including code execution and shell utilities
Manage its own context window and decide what to keep or compact
Operate as a “long‑running, more autonomous assistant” rather than a one‑off chatbot

In other words, model quality is now a prerequisite—but by itself, insufficient. Without a sophisticated harness, even state‑of‑the‑art models will struggle to perform multi‑step, stateful tasks reliably.

Lessons from AutoGPT: capability thresholds and the cost of premature loops

To illustrate the difference between architecture and capability, Chase points back to AutoGPT, which briefly became the fastest‑growing GitHub project of its time. Architecturally, he notes, AutoGPT looked a lot like many of today’s agent frameworks: it ran the model in loops, called tools, and tried to pursue multi‑step goals autonomously.

The problem was that, at the time, models were “below the threshold of usefulness” for that style of interaction. They couldn’t reliably operate in a loop, so the architecture didn’t translate into stable behavior. As a result, interest faded quickly, despite the hype.

Before current‑generation models, teams worked around these limitations by hard‑coding flows as graphs and chains. Instead of letting an LLM plan its next action, developers encoded the sequence of steps themselves. That made systems more predictable but also less flexible—and crucially, it meant you couldn’t use the model to meaningfully improve the harness, because the model wasn’t capable enough inside it.

Now that models have crossed that usefulness threshold, Chase argues, the situation is different. Teams can finally construct environments where models can:

Plan over longer horizons
Run multiple steps in sequence without constant human oversight
Iterate on their own behavior within a well‑designed harness

That shift, in his view, is what enables a new generation of serious agents—provided the harness design keeps pace.

Inside LangChain’s Deep Agents: a general-purpose harness

Tracking progress and maintaining coherence over long tasks

Chase positions LangChain’s Deep Agents as a response to this new environment: a customizable, general‑purpose harness built on top of LangChain and LangGraph, designed specifically for long‑running, tool‑using agents. While he doesn’t frame it as a silver bullet, the components he highlights map directly onto recurring pain points AI teams encounter when trying to move agents from prototype to production.

Key capabilities he calls out include:

Planning: The harness gives agents explicit planning capabilities, so they can break a high‑level goal into steps and update that plan as they proceed.
Virtual filesystem: Agents have access to a file system where they can create and manage artifacts—effectively externalizing state rather than relying purely on the LLM’s short‑term context.
Context and token management: The harness manages what information is surfaced to the model at each step and how larger histories are compressed, with an eye toward token efficiency.
Code execution and tools: Built‑in support for code execution and shell (BASH) tools gives agents more flexible ways to act on the world, not just reason about it.
Skills and memory: Agents can rely on reusable “skills” and memory functions instead of a single massive system prompt.

Chase emphasizes that all agents in this framework can effectively “write down their thoughts” as they go. That might look like to‑do lists, intermediate notes, or incremental outputs stored in the virtual filesystem. The important part is that when the agent moves from step two to step three—or step four of a 200‑step process—it has a structured way to:

Track its progress
Recall what has already been done
Remain coherent with the original goal

From a harness‑engineering perspective, this written trail is not just logging; it is a core mechanism for preserving coherence and enabling long tasks to stay on track.

Subagents, isolation, and token efficiency

Another design choice Chase highlights is the use of subagents. A primary agent can delegate parts of a task to specialized subagents, each with its own tools and configuration. These subagents can operate in parallel, but their work is context‑isolated.

That isolation matters for two reasons:

Focus: The main agent’s context is not polluted with every intermediate detail from every subtask.
Efficiency: Large subtask histories are compressed into a single result before being handed back, conserving tokens and keeping the overall context manageable.

In practice, this lets teams model complex workflows—research, analysis, data transformation, code changes—without overwhelming the primary agent with internal noise. The harness ensures that for each decision point, the LLM sees only what’s most relevant.

Chase connects this back to the broader goal: creating an environment where the LLM can act more autonomously, but within guardrails that preserve coherence, cost control, and interpretability.

Skills vs. tools: a more modular approach to context

Beyond the mechanics of loops and subagents, Chase argues that teams need to think differently about how capabilities are surfaced to an agent. Rather than loading all instructions, tools, and behaviors into an ever‑growing system prompt, he suggests a skills‑based approach.

In this model, the system prompt contains a concise core foundation—who the agent is, what it’s responsible for, and how it should generally behave. Then, when the agent needs to perform a specific task (say, “X” or “Y”), it dynamically loads the relevant “skill” definition.

Chase’s framing: instead of hard‑coding everything up front, the agent can ask, “If I need to do X, let me read the skill for X. If I need to do Y, let me read the skill for Y.” This keeps the always‑visible context smaller and pushes specificity into modular, on‑demand components.

For AI engineers, this reinforces a key harness‑engineering principle: context is not just about retrieval over external data; it’s also about how the agent accesses its own behaviors and instructions. A skills model can make those behaviors more composable, auditable, and easier to evolve without rewriting monolithic prompts.

Context engineering: what the LLM sees vs. what you see

Chase describes context engineering in deceptively simple terms: it’s about “what the LLM is seeing.” That view often diverges sharply from what developers assume the model sees at any given point.

He argues that when agents fail, it is usually because they lack the right context; when they succeed, it’s because they have it in the right format at the right time. To close that gap, teams need deep visibility into the agent’s inner loop, including:

How the system prompt is constructed (static vs. dynamically populated)
What tools the agent has access to and how those tools are presented
How tool call responses are formatted and surfaced back to the model
What intermediate state (plans, notes, partial outputs) is or is not included in the next step’s context

Chase points out that analyzing agent traces helps developers “put themselves in the AI’s mindset.” By replaying interactions and inspecting prompts, tool calls, and responses, teams can see the world as the model sees it and iteratively refine the harness.

In this view, traces and observability are not just debugging tools; they are core ingredients for building an agent that “actually works” at production scale.

OpenClaw, ‘letting it rip,’ and the safety–enterprise gap

Chase also comments on OpenAI’s acquisition of OpenClaw, an agent product that gained viral traction. In his telling, its success stemmed from a willingness to “let it rip” in ways that major labs typically avoid—loosening constraints and allowing more autonomy in how the agent operated.

He questions, however, whether acquiring such a product meaningfully moves OpenAI closer to a safe, enterprise‑ready version of that experience. The implication is that the qualities that made OpenClaw viral—aggressive autonomy and minimal friction—may be at odds with the requirements of risk‑sensitive deployments.

For enterprise and platform teams, the takeaway is not that aggressive autonomy is impossible, but that the harness must be deliberate: the more power and control you hand to the agent, the more rigorous your design around observability, context, and tool access needs to be.

What this means for teams building production agents

Chase’s framing has several concrete implications for AI engineers and technical product leaders:

Treat the harness as a first‑class system: Model choice matters, but the surrounding planning, context management, file systems, and tool orchestration often determine whether an agent can handle real workloads.
Design for long‑running behavior from day one: If the product vision involves multi‑step or continuous agents, build in mechanisms for progress tracking, written “thoughts,” and state externalization early.
Invest in observability and traces: The only way to improve context engineering is to understand what the LLM is actually seeing at each step.
Modularize instructions into skills: Move away from enormous, static system prompts and toward skills or similar abstractions that can be loaded as needed.
Be explicit about autonomy levels: Decide how much freedom your agent has to “let it rip” and align harness design, safety constraints, and user experience with that choice.

Chase also hints at adjacent trends he sees as important for agent builders: code sandboxes as a key runtime environment, new user experiences for agents that run on long intervals or continuously, and the central role of LangGraph and LangChain in LangChain’s own stack, with Deep Agents layered on top.

For teams feeling the limits of prompt‑only approaches or single‑call workflows, his message is straightforward: better models open the door, but harness engineering determines whether your agents can walk through it—and keep going.

Cary Huang

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.