How a New Agentic AI Framework Helps Enterprises Choose Between Training Models and Training Tools

Enterprises exploring agentic AI now face an overwhelming number of frameworks, tools, and orchestration patterns. The choice is no longer simply “which large language model (LLM)?” but “what kind of agent architecture are we building, where do we invest training budget, and which parts should stay modular and swappable?”

A new multi-institution research study proposes a practical framework to navigate this complexity. Rather than treating agentic AI as a pure model-selection problem, it breaks the landscape down into a set of adaptation strategies that clarify when you should train the core agent itself and when you should instead train or tune the tools around it.

For enterprise AI architects, ML engineers, and technical leaders, this framework turns a confusing ecosystem into a structured decision space: a way to map business needs, risk appetite, and budget constraints onto concrete architectural choices.

From Model Selection to Architecture Decisions

The study’s core contribution is conceptual: it reframes agentic AI design as a choice between adapting the agent (the model that reasons and plans) and adapting the tools (retrievers, memories, sub-agents, and other components the agent calls).

This shift matters because it aligns directly with the decisions enterprises must make:

Where to spend training budget – on expensive, monolithic model training, or on smaller, cheaper tool components.
How much modularity to preserve – whether to keep the core model general-purpose and surround it with specialized tools, or to embed capabilities directly into the agent.
What tradeoffs to accept – between cost, flexibility, generalization, and operational risk.

Instead of treating every problem as a candidate for a custom fine-tuned model, the framework encourages enterprises to think in terms of system architecture: a stable core agent plus an ecosystem of tools, tuned at different layers depending on the use case.

Agent vs. Tool Adaptation: The Two Axes of Design

The framework introduces two key dimensions:

Agent adaptation – updating the parameters or policies of the foundation model itself.
Tool adaptation – leaving the core model “frozen” and instead training or optimizing the tools it uses.

Agent adaptation changes the agent’s internal behavior. This typically involves fine-tuning or reinforcement learning (RL) so the model better aligns with specific tasks. It is similar to rewiring the agent’s “brain”: once changed, those behaviors are baked directly into the model.

Tool adaptation instead focuses on the environment around the model. The core LLM stays fixed; surrounding components—search retrievers, memory modules, or dedicated sub-agents—are trained or tuned to work better with that model. This approach lets the overall system evolve without retraining the expensive foundation model.

The study further refines these two axes into four concrete strategies that map directly to design decisions engineers face when building agentic systems.

The Four Adaptation Strategies: A1, A2, T1, T2

Combining the two dimensions yields four distinct strategies:

A1: Tool execution signaled (agent adaptation)

Here, the agent learns from verifiable tool execution. The model is trained using feedback from tools themselves—for example, whether generated code compiles and runs or whether a query executes successfully against a database.

The feedback signal is objective and often binary (run vs. crash, success vs. failure).
The agent internalizes the mechanics of calling tools correctly, such as generating syntactically valid SQL or usable Python.
This paradigm is especially effective for stable, verifiable domains like coding or structured query generation.

The study highlights DeepSeek-R1 as an instance of this pattern: a model trained with reinforcement learning against a sandboxed code execution environment. The model learns to produce working code because success is directly and mechanically verifiable.

A2: Agent output signaled (agent adaptation)

In A2, the agent is optimized based on the quality of its final answer, regardless of the intermediate steps or number of tool calls. Here, the reward function evaluates the correctness of the outcome, not the correctness of each action.

The agent learns to orchestrate multiple tools to arrive at a correct result.
Intermediate sequences (how many searches, which documents, what chain-of-thought) are left for the model to discover and optimize.
This makes A2 suitable for complex, multi-step workflows where the end result is what matters.

Search-R1 exemplifies this category. It performs multi-step retrieval to answer questions but only receives a reward when the final answer is correct. Over time, it learns more effective search and reasoning strategies, without being explicitly told how to sequence those steps.

T1: Agent-agnostic tools (tool adaptation)

T1 tools are trained independently and can be “plugged in” to any sufficiently capable frozen LLM.

Classic dense retrievers in retrieval-augmented generation (RAG) fit this pattern.
A retriever might be trained on generic search data with no knowledge of the specific LLM that will consume its outputs.
A powerful frozen agent can still use such a retriever effectively to fetch relevant context.

This is the most modular approach: tools are developed as standalone components and then wired into different systems as needed.

T2: Agent-supervised tools (tool adaptation)

T2 tools are trained for a specific frozen agent, using supervision signals derived from that agent’s outputs. The result is a symbiotic relationship: the tool learns to provide exactly what the agent needs to succeed.

The study’s key example is the s3 framework:

A small “searcher” model is trained to retrieve documents.
A large, frozen “reasoner” LLM then attempts to answer a question using those documents.
The searcher receives rewards based on whether the reasoner answers correctly.

Over time, the searcher adapts to fill the reasoner’s knowledge gaps, optimizing retrieval for that particular core model.

In practice, complex enterprise systems may blend these modes. A deep research pipeline, for example, might use T1-style generic retrievers, T2-style specialized search agents tuned to a particular LLM, and A1-style reasoning models trained with execution feedback—all orchestrated into one system.

Cost, Generalization, and Modularity: What the Data Shows

For enterprise decision-makers, the key question is not just “what’s theoretically elegant?” but “what’s cost-effective, robust, and maintainable?” The study surfaces several important tradeoffs across the four strategies.

Cost vs. flexibility

Agent adaptation (A1/A2) gives maximum flexibility because it rewires the agent itself, but it is expensive:

Search-R1 (A2) needed 170,000 examples to internalize search capabilities.
This implies large compute budgets and the need for substantial, high-quality datasets.
The payoff is that such agents can be much more efficient at inference time, since they can often be smaller than generalist models while performing strongly on their target tasks.

Tool adaptation (T1/T2) is typically far more data- and compute-efficient:

The s3 system (T2) trained its small searcher on only 2,400 examples—roughly 70x less data than Search-R1—yet achieved comparable performance on the evaluated tasks.
The tradeoff is higher inference-time cost and latency, because s3 relies on coordination between the small searcher and a larger reasoning model.

Generalization

Another key finding involves generalization across domains.

A1/A2 agents risk overfitting to their training tasks. Search-R1, for instance, excelled on the tasks it was built for, but when evaluated on specialized medical question answering, it achieved 71.8% accuracy.
By contrast, the T2-based s3 system—combining a general-purpose frozen agent with a trained retrieval tool—achieved 76.6% accuracy on the same medical tasks.

This suggests that combining a broad, frozen model with specialized tools can preserve general world knowledge while still optimizing for specific domains. However, the framework also warns that T1/T2 strategies are only as strong as the underlying frozen agent. If the core model fundamentally cannot handle a specialized task, no amount of tool tuning will fix that.

Modularity

Modularity is where T1/T2 approaches clearly shine.

With tool-centric designs, components can be hot-swapped. For example, the Memento framework optimizes a memory module that retrieves past cases. If requirements change—say, new compliance constraints—you can update or replace just the memory component without touching the core planner or reasoner.
A1/A2 systems, in contrast, tend to be monolithic. Adding a new skill via fine-tuning can trigger catastrophic forgetting, where performance on previously learned abilities (for example, math) degrades because internal weights were overwritten to improve coding or other specialized skills.

For enterprises that need long-lived systems supporting evolving requirements, this modularity argument can be just as important as raw task performance.

A Practical Ladder for Enterprise Strategy

The study proposes treating these four strategies as a progressive ladder—a roadmap for how enterprises can mature their agentic AI capabilities while managing risk and cost.

1. Start with T1 (agent-agnostic tools)

For most organizations, the recommended entry point is simple: take a strong, frozen LLM (such as a leading commercial model) and surround it with off-the-shelf tools.

Use generic dense retrievers for RAG.
Connect to external systems via standardized connectors such as MCP-style interfaces.
Avoid training any models at this stage.

This is ideal for prototyping, broad internal use cases, and applications where requirements are still moving. It delivers significant value with minimal investment and operational complexity.

2. Move to T2 (agent-supervised tools) for efficiency and specialization

Once pain points emerge—such as an LLM struggling to use generic tools effectively—the next step is not necessarily to retrain the core model. Instead, organizations can:

Train small, specialized sub-agents (e.g., searchers, memory managers) that learn from the frozen LLM’s behavior.
Use these sub-agents to filter, rank, or format data in a way that best supports the main agent.
Apply this approach to proprietary enterprise data and high-volume, cost-sensitive workloads.

This preserves the robustness and generality of the core model, while improving performance and cost profile for specific workflows.

3. Use A1 (tool execution signaled) for technical specialization

When the bottleneck is the model’s ability to operate tools correctly—such as generating valid code, crafting precise SQL, or calling proprietary APIs—A1 becomes attractive.

Train smaller models with verifiable execution feedback for your specific technical domains.
Turn these into specialists that deeply understand the syntax and mechanics of your tools.
Then, plug those specialists back into a broader system, for example as T1-style plugins to a generalist LLM.

This approach creates highly capable technical agents without requiring you to re-train your entire general-purpose model stack.

4. Reserve A2 (agent output signaled) as a last resort

The study characterizes A2—large-scale, end-to-end training of monolithic agents on complex strategies and self-correction—as the “nuclear option.”

It is resource-intensive, requiring large datasets and significant compute.
It is most suitable when you truly need a model that internally encodes complex decision-making strategies rather than relying on external tools.
For standard enterprise applications, the authors suggest this level of customization is rarely necessary.

In other words, most organizations can get very far—often far enough—without ever training their own large agentic model from scratch.

What This Means for Enterprise AI Roadmaps

The broader shift reflected in this framework is away from chasing a single, giant “perfect” model and toward building smart ecosystems of tools around a stable core.

For enterprise AI leaders, that translates into several practical takeaways:

Treat the core LLM as a long-lived platform, not a constantly retrained artifact.
Invest incrementally in tools and sub-agents (T1/T2) where they clearly address bottlenecks.
Use A1 and A2 selectively for high-value, tightly scoped domains where internalizing capabilities into the model is justified.

Ultimately, the study’s message is that the most effective path to agentic AI is not always to build a bigger brain. It is often more effective—and more economical—to give a capable brain better tools, tuned thoughtfully to your organization’s needs and constraints.

Cary Huang

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.