Inside LinkedIn’s Next-Gen Recommender: Why Prompting Failed and Small, Distilled Models Won

LinkedIn has spent more than 15 years building large-scale AI-powered recommendation systems for jobs, people, and content. As the company moved to design a “next-gen” recommendation stack, it confronted a question that many machine learning leaders are asking: should you simply prompt a powerful off-the-shelf LLM, or invest in custom models tailored to your domain?

For Erran Berger, VP of product engineering at LinkedIn, the answer was clear. In a recent episode of the Beyond the Pilot podcast, he explained that prompting alone was “a non-starter” for the company’s next generation of recommenders. Instead, LinkedIn built a system around a detailed product policy, a large teacher model, and a set of distilled, smaller models tuned for both policy adherence and click prediction.

The result is not just a new recommendation stack, but a reusable “cookbook” for how LinkedIn approaches AI products more broadly—spanning model design, evaluation, and cross-functional collaboration.

Why prompting wasn’t enough for LinkedIn’s recommendations

When LinkedIn set out to upgrade its recommendations for job seekers and recruiters, the team considered how best to leverage the wave of powerful general-purpose LLMs. Berger’s assessment: prompting such models directly for real-time recommendations would not meet LinkedIn’s bar for accuracy, latency, or efficiency.

“There was just no way we were gonna be able to do that through prompting,” he said. “We didn’t even try that for next-gen recommender systems because we realized it was a non-starter.”

For a production recommender that interprets job queries, candidate profiles, and job descriptions in real time, several constraints collide:

Tight latency budgets: Recommendations have to appear fast enough to feel instantaneous to users.
High-volume traffic: LinkedIn serves recommendations across a massive user base, amplifying the cost of per-request compute.
Predictable behavior: Recommendations must be aligned with product policy and business goals, not just “plausible” text outputs.

General-purpose LLM prompting tends to trade off latency, control, and cost for flexibility. For LinkedIn’s scenario—ranking and matching items, optimizing for engagement signals, and enforcing nuanced policies—this wasn’t acceptable. Instead of relying on ad hoc prompts, the team chose to encode its product rules and preferences directly into models via fine-tuning and distillation.

This decision set the stage for a new architecture: start with a large model to learn LinkedIn’s product policy deeply, then compress that knowledge into smaller, more efficient models optimized for real-world recommender workloads.

From massive teacher to compact student: how LinkedIn structured its LLM stack

The first step in LinkedIn’s approach was to build a large, policy-aware teacher model. The team began with an initially massive 7-billion-parameter LLM and fine-tuned it to reflect how LinkedIn believes job and candidate matching should work.

Key elements of that process:

Start from a big model: The 7B-parameter teacher had enough capacity to internalize LinkedIn’s nuanced criteria for evaluating job–profile pairs.
Ground it in policy: Rather than relying on emergent behavior from generic training data, the model was driven by a carefully crafted product policy document and a curated dataset.
Use it as a foundation: This large model was never intended as the final production workhorse. It served as the “source of truth” to supervise smaller, more efficient models.

Once trained, this policy-focused teacher became the reference model from which subsequent teacher and student models were derived. Over time, LinkedIn moved from that 7B teacher down to models in the hundreds of millions of parameters—much better suited to real-time recommendation scenarios.

According to Berger, this hierarchy of models, all linked back to a shared policy and evaluation framework, now acts as a repeatable “cookbook” that LinkedIn can apply to other AI products.

Building the product policy and ‘golden’ dataset

Central to LinkedIn’s system is not just the model architecture, but the detailed product policy and evaluation methodology that guide it.

Working closely with the product management team, engineers helped transform domain expertise into a structured document: a 20–30 page product policy that scores job descriptions and member profiles “across many dimensions.” This document captures how LinkedIn wants its recommender to judge the quality and relevance of job–profile pairs.

Berger emphasized that this was not a one-shot effort. “We did many, many iterations on this,” he said. The policy evolved as teams refined the way they expressed and operationalized product requirements.

Alongside the policy, LinkedIn assembled a “golden dataset” containing thousands of pairs of queries and profiles. This dataset served as a benchmark for how the model should behave. The team then leveraged ChatGPT during data generation and experimentation, using it to:

Apply the policy to the golden pairs through prompting.
Have the model learn to score pairs according to the policy.
Generate a much larger synthetic dataset aligned with those scoring patterns.

This synthetic expansion step allowed LinkedIn to train the 7B-parameter teacher model on a far richer set of examples than the original golden set alone would permit—while still being anchored to human-validated policy judgments.

The result is a model that doesn’t just approximate “relevance” in a generic sense, but encodes LinkedIn’s specific view of what makes a job recommendation high-quality and appropriate.

Multi-teacher distillation: combining policy and click prediction

A single policy-aware teacher, however, is not enough for a full recommender system. As Berger points out, “At the end of the day, it’s a recommender system, and we need to do some amount of click prediction and personalization.”

To address that, LinkedIn introduced an additional teacher model, creating a multi-teacher distillation setup:

Teacher 1: A 7B-parameter model focused on product policy—what “good” matches look like according to LinkedIn’s rules and preferences.
Teacher 2: A second model oriented toward click prediction—capturing behavioral signals that indicate what users are likely to engage with.

These two teacher models then supervised the training of a much smaller, roughly 1.7-billion-parameter student model used for training and experimentation. The student was run through “many, many training runs” and was optimized “at every point” to minimize quality loss, according to Berger.

The multi-teacher approach gave LinkedIn several advantages:

Affinity to policy: The student inherits a strong alignment with the product policy teacher, ensuring recommendations adhere to LinkedIn’s standards.
Landing click prediction: By also learning from the click-focused teacher, the student can optimize for real engagement metrics.
Modularity: Policy and click objectives can be iterated independently at the teacher level, while the student consumes both as combined supervision.

Berger likens this to a chat agent trained by two different teachers: one ensures factual accuracy; the other shapes tone and style. Those are “very different, yet critical, objectives.” By mixing them, you can produce a model that balances both—and continue improving each teacher independently.

“By now mixing them, you get better outcomes, but also iterate on them independently,” he said. “That was a breakthrough for us.”

Why small, distilled models beat large, prompted ones in production

LinkedIn’s experience highlights a pattern many infrastructure teams are discovering: large, generic models are ideal for exploration and data generation, while smaller, distilled models are better suited for high-throughput, latency-sensitive production systems.

In this case, starting from a 7B-parameter teacher and distilling down to hundreds-of-millions-parameter students offered a practical balance between capability and cost. Berger notes that the team optimized the student “at every point” to preserve as much quality as possible relative to the teachers.

From a recommender system architect’s perspective, the advantages of this strategy are clear:

Latency: Smaller models are faster to run, especially when executed many times per user session.
Throughput and cost: Reduced parameter counts scale better under LinkedIn-level traffic.
Operational control: Distilled models are trained on exactly the blend of signals (policy, clicks) you care about, rather than emergent behaviors of a general LLM.

At the same time, because these compact models are derived from rich, policy-anchored teachers, they can maintain “a lot of affinity” to the behaviors LinkedIn wants—rather than sacrificing quality for speed.

The underlying technique—multi-teacher distillation, guided by a strong evaluation process—has become a reusable pattern inside the company. As Berger put it, “Adopting this eval process end to end will drive substantial quality improvement of the likes we probably haven’t seen in years here at LinkedIn.”

Changing how product managers and ML engineers collaborate

The modeling breakthrough also forced a change in how teams inside LinkedIn work together. A central theme in Berger’s account is the importance of “anchoring on a product policy” and using an iterative evaluation process to continually refine it.

Historically, product managers focused on strategy and user experience, while ML engineers handled modeling choices and iteration. With policy-driven AI systems, that division no longer works. The quality of the teacher model—and by extension the entire recommender—depends on how well product expertise is translated into explicit, testable guidelines.

“Getting a really, really good product policy” is key, Berger said. That means:

Capturing nuanced domain knowledge in a single, shared document.
Iterating the policy as teams observe model behavior and evaluation results.
Using that policy as the blueprint for teacher model training.

According to Berger, this has “very different” implications for how product managers and ML engineers collaborate. Instead of PMs defining high-level goals and leaving implementation details to engineering, both groups now co-own the design of the teacher model and its evaluation criteria.

“How product managers work with machine learning engineers now is very different from anything we’ve done previously,” he said. “It’s now a blueprint for basically any AI products we do at LinkedIn.”

Operational lessons: evals, plugable pipelines, and debugging

Beyond the model architecture and team structure, LinkedIn’s experience also surfaces several operational lessons for R&D velocity and system robustness.

In the podcast, Berger highlights that LinkedIn:

Optimized R&D for speed: The company tuned “every step of the R&D process to support velocity,” aiming to see real results in days or even hours instead of weeks. Although the specific techniques aren’t detailed, the outcome is a tighter loop between policy changes, model training, and evaluation.
Invested in plugable, experimental pipelines: Teams are encouraged to build pipelines that make it easy to swap in different models and configurations. This supports rapid experimentation with teacher and student variants while preserving a consistent evaluation framework.
Kept traditional debugging in the loop: Despite the sophistication of LLMs and distillation techniques, LinkedIn still relies on “traditional engineering debugging.” Understanding failure modes, inspecting examples, and tightening specs remain core to shipping robust recommenders.

These practices complement the model and policy work, enabling LinkedIn to evolve its systems quickly without losing control over behavior or quality.

What other ML and recommender teams can take away

While LinkedIn’s scale and resources are unusual, several principles from this effort are broadly applicable to ML engineers, recommender system architects, and AI product leaders:

Don’t expect prompting alone to solve structured recommendation problems. For latency- and cost-sensitive recommenders that must obey strict product rules, tailored models often beat ad hoc prompts to generic LLMs.
Treat product policy as a first-class artifact. A well-iterated policy, co-authored with product managers, can anchor datasets, teacher models, and evals.
Use large models as teachers, not necessarily as serving endpoints. Start big to internalize complex signals and then distill down to smaller, deployable models.
Consider multi-teacher distillation when you have multiple objectives. Separating concerns—e.g., policy alignment and click prediction—at the teacher level can yield better, more controllable students.
Align org structure with your modeling approach. Changing how PMs and ML engineers collaborate may be as important as changing your architectures.

For LinkedIn, this combination—a rigorous product policy, a multi-teacher distillation setup, and a tight eval loop—represents a step-change in recommender quality. As Berger summarized, adopting this end-to-end process is expected to drive “substantial quality improvement” across LinkedIn’s AI products for years to come.

Those interested in a deeper dive into LinkedIn’s approach—including more on their R&D pipeline design and debugging practices—can watch the full Beyond the Pilot episode or subscribe via Spotify and Apple Podcasts.

Cary Huang

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.