🔍 LinkedIn's LLM-Powered Semantic Search Architecture

LinkedIn Engineering   ⭐⭐⭐⭐⭐   27,232 chars
#LLM #Search #SemanticSearch #Embeddings #Ranking #Inference

Why This Matters

LinkedIn rebuilt their entire search infrastructure around LLMs, serving millions of queries per second at scale. This is one of the most detailed production LLM search system write-ups I've seen.

Key Innovation: They don't just use LLMs for semantic understanding—they've built a complete pipeline including query understanding, embedding-based retrieval, cross-encoder ranking, and multi-teacher distillation to get a 0.6B parameter SLM that outperforms the 7B teacher.

Architecture Overview

Three-stage pipeline:

  1. Query Understanding: Fine-tuned 1.5-4B LLM models for intent classification, facet extraction, and profile-aware rewriting
  2. Embedding-Based Retrieval (EBR): GPU-powered exhaustive vector search on CUDA, using dual-tower bi-encoder
  3. Cross-Encoder Ranking: Small Language Model (0.6B params) for final relevance scoring

Key Technical Insights

1. Multi-Teacher Distillation

They distill a 7B parameter teacher model into a 0.6B student that can serve millions of queries per second. The student learns from multiple teachers (relevance + engagement prediction) using KL divergence loss.

2. Context Compression Tricks

  • Job Summarization: Use a 1.7B LLM to summarize job descriptions offline (median 900 tokens → concise summaries)
  • Embedding Compression: Replace most job description text with a single-token embedding from an encoder LLM
  • Result: 22x throughput improvement (290 → 22,000 items/sec/GPU) with minimal quality loss

3. LLM-Based Quality Measurement

They use LLM judges to grade search results at scale—tens of millions of query-document pairs daily. These judges are trained to align with product manager "golden grades" (Cohen's Kappa ≥ 0.8).

4. Inference Efficiency Techniques

  • Model pruning: Remove MLP neurons, attention heads, and full transformer layers
  • Workload-aware request management: Prevents any single workload from overwhelming LLM capacity
  • Score caching and ranking-depth controller

Metrics That Matter

Setup NDCG@10 Throughput
SLM with raw-text 0.9432 290
Pruned SLM + Summarized Text 0.9218 2,200
SLM with Embedding Compression 0.9239 22,000

Lessons for Building Production LLM Systems

  1. Distillation is essential: Large teachers → small efficient students for production
  2. Context compression matters: Summaries + embeddings dramatically reduce inference cost
  3. Quality measurement at scale: LLM judges can evaluate millions of results daily
  4. Multi-objective training: Train for both relevance AND engagement