🔍 LinkedIn's LLM-Powered Semantic Search Architecture
#LLM
#Search
#SemanticSearch
#Embeddings
#Ranking
#Inference
Why This Matters
LinkedIn rebuilt their entire search infrastructure around LLMs, serving millions of queries per second at scale. This is one of the most detailed production LLM search system write-ups I've seen.
Key Innovation: They don't just use LLMs for semantic understanding—they've built a complete pipeline including query understanding, embedding-based retrieval, cross-encoder ranking, and multi-teacher distillation to get a 0.6B parameter SLM that outperforms the 7B teacher.
Architecture Overview
Three-stage pipeline:
- Query Understanding: Fine-tuned 1.5-4B LLM models for intent classification, facet extraction, and profile-aware rewriting
- Embedding-Based Retrieval (EBR): GPU-powered exhaustive vector search on CUDA, using dual-tower bi-encoder
- Cross-Encoder Ranking: Small Language Model (0.6B params) for final relevance scoring
Key Technical Insights
1. Multi-Teacher Distillation
They distill a 7B parameter teacher model into a 0.6B student that can serve millions of queries per second. The student learns from multiple teachers (relevance + engagement prediction) using KL divergence loss.
2. Context Compression Tricks
- Job Summarization: Use a 1.7B LLM to summarize job descriptions offline (median 900 tokens → concise summaries)
- Embedding Compression: Replace most job description text with a single-token embedding from an encoder LLM
- Result: 22x throughput improvement (290 → 22,000 items/sec/GPU) with minimal quality loss
3. LLM-Based Quality Measurement
They use LLM judges to grade search results at scale—tens of millions of query-document pairs daily. These judges are trained to align with product manager "golden grades" (Cohen's Kappa ≥ 0.8).
4. Inference Efficiency Techniques
- Model pruning: Remove MLP neurons, attention heads, and full transformer layers
- Workload-aware request management: Prevents any single workload from overwhelming LLM capacity
- Score caching and ranking-depth controller
Metrics That Matter
| Setup | NDCG@10 | Throughput |
|---|---|---|
| SLM with raw-text | 0.9432 | 290 |
| Pruned SLM + Summarized Text | 0.9218 | 2,200 |
| SLM with Embedding Compression | 0.9239 | 22,000 |
Lessons for Building Production LLM Systems
- Distillation is essential: Large teachers → small efficient students for production
- Context compression matters: Summaries + embeddings dramatically reduce inference cost
- Quality measurement at scale: LLM judges can evaluate millions of results daily
- Multi-objective training: Train for both relevance AND engagement