🔍 LinkedIn's LLM-Powered Semantic Search Architecture

LinkedIn Engineering ⭐⭐⭐⭐⭐ 27,232 chars

#LLM #Search #SemanticSearch #Embeddings #Ranking #Inference

Why This Matters

LinkedIn rebuilt their entire search infrastructure around LLMs, serving millions of queries per second at scale. This is one of the most detailed production LLM search system write-ups I've seen.

            Key Innovation: They don't just use LLMs for semantic understanding—they've built a complete pipeline including query understanding, embedding-based retrieval, cross-encoder ranking, and multi-teacher distillation to get a 0.6B parameter SLM that outperforms the 7B teacher.
        

Architecture Overview

Three-stage pipeline:

Query Understanding: Fine-tuned 1.5-4B LLM models for intent classification, facet extraction, and profile-aware rewriting
Embedding-Based Retrieval (EBR): GPU-powered exhaustive vector search on CUDA, using dual-tower bi-encoder
Cross-Encoder Ranking: Small Language Model (0.6B params) for final relevance scoring

Key Technical Insights

1. Multi-Teacher Distillation

They distill a 7B parameter teacher model into a 0.6B student that can serve millions of queries per second. The student learns from multiple teachers (relevance + engagement prediction) using KL divergence loss.

2. Context Compression Tricks

Job Summarization: Use a 1.7B LLM to summarize job descriptions offline (median 900 tokens → concise summaries)
Embedding Compression: Replace most job description text with a single-token embedding from an encoder LLM
Result: 22x throughput improvement (290 → 22,000 items/sec/GPU) with minimal quality loss

3. LLM-Based Quality Measurement

They use LLM judges to grade search results at scale—tens of millions of query-document pairs daily. These judges are trained to align with product manager "golden grades" (Cohen's Kappa ≥ 0.8).

4. Inference Efficiency Techniques

Model pruning: Remove MLP neurons, attention heads, and full transformer layers
Workload-aware request management: Prevents any single workload from overwhelming LLM capacity
Score caching and ranking-depth controller

Metrics That Matter

Setup	NDCG@10	Throughput
SLM with raw-text	0.9432	290
Pruned SLM + Summarized Text	0.9218	2,200
SLM with Embedding Compression	0.9239	22,000

Lessons for Building Production LLM Systems

Distillation is essential: Large teachers → small efficient students for production
Context compression matters: Summaries + embeddings dramatically reduce inference cost
Quality measurement at scale: LLM judges can evaluate millions of results daily
Multi-objective training: Train for both relevance AND engagement

Read original article →