LLM Architecture Gallery
This page collects architecture figures and fact sheets from Sebastian Raschka's popular LLM comparison articles, providing a comprehensive reference for understanding how different Large Language Models are built.
Key Content
- Architecture Panels: Visual breakdowns of major LLM architectures
- Fact Sheets: Scale, context length, license, decoder type, attention mechanism for each model
- Model Comparisons: From GPT-2 (2019) to latest releases (2025-2026)
- Physical Poster Available: Can order via Redbubble
Highlights:
- DeepSeek V3: 671B total, 37B active, Sparse MoE, MLA attention
- Llama 4 Scout: 400B total, 17B active, 1M token context
- Qwen3: Multiple variants from 235B MoE to 3B dense
- Gemma 3: 27B with sliding-window/global attention hybrid
- OpenAI o1/o3: Reasoning-tuned models on modified architectures
Why This Is Valuable
This is the definitive visual reference for understanding how different LLM architectures compare. It shows the evolution from dense models (GPT-2) to MoE (Mixture of Experts), the various attention mechanisms (MHA, GQA, MLA), and how companies like Meta, Google, DeepSeek, and OpenAI make different trade-offs.
Useful for: ML engineers, researchers, and anyone wanting to understand the architectural differences between major LLM releases.