🚀 New NVIDIA Nemotron 3 Super Delivers 5x Higher Throughput for Agentic AI
Source: blogs.nvidia.com | March 11, 2026
📌 Key Highlights
- 120B parameters with only 12B active at inference (MoE architecture)
- 1M token context window - prevents goal drift in multi-agent workflows
- 5x higher throughput vs previous Nemotron Super model
- Hybrid Architecture: Mamba layers (4x efficiency) + Transformer layers (reasoning)
- Latent MoE: Activates 4 expert specialists for the cost of 1
- Multi-Token Prediction: 3x faster inference
- NVFP4 precision on Blackwell: 4x faster than FP8 on Hopper
🎯 The Two Constraints:
1. Context explosion: Multi-agent workflows generate 15x more tokens than standard chat
2. Thinking tax: Using large models for every subtask is too expensive
1. Context explosion: Multi-agent workflows generate 15x more tokens than standard chat
2. Thinking tax: Using large models for every subtask is too expensive
🏢 Enterprise Adoption
- AI-Native: Perplexity, CodeRabbit, Factory, Greptile, Edison Scientific
- Enterprise: Amdocs, Palantir, Cadence, Dassault Systèmes, Siemens
- Cloud: Google Cloud Vertex AI, Oracle Cloud, AWS (coming), Azure (coming)
- Inference: Cloudflare, Fireworks AI, DeepInfra, Baseten, Together AI
📊 Performance
- Top spot on Artificial Analysis for efficiency and openness
- Powers NVIDIA AI-Q research agent to #1 on DeepResearch Bench
- DeepResearch Bench II #1 position
🔓 Open Weights
Released with permissive license. Complete methodology published:
- 10+ trillion tokens of pre- and post-training data
- 15 training environments for RL
- Evaluation recipes
- Available on: build.nvidia.com, Perplexity, OpenRouter, Hugging Face
💡 Why It Matters
Nemotron 3 Super is designed for complex subtasks inside multi-agent systems:
- Software development: Load entire codebase into context at once
- Financial analysis: Load thousands of pages, eliminate re-reasoning
- Cybersecurity: High-accuracy tool calling for autonomous security orchestration