Phi-4-reasoning-vision: Compact Multimodal Reasoning
5/5 — Best-in-class efficiency for multimodal reasoning
🎯 Core Value: 15B parameter open-weight model that pushes the pareto-frontier of accuracy vs compute costs — competitive with models requiring 10x more compute.
Key Specifications
- Parameters: 15B (open-weight)
- Training Data: 200B tokens (vs 1T+ for competitors)
- Context Window: 128K tokens
- Availability: Microsoft Foundry, HuggingFace, GitHub
Architecture Insights
Mid-Fusion Architecture
Chose mid-fusion over early-fusion for practical trade-off between performance and resource requirements:
- Pretrained vision encoder → visual tokens → projected into LLM embedding space
- Leverages components already trained on trillions of tokens
Vision Encoder: SigLIP-2 Naflex
Key finding: Dynamic resolution vision encoders perform best, especially on high-resolution data.
| Method | Max Tokens | MathVista | ScreenSpot | ScreenSpot-Pro |
|---|---|---|---|---|
| Dynamic resolution (3600) | 3600 | 44.9 | 79.7 | 17.5 |
| Dynamic resolution (2048) | 2048 | 45.2 | 81.5 | 9.2 |
| Multi-crop with S2 | 2048 | 43.4 | 79.1 | 10.6 |
Training Data Strategy
Three data sources:
- Open-source datasets: Meticulously filtered and improved
- High-quality internal data: Domain-specific
- Targeted acquisitions: Curated external data
Key insight: Only 200B tokens needed (vs 1T+ for Qwen2.5VL, Kimi-VL, Gemma3)
Performance Highlights
- Competitive with much larger models on general VLM tasks
- Excels at math and science reasoning
- Strong computer use / GUI grounding capabilities
- Runs on modest hardware
Why It Matters
Phi-4-reasoning-vision proves that efficient models can match or exceed larger counterparts with careful architecture choices, rigorous data curation, and smart reasoning/non-reasoning data mixing.