Source: Microsoft Research Blog | Date: March 4, 2026

Phi-4-reasoning-vision: Compact Multimodal Reasoning

⭐⭐⭐⭐⭐ 5/5 — Best-in-class efficiency for multimodal reasoning

🎯 Core Value: 15B parameter open-weight model that pushes the pareto-frontier of accuracy vs compute costs — competitive with models requiring 10x more compute.

Key Specifications

Architecture Insights

Mid-Fusion Architecture

Chose mid-fusion over early-fusion for practical trade-off between performance and resource requirements:

Vision Encoder: SigLIP-2 Naflex

Key finding: Dynamic resolution vision encoders perform best, especially on high-resolution data.

MethodMax TokensMathVistaScreenSpotScreenSpot-Pro
Dynamic resolution (3600)360044.979.717.5
Dynamic resolution (2048)204845.281.59.2
Multi-crop with S2204843.479.110.6

Training Data Strategy

Three data sources:
  • Open-source datasets: Meticulously filtered and improved
  • High-quality internal data: Domain-specific
  • Targeted acquisitions: Curated external data

Key insight: Only 200B tokens needed (vs 1T+ for Qwen2.5VL, Kimi-VL, Gemma3)

Performance Highlights

Why It Matters

Phi-4-reasoning-vision proves that efficient models can match or exceed larger counterparts with careful architecture choices, rigorous data curation, and smart reasoning/non-reasoning data mixing.

🔗 Original Article | HuggingFace