Source: Microsoft Research Blog | Date: March 4, 2026

Phi-4-reasoning-vision: Compact Multimodal Reasoning

⭐⭐⭐⭐⭐ 5/5 — Best-in-class efficiency for multimodal reasoning

        🎯 Core Value: 15B parameter open-weight model that pushes the pareto-frontier of accuracy vs compute costs — competitive with models requiring 10x more compute.
    

Key Specifications

Parameters: 15B (open-weight)
Training Data: 200B tokens (vs 1T+ for competitors)
Context Window: 128K tokens
Availability: Microsoft Foundry, HuggingFace, GitHub

Architecture Insights

Mid-Fusion Architecture

Chose mid-fusion over early-fusion for practical trade-off between performance and resource requirements:

Pretrained vision encoder → visual tokens → projected into LLM embedding space
Leverages components already trained on trillions of tokens

Vision Encoder: SigLIP-2 Naflex

Key finding: Dynamic resolution vision encoders perform best, especially on high-resolution data.

Method	Max Tokens	MathVista	ScreenSpot	ScreenSpot-Pro
Dynamic resolution (3600)	3600	44.9	79.7	17.5
Dynamic resolution (2048)	2048	45.2	81.5	9.2
Multi-crop with S2	2048	43.4	79.1	10.6

Training Data Strategy

Three data sources:

Open-source datasets: Meticulously filtered and improved
High-quality internal data: Domain-specific
Targeted acquisitions: Curated external data

Key insight: Only 200B tokens needed (vs 1T+ for Qwen2.5VL, Kimi-VL, Gemma3)

Performance Highlights

Competitive with much larger models on general VLM tasks
Excels at math and science reasoning
Strong computer use / GUI grounding capabilities
Runs on modest hardware

Why It Matters

Phi-4-reasoning-vision proves that efficient models can match or exceed larger counterparts with careful architecture choices, rigorous data curation, and smart reasoning/non-reasoning data mixing.

🔗 Original Article | HuggingFace