Keeping 20,000 GPUs healthy

December 28, 2025 • Modal

scaling infrastructure gpu reliability

Modal runs a globally distributed, autoscaling GPU worker pool by sourcing compute from all cloud giants: AWS, GCP, Azure, OCI. They've scaled to well over 20,000 concurrent GPUs, and launched over four million cloud instances. At this scale, you see almost every GPU reliability problem there is.

Instance Type Testing and Selection

The hyperscalers are significantly differentiated at the instance type level:

Cloud A - Simplest and most reliable instance launch API (99.6% boot success), but H100s perform 50% worse on StableDiffusion text2img compared with C and D
Cloud C - Ran H100s too hot (over 90°C for months in 2025). FLOP/s performance degrades starting at mid-70s Celsius. Has 228MiB more reserved H100 memory
Cloud D - A10s have frequent hardware-side clock slowdowns (HW_SLOWDOWN and HW_POWER_BRAKE). One US region has more frequent uncorrectable ECC errors. Has the best price/performance

            Key Benchmark: SXM H100 vs PCIe H100 - torch_matmul is 67.5% slower, h2d_bw_pageable is 174% worse on PCIe
        

Machine Images

Quality of machine images has significant implications for reliability and performance. Modal maintains consistency across multi-cloud compute pool and freshness with latest NVIDIA driver (580.95.05). They run DCGM and custom GPU tests at the end of every image build before promotion.

Key differentiator: Hyperscalers clearly differentiate from neoclouds (Lambda Labs, Nebius) - very few neoclouds support image customization, and have worse instance startup (5+ minutes vs 2-3 minutes).

Instance Boot

Tradeoff between thoroughness and speed. Deepest generic check (dccgmi diag --run 4) takes ~1 hour. Shallowest (dcgmi diag --run 1) takes at least a minute. "Testing hardware on boot is likely redundant with healthchecking already performed by the cloud provider."

Production issue: Cloud C's L4s flake at CUDA initialization in 0.1% of cases - application code must use cuInit retries.

Lifetime Management

Passive Healthchecking

Non-invasive, read-only. Running dcgmi periodically and checking dmesg provides 80% of passive healthchecking wins. dcgmi reports uncorrectable ECC errors, thermal violations, sync boost violations, hardware slowdowns, and excessive temperatures (>88°C).

Active Healthchecking

Requires exclusive lock on GPUs. Following SemiAnalysis's ClusterMAX expectations, each GPU node gets deep active checking at least weekly:

NVIDIA DCGM diag level 2
GPUBurn/GPU-fryer to validate GPU won't fail under load
Local NCCL all-reduce tests to validate NVLink/NVSwitch/NVLink SHARP performance

Observability

Dashboard offers every container view of GPU reliability via four metrics: memory usage, utilization, temperature, power usage. GPU health events are piped into container logs.

The Big Picture

            It's underappreciated how unreliable GPUs are. From Meta's LLaMA 3 paper: "GPU issues are the largest category, accounting for 58.7% of all unexpected issues." CPUs were the problem only 0.5% of the time.