Keeping 20,000 GPUs healthy
Modal runs a globally distributed, autoscaling GPU worker pool by sourcing compute from all cloud giants: AWS, GCP, Azure, OCI. They've scaled to well over 20,000 concurrent GPUs, and launched over four million cloud instances. At this scale, you see almost every GPU reliability problem there is.
Instance Type Testing and Selection
The hyperscalers are significantly differentiated at the instance type level:
- Cloud A - Simplest and most reliable instance launch API (99.6% boot success), but H100s perform 50% worse on StableDiffusion text2img compared with C and D
- Cloud C - Ran H100s too hot (over 90°C for months in 2025). FLOP/s performance degrades starting at mid-70s Celsius. Has 228MiB more reserved H100 memory
- Cloud D - A10s have frequent hardware-side clock slowdowns (HW_SLOWDOWN and HW_POWER_BRAKE). One US region has more frequent uncorrectable ECC errors. Has the best price/performance
Machine Images
Quality of machine images has significant implications for reliability and performance. Modal maintains consistency across multi-cloud compute pool and freshness with latest NVIDIA driver (580.95.05). They run DCGM and custom GPU tests at the end of every image build before promotion.
Key differentiator: Hyperscalers clearly differentiate from neoclouds (Lambda Labs, Nebius) - very few neoclouds support image customization, and have worse instance startup (5+ minutes vs 2-3 minutes).
Instance Boot
Tradeoff between thoroughness and speed. Deepest generic check (dccgmi diag --run 4) takes ~1 hour. Shallowest (dcgmi diag --run 1) takes at least a minute. "Testing hardware on boot is likely redundant with healthchecking already performed by the cloud provider."
Production issue: Cloud C's L4s flake at CUDA initialization in 0.1% of cases - application code must use cuInit retries.
Lifetime Management
Passive Healthchecking
Non-invasive, read-only. Running dcgmi periodically and checking dmesg provides 80% of passive healthchecking wins. dcgmi reports uncorrectable ECC errors, thermal violations, sync boost violations, hardware slowdowns, and excessive temperatures (>88°C).
Active Healthchecking
Requires exclusive lock on GPUs. Following SemiAnalysis's ClusterMAX expectations, each GPU node gets deep active checking at least weekly:
- NVIDIA DCGM diag level 2
- GPUBurn/GPU-fryer to validate GPU won't fail under load
- Local NCCL all-reduce tests to validate NVLink/NVSwitch/NVLink SHARP performance
Observability
Dashboard offers every container view of GPU reliability via four metrics: memory usage, utilization, temperature, power usage. GPU health events are piped into container logs.