Rust threads on the GPU
Overview
VectorWare announces the first implementation of Rust's std::thread on GPU. This milestone enables writing complex GPU programs using familiar Rust abstractions, unlocking the existing Rust ecosystem (rayon, tokio) for GPU computing.
The Problem: Mismatch Between Programming and Execution Models
CPU Model
Programs begin on a single thread and spawn additional threads as needed. Each thread runs independently.
GPU Model
GPU programs consist of kernels launched with thousands of instances running in parallel. The entry point is a function that "looks like it runs once but runs thousands of times."
// CUDA kernel - looks like a function, runs like thousands
__global__ void scale(float* data) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
data[i] *= 2.0f;
}
Why This Matters
The mismatch makes GPU programming hard because:
- Concurrency is implicit, not explicit
- Programmer must manually uphold invariants (indexing,避免竞态)
- Compiler cannot enforce GPU-specific invariants
Why Not Map std::thread to GPU Lanes?
GPU "threads" (lanes) within warps are not CPU threads:
- GPU lane = SIMD lane, not independent execution context
- Lanes within a warp advance together in lockstep
- Mapping thread::spawn to lanes would cause divergence
- Hardware would serialize different paths, negating concurrency benefit
Solution: Map to Warps, Not Lanes
Key insight: Warps can behave like CPU threads:
- Each warp has its own program counter
- Each warp has its own register file
- GPU scheduler switches between warps (like OS scheduler)
- Anything a CPU thread can do, a warp can do
What This Unlocks
Supporting std::thread enables major portions of Rust ecosystem on GPU:
- Rayon - widely-used thread pools
- Tokio - async runtime (previously demonstrated async on GPU)
- Many parallelism libraries that depend on std::thread
Significance
This is a significant step toward "GPU-native software" where GPU code looks like ordinary Rust. The goal is for the Rust compiler to reason about the same invariants in both CPU and GPU environments, with concurrency explicit rather than implicit.