RocksDB Development Finds a CPU Bug
This is the story of how a RocksDB unit test I added four years ago, a mini-stress test you might call it, revealed a novel hardware bug in a newer CPU. It was scary enough to be assigned a "high severity" CVE.
Background: Unique Identifiers
About four years ago, we added unique identifiers to SST files to give them stable identifiers across different filesystems for caching purposes. Part of the motivation here was to eliminate our dependence on the uniqueness and non-recycling of unique identifiers on files provided by the OS filesystem.
High Quality Randomness
Using large random numbers (e.g. 128 bits) for persisting random identifiers is safer and more predictable than relying on filesystem uniqueness guarantees. The team used a quasi-random approach with multiple entropy sources.
Trust But Verify
Unit tests used many threads to create thousands of unique identifiers and verified their uniqueness. For a high quality source, the probability of any duplicate 128-bit IDs among thousands is negligible.
That's Weird
The test based on std::random_device failed, once. Then it failed again about a month later. No failures for four years, then two failures in two months. Both failed test jobs had run on the same type of hardware, though in completely different data centers.
Root Cause Analysis
The RDSEED instruction on this type of processor would return 0 and "success" much more often than would randomly be expected, but only on some cores and only under "complex micro-architectural conditions reproducible under memory-load." AMD quickly acknowledged the issue and announced planned mitigation, including a CPU microcode update.
Key Takeaways
- Test what you depend on.
- Have redundancies and/or sanity checks for what you depend on.
- Even CPUs can have bugs, usually flaky individual units but occasionally a bug affecting all units.
Rating: ★★★★☆ - A fascinating real-world story of hardware bug discovery through rigorous testing.