RocksDB Development Finds a CPU Bug

databases hardware testing Source: rocksdb.org via Lobsters

This is the story of how a RocksDB unit test I added four years ago, a mini-stress test you might call it, revealed a novel hardware bug in a newer CPU. It was scary enough to be assigned a "high severity" CVE.

Background: Unique Identifiers

About four years ago, we added unique identifiers to SST files to give them stable identifiers across different filesystems for caching purposes. Part of the motivation here was to eliminate our dependence on the uniqueness and non-recycling of unique identifiers on files provided by the OS filesystem.

High Quality Randomness

Using large random numbers (e.g. 128 bits) for persisting random identifiers is safer and more predictable than relying on filesystem uniqueness guarantees. The team used a quasi-random approach with multiple entropy sources.

Trust But Verify

Unit tests used many threads to create thousands of unique identifiers and verified their uniqueness. For a high quality source, the probability of any duplicate 128-bit IDs among thousands is negligible.

That's Weird

The test based on std::random_device failed, once. Then it failed again about a month later. No failures for four years, then two failures in two months. Both failed test jobs had run on the same type of hardware, though in completely different data centers.

Root Cause Analysis

The RDSEED instruction on this type of processor would return 0 and "success" much more often than would randomly be expected, but only on some cores and only under "complex micro-architectural conditions reproducible under memory-load." AMD quickly acknowledged the issue and announced planned mitigation, including a CPU microcode update.

Key Takeaways

Rating: ★★★★☆ - A fascinating real-world story of hardware bug discovery through rigorous testing.