⭐⭐⭐⭐⭐

Fixing eBPF Spinlock Issues in Linux Kernel

Source: rovarma.com | Author: Ritesh Oedayrajsingh Varma | Date: March 2026
Linux Kernel eBPF Debugging Performance

Summary

A deep dive into fixing system freezes caused by eBPF spinlock contention in the Linux kernel. The Superluminal CPU profiler was causing periodic system freezes, and this article documents the debugging journey that led deep into kernel internals.

Key Findings

  • Root Cause: NMI (Non-Maskable Interrupt) sampling interrupts competing with eBPF ring buffer spinlocks
  • Freeze Duration: Exactly 250ms, matching RES_DEF_TIMEOUT in kernel spinlock code
  • Minimal Repro: Only 20 lines of eBPF code needed to reproduce the issue
  • Solution: Discovered bug in Linux 5.17's rqspinlock implementation

Technical Details

The issue occurred because:

  1. eBPF programs use bpf_ringbuf_reserve() which acquires a spinlock
  2. NMI interrupts cannot be disabled (unlike regular interrupts)
  3. When sampling interrupt hits while spinlock is held → 250ms wait timeout
  4. Multiple CPUs experiencing this simultaneously → system freeze

Debugging Techniques

  • Physical machine required (VM couldn't reproduce)
  • Serial port for kernel debugging (gdb)
  • Binary search through eBPF code to find minimal repro
  • Reading kernel source to understand lock implementation

Why This Matters

This is a great example of how subtle kernel interactions can cause hard-to-debug issues. The fix involved working with the Linux kernel team to address the underlying rqspinlock problem.