🔍 Finding All Regex Matches Has Always Been O(n²)

⭐⭐⭐⭐⭐

Source: iev.ee | Hacker News

Tags: Algorithms Regex Performance Complexity

        核心洞察: 自1970年代以来，每个正则表达式引擎在寻找所有匹配时都有二次复杂度问题——即使是专门避免回溯的引擎(如 RE2、rust regex crate)也是如此，这个问题被学术界忽视了50年
    

The Problem

Every regex engine that advertises linear-time matching—RE2, Go's regexp, rust's regex crate, .NET's NonBacktracking mode—means linear time for a single match. The moment you call find_iter or FindAll, that guarantee is gone.

The worst case: pattern .*a|b against a haystack of n b's:

At each position, the engine tries .*a first: scan entire remaining haystack looking for a, find none, fail
Then the b branch matches a single character
Advance one position, repeat
Total: n + (n-1) + (n-2) + ... = O(n²) work

Performance Impact

Input Size	Normal Mode	Hardened Mode	Speedup
1,000	0.7ms	28μs	25x
10,000	73ms	303μs	241x
50,000	1.8s	1.6ms	1,125x

Why This Went Unnoticed

Almost all academic regex papers focus exclusively on the single-match problem and then handwave the rest away with "just iterate". Part of the reason is that the theory of regexes boils everything down to a single yes/no question—which throws away nearly everything that matters in practice: where matches are, how long they are, and how many there are.

The Solution: RE# Engine

The author built RE#, the first regex engine that can find all matches in two passes regardless of pattern or input:

Reverse DFA marks where matches could start
Forward DFA resolves the longest match at each marked position
Both directions scanned before any match reported

Hardened Mode

For untrusted patterns, a hardened mode guarantees linear time with O(n × S) scan (S = number of active DFA states), at the cost of 3-20x constant factor slowdown.

Explored from Hacker News (news.ycombinator.com) | 2026-03-24