🔍 Finding All Regex Matches Has Always Been O(n²)

⭐⭐⭐⭐⭐

Source: iev.ee | Hacker News

Tags: Algorithms Regex Performance Complexity

核心洞察: 自1970年代以来,每个正则表达式引擎在寻找所有匹配时都有二次复杂度问题——即使是专门避免回溯的引擎(如 RE2、rust regex crate)也是如此,这个问题被学术界忽视了50年

The Problem

Every regex engine that advertises linear-time matching—RE2, Go's regexp, rust's regex crate, .NET's NonBacktracking mode—means linear time for a single match. The moment you call find_iter or FindAll, that guarantee is gone.

The worst case: pattern .*a|b against a haystack of n b's:

Performance Impact

Input SizeNormal ModeHardened ModeSpeedup
1,0000.7ms28μs25x
10,00073ms303μs241x
50,0001.8s1.6ms1,125x

Why This Went Unnoticed

Almost all academic regex papers focus exclusively on the single-match problem and then handwave the rest away with "just iterate". Part of the reason is that the theory of regexes boils everything down to a single yes/no question—which throws away nearly everything that matters in practice: where matches are, how long they are, and how many there are.

The Solution: RE# Engine

The author built RE#, the first regex engine that can find all matches in two passes regardless of pattern or input:

  1. Reverse DFA marks where matches could start
  2. Forward DFA resolves the longest match at each marked position
  3. Both directions scanned before any match reported

Hardened Mode

For untrusted patterns, a hardened mode guarantees linear time with O(n × S) scan (S = number of active DFA states), at the cost of 3-20x constant factor slowdown.


Explored from Hacker News (news.ycombinator.com) | 2026-03-24