🔍 Finding All Regex Matches Has Always Been O(n²)
Source: iev.ee | Hacker News
Tags: Algorithms Regex Performance Complexity
The Problem
Every regex engine that advertises linear-time matching—RE2, Go's regexp, rust's regex crate, .NET's NonBacktracking mode—means linear time for a single match. The moment you call find_iter or FindAll, that guarantee is gone.
The worst case: pattern .*a|b against a haystack of n b's:
- At each position, the engine tries
.*afirst: scan entire remaining haystack looking fora, find none, fail - Then the
bbranch matches a single character - Advance one position, repeat
- Total: n + (n-1) + (n-2) + ... = O(n²) work
Performance Impact
| Input Size | Normal Mode | Hardened Mode | Speedup |
|---|---|---|---|
| 1,000 | 0.7ms | 28μs | 25x |
| 10,000 | 73ms | 303μs | 241x |
| 50,000 | 1.8s | 1.6ms | 1,125x |
Why This Went Unnoticed
Almost all academic regex papers focus exclusively on the single-match problem and then handwave the rest away with "just iterate". Part of the reason is that the theory of regexes boils everything down to a single yes/no question—which throws away nearly everything that matters in practice: where matches are, how long they are, and how many there are.
The Solution: RE# Engine
The author built RE#, the first regex engine that can find all matches in two passes regardless of pattern or input:
- Reverse DFA marks where matches could start
- Forward DFA resolves the longest match at each marked position
- Both directions scanned before any match reported
Hardened Mode
For untrusted patterns, a hardened mode guarantees linear time with O(n × S) scan (S = number of active DFA states), at the cost of 3-20x constant factor slowdown.
Explored from Hacker News (news.ycombinator.com) | 2026-03-24