🔍 Finding All Regex Matches Has Always Been O(n²)
Summary
Search a document for a pattern and it takes a second. Search one a hundred times larger and it doesn't take a hundred seconds — it can take almost three hours. Every regex engine, in every language, has had this problem since the 1970s, and nobody fixed it.
The Problem
Every regex engine that advertises linear-time matching — RE2, Go's regexp, rust's regex crate, .NET's NonBacktracking mode — means linear time for a single match. The moment you call find_iter or FindAll, that guarantee is gone.
Classic Example
Pattern: .*a|b, Haystack: n b's
- At each position, engine tries
.*afirst: scan entire remaining haystack looking for 'a', find none, fail - Then branch
bmatches a single character - Advance one position, repeat
- Total work: n + (n-1) + (n-2) + ... = O(n²)
Why It Went Unnoticed
- Almost all academic regex papers focus exclusively on the single-match problem
- Theory reduces regex to "match or no match" — clean for theorems, but throws away where matches are, how long, and how many
- Once reduced to yes/no, the all-matches problem simply disappears from view
Backtracking Is Even Worse
With backtracking, a user-supplied pattern and a 50-character input can take longer than the heat death of the universe — exponential time. Thompson published the NFA construction that avoids this in 1968. That's nearly 60 years of a solved problem being actively unsolved at scale.
Existing Solutions
Aho-Corasick (1975)
Solved linear-time matching for fixed strings only. Builds a trie from patterns, adds failure links, scans input once. Cannot handle regex.
Hyperscan/Vectorscan
True linear-time all-matches regex engine using "earliest match" semantics. Changes results — reports match the moment DFA enters match state instead of longest.
RE#: Two Pass Solution
RE# is the first regex engine that can find all matches in two passes, regardless of pattern or input, without altering semantics:
- Reverse DFA marks where matches could start
- Forward DFA resolves the longest match at each marked position
Performance Results
| Input Size | Normal | Hardened | Speedup |
|---|---|---|---|
| 1,000 | 0.7ms | 28μs | 25x |
| 10,000 | 73ms | 303μs | 241x |
| 50,000 | 1.8s | 1.6ms | 1,125x |
Key Insight
The quadratic blowup requires a pathological pattern AND a structured input that's long enough. Take a pattern like [A-Z][a-z]+: every match starts at an uppercase letter and ends when engine sees something that isn't lowercase. No ambiguity, no quadratic case possible.