Performance Tips

Various notes for how to improve performance.

Configure the fast-fail lookahead length

Specregex looks ahead a few characters to check for any immediate failures, before running the capturing DFA engine. You can configure this length based on your regex and data; we typically recommend 4-8. Longer prefix filtering results in fewer false positives but more work done in the pre-check.

Examples
Regex Suggested length Explanation
[a-z][a-z][0-9]z.* 4 More than 4 does nothing due to the nonspecific .*
ab.*ab 2 More than 2 does nothing due to the .*
.......z 8 Fewer than this doesn’t filter out anything, and even a long lookahead typically saves time vs the full engine if it can filter out a significant (even say 20%) of potential matches. Unless your data is replete with zs, such as zzzzzzzzzzzzazzzzzzzzzz, then this is likely worthwhile.

Avoid redundant capture groups

There is a small performance cost associated with each capture group in the regex.

Commonly, regexes use alternation to provide different ways of matching the same “things”. Perhaps you have multiple different formats for matching an address.

In such cases, many applications use capture groups naively, and use application logic after the regex search to determine which group(s) contain the actual data. This (as well as being annoying to write) is inefficient.

Instead, either: - Use a branch reset group to make each “thing” you want to match always map to the same underlying capture group. - Use named capture groups and use the same name for each of the things you want to alias. This ultimately has the same effect.)

Example

The regex (a)|(b)|(c) has three capture groups, but only one of them can ever be involved in the match.

Using a branch reset group, we can have (?|(a)|(b)|(c)). This regex only has one capture group, and it will match a, b, or c.

This example is somewhat contrived, since we could simply have written (a|b|c). In more complicated situations with nested groups, however, such a trivial solution is not possible and you can use a branch reset group to avoid “divergence” of capture group numberings.

Alternatively, you can use repeated names, such as in (?<foo>a)|(?<foo>b). This regex has a single group, “foo”, which matches a or b, similarly to the earlier example. Unlike the earlier example, when using named groups in this fashion you are not constrained by ordering of the groups (the names can appear in any order on either side of the OR).