Regular Expression Implementations (was: Re: OR, quantifier support in Lua patterns)

sur-behoffski

2018-10-11 00:27:23 UTC

I am not sure why everybody seems to believe that (non-PCRE) regular
expression engine has to be complex. See
https://swtch.com/~rsc/regexp/regexp1.html for implementation in less than
400 lines of C (probably less rewritten in a "modern" style).

Searching is easy... until you want to start offering things that the user
highly desires, such as non-exponential running times and/or modest memory
footprint.

The exponential-running-time case is particularly nasty... you basically
want to build a tool for which there is no input data that can cause
problems, but (of course) this is impossible: There is always someone's
data set that just happens to trigger weaknesses in your code. This is
why DFA engines are considered superior: Quickly running out of memory
lets the user and/or the OS invoke handlers to decide how to proceed...
whereas the best you can do for an NFA is impose a timeout, which can be
somewhere between guesswork and magic.

It gets harder, if, like GNU Grep, you want to be "the fastest Grep in the
West". You have to start incorporating string searching algorithms such as
the Tuned Boyer-Moore algorithm (or a variant), Knuth-Morris-Pratt's
multiple-parallel-keyword search, as well as using Spencer's regex package
(always as a precursor check, plus an engine-of-last-resort if other
engines can't support features), and PCRE, if selected by the caller.

The desire for speed also reaches down into aligning file buffering to align
with underlying OS buffers, minimising virtual-memory overheads. This
technique gains performance for file-based searching; it does not help speed
processing for dynamic strings allocated on-the-fly.

Grep also supports (at the Unicode character level) internationalisation and
localisation, so C code internal representations such as "wchar" and "UTF-8"
can also complicate things (e.g. "[[=a=]]" in a multi-byte locale... some
matching characters in the class are one octet, others are locale-dependent...
perhaps three octets in UTF-8).

-----

Looking more closely at grouping/parenthesis "()" and alternation "|":
Tricky cases such as "(^abc|pqr|xyz$)" can appear, and different users may
have different desires as to whether certain characters, such as the anchors
"^" and "$", are special, or are merely simple literals in this context.

-----

cheers,

sur-behoffski (Brenton Hoff)
programmer, Grouse Software