Know the precedence of regular expression operators.





Item 15: Know the precedence of regular expression operators.

The "expression" in " regular expression" is there because regular expressions are constructed and parsed using grammatical rules that are similar to those used for arithmetic expressions. Although regular expressions serve a greatly different purpose, understanding the similarities between them will help you write better regular expressions, and hence better Perl.

Regular expressions in Perl are made up of atoms. Atoms are connected by operators like repetition, sequence, and alternation. Most regular expression atoms are single-character matches. For example:

a

Matches the letter a.

\$

Matches the character $—backslash escapes metacharacters.

\n

Matches newline.

[a-z]

Matches a lowercase letter.

.

Matches any character except \n.

\1

Matches contents of first memory—arbitrary length.

There are also special " zero-width" atoms. For example:

\b

Word boundary—transition from \w to \W.

^

Matches start of a string.

\Z

Matches end of a string or before newline at end.

Atoms are modified and/or joined together by regular expression operators. As in arithmetic expressions, there is an order of precedence among these operators:

Regular expression operator precedence

Precedence

Operator

Description

Highest

(), (?:), etc.

Parentheses and other grouping operators

 

?, +, *, {m,n}, +?, etc.

Repetition

 

^abc

Sequence (see below)

Lowest

|

Alternation

Fortunately, there are only four precedence levels—imagine if there were as many as there are for arithmetic expressions! Parentheses and the other grouping operators [1] have the highest precedence.

[1] A multitude of new grouping operators were introduced in Perl 5.

A repetition operator binds tightly to its argument, which is either a single atom or a grouping operator:

ab*c

Matches ac, abc, abbc, abbbc, etc.

abc*

Matches ab, abc, abcc, abccc, etc.

ab(c)*

Same thing, and memorizes the c actually matched.

ab(?:c)*

Same thing, but doesn't memorize the c.

abc{2,4}

Matches abcc, abccc, abcccc.

(abc)*

Matches empty string, abc, abcabc, etc.; memorizes abc.

Placing two atoms side by side is called sequence. Sequence is a kind of operator, even though it is written without punctuation. This is similar to the invisible multiplication operator in a mathematical expression like y = ax + b. To illustrate this, let's suppose that sequence were actually represented with the character "". Then the above examples would look like:

a•b*•c

Matches ac, abc, abbc, abbbc, etc.

a•b•c*

Matches ab, abc, abcc, abccc, etc.

a•b•(c)*

Same thing, and memorizes the c actually matched.

a•b•(?:c)*

Same thing, but doesn't memorize the c.

a•b•c{2,4}

Matches abcc, abccc, abcccc.

(a•b•c)*

Matches empty string, abc, abcabc, etc.; memorizes abc.

The last entry in the precedence chart is alternation. Let's continue to use the "" notation for a moment:

e•d|j•o

Matches ed or jo.

(e•d)|(j•o)

Same thing.

e•(d|j)•o

Matches edo or ejo.

e•d|j•o{1,3}

Matches ed, jo, joo, jooo.

The zero-width atoms, for example, ^ and \b, group in the same way as other atoms:

^e•d|j•o$

Matches ed at beginning, jo at end.

^(e•d|j•o)$

Matches exactly ed or jo.

It's easy to forget about precedence. Removing excess parentheses is a noble pursuit, especially within regular expressions, but be careful not to remove too many:

/^Sender|From:\s+(.*)/;

WRONG—would match:

X-Not-Really-From: faker Senderella is misspelled

The pattern was meant to match Sender: and From: lines in a mail header, but it actually matches something somewhat different. Here it is with some parentheses added to clarify the precedence:

/(^Sender)|(From:\s+(.*))/;

Adding a pair of parentheses, or perhaps memory-free parentheses (?:…), fixes the problem:

/^(Sender|From):\s+(.*)/;

$1 contains Sender or From.

$2 has the data.

/^(?:Sender|From):\s+(.*)/;

$1 contains the data.

Double-quote interpolation

Perl regular expressions are subject to the same kind of interpolation that double-quoted strings are. [2] Interpolated variables and string escapes like \U and \Q are not regular expression atoms and are never seen by the regular expression parser. Interpolation takes place in a single pass that occurs before a regular expression is parsed:

[2] Well, more or less. The $ anchor receives special treatment so that it is not always interpreted as a scalar variable prefix.

/te(st)/; 
/\Ute(st)/; 
/\Qte(st)/;

Matches test in $_.

Matches TEST.

Matches te(st).

$x = 'test'; 
/$x*/; 
/test*/;

Matches tes, test, testt, etc.

Same thing as /$x*/.

Double-quote interpolation and the separate regular expression parsing phase combine to produce a number of common "gotchas." For example, here's what can happen if you forget that an interpolated variable is not an atom:

Read a pattern into $pat and match two consecutive occurrences of it.

chomp($pat = <STDIN>);

For example, bob.

print "matched\n" if /$pat{2}/;

WRONG—/bob{2}/.

print "matched\n" if /($pat){2}/; 
print "matched\n" if /$pat$pat/;

RIGHT—/(bob){2}/.

Brute force way.

In this example, if the user types in bob, the first regular expression will match bobb, because the contents of $pat are expanded before the regular expression is interpreted.

All three regular expressions in this example have another potential pit-fall. Suppose the user types in the string "hello :-)". This will generate a fatal run-time error. The result of interpolating this string into /($pat){2}/ is /(hello :-)){2}/, which, aside from being nonsense, has unbalanced parentheses.

If you don't want special characters like parentheses, asterisks, periods, and so forth interpreted as regular expression metacharacters, use the quotemeta operator or the quotemeta escape, \Q. Both quotemeta and \Q put a backslash in front of any character that isn't a letter, digit, or underscore:

chomp($pat = <STDIN>); 
$quoted = quotemeta $pat;

For example, hello :-).

Now hello\ \:\-\).

print "matched\n" if /($quoted){2}/; 
print "matched\n" if /(\Q$pat\E){2}/;

"Safe" to match now.

Another approach.

As with seemingly everything else pertaining to regular expressions, tiny errors in quoting metacharacters can result in strange bugs:

print "matched\n" if /(\Q$pat){2}/;

WRONG—no \E ... means /hello \ \:\-\)\{2\}/.


     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows