Structure of Regular Expressions






Structure of Regular Expressions

Element

An element can be any of the following:

  • An ORDINARY CHARACTER, which matches the same character in the target sequence.

  • A WILDCARD CHARACTER, '.', which matches any character in the target sequence except a newline.

  • A BRACKET EXPRESSION, of the form "[expr]", which matches a single character or a COLLATING ELEMENT in the target sequence that is also in the set defined by the expression expr, oroftheform "[^expr]", which matches a single character or a collation element in the target sequence that is not in the set defined by the expression expr. In either case, the expression expr can consist of any combination of any number of each of the following:

    - An INDIVIDUAL CHARACTER, which adds that character to the set defined by expr

    - A CHARACTER RANGE, of the form "ch1-ch2", which adds all the characters represented by values in the closed range [ch1, ch2] to the set defined by expr

    - A CHARACTER CLASS, of the form "[:name:]", which adds all the characters in the named class to the set defined by expr

    - A COLLATING SYMBOL, of the form "[.elt.]", which adds the collation element elt to the set defined by expr

    - An EQUIVALENCE CLASS, of the form "[=elt=]", which adds the collating elements that are equivalent to elt to the set defined by expr

  • An ANCHOR, either '^' or '$', which matches the beginning or the end of the target sequence, respectively.

  • A CAPTURE GROUP, of the form "(subexpression)" (written as "\(subexpression\)" in BRE and grep), which matches the sequence of characters in the target sequence matched by the subexpression between the delimiters.

  • An IDENTITY ESCAPE, of the form "\k", which matches the character k in the target sequence.

For example:

  • "a" matches the target sequence "a" but does not match any of the target sequences "B", "b", or "c".

  • "." matches all the target sequences "a", "B", "b", and "c".

  • "[b-z]" matches the target sequences "b" and "c" but does not match the target sequence "a" or the target sequence "B".

  • "[[:lower:]]" matches the target sequences "a", "b", and "c" but does not match the target sequence "B".

  • "(a)" matches the target sequence "a" and associates capture group 1 with the subsequence "a" but does not match any of the target sequences "B", "b", or "c".

In ECMAScript, BRE, and grep, an element can also be

  • A BACK REFERENCE, of the form "\d", as well as "\dd" in ECMA-Script, which matches a sequence of characters in the target sequence that is the same as the sequence of characters matched by the Nth CAPTURE GROUP, where N is the value represented by the decimal digit d or by the decimal digits dd.

For example:

  • "(a)\1" matches the target sequence "aa" because the first, and only, capture group matches the initial sequence "a", and the back reference \1 then matches the final sequence "a".

In ECMAScript, an element can also be any of the following:

  • A NONCAPTURE GROUP, of the form "(?:subexpression)"; the group matches the sequence of characters in the target sequence matched by the subexpression between the delimiters.

  • A limited FILE FORMAT ESCAPE, of the form "\f", "\n", "\r", "\t", or "\v"; these match a form feed, newline, carriage return, horizontal tab, and vertical tab, respectively, in the target sequence.

  • A POSITIVE ASSERT, of the form "(?=subexpression)", which matches the sequence of characters in the target sequence matched by the subexpression between the delimiters but does not change the match position in the target sequence.

  • A NEGATIVE ASSERT, of the form "(?!subexpression)", which matches any sequence of characters in the target sequence that does not match the subexpression between the delimiters but does not change the match position in the target sequence.

  • A HEXADECIMAL ESCAPE SEQUENCE, of the form "\xhh "; the sequence matches a character in the target sequence whose representation is the value represented by the two hexadecimal digits hh.

  • A UNICODE ESCAPE SEQUENCE, of the form "\uhhhh ", which matches a character in the target sequence whose representation is the value represented by the four hexadecimal digits hhhh.

  • A CONTROL ESCAPE SEQUENCE, of the form "\ck ", which matches the control character named by the character k.

  • A WORD BOUNDARY ASSERT, of the form "\b", which matches if the current position in the target sequence is immediately after a word boundary.

  • A NEGATIVE WORD BOUNDARY ASSERT, of the form "\B"; the assert matches if the current position in the target sequence is not immediately after a word boundary.

  • A DSW CHARACTER ESCAPE, of the form "\d", "\D", "\s", "\S", "\w", "\W", which provides a short name for a character class.

For example:

  • "(?:a)" matches the target sequence "a".

  • "(?:a)\1" is invalid, because there is no capture group 1.

  • "(?=a)a" matches the target sequence "a". The assert matches the initial sequence "a" in the target sequence, and the final "a" in the regular expression matches the initial sequence "a" in the target sequence.

  • "(?!a)a" does not match the target sequence "a"; nor does it match any other target sequence.

  • "a\b." matches the target sequence "a!" but does not match the target sequence "ab".

  • "a\B." matches the target sequence "ab" but does not match the target sequence "a!".

In awk, an element can also be one of the following:

  • A FILE FORMAT ESCAPE, of the form "\\", "\a", "\b", "\f", "\n", "\r", "\t", or "\v"; these match a backslash, alert, backspace, form feed, newline, carriage return, horizontal tab, and vertical tab, respectively, in the target sequence.

  • An OCTAL ESCAPE SEQUENCE, of the form "\ooo", which matches a character in the target sequence whose representation is the value represented by the one, two, or three octal digits ooo.

Repetition

Any element other than a POSITIVE ASSERT, a NEGATIVE ASSERT, or an ANCHOR can be followed by a repetition count. The most general form of a repetition count is "{min, max}" (written as "\{min, max\}" in BRE and grep). An element followed by this form of repetition count matches at least min and no more than max successive occurrences of sequences that match the element.

For example:

  • "a{2, 3}" matches the target sequence "aa" and the target sequence "aaa" but not the target sequence "a" or the target sequence "aaaa".

A repetition count can also take one of the following forms:

  • "{min}" (written as "\{min\}" in BRE and grep), which is equivalent to "{min, min}"

  • "{min,}" (written as "\{min,\}" in BRE and grep), which is equivalent to "{min, unbounded}"

  • "*", which is equivalent to "{0, unbounded}"

For example:

  • "a{2}" matches the target sequence "aa" but not the target sequence "a" or "aaa".

  • "a{2,}" matches the target sequence "aa", the target sequence "aaa", and so on, but does not match the target sequence "a".

  • "a*" matches the target sequence "", the target sequence "a", the target sequence "aa", and so on.

For all grammars except BRE and grep, a repetition count can also take one of the following forms:

  • "?", which is equivalent to "{0, 1}"

  • "+", which is equivalent to "{1, unbounded}"

For example:

  • "a?" matches the target sequence "" and the target sequence "a" but not the target sequence "aa".

  • "a+" matches the target sequence "a", the target sequence "aa", and so on, but not the target sequence "".

All the previous repetition counts apply a greedy repetition, which matches as many characters as possible in the target sequence. In ECMAScript, all the forms of repetition count can be followed by the character '?' to specify a non-greedy repetition. A nongreedy repetition matches as few characters as possible in the target sequence.

For example:

  • "(a+)a*" matches the target sequence "aa" and associates capture group 1 with the entire target sequence because the element inside the capture group ("a+") uses a greedy match.

  • "(a+?)a*" matches the target sequence "aa" and associates capture group 1 with the initial subsequence "a" because the element inside the capture group ("a+?") uses a nongreedy match.

Concatenation

Regular expression elements, with or without repetition counts, can be concatenated to form longer regular expressions. Such a concatenated regular expression matches a target sequence that is a concatenation of sequences matched by the individual elements.

For example:

  • "a{2, 3}c" matches the target sequence "aac" and the target sequence "aaac" but does not match the target sequence "ac" or the target sequence "aaaac".

  • "ab{2, 3}c" matches the target sequence "abbc" and the target sequence "abbbc" but does not match the target sequence "ababc".

  • "(ab){2, 3}c" matches the target sequence "ababc" and the target sequence "abababc" but does not match the target sequence "abbc".

Alternation

For all the regular expression grammars except BRE and grep, a concatenated regular expression can be followed by the character '|' and another concatenated regular expression, which can be followed by another '|' and another concatenated regular expression, and so on. Such an expression matches any target sequence that matches one or more of the concatenated regular expressions.

For example:

  • "ab|cd" matches the target sequence "ab" and the target sequence "cd" but does not match the target sequence "abd" or the target sequence "acd".

In grep and egrep, a newline character ('\n') can be used to separate alternations.[2]

[2] The UNIX utilities grep and egrep can take a file as the source of the regular expression that they try to match. In that case, each line in the file is a separate concatenated regular expression.

When a match succeeds, if more than one of the concatenated regular expressions matches in an alternation could match part of the target sequence, ECMAScript chooses the first of the concatenated regular expressions that matches the target sequence as the match; the other regular expression grammars choose the one that results in the longest match.

For example:

  • "(a|ab).*" matches the target sequence "abc". In ECMAScript, the capture group is associated with the initial sequence "a" because it matched the first element in the alternation. Under the other grammars, the capture group is associated with the initial sequence "ab" because it gave the longest match in the alternation.

Subexpression

A subexpression is a concatenated regular expression in BRE and grep and an alternation in the other regular expression grammars. This is where the specification for regular expressions becomes recursive. In particular, as we saw earlier, a capture group can hold a subexpression. This makes it possible to nest subexpressions to create rather complicatedand potentially unreadableregular expressions.

For example:

  • "(a(.*)d)" matches the target sequence "abcd", associates capture group 1 with the text "abcd", and associates capture group 2 with the text "bc".[3]

    [3] Capture group 1 begins with the first '(', and capture group 2 begins with the second.

  • "(a(.*)d)\1" matches the target sequence "abcdabcd". It associates capture group 1 with the initial text "abcd", and the back reference matches the corresponding text at the end of the target text.

  • "(a(.*)d)\2" matches the target sequence "abcdbc". It associates capture group 2 with the first occurrence of "bc", and the back reference matches the corresponding text at the end of the target text.



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows