Java's Regex Flavor






Java's Regex Flavor

java.util.regex is powered by a Traditional NFA, so the rich set of lessons from Chapters 4, 5, and 6 apply. Figure on the facing page summarizes its metacharacters. Certain aspects of the flavor are modified by a variety of match modes, turned on via flags to the various methods and factories, or turned on and off via mods-mods)(?mods-mods:⋯)Figure on page 368.

Overview of Sun's java.util.regex Flavor

Character Shorthands[(c)]

☞ 115

(c)

\a [\b] \e \f \n \r \t \0octal \x## \u#### \cchar

Character Classes and Class-Like Constructs

☞ 118

(c)

Classes: [] [^⋯] (may contain class set operators ☞ 125)

☞ 119

 

Almost any character: dot (various meanings, changes with modes)

☞ 120

(c)

Class shorthands: [➀] \w \d \s \W \D \S

☞ 121

(c)

Unicode properties and blocks:[➁] \p{Prop} \P{Prop}

Anchors and Other Zero-Width Tests

☞ 370

 

Start of line/string: ^ \A

☞ 370

 

End of line/string: $ \z \Z

☞ 130

 

Start of current match: \G

☞ 133

 

Word boundary:[➂] \b \B

☞ 133

 

Lookaround:[➃] (?=⋯) (?!⋯) (?<=⋯) (?<!⋯)

Comments and Mode Modifiers

☞ 135

 

Mode modifiers: (?mods-mods) Modifiers allowed: x d s m i u

☞ 135

 

Mode-modified spans: (?mods-mods:⋯)

☞ 368

(c)

Comments: From # until newline (only when enabled)[➄]

☞ 113

(c)

Literal-text mode:[➅] \Q⋯\E

Grouping and Capturing

☞ 137

 

Capturing parentheses: (⋯) \1 \2 ...

☞ 137

 

Grouping-only parentheses: (?:⋯)

☞ 139

 

Atomic grouping: (?>⋯)

☞ 139

 

Alternation: |

☞ 141

 

Greedy quantifiers: * + ? {n} {n,} {x,y}

☞ 141

 

Lazy quantifiers: *? +? ?? {n}? {n,}? {x,y}?

☞ 142

 

Possessive quantifiers: *+ ++ ?+ {n}+ {n,}+ {x,y}+

[(c)]


[(c)] may also be used within a character class ➀⋯➆ see text

These notes augment Figure:

  • [➀]

    [➀] \b is a character shorthand for backspace only within a character class. Outside of a character class, \b matches a word boundary (☞ 133).

    The table shows "raw" backslashes, not the doubled backslashes required when regular expressions are provided as Java string literals. For example, in the table must be written as "\\n" as a Java string. See "Strings as Regular Expressions" (☞ 101).

    \x## allows exactly two hexadecimal digits, e.g., \xFCberüber'.

    \u#### allows exactly four hexadecimal digits, e.g., \u00FCberüber', and matches '€'.

    \0octal requires a leading zero, followed by one to three octal digits.

    \cchar is case sensitive, blindly xoring the ordinal value of the following character with 0x40. This bizarre behavior means that, unlike any other flavor I've ever seen, \cA and \ca are different. Use uppercase letters to get the traditional meaning of \x01. As it happens, \ca is the same as \x21, matching '!'.

  • [➁]

    [➁] \w, \d, and \s (and their uppercase counterparts) match only ASCII characters, and don't include the other alphanumerics, digits, or whitespace in Unicode. That is, \d is exactly the same as [0-9], \w is the same as [0-9a-zA-Z_], and \s is the same as [\t\n\f\r\x0B] (\x0B is the little-used ASCII VT character). For full Unicode coverage, you can use Unicode properties (☞ 121): use \p{L} for \w, use \p{Nd} for \d, and use \p{Z} for \s. (Use the \P{} version of each for \W, \D, and \S.)

  • [➂]

    [➂] \p{} and \P{} support Unicode properties and blocks, and some additional "Java properties." Unicode scripts are not supported. Details follow on the facing page.

  • [➃]

    [➃] The \b and \B word boundary metacharacters' idea of a "word character" is not the same as that of \w and \W. The word boundaries understand the properties of Unicode characters, while \w and \W match only ASCII characters.

  • [➄]

    [➄] Lookahead constructs can employ arbitrary regular expressions, but lookbehind is restricted to subexpressions whose possible matches are finite in length. This means, for example, that is allowed within lookbehind, but and are not. See the description in Chapter 3, starting on page 133.

  • [➅]

    [➅] # sequences are taken as comments only under the influence of the x modifier, or when the Pattern.COMMENTS option is used (☞ 368). (Don't forget to add newlines to multiline string literals, as in the example on page 401.) Unescaped ASCII whitespace is ignored. Note: unlike most regex engines that support this type of mode, comments and free whitespace are recognized within character classes.

  • [➆]

    [➆] \Q\E has always been supported, but its use entirely within a character class was buggy and unreliable until Java 1.6.

The java.util.regex Match and Regex Modes

Compile-Time Option

(?mode)

Description

Pattern.UNIX_LINES

d

Changes how dot and match (☞ 370)

Pattern.DOTALL

s

Causes dot to match any character (☞ 111)

Pattern.MULTILINE

m

Expands where and can match (☞ 370)

Pattern.COMMENTS

x

Free-spacing and comment mode (☞ 72) (Applies even inside character classes)

Pattern.CASE_INSENSITIVE

i

Case-insensitive matching for ASCII characters

Pattern.UNICODE_CASE

u

Case-insensitive matching for non-ASCII characters

Pattern.CANON_EQ

 

Unicode "canonical equivalence" match mode (different encodings of the same character match as identical ☞ 108)

Pattern.LITERAL

 

Treat the regex argument as plain, literal text instead of as a regular expression


8.1.1. Java Support for \p{⋯} and \P{⋯}

The and constructs support Unicode properties and blocks, as well as special "Java" character properties. Unicode support is as of Unicode Version 4.0.0. (Java 1.4.2's support is only as of Unicode Version 3.0.0.)

Unicode properties

Unicode properties are referenced via short names such as \p{Lu}. (See the list on page 122.) One-letter property names may omit the braces: \pL is the same as \p{L}. The long names such as \p{Lowercase_Letter} are not supported.

In Java 1.5 and earlier, the Pi and Pf properties are not supported, and as such, characters with that property are not matched by \p{P}. (Java 1.6 supports them.)

The "other stuff" property \p{C} doesn't match code points matched by the "unassigned code points" property \p{Cn}.

The \p{L&} composite property is not supported.

The pseudo-property \p{all} is supported and is equivalent to . The \p{assigned} and \p{unassigned} pseudo-properties are not supported, but you can use \P{Cn} and \p{Cn} instead.

Unicode blocks

Unicode blocks are supported, requiring an 'In' prefix. See page 402 for version-specific details on how block names can appear within and .

For backward compatibility, two Unicode blocks whose names changed between Unicode Versions 3.0 and 4.0 are accessible by either name as of Java 1.5. The extra non-Unicode-4.0 names Combining Marks for Symbols and Greek can now be used in addition to the Unicode 4.0 standard names Combining Diacritical Marks for Symbols and Greek and Coptic.

A Java 1.4.2 bug involving the Arabic Presentation Forms-B and Latin Extended-B block names has been fixed as of Java 1.5 (☞ 403).

Special Java character properties

Starting in Java 1.5.0, the \p{⋯} and \P{⋯} constructs include support for the non-deprecated isSomething methods in java.lang.Character. To access the method functionality within a regex, replace the method name's leading 'is' with 'java', and use that within or . For example, characters matched by java.lang.Character.is can be matched from within a regex by }

Unicode Line Terminators

In traditional pre-Unicode regex flavors, a newline (ASCII LF character) is treated specially by dot, ^, $, and \Z. In Java, most Unicode line terminators (☞ 109) also receive this special treatment.

Java normally considers the following as line terminators:

Character Codes

Nicknames

Description

U+000A

LF \n

ASCII Line Feed ("newline")

U+000D

CR \r

ASCII Carriage Return

U+000D U+000A

CR/LF \r\n

ASCII Carriage Return / Line Feed sequence

U+0085

NEL

Unicode NEXT LINE

U+2028

LS

Unicode LINE SEPARATOR

U+2029

PS

Unicode PARAGRAPH SEPARATOR


The characters and situations that are treated specially by dot, ^, $, and \Z change depending on which match modes (☞ 368) are in effect:

Match Mode

Affects

Description

UNIX_LINES

^ . $ \Z

Revert to traditional newline-only line-terminator semantics.

MULTILINE

^ $

Add embedded line terminators to list of locations after which ^ and before which $ can match.

DOTALL

.

Line terminators no longer special to dot; it matches any character.


The two-character CR/LF line-terminator sequence deserves special mention. By default, when the full complement of line terminators is recognized (that is, when UNIX_LINES is not used), a CR/LF sequence is treated as an atomic unit by the line-boundary metacharacters, and they can't match between the sequence's two characters.

For example, $ and \Z can normally match just before a line terminator. LF is a line terminator, but $ and \Z can match before a string-ending LF only when it is not part of a CR/LF sequence (that is, when the LF is not preceded by a CR).

This extends to $ and ^ in MULTILINE mode, where ^ can match after an embedded CR only when that CR is not followed by a LF, and $ can match before an embedded LF only when that LF is not preceded by a CR.

To be clear, DOTALL has no effect on how CR/LF sequences are treated (DOTALL affects only dot, which always considers characters individually), and UNIX_LINES removes the issue altogether (it renders LF and all the other non-newline line terminators unspecial).



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows