Java's Regex Flavor

Java's Regex Flavor

java.util.regex is powered by a Traditional NFA, so the rich set of lessons from Chapters 4, 5, and 6 apply. Figure on the facing page summarizes its metacharacters. Certain aspects of the flavor are modified by a variety of match modes, turned on via flags to the various methods and factories, or turned on and off via mods-mods)(?mods-mods:⋯)Figure on page 368.

Overview of Sun's java.util.regex Flavor

Character Shorthands[(c)]

☞ 115


\a [\b] \e \f \n \r \t \0octal \x## \u#### \cchar

Character Classes and Class-Like Constructs

☞ 118


Classes: [] [^⋯] (may contain class set operators ☞ 125)

☞ 119


Almost any character: dot (various meanings, changes with modes)

☞ 120


Class shorthands: [➀] \w \d \s \W \D \S

☞ 121


Unicode properties and blocks:[➁] \p{Prop} \P{Prop}

Anchors and Other Zero-Width Tests

☞ 370


Start of line/string: ^ \A

☞ 370


End of line/string: $ \z \Z

☞ 130


Start of current match: \G

☞ 133


Word boundary:[➂] \b \B

☞ 133


Lookaround:[➃] (?=⋯) (?!⋯) (?<=⋯) (?<!⋯)

Comments and Mode Modifiers

☞ 135


Mode modifiers: (?mods-mods) Modifiers allowed: x d s m i u

☞ 135


Mode-modified spans: (?mods-mods:⋯)

☞ 368


Comments: From # until newline (only when enabled)[➄]

☞ 113


Literal-text mode:[➅] \Q⋯\E

Grouping and Capturing

☞ 137


Capturing parentheses: (⋯) \1 \2 ...

☞ 137


Grouping-only parentheses: (?:⋯)

☞ 139


Atomic grouping: (?>⋯)

☞ 139


Alternation: |

☞ 141


Greedy quantifiers: * + ? {n} {n,} {x,y}

☞ 141


Lazy quantifiers: *? +? ?? {n}? {n,}? {x,y}?

☞ 142


Possessive quantifiers: *+ ++ ?+ {n}+ {n,}+ {x,y}+


[(c)] may also be used within a character class ➀⋯➆ see text

These notes augment Figure:

  • [➀]

    [➀] \b is a character shorthand for backspace only within a character class. Outside of a character class, \b matches a word boundary (☞ 133).

    The table shows "raw" backslashes, not the doubled backslashes required when regular expressions are provided as Java string literals. For example, in the table must be written as "\\n" as a Java string. See "Strings as Regular Expressions" (☞ 101).

    \x## allows exactly two hexadecimal digits, e.g., \xFCberüber'.

    \u#### allows exactly four hexadecimal digits, e.g., \u00FCberüber', and matches '€'.

    \0octal requires a leading zero, followed by one to three octal digits.

    \cchar is case sensitive, blindly xoring the ordinal value of the following character with 0x40. This bizarre behavior means that, unlike any other flavor I've ever seen, \cA and \ca are different. Use uppercase letters to get the traditional meaning of \x01. As it happens, \ca is the same as \x21, matching '!'.

  • [➁]

    [➁] \w, \d, and \s (and their uppercase counterparts) match only ASCII characters, and don't include the other alphanumerics, digits, or whitespace in Unicode. That is, \d is exactly the same as [0-9], \w is the same as [0-9a-zA-Z_], and \s is the same as [\t\n\f\r\x0B] (\x0B is the little-used ASCII VT character). For full Unicode coverage, you can use Unicode properties (☞ 121): use \p{L} for \w, use \p{Nd} for \d, and use \p{Z} for \s. (Use the \P{} version of each for \W, \D, and \S.)

  • [➂]

    [➂] \p{} and \P{} support Unicode properties and blocks, and some additional "Java properties." Unicode scripts are not supported. Details follow on the facing page.

  • [➃]

    [➃] The \b and \B word boundary metacharacters' idea of a "word character" is not the same as that of \w and \W. The word boundaries understand the properties of Unicode characters, while \w and \W match only ASCII characters.

  • [➄]

    [➄] Lookahead constructs can employ arbitrary regular expressions, but lookbehind is restricted to subexpressions whose possible matches are finite in length. This means, for example, that is allowed within lookbehind, but and are not. See the description in Chapter 3, starting on page 133.

  • [➅]

    [➅] # sequences are taken as comments only under the influence of the x modifier, or when the Pattern.COMMENTS option is used (☞ 368). (Don't forget to add newlines to multiline string literals, as in the example on page 401.) Unescaped ASCII whitespace is ignored. Note: unlike most regex engines that support this type of mode, comments and free whitespace are recognized within character classes.

  • [➆]

    [➆] \Q\E has always been supported, but its use entirely within a character class was buggy and unreliable until Java 1.6.

The java.util.regex Match and Regex Modes

Compile-Time Option





Changes how dot and match (☞ 370)



Causes dot to match any character (☞ 111)



Expands where and can match (☞ 370)



Free-spacing and comment mode (☞ 72) (Applies even inside character classes)



Case-insensitive matching for ASCII characters



Case-insensitive matching for non-ASCII characters



Unicode "canonical equivalence" match mode (different encodings of the same character match as identical ☞ 108)



Treat the regex argument as plain, literal text instead of as a regular expression

8.1.1. Java Support for \p{⋯} and \P{⋯}

The and constructs support Unicode properties and blocks, as well as special "Java" character properties. Unicode support is as of Unicode Version 4.0.0. (Java 1.4.2's support is only as of Unicode Version 3.0.0.)

Unicode properties

Unicode properties are referenced via short names such as \p{Lu}. (See the list on page 122.) One-letter property names may omit the braces: \pL is the same as \p{L}. The long names such as \p{Lowercase_Letter} are not supported.

In Java 1.5 and earlier, the Pi and Pf properties are not supported, and as such, characters with that property are not matched by \p{P}. (Java 1.6 supports them.)

The "other stuff" property \p{C} doesn't match code points matched by the "unassigned code points" property \p{Cn}.

The \p{L&} composite property is not supported.

The pseudo-property \p{all} is supported and is equivalent to . The \p{assigned} and \p{unassigned} pseudo-properties are not supported, but you can use \P{Cn} and \p{Cn} instead.

Unicode blocks

Unicode blocks are supported, requiring an 'In' prefix. See page 402 for version-specific details on how block names can appear within and .

For backward compatibility, two Unicode blocks whose names changed between Unicode Versions 3.0 and 4.0 are accessible by either name as of Java 1.5. The extra non-Unicode-4.0 names Combining Marks for Symbols and Greek can now be used in addition to the Unicode 4.0 standard names Combining Diacritical Marks for Symbols and Greek and Coptic.

A Java 1.4.2 bug involving the Arabic Presentation Forms-B and Latin Extended-B block names has been fixed as of Java 1.5 (☞ 403).

Special Java character properties

Starting in Java 1.5.0, the \p{⋯} and \P{⋯} constructs include support for the non-deprecated isSomething methods in java.lang.Character. To access the method functionality within a regex, replace the method name's leading 'is' with 'java', and use that within or . For example, characters matched by can be matched from within a regex by }

Unicode Line Terminators

In traditional pre-Unicode regex flavors, a newline (ASCII LF character) is treated specially by dot, ^, $, and \Z. In Java, most Unicode line terminators (☞ 109) also receive this special treatment.

Java normally considers the following as line terminators:

Character Codes




LF \n

ASCII Line Feed ("newline")


CR \r

ASCII Carriage Return

U+000D U+000A

CR/LF \r\n

ASCII Carriage Return / Line Feed sequence










The characters and situations that are treated specially by dot, ^, $, and \Z change depending on which match modes (☞ 368) are in effect:

Match Mode




^ . $ \Z

Revert to traditional newline-only line-terminator semantics.


^ $

Add embedded line terminators to list of locations after which ^ and before which $ can match.



Line terminators no longer special to dot; it matches any character.

The two-character CR/LF line-terminator sequence deserves special mention. By default, when the full complement of line terminators is recognized (that is, when UNIX_LINES is not used), a CR/LF sequence is treated as an atomic unit by the line-boundary metacharacters, and they can't match between the sequence's two characters.

For example, $ and \Z can normally match just before a line terminator. LF is a line terminator, but $ and \Z can match before a string-ending LF only when it is not part of a CR/LF sequence (that is, when the LF is not preceded by a CR).

This extends to $ and ^ in MULTILINE mode, where ^ can match after an embedded CR only when that CR is not followed by a LF, and $ can match before an embedded LF only when that LF is not preceded by a CR.

To be clear, DOTALL has no effect on how CR/LF sequences are treated (DOTALL affects only dot, which always considers characters individually), and UNIX_LINES removes the issue altogether (it renders LF and all the other non-newline line terminators unspecial).

 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows