REGULAR EXPRESSIONS





REGULAR EXPRESSIONS

Regular expressions are a large part of the Perl language and they are used just about everywhere. In addition to the =~ and !~ operators, regular expressions are also supported by split(), grep(), and other functions, and they become a standard part of just about every text processing script you write. The regular expression system in Perl is one of the contributing factors to why Perl is so popular when it comes to processing information.

Python's regular expression system is not so much of an integral part of the language. There is no built-in regular expression parser, instead we have to use an external module to provide the facilities. Although we have access to the same basic methods of matching and replacing regular expression information they need to be accessed through a series of functions or classes rather than operators.

The Python re module replaces the older (and now obsolete) regex module as the regular expression engine. The main benefit of the re module, aside from some improvements in both speed and memory footprint, is that the re module works exactly like the regular expression parser in Perl, supporting the same semantics and regular expression language, but without the operator interface to the regular expression engine.

For cross reference, the list of supported character sequences, expressions and other characters supported by the re module in Python are shown in Figure. We'll look at specific solutions for working with regular expressions in Python shortly.

Figure Character sequences recognized by the re module in Python
Character sequence Description
text Matches the string text
. Matches any character except newline (unless the s flag is in use)
^ Matches the start of a string
$ Matches the end of a string
* Matches zero or more repeats of the preceding expression, matching as many repetitions as possible
*? Matches zero or more repeats of the preceding expression, matching as few repetitions as possible
+ Matches one or more repeats of the preceding expression, matching as many repetitions as possible
+? Matches one or more repeats of the preceding expression, matching as few repetitions as possible
? Matches zero or one repeats of the preceding expression, matching as many repetitions as possible
?? Matches zero or one repeats of the preceding expression, matching as few repetitions as possible
{m, n} Matches from m to n repetitions of the preceding expression, matching as many repetitions as possible
{m, n} ? Matches from m to n repetitions of the preceding expression, matching as few repetitions as possible
[…] Matches a set of characters; for example [a-zA-Z] or [,./;']
[^…] Matches the characters not in the set
A|B Matches either expression A or B
(…) Expression group
\number Matches the text in expression group number
\A Matches only at the start of the string
\b Matches a word boundary
\B Matches not a word boundary
\d Match any decimal digit – equivalent to r'[0-9]'
\D Match any non-digit character – equivalent to r'[^0-9]'
\s Match any whitespace (space, tab, newline, carriage-return, form feed, vertical tab)
\S Match any non-whitespace character
\w Match any alphanumeric character
\W Match any non-alphanumeric character
\Z Match only the end of the string
\\ Match backslash
(?iLmsux) The group matches the empty string, with the iLmsux characters matching the corresponding regular expression options as defined in Figure. Note that these affect the portion of the regular expression following, rather than the entire expression. They can also be used when you want the affects to be defined by the search expression, rather than the regular expression function
(?:…) not Matches the expression defined between the parentheses, but does populate the grouping table
(?P<name>) Matches the expression defined between the parentheses, but the matched expression is also available as a symbolic group identified by name. Note that the group still populates the normal group match variables. To refer to a group by name, supply it directly to the match.end() or match.group() methods, or use \g<name>
(?P=name) Matches whatever text was matched by the earlier named group
(?#…) Introduces a comment – the contents of the parentheses are ignored
(?=…) Matches if the text supplied matches next, without consuming any text. This allows you to look ahead in an expression, without affecting the rest of the regular expression parsing. For example Martin (?=Brown) will only match "Martin" if it's immediately followed by "Brown"
(?!…) Matches only if the specified expression doesn't match next (the opposite of (?=…))
(?<=…) Matches if the current position in the string is preceded by the supplied text, with the whole expression terminating at the current position. i.e. (?<=abc)def will match 'abcdef'. Matching is precise to the number of characters preceding such abc and a|b would match but a* would not
(?<!…) Matches if the current position in the string is not preceded by the specified match (opposite of (?<=…))

Some general notes that apply to all Python regular expressions:

  • Regular expressions should be supplied as raw strings using the r" operator – this ensures that the string is interpreted as seen, rather than run through the normal parser for escaping character sequences as used on all other single, double and triple quoted blocks.

  • Note that expression groups are available using the old style, and now deprecated in Perl, form of \number as opposed to the newer Perl format of $number.

Most regular expressions also support a number of flags, just like the Perl expressions, as listed here in Figure:

Flags supported by regular expression processes
Flag Description
I or IGNORECASE Ignores the case of expression and match text
L or LOCALE Use locale settings for determining the \b, \B and \w, \W sequences
M or MULTILINE Makes ^ and $ apply to the start and end of lines rather than strings in multiline strings
S or DOTALL Forces . to match all characters, including newline
X or VERBOSE Allows regular expressions to ignore unescaped whitespace and comments

You'll notice that there is no equivalent of the /g option which in Perl forces a matches to occur globally – this is because Python's re functions instead accept a fourth argument, count, which restricts the number of modifications to the given count value. See the Substitution section for examples.

Basic searches/matches

You can perform a basic search using the re.search() function:

search(pattern, string [, flags])

this is syntactically equivalent to the Perl:

$string =~ /pattern/flags;

For example, to determine whether the string 'cat' is in the string $string in Perl we'd use:

if ($string =~ /cat/) { ... }

In Python this translates to:

if re.search(r'cat', string): ...

The search() function actually returns a MatchObject if the search was successful, or None if it wasn't. See the Using MatchObjects section for more information.

The match() function is a more restrictive version of search(), only matching expressions at the beginning of the supplied string, rather than anywhere within the string. This can be used to enforce within a script for a match to start at the beginning of a string, irrespective of a user-supplied regular expression. In essence it provides no advantages over preceding the search expression with \A.

Extracting matched components

If you want to find and match specific expressions then the simplest method is to use the findall() function – this returns a list of the matches in a given expression, rather than a MatchObject. The function returns the matched text if no groups are used – if a group is used then it returns a list of all the matches, and if multiple groups are used then each item in the returned list is a tuple containing the text for each group, for example:

>>> import re
>>> string = 'the cat sat on the mat at ten'
>>> print re.findall(r'at',string)
['at', 'at', 'at']
>>> print re.findall(r'm.*?\b',string)
['mat']
>>> print re.findall(r'(at)',string)
['at', 'at', 'at']
>>> print re.findall(r'(cat)(.*?)(mat)',string)
[('cat', ' sat on the ', 'mat')]

This makes the findall() function the equivalent of the Perl statements:

@matches = ($string =~ s/at/);

or

($animal, $doing, $where) = ($string =~ m/(cat)(.*?)(mat)/);

Using MatchObjects

The search() and match() functions return MatchObjects which contain information both about the contents of the matched groups as well as the locations within the original strings at which the matches occurred. It's probably easiest to think of a MatchObject as a super variable containing the information that Perl populates through the $number, @+ and @LAST_MATCH_START built-in variables. I've listed the methods available to the MatchObject indicated by m in Figure.

Methods for a given MatchObject
MatchObject method Description
m.group([group, …]) Returns the matched text for the supplied group or groups as defined by their index number, returning a tuple. If no group name is given then all of the matches are returned
m.groups([default]) Returns a tuple containing the text matched by all the groups in the pattern. If supplied then default is the value returned for those groups that did not match the supplied expression. The default value for default is None
m.groupdict([default]) Returns a dictionary containing all the name subgroups of the match. If supplied default is the value returned for those matches that didn't match the default is None
m.start([group]) Returns the start location of the specified group, or the start location of the entire match
m.end([group]) Returns the end location of the specified group, or the end location of the entire match
m.span([group]) Returns a two element tuple equivalent to (m.start(group), m.end(group)) for a given group or the entire matched expression
m.pos The value of pos as passed to the match() or search() function
m.endpos The value of endpos as passed to the match() or search() function
m.re The regular expression object that created this MatchObject
m.string The string supplied to the match() or search() function

For example the Perl fragment:

$datetime = 'The date and time is 11/2/01 16:12:01 from
MET';
$datetime =~ m'((\d+)/(\d+)/(\d+)) ((\d+):(\d+):(\d+))';
$date = $1;
$time = $5;
($day, $month, $year) = ($2, $3, $4);

can be re-written in Python as:

datetime = 'The date and time is 11/2/01 16:12:01 from 
MET';
dtmatch = re.match(
r'((\d+)/(\d+)/(\d+)) ((\d+):(\d+):(\d+))', datetime)
date = dtmatch.group(1)
time = dtmatch.group(5)
(day, month, year) = dtmatch.group(2,3,4)

Substitution

The sub() function performs the same operation as the s/// substitution operator in Perl. The basic format for the command is:

sub(pattern, replace, string [, count])

We can therefore rewrite the Perl fragment:

$text = 'the cat sat on the mat';
$text =~ s/cat/slug/;

with:

text = 'the cat sat on the mat'
text = re.sub(r'cat', 'slug', text)

Note once again that the text or variable that you perform the substitution on is not modified in place, we must re-assign the result of the function to the original variable to make the change.

The replacement can also contain group references in the form \n where n is the group number. The Perl fragment below converts an international date (yyyymmdd) into a British date:

$date = '20010416';
$date =~ s/(\d{4})(\d{2})(\d{2})/$3.$2.$1/;

Which we can rewrite in Python as:

date = '20010416';
date = re.sub(r'(\d{4})(\d{2})(\d{2})', '\3. \2. \1', date)

The replace argument will also accept a function which will be supplied a single MatchObject argument. For example the Perl statement below is used to replace sequences of the form %xx as used in URLs with their single character equivalent:

$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C",
hex($1))/eg;

We can rewrite this in Python as:

value = re.sub(r'%([a-fA-F0-9][a-fA-F0-9])',
               lambda x: chr(eval('0x'+x.group(1))), value)

To get the number of substitutions that took place use the subn() function which returns a tuple containing the substituted text and the number of substations, i.e.:

text = 'the cat sat on the mat'
(text, subs) = re.subn(r'at', 'ow', text)

sets subs to three.

As previously mentioned there is no /g flag to force substitutions to take place globally within a given string. In fact, by default all of Python's substitution functions replace all occurrences. To limit the number of modifications made, use the optional fourth argument count. For example, to change only the first occurrence of "at" to "oelacanth:"

text = 'the cat sat on the mat'
text = re.sub(r'at', 'oelacanth', text, 1)

which produces "the coelacanth sat on the mat."

Using compiled regular expressions

The compile() functions compiles a regular expression into a regular expression object in much the same way as the qr{} operator in Perl. To compile a new regular expression object use the compile() function:

compile(str [, flags])

Where str is the regular expression you want to use and flags as listed in Figure that you want to use on the new object. The new object has methods with the same name and purpose as the main functions in the re module.

For example we can rewrite the Python fragment:

text = 'the cat sat on the mat'
text = re.sub(r'at', 'oelacanth', text, 1)

as:

text = 'the cat sat on the mat'
cvanimal = compile(r'at')
text = cvanimal.sub('oelacanth', text)

The benefit of the compiled regular expression object is that we can use it many times on different strings without implying the additional compilation overhead on the regular expression itself. This is particularly useful when processing log or other text files.

The other methods supported by a regular expression object are listed in Figure.

Methods/attributes for a regular expression object
Regular expression method Description
r.search(string [, pos [, endpos]]) Identical to the search() function, but allows you to specify the start and end points for the search
r.match(string [, pos [, endpos]]) Identical to the match() function, but allows you to specify the start and end points for the search
r.split(string [, max]) Identical to the split() function
r.findall(string) Identical to the findall() function
r.sub(replace, string [, count]) Identical to the sub() function
r.subn(replace, string [, count]) Identical to the subn() function
r.flags The flags supplied when the object was created
r.groupindex A dictionary mapping the symbolic group names defined by r'(?Pid)' to group numbers
r.pattern The pattern used when the object was created

Escaping strings

The quotemeta() function in Perl translates all the characters in a string that may be interpreted as regular expression character sequences, allowing you to use a string directly as pattern of a regular expression pattern or replacement. The Python escape() function in the re module performs a similar operation:

$string = quotemeta($expr)

is therefore equivalent to:

string = re.escape(expr)

     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows