Make regular expressions efficient.





Item 22: Make regular expressions efficient.

Although Perl's regular expression engine contains many optimizations for efficiency, it's possible—and easy at times—to write matches and substitutions that run much slower than they should.

Efficiency may not always be your primary objective. In fact, efficiency should rarely be a primary objective in software development. Generally, a programmer's first priority should be to develop adequate, robust solutions to problems. It doesn't hurt, though, to keep efficiency in mind.

Let's look at a few common issues for regular expressions in Perl. The list below is by no means exhaustive, but it's a start, and it should get you headed in the right direction.

Compile once with /o

The regular expressions for most pattern matches and substitutions are compiled into Perl's internal form only once—at compile time, along with the rest of the surrounding statements:

The pattern /\bmagic_word\b/ is compiled only once, since it remains constant. The compiled form is then used over and over again at run time.

foreach (@big_long_list) { 
  $count += /\bmagic_word\b/; 
}

Count occurrences of magic_word in @big_long_list.

When a pattern contains interpolated variables, however, Perl recompiles it every time it is used:

The pattern /\b$magic\b/ is recompiled every time it is used in a match, since it contains an interpolated variable.

print "give me the magic word: "; 
chomp($magic = <STDIN>); 
foreach (@big_long_list) { 
  $count += /\b$magic\b/; 
}

Count occurrences of the magic word in @big_long_list.

The reason for this behavior is that the variables making up the pattern might have changed since the last time the pattern was compiled, and thus the pattern itself might be different. Perl makes this assumption to be safe, but such recompilation is often unnecessary. In many cases, like the /\b$magic\b/ example above, variables are used to construct a pattern that will remain the same throughout the execution of the program containing it. To recompile such a pattern each time it is used in a match is grossly wasteful. This problem arises often, and naturally there is a feature in Perl to help you solve it. Perl's /o ("compile once") flag causes a regular expression containing variables to be compiled only once—the first time it is encountered at run time:

Use /o to compile patterns only once.

The pattern /\b$magic\b/o is compiled on the first iteration of the foreach loop, using whatever the value of $magic is at that time. The pattern is never compiled again, even if the value of $magic changes.

print "give me the magic word: "; 
chomp($magic = <STDIN>); 
foreach (@big_long_list) { 
  $count += /\b$magic\b/o; 
}

Count occurrences of the magic word in @big_long_list—note added /o.

The /o flag also works for substitutions. Note that the replacement string in the substitution continues to work as it normally does—it can vary from match to match:

print "give me the magic word: "; 
chomp($magic = <STDIN>); 
foreach (@big_long_list) { 
  s/\b$magic\b/rand_word()/eo; 
}

Replace occurrences of $magic with something returned by rand_word(). See also examples at end of Item 29.

Don't use match variables

I mentioned in Item 16 that the match variables ($`, $&, and $') impose a speed penalty on your Perl programs. Whenever a Perl program uses one of the match variables, Perl has to keep track of the values of the match variables for every single pattern match in the program.

Don't use match variables ($`, $&, $') if speed is important.

Using a match variable anywhere in your program activates a feature that makes copies of the match ($&), before-match ($`) and after-match ($') strings for every single match in the program.

$_ = "match variable"; 
/.*/; 
print "Gratuitious use of a $&\n";

Uh-oh: We activated the match variable feature.

while (<>) { 
  push @merlyn, $_ if /\bmerlyn\b/; 
}

This now runs slower because of the use of $& above!

Perl isn't smart enough to know which pattern(s) the match variables might be referring to, so Perl sets up the values of the match variables every time it does a pattern match. This results in a lot of extra copying and unnecessary shuffling around of bytes.

Fortunately, the penalty isn't that severe. In most cases (particularly if some I/O is involved, as above), your program will run only slightly slower, if at all. In test cases designed to spot the penalty, the extra time consumed can range from a few percent to 30 to 40 percent. Jeffrey Friedl reports a contrived test case in which the run time with a match variable present was 700 times longer, but it is unlikely you will face a situation like this.

Avoid unnecessary alternation

Alternation in regular expressions is generally slow. Because of the way the regular expression engine in Perl works, each time an alternative in a regular expression fails to match, the engine has to " backtrack" (see the next subheading) in the string and try the next alternative:

The pattern match below finds a word boundary, then tries to match george. If that fails, it backs up to the boundary and tries to match jane. If that fails, it tries judy, then elroy. If a match is found, it looks for another word boundary.

while (<>) { 
  print if 
    /\b(george|jane|judy|elroy)\b/; 
}

There are some instances in which alternation is completely unnecessary. In these cases, it is usually vastly slower than the correct alternative. The classic mistake is using alternation instead of a character class:

Don't use alternation (a|b|c ) instead of a character class ([abc] ).

Using an alternation instead of a character class can impose a tremendous speed penalty on a pattern match.

while (<>) { 
  push @var, m'((?:\$|@|%|&)\w+)'g; 
}

Look for Perl variable-namelike things. Single quote delimiters turn off variable interpolation inside pattern.

while (<>) { 
  push @var, m'([$@%&]\w+)'g; 
}

Look for Perl variable-namelike things. This is about four times faster than the version using alternation.

Avoid unnecessary backtracking

Perl's procedural regular expression engine (see Item 17) works by stepping through a compiled version of a pattern, in effect using it as if it were a little program trying to match pieces of text:

  • When you write a sequence, you are creating instructions that mean "try to match this, followed by that, followed by . . ."

  • When you write an alternation, you are creating instructions that mean "try to match this first; if that doesn't work, back up and try to match that; if that doesn't work . . ." and so on.

  • When you use a repetition operator like + or *, you are instructing the engine to "try to find as many of these in a row as you can."

Consider the pattern /\b(\w*t|\w*d)\b/, which looks for words ending in either t or d. Each time you use this pattern, the engine will look for a word boundary. It will then do the first alternation, looking for as many word characters in a row as possible. Then it looks for a t. Hmm—it won't find one, because it already read all the word characters. So it will have to back up a character. If that character is a t, that's great—now it can look for a word boundary, and then it's all done. Otherwise, if there was no match, the engine keeps backing up and trying to find a t. If it runs all the way back to the initial word boundary, then the engine tries the second half of the alternation, looking for a d at the end.

You can see that this is a very complicated process. Well, the regular expression engine is meant to do complicated work, but this particular pattern makes that work much more complicated than it has to be.

An obvious shortcoming is that if the engine starts out at the beginning of a word that ends in d, it has to go all the way to the end and back searching fruitlessly for a t before it even starts looking for a d. We can definitely fix this. Let's get rid of the alternation:

/\b\w*[td]\b/

This is an improvement. Now, the engine will scan the length of the word only once, regardless of whether it ends in t, d, or something else.

We still haven't addressed the general backtracking issue. Notice that there is no need for the regular expression engine to continue backtrack-ing more than a single character back from the end of a word. If that character isn't a t or d, there's no point in continuing, because even if we did find one earlier in the string it wouldn't be at the end of the word.

There's no way to force Perl to change this backtracking behavior (at least not so far as I know), but you can approach the problem in a slightly different manner. Ask yourself: "If I were looking for words ending in t or d, what would I be looking for?" More than likely, you'd be looking at the ends of words. You'd be looking for something like:

/[td]\b/

Now, this is interesting. This little regular expression does everything that the other two do, even though it may not be obvious at first. But think about it. To the left of the t or d there will be zero or more \w characters. We don't care what sort of \w characters they are; so, tautologically if you will, once we have a t or d to the left of a word boundary, we have a word ending in t or d.

Naturally, this little regular expression runs much faster than either of the two above—about twice as fast, more or less. Obviously there's not much backtracking, because the expression matches only a single character!

Use memory-free parentheses

If you are using parentheses for grouping alone, you won't need a copy of what the parentheses matched. You can save the time required to make the copy by using Perl's memory-free parentheses:

Use memory-free parentheses (?:…) to speed up pattern matches.

There's no point in memorizing the contents of the inner parentheses in this pattern, so if you want to save a little time, use memory-free parentheses.

($host) = m/(\w+(\.\w+)*)/;

Find hostname-like thing (foo.bar.com) and put it into $host.

($host) = m/(\w+(?:\.\w+)*)/;

Same thing, but no memory for the inner parens.

The time saved isn't generally all that great, and memory-free parentheses don't exactly improve readability. But sometimes, every little bit of speed helps!

See Item 16 for more about memory-free parentheses.

Benchmark your regular expressions

As with many other things in Perl, one of the best ways to determine how to make a pattern match run quickly is to write several alternative implementations and benchmark them.

Let's use the Benchmark module (see Item 37) to see how much of a difference those memory-free parentheses above really make:

Time your regular expressions with Benchmark.

use Benchmark; 
@data = <>;

Read some data. (I used 1,000 lines of an HTTP access log.)

my $host; 
timethese (100, 
 { mem => q{ 
  for (@data) { 
   ($host) = m/(\w+(\.\w+)+)/; } 
  },

The test code goes in an eval string (see Item 54).

 memfree => q{ 
  for (@data) { 
   ($host) = m/(\w+(?:\.\w+)+)/; } 
  } 
 } 
);

Some more test code.

The results:

Benchmark: timing 100 iterations of mem, memfree... 
       mem: 12 secs (12.23 usr  0.00 sys = 12.23 cpu) 
   memfree: 11 secs (10.64 usr  0.00 sys = 10.64 cpu)

Not bad: it takes about 15 percent longer to run the version without the memory-free parentheses.


     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows