Avoid using regular expressions for simple string operations.



Item 20: Avoid using regular expressions for simple string operations.

Regular expressions are wonderful, but they are not the most efficient way to perform all string operations. Although regular expressions can be used to perform string operations like extracting substrings and translating characters, they are better suited for more complex operations. Simple string operations in Perl should be handled by special-purpose operators like index, rindex, substr, and tr///.

Bear in mind that all regular expression matches, even simple ones, have to manipulate memory variables. If all you need is a comparison or a substring, manipulating memory variables is a waste of time. For this reason, if no other, you should prefer special-purpose string operators to regular expression matches whenever possible.

Compare strings with string comparison operators

If you have two strings to compare for equality, use string comparison operators, not regular expressions:

do_it() if $answer eq 'yes';

Fastest way to compare strings for equality.

The string comparison operators are at least twice as fast as regular expression matches:

do_it() if $answer =~ /^yes$/;

Slower.

do_it() if $answer =~ /yes/;

Even slower, and probably wrong, without anchors; e.g. on "my eyes hurt".

A few more complex comparisons are also faster if you avoid regular expressions:

do_it() if lc($answer) eq 'yes';

Faster.

do_it() if $answer =~ /^yes$/i;

Slower.

Find substrings with index and rindex

The index operator locates an occurrence of a shorter string in a longer string. The rindex operator locates the rightmost occurrence, still counting character positions from the left:

$_ = "It's a Perl Perl Perl Perl World.";
$left = index $_, 'Perl'; 
$right = rindex $_, 'Perl';

7

22

The index operator is very fast—it uses a Boyer-Moore algorithm for its searches. Perl will also compile index-like regular expressions into Boyer-Moore searches. You could write:

Continued from above:

/Perl/; 
$left = length $';

Slow, with a gratuitous use of $'.

or, avoiding the use of $' (see Item 16 and Item 21):

$perl = 'Perl'; 
/$perl/og; 
$left = pos($_) - length($perl);

Yes, the pos operator does have uses. This is still slow, though.

However, the overhead associated with using a regular expression match makes index several times faster than m// for short strings.

Extract and modify substrings with substr

The substr operator extracts a portion of a string, given a starting position and (optional) length:

$str = "It's a Perl World.";
 
print substr($str, 7, 4), "\n"; 
print substr($str, 7), "\n";

Perl

Perl World

The substr operator is much faster than a regular expression written to do the same thing (also see Item 19):

Continued from above:

print ($str =~ /^.{7}(.{4})/), "\n";

Perl — but yuck!

The nifty thing about substr is that you can make replacements with it by using it on the left side of an expression. The text referred to by substr is replaced by the string value of the right-hand side:

Continued from above:

substr($str, 7, 4) = "Small";

It's a Small World.

You can combine index and substr to perform s///-like substitutions, but in this case s/// is usually faster:

$str = "It's a Perl World.";
 
substr($str, index($str, 'Perl'), 4) = 
  "Mad Mad Mad Mad";

It's a Mad Mad Mad Mad World.

$str =~ s/Perl/Mad Mad Mad Mad/;

Less noisy, and probably faster.

You can also do other lvalue-ish things with a substr, such as binding it to substitutions or tr///:

$str = "It's a Perl World.";
 
substr($str, index($str, 'Perl'), 4) 
  =~ tr/a-z/A-Z/;

It's a PERL World.

Transliterate characters with tr///

Although it is possible to perform character-level substitutions with regular expressions, the tr/// operator provides a much more efficient mechanism:

Use tr///, not regular expressions, to transliterate characters.

$_ = "secret message";
 
tr/n-za-m/a-z/;

frperg zrffntr—string "rot13" encoded.

@h{'n'..'z','a'..'m'} = ('a'..'z'); 
s/([a-z])/$h{$1}/g;

Over 20 times slower, not counting initializing the hash!

The tr/// operator has other uses as well. It is the fastest way to count characters in a string, and it can be used to remove duplicated characters:

$digits = tr/0-9//;

Count digits in $_, fast.

tr/ \n\r\t\f/ /s;

Repeated whitespace becomes single space.

$_ = "Totally\r\nDOS\r\n"; 
tr/\r//d;

Convert DOS text file to Unix.