Regex-Related Perlisms






Regex-Related Perlisms

A variety of general Perl concepts pertain to our study of regular expressions. The next few sections discuss:

  • Context An important concept in Perl is that many functions and operators respond to the context they're used in. For example, Perl expects a scalar value as the conditional of a while loop, but a list of values as the arguments to a print statement. Since Perl allows expressions to "respond" to the context in which they're in, identical expressions in each case might produce wildly different results.

  • Dynamic Scope Most programming languages support the concept of local and global variables, but Perl provides an additional twist with something known as dynamic scoping. Dynamic scoping temporarily "protects" a global variable by saving a copy of its value and automatically restoring it later. It's an intriguing concept that's important for us because it affects $1 and other match-related variables.

Expression Context

The notion of context is important throughout Perl, and in particular, to the match operator. An expression might find itself in one of three contexts, list, scalar, or void, indicating the type of value expected from the expression. Not surprisingly, a list context is one where a list of values is expected of an expression. A scalar context is one where a single value is expected. These two are very common and of great interest to our use of regular expressions. Void context is one in which no value is expected.

Consider the two assignments:

    $s = expression one;
    @a = expression two;

Because $s is a simple scalar variable (it holds a single value, not a list), it expects a simple scalar value, so the first expression, whatever it may be, finds itself in a scalar context. Similarly, because @a is an array variable and expects a list of values, the second expression finds itself in a list context. Even though the two expressions might be exactly the same, they might return completely different values, and cause completely different side effects while they're at it. Exactly what happens depends on each expression.

For example, the localtime function, if used in a list context, returns a list of values representing the current year, month, date, hour, etc. But if used in a scalar context, it returns a textual version of the current time along the lines of 'Mon Jan 20 22:05:15 2003'.

As another example, an I/O operator such as <MYDATA> returns the next line of the file in a scalar context, but returns a list of all (remaining) lines in a list context.

Like localtime and the I/O operator, many Perl constructs respond to their context. The regex operators do as well the match operator m/⋯/, for example, sometimes returns a simple true/false value, and sometimes a list of certain match results. All the details are found later in this chapter.

Contorting an expression

Not all expressions are natively context-sensitive, so Perl has rules about what happens when a general expression is used in a context that doesn't exactly match the type of value the expression normally returns. To make the square peg fit into a round hole, Perl "contorts" the value to make it fit. If a scalar value is returned in a list context, Perl makes a list containing the single value on the fly. Thus, @a = 42 is the same as @a = (42).

On the other hand, there's no general rule for converting a list to a scalar. If a literal list is given, such as with

    $var = ($this, &is, 0xA, 'list');

the comma-operator returns the last element, 'list', for $var. If an array is given, as with $var = @array, the length of the array is returned.

Some words used to describe how other languages deal with this issue are cast, promote, coerce, and convert, but I feel they are a bit too consistent (boring?) to describe Perl's attitude in this respect, so I use "contort."

Dynamic Scope and Regex Match Effects

Perl's two types of storage (global and private variables) and its concept of dynamic scoping are important to understand in their own right, but are of particular interest to our study of regular expressions because of how after-match information is made available to the rest of the program. The next sections describe these concepts, and their relation to regular expressions.

Global and private variables

On a broad scale, Perl offers two types of variables: global and private. Private variables are declared using my(⋯). Global variables are not declared, but just pop into existence when you use them. Global variables are always visible from anywhere and everywhere within the program, while private variables are visible, lexically, only to the end of their enclosing block. That is, the only Perl code that can directly access the private variable is the code that falls between the my declaration and the end of the block of code that encloses the my.

The use of global variables is normally discouraged, except for special cases, such as the myriad of special variables like $1, $_, and @ARGV. Regular user variables are global unless declared with my, even if they might "look" private. Perl allows the names of global variables to be partitioned into groups called packages, but the variables are still global. A global variable $Debug within the package Acme::Widget has a fully qualified name of $Acme::Widget::Debug, but no matter how it's referenced, it's still the same global variable. If you use strict;, all (non-special) globals must either be referenced via fully-qualified names, or via a name declared with our (our declares a name, not a new variablesee the Perl documentation for details).

Dynamically scoped values

Dynamic scoping is an interesting concept that few programming languages provide. We'll see the relevance to regular expressions soon, but in a nutshell, you can have Perl save a copy of the value of a global variable that you intend to modify within a block, and restore the original copy automatically at the time when the block ends. Saving a copy is called creating a new dynamic scope, or localizing.

One reason that you might want to do this is to temporarily update some kind of global state that's maintained in a global variable. Let's say that you're using a package, Acme::Widget, and it provides a debugging flag via the global variable $Acme ::Widget::Debug. You can temporarily ensure that debugging is turned on with code like:

      
 = 1; #  Ensure it's turned on
      # work with Acme::Widget while debugging is on
         is now back to whatever it had been before
        It's that extremely ill-named function local that creates a new dynamic scope. Let me say up front that the call to local does not create a new variable. local is an action, not a declaration. Given a global variable, local does three things:

  • 1. Saves an internal copy of the variable's value

  • 2. Copies a new value into the variable (either undef, or a value assigned to the local)

  • 3. Slates the variable to have its original value restored when execution runs off the end of the block enclosing the local

This means that "local" refers only to how long any changes to the variable will last. The localized value lasts as long as the enclosing block is executing. Even if a subroutine is called from within that block, the localized value is seen. (After all, the variable is still a global variable.) The only difference from a non-localized global variable is that when execution of the enclosing block finally ends, the previous value is automatically restored.

An automatic save and restore of a global variable's value is pretty much all there is to local. For all the misunderstanding that has accompanied local, it's no more complex than the snippet on the right of Figure illustrates.

As a matter of convenience, you can assign a value to local ($SomeVar), which is exactly the same as assigning to $SomeVar in place of the undef assignment. Also, the parentheses can be omitted to force a scalar context.

As a practical example, consider having to call a function in a poorly written library that generates a lot of "Use of uninitialized value" warnings. You use Perl's -w option, as all good Perl programmers should, but the library author apparently didn't. You are exceedingly annoyed by the warnings, but if you can't change the

The Meaning of local

Normal Perl

Equivalent Meaning

    {
        local($SomeVar); # save copy

        $SomeVar = 'My Value';

              •
              •
              •
              •
     }
     # Value automatically restored

    }
      my $TempCopy = $SomeVar;
      $SomeVar = undef;

      $SomeVar = 'My Value';
            •
            •
            •
      $SomeVar = $TempCopy;
    }


library, what can you do short of stop using -w altogether? Well, you could set a local value of $ ^W, the in-code debugging flag (the variable name ^W can be either the two characters, caret and 'W', or an actual control-W character):

    {
        local $^W = 0; # Ensure warnings are off.
        UnrulyFunction(⋯);
    }
    # Exiting the block restores the original value of $^W.

The call to local saves an internal copy of the value of the global variable $^W, whatever it might be. Then that same $^W receives the new value of zero that we immediately scribble in. When UnrulyFunction is executing, Perl checks $^W and sees the zero we wrote, so doesn't issue warnings. When the function returns, our value of zero is still in effect.

So far, everything appears to work just as if local isn't used. However, when the block is exited right after the subroutine returns, the original value of $^W is restored. Your change of the value was local, in time, to the life of the block. You'd get the same effect by making and restoring a copy yourself, as in Figure, but local conveniently takes care of it for you.

For completeness, let's consider what happens if I use my instead of local.[new variable with an initially undefined value. It is visible only within the lexical block it is declared in (that is, visible only by the code written between the my and the end of the enclosing block). It does not change, modify, or in any other way refer to or affect other variables, including any global variable of the same name that might exist. The newly created variable is not visible elsewhere in the program, including from within UnrulyFunction. In our example snippet, the new $^W is immediately set to zero but is never again used or referenced, so it's pretty much a waste of effort. (While executing UnrulyFunction and deciding whether to issue warnings, Perl checks the unrelated global variable $^W.)

[

A better analogy: clear transparencies

A useful analogy for local is that it provides a clear transparency (like used with an overhead projector) over a variable on which you scribble your own changes. You (and anyone else that happens to look, such as subroutines and signal handlers) will see the new values. They shadow the previous value until the point in time that the block is finally exited. At that point, the transparency is automatically removed, in effect, removing any changes that might have been made since the local.

This analogy is actually much closer to reality than saying "an internal copy is made." Using local doesn't actually make a copy, but instead puts your new value earlier in the list of those checked whenever a variable's value is accessed (that is, it shadows the original). Exiting a block removes any shadowing values added since the block started. Values are added manually, with local, but here's the whole reason we've been looking localization: regex side-effect variables have their values dynamically scoped automatically.

Regex side effects and dynamic scoping

What does dynamic scoping have to do with regular expressions? A lot. A number of variables like $& (refers to the text matched) and $1 (refers to the text matched by the first parenthesized subexpression) are automatically set as a side effect of a successful match. They are discussed in detail in the next section. These variables have their value dynamically scoped automatically upon entry to every block.

To see the benefit of this design choice, realize that each call to a subroutine involves starting a new block, which means a new dynamic scope is created for these variables. Because the values before the block are restored when the block exits (that is, when the subroutine returns), the subroutine can't change the values that the caller sees.

As an example, consider:

    if ( m/(⋯)/ )
    {
         DoSomeOtherStuff();
         print "the matched text was $1.\n";
    }

Because the value of $1 is dynamically scoped automatically upon entering each block, this code snippet neither cares, nor needs to care, whether the function DoSomeOtherStuff changes the value of $1 or not. Any changes to $1 by the function are contained within the block that the function defines, or perhaps within a sub-block of the function. Therefore, they can't affect the value this snippet sees with the print after the function returns.

The automatic dynamic scoping is helpful even when not so apparent:

    if ($result =~ m/ERROR= ( .*)/) {
       warn "Hey, tell $Config{perladmin} 
 about $1!\n";
    }

The standard library module Config defines an associative array %Config, of which the member $Config{perladmin} holds the email address of the local Perlmaster. This code could be very surprising if $1 were not automatically dynamically scoped, because %Config is actually a tied variable. That means any reference to it involves a behind-the-scenes subroutine call, and the subroutine within Config that fetches the appropriate value when $Config{⋯} is used invokes a regex match. That match lies between your match and your use of $1, so if $1 were not dynamically scoped, it would be destroyed before you used it. As it is, any changes in $1 during the $Config{⋯} processing are safely hidden by dynamic scoping.

Dynamic scoping versus lexical scoping

Dynamic scoping provides many rewards if used effectively, but haphazard dynamic scoping with local can create a maintenance nightmare, as readers of a program find it difficult to understand the increasingly complex interactions among the lexically disperse local, subroutine calls, and references to localized variables.

As I mentioned, the my(⋯) declaration creates a private variable with lexical scope. A private variable's lexical scope is the opposite of a global variable's global scope, but it has little to do with dynamic scoping (except that you can't local the value of a my variable). Remember, local is just an action, while my is both an action and, importantly, a declaration.

Special Variables Modified by a Match

A successful match or substitution sets a variety of global, read-only variables that are always automatically dynamically scoped. These values never change if a match attempt is unsuccessful, and are always set when a match is successful. When appropriate, they are set to the empty string (a string with no characters in it), or undefined (a "no value" value, similar to, yet testably distinct from, an empty string). Figure shows examples.

In more detail, here are the variables set after a match:


$&

A copy of the text successfully matched by the regex. This variable (along with $' and $', described next) is best avoided for performance reasons. (See the discussion on page 356.) $& is never undefined after a successful match, although it can be an empty string.

Example Showing After-Match Special Variables

After the match of

                                   12               2 3   4     4 31
    "Pi is 3.14159, roughly" =~ m/\b((tasty|fattening)|(\d+(\.\d*)?))\b/;

the following special variables are given the values shown.

Variable

Meaning

Value

$'

Text before match

Pi•is•

$&

Text matched

3.14159

$'

Text after match

, •roughly

$1

Text matched within 1st set of parentheses

3.14159

$2

Text matched within 2nd set of parentheses

undef

$3

Text matched within 3rd set of parentheses

3.14159

$4

Text matched within 4th set of parentheses

.14159

$+

Text from highest-numbered $1, $2, etc.

.14159

$^N

Text from most recently closed $1, $2, etc.

3.14159

@-

Array of match-start indices into target text

(6, 6, undef, 6, 7)

@+

Array of match-end indices into target text

(13, 13, undef, 13, 13)



$'

A copy of the target text in front of (to the left of) the match's start. When used in conjunction with the /g modifier, you might wish $' to be the text from start of the match attempt, but it's the text from the start of the whole string, each time. $' is never undefined after a successful match.


$'

A copy of the target text after (to the right of) the successfully matched text. $' is never undefined after a successful match. After a successful match, the string "$'$&$'" is always a copy of the original target text.[[ Actually, if the original target is undefined, but the match successful (unlikely, but possible), "$'$&$'" would be an empty string, not undefined. This is the only situation where the two differ.


$1, $2, $3, etc.

The text matched by the 1st, 2nd, 3rd, etc., set of capturing parentheses. (Note that $0 is not included hereit is a copy of the script name and not related to regular expressions.) These are guaranteed to be undefined if they refer to a set of parentheses that doesn't exist in the regex, or to a set that wasn't actually involved in the match.

These variables are available after a match, including in the replacement operand of s/⋯/⋯/. They can also be used within the code parts of an embedded-code or dynamic-regex construct (☞327). Otherwise, it makes little sense to use them within the regex itself. (That's what and friends are for.) See "Using $1 Within a Regex?" on page 303.

The difference between (\ w+)(\w)+(\w)+Also, note the difference between and . With the former, the parentheses and what they enclose are optional, so $1 would be either 'x' or undefined. But with , the parentheses enclose a match what is optional are the contents. If the overall regex matches, the contents matches something, although that something might be the nothingness allows. Thus, with the possible values of $1 are 'x' and an empty string. The following table shows some examples:

Sample Match

Resulting $1

Sample Match

Resulting $1

"::" =~ m/:(A?):/

empty string

"::" =~ m/:(\w*):/

empty string

"::" =~ m/:(A)?:/

undefined

"::" =~ m/:(\w)*:/

undefined

":A:" =~ m/:(A?):/

A

":Word:" =~ m/:(\w*):/

Word

":A:" =~ m/:(A)?:/

A

":Word:" =~ m/:(\w)*:/

d


When adding parentheses just for capturing, as was done here, the decision of which to use is dependent only upon the semantics you want. In these examples, since the added parentheses have no affect on the overall match (they all match the same text), the only differences among them is in the side effect of how $1 is set.


$+

This is a copy of the highest numbered $1, $2, etc. explicitly set during the match. This might be useful after something like

    $url =~ m{
       href \ s* = \ s* # Match the "href = " part, then the value ...
       (?: " ([^"]*) " # a double-quoted value, or ...
         | ' ( [^'] * ) ' # a single-quoted value, or ...
         | ( [^'"<>]+ ) ) # an unquoted value.
    }ix;

to access the value of the href. Without $+, you would have to check each of $1, $2, and $3 and use the one that's not undefined.

If there are no capturing parentheses in the regex (or none are used during the match), it becomes undefined.


$^N

A copy of the most-recently-closed $1, $2, etc. explicitly set during the match (i.e., the $1, $2, etc., associated with the final closing parenthesis). If there are no capturing parentheses in the regex (or none used during the match), it becomes undefined. A good example of its use is given starting on page 344.


@- and @+

These are arrays of starting and ending offsets (string indices) into the target text. They might be a bit confusing to work with, due to their odd names. The first element of each refers to the overall match. That is, the first element of @-, accessed with $-[0], is the offset from the beginning of the target string to where the match started. Thus, after

    $text = "Version 6 coming soon?";
       the value of $-[0] is 8, indicating that the match started eight characters into the target string. (In Perl, indices are counted started at zero.)

The first element of @+, accessed with $+[0], is the offset to the end of the match. With this example, it contains 9, indicating that the overall match ended nine characters from the start of the string. So, using them together, substr($text, $-[0], $+[0] - $-[0]) is the same as $& if $text has not been modified, but doesn't have the performance penalty that $& has (☞356). Here's an example showing a simple use of @-:

    1 while $line =~ s/\t/' ' x (8 - $-[0] % 8)/e;

Given a line of text, it replaces tabs with the appropriate number of spaces.[[, which is one character but takes up two spaces, nor some Unicode renditions of accented characters like à (☞107).

Subsequent elements of each array are the starting and ending offsets for captured groups. The pair $-[1] and $+[1] are the offsets into the target text where $1 was taken, $-[2] and $+[2] for $2, and so on.


$^R

This variable holds the resulting value of the most recently executed embedded-code construct, except that an embedded-code construct used as the if of a (? if then | else)code parts of embedded-code and dynamic-regex constructs; ☞327), it is automatically localized to each part of the match, so values of $^R set by code that gets "unmatched" due to backtracking are properly forgotten. Put another way, it has the "most recent" value with respect to the match path that got the engine to the current location.

When a regex is applied repeatedly with the /g modifier, each iteration sets these variables afresh. That's why, for instance, you can use $1 within the replacement operand of s/⋯/⋯/g and have it represent a new slice of text with each match.

Using $1 within a regex?

The Perl man page makes a concerted effort to point out that is not available as a backreference outside of a regex. (Use the variable $1 instead.) The variable $1 refers to a string of static text matched during some previously completed successful match. On the other hand, is a true regex metacharacter that matches text similar to that matched within the first parenthesized subexpression at the time that the regex-directed NFA reaches the . What it matches might change over the course of an attempt as the NFA tracks and backtracks in search of a match.

The opposite question is whether $1 and other after-match variables are available within a regex operand. They are commonly used within the code parts of embedded-code and dynamic-regex constructs (☞327), but otherwise make little sense within a regex. A $1 appearing in the "regex part" of a regex operand is treated exactly like any other variable: its value is interpolated before the match or substitution operation even begins. Thus, as far as the regex is concerned, the value of $1 has nothing to do with the current match, but rather is left over from some previous match.



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows