Matching Text with Regular Expressions






Matching Text with Regular Expressions

Perl uses regular expressions in many ways, the simplest being to check if a regex matches text (or some part thereof) held in a variable. This snippet checks the string held in variable $reply and reports whether it contains only digits:

    if ($reply =~ m/^[0-9]+$/) {
        print "only digits\n";
    } else {
        print "not only digits\n";
    }

The mechanics of the first line might seem a bit strange: the regular expression is , while the surrounding m/⋯/ tells Perl what to do with it. The m means to attempt a regular expression match, while the slashes delimit the regex itself.[=~ links m/⋯/ with the string to be searched, in this case the contents of the variable $reply.

[$reply =~ /^[0-9]+$/, which some readers with past Perl experience may find to be more natural. Personally, I feel the m is descriptive, so I tend to use it.

Don't confuse =~ with = or ==. The operator == tests whether two numbers are the same. (The operator eq, as we will soon see, is used to test whether two strings are the same.) The = operator is used to assign a value to a variable, as with $celsius = 20. Finally,=~ links a regex search with the target string to be searched. In the example, the search is m/^[0-9]+$/ and the target is $reply. Other languages approach this differently, and we'll see examples in the next chapter.

It might be convenient to read =~ as "matches," such that

    if ($reply =~ m/^[0-9]+$/)

becomes:

    if the text contained in the variable $reply matches the regex ,
    then ...

The whole result of $reply =~ m/^[0-9]+$/ is a true value if the matches the string held in $reply, a false value otherwise. The if uses this true or false value to decide which message to print.

Note that a test such as $reply =~ m/[0-9]+/ (the same as before except the wrapping caret and dollar have been removed) would be true if $reply contained at least one digit anywhere. The surrounding ensures that the entire $reply contains only digits.

Let's combine the last two examples. We'll prompt the user to enter a value, accept that value, and then verify it with a regular expression to make sure it's a number. If it is, we calculate and display the Fahrenheit equivalent. Otherwise, we issue a warning message:

    print "Enter a temperature in Celsius:\n";
    $celsius = <STDIN>; # this reads one line from the user
    chomp($celsius);          # this removes the ending newline from $celsius

    if ( $celsius =~ m/^[0-9]+$/) {
        $fahrenheit = ($celsius * 9 / 5) + 32; # calculate Fahrenheit
        print "$celsius C is $fahrenheit F\n";
    } else {
        print "Expecting a number, so I don't understand \"$celsius\".\n";
    }

Notice in the last print how we escaped the quotes to be printed, to distinguish them from the quotes that delimit the string? As with literal strings in most languages, there are occasions to escape some items, and this is very similar to escaping a metacharacter in a regex. The relationship between a string and a regex isn't quite as important with Perl, but is extremely important with languages like Java, Python, and the like. The section "A short asidemetacharacters galore" (☞ 44) discusses this in a bit more detail. (One notable exception is VB.NET, which requires '""' rather than '\"' to get a double quote into a string literal.)

If we put this program into the file c2f, we might run it and see:

    % perl -w c2f
    Enter a temperature in Celsius:
    22
    22 C is 71.599999999999994316 F

Oops. As it turns out (at least on some systems), Perl's simple print is not always so good when it comes to floating-point numbers.

I don't want to get bogged down describing all the details of Perl in this chapter, so I'll just say without further comment that you can use printf ("print formatted") to make this look better:

    printf "%.2f C is %.2f F\n", $celsius, $fahrenheit;

The printf function is similar to the C language's printf, or the format of Pascal, Tcl, elisp, and Python. It doesn't change the values of the variables, but merely how they are displayed. The result is now much nicer:

    Enter a temperature in Celsius:
    22
    22.00 C is 71.60 F

Toward a More Real-World Example

Let's extend this example to allow negative and fractional temperature values. The math part of the program is fine Perl normally makes no distinction between integers and floating-point numbers. We do, however, need to modify the regex to let negative and floating-point values pass. We can insert a leading to allow a leading minus sign. In fact, we may as well make that to allow a leading plus sign, too.

To allow an optional decimal part, we add (\.[0-9]*)?\.[0-9]*\.[0-9]*(⋯)?\.[0-9]*\.Putting this all together, we get

    

as our check line. It allows numbers such as 32, -3.723, and +98.6. It is actually not quite perfect: it doesn't allow a number that begins with a decimal point (such as .357). Of course, the user can just add a leading zero to allow it to match (e.g., 0.357), so I don't consider it a major shortcoming. This floating-point problem can have some interesting twists, and I look at it in detail in Chapter 5(☞ 194).

Side Effects of a Successful Match

Let's extend the example further to allow someone to enter a value in either Fahrenheit or Celsius. We'll have the user append a C or F to the temperature entered. To let this pass our regular expression, we can simply add after the expression to match a number, but we still need to change the rest of the program to recognize which kind of temperature was entered, and to compute the other.

In Chapter 1, we saw how some versions of egrep support , , , etc. as metacharacters to refer to the text matched by parenthesized subexpressions earlier within the regex (☞ 21). Perl and most other modern regex-endowed languages support these as well, but also provide a way to refer to the text matched by parenthesized subexpressions from code outside of the regular expression, after a match has been successfully completed.

We'll see examples of how other languages do this in the next chapter (☞ 137), but Perl provides the access via the variables $1, $2, $3, etc., which refer to the text matched by the first, second, third, etc., parenthesized subexpression. As odd as it might seem, these are variables. The variable names just happen to be numbers. Perl sets them every time the application of a regex is successful.

To summarize, use the metacharacter within the regular expression to refer to some text matched earlier during the same match attempt, and use the variable $1 in subsequent code to refer to that same text after the match has been successfully completed.

To keep the example uncluttered and focus on what's new, I'll remove the fractional-value part of the regex for now, but we'll return to it again soon. So, to see $1 in action, compare:

    

Do the added parentheses change the meaning of the expression? Well, to answer that, we need to know whether they provide grouping for star or other quantifiers, or provide an enclosure for . The answer is no on both counts, so what matches remains unchanged. However, they do enclose two subexpressions that match "interesting" parts of the string we are checking. As Figure illustrates, $1 will receive the number entered, and $2 will receive the C or F entered. Referring to the flowchart in Figure on the next page, we see that this allows us to easily decide how to proceed after the match.

Capturing parentheses


Temperature-conversion program's logic flow


Temperature-conversion program
    print "Enter a temperature (e.g., 32F, 100C):\n";
    $input = <STDIN>; # This reads one line from the user.
    chomp($input);          # This removes the ending newline from $input.
    if ($input =~ m/^([-+]?[0-9]+)([CF])$/)
    {
        # If we get in here, we had a match. $1 is the number, $2 is "C" or "F".
        $InputNum = $1; # Save to named variables to make the ...
        $type = $2; # ... rest of the program easier to read.

        if ($type eq "C") { # 'eq' tests if two strings are equal
            # The input was Celsius, so calculate Fahrenheit
            $celsius = $InputNum;
            $fahrenheit = ($celsius * 9 / 5) + 32;
        } else {
            # If not "C", it must be an "F", so calculate Celsius
            $fahrenheit = $InputNum;
            $celsius = ($fahrenheit - 32) * 5 / 9;
        }
        # At this point we have both temperatures, so display the results:
        printf "%.2f C is %.2f F\n", $celsius, $fahrenheit;
    } else {
        # The initial regex did not match, so issue a warning.
        print "Expecting a number followed by \"C\" or \"F\",\n";
        print "so I don't understand \"$input\".\n";
    }

If the program shown on the facing page is named convert, we can use it like this:

    % perl -w convert
    Enter a temperature (e.g., 32F, 100C):
    39F
    3.89 C is 39.00 F
    % perl -w convert
    Enter a temperature (e.g., 32F, 100C):
    39C
    39.00 C is 102.20 F
    % perl -w convert
    Enter a temperature (e.g., 32F, 100C):
    oops
    Expecting a number followed by "C" or "F",
    so I don't understand "oops".

Intertwined Regular Expressions

With advanced programming languages like Perl, regex use can become quite intertwined with the logic of the rest of the program. For example, let's make three useful changes to our program: allow floating-point numbers as we did earlier, allow for the f or c entered to be lowercase, and allow spaces between the number and letter. Once all these changes are done, input such as '98.6•f' will be allowed.

Earlier, we saw how we can allow floating-point numbers by adding to the expression:

    

Notice that it is added inside the first set of parentheses. Since we use that first set to capture the number to compute, we want to make sure that they capture the fractional portion as well. However, the added set of parentheses, even though ostensibly used only to group for the question mark, also has the side effect of capturing into a variable. Since the opening parenthesis of the pair is the second (from the left), it captures into $2. This is illustrated in Figure.

Nesting parentheses


Figure illustrates how closing parentheses nest with opening ones. Adding a set of parentheses earlier in the expression doesn't influence the meaning of directly, but it does so indirectly because the parentheses surrounding it have now become the third pair. Becoming the third pair means that we need to change the assignment to $type to refer to $3 instead of $2 (but see the sidebar on the facing page for an alternative approach).

Next, allowing spaces between the number and letter is easier. We know that an unadorned space in a regex requires exactly one space in the matched text, so can be used to allow any number of spaces (but still not require any):

    if ($input =~ m/^([-+]?[0-9]+(\.[0-9]*)?) 
([CF])$/)

This does give a limited amount of flexibility to the user of our program, but since we are trying to make something useful in the real world, let's construct the regex to also allow for other kinds of whitespace as well. Tabs, for instance, are quite common. Writing *[•]*Compare that with *)❖ After considering this, turn the page to check your thoughts.

In this book, spaces and tabs are easy to notice because of the • and typesetting conventions I've used. Unfortunately, it is not so on-screen. If you see something like [ ]*, you can guess that it is probably a space and a tab, but you can't be sure until you check. For convenience, Perl regular expressions provide the metacharacter. It simply matches a tabits only benefit over a literal tab is that it is visually apparent, so I use it in my expressions. Thus, ]*[•\t]*Some other Perl convenience metacharacters are (newline), (ASCII formfeed), and (backspace). Well, actually, is a backspace in some situations, but in others, it matches a word boundary. How can it be both? The next section tells us.

A short asidemetacharacters galore

We saw \n in earlier examples, but in those cases, it was in a string, not a regular expression. Like most languages, Perl strings have metacharacters of their own, and these are completely distinct from regular expression metacharacters. It is a common mistake for new programmers to get them confused. (VB.NET is a notable language that has very few string metacharacters.) Some of these string metacharacters conveniently look exactly the same as some comparable regex metacharacters. You can use the string metacharacter \t to get a tab into your string, while you can use the regex metacharacter to insert a tab-matching element into your regex.

Non-Capturing Parentheses:

In Figure, we use the parentheses of the (\.[0-9]*)?\.[0-9]*Perl, and recently some other regex flavors, do provide a way to do this. Rather than using , which group and capture, you can use the special notation , which group but do not capture. With this notation, the "opening parentheses" is the three-character sequence (?:, which certainly looks odd. This use of '?' has no relation to the "optional" metacharacter. (Peek ahead to page 90 for a note about why this odd notation was chosen.)

So, the whole expression becomes:

    if ($input =~ m/^([-+]?[0-9]+ (?:\.[0-9]*)?)([CF])$/)

Now, even though the parentheses surrounding are ostensibly the third set, the text they match goes to $2 since, for counting purposes, the set doesn't, well, count.

The benefits of this are twofold. One is that by avoiding the unnecessary capturing, the match process is more efficient (efficiency is something we'll look at in great detail in Chapter 6). Another is that, overall, using exactly the type of parentheses needed for each situation may be less confusing later to someone reading the code who might otherwise be left wondering about the exact nature of each set of parentheses.

On the other hand, the notation is somewhat unsightly, and perhaps makes the expression more difficult to grasp at a glance. Are the benefits worth it? Well, personally, I tend to use exactly the kind of parentheses I need, but in this particular case, it's probably not worth the confusion. For example, efficiency isn't really an issue since the match is done just once (as opposed to being done repeatedly in a loop).

Throughout this chapter, I'll tend to use even when I don't need their capturing, just for their visual clarity.


The similarity is convenient, but I can't stress enough how important it is to maintain the distinction between the different types of metacharacters. It may not seem important for such a simple example as \t, but as we'll later see when looking at numerous different languages and tools, knowing which metacharacters are being used in each situation is extremely important.

Quiz Answer

Answer to the question on page 44

How do ]*•*|**)•**combination of spaces and tabs.

On the other hand, ]*[•]••' it matches three times, a tab the first time and spaces the rest.

]*(•|)*We have already seen multiple sets of metacharacters conflict. In Chapter 1, while working with egrep, we generally wrapped our regular expressions in single quotes. The whole egrep command line is written at the command-shell prompt, and the shell recognizes several of its own metacharacters. For example, to the shell, the space is a metacharacter that separates the command from the arguments and the arguments from each other. With many shells, single quotes are metacharacters that tell the shell to not recognize other shell metacharacters in the text between the quotes. (DOS uses double quotes.)

Using the quotes for the shell allows us to use spaces in our regular expression. Without the quotes, the shell would interpret the spaces in its own way instead of passing them through to egrep to interpret in its way. Many shells also recognize metacharacters such as $, *, ?, and so oncharacters that we are likely to want to use in a regex.

Now, all this talk about other shell metacharacters and Perl's string metacharacters has nothing to do with regular expressions themselves, but it has everything to do with using regular expressions in real situations. As we move through this book, we'll see numerous (sometimes complex) situations where we need to take advantage of multiple levels of simultaneously interacting metacharacters.

And what about this business? This is a regex thing: in Perl regular expressions, normally matches a word boundary, but within a character class, it matches a backspace. A word boundary would make no sense as part of a class, so Perl is free to let it mean something else. The warnings in the first chapter about how a character class's "sub language" is different from the main regex language certainly apply to Perl (and every other regex flavor as well).

2.2.3.2. Generic "whitespace" with \s

While discussing whitespace, we left off with . This is fine, but many regex flavors provide a useful shorthand: . While it looks similar to something like which simply represents a literal tab, the metacharacter is a shorthand for a whole character class that matches any "whitespace character." This includes (among others) space, tab, newline, and carriage return. With our example, the newline and carriage return don't really matter one way or the other, but typing is easier than . After a while, you get used to seeing it, and becomes easy to read even in complex regular expressions.

Our test now looks like:

    $input =~ m/^([-+]?[0-9]+(\.[0-9]*)?)
([CF])$/

Lastly, we want to allow a lowercase letter as well as uppercase. This is as easy as adding the lowercase letters to the class: . However, I'd like to show another way as well:

    $input =~ m/^([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$/

The added i is called a modifier, and placing it after the m/⋯/ instructs Perl to do the match in a case-insensitive manner. It's not actually part of the regex, but part of the m/⋯/ syntactic packaging that tells Perl what you want to do (apply a regex), and which regex to do it with (the one between the slashes). We've seen this type of thing before, with egrep's -i option (☞ 15).

It's a bit too cumbersome to say "the i modifier" all the time, so normally "/i" is used even though you don't add an extra / when actually using it. This /i notation is one way to specify modifiers in Perlin the next chapter, we'll see other ways to do it in Perl, and also how other languages allow for the same functionality. We'll also see other modifiers as we move along, including /g ("global match") and /x ("free-form expressions") later in this chapter.

Well, we've made a lot of changes. Let's try the new program:

    % perl -w convert
    Enter a temperature (e.g., 32F, 100C):
    32 f
    0.00 C is 32.00 F
    % perl -w convert
    Enter a temperature (e.g., 32F, 100C):
    50 c
    10.00 C is 50.00 F

Oops! Did you notice that in the second try we thought we were entering 50° Celsius, yet it was interpreted as 50° Fahrenheit? Looking at the program's logic, do you see why?

Let's look at that part of the program again:

    if ($input =~ m/^([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$/i)
    {
         save to a named variable to make rest of program more readable
        if ($type eq "C") { # 'eq' tests if two strings are equal
            

Although we modified the regex to allow a lowercase f, we neglected to update the rest of the program appropriately. As it is now, if $type isn't exactly 'C', we assume the user entered Fahrenheit. Since we now also allow 'c' to mean Celsius, we need to update the $type test:

    if ($type eq "C" or $type eq "c") {

Actually, since this is a book on regular expressions, perhaps I should use:

    if ($type =~ m/c/i) {

In either case, it now works as we want. The final program is shown below. These examples show how the use of regular expressions can become intertwined with the rest of the program.

Temperature-conversion program final listing
    print "Enter a temperature (e.g., 32F, 100C):\n";
    $input = <STDIN>;     # This reads one line from the user.
    chomp($input);              # This removes the ending newline from $input.

    if ($input =~ m/^([-+]?[0-9]+(\.[0-9]+)?)\s*([CF])$/i)
    {
        # If we get in here, we had a match. $1 is the number, $3 is "C" or "F".
        $InputNum = $1;    # Save to named variables to make the ...
        $type = $3;        # ... rest of the program easier to read.
        if ($type =~ m/c/i) {     # Is it "c" or "C"?
            # The input was Celsius, so calculate Fahrenheit
            $celsius = $InputNum;
            $fahrenheit = ($celsius * 9 / 5) + 32;
        } else {
            # If not "C", it must be an "F", so calculate Celsius
            $fahrenheit = $InputNum;
            $celsius = ($fahrenheit - 32) * 5 / 9;
        }
        # At this point we have both temperatures, so display the results:
        printf "%.2f C is %.2f F\n", $celsius, $fahrenheit;
    } else {
        # The initial regex did not match, so issue a warning.
        print "Expecting a number followed by \"C\" or \"F\",\n";
        print "so I don't understand \"$input\".\n";
    }

Intermission

Although we have spent much of this chapter coming up to speed with Perl, we've encountered a lot of new information about regexes:

  • 1. Most tools have their own particular flavor of regular expressions. Perl's appear to be of the same general type as egrep's, but has a richer set of metacharacters. Many other languages, such as Java, Python, the .NET languages, and Tcl, have flavors similar to Perl's.

  • 2. Perl can check a string in a variable against a regex using the construct $variable =~ m/regex/. The m indicates that a match is requested, while the slashes delimit (and are not part of) the regular expression. The whole test, as a unit, is either true or false.

  • 3. The concept of metacharacterscharacters with special interpretationsis not unique to regular expressions. As discussed earlier about shells and double-quoted strings, multiple contexts often vie for interpretation. Knowing the various contexts (shell, regex, and string, among others), their metacharacters, and how they can interact becomes more important as you learn and use Perl, PHP, Java, Tcl, GNU Emacs, awk, Python, or other advanced languages. (And of course, within regular expressions, character classes have their own mini language with a distinct set of metacharacters.)

  • 4. Among the more useful shorthands that Perl and many other flavors of regex provide (some of which we haven't seen yet) are:

        \t 
     
     a tab character
        \n 
     
     a newline character
        \r 
     
     a carriage-return character
        \s 
     
     matches any "whitespace" character (space, tab, newline, formfeed,
     and such)
        \S 
     
     anything not 
        \w 
     
      (useful as in , ostensibly to match a word)
        \W 
     
     anything not , i.e., 
        \d 
     
     , i.e., a digit
        \D 
     
     anything not , i.e., 
    

  • 5. The /i modifier makes the test case-insensitive. Although written in prose as "/i", only "i" is actually appended after the match operator's closing delimiter.

  • 6. The somewhat unsightly non-capturing parentheses can be used for grouping without capturing.

  • 7. After a successful match, Perl provides the variables $1, $2, $3, etc., which hold the text matched by their respective parenthesized subexpressions in the regex. In concert with these variables, you can use a regex to pluck information from a string. (Other languages provide the same type of information in other ways; we'll see many examples in the next chapter.)

    Subexpressions are numbered by counting open parentheses from the left, starting with one. Subexpressions can be nested, as in Raw parentheses can be intended for grouping only, but as a byproduct, they still capture into one of the special variables.



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows