Marking Up a Web Page






Marking Up a Web Page

Problem

You want to display a web page'for example, a search result'with certain words highlighted.

Solution

Build an array replacement for each word you want to highlight. Then, chop up the page into "HTML elements" and "text between HTML elements" and apply the replacements to just the text between HTML elements. Figure applies highlighting in the HTML in $body to the words found in $words.

Marking up a web page

$body = '
<p>I like pickles and herring.</p>

<a href="pickle.php"><img src="pickle.jpg"/>A pickle picture</a>

I have a herringbone-patterned toaster cozy.

<herring>Herring is not a real HTML element!</herring>
';

$words = array('pickle','herring');
$replacements = array();
foreach ($words as $i => $word) {
    $replacements[] = "<span class='word-$i'>$word</span>";
}

// Split up the page into chunks delimited by a
// reasonable approximation of what an HTML element
// looks like.
$parts = preg_split("{(<(?:\"[^\"]*\"|'[^']*'|[^'\">])*>)}",
                    $body,
                    -1,  // Unlimited number of chunks
                    PREG_SPLIT_DELIM_CAPTURE);
foreach ($parts as $i => $part) {
    // Skip if this part is an HTML element
    if (isset($part[0]) && ($part[0] == '<')) { continue; }
    // Wrap the words with <span/>s
    $parts[$i] = str_replace($words, $replacements, $part);
}

// Reconstruct the body
$body = implode('',$parts);

print $body;
?>

Discussion

Figure prints:

<p>I like <span class='word-0'>pickle</span>s and <span class='word-1'>herring</span>.
</p>

<a href="pickle.php"><img src="pickle.jpg"/>A <span class='word-0'>pickle</span>
picture</a>

I have a <span class='word-1'>herring</span>bone-patterned toaster cozy.

<herring>Herring is not a real HTML element!</herring>

Each of the words in $words (pickle and herring) has been wrapped with a <span/> that has a specific class attribute. Use a CSS stylesheet to attach particular display attributes to these classes, such as a bright yellow background or a border.

The regular expression in Figure chops up $body into a series of chunks delimited by HTML elements. This lets us just replace the text between HTML elements and leaves HTML elements or attributes alone whose values might contain a search term. The regular expression does a pretty good job of matching HTML elements, but if you have some particularly crazy, malformed markup with mismatched or unescaped quotes, it might get confused.

Because str_replace( ) is case sensitive, only strings that exactly match words in $words are replaced. The last Herring in Figure doesn't get highlighted because it begins with a capital letter. To do case-insensitive matching, we need to switch from str_replace( ) to regular expressions. (We can't use str_ireplace( ) because the replacement has to preserve the case of what matched.) Figure shows the altered code that uses regular expressions to do the replacement.

Marking up a web page with regular expressions

<?php
$body = '
<p>I like pickles and herring.</p>

<a href="pickle.php"><img src="pickle.jpg"/>A pickle picture</a>

I have a herringbone-patterned toaster cozy.

<herring>Herring is not a real HTML element!</herring>
';

$words = array('pickle','herring');
$patterns = array();
$replacements = array();
foreach ($words as $i => $word) {
    $patterns[] = '/' . preg_quote($word) .'/i';
    $replacements[] = "<span class='word-$i'>\\0</span>";
}

// Split up the page into chunks delimited by a
// reasonable approximation of what an HTML element
// looks like.
$parts = preg_split("{(<(?:\"[^\"]*\"|'[^']*'|[^'\">])*>)}",
                    $body,
                    -1,  // Unlimited number of chunks
                    PREG_SPLIT_DELIM_CAPTURE);
foreach ($parts as $i => $part) {
    // Skip if this part is an HTML element
    if (isset($part[0]) && ($part[0] == '<')) { continue; }
    // Wrap the words with <span/>s
    $parts[$i] = preg_replace($patterns, $replacements, $part);
}

// Reconstruct the body
$body = implode('',$parts);

print $body;
?>

The two differences in Figure are that it builds a $patterns array in the loop at the top and it uses the preg_replace( ) (with the $patterns array) instead of str_replace( ). The i at the end of each element in $patterns makes the match case insensitive. The \\0 in the replacement preserves the case in the replacement with the case of what it matched.

Switching to regular expressions also makes it easy to prevent substring matching. In both Figure and Figure, the herring in herringbone gets highlighted. To prevent this, change $patterns[] = '/' . preg_quote($word) .'/i'; in Figure to $patterns[] = '/\b' . preg_quote($word) .'\b/i';. The additional \b items in the pattern tell preg_replace( ) only to match a word if it stands on its own.

See Also

Documentation on str_replace( ) at http://www.php.net/str_replace, on str_ireplace( ) at http://www.php.net/str_ireplace, on preg_replace( ) at http://www.php.net/preg_replace, and on preg_split( ) at http://www.php.net/preg_split.



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows