July 12, 2011, 10:47 a.m.
posted by stackme
Marking Up a Web Page
Build an array replacement for each word you want to highlight. Then, chop up the page into "HTML elements" and "text between HTML elements" and apply the replacements to just the text between HTML elements. Figure applies highlighting in the HTML in $body to the words found in $words.
Marking up a web page
<p>I like <span class='word-0'>pickle</span>s and <span class='word-1'>herring</span>. </p> <a href="pickle.php"><img src="pickle.jpg"/>A <span class='word-0'>pickle</span> picture</a> I have a <span class='word-1'>herring</span>bone-patterned toaster cozy. <herring>Herring is not a real HTML element!</herring>
Each of the words in $words (pickle and herring) has been wrapped with a <span/> that has a specific class attribute. Use a CSS stylesheet to attach particular display attributes to these classes, such as a bright yellow background or a border.
The regular expression in Figure chops up $body into a series of chunks delimited by HTML elements. This lets us just replace the text between HTML elements and leaves HTML elements or attributes alone whose values might contain a search term. The regular expression does a pretty good job of matching HTML elements, but if you have some particularly crazy, malformed markup with mismatched or unescaped quotes, it might get confused.
Because str_replace( ) is case sensitive, only strings that exactly match words in $words are replaced. The last Herring in Figure doesn't get highlighted because it begins with a capital letter. To do case-insensitive matching, we need to switch from str_replace( ) to regular expressions. (We can't use str_ireplace( ) because the replacement has to preserve the case of what matched.) Figure shows the altered code that uses regular expressions to do the replacement.
Marking up a web page with regular expressions
The two differences in Figure are that it builds a $patterns array in the loop at the top and it uses the preg_replace( ) (with the $patterns array) instead of str_replace( ). The i at the end of each element in $patterns makes the match case insensitive. The \\0 in the replacement preserves the case in the replacement with the case of what it matched.
Switching to regular expressions also makes it easy to prevent substring matching. In both Figure and Figure, the herring in herringbone gets highlighted. To prevent this, change $patterns = '/' . preg_quote($word) .'/i'; in Figure to $patterns = '/\b' . preg_quote($word) .'\b/i';. The additional \b items in the pattern tell preg_replace( ) only to match a word if it stands on its own.
Documentation on str_replace( ) at http://www.php.net/str_replace, on str_ireplace( ) at http://www.php.net/str_ireplace, on preg_replace( ) at http://www.php.net/preg_replace, and on preg_split( ) at http://www.php.net/preg_split.