Start Your Engines!






Start Your Engines!

Let's see how much I can milk this engine analogy. The whole point of having an engine is so that you can get from Point A to Point B without doing much work. The engine does the work for you so you can relax and enjoy the sound system. The engine's primary task is to turn the wheels, and how it does that isn't really a concern of yours. Or is it?

Two Kinds of Engines

Well, what if you had an electric car? They've been around for a long time, but they aren't as common as gas cars because they're hard to design well. If you had one, though, you would have to remember not to put gas in it. If you had a gasoline engine, well, watch out for sparks! An electric engine more or less just runs, but a gas engine might need some babysitting. You can get much better performance just by changing little things like your spark plug gaps, air filter, or brand of gas. Do it wrong and the engine's performance deteriorates, or, worse yet, it stalls.

Each engine might do its work differently, but the end result is that the wheels turn. You still have to steer properly if you want to get anywhere, but that's an entirely different issue.

New Standards

Let's stoke the fire by adding another variable: the California Emissions Standards.[[

Come to think of it, I bet that an electric engine can qualify for the standard without much changethe standard just "blesses" the clean results that are already par for the course. The gas engine, on the other hand, needs some major tweaking and a bit of re-tooling before it can qualify. Owners of this kind of engine need to pay particular care to what they feed ituse the wrong kind of gas and you're in big trouble.

The impact of standards

Better pollution standards are a good thing, but they require that the driver exercise more thought and foresight (well, at least for gas engines). Frankly, however, the standard doesn't impact most people since all the other states still do their own thing and don't follow California's standard.

So, you realize that these four types of engines can be classified into three groups (the two kinds for gas, and electric in general). You know about the differences, and that in the end they all still turn the wheels. What you don't know is what the heck this has to do with regular expressions! More than you might imagine.

Regex Engine Types

There are two fundamentally different types of regex engines: one called "DFA" (the electric engine of our story) and one called "NFA" (the gas engine). The details of what NFA and DFA mean follow shortly (☞ 156), but for now just consider them names, like Bill and Ted. Or electric and gas.

Both engine types have been around for a long time, but like its gasoline counterpart, the NFA type seems to be used more often. Tools that use an NFA engine include the .NET languages, PHP, Ruby, Perl, Python, GNU Emacs, ed, sed, vi, most versions of grep, and even a few versions of egrep and awk. On the other hand, a DFA engine is found in almost all versions of egrep and awk, as well as lex and flex. Some systems have a multi-engine hybrid system, using the most appropriate engine for the job (or even one that swaps between engines for different parts of the same regex, as needed to get the best combination of features and speed). Figure lists a few common programs and the regex engine that most versions use. If your favorite program is not in the list, the section "Testing the Engine Type" on the next page can help you find out which it is.

Some Tools and Their Regex Engines

Engine type

Programs

DFA

awk (most versions), egrep(most versions), flex, lex, MySQL, Procmail

Traditional NFA

GNU Emacs, Java, grep(most versions), less, more, .NET languages, PCRE library, Perl, PHP (all three regex suites), Python, Ruby, sed (most versions), vi

POSIX NFA

mawk, Mortice Kern Systems' utilities, GNU Emacs (when requested)

Hybrid NFA/DFA

GNU awk, GNU grep/egrep, Tcl


As Chapter 3 illustrated, 20 years of development with both DFAs and NFAs resulted in a lot of needless variety. Things were dirty. The POSIX standard came in to clean things up by clearly specifying not only which metacharacters and features an engine should support, as mentioned in the previous chapter, but also exactly the results you could expect from them. Superficial details aside, the DFAs (our electric engines) were already well suited to adhere to this new standard, but the kind of results an NFA traditionally provided were different, so changes were needed. As a result, broadly speaking, there are three types of regex engines:

  • DFA (POSIX or notsimilar either way)

  • Traditional NFA (most common: Perl, .NET, PHP, Java, Python, ...)

  • POSIX NFA

Here, we use "POSIX" to refer to the match semanticsthe expected operation of a regex that the POSIX standard specifies (which we'll get to later in this chapter). Don't confuse this use of "POSIX" with uses surrounding regex features introduced in that same standard (☞ 127). Many programs support the features without supporting the full match semantics.

Old (and minimally featured) programs like egrep, awk, and lex were normally implemented with the electric DFA engine, so the new standard primarily just confirmed the status quono big changes. However, there were some gas-powered versions of these programs which had to be changed if they wanted to be POSIX-compliant. The gas engines that passed the California Emission Standards tests(POSIX NFA) were fine in that they produced results according to the standard, but the necessary changes only increased how fickle they were to proper tuning. Where before you might get by with slightly misaligned spark plugs, you now have absolutely no tolerance. Gasoline that used to be "good enough" now causes knocks and pings. But, so long as you know how to maintain your baby, the engine runs smoothly and cleanly.

From the Department of Redundancy Department

At this point, I'll ask you to go back and review the story about engines. Every sentence there rings with some truth about regular expressions. A second reading should raise some questions. Particularly, what does it mean that an electric DFA regex engine more or less "just runs?" What affects a gas-powered NFA? How can I tune my regular expressions to run as I want on an NFA? What special concerns does an emissions-controlled POSIX NFA have? What's a "stalled engine" in the regex world?

Testing the Engine Type

Because the type of engine used in a tool influences the type of features it can offer, and how those features appear to work, we can often learn the type of engine a tool has merely by checking to see how it handles a few test expressions. (After all, if you can't tell the difference, the difference doesn't matter.) At this point in the book, I wouldn't expect you to understand why the following test results indicate what they do, but I want to offer these tests now so that if your favorite tool is not listed in Figure, you can investigate before continuing with this and the subsequent chapters.

Traditional NFA or not?

The most commonly used engine is a Traditional NFA, and spotting it is easy. First, are lazy quantifiers (☞ 141) supported? If so, it's almost certainly a Traditional NFA. As we'll see, lazy quantifiers are not possible with a DFA, nor would they make any sense with a POSIX NFA. However, to make sure, simply apply the regex to the string 'nfa•not'if only 'nfa' matches, it's a Traditional NFA. If the entire 'nfa•not' matches, it's either a POSIX NFA or a DFA.

DFA or POSIX NFA?

Differentiating between a POSIX NFA and a DFA is sometimes just as simple. Capturing parentheses and backreferences are not supported by a DFA, so that can be one hint, but there are systems that are a hybrid mix between the two engine types, and so may end up using a DFA if there are no capturing parentheses.

Here's a simple test that can tell you a lot. Apply to a string like '=XX====================', as with this egrep command:

    echo =XX========================================= | egrep 'X(.+)+X'

If it takes a long time to finish, it's an NFA (and if not a Traditional NFA as per the test in the previous section, it must be a POSIX NFA). If it finishes quickly, it's either a DFA or an NFA with some advanced optimization. Does it display a warning message about a stack overflow or long match aborted? If so, it's an NFA.



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows