Compiling Regular Expressions






Compiling Regular Expressions

Problem

You have a handful of regular expressions to execute as quickly as possible over many different strings. Performance is of the utmost importance.

Solution

The best way to do this task is to use compiled regular expressions. However, there are some drawbacks to using this technique, which we will examine.

There are two ways to compile regular expressions. The easiest way is to use the RegexOptions.Compiled enumeration value in the Options parameter of the static Match or Matches methods on the Regex class:

	Match theMatch = Regex.Match(source, pattern, RegexOptions.Compiled);

	MatchCollection theMatches = Regex.Matches(source, pattern, RegexOptions.Compiled);

If more than a few expressions will be compiled and/or the expressions need to be shared across applications, consider precompiling all of these expressions into their own assembly. Do this by using the static CompileToAssembly method on the Regex class. The following method accepts an assembly name and compiles two simple regular expressions into this assembly:

	public static void CreateRegExDLL(string assmName)
	{
	    RegexCompilationInfo[] RE = new RegexCompilationInfo[2]
	        {new RegexCompilationInfo("PATTERN", RegexOptions.Compiled,
	                                  "CompiledPATTERN", "Chapter_Code", true),
	         new RegexCompilationInfo("NAME", RegexOptions.Compiled,
	                                  "CompiledNAME", "Chapter_Code", true)};

	    System.Reflection.AssemblyName aName =
	         new System.Reflection.AssemblyName( );
	    aName.Name = assmName;

	    Regex.CompileToAssembly(RE, aName);
	}

Now that the expressions are compiled to an assembly, the assembly can be added as a reference to your project and used as follows:

	Chapter_Code.CompiledNAME CN = new Chapter_Code.CompiledNAME( );
	Match mName = CN.Match("Get the NAME from this text.");
	Console.WriteLine("mName.Value = " + mName.Value);

This code displays the following text:

	mName.Value = NAME

Note that this code can be used as part of your build process. The resulting assembly can then be shipped with your application.

Discussion

Compiling regular expressions allows the expression to run faster. To understand how, you need to examine the process that an expression goes through as it is run against a string. If an expression is not compiled, the regular expression engine converts the expression to a series of internal codes that are recognized by the regular expression engine; it is not converted to MSIL. As the expression runs against a string, the engine interprets the series of internal codes. This can be a slow process, especially as the source string becomes very large and the expression becomes much more complex.

There is a class of scenarios for which performance of uncompiled regex is unacceptable, but for which compiled regex performs acceptably. To mitigate this performance problem, you can compile the expression so that it gets converted directly to a series of MSIL instructions, which performs the pattern matching for the specific regular expression. Once the Just-In-Time (JIT) compiler is run on this MSIL, the instructions are converted to machine code. This allows for an extremely fast execution of the pattern against a string.

There are two drawbacks to using the RegexOptions.Compiled enumerated value to compile regular expressions. The first is that the first time an expression is used with the Compiled flag, it performs very slowly, due to the compilation process. Fortunately, this is a one-time expense since every unique expression is compiled only once. The second drawback is that an in-memory assembly gets generated to contain the IL, which can never be unloaded. An assembly can never be unloaded from an appdomain. The garbage collector cannot remove it from memory. If large numbers of expressions are compiled, the amount of heap resources that will be used up and not released will be large. So use this technique wisely.

Compiling regular expressions into their own assembly immediately gives you two benefits. First, precompiled expressions do not require any extra time to be compiled while your application is running. Second, they are in their own assembly and therefore can be used by other applications.

Consider precompiling regular expressions and placing them in their own assembly rather than using the RegexOptions.Compiled flag.


To compile one or more expressions into an assembly, the static CompileToAssembly method of the Regex class must be used. To use this method, a RegexCompilationInfoarray must be created and filled with RegexCompilationInfo objects. The next step is to create the assembly in which the expression will live. An instance of the AssemblyName class is created using the default constructor. Next, this assembly is given a name (do not include the .dll file extension in the name; it is added automatically). Finally, the CompileToAssembly method can be called with the RegexCompilationInfo array and the AssemblyName object supplied as arguments.

In our example, this assembly is placed in the same directory that the executable was launched from.


See Also

See the ".NET Framework Regular Expressions" and "AssemblyName Class" topics in the MSDN documentation.



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows