Advanced Language Tools

Advanced Language Tools

If you have a background in parsing theory, you may know that neither regular expressions nor string splitting is powerful enough to handle more complex language grammars. Roughly, regular expressions don't have the stack "memory" required by true grammars, so they cannot support arbitrary nesting of language constructs (nested if statements in a programming language, for instance). From a theoretical perspective, regular expressions are intended to handle just the first stage of parsingseparating text into components, otherwise known as lexical analysis. Language parsing requires more.

In most applications, the Python language itself can replace custom languages and parsersuser-entered code can be passed to Python for evaluation with tools such as eval and exec. By augmenting the system with custom modules, user code in this scenario has access to both the full Python language and any application-specific extensions required. In a sense, such systems embed Python in Python. Since this is a common application of Python, we'll revisit this approach later in this chapter.

For some sophisticated language analysis tasks, though, a full-blown parser may still be required. Since Python is built for integrating C tools, we can write integrations to traditional parser generator systems such as yacc and bison, tools that create parsers from language grammar definitions. Better yet, we could use an integration that already existsinterfaces to such common parser generators are freely available in the open source domain (run a web search in Google for up-to-date details and links).

Python-specific parsing systems also are accessible from Python's web site. Among them, the kwParsing system is a parser generator written in Python, and the SPARK toolkit is a lightweight system that employs the Earley algorithm to work around technical problems with LALR parser generation (if you don't know what that means, you probably don't need to care). Since all of these are complex tools, though, we'll skip their details in this text. Consult for information on parser generator tools available for use in Python programs.

Even more demanding language analysis tasks require techniques developed in artificial intelligence research, such as semantic analysis and machine learning. For instance, the Natural Language Toolkit, or NLTK, is an open source suite of Python libraries and programs for symbolic and statistical natural language processing. It applies linguistic techniques to textual data, and it can be used in the development of natural language recognition software and systems.

Don't Reinvent the Wheel

Speaking of parser generators, to use some of these tools in Python programs, you'll need an extension module that integrates them. The first step in such scenarios should always be to see whether the extension already exists in the public domain. Especially for common tools like these, chances are that someone else has already written an integration that you can use off-the-shelf instead of writing one from scratch.

Of course, not everyone can donate all their extension modules to the public domain, but there's a growing library of available components that you can pick up for free and a community of experts to query. Visit for links to Python software resources. With roughly one million Python users out there as I write this book, much can be found in the prior-art department.

Of special interest to this chapter, also see Yet Another Python Parser System (YAPPS). YAPPS is a parser generator written in Python. It uses supplied grammar rules to generate human-readable Python code that implements a recursive descent parser. The parsers generated by YAPPS look much like (and are inspired by) the handcoded expression parsers shown in the next section. YAPPS creates LL(1) parsers, which are not as powerful as LALR parsers but are sufficient for many language tasks. For more on YAPPS, see or search the Web at large.

 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows