April 21, 2011, 1:24 a.m.
posted by bruce
Advanced Language Tools
If you have a background in parsing theory, you may know that neither regular expressions nor string splitting is powerful enough to handle more complex language grammars. Roughly, regular expressions don't have the stack "memory" required by true grammars, so they cannot support arbitrary nesting of language constructs (nested if statements in a programming language, for instance). From a theoretical perspective, regular expressions are intended to handle just the first stage of parsingseparating text into components, otherwise known as lexical analysis. Language parsing requires more.
In most applications, the Python language itself can replace custom languages and parsersuser-entered code can be passed to Python for evaluation with tools such as eval and exec. By augmenting the system with custom modules, user code in this scenario has access to both the full Python language and any application-specific extensions required. In a sense, such systems embed Python in Python. Since this is a common application of Python, we'll revisit this approach later in this chapter.
For some sophisticated language analysis tasks, though, a full-blown parser may still be required. Since Python is built for integrating C tools, we can write integrations to traditional parser generator systems such as yacc and bison, tools that create parsers from language grammar definitions. Better yet, we could use an integration that already existsinterfaces to such common parser generators are freely available in the open source domain (run a web search in Google for up-to-date details and links).
Python-specific parsing systems also are accessible from Python's web site. Among them, the kwParsing system is a parser generator written in Python, and the SPARK toolkit is a lightweight system that employs the Earley algorithm to work around technical problems with LALR parser generation (if you don't know what that means, you probably don't need to care). Since all of these are complex tools, though, we'll skip their details in this text. Consult http://www.python.org for information on parser generator tools available for use in Python programs.
Even more demanding language analysis tasks require techniques developed in artificial intelligence research, such as semantic analysis and machine learning. For instance, the Natural Language Toolkit, or NLTK, is an open source suite of Python libraries and programs for symbolic and statistical natural language processing. It applies linguistic techniques to textual data, and it can be used in the development of natural language recognition software and systems.
Of special interest to this chapter, also see Yet Another Python Parser System (YAPPS). YAPPS is a parser generator written in Python. It uses supplied grammar rules to generate human-readable Python code that implements a recursive descent parser. The parsers generated by YAPPS look much like (and are inspired by) the handcoded expression parsers shown in the next section. YAPPS creates LL(1) parsers, which are not as powerful as LALR parsers but are sufficient for many language tasks. For more on YAPPS, see http://theory.stanford.edu/~amitp/Yapps or search the Web at large.