NLTK Tutorial: Regular Expressions

Steven Bird


Table of Contents
Introduction
Simple Regular Expressions
The Wildcard
Optionality
Repeatability
Choices
More Complex Regular Expressions
Ranges
Complementation
Common Special Symbols
Advanced Regular Expressions
Python Interface

Introduction

Note

This tutorial provides a gentle introduction to regular expressions illustrated with examples from language processing. For a more concise introduction please see the Python documentation re -- Regular expression operations.

When written language is stored in a machine it is normally represented as a sequence (or string) of characters. Individual words are strings. A list of words is a string. Entire texts are also strings, including special characters for space and newline. Strings are sometimes formatted, such as a "date string" like 2002-06-23. Whole texts may be formatted, such as an email message with header fields followed by the message body. Texts may contain "markup", such as <abbrev>Phila</abbrev>, which provides information about the interpretation or presentation of some piece of text. Thus in language processing, strings are ubiquitous, and they often contain important structure.

Most language processing is performed above the level of characters. For instance, parsing a sentence is a process that operates at the level of complete words. What kinds of processing are performed at the character level? Perhaps word games are the most familiar example of such processing. In completing a crossword we may want to know which 3-letter English words end with the letter c (e.g. "arc"). We might want to know how many words can be formed from the letters: a, c, e, o, and n (e.g. "ocean"). We may want to find out which unique English word contains the substring gnt (left as an exercise for the reader). In all these examples, we are considering which word - drawn from a large set of candidates - matches a given pattern. There are many more serious uses of this so-called pattern matching. We may want to break a text into its component words, a process known as tokenization (see the introduction). For simple tokenization we want to search for locations in the text string containing whitespace (space, tab, or newline) or certain punctuation symbols, and break the text strings into word strings at these points. Alternatively, we may want to compute the relative frequency of particular letters in the text in order to guess the language of the text. For this we must total up the number of matches for each character of interest.

So far we have seen elementary examples of pattern matching, the matching of individual characters. More often we are interested in matching sequences of characters. For example, part of the operation of a naive spell-checker could be to remove a word-final s, in case it is a plural, and see if the putative singular form exists in the dictionary. For this we must locate s and remove it, but only if it precedes a word boundary. This requires matching a pattern consisting of two characters.

Beyond this pattern matching on the content of a text, we often want to process the formatting and markup of a text. We may want to check the formatting of a document (e.g. to ensure that every sentence begins with a capital letter) or to reformat a document (e.g. replacing sequences of space characters with a single space). We may want to find all date strings and extract the year. We may want to extract all words contained inside the <abbrev></abbrev> markup in order to construct a list of abbreviations.

Processing the content, format and markup of strings is a central task in most kinds of NLP. The most widespread method for string processing uses regular expressions, the topic of this tutorial.


Simple Regular Expressions

In this section we will see the building blocks for simple regular expressions, along with a selection of linguistic examples.


The Wildcard

The "." symbol matches any character. For example, the regular expression s.ng matches the following English words: sang, sing, song, and sung.

We can also use this symbol for counting characters. For instance ....zy matches six-letter words that end in zy. The pattern ....berry finds words like cranberry.

Note

Note that the . symbol matches exactly one character, and must be repeated for as many characters as should be matched. To match a variable number of characters we must use notation for optionality.


Optionality

The "?" symbol indicates that the immediately preceding expression is optional. The expression colou?r matches both British and American spellings, colour and color. The expression that precedes the ? may be punctuation, such as an optional hyphen. For instance e-?mail matches both e-mail and email.


Repeatability

The "+" symbol indicates that the immediately preceding expression is repeatable, up to an arbitrary number of times. For example, the expression coo+l matches cool, coool, and so on. This symbol is particularly effective when combined with the . symbol. For example, f.+f matches all words that begin and end with the letter f (e.g. "foolproof"). The expression .+ed finds words that potentially have the past-tense -ed suffix.

The "*" symbol indicates that the immediately preceding expression is both optional and repeatable. For example .*gnt.* matches all words that contain gnt.


Choices

Patterns using the wildcard symbol are very effective, but there are many instances where we want to limit the set of characters that the wildcard can match. In such cases we can use the [] notation. For example, we can match only the vowels using [aeiou]. The expression p[aeiou]t matches the words: pat, pet, pit, pot, and put.

We can combine the [] notation with our notation for repeatability. For example, expression p[aeiou]+t matches the words listed above, along with: peat, poet, and pout.

Note

Note that the order of vowels in the regular expression is insignificant, and we would have had the same result with the expression p[uoiea]+t. Thus, inside these brackets, the characters are interpreted not as a string but as a set of choices.

Often the choices we want to describe cannot be expressed at the level of individual characters. For example, if we were processing the output of a tagger to extract noun phrases, we might want to find all nouns (NN.*), adjectives (JJ.*), determiners (DT) and cardinals (CD), while excluding all other word types (e.g. verbs VB.*). It is possible, using a single regular expression, to search for this set of candidates using the choice operator | as follows: NN.*|JJ.*|DT|CD.

As another example of multi-character choices, suppose that we wanted to create a program to simplify English prose, replacing rare words (like habitation) with a more frequent, synonymous word (like home). In this situation, we need to map from a potentially large set of words to an individual word. We can match the set of words using the choice operator. In the case of the word home, we would want to match the regular expression dwelling|domicile|abode|habitation.

Note

Note that the choice operator has wide scope, so that abc|def is a choice between abd and def, and not between abced and abdef. The latter choice must be written using parentheses: ab(c|d)ed.


More Complex Regular Expressions

In this section we will cover operators which can be used to construct more powerful and useful regular expressions.


Ranges

In section the section called Choices we saw how the [] notation could be used to express a set of choices between individual characters. Instead of listing each character, it is also possible to express a range of characters, using the - operator. For example, [a-z] matches any lowercase letter.

As expected, ranges can be combined with other operators. For example [A-Z][a-z]* matches words that have an initial capital letter followed by any number of lowercase letters. The expression 20[0-4][0-9] matches years in the range 2000 to 2049.

Ranges can be combined, e.g. [a-zA-Z] which matches any lowercase or uppercase letter. The expression [b-df-hj-np-tv-z]+ matches words consisting only of consonants (e.g. "pygmy").


Complementation

The notation [^aeiou] - much simpler than [b-df-hj-np-tv-z]+... Also for ranges [^a-f].


Advanced Regular Expressions

greedy vs non-greedy matching

zero-width assertions

more special symbols: \b etc


Python Interface

How to do regexps in Python - give a couple of code samples for the above prose illustrations. Use /usr/dict/words - local copy.


# Load the regular expression module:
>>> from re import *
# Read a big list of words:
>>> words = open('/home/sb/nltk/data/words', 'r').read()
# How many words are there?
>>> len(words)
409093
# Compile a regexp for 3-letter words ending in c
>>> r = compile(r'^..c$', MULTILINE)
# Find all matches
>>> print r.findall(words)
['arc', 'Doc', 'Lac', 'Mac', 'Vic']