Parentheses used in a regular expression not only group elements of that expression together, but also designate any matches found for that group as tokens. You can use tokens to match other parts of the same text. One advantage of using tokens is that they remember what they matched, so you can recall and reuse matched text in the process of searching or replacing.
Each token in the expression is assigned a number, starting
from 1, going from left to right. To make a reference to a token later
in the expression, refer to it using a backslash followed by the token
number. For example, when referencing a token generated by the third
set of parentheses in the expression, use \3
.
As a simple example, if you wanted to search for identical sequential
letters in a character array, you could capture the first letter as
a token and then search for a matching character immediately afterwards.
In the expression shown below, the (\S)
phrase
creates a token whenever regexp
matches any nonwhitespace
character in the character array. The second part of the expression, '\1'
,
looks for a second instance of the same character immediately following
the first:
poe = ['While I nodded, nearly napping, ' ... 'suddenly there came a tapping,']; [mat,tok,ext] = regexp(poe, '(\S)\1', 'match', ... 'tokens', 'tokenExtents'); mat
mat = 'dd' 'pp' 'dd' 'pp'
The tokens returned in cell array tok
are:
'd', 'p', 'd', 'p'
Starting and ending indices for each token in poe
are:
11 11, 26 26, 35 35, 57 57
For another example, capture pairs of matching HTML tags (e.g., <a>
and </a>
)
and the text between them. The expression used for this example is
expr = '<(\w+).*?>.*?</\1>';
The first part of the expression, '<(\w+)'
, matches an opening angle
bracket (<
) followed by one or more alphabetic, numeric, or
underscore characters. The enclosing parentheses capture token characters following
the opening angle bracket.
The second part of the expression, '.*?>.*?'
, matches the remainder of
this HTML tag (characters up to the >
), and any characters
that may precede the next opening angle bracket.
The last part, '</\1>'
, matches all
characters in the ending HTML tag. This tag is composed of the sequence </tag>
,
where tag
is whatever characters were captured
as a token.
hstr = '<!comment><a name="752507"></a><b>Default</b><br>'; expr = '<(\w+).*?>.*?</\1>'; [mat,tok] = regexp(hstr, expr, 'match', 'tokens'); mat{:}
ans = <a name="752507"></a> ans = <b>Default</b>
tok{:}
ans = 'a' ans = 'b'
Here is an example of how tokens are assigned values. Suppose that you are going to search the following text:
andy ted bob jim andrew andy ted mark
You choose to search the above text with the following search pattern:
and(y|rew)|(t)e(d)
This pattern has three parenthetical expressions that generate tokens. When you finally perform the search, the following tokens are generated for each match.
Match | Token 1 | Token 2 |
---|---|---|
|
| |
|
|
|
|
| |
|
| |
|
|
|
Only the highest level parentheses are used. For example, if
the search pattern and(y|rew)
finds the text andrew
,
token 1 is assigned the value rew
. However, if
the search pattern (and(y|rew))
is used, token
1 is assigned the value andrew
.
For those tokens specified in the regular expression that have
no match in the text being evaluated, regexp
and regexpi
return
an empty character vector (''
) as the token output,
and an extent that marks the position in the string where the token
was expected.
The example shown here executes regexp
on
a character vector specifying the path returned from the MATLAB® tempdir
function. The regular expression expr
includes
six token specifiers, one for each piece of the path. The third specifier [a-z]+
has
no match in the character vector because this part of the path, Profiles
,
begins with an uppercase letter:
chr = tempdir
chr = C:\WINNT\Profiles\bpascal\LOCALS~1\Temp\
expr = ['([A-Z]:)\\(WINNT)\\([a-z]+)?.*\\' ... '([a-z]+)\\([A-Z]+~\d)\\(Temp)\\']; [tok, ext] = regexp(chr, expr, 'tokens', 'tokenExtents');
When a token is not found in the text, regexp
returns
an empty character vector (''
) as the token and
a numeric array with the token extent. The first number of the extent
is the string index that marks where the token was expected, and the
second number of the extent is equal to one less than the first.
In the case of this example, the empty token is the third specified in the expression, so the third token returned is empty:
tok{:}
ans = 'C:' 'WINNT' '' 'bpascal' 'LOCALS~1' 'Temp'
The third token extent returned in the variable ext
has
the starting index set to 10, which is where the nonmatching term, Profiles
,
begins in the path. The ending extent index is set to one less than
the starting index, or 9:
ext{:}
ans = 1 2 4 8 10 9 19 25 27 34 36 39
When using tokens in replacement text, reference them using $1
, $2
,
etc. instead of \1
, \2
, etc.
This example captures two tokens and reverses their order. The first, $1
,
is 'Norma Jean'
and the second, $2
,
is 'Baker'
. Note that regexprep
returns
the modified text, not a vector of starting indices.
regexprep('Norma Jean Baker', '(\w+\s\w+)\s(\w+)', '$2, $1')
ans = Baker, Norma Jean
If you use a lot of tokens in your expressions, it may be helpful to assign them names rather than having to keep track of which token number is assigned to which token.
When referencing a named token within the expression, use the
syntax \k<name>
instead of the numeric \1
, \2
,
etc.:
poe = ['While I nodded, nearly napping, ' ... 'suddenly there came a tapping,']; regexp(poe, '(?<anychar>.)\k<anychar>', 'match')
ans = 'dd' 'pp' 'dd' 'pp'
Named tokens can also be useful in labeling the output from the MATLAB regular expression functions. This is especially true when you are processing many pieces of text.
For example, parse different parts of street addresses from several character vectors. A short name is assigned to each token in the expression:
chr1 = '134 Main Street, Boulder, CO, 14923'; chr2 = '26 Walnut Road, Topeka, KA, 25384'; chr3 = '847 Industrial Drive, Elizabeth, NJ, 73548'; p1 = '(?<adrs>\d+\s\S+\s(Road|Street|Avenue|Drive))'; p2 = '(?<city>[A-Z][a-z]+)'; p3 = '(?<state>[A-Z]{2})'; p4 = '(?<zip>\d{5})'; expr = [p1 ', ' p2 ', ' p3 ', ' p4];
As the following results demonstrate, you can make your output easier to work with by using named tokens:
loc1 = regexp(chr1, expr, 'names')
loc1 = adrs: '134 Main Street' city: 'Boulder' state: 'CO' zip: '14923'
loc2 = regexp(chr2, expr, 'names')
loc2 = adrs: '26 Walnut Road' city: 'Topeka' state: 'KA' zip: '25384'
loc3 = regexp(chr3, expr, 'names')
loc3 = adrs: '847 Industrial Drive' city: 'Elizabeth' state: 'NJ' zip: '73548'