In a dynamic expression, you can make the pattern that you want regexp
to
match dependent on the content of the input text. In this way, you
can more closely match varying input patterns in the text being parsed.
You can also use dynamic expressions in replacement terms for use
with the regexprep
function. This gives you the
ability to adapt the replacement text to the parsed input.
You can include any number of dynamic expressions in the match_expr
or replace_expr
arguments
of these commands:
regexp(text, match_expr) regexpi(text, match_expr) regexprep(text, match_expr, replace_expr)
As an example of a dynamic expression, the following regexprep
command
correctly replaces the term internationalization
with
its abbreviated form, i18n
. However, to use it
on a different term such as globalization
, you
have to use a different replacement expression:
match_expr = '(^\w)(\w*)(\w$)'; replace_expr1 = '$118$3'; regexprep('internationalization', match_expr, replace_expr1)
ans = i18n
replace_expr2 = '$111$3'; regexprep('globalization', match_expr, replace_expr2)
ans = g11n
Using a dynamic expression ${num2str(length($2))}
enables
you to base the replacement expression on the input text so that you
do not have to change the expression each time. This example uses
the dynamic replacement syntax ${cmd}
.
match_expr = '(^\w)(\w*)(\w$)'; replace_expr = '$1${num2str(length($2))}$3'; regexprep('internationalization', match_expr, replace_expr)
ans = i18n
regexprep('globalization', match_expr, replace_expr)
ans = g11n
When parsed, a dynamic expression must correspond to a complete,
valid regular expression. In addition, dynamic match expressions that
use the backslash escape character (\
) require
two backslashes: one for the initial parsing of the expression, and
one for the complete match. The parentheses that enclose dynamic expressions
do not create a capturing group.
There are three forms of dynamic expressions that you can use in match expressions, and one form for replacement expressions, as described in the following sections
The (??expr)
operator parses expression expr
,
and inserts the results back into the match expression. MATLAB® then
evaluates the modified match expression.
Here is an example of the type of expression that you can use with this operator:
chr = {'5XXXXX', '8XXXXXXXX', '1X'}; regexp(chr, '^(\d+)(??X{$1})$', 'match', 'once');
The purpose of this particular command is to locate a series
of X
characters in each of the character vectors
stored in the input cell array. Note however that the number of X
s
varies in each character vector. If the count did not vary, you could
use the expression X{n}
to indicate that you want
to match n
of these characters. But, a constant
value of n
does not work in this case.
The solution used here is to capture the leading count number
(e.g., the 5
in the first character vector of the
cell array) in a token, and then to use that count in a dynamic expression.
The dynamic expression in this example is (??X{$1})
,
where $1
is the value captured by the token \d+
.
The operator {$1}
makes a quantifier of that token
value. Because the expression is dynamic, the same pattern works on
all three of the input vectors in the cell array. With the first input
character vector, regexp
looks for five X
characters;
with the second, it looks for eight, and with the third, it looks
for just one:
regexp(chr, '^(\d+)(??X{$1})$', 'match', 'once')
ans = '5XXXXX' '8XXXXXXXX' '1X'
MATLAB uses the (??@cmd)
operator to
include the results of a MATLAB command in the match expression.
This command must return a term that can be used within the match
expression.
For example, use the dynamic expression (??@flilplr($1))
to
locate a palindrome, “Never Odd or Even”, that has been
embedded into a larger character vector.
First, create the input string. Make sure that all letters are lowercase, and remove all nonword characters.
chr = lower(... 'Find the palindrome Never Odd or Even in this string'); chr = regexprep(str, '\W*', '')
chr = findthepalindromeneveroddoreveninthisstring
Locate the palindrome within the character vector using the dynamic expression:
palchr = regexp(chr, '(.{3,}).?(??@fliplr($1))', 'match')
palchr = 'neveroddoreven'
The dynamic expression reverses the order of the letters that
make up the character vector, and then attempts to match as much of
the reversed-order vector as possible. This requires a dynamic expression
because the value for $1
relies on the value of
the token (.{3,})
.
Dynamic expressions in MATLAB have access to the currently active workspace. This means that you can change any of the functions or variables used in a dynamic expression just by changing variables in the workspace. Repeat the last command of the example above, but this time define the function to be called within the expression using a function handle stored in the base workspace:
fun = @fliplr; palchr = regexp(str, '(.{3,}).?(??@fun($1))', 'match')
palchr = 'neveroddoreven'
The (?@cmd)
operator specifies a MATLAB command
that regexp
or regexprep
is to run while parsing the
overall match expression. Unlike the other dynamic expressions in MATLAB,
this operator does not alter the contents of the expression it is
used in. Instead, you can use this functionality to get MATLAB to
report just what steps it is taking as it parses the contents of one
of your regular expressions. This functionality can be useful in diagnosing
your regular expressions.
The following example parses a word for zero or more characters followed by two identical characters followed again by zero or more characters:
regexp('mississippi', '\w*(\w)\1\w*', 'match')
ans = 'mississippi'
To track the exact steps that MATLAB takes in determining
the match, the example inserts a short script (?@disp($1))
in
the expression to display the characters that finally constitute the
match. Because the example uses greedy quantifiers, MATLAB attempts
to match as much of the character vector as possible. So, even though MATLAB finds
a match toward the beginning of the string, it continues to look for
more matches until it arrives at the very end of the string. From
there, it backs up through the letters i
then p
and
the next p
, stopping at that point because the
match is finally satisfied:
regexp('mississippi', '\w*(\w)(?@disp($1))\1\w*', 'match')
i p p ans = 'mississippi'
Now try the same example again, this time making the first quantifier
lazy (*?
). Again, MATLAB makes the same match:
regexp('mississippi', '\w*?(\w)\1\w*', 'match')
ans = 'mississippi'
But by inserting a dynamic script, you can see that this time, MATLAB has matched the text quite differently. In this case, MATLAB uses the very first match it can find, and does not even consider the rest of the text:
regexp('mississippi', '\w*?(\w)(?@disp($1))\1\w*', 'match')
m i s ans = 'mississippi'
To demonstrate how versatile this type of dynamic expression
can be, consider the next example that progressively assembles a cell
array as MATLAB iteratively parses the input text. The (?!)
operator
found at the end of the expression is actually an empty lookahead
operator, and forces a failure at each iteration. This forced failure
is necessary if you want to trace the steps that MATLAB is taking
to resolve the expression.
MATLAB makes a number of passes through the input text,
each time trying another combination of letters to see if a fit better
than last match can be found. On any passes in which no matches are
found, the test results in an empty character vector. The dynamic
script (?@if(~isempty($&)))
serves to omit
the empty character vectors from the matches
cell
array:
matches = {}; expr = ['(Euler\s)?(Cauchy\s)?(Boole)?(?@if(~isempty($&)),' ... 'matches{end+1}=$&;end)(?!)']; regexp('Euler Cauchy Boole', expr); matches
matches = 'Euler Cauchy Boole' 'Euler Cauchy ' 'Euler ' 'Cauchy Boole' 'Cauchy ' 'Boole'
The operators $&
(or the equivalent $0
), $`
,
and $'
refer to that part of the input text that
is currently a match, all characters that precede the current match,
and all characters to follow the current match, respectively. These
operators are sometimes useful when working with dynamic expressions,
particularly those that employ the (?@cmd)
operator.
This example parses the input text looking for the letter g
.
At each iteration through the text, regexp
compares
the current character with g
, and not finding it,
advances to the next character. The example tracks the progress of
scan through the text by marking the current location being parsed
with a ^
character.
(The $`
and $´
operators
capture that part of the text that precedes and follows the current
parsing location. You need two single-quotation marks ($''
)
to express the sequence $´
when it appears
within text.)
chr = 'abcdefghij'; expr = '(?@disp(sprintf(''starting match: [%s^%s]'',$`,$'')))g'; regexp(chr, expr, 'once');
starting match: [^abcdefghij] starting match: [a^bcdefghij] starting match: [ab^cdefghij] starting match: [abc^defghij] starting match: [abcd^efghij] starting match: [abcde^fghij] starting match: [abcdef^ghij]
The ${cmd}
operator modifies the contents
of a regular expression replacement pattern, making this pattern adaptable
to parameters in the input text that might vary from one use to the
next. As with the other dynamic expressions used in MATLAB, you
can include any number of these expressions within the overall replacement
expression.
In the regexprep
call shown here, the replacement
pattern is '${convertMe($1,$2)}'
. In this case,
the entire replacement pattern is a dynamic expression:
regexprep('This highway is 125 miles long', ... '(\d+\.?\d*)\W(\w+)', '${convertMe($1,$2)}');
The dynamic expression tells MATLAB to execute a function
named convertMe
using the two tokens (\d+\.?\d*)
and (\w+)
,
derived from the text being matched, as input arguments in the call
to convertMe
. The replacement pattern requires
a dynamic expression because the values of $1
and $2
are
generated at runtime.
The following example defines the file named convertMe
that
converts measurements from imperial units to metric.
function valout = convertMe(valin, units) switch(units) case 'inches' fun = @(in)in .* 2.54; uout = 'centimeters'; case 'miles' fun = @(mi)mi .* 1.6093; uout = 'kilometers'; case 'pounds' fun = @(lb)lb .* 0.4536; uout = 'kilograms'; case 'pints' fun = @(pt)pt .* 0.4731; uout = 'litres'; case 'ounces' fun = @(oz)oz .* 28.35; uout = 'grams'; end val = fun(str2num(valin)); valout = [num2str(val) ' ' uout]; end
At the command line, call the convertMe
function
from regexprep
, passing in values for the quantity
to be converted and name of the imperial unit:
regexprep('This highway is 125 miles long', ... '(\d+\.?\d*)\W(\w+)', '${convertMe($1,$2)}')
ans = This highway is 201.1625 kilometers long
regexprep('This pitcher holds 2.5 pints of water', ... '(\d+\.?\d*)\W(\w+)', '${convertMe($1,$2)}')
ans = This pitcher holds 1.1828 litres of water
regexprep('This stone weighs about 10 pounds', ... '(\d+\.?\d*)\W(\w+)', '${convertMe($1,$2)}')
ans = This stone weighs about 4.536 kilograms
As with the (??@ )
operator discussed in
an earlier section, the ${ }
operator has access
to variables in the currently active workspace. The following regexprep
command
uses the array A
defined in the base workspace:
A = magic(3)
A = 8 1 6 3 5 7 4 9 2
regexprep('The columns of matrix _nam are _val', ... {'_nam', '_val'}, ... {'A', '${sprintf(''%d%d%d '', A)}'})
ans = The columns of matrix A are 834 159 672