Patterns |
|
The patterns in the input are written using an extended set of
regular expressions. These are:
- x
- match the character x.
- .
- any character except newline.
- [xyz]
- a character class; in this
case, the pattern matches either an x,
a y or a z.
- [abj-oZ]
- a character class with a range in it; matches an
a, a b, any letter from j through o, or a Z.
- [^A-Z]
- a negated character class,
i.e., any character but those in the class. In this case,
any character except an uppercase letter.
- [^A-Z\n]
- any character except an uppercase letter or a newline.
- [a-z]{+}[0-9]
- a union of character classes,
i.e., any character in either of those classes. In this case,
any lowercase letter or digit.
- [a-z]{-}[aeiouy]
- a subtraction of character classes,
i.e., any character in the first class which is not in the
second class. In this case,
any consonant lowercase letter.
- [^\n]{-}([a-z]{+}[0-9])
- any character except a newline, a lowercase letter or a digit.
Note the use of parentheses in character class operations. Without
parentheses, a digit could have been matched.
- r*
- zero or more r's,
where r is any
regular expression.
- r+
- one or more r's.
- r?
- zero or one r's
(that is, "an optional r").
- r{2,5}
- anywhere from two to five r's.
- r{2,}
- two or more r's.
- r{4}
- exactly four r's.
- {name}
- the expansion of the "name"
definition.
- "[xyz]\"foo"
- the literal string: [xyz]"foo.
- \X
- if X is an a, b,
f, n, r,
t, or v, then the ANSI-C interpretation of \X. Otherwise, a literal X (used to escape
operators such as *).
- \0
- a null character (ASCII code 0).
- \123
- the character with octal value 123.
- \x2a
- the character with hexadecimal value 2a.
- \x{03B2}
- the Unicode character with hexadecimal value 03B2.
- \u03B2
- the Unicode character with hexadecimal value 03B2.
- \u{03B2}
- the Unicode character with hexadecimal value 03B2.
- (r)
- match an r;
parentheses are used to override precedence.
- rs
- the regular expression r
followed by the regular expression s;
called concatenation.
- (b:r)
- in utf8 mode, match an
r where characters
in r
are interpreted as bytes (and not Unicode characters).
- (u:r)
- in utf8 mode, match an
r where characters
in r
are interpreted as Unicode characters. Overrides
(b:s) when
it appears in s.
- r|s
- either an r or an s.
- r/s
- an r but only if it
is followed by an s.
The text matched by s
is included when determining whether this rule is the
"longest match", but is then returned to the
input before the action is executed. So the action only
sees the text matched by r.
This type of pattern is called trailing
context. (There are some combinations of r/s that gelex
cannot match correctly, such as in zx*/xy.
See gelex's limitations
for details.).
- ^r
- an r, but only at
the beginning of a line (i.e., when just starting to
scan, or right after a newline has been scanned).
- r$
- an r, but only at
the end of a line (i.e., just before a newline).
Equivalent to r/\n.
- Note that gelex's notion of "newline"
is exactly what is interpreted as %N
by the Eiffel compiler that was used to compile gelex;
in particular, on some DOS systems
you must either filter out \r's
in the input yourself, or explicitly use r/\r\n for r$.
- <s>r
- an r, but only in
start condition s
(see discussion about start
conditions for details).
- <s1,s2,s3>r
- same, but in any of start conditions s1, s2,
or s3.
- <*>r
- an r in any start
condition, even an exclusive one.
- <<EOF>>
- an end-of-file.
- <s1,s2><<EOF>>
- an end-of-file when in start condition s1 or s2.
Some notes on patterns
Everywhere where a character is valid (by itself or inside a
character class), a Unicode character can be used as well. Just
make sure the input file uses the UTF-8 encoding and starts with the
BOM
character. The character set is the set of non-surrogate
valid Unicode characters, except with
(b:r) where it's the
set of bytes (8-bit characters). As a consequence,
. or
[^\n]
is any Unicode character
except newline, and (b:.)
or (b:[^\n])
is any byte between \x00
and \xFF except newline
\x0A.
Note that inside of a character class, all regular expression
operators lose their special meaning except escape (\) and the character class
operators, -, ], and, at the beginning of the
class, ^.
The regular expressions listed above are grouped according to precedence, from
highest precedence at the top to lowest at the bottom. Those
grouped together have equal precedence. For example,
foo|bar*
is the same as:
(foo)|(ba(r*))
since the * operator has
higher precedence than concatenation, and concatenation higher
than alternation (|). This
pattern therefore matches either the string foo
or the string ba followed
by zero-or-more r's. To
match foo or zero-or-more bar's, use:
foo|(bar)*
and to match zero-or-more foo's-or-bar's:
(foo|bar)*
A negated character class such as the example [^A-Z] above will match a newline
unless \n (or an equivalent
escape sequence) is one of the characters explicitly present in
the negated character class (e.g., [^A-Z\n]).
This is unlike how many other regular expression tools treat
negated character classes, but unfortunately the inconsistency is
historically entrenched. Matching newlines means that a pattern
like [^"]* can match
the entire input unless there's another quote in the input.
A rule can have at most one instance of trailing context (the / operator or the $ operator). The start
conditions, ^, and <<EOF>> patterns can
only occur at the beginning of a pattern, and, as well as with / and $,
cannot be grouped inside parentheses. A ^
which does not occur at the beginning of a rule or a $ which does not occur at the end
of a rule loses its special properties and is treated as a normal
character.
The following are illegal:
foo/bar$
<sc1>foo<sc2>bar
Note that the first of these, can be written foo/bar\n. The following will
result in $ or ^ being treated as a normal
character:
foo|(bar$)
foo|^bar
If what's wanted is a foo
or a bar-followed-by-a-newline,
the following could be used (the special |
action is explained in the Actions
section):
foo |
bar$ -- action goes here
A similar trick will work for matching a foo
or a bar-at-the-beginning-of-a-line.