Actions |
|
Each pattern in a rule has a corresponding action, which can
be any arbitrary Eiffel instructions. The pattern ends at the
first non-escaped whitespace character; the remainder of the line
is its action. If the action is empty, then when the pattern is
matched the input token is simply discarded. For example, here is
an excerpt from the specification of a program which deletes all
occurrences of "zap me" from its input:
%%
"zap me"
It will copy all other characters in the input to the output
since they will be matched by the default rule. Here is
a program which compresses multiple blanks and tabs down to a
single blank, and throws away whitespace found at the end of a
line:
%%
[ \t]+ io.put_character (' ')
[ \t]+$ -- Ignore this token.
If the action begins with a {,
then the action spans till the balancing }
is found, and the action may cross multiple lines. Gelex
knows about Eiffel strings, characters and comments and therefore
won't be fooled by braces found within them.
An action consisting solely of a vertical bar | means "same as the action
for the next rule". See below for an illustration.
Actions can include arbitrary Eiffel code. Note that if
the Eiffel code contains Unicode characters,
the input file should use the UTF-8 encoding and start with the
BOM
character.
There are a number
of special features, inherited from class YY_SCANNER, which
can be used in actions:
- append_text_to_string (a_string:
STRING_8)
- Append text
at end of a_string.
For efficiency reason, this feature can bypass the call
to text
and directly copy the characters from the input buffer.
- append_unicode_text_to_string (a_string:
STRING_32)
- Append unicode_text
at end of a_string.
For efficiency reason, this feature can bypass the call
to unicode_text
and directly copy the characters from the input buffer.
- append_utf8_text_to_string (a_string:
STRING_8)
- Append utf8_text
at end of a_string.
For efficiency reason, this feature can bypass the call
to utf8_text
and directly copy the characters from the input buffer.
- append_text_substring_to_string (s,
e: INTEGER;
a_string: STRING_8)
- Append text_substring at
end of a_string.
For efficiency reason, this feature can bypass the call
to text_substring
and directly copy the characters from the input buffer.
- append_unicode_text_substring_to_string (s,
e: INTEGER;
a_string: STRING_32)
- Append unicode_text_substring at
end of a_string.
For efficiency reason, this feature can bypass the call
to unicode_text_substring
and directly copy the characters from the input buffer.
- append_utf8_text_substring_to_string (s,
e: INTEGER;
a_string: STRING_8)
- Append utf8_text_substring at
end of a_string.
For efficiency reason, this feature can bypass the call
to utf8_text_substring
and directly copy the characters from the input buffer.
- column: INTEGER
- Column number of last token read. If it is used in any of
the scanner's actions the %option
line will have to be set.
- echo
- Copy text
to the scanner's output file using output.
- Empty_buffer: YY_BUFFER
- Empty input buffer (once function). When input sources
are not known yet at the creation time of a scanner, this
input buffer can be used by default with the creation
routine make_with_buffer.
- flush_input_buffer
- Flush the scanner's internal buffer so that the next time
the scanner attempts to match a token it will first
refill the buffer, unless end of file has been found.
- input_buffer: YY_BUFFER
- Input buffer of the scanner. By default the input buffer
is filled from the standard input. To avoid unexpected
behaviors, the routine set_input_buffer
should be used to switch to other input buffers.
- last_character: CHARACTER_8
- Last character read by read_character.
- last_unicode_character: CHARACTER_32
- Last Unicode character read by read_character.
- last_token: INTEGER
- Code of the last token read. When this attribute is given
a non-negative value the procedure read_token
stops, giving the opportunity to its caller (e.g. a
parser routine) to inspect this code. Each time read_token is
called again it continues processing tokens from where it
last left off until either last_token
is given a non-negative value again or the end of the
file is reached (yielding a null value). Negative
values are reserved by read_token to indicate internal errors which can
occur when too many reject are
called (and hence nothing can be matched anymore) or when
the option nodefault (or option -s) has been specified
but the default
rule is matched nevertheless.
- less (n:
INTEGER)
- Return all but the first n characters of the
current token back to the input stream, where they will
be rescanned when the scanner looks for the next match. text and text_count are
adjusted appropriately (e.g., text_count
will now be equal to n).
For example, on the input "foobar" the
following will write out "foobarbar":
%%
foobar echo; less (3)
[a-z]+ echo
- An argument of 0 to less will cause
the entire current input string to be scanned again.
Unless the way the scanner subsequently process its input
has been changed (using set_start_condition,
for example), this will result in an endless loop.
- line: INTEGER
- Line number of last token read. If it is used in any of
the scanner's actions the %option
line will have to be set.
- more
- Tell the scanner that the next time it matches a rule,
the corresponding token should be appended onto the
current value of text
rather than replacing it. For example, given the input
"mega-kludge" the following will write
"mega-mega-kludge" to the output:
%%
mega- echo; more
kludge echo
- First "mega-" is matched and echoed to
the output. Then "kludge" is matched,
but the previous "mega-" is still
hanging around at the beginning of text
so the echo
for the "kludge" rule will actually
write "mega-kludge".
- new_file_buffer (a_file:
KI_CHARACTER_INPUT_STREAM): YY_FILE_BUFFER
- Create an input buffer for a_file.
This routine is convenient when used with set_input_buffer.
To be used when a_file contains
ISO-8859-1 characters, or when it is using the UTF-8 encoding and the scanner is
either using the %option
utf8
or has been manually written to expect sequences of UTF-8 bytes.
- new_string_buffer (a_string:
STRING_8): YY_BUFFER
- Create an input buffer for a_string.
This routine is convenient when used with set_input_buffer.
To be used when a_string contains
ISO-8859-1 characters, or when it is using the UTF-8 encoding and the scanner is
either using the %option
utf8
or has been manually written to expect sequences of UTF-8 bytes.
- new_unicode_file_buffer (a_file:
KI_CHARACTER_INPUT_STREAM): YY_UNICODE_FILE_BUFFER
- Create a Unicode input buffer for a_file.
This routine is convenient when used with set_input_buffer.
To be used when a_file contains
ISO-8859-1 characters, or when it is using the UTF-8 encoding. The scanner will
receive Unicode characters, not sequences of UTF-8 bytes.
- new_unicode_string_buffer (a_string:
READABLE_STRING_GENERAL): YY_UNICODE_BUFFER
- Create a Unicode input buffer for a_string.
This routine is convenient when used with set_input_buffer.
To be used when a_string contains
ISO-8859-1 or Unicode characters. The scanner will
receive Unicode characters, not sequences of UTF-8 bytes.
- new_utf8_file_buffer (a_file:
KI_CHARACTER_INPUT_STREAM): YY_UTF8_FILE_BUFFER
- Create a UTF-8 input buffer for a_file.
This routine is convenient when used with set_input_buffer.
To be used when a_file contains
ISO-8859-1 characters or when it is using the UTF-8 encoding, and the scanner is
either using the %option
utf8
or has been manually written to expect sequences of UTF-8 bytes.
The scanner will receive sequences of UTF-8 bytes.
- new_utf8_string_buffer (a_string:
READABLE_STRING_GENERAL): YY_UTF8_BUFFER
- Create a UTF-8 input buffer for a_string.
This routine is convenient when used with set_input_buffer.
To be used when a_string contains
ISO-8859-1 or Unicode characters, and the scanner is
either using the %option
utf8
or has been manually written to expect sequences of UTF-8 bytes.
The scanner will receive sequences of UTF-8 bytes.
- output (a_text:
like text)
- Write a_text to
the standard output by default. This behavior can easily
be modified through redefinition.
- pop_start_condition
- Restore previous start condition.
See discussion on start
conditions for further details.
- position: INTEGER
- Position of last token read (i.e. number of characters
from the start of the input source). If it is used in any
of the scanner's actions the %option
position will have to be set.
- print_last_token
- Routine called at the end of read_token
when debugging instructions
are enabled. Print to standard error debug information
about the last token read. This routine can be redefined
in descendant classes to print more information. In
particular, the routine token_name
generated by geyacc can be used to make the
debugging output more human-readable.
- push_start_condition (a_start_condition:
INTEGER)
- Add current start condition to a stack and put the scanner in the
corresponding new start condition a_start_condition.
See discussion on start
conditions for further details.
- read_character
- Read the next character from the input stream. Make the
result available in last_character
and last_unicode_character.
For example, the following is one way to eat up C
comments:
%%
"/*" {
from until stop loop
from
read_character
until
last_character = '*' or
last_character = '%/255/'
loop
read_character
end
if last_character = '*' then
from
read_character
until
last_character /= '*'
loop
read_character
end
if last_character = '/' then
stop := True
end
end
if last_character = '%/255/' then
io.error.put_string ("EOF in comment%N")
stop := True
end
end
}
- This feature should be used with care since it bypasses
the pattern-matching DFA engine.
Note that if input_buffer
contains Unicode characters
which cannot be represented as 8-bit characters, they
will be replaced by a replacement character specified
in the buffer.
- reject
- Direct the scanner to proceed on to the "second
best" rule which matched the input (or a prefix of
the input). The rule is chosen as described in Matching Rules, and text and text_count
return the appropriate values. It may either be one which
matched as much text as the originally chosen rule but
came later in the gelex input file, or one which
matched less text. For example, the following will both
count the words in the input and call the routine special whenever
"frob" is seen:
%%
frob special; reject
[^ \t\n]+ word_count := word_count + 1
%%
word_count: INTEGER
special is do ... end
Without the reject,
any "frob"'s in the input would not be
counted as words, since the scanner normally executes
only one action per token. Multiple reject's
are allowed, each one finding the next best choice to the
currently active rule. For example, when the following
scanner scans the token "abcd", it
will write "abcdabcaba" to the output:
%%
a |
ab |
abc |
abcd echo; reject
.|\n -- Eat up any unmatched character.
- (The first three rules share the fourth's action since
they use the special '|'
action.) reject
is a particularly expensive feature in terms of scanner
performance. If it is used in any of the scanner's
actions the %option
reject
will have to be set and it will slow down all of the
scanner's matching. Furthermore, reject
cannot be used with the %option
full
and this feature is only available to descendants of
class YY_COMPRESSED_SCANNER_SKELETON.
- set_input_buffer (a_buffer:
like input_buffer)
- Switch the scanner's input buffer so that subsequent
tokens will come from a_buffer.
This routine can be used to continue scanning another
file when the end-of-file has been read, or to deal with
preprocessor instructions such as #include. It
can eventually be given as argument the result of one of
the functions new_file_buffer,
new_unicode_file_buffer,
new_utf8_file_buffer,
new_string_buffer,
new_unicode_string_buffer
or new_utf8_string_buffer.
Note that switching input buffers does not change the
start condition of the scanner.
- set_last_token (a_token:
INTEGER)
- Set last_token
to a_token.
- set_start_condition (a_start_condition:
INTEGER)
- Put the scanner in the corresponding start condition. See
discussion on start
conditions for further details.
- start_condition: INTEGER
- Current start condition. This value can subsequently be
used with set_start_condition
or push_start_condition
to return to that start condition. See discussion on start conditions for
further details.
- terminate
- Terminate the scanner and set last_token
to 0, indicating "all done".
By default, terminate
is also called when an end-of-file is encountered.
- text: STRING_8
- Text of the last token read. This feature is a function
which creates a new string each time it is called.
Actions are hence free to alter the result of text without
damaging the input buffer.
Note that if input buffer contains Unicode characters
which cannot be represented as 8-bit characters, they
will be replaced by a replacement character specified
in the buffer.
- text_count: INTEGER
- Length of the last token read. This feature is a function
which computes the number of characters matched by the
corresponding pattern. If efficiency is a concern and
this function is called several times in the same action,
its result can be stored in a temporary variable.
- text_item (i:
INTEGER): CHARACTER_8
- Character at a given index in text.
For efficiency reason, this function bypasses the call to
text and
reads the character directly from the input buffer.
- text_substring (s,
e: INTEGER):
STRING_8
- Substring of text.
This function creates a new string each time it is
called. For efficiency reason, this function bypasses the
call to text
and creates the substring directly from the input buffer.
- unicode_text: STRING_32
- Text of the last token read. This feature is a function
which creates a new string each time it is called.
Actions are hence free to alter the result of unicode_text without
damaging the input buffer.
Note that if the scanner is written to receive sequences
of UTF-8 bytes, unicode_text
will treat each single
byte as a character. It will not try to decode the UTF-8 bytes
into Unicode characters. Also note that unicode_text
does not contain surrogate or invalid Unicode characters.
- unicode_text_item (i:
INTEGER): CHARACTER_32
- Unicode character at a given index in unicode_text.
For efficiency reason, this function bypasses the call to
unicode_text and
reads the character directly from the input buffer.
- unicode_text_substring (s,
e: INTEGER):
STRING_32
- Substring of unicode_text.
This function creates a new string each time it is
called. For efficiency reason, this function bypasses the
call to unicode_text
and creates the substring directly from the input buffer.
- unread_character (c:
CHARACTER_8)
- Put the character c back onto the input
stream. It will be the next character scanned. The
following action will take the current token and cause it
to be rescanned enclosed in parentheses.
{
a_text := text
unread_character (')')
from i := text_count until i < 1 loop
unread_character (a_text.item (i))
i := i - 1
end
unread_character ('(')
}
- Note that since each unread_character
puts the given character back at the beginning of the
input stream, pushing back strings must be done
back-to-front. An important potential problem when using unread_character
is that it alters the input stream. If you need the value
of text
after a call to unread_character
(as in the above example), you must first save it
elsewhere. Finally, note that you cannot put back EOF
(i.e. '%/255/') to attempt to mark the input
stream with an end-of-file.
- unread_unicode_character (c:
CHARACTER_32)
- Put the character c back onto the input
stream. It will be the next character scanned.
The behavior is undefined if c is too large to fit into
input_buffer.
An important potential problem when using unread_unicode_character
is that it alters the input stream. If you need the value
of unicode_text
after a call to unread_unicode_character,
you must first save it elsewhere.
- utf8_text: STRING_8
- UTF-8 representation of unicode_text.
This function creates a new string each time it is
called. For efficiency reason, this function bypasses the
call to unicode_text
and creates the UTF-8 representation directly from the input buffer.
- utf8_text_substring (s,
e: INTEGER):
STRING_8
- UTF-8 representation of unicode_text_substring.
This function creates a new string each time it is
called. For efficiency reason, this function bypasses the
call to unicode_text_substring
and creates the UTF-8 representation directly from the input buffer.
In addition to the above routines which can be called in
semantic actions, the following routines can be called after the
routine read_token has
returned:
- end_of_file: BOOLEAN
- Has the end of input buffer been reached? This is the
case when last_token has
been set to 0.
- scanning_error: BOOLEAN
- Has an error occurred during scanning? This is the case
when last_token has
been given a negative value. It can occur when too
many reject are
called (and hence nothing can be matched anymore) or when
the option nodefault (or option -s) has been specified
but the default
rule is matched nevertheless.
Furthermore, the following routines can be called before or
after any semantic actions if the corresponding %option have been specified.
These routines do nothing by default but can be redefined in the
generated scanner class.
- pre_action
- Action executed before every semantic action when %option pre-action has been specified.
- post_action
- Action executed after every semantic action when %option post-action has been specified.
- pre_eof_action
- Action executed before every end-of-file semantic action
(i.e. <<EOF>>) when %option pre-eof-action has been specified.
- post_eof_action
- Action executed after every end-of-file semantic action
(i.e. <<EOF>>) when %option post-eof-action has been specified.