Actions PreviousNext

Each pattern in a rule has a corresponding action, which can be any arbitrary Eiffel instructions. The pattern ends at the first non-escaped whitespace character; the remainder of the line is its action. If the action is empty, then when the pattern is matched the input token is simply discarded. For example, here is an excerpt from the specification of a program which deletes all occurrences of "zap me" from its input:

%%
"zap me"

It will copy all other characters in the input to the output since they will be matched by the default rule. Here is a program which compresses multiple blanks and tabs down to a single blank, and throws away whitespace found at the end of a line:

%%
[ \t]+     io.put_character (' ')
[ \t]+$    -- Ignore this token.

If the action begins with a {, then the action spans till the balancing } is found, and the action may cross multiple lines. Gelex knows about Eiffel strings, characters and comments and therefore won't be fooled by braces found within them.

An action consisting solely of a vertical bar | means "same as the action for the next rule". See below for an illustration.

Actions can include arbitrary Eiffel code. Note that if the Eiffel code contains Unicode characters, the input file should use the UTF-8 encoding and start with the BOM character.

There are a number of special features, inherited from class YY_SCANNER, which can be used in actions:

append_text_to_string (a_string: STRING_8)
Append text at end of a_string. For efficiency reason, this feature can bypass the call to text and directly copy the characters from the input buffer.
append_unicode_text_to_string (a_string: STRING_32)
Append unicode_text at end of a_string. For efficiency reason, this feature can bypass the call to unicode_text and directly copy the characters from the input buffer.
append_utf8_text_to_string (a_string: STRING_8)
Append utf8_text at end of a_string. For efficiency reason, this feature can bypass the call to utf8_text and directly copy the characters from the input buffer.
append_text_substring_to_string (s, e: INTEGER; a_string: STRING_8)
Append text_substring at end of a_string. For efficiency reason, this feature can bypass the call to text_substring and directly copy the characters from the input buffer.
append_unicode_text_substring_to_string (s, e: INTEGER; a_string: STRING_32)
Append unicode_text_substring at end of a_string. For efficiency reason, this feature can bypass the call to unicode_text_substring and directly copy the characters from the input buffer.
append_utf8_text_substring_to_string (s, e: INTEGER; a_string: STRING_8)
Append utf8_text_substring at end of a_string. For efficiency reason, this feature can bypass the call to utf8_text_substring and directly copy the characters from the input buffer.
column: INTEGER
Column number of last token read. If it is used in any of the scanner's actions the %option line will have to be set.
echo
Copy text to the scanner's output file using output.
Empty_buffer: YY_BUFFER
Empty input buffer (once function). When input sources are not known yet at the creation time of a scanner, this input buffer can be used by default with the creation routine make_with_buffer.
flush_input_buffer
Flush the scanner's internal buffer so that the next time the scanner attempts to match a token it will first refill the buffer, unless end of file has been found.
input_buffer: YY_BUFFER
Input buffer of the scanner. By default the input buffer is filled from the standard input. To avoid unexpected behaviors, the routine set_input_buffer should be used to switch to other input buffers.
last_character: CHARACTER_8
Last character read by read_character.
last_unicode_character: CHARACTER_32
Last Unicode character read by read_character.
last_token: INTEGER
Code of the last token read. When this attribute is given a non-negative value the procedure read_token stops, giving the opportunity to its caller (e.g. a parser routine) to inspect this code. Each time read_token is called again it continues processing tokens from where it last left off until either last_token is given a non-negative value again or the end of the file is reached (yielding a null value). Negative values are reserved by read_token to indicate internal errors which can occur when too many reject are called (and hence nothing can be matched anymore) or when the option nodefault (or option -s) has been specified but the default rule is matched nevertheless.
less (n: INTEGER)
Return all but the first n characters of the current token back to the input stream, where they will be rescanned when the scanner looks for the next match. text and text_count are adjusted appropriately (e.g., text_count will now be equal to n). For example, on the input "foobar" the following will write out "foobarbar":
%%
foobar    echo; less (3)
[a-z]+    echo 
An argument of 0 to less will cause the entire current input string to be scanned again. Unless the way the scanner subsequently process its input has been changed (using set_start_condition, for example), this will result in an endless loop.
line: INTEGER
Line number of last token read. If it is used in any of the scanner's actions the %option line will have to be set.
more
Tell the scanner that the next time it matches a rule, the corresponding token should be appended onto the current value of text rather than replacing it. For example, given the input "mega-kludge" the following will write "mega-mega-kludge" to the output:
%%
mega-      echo; more
kludge     echo 
First "mega-" is matched and echoed to the output. Then "kludge" is matched, but the previous "mega-" is still hanging around at the beginning of text so the echo for the "kludge" rule will actually write "mega-kludge".
new_file_buffer (a_file: KI_CHARACTER_INPUT_STREAM): YY_FILE_BUFFER
Create an input buffer for a_file. This routine is convenient when used with set_input_buffer. To be used when a_file contains ISO-8859-1 characters, or when it is using the UTF-8 encoding and the scanner is either using the %option utf8 or has been manually written to expect sequences of UTF-8 bytes.
new_string_buffer (a_string: STRING_8): YY_BUFFER
Create an input buffer for a_string. This routine is convenient when used with set_input_buffer. To be used when a_string contains ISO-8859-1 characters, or when it is using the UTF-8 encoding and the scanner is either using the %option utf8 or has been manually written to expect sequences of UTF-8 bytes.
new_unicode_file_buffer (a_file: KI_CHARACTER_INPUT_STREAM): YY_UNICODE_FILE_BUFFER
Create a Unicode input buffer for a_file. This routine is convenient when used with set_input_buffer. To be used when a_file contains ISO-8859-1 characters, or when it is using the UTF-8 encoding. The scanner will receive Unicode characters, not sequences of UTF-8 bytes.
new_unicode_string_buffer (a_string: READABLE_STRING_GENERAL): YY_UNICODE_BUFFER
Create a Unicode input buffer for a_string. This routine is convenient when used with set_input_buffer. To be used when a_string contains ISO-8859-1 or Unicode characters. The scanner will receive Unicode characters, not sequences of UTF-8 bytes.
new_utf8_file_buffer (a_file: KI_CHARACTER_INPUT_STREAM): YY_UTF8_FILE_BUFFER
Create a UTF-8 input buffer for a_file. This routine is convenient when used with set_input_buffer. To be used when a_file contains ISO-8859-1 characters or when it is using the UTF-8 encoding, and the scanner is either using the %option utf8 or has been manually written to expect sequences of UTF-8 bytes. The scanner will receive sequences of UTF-8 bytes.
new_utf8_string_buffer (a_string: READABLE_STRING_GENERAL): YY_UTF8_BUFFER
Create a UTF-8 input buffer for a_string. This routine is convenient when used with set_input_buffer. To be used when a_string contains ISO-8859-1 or Unicode characters, and the scanner is either using the %option utf8 or has been manually written to expect sequences of UTF-8 bytes. The scanner will receive sequences of UTF-8 bytes.
output (a_text: like text)
Write a_text to the standard output by default. This behavior can easily be modified through redefinition.
pop_start_condition
Restore previous start condition. See discussion on start conditions for further details.
position: INTEGER
Position of last token read (i.e. number of characters from the start of the input source). If it is used in any of the scanner's actions the %option position will have to be set.
print_last_token
Routine called at the end of read_token when debugging instructions are enabled. Print to standard error debug information about the last token read. This routine can be redefined in descendant classes to print more information. In particular, the routine token_name generated by geyacc can be used to make the debugging output more human-readable.
push_start_condition (a_start_condition: INTEGER)
Add current start condition to a stack and put the scanner in the corresponding new start condition a_start_condition. See discussion on start conditions for further details.
read_character
Read the next character from the input stream. Make the result available in last_character and last_unicode_character. For example, the following is one way to eat up C comments:
%%
"/*"  {
    from until stop loop
        from
            read_character
         until
            last_character = '*' or
            last_character = '%/255/'
        loop
            read_character
        end
        if last_character = '*' then
            from
                read_character
            until
                last_character /= '*'
            loop
                read_character
            end
            if last_character = '/' then
                stop := True
            end
        end
        if last_character = '%/255/' then
            io.error.put_string ("EOF in comment%N")
            stop := True
        end
    end
}
This feature should be used with care since it bypasses the pattern-matching DFA engine. Note that if input_buffer contains Unicode characters which cannot be represented as 8-bit characters, they will be replaced by a replacement character specified in the buffer.
reject
Direct the scanner to proceed on to the "second best" rule which matched the input (or a prefix of the input). The rule is chosen as described in Matching Rules, and text and text_count return the appropriate values. It may either be one which matched as much text as the originally chosen rule but came later in the gelex input file, or one which matched less text. For example, the following will both count the words in the input and call the routine special whenever "frob" is seen:
%%
frob         special; reject
[^ \t\n]+    word_count := word_count + 1
%%
    word_count: INTEGER
    special is do ... end

Without the reject, any "frob"'s in the input would not be counted as words, since the scanner normally executes only one action per token. Multiple reject's are allowed, each one finding the next best choice to the currently active rule. For example, when the following scanner scans the token "abcd", it will write "abcdabcaba" to the output:

%%
a        |
ab       |
abc      |
abcd     echo; reject
.|\n     -- Eat up any unmatched character. 
(The first three rules share the fourth's action since they use the special '|' action.) reject is a particularly expensive feature in terms of scanner performance. If it is used in any of the scanner's actions the %option reject will have to be set and it will slow down all of the scanner's matching. Furthermore, reject cannot be used with the %option full and this feature is only available to descendants of class YY_COMPRESSED_SCANNER_SKELETON.
set_input_buffer (a_buffer: like input_buffer)
Switch the scanner's input buffer so that subsequent tokens will come from a_buffer. This routine can be used to continue scanning another file when the end-of-file has been read, or to deal with preprocessor instructions such as #include. It can eventually be given as argument the result of one of the functions new_file_buffer, new_unicode_file_buffer, new_utf8_file_buffer, new_string_buffer, new_unicode_string_buffer or new_utf8_string_buffer. Note that switching input buffers does not change the start condition of the scanner.
set_last_token (a_token: INTEGER)
Set last_token to a_token.
set_start_condition (a_start_condition: INTEGER)
Put the scanner in the corresponding start condition. See discussion on start conditions for further details.
start_condition: INTEGER
Current start condition. This value can subsequently be used with set_start_condition or push_start_condition to return to that start condition. See discussion on start conditions for further details.
terminate
Terminate the scanner and set last_token to 0, indicating "all done". By default, terminate is also called when an end-of-file is encountered.
text: STRING_8
Text of the last token read. This feature is a function which creates a new string each time it is called. Actions are hence free to alter the result of text without damaging the input buffer. Note that if input buffer contains Unicode characters which cannot be represented as 8-bit characters, they will be replaced by a replacement character specified in the buffer.
text_count: INTEGER
Length of the last token read. This feature is a function which computes the number of characters matched by the corresponding pattern. If efficiency is a concern and this function is called several times in the same action, its result can be stored in a temporary variable.
text_item (i: INTEGER): CHARACTER_8
Character at a given index in text. For efficiency reason, this function bypasses the call to text and reads the character directly from the input buffer.
text_substring (s, e: INTEGER): STRING_8
Substring of text. This function creates a new string each time it is called. For efficiency reason, this function bypasses the call to text and creates the substring directly from the input buffer.
unicode_text: STRING_32
Text of the last token read. This feature is a function which creates a new string each time it is called. Actions are hence free to alter the result of unicode_text without damaging the input buffer. Note that if the scanner is written to receive sequences of UTF-8 bytes, unicode_text will treat each single byte as a character. It will not try to decode the UTF-8 bytes into Unicode characters. Also note that unicode_text does not contain surrogate or invalid Unicode characters.
unicode_text_item (i: INTEGER): CHARACTER_32
Unicode character at a given index in unicode_text. For efficiency reason, this function bypasses the call to unicode_text and reads the character directly from the input buffer.
unicode_text_substring (s, e: INTEGER): STRING_32
Substring of unicode_text. This function creates a new string each time it is called. For efficiency reason, this function bypasses the call to unicode_text and creates the substring directly from the input buffer.
unread_character (c: CHARACTER_8)
Put the character c back onto the input stream. It will be the next character scanned. The following action will take the current token and cause it to be rescanned enclosed in parentheses.
{
    a_text := text
    unread_character (')')
    from i := text_count until i < 1 loop
        unread_character (a_text.item (i))
        i := i - 1
    end
    unread_character ('(')
} 
Note that since each unread_character puts the given character back at the beginning of the input stream, pushing back strings must be done back-to-front. An important potential problem when using unread_character is that it alters the input stream. If you need the value of text after a call to unread_character (as in the above example), you must first save it elsewhere. Finally, note that you cannot put back EOF (i.e. '%/255/') to attempt to mark the input stream with an end-of-file.
unread_unicode_character (c: CHARACTER_32)
Put the character c back onto the input stream. It will be the next character scanned. The behavior is undefined if c is too large to fit into input_buffer. An important potential problem when using unread_unicode_character is that it alters the input stream. If you need the value of unicode_text after a call to unread_unicode_character, you must first save it elsewhere.
utf8_text: STRING_8
UTF-8 representation of unicode_text. This function creates a new string each time it is called. For efficiency reason, this function bypasses the call to unicode_text and creates the UTF-8 representation directly from the input buffer.
utf8_text_substring (s, e: INTEGER): STRING_8
UTF-8 representation of unicode_text_substring. This function creates a new string each time it is called. For efficiency reason, this function bypasses the call to unicode_text_substring and creates the UTF-8 representation directly from the input buffer.

In addition to the above routines which can be called in semantic actions, the following routines can be called after the routine read_token has returned:

end_of_file: BOOLEAN
Has the end of input buffer been reached? This is the case when last_token has been set to 0.
scanning_error: BOOLEAN
Has an error occurred during scanning? This is the case when last_token has been given a negative value. It can occur when too many reject are called (and hence nothing can be matched anymore) or when the option nodefault (or option -s) has been specified but the default rule is matched nevertheless.

Furthermore, the following routines can be called before or after any semantic actions if the corresponding %option have been specified. These routines do nothing by default but can be redefined in the generated scanner class.

pre_action
Action executed before every semantic action when %option pre-action has been specified.
post_action
Action executed after every semantic action when %option post-action has been specified.
pre_eof_action
Action executed before every end-of-file semantic action (i.e. <<EOF>>) when %option pre-eof-action has been specified.
post_eof_action
Action executed after every end-of-file semantic action (i.e. <<EOF>>) when %option post-eof-action has been specified.

Copyright © 2000-2019, Eric Bezault
mailto:
ericb@gobosoft.com
http:
//www.gobosoft.com
Last Updated: 29 September 2019

HomeTocPreviousNext