Gelex: Actions

Actions

Each pattern in a rule has a corresponding action, which can be any arbitrary Eiffel instructions. The pattern ends at the first non-escaped whitespace character; the remainder of the line is its action. If the action is empty, then when the pattern is matched the input token is simply discarded. For example, here is an excerpt from the specification of a program which deletes all occurrences of "zap me" from its input:

%%
"zap me"

It will copy all other characters in the input to the output since they will be matched by the default rule. Here is a program which compresses multiple blanks and tabs down to a single blank, and throws away whitespace found at the end of a line:

%%
[ \t]+     io.put_character (' ')
[ \t]+$    -- Ignore this token.

If the action begins with a {, then the action spans till the balancing } is found, and the action may cross multiple lines. Gelex knows about Eiffel strings, characters and comments and therefore won't be fooled by braces found within them.

An action consisting solely of a vertical bar | means "same as the action for the next rule". See below for an illustration.

Actions can include arbitrary Eiffel code. Note that if the Eiffel code contains Unicode characters, the input file should use the UTF-8 encoding and start with the BOM character.

There are a number of special features, inherited from class YY_SCANNER, which can be used in actions:

append_text_to_string (a_string:STRING_8)

Append text at end of a_string. For efficiency reason, this feature can bypass the call to text and directly copy the characters from the input buffer.

append_unicode_text_to_string (a_string:STRING_32)

Append unicode_text at end of a_string. For efficiency reason, this feature can bypass the call to unicode_text and directly copy the characters from the input buffer.

append_utf8_text_to_string (a_string:STRING_8)

Append utf8_text at end of a_string. For efficiency reason, this feature can bypass the call to utf8_text and directly copy the characters from the input buffer.

append_text_substring_to_string (s,e: INTEGER;a_string: STRING_8)

Append text_substring at end of a_string. For efficiency reason, this feature can bypass the call to text_substring and directly copy the characters from the input buffer.

append_unicode_text_substring_to_string (s,e: INTEGER;a_string: STRING_32)

Append unicode_text_substring at end of a_string. For efficiency reason, this feature can bypass the call to unicode_text_substring and directly copy the characters from the input buffer.

append_utf8_text_substring_to_string (s,e: INTEGER;a_string: STRING_8)

Append utf8_text_substring at end of a_string. For efficiency reason, this feature can bypass the call to utf8_text_substring and directly copy the characters from the input buffer.

column: INTEGER

Column number of last token read. If it is used in any of the scanner's actions the %optionline will have to be set.

echo

Copy text to the scanner's output file using output.

Empty_buffer: YY_BUFFER

Empty input buffer (once function). When input sources are not known yet at the creation time of a scanner, this input buffer can be used by default with the creation routine make_with_buffer.

flush_input_buffer

Flush the scanner's internal buffer so that the next time the scanner attempts to match a token it will first refill the buffer, unless end of file has been found.

input_buffer: YY_BUFFER

Input buffer of the scanner. By default the input buffer is filled from the standard input. To avoid unexpected behaviors, the routine set_input_buffer should be used to switch to other input buffers.

last_character: CHARACTER_8

Last character read by read_character.

last_unicode_character: CHARACTER_32

Last Unicode character read by read_character.

last_token: INTEGER

Code of the last token read. When this attribute is given a non-negative value the procedure read_token stops, giving the opportunity to its caller (e.g. a parser routine) to inspect this code. Each time read_token is called again it continues processing tokens from where it last left off until either last_token is given a non-negative value again or the end of the file is reached (yielding a null value). Negative values are reserved by read_token to indicate internal errors which can occur when too many reject are called (and hence nothing can be matched anymore) or when the option nodefault (or option -s) has been specified but the default rule is matched nevertheless.

less (n:INTEGER)

Return all but the first n characters of the current token back to the input stream, where they will be rescanned when the scanner looks for the next match. text and text_count are adjusted appropriately (e.g., text_count will now be equal to n). For example, on the input "foobar" the following will write out "foobarbar":

%%
foobar    echo; less (3)
[a-z]+    echo

An argument of 0 to less will cause the entire current input string to be scanned again. Unless the way the scanner subsequently process its input has been changed (using set_start_condition, for example), this will result in an endless loop.

line: INTEGER

Line number of last token read. If it is used in any of the scanner's actions the %optionline will have to be set.

more

Tell the scanner that the next time it matches a rule, the corresponding token should be appended onto the current value of text rather than replacing it. For example, given the input "mega-kludge" the following will write "mega-mega-kludge" to the output:

%%
mega-      echo; more
kludge     echo

First "mega-" is matched and echoed to the output. Then "kludge" is matched, but the previous "mega-" is still hanging around at the beginning of text so the echo for the "kludge" rule will actually write "mega-kludge".

new_file_buffer (a_file:KI_CHARACTER_INPUT_STREAM): YY_FILE_BUFFER

Create an input buffer for a_file. This routine is convenient when used with set_input_buffer. To be used when a_file contains ISO-8859-1 characters, or when it is using the UTF-8 encoding and the scanner is either using the %optionutf8 or has been manually written to expect sequences of UTF-8 bytes.

new_string_buffer (a_string:STRING_8): YY_BUFFER

Create an input buffer for a_string. This routine is convenient when used with set_input_buffer. To be used when a_string contains ISO-8859-1 characters, or when it is using the UTF-8 encoding and the scanner is either using the %optionutf8 or has been manually written to expect sequences of UTF-8 bytes.

new_unicode_file_buffer (a_file:KI_CHARACTER_INPUT_STREAM): YY_UNICODE_FILE_BUFFER

Create a Unicode input buffer for a_file. This routine is convenient when used with set_input_buffer. To be used when a_file contains ISO-8859-1 characters, or when it is using the UTF-8 encoding. The scanner will receive Unicode characters, not sequences of UTF-8 bytes.

new_unicode_string_buffer (a_string:READABLE_STRING_GENERAL): YY_UNICODE_BUFFER

Create a Unicode input buffer for a_string. This routine is convenient when used with set_input_buffer. To be used when a_string contains ISO-8859-1 or Unicode characters. The scanner will receive Unicode characters, not sequences of UTF-8 bytes.

new_utf8_file_buffer (a_file:KI_CHARACTER_INPUT_STREAM): YY_UTF8_FILE_BUFFER

Create a UTF-8 input buffer for a_file. This routine is convenient when used with set_input_buffer. To be used when a_file contains ISO-8859-1 characters or when it is using the UTF-8 encoding, and the scanner is either using the %optionutf8 or has been manually written to expect sequences of UTF-8 bytes. The scanner will receive sequences of UTF-8 bytes.

new_utf8_string_buffer (a_string:READABLE_STRING_GENERAL): YY_UTF8_BUFFER

Create a UTF-8 input buffer for a_string. This routine is convenient when used with set_input_buffer. To be used when a_string contains ISO-8859-1 or Unicode characters, and the scanner is either using the %optionutf8 or has been manually written to expect sequences of UTF-8 bytes. The scanner will receive sequences of UTF-8 bytes.

output (a_text:like text)

Write a_text to the standard output by default. This behavior can easily be modified through redefinition.

pop_start_condition

Restore previous start condition. See discussion on start conditions for further details.

position: INTEGER

Position of last token read (i.e. number of characters from the start of the input source). If it is used in any of the scanner's actions the %optionposition will have to be set.

print_last_token

Routine called at the end of read_token when debugging instructions are enabled. Print to standard error debug information about the last token read. This routine can be redefined in descendant classes to print more information. In particular, the routine token_name generated by geyacc can be used to make the debugging output more human-readable.

push_start_condition (a_start_condition:INTEGER)

Add current start condition to a stack and put the scanner in the corresponding new start condition a_start_condition. See discussion on start conditions for further details.

read_character

Read the next character from the input stream. Make the result available in last_character and last_unicode_character. For example, the following is one way to eat up C comments:

%%
"/*"  {
    from until stop loop
        from
            read_character
         until
            last_character = '*' or
            last_character = '%/255/'
        loop
            read_character
        end
        if last_character = '*' then
            from
                read_character
            until
                last_character /= '*'
            loop
                read_character
            end
            if last_character = '/' then
                stop := True
            end
        end
        if last_character = '%/255/' then
            io.error.put_string ("EOF in comment%N")
            stop := True
        end
    end
}

This feature should be used with care since it bypasses the pattern-matching DFA engine. Note that if input_buffer contains Unicode characters which cannot be represented as 8-bit characters, they will be replaced by a replacement character specified in the buffer.

reject

Direct the scanner to proceed on to the "second best" rule which matched the input (or a prefix of the input). The rule is chosen as described in Matching Rules, and text and text_count return the appropriate values. It may either be one which matched as much text as the originally chosen rule but came later in the gelex input file, or one which matched less text. For example, the following will both count the words in the input and call the routine special whenever "frob" is seen:

%%
frob         special; reject
[^ \t\n]+    word_count := word_count + 1
%%
    word_count: INTEGER
    special is do ... end

Without the reject, any "frob"'s in the input would not be counted as words, since the scanner normally executes only one action per token. Multiple reject's are allowed, each one finding the next best choice to the currently active rule. For example, when the following scanner scans the token "abcd", it will write "abcdabcaba" to the output:

%%
a        |
ab       |
abc      |
abcd     echo; reject
.|\n     -- Eat up any unmatched character.

(The first three rules share the fourth's action since they use the special '|' action.) reject is a particularly expensive feature in terms of scanner performance. If it is used in any of the scanner's actions the %optionreject will have to be set and it will slow down all of the scanner's matching. Furthermore, reject cannot be used with the %optionfull and this feature is only available to descendants of class YY_COMPRESSED_SCANNER_SKELETON.

set_input_buffer (a_buffer:like input_buffer)

Switch the scanner's input buffer so that subsequent tokens will come from a_buffer. This routine can be used to continue scanning another file when the end-of-file has been read, or to deal with preprocessor instructions such as #include. It can eventually be given as argument the result of one of the functions new_file_buffer, new_unicode_file_buffer, new_utf8_file_buffer, new_string_buffer, new_unicode_string_buffer or new_utf8_string_buffer. Note that switching input buffers does not change the start condition of the scanner.

set_last_token (a_token:INTEGER)

Set last_token to a_token.

set_start_condition (a_start_condition:INTEGER)

Put the scanner in the corresponding start condition. See discussion on start conditions for further details.

start_condition: INTEGER

Current start condition. This value can subsequently be used with set_start_condition or push_start_condition to return to that start condition. See discussion on start conditions for further details.

terminate

Terminate the scanner and set last_token to 0, indicating "all done". By default, terminate is also called when an end-of-file is encountered.

text: STRING_8

Text of the last token read. This feature is a function which creates a new string each time it is called. Actions are hence free to alter the result of text without damaging the input buffer. Note that if input buffer contains Unicode characters which cannot be represented as 8-bit characters, they will be replaced by a replacement character specified in the buffer.

text_count: INTEGER

Length of the last token read. This feature is a function which computes the number of characters matched by the corresponding pattern. If efficiency is a concern and this function is called several times in the same action, its result can be stored in a temporary variable.

text_item (i:INTEGER): CHARACTER_8

Character at a given index in text. For efficiency reason, this function bypasses the call to text and reads the character directly from the input buffer.

text_substring (s,e: INTEGER):STRING_8

Substring of text. This function creates a new string each time it is called. For efficiency reason, this function bypasses the call to text and creates the substring directly from the input buffer.

unicode_text: STRING_32

Text of the last token read. This feature is a function which creates a new string each time it is called. Actions are hence free to alter the result of unicode_text without damaging the input buffer. Note that if the scanner is written to receive sequences of UTF-8 bytes, unicode_text will treat each single byte as a character. It will not try to decode the UTF-8 bytes into Unicode characters. Also note that unicode_text does not contain surrogate or invalid Unicode characters.

unicode_text_item (i:INTEGER): CHARACTER_32

Unicode character at a given index in unicode_text. For efficiency reason, this function bypasses the call to unicode_text and reads the character directly from the input buffer.

unicode_text_substring (s,e: INTEGER):STRING_32

Substring of unicode_text. This function creates a new string each time it is called. For efficiency reason, this function bypasses the call to unicode_text and creates the substring directly from the input buffer.

unread_character (c:CHARACTER_8)

Put the character c back onto the input stream. It will be the next character scanned. The following action will take the current token and cause it to be rescanned enclosed in parentheses.

{
    a_text := text
    unread_character (')')
    from i := text_count until i < 1 loop
        unread_character (a_text.item (i))
        i := i - 1
    end
    unread_character ('(')
}

Note that since each unread_character puts the given character back at the beginning of the input stream, pushing back strings must be done back-to-front. An important potential problem when using unread_character is that it alters the input stream. If you need the value of text after a call to unread_character (as in the above example), you must first save it elsewhere. Finally, note that you cannot put back EOF (i.e. '%/255/') to attempt to mark the input stream with an end-of-file.

unread_unicode_character (c:CHARACTER_32)

Put the character c back onto the input stream. It will be the next character scanned. The behavior is undefined if c is too large to fit into input_buffer. An important potential problem when using unread_unicode_character is that it alters the input stream. If you need the value of unicode_text after a call to unread_unicode_character, you must first save it elsewhere.

utf8_text: STRING_8

UTF-8 representation of unicode_text. This function creates a new string each time it is called. For efficiency reason, this function bypasses the call to unicode_text and creates the UTF-8 representation directly from the input buffer.

utf8_text_substring (s,e: INTEGER):STRING_8

UTF-8 representation of unicode_text_substring. This function creates a new string each time it is called. For efficiency reason, this function bypasses the call to unicode_text_substring and creates the UTF-8 representation directly from the input buffer.

In addition to the above routines which can be called in semantic actions, the following routines can be called after the routine read_token has returned:

end_of_file: BOOLEAN: Has the end of input buffer been reached? This is the case when last_token has been set to 0.
scanning_error: BOOLEAN: Has an error occurred during scanning? This is the case when last_token has been given a negative value. It can occur when too many reject are called (and hence nothing can be matched anymore) or when the option nodefault (or option -s) has been specified but the default rule is matched nevertheless.

Furthermore, the following routines can be called before or after any semantic actions if the corresponding %option have been specified. These routines do nothing by default but can be redefined in the generated scanner class.

pre_action: Action executed before every semantic action when %option pre-action has been specified.
post_action: Action executed after every semantic action when %option post-action has been specified.
pre_eof_action: Action executed before every end-of-file semantic action (i.e. <<EOF>>) when %option pre-eof-action has been specified.
post_eof_action: Action executed after every end-of-file semantic action (i.e. <<EOF>>) when %option post-eof-action has been specified.