Geyacc: Semantic Actions

Semantic Actions

The grammar rules for a language determine only the syntax. The semantics are determined by the semantic values associated with various tokens and groupings, and by the actions taken when various groupings are recognized. For example, the calculator calculates properly because the value associated with each expression is the proper number; it adds properly because the action for the grouping X + Y is to add the numbers associated with X and Y.

Actions

An action accompanies a syntactic rule and contains Eiffel code to be executed each time an instance of that rule is recognized. The task of most actions is to compute a semantic value for the grouping built by the rule from the semantic values associated with tokens or smaller groupings.

An action consists of Eiffel instructions surrounded by braces. Geyacc knows about Eiffel strings, characters and comments and therefore won't be fooled by braces found within them. An action can be placed at any position in the rule; it is executed at that position. Most rules have just one action at the end of the rule, following all the components. Actions in the middle of a rule are tricky and used only for special purposes.

The Eiffel code in an action can refer to the semantic values of the components matched by the rule with the construct $N, which stands for the value of the N^th component. The semantic value for the grouping being constructed is $$. (Geyacc translates both of these constructs into array element references when it copies the actions into the generated parser class.)

Here is a typical example:

exp: ...
    | exp '+' exp
        { $$ := $1 + $3 }

This rule constructs an exp from two smaller exp groupings connected by a plus-sign token. In the action, $1 and $3 refer to the semantic values of the two component exp groupings, which are the first and third symbols on the right hand side of the rule. The sum is stored into $$ so that it becomes the semantic value of the addition-expression just recognized by the rule. If there were a useful semantic value associated with the '+' token, it could be referred to as $2.

Like entities in Eiffel, $$ is initialized to its default value at the begining of the semantic action. This default value is the same as in Eiffel: 0 for INTEGER, False for BOOLEAN, Void for reference types, etc. Specifying no action for a rule is equivalent to specifying an empty action {}. Therefore the semantic value of such rules is set to its corresponding default value. Note that this is a departure from yacc and Bison behavior: If you don't specify an action for a rule, yacc and Bison would supply a default: { $$ :=$1 }. Thus, the value of the first symbol in the rule would become the value of the whole rule. Furthermore, there is no meaningful default action for an empty rule in yacc and Bison; every empty rule must have an explicit action unless the rule's value does not matter. The current behavior of geyacc was deemed more appropriate in the Eiffel context. In Eiffel, all entities are initialized to its default value. $$ could be considered as the Result entity of the semantic action, therefore it is initialized to its default value at the beginning of the action as well. Furthermore, in a typed system such as Eiffel, it is meaningless to use { $$ := $1 } as a default action since there is no guarantee that $$ and $1 will have conforming types.

Note that contrary to yacc and Bison, $N with N zero or negative is not allowed in geyacc.

Action Features

Actions can include arbitrary Eiffel code. There are a number of special features, inherited from class YY_PARSER, which can be used in actions:

abort: Stop the current parsing and set syntax_error to true.; Do not report an error through report_error.
accept: Stop the current parsing and set syntax_error to false.
clear_token: Discard the look-ahead token. This is useful primarily in error-recovery rule actions.
clear_all: Clear temporary objects so that they can be collected by the garbage collector. This routine is called by parse before exiting. It can be redefined in descendants. Clear internal stacks by default (call to clear_stacks).
error_count: INTEGER: Each time the geyacc parser detects a syntax error, it increments error_count, which hence contains the number of syntax errors encountered so far during the current parsing. Even when parse returns with syntax_error set to false, error_count may have a non-zero value. This may indeed happen when error recovery was successful.
is_recovering: BOOLEAN: Specify whether the parser is recovering from a syntax error. The parser is in a recovering phase when a syntax error has been detected and the grammar is equipped with error recovery rules. During that phase, syntax errors are not reported anymore. After three syntax errors have been ignored, the parser exits the recovering phase and parsing resumes as if no error had been detected (error_count has been kept up-to-date though). Normal parsing can be immediately resumed by calling recover.
is_suspended: BOOLEAN: Specify whether the parsing has been suspended. The next call to parse will resume parsing in the state where the parser was when it was suspended. Note that a call to abort or accept will force parse to parse from scratch.
last_token: INTEGER: Current look-ahead token. This token is returned by read_token and can be discarded with clear_token when recovering from a syntax error.
last_<type>_value: <TYPE>: Semantic value of the last token read of type TYPE. This value is updated whenever read_token is called.
raise_error: Cause an immediate syntax error. This routine initiates error recovery just as if the parser itself had detected an error; it also calls the error action %error associated with current parsing state or report_error by default.
read_token: The lexical analyzer routine, read_token, recognizes tokens from the input stream and makes them available to the parser in last_token. read_token also updates the semantic value of the last token read in feature last_<type>_value. The routine read_token is called by parse when it needs a new token from the input stream.
recover: Resume generating error messages immediately for subsequent syntax errors. This is useful primarily in error-recovery rule actions.
report_error (a_message:STRING): The geyacc parser detects a parse error or syntax error whenever it reads a token which cannot satisfy any syntax rule. An action in the grammar can also explicitly proclaim an error by calling feature raise_error. The geyacc parser expects to report the error by calling the error action %error associated with current parsing state or the error reporting routine report_error by default. For a parse error, the message is normally "parse error". The default behavior is to print this message on the screen, but report_error can easily be redefined to suit your needs.; After report_error or the error action %error returns to parse, the latter will attempt error recovery if you have written suitable error recovery grammar rules. If recovery is impossible, parse will immediately return and syntax_error will be set to true.
suspend: Suspend the current parsing. The next call to parse will resume parsing in the state where the parser was when it was suspended. Note that a call to abort or accept will force parse to parse from scratch.
token_name (a_token:INTEGER): STRING: Name of a token given its code. Useful in debugging instructions to make the output more human-readable.

Types of Semantic Values in Actions

In a simple program it may be sufficient to use the same Eiffel type for the semantic values of all constructs. This was true in the RPN and infix calculator examples. However, in most programs, there will be a need for different Eiffel types for different kinds of tokens and groupings. For example, a numeric constant may need type INTEGER or DOUBLE, while a string constant needs type STRING, and a list of identifiers might need type LINKED_LIST[STRING]. To use more than one Eiffel type for semantic values in one parser, choose the types for each symbol (terminal or nonterminal) for which semantic values are used. This is done for tokens with the %token geyacc declaration, and for groupings with the %type geyacc declaration. If the type of a semantic value has not been specified that way, it will by default be detachable ANY.

Each time $$ or $N is used, its Eiffel type is determined by which symbol it refers to in the rule. In this example:

exp: ...
   | exp '+' exp
       { $$ := $1 + $3 }

$1 and $3 refer to instances of exp, so they all have the Eiffel type declared for the nonterminal symbol exp. If $2 were used, it would have the type declared for the terminal symbol '+'.

Actions in Mid-Rule

Occasionally it is useful to put an action in the middle of a rule. These actions are written just like usual end-of-rule actions, but they are executed before the parser even recognizes the following components.

A mid-rule action may refer to the components preceding it using $N, but it may not refer to subsequent components because it is run before they are parsed. The mid-rule action itself counts as one of the components of the rule. This makes a difference when there is another action later in the same rule (and usually there is another at the end): you have to count the actions along with the symbols when working out which number N to use in $N.

The mid-rule action can also have a semantic value. The action can set its value with an assignment to $$, and actions later in the rule can refer to the value using $N. The Eiffel type for the semantic value of a mid-rule action is the same type as declared for the full grouping.

There is no way to set the value of the entire rule with a mid-rule action, because assignments to $$ do not have that effect. The only way to set the value for the entire rule is with an ordinary action at the end of the rule.

Here is an example from a hypothetical compiler, handling a let statement that looks like let (VARIABLE) STATEMENT and serves to create a variable named VARIABLE temporarily for the duration of STATEMENT. To parse this construct, we must put VARIABLE into the symbol table while STATEMENT is parsed, then remove it afterward. Here is how it is done:

stmt: LET '(' var ')'
        {
            $$ := new_context
            contexts.put ($$)
            $$.declare_variable ($3)
        }
    stmt
        {
            $$ := $6
            contexts.remove ($5)
        }
    ;

As soon as let (VARIABLE) has been recognized, the first action is run. It saves a copy of the current semantic context (the list of accessible variables) as its semantic value. Then it calls declare_variable to add the new variable to that list. Once the first action is finished, the embedded statement stmt can be parsed. Note that the mid-rule action is component number 5, so the stmt is component number 6. After the embedded statement is parsed, its semantic value becomes the value of the entire `let'-statement. Then the semantic value from the earlier action is used to restore the prior list of variables. This removes the temporary `let'-variable from the list so that it won't appear to exist while the rest of the program is parsed.

Taking action before a rule is completely recognized often leads to conflicts since the parser must commit to a parse in order to execute the action. For example, the following two rules, without mid-rule actions, can coexist in a working parser because the parser can shift the open-brace token and look at what follows before deciding whether there is a declaration or not:

compound: '{' declarations statements '}'
    | '{' statements '}'
    ;

But when we add a mid-rule action as follows, the rules become nonfunctional:

compound: { prepare_for_local_variables }
      '{' declarations statements '}'
    | '{' statements '}'
    ;

Now the parser is forced to decide whether to run the mid-rule action when it has read no further than the open-brace. In other words, it must commit to using one rule or the other, without sufficient information to do it correctly. (The open-brace token is what is called the look-ahead token at this time, since the parser is still deciding what to do about it.) You might think that you could correct the problem by putting identical actions into the two rules, like this:

compound: { prepare_for_local_variables }
       '{' declarations statements '}'
    |     { prepare_for_local_variables }
       '{' statements '}'
    ;

But this does not help, because geyacc does not realize that the two actions are identical. (Geyacc never tries to understand the Eiffel code in an action.) If the grammar is such that a declaration can be distinguished from a statement by the first token (which is true in C), then one solution which does work is to put the action after the open-brace, like this:

compound: '{' { prepare_for_local_variables }
      declarations statements '}'
    | '{' statements '}'
    ;

Now the first token of the following declaration or statement, which would in any case tell geyacc which rule to use, can still do so. Another solution is to bury the action inside a nonterminal symbol which serves as a subroutine:

subroutine: -- Empty
        { prepare_for_local_variables }
    ;

compound: subroutine '{' declarations statements '}'
    | subroutine '{' statements '}'
    ;

Now geyacc can execute the action in the rule for subroutine without deciding which rule for compound it will eventually use. Note that the action is now at the end of its rule. Any mid-rule action can be converted to an end-of-rule action in this way, and this is what geyacc actually does to implement mid-rule actions.