Geyacc: Some Simple Examples

Some Simple Examples

First some simple examples to get the flavor of how one uses geyacc: a reverse polish notation calculator, an algebraic (infix) notation calculator and a calculator with memory. Each produces a usable, though limited, interactive desk-top calculator. These examples are simple, but geyacc grammars for real programming languages are written the same way.

Reverse Polish Notation Calculator

The first example is that of a simple double-precision reverse polish notation calculator (a calculator using postfix operators). This example provides a good starting point, since operator precedence is not an issue. The second example will illustrate how operator precedence is handled.

The source code for this calculator can be found in $GOBO/library/parse/example/rpcalc. The geyacc grammar file is named rpcalc.y. The `.y' extension is a convention used for geyacc input files.

Declarations for `rpcalc`

Here are the Eiffel and geyacc declarations for the reverse polish notation calculator. Comments follow Eiffel style conventions and have no effect on the description's semantics.

%{
note

    description: "Reverse polish notation calculator"

class RPCALC

inherit

    YY_PARSER_SKELETON

create

    make

%}

%token <DOUBLE> NUM
%type <DOUBLE> exp

%% -- Grammar rules and actions follow.

The Eiffel declarations subsection specifies the note, class header, inherit and creation clauses of the generated parser class. This class is called RPCALC and inherits its parsing engine from YY_PARSER_SKELETON.

The second subsection, geyacc declarations, provides information to geyacc about the terminal and nonterminal symbols and their types. Each terminal symbol (token) that is not a single-character literal must be declared here. (Single-character literals normally don't need to be declared.) In this example, all the arithmetic operators are designated by single-character literals, so the only terminal symbol that needs to be declared with %token is NUM, the token type for numeric constants. The geyacc construct %type is used for declaring nonterminal symbols, just as %token is used for declaring tokens. We do not have to use %type because nonterminal symbols are normally declared implicitly by the rules that define them. But exp must be declared explicitly so we can specify its value type. The declarations of NUM and exp are augmented with information about their Eiffel type (placed between angle brackets).

Grammar Rules for `rpcalc`

Here are the grammar rules for the reverse polish notation calculator:

input: -- Empty
    | input line
    ;

line: '\n'
    | exp '\n' { print ($1); print ("%N") }
    ;

exp: NUM          { $$ := $1 }
    | exp exp '+' { $$ := $1 + $2 }
    | exp exp '-' { $$ := $1 - $2 }
    | exp exp '*' { $$ := $1 * $2 }
    | exp exp '/' { $$ := $1 / $2 }
        -- Unary minus
    | exp 'n'     { $$ := -$1 }
    ;
%%

The groupings of the rpcalc "language" defined here are the expression (given the name exp), the line of input (line), and the complete input transcript (input). Each of these nonterminal symbols has several alternate rules, joined by the | punctuator which is read as "or". The following sections explain what these rules mean.

The semantics of the language is determined by the actions taken when a grouping is recognized. The actions are the Eiffel code that appears inside braces. You must specify these actions in Eiffel, but geyacc provides the means for passing semantic values between the rules. In each action, the pseudo-variable $$ stands for the semantic value for the grouping that the rule is going to construct. Assigning a value to $$ is the main job of most actions. The semantic values of the components of the rule are referred to as $1, $2, and so on.

Explanation of `input`

Consider the definition of input:

input: -- Empty
    | input line
    ;

This definition reads as follows: "A complete input is either an empty string, or a complete input followed by an input line". Notice that "complete input" is defined in terms of itself. This definition is said to be left recursive since input appears always as the leftmost symbol in the sequence.

The first alternative is empty because there are no symbols between the colon and the first |; this means that input can match an empty string of input (no tokens). We write the rules this way because it is legitimate to type `Ctrl-d' right after you start the calculator. It's conventional to put an empty alternative first and write the comment -- Emptyin it.

The second alternate rule (input line) handles all nontrivial input. It means, "After reading any number of lines, read one more line if possible". The left recursion makes this rule into a loop. Since the first alternative matches empty input, the loop can be executed zero or more times.

The parser routine parse continues to process input until a grammatical error is seen or the lexical analyzer says there are no more input tokens; we will arrange for the latter to happen at end of file.

Explanation of `line`

Now consider the definition of line:

line: '\n'
    | exp '\n' { print ($1); print ("%N") }
    ;

The first alternative is a token which is a newline character; this means that rpcalc accepts a blank line (and ignores it, since there is no action). The second alternative is an expression followed by a newline. This is the alternative that makes rpcalc useful. The semantic value of the exp grouping is the value of $1 because the exp in question is the first symbol in the alternative. The action prints this value, which is the result of the computation the user asked for.

This action is unusual because it does not assign a value to $$. As a consequence, the semantic value associated with the line is initialized to its default value. Since line has not been given a type in the %type section, the default type is detachable ANY. And as a consequence the semantic value for line is Void.

Explanation of `expr`

The exp grouping has several rules, one for each kind of expression. The first rule handles the simplest expressions: those that are just numbers. The second handles an addition-expression, which looks like two expressions followed by a plus-sign. The third handles subtraction, and so on.

exp: NUM          { $$ := $1 }
    | exp exp '+' { $$ := $1 + $2 }
    | exp exp '-' { $$ := $1 - $2 }
    ...
    ;

We have used | to join all the rules for exp, but we could equally well have written them separately:

exp: NUM         { $$ := $1 } ;
exp: exp exp '+' { $$ := $1 + $2 } ;
exp: exp exp '-' { $$ := $1 - $2 } ;
...

Most of the rules have actions that compute the value of the expression in terms of the value of its parts. For example, in the rule for addition, $1 refers to the first component exp and $2 refers to the second one. The third component, '+', has no meaningful associated semantic value, but if it had one you could refer to it as $3. When parse recognizes a sum expression using this rule, the sum of the two subexpressions' values is produced as the value of the entire expression.

The formatting shown here is the recommended convention, but geyacc does not require it. You can add or change whitespace as much as you wish. For example, this:

exp : NUM { $$ := $1 } | exp exp '+' {$$ := $1 + $2 } | ...

means the same thing as this:

exp: NUM          { $$ := $1 }
    | exp exp '+' { $$ := $1 + $2 }
    | ...

The latter, however, is much more readable.

The `rpcalc` Lexical Analyzer

The lexical analyzer's job is low-level parsing: converting characters or sequences of characters into tokens. The geyacc parser gets its tokens by calling the lexical analyzer.

Only a simple lexical analyzer is needed for the RPN calculator. This lexical analyzer skips blanks and tabs, then reads in numbers as DOUBLE and returns them as NUM tokens. Any other character that isn't part of a number is a separate token. Note that the token-code for such a single-character token is the character itself.

The lexical analyzer routine read_token stores into last_token a numeric code which represents a token type. The same text used in geyacc rules to stand for this token type is also an Eiffel expression for the numeric code for the type. This works in two ways. If the token type is a character literal, then its numeric code is the code for that character; the same character literal can be used in the lexical analyzer to express the number. If the token type is an identifier, that identifier is defined by geyacc as an integer constant feature with the appropriate number. In this example, therefore, NUM becomes an integer constant for read_token to use.

The semantic value of the token (if it has one) is made available in last_<type>_value, where <type> is the type of the token. This is where the geyacc parser will look for it. Sine the calculator handles DOUBLE values and expressions, last_double_value will be used.

A token type code of zero is returned if the end-of-file is encountered. (Geyacc recognizes any negative value as indicating that an error occurred in the lexical analysis.)

Here is the code for the lexical analyzer routine:

read_token
        -- Lexical analyzer returns a double floating point
        -- number on the stack and the token NUM, or the
        -- character read if not a number. Skips all blanks
        -- and tabs, returns 0 for EOF.
    local
        c: CHARACTER
        buffer: STRING
    do
            -- Skip white space
        from
            if has_pending_character then
                c := pending_character
                has_pending_character := False
            elseif not io.end_of_file then
                io.read_character
                c := io.last_character
            end
        until
            io.end_of_file or else
            (c /= ' ' and c /= '%T')
        loop
            io.read_character
            c := io.last_character
        end
        if io.end_of_file then
                -- Return end-of-file
            last_token := 0
        elseif (c >= '0' and c <= '9') then
                -- Process numbers
            last_token := NUM
            from
                create buffer.make (10)
                buffer.append_character (c)
                io.read_character
                c := io.last_character
            until
                io.end_of_file or else
                (c < '0' or c > '9')
            loop
                buffer.append_character (c)
                io.read_character
                c := io.last_character
            end
            if not io.end_of_file and then c = '.' then
		from
                    buffer.append_character ('.')
                    io.read_character
                    c := io.last_character
                until
                    io.end_of_file or else
                    (c < '0' or c > '9')
                loop
                    buffer.append_character (c)
                    io.read_character
                    c := io.last_character
                end
            end
            if not io.end_of_file then
                pending_character := c
                has_pending_character := True
            end
            last_double_value := buffer.to_double
        else
                -- Return single character
            last_token := c.code
        end
    end

The Error Reporting Routine

When parse detects a syntax error, it calls the error reporting routine report_error to print an error message (usually but not always "parse error"). It is up to the programmer to redefine report_error when needed.

After report_error returns, the geyacc parser may recover from the error and continue parsing if the grammar contains a suitable error rule. Otherwise, parse terminates with syntax_error being set to true. There is no error rules in this example, so any invalid input will cause the calculator program to exit. This is not clean behavior for a real calculator, but it is adequate in the first example.

Infix Notation Calculator: `calc`

We now modify rpcalc to handle infix operators instead of postfix. Infix notation involves the concept of operator precedence and the need for parentheses nested to arbitrary depth. Here is the geyacc code for calc.y, an infix desk-top calculator (the source code for this calculator can be found in $GOBO/library/parse/example/calc).

%{
note

    description: "Infix notation calculator"

class CALC

inherit

    YY_PARSER_SKELETON

create

    make

%}

    -- geyacc declarations.
%token <DOUBLE> NUM
%left '-' '+'
%left '*' '/'
%left NEG  -- negation--unary minus
%right '^' -- exponentiation
%type <DOUBLE> exp


    -- Grammar follows.
%%
input: -- Empty
    | input line
    ;

line: '\n'
    | exp '\n' { print ($1); print ("%N") }
    ;

exp: NUM          { $$ := $1 }
    | exp '+' exp { $$ := $1 + $3 }
    | exp '-' exp { $$ := $1 - $3 }
    | exp '*' exp { $$ := $1 * $3 }
    | exp '/' exp { $$ := $1 / $3 }
    | '-' exp %prec NEG { $$ := -$2 }
    | '(' exp ')' { $$ := $2 }
    ;
%%

feature

    read_token
        do
            ...
        end
...

end

The routine read_token is the same as before. There are two important new features shown in this code.

In the first section (geyacc declarations), %left declares token types and says they are left-associative operators. The declarations %left and %right (right associativity) take the place of %token which is used to declare a token type name without associativity. (These tokens are single-character literals, which ordinarily don't need to be declared. We declare them here to specify the associativity.)

Operator precedence is determined by the line ordering of the declarations; the higher the line number of the declaration (lower on the page or screen), the higher the precedence. Hence, exponentiation ('^') has the highest precedence; unary minus (NEG) is next, followed by '*' and '/'; '+' and '-' have the lowest precedence.

The other important new feature is the %prec in the grammar section for the unary minus operator. The %prec simply instructs geyacc that the rule '-' exp has the same precedence as NEG - in this case the next-to-highest.

Simple Error Recovery

Up to this point, this manual has not addressed the issue of error recovery - how to continue parsing after the parser detects a syntax error. All we have handled is error reporting with feature report_error. Recall that by default parse terminates after calling report_error. This means that an erroneous input line causes the calculator program to exit. Now we show how to rectify this deficiency.

The geyacc language itself includes the reserved word error, which may be included in the grammar rules. In the example below it has been added to one of the alternatives for line:

line: '\n'
    | exp '\n'   { print ($1); print ('%N') }
    | error '\n' { recover }
    ;

This addition to the grammar allows for simple error recovery in the event of a parse error. If an expression that cannot be evaluated is read, the error will be recognized by the third rule for line, and parsing will continue. (The report_error routine is still called upon to print its message as well.) The action executes the routine recover, a feature inherited from YY_PARSER_SKELETON; its meaning is that error recovery is complete.

This form of error recovery deals with syntax errors. There are other kinds of errors; for example, division by zero, which raises an exception that is normally fatal. A real calculator program must handle this kind exception in a rescue clause and resume parsing input lines; it would also have to discard the rest of the current line of input.

Calculator with Memory: `mcalc`

Now that the basics of geyacc have been discussed, it is time to move on to a more advanced problem. The above calculators provided only five functions: +, -, * , / and unary minus. It would be nice to add memory to the calculator, by allowing you to create named variables, store values in them, and use them later. Here is a sample session with the calculator:

% mcalc
pi := 3.141592
3.141592
2 * pi
6.283184
gobo := 2 + 3
5
gobo := gobo * 4
20
gobo
20
%

The source code for this calculator can be found in $GOBO/library/parse/example/mcalc. The geyacc grammar file is named mcalc.y.

Declarations for `mcalc`

Here are the Eiffel and geyacc declarations for the calculator with memory.

%{
note

    description: "Calculator with memory"

class MCALC

inherit

    YY_PARSER_SKELETON
        rename
            make as make_parser_skeleton
        end

create

    make

%}

    -- geyacc declarations.
%token <DOUBLE> NUM  -- Double precision number
%token <STRING> VAR  -- Memory name
%type <DOUBLE> exp

%right ASSIGNMENT    -- Assignment sign `:='
%left '-' '+'
%left '*' '/'
%left NEG            -- negation--unary minus
%right '^'           -- exponentiation

    -- Grammar follows.
%%

The above grammar allows semantic values to have various types. These allowable types are now DOUBLE (for exp and NUM) and STRING (for memory names VAR).

Grammar Rules for `mcalc`

Here are the grammar rules for the calculator. Most of them are copied directly from calc; two rules, those which mention VAR, are new.

input: -- Empty
    | input line
    ;

line: '\n'
    | exp '\n'   { print ($1); print ("%N") }
    | error '\n' { recover }
    ;

exp: NUM             { $$ := $1 }
    | VAR            { $$ := memory_value ($1) }
    | VAR ASSIGNMENT exp { $$ := $3; set_memory_value ($$, $1) }
    | exp '+' exp    { $$ := $1 + $3 }
    | exp '-' exp    { $$ := $1 - $3 }
    | exp '*' exp    { $$ := $1 * $3 }
    | exp '/' exp    { $$ := $1 / $3 }
    | '-' exp %prec NEG { $$ := -$2 }
    | '(' exp ')'    { $$ := $2 }
    ;
%%

The `mcalc` Lexical Analyzer

The lexical analyzer must now recognize memory variables, numeric values and arithmetic operators. Memory variables are strings of alphanumeric characters with a leading nondigit character.

read_token
        -- Lexical analyzer returns a double floating point
        -- number on the stack and the token NUM, a STRING and
        -- and the token VAR, a token ASSIGNMENT, or the
        -- character read if not a number. Skips all blanks
        -- and tabs, returns 0 for EOF.
    local
        c: CHARACTER
        buffer: STRING
    do
            -- Skip white space
        from
            if has_pending_character then
                c := pending_character
                has_pending_character := False
            elseif not io.end_of_file then
                io.read_character
                c := io.last_character
            end
        until
            io.end_of_file or else
            (c /= ' ' and c /= '%T')
        loop
            io.read_character
            c := io.last_character
        end
        if io.end_of_file then
                -- Return end-of-file
            last_token := 0
        else
            inspect c
            when '0'..'9' then
                    -- Process numbers
                last_token := NUM
                from
                    create buffer.make (10)
                    buffer.append_character (c)
                    io.read_character
                    c := io.last_character
                until
                    io.end_of_file or else
                    (c < '0' or c > '9')
                loop
                    buffer.append_character (c)
                    io.read_character
                    c := io.last_character
                end
                if not io.end_of_file and then c = '.' then
		    from
                        buffer.append_character ('.')
                        io.read_character
                        c := io.last_character
                    until
                        io.end_of_file or else
                        (c < '0' or c > '9')
                    loop
                        buffer.append_character (c)
                        io.read_character
                        c := io.last_character
                    end
                end
                if not io.end_of_file then
                    pending_character := c
                    has_pending_character := True
                end
                last_double_value := buffer.to_double
            when 'a'..'z', 'A'..'Z' then
                    -- Process variables.
                last_token := VAR
                from
                    create buffer.make (10)
                    buffer.append_character (c)
                    io.read_character
                    c := io.last_character
                until
                    io.end_of_file or else
                    not (('a' <= c and c <= 'z') or
                         ('A' <= c and c <= 'Z') or
                         ('0' <= c and c <= '9'))
                loop
                    buffer.append_character (c)
                    io.read_character
                    c := io.last_character
                end
                if not io.end_of_file then
                    pending_character := c
                    has_pending_character := True
                end
                last_string_value := buffer
            when ':' then
                io.read_character
                c := io.last_character
                if not io.end_of_file then
                    if c = '=' then
                            -- Found ":="
                        last_token := ASSIGNMENT
                    else
                            -- Return single character
                        last_token := (':').code
                        pending_character := c
                        has_pending_character := True
                    end
                else
                        -- Return single character
                    last_token := (':').code
                end
            else
                    -- Return single character
                last_token := c.code
            end
        end
    end

Memory management of `mcalc`

Following is some remaining source code taking care of the memory management of mcalc.

feature {NONE} -- Initialization

    make
            -- Create a new calculator with memory.
        do
            make_parser_skeleton
            last_string_value := ""
            create memory_values.make (10)
        end

feature -- Memory management

    memory_value (a_name: STRING): DOUBLE
            -- Value associated with memory a_name;
            -- 0.0 if no value has been stored in a_name yet
        require
            a_name_not_void: a_name /= Void
        do
            if memory_values.has (a_name) then
                Result := memory_values.item (a_name)
            else
                 Result := 0.0
            end
        end

    set_memory_value (a_value: DOUBLE; a_name: STRING)
            -- Store a_value into a_name.
        require
            a_name_not_void: a_name /= Void
        do
            memory_values.force (a_value, a_name)
        ensure
            memory_value_set: memory_value (a_name) = a_value
        end

feature {NONE} -- Implementation

    memory_values: DS_HASH_TABLE [DOUBLE, STRING]
            -- Values already stored so far

invariant

    memory_values_not_void: memory_values /= Void

end

Reverse Polish Notation Calculator

Declarations for rpcalc

Grammar Rules for rpcalc

Explanation of input

Explanation of line

Explanation of expr

The rpcalc Lexical Analyzer