Some Simple Examples |
First some simple examples to get the flavor of how one uses geyacc: a reverse polish notation calculator, an algebraic (infix) notation calculator and a calculator with memory. Each produces a usable, though limited, interactive desk-top calculator. These examples are simple, but geyacc grammars for real programming languages are written the same way.
The first example is that of a simple double-precision reverse polish notation calculator (a calculator using postfix operators). This example provides a good starting point, since operator precedence is not an issue. The second example will illustrate how operator precedence is handled.
The source code for this calculator can be found in $GOBO/library/parse/example/rpcalc. The geyacc grammar file is named rpcalc.y. The `.y' extension is a convention used for geyacc input files.
Here are the Eiffel and geyacc declarations for the reverse polish notation calculator. Comments follow Eiffel style conventions and have no effect on the description's semantics.
%{ note description: "Reverse polish notation calculator" class RPCALC inherit YY_PARSER_SKELETON create make %} %token <DOUBLE> NUM %type <DOUBLE> exp %% -- Grammar rules and actions follow.
The Eiffel declarations subsection specifies the note, class header, inherit and creation clauses of the generated parser class. This class is called RPCALC and inherits its parsing engine from YY_PARSER_SKELETON.
The second subsection, geyacc declarations, provides information to geyacc about the terminal and nonterminal symbols and their types. Each terminal symbol (token) that is not a single-character literal must be declared here. (Single-character literals normally don't need to be declared.) In this example, all the arithmetic operators are designated by single-character literals, so the only terminal symbol that needs to be declared with %token is NUM, the token type for numeric constants. The geyacc construct %type is used for declaring nonterminal symbols, just as %token is used for declaring tokens. We do not have to use %type because nonterminal symbols are normally declared implicitly by the rules that define them. But exp must be declared explicitly so we can specify its value type. The declarations of NUM and exp are augmented with information about their Eiffel type (placed between angle brackets).
Here are the grammar rules for the reverse polish notation calculator:
input: -- Empty | input line ; line: '\n' | exp '\n' { print ($1); print ("%N") } ; exp: NUM { $$ := $1 } | exp exp '+' { $$ := $1 + $2 } | exp exp '-' { $$ := $1 - $2 } | exp exp '*' { $$ := $1 * $2 } | exp exp '/' { $$ := $1 / $2 } -- Unary minus | exp 'n' { $$ := -$1 } ; %%
The groupings of the rpcalc "language" defined here are the expression (given the name exp), the line of input (line), and the complete input transcript (input). Each of these nonterminal symbols has several alternate rules, joined by the | punctuator which is read as "or". The following sections explain what these rules mean.
The semantics of the language is determined by the actions taken when a grouping is recognized. The actions are the Eiffel code that appears inside braces. You must specify these actions in Eiffel, but geyacc provides the means for passing semantic values between the rules. In each action, the pseudo-variable $$ stands for the semantic value for the grouping that the rule is going to construct. Assigning a value to $$ is the main job of most actions. The semantic values of the components of the rule are referred to as $1, $2, and so on.
Consider the definition of input:
input: -- Empty | input line ;
This definition reads as follows: "A complete input is either an empty string, or a complete input followed by an input line". Notice that "complete input" is defined in terms of itself. This definition is said to be left recursive since input appears always as the leftmost symbol in the sequence.
The first alternative is empty because there are no symbols between the colon and the first |; this means that input can match an empty string of input (no tokens). We write the rules this way because it is legitimate to type `Ctrl-d' right after you start the calculator. It's conventional to put an empty alternative first and write the comment -- Empty in it.
The second alternate rule (input line) handles all nontrivial input. It means, "After reading any number of lines, read one more line if possible". The left recursion makes this rule into a loop. Since the first alternative matches empty input, the loop can be executed zero or more times.
The parser routine parse continues to process input until a grammatical error is seen or the lexical analyzer says there are no more input tokens; we will arrange for the latter to happen at end of file.
Now consider the definition of line:
line: '\n' | exp '\n' { print ($1); print ("%N") } ;
The first alternative is a token which is a newline character; this means that rpcalc accepts a blank line (and ignores it, since there is no action). The second alternative is an expression followed by a newline. This is the alternative that makes rpcalc useful. The semantic value of the exp grouping is the value of $1 because the exp in question is the first symbol in the alternative. The action prints this value, which is the result of the computation the user asked for.
This action is unusual because it does not assign a value to $$. As a consequence, the semantic value associated with the line is initialized to its default value. Since line has not been given a type in the %type section, the default type is detachable ANY. And as a consequence the semantic value for line is Void.
The exp grouping has several rules, one for each kind of expression. The first rule handles the simplest expressions: those that are just numbers. The second handles an addition-expression, which looks like two expressions followed by a plus-sign. The third handles subtraction, and so on.
exp: NUM { $$ := $1 } | exp exp '+' { $$ := $1 + $2 } | exp exp '-' { $$ := $1 - $2 } ... ;
We have used | to join all the rules for exp, but we could equally well have written them separately:
exp: NUM { $$ := $1 } ; exp: exp exp '+' { $$ := $1 + $2 } ; exp: exp exp '-' { $$ := $1 - $2 } ; ...
Most of the rules have actions that compute the value of the expression in terms of the value of its parts. For example, in the rule for addition, $1 refers to the first component exp and $2 refers to the second one. The third component, '+', has no meaningful associated semantic value, but if it had one you could refer to it as $3. When parse recognizes a sum expression using this rule, the sum of the two subexpressions' values is produced as the value of the entire expression.
The formatting shown here is the recommended convention, but geyacc does not require it. You can add or change whitespace as much as you wish. For example, this:
exp : NUM { $$ := $1 } | exp exp '+' {$$ := $1 + $2 } | ...
means the same thing as this:
exp: NUM { $$ := $1 } | exp exp '+' { $$ := $1 + $2 } | ...
The latter, however, is much more readable.
The lexical analyzer's job is low-level parsing: converting characters or sequences of characters into tokens. The geyacc parser gets its tokens by calling the lexical analyzer.
Only a simple lexical analyzer is needed for the RPN calculator. This lexical analyzer skips blanks and tabs, then reads in numbers as DOUBLE and returns them as NUM tokens. Any other character that isn't part of a number is a separate token. Note that the token-code for such a single-character token is the character itself.
The lexical analyzer routine read_token stores into last_token a numeric code which represents a token type. The same text used in geyacc rules to stand for this token type is also an Eiffel expression for the numeric code for the type. This works in two ways. If the token type is a character literal, then its numeric code is the code for that character; the same character literal can be used in the lexical analyzer to express the number. If the token type is an identifier, that identifier is defined by geyacc as an integer constant feature with the appropriate number. In this example, therefore, NUM becomes an integer constant for read_token to use.
The semantic value of the token (if it has one) is made available in last_<type>_value, where <type> is the type of the token. This is where the geyacc parser will look for it. Sine the calculator handles DOUBLE values and expressions, last_double_value will be used.
A token type code of zero is returned if the end-of-file is encountered. (Geyacc recognizes any negative value as indicating that an error occurred in the lexical analysis.)
Here is the code for the lexical analyzer routine:
read_token -- Lexical analyzer returns a double floating point -- number on the stack and the token NUM, or the -- character read if not a number. Skips all blanks -- and tabs, returns 0 for EOF. local c: CHARACTER buffer: STRING do -- Skip white space from if has_pending_character then c := pending_character has_pending_character := False elseif not io.end_of_file then io.read_character c := io.last_character end until io.end_of_file or else (c /= ' ' and c /= '%T') loop io.read_character c := io.last_character end if io.end_of_file then -- Return end-of-file last_token := 0 elseif (c >= '0' and c <= '9') then -- Process numbers last_token := NUM from create buffer.make (10) buffer.append_character (c) io.read_character c := io.last_character until io.end_of_file or else (c < '0' or c > '9') loop buffer.append_character (c) io.read_character c := io.last_character end if not io.end_of_file and then c = '.' then from buffer.append_character ('.') io.read_character c := io.last_character until io.end_of_file or else (c < '0' or c > '9') loop buffer.append_character (c) io.read_character c := io.last_character end end if not io.end_of_file then pending_character := c has_pending_character := True end last_double_value := buffer.to_double else -- Return single character last_token := c.code end end
When parse detects a syntax error, it calls the error reporting routine report_error to print an error message (usually but not always "parse error"). It is up to the programmer to redefine report_error when needed.
After report_error returns, the geyacc parser may recover from the error and continue parsing if the grammar contains a suitable error rule. Otherwise, parse terminates with syntax_error being set to true. There is no error rules in this example, so any invalid input will cause the calculator program to exit. This is not clean behavior for a real calculator, but it is adequate in the first example.
We now modify rpcalc to handle infix operators instead of postfix. Infix notation involves the concept of operator precedence and the need for parentheses nested to arbitrary depth. Here is the geyacc code for calc.y, an infix desk-top calculator (the source code for this calculator can be found in $GOBO/library/parse/example/calc).
%{ note description: "Infix notation calculator" class CALC inherit YY_PARSER_SKELETON create make %} -- geyacc declarations. %token <DOUBLE> NUM %left '-' '+' %left '*' '/' %left NEG -- negation--unary minus %right '^' -- exponentiation %type <DOUBLE> exp -- Grammar follows. %% input: -- Empty | input line ; line: '\n' | exp '\n' { print ($1); print ("%N") } ; exp: NUM { $$ := $1 } | exp '+' exp { $$ := $1 + $3 } | exp '-' exp { $$ := $1 - $3 } | exp '*' exp { $$ := $1 * $3 } | exp '/' exp { $$ := $1 / $3 } | '-' exp %prec NEG { $$ := -$2 } | '(' exp ')' { $$ := $2 } ; %% feature read_token do ... end ... end
The routine read_token is the same as before. There are two important new features shown in this code.
In the first section (geyacc declarations), %left declares token types and says they are left-associative operators. The declarations %left and %right (right associativity) take the place of %token which is used to declare a token type name without associativity. (These tokens are single-character literals, which ordinarily don't need to be declared. We declare them here to specify the associativity.)
Operator precedence is determined by the line ordering of the declarations; the higher the line number of the declaration (lower on the page or screen), the higher the precedence. Hence, exponentiation ('^') has the highest precedence; unary minus (NEG) is next, followed by '*' and '/'; '+' and '-' have the lowest precedence.
The other important new feature is the %prec in the grammar section for the unary minus operator. The %prec simply instructs geyacc that the rule '-' exp has the same precedence as NEG - in this case the next-to-highest.
Up to this point, this manual has not addressed the issue of error recovery - how to continue parsing after the parser detects a syntax error. All we have handled is error reporting with feature report_error. Recall that by default parse terminates after calling report_error. This means that an erroneous input line causes the calculator program to exit. Now we show how to rectify this deficiency.
The geyacc language itself includes the reserved word error, which may be included in the grammar rules. In the example below it has been added to one of the alternatives for line:
line: '\n' | exp '\n' { print ($1); print ('%N') } | error '\n' { recover } ;
This addition to the grammar allows for simple error recovery in the event of a parse error. If an expression that cannot be evaluated is read, the error will be recognized by the third rule for line, and parsing will continue. (The report_error routine is still called upon to print its message as well.) The action executes the routine recover, a feature inherited from YY_PARSER_SKELETON; its meaning is that error recovery is complete.
This form of error recovery deals with syntax errors. There are other kinds of errors; for example, division by zero, which raises an exception that is normally fatal. A real calculator program must handle this kind exception in a rescue clause and resume parsing input lines; it would also have to discard the rest of the current line of input.
Now that the basics of geyacc have been discussed, it is time to move on to a more advanced problem. The above calculators provided only five functions: +, -, * , / and unary minus. It would be nice to add memory to the calculator, by allowing you to create named variables, store values in them, and use them later. Here is a sample session with the calculator:
% mcalc pi := 3.141592 3.141592 2 * pi 6.283184 gobo := 2 + 3 5 gobo := gobo * 4 20 gobo 20 %
The source code for this calculator can be found in $GOBO/library/parse/example/mcalc. The geyacc grammar file is named mcalc.y.
Here are the Eiffel and geyacc declarations for the calculator with memory.
%{ note description: "Calculator with memory" class MCALC inherit YY_PARSER_SKELETON rename make as make_parser_skeleton end create make %} -- geyacc declarations. %token <DOUBLE> NUM -- Double precision number %token <STRING> VAR -- Memory name %type <DOUBLE> exp %right ASSIGNMENT -- Assignment sign `:=' %left '-' '+' %left '*' '/' %left NEG -- negation--unary minus %right '^' -- exponentiation -- Grammar follows. %%
The above grammar allows semantic values to have various types. These allowable types are now DOUBLE (for exp and NUM) and STRING (for memory names VAR).
Here are the grammar rules for the calculator. Most of them are copied directly from calc; two rules, those which mention VAR, are new.
input: -- Empty | input line ; line: '\n' | exp '\n' { print ($1); print ("%N") } | error '\n' { recover } ; exp: NUM { $$ := $1 } | VAR { $$ := memory_value ($1) } | VAR ASSIGNMENT exp { $$ := $3; set_memory_value ($$, $1) } | exp '+' exp { $$ := $1 + $3 } | exp '-' exp { $$ := $1 - $3 } | exp '*' exp { $$ := $1 * $3 } | exp '/' exp { $$ := $1 / $3 } | '-' exp %prec NEG { $$ := -$2 } | '(' exp ')' { $$ := $2 } ; %%
The lexical analyzer must now recognize memory variables, numeric values and arithmetic operators. Memory variables are strings of alphanumeric characters with a leading nondigit character.
read_token -- Lexical analyzer returns a double floating point -- number on the stack and the token NUM, a STRING and -- and the token VAR, a token ASSIGNMENT, or the -- character read if not a number. Skips all blanks -- and tabs, returns 0 for EOF. local c: CHARACTER buffer: STRING do -- Skip white space from if has_pending_character then c := pending_character has_pending_character := False elseif not io.end_of_file then io.read_character c := io.last_character end until io.end_of_file or else (c /= ' ' and c /= '%T') loop io.read_character c := io.last_character end if io.end_of_file then -- Return end-of-file last_token := 0 else inspect c when '0'..'9' then -- Process numbers last_token := NUM from create buffer.make (10) buffer.append_character (c) io.read_character c := io.last_character until io.end_of_file or else (c < '0' or c > '9') loop buffer.append_character (c) io.read_character c := io.last_character end if not io.end_of_file and then c = '.' then from buffer.append_character ('.') io.read_character c := io.last_character until io.end_of_file or else (c < '0' or c > '9') loop buffer.append_character (c) io.read_character c := io.last_character end end if not io.end_of_file then pending_character := c has_pending_character := True end last_double_value := buffer.to_double when 'a'..'z', 'A'..'Z' then -- Process variables. last_token := VAR from create buffer.make (10) buffer.append_character (c) io.read_character c := io.last_character until io.end_of_file or else not (('a' <= c and c <= 'z') or ('A' <= c and c <= 'Z') or ('0' <= c and c <= '9')) loop buffer.append_character (c) io.read_character c := io.last_character end if not io.end_of_file then pending_character := c has_pending_character := True end last_string_value := buffer when ':' then io.read_character c := io.last_character if not io.end_of_file then if c = '=' then -- Found ":=" last_token := ASSIGNMENT else -- Return single character last_token := (':').code pending_character := c has_pending_character := True end else -- Return single character last_token := (':').code end else -- Return single character last_token := c.code end end end
Following is some remaining source code taking care of the memory management of mcalc.
feature {NONE} -- Initialization make -- Create a new calculator with memory. do make_parser_skeleton last_string_value := "" create memory_values.make (10) end feature -- Memory management memory_value (a_name: STRING): DOUBLE -- Value associated with memory a_name; -- 0.0 if no value has been stored in a_name yet require a_name_not_void: a_name /= Void do if memory_values.has (a_name) then Result := memory_values.item (a_name) else Result := 0.0 end end set_memory_value (a_value: DOUBLE; a_name: STRING) -- Store a_value into a_name. require a_name_not_void: a_name /= Void do memory_values.force (a_value, a_name) ensure memory_value_set: memory_value (a_name) = a_value end feature {NONE} -- Implementation memory_values: DS_HASH_TABLE [DOUBLE, STRING] -- Values already stored so far invariant memory_values_not_void: memory_values /= Void end
Copyright © 1999-2019, Eric
Bezault mailto:ericb@gobosoft.com http://www.gobosoft.com Last Updated: 21 September 2019 |