Introduction to Geyacc

Introduction to Geyacc

This chapter introduces many of the basic concepts without which the details of geyacc will not make sense. If you do not already know how to use yacc or GNU bison, it is suggested to start by reading this chapter carefully.

Languages and Context-Free Grammars

In order for geyacc to parse a language, it must be described by a context-free grammar. This means that you specify one or more syntactic groupings and give rules for constructing them from their parts. For example, in the Eiffel language, one kind of grouping is called an expression. One rule for making an expression might be, "An expression can be made up of a minus sign and another expression". Another would be, "An expression can be an integer". As you can see, rules are often recursive, but there must be at least one rule which leads out of the recursion.

The most common formal system for presenting such rules for humans to read is Backus-Naur Form or BNF, which was developed in order to specify the language Algol 60. Any grammar expressed in BNF is a context-free grammar. The input to geyacc is essentially machine-readable BNF.

Not all context-free languages can be handled by geyacc, only those that are LALR(1). In brief, this means that it must be possible to tell how to parse any portion of an input string with just a single token of look-ahead. Strictly speaking, that is a description of an LR(1) grammar, and LALR(1) involves additional restrictions that are hard to explain simply; but it is rare in actual practice to find an LR(1) grammar that fails to be LALR(1).

In the formal grammatical rules for a language, each kind of syntactic unit or grouping is named by a symbol. Those which are built by grouping smaller constructs according to grammatical rules are called nonterminal symbols; those which can't be subdivided are called terminal symbols or token types. A piece of input corresponding to a single terminal symbol is called a token, and a piece corresponding to a single nonterminal symbol a grouping.

Let's use the Eiffel language as an example of what symbols, terminal and nonterminal, mean. The tokens of Eiffel are identifiers, manifest constants (numeric, character and string), and the various keywords, arithmetic operators and punctuation marks. So the terminal symbols of a grammar for Eiffel include `identifier', `number', `character', `string', plus one symbol for each keyword, operator or punctuation mark: `if', `then', `loop', `end', `class', `inherit', `plus-sign', `open-brace', `close-brace', `comma' and many more. (These tokens can be subdivided into characters, but that is a matter of lexicography, not grammar.)

Here is a simple Eiffel routine subdivided into tokens:

square (x: INTEGER)      -- identifier, open-paren,
                         -- identifier, colon, identifier,
                         -- close-paren
    do                   -- keyword `do'
        foo := x * x     -- identifier, assign-sign,
                         -- identifier, star-sign, identifier
    end                  -- keyword `end'

The syntactic groupings of Eiffel include the expression, the instruction, the assignment, the loop, etc. These are represented in the grammar of Eiffel by nonterminal symbols expression, instruction, assignment and loop. The full grammar uses dozens of additional language constructs, each with its own nonterminal symbol, in order to express the meanings of these four. The example above is a routine definition; it contains one instruction. In the instruction, each x is an expression and so is x *x.

Each nonterminal symbol must have grammatical rules showing how it is made out of simpler constructs. For example, one kind of Eiffel instruction is the assignment instruction; this would be described with a grammar rule which reads informally as follows:

An instruction can be made up of an identifier, an assign-sign (:=) and an expression.

There would be many other rules for instruction, one for each kind of instruction in Eiffel.

One nonterminal symbol must be distinguished as the special one which defines a complete utterance in the language. It is called the start symbol. In a compiler, this means a complete input program. In the Eiffel language, the nonterminal symbol class_declaration plays this role. For example, 1 +2 is a valid Eiffel expression - a valid part of an Eiffel class - but it is not valid as an entire Eiffel class. In the context-free grammar of Eiffel, this follows from the fact that expression is not the start symbol.

The geyacc parser reads a sequence of tokens as its input, and groups the tokens using the grammar rules. If the input is valid, the end result is that the entire token sequence reduces to a single grouping whose symbol is the grammar's start symbol. If we use a grammar for Eiffel, the entire input must be a class_declaration. If not, the parser reports a syntax error.

From Formal Rules to Geyacc Input

A formal grammar is a mathematical construct. To define the language for geyacc, you must write a file expressing the grammar in geyacc syntax: a "geyacc grammar" file.

A nonterminal symbol in the formal grammar is represented in geyacc input as an identifier, like an identifier in Eiffel. By convention, it should be in lower case, such as expression, instruction or declaration.

The geyacc representation for a terminal symbol is also called a token type. Token types as well can be represented as Eiffel-like identifiers. By convention, these identifiers should be upper case to distinguish them from nonterminals: for example, NUMBER, IDENTIFIER or ASSIGN. The terminal symbol error is reserved for error recovery.

A terminal symbol can also be represented as a character literal, just like a C character constant, between single quotes. You should do this whenever a token is just a single character (parenthesis, plus-sign, etc.): use that same character in a literal as the terminal symbol for that token. Note that Unicode characters are supported. Just make sure that the input file is using the UTF-8 encoding and starts with the BOM character.

The grammar rules also have an expression in geyacc syntax. For example, here is the geyacc rule for an Eiffel assignment instruction. The semicolon and the colon are geyacc punctuation used in every rule.

instruction: identifier ASSIGN expression
    ;

Semantic Values

A formal grammar selects tokens only by their classifications: for example, if a rule mentions the terminal symbol `integer constant', it means that any integer constant is grammatically valid in that position. The precise value of the constant is irrelevant to how to parse the input: if x+4 is grammatical then x+1 or x+3989 is equally grammatical.

But the precise value is very important for what the input means once it is parsed. A compiler is useless if it fails to distinguish between 4, 1 and 3989 as constants in the program! Therefore, each token in a geyacc grammar has both a token type and a semantic value.

The token type is a terminal symbol defined in the grammar, such as NUMBER, IDENTIFIER or ','. It tells everything you need to know to decide where the token may validly appear and how to group it with other tokens. The grammar rules know nothing about tokens except their types.

The semantic value has all the rest of the information about the meaning of the token, such as the value of an integer, or the name of an identifier. (A token such as ',' which is just punctuation doesn't need to have any semantic value.)

For example, an input token might be classified as token type NUMBER and have the semantic value 4. Another input token might have the same token type NUMBER but value 3989. When a grammar rule says that NUMBER is allowed, either of these tokens is acceptable because each is a NUMBER. When the parser accepts the token, it keeps track of the token's semantic value.

Each grouping can also have a semantic value as well as its nonterminal symbol. For example, in a calculator, an expression typically has a semantic value that is a number. In a compiler for a programming language, an expression typically has a semantic value that is a tree structure describing the meaning of the expression.

Semantic Actions

In order to be useful, a program must do more than parse input; it must also produce some output based on the input. In a geyacc grammar, a grammar rule can have an action made up of Eiffel instructions. Each time the parser recognizes a match for that rule, the action is executed.

Most of the time, the purpose of an action is to compute the semantic value of the whole construct from the semantic values of its parts. For example, suppose we have a rule which says an expression can be the sum of two expressions. When the parser recognizes such a sum, each of the subexpressions has a semantic value which describes how it was built up. The action for this rule should create a similar sort of value for the newly recognized larger expression. For example, here is a rule that says an expression can be the sum of two subexpressions:

expr: expr '+' expr
        { $$ := $1 + $3 }
    ;

The action says how to produce the semantic value of the sum expression from the values of the two subexpressions.

Generated Parser

When you run geyacc, you give it a geyacc grammar file as input. The output is an Eiffel class equipped with features to parse the language described by the grammar. This class is called a "geyacc parser". Keep in mind that the geyacc utility and the geyacc parser are two distinct programs: the geyacc utility is a program whose output is the geyacc parser that becomes part of your program.

The job of the geyacc parser is to group tokens into groupings according to the grammar rules - for example, to build identifiers and operators into expressions. As it does this, it runs the actions for the grammar rules it uses.

The tokens come from a routine called the lexical analyzer that you must supply in some fashion (such as by writing it using gelex). The geyacc parser calls the lexical analyzer each time it wants a new token. It doesn't know what is "inside" the tokens (though their semantic values may reflect this). Typically the lexical analyzer makes the tokens by parsing characters of text, but geyacc does not depend on this.