Gelex: Start Conditions

Start Conditions

Gelex provides a mechanism for conditionally activating rules. Any rule whose pattern is prefixed with <sc> will only be active when the scanner is in the start condition named sc. For example,

<STRING>[^"]*    {  -- Eat up the string body ...
                    ...
                 }

will be active only when the scanner is in the STRING start condition, and:

<INITIAL,STRING,QUOTE>\.   { -- handle an escape ...
                             ...
                           }

will be active only when the current start condition is either INITIAL, STRING, or QUOTE.

Start conditions are declared in the declarations section of the input using unindented lines beginning with either %s or %x followed by a whitespace-separated list of names. The former declares inclusive start conditions, the latter exclusive start conditions. The name of the start conditions are case-insensitive and are made up of a letter followed by zero or more letters, digits, or underscores. For each start condition gelex generates an Eiffel integer constant attribute which can be used to refer to it. The name of the start conditions must therefore be different from other feature names in the generated class.

A start condition is activated using feature set_start_condition. Until the next set_start_condition is called, rules with the given start condition will be active and rules with other start conditions will be inactive. If the start condition is inclusive, then rules with no start conditions at all will also be active. If it is exclusive, then only rules qualified with the start condition will be active. The original state where only the rules with no start conditions are active is referred to as the start condition INITIAL.

A set of rules contingent on the same exclusive start condition describe a scanner which is independent of any of the other rules in the gelex input. Because of this, exclusive start conditions make it easy to specify "mini-scanners" which scan portions of the input that are syntactically different from the rest (e.g. strings).

If the distinction between inclusive and exclusive start conditions is still a little vague, here's a simple example illustrating the connection between the two. The set of rules:

%s example
%%
<example>foo    do_something
bar             something_else

is equivalent to:

%x example
%%
<example>foo           do_something
<INITIAL,example>bar   something_else

Without the <INITIAL,example> qualifier, the bar pattern in the second example wouldn't be active (i.e. couldn't match) when in start condition example. If we just used <example> to qualify bar, though, then it would only be active in example and not in INITIAL, while in the first example it's active in both, because in the first example the example start condition is an inclusive (%s) start condition.

Also note that the special start condition specifier <*> matches every start condition. Thus, the above example could also have been written:

%x example
%%
<example>foo    do_something
<*>bar          something_else

The default rule (to echo any unmatched character) remains active in start conditions. It is equivalent to:

<*>.|\n         default_action

To illustrate the uses of start conditions, here is a scanner which provides two different interpretations of a string like 123.456. By default it will treat it as three tokens, the integer 123, a dot, and the integer 456. But if the string is preceded earlier in the line by the string expect-reals it will treat it as a single token, the real number 123.456:

%s expect
%%
expect-reals              set_start_condition (expect)
<expect>[0-9]+"."[0-9]+  {
                          io.put_string ("Found a real: ")
                          io.put_real (text.to_real)
                          io.put_new_line
                     }
<expect>\n           {
                          -- That's the end of the line, so
                          -- we need another "expect-reals"
                          -- before we'll recognize any more
                          -- reals.
                          set_start_condition (INITIAL)
                     }
[0-9]+               {
                          io.put_string ("Found an integer: ")
                          io.put_integer (text.to_integer)
                          io.put_new_line
                     }
"."             io.put_string ("Found a dot"); io.put_new_line

Here is a scanner which recognizes (and discards) C comments while maintaining a count of the current input line.

%x comment
%%
"/*"                    set_start_condition (comment)
<comment>[^*\n]*        -- Eat anything that's not a '*'
<comment>"*"+[^*/\n]*   -- Eat up '*'s not followed by '/'s
<comment>\n             line_nb := line_nb + 1
<comment>"*"+"/"        set_start_condition (INITIAL)
%%
    line_nb: INTEGER
            -- Current line number
    ...

This scanner goes to a bit of trouble to match as much text as possible with each rule. In general, when attempting to write a high-speed scanner try to match as much possible in each rule, as it's a big win.

Note that start condition entities in the Eiffel code are integer values and can be stored as such. Thus, the above could be extended in the following fashion:

%x comment foo
%%
"/*"      {
               comment_caller := INITIAL
               set_start_condition (comment)
          }
<foo>"/*" {
               comment_caller := foo
               set_start_condition (comment)
          }
<comment>[^*\n]*         -- Eat anything that's not a '*'
<comment>"*"+[^*/\n]*    -- Eat up '*'s not followed by '/'s
<comment>\n              line_nb := line_nb + 1
<comment>"*"+"/"         set_start_condition (comment_caller)
%%
    line_nb: INTEGER
            -- Current line number

    comment_caller: INTEGER
            -- Last start condition
    ...

Furthermore, you can access the current start condition using the integer function start_condition. For example, the above assignments to comment_caller could instead have been written:

comment_caller := start_condition

In case of nested start conditions, you could also keep track of previous start conditions by pushing them on an integer stack using push_start_condition and then popping them from the stack as you leave the current start condition using pop_start_condition.

Finally, here's an example of how to match Eiffel-style quoted strings using exclusive start conditions, including expanded escape sequences:

%x str
%%
\"            buffer.wipe_out; set_start_condition (str)
<str>[^%\n"]+ buffer.append_string (unicode_text)
<str>%A       buffer.append_character ('%A')
<str>%B       buffer.append_character ('%B')
<str>%C       buffer.append_character ('%C')
<str>%D       buffer.append_character ('%D')
<str>%F       buffer.append_character ('%F')
<str>%H       buffer.append_character ('%H')
<str>%L       buffer.append_character ('%L')
<str>%N       buffer.append_character ('%N')
<str>%Q       buffer.append_character ('%Q')
<str>%R       buffer.append_character ('%R')
<str>%S       buffer.append_character ('%S')
<str>%T       buffer.append_character ('%T')
<str>%U       buffer.append_character ('%U')
<str>%V       buffer.append_character ('%V')
<str>%%       buffer.append_character ('%%')
<str>%\'      buffer.append_character ('%'')
<str>%\"      buffer.append_character ('%"')
<str>%\(      buffer.append_character ('%(')
<str>%\)      buffer.append_character ('%)')
<str>%<       buffer.append_character ('%<')
<str>%>       buffer.append_character ('%>')
<str>%\/[0-9]+\/   {
        code := text_substring (3, text_count - 1).to_integer
        if code > Maximum_character_code then
            set_start_condition (INITIAL)
            last_token := E_STRERR
        else
            buffer.append_character (code.to_character)
        end
     }
<str>%\r?\n[ \t\r]*%    line_nb := line_nb + 1
<str>\"     {
                 set_start_condition (INITIAL)
                 -- Pass string value to parser.
                 last_value := buffer.twin
                 last_token := E_STRING
            }
<str>.|\n                |
<str>%\r?\n[ \t\r]*      |
<str>%\/([0-9]+(\/)?)?   |
<str><<EOF>>             {   -- Catch-all rules (no backing up)
                              set_start_condition (INITIAL)
                              last_token := E_STRERR
                          }

A complete version of this example can be found in $GOBO/library/lexical/example/eiffel_scanner.

Often, such as in some of the examples above, you wind up writing a whole bunch of rules all preceded by the same start condition(s). Gelex makes this a little easier and cleaner by introducing a notion of start condition scope. A start condition scope begins with:

<SCs>{

where SCs is a comma-separated list of one or more start conditions. Inside the start condition scope, every rule automatically has the prefix <SCs> applied to it, until a } which matches the initial {. So, for example,

<ESC>{
   %N    last_token := ('%N').code
   %T    last_token := ('%T').code
   %B    last_token := ('%B').code
   %U    last_token := ('%U').code
}

is equivalent to:

<ESC>%N    last_token := ('%N').code
<ESC>%T    last_token := ('%T').code
<ESC>%B    last_token := ('%B').code
<ESC>%U    last_token := ('%U').code

Start condition scopes may be nested.