Start Conditions |
Gelex provides a mechanism for conditionally activating rules. Any rule whose pattern is prefixed with <sc> will only be active when the scanner is in the start condition named sc. For example,
<STRING>[^"]* { -- Eat up the string body ... ... }
will be active only when the scanner is in the STRING start condition, and:
<INITIAL,STRING,QUOTE>\. { -- handle an escape ... ... }
will be active only when the current start condition is either INITIAL, STRING, or QUOTE.
Start conditions are declared in the declarations section of the input using unindented lines beginning with either %s or %x followed by a whitespace-separated list of names. The former declares inclusive start conditions, the latter exclusive start conditions. The name of the start conditions are case-insensitive and are made up of a letter followed by zero or more letters, digits, or underscores. For each start condition gelex generates an Eiffel integer constant attribute which can be used to refer to it. The name of the start conditions must therefore be different from other feature names in the generated class.
A start condition is activated using feature set_start_condition. Until the next set_start_condition is called, rules with the given start condition will be active and rules with other start conditions will be inactive. If the start condition is inclusive, then rules with no start conditions at all will also be active. If it is exclusive, then only rules qualified with the start condition will be active. The original state where only the rules with no start conditions are active is referred to as the start condition INITIAL.
A set of rules contingent on the same exclusive start condition describe a scanner which is independent of any of the other rules in the gelex input. Because of this, exclusive start conditions make it easy to specify "mini-scanners" which scan portions of the input that are syntactically different from the rest (e.g. strings).
If the distinction between inclusive and exclusive start conditions is still a little vague, here's a simple example illustrating the connection between the two. The set of rules:
%s example %% <example>foo do_something bar something_else
is equivalent to:
%x example %% <example>foo do_something <INITIAL,example>bar something_else
Without the <INITIAL,example> qualifier, the bar pattern in the second example wouldn't be active (i.e. couldn't match) when in start condition example. If we just used <example> to qualify bar, though, then it would only be active in example and not in INITIAL, while in the first example it's active in both, because in the first example the example start condition is an inclusive (%s) start condition.
Also note that the special start condition specifier <*> matches every start condition. Thus, the above example could also have been written:
%x example %% <example>foo do_something <*>bar something_else
The default rule (to echo any unmatched character) remains active in start conditions. It is equivalent to:
<*>.|\n default_action
To illustrate the uses of start conditions, here is a scanner which provides two different interpretations of a string like 123.456. By default it will treat it as three tokens, the integer 123, a dot, and the integer 456. But if the string is preceded earlier in the line by the string expect-reals it will treat it as a single token, the real number 123.456:
%s expect %% expect-reals set_start_condition (expect) <expect>[0-9]+"."[0-9]+ { io.put_string ("Found a real: ") io.put_real (text.to_real) io.put_new_line } <expect>\n { -- That's the end of the line, so -- we need another "expect-reals" -- before we'll recognize any more -- reals. set_start_condition (INITIAL) } [0-9]+ { io.put_string ("Found an integer: ") io.put_integer (text.to_integer) io.put_new_line } "." io.put_string ("Found a dot"); io.put_new_line
Here is a scanner which recognizes (and discards) C comments while maintaining a count of the current input line.
%x comment %% "/*" set_start_condition (comment) <comment>[^*\n]* -- Eat anything that's not a '*' <comment>"*"+[^*/\n]* -- Eat up '*'s not followed by '/'s <comment>\n line_nb := line_nb + 1 <comment>"*"+"/" set_start_condition (INITIAL) %% line_nb: INTEGER -- Current line number ...
This scanner goes to a bit of trouble to match as much text as possible with each rule. In general, when attempting to write a high-speed scanner try to match as much possible in each rule, as it's a big win.
Note that start condition entities in the Eiffel code are integer values and can be stored as such. Thus, the above could be extended in the following fashion:
%x comment foo %% "/*" { comment_caller := INITIAL set_start_condition (comment) } <foo>"/*" { comment_caller := foo set_start_condition (comment) } <comment>[^*\n]* -- Eat anything that's not a '*' <comment>"*"+[^*/\n]* -- Eat up '*'s not followed by '/'s <comment>\n line_nb := line_nb + 1 <comment>"*"+"/" set_start_condition (comment_caller) %% line_nb: INTEGER -- Current line number comment_caller: INTEGER -- Last start condition ...
Furthermore, you can access the current start condition using the integer function start_condition. For example, the above assignments to comment_caller could instead have been written:
comment_caller := start_condition
In case of nested start conditions, you could also keep track of previous start conditions by pushing them on an integer stack using push_start_condition and then popping them from the stack as you leave the current start condition using pop_start_condition.
Finally, here's an example of how to match Eiffel-style quoted strings using exclusive start conditions, including expanded escape sequences:
%x str %% \" buffer.wipe_out; set_start_condition (str) <str>[^%\n"]+ buffer.append_string (unicode_text) <str>%A buffer.append_character ('%A') <str>%B buffer.append_character ('%B') <str>%C buffer.append_character ('%C') <str>%D buffer.append_character ('%D') <str>%F buffer.append_character ('%F') <str>%H buffer.append_character ('%H') <str>%L buffer.append_character ('%L') <str>%N buffer.append_character ('%N') <str>%Q buffer.append_character ('%Q') <str>%R buffer.append_character ('%R') <str>%S buffer.append_character ('%S') <str>%T buffer.append_character ('%T') <str>%U buffer.append_character ('%U') <str>%V buffer.append_character ('%V') <str>%% buffer.append_character ('%%') <str>%\' buffer.append_character ('%'') <str>%\" buffer.append_character ('%"') <str>%\( buffer.append_character ('%(') <str>%\) buffer.append_character ('%)') <str>%< buffer.append_character ('%<') <str>%> buffer.append_character ('%>') <str>%\/[0-9]+\/ { code := text_substring (3, text_count - 1).to_integer if code > Maximum_character_code then set_start_condition (INITIAL) last_token := E_STRERR else buffer.append_character (code.to_character) end } <str>%\r?\n[ \t\r]*% line_nb := line_nb + 1 <str>\" { set_start_condition (INITIAL) -- Pass string value to parser. last_value := buffer.twin last_token := E_STRING } <str>.|\n | <str>%\r?\n[ \t\r]* | <str>%\/([0-9]+(\/)?)? | <str><<EOF>> { -- Catch-all rules (no backing up) set_start_condition (INITIAL) last_token := E_STRERR }
A complete version of this example can be found in $GOBO/library/lexical/example/eiffel_scanner.
Often, such as in some of the examples above, you wind up writing a whole bunch of rules all preceded by the same start condition(s). Gelex makes this a little easier and cleaner by introducing a notion of start condition scope. A start condition scope begins with:
<SCs>{
where SCs is a comma-separated list of one or more start conditions. Inside the start condition scope, every rule automatically has the prefix <SCs> applied to it, until a } which matches the initial {. So, for example,
<ESC>{ %N last_token := ('%N').code %T last_token := ('%T').code %B last_token := ('%B').code %U last_token := ('%U').code }
is equivalent to:
<ESC>%N last_token := ('%N').code <ESC>%T last_token := ('%T').code <ESC>%B last_token := ('%B').code <ESC>%U last_token := ('%U').code
Start condition scopes may be nested.
Copyright © 2000-2019, Eric
Bezault mailto:ericb@gobosoft.com http://www.gobosoft.com Last Updated: 27 September 2019 |