Using the XML parser PreviousNext

A first example: counting tags

Let's start with a simple example of an XML parser that counts the number of tags in an input file. This example has two classes, the main class that opens the file and creates the parser, and a descendant of the callbacks class that receives events from the parser. The main class will connect this event consumer with the parser.

The base class for events is XM_CALLBACKS which has all features deferred. For this example, only a couple of events are needed, so XM_CALLBACKS_NULL which provides empty bodies for all events is used for convenience. The events we redefine are routines called when the parser is started, and whenever it encounters a start (opening) XML tag.






feature -- Events
			-- Reset tag count.
			count := 0
	on_start_tag (a_namespace: STRING; a_prefix: STRING; a_local_part: STRING)
			-- Count start tags.
			count := count + 1
feature -- Access
	count: INTEGER
			-- Number of tags seen.

The main class creates the parser, the routines that read the command line and open the file are standard and omitted here, which leaves the main routine that sets up and starts the parser, and prints the result:

	parse_stream (a_stream: KI_CHARACTER_INPUT_STREAM)
			-- Parse open stream.
			a_stream_not_void: a_stream /= Void
			a_stream_open: a_stream.is_open_read
			a_parser: XM_PARSER
			-- Create the parser. 
			-- It is left in the default state, which means: 
			-- ascii only, no external entities or DTDs, 
			-- no namespace resolving.
			create {XM_EIFFEL_PARSER} a_parser.make
			-- Create the event consumer that counts start tags. 
			create {TAGCOUNT_CALLBACKS} a_consumer.make	
			a_parser.set_callbacks (a_consumer)
			-- Parse and display result
			a_parser.parse_from_stream (a_stream)
			if not a_parser.is_correct then
				error_handler.report_error_message (a_parser.last_error_extended_description)
				error_handler.report_info_message ("Number of tags found: " + a_consumer.count.out)

The full example is available at example/xml/event/tagcount.

Event interfaces

Event interfaces are the lowest level of communication with an XML parser. An event interface is a deferred class containing callback calls. Sources of events, like a parser, have routines to attach a descendant of the event interface.

For each event interface, there is a purely deferred class with the callbacks, of which clients inherit, and a 'source' class, which events sources, like the parser, inherit. For the main XML content events, the event interface is XM_CALLBACKS, and the source is XM_CALLBACKS_SOURCE. It provides a set_callbacks feature, and the parser inherits from it.

DTD events are covered separately, for parsers that support them, using XM_DTD_CALLBACKS and XM_DTD_CALLBACKS_SOURCE (with set_dtd_callbacks).

The XML parser interface

The public interface of XML parsers is represented in the deferred class XM_PARSER. Parsers are event sources, inheriting from the event sources classes to provide set_callbacks and set_dtd_callbacks. An input document is parsed using parse_from_stream and similar features, which accept strings or IO streams from the Gobo Kernel library. Incremental parsing routines are available to parse a document a chunk at a time, if the parser supports it, which can be checked with is_incremental.

Errors can be collected but are also forwarded to the event interface. Because an event filter stream as described below can produce its own errors, not reflected in the event source that is the parser, it may be more sensible in most cases to collect errors downstream.

The parsers have a string mode. Because XML documents can contain unicode characters that do not fit in most Eiffel compilers' CHARACTER and STRING types, the Gobo Eiffel library provides a Unicode string class, that inherits from STRING. The base class of Gobo's unicode string classes is UC_STRING. There are some subtle issues with polymorphism, for instance:

"hello".append_string (a_uc_string)

will not work because the call target is of dynamic type STRING which does not know about UC_STRING. A library routine, KL_STRING_ROUTINES's appended_string, that copes with polymorphism is provided to replace the original routine. Other polymorphic routines are treated similarly. The Gobo Unicode facilities are described in more detail elsewhere along with the rationale for this design.

As this potential polymorphism puts a burden of care on the client, and could lead to hard to detect problems for unprepared clients, the XML parsers default to a safe mode, where only instances of STRING are produced, and a parsing error occurs if the XML input contains some characters that do not fit in STRING. If the application input is only ASCII files, nothing else needs to be done. Otherwise, it may be necessary to set the string mode to some other mode, such as producing exclusively UC_STRING descendants, or producing them only when needed. These string mode settings are contained in the class XM_STRING_MODE, a parent of the parser classes.

Another setting a parser user may want to change from the default relates to external references, described below.

Concrete parsers

Several concrete parsers are available, which are descendants of this interface. The pure Eiffel parser is XM_EIFFEL_PARSER. The parser making use of and depending on the Expat C library is XM_EXPAT_PARSER. These classes can be created directly.

Because Expat introduces external dependencies in the library, a factory class is available to allow the same client code to work independently of whether or not the Expat parser is compiled in: XM_EXPAT_PARSER_FACTORY. The value of is_expat_available depends on whether Expat is available, and code may portably act accordingly, for instance falling back to the Eiffel parser.

Event filters and streams

On top of the event interface, the XML library provides a set of filters and a framework for using filters. The filters are arranged in a stream, in a manner similar to the Unix command shell.

Each component of a filter pipe is a descendant of a filter base class, XM_CALLBACKS_FILTER for content events, which has a next attribute. The default implementation of each event is simply to forward the event to the next filter. A filter that uses only a few events can redefine only the required routines. Redefinition of routines are expected to do their processing and then forward the event to the next filter, for instance using Precursor. The class provides two routines that can be used as creation procedures: make_null sets next to a filter that does nothing on each event. This null filter, XM_CALLBACKS_NULL for content events, allows each component of a pipe to be used at any position in the pipe, including at the end, and the next filter to be set when convenient, while maintaining an invariant that next is not Void. The feature set_next can also be used as a creation procedure.

From an Eiffel typing viewpoint, the whole stream has the same type: each filter can be at any position in the pipe. It maybe that some filters have extra dependencies (one must be before the other) that are not captured by the static type system. This seems acceptable given the flexibility of the system, and that many practical filters can indeed be placed anywhere on a pipe. A good point for encapsulation is that each filter is a small component with a clear interface, providing much better encapsulation than some other event filter patterns (like each stage inheriting from the previous one, with high coupling between each component).

Content events

The content events are the core of the XML parser interface. They cover elements and attributes, in addition to less fundamental feature like comments and processing instructions. There are also events called on startup and at the end of parsing.

All events of XM_CALLBACKS that take names of tags or attributes, follow the same convention. The signature includes the namespace (a string representing the namespace URI), name prefix and local part. The parser is not expected to provide to resolve namespaces, with a filter introduced below resolving the namespaces and replacing the non-resolved namespaces (Void) downstream of the filter pipe. Whether a namespace is set can be checked with has_namespace.

To make the interface consistently simple, it has only atomic events whose parameters are only strings and not data structures. Data structures are built downstream, or as intermediary internal structures of a specific filter. In particular, this means there is one event per attribute.

Content event filters

A set of standard content event filters is available in the library. There is a factory class XM_CALLBACKS_FILTER_FACTORY with creation routines and convenience routines to build pipes and bind the filters to each other. The filters can be created directly, the factory is only there for convenience.

The namespace resolving filter XM_NAMESPACE_RESOLVER will be in most standard pipes. It reads XML namespace declaration attributes (these events are not forwarded downstream) and adds a resolved namespace URI to all outgoing names.

For completely correct validation of unicode character classes in XML names, the filter XM_UNICODE_VALIDATION_FILTER should be used. It is transparent unless an error occurs when it issues an error event, so an error filter should be connected later in the pipe.

The atomicity of content events is not guaranteed by the parser, that is two or more on_content events may follow each other. The XM_CONTENT_CONCATENATOR filter turns succeeding content events into one. It will be usually be preceded by a XM_NO_COMMENT_FILTER filter which removes comment events, because otherwise events placed in the middle of content will also contribute non-atomic content events separated by comment events.

Without XM_STOP_ON_ERROR_FILTER the event flow may continue after an error. This filter stops all event forwarding from the first error, which it remembers for later use (has_error and last_error). It is useful for most standard pipes, indeed an error condition is better collected here, including errors within the preceding filters, than in the parser itself.

XM_PRETTY_PRINT_FILTER is a filter that prints out the event stream as an XML document, to the standard output, or a string. It can be placed anywhere in the stream, which may be convenient for debugging.

To produce the output in a tree structure (descendants from XM_NODE), the filter XM_CALLBACKS_TO_TREE_FILTER is used. It expects resolved namespaces.

XM_SHARED_STRINGS_FILTER saves memory and possibly comparison time by making all equal strings point to a single instance. The downstream events must then consider strings immutable. This sharing is across event categories: for example, if a content happens to be the same as an element name, it will be the same string.

To finish this section, here is an example of a filter pipe, using the factory class convenience routines callbacks_pipe that simply binds the next pipe of each filter in an array and returns the first element:


a_parser: XM_PARSER

a_parser.set_callbacks (callbacks_pipe (
		<< new_namespace_resolver,
			new_tree_builder >>))

In a real program, references may be kept to individual filters, to recover the result or check their state after processing. XM_TREE_CALLBACKS_PIPE provides a standard pipe for tree creation with attributes for the interesting component filters.

Copyright 2005, Eric Bezault
Last Updated: 7 July 2005