e4Graph and XML
The e4XML library enables XML input to be stored in an e4_Storage
object. The XML input is parsed using James Clark's expat
parser, and stored in nodes and vertices under a given e4_Node
object in an e4_Storage object. This facility is provided by the e4_XMLParser
class.
The e4XML library also provides facilities for producing a character
string representing the XML encoding of a node and all its vertices. This
functionality is provided by the e4_XMLGenerator
class.
The e4_XMLParser Class
The e4XML library provides a class, e4_XMLParser, which can be
used to parse a given string of XML input into an e4Graph graph of objects.
A new instance of e4_XMLParser must be created for each parse; each
instance can be used for only one parse and must be deleted afterwards.
Each instance of e4_XMLParser is associated with an instance of e4_Node.
The association must be formed before parsing can commence, either during
the construction of the parser, or afterwards by assigning a node to be
associated with an existing parser using the SetNode method. The
associated node can later be retrieved using the GetNode method,
and interim parsing state can be queried through the regular operations
on e4_Node instances.
A parse either succeeds or fails. The current state can be queried using
the HasError method which returns true if an error was encountered.
The reason for the error is stored in a NULL terminated string which is
returned by the ErrorString method. The error state can be cleared
using the ClearError method. Clearing the error state does not always
succeed and the parser may be unable to continue parsing. If that happens,
the parser will immediately enter a new error state.
A parse is either in progress or has finished. The current state may
be queried using the Finished method, which returns true when the
parse is finished.
Thus, there are four distinct states:
-
If Finished returns true and HasError returns true, it means
that the parser was given more input than was needed to produce a complete
parse. The input past where the parser considers the parse complete is
treated as illegal.
-
If Finished returns true and HasError returns false, the
parse finished successfully.
-
If Finished returns false and HasError returns true, a parse
error occurred before the end of legal input was parsed.
-
Finally, if Finished and HasError both return false, all
input up to this point was legal and more input is required to finish the
parse.
Parsing is done through the Parse method, which takes a buffer of
input (not necessarily NULL terminated). The Parse method will advance
the parse using the provided input.
The XML input is parsed as follows:
-
The parser recognizes the reserved XML open tag __nodebackref__
and treats it specially. It is used to represent circular references in
the input. You should not edit this tag in the XML input or the parser
may operate incorrectly.
-
The XML tag __vertex__ is handled specially when it is parsed by
the e4_XMLParser Parse method. Instead of causing the creation of
a new node, it causes the creation of a new vertex. The tag must have exactly
three attributes, named name, type and value, which
must appear in that order. The name attribute specifies the name
of the vertex to create, the type attribute specifies the type of
the value to be assigned, and the value attribute specifies the
value to be assigned, and which is automatically converted from a string
into the appropriate representation. The XML tag __vertex__ must
be empty, that is, there are no separate open and close tags. An example
is <__vertex__ name="color" type="string" value="green"/>. Parsing
this produces a vertex with name color, type string and string
value green. If the value of the type attribute is binary,
the binary data is decoded from the BASE64 string provided as the value
of the value attribute.
-
Any other XML open tag is stored as a vertex named using the tag's name,
with a new instance of e4_Node as its value. If the tag is empty (e.g.
<foo/>)
then
the new node is empty. Otherwise, any XML input parsed between the open
and close tags (e.g. <foo>...</foo>) is stored in the new
node.
-
Tags may have special attributes that are treated specially by the parser.
These attributes are used to represent attached markers and circular references,
and you should not edit them in the XML input. The attributes specially
treated by the parser are
__nodeid__ and __marker__.
-
Tags may have other attributes associated with them. If there are attributes,
the new node will have a first vertex named __attributes__ whose
value is a node with vertices named according to the attribute names and
whose values are strings containing the attribute values. For example,
<foo
color="green" space="55"> will cause the new node to have an __attributes__
vertex whose value is a node containing two vertices; the first vertex
has the name color and string value green, and the second
has the name space and the string value 55.
-
As mentioned above, the XML tag __backnoderef__ is handled specially
when it is parsed by the Parse method. It must have two attributes,
name
and
id. Instead of causing the creation of a new node, the parser
creates a vertex with as name the value of the name attribute, and
as value the node which was previously parsed as the nth node, with
n
being the value of the id attribute. This allows linear XML input
to represent cycles in the graph of objects being parsed.
-
Character data, i.e. any text appearing between XML tags, is stored in
a vertex named __data__ in the node representing the most nested
open XML tag being parsed currently.
-
An XML comment is stored in a vertex named __comment__ whose string
value is the comment text.
-
Text in a CDATA section is stored in a vertex named __cdata__ whose
value is a node. That node has a single vertex named __data__ whose
string value contains the text of the CDATA section.
-
An XML processing instruction is stored in a vertex named __processinginstruction__
whose value is a node. That node has two vertices named __target__
and __data__ whose string values are the target and data, respectively,
of the processing instruction.
-
An XML declaration is stored in a vertex named __xml__ whose value
is a node. That node optionally contains, depending on which parts of the
XML declaration are present in the input, up to three vertices. A vertex
named __version__ may be present and contain a string value representing
the version of XML that the input conforms to. A vertex named __encoding__
may be present and contain a string value representing the encoding of
the XML input. A vertex named __standalone__ is always present;
its
value is an integer denoting whether the XML input is stand alone. A value
of -1 denotes that the input did not contain a standalone declaration
part. A value of 0 denotes that the standalone declaration was present
and its value was the string no. A value of 1 denotes that the standalone
declaration was present and its value was the string yes.
-
An XML document type declaration is stored in a vertex named __doctypedecl__
whose value is a node. That node optionally contains, depending on which
parts of the XML document type declaration are present in the input, up
to four vertices. A vertex named __doctypename__ may be present
and contain a string value representing the name of the document type being
declared. A vertex named __sysid__ may be present and contain a
string value representing the system ID part of the document type declaration.
A vertex named __pubid__ may be present and contain a string value
representing the publication ID part of the document type declaration.
A vertex named __hasinternalsubset__ is always present; its value
is an integer denoting whether the XML document type declaration has an
internal subset. A value of -1 denotes that no hasinternalsubset
declaration part was present in the input. A value of 0 denotes that a
hasinternalsubset
declaration was present and its value was the string no. A value
of 1 denotes that the hasinternalsubset declaration was present
and its value was the string yes.
Note that the XML parser is very naive and may cause a stack overflow attempting
to parse highly nested XML input. Unfortunately this is an artifact of
the underlying parser used, expat.
There is no way to fix this problem in programs that use expat; it must
be fixed in expat itself and then all programs using expat will automatically
be able to handle XML input of arbitrary size.
The following code snippet shows how the e4_XMLParser class might be
used in a C++ program:
#include <stdlib.h>
#include "e4xml.h"
...
e4_XMLParser *parser;
e4_Node n;
char *buf;
size_t len;
...
parser = new e4_XMLParser(n);
...
if (!parser->Parse(buf, len)) {
fprintf(stderr, "Parsing encountered an error: \"%s\"\n",
parser->ErrorString());
}
delete parser;
e4_XMLParser Methods and Constructors
|
|
e4_XMLParser() |
Constructor. Creates an empty parser that does not have an associated
node. |
e4_XMLParser(e4_Node nn) |
Constructor. Creates a parser associated with the node nn, which
must be a valid node. |
~e4_XMLParser() |
Destructor. Destroys the parser and any associated transient state
information. |
|
|
void SetNode(e4_Node nn) |
Associates the node nn with this parser. New input will be stored
in new vertices added to the end of the list of vertices in the node nn. |
bool GetNode(e4_Node &nn) |
Retrieves the associated node in nn, if there is an associated
node. If not, returns false. |
bool Finished() |
Returns true when the parse has finished successfully, false otherwise. |
bool HasError() |
Returns true if the parse has encountered an error. |
const char *ErrorString() |
Returns a NULL terminated string describing the error, if any, encountered
by the parser. |
void ClearError() |
Attempts to clear the error situation and recover; may fail, in which
case a subsequent attempt to parse more input will reenter an error state. |
bool Parse(char *buf, size_t len) |
Attempts to parse the XML input in the buffer buf of length
len.
If the entire buffer was parsed successfully, the operation returns true.
If an error was encountered, false is returned. |
const unsigned char *Base64_Decode(const char *base64Str,
int *nbytes) |
As a convenience, the parser provides a method to decode a character
string encoded in BASE64 (RFC 1341) into a binary value. The output
argument nbytes contains the length of the byte sequence returned.
The memory occupied by the returned binary value is managed by the parser
and must be copied by your application. |
|
|
The e4_XMLGenerator Class
The e4XML library provides a class, e4_XMLGenerator, which can
be used for creating an XML string representing a given node and all its
vertices, recursively. A new instance of e4_XMLGenerator must be
used each time you want to generate XML output from a node.
Each instance of e4_XMLGenerator is associated with an instance
of e4_Node, the node from which XML output is
generated. The association must be formed before XML generation can occur,
by either giving the associated node in the constructor of e4_XMLGenerator,
or by using the SetNode method. The current associated node is returned
by the GetNode method.
XML output generated from the associated node is wrapped by an XML tag
determined either at construction time or set later by using the SetElementName
method. The currently set wrapping XML tag is returned by the GetElementName
method.
XML output is generated with the Generate method, which takes
no arguments and returns the XML output as a NULL terminated string. You
can get the XML output string from a previous invocation of Generate
using the Get method. The memory occupied by the string returned
by Generate and Get is owned by the XML generator; if your
program wishes to keep the value around, it must be copied.
The generated XML output string reverses the process of parsing XML
input using the e4_XMLParser class described above.
Specifically:
-
Each vertex whose name is __data__ and whose type is string
causes the generation of XML output containing character data.
-
Each vertex whose name is __cdata__ and whose type is node
causes the generation of an XML CDATA section. Markers attached to the
node are not represented in the generated output.
-
Each vertex whose name is __comment__ and whose type is string
causes the generation of an XML comment.
-
Each vertex whose name is __xml__ and whose type is node
causes the generation of an XML declaration. Markers attached to the node
are not represented in the generated output.
-
Each vertex whose name is __processinginstruction__ and whose type
is node causes the generation of an XML processing instruction.
Markers attached to the node are not represented in the generated output.
-
Each vertex whose name is __doctypedecl__ and whose type is node
causes the generation of an XML document type declaration. Markers attached
to the node are not represented in the generated output.
-
Any other vertex whose type is node and for which output has not
yet been generated in this invocation causes the generation of XML output
containing a pair of XML open and close tags with the name of the vertex,
and between these tags the XML output generated from the vertices of the
node value, recursively. If the node has a first vertex whose name is __attributes__
and all of whose vertices are of type string, then these are used
to generate attributes on the XML open tag for the node. If the node has
parents, its XML open tag is additionally decorated with a
__nodeid__
attribute to enable circular references to be resolved when the generated
XML is parsed. The value of the
__nodeid__ attribute is a running
integer counter. If the node has attached markers, its XML open tag is
additionally decorated with
__marker__ attributes, one for each
marker. These attributes are used to reconstruct the markers when the generated
XML is parsed.
-
Each vertex whose type is not node causes the generation of an empty
__vertex__
tag with three attributes, name, type and value, describing
the vertex. Binary data is encoded in BASE64 encoding.
-
If the generator encounters a vertex whose type is node and whose
value is a node for which XML output was already generated during this
invocation, then an empty __nodebackref__ tag is generated with
two attributes,
name and id. The value of the name
attribute is the name of the vertex, and the id attribute's value
is the same as the value of the __nodeid__ attribute that was generated
when the XML output was generated for this node. This allows linear XML
output to represent circular e4Graph graph structures.
Note that the generator is very naive and may cause a stack overflow while
descending into the associated node's reachable graph structure. A future
version of the generator may fix this problem by using iteration instead
of recursive descent to visit all reachable nodes and vertices. However,
the generator is guaranteed to finish generating output in bounded time;
it will not recurse infinitely given circular data structures.
The following code snippet shows how you might use the e4_XMLGenerator
class in your C++ programs:
#include "e4xml.h"
...
e4_XMLGenerator *gen;
e4_Node n;
char *xml;
...
gen = new e4_XMLGenerator(n, "hello");
...
xml = gen->Generate();
...
fprintf(stderr, "Generated XML: \"%s\"\n", xml);
...
delete gen;
e4_XMLGenerator Constructors and Methods
|
|
e4_XMLGenerator() |
Creates an empty instance. Subsequently you should associate an instance
of e4_Node with this XML generator using the
SetNode
method, and a wrapping XML element tag name, using the SetElementName
method. |
e4_XMLGenerator(e4_Node n, char *elementName) |
Creates an instance with an associated instance of e4_Node and a wrapping
XML element tag name. |
~e4_XMLGenerator() |
Destructor. Frees memory associated with this XML generator. |
|
|
void SetNode(e4_Node n) |
Sets the associated node for this XML generator. |
void SetElementName(char *elementName) |
Sets the wrapping XML element tag name for this XML generator. |
void SetElementNameAndNode(e4_Node n, char *elementName) |
Sets both the associated node and the wrapping XML element tag name
for this XML generator. |
void GetNode(e4_Node &n) const |
Retrieves the associated node. |
char *GetElementName() const |
Retrieves the wrapping XML element tag name. This retrieves the XML
element tag name string used by the generator itself, not a copy. |
char *Generate() |
Generates and returns the XML output representing the associated node,
wrapped in the wrapping XML tag name. The returned string's memory is owned
by the generator, so your application must copy it if the value should
be preserved. If an error occurs or the generator is not ready to produce
XML (no associated node or wrapping XML tag name was previously given)
then NULL is returned. |
char *Get() |
Retrieves the XML output previously generated by a sucessful call to
Generate.
The returned string's memory is managed by the generator and must be copied
by your application. |
char *Base64_Encode(unsigned char *bytes, int nbytes) const |
As a convenience, the class provides a method to encode binary data
of a given length as a BASE64 character string. The memory occupied by
the returned string is owned by the generator and must be copied by your
application. |
|
|