e4Graph logo
Introduction

This page explains what the e4Graph library is, the features it provides and the concepts it embodies, and shows by example how you would use it in your programs.

What is the e4Graph Library?

Are you a programmer, and do you work with information whose "shape" is irregular, data whose pieces inter-relate to each other in many different and non-uniform ways? And do you need to represent a collection of data with the ability to add new connections between any two data items on the fly? Do you need to efficiently work with huge data sets, and do you need guaranteed data consistency for your persistent data? And do you need to work with the data over many invocations of your programs, each invocation starting off at exactly the point where the previous invocation left off? Do you need to access the data from many different programs running on a variety of operating systems and machine architectures? Do you need to extract data from the data sets in XML format and do you need efficient access to external data available in XML format? If so, e4Graph can help you.

e4Graph is a C++ library that allows programs to store graph-like data persistently and to access and manipulate that data efficiently. With e4Graph, you can arrange your data in the most natural form that reflects the relationships between its parts, rather than having to force it into a table-like format. The e4Graph library allows you to concentrate on the relationships you want to represent, and not on how to store them in a database. You can modify data items, and add and remove connections and relationships between pieces of data on the fly. e4Graph allows you to represent an unlimited number of different connections between pieces of data, and your program can selectively manipulate the data according to the relationships it cares about, not having to know about other connections represented in the data set.

The e4Graph library is free software, provided under a BSD style license. It is also an open source project, meaning that the source code for the whole library, instructions on how to build and use it, and full documentation, is provided in the distribution. Its development is being managed from a web site kindly hosted by the good folks at SourceForge. My employer, Sun Microsystems Inc., has graciously allowed me to release this software freely, and I am solely responsible for its content, function and operation. You are allowed to use the e4Graph library for any purpose, and you can include it in your commercial, shareware or freeware products and projects. The library is provided in source form and, as a convenience, in binary form for several platforms, including Windows 95, Windows 98, Windows NT, Windows 2000, Solaris 2.6 and Linux. It is known to work on various Windows platforms, Linux and Solaris. Ports to other platforms are easy, as I've made a conscious attempt to write the code portably; if you make a port to a new platform, please email jyl@best.com to allow your port to be included in a future distribution. If you run into portability problems then I'd appreciate the opportunity to learn from your experience and to help. I also ask that you send me email if you decide to include the library in your product. This will allow me to better support your needs.

The e4Graph library is provided as is, with no expressed or implied guarantee of usefulness, operation or correctness. I promise to reply to every email question or request. I wrote this software because I thought it's an interesting problem area, so I'm very interested in hearing about how you use it and what problems you ran into; unfortunately, I cannot promise commercial level support.

A Taste of e4Graph

Before I present the details, a small example to give you a glimpse of what e4Graph is about. I expect you to know a little bit of C++ in order to understand the example. Explaining C++ is beyond the scope of this document, sorry.

The example shows how a program uses e4Graph to find out whether John's grocery store has any sugar, and if it has, how many pounds are available and for what price. This example dispenses with error checking for the sake of brevity:

e4_Storage s("John's grocery store", E4_METAKIT);
e4_Marker on_hand;
e4_Node items;
e4_Node sugar;
double price_per_pound;
int pounds_in_store;

if (!s.GetMarker("on hand items", on_hand) {...}
if (!on_hand.GetMarkedNode(items)) {...}
if (!items.GetVertex("sugar", sugar)) {
    fprintf(stderr, "John's store is out of sugar!\n");
    exit(99);
}

if (!sugar.GetVertex("pounds in store", pounds_in_store)) {...}
if (!sugar.GetVertex("price per pound", price_per_pound)) {...}
fprintf(stderr, "John's store has %d pounds of sugar at $%f per pound\n",
        pounds_in_store, price_per_pound);
This is quite straightforward. All of the concepts are explained in much more detail below, but for now its works to think in terms related to the data being represented. First, we open the e4Graph data that describes John's grocery store. Then, we obtain a list of items on hand. Next, we obtain all the related data that's available about sugar in John's store. If sugar is not among the items on hand, we print out a message and stop right there. Otherwise, we obtain an integer representing the number of pounds of sugar in the store, and a 64-bit floating point number that tells the price per pound. Finally, we print out a report detailing how many pounds of sugar are available and at which price.

Note that there may be many more items for sale in the store. However, in this case we only care about sugar. Also, the information about sugar may have much more detail, such as brands, links to one or more suppliers, incentive program information (discounts), links to information collected about client purchases of sugar, and so forth. In this example we only care about the number of units available and their price.

This is a very simplistic example that doesn't even begin to show off all the features of e4Graph. It's just a first taste. Hopefully it's sweet and gives you appetite for more!

Installing and Building e4Graph

You might want to pause in reading this page and find out how to install, build and use e4Graph in your own projects.

Features of e4Graph

Simply stated, the e4Graph library provides your program with the ability to represent and manipulate graph-like data efficiently and to store this data persistently. Your program accesses and manipulates the data according to the relationships it represents, and e4Graph takes care of how to represent the data efficiently and persistently. Data is loaded into the executing program on demand, as a connection from an already loaded item to an on-disk item is followed; this data loading step is transparent to your program, and in-memory storage is recovered automatically when the data is no longer accessible by your program. This allows e4Graph to efficiently manipulate data graphs whose size is orders of magnitude larger than the machine's main memory.

e4Graph makes changes to stored data permanent through a commit step. Persistent data cannot be corrupted by a program crash, because intermediate states are represented only in program memory, not in the persistent storage. At all times the persistent data represents the last committed state, while the program's in-memory state represents the state of data accessed or modified since the last commit. The e4Graph library also allows the program to restore its in-memory data to the last committed state. The e4Graph library provides a mechanism that automatically commits the changes made during execution at the time a storage is closed. All storages are automatically closed and optionally committed at normal program termination. This frees you from having to explicitly commit before program termination. Of course, you can still commit the intermediate state at any time during program execution.

The e4Graph library is layered on top of a generalized interface to table oriented or relational databases, and the library includes a database driver for the Metakit database package. Additional drivers for other database packages can easily be written; email to jyl@best.com if you are interested in providing a driver for another database. Each database driver provides its own guarantees regarding the consistency of data stored persistently; the Metakit database provides extremely strong consistency guarantees, preventing data corruption even if a program crash happens in the middle of a commit operation.

The e4Graph library allows you to model any kind of relationship between data that can be represented by a directed graph, including circular graphs of connections between data items. e4Graph is unique in that it allows these circular relationships to be represented directly rather than implicitly or through meta-data, as is necessitated by other approaches such as relational databases. A bi-directional link between two data items can be represented by two directed connections between those data items.

Your program can open any number of persistent storages at a given time, up to the resource limitations imposed by the hosting operating system and machine. This means that your program can partition its data into separate persistent storages to divide it into "consistency units". All data stored in a single storage is guaranteed to be consistent, while relationships across storage units may not be consistent at all times. This feature also allows your program to use e4Graph data stored in different databases through a uniform interface.

e4Graph Concepts

The concepts embodied by e4Graph are necessarily related to each other. The description below therefore sometimes mentions a concept before fully explaining it. Later, it loops back to provide a complete explanation.

As already stated, data is organized into persistent storages represented by the C++ class e4_Storage.  Each storage has a name and is controlled by a specific database driver; its internal layout is according to the rules imposed by that database driver. For storages managed by Metakit, the persistent data is contained in files on your machine's disk, and storages are named by the name of the file which contains them.

A storage maintains a list of markers, represented by the C++ class e4_Marker. Each marker has a name which is unique within its storage, and marks a node in the stored data. A storage can contain any number of markers, and markers can mark disconnected graphs, which means that it may not be possible to traverse from the node marked by one marker to any other node marked by another marker.

A node is represented by the C++ class e4_Node. Nodes represent clusters of related data items. A node may be marked by zero or more markers, and may contain zero or more vertices. Vertices store data items, and one kind of data item is a node; this allows the formation of graphs of nodes connected by vertices. A node A is a parent of a node B if A contains a vertex whose value is the node B. Thus, a node may also have zero or more parents. A node is unreachable if it has zero parents and is marked by zero markers. Such nodes are automatically and silently removed from the storage, and their memory is recycled. The e4Graph library provides methods for traversing from a node through a vertex, from a node to any of its markers and from a node to any of its parents.

Vertices are the last concept left to explain. A vertex is a named piece of data. Each vertex is contained within a node, and has a name, type and associated data. A vertex's name need not be unique within its node, and can be an ascii string of any length greater than zero. Vertices are arranged in a sibling relationship, and the e4Graph library provides methods to traverse from a vertex to its containing node and to the preceding and following vertex within its containing node. A vertex can contain data of a variety of types, including integers, 64-bit floating point numbers, null-terminated strings and binary uninterpreted data of arbitrary size. Additionally, a vertex value can be another node.

e4Graph supports direct representation of relationships only within a single storage: a marker can only mark a single node, and that node must be within the same storage that contains the marker. All of the vertices within a node are contained within the storage containing the node, as must its parents and all the markers marking this node. Of course, your program is free to implicitly represent relationships between pieces of data stored within different storages. However, e4Graph does not know about these higher level relationships and does not extend any guarantee of consistency to their representation. For example, your program may choose to represent cross-storage relationships as URLs, and provide a mechanism for following such links automatically.

e4Graph and Programming Languages

You must use e4Graph through an application programming interface (API) in your favorite programming language. The library comes with a C++ interface, and can be used directly from C++ programs. It also provides a very efficient Tcl binding that allows use of e4Graph in Tcl programs, and I am working on a Java binding through JNI. I welcome use of e4Graph in other programming languages. If you implement a binding to your favorite language, please drop me some email and let me know; I am open to including your binding in a future release. A Python and PHP binding would be really nice to have, too!

The Rest

There's some more documentation:

Acknowledgments

Any piece of software is a labor of love, a birth. Many people are involved directly and indirectly. As the primary author of e4Graph, I've had the fortune to have many friends whose support helped make this all happen. Thank you all!

First and foremost, thanks to my family. I recognize the signs of an obsession; it's when even your eleven year old daughter who doesn't program knows what e4Graph is and when it's going to be released. :) Thanks to my wife, son and daughter for tolerating my michegas.

Thanks to Jean-Claude Wippler for help in ferreting out many of the ideas and concepts embodied by e4Graph. Jean-Claude also provided excellent support for his exceptional product, the Metakit package. Jean-Claude's patient help made it possible for this software to see the light of day.

Andreas Kupries provided valuable insight into how to generalize an earlier version of this package. At that time it only dealt with trees. Andreas suggested a way to generalize it to handle directed graphs, and as a consequence the package was renamed to e4Graph.

Jeffrey Hobbs of Ajuba Solutions provided extensive help in clarifying the concepts behind the Tcl binding. Without Jeff's help the Tcl binding would not have reached maturity so quickly. Jeff also helped me overcome a couple of tight spots in implementing the Tcl binding.

My employer, Sun Microsystems Inc., deserves special thanks for allowing me to release this hobby project. The idea has been knocking around in my head for a while, and Sun graciously tolerated my after-hours work on its implementation. Sun has no connection whatsoever to this package, and I bear complete and exclusive responsibility for the ideas, concepts and implementation of e4Graph.

Thanks to Jean-Claude Wippler for Metakit,  John Ousterhout for Tcl, Sun Microsystems Inc. for Java, and James Clark for expat. The e4Graph library depends on all of these packages.