A Guided Tour of Sample Code


The libGiST package is a C++ library that provides the basic GiST functionality. By itself it does nothing, since some of the important methods are pure virtual. However, with a little coding on your part, you can make it index data in a variety of ways, and support just about any query.

We have provided some examples to get you started.  The RTree directory contains code which uses libGiST as the basis for a simple R-tree-like storage/retrieval system that indexes 2-dimensional box data, and supports natural geographic queries.  The RSTree directory provides the same functionality as the RTree directory, but uses the more efficient (and complex) R*-tree techniques invented by Beckman, et al. [BKSS90]. The BTree directory contains code which uses libGiST to implement a simple B+-tree storage/retrieval system that indexes textual data, to support alphabetic equality and range queries.

As an introduction to libGiST, we will step through the header files for each of these small projects, which will introduce you to increasingly powerful features of libGiST. We will go through the simplest project, RTree, in some detail, and then just describe the features of RSTree and BTree that are different from RTree.  Probably the easiest way to use libGiST is to cannibalize some of this sample code, so it's worth looking at carefully.


RTree

R-trees were first proposed by Guttman [Gut84] to index geographic data (points, lines, and polygons), in order to efficiently find data that satisfy geographic predicates (i.e., key {contains,contained-in,overlaps,equals} constant). The keys in a (2D) R-tree are "bounding boxes", i.e. rectangles which are aligned with the x and y axes. A key signifies that all objects stored below it fall within that geographic area. If you want more detail on R-trees, Guttman's paper makes for fairly accessible reading.

In order to extend libGiST to support R-tree behavior, we need four of the basic GiST methods mentioned in the original GiST paper:

We will also need various utility methods like constructors and destructors for our RTree classes, which inherit most (but not all!) of their behavior from libGiST. We won't talk about the constructors and destructors, which are pretty straightforward, but the rest of the simple utility methods are as follows (organized by file):

A quick look at the .cpp files will show that there's little more in them than what's described above -- the only method of any complexity is the PickSplit algorithm, which is taken directly from Guttman's paper.


RSTree

The R*-tree is very much like the R-tree, having identical keys and supporting identical query predicates. But it contains some optimizations that often make t outperform simple R-trees during lookup, by paying for some additional complexity during insertion. First, it has different Penalty and PickSplit methods. Second, it uses a technique known as Forced Reinsert to keep the tree fuller and better balanced than an R-tree.

Forced reinsert works as follows -- when an entry is inserted into a full node, some percentage of the entries on that node are deleted, and reinserted into the tree. If the node is at a non-leaf level, the reinsertion places the deleted entries back into nodes at that level.  If reinsertion discovers a full node, then it splits it as usual. This sporadic reinsertion of elements accomodates changes in the data distribution as the tree fills in.

Forced reinsert is supported by libGiST, and the RSTree implementation simply takes advantage of it.  The PickSplit and Penalty methods for RSTree are a bit complicated, as proposed by the inventors of the R*-tree.

These are the only significant differences between RSTree and RTree.


BTree

B+-trees have 3 fundamental differences from R-trees:

  1. the keys in a B+-tree are in ascending order from left to right on each page.
  2. range scans in a B+-tree do not traverse the tree; rather, they find the leftmost item and scan across the leaves of the tree until they exceed the right boundary of the range.
  3. in a B+-tree node, there are 1 fewer keys than there are pointers

The first of these two issues are handled in our BTree implementation by informing libGiST that the tree has the IsOrdered property.  This is described below.  The third of these issues is handled via key compression: two virtual methods of GiSTentry -- Compress and Decompress -- were not used in our RTree implementation, but are added for BTree. They allow keys to be compressed on disk to take up less space and improve fanout.  When keys are read off disk, they are decompressed, and before they are written back, they are compressed. The BTree implementation essentially compresses the first key on a node to nothing, which effectively gives the standard B+-tree layout of n-1 keys for n pointers.

We now describe the salient parts of the BTree implementation that make it differ from the RTree and RSTree implementations.  First, of course, the keys in a BTree are different -- in uncompressed format, they are alphabetic ranges of words (our demo works on textual data -- see the BTkey class in BTentry.h). The operators that can appear in BTpredicates are the standard ordering operators: <, <=, =, >=, >; otherwise, BTpredicates are very similar to RTpredicates.  Further differences are described by file:

Otherwise there should be no great surprises in the BTree code.


Writing your own methods

If you've made it this far, you should be ready to develop your own indexing code using libGiST.  If you have ordered data and mostly are supporting range predicates, you should probably start with the BTree sample code and modify it to suit.  If you have unordered data, it's probably best to start with the RTree code, and when you get things working, try adding Forced Resinsert (a la RSTree) and see if it helps.


1In v0.9b1 range queries are handled depth-first even if IsOrdered() returns 1. This is a bug, which we hope to fix in a subsequent release.