Reference Work Entry

Encyclopedia of Database Systems

pp 3585-3591

XML Indexing

  • Xin Luna DongAffiliated withAT&T Labs–Research
  • , Divesh SrivastavaAffiliated withAT&T Labs–Research

Definition

XML employs an ordered, tree-structured model for representing data. Queries in XML languages like XQuery employ twig queries to match relevant portions of data in an XML database. An XML Index is a data structure that is used to efficiently look up all matches of a fragment of the twig query, where some of the twig query fragment nodes may have been mapped to specific nodes in the XML database.

Historical Background

XML path indexing is related to the problem of join indexing in relational database systems [15] and path indexing in object-oriented database systems (see, e.g., [1,9]). These index structures assume that the schema is homogeneous and known; these assumptions do not hold in general for XML data. The DataGuide [7] was the first path index designed specifically for XML data, where the schema may be heterogeneous and may not even be known.

Foundations

Notation

An XML document d is a rooted, ordered, node-labeled tree, where (i) each node corresponds to an XML element, an XML attribute, or a value; and (ii) each edge corresponds to an element-subelement, element-attribute, element-value, or an attribute-value relationship. Non-leaf nodes in d correspond to XML elements and attributes, and are labeled by the element tags or attribute names, while leaf nodes in d correspond to values. For the example XML document of Fig. 1a, its tree representation is shown in Fig. 1b. Each node is associated with a unique number referred to as its Id, as depicted in Fig. 1b. An XML database D is a set of XML documents.
https://static-content.springer.com/image/prt%3A978-0-387-39940-9%2F24/MediaObjects/978-0-387-39940-9_24_Part_Fig1-779_HTML.jpg
XML Indexing. Figure 1

(a) An XML database fragment. (b) XML tree (Id numbering).

Queries in XML languages like XPath and XQuery make fundamental use of twig queries to match relevant portions of data in an XML database. A twig query Q is a node- and edge-labeled tree, where (i) nodes are labeled by element tags, attribute names, or values; (ii) edges are labeled by an XPath axis step, e.g., child, descendant, or following-sibling. For example, the twig query in Fig. 2a corresponds to the path expression /book[./descendant::author[./following-sibling :: author[fn =‘‘jane’’]][ln =‘‘poe’’]].
https://static-content.springer.com/image/prt%3A978-0-387-39940-9%2F24/MediaObjects/978-0-387-39940-9_24_Part_Fig2-779_HTML.jpg
XML Indexing. Figure 2

(a) Twig query. (b) Twig query fragments.

Given an XML database D, and a twig query Q, a match of Q in D is identified by a mapping from nodes in Q to nodes in D, such that: (i) Q’s node labels (i.e., element tags, attribute-names and values) are preserved under the mapping; and (ii) Q’s edge labels (i.e., XPath axis steps) are satisfied by the corresponding pair of nodes in D under the mapping. For example, the twig query of Fig. 2a matches a root book element that has a descendant author element that (i) has a child ln element with value poe; and (ii) has a following sibling author element that has a child fn element with value jane. Thus, the book element in Fig. 1b is a match of the twig query in Fig. 2a.

An XML Index I is a data structure that is used to efficiently look up all matches of a fragment of the twig query Q, where some of the twig query fragment nodes may have been mapped to specific nodes in the XML database. Some fragments of the twig query of Fig. 2a are shown in Fig. 2b and include the edges book/descendant::author and author/following-sibling::author, and the path book/descendant::author[ln =‘‘poe’’]. The matches returned by XML Index lookups on different fragments of a twig query Q can be “stitched together” using query processing algorithms to compute matches to Q.

The following sections describe various techniques that have been proposed in the literature to index node, edge, path and twig fragments of twig queries.

Node Indexes

When the fragment of a twig query Q that needs to be looked up in the index is a single node labeled by an element tag, an attribute name, or a value, a classical inverted index is adequate. This index is constructed by associating each element tag (or attribute name, value) with the list of all the node Ids in the XML database with that element tag (attribute name, value, respectively). For example, given the data of Fig. 1b, the list associated with the element tag author would be [7,27,47], and the list associated with the value jane would be [9,49].

Positional Numberings, Edge Indexes

Now consider the case when the fragment of a twig query Q that needs to be looked up in the index is an edge labeled by an XPath axis step. For example, find all author nodes that are descendants of the book node with Id = 1. As another example, find all author nodes that are following-siblings of the author node with Id = 7.

Using Node Ids

A simple solution is to use an inverted index that associates each (node1 label, node2 label, XPath axis) triple to the list of all pairs of XML database node Ids that satisfy the specified node labels and axis relationship. For example, given the data of Fig. 1b, the list associated with the triple (book, author, descendant) would be [(1,7),(1,27),(1,47)], and the list associated with the triple (author, author, following-sibling) would be [(7,27),(7,47),(27,47)].

In general, this inverted index could be much larger than the number of nodes in the original XML database, especially for XPath axes such as descendant, following-sibling and following. To overcome this limitation, more sophisticated approaches are required. A popular approach has been to (i) use a positional numbering system to identify nodes in an XML database; and (ii) demonstrate that each XPath axis step corresponds to a predicate on the positional numbers of the corresponding nodes. Two such approaches, using Dewey numbering and using Interval numbering, are described next.

Using Dewey Numbering

An elegant solution for edge indexing is to associate each XML node n with its DeweyId, proposed by [14], and obtained as follows: (i) associate each node with a numeric Id ensuring that sibling nodes are given increasing numbers in a left-to-right order; and (ii) the DeweyId of a node is obtained by concatenating the Ids of all nodes along the path from the root node of n’s XML document to n itself (The similarity with the Dewey Decimal System of library classification is the reason for its name). Figure 3a shows the DeweyIds of some nodes in the XML tree, using the numeric Ids associated with those nodes in Fig. 1b. For example, the fn node with Id = 8 has DeweyId = 1.6.7.8.
https://static-content.springer.com/image/prt%3A978-0-387-39940-9%2F24/MediaObjects/978-0-387-39940-9_24_Part_Fig3-779_HTML.jpg
XML Indexing. Figure 3

(a) Dewey numbering. (b) Interval numbering.

DeweyIds can be used to easily find matches to various XPath axis steps. In particular, node n 2 is a descendant of node n 1 if and only if (i) n1.DeweyId is a prefix of n 2.DeweyId. For example, in Fig. 3a, the jane node with DeweyId = 1.6.7.8.9 is a descendant of the author node with DeweyId = 1.6.7. Similarly, node n2 is a child of node n 1 if and only if (i) n 2 is a descendant of n 1; and (ii) n 2’s DeweyId extends n 1’s DeweyId by one Id. By maintaining DeweyIds of nodes in classical trie data structures, various XPath axis lookups can be done efficiently.

The main limitation of DeweyIds is that their size depends on the depth of the XML tree, and can get quite large. This limitation is overcome by using Interval numbering, described next.

Using Interval Numbering

The position of an XML node n is represented as a 3-tuple: (LeftPos, RightPos, PLeftPos), where (i) numbers are generated in an increasing order by visiting each tree node twice in a left-to-right, depth-first traversal; n.LeftPos is the number generated before visiting any node in n’s subtree and n.RightPos is the number generated after visiting every node in n’s subtree; and (ii) n.PLeftPos is the LeftPos of n’s parent node (0 if n is the root node of DocId). Figure 3b depicts the LeftPos and RightPos numbers of each node in the XML document.

It can be seen that each XPath axis step between a pair of XML database nodes can be tested using a conjunction of equality and inequality predicates on the components of the 3-tuple. In particular, node n 2 is a descendant of node n 1 if and only if: (i) n 1.LeftPos < n 2.LeftPos; and (ii) n 1.RightPos > n 2.RightPos. An element n 2 is a child of an element n 1 if and only if n 1.LeftPos = n 2.PLeftPos. An element n 2 is a following-sibling of an element n 1 if and only if: (i) n 1.RightPos < n 2.LeftPos; and (ii) n 1.PLeftPos = n 2.PLeftPos. For example, in Fig. 3b, the jane node with interval number (9,10,8) is a descendant of the author node with interval number (7,16,6).

Thus, the set of 3-tuples corresponding to the interval numbering of nodes of an XML database can be indexed using a 3-dimensional spatial index such as an R-tree, and the different XPath axis steps correspond to different regions within the 3-dimensional space. Variations of this approach have been considered in, e.g., [10,2,8].

Note that for both Dewey numbering and Interval numbering, one would need to leave gaps between numbers to allow for insertions of new nodes in the XML database [5].

Path Indexes

When the fragment of a twig query Q that needs to be looked up in the index is a subpath of a root-to-leaf path in Q, where some (possibly none) of the nodes have been mapped to specific nodes in the XML database, XML path indexes are very useful. The works in the literature have primarily focused on the case where each edge in the subpath of Q is labeled by child, i.e., all matches are subpaths of root-to-leaf paths in an XML document. For example, a path index can be used to efficiently look up all matches to the path fragment author[ln =‘‘poe’’] of the twig query depicted in Fig. 2a.

A framework by Chen et al. [3] is described next, which covers most existing XML path index structures, and solves the BoundIndex problem.

Problem BoundIndex: Given an XML database D, a subpath query P with k node labels and each edge labeled by child, and a specific database node id n, return all k-tuples (n 1,...,n k ) that identify matches of P in D, rooted at node n.

The framework of [3] requires each node in an XML document to be associated with a unique numeric identifier; this could be, e.g., the Id of the node in Fig. 1b. To create a path index, [3] conceptually separates a path in an XML document into two parts: (i) a schema path, which consists solely of schema components, i.e., element tags and attribute names; and (ii) a leaf value as a string if the path reaches a leaf. Schema paths can be dictionary-encoded using special characters (whose lengths depend on the dictionary size) as designators for the schema components. Most of the works in the literature have focused on indexing XML schema paths (see, e.g., [7,12,4]). Notable exceptions that also consider indexing data values at the leaves of paths include [6,17,3].

In order to solve the BoundIndex problem, one needs to explicitly represent paths that are arbitrary subpaths of the root-to-leaf paths, and associate each such path with the node at which the subpath is rooted. Such a relational representation of all the paths in an XML database is (HeadId, SchemaPath, LeafValue, IdList), where HeadId is the id of the start of the path, and IdList is the list of all node identifiers along the schema path, except for the HeadId. As an example, a fragment of the 4-ary relational representation of the data tree of Fig. 1b is given in Table 1; element tags have been encoded using boldface characters as designators, based on the first character of the tag, except for allauthors which uses U as its designator.
XML Indexing. Table 1

The 4-ary relation for path indexes

HeadId

SchemaPath

LeafValue

IdList

1

B

null

[]

1

BT

null

[2]

1

BT

XML

[2]

1

BU

null

[6]

1

BUA

null

[6,7]

1

BUAF

null

[6,7,8]

1

BUAF

jane

[6,7,8]

1

BUAL

null

[6,7,12]

1

BUAL

poe

[6,7,12]

 

  

6

U

null

[]

6

UA

null

[7]

6

UAF

null

[7,8]

6

UAF

jane

[7,8]

6

UAL

null

[7,12]

6

UAL

poe

[7,12]

 

  

Given the 4-ary relational representation of XML database D, each index in the family of indexes: (i) stores a subset of all possible SchemaPaths in D; (ii) stores a sublist of IdList; and (iii) indexes a subset of the columns HeadId, SchemaPath, and LeafValue.

Given a query, the index structure probes the indexed columns in (iii) and returns the sublist of IdList stored in the index entries. Many existing indexes fit in this framework, as summarized in Table 2. For example, the DataGuide [7] returns the last Id of the IdList for every root-to-leaf prefix path. Similarly, IndexFabric [6] returns the Id of either the root or the leaf element (first or last Id in IdList), given a root-to-leaf path and the value of the leaf element. Finally, the DATAPATHS index is a regular B+-tree index on the concatenation of HeadId, LeafValue and the reverse of SchemaPath (or the concatenation LeafValue⋅HeadId⋅ReverseSchemaPath), where the SchemaPath column stores all subpaths of root-to-leaf paths, and the complete IdList is returned; the DATAPATHS index can solve the BoundIndex problem in one index lookup.
XML Indexing. Table 2

Members of family of path indexes

Index

Subset of SchemaPath

Sublist of IdList

Indexed Columns

Value [11]

paths of length 1

only last Id

SchemaPath, LeafValue

Forward link [11]

paths of length 1

only last Id

HeadId, SchemaPath

DataGuide [7]

root-to-leaf path prefixes

only last Id

SchemaPath

Index Fabric [6]

root-to-leaf paths

only first or last Id

reverse SchemaPath, LeafValue

ROOTPATHS [3]

root-to-leaf path prefixes

full IdList

LeafValue, reverse SchemaPath,

DATAPATHS [3]

all paths

full IdList

LeafValue, HeadId,reverse SchemaPath

Twig Indexes

ViST [16] and PRIX [13] are techniques that encode XML documents as sequences, and perform sub-sequence matching to look up all matches to twig queries.

Key Applications

XML Indexing is important for efficient XML query processing, both in relational implementations of XML databases and in native XML databases.

Future Directions

It is important to investigate XML Path Indexes for the case of path queries with edge labels other than child, especially when different edges on a query path have different edge labels. Another extension worth investigating is to identify classes of twig queries that admit efficient XML Twig Indexes.

Data Sets

University of Washington XML Repository: http://​www.​cs.​washington.​edu/​research/​xmldatasets/​.

Cross-references

XML Document

XML Tree Pattern, XML Twig Query

XPath/​XQuery

XQuery Processors

Copyright information

© Springer Science+Business Media, LLC 2009
Show all