Configuring Database Indexes

1. Overview

In this section, we discuss the types of database indexes used by eXist, as well as how they are created, configured and maintained. It assumes readers have a basic understanding of XML and XQuery.

Database indexes are used extensively by eXist to facilitate efficient querying of the database. This is accomplished both by system-generated and user-configured database indexes. The current version of eXist uses three types of indexes:

  1. Structural Indexes: These index the nodal structure, elements (tags), and attributes of documents in a collection.

  2. Fulltext Indexes: These map text tokens to text nodes and attributes of documents in a collection.

  3. Range Indexes: These index specific text nodes and attributes in a collection based on user-configured index paths and selected data types.

These indexes are discussed in detail in the following section.

2. Index Types

2.1. Structural index

This index keeps track of the elements (tags), attributes, and nodal structure for all XML documents in a collection. It is created and maintained automatically in eXist, and can neither be reconfigured nor disabled by the user. The structural index is also required for executing nearly all XPath and XQuery expressions in eXist (with the exception of wildcard-only expressions such as "//*"). This index is stored in the database file elements.dbx.

Technically, the structural index maps every element and attribute qname (or qualified name) in a document collection to a list of <documentId, nodeId> pairs. This mapping is used by the query engine to resolve queries for a given XPath expression.

For example, given the following query:

//book/section

eXist uses two index lookups: the first for the "book" node, and the second for the "section" node. It then computes the structural join between these node sets to determine which "section" elements are in fact children of "book" elements.

2.2. Fulltext Index

This index maps text tokens (or blocks of structured text) to the document text and attribute nodes in which they occur. It is also created and maintained automatically, and stored in words.dbx. You can, however, select specific documents and collections you would like eXist to index by creating a configuration file. A fulltext index can also be configured to select specific parts of a document. For instance, it is possible to include or exclude certain elements or attributes from being indexed. Without any such configuration, eXist defaults to index the entire text. (For more information on user-based configuration, see Configuring Indexes below).

The fulltext Index is also required for eXist's fulltext search extensions. In particular, you can use the following eXist-specific operators and functions that apply a fulltext index:

Check the XQuery Documentation for more information.

It is important to note that, if you have disabled fulltext indexing for certain elements, these operators and functions will also be effectively disabled, and will not return matches. Furthermore, eXist has NO fallback to "brute-force" searching. As a result, eXist will not return results for queries that normally would have results provided fulltext indexing was enabled. Note also that this is in direct contrast to the operation of range indexing, which does fallback to full searching of the document if no range index applies (see below).

2.3. Range index

This index is "type-specific" - meaning it is based on the data type of specific node values in the document. These indexes provide a shortcut for the database to directly select nodes based on these type values. Unlike structural and fulltext Indexes, range indexes can be created and configured directly by the user, and in this sense, they are similar to indexes used by relational databases. eXist therefore does not come installed with configured range indexes. However, configured indexes are created when loading a document, and are automatically maintained during subsequent updates to the document or a part of it.

Range indexes are used when matching or comparing nodes by way of standard XPath operators and functions. Whenever these functions or operators are used, eXist looks for, and implements, any user-defined range index that applies. Unlike fulltext indexing, eXist will return the correct query results even if no range index applies - defaulting to a "brute-force" inspection of the DOM if necessary.

To see how range indexes work, consider the following fragment:

Example: Example: List Entry

                        
<items>
                            <item n="1">
                               <name>Tall Bookcase</name>
                               <price>299.99</price>
                            </item>
                            <item n="2">
                               <name>Short Bookcase</name>
                               <price>199.99</price>
                            </item>
                        </items>
                        

With this short inventory, the <price> elements have dollar values expressed as a floating-point numbers, (e.g. "299.99"), which have an XML Schema Definition (XSD) data type of xs:double. Using this builtin type to define a range index, we can improve the efficiency of searches for price values. (Instructions on how to configure range indexes using configuration files are provided under the Configuring Indexes section below.) During indexing, eXist will apply this data type selection by attempting to cast all <price> values as double floating point numbers, and add appropriate values to the index. Values that cannot be cast as double floating point numbers are therefore ignored. This range index will then be used by any expression that compares <price> to a numeric value - for instance:

//item[price > 100.0]

For non-string data types, the range index also provides the query engine a more efficient method of data conversion. Instead of retrieving the value of each selected element and casting it as a xs:double type, the engine can evaluate the expression by using the range index as a form of lookup index.

The benefits of a range index can apply to string values as well. When working without a range index, eXist employs the fulltext index to scan for the correct nodes - but this index then requires eXist to scan the resulting nodes to filter wrong matches. Using a range index, this extra scan is not required. In addtion, the range index can be used for equality comparisons; "<" (less-than) and ">" (greater-than) comparisons; regular expression searches using the fn:matches() function, or standard string search functions like fn:starts-with() and fn:contains().

To illustrate this functionality, let's return to the previous example. If you define a range index of type xs:string for element <name> , a query on this element to select tall bookcases using fn:matches() will be supported by the following index:

//item[fn:matches(name, '[Tt]all\s[Bb]')]

Another advantage of using the range index for strings is that it can be defined for elements with mixed content. For example, suppose you have the following:

Example: Mixed Content Element

                        

                            <mixed>
                               <span>un</span>
                               <span>even</span>
                            </mixed>
                    

In this case, you can query the database using an index-assisted regular expression that applies across the entire <mixed> node - for instance:

//item[fn:matches(mixed, 'un.*')]

In general, three conditions must be met in order to optimize a search using a range index:

  1. The range index must be defined on all items in the input sequence.

    For example, suppose you have two collections in the database: C1 and C2. If you have a range index defined for collection C1, but your query happens to operate on both C1 and C2, then the range index would NOT be used. The eXist "Query Optimizer" selects an optimization strategy based on the entire input sequence of the query. Since, in this example, since only nodes in C1 have a range index, no range index optimization would be applied.

  2. The index data type (first argument type) must match the test data type (second argument type).

    In other words, with range indexes, there is no promotion of data types (i.e. no data type precedes or replaces another data type). For example, if you defined an index of type xs:double on <price> , a query that compares this element's value with a string literal would not use a range index, for instance:

    //item[price = '1000.0']

    In order to apply the range index, you would need to cast the value as a type xs:double, i.e.:

    //item[price = xs:double($price)] (where $price is any test value)

    Similarly, when we compare xs:double values with xs:integer values, as in, for instance:

    //item[price = 1000]

    the range index would again not be used since the <price> data type differs from the test value type, although this conflict might not seem as obvious as it is with string values.

  3. The right-hand argument has no dependencies on the current context item.

    That is, the test or conditional value must not depend on the value against which it is being tested. For example, range indexes will not be applied given the following expression:

    //item[price = self]

3. Maintaining Indexes and Re-indexing

The current eXist index system automatically maintains and updates indexes defined by the user. You therefore do not need to update an index when you update a database document or collection. eXist will even update indexes following partial document updates via XUpdate or XQuery Update expressions.

The only exception to eXist's automatic update occurs when you add a new index definition to an existing database collection. In this case, the new index settings will ONLY apply to new data added to this collection, or any of its sub-collections, and NOT to previously existing data. To apply the new settings to the entire collection, you need to trigger a "manual reindex" of the collection being updated. You can re-index collections using the Java Admin Client (shown on the right). From the Admin menu, select File»Reindex Collection

4. Configuring Indexes

For later versions of eXist, it is recommended that users configure fulltext and range indexes using collection-specific configuration files. These files are stored as standard XML documents in the system collection: /db/system/config, which can be accessed using the Admin interface or Java Client. In addition to defining settings for indexing collections, the configuration document specifies collection-specific other settings such as triggers or default permissions.

The contents of the system collection (/db/system/config) mirror the hierarchical structure of the main collection. Configurations are therefore "inherited" by descendants in the hierarchy, (i.e. the configuration settings for the child collection are added to or override those set for the parent). It is furthermore possible for each collection in the hierarchy to have its own index creation policy defined by a configuration file. If no collection-specific configuration file is created for any document, the global settings in the main configuration file, conf.xml, will apply by default. That being said, the conf.xml file should only define the default global index creation policy.

To configure indexes for a given collection - for example: /db/foo - you must create a new .xconf configuration file and store it in the system collection (e.g. /db/system/config/db/foo). You can choose any name for this document so long as it has the .xconf extension. Note that since subcollections will inherit the configuration policy of their parent collections, you are not required to specify a configuration for every collection.

Note

You can store only ONE .xconf configuration document per collection in the system collection /db/system/config. For example, the collection /db/system/config/foo would contain one configuration file and/or other subcollections.

Important

To specify a global index configuration that applies to all collections, you must create a file called collection.xconf and store it in the system collection: /db/system/config/db/. (Note that collection.xconf is a reserved filename.)

4.1. Configuration Structure and Syntax

Index configuration files are standard XML documents that have their elements and attributes defined by the eXist namespace:

http://exist-db.org/collection-config/1.0

The following example shows a simple configuration document:

Example: Simple Configuration Document

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:x="http://www.foo.com">
       <fulltext>
           <include path="//item/name"/>
       </fulltext>
       <create path="//item/price" type="xs:double"/>
    </index>
</collection>

All configuration documents have the <collection> root element. These documents also have an <index> element directly below the root element, which encloses the index configuration. Only ONE <index> element is permitted in a document.

In the <index> element are elements that define the fulltext index and range index settings. The fulltext index is defined by the <fulltext> element - along with <include> and <exclude> elements that provide specific index paths. Range indexes are defined by <create> elements. Note that an index element contains only ONE <fulltext> element, but may contain any number of <create> elements as range indexes.

To illustrate how basic configuration document is created, let's suppose we have a collection of XML documents with the following structure:

Example: Sample XML Document

<?xml version="1.0" encoding="UTF-8"?>
<items xmlns:x="http://www.foo.com">
    <item n="1"">
        <name>Red Bicycle</name>
        <price specialprice="false">645.50</price>
        <stock>15</stock>
        <x:rating>8.7</x:rating>
    </item>
</items>

A configuration document for this collection might look like the following:

Example: Configuration Document (collection.xconf)

<collection xmlns="http://exist-db.org/collection-config/1.0">
	<index xmlns:x="http://www.foo.com">
		<fulltext default="none" attributes="false" alphanum="false">
			<include path="//item/name"/>
		</fulltext>
		<create path="//item/@n" type="xs:integer"/>
		<create path="//item/name" type="xs:string"/>
		<create path="//item/stock" type="xs:integer"/>
		<create path="//item/price" type="xs:double"/>
		<create path="//item/prices/@specialprice" type="xs:boolean"/>
                  <create path="//item/x:rating" type="xs:double"/>
	</index>
</collection>

With this example, the fulltext default attribute is set to "none", which disables the default fulltext indexing for all document elements. The exception to this are the name elements using the <include> element.

Notice as well that each <create> element has a path attribute that defines the nodes to which the configuration applies, and which are expressed as index paths. Notice also that each <include> has a path attribute, but not a type attribute. Since range indexes are "type-specific", a node will be ignored if its value cannot be cast to the specified type. For instance, if, in the above example, a price element contains the string "unknown" instead of a double value, the indexer will simply ignore this node. However, the value would still be found through a string comparison. Note that a path component may use a namespace prefix, for which a mapping has to be defined in the enclosing index tag. For example, the document node <x:rating> in the above sample XML uses the namespace x.

Unlike the fulltext index, the range index applies ONLY to the element specified in the index path, and NOT to descendant nodes of that element. Returning to a previous example, consider the following markup:

Example: Mixed Content Element

                        
<mixed>
    <span>un</span>
    <span>even</span>
</mixed>
                    

which has the index definition:

<create path="//item/mixed" type="xs:string"/>

In this case, the string added to the corresponding range index during indexing would be "uneven". In contrast, the fulltext index would (by default) add the strings "un" and "even".

However, you can configure the fulltext index to treat mixed content in the same way as the range index. For this, we use a special type of <include>:

Example: Configure the Fulltext Index for a Mixed Content Element

                        
<fulltext default="none" attributes="false" alphanum="false">
    <include path="//mixed" content="mixed"/>
</fulltext>
                    

The concatenated text nodes of element mixed and all its descendants will be passed to the indexer as one single string. The indexer thus sees and indexes "uneven" as a single token. Accordingly, using this definition, the expression:

//mixed[. &= 'uneven']

will return the mixed node. You can still explicitely query for child and descendant nodes (unless you excluded them from the index).

Although the syntax of index paths appears similar to XPath syntax, it is not the same. Index path syntax is in fact much simpler than XPath, and uses the following components to construct paths:

Important

Please note that index paths are NOT the same as XPath expressions, even though the syntax is similar. The main reason for this difference is that index paths must be evaluated during the indexing phase. During this phase, eXist engine has only minimal structural information about the document, and so the use of XPath expressions is not possible. With improvements made to the search method for future versions of eXist, however, the full use of XPath in configuring indexes may become possible.

4.2. Document Elements and Attributes

The following table summarizes the function and syntax of elements used in a configuration document:

<collection>

Description: Configuration document root element.

Namespace: http://exist-db.org/collection-config/1.0

<index>

Description: Container for the index configuration.

Namespace: Optional collection namespace for prefixed index path components. (e.g. xmlns:x="http://www.foo.com" for the x namespace prefix).

<fulltext>

Description: Specifies the default fulltext indexing settings, and contains <include> and <exclude> elements that specify which document nodes to include or exclude during indexing.

Attributes:

  • default: Sets the default index range for text nodes. All fulltext indexes must have this attribute defined.

      Values
    • all: All test nodes are indexed (default). To exclude specific nodes, use the <exclude> element (see below).

    • none: No text nodes are indexed. To include specific nodes, use the <include> element (see below).

  • attributes: Default setting to include (or exclude) attribute values during the document indexing. Compare this setting with the default setting for text nodes.

      Values
    • true: Attributes are indexed (default).

    • false: Attributes are not indexed.

  • alphanum: Default setting to include (or exclude) alpha-numeric values during the document indexing. These sequences include: numbers, dates, and any alpha-numeric sequence that is not a simple word string.

      Values
    • true: Alpha-numeric strings are indexed (default).

    • false: Alpha-numeric strings are not indexed.

<include>

Description: Specifies a document node to include in fulltext indexing.

Attributes: path (Specifies the document node path).

The following includes the values of all id attributes for all <article> elements.

include path="//article/@id" /

The optional attribute content can be used to create an additional index on the mixed content of a node (see description above):

include path="//mixed" content="mixed"/

<exclude>

Description: Specifies a document node to exclude in fulltext indexing.

Attributes: path (Specifies the document node path).

The following excludes the values of all <url> elements and their descendants.

exclude path="//url" /

<create>

Description: Defines a range index of selected elements or attributes.

Attributes:

  • path: Specifies an index path that defines the nodes to which the configuration applies. A range index is created for all elements or attributes that match this path.

  • type: Defines the range index data type (XSD). It is expected future versions of eXist will expand on the following supported data types.

      Values
    • xs:string

    • xs:integer (NOTE: integers are limited to the range of values that can be represented by a Java signed long).

    • xs:float

    • xs:boolean

5. Debugging the Range Index

It is sometimes a bit difficult to see if a range index is correctly defined or not. The simplest way to get some information on index usage is to set the priority for eXist's standard logger to TRACE. For example, change the <root> category in log4j.xml as follows:

Example: Configure log4j to Display Trace Output

                    
<root>
    <priority value="trace"/>
    <appender-ref ref="console"/>
</root>
                

This enables trace and sends all log output to the console instead of the log files. For expressions that can benefit from a range index, you should now see messages like "Checking if range index can be used ..." or "Using range index for key...".

Another possibility to see what's in your index is to use the util:index-keys function:

Example: Query to List Index Contents

declare namespace f="http://exist-db.org/xquery/test";
declare function f:term-callback($term as xs:string, $data as xs:int+) as element() {
   <entry><term>{$term}</term></entry>
};
util:index-keys(//city/name, "T", util:function("f:term-callback", 2), 1000)

This query will show you all keys (starting with the letter 'T') indexed for the element selected by the path expression //city/name.