Server Configuration

1. Overview

This section deals with the configuration of the eXist server. The main configuration file for eXist is called conf.xml, which is loaded from different directories depending on the server set-up (see Server Deployment for more information).

Specifically, if you run a standalone database server, the conf.xml file located in the root directory of the distribution (as specified by the system property exist.home) will be loaded by default. Note as well that in standalone mode, the server reads server.xml for its configuration values. This file resides in eXist's root directory and is used to control the Jetty server (e.g. port settings), URL forwarding, and the servlets (i.e. WebDAV, XML-RPC and REST services). On the other hand, if eXist is running in a servlet-context, conf.xml is read from the WEB-INF directory of the web application.

Why is the configuration file placed in two separate locations? The reason is that eXist normally has no access to files outside the context in which it is running when it is deployed as part of a web application - i.e. packaged in a .war file. Therefore, when eXist is deployed in this way, the configuration is read from the WEB-INF directory.

2. eXist Configuration: Editing conf.xml

The configuration file conf.xml can be divided into four sections with the following elements:

<db-connection>

Configures the storage back-end.

<serializer>

Default settings for the serializer (external data representation).

<indexer>

Controls the indexing process.

<xupdate>

Configuration options related to XUpdate processing.

The following sections describe the attributes and child elements of the above elements.

2.1. <db-connection>

This element contains basic default storage settings for eXist, including memory and system limits. Only one <db-connection> should be specified. An example configuration for the native back-end is shown below:

Example: Basic <db-connection> Entry

<db-connection database="native" files="data" 
      cacheSize="48M" free_mem_min="5" pageSize="4096">
      <pool min="1" max="15" sync-period="240000" wait-before-shutdown="60000"/>
      <recovery enabled="yes" sync-on-commit="no" group-commit="no" size="100M" 
            journal-dir="webapp/WEB-INF/data"/>
      <watchdog query-timeout="-1" output-size-limit="10000"/>
      <default-permissions collection="0775" resource="0775"/>
</db-connection>

<db-connection> Attributes

database

This attribute selects a database system type. Since relational database back-ends are no longer supported by the current release of eXist, only "native" and "native_cluster" are available.

files

This attribute specifies the directory where the native back-end will keep its database files, and so it is necessary that this directory exists. If a relative path is specified, it will be based on the root directory as defined in the exist.home system property. If this data directory does not have write permissions (see User Authentication and Access Control), eXist will internally switch to read-only mode such that any attempt to change the database will throw an exception.

cacheSize

This attribute sets the maximum amount of main memory used by all page buffers (i.e. assuming all page buffers are at full capacity). The database uses this parameter to calculate the maximum size of each internal cache. You can increase this value if your system allows for greater memory use.

The cacheSize should not be more than half of the size of the JVM heap size (set by the JVM -Xmx parameter).

pageSize

This specifies the number of bytes used for internal data and B-tree pages. This should be equal to or a multiple of the page size used by the filesystem (usually a multiple of 4096).

free_mem_min

This attribute sets the amount of free memory (expressed as a percentage of the total memory) available for the Java virtual machine, beyond which eXist will begin to flush its internal caches. During indexing, eXist caches index data in memory to avoid frequent disk look-ups (see Configuring Database Indexes). If the amount of free memory drops below the defined limit, eXist will write all cached data to disk and trigger the garbage collection.

If your database files are relatively large (i.e. > 50MB) or if you experience OutOfMemory errors during indexing, you may consider increasing this setting to > 10%.

<pool>

These settings control the internal database connection pool.

<pool> Attributes

min | max

These options specify the minimum and maximum size of the connection pool. This pool restricts the number of parallel (basic) operations that can be executed by the database. Settings should be somewhere between 1 and 20. (Please note that this has nothing to do with the HTTP and XMLRPC server settings - these servers have their own connection pools.)

sync-period

This option defines how often the database will flush its internal buffers to disk (in milliseconds). The sync-thread will interrupt normal database operation after the specified time and write all dirty pages to disk.

wait-before-shutdown

This option specifies the maximum amount of time (in milliseconds) that the database will allow for any running processes to complete upon database shutdown.

<recovery>

This element configures the journaling and recovery of the database. With recovery enabled, the database is able to recover from an unclean database shutdown due to, for example, power failures, OS reboots, and hanging processes. For this to work correctly, all database operations must be logged to a journal file. The location, size and other parameters for this file can be set using the <recovery> element.

<recovery> Attributes

enabled

If this attribute is set to yes, automatic recovery is enabled.

size

This attributes sets the maximum allowed size of the journal file. Once the journal reaches this limit, a checkpoint will be triggered and the journal will be cleaned. However, the database waits for running transactions to return before processing this checkpoint. In the event one of these transactions writes a lot of data to the journal file, the file will grow until the transaction has completed. Hence, the size limit is not enforced in all cases.

journal-dir

This attribute sets the directory where journal files are to be written. If no directory is specified, the default path is to the data directory.

sync-on-commit

This attribute determines whether or not to protect the journal during operating system failures. That is, it determines whether the database forces a file-sync on the journal after every commit. If this attribute is set to "yes", the journal is protected against operating system failures. However, this will slow performance - especially on Windows systems. If set to "no", eXist will rely on the operating system to flush out the journal contents to disk. In the worst case scenario, in which there is a complete system failure, some committed transactions might not have yet been written to the journal, and so will be rolled back.

<watchdog>

This is the global configuration for the query watchdog. The watchdog monitors all query processes, and can terminate any long-running queries if they exceed one of the predefined limits. These limits are as follows:

<watchdog> Attributes

query-timeout

This attribute sets the maximum amount of time (expressed in milliseconds) that the query can take before it is killed..

output-size-limit

This attribute limits the size of XML fragments constructed using XQuery, and thus sets the maximum amount of main memory a query is allowed to use. This limit is expressed as the maximum number of nodes allowed for an in-memory DOM tree. The purpose of this option is to avoid memory shortages on the server in cases where users are allowed to run queries that produce very large output fragments.

<default-permissions>

Specifies the default permissions for all resources and collections in eXist (see User Authentication and Access Control). When this is not configured, the default "mod" (similar to the Unix "chmod" command) is set to 0775 in the resources and collections attributes. A different default value may be set for a database instance, and local overrides are also possible.

<security>

The <security> element in the <db-connection> node is used to select the security manager Class and control the database of users and groups.

<security> Attributes

class

This attribute is required, and specifies a Java class name used to implement the org.exist.security.SecurityManager interface, as in the following example:

Example: <security> class Attribute (LDAP)

<security class="org.exist.security.LDAPSecurityManager" />

eXist is distributed with the following built-in security manager implementations:

org.exist.security.XMLSecurityManager

Stores the user information in the database. This is the default manager if the <security> element is not included in <db-connection> .

org.exist.security.LDAPSecurityManager

Retrieves the user and groups from the LDAP database. This requires addition configuration parameters which are described in the LDAP Security Manager documentation.

password-encoding

Password encoding can be set to one of the following types:

  1. plain - Applies plain encryption.

  2. md5 (default) - Applies the MD5 algorithm to encrypt passwords.

  3. simple-md5 - Applies a simplified MD5 algorithm to encrypt passwords.

password-realm

The realm to use for basic auth or http-digest password challenges.

2.2. <serializer>

The serializer is responsible for serializing XML documents or document fragments back into XML. This configuration element defines default settings for various parameters, which can also be specified programmatically.

<serializer> Attributes

enable-xinclude

This attribute determines whether <xinclude> tags are to be expanded during serialization. Setting the value to "false" will leave <xinclude> tags unexpanded.

enable-xsl

This attribute (when set to "true") tells the serializer to pass its output to an XSL stylesheet when it encounters an XSL processing-instruction at the start of the document.

add-exist-id

This attribute tells the serializer to add debug information to each element expressed as additional attributes. This information includes the internal identifier of the node and source document. These are the accepted values:

  1. all - Adds debug information to every node in the output.

  2. element - Adds debug information to top-level elements only.

  3. none (default) - Disables debugging feature.

indent

The serializer defaults to pretty-print the resulting XML source code. Set this option to "no" to disable pretty-printing.

match-tagging-elements

The database can highlight matches in the text content of a node by tagging the matching text string with <exist:match> . Clearly, this only works for XPath expressions using the fulltext index. Set the parameter to "yes" to disable this feature.

match-tagging-attributes

Matches for attribute values can also be tagged using the character sequence "||" to demarcate the matching text string. Since this changes the content of the attribute value, the feature is disabled by default.

2.3. <xupdate>

During XUpdates, the database performs a partial reindexing of the document whenever the internal node-id structure has changed. Reindexes can occur quite frequently and slow down the XUpdate process. However, the frequency of reindex runs can be specified (with limitations) in the <xupdate> section.

Furthermore, when nodes are inserted into a document repeatedly, a page fragmentation within the database files can result. A defragmentation run is triggered if this fragmentation exceeds a predefined limit that can be configured here. A typical <xupdate> entry looks like the following:

Example: XUpdate-Options in conf.xml

<xupdate growth-factor="20" allowed-fragmentation="20"
		enable-consistency-checks="no" />

<xupdate> Attributes

growth-factor

Frequent reindexing can be avoided by leaving space between the numeric identifiers assigned to every node. Future insertions will first use these spare identifiers, and therefore the document will not need to be reindexed.

The growth-factor attribute allows the user to specify the number of spare ids to be inserted whenever the node id scheme is recomputed after an XUpdate. As discussed, increasing this setting will result in fewer reindex runs. However, be aware that by leaving spare ids, you limits the maximum size of a document that can be indexed.

allowed-fragmentation

This attribute defines the maximum number of page splits allowed within a document before a defragmentation run is triggered.

enable-consistency-checks

This attribute is for or debugging purposes only. If the parameter is set to "yes", a consistency check will be run on modified documents after every XUpdate request. This checks whether the persistent DOM is complete, and all pointers in the structural index point to valid storage addresses that contain valid nodes.

2.4. <indexer>

This element sets parameters on how XML files are to be indexed by eXist. An example configuration is shown below:

Example: Specifying Indexer-Options in conf.xml

<indexer caseSensitive="no"
	suppress-whitespace="both" index-depth="1"
	tokenizer="org.exist.storage.analysis.SimpleTokenizer"
	validation="no">
	  
    <stopwords file="stopword"/>
    
	<!-- Default index configuration -->
    <index>
        <fulltext default="all" attributes="false">
            <exclude path="/auth"/>
        </fulltext>
    </index>

    <entity-resolver>
	    <catalog file="samples/xcatalog.xml"/>
    </entity-resolver>
</indexer>

<indexer> Attributes

caseSensitive

Specifies whether string comparisons are to be case-sensitive. This option applies to XPath equality tests (i.e. "=" operator), as well as functions such as contains(), starts-with() and ends-with(). Since index look-ups are NEVER case-sensitive, this setting does not apply to operators or functions of the fulltext index (e.g. "&=", "|=", "near()").

suppress-whitespace

Specifies how the <indexer> is to treat whitespace at the start or end of a character sequence. This option ONLY applies to newly stored files, and therefore changing it has no effect on previously stored documents. Possible values for this attribute are:

  1. leading - Suppresses leading whitespace.

  2. trailing - Suppresses trailing whitespace.

  3. both - Suppresses leading and trailing whitespace.

  4. none - Preserves all whitespace.

tokenizer

This attribute invokes the Java class used to tokenize a string into a sequence of single words or tokens, which are stored to the fulltext index. Currently only the SimpleTokenizer is available.

index-depth

This attribute specifies the depth of the DOM index, or the tree level up to which elements will be added to the index. For example, a value of "2" results in the document root node and all its child elements being indexed; a value of "1" only indexes the root node.

The DOM index maps unique node identifiers to the nodes' storage locations in the DOM file. Generating this index is time- and memory-consuming. It is furthermore primarily needed to access nodes by their unique node identifier - for example, when serializing XML data for query results or XUpdate - which are operations not normally considered time-critical. Moreover, most XPath expressions can do without this index since they use short-cuts to access the node directly.

Beginning with version 0.9, only top-level elements are added to the DOM index, whereas attributes and text nodes are always excluded. This results in much smaller index sizes and, consequently, a smaller dom.dbx file size. Usually, setting the index-depth to a value of "2" offers a reasonable compromise of index size and performance. However, if your documents are deeply-structured, you might consider increasing this setting to a level of 3, 4 or 5. For example, if the longest path from the document root to an element node has greater than ten node levels, an index-depth setting of 4 or 5 would probably help to increase overall query performance for some types of queries.

validation

This attribute defines the default setting for the validation of documents by the XML parser. If it is set to "no", documents will never be validated against an existing DTD or schema. A value of "auto" will leave document validation to the SAX parser (i.e. the Xerces parser).

preserve-whitespace-mixed-content

This preserves whitespace for mixed content. The default value is "no".

<stopwords>

The file for this element points to a file containing a list of stopwords. Note that stopwords are NOT added to the fullext index.

<index>

This configuration element specifies the default index settings. These settings are applied if a collection is not configured differently in its collection configuration file. For more information, read the Configuring Indexes documentation.

3. Cocoon Sitemap Configuration

Cocoon uses a sitemap XML file called sitemap.xmap to configure the processing pipelines it uses to process HTTP requests. eXist's integration with Cocoon is completely based on the XML:DB database API, however any XML:DB-enabled database (e.g. Xindice) can be integrated with Cocoon.

Beginning with Cocoon version 2.0, pseudo-protocols are supported. Pseudo-protocols allow you to register handlers for special URLs via so-called "source factories". In essence, these protocols specify resources wherever a known protocol such as http:// or file:// is specified in the sitemap. Currently, the distribution defines a pseudo-protocol to access XML:DB-enabled databases.

In eXist, pseudo-protocols are configured in Cocoon's main configuration file WEB-INF/cocoon.xconf. To make use of these protocols, simply specify the correct database driver class, as in the following example:

Example: Defining the XML:DB Database Driver

<source-handler logger="core.source-handler"&gt;
        <!-- xmldb pseudo protocol -->
     <protocol 
            class="org.apache.cocoon.components.source.XMLDBSourceFactory" 
            name="xmldb">
        <driver class="org.exist.xmldb.DatabaseImpl" type="exist"/>
        <!-- Add here other XML:DB compliant databases drivers -->
      </protocol>
</source-handler>
          

Once the database driver has been registered with the handler, it is possible to use an XML:DB URI wherever Cocoon expects a URI in its site configuration file sitemap.xmap. For example, to access our collection of Shakespeare plays from the web-browser, and with a stylesheet applied to each document, we could use the following code fragment in the sitemap's processing pipeline:

Example: Using XML:DB URIs in the Sitemap

<!-- apply stylesheet shakes.xsl to all XML documents
in xmldb-collection /db/shakespeare/plays --> 
<map:match pattern="xmldb/db/shakespeare/plays/**.xml">
    <map:generate src="xmldb:exist:///db/shakespeare/plays/{1}.xml"/>
    <map:transform src="xmldb:exist:///db/shakespeare/plays/shakes.xsl"/>
    <map:serialize type="html"/>
</map:match>
</programlisting>
        

The sitemap.xmap delivered with eXist also contains more complex examples.