Fast and Scalable Date Range Searches with Hippo Repository - BloomReach Experience - Open Source CMS

This article covers a Hippo CMS version 10. There's an updated version available that covers our most recent release.

26-11-2015

Fast and Scalable Date Range Searches with Hippo Repository

Using this feature for an upgraded repository from an older version than CMS 7.8.1 does require a rebuild of the Lucene indices  

Jackrabbit has performance problems with date range queries (queries with a comparison constraint on properties of type Date), which are noticeable for a quite limited set of documents, already. The problem is described at make-date-range-queries-in-jackrabbit, including a workaround using  derived data function. Check the paragraph at the end of this document for an in-depth technical explanation of the problem in Jackrabbit.

Date Range Constraints

Using XPath, date range constraints are a bit tedious to write. An example for a query using an XPath date range constraint is:

//element(*,custom:document)
          [@custom:date >=xs:dateTime('2009-01-01T03:23:54.234Z') and
           @custom:date <=xs:dateTime('2013-01-01T06:41:30.056Z')]
           order by @custom:date descending

The xs:dateTime(...) constraing fragment is generated from a target timestamp like this:

final Calendar calendar = ...;
String xsDateTimeFormat = session.getValueFactory().
                     createValue(calendar).getString(); 

Above constraints result in range queries with millisecond resolution, which results in very slow range queries or even out-of-memory conditions. For date range queries, we strongly encourage you to use one of the supported resolutions mentioned later in this document.

Do not use Date Range Queries without specifying a resolution (explained below).

How to use Fast and Scalable Date Range Queries

Here, we describe Hippo Repository's tooling to avoid slow date range queries. For HST developers, more in-depth information is available at  HST fast date range queries.

Typically, a resolution of days, months or even years is sufficient for date range queries for documents on a website or in the CMS. For example, a visitor wants to limit the search results to documents published between 03-03-2013 and 04-03-2013, or documents last modified in 2012 or 2013, etc.

Hippo Repository makes sure that  all  Calendar JCR properties get indexed just like Jackrabbit does, but  also with different resolutions to support fast date range queries. The supported fast resolutions are:

  1. year
  2. month
  3. day
  4. hour

The utility class  DateTools of the Hippo Repository API simplifies the creation of xpath date range constraints. There are two helper methods that results in fast date range constraints:

String DateTools#getPropertyForResolution(String property,
                  org.hippoecm.repository.util.DateTools.Resolution resolution) 

and 

String DateTools#createXPathConstraint(javax.jcr.Session session,
                  java.util.Calendar calendar,
                  org.hippoecm.repository.util.DateTools.Resolution roundDateBy)

getPropertyForResolution returns the name of the  indexed property of the (document's) Date property, for the desired resolution, identifying the to be evaluated date value.  createXPathConstraint transforms the date to compare to (expressed by calendar) into an XPath fragment, applying the desired resolution.

Example: Fast date range with resolution days

Assume that we want to rewrite the XPath query

//element(*,custom:document)
                [@custom:date >=xs:dateTime('2009-01-01T04:04:56.456Z')]
                 order by @custom:date descending

to a fast date range query on resolution  Day. This can be done as follows:

Calendar calendar = ...; // 2009-01-01T34:04:56.456Z
// the custom:date property for resolution 'Day'
String xpathProperty = DateTools.getPropertyForResolution("custom:date",
                                        DateTools.Resolution.DAY);
// the xpath constraint for custom:date for resolution 'Day'
String xpathDate = DateTools.createXPathConstraint(session,
                                    calendar ,
                                    DateTools.Resolution.DAY);
String xpath = "//element(*,custom:document)
               [@" +xpathProperty+ " >= " + xpathDate + "]
               order by @custom:date descending";

Above code results in the folowing XPath query:

//element(*,custom:document)
                [@custom:date____day >=xs:dateTime('2009-01-01T00:00:00.000Z')]
                 order by @custom:date descending

Note that

  1. The property to do the range query on as changed into  @custom:date____day
  2. The value in xs:dateTime is rounded to day: It now ends with   T00:00:00.000Z 
  3. The order by is done on the original @custom:date, so the sorting is still done  on exact timestamps

 

The rewritten query executes fast and scales to millions of documents. The more coarse the resolution, the faster the query becomes. Setting the resolution for a date range query to year typically results in even faster query execution than doing the same search without range constraint.

Note that a query with a resolution will at least give as many results as a query without or finer resolution.

This can be understood from the query example above based on resolution Day. In this example, we saw that 

@custom:date >=xs:dateTime('2009-01-01T04:04:56.456Z'

got translated into

@custom:date____day >=xs:dateTime('2009-01-01T00:00:00.000Z' 

A document with custom:date = 2009-01-01T 03:04:56.456Z (a hour before  2009-01-01T 04:04:56.456Z) does match the query based on Day resolution, but not the one without resolution.

Query support for documents on a specific Year, Month, Day or Hour

The fast scalable range query support described above can (and should) also be used for queries like

  1. All documents last modified in 2013
  2. All documents with publication date in February 2013
  3. All documents modified on 2013-02-03

For queries within a specific year, month, day or hour, the query can be turned into an  equals comparison. For example, query 1) can (and should) be written as:

Calendar any2013Date = Calendar.getInstance();
any2013Date.set(Calendar.YEAR, 2013);

// the custom:date property for resolution 'Year'
String xpathProperty = DateTools.getPropertyForResolution("custom:date",
                                        DateTools.Resolution.YEAR);
// the xpath constraint for custom:date for resolution 'Year'
String xpathDate = DateTools.createXPathConstraint(session,
                                    any2013Date,
                                    DateTools.Resolution.YEAR);
// note below a = and not a range!
String xpath = "//element(*,custom:document)
               [@" +xpathProperty+ " = " + xpathDate + "]; 

 

Detailed explanation of the Date Range Query performance problem in Jackrabbit

When a Date property in Jackrabbit gets indexed in Lucene, its exact (non rounded) value is used. Hippo Documents store several timestamps as Date properties with millisecond resolution, such as  creationDate, lastModifiedDate or  publicationDate

Doing a range query in Lucene results in a query expansion, similar to a BooleanQuery, where all unique values in the specified range are OR-ed. When the property on which the range query is performed, is a timestamp, this results in a OR term per document. Having 100.000 documents with a unique timestamp for creationDate, and then doing a range query containing a between (Jackrabbit treats the < and > as separate ranges thus instead of query expansion of values in the range, both ranges will be treated separately resulting in all terms) results in a OR query with 100.000 terms ... this will not perform.

Hippo Repository ships out of the box with rounded timestamps, on which you can do very fast efficient and scalable date range queries to solve the problem outlined above. 

Note that Lucene 2.9 and higher supports  NumericRangeQuery (TrieRangeQuery) to address in general fast range queries on for example date fields, however, supporting this in Jackrabbit is non trivial and has not been done (yet). Also, NumericRangeQuery's are far more efficient than normal range queries on date, but by far not as efficient as the range queries on resolution supported by Hippo Repository.

Did you find this page helpful?
How could this documentation serve you better?
On this page
    Did you find this page helpful?
    How could this documentation serve you better?