lockmanager locking cluster-wide clusterwide - BloomReach Experience - Open Source CMS

This article covers a Hippo CMS version 11. There's an updated version available that covers our most recent release.

04-01-2018

Cluster-wide locking with the LockManager service

This feature is available since Hippo CMS 12.1.0
This feature is available since version 12.0.3
This feature is available since Hippo CMS 11.2.4
This feature is available since Hippo CMS 10.2.8

As of Hippo CMS 12.1.0, 12.0.3, 11.2.4 and 10.2.8, a new LockManager service is provided as a more scalable, lightweight and resilient solution to ensure sequential process execution across a Hippo CMS cluster. This new service provides a superior alternative solution from using ‘native’ JCR based locking. The  LockManager servcice can also be used for obtaining cluster-wide master selection. For the reason why we introduced this new LockManager, see below at Rationale behind the new LockManager.

When to use

Getting hold of a cluster-wide lock typically has three different usecases, which all can be easily achieved with the LockManager.

  1. A task/job that needs to be executed once in the entire cluster
  2. A task/job that needs to be executed on every cluster node, but is not allowed to be done concurrently
  3. Leader Election: At one cluster node a long running job needs to be executed, and if that cluster node dies, another cluster node should take over 

(1) can be achieved by obtaining a cluster-wide lock described below at Cluster-wide locking with the LockManager.  in the // Do work part in the code below the code could check whether it still has to run or whether another cluster node already has executed the task/job.  (2) can be achieved by using a cluster-wide lock with a waitForLock in LockManagerUtils, see Waiting for a Lock below. Lastly, (3) can be achieved by using (1) in a specific way, see Leader Election with the LockManager see below.

Usage in combination with JCR

When after getting hold of the cluster-wide lock you execute JCR related code, then almost always you must first invoke 

session.refresh(true|false)

where the session above is of course a JCR session. The reason for this is that the refresh triggers a cluster-wide JCR sync making sure your local cluster node is up to date with the latest global cluster changes. If you don't refresh the session, it might be that another cluster node that got the lock earlier already modified the JCR nodes you are going to touch, possibly resulting in erroneous code being executed or resulting in an InvalidItemStateException when trying the persist the JCR node changes. Thus in general:

When using JCR related code within a cluster-wide synchronized block of code, always start with session.refresh(true|false).

Getting hold of the LockManager

LockManager lockManager = HippoServiceRegistry.getService(LockManager.class))

Cluster-wide locking with the LockManager

When a Lock is obtained, that lock is tied to the Thread that obtained the Lock and can only be unlocked by the same Thread. The number of invocations on lock(String) must be balanced with unlock(String) since calling Lock multiple times increases the hold count: Only when the hold count is 0, the lock is really freed. The following pattern uses a cluster-wide lock on key 

public void run() {
  try (LockResource ignore = lockManager.lock(key)){
     // session.refresh(true|false) is JCR nodes are involved
     // Do work
  } catch (AlreadyLockedException e) {
     log.info("'{}' is already locked", key, e);
  } catch (LockException e) {
     log.error("Exception while trying to obtain lock, e);
  }
}

In the above locking example, we use a try-with-resources construct where in the close() of the LockResource which implements the AutoCloseable interface the lock is freed again. The above example without using the AutoCloseable construct is as follows:

public void run() {
  boolean locked = false;
  try {
     LockResource resource = lockManager.lock(key);
     locked = true;
     // session.refresh(true|false) is JCR nodes are involved
     // Do work
  } catch (AlreadyLockedException e) {
     log.info("'{}' is already locked", key, e);
  } catch (LockException e) {
     log.error("Exception while trying to obtain lock, e);
  } finally {
     if (locked) {
       lockManager.unlock(key);
     }
  }
}

Note that when key is already locked by another Thread or other cluster node, the invocation of lock(key) directly results in an AlreadyLockedException : This is thus different than ReentrantLock.lock() behavior (which blocks until the lock is acquired). If you need similar behavior to ReentrantLock.lock() but then cluster wide, you can use LockManagerUtils.waitForLock(LockManager, String, long) and if you need the cluster wide equivalent of ReentrantLock.tryLock(long, TimeUnit) you can use LockManagerUtils.waitForLock(LockManager, String, long, long).

Unlocking with another Thread

As explained above, a created Lock is tied to the Thread that invoked lockManager.lock(key). Unlocking the Lock via the LockManager can only be done if the same Thread that created the Lock invokes lockManager.unlock(key)However, via the LockResource that is returned by lockManager.lock(key), another Thread can unlock the Lock via LockResource.close(). Note that while the LockResource may be closed by another thread, the lock itself remains tied to the thread that created it. Therefore the thread creating the lock must not be terminated before the other thread completes the process requiring the lock, as the lock then may expire prematurely! 

Waiting for a Lock

If you need a task/job to be executed on every cluster node, but the task/job is not allowed to run concurrently on different cluster nodes, you can easily achieve this with the LockManager as follows:

LockManager lockManager = HippoServiceRegistry.getService(LockManager.class))
try {
    LockManagerUtils.waitForLock(lockManager, key, 500);
    // session.refresh(true|false) is JCR nodes are involved
    // Do stuff
} catch(LockException | InterruptedException e) {
    // handle exception 
}

The above LockManagerUtils.waitForLock waits indefinitely until it succeeds in retrieving a Lock for key. It retries every 500 milliseconds. If you want a timeout for retrieving a Lock, you can use

LockManagerUtils.waitForLock(lockManager, key, 500,  1000 * 60);

where now at most 1 minute is waited. For those familiar with java.util.concurrent.locksReentrantLock,  above two waitForLock methods can best be compared to a cluster-wide equivalent of ReentrantLock.lock() or ReentrantLock.tryLock(timeout, unit).

Leader Election with the LockManager

With the LockManager, it is relatively easy to achieve Leader Election capabilities. Note it is relatively easy, given that Leader Election in a cluster is a non-trivial exercise in general. The crux of achieving Leader Election is having a process running in each cluster node that tries to claim a cluster-wide Lock for the same key. The cluster node that succeeds in claiming the Lock becomes the leader (master). All other cluster nodes keep trying to get hold of the lock for key, because the current Leader can die: If the leader dies, another cluster node becomes leader. The most tricky part is that a cluster node that has become the leader should have a graceful release of the Lock for key in case that cluster node is taken down : In case of a shutdown of the leader without releasing the Lock for key, another cluster node can only become leader after the never released Lock for key has been expired, which will take at most 1 minute and will result in a warning log that a lock was never released. In an upcoming version, we will most like add support for Leader Election via a HippoServiceRegistry service such that end projects can use an available service that tells them whether they are the leader (for a certain lock key) or not. 

Rationale behind the new LockManager

Using JCR (Apache Jackrabbit) locking for short living locks in general doesn't cause problems. If however longer lived locks are needed, the JCR locking API has limitations which may cause lock timeouts under extreme conditions with concurrent and long-running JCR sessions.

To guarantee proper locking semantics even in such conditions and in a scalable way, we decided to ‘sidestep’ the native JCR locking mechanism and provide an additional LockManager service which doesn’t use or depend on JCR.

We designed this new LockManager service to be more lightweight and easier to use, understand, and manage. Therefore, we decided to replace all usages of the native JCR lock and the JCR-based HippoLock API with this new solution throughout the core of the product.
We also deprecated the HippoLock API to be removed in a future major release (v13 or later).

Because lock management plays a critical role in core product features (workflow, schedulers, replication, relevance etc.), we also decided to backport this new LockManager service and its usages for all currently supported releases. Therefore, the upcoming maintenance releases 12.0.3, 11.2.4, and 10.2.8 will all provide this major technical improvement.

The new LockManager solution uses a dedicated database table which by default will be created automatically at first deployment or upgrade. In case database schema changes are not permitted for the database credentials used by the repository, please read the Upgrade 12.0.2 to 12.1.0 instructions.

Did you find this page helpful?
How could this documentation serve you better?
On this page
    Did you find this page helpful?
    How could this documentation serve you better?