HTML cleaning - BloomReach Experience - Open Source CMS
07-01-2019

HTML cleaning

The contents of HTML fields can be cleaned both on client-side and the server-side.

Client-side

Client-side HTML cleaning is done by CKEditor itself. This feature is called Advanced Content Filter (ACF). Each plugin and command added to or removed from CKEditor influences the allowed HTML. For example, when there is no plugin to add an image, <img> tags will be removed automatically. This filtering also applies to attributes, which can, for instance, be allowed or required.

ACF can also be controlled per editor instance via the configuration property extraAllowedContent. Note that since BloomReach Experience Manager 12, extraAllowedContent must be specified in JSON object format. For example:

{
  extraAllowedContent: {q: {}, cite: {classes: 'myclass'}}
}

More information on ACF and how to configure it can be found at the CKEditor documentation website.

Disable client-side HTML cleaning

ACF is enabled by default. To disable ACF, set the CKEditor property allowedContent to true:

ckeditor.config.overlayed.json:

{
  allowedContent: true
}

Server-side

Server-side HTML cleaning is done by an HTML-processor. The HTML-processor checks, cleans, and corrects the output of rich-text fields, as well as management of internal links and images. The configuration of the HTML-processor works on the basis of a whitelist that defines which elements are allowed and the attributes they may contain. If an attribute is not configured as allowed (whitelisted), it is stripped from the output (text nodes from elements are preserved).

By default, server-side HTML cleaning also removes any usage of the javascript: protocol and, as of BloomReach Experience Manager 12.0.4/12.1.1, the data: protocol within <a> href and <object> data attributes. This security feature can be disabled by setting the omitJavascriptProtocol configuration property to false (see next paragraph).

Configuration

A CKEditor field is configured with an HTML-processor by setting the configuration property htmlprocessor.id. This property can either be specified in the cluster.options node of a field of a specific document-type, or globally (i.e. for all formatted and/or richtext fields). The value of this property should correspond to the name of the HTML-processor configuration node as defined in the HTML-processor module, which is located at:

/hippo:configuration/hippo:modules/htmlprocessor/hippo:moduleconfig

By default, the CMS is bootstrapped with the following HTML-processor configurations:

  1. formatted: contains a whitelist of elements used in Formatted fields.
  2. richtext: contains a whitelist of elements used in Rich Text fields and manages internal links and images.
  3. no-filter: contains an empty whitelist but does manage internal links and images when applied to Rich Text fields.

The configuration node of an HTML-processor is of nodetype hipposys:moduleconfig and has the following properties available:

  • charset: the character set of the output. Defaults to UTF-8.
  • serializer: the type of serializer to use. Valid values are pretty, compact, and simple. Defaults to simple.
  • convertLineEndings: whether to convert CRLF to LF when storing html, and vice-versa when reading HTML. Defaults to true.
  • omitComments: whether to strip comments from the html. Defaults to false.
  • omitJavascriptProtocol: whether javascript statements are removed from the html. Defaults to true.
  • filter: whether to apply whitelist filtering. Defaults to true.

Whitelisted HTML elements are defined as childnodes and are of nodetype hipposys:moduleconfig. The name of such a node corresponds with the whitelisted element name. These element nodes may contain a multi-valued property called attributes to list the HTML attributes allowed on the element.

The pretty and compact serializers add some whitespace characters to the HTML source in order to make it human readable. This may result in some unwanted spacing when using super or sub scripts. For this reason, the default serializer is simple.

Disable server-side HTML cleaning

Change the configuration property htmlprocessor.id to no-filter.

Did you find this page helpful?
How could this documentation serve you better?
On this page
    Did you find this page helpful?
    How could this documentation serve you better?