Elasticsearch Character Filters

Elasticsearch Character Filters preprocess (adding, removing, or changing) the stream of characters before it is passed to Tokenizer. In this tutorial, we’re gonna look at 3 types of Character Filters: HTML Strip, Mapping, Pattern Replace that are very important to build Customer Analyzers.

1. HTML Strip Character Filter

html_strip character filter can:
– strip out HTML elements (like <b>)
– replace HTML entities with their decoded value (&amp; becomes &).

For example:

Terms:

Configuration

escaped_tags: array of HTML tags which should not be stripped.

For example, we want to to leave <b> and <p> tags in place:

Terms:

2. Mapping Character Filter

mapping character filter accepts a map of keys and values. It replaces string of characters that is the same as a key with the value.

– the longest pattern will win (if original string is javasampleapproach, javasample key will win java key)
– replacements are allowed to be the empty string
mappings or mappings_path parameter must be provided:
+ mappings: array of mappings (each element having the form key => value)
+ mappings_path: path, either absolute or relative to the config directory, to a UTF-8 encoded text mappings file containing a key => value mapping per line.

For example, we want to replaces T_T and LOL with text equivalent:

Terms:

3. Pattern Replace Character Filter

pattern_replace character filter uses a regular expression to match characters, then replace them with the specified replacement string. The replacement string can refer to capture groups in the regular expression.

This filter has some configuration params:
pattern (required): Java regular expression.
replacement: The replacement string, which can reference capture groups using the $1..$9 syntax.
flags: Java regular expression flags (for example, “CASE_INSENSITIVE|COMMENTS”).

For example, we want filter to replace any embedded dashes in string with underscores:


Terms:


By grokonez | November 14, 2017.


Related Posts


Got Something To Say:

Your email address will not be published. Required fields are marked *

*