Elasticsearch Tokenizers – Structured Text Tokenizers

In this tutorial, we’re gonna look at Structured Text Tokenizers that are usually used with structured text like identifiers, email addresses, zip codes, and paths.

I. Keyword Tokenizer

keyword tokenizer is the simplest tokenizer that accepts whatever text it is given and outputs the exact same text as a single term.

For example:

Term:

II. Pattern Tokenizer

pattern tokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms.

The default pattern is \W+, which splits text whenever it encounters non-word characters.

For example:

Terms:

Configuration

pattern: Java regular expression, defaults to \W+.
flags: Java regular expression flags. (for example: “CASE_INSENSITIVE|COMMENTS”) More flags at: regex Pattern
group capture group to extract as tokens. Defaults to -1 (split).

For example, we want to break text into tokens when it encounters commas:

Terms:

III. Path Tokenizer

path_hierarchy tokenizer takes a hierarchical value like a filesystem path, splits on the path separator, and emits a term for each component in the tree.

For example:

Terms:

Configuration

delimiter: character to use as the path separator. Defaults to /.
replacement: optional replacement character to use for the delimiter. Defaults to the delimiter.
buffer_size: number of characters read into the term buffer in a single pass. Defaults to 1024. The term buffer will grow by this size until all the text has been consumed. It is advisable not to change this setting.
reverse: If true, emits the tokens in reverse order. Defaults to false.
skip: number of initial tokens to skip. Defaults to 0.

For example, we configure tokenizer to split on - characters, and to replace them with /, skip 2 first tokens:


Terms:

If reverse is true:
Terms:


By grokonez | November 11, 2017.


Related Posts


Got Something To Say:

Your email address will not be published. Required fields are marked *

*