Elasticsearch Tokenizers – Partial Word Tokenizers

In this tutorial, we’re gonna look at 2 tokenizers that can break up text or words into small fragments, for partial word matching: N-Gram Tokenizer and Edge N-Gram Tokenizer.

I. N-Gram Tokenizer

ngram tokenizer does 2 things:
– break up text into words when it encounters specified characters (whitespace, punctuation…)
– emit N-grams of each word of the specified length (quick with length = 2 -> [qu, ui, ic, ck] )

=> N-grams are like a sliding window of continuous letters.

For example:

It will generate terms with a sliding (1 char min-width, 2 chars max-width) window:

Configuration

min_gram: minimum length of characters in a gram (min-width of the sliding window). Defaults to 1.
max_gram: maximum length of characters in a gram (max-width of the sliding window). Defaults to 2.
token_chars: character classes that will be included in a token. Elasticsearch will split on characters that don’t belong to:
+ letter (a, b, …)
+ digit (1, 2, …)
+ whitespace (” “, “\n”, …)
+ punctuation (!, “, …)
+ symbol ($, %, …)

Defaults to [] (keep all characters).

For example, we will create a tokenizer with sliding window (width = 3) and character classes: only letter & digit.

Terms:

We can see that ":" (punctuation) and "5" (digit but "g 5" contains whitespace) were not contained in the terms.

II. N-Gram Tokenizer

edge_ngram tokenizer does 2 things:
– break up text into words when it encounters specified characters (whitespace, punctuation…)
– emit N-grams of each word where the start of the N-gram is anchored to the beginning of the word (quick -> [q, qu, qui, quic, quick])

For example:

It will generate terms with maximum length = 2:

We can see that default length is useless. We need to configure more.

Configuration

min_gram: minimum length of characters in a gram. Defaults to 1.
max_gram: maximum length of characters in a gram. Defaults to 2.
token_chars: character classes that will be included in a token. Elasticsearch will split on characters that don’t belong to:
+ letter (a, b, …)
+ digit (1, 2, …)
+ whitespace (” “, “\n”, …)
+ punctuation (!, “, …)
+ symbol ($, %, …)

Defaults to [] (keep all characters).

For example, we will create a tokenizer to treat letters and digits as tokens, and to produce grams with minimum length 2 and maximum length 8:


Terms:

We can see that ":" (punctuation), "5" (digit but "g 5" contains whitespace), and Framework (length > 8) were not contained in the terms.

By grokonez | November 10, 2017.


Related Posts


Got Something To Say:

Your email address will not be published. Required fields are marked *

*