Elasticsearch Analyzers – Custom Analyzer

In this tutorial, we’re gonna look at way to create an Elasticsearch Customer Analyzer.

I. Custom Analyzer

A Custom Analyzer is combination of:
character filters (optional) -> tokenizer -> token filters (optional)

In accordance with these components, it has following parameters:
char_filter (optional): array of built-in or customised character filters.
tokenizer (required): built-in or customised tokenizer (Word Oriented Tokenizers + Partial Word Tokenizers + Structured Text Tokenizers)
filter (optional): array of built-in or customised token filters.
position_increment_gap (optional): when indexing an array of text values, Elasticsearch inserts a fake “gap” between the last term of one value and the first term of the next value to ensure that a phrase query doesn’t match two terms from different array elements. Defaults to 100.

For example, with array "titles": [ "Java Sample Approach", "Java Technology"], the “gap” between term approach and term java is position_increment_gap.

II. Example

We will create a Customer Analyzer that can:
– replace ^^ with _happy_ and T_T with _sad_ using Mapping Character Filter
– split on punctuation characters using Pattern Tokenizer
– lowercase token text using Lowercase Token Filter
– use the pre-defined list of English stop words using Stop Token Filter


By grokonez | November 15, 2017.

Related Posts

Got Something To Say:

Your email address will not be published. Required fields are marked *