Elasticsearch Term Level Queries – Fuzzy Query

Searching natural language is imprecise because computers can’t comprehend entire natural language. Fuzzy Query can find words that need at most a certain number of character modifications to match. In this tutorial, we’re gonna look at way to use Elasticsearch Fuzzy Query that uses similarity based on Levenshtein edit distance.

1. Fuzzy Query

fuzzy query will:
– generate all possible matching terms (GEN_TERMS) within the maximum edit distance specified in fuzziness (for example, sumple -> GEN_TERMS[sample, simple, simply…] )
– then check the term dictionary to find out which of GEN_TERMS actually exist in the index.

This is a simple example:

Response with “quick” in term dictionary:

2. Advanced Settings

fuzziness: maximum edit distance (0..2). Defaults to AUTO.
prefix_length: number of initial characters which will not be “fuzzified”. Defaults to 0.
max_expansions: maximum number of terms that the fuzzy query will expand to. Defaults to 50.

Now we should know how to use fuzziness parameter:
+ It can be specified as 0, 1, 2 (maximum number of edits – insertions, deletions or substitutions) bases on Levenshtein Edit Distance.
+ If fuzziness isn’t specified, maximum number of edits is generated based on the length of the term:
-> 0..2: must match exactly (fuzziness = 0)
-> 3..5: fuzziness = 1
-> larger than 5: fuzziness = 2

For example:

"prefix_length": 3: starting with “fir” is a must.
"fuzziness": 1: “firbaze” will match “firbase” (fuzziness = 1: z -> s), not match “firebase” (fuzziness = 2: insert e and z -> s).

If we change to "fuzziness": 2, or we don’t specify fuzziness (that means "fuzziness": 2 based on length of “firbaze” = 7 > 5 ), it will match “firbase” term:

Improve Performance

fuzzy query with fuzziness = 2 could perform very badly because it matches a very large number of terms. So we should use a small fuzziness and/or limit the performance impact with other parameters:

prefix_length: Most spelling errors occur toward the end of the word, not toward the beginning.
=> the bigger, the faster: we can significantly reduce the number of matching terms.

max_expansions: fuzzy query will collect matching terms until it runs out of terms or reaches the max_expansions limit.
=> the smaller, the more meaningful: we can limit the total number of options that will be produced.

fuzziness: AUTO should generally be the preferred value.

By grokonez | November 18, 2017.

Related Posts

Got Something To Say:

Your email address will not be published. Required fields are marked *