Searching natural language is imprecise because computers can’t comprehend entire natural language. Fuzzy Query can find words that need at most a certain number of character modifications to match. In this tutorial, we’re gonna look at way to use Elasticsearch Fuzzy Query that uses similarity based on Levenshtein edit distance.
1. Fuzzy Query
fuzzy
query will:
– generate all possible matching terms (GEN_TERMS) within the maximum edit distance specified in fuzziness
(for example, sumple -> GEN_TERMS[sample, simple, simply…] )
– then check the term dictionary to find out which of GEN_TERMS actually exist in the index.
This is a simple example:
GET javasampleapproach/tutorial/_search { "query": { "fuzzy": { "title": { "value": "quuck" } } } } |
Response with “quick” in term dictionary:
{ ... "hits": { "total": 1, "max_score": 0.56611896, "hits": [ { "_index": "javasampleapproach", "_type": "tutorial", "_id": "3", "_score": 0.56611896, "_source": { "title": "Angular 4 Firebase Quick Start", ... } } ] } } |
2. Advanced Settings
– fuzziness
: maximum edit distance (0..2). Defaults to AUTO
.
– prefix_length
: number of initial characters which will not be “fuzzified”. Defaults to 0
.
– max_expansions
: maximum number of terms that the fuzzy query will expand to. Defaults to 50
.
Now we should know how to use fuzziness
parameter:
+ It can be specified as 0
, 1
, 2
(maximum number of edits – insertions, deletions or substitutions) bases on Levenshtein Edit Distance.
+ If fuzziness
isn’t specified, maximum number of edits is generated based on the length of the term:
-> 0..2: must match exactly (fuzziness = 0)
-> 3..5: fuzziness = 1
-> larger than 5: fuzziness = 2
For example:
GET javasampleapproach/tutorial/_search { "query": { "fuzzy": { "title": { "value": "firbaze", "fuzziness": 1, "prefix_length": 3, "max_expansions": 30 } } } } |
– "prefix_length": 3
: starting with “fir” is a must.
– "fuzziness": 1
: “firbaze” will match “firbase” (fuzziness = 1: z -> s), not match “firebase” (fuzziness = 2: insert e and z -> s).
If we change to "fuzziness": 2
, or we don’t specify fuzziness
(that means "fuzziness": 2
based on length of “firbaze” = 7 > 5 ), it will match “firbase” term:
GET javasampleapproach/tutorial/_search { "query": { "fuzzy": { "title": { "value": "firbaze", // "fuzziness": 2, "prefix_length": 3, "max_expansions": 30 } } } } |
Improve Performance
fuzzy
query with fuzziness = 2
could perform very badly because it matches a very large number of terms. So we should use a small fuzziness
and/or limit the performance impact with other parameters:
– prefix_length
: Most spelling errors occur toward the end of the word, not toward the beginning.
=> the bigger, the faster: we can significantly reduce the number of matching terms.
– max_expansions
: fuzzy query will collect matching terms until it runs out of terms or reaches the max_expansions
limit.
=> the smaller, the more meaningful: we can limit the total number of options that will be produced.
– fuzziness: AUTO
should generally be the preferred value.