将精确搜索与词干提取混合使用

编辑

将精确搜索与词干提取混合使用编辑

在构建搜索应用程序时,词干提取通常是必须的,因为希望对 skiing 的查询能够匹配包含 skiskis 的文档。但是,如果用户只想搜索 skiing 呢?通常的做法是使用 多字段,以便以两种不同的方式索引相同的内容。

response = client.indices.create(
  index: 'index',
  body: {
    settings: {
      analysis: {
        analyzer: {
          english_exact: {
            tokenizer: 'standard',
            filter: [
              'lowercase'
            ]
          }
        }
      }
    },
    mappings: {
      properties: {
        body: {
          type: 'text',
          analyzer: 'english',
          fields: {
            exact: {
              type: 'text',
              analyzer: 'english_exact'
            }
          }
        }
      }
    }
  }
)
puts response

response = client.index(
  index: 'index',
  id: 1,
  body: {
    body: 'Ski resort'
  }
)
puts response

response = client.index(
  index: 'index',
  id: 2,
  body: {
    body: 'A pair of skis'
  }
)
puts response

response = client.indices.refresh(
  index: 'index'
)
puts response
PUT index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "english_exact": {
          "tokenizer": "standard",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "body": {
        "type": "text",
        "analyzer": "english",
        "fields": {
          "exact": {
            "type": "text",
            "analyzer": "english_exact"
          }
        }
      }
    }
  }
}

PUT index/_doc/1
{
  "body": "Ski resort"
}

PUT index/_doc/2
{
  "body": "A pair of skis"
}

POST index/_refresh

使用这种设置,在 body 上搜索 ski 将返回两个文档。

response = client.search(
  index: 'index',
  body: {
    query: {
      simple_query_string: {
        fields: [
          'body'
        ],
        query: 'ski'
      }
    }
  }
)
puts response
GET index/_search
{
  "query": {
    "simple_query_string": {
      "fields": [ "body" ],
      "query": "ski"
    }
  }
}
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped" : 0,
    "failed": 0
  },
  "hits": {
    "total" : {
        "value": 2,
        "relation": "eq"
    },
    "max_score": 0.18232156,
    "hits": [
      {
        "_index": "index",
        "_id": "1",
        "_score": 0.18232156,
        "_source": {
          "body": "Ski resort"
        }
      },
      {
        "_index": "index",
        "_id": "2",
        "_score": 0.18232156,
        "_source": {
          "body": "A pair of skis"
        }
      }
    ]
  }
}

另一方面,在 body.exact 上搜索 ski 将只返回文档 1,因为 body.exact 的分析链不执行词干提取。

response = client.search(
  index: 'index',
  body: {
    query: {
      simple_query_string: {
        fields: [
          'body.exact'
        ],
        query: 'ski'
      }
    }
  }
)
puts response
GET index/_search
{
  "query": {
    "simple_query_string": {
      "fields": [ "body.exact" ],
      "query": "ski"
    }
  }
}
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped" : 0,
    "failed": 0
  },
  "hits": {
    "total" : {
        "value": 1,
        "relation": "eq"
    },
    "max_score": 0.8025915,
    "hits": [
      {
        "_index": "index",
        "_id": "1",
        "_score": 0.8025915,
        "_source": {
          "body": "Ski resort"
        }
      }
    ]
  }
}

这对最终用户来说并不容易理解,因为我们需要一种方法来确定他们是在寻找精确匹配还是模糊匹配,并相应地重定向到适当的字段。此外,如果只需要对查询的某些部分进行精确匹配,而其他部分仍然需要考虑词干提取,该怎么办?

幸运的是,query_stringsimple_query_string 查询有一个功能可以解决这个问题:quote_field_suffix。这告诉 Elasticsearch 出现在引号之间的单词将被重定向到不同的字段,如下所示。

response = client.search(
  index: 'index',
  body: {
    query: {
      simple_query_string: {
        fields: [
          'body'
        ],
        quote_field_suffix: '.exact',
        query: '"ski"'
      }
    }
  }
)
puts response
GET index/_search
{
  "query": {
    "simple_query_string": {
      "fields": [ "body" ],
      "quote_field_suffix": ".exact",
      "query": "\"ski\""
    }
  }
}
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped" : 0,
    "failed": 0
  },
  "hits": {
    "total" : {
        "value": 1,
        "relation": "eq"
    },
    "max_score": 0.8025915,
    "hits": [
      {
        "_index": "index",
        "_id": "1",
        "_score": 0.8025915,
        "_source": {
          "body": "Ski resort"
        }
      }
    ]
  }
}

在上面的例子中,由于 ski 在引号之间,所以由于 quote_field_suffix 参数,它在 body.exact 字段上被搜索,所以只有文档 1 匹配。这允许用户根据需要混合使用精确搜索和词干搜索。

如果在 quote_field_suffix 中传递的字段选择不存在,则搜索将回退到使用查询字符串的默认字段。