› › ›

更多相似查询

“更多相似查询” 查找与给定的一组文档“相似”的文档。为了实现这一点，MLT 会选择这些输入文档的一组代表性术语，使用这些术语形成查询，执行查询并返回结果。用户可以控制输入文档、如何选择术语以及如何形成查询。

最简单的用例是请求与提供的文本片段相似的文档。在这里，我们要求所有电影在其“title”和“description”字段中具有与“Once upon a time”类似的文本，并将所选术语的数量限制为 12 个。

resp = client.search(
    query={
        "more_like_this": {
            "fields": [
                "title",
                "description"
            ],
            "like": "Once upon a time",
            "min_term_freq": 1,
            "max_query_terms": 12
        }
    },
)
print(resp)

response = client.search(
  body: {
    query: {
      more_like_this: {
        fields: [
          'title',
          'description'
        ],
        like: 'Once upon a time',
        min_term_freq: 1,
        max_query_terms: 12
      }
    }
  }
)
puts response

const response = await client.search({
  query: {
    more_like_this: {
      fields: ["title", "description"],
      like: "Once upon a time",
      min_term_freq: 1,
      max_query_terms: 12,
    },
  },
});
console.log(response);

GET /_search
{
  "query": {
    "more_like_this" : {
      "fields" : ["title", "description"],
      "like" : "Once upon a time",
      "min_term_freq" : 1,
      "max_query_terms" : 12
    }
  }
}

更复杂的用例是将文本与索引中已存在的文档混合使用。在这种情况下，指定文档的语法类似于 Multi GET API 中使用的语法。

resp = client.search(
    query={
        "more_like_this": {
            "fields": [
                "title",
                "description"
            ],
            "like": [
                {
                    "_index": "imdb",
                    "_id": "1"
                },
                {
                    "_index": "imdb",
                    "_id": "2"
                },
                "and potentially some more text here as well"
            ],
            "min_term_freq": 1,
            "max_query_terms": 12
        }
    },
)
print(resp)

response = client.search(
  body: {
    query: {
      more_like_this: {
        fields: [
          'title',
          'description'
        ],
        like: [
          {
            _index: 'imdb',
            _id: '1'
          },
          {
            _index: 'imdb',
            _id: '2'
          },
          'and potentially some more text here as well'
        ],
        min_term_freq: 1,
        max_query_terms: 12
      }
    }
  }
)
puts response

const response = await client.search({
  query: {
    more_like_this: {
      fields: ["title", "description"],
      like: [
        {
          _index: "imdb",
          _id: "1",
        },
        {
          _index: "imdb",
          _id: "2",
        },
        "and potentially some more text here as well",
      ],
      min_term_freq: 1,
      max_query_terms: 12,
    },
  },
});
console.log(response);

GET /_search
{
  "query": {
    "more_like_this": {
      "fields": [ "title", "description" ],
      "like": [
        {
          "_index": "imdb",
          "_id": "1"
        },
        {
          "_index": "imdb",
          "_id": "2"
        },
        "and potentially some more text here as well"
      ],
      "min_term_freq": 1,
      "max_query_terms": 12
    }
  }
}

最后，用户可以混合使用一些文本、选定的一组文档，还可以提供不一定存在于索引中的文档。为了提供索引中不存在的文档，其语法类似于人工文档。

resp = client.search(
    query={
        "more_like_this": {
            "fields": [
                "name.first",
                "name.last"
            ],
            "like": [
                {
                    "_index": "marvel",
                    "doc": {
                        "name": {
                            "first": "Ben",
                            "last": "Grimm"
                        },
                        "_doc": "You got no idea what I'd... what I'd give to be invisible."
                    }
                },
                {
                    "_index": "marvel",
                    "_id": "2"
                }
            ],
            "min_term_freq": 1,
            "max_query_terms": 12
        }
    },
)
print(resp)

response = client.search(
  body: {
    query: {
      more_like_this: {
        fields: [
          'name.first',
          'name.last'
        ],
        like: [
          {
            _index: 'marvel',
            doc: {
              name: {
                first: 'Ben',
                last: 'Grimm'
              },
              _doc: "You got no idea what I'd... what I'd give to be invisible."
            }
          },
          {
            _index: 'marvel',
            _id: '2'
          }
        ],
        min_term_freq: 1,
        max_query_terms: 12
      }
    }
  }
)
puts response

const response = await client.search({
  query: {
    more_like_this: {
      fields: ["name.first", "name.last"],
      like: [
        {
          _index: "marvel",
          doc: {
            name: {
              first: "Ben",
              last: "Grimm",
            },
            _doc: "You got no idea what I'd... what I'd give to be invisible.",
          },
        },
        {
          _index: "marvel",
          _id: "2",
        },
      ],
      min_term_freq: 1,
      max_query_terms: 12,
    },
  },
});
console.log(response);

GET /_search
{
  "query": {
    "more_like_this": {
      "fields": [ "name.first", "name.last" ],
      "like": [
        {
          "_index": "marvel",
          "doc": {
            "name": {
              "first": "Ben",
              "last": "Grimm"
            },
            "_doc": "You got no idea what I'd... what I'd give to be invisible."
          }
        },
        {
          "_index": "marvel",
          "_id": "2"
        }
      ],
      "min_term_freq": 1,
      "max_query_terms": 12
    }
  }
}

工作原理

编辑

假设我们想找到所有与给定输入文档相似的文档。显然，输入文档本身应该是该类型查询的最佳匹配项。根据 Lucene 评分公式，主要原因将是具有最高 tf-idf 的术语。因此，输入文档中具有最高 tf-idf 的术语是该文档的良好代表，可以在析取查询（或 OR）中使用以检索相似的文档。 MLT 查询只是从输入文档中提取文本，对其进行分析（通常使用字段中的相同分析器），然后选择具有最高 tf-idf 的前 K 个术语来形成这些术语的析取查询。

执行 MLT 的字段必须已索引，且类型为 text 或 keyword。此外，当将 like 与文档一起使用时，必须启用 _source，或者这些字段必须是 stored 或存储 term_vector。为了加快分析速度，在索引时存储词向量可能会有所帮助。

例如，如果我们希望在“title”和“tags.raw”字段上执行 MLT，我们可以显式地在索引时存储它们的 term_vector。我们仍然可以在“description”和“tags”字段上执行 MLT，因为默认情况下启用了 _source，但这些字段的分析速度不会加快。

resp = client.indices.create(
    index="imdb",
    mappings={
        "properties": {
            "title": {
                "type": "text",
                "term_vector": "yes"
            },
            "description": {
                "type": "text"
            },
            "tags": {
                "type": "text",
                "fields": {
                    "raw": {
                        "type": "text",
                        "analyzer": "keyword",
                        "term_vector": "yes"
                    }
                }
            }
        }
    },
)
print(resp)

response = client.indices.create(
  index: 'imdb',
  body: {
    mappings: {
      properties: {
        title: {
          type: 'text',
          term_vector: 'yes'
        },
        description: {
          type: 'text'
        },
        tags: {
          type: 'text',
          fields: {
            raw: {
              type: 'text',
              analyzer: 'keyword',
              term_vector: 'yes'
            }
          }
        }
      }
    }
  }
)
puts response

const response = await client.indices.create({
  index: "imdb",
  mappings: {
    properties: {
      title: {
        type: "text",
        term_vector: "yes",
      },
      description: {
        type: "text",
      },
      tags: {
        type: "text",
        fields: {
          raw: {
            type: "text",
            analyzer: "keyword",
            term_vector: "yes",
          },
        },
      },
    },
  },
});
console.log(response);

PUT /imdb
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "term_vector": "yes"
      },
      "description": {
        "type": "text"
      },
      "tags": {
        "type": "text",
        "fields": {
          "raw": {
            "type": "text",
            "analyzer": "keyword",
            "term_vector": "yes"
          }
        }
      }
    }
  }
}

参数

编辑

唯一必需的参数是 like，所有其他参数都有合理的默认值。参数有三种类型：一种用于指定文档输入，另一种用于术语选择和查询形成。

文档输入参数

编辑

`like`	MLT 查询唯一必需的参数是 `like`，它遵循一种通用的语法，用户可以在其中指定自由格式的文本和/或单个或多个文档（请参见上面的示例）。指定文档的语法类似于 Multi GET API 使用的语法。指定文档时，文本会从 `fields` 中提取，除非在每个文档请求中被覆盖。文本由字段中的分析器进行分析，但也可以被覆盖。覆盖字段分析器的语法类似于词向量 API 的 `per_field_analyzer` 参数。此外，为了提供索引中不一定存在的文档，还支持人工文档。
`unlike`	`unlike` 参数与 `like` 结合使用，以便不选择在选定的一组文档中找到的术语。换句话说，我们可以请求 `like: "Apple"`，但 `unlike: "cake crumble tree"` 的文档。其语法与 `like` 相同。
`fields`	用于提取和分析文本的字段列表。默认为 `index.query.default_field` 索引设置，其默认值为 ``。`` 值匹配所有符合术语级查询条件的字段，元数据字段除外。

术语选择参数

编辑

`max_query_terms`	将选择的最大查询术语数。增加此值可以提高准确性，但会牺牲查询执行速度。默认为 `25`。
`min_term_freq`	输入文档中将忽略的术语的最小词频。默认为 `2`。
`min_doc_freq`	输入文档中将忽略的术语的最小文档频率。默认为 `5`。
`max_doc_freq`	输入文档中将忽略的术语的最大文档频率。这可能有助于忽略高频词，例如停用词。默认为无界限 (`Integer.MAX_VALUE`，即 `2^31-1` 或 2147483647)。
`min_word_length`	将忽略的术语的最小字长。默认为 `0`。
`max_word_length`	将忽略的术语的最大字长。默认为无界限 (`0`)。
`stop_words`	停用词数组。此集合中的任何词都被认为是“无趣的”并被忽略。如果分析器允许停用词，您可能需要告诉 MLT 显式忽略它们，因为就文档相似性而言，假设“停用词永远不有趣”似乎是合理的。
`analyzer`	用于分析自由格式文本的分析器。默认为与 `fields` 中的第一个字段关联的分析器。

查询形成参数

编辑

`minimum_should_match`	在形成析取查询后，此参数控制必须匹配的术语数。语法与 minimum should match 相同。（默认为 `"30%"`）。
`fail_on_unsupported_field`	控制当任何指定的字段不是支持的类型（`text` 或 `keyword`）时，查询是否应失败（引发异常）。将此设置为 `false` 以忽略该字段并继续处理。默认为 `true`。
`boost_terms`	形成的查询中的每个术语都可以通过其 tf-idf 分数进一步提升。这设置了使用此功能时要使用的提升因子。默认为停用 (`0`)。任何其他正值都会使用给定的提升因子激活术语提升。
`include`	指定是否应将输入文档也包含在返回的搜索结果中。默认为 `false`。
`boost`	设置整个查询的提升值。默认为 `1.0`。

替代方案

编辑

为了更好地控制如何构建类似文档的查询，值得考虑编写自定义客户端代码，将示例文档中选定的术语组合成具有所需设置的布尔查询。more_like_this 中从一段文本中选择“有趣”词的逻辑也可以通过词向量 API 访问。例如，使用词向量 API 可以向用户展示文档文本中找到的主题关键字选择，允许他们选择感兴趣的词进行细化，而不是使用 more_like_this 使用的更“黑盒”的匹配方法。

« 距离特征查询 Percolate 查询 »