热门命中聚合

编辑

top_hits 指标聚合器会跟踪被聚合的最相关的文档。此聚合器旨在用作子聚合器,以便可以按存储桶聚合最匹配的文档。

我们不建议将 top_hits 用作顶层聚合。如果想要对搜索命中结果进行分组,请改用 collapse 参数。

top_hits 聚合器可以通过存储桶聚合器有效地按特定字段对结果集进行分组。一个或多个存储桶聚合器确定将结果集切分为哪些属性。

选项

编辑
  • from - 要获取的第一个结果的偏移量。
  • size - 每个存储桶返回的最大匹配命中次数。默认情况下,返回前三个匹配的命中结果。
  • sort - 应如何对最匹配的命中结果进行排序。默认情况下,命中结果按主查询的分数排序。

支持的每个命中的功能

编辑

top_hits 聚合返回常规搜索命中结果,因此可以支持许多每个命中的功能。

如果 需要 docvalue_fieldssizesort,则 热门指标 可能比热门命中聚合更有效。

top_hits 不支持 rescore 参数。查询重新评分仅适用于搜索命中结果,而不适用于聚合结果。要更改聚合使用的分数,请使用 function_scorescript_score 查询。

示例

编辑

在以下示例中,我们将销售额按类型分组,并显示每种类型的最后一次销售额。对于每次销售,仅在源中包含日期和价格字段。

resp = client.search(
    index="sales",
    size="0",
    aggs={
        "top_tags": {
            "terms": {
                "field": "type",
                "size": 3
            },
            "aggs": {
                "top_sales_hits": {
                    "top_hits": {
                        "sort": [
                            {
                                "date": {
                                    "order": "desc"
                                }
                            }
                        ],
                        "_source": {
                            "includes": [
                                "date",
                                "price"
                            ]
                        },
                        "size": 1
                    }
                }
            }
        }
    },
)
print(resp)
response = client.search(
  index: 'sales',
  size: 0,
  body: {
    aggregations: {
      top_tags: {
        terms: {
          field: 'type',
          size: 3
        },
        aggregations: {
          top_sales_hits: {
            top_hits: {
              sort: [
                {
                  date: {
                    order: 'desc'
                  }
                }
              ],
              _source: {
                includes: [
                  'date',
                  'price'
                ]
              },
              size: 1
            }
          }
        }
      }
    }
  }
)
puts response
const response = await client.search({
  index: "sales",
  size: 0,
  aggs: {
    top_tags: {
      terms: {
        field: "type",
        size: 3,
      },
      aggs: {
        top_sales_hits: {
          top_hits: {
            sort: [
              {
                date: {
                  order: "desc",
                },
              },
            ],
            _source: {
              includes: ["date", "price"],
            },
            size: 1,
          },
        },
      },
    },
  },
});
console.log(response);
POST /sales/_search?size=0
{
  "aggs": {
    "top_tags": {
      "terms": {
        "field": "type",
        "size": 3
      },
      "aggs": {
        "top_sales_hits": {
          "top_hits": {
            "sort": [
              {
                "date": {
                  "order": "desc"
                }
              }
            ],
            "_source": {
              "includes": [ "date", "price" ]
            },
            "size": 1
          }
        }
      }
    }
  }
}

可能的响应

{
  ...
  "aggregations": {
    "top_tags": {
       "doc_count_error_upper_bound": 0,
       "sum_other_doc_count": 0,
       "buckets": [
          {
             "key": "hat",
             "doc_count": 3,
             "top_sales_hits": {
                "hits": {
                   "total" : {
                       "value": 3,
                       "relation": "eq"
                   },
                   "max_score": null,
                   "hits": [
                      {
                         "_index": "sales",
                         "_id": "AVnNBmauCQpcRyxw6ChK",
                         "_source": {
                            "date": "2015/03/01 00:00:00",
                            "price": 200
                         },
                         "sort": [
                            1425168000000
                         ],
                         "_score": null
                      }
                   ]
                }
             }
          },
          {
             "key": "t-shirt",
             "doc_count": 3,
             "top_sales_hits": {
                "hits": {
                   "total" : {
                       "value": 3,
                       "relation": "eq"
                   },
                   "max_score": null,
                   "hits": [
                      {
                         "_index": "sales",
                         "_id": "AVnNBmauCQpcRyxw6ChL",
                         "_source": {
                            "date": "2015/03/01 00:00:00",
                            "price": 175
                         },
                         "sort": [
                            1425168000000
                         ],
                         "_score": null
                      }
                   ]
                }
             }
          },
          {
             "key": "bag",
             "doc_count": 1,
             "top_sales_hits": {
                "hits": {
                   "total" : {
                       "value": 1,
                       "relation": "eq"
                   },
                   "max_score": null,
                   "hits": [
                      {
                         "_index": "sales",
                         "_id": "AVnNBmatCQpcRyxw6ChH",
                         "_source": {
                            "date": "2015/01/01 00:00:00",
                            "price": 150
                         },
                         "sort": [
                            1420070400000
                         ],
                         "_score": null
                      }
                   ]
                }
             }
          }
       ]
    }
  }
}

字段折叠示例

编辑

字段折叠或结果分组是一项功能,可将结果集逻辑分组,并为每个组返回最上面的文档。组的顺序由组中第一个文档的相关性决定。在 Elasticsearch 中,可以通过一个包装 top_hits 聚合器作为子聚合器的存储桶聚合器来实现。

在下面的示例中,我们搜索抓取的网页。对于每个网页,我们存储正文和网页所属的域。通过在 domain 字段上定义一个 terms 聚合器,我们按域对网页的结果集进行分组。top_hits 聚合器随后被定义为子聚合器,以便为每个存储桶收集最匹配的命中结果。

此外,还定义了一个 max 聚合器,该聚合器由 terms 聚合器的 order 功能使用,以按存储桶中最相关文档的相关性顺序返回存储桶。

resp = client.search(
    index="sales",
    query={
        "match": {
            "body": "elections"
        }
    },
    aggs={
        "top_sites": {
            "terms": {
                "field": "domain",
                "order": {
                    "top_hit": "desc"
                }
            },
            "aggs": {
                "top_tags_hits": {
                    "top_hits": {}
                },
                "top_hit": {
                    "max": {
                        "script": {
                            "source": "_score"
                        }
                    }
                }
            }
        }
    },
)
print(resp)
response = client.search(
  index: 'sales',
  body: {
    query: {
      match: {
        body: 'elections'
      }
    },
    aggregations: {
      top_sites: {
        terms: {
          field: 'domain',
          order: {
            top_hit: 'desc'
          }
        },
        aggregations: {
          top_tags_hits: {
            top_hits: {}
          },
          top_hit: {
            max: {
              script: {
                source: '_score'
              }
            }
          }
        }
      }
    }
  }
)
puts response
const response = await client.search({
  index: "sales",
  query: {
    match: {
      body: "elections",
    },
  },
  aggs: {
    top_sites: {
      terms: {
        field: "domain",
        order: {
          top_hit: "desc",
        },
      },
      aggs: {
        top_tags_hits: {
          top_hits: {},
        },
        top_hit: {
          max: {
            script: {
              source: "_score",
            },
          },
        },
      },
    },
  },
});
console.log(response);
POST /sales/_search
{
  "query": {
    "match": {
      "body": "elections"
    }
  },
  "aggs": {
    "top_sites": {
      "terms": {
        "field": "domain",
        "order": {
          "top_hit": "desc"
        }
      },
      "aggs": {
        "top_tags_hits": {
          "top_hits": {}
        },
        "top_hit" : {
          "max": {
            "script": {
              "source": "_score"
            }
          }
        }
      }
    }
  }
}

目前,需要 max (或 min) 聚合器来确保来自 terms 聚合器的存储桶按照每个域中最相关网页的分数排序。遗憾的是,top_hits 聚合器还不能用于 terms 聚合器的 order 选项中。

嵌套或反向嵌套聚合器中对 top_hits 的支持

编辑

如果 top_hits 聚合器被包装在 nestedreverse_nested 聚合器中,则会返回嵌套命中结果。嵌套命中结果在某种意义上是隐藏的迷你文档,它们是常规文档的一部分,其中在映射中配置了嵌套字段类型。如果 top_hits 聚合器被包装在 nestedreverse_nested 聚合器中,则它具有取消隐藏这些文档的能力。请在 嵌套类型映射 中了解更多有关嵌套的信息。

如果配置了嵌套类型,则单个文档实际上会索引为多个 Lucene 文档,并且它们共享相同的 id。为了确定嵌套命中的身份,需要的不仅仅是 id,这就是为什么嵌套命中还包括它们的嵌套身份。嵌套身份保留在搜索命中结果中的 _nested 字段下,并且包括数组字段和嵌套命中所属的数组字段中的偏移量。偏移量从零开始。

让我们看看它如何与一个真实的示例一起使用。考虑以下映射

resp = client.indices.create(
    index="sales",
    mappings={
        "properties": {
            "tags": {
                "type": "keyword"
            },
            "comments": {
                "type": "nested",
                "properties": {
                    "username": {
                        "type": "keyword"
                    },
                    "comment": {
                        "type": "text"
                    }
                }
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'sales',
  body: {
    mappings: {
      properties: {
        tags: {
          type: 'keyword'
        },
        comments: {
          type: 'nested',
          properties: {
            username: {
              type: 'keyword'
            },
            comment: {
              type: 'text'
            }
          }
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "sales",
  mappings: {
    properties: {
      tags: {
        type: "keyword",
      },
      comments: {
        type: "nested",
        properties: {
          username: {
            type: "keyword",
          },
          comment: {
            type: "text",
          },
        },
      },
    },
  },
});
console.log(response);
PUT /sales
{
  "mappings": {
    "properties": {
      "tags": { "type": "keyword" },
      "comments": {                           
        "type": "nested",
        "properties": {
          "username": { "type": "keyword" },
          "comment": { "type": "text" }
        }
      }
    }
  }
}

comments 是一个数组,其中包含 product 对象下的嵌套文档。

以及一些文档

resp = client.index(
    index="sales",
    id="1",
    refresh=True,
    document={
        "tags": [
            "car",
            "auto"
        ],
        "comments": [
            {
                "username": "baddriver007",
                "comment": "This car could have better brakes"
            },
            {
                "username": "dr_who",
                "comment": "Where's the autopilot? Can't find it"
            },
            {
                "username": "ilovemotorbikes",
                "comment": "This car has two extra wheels"
            }
        ]
    },
)
print(resp)
response = client.index(
  index: 'sales',
  id: 1,
  refresh: true,
  body: {
    tags: [
      'car',
      'auto'
    ],
    comments: [
      {
        username: 'baddriver007',
        comment: 'This car could have better brakes'
      },
      {
        username: 'dr_who',
        comment: "Where's the autopilot? Can't find it"
      },
      {
        username: 'ilovemotorbikes',
        comment: 'This car has two extra wheels'
      }
    ]
  }
)
puts response
const response = await client.index({
  index: "sales",
  id: 1,
  refresh: "true",
  document: {
    tags: ["car", "auto"],
    comments: [
      {
        username: "baddriver007",
        comment: "This car could have better brakes",
      },
      {
        username: "dr_who",
        comment: "Where's the autopilot? Can't find it",
      },
      {
        username: "ilovemotorbikes",
        comment: "This car has two extra wheels",
      },
    ],
  },
});
console.log(response);
PUT /sales/_doc/1?refresh
{
  "tags": [ "car", "auto" ],
  "comments": [
    { "username": "baddriver007", "comment": "This car could have better brakes" },
    { "username": "dr_who", "comment": "Where's the autopilot? Can't find it" },
    { "username": "ilovemotorbikes", "comment": "This car has two extra wheels" }
  ]
}

现在可以执行以下 top_hits 聚合(包装在 nested 聚合中)

resp = client.search(
    index="sales",
    query={
        "term": {
            "tags": "car"
        }
    },
    aggs={
        "by_sale": {
            "nested": {
                "path": "comments"
            },
            "aggs": {
                "by_user": {
                    "terms": {
                        "field": "comments.username",
                        "size": 1
                    },
                    "aggs": {
                        "by_nested": {
                            "top_hits": {}
                        }
                    }
                }
            }
        }
    },
)
print(resp)
response = client.search(
  index: 'sales',
  body: {
    query: {
      term: {
        tags: 'car'
      }
    },
    aggregations: {
      by_sale: {
        nested: {
          path: 'comments'
        },
        aggregations: {
          by_user: {
            terms: {
              field: 'comments.username',
              size: 1
            },
            aggregations: {
              by_nested: {
                top_hits: {}
              }
            }
          }
        }
      }
    }
  }
)
puts response
const response = await client.search({
  index: "sales",
  query: {
    term: {
      tags: "car",
    },
  },
  aggs: {
    by_sale: {
      nested: {
        path: "comments",
      },
      aggs: {
        by_user: {
          terms: {
            field: "comments.username",
            size: 1,
          },
          aggs: {
            by_nested: {
              top_hits: {},
            },
          },
        },
      },
    },
  },
});
console.log(response);
POST /sales/_search
{
  "query": {
    "term": { "tags": "car" }
  },
  "aggs": {
    "by_sale": {
      "nested": {
        "path": "comments"
      },
      "aggs": {
        "by_user": {
          "terms": {
            "field": "comments.username",
            "size": 1
          },
          "aggs": {
            "by_nested": {
              "top_hits": {}
            }
          }
        }
      }
    }
  }
}

带有嵌套命中的热门命中响应代码段,该命中位于数组字段 comments 的第一个槽中

{
  ...
  "aggregations": {
    "by_sale": {
      "by_user": {
        "buckets": [
          {
            "key": "baddriver007",
            "doc_count": 1,
            "by_nested": {
              "hits": {
                "total" : {
                   "value": 1,
                   "relation": "eq"
                },
                "max_score": 0.3616575,
                "hits": [
                  {
                    "_index": "sales",
                    "_id": "1",
                    "_nested": {
                      "field": "comments",  
                      "offset": 0 
                    },
                    "_score": 0.3616575,
                    "_source": {
                      "comment": "This car could have better brakes", 
                      "username": "baddriver007"
                    }
                  }
                ]
              }
            }
          }
          ...
        ]
      }
    }
  }
}

包含嵌套命中的数组字段的名称

如果嵌套命中位于包含数组中的位置

嵌套命中的源

如果请求了 _source,则仅返回嵌套对象的源部分,而不是整个文档的源。还可以通过位于 nestedreverse_nested 聚合器中的 top_hits 聚合器访问 嵌套 内部对象级别的存储字段。

只有嵌套命中结果的命中结果中才会有 _nested 字段,非嵌套(常规)命中结果则不会有 _nested 字段。

如果未启用 _source,则 _nested 中的信息也可以用于在其他地方解析原始源。

如果在映射中定义了多个级别的嵌套对象类型,则 _nested 信息也可以是分层的,以便表达深度为两层或更多的嵌套命中的身份。

在下面的示例中,嵌套命中结果位于字段 nested_grand_child_field 的第一个槽中,该字段随后位于 nested_child_field 字段的第二个慢槽中

...
"hits": {
 "total" : {
     "value": 2565,
     "relation": "eq"
 },
 "max_score": 1,
 "hits": [
   {
     "_index": "a",
     "_id": "1",
     "_score": 1,
     "_nested" : {
       "field" : "nested_child_field",
       "offset" : 1,
       "_nested" : {
         "field" : "nested_grand_child_field",
         "offset" : 0
       }
     }
     "_source": ...
   },
   ...
 ]
}
...

在管道聚合中使用

编辑

top_hits 可用于管道聚合,该管道聚合消耗每个存储桶的单个值,例如 bucket_selector,它应用每个存储桶的过滤,类似于在 SQL 中使用 HAVING 子句。这需要将 size 设置为 1,并为要传递给包装聚合器的值指定正确的路径。后者可以是 _source_sort_score 值。例如

resp = client.search(
    index="sales",
    size="0",
    aggs={
        "top_tags": {
            "terms": {
                "field": "type",
                "size": 3
            },
            "aggs": {
                "top_sales_hits": {
                    "top_hits": {
                        "sort": [
                            {
                                "date": {
                                    "order": "desc"
                                }
                            }
                        ],
                        "_source": {
                            "includes": [
                                "date",
                                "price"
                            ]
                        },
                        "size": 1
                    }
                },
                "having.top_salary": {
                    "bucket_selector": {
                        "buckets_path": {
                            "tp": "top_sales_hits[_source.price]"
                        },
                        "script": "params.tp < 180"
                    }
                }
            }
        }
    },
)
print(resp)
const response = await client.search({
  index: "sales",
  size: 0,
  aggs: {
    top_tags: {
      terms: {
        field: "type",
        size: 3,
      },
      aggs: {
        top_sales_hits: {
          top_hits: {
            sort: [
              {
                date: {
                  order: "desc",
                },
              },
            ],
            _source: {
              includes: ["date", "price"],
            },
            size: 1,
          },
        },
        "having.top_salary": {
          bucket_selector: {
            buckets_path: {
              tp: "top_sales_hits[_source.price]",
            },
            script: "params.tp < 180",
          },
        },
      },
    },
  },
});
console.log(response);
POST /sales/_search?size=0
{
  "aggs": {
    "top_tags": {
      "terms": {
        "field": "type",
        "size": 3
      },
      "aggs": {
        "top_sales_hits": {
          "top_hits": {
            "sort": [
              {
                "date": {
                  "order": "desc"
                }
              }
            ],
            "_source": {
              "includes": [ "date", "price" ]
            },
            "size": 1
          }
        },
        "having.top_salary": {
          "bucket_selector": {
            "buckets_path": {
              "tp": "top_sales_hits[_source.price]"
            },
            "script": "params.tp < 180"
          }
        }
      }
    }
  }
}

bucket_path 使用 top_hits 名称 top_sales_hits 和提供聚合值的字段的关键字,即上面示例中的 _source 字段 price。其他选项包括 top_sales_hits[_sort],用于筛选上面的排序值 date,以及 top_sales_hits[_score],用于筛选热门命中的分数。