中位数绝对偏差聚合
编辑中位数绝对偏差聚合
编辑此 单值
聚合近似计算搜索结果的中位数绝对偏差。
中位数绝对偏差是衡量变异性的指标。它是一种稳健的统计量,意味着它可用于描述可能存在异常值或可能不呈正态分布的数据。对于此类数据,它可能比标准差更具描述性。
它的计算方法是每个数据点与整个样本中位数的偏差的中位数。也就是说,对于随机变量 X,中位数绝对偏差是 median(|median(X) - Xi|)。
示例
编辑假设我们的数据表示一到五星级的产品评论。此类评论通常总结为平均值,这很容易理解,但没有描述评论的变异性。估计中位数绝对偏差可以深入了解评论之间的差异程度。
在此示例中,我们有一个平均评分为 3 星的产品。让我们看一下它的评分的中位数绝对偏差,以确定它们的差异程度。
resp = client.search( index="reviews", size=0, aggs={ "review_average": { "avg": { "field": "rating" } }, "review_variability": { "median_absolute_deviation": { "field": "rating" } } }, ) print(resp)
response = client.search( index: 'reviews', body: { size: 0, aggregations: { review_average: { avg: { field: 'rating' } }, review_variability: { median_absolute_deviation: { field: 'rating' } } } } ) puts response
const response = await client.search({ index: "reviews", size: 0, aggs: { review_average: { avg: { field: "rating", }, }, review_variability: { median_absolute_deviation: { field: "rating", }, }, }, }); console.log(response);
GET reviews/_search { "size": 0, "aggs": { "review_average": { "avg": { "field": "rating" } }, "review_variability": { "median_absolute_deviation": { "field": "rating" } } } }
结果中位数绝对偏差为 2
,这告诉我们评分存在相当大的变异性。评论者对该产品肯定有不同的看法。
{ ... "aggregations": { "review_average": { "value": 3.0 }, "review_variability": { "value": 2.0 } } }
近似值
编辑计算中位数绝对偏差的朴素实现将整个样本存储在内存中,因此此聚合改为计算近似值。它使用 TDigest 数据结构 来近似样本中位数和与样本中位数偏差的中位数。有关 TDigest 的近似特征的更多信息,请参阅百分位数(通常)是近似值。
TDigest 的分位数近似值的资源使用和准确性之间的权衡,以及因此此聚合对中位数绝对偏差的近似值的准确性,由 compression
参数控制。较高的 compression
设置以更高的内存使用为代价提供更准确的近似值。有关 TDigest compression
参数特征的更多信息,请参阅压缩。
resp = client.search( index="reviews", size=0, aggs={ "review_variability": { "median_absolute_deviation": { "field": "rating", "compression": 100 } } }, ) print(resp)
response = client.search( index: 'reviews', body: { size: 0, aggregations: { review_variability: { median_absolute_deviation: { field: 'rating', compression: 100 } } } } ) puts response
const response = await client.search({ index: "reviews", size: 0, aggs: { review_variability: { median_absolute_deviation: { field: "rating", compression: 100, }, }, }, }); console.log(response);
GET reviews/_search { "size": 0, "aggs": { "review_variability": { "median_absolute_deviation": { "field": "rating", "compression": 100 } } } }
此聚合的默认 compression
值为 1000
。在此压缩级别下,此聚合通常在精确结果的 5% 以内,但观察到的性能将取决于样本数据。
脚本
编辑在上面的示例中,产品评论的比例为一到五。如果要将它们修改为一到十的比例,请使用运行时字段。
resp = client.search( index="reviews", filter_path="aggregations", size=0, runtime_mappings={ "rating.out_of_ten": { "type": "long", "script": { "source": "emit(doc['rating'].value * params.scaleFactor)", "params": { "scaleFactor": 2 } } } }, aggs={ "review_average": { "avg": { "field": "rating.out_of_ten" } }, "review_variability": { "median_absolute_deviation": { "field": "rating.out_of_ten" } } }, ) print(resp)
response = client.search( index: 'reviews', filter_path: 'aggregations', body: { size: 0, runtime_mappings: { 'rating.out_of_ten' => { type: 'long', script: { source: "emit(doc['rating'].value * params.scaleFactor)", params: { "scaleFactor": 2 } } } }, aggregations: { review_average: { avg: { field: 'rating.out_of_ten' } }, review_variability: { median_absolute_deviation: { field: 'rating.out_of_ten' } } } } ) puts response
const response = await client.search({ index: "reviews", filter_path: "aggregations", size: 0, runtime_mappings: { "rating.out_of_ten": { type: "long", script: { source: "emit(doc['rating'].value * params.scaleFactor)", params: { scaleFactor: 2, }, }, }, }, aggs: { review_average: { avg: { field: "rating.out_of_ten", }, }, review_variability: { median_absolute_deviation: { field: "rating.out_of_ten", }, }, }, }); console.log(response);
GET reviews/_search?filter_path=aggregations { "size": 0, "runtime_mappings": { "rating.out_of_ten": { "type": "long", "script": { "source": "emit(doc['rating'].value * params.scaleFactor)", "params": { "scaleFactor": 2 } } } }, "aggs": { "review_average": { "avg": { "field": "rating.out_of_ten" } }, "review_variability": { "median_absolute_deviation": { "field": "rating.out_of_ten" } } } }
这将导致
{ "aggregations": { "review_average": { "value": 6.0 }, "review_variability": { "value": 4.0 } } }
缺失值
编辑missing
参数定义了应如何处理缺少值的文档。默认情况下,它们将被忽略,但也可以将它们视为具有值。
让我们乐观一点,假设一些评论者非常喜欢该产品,以至于他们忘记了给它评分。我们将为他们分配五颗星。
resp = client.search( index="reviews", size=0, aggs={ "review_variability": { "median_absolute_deviation": { "field": "rating", "missing": 5 } } }, ) print(resp)
response = client.search( index: 'reviews', body: { size: 0, aggregations: { review_variability: { median_absolute_deviation: { field: 'rating', missing: 5 } } } } ) puts response
const response = await client.search({ index: "reviews", size: 0, aggs: { review_variability: { median_absolute_deviation: { field: "rating", missing: 5, }, }, }, }); console.log(response);
GET reviews/_search { "size": 0, "aggs": { "review_variability": { "median_absolute_deviation": { "field": "rating", "missing": 5 } } } }