中位数绝对偏差聚合

编辑

单值 聚合近似计算搜索结果的中位数绝对偏差

中位数绝对偏差是衡量变异性的指标。它是一种稳健的统计量,意味着它可用于描述可能存在异常值或可能不呈正态分布的数据。对于此类数据,它可能比标准差更具描述性。

它的计算方法是每个数据点与整个样本中位数的偏差的中位数。也就是说,对于随机变量 X,中位数绝对偏差是 median(|median(X) - Xi|)。

示例

编辑

假设我们的数据表示一到五星级的产品评论。此类评论通常总结为平均值,这很容易理解,但没有描述评论的变异性。估计中位数绝对偏差可以深入了解评论之间的差异程度。

在此示例中,我们有一个平均评分为 3 星的产品。让我们看一下它的评分的中位数绝对偏差,以确定它们的差异程度。

resp = client.search(
    index="reviews",
    size=0,
    aggs={
        "review_average": {
            "avg": {
                "field": "rating"
            }
        },
        "review_variability": {
            "median_absolute_deviation": {
                "field": "rating"
            }
        }
    },
)
print(resp)
response = client.search(
  index: 'reviews',
  body: {
    size: 0,
    aggregations: {
      review_average: {
        avg: {
          field: 'rating'
        }
      },
      review_variability: {
        median_absolute_deviation: {
          field: 'rating'
        }
      }
    }
  }
)
puts response
const response = await client.search({
  index: "reviews",
  size: 0,
  aggs: {
    review_average: {
      avg: {
        field: "rating",
      },
    },
    review_variability: {
      median_absolute_deviation: {
        field: "rating",
      },
    },
  },
});
console.log(response);
GET reviews/_search
{
  "size": 0,
  "aggs": {
    "review_average": {
      "avg": {
        "field": "rating"
      }
    },
    "review_variability": {
      "median_absolute_deviation": {
        "field": "rating" 
      }
    }
  }
}

rating 必须是数值字段

结果中位数绝对偏差为 2,这告诉我们评分存在相当大的变异性。评论者对该产品肯定有不同的看法。

{
  ...
  "aggregations": {
    "review_average": {
      "value": 3.0
    },
    "review_variability": {
      "value": 2.0
    }
  }
}

近似值

编辑

计算中位数绝对偏差的朴素实现将整个样本存储在内存中,因此此聚合改为计算近似值。它使用 TDigest 数据结构 来近似样本中位数和与样本中位数偏差的中位数。有关 TDigest 的近似特征的更多信息,请参阅百分位数(通常)是近似值

TDigest 的分位数近似值的资源使用和准确性之间的权衡,以及因此此聚合对中位数绝对偏差的近似值的准确性,由 compression 参数控制。较高的 compression 设置以更高的内存使用为代价提供更准确的近似值。有关 TDigest compression 参数特征的更多信息,请参阅压缩

resp = client.search(
    index="reviews",
    size=0,
    aggs={
        "review_variability": {
            "median_absolute_deviation": {
                "field": "rating",
                "compression": 100
            }
        }
    },
)
print(resp)
response = client.search(
  index: 'reviews',
  body: {
    size: 0,
    aggregations: {
      review_variability: {
        median_absolute_deviation: {
          field: 'rating',
          compression: 100
        }
      }
    }
  }
)
puts response
const response = await client.search({
  index: "reviews",
  size: 0,
  aggs: {
    review_variability: {
      median_absolute_deviation: {
        field: "rating",
        compression: 100,
      },
    },
  },
});
console.log(response);
GET reviews/_search
{
  "size": 0,
  "aggs": {
    "review_variability": {
      "median_absolute_deviation": {
        "field": "rating",
        "compression": 100
      }
    }
  }
}

此聚合的默认 compression 值为 1000。在此压缩级别下,此聚合通常在精确结果的 5% 以内,但观察到的性能将取决于样本数据。

脚本

编辑

在上面的示例中,产品评论的比例为一到五。如果要将它们修改为一到十的比例,请使用运行时字段

resp = client.search(
    index="reviews",
    filter_path="aggregations",
    size=0,
    runtime_mappings={
        "rating.out_of_ten": {
            "type": "long",
            "script": {
                "source": "emit(doc['rating'].value * params.scaleFactor)",
                "params": {
                    "scaleFactor": 2
                }
            }
        }
    },
    aggs={
        "review_average": {
            "avg": {
                "field": "rating.out_of_ten"
            }
        },
        "review_variability": {
            "median_absolute_deviation": {
                "field": "rating.out_of_ten"
            }
        }
    },
)
print(resp)
response = client.search(
  index: 'reviews',
  filter_path: 'aggregations',
  body: {
    size: 0,
    runtime_mappings: {
      'rating.out_of_ten' => {
        type: 'long',
        script: {
          source: "emit(doc['rating'].value * params.scaleFactor)",
          params: {
            "scaleFactor": 2
          }
        }
      }
    },
    aggregations: {
      review_average: {
        avg: {
          field: 'rating.out_of_ten'
        }
      },
      review_variability: {
        median_absolute_deviation: {
          field: 'rating.out_of_ten'
        }
      }
    }
  }
)
puts response
const response = await client.search({
  index: "reviews",
  filter_path: "aggregations",
  size: 0,
  runtime_mappings: {
    "rating.out_of_ten": {
      type: "long",
      script: {
        source: "emit(doc['rating'].value * params.scaleFactor)",
        params: {
          scaleFactor: 2,
        },
      },
    },
  },
  aggs: {
    review_average: {
      avg: {
        field: "rating.out_of_ten",
      },
    },
    review_variability: {
      median_absolute_deviation: {
        field: "rating.out_of_ten",
      },
    },
  },
});
console.log(response);
GET reviews/_search?filter_path=aggregations
{
  "size": 0,
  "runtime_mappings": {
    "rating.out_of_ten": {
      "type": "long",
      "script": {
        "source": "emit(doc['rating'].value * params.scaleFactor)",
        "params": {
          "scaleFactor": 2
        }
      }
    }
  },
  "aggs": {
    "review_average": {
      "avg": {
        "field": "rating.out_of_ten"
      }
    },
    "review_variability": {
      "median_absolute_deviation": {
        "field": "rating.out_of_ten"
      }
    }
  }
}

这将导致

{
  "aggregations": {
    "review_average": {
      "value": 6.0
    },
    "review_variability": {
      "value": 4.0
    }
  }
}

缺失值

编辑

missing 参数定义了应如何处理缺少值的文档。默认情况下,它们将被忽略,但也可以将它们视为具有值。

让我们乐观一点,假设一些评论者非常喜欢该产品,以至于他们忘记了给它评分。我们将为他们分配五颗星。

resp = client.search(
    index="reviews",
    size=0,
    aggs={
        "review_variability": {
            "median_absolute_deviation": {
                "field": "rating",
                "missing": 5
            }
        }
    },
)
print(resp)
response = client.search(
  index: 'reviews',
  body: {
    size: 0,
    aggregations: {
      review_variability: {
        median_absolute_deviation: {
          field: 'rating',
          missing: 5
        }
      }
    }
  }
)
puts response
const response = await client.search({
  index: "reviews",
  size: 0,
  aggs: {
    review_variability: {
      median_absolute_deviation: {
        field: "rating",
        missing: 5,
      },
    },
  },
});
console.log(response);
GET reviews/_search
{
  "size": 0,
  "aggs": {
    "review_variability": {
      "median_absolute_deviation": {
        "field": "rating",
        "missing": 5
      }
    }
  }
}