扩展统计聚合

编辑

一个 多值 指标聚合,用于计算从聚合文档中提取的数值的统计信息。

extended_stats 聚合是 stats 聚合的扩展版本,其中添加了诸如 sum_of_squaresvariancestd_deviationstd_deviation_bounds 之类的额外指标。

假设数据由表示学生考试成绩(介于 0 和 100 之间)的文档组成

resp = client.search(
    index="exams",
    size=0,
    aggs={
        "grades_stats": {
            "extended_stats": {
                "field": "grade"
            }
        }
    },
)
print(resp)
response = client.search(
  index: 'exams',
  body: {
    size: 0,
    aggregations: {
      grades_stats: {
        extended_stats: {
          field: 'grade'
        }
      }
    }
  }
)
puts response
const response = await client.search({
  index: "exams",
  size: 0,
  aggs: {
    grades_stats: {
      extended_stats: {
        field: "grade",
      },
    },
  },
});
console.log(response);
GET /exams/_search
{
  "size": 0,
  "aggs": {
    "grades_stats": { "extended_stats": { "field": "grade" } }
  }
}

以上聚合计算所有文档的成绩统计信息。聚合类型是 extended_statsfield 设置定义了将计算统计信息的文档的数值字段。以上将返回以下内容

std_deviationvariance 作为总体指标计算,因此它们始终与 std_deviation_populationvariance_population 相同。

{
  ...

  "aggregations": {
    "grades_stats": {
      "count": 2,
      "min": 50.0,
      "max": 100.0,
      "avg": 75.0,
      "sum": 150.0,
      "sum_of_squares": 12500.0,
      "variance": 625.0,
      "variance_population": 625.0,
      "variance_sampling": 1250.0,
      "std_deviation": 25.0,
      "std_deviation_population": 25.0,
      "std_deviation_sampling": 35.35533905932738,
      "std_deviation_bounds": {
        "upper": 125.0,
        "lower": 25.0,
        "upper_population": 125.0,
        "lower_population": 25.0,
        "upper_sampling": 145.71067811865476,
        "lower_sampling": 4.289321881345245
      }
    }
  }
}

聚合的名称(上面的 grades_stats)也用作从返回的响应中检索聚合结果的键。

标准差范围

编辑

默认情况下,extended_stats 指标将返回一个名为 std_deviation_bounds 的对象,该对象提供了与平均值的正负两个标准差的间隔。这可以有效地可视化数据的方差。如果需要不同的边界,例如三个标准差,可以在请求中设置 sigma

resp = client.search(
    index="exams",
    size=0,
    aggs={
        "grades_stats": {
            "extended_stats": {
                "field": "grade",
                "sigma": 3
            }
        }
    },
)
print(resp)
response = client.search(
  index: 'exams',
  body: {
    size: 0,
    aggregations: {
      grades_stats: {
        extended_stats: {
          field: 'grade',
          sigma: 3
        }
      }
    }
  }
)
puts response
const response = await client.search({
  index: "exams",
  size: 0,
  aggs: {
    grades_stats: {
      extended_stats: {
        field: "grade",
        sigma: 3,
      },
    },
  },
});
console.log(response);
GET /exams/_search
{
  "size": 0,
  "aggs": {
    "grades_stats": {
      "extended_stats": {
        "field": "grade",
        "sigma": 3          
      }
    }
  }
}

sigma 控制应显示距平均值正负多少个标准差

sigma 可以是任何非负双精度数,这意味着您可以请求诸如 1.5 之类的非整数值。0 的值是有效的,但仅会为 upperlower 边界返回平均值。

upperlower 边界作为总体指标计算,因此它们始终与 upper_populationlower_population 相同。

标准差和范围需要正态性

默认情况下会显示标准差及其范围,但并非总是适用于所有数据集。您的数据必须呈正态分布,这些指标才有意义。标准差背后的统计信息假设数据呈正态分布,因此,如果您的数据严重向左或向右倾斜,则返回的值将具有误导性。

脚本

编辑

如果您需要在未索引的值上进行聚合,请使用运行时字段。假设我们发现我们一直在处理的成绩是针对高于学生水平的考试的,并且我们想要“纠正”它

resp = client.search(
    index="exams",
    size=0,
    runtime_mappings={
        "grade.corrected": {
            "type": "double",
            "script": {
                "source": "emit(Math.min(100, doc['grade'].value * params.correction))",
                "params": {
                    "correction": 1.2
                }
            }
        }
    },
    aggs={
        "grades_stats": {
            "extended_stats": {
                "field": "grade.corrected"
            }
        }
    },
)
print(resp)
response = client.search(
  index: 'exams',
  body: {
    size: 0,
    runtime_mappings: {
      'grade.corrected' => {
        type: 'double',
        script: {
          source: "emit(Math.min(100, doc['grade'].value * params.correction))",
          params: {
            correction: 1.2
          }
        }
      }
    },
    aggregations: {
      grades_stats: {
        extended_stats: {
          field: 'grade.corrected'
        }
      }
    }
  }
)
puts response
const response = await client.search({
  index: "exams",
  size: 0,
  runtime_mappings: {
    "grade.corrected": {
      type: "double",
      script: {
        source: "emit(Math.min(100, doc['grade'].value * params.correction))",
        params: {
          correction: 1.2,
        },
      },
    },
  },
  aggs: {
    grades_stats: {
      extended_stats: {
        field: "grade.corrected",
      },
    },
  },
});
console.log(response);
GET /exams/_search
{
  "size": 0,
  "runtime_mappings": {
    "grade.corrected": {
      "type": "double",
      "script": {
        "source": "emit(Math.min(100, doc['grade'].value * params.correction))",
        "params": {
          "correction": 1.2
        }
      }
    }
  },
  "aggs": {
    "grades_stats": {
      "extended_stats": { "field": "grade.corrected" }
    }
  }
}

缺失值

编辑

missing 参数定义了应该如何处理缺少值的文档。默认情况下,它们将被忽略,但是也可以将它们视为具有值。

resp = client.search(
    index="exams",
    size=0,
    aggs={
        "grades_stats": {
            "extended_stats": {
                "field": "grade",
                "missing": 0
            }
        }
    },
)
print(resp)
response = client.search(
  index: 'exams',
  body: {
    size: 0,
    aggregations: {
      grades_stats: {
        extended_stats: {
          field: 'grade',
          missing: 0
        }
      }
    }
  }
)
puts response
const response = await client.search({
  index: "exams",
  size: 0,
  aggs: {
    grades_stats: {
      extended_stats: {
        field: "grade",
        missing: 0,
      },
    },
  },
});
console.log(response);
GET /exams/_search
{
  "size": 0,
  "aggs": {
    "grades_stats": {
      "extended_stats": {
        "field": "grade",
        "missing": 0        
      }
    }
  }
}

grade 字段中没有值的文档将与具有值 0 的文档落入相同的存储桶。