› › ›

聚合数据以提高性能

编辑

聚合数据以提高性能

编辑

当您聚合数据时，Elasticsearch 会自动在您的集群中分配计算。然后，您可以将此聚合数据馈送到机器学习功能，而不是原始结果。它减少了必须分析的数据量。

要求

编辑

在数据馈送中使用聚合有一些要求。

聚合

编辑

您的聚合必须包含 date_histogram 聚合或顶层 composite 聚合，后者又必须包含时间字段的 max 聚合。这确保了聚合数据是时间序列，并且每个桶的时间戳是桶中最后一个记录的时间。
日期直方图聚合中的 time_zone 参数必须设置为 UTC，这是默认值。
聚合的名称及其操作的字段的名称需要匹配。例如，如果您在名为 responsetime 的时间字段上使用 max 聚合，则聚合的名称也必须是 responsetime。
对于 composite 聚合支持，必须只有一个 date_histogram 值源。该值源不得以降序排序。允许使用其他 composite 聚合值源，例如 terms。
非复合聚合的 size 参数必须与您的数据的基数匹配。 size 参数的较大值会增加聚合的内存需求。
如果您将 summary_count_field_name 属性设置为非空值，则异常检测作业会期望接收聚合输入。该属性必须设置为包含已聚合的原始数据点计数的字段的名称。它适用于作业中的所有检测器。
影响因素或分区字段必须包含在数据馈送的聚合中，否则它们不会包含在作业分析中。有关影响因素的更多信息，请参阅影响因素。

间隔

编辑

您的异常检测作业的桶跨度必须可被聚合中的 calendar_interval 或 fixed_interval 的值整除（没有余数）。
如果您为数据馈送指定 frequency，则它必须可被 calendar_interval 或 fixed_interval 整除。
异常检测作业不能使用以月为单位测量的间隔的 date_histogram 或 composite 聚合，因为月份的长度不是固定的；它们可以使用周或更小的单位。

限制

编辑

如果您的数据馈送使用嵌套的terms聚合，并且未为异常检测作业启用模型图，则 单指标查看器 和 异常浏览器 都无法绘制和显示异常图表。在这些情况下，将显示一条说明性消息而不是图表。
您的数据馈送可以包含多个聚合，但只有名称与作业配置中的值匹配的聚合才会馈送到作业。
数据馈送不支持使用脚本指标聚合。

建议

编辑

当您的检测器使用指标或总和分析函数时，建议将 date_histogram 或 composite 聚合间隔设置为桶跨度的十分之一。这会创建更精细、更细粒度的时间桶，这对于此类分析是理想的。
当您的检测器使用计数或稀有函数时，请将间隔设置为与桶跨度相同的值。

如果您有多个影响因素或分区字段，或者如果您的字段基数大于 1000，请使用复合聚合。

要确定数据的基数，您可以运行以下搜索

GET .../_search
{
  "aggs": {
    "service_cardinality": {
      "cardinality": {
        "field": "service"
      }
    }
  }
}

在异常检测作业中包含聚合

编辑

当您创建或更新异常检测作业时，您可以在分析配置中包含聚合字段。在数据馈送配置对象中，您可以定义聚合。

PUT _ml/anomaly_detectors/kibana-sample-data-flights
{
  "analysis_config": {
    "bucket_span": "60m",
    "detectors": [{
      "function": "mean",
      "field_name": "responsetime",  
      "by_field_name": "airline"  
    }],
    "summary_count_field_name": "doc_count" 
  },
  "data_description": {
    "time_field":"time"  
  },
  "datafeed_config":{
    "indices": ["kibana-sample-data-flights"],
    "aggregations": {
      "buckets": {
        "date_histogram": {
          "field": "time",
          "fixed_interval": "360s",
          "time_zone": "UTC"
        },
        "aggregations": {
          "time": {  
            "max": {"field": "time"}
          },
          "airline": {  
            "terms": {
             "field": "airline",
              "size": 100
            },
            "aggregations": {
              "responsetime": {  
                "avg": {
                  "field": "responsetime"
                }
              }
            }
          }
        }
      }
    }
  }
}

Copy as curl Try in Elastic

	`airline`、`responsetime` 和 `time` 字段是聚合。只有 `analysis_config` 对象中定义的聚合字段才会由异常检测作业进行分析。
	`summary_count_field_name` 属性设置为 `doc_count` 字段，该字段是聚合字段，包含聚合数据点的计数。
	聚合的名称与它们操作的字段匹配。 `max` 聚合命名为 `time`，其字段也需要是 `time`。
	`term` 聚合命名为 `airline`，其字段也命名为 `airline`。
	`avg` 聚合命名为 `responsetime`，其字段也命名为 `responsetime`。

使用以下格式定义 date_histogram 聚合，以按数据馈送中的时间进行分桶

"aggregations": {
  ["bucketing_aggregation": {
    "bucket_agg": {
      ...
    },
    "aggregations": {
      "data_histogram_aggregation": {
        "date_histogram": {
          "field": "time",
        },
        "aggregations": {
          "timestamp": {
            "max": {
              "field": "time"
            }
          },
          [,"<first_term>": {
            "terms":{...
            }
            [,"aggregations" : {
              [<sub_aggregation>]+
            } ]
          }]
        }
      }
    }
  }
}

复合聚合

编辑

复合聚合针对 match_all 或 range 过滤器查询进行了优化。在这些情况下，请在数据馈送中使用复合聚合。其他类型的查询可能会导致 composite 聚合效率低下。

以下是一个使用 composite 聚合来根据时间和术语对指标进行分桶的数据馈送作业的示例

PUT _ml/anomaly_detectors/kibana-sample-data-flights-composite
{
  "analysis_config": {
    "bucket_span": "60m",
    "detectors": [{
      "function": "mean",
      "field_name": "responsetime",
      "by_field_name": "airline"
    }],
    "summary_count_field_name": "doc_count"
  },
  "data_description": {
    "time_field":"time"
  },
  "datafeed_config":{
    "indices": ["kibana-sample-data-flights"],
    "aggregations": {
      "buckets": {
        "composite": {
          "size": 1000,  
          "sources": [
            {
              "time_bucket": {  
                "date_histogram": {
                  "field": "time",
                  "fixed_interval": "360s",
                  "time_zone": "UTC"
                }
              }
            },
            {
              "airline": {  
                "terms": {
                  "field": "airline"
                }
              }
            }
          ]
        },
        "aggregations": {
          "time": {  
            "max": {
              "field": "time"
            }
          },
          "responsetime": { 
            "avg": {
              "field": "responsetime"
            }
          }
        }
      }
    }
  }
}

Copy as curl Try in Elastic

	聚合数据时要使用的资源数量。较大的 `size` 意味着更快的数据馈送，但在搜索时会使用更多的集群资源。
	所需的 `date_histogram` 复合聚合源。确保它的命名与您所需的时间字段不同。
	添加名为 `airline` 的复合聚合 `term` 源，而不是使用常规的 `term` 聚合。请注意，其名称与字段相同。
	所需的 `max` 聚合，其名称是作业分析配置中的时间字段。
	`avg` 聚合命名为 `responsetime`，其字段也命名为 `responsetime`。

使用以下格式在数据馈送中定义复合聚合

"aggregations": {
  "composite_agg": {
    "sources": [
      {
        "date_histogram_agg": {
          "field": "time",
          ...settings...
        }
      },
      ...other valid sources...
      ],
      ...composite agg settings...,
      "aggregations": {
        "timestamp": {
            "max": {
              "field": "time"
            }
          },
          ...other aggregations...
          [
            [,"aggregations" : {
              [<sub_aggregation>]+
            } ]
          }]
      }
   }
}

嵌套聚合

编辑

您还可以在数据馈送中使用复杂的嵌套聚合。

下一个示例使用derivative 管道聚合来查找字段 beat.name 的每个值的计数器 system.network.out.bytes 的一阶导数。

derivative 或其他管道聚合可能无法在 composite 聚合中工作。请参阅复合聚合和管道聚合。

"aggregations": {
  "beat.name": {
    "terms": {
      "field": "beat.name"
    },
    "aggregations": {
      "buckets": {
        "date_histogram": {
          "field": "@timestamp",
          "fixed_interval": "5m"
        },
        "aggregations": {
          "@timestamp": {
            "max": {
              "field": "@timestamp"
            }
          },
          "bytes_out_average": {
            "avg": {
              "field": "system.network.out.bytes"
            }
          },
          "bytes_out_derivative": {
            "derivative": {
              "buckets_path": "bytes_out_average"
            }
          }
        }
      }
    }
  }
}

单桶聚合

编辑

您还可以在数据馈送中使用单桶聚合。以下示例显示了两个 filter 聚合，每个聚合都收集 error 字段的唯一条目数。

{
  "job_id":"servers-unique-errors",
  "indices": ["logs-*"],
  "aggregations": {
    "buckets": {
      "date_histogram": {
        "field": "time",
        "interval": "360s",
        "time_zone": "UTC"
      },
      "aggregations": {
        "time": {
          "max": {"field": "time"}
        }
        "server1": {
          "filter": {"term": {"source": "server-name-1"}},
          "aggregations": {
            "server1_error_count": {
              "value_count": {
                "field": "error"
              }
            }
          }
        },
        "server2": {
          "filter": {"term": {"source": "server-name-2"}},
          "aggregations": {
            "server2_error_count": {
              "value_count": {
                "field": "error"
              }
            }
          }
        }
      }
    }
  }
}

在数据馈送中使用 `aggregate_metric_double` 字段类型

编辑

目前，在没有聚合的情况下，无法在数据馈送中使用 aggregate_metric_double 类型字段。

您可以在带有聚合的数据馈送中使用aggregate_metric_double 字段类型的字段。需要检索聚合中 aggregate_metric_double 字段的 value_count，然后将其用作 summary_count_field_name，以提供代表聚合值的正确计数。

在以下示例中，presum 是一个 aggregate_metric_double 类型字段，它具有所有可能的指标：[ min, max, sum, value_count ]。要在此字段上使用 avg 聚合，您需要在 presum 上执行 value_count 聚合，然后将包含聚合值的字段 my_count 设置为 summary_count_field_name

{
  "analysis_config": {
    "bucket_span": "1h",
    "detectors": [
      {
        "function": "avg",
        "field_name": "my_avg"
      }
    ],
    "summary_count_field_name": "my_count" 
  },
  "data_description": {
    "time_field": "timestamp"
  },
  "datafeed_config": {
    "indices": [
      "my_index"
    ],
    "datafeed_id": "datafeed-id",
    "aggregations": {
      "buckets": {
        "date_histogram": {
          "field": "time",
          "fixed_interval": "360s",
          "time_zone": "UTC"
        },
        "aggregations": {
            "timestamp": {
                "max": {"field": "timestamp"}
            },
            "my_avg": {  
                "avg": {
                    "field": "presum"
                }
             },
             "my_count": { 
                 "value_count": {
                     "field": "presum"
                 }
             }
          }
        }
     }
  }
}

	字段 `my_count` 设置为 `summary_count_field_name`。此字段包含来自 `presum` `aggregate_metric_double` 类型字段的聚合值（请参阅脚注 3）。
	要在 `presum` `aggregate_metric_double` 类型字段上使用的 `avg` 聚合。
	`presum` `aggregate_metric_double` 类型字段上的 `value_count` 聚合。此聚合字段必须设置为 `summary_count_field_name`（请参阅脚注 1），以便在另一个聚合中使用 `aggregate_metric_double` 类型字段。

« 为异常检测作业生成警报使用运行时字段修改数据馈送中的数据 »

On this page

要求
聚合
间隔
限制
建议
在异常检测作业中包含聚合
复合聚合
嵌套聚合
单桶聚合
在数据馈送中使用 aggregate_metric_double 字段类型

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

聚合数据以提高性能

聚合数据以提高性能

要求

聚合

间隔

限制

建议

在异常检测作业中包含聚合

复合聚合

嵌套聚合

单桶聚合

在数据馈送中使用 `aggregate_metric_double` 字段类型

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

The Search AI Company

Generative AI

Search

Security

Observability

By solution

Industries

聚合数据以提高性能

聚合数据以提高性能

要求

聚合

间隔

限制

建议

在异常检测作业中包含聚合

复合聚合

嵌套聚合

单桶聚合

在数据馈送中使用 aggregate_metric_double 字段类型

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

在数据馈送中使用 `aggregate_metric_double` 字段类型