数据剖析

编辑

Dissect 将单个文本字段与定义的模式进行匹配。Dissect 模式由您想要丢弃的字符串部分定义。特别注意字符串的每个部分有助于构建成功的 dissect 模式。

如果您不需要正则表达式的强大功能,请使用 dissect 模式而不是 grok。Dissect 使用比 grok 简单得多的语法,并且通常总体速度更快。Dissect 的语法很透明:告诉 dissect 您想要什么,它会将这些结果返回给您。

Dissect 模式

编辑

Dissect 模式由变量分隔符组成。任何由百分号和花括号 %{} 定义的内容都被视为变量,例如 %{clientip}。您可以将变量分配给字段中数据的任何部分,然后只返回您想要的部分。分隔符是变量之间的任何值,可以是空格、破折号或其他分隔符。

例如,假设您的日志数据具有如下所示的 message 字段

"message" : "247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"

您可以将变量分配给数据的每个部分以构建成功的 dissect 模式。记住,确切地告诉 dissect 您想要匹配的内容。

数据的第一个部分看起来像一个 IP 地址,因此您可以分配一个变量,例如 %{clientip}。接下来的两个字符是带空格的破折号。您可以为每个破折号分配一个变量,或分配单个变量来表示破折号和空格。接下来是一组包含时间戳的括号。括号是分隔符,因此您将它们包含在 dissect 模式中。到目前为止,数据和匹配的 dissect 模式如下所示

247.37.0.0 - - [30/Apr/2020:14:31:22 -0500]  

%{clientip} %{ident} %{auth} [%{@timestamp}] 

message 字段中的第一部分数据

与所选数据块匹配的 Dissect 模式

使用相同的逻辑,您可以为剩余的数据块创建变量。双引号是分隔符,因此将其包含在您的 dissect 模式中。该模式将 GET 替换为 %{verb} 变量,但保留 HTTP 作为模式的一部分。

\"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0

"%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}

结合这两个模式,得到的 dissect 模式如下所示

%{clientip} %{ident} %{auth} [%{@timestamp}] \"%{verb} %{request} HTTP/%{httpversion}\" %{status} %{size}

现在您有了 dissect 模式,如何测试和使用它呢?

使用 Painless 测试 dissect 模式

编辑

您可以将 dissect 模式合并到 Painless 脚本中以提取数据。要测试您的脚本,可以使用 Painless 执行 API 的字段上下文 或创建一个包含该脚本的运行时字段。运行时字段提供更大的灵活性并接受多个文档,但是如果您没有要测试脚本的集群的写入权限,则 Painless 执行 API 是一个不错的选择。

例如,通过包含您的 Painless 脚本和与您的数据匹配的单个文档来使用 Painless 执行 API 测试您的 dissect 模式。首先将 message 字段索引为 wildcard 数据类型

resp = client.indices.create(
    index="my-index",
    mappings={
        "properties": {
            "message": {
                "type": "wildcard"
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'my-index',
  body: {
    mappings: {
      properties: {
        message: {
          type: 'wildcard'
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "my-index",
  mappings: {
    properties: {
      message: {
        type: "wildcard",
      },
    },
  },
});
console.log(response);
PUT my-index
{
  "mappings": {
    "properties": {
      "message": {
        "type": "wildcard"
      }
    }
  }
}

如果您想检索 HTTP 响应代码,请将您的 dissect 模式添加到提取 response 值的 Painless 脚本中。要从字段中提取值,请使用此函数

`.extract(doc["<field_name>"].value)?.<field_value>`

在此示例中,message<field_name>response<field_value>

resp = client.scripts_painless_execute(
    script={
        "source": "\n      String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] \"%{verb} %{request} HTTP/%{httpversion}\" %{response} %{size}').extract(doc[\"message\"].value)?.response;\n        if (response != null) emit(Integer.parseInt(response)); \n    "
    },
    context="long_field",
    context_setup={
        "index": "my-index",
        "document": {
            "message": "247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"
        }
    },
)
print(resp)
const response = await client.scriptsPainlessExecute({
  script: {
    source:
      '\n      String response=dissect(\'%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}\').extract(doc["message"].value)?.response;\n        if (response != null) emit(Integer.parseInt(response)); \n    ',
  },
  context: "long_field",
  context_setup: {
    index: "my-index",
    document: {
      message:
        '247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] "GET /images/hm_nbg.jpg HTTP/1.0" 304 0',
    },
  },
});
console.log(response);
POST /_scripts/painless/_execute
{
  "script": {
    "source": """
      String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response;
        if (response != null) emit(Integer.parseInt(response)); 
    """
  },
  "context": "long_field", 
  "context_setup": {
    "index": "my-index",
    "document": {          
      "message": """247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] "GET /images/hm_nbg.jpg HTTP/1.0" 304 0"""
    }
  }
}

运行时字段需要 emit 方法才能返回值。

因为响应代码是整数,所以使用 long_field 上下文。

包含与您的数据匹配的示例文档。

结果包含 HTTP 响应代码

{
  "result" : [
    304
  ]
}

在运行时字段中使用 dissect 模式和脚本

编辑

如果您有一个功能性 dissect 模式,您可以将其添加到运行时字段以操作数据。因为运行时字段不需要您索引字段,所以您可以非常灵活地修改脚本及其功能。如果您已经测试了您的 dissect 模式 使用 Painless 执行 API,您可以在运行时字段中使用完全相同的 Painless 脚本。

首先,像上一节一样添加 message 字段作为 wildcard 类型,但也添加 @timestamp 作为 date 类型,以防您想针对其他用例 操作该字段

resp = client.indices.create(
    index="my-index",
    mappings={
        "properties": {
            "@timestamp": {
                "format": "strict_date_optional_time||epoch_second",
                "type": "date"
            },
            "message": {
                "type": "wildcard"
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'my-index',
  body: {
    mappings: {
      properties: {
        "@timestamp": {
          format: 'strict_date_optional_time||epoch_second',
          type: 'date'
        },
        message: {
          type: 'wildcard'
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "my-index",
  mappings: {
    properties: {
      "@timestamp": {
        format: "strict_date_optional_time||epoch_second",
        type: "date",
      },
      message: {
        type: "wildcard",
      },
    },
  },
});
console.log(response);
PUT /my-index/
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "format": "strict_date_optional_time||epoch_second",
        "type": "date"
      },
      "message": {
        "type": "wildcard"
      }
    }
  }
}

如果您想使用您的 dissect 模式提取 HTTP 响应代码,您可以创建一个像 http.response 这样的运行时字段

resp = client.indices.put_mapping(
    index="my-index",
    runtime={
        "http.response": {
            "type": "long",
            "script": "\n        String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] \"%{verb} %{request} HTTP/%{httpversion}\" %{response} %{size}').extract(doc[\"message\"].value)?.response;\n        if (response != null) emit(Integer.parseInt(response));\n      "
        }
    },
)
print(resp)
const response = await client.indices.putMapping({
  index: "my-index",
  runtime: {
    "http.response": {
      type: "long",
      script:
        '\n        String response=dissect(\'%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}\').extract(doc["message"].value)?.response;\n        if (response != null) emit(Integer.parseInt(response));\n      ',
    },
  },
});
console.log(response);
PUT my-index/_mappings
{
  "runtime": {
    "http.response": {
      "type": "long",
      "script": """
        String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response;
        if (response != null) emit(Integer.parseInt(response));
      """
    }
  }
}

在映射要检索的字段后,将一些日志数据记录索引到 Elasticsearch 中。以下请求使用 批量 API 将原始日志数据索引到 my-index

resp = client.bulk(
    index="my-index",
    refresh=True,
    operations=[
        {
            "index": {}
        },
        {
            "timestamp": "2020-04-30T14:30:17-05:00",
            "message": "40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"
        },
        {
            "index": {}
        },
        {
            "timestamp": "2020-04-30T14:30:53-05:00",
            "message": "232.0.0.0 - - [30/Apr/2020:14:30:53 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"
        },
        {
            "index": {}
        },
        {
            "timestamp": "2020-04-30T14:31:12-05:00",
            "message": "26.1.0.0 - - [30/Apr/2020:14:31:12 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"
        },
        {
            "index": {}
        },
        {
            "timestamp": "2020-04-30T14:31:19-05:00",
            "message": "247.37.0.0 - - [30/Apr/2020:14:31:19 -0500] \"GET /french/splash_inet.html HTTP/1.0\" 200 3781"
        },
        {
            "index": {}
        },
        {
            "timestamp": "2020-04-30T14:31:22-05:00",
            "message": "247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"
        },
        {
            "index": {}
        },
        {
            "timestamp": "2020-04-30T14:31:27-05:00",
            "message": "252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"
        },
        {
            "index": {}
        },
        {
            "timestamp": "2020-04-30T14:31:28-05:00",
            "message": "not a valid apache log"
        }
    ],
)
print(resp)
response = client.bulk(
  index: 'my-index',
  refresh: true,
  body: [
    {
      index: {}
    },
    {
      timestamp: '2020-04-30T14:30:17-05:00',
      message: '40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] "GET /images/hm_bg.jpg HTTP/1.0" 200 24736'
    },
    {
      index: {}
    },
    {
      timestamp: '2020-04-30T14:30:53-05:00',
      message: '232.0.0.0 - - [30/Apr/2020:14:30:53 -0500] "GET /images/hm_bg.jpg HTTP/1.0" 200 24736'
    },
    {
      index: {}
    },
    {
      timestamp: '2020-04-30T14:31:12-05:00',
      message: '26.1.0.0 - - [30/Apr/2020:14:31:12 -0500] "GET /images/hm_bg.jpg HTTP/1.0" 200 24736'
    },
    {
      index: {}
    },
    {
      timestamp: '2020-04-30T14:31:19-05:00',
      message: '247.37.0.0 - - [30/Apr/2020:14:31:19 -0500] "GET /french/splash_inet.html HTTP/1.0" 200 3781'
    },
    {
      index: {}
    },
    {
      timestamp: '2020-04-30T14:31:22-05:00',
      message: '247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] "GET /images/hm_nbg.jpg HTTP/1.0" 304 0'
    },
    {
      index: {}
    },
    {
      timestamp: '2020-04-30T14:31:27-05:00',
      message: '252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] "GET /images/hm_bg.jpg HTTP/1.0" 200 24736'
    },
    {
      index: {}
    },
    {
      timestamp: '2020-04-30T14:31:28-05:00',
      message: 'not a valid apache log'
    }
  ]
)
puts response
const response = await client.bulk({
  index: "my-index",
  refresh: "true",
  operations: [
    {
      index: {},
    },
    {
      timestamp: "2020-04-30T14:30:17-05:00",
      message:
        '40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] "GET /images/hm_bg.jpg HTTP/1.0" 200 24736',
    },
    {
      index: {},
    },
    {
      timestamp: "2020-04-30T14:30:53-05:00",
      message:
        '232.0.0.0 - - [30/Apr/2020:14:30:53 -0500] "GET /images/hm_bg.jpg HTTP/1.0" 200 24736',
    },
    {
      index: {},
    },
    {
      timestamp: "2020-04-30T14:31:12-05:00",
      message:
        '26.1.0.0 - - [30/Apr/2020:14:31:12 -0500] "GET /images/hm_bg.jpg HTTP/1.0" 200 24736',
    },
    {
      index: {},
    },
    {
      timestamp: "2020-04-30T14:31:19-05:00",
      message:
        '247.37.0.0 - - [30/Apr/2020:14:31:19 -0500] "GET /french/splash_inet.html HTTP/1.0" 200 3781',
    },
    {
      index: {},
    },
    {
      timestamp: "2020-04-30T14:31:22-05:00",
      message:
        '247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] "GET /images/hm_nbg.jpg HTTP/1.0" 304 0',
    },
    {
      index: {},
    },
    {
      timestamp: "2020-04-30T14:31:27-05:00",
      message:
        '252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] "GET /images/hm_bg.jpg HTTP/1.0" 200 24736',
    },
    {
      index: {},
    },
    {
      timestamp: "2020-04-30T14:31:28-05:00",
      message: "not a valid apache log",
    },
  ],
});
console.log(response);
POST /my-index/_bulk?refresh=true
{"index":{}}
{"timestamp":"2020-04-30T14:30:17-05:00","message":"40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
{"index":{}}
{"timestamp":"2020-04-30T14:30:53-05:00","message":"232.0.0.0 - - [30/Apr/2020:14:30:53 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
{"index":{}}
{"timestamp":"2020-04-30T14:31:12-05:00","message":"26.1.0.0 - - [30/Apr/2020:14:31:12 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
{"index":{}}
{"timestamp":"2020-04-30T14:31:19-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:19 -0500] \"GET /french/splash_inet.html HTTP/1.0\" 200 3781"}
{"index":{}}
{"timestamp":"2020-04-30T14:31:22-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"}
{"index":{}}
{"timestamp":"2020-04-30T14:31:27-05:00","message":"252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
{"index":{}}
{"timestamp":"2020-04-30T14:31:28-05:00","message":"not a valid apache log"}

您可以定义一个简单的查询来搜索特定的 HTTP 响应并返回所有相关字段。使用搜索 API 的 fields 参数来检索 http.response 运行时字段。

resp = client.search(
    index="my-index",
    query={
        "match": {
            "http.response": "304"
        }
    },
    fields=[
        "http.response"
    ],
)
print(resp)
response = client.search(
  index: 'my-index',
  body: {
    query: {
      match: {
        'http.response' => '304'
      }
    },
    fields: [
      'http.response'
    ]
  }
)
puts response
const response = await client.search({
  index: "my-index",
  query: {
    match: {
      "http.response": "304",
    },
  },
  fields: ["http.response"],
});
console.log(response);
GET my-index/_search
{
  "query": {
    "match": {
      "http.response": "304"
    }
  },
  "fields" : ["http.response"]
}

或者,您可以在搜索请求的上下文中定义相同的运行时字段。运行时定义和脚本与之前在索引映射中定义的完全相同。只需将该定义复制到搜索请求下的 runtime_mappings 部分,并包含一个与运行时字段匹配的查询。此查询返回与之前为索引映射中的 http.response 运行时字段定义的搜索查询相同的结果,但仅限于此特定搜索的上下文

resp = client.search(
    index="my-index",
    runtime_mappings={
        "http.response": {
            "type": "long",
            "script": "\n        String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] \"%{verb} %{request} HTTP/%{httpversion}\" %{response} %{size}').extract(doc[\"message\"].value)?.response;\n        if (response != null) emit(Integer.parseInt(response));\n      "
        }
    },
    query={
        "match": {
            "http.response": "304"
        }
    },
    fields=[
        "http.response"
    ],
)
print(resp)
const response = await client.search({
  index: "my-index",
  runtime_mappings: {
    "http.response": {
      type: "long",
      script:
        '\n        String response=dissect(\'%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}\').extract(doc["message"].value)?.response;\n        if (response != null) emit(Integer.parseInt(response));\n      ',
    },
  },
  query: {
    match: {
      "http.response": "304",
    },
  },
  fields: ["http.response"],
});
console.log(response);
GET my-index/_search
{
  "runtime_mappings": {
    "http.response": {
      "type": "long",
      "script": """
        String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response;
        if (response != null) emit(Integer.parseInt(response));
      """
    }
  },
  "query": {
    "match": {
      "http.response": "304"
    }
  },
  "fields" : ["http.response"]
}
{
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my-index",
        "_id" : "D47UqXkBByC8cgZrkbOm",
        "_score" : 1.0,
        "_source" : {
          "timestamp" : "2020-04-30T14:31:22-05:00",
          "message" : "247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"
        },
        "fields" : {
          "http.response" : [
            304
          ]
        }
      }
    ]
  }
}