查找文本结构 API

编辑

查找文本的结构。文本必须包含适合摄取到 Elastic Stack 中的数据。

请求

编辑

POST _text_structure/find_structure

先决条件

编辑
  • 如果启用了 Elasticsearch 安全功能,您必须拥有 monitor_text_structuremonitor 集群权限才能使用此 API。请参阅安全权限

描述

编辑

此 API 为将数据以适合后续与其他 Elastic Stack 功能一起使用的格式摄取到 Elasticsearch 中提供了起点。

与其他 Elasticsearch 端点不同,发布到此端点的数据不需要使用 UTF-8 编码并且采用 JSON 格式。但是,它必须是文本;目前不支持二进制文本格式。

API 的响应包含

  • 来自文本开头的一些消息。
  • 揭示文本中检测到的所有字段的最常见值的统计信息,以及数值字段的基本数值统计信息。
  • 有关文本结构的信息,当您编写索引它或类似格式文本的摄取配置时,此信息非常有用。
  • 适用于 Elasticsearch 索引的映射,您可以用来摄取文本。

所有这些信息都可以由结构查找器在没有指导的情况下计算出来。但是,您可以选择通过指定一个或多个查询参数来覆盖有关文本结构的一些决策。

有关输出的详细信息,请参阅示例

如果结构查找器对于某些文本产生意外结果,请指定 explain 查询参数。它会导致响应中出现一个 explanation,这应该有助于确定为什么选择返回的结构。

查询参数

编辑
charset
(可选,字符串)文本的字符集。它必须是 Elasticsearch 使用的 JVM 支持的字符集。例如,UTF-8UTF-16LEwindows-1252EUC-JP。如果未指定此参数,则结构查找器将选择适当的字符集。
column_names
(可选,字符串)如果您已将 format 设置为 delimited,则可以在逗号分隔的列表中指定列名。如果未指定此参数,则结构查找器将使用文本标题行中的列名。如果文本没有标题行,则列将命名为“column1”、“column2”、“column3”等。
delimiter
(可选,字符串)如果您已将 format 设置为 delimited,则可以指定用于分隔每行值的字符。仅支持单个字符;分隔符不能有多个字符。默认情况下,API 会考虑以下可能性:逗号、制表符、分号和管道符 (|)。在此默认场景中,所有行必须具有相同数量的字段,才能检测到分隔格式。如果您指定分隔符,则最多 10% 的行可以具有与第一行不同数量的列。
explain
(可选,布尔值)如果为 true,则响应将包含一个名为 explanation 的字段,该字段是一个字符串数组,指示结构查找器如何生成其结果。默认值为 false
format
(可选,字符串)文本的高级结构。有效值为 ndjsonxmldelimitedsemi_structured_text。默认情况下,API 会选择格式。在此默认场景中,所有行必须具有相同数量的字段,才能检测到分隔格式。但是,如果将 format 设置为 delimited 且未设置 delimiter,则 API 允许最多 5% 的行的列数与第一行不同。
grok_pattern
(可选,字符串)如果您已将 format 设置为 semi_structured_text,则可以指定一个 Grok 模式,该模式用于从文本中的每条消息中提取字段。Grok 模式中时间戳字段的名称必须与 timestamp_field 参数中指定的名称匹配。如果未指定该参数,则 Grok 模式中时间戳字段的名称必须与“timestamp”匹配。如果未指定 grok_pattern,则结构查找器将创建一个 Grok 模式。
ecs_compatibility
(可选,字符串)与 ECS 兼容的 Grok 模式的兼容模式。当结构查找器创建 Grok 模式时,使用此参数指定是否使用 ECS Grok 模式而不是旧模式。有效值为 disabledv1。默认值为 disabled。当诸如 %{CATALINALOG} 之类的整个消息 Grok 模式与输入匹配时,此设置主要会产生影响。如果结构查找器识别出通用结构,但不知道含义,则会在 grok_pattern 输出中使用诸如 pathipaddressfield1field2 之类的通用字段名称,目的是让知道含义的用户在使用之前重命名这些字段。
has_header_row
(可选,布尔值)如果您已将 format 设置为 delimited,则可以使用此参数指示列名是否位于文本的第一行中。如果未指定此参数,则结构查找器会根据文本第一行与其他行的相似性进行猜测。
line_merge_size_limit
(可选,无符号整数)当合并行以在分析半结构化文本时形成消息时,消息中的最大字符数。默认值为 10000。如果您的消息非常长,则可能需要增加此值,但请注意,如果将行分组为消息的方式检测错误,则可能会导致处理时间非常长。
lines_to_sample

(可选,无符号整数)要包含在结构分析中的行数,从文本开头开始。最小值为 2;默认值为 1000。如果此参数的值大于文本中的行数,则分析将针对所有行进行(只要文本中至少有两行)。

行数和行的变化会影响分析的速度。例如,如果您上传的文本中前 1000 行都是同一消息的变体,则分析将比更大的样本看到更多的共性。但是,如果可能,上传前 1000 行具有更多变化的示例文本,比请求分析 100000 行以实现一些变化更有效。

quote
(可选,字符串)如果您已将 format 设置为 delimited,则可以指定用于引用每行中值的字符(如果它们包含换行符或分隔符字符)。仅支持单个字符。如果未指定此参数,则默认值为双引号 (")。如果您的分隔文本格式不使用引号,一种解决方法是将此参数设置为示例中任何位置都不会出现的字符。
should_trim_fields
(可选,布尔值)如果您已将 format 设置为 delimited,则可以指定是否应从分隔符之间的值中删除空格。如果未指定此参数且分隔符是管道符 (|),则默认值为 true。否则,默认值为 false
timeout
(可选,时间单位)设置结构分析可能花费的最长时间。如果在超时到期时分析仍在运行,则会停止分析。默认值为 25 秒。
timestamp_field

(可选,字符串)包含文本中每个记录的主时间戳的字段的名称。特别是,如果将文本摄取到索引中,则此字段将用于填充 @timestamp 字段。

如果 formatsemi_structured_text,则此字段必须与 grok_pattern 中的相应提取名称匹配。因此,对于半结构化文本,最好不要指定此参数,除非也指定了 grok_pattern

对于结构化文本,如果指定此参数,则该字段必须存在于文本中。

如果未指定此参数,则结构查找器将决定哪个字段(如果有)是主时间戳字段。对于结构化文本,文本中不强制包含时间戳。

timestamp_format

(可选,字符串)文本中时间戳字段的 Java 时间格式。

仅支持 Java 时间格式字母组的子集

  • a
  • d
  • dd
  • EEE
  • EEEE
  • H
  • HH
  • h
  • M
  • MM
  • MMM
  • MMMM
  • mm
  • ss
  • XX
  • XXX
  • yy
  • yyyy
  • zzz

此外,还支持长度为一到九的 S 字母组(小数秒),前提是它们出现在 ss 之后,并与 ss.,: 分隔。允许使用空格和标点符号,但 ?、换行符和回车符除外,以及用单引号括起来的文字。例如,MM/dd HH.mm.ss,SSSSSS 'in' yyyy 是一种有效的替代格式。

此参数的一个有价值的用例是,当格式为半结构化文本,文本中存在多种时间戳格式,并且您知道哪个格式对应于主时间戳,但是您不想指定完整的 grok_pattern 时。另一个用例是时间戳格式是结构查找器默认不考虑的格式时。

如果未指定此参数,则结构查找器将从内置集中选择最佳格式。

如果指定特殊值 null,结构查找器将不会在文本中查找主时间戳。当格式为半结构化文本时,这将导致结构查找器将文本视为单行消息。

下表提供了一些示例时间戳的适当 timeformat

时间格式 呈现

yyyy-MM-dd HH:mm:ssZ

2019-04-20 13:15:22+0000

EEE, d MMM yyyy HH:mm:ss Z

Sat, 20 Apr 2019 13:15:22 +0000

dd.MM.yy HH:mm:ss.SSS

20.04.19 13:15:22.285

有关日期和时间格式语法的更多信息,请参阅 Java 日期/时间格式文档

请求正文

编辑

您要分析的文本。它必须包含适合摄取到 Elasticsearch 中的数据。它不需要是 JSON 格式,也不需要是 UTF-8 编码。大小限制为 Elasticsearch HTTP 接收缓冲区大小,默认为 100 Mb。

示例

编辑
摄取换行符分隔的 JSON
编辑

假设您有换行符分隔的 JSON 文本,其中包含有关一些书籍的信息。您可以将内容发送到 find_structure 端点

resp = client.text_structure.find_structure(
    text_files=[
        {
            "name": "Leviathan Wakes",
            "author": "James S.A. Corey",
            "release_date": "2011-06-02",
            "page_count": 561
        },
        {
            "name": "Hyperion",
            "author": "Dan Simmons",
            "release_date": "1989-05-26",
            "page_count": 482
        },
        {
            "name": "Dune",
            "author": "Frank Herbert",
            "release_date": "1965-06-01",
            "page_count": 604
        },
        {
            "name": "Dune Messiah",
            "author": "Frank Herbert",
            "release_date": "1969-10-15",
            "page_count": 331
        },
        {
            "name": "Children of Dune",
            "author": "Frank Herbert",
            "release_date": "1976-04-21",
            "page_count": 408
        },
        {
            "name": "God Emperor of Dune",
            "author": "Frank Herbert",
            "release_date": "1981-05-28",
            "page_count": 454
        },
        {
            "name": "Consider Phlebas",
            "author": "Iain M. Banks",
            "release_date": "1987-04-23",
            "page_count": 471
        },
        {
            "name": "Pandora's Star",
            "author": "Peter F. Hamilton",
            "release_date": "2004-03-02",
            "page_count": 768
        },
        {
            "name": "Revelation Space",
            "author": "Alastair Reynolds",
            "release_date": "2000-03-15",
            "page_count": 585
        },
        {
            "name": "A Fire Upon the Deep",
            "author": "Vernor Vinge",
            "release_date": "1992-06-01",
            "page_count": 613
        },
        {
            "name": "Ender's Game",
            "author": "Orson Scott Card",
            "release_date": "1985-06-01",
            "page_count": 324
        },
        {
            "name": "1984",
            "author": "George Orwell",
            "release_date": "1985-06-01",
            "page_count": 328
        },
        {
            "name": "Fahrenheit 451",
            "author": "Ray Bradbury",
            "release_date": "1953-10-15",
            "page_count": 227
        },
        {
            "name": "Brave New World",
            "author": "Aldous Huxley",
            "release_date": "1932-06-01",
            "page_count": 268
        },
        {
            "name": "Foundation",
            "author": "Isaac Asimov",
            "release_date": "1951-06-01",
            "page_count": 224
        },
        {
            "name": "The Giver",
            "author": "Lois Lowry",
            "release_date": "1993-04-26",
            "page_count": 208
        },
        {
            "name": "Slaughterhouse-Five",
            "author": "Kurt Vonnegut",
            "release_date": "1969-06-01",
            "page_count": 275
        },
        {
            "name": "The Hitchhiker's Guide to the Galaxy",
            "author": "Douglas Adams",
            "release_date": "1979-10-12",
            "page_count": 180
        },
        {
            "name": "Snow Crash",
            "author": "Neal Stephenson",
            "release_date": "1992-06-01",
            "page_count": 470
        },
        {
            "name": "Neuromancer",
            "author": "William Gibson",
            "release_date": "1984-07-01",
            "page_count": 271
        },
        {
            "name": "The Handmaid's Tale",
            "author": "Margaret Atwood",
            "release_date": "1985-06-01",
            "page_count": 311
        },
        {
            "name": "Starship Troopers",
            "author": "Robert A. Heinlein",
            "release_date": "1959-12-01",
            "page_count": 335
        },
        {
            "name": "The Left Hand of Darkness",
            "author": "Ursula K. Le Guin",
            "release_date": "1969-06-01",
            "page_count": 304
        },
        {
            "name": "The Moon is a Harsh Mistress",
            "author": "Robert A. Heinlein",
            "release_date": "1966-04-01",
            "page_count": 288
        }
    ],
)
print(resp)
response = client.text_structure.find_structure(
  body: [
    {
      name: 'Leviathan Wakes',
      author: 'James S.A. Corey',
      release_date: '2011-06-02',
      page_count: 561
    },
    {
      name: 'Hyperion',
      author: 'Dan Simmons',
      release_date: '1989-05-26',
      page_count: 482
    },
    {
      name: 'Dune',
      author: 'Frank Herbert',
      release_date: '1965-06-01',
      page_count: 604
    },
    {
      name: 'Dune Messiah',
      author: 'Frank Herbert',
      release_date: '1969-10-15',
      page_count: 331
    },
    {
      name: 'Children of Dune',
      author: 'Frank Herbert',
      release_date: '1976-04-21',
      page_count: 408
    },
    {
      name: 'God Emperor of Dune',
      author: 'Frank Herbert',
      release_date: '1981-05-28',
      page_count: 454
    },
    {
      name: 'Consider Phlebas',
      author: 'Iain M. Banks',
      release_date: '1987-04-23',
      page_count: 471
    },
    {
      name: "Pandora's Star",
      author: 'Peter F. Hamilton',
      release_date: '2004-03-02',
      page_count: 768
    },
    {
      name: 'Revelation Space',
      author: 'Alastair Reynolds',
      release_date: '2000-03-15',
      page_count: 585
    },
    {
      name: 'A Fire Upon the Deep',
      author: 'Vernor Vinge',
      release_date: '1992-06-01',
      page_count: 613
    },
    {
      name: "Ender's Game",
      author: 'Orson Scott Card',
      release_date: '1985-06-01',
      page_count: 324
    },
    {
      name: '1984',
      author: 'George Orwell',
      release_date: '1985-06-01',
      page_count: 328
    },
    {
      name: 'Fahrenheit 451',
      author: 'Ray Bradbury',
      release_date: '1953-10-15',
      page_count: 227
    },
    {
      name: 'Brave New World',
      author: 'Aldous Huxley',
      release_date: '1932-06-01',
      page_count: 268
    },
    {
      name: 'Foundation',
      author: 'Isaac Asimov',
      release_date: '1951-06-01',
      page_count: 224
    },
    {
      name: 'The Giver',
      author: 'Lois Lowry',
      release_date: '1993-04-26',
      page_count: 208
    },
    {
      name: 'Slaughterhouse-Five',
      author: 'Kurt Vonnegut',
      release_date: '1969-06-01',
      page_count: 275
    },
    {
      name: "The Hitchhiker's Guide to the Galaxy",
      author: 'Douglas Adams',
      release_date: '1979-10-12',
      page_count: 180
    },
    {
      name: 'Snow Crash',
      author: 'Neal Stephenson',
      release_date: '1992-06-01',
      page_count: 470
    },
    {
      name: 'Neuromancer',
      author: 'William Gibson',
      release_date: '1984-07-01',
      page_count: 271
    },
    {
      name: "The Handmaid's Tale",
      author: 'Margaret Atwood',
      release_date: '1985-06-01',
      page_count: 311
    },
    {
      name: 'Starship Troopers',
      author: 'Robert A. Heinlein',
      release_date: '1959-12-01',
      page_count: 335
    },
    {
      name: 'The Left Hand of Darkness',
      author: 'Ursula K. Le Guin',
      release_date: '1969-06-01',
      page_count: 304
    },
    {
      name: 'The Moon is a Harsh Mistress',
      author: 'Robert A. Heinlein',
      release_date: '1966-04-01',
      page_count: 288
    }
  ]
)
puts response
const response = await client.textStructure.findStructure({
  text_files: [
    {
      name: "Leviathan Wakes",
      author: "James S.A. Corey",
      release_date: "2011-06-02",
      page_count: 561,
    },
    {
      name: "Hyperion",
      author: "Dan Simmons",
      release_date: "1989-05-26",
      page_count: 482,
    },
    {
      name: "Dune",
      author: "Frank Herbert",
      release_date: "1965-06-01",
      page_count: 604,
    },
    {
      name: "Dune Messiah",
      author: "Frank Herbert",
      release_date: "1969-10-15",
      page_count: 331,
    },
    {
      name: "Children of Dune",
      author: "Frank Herbert",
      release_date: "1976-04-21",
      page_count: 408,
    },
    {
      name: "God Emperor of Dune",
      author: "Frank Herbert",
      release_date: "1981-05-28",
      page_count: 454,
    },
    {
      name: "Consider Phlebas",
      author: "Iain M. Banks",
      release_date: "1987-04-23",
      page_count: 471,
    },
    {
      name: "Pandora's Star",
      author: "Peter F. Hamilton",
      release_date: "2004-03-02",
      page_count: 768,
    },
    {
      name: "Revelation Space",
      author: "Alastair Reynolds",
      release_date: "2000-03-15",
      page_count: 585,
    },
    {
      name: "A Fire Upon the Deep",
      author: "Vernor Vinge",
      release_date: "1992-06-01",
      page_count: 613,
    },
    {
      name: "Ender's Game",
      author: "Orson Scott Card",
      release_date: "1985-06-01",
      page_count: 324,
    },
    {
      name: "1984",
      author: "George Orwell",
      release_date: "1985-06-01",
      page_count: 328,
    },
    {
      name: "Fahrenheit 451",
      author: "Ray Bradbury",
      release_date: "1953-10-15",
      page_count: 227,
    },
    {
      name: "Brave New World",
      author: "Aldous Huxley",
      release_date: "1932-06-01",
      page_count: 268,
    },
    {
      name: "Foundation",
      author: "Isaac Asimov",
      release_date: "1951-06-01",
      page_count: 224,
    },
    {
      name: "The Giver",
      author: "Lois Lowry",
      release_date: "1993-04-26",
      page_count: 208,
    },
    {
      name: "Slaughterhouse-Five",
      author: "Kurt Vonnegut",
      release_date: "1969-06-01",
      page_count: 275,
    },
    {
      name: "The Hitchhiker's Guide to the Galaxy",
      author: "Douglas Adams",
      release_date: "1979-10-12",
      page_count: 180,
    },
    {
      name: "Snow Crash",
      author: "Neal Stephenson",
      release_date: "1992-06-01",
      page_count: 470,
    },
    {
      name: "Neuromancer",
      author: "William Gibson",
      release_date: "1984-07-01",
      page_count: 271,
    },
    {
      name: "The Handmaid's Tale",
      author: "Margaret Atwood",
      release_date: "1985-06-01",
      page_count: 311,
    },
    {
      name: "Starship Troopers",
      author: "Robert A. Heinlein",
      release_date: "1959-12-01",
      page_count: 335,
    },
    {
      name: "The Left Hand of Darkness",
      author: "Ursula K. Le Guin",
      release_date: "1969-06-01",
      page_count: 304,
    },
    {
      name: "The Moon is a Harsh Mistress",
      author: "Robert A. Heinlein",
      release_date: "1966-04-01",
      page_count: 288,
    },
  ],
});
console.log(response);
POST _text_structure/find_structure
{"name": "Leviathan Wakes", "author": "James S.A. Corey", "release_date": "2011-06-02", "page_count": 561}
{"name": "Hyperion", "author": "Dan Simmons", "release_date": "1989-05-26", "page_count": 482}
{"name": "Dune", "author": "Frank Herbert", "release_date": "1965-06-01", "page_count": 604}
{"name": "Dune Messiah", "author": "Frank Herbert", "release_date": "1969-10-15", "page_count": 331}
{"name": "Children of Dune", "author": "Frank Herbert", "release_date": "1976-04-21", "page_count": 408}
{"name": "God Emperor of Dune", "author": "Frank Herbert", "release_date": "1981-05-28", "page_count": 454}
{"name": "Consider Phlebas", "author": "Iain M. Banks", "release_date": "1987-04-23", "page_count": 471}
{"name": "Pandora's Star", "author": "Peter F. Hamilton", "release_date": "2004-03-02", "page_count": 768}
{"name": "Revelation Space", "author": "Alastair Reynolds", "release_date": "2000-03-15", "page_count": 585}
{"name": "A Fire Upon the Deep", "author": "Vernor Vinge", "release_date": "1992-06-01", "page_count": 613}
{"name": "Ender's Game", "author": "Orson Scott Card", "release_date": "1985-06-01", "page_count": 324}
{"name": "1984", "author": "George Orwell", "release_date": "1985-06-01", "page_count": 328}
{"name": "Fahrenheit 451", "author": "Ray Bradbury", "release_date": "1953-10-15", "page_count": 227}
{"name": "Brave New World", "author": "Aldous Huxley", "release_date": "1932-06-01", "page_count": 268}
{"name": "Foundation", "author": "Isaac Asimov", "release_date": "1951-06-01", "page_count": 224}
{"name": "The Giver", "author": "Lois Lowry", "release_date": "1993-04-26", "page_count": 208}
{"name": "Slaughterhouse-Five", "author": "Kurt Vonnegut", "release_date": "1969-06-01", "page_count": 275}
{"name": "The Hitchhiker's Guide to the Galaxy", "author": "Douglas Adams", "release_date": "1979-10-12", "page_count": 180}
{"name": "Snow Crash", "author": "Neal Stephenson", "release_date": "1992-06-01", "page_count": 470}
{"name": "Neuromancer", "author": "William Gibson", "release_date": "1984-07-01", "page_count": 271}
{"name": "The Handmaid's Tale", "author": "Margaret Atwood", "release_date": "1985-06-01", "page_count": 311}
{"name": "Starship Troopers", "author": "Robert A. Heinlein", "release_date": "1959-12-01", "page_count": 335}
{"name": "The Left Hand of Darkness", "author": "Ursula K. Le Guin", "release_date": "1969-06-01", "page_count": 304}
{"name": "The Moon is a Harsh Mistress", "author": "Robert A. Heinlein", "release_date": "1966-04-01", "page_count": 288}

如果请求没有遇到错误,您将收到以下结果

{
  "num_lines_analyzed" : 24, 
  "num_messages_analyzed" : 24, 
  "sample_start" : "{\"name\": \"Leviathan Wakes\", \"author\": \"James S.A. Corey\", \"release_date\": \"2011-06-02\", \"page_count\": 561}\n{\"name\": \"Hyperion\", \"author\": \"Dan Simmons\", \"release_date\": \"1989-05-26\", \"page_count\": 482}\n", 
  "charset" : "UTF-8", 
  "has_byte_order_marker" : false, 
  "format" : "ndjson", 
  "ecs_compatibility" : "disabled", 
  "timestamp_field" : "release_date", 
  "joda_timestamp_formats" : [ 
    "ISO8601"
  ],
  "java_timestamp_formats" : [ 
    "ISO8601"
  ],
  "need_client_timezone" : true, 
  "mappings" : { 
    "properties" : {
      "@timestamp" : {
        "type" : "date"
      },
      "author" : {
        "type" : "keyword"
      },
      "name" : {
        "type" : "keyword"
      },
      "page_count" : {
        "type" : "long"
      },
      "release_date" : {
        "type" : "date",
        "format" : "iso8601"
      }
    }
  },
  "ingest_pipeline" : {
    "description" : "Ingest pipeline created by text structure finder",
    "processors" : [
      {
        "date" : {
          "field" : "release_date",
          "timezone" : "{{ event.timezone }}",
          "formats" : [
            "ISO8601"
          ]
        }
      }
    ]
  },
  "field_stats" : { 
    "author" : {
      "count" : 24,
      "cardinality" : 20,
      "top_hits" : [
        {
          "value" : "Frank Herbert",
          "count" : 4
        },
        {
          "value" : "Robert A. Heinlein",
          "count" : 2
        },
        {
          "value" : "Alastair Reynolds",
          "count" : 1
        },
        {
          "value" : "Aldous Huxley",
          "count" : 1
        },
        {
          "value" : "Dan Simmons",
          "count" : 1
        },
        {
          "value" : "Douglas Adams",
          "count" : 1
        },
        {
          "value" : "George Orwell",
          "count" : 1
        },
        {
          "value" : "Iain M. Banks",
          "count" : 1
        },
        {
          "value" : "Isaac Asimov",
          "count" : 1
        },
        {
          "value" : "James S.A. Corey",
          "count" : 1
        }
      ]
    },
    "name" : {
      "count" : 24,
      "cardinality" : 24,
      "top_hits" : [
        {
          "value" : "1984",
          "count" : 1
        },
        {
          "value" : "A Fire Upon the Deep",
          "count" : 1
        },
        {
          "value" : "Brave New World",
          "count" : 1
        },
        {
          "value" : "Children of Dune",
          "count" : 1
        },
        {
          "value" : "Consider Phlebas",
          "count" : 1
        },
        {
          "value" : "Dune",
          "count" : 1
        },
        {
          "value" : "Dune Messiah",
          "count" : 1
        },
        {
          "value" : "Ender's Game",
          "count" : 1
        },
        {
          "value" : "Fahrenheit 451",
          "count" : 1
        },
        {
          "value" : "Foundation",
          "count" : 1
        }
      ]
    },
    "page_count" : {
      "count" : 24,
      "cardinality" : 24,
      "min_value" : 180,
      "max_value" : 768,
      "mean_value" : 387.0833333333333,
      "median_value" : 329.5,
      "top_hits" : [
        {
          "value" : 180,
          "count" : 1
        },
        {
          "value" : 208,
          "count" : 1
        },
        {
          "value" : 224,
          "count" : 1
        },
        {
          "value" : 227,
          "count" : 1
        },
        {
          "value" : 268,
          "count" : 1
        },
        {
          "value" : 271,
          "count" : 1
        },
        {
          "value" : 275,
          "count" : 1
        },
        {
          "value" : 288,
          "count" : 1
        },
        {
          "value" : 304,
          "count" : 1
        },
        {
          "value" : 311,
          "count" : 1
        }
      ]
    },
    "release_date" : {
      "count" : 24,
      "cardinality" : 20,
      "earliest" : "1932-06-01",
      "latest" : "2011-06-02",
      "top_hits" : [
        {
          "value" : "1985-06-01",
          "count" : 3
        },
        {
          "value" : "1969-06-01",
          "count" : 2
        },
        {
          "value" : "1992-06-01",
          "count" : 2
        },
        {
          "value" : "1932-06-01",
          "count" : 1
        },
        {
          "value" : "1951-06-01",
          "count" : 1
        },
        {
          "value" : "1953-10-15",
          "count" : 1
        },
        {
          "value" : "1959-12-01",
          "count" : 1
        },
        {
          "value" : "1965-06-01",
          "count" : 1
        },
        {
          "value" : "1966-04-01",
          "count" : 1
        },
        {
          "value" : "1969-10-15",
          "count" : 1
        }
      ]
    }
  }
}

num_lines_analyzed 指示分析了文本的多少行。

num_messages_analyzed 指示这些行包含多少不同的消息。对于 NDJSON,此值与 num_lines_analyzed 相同。对于其他文本格式,消息可以跨越多行。

sample_start 逐字重复文本中的前两条消息。这可能有助于诊断解析错误或意外上传的错误文本。

charset 指示用于解析文本的字符编码。

对于 UTF 字符编码,has_byte_order_marker 指示文本是否以字节顺序标记开头。

formatndjsonxmldelimitedsemi_structured_text 之一。

ecs_compatibility 要么是 disabled 要么是 v1,默认为 disabled

timestamp_field 命名被认为最有可能作为每个文档的主时间戳的字段。

joda_timestamp_formats 用于告诉 Logstash 如何解析时间戳。

java_timestamp_formats 是时间字段中识别的 Java 时间格式。Elasticsearch 映射和摄取管道使用此格式。

如果检测到的时间戳格式不包含时区,则 need_client_timezone 将为 true。因此,解析文本的服务器必须由客户端告知正确的时区。

mappings 包含一些适用于可以摄取数据的索引的映射。在这种情况下,release_date 字段已被赋予 keyword 类型,因为它被认为不够具体,无法转换为 date 类型。

field_stats 包含每个字段最常见的值,以及数字 page_count 字段的基本数字统计信息。此信息可能提供线索,表明数据在使用其他 Elastic Stack 功能之前需要进行清理或转换。

查找纽约市出租车示例数据的结构
编辑

下一个示例说明如何查找纽约市出租车行程数据的一些结构。第一个 curl 命令下载数据,然后将前 20000 行数据管道传输到 find_structure 端点。lines_to_sample 端点的查询参数设置为 20000,以匹配 head 命令中指定的内容。

curl -s "s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-06.csv" | head -20000 | curl -s -H "Content-Type: application/json" -XPOST "localhost:9200/_text_structure/find_structure?pretty&lines_to_sample=20000" -T -

即使在这种情况下数据不是 JSON,也必须设置 Content-Type: application/json 标头。(或者,可以将 Content-Type 设置为 Elasticsearch 支持的任何其他类型,但必须设置。)

如果请求没有遇到错误,您将收到以下结果

{
  "num_lines_analyzed" : 20000,
  "num_messages_analyzed" : 19998, 
  "sample_start" : "VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount\n\n1,2018-06-01 00:15:40,2018-06-01 00:16:46,1,.00,1,N,145,145,2,3,0.5,0.5,0,0,0.3,4.3\n",
  "charset" : "UTF-8",
  "has_byte_order_marker" : false,
  "format" : "delimited", 
  "multiline_start_pattern" : "^.*?,\"?\\d{4}-\\d{2}-\\d{2}[T ]\\d{2}:\\d{2}",
  "exclude_lines_pattern" : "^\"?VendorID\"?,\"?tpep_pickup_datetime\"?,\"?tpep_dropoff_datetime\"?,\"?passenger_count\"?,\"?trip_distance\"?,\"?RatecodeID\"?,\"?store_and_fwd_flag\"?,\"?PULocationID\"?,\"?DOLocationID\"?,\"?payment_type\"?,\"?fare_amount\"?,\"?extra\"?,\"?mta_tax\"?,\"?tip_amount\"?,\"?tolls_amount\"?,\"?improvement_surcharge\"?,\"?total_amount\"?",
  "column_names" : [ 
    "VendorID",
    "tpep_pickup_datetime",
    "tpep_dropoff_datetime",
    "passenger_count",
    "trip_distance",
    "RatecodeID",
    "store_and_fwd_flag",
    "PULocationID",
    "DOLocationID",
    "payment_type",
    "fare_amount",
    "extra",
    "mta_tax",
    "tip_amount",
    "tolls_amount",
    "improvement_surcharge",
    "total_amount"
  ],
  "has_header_row" : true, 
  "delimiter" : ",", 
  "quote" : "\"", 
  "timestamp_field" : "tpep_pickup_datetime", 
  "joda_timestamp_formats" : [ 
    "YYYY-MM-dd HH:mm:ss"
  ],
  "java_timestamp_formats" : [ 
    "yyyy-MM-dd HH:mm:ss"
  ],
  "need_client_timezone" : true, 
  "mappings" : {
    "properties" : {
      "@timestamp" : {
        "type" : "date"
      },
      "DOLocationID" : {
        "type" : "long"
      },
      "PULocationID" : {
        "type" : "long"
      },
      "RatecodeID" : {
        "type" : "long"
      },
      "VendorID" : {
        "type" : "long"
      },
      "extra" : {
        "type" : "double"
      },
      "fare_amount" : {
        "type" : "double"
      },
      "improvement_surcharge" : {
        "type" : "double"
      },
      "mta_tax" : {
        "type" : "double"
      },
      "passenger_count" : {
        "type" : "long"
      },
      "payment_type" : {
        "type" : "long"
      },
      "store_and_fwd_flag" : {
        "type" : "keyword"
      },
      "tip_amount" : {
        "type" : "double"
      },
      "tolls_amount" : {
        "type" : "double"
      },
      "total_amount" : {
        "type" : "double"
      },
      "tpep_dropoff_datetime" : {
        "type" : "date",
        "format" : "yyyy-MM-dd HH:mm:ss"
      },
      "tpep_pickup_datetime" : {
        "type" : "date",
        "format" : "yyyy-MM-dd HH:mm:ss"
      },
      "trip_distance" : {
        "type" : "double"
      }
    }
  },
  "ingest_pipeline" : {
    "description" : "Ingest pipeline created by text structure finder",
    "processors" : [
      {
        "csv" : {
          "field" : "message",
          "target_fields" : [
            "VendorID",
            "tpep_pickup_datetime",
            "tpep_dropoff_datetime",
            "passenger_count",
            "trip_distance",
            "RatecodeID",
            "store_and_fwd_flag",
            "PULocationID",
            "DOLocationID",
            "payment_type",
            "fare_amount",
            "extra",
            "mta_tax",
            "tip_amount",
            "tolls_amount",
            "improvement_surcharge",
            "total_amount"
          ]
        }
      },
      {
        "date" : {
          "field" : "tpep_pickup_datetime",
          "timezone" : "{{ event.timezone }}",
          "formats" : [
            "yyyy-MM-dd HH:mm:ss"
          ]
        }
      },
      {
        "convert" : {
          "field" : "DOLocationID",
          "type" : "long"
        }
      },
      {
        "convert" : {
          "field" : "PULocationID",
          "type" : "long"
        }
      },
      {
        "convert" : {
          "field" : "RatecodeID",
          "type" : "long"
        }
      },
      {
        "convert" : {
          "field" : "VendorID",
          "type" : "long"
        }
      },
      {
        "convert" : {
          "field" : "extra",
          "type" : "double"
        }
      },
      {
        "convert" : {
          "field" : "fare_amount",
          "type" : "double"
        }
      },
      {
        "convert" : {
          "field" : "improvement_surcharge",
          "type" : "double"
        }
      },
      {
        "convert" : {
          "field" : "mta_tax",
          "type" : "double"
        }
      },
      {
        "convert" : {
          "field" : "passenger_count",
          "type" : "long"
        }
      },
      {
        "convert" : {
          "field" : "payment_type",
          "type" : "long"
        }
      },
      {
        "convert" : {
          "field" : "tip_amount",
          "type" : "double"
        }
      },
      {
        "convert" : {
          "field" : "tolls_amount",
          "type" : "double"
        }
      },
      {
        "convert" : {
          "field" : "total_amount",
          "type" : "double"
        }
      },
      {
        "convert" : {
          "field" : "trip_distance",
          "type" : "double"
        }
      },
      {
        "remove" : {
          "field" : "message"
        }
      }
    ]
  },
  "field_stats" : {
    "DOLocationID" : {
      "count" : 19998,
      "cardinality" : 240,
      "min_value" : 1,
      "max_value" : 265,
      "mean_value" : 150.26532653265312,
      "median_value" : 148,
      "top_hits" : [
        {
          "value" : 79,
          "count" : 760
        },
        {
          "value" : 48,
          "count" : 683
        },
        {
          "value" : 68,
          "count" : 529
        },
        {
          "value" : 170,
          "count" : 506
        },
        {
          "value" : 107,
          "count" : 468
        },
        {
          "value" : 249,
          "count" : 457
        },
        {
          "value" : 230,
          "count" : 441
        },
        {
          "value" : 186,
          "count" : 432
        },
        {
          "value" : 141,
          "count" : 409
        },
        {
          "value" : 263,
          "count" : 386
        }
      ]
    },
    "PULocationID" : {
      "count" : 19998,
      "cardinality" : 154,
      "min_value" : 1,
      "max_value" : 265,
      "mean_value" : 153.4042404240424,
      "median_value" : 148,
      "top_hits" : [
        {
          "value" : 79,
          "count" : 1067
        },
        {
          "value" : 230,
          "count" : 949
        },
        {
          "value" : 148,
          "count" : 940
        },
        {
          "value" : 132,
          "count" : 897
        },
        {
          "value" : 48,
          "count" : 853
        },
        {
          "value" : 161,
          "count" : 820
        },
        {
          "value" : 234,
          "count" : 750
        },
        {
          "value" : 249,
          "count" : 722
        },
        {
          "value" : 164,
          "count" : 663
        },
        {
          "value" : 114,
          "count" : 646
        }
      ]
    },
    "RatecodeID" : {
      "count" : 19998,
      "cardinality" : 5,
      "min_value" : 1,
      "max_value" : 5,
      "mean_value" : 1.0656565656565653,
      "median_value" : 1,
      "top_hits" : [
        {
          "value" : 1,
          "count" : 19311
        },
        {
          "value" : 2,
          "count" : 468
        },
        {
          "value" : 5,
          "count" : 195
        },
        {
          "value" : 4,
          "count" : 17
        },
        {
          "value" : 3,
          "count" : 7
        }
      ]
    },
    "VendorID" : {
      "count" : 19998,
      "cardinality" : 2,
      "min_value" : 1,
      "max_value" : 2,
      "mean_value" : 1.59005900590059,
      "median_value" : 2,
      "top_hits" : [
        {
          "value" : 2,
          "count" : 11800
        },
        {
          "value" : 1,
          "count" : 8198
        }
      ]
    },
    "extra" : {
      "count" : 19998,
      "cardinality" : 3,
      "min_value" : -0.5,
      "max_value" : 0.5,
      "mean_value" : 0.4815981598159816,
      "median_value" : 0.5,
      "top_hits" : [
        {
          "value" : 0.5,
          "count" : 19281
        },
        {
          "value" : 0,
          "count" : 698
        },
        {
          "value" : -0.5,
          "count" : 19
        }
      ]
    },
    "fare_amount" : {
      "count" : 19998,
      "cardinality" : 208,
      "min_value" : -100,
      "max_value" : 300,
      "mean_value" : 13.937719771977209,
      "median_value" : 9.5,
      "top_hits" : [
        {
          "value" : 6,
          "count" : 1004
        },
        {
          "value" : 6.5,
          "count" : 935
        },
        {
          "value" : 5.5,
          "count" : 909
        },
        {
          "value" : 7,
          "count" : 903
        },
        {
          "value" : 5,
          "count" : 889
        },
        {
          "value" : 7.5,
          "count" : 854
        },
        {
          "value" : 4.5,
          "count" : 802
        },
        {
          "value" : 8.5,
          "count" : 790
        },
        {
          "value" : 8,
          "count" : 789
        },
        {
          "value" : 9,
          "count" : 711
        }
      ]
    },
    "improvement_surcharge" : {
      "count" : 19998,
      "cardinality" : 3,
      "min_value" : -0.3,
      "max_value" : 0.3,
      "mean_value" : 0.29915991599159913,
      "median_value" : 0.3,
      "top_hits" : [
        {
          "value" : 0.3,
          "count" : 19964
        },
        {
          "value" : -0.3,
          "count" : 22
        },
        {
          "value" : 0,
          "count" : 12
        }
      ]
    },
    "mta_tax" : {
      "count" : 19998,
      "cardinality" : 3,
      "min_value" : -0.5,
      "max_value" : 0.5,
      "mean_value" : 0.4962246224622462,
      "median_value" : 0.5,
      "top_hits" : [
        {
          "value" : 0.5,
          "count" : 19868
        },
        {
          "value" : 0,
          "count" : 109
        },
        {
          "value" : -0.5,
          "count" : 21
        }
      ]
    },
    "passenger_count" : {
      "count" : 19998,
      "cardinality" : 7,
      "min_value" : 0,
      "max_value" : 6,
      "mean_value" : 1.6201620162016201,
      "median_value" : 1,
      "top_hits" : [
        {
          "value" : 1,
          "count" : 14219
        },
        {
          "value" : 2,
          "count" : 2886
        },
        {
          "value" : 5,
          "count" : 1047
        },
        {
          "value" : 3,
          "count" : 804
        },
        {
          "value" : 6,
          "count" : 523
        },
        {
          "value" : 4,
          "count" : 406
        },
        {
          "value" : 0,
          "count" : 113
        }
      ]
    },
    "payment_type" : {
      "count" : 19998,
      "cardinality" : 4,
      "min_value" : 1,
      "max_value" : 4,
      "mean_value" : 1.315631563156316,
      "median_value" : 1,
      "top_hits" : [
        {
          "value" : 1,
          "count" : 13936
        },
        {
          "value" : 2,
          "count" : 5857
        },
        {
          "value" : 3,
          "count" : 160
        },
        {
          "value" : 4,
          "count" : 45
        }
      ]
    },
    "store_and_fwd_flag" : {
      "count" : 19998,
      "cardinality" : 2,
      "top_hits" : [
        {
          "value" : "N",
          "count" : 19910
        },
        {
          "value" : "Y",
          "count" : 88
        }
      ]
    },
    "tip_amount" : {
      "count" : 19998,
      "cardinality" : 717,
      "min_value" : 0,
      "max_value" : 128,
      "mean_value" : 2.010959095909593,
      "median_value" : 1.45,
      "top_hits" : [
        {
          "value" : 0,
          "count" : 6917
        },
        {
          "value" : 1,
          "count" : 1178
        },
        {
          "value" : 2,
          "count" : 624
        },
        {
          "value" : 3,
          "count" : 248
        },
        {
          "value" : 1.56,
          "count" : 206
        },
        {
          "value" : 1.46,
          "count" : 205
        },
        {
          "value" : 1.76,
          "count" : 196
        },
        {
          "value" : 1.45,
          "count" : 195
        },
        {
          "value" : 1.36,
          "count" : 191
        },
        {
          "value" : 1.5,
          "count" : 187
        }
      ]
    },
    "tolls_amount" : {
      "count" : 19998,
      "cardinality" : 26,
      "min_value" : 0,
      "max_value" : 35,
      "mean_value" : 0.2729697969796978,
      "median_value" : 0,
      "top_hits" : [
        {
          "value" : 0,
          "count" : 19107
        },
        {
          "value" : 5.76,
          "count" : 791
        },
        {
          "value" : 10.5,
          "count" : 36
        },
        {
          "value" : 2.64,
          "count" : 21
        },
        {
          "value" : 11.52,
          "count" : 8
        },
        {
          "value" : 5.54,
          "count" : 4
        },
        {
          "value" : 8.5,
          "count" : 4
        },
        {
          "value" : 17.28,
          "count" : 4
        },
        {
          "value" : 2,
          "count" : 2
        },
        {
          "value" : 2.16,
          "count" : 2
        }
      ]
    },
    "total_amount" : {
      "count" : 19998,
      "cardinality" : 1267,
      "min_value" : -100.3,
      "max_value" : 389.12,
      "mean_value" : 17.499898989898995,
      "median_value" : 12.35,
      "top_hits" : [
        {
          "value" : 7.3,
          "count" : 478
        },
        {
          "value" : 8.3,
          "count" : 443
        },
        {
          "value" : 8.8,
          "count" : 420
        },
        {
          "value" : 6.8,
          "count" : 406
        },
        {
          "value" : 7.8,
          "count" : 405
        },
        {
          "value" : 6.3,
          "count" : 371
        },
        {
          "value" : 9.8,
          "count" : 368
        },
        {
          "value" : 5.8,
          "count" : 362
        },
        {
          "value" : 9.3,
          "count" : 332
        },
        {
          "value" : 10.3,
          "count" : 332
        }
      ]
    },
    "tpep_dropoff_datetime" : {
      "count" : 19998,
      "cardinality" : 9066,
      "earliest" : "2018-05-31 06:18:15",
      "latest" : "2018-06-02 02:25:44",
      "top_hits" : [
        {
          "value" : "2018-06-01 01:12:12",
          "count" : 10
        },
        {
          "value" : "2018-06-01 00:32:15",
          "count" : 9
        },
        {
          "value" : "2018-06-01 00:44:27",
          "count" : 9
        },
        {
          "value" : "2018-06-01 00:46:42",
          "count" : 9
        },
        {
          "value" : "2018-06-01 01:03:22",
          "count" : 9
        },
        {
          "value" : "2018-06-01 01:05:13",
          "count" : 9
        },
        {
          "value" : "2018-06-01 00:11:20",
          "count" : 8
        },
        {
          "value" : "2018-06-01 00:16:03",
          "count" : 8
        },
        {
          "value" : "2018-06-01 00:19:47",
          "count" : 8
        },
        {
          "value" : "2018-06-01 00:25:17",
          "count" : 8
        }
      ]
    },
    "tpep_pickup_datetime" : {
      "count" : 19998,
      "cardinality" : 8760,
      "earliest" : "2018-05-31 06:08:31",
      "latest" : "2018-06-02 01:21:21",
      "top_hits" : [
        {
          "value" : "2018-06-01 00:01:23",
          "count" : 12
        },
        {
          "value" : "2018-06-01 00:04:31",
          "count" : 10
        },
        {
          "value" : "2018-06-01 00:05:38",
          "count" : 10
        },
        {
          "value" : "2018-06-01 00:09:50",
          "count" : 10
        },
        {
          "value" : "2018-06-01 00:12:01",
          "count" : 10
        },
        {
          "value" : "2018-06-01 00:14:17",
          "count" : 10
        },
        {
          "value" : "2018-06-01 00:00:34",
          "count" : 9
        },
        {
          "value" : "2018-06-01 00:00:40",
          "count" : 9
        },
        {
          "value" : "2018-06-01 00:02:53",
          "count" : 9
        },
        {
          "value" : "2018-06-01 00:05:40",
          "count" : 9
        }
      ]
    },
    "trip_distance" : {
      "count" : 19998,
      "cardinality" : 1687,
      "min_value" : 0,
      "max_value" : 64.63,
      "mean_value" : 3.6521062106210715,
      "median_value" : 2.16,
      "top_hits" : [
        {
          "value" : 0.9,
          "count" : 335
        },
        {
          "value" : 0.8,
          "count" : 320
        },
        {
          "value" : 1.1,
          "count" : 316
        },
        {
          "value" : 0.7,
          "count" : 304
        },
        {
          "value" : 1.2,
          "count" : 303
        },
        {
          "value" : 1,
          "count" : 296
        },
        {
          "value" : 1.3,
          "count" : 280
        },
        {
          "value" : 1.5,
          "count" : 268
        },
        {
          "value" : 1.6,
          "count" : 268
        },
        {
          "value" : 0.6,
          "count" : 256
        }
      ]
    }
  }
}

num_messages_analyzednum_lines_analyzed 小 2,因为只有数据记录才算作消息。第一行包含列名,在此示例中,第二行是空白的。

与第一个示例不同,在这种情况下,format 已被标识为 delimited

因为 formatdelimited,所以输出中的 column_names 字段会按它们在示例中出现的顺序列出列名。

has_header_row 指示对于此示例,列名位于示例的第一行。(如果不是这样,那么最好在 column_names 查询参数中指定它们。)

此示例的 delimiter 是逗号,因为它是 CSV 格式的文本。

quote 字符是默认的双引号。(结构查找器不尝试推断任何其他引号字符,因此如果您有使用其他字符引用的分隔文本,则必须使用 quote 查询参数指定它。)

timestamp_field 已被选择为 tpep_pickup_datetimetpep_dropoff_datetime 也可以正常工作,但之所以选择 tpep_pickup_datetime 是因为它在列顺序中排在第一位。如果您更喜欢 tpep_dropoff_datetime,则使用 timestamp_field 查询参数强制选择它。

joda_timestamp_formats 用于告诉 Logstash 如何解析时间戳。

java_timestamp_formats 是时间字段中识别的 Java 时间格式。Elasticsearch 映射和摄取管道使用此格式。

此示例中的时间戳格式未指定时区,因此要将其准确转换为 UTC 时间戳以存储在 Elasticsearch 中,必须提供它们所关联的时区。对于包含时区的时间戳格式,need_client_timezone 将为 false

设置超时参数
编辑

如果您尝试分析大量数据,则分析将需要很长时间。如果您想限制 Elasticsearch 集群为请求执行的处理量,请使用 timeout 查询参数。当超时到期时,分析将中止并返回错误。例如,您可以将上一个示例中的 20000 行替换为 200000 行,并在分析中设置 1 秒的超时

curl -s "s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-06.csv" | head -200000 | curl -s -H "Content-Type: application/json" -XPOST "localhost:9200/_text_structure/find_structure?pretty&lines_to_sample=200000&timeout=1s" -T -

除非您使用速度极快的计算机,否则您将收到超时错误

{
  "error" : {
    "root_cause" : [
      {
        "type" : "timeout_exception",
        "reason" : "Aborting structure analysis during [delimited record parsing] as it has taken longer than the timeout of [1s]"
      }
    ],
    "type" : "timeout_exception",
    "reason" : "Aborting structure analysis during [delimited record parsing] as it has taken longer than the timeout of [1s]"
  },
  "status" : 500
}

如果您自己尝试上述示例,您会注意到 curl 命令的整体运行时间明显长于 1 秒。这是因为从互联网下载 200000 行 CSV 需要一段时间,并且超时是从此端点开始处理数据时开始测量的。

分析 Elasticsearch 日志文件
编辑

这是分析 Elasticsearch 日志文件的示例

curl -s -H "Content-Type: application/json" -XPOST
"localhost:9200/_text_structure/find_structure?pretty&ecs_compatibility=disabled" -T "$ES_HOME/logs/elasticsearch.log"

如果请求没有遇到错误,结果将如下所示

{
  "num_lines_analyzed" : 53,
  "num_messages_analyzed" : 53,
  "sample_start" : "[2018-09-27T14:39:28,518][INFO ][o.e.e.NodeEnvironment    ] [node-0] using [1] data paths, mounts [[/ (/dev/disk1)]], net usable_space [165.4gb], net total_space [464.7gb], types [hfs]\n[2018-09-27T14:39:28,521][INFO ][o.e.e.NodeEnvironment    ] [node-0] heap size [494.9mb], compressed ordinary object pointers [true]\n",
  "charset" : "UTF-8",
  "has_byte_order_marker" : false,
  "format" : "semi_structured_text", 
  "multiline_start_pattern" : "^\\[\\b\\d{4}-\\d{2}-\\d{2}[T ]\\d{2}:\\d{2}", 
  "grok_pattern" : "\\[%{TIMESTAMP_ISO8601:timestamp}\\]\\[%{LOGLEVEL:loglevel}.*", 
  "ecs_compatibility" : "disabled", 
  "timestamp_field" : "timestamp",
  "joda_timestamp_formats" : [
    "ISO8601"
  ],
  "java_timestamp_formats" : [
    "ISO8601"
  ],
  "need_client_timezone" : true,
  "mappings" : {
    "properties" : {
      "@timestamp" : {
        "type" : "date"
      },
      "loglevel" : {
        "type" : "keyword"
      },
      "message" : {
        "type" : "text"
      }
    }
  },
  "ingest_pipeline" : {
    "description" : "Ingest pipeline created by text structure finder",
    "processors" : [
      {
        "grok" : {
          "field" : "message",
          "patterns" : [
            "\\[%{TIMESTAMP_ISO8601:timestamp}\\]\\[%{LOGLEVEL:loglevel}.*"
          ]
        }
      },
      {
        "date" : {
          "field" : "timestamp",
          "timezone" : "{{ event.timezone }}",
          "formats" : [
            "ISO8601"
          ]
        }
      },
      {
        "remove" : {
          "field" : "timestamp"
        }
      }
    ]
  },
  "field_stats" : {
    "loglevel" : {
      "count" : 53,
      "cardinality" : 3,
      "top_hits" : [
        {
          "value" : "INFO",
          "count" : 51
        },
        {
          "value" : "DEBUG",
          "count" : 1
        },
        {
          "value" : "WARN",
          "count" : 1
        }
      ]
    },
    "timestamp" : {
      "count" : 53,
      "cardinality" : 28,
      "earliest" : "2018-09-27T14:39:28,518",
      "latest" : "2018-09-27T14:39:37,012",
      "top_hits" : [
        {
          "value" : "2018-09-27T14:39:29,859",
          "count" : 10
        },
        {
          "value" : "2018-09-27T14:39:29,860",
          "count" : 9
        },
        {
          "value" : "2018-09-27T14:39:29,858",
          "count" : 6
        },
        {
          "value" : "2018-09-27T14:39:28,523",
          "count" : 3
        },
        {
          "value" : "2018-09-27T14:39:34,234",
          "count" : 2
        },
        {
          "value" : "2018-09-27T14:39:28,518",
          "count" : 1
        },
        {
          "value" : "2018-09-27T14:39:28,521",
          "count" : 1
        },
        {
          "value" : "2018-09-27T14:39:28,522",
          "count" : 1
        },
        {
          "value" : "2018-09-27T14:39:29,861",
          "count" : 1
        },
        {
          "value" : "2018-09-27T14:39:32,786",
          "count" : 1
        }
      ]
    }
  }
}

这次,format 已被标识为 semi_structured_text

multiline_start_pattern 的设置基于时间戳出现在每个多行日志消息的第一行。

创建了一个非常简单的 grok_pattern,它会提取时间戳和出现在每个分析消息中的可识别字段。在这种情况下,除了时间戳之外,唯一被识别的字段是日志级别。

所使用的 ECS Grok 模式兼容性模式可以是 disabled(如果在请求中未指定则为默认值)或 v1 中的一个

grok_pattern 指定为查询参数
编辑

如果您识别出的字段比结构查找器在没有辅助的情况下生成的简单 grok_pattern 多,则可以重新提交请求,并将更高级的 grok_pattern 指定为查询参数,结构查找器将为您的其他字段计算 field_stats

对于 Elasticsearch 日志,更完整的 Grok 模式是 \[%{TIMESTAMP_ISO8601:timestamp}\]\[%{LOGLEVEL:loglevel} *\]\[%{JAVACLASS:class} *\] \[%{HOSTNAME:node}\] %{JAVALOGMESSAGE:message}。您可以再次分析相同的文本,将此 grok_pattern 作为查询参数(适当地 URL 转义)提交

curl -s -H "Content-Type: application/json" -XPOST "localhost:9200/_text_structure/find_structure?pretty&format=semi_structured_text&grok_pattern=%5C%5B%25%7BTIMESTAMP_ISO8601:timestamp%7D%5C%5D%5C%5B%25%7BLOGLEVEL:loglevel%7D%20*%5C%5D%5C%5B%25%7BJAVACLASS:class%7D%20*%5C%5D%20%5C%5B%25%7BHOSTNAME:node%7D%5C%5D%20%25%7BJAVALOGMESSAGE:message%7D" -T "$ES_HOME/logs/elasticsearch.log"

如果请求没有遇到错误,结果将如下所示

{
  "num_lines_analyzed" : 53,
  "num_messages_analyzed" : 53,
  "sample_start" : "[2018-09-27T14:39:28,518][INFO ][o.e.e.NodeEnvironment    ] [node-0] using [1] data paths, mounts [[/ (/dev/disk1)]], net usable_space [165.4gb], net total_space [464.7gb], types [hfs]\n[2018-09-27T14:39:28,521][INFO ][o.e.e.NodeEnvironment    ] [node-0] heap size [494.9mb], compressed ordinary object pointers [true]\n",
  "charset" : "UTF-8",
  "has_byte_order_marker" : false,
  "format" : "semi_structured_text",
  "multiline_start_pattern" : "^\\[\\b\\d{4}-\\d{2}-\\d{2}[T ]\\d{2}:\\d{2}",
  "grok_pattern" : "\\[%{TIMESTAMP_ISO8601:timestamp}\\]\\[%{LOGLEVEL:loglevel} *\\]\\[%{JAVACLASS:class} *\\] \\[%{HOSTNAME:node}\\] %{JAVALOGMESSAGE:message}", 
  "ecs_compatibility" : "disabled", 
  "timestamp_field" : "timestamp",
  "joda_timestamp_formats" : [
    "ISO8601"
  ],
  "java_timestamp_formats" : [
    "ISO8601"
  ],
  "need_client_timezone" : true,
  "mappings" : {
    "properties" : {
      "@timestamp" : {
        "type" : "date"
      },
      "class" : {
        "type" : "keyword"
      },
      "loglevel" : {
        "type" : "keyword"
      },
      "message" : {
        "type" : "text"
      },
      "node" : {
        "type" : "keyword"
      }
    }
  },
  "ingest_pipeline" : {
    "description" : "Ingest pipeline created by text structure finder",
    "processors" : [
      {
        "grok" : {
          "field" : "message",
          "patterns" : [
            "\\[%{TIMESTAMP_ISO8601:timestamp}\\]\\[%{LOGLEVEL:loglevel} *\\]\\[%{JAVACLASS:class} *\\] \\[%{HOSTNAME:node}\\] %{JAVALOGMESSAGE:message}"
          ]
        }
      },
      {
        "date" : {
          "field" : "timestamp",
          "timezone" : "{{ event.timezone }}",
          "formats" : [
            "ISO8601"
          ]
        }
      },
      {
        "remove" : {
          "field" : "timestamp"
        }
      }
    ]
  },
  "field_stats" : { 
    "class" : {
      "count" : 53,
      "cardinality" : 14,
      "top_hits" : [
        {
          "value" : "o.e.p.PluginsService",
          "count" : 26
        },
        {
          "value" : "o.e.c.m.MetadataIndexTemplateService",
          "count" : 8
        },
        {
          "value" : "o.e.n.Node",
          "count" : 7
        },
        {
          "value" : "o.e.e.NodeEnvironment",
          "count" : 2
        },
        {
          "value" : "o.e.a.ActionModule",
          "count" : 1
        },
        {
          "value" : "o.e.c.s.ClusterApplierService",
          "count" : 1
        },
        {
          "value" : "o.e.c.s.MasterService",
          "count" : 1
        },
        {
          "value" : "o.e.d.DiscoveryModule",
          "count" : 1
        },
        {
          "value" : "o.e.g.GatewayService",
          "count" : 1
        },
        {
          "value" : "o.e.l.LicenseService",
          "count" : 1
        }
      ]
    },
    "loglevel" : {
      "count" : 53,
      "cardinality" : 3,
      "top_hits" : [
        {
          "value" : "INFO",
          "count" : 51
        },
        {
          "value" : "DEBUG",
          "count" : 1
        },
        {
          "value" : "WARN",
          "count" : 1
        }
      ]
    },
    "message" : {
      "count" : 53,
      "cardinality" : 53,
      "top_hits" : [
        {
          "value" : "Using REST wrapper from plugin org.elasticsearch.xpack.security.Security",
          "count" : 1
        },
        {
          "value" : "adding template [.monitoring-alerts] for index patterns [.monitoring-alerts-6]",
          "count" : 1
        },
        {
          "value" : "adding template [.monitoring-beats] for index patterns [.monitoring-beats-6-*]",
          "count" : 1
        },
        {
          "value" : "adding template [.monitoring-es] for index patterns [.monitoring-es-6-*]",
          "count" : 1
        },
        {
          "value" : "adding template [.monitoring-kibana] for index patterns [.monitoring-kibana-6-*]",
          "count" : 1
        },
        {
          "value" : "adding template [.monitoring-logstash] for index patterns [.monitoring-logstash-6-*]",
          "count" : 1
        },
        {
          "value" : "adding template [.triggered_watches] for index patterns [.triggered_watches*]",
          "count" : 1
        },
        {
          "value" : "adding template [.watch-history-9] for index patterns [.watcher-history-9*]",
          "count" : 1
        },
        {
          "value" : "adding template [.watches] for index patterns [.watches*]",
          "count" : 1
        },
        {
          "value" : "starting ...",
          "count" : 1
        }
      ]
    },
    "node" : {
      "count" : 53,
      "cardinality" : 1,
      "top_hits" : [
        {
          "value" : "node-0",
          "count" : 53
        }
      ]
    },
    "timestamp" : {
      "count" : 53,
      "cardinality" : 28,
      "earliest" : "2018-09-27T14:39:28,518",
      "latest" : "2018-09-27T14:39:37,012",
      "top_hits" : [
        {
          "value" : "2018-09-27T14:39:29,859",
          "count" : 10
        },
        {
          "value" : "2018-09-27T14:39:29,860",
          "count" : 9
        },
        {
          "value" : "2018-09-27T14:39:29,858",
          "count" : 6
        },
        {
          "value" : "2018-09-27T14:39:28,523",
          "count" : 3
        },
        {
          "value" : "2018-09-27T14:39:34,234",
          "count" : 2
        },
        {
          "value" : "2018-09-27T14:39:28,518",
          "count" : 1
        },
        {
          "value" : "2018-09-27T14:39:28,521",
          "count" : 1
        },
        {
          "value" : "2018-09-27T14:39:28,522",
          "count" : 1
        },
        {
          "value" : "2018-09-27T14:39:29,861",
          "count" : 1
        },
        {
          "value" : "2018-09-27T14:39:32,786",
          "count" : 1
        }
      ]
    }
  }
}

输出中的 grok_pattern 现在是查询参数中提供的被覆盖的模式。

所使用的 ECS Grok 模式兼容性模式可以是 disabled(如果在请求中未指定则为默认值)或 v1 中的一个

返回的 field_stats 包括来自被覆盖的 grok_pattern 的字段的条目。

URL 转义很困难,因此如果您以交互方式工作,最好使用 UI!