查找文本结构 API
编辑查找文本结构 API编辑
查找文本的结构。文本必须包含适合摄入 Elastic Stack 的数据。
请求编辑
POST _text_structure/find_structure
先决条件编辑
- 如果启用了 Elasticsearch 安全功能,则您必须拥有
monitor_text_structure
或monitor
集群权限才能使用此 API。请参阅安全权限。
描述编辑
此 API 提供了一个起点,用于以适合随后与其他 Elastic Stack 功能一起使用的格式将数据摄入 Elasticsearch。
与其他 Elasticsearch 端点不同,发布到此端点的数据不需要采用 UTF-8 编码和 JSON 格式。但是,它必须是文本;当前不支持二进制文本格式。
API 的响应包含
- 文本开头的一些消息。
- 统计信息,显示文本中检测到的所有字段的最常见值,以及数字字段的基本数字统计信息。
- 有关文本结构的信息,这在编写摄取配置以索引它或类似格式的文本时很有用。
- Elasticsearch 索引的适当映射,您可以使用它来摄取文本。
结构查找器可以在没有任何指导的情况下计算所有这些信息。但是,您可以选择通过指定一个或多个查询参数来覆盖有关文本结构的一些决定。
输出的详细信息可以在示例中看到。
如果结构查找器对某些文本产生意外结果,请指定 explain
查询参数。它会导致响应中出现 explanation
,这应该有助于确定选择返回结构的原因。
查询参数编辑
-
charset
- (可选,字符串)文本的字符集。它必须是 Elasticsearch 使用的 JVM 支持的字符集。例如,
UTF-8
、UTF-16LE
、windows-1252
或EUC-JP
。如果未指定此参数,则结构查找器会选择适当的字符集。 -
column_names
- (可选,字符串)如果已将
format
设置为delimited
,则可以在逗号分隔列表中指定列名。如果未指定此参数,则结构查找器使用文本标题行中的列名。如果文本没有标题行,则列名为“column1”、“column2”、“column3”等。 -
delimiter
- (可选,字符串)如果已将
format
设置为delimited
,则可以指定用于分隔每行中值的字符。仅支持单个字符;分隔符不能有多个字符。默认情况下,API 会考虑以下可能性:逗号、制表符、分号和竖线 (|
)。在此默认情况下,所有行必须具有相同数量的字段,才能检测到分隔格式。如果指定了分隔符,则最多 10% 的行可以具有与第一行不同的列数。 -
explain
- (可选,布尔值)如果为
true
,则响应包含一个名为explanation
的字段,该字段是一个字符串数组,指示结构查找器如何生成其结果。默认值为false
。 -
format
- (可选,字符串)文本的高级结构。有效值为
ndjson
、xml
、delimited
和semi_structured_text
。默认情况下,API 会选择格式。在此默认情况下,所有行必须具有相同数量的字段,才能检测到分隔格式。但是,如果format
设置为delimited
并且未设置delimiter
,则 API 最多可容忍 5% 的行具有与第一行不同的列数。 -
grok_pattern
- (可选,字符串)如果已将
format
设置为semi_structured_text
,则可以指定一个 Grok 模式,用于从文本中的每条消息中提取字段。Grok 模式中时间戳字段的名称必须与timestamp_field
参数中指定的名称匹配。如果未指定该参数,则 Grok 模式中时间戳字段的名称必须与“timestamp”匹配。如果未指定grok_pattern
,则结构查找器会创建一个 Grok 模式。 -
ecs_compatibility
- (可选,字符串)与 ECS 兼容的 Grok 模式的兼容模式。使用此参数可以指定在结构查找器创建 Grok 模式时是使用 ECS Grok 模式还是使用旧模式。有效值为
disabled
和v1
。默认值为disabled
。此设置主要在整个消息 Grok 模式(例如%{CATALINALOG}
)与输入匹配时产生影响。如果结构查找器识别出一种通用结构,但不知道其含义,则会在grok_pattern
输出中使用通用字段名称,例如path
、ipaddress
、field1
和field2
,其目的是让知道含义的用户在使用之前重命名这些字段。 -
has_header_row
- (可选,布尔值)如果已将
format
设置为delimited
,则可以使用此参数指示列名是否位于文本的第一行中。如果未指定此参数,则结构查找器会根据文本第一行与其他行的相似性进行猜测。 -
line_merge_size_limit
- (可选,无符号整数)分析半结构化文本时,合并行以形成消息时消息中的最大字符数。默认值为
10000
。如果您的消息非常长,则可能需要增加此值,但请注意,如果将行分组到消息中的方式检测错误,则这可能会导致处理时间非常长。 -
lines_to_sample
-
(可选,无符号整数)要包含在结构分析中的行数,从文本开头开始。最小值为 2;默认值为
1000
。如果此参数的值大于文本中的行数,则分析将对所有行进行(只要文本中至少有两行)。行数和行的变化会影响分析速度。例如,如果您上传的文本的前 1000 行都是同一条消息的变体,则分析将比使用更大的样本看到更多的共性。但是,如果可能,在上传的样本文本的前 1000 行中包含更多变化比请求分析 100000 行以实现某种变化更有效率。
-
quote
- (可选,字符串)如果已将
format
设置为delimited
,则可以指定用于在每行中的值包含换行符或分隔符时对其进行引用的字符。仅支持单个字符。如果未指定此参数,则默认值为双引号 ("
)。如果您的分隔文本格式不使用引号,则解决方法是将此参数设置为样本中未出现的字符。 -
should_trim_fields
- (可选,布尔值)如果已将
format
设置为delimited
,则可以指定是否应从分隔符之间的值中删除空格。如果未指定此参数且分隔符为竖线 (|
),则默认值为true
。否则,默认值为false
。 -
timeout
- (可选,时间单位)设置结构分析可能花费的最长时间。如果在超时到期时分析仍在运行,则将停止分析。默认值为 25 秒。
-
timestamp_field
-
(可选,字符串)包含文本中每条记录的主要时间戳的字段的名称。特别是,如果文本被摄入索引,则此字段将用于填充
@timestamp
字段。如果
format
为semi_structured_text
,则此字段必须与grok_pattern
中相应提取的名称匹配。因此,对于半结构化文本,最好不要指定此参数,除非还指定了grok_pattern
。对于结构化文本,如果指定此参数,则该字段必须存在于文本中。
如果未指定此参数,则结构查找器会决定哪个字段(如果有)是主要时间戳字段。对于结构化文本,文本中并非必须包含时间戳。
-
timestamp_format
-
(可选,字符串)文本中时间戳字段的 Java 时间格式。
仅支持 Java 时间格式字母组的子集
-
a
-
d
-
dd
-
EEE
-
EEEE
-
H
-
HH
-
h
-
M
-
MM
-
MMM
-
MMMM
-
mm
-
ss
-
XX
-
XXX
-
yy
-
yyyy
-
zzz
此外,还支持长度为 1 到 9 的
S
字母组(小数秒),前提是它们出现在ss
之后,并通过.
、,
或:
与ss
分隔。也允许使用空格和标点符号,但?
、换行符和回车符以及用单引号括起来的文字除外。例如,MM/dd HH.mm.ss,SSSSSS 'in' yyyy
是有效的覆盖格式。此参数的一个有价值的用例是,当格式为半结构化文本时,文本中有多种时间戳格式,并且您知道哪种格式对应于主要时间戳,但您不想指定完整的
grok_pattern
。另一个用例是时间戳格式是结构查找器默认不考虑的格式。如果未指定此参数,则结构查找器会从内置集中选择最佳格式。
如果指定了特殊值
null
,则结构查找器将不会在文本中查找主要时间戳。当格式为半结构化文本时,这将导致结构查找器将文本视为单行消息。下表提供了一些示例时间戳的相应
timeformat
值时间格式 表示 yyyy-MM-dd HH:mm:ssZ
2019-04-20 13:15:22+0000
EEE, d MMM yyyy HH:mm:ss Z
Sat, 20 Apr 2019 13:15:22 +0000
dd.MM.yy HH:mm:ss.SSS
20.04.19 13:15:22.285
有关日期和时间格式语法的更多信息,请参阅Java 日期/时间格式文档。
-
请求正文编辑
您要分析的文本。它必须包含适合摄取到 Elasticsearch 中的数据。它不需要采用 JSON 格式,也不需要进行 UTF-8 编码。大小限制为 Elasticsearch HTTP 接收缓冲区大小,默认为 100 Mb。
示例编辑
摄取换行符分隔的 JSON编辑
假设您有包含一些书籍信息的换行符分隔的 JSON 文本。您可以将内容发送到 find_structure
端点
response = client.text_structure.find_structure( body: [ { name: 'Leviathan Wakes', author: 'James S.A. Corey', release_date: '2011-06-02', page_count: 561 }, { name: 'Hyperion', author: 'Dan Simmons', release_date: '1989-05-26', page_count: 482 }, { name: 'Dune', author: 'Frank Herbert', release_date: '1965-06-01', page_count: 604 }, { name: 'Dune Messiah', author: 'Frank Herbert', release_date: '1969-10-15', page_count: 331 }, { name: 'Children of Dune', author: 'Frank Herbert', release_date: '1976-04-21', page_count: 408 }, { name: 'God Emperor of Dune', author: 'Frank Herbert', release_date: '1981-05-28', page_count: 454 }, { name: 'Consider Phlebas', author: 'Iain M. Banks', release_date: '1987-04-23', page_count: 471 }, { name: "Pandora's Star", author: 'Peter F. Hamilton', release_date: '2004-03-02', page_count: 768 }, { name: 'Revelation Space', author: 'Alastair Reynolds', release_date: '2000-03-15', page_count: 585 }, { name: 'A Fire Upon the Deep', author: 'Vernor Vinge', release_date: '1992-06-01', page_count: 613 }, { name: "Ender's Game", author: 'Orson Scott Card', release_date: '1985-06-01', page_count: 324 }, { name: '1984', author: 'George Orwell', release_date: '1985-06-01', page_count: 328 }, { name: 'Fahrenheit 451', author: 'Ray Bradbury', release_date: '1953-10-15', page_count: 227 }, { name: 'Brave New World', author: 'Aldous Huxley', release_date: '1932-06-01', page_count: 268 }, { name: 'Foundation', author: 'Isaac Asimov', release_date: '1951-06-01', page_count: 224 }, { name: 'The Giver', author: 'Lois Lowry', release_date: '1993-04-26', page_count: 208 }, { name: 'Slaughterhouse-Five', author: 'Kurt Vonnegut', release_date: '1969-06-01', page_count: 275 }, { name: "The Hitchhiker's Guide to the Galaxy", author: 'Douglas Adams', release_date: '1979-10-12', page_count: 180 }, { name: 'Snow Crash', author: 'Neal Stephenson', release_date: '1992-06-01', page_count: 470 }, { name: 'Neuromancer', author: 'William Gibson', release_date: '1984-07-01', page_count: 271 }, { name: "The Handmaid's Tale", author: 'Margaret Atwood', release_date: '1985-06-01', page_count: 311 }, { name: 'Starship Troopers', author: 'Robert A. Heinlein', release_date: '1959-12-01', page_count: 335 }, { name: 'The Left Hand of Darkness', author: 'Ursula K. Le Guin', release_date: '1969-06-01', page_count: 304 }, { name: 'The Moon is a Harsh Mistress', author: 'Robert A. Heinlein', release_date: '1966-04-01', page_count: 288 } ] ) puts response
POST _text_structure/find_structure {"name": "Leviathan Wakes", "author": "James S.A. Corey", "release_date": "2011-06-02", "page_count": 561} {"name": "Hyperion", "author": "Dan Simmons", "release_date": "1989-05-26", "page_count": 482} {"name": "Dune", "author": "Frank Herbert", "release_date": "1965-06-01", "page_count": 604} {"name": "Dune Messiah", "author": "Frank Herbert", "release_date": "1969-10-15", "page_count": 331} {"name": "Children of Dune", "author": "Frank Herbert", "release_date": "1976-04-21", "page_count": 408} {"name": "God Emperor of Dune", "author": "Frank Herbert", "release_date": "1981-05-28", "page_count": 454} {"name": "Consider Phlebas", "author": "Iain M. Banks", "release_date": "1987-04-23", "page_count": 471} {"name": "Pandora's Star", "author": "Peter F. Hamilton", "release_date": "2004-03-02", "page_count": 768} {"name": "Revelation Space", "author": "Alastair Reynolds", "release_date": "2000-03-15", "page_count": 585} {"name": "A Fire Upon the Deep", "author": "Vernor Vinge", "release_date": "1992-06-01", "page_count": 613} {"name": "Ender's Game", "author": "Orson Scott Card", "release_date": "1985-06-01", "page_count": 324} {"name": "1984", "author": "George Orwell", "release_date": "1985-06-01", "page_count": 328} {"name": "Fahrenheit 451", "author": "Ray Bradbury", "release_date": "1953-10-15", "page_count": 227} {"name": "Brave New World", "author": "Aldous Huxley", "release_date": "1932-06-01", "page_count": 268} {"name": "Foundation", "author": "Isaac Asimov", "release_date": "1951-06-01", "page_count": 224} {"name": "The Giver", "author": "Lois Lowry", "release_date": "1993-04-26", "page_count": 208} {"name": "Slaughterhouse-Five", "author": "Kurt Vonnegut", "release_date": "1969-06-01", "page_count": 275} {"name": "The Hitchhiker's Guide to the Galaxy", "author": "Douglas Adams", "release_date": "1979-10-12", "page_count": 180} {"name": "Snow Crash", "author": "Neal Stephenson", "release_date": "1992-06-01", "page_count": 470} {"name": "Neuromancer", "author": "William Gibson", "release_date": "1984-07-01", "page_count": 271} {"name": "The Handmaid's Tale", "author": "Margaret Atwood", "release_date": "1985-06-01", "page_count": 311} {"name": "Starship Troopers", "author": "Robert A. Heinlein", "release_date": "1959-12-01", "page_count": 335} {"name": "The Left Hand of Darkness", "author": "Ursula K. Le Guin", "release_date": "1969-06-01", "page_count": 304} {"name": "The Moon is a Harsh Mistress", "author": "Robert A. Heinlein", "release_date": "1966-04-01", "page_count": 288}
如果请求没有遇到错误,您将收到以下结果
{ "num_lines_analyzed" : 24, "num_messages_analyzed" : 24, "sample_start" : "{\"name\": \"Leviathan Wakes\", \"author\": \"James S.A. Corey\", \"release_date\": \"2011-06-02\", \"page_count\": 561}\n{\"name\": \"Hyperion\", \"author\": \"Dan Simmons\", \"release_date\": \"1989-05-26\", \"page_count\": 482}\n", "charset" : "UTF-8", "has_byte_order_marker" : false, "format" : "ndjson", "ecs_compatibility" : "disabled", "timestamp_field" : "release_date", "joda_timestamp_formats" : [ "ISO8601" ], "java_timestamp_formats" : [ "ISO8601" ], "need_client_timezone" : true, "mappings" : { "properties" : { "@timestamp" : { "type" : "date" }, "author" : { "type" : "keyword" }, "name" : { "type" : "keyword" }, "page_count" : { "type" : "long" }, "release_date" : { "type" : "date", "format" : "iso8601" } } }, "ingest_pipeline" : { "description" : "Ingest pipeline created by text structure finder", "processors" : [ { "date" : { "field" : "release_date", "timezone" : "{{ event.timezone }}", "formats" : [ "ISO8601" ] } } ] }, "field_stats" : { "author" : { "count" : 24, "cardinality" : 20, "top_hits" : [ { "value" : "Frank Herbert", "count" : 4 }, { "value" : "Robert A. Heinlein", "count" : 2 }, { "value" : "Alastair Reynolds", "count" : 1 }, { "value" : "Aldous Huxley", "count" : 1 }, { "value" : "Dan Simmons", "count" : 1 }, { "value" : "Douglas Adams", "count" : 1 }, { "value" : "George Orwell", "count" : 1 }, { "value" : "Iain M. Banks", "count" : 1 }, { "value" : "Isaac Asimov", "count" : 1 }, { "value" : "James S.A. Corey", "count" : 1 } ] }, "name" : { "count" : 24, "cardinality" : 24, "top_hits" : [ { "value" : "1984", "count" : 1 }, { "value" : "A Fire Upon the Deep", "count" : 1 }, { "value" : "Brave New World", "count" : 1 }, { "value" : "Children of Dune", "count" : 1 }, { "value" : "Consider Phlebas", "count" : 1 }, { "value" : "Dune", "count" : 1 }, { "value" : "Dune Messiah", "count" : 1 }, { "value" : "Ender's Game", "count" : 1 }, { "value" : "Fahrenheit 451", "count" : 1 }, { "value" : "Foundation", "count" : 1 } ] }, "page_count" : { "count" : 24, "cardinality" : 24, "min_value" : 180, "max_value" : 768, "mean_value" : 387.0833333333333, "median_value" : 329.5, "top_hits" : [ { "value" : 180, "count" : 1 }, { "value" : 208, "count" : 1 }, { "value" : 224, "count" : 1 }, { "value" : 227, "count" : 1 }, { "value" : 268, "count" : 1 }, { "value" : 271, "count" : 1 }, { "value" : 275, "count" : 1 }, { "value" : 288, "count" : 1 }, { "value" : 304, "count" : 1 }, { "value" : 311, "count" : 1 } ] }, "release_date" : { "count" : 24, "cardinality" : 20, "earliest" : "1932-06-01", "latest" : "2011-06-02", "top_hits" : [ { "value" : "1985-06-01", "count" : 3 }, { "value" : "1969-06-01", "count" : 2 }, { "value" : "1992-06-01", "count" : 2 }, { "value" : "1932-06-01", "count" : 1 }, { "value" : "1951-06-01", "count" : 1 }, { "value" : "1953-10-15", "count" : 1 }, { "value" : "1959-12-01", "count" : 1 }, { "value" : "1965-06-01", "count" : 1 }, { "value" : "1966-04-01", "count" : 1 }, { "value" : "1969-10-15", "count" : 1 } ] } } }
|
|
|
|
|
|
|
|
对于 UTF 字符编码, |
|
|
|
|
|
|
|
|
|
|
|
如果检测到的时间戳格式不包含时区,则 |
|
|
|
|
查找纽约市黄色出租车示例数据的结构编辑
下一个示例显示了如何找到一些纽约市黄色出租车行程数据的结构。第一个 curl
命令下载数据,然后将其前 20000 行通过管道传输到 find_structure
端点。端点的 lines_to_sample
查询参数设置为 20000,以匹配 head
命令中指定的内容。
curl -s "s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-06.csv" | head -20000 | curl -s -H "Content-Type: application/json" -XPOST "localhost:9200/_text_structure/find_structure?pretty&lines_to_sample=20000" -T -
即使在这种情况下数据不是 JSON,也必须设置 Content-Type: application/json
标头。(或者,Content-Type
可以设置为 Elasticsearch 支持的任何其他类型,但必须进行设置。)
如果请求没有遇到错误,您将收到以下结果
{ "num_lines_analyzed" : 20000, "num_messages_analyzed" : 19998, "sample_start" : "VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount\n\n1,2018-06-01 00:15:40,2018-06-01 00:16:46,1,.00,1,N,145,145,2,3,0.5,0.5,0,0,0.3,4.3\n", "charset" : "UTF-8", "has_byte_order_marker" : false, "format" : "delimited", "multiline_start_pattern" : "^.*?,\"?\\d{4}-\\d{2}-\\d{2}[T ]\\d{2}:\\d{2}", "exclude_lines_pattern" : "^\"?VendorID\"?,\"?tpep_pickup_datetime\"?,\"?tpep_dropoff_datetime\"?,\"?passenger_count\"?,\"?trip_distance\"?,\"?RatecodeID\"?,\"?store_and_fwd_flag\"?,\"?PULocationID\"?,\"?DOLocationID\"?,\"?payment_type\"?,\"?fare_amount\"?,\"?extra\"?,\"?mta_tax\"?,\"?tip_amount\"?,\"?tolls_amount\"?,\"?improvement_surcharge\"?,\"?total_amount\"?", "column_names" : [ "VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime", "passenger_count", "trip_distance", "RatecodeID", "store_and_fwd_flag", "PULocationID", "DOLocationID", "payment_type", "fare_amount", "extra", "mta_tax", "tip_amount", "tolls_amount", "improvement_surcharge", "total_amount" ], "has_header_row" : true, "delimiter" : ",", "quote" : "\"", "timestamp_field" : "tpep_pickup_datetime", "joda_timestamp_formats" : [ "YYYY-MM-dd HH:mm:ss" ], "java_timestamp_formats" : [ "yyyy-MM-dd HH:mm:ss" ], "need_client_timezone" : true, "mappings" : { "properties" : { "@timestamp" : { "type" : "date" }, "DOLocationID" : { "type" : "long" }, "PULocationID" : { "type" : "long" }, "RatecodeID" : { "type" : "long" }, "VendorID" : { "type" : "long" }, "extra" : { "type" : "double" }, "fare_amount" : { "type" : "double" }, "improvement_surcharge" : { "type" : "double" }, "mta_tax" : { "type" : "double" }, "passenger_count" : { "type" : "long" }, "payment_type" : { "type" : "long" }, "store_and_fwd_flag" : { "type" : "keyword" }, "tip_amount" : { "type" : "double" }, "tolls_amount" : { "type" : "double" }, "total_amount" : { "type" : "double" }, "tpep_dropoff_datetime" : { "type" : "date", "format" : "yyyy-MM-dd HH:mm:ss" }, "tpep_pickup_datetime" : { "type" : "date", "format" : "yyyy-MM-dd HH:mm:ss" }, "trip_distance" : { "type" : "double" } } }, "ingest_pipeline" : { "description" : "Ingest pipeline created by text structure finder", "processors" : [ { "csv" : { "field" : "message", "target_fields" : [ "VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime", "passenger_count", "trip_distance", "RatecodeID", "store_and_fwd_flag", "PULocationID", "DOLocationID", "payment_type", "fare_amount", "extra", "mta_tax", "tip_amount", "tolls_amount", "improvement_surcharge", "total_amount" ] } }, { "date" : { "field" : "tpep_pickup_datetime", "timezone" : "{{ event.timezone }}", "formats" : [ "yyyy-MM-dd HH:mm:ss" ] } }, { "convert" : { "field" : "DOLocationID", "type" : "long" } }, { "convert" : { "field" : "PULocationID", "type" : "long" } }, { "convert" : { "field" : "RatecodeID", "type" : "long" } }, { "convert" : { "field" : "VendorID", "type" : "long" } }, { "convert" : { "field" : "extra", "type" : "double" } }, { "convert" : { "field" : "fare_amount", "type" : "double" } }, { "convert" : { "field" : "improvement_surcharge", "type" : "double" } }, { "convert" : { "field" : "mta_tax", "type" : "double" } }, { "convert" : { "field" : "passenger_count", "type" : "long" } }, { "convert" : { "field" : "payment_type", "type" : "long" } }, { "convert" : { "field" : "tip_amount", "type" : "double" } }, { "convert" : { "field" : "tolls_amount", "type" : "double" } }, { "convert" : { "field" : "total_amount", "type" : "double" } }, { "convert" : { "field" : "trip_distance", "type" : "double" } }, { "remove" : { "field" : "message" } } ] }, "field_stats" : { "DOLocationID" : { "count" : 19998, "cardinality" : 240, "min_value" : 1, "max_value" : 265, "mean_value" : 150.26532653265312, "median_value" : 148, "top_hits" : [ { "value" : 79, "count" : 760 }, { "value" : 48, "count" : 683 }, { "value" : 68, "count" : 529 }, { "value" : 170, "count" : 506 }, { "value" : 107, "count" : 468 }, { "value" : 249, "count" : 457 }, { "value" : 230, "count" : 441 }, { "value" : 186, "count" : 432 }, { "value" : 141, "count" : 409 }, { "value" : 263, "count" : 386 } ] }, "PULocationID" : { "count" : 19998, "cardinality" : 154, "min_value" : 1, "max_value" : 265, "mean_value" : 153.4042404240424, "median_value" : 148, "top_hits" : [ { "value" : 79, "count" : 1067 }, { "value" : 230, "count" : 949 }, { "value" : 148, "count" : 940 }, { "value" : 132, "count" : 897 }, { "value" : 48, "count" : 853 }, { "value" : 161, "count" : 820 }, { "value" : 234, "count" : 750 }, { "value" : 249, "count" : 722 }, { "value" : 164, "count" : 663 }, { "value" : 114, "count" : 646 } ] }, "RatecodeID" : { "count" : 19998, "cardinality" : 5, "min_value" : 1, "max_value" : 5, "mean_value" : 1.0656565656565653, "median_value" : 1, "top_hits" : [ { "value" : 1, "count" : 19311 }, { "value" : 2, "count" : 468 }, { "value" : 5, "count" : 195 }, { "value" : 4, "count" : 17 }, { "value" : 3, "count" : 7 } ] }, "VendorID" : { "count" : 19998, "cardinality" : 2, "min_value" : 1, "max_value" : 2, "mean_value" : 1.59005900590059, "median_value" : 2, "top_hits" : [ { "value" : 2, "count" : 11800 }, { "value" : 1, "count" : 8198 } ] }, "extra" : { "count" : 19998, "cardinality" : 3, "min_value" : -0.5, "max_value" : 0.5, "mean_value" : 0.4815981598159816, "median_value" : 0.5, "top_hits" : [ { "value" : 0.5, "count" : 19281 }, { "value" : 0, "count" : 698 }, { "value" : -0.5, "count" : 19 } ] }, "fare_amount" : { "count" : 19998, "cardinality" : 208, "min_value" : -100, "max_value" : 300, "mean_value" : 13.937719771977209, "median_value" : 9.5, "top_hits" : [ { "value" : 6, "count" : 1004 }, { "value" : 6.5, "count" : 935 }, { "value" : 5.5, "count" : 909 }, { "value" : 7, "count" : 903 }, { "value" : 5, "count" : 889 }, { "value" : 7.5, "count" : 854 }, { "value" : 4.5, "count" : 802 }, { "value" : 8.5, "count" : 790 }, { "value" : 8, "count" : 789 }, { "value" : 9, "count" : 711 } ] }, "improvement_surcharge" : { "count" : 19998, "cardinality" : 3, "min_value" : -0.3, "max_value" : 0.3, "mean_value" : 0.29915991599159913, "median_value" : 0.3, "top_hits" : [ { "value" : 0.3, "count" : 19964 }, { "value" : -0.3, "count" : 22 }, { "value" : 0, "count" : 12 } ] }, "mta_tax" : { "count" : 19998, "cardinality" : 3, "min_value" : -0.5, "max_value" : 0.5, "mean_value" : 0.4962246224622462, "median_value" : 0.5, "top_hits" : [ { "value" : 0.5, "count" : 19868 }, { "value" : 0, "count" : 109 }, { "value" : -0.5, "count" : 21 } ] }, "passenger_count" : { "count" : 19998, "cardinality" : 7, "min_value" : 0, "max_value" : 6, "mean_value" : 1.6201620162016201, "median_value" : 1, "top_hits" : [ { "value" : 1, "count" : 14219 }, { "value" : 2, "count" : 2886 }, { "value" : 5, "count" : 1047 }, { "value" : 3, "count" : 804 }, { "value" : 6, "count" : 523 }, { "value" : 4, "count" : 406 }, { "value" : 0, "count" : 113 } ] }, "payment_type" : { "count" : 19998, "cardinality" : 4, "min_value" : 1, "max_value" : 4, "mean_value" : 1.315631563156316, "median_value" : 1, "top_hits" : [ { "value" : 1, "count" : 13936 }, { "value" : 2, "count" : 5857 }, { "value" : 3, "count" : 160 }, { "value" : 4, "count" : 45 } ] }, "store_and_fwd_flag" : { "count" : 19998, "cardinality" : 2, "top_hits" : [ { "value" : "N", "count" : 19910 }, { "value" : "Y", "count" : 88 } ] }, "tip_amount" : { "count" : 19998, "cardinality" : 717, "min_value" : 0, "max_value" : 128, "mean_value" : 2.010959095909593, "median_value" : 1.45, "top_hits" : [ { "value" : 0, "count" : 6917 }, { "value" : 1, "count" : 1178 }, { "value" : 2, "count" : 624 }, { "value" : 3, "count" : 248 }, { "value" : 1.56, "count" : 206 }, { "value" : 1.46, "count" : 205 }, { "value" : 1.76, "count" : 196 }, { "value" : 1.45, "count" : 195 }, { "value" : 1.36, "count" : 191 }, { "value" : 1.5, "count" : 187 } ] }, "tolls_amount" : { "count" : 19998, "cardinality" : 26, "min_value" : 0, "max_value" : 35, "mean_value" : 0.2729697969796978, "median_value" : 0, "top_hits" : [ { "value" : 0, "count" : 19107 }, { "value" : 5.76, "count" : 791 }, { "value" : 10.5, "count" : 36 }, { "value" : 2.64, "count" : 21 }, { "value" : 11.52, "count" : 8 }, { "value" : 5.54, "count" : 4 }, { "value" : 8.5, "count" : 4 }, { "value" : 17.28, "count" : 4 }, { "value" : 2, "count" : 2 }, { "value" : 2.16, "count" : 2 } ] }, "total_amount" : { "count" : 19998, "cardinality" : 1267, "min_value" : -100.3, "max_value" : 389.12, "mean_value" : 17.499898989898995, "median_value" : 12.35, "top_hits" : [ { "value" : 7.3, "count" : 478 }, { "value" : 8.3, "count" : 443 }, { "value" : 8.8, "count" : 420 }, { "value" : 6.8, "count" : 406 }, { "value" : 7.8, "count" : 405 }, { "value" : 6.3, "count" : 371 }, { "value" : 9.8, "count" : 368 }, { "value" : 5.8, "count" : 362 }, { "value" : 9.3, "count" : 332 }, { "value" : 10.3, "count" : 332 } ] }, "tpep_dropoff_datetime" : { "count" : 19998, "cardinality" : 9066, "earliest" : "2018-05-31 06:18:15", "latest" : "2018-06-02 02:25:44", "top_hits" : [ { "value" : "2018-06-01 01:12:12", "count" : 10 }, { "value" : "2018-06-01 00:32:15", "count" : 9 }, { "value" : "2018-06-01 00:44:27", "count" : 9 }, { "value" : "2018-06-01 00:46:42", "count" : 9 }, { "value" : "2018-06-01 01:03:22", "count" : 9 }, { "value" : "2018-06-01 01:05:13", "count" : 9 }, { "value" : "2018-06-01 00:11:20", "count" : 8 }, { "value" : "2018-06-01 00:16:03", "count" : 8 }, { "value" : "2018-06-01 00:19:47", "count" : 8 }, { "value" : "2018-06-01 00:25:17", "count" : 8 } ] }, "tpep_pickup_datetime" : { "count" : 19998, "cardinality" : 8760, "earliest" : "2018-05-31 06:08:31", "latest" : "2018-06-02 01:21:21", "top_hits" : [ { "value" : "2018-06-01 00:01:23", "count" : 12 }, { "value" : "2018-06-01 00:04:31", "count" : 10 }, { "value" : "2018-06-01 00:05:38", "count" : 10 }, { "value" : "2018-06-01 00:09:50", "count" : 10 }, { "value" : "2018-06-01 00:12:01", "count" : 10 }, { "value" : "2018-06-01 00:14:17", "count" : 10 }, { "value" : "2018-06-01 00:00:34", "count" : 9 }, { "value" : "2018-06-01 00:00:40", "count" : 9 }, { "value" : "2018-06-01 00:02:53", "count" : 9 }, { "value" : "2018-06-01 00:05:40", "count" : 9 } ] }, "trip_distance" : { "count" : 19998, "cardinality" : 1687, "min_value" : 0, "max_value" : 64.63, "mean_value" : 3.6521062106210715, "median_value" : 2.16, "top_hits" : [ { "value" : 0.9, "count" : 335 }, { "value" : 0.8, "count" : 320 }, { "value" : 1.1, "count" : 316 }, { "value" : 0.7, "count" : 304 }, { "value" : 1.2, "count" : 303 }, { "value" : 1, "count" : 296 }, { "value" : 1.3, "count" : 280 }, { "value" : 1.5, "count" : 268 }, { "value" : 1.6, "count" : 268 }, { "value" : 0.6, "count" : 256 } ] } } }
|
|
与第一个示例不同,在这种情况下, |
|
因为 |
|
|
|
此示例的 |
|
|
|
|
|
|
|
|
|
此示例中的时间戳格式未指定时区,因此要将它们准确转换为 UTC 时间戳以存储在 Elasticsearch 中,需要提供它们相关的时区。对于包含时区的 timestamp 格式, |
设置 timeout 参数编辑
如果您尝试分析大量数据,则分析将花费很长时间。如果要限制 Elasticsearch 集群为请求执行的处理量,请使用 timeout
查询参数。当超时到期时,分析将中止并返回错误。例如,您可以将上一个示例中的 20000 行替换为 200000,并为分析设置 1 秒的超时
curl -s "s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-06.csv" | head -200000 | curl -s -H "Content-Type: application/json" -XPOST "localhost:9200/_text_structure/find_structure?pretty&lines_to_sample=200000&timeout=1s" -T -
除非您使用的是速度极快的计算机,否则您将收到超时错误
{ "error" : { "root_cause" : [ { "type" : "timeout_exception", "reason" : "Aborting structure analysis during [delimited record parsing] as it has taken longer than the timeout of [1s]" } ], "type" : "timeout_exception", "reason" : "Aborting structure analysis during [delimited record parsing] as it has taken longer than the timeout of [1s]" }, "status" : 500 }
如果您自己尝试上面的示例,您会注意到 curl
命令的总运行时间远远超过 1 秒。这是因为从互联网下载 200000 行 CSV 需要一段时间,并且超时是从此端点开始处理数据时开始计算的。
分析 Elasticsearch 日志文件编辑
这是一个分析 Elasticsearch 日志文件的示例
curl -s -H "Content-Type: application/json" -XPOST "localhost:9200/_text_structure/find_structure?pretty&ecs_compatibility=disabled" -T "$ES_HOME/logs/elasticsearch.log"
如果请求没有遇到错误,则结果将如下所示
{ "num_lines_analyzed" : 53, "num_messages_analyzed" : 53, "sample_start" : "[2018-09-27T14:39:28,518][INFO ][o.e.e.NodeEnvironment ] [node-0] using [1] data paths, mounts [[/ (/dev/disk1)]], net usable_space [165.4gb], net total_space [464.7gb], types [hfs]\n[2018-09-27T14:39:28,521][INFO ][o.e.e.NodeEnvironment ] [node-0] heap size [494.9mb], compressed ordinary object pointers [true]\n", "charset" : "UTF-8", "has_byte_order_marker" : false, "format" : "semi_structured_text", "multiline_start_pattern" : "^\\[\\b\\d{4}-\\d{2}-\\d{2}[T ]\\d{2}:\\d{2}", "grok_pattern" : "\\[%{TIMESTAMP_ISO8601:timestamp}\\]\\[%{LOGLEVEL:loglevel}.*", "ecs_compatibility" : "disabled", "timestamp_field" : "timestamp", "joda_timestamp_formats" : [ "ISO8601" ], "java_timestamp_formats" : [ "ISO8601" ], "need_client_timezone" : true, "mappings" : { "properties" : { "@timestamp" : { "type" : "date" }, "loglevel" : { "type" : "keyword" }, "message" : { "type" : "text" } } }, "ingest_pipeline" : { "description" : "Ingest pipeline created by text structure finder", "processors" : [ { "grok" : { "field" : "message", "patterns" : [ "\\[%{TIMESTAMP_ISO8601:timestamp}\\]\\[%{LOGLEVEL:loglevel}.*" ] } }, { "date" : { "field" : "timestamp", "timezone" : "{{ event.timezone }}", "formats" : [ "ISO8601" ] } }, { "remove" : { "field" : "timestamp" } } ] }, "field_stats" : { "loglevel" : { "count" : 53, "cardinality" : 3, "top_hits" : [ { "value" : "INFO", "count" : 51 }, { "value" : "DEBUG", "count" : 1 }, { "value" : "WARN", "count" : 1 } ] }, "timestamp" : { "count" : 53, "cardinality" : 28, "earliest" : "2018-09-27T14:39:28,518", "latest" : "2018-09-27T14:39:37,012", "top_hits" : [ { "value" : "2018-09-27T14:39:29,859", "count" : 10 }, { "value" : "2018-09-27T14:39:29,860", "count" : 9 }, { "value" : "2018-09-27T14:39:29,858", "count" : 6 }, { "value" : "2018-09-27T14:39:28,523", "count" : 3 }, { "value" : "2018-09-27T14:39:34,234", "count" : 2 }, { "value" : "2018-09-27T14:39:28,518", "count" : 1 }, { "value" : "2018-09-27T14:39:28,521", "count" : 1 }, { "value" : "2018-09-27T14:39:28,522", "count" : 1 }, { "value" : "2018-09-27T14:39:29,861", "count" : 1 }, { "value" : "2018-09-27T14:39:32,786", "count" : 1 } ] } } }
这次, |
|
|
|
已创建一个非常简单的 |
|
使用的 ECS Grok 模式兼容模式可以是 |
将 grok_pattern
指定为查询参数编辑
如果您识别出的字段比结构查找器自身生成的简单 grok_pattern
更多,则可以重新提交请求,将更高级的 grok_pattern
指定为查询参数,结构查找器将为您的其他字段计算 field_stats
。
对于 Elasticsearch 日志,更完整的 Grok 模式是 \[%{TIMESTAMP_ISO8601:timestamp}\]\[%{LOGLEVEL:loglevel} *\]\[%{JAVACLASS:class} *\] \[%{HOSTNAME:node}\] %{JAVALOGMESSAGE:message}
。您可以再次分析相同的文本,将此 grok_pattern
作为查询参数提交(适当地进行 URL 转义)
curl -s -H "Content-Type: application/json" -XPOST "localhost:9200/_text_structure/find_structure?pretty&format=semi_structured_text&grok_pattern=%5C%5B%25%7BTIMESTAMP_ISO8601:timestamp%7D%5C%5D%5C%5B%25%7BLOGLEVEL:loglevel%7D%20*%5C%5D%5C%5B%25%7BJAVACLASS:class%7D%20*%5C%5D%20%5C%5B%25%7BHOSTNAME:node%7D%5C%5D%20%25%7BJAVALOGMESSAGE:message%7D" -T "$ES_HOME/logs/elasticsearch.log"
如果请求没有遇到错误,则结果将如下所示
{ "num_lines_analyzed" : 53, "num_messages_analyzed" : 53, "sample_start" : "[2018-09-27T14:39:28,518][INFO ][o.e.e.NodeEnvironment ] [node-0] using [1] data paths, mounts [[/ (/dev/disk1)]], net usable_space [165.4gb], net total_space [464.7gb], types [hfs]\n[2018-09-27T14:39:28,521][INFO ][o.e.e.NodeEnvironment ] [node-0] heap size [494.9mb], compressed ordinary object pointers [true]\n", "charset" : "UTF-8", "has_byte_order_marker" : false, "format" : "semi_structured_text", "multiline_start_pattern" : "^\\[\\b\\d{4}-\\d{2}-\\d{2}[T ]\\d{2}:\\d{2}", "grok_pattern" : "\\[%{TIMESTAMP_ISO8601:timestamp}\\]\\[%{LOGLEVEL:loglevel} *\\]\\[%{JAVACLASS:class} *\\] \\[%{HOSTNAME:node}\\] %{JAVALOGMESSAGE:message}", "ecs_compatibility" : "disabled", "timestamp_field" : "timestamp", "joda_timestamp_formats" : [ "ISO8601" ], "java_timestamp_formats" : [ "ISO8601" ], "need_client_timezone" : true, "mappings" : { "properties" : { "@timestamp" : { "type" : "date" }, "class" : { "type" : "keyword" }, "loglevel" : { "type" : "keyword" }, "message" : { "type" : "text" }, "node" : { "type" : "keyword" } } }, "ingest_pipeline" : { "description" : "Ingest pipeline created by text structure finder", "processors" : [ { "grok" : { "field" : "message", "patterns" : [ "\\[%{TIMESTAMP_ISO8601:timestamp}\\]\\[%{LOGLEVEL:loglevel} *\\]\\[%{JAVACLASS:class} *\\] \\[%{HOSTNAME:node}\\] %{JAVALOGMESSAGE:message}" ] } }, { "date" : { "field" : "timestamp", "timezone" : "{{ event.timezone }}", "formats" : [ "ISO8601" ] } }, { "remove" : { "field" : "timestamp" } } ] }, "field_stats" : { "class" : { "count" : 53, "cardinality" : 14, "top_hits" : [ { "value" : "o.e.p.PluginsService", "count" : 26 }, { "value" : "o.e.c.m.MetadataIndexTemplateService", "count" : 8 }, { "value" : "o.e.n.Node", "count" : 7 }, { "value" : "o.e.e.NodeEnvironment", "count" : 2 }, { "value" : "o.e.a.ActionModule", "count" : 1 }, { "value" : "o.e.c.s.ClusterApplierService", "count" : 1 }, { "value" : "o.e.c.s.MasterService", "count" : 1 }, { "value" : "o.e.d.DiscoveryModule", "count" : 1 }, { "value" : "o.e.g.GatewayService", "count" : 1 }, { "value" : "o.e.l.LicenseService", "count" : 1 } ] }, "loglevel" : { "count" : 53, "cardinality" : 3, "top_hits" : [ { "value" : "INFO", "count" : 51 }, { "value" : "DEBUG", "count" : 1 }, { "value" : "WARN", "count" : 1 } ] }, "message" : { "count" : 53, "cardinality" : 53, "top_hits" : [ { "value" : "Using REST wrapper from plugin org.elasticsearch.xpack.security.Security", "count" : 1 }, { "value" : "adding template [.monitoring-alerts] for index patterns [.monitoring-alerts-6]", "count" : 1 }, { "value" : "adding template [.monitoring-beats] for index patterns [.monitoring-beats-6-*]", "count" : 1 }, { "value" : "adding template [.monitoring-es] for index patterns [.monitoring-es-6-*]", "count" : 1 }, { "value" : "adding template [.monitoring-kibana] for index patterns [.monitoring-kibana-6-*]", "count" : 1 }, { "value" : "adding template [.monitoring-logstash] for index patterns [.monitoring-logstash-6-*]", "count" : 1 }, { "value" : "adding template [.triggered_watches] for index patterns [.triggered_watches*]", "count" : 1 }, { "value" : "adding template [.watch-history-9] for index patterns [.watcher-history-9*]", "count" : 1 }, { "value" : "adding template [.watches] for index patterns [.watches*]", "count" : 1 }, { "value" : "starting ...", "count" : 1 } ] }, "node" : { "count" : 53, "cardinality" : 1, "top_hits" : [ { "value" : "node-0", "count" : 53 } ] }, "timestamp" : { "count" : 53, "cardinality" : 28, "earliest" : "2018-09-27T14:39:28,518", "latest" : "2018-09-27T14:39:37,012", "top_hits" : [ { "value" : "2018-09-27T14:39:29,859", "count" : 10 }, { "value" : "2018-09-27T14:39:29,860", "count" : 9 }, { "value" : "2018-09-27T14:39:29,858", "count" : 6 }, { "value" : "2018-09-27T14:39:28,523", "count" : 3 }, { "value" : "2018-09-27T14:39:34,234", "count" : 2 }, { "value" : "2018-09-27T14:39:28,518", "count" : 1 }, { "value" : "2018-09-27T14:39:28,521", "count" : 1 }, { "value" : "2018-09-27T14:39:28,522", "count" : 1 }, { "value" : "2018-09-27T14:39:29,861", "count" : 1 }, { "value" : "2018-09-27T14:39:32,786", "count" : 1 } ] } } }
输出中的 |
|
使用的 ECS Grok 模式兼容模式可以是 |
|
返回的 |
URL 转义很困难,因此,如果您以交互方式工作,最好使用 UI!