› › ›

分析 API

对文本字符串执行分析并返回生成的词元。

resp = client.indices.analyze(
    analyzer="standard",
    text="Quick Brown Foxes!",
)
print(resp)

response = client.indices.analyze(
  body: {
    analyzer: 'standard',
    text: 'Quick Brown Foxes!'
  }
)
puts response

const response = await client.indices.analyze({
  analyzer: "standard",
  text: "Quick Brown Foxes!",
});
console.log(response);

GET /_analyze
{
  "analyzer" : "standard",
  "text" : "Quick Brown Foxes!"
}

请求

编辑

GET /_analyze

POST /_analyze

GET /<index>/_analyze

POST /<index>/_analyze

先决条件

编辑

如果启用了 Elasticsearch 安全功能，您必须对指定的索引拥有 manage 索引权限。

路径参数

编辑

<index>

（可选，字符串）用于派生分析器的索引。

如果指定，则 analyzer 或 <field> 参数将覆盖此值。

如果未指定分析器或字段，则分析 API 将使用索引的默认分析器。

如果未指定索引或索引没有默认分析器，则分析 API 将使用标准分析器。

查询参数

编辑

analyzer

（可选，字符串）应应用于提供的 text 的分析器的名称。这可以是内置分析器，也可以是在索引中配置的分析器。

如果未指定此参数，则分析 API 将使用字段映射中定义的分析器。

如果未指定字段，则分析 API 将使用索引的默认分析器。

如果未指定索引，或索引没有默认分析器，则分析 API 将使用标准分析器。

attributes

（可选，字符串数组）用于过滤 explain 参数输出的词元属性数组。

char_filter

（可选，字符串数组）在分词器之前用于预处理字符的字符过滤器数组。有关字符过滤器列表，请参阅字符过滤器参考。

explain

（可选，布尔值）如果为 true，则响应包括词元属性和其他详细信息。默认为 false。 [预览] 附加详细信息的格式在 Lucene 中被标记为实验性，并且将来可能会发生更改。

field

（可选，字符串）用于派生分析器的字段。要使用此参数，您必须指定一个索引。

如果指定，则 analyzer 参数将覆盖此值。

如果未指定字段，则分析 API 将使用索引的默认分析器。

如果未指定索引或索引没有默认分析器，则分析 API 将使用标准分析器。

filter

（可选，字符串数组）用于在分词器之后应用的词元过滤器数组。有关词元过滤器列表，请参阅词元过滤器参考。

normalizer

（可选，字符串）用于将文本转换为单个词元的归一化器。有关归一化器列表，请参阅归一化器。

text

（必需，字符串或字符串数组）要分析的文本。如果提供了字符串数组，则将其分析为多值字段。

tokenizer

（可选，字符串）用于将文本转换为词元的分词器。有关分词器列表，请参阅分词器参考。

示例

编辑

未指定索引

编辑

您可以将任何内置分析器应用于文本字符串，而无需指定索引。

resp = client.indices.analyze(
    analyzer="standard",
    text="this is a test",
)
print(resp)

response = client.indices.analyze(
  body: {
    analyzer: 'standard',
    text: 'this is a test'
  }
)
puts response

const response = await client.indices.analyze({
  analyzer: "standard",
  text: "this is a test",
});
console.log(response);

GET /_analyze
{
  "analyzer" : "standard",
  "text" : "this is a test"
}

字符串数组

编辑

如果 text 参数作为字符串数组提供，则将其分析为多值字段。

resp = client.indices.analyze(
    analyzer="standard",
    text=[
        "this is a test",
        "the second text"
    ],
)
print(resp)

response = client.indices.analyze(
  body: {
    analyzer: 'standard',
    text: [
      'this is a test',
      'the second text'
    ]
  }
)
puts response

const response = await client.indices.analyze({
  analyzer: "standard",
  text: ["this is a test", "the second text"],
});
console.log(response);

GET /_analyze
{
  "analyzer" : "standard",
  "text" : ["this is a test", "the second text"]
}

自定义分析器

编辑

您可以使用分析 API 来测试由分词器、词元过滤器和字符过滤器构建的自定义瞬态分析器。词元过滤器使用 filter 参数

resp = client.indices.analyze(
    tokenizer="keyword",
    filter=[
        "lowercase"
    ],
    text="this is a test",
)
print(resp)

response = client.indices.analyze(
  body: {
    tokenizer: 'keyword',
    filter: [
      'lowercase'
    ],
    text: 'this is a test'
  }
)
puts response

const response = await client.indices.analyze({
  tokenizer: "keyword",
  filter: ["lowercase"],
  text: "this is a test",
});
console.log(response);

GET /_analyze
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "text" : "this is a test"
}

resp = client.indices.analyze(
    tokenizer="keyword",
    filter=[
        "lowercase"
    ],
    char_filter=[
        "html_strip"
    ],
    text="this is a test</b>",
)
print(resp)

response = client.indices.analyze(
  body: {
    tokenizer: 'keyword',
    filter: [
      'lowercase'
    ],
    char_filter: [
      'html_strip'
    ],
    text: 'this is a test</b>'
  }
)
puts response

const response = await client.indices.analyze({
  tokenizer: "keyword",
  filter: ["lowercase"],
  char_filter: ["html_strip"],
  text: "this is a test</b>",
});
console.log(response);

GET /_analyze
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "char_filter" : ["html_strip"],
  "text" : "this is a <b>test</b>"
}

自定义分词器、词元过滤器和字符过滤器可以在请求正文中指定，如下所示

resp = client.indices.analyze(
    tokenizer="whitespace",
    filter=[
        "lowercase",
        {
            "type": "stop",
            "stopwords": [
                "a",
                "is",
                "this"
            ]
        }
    ],
    text="this is a test",
)
print(resp)

response = client.indices.analyze(
  body: {
    tokenizer: 'whitespace',
    filter: [
      'lowercase',
      {
        type: 'stop',
        stopwords: [
          'a',
          'is',
          'this'
        ]
      }
    ],
    text: 'this is a test'
  }
)
puts response

const response = await client.indices.analyze({
  tokenizer: "whitespace",
  filter: [
    "lowercase",
    {
      type: "stop",
      stopwords: ["a", "is", "this"],
    },
  ],
  text: "this is a test",
});
console.log(response);

GET /_analyze
{
  "tokenizer" : "whitespace",
  "filter" : ["lowercase", {"type": "stop", "stopwords": ["a", "is", "this"]}],
  "text" : "this is a test"
}

特定索引

编辑

您还可以针对特定索引运行分析 API

resp = client.indices.analyze(
    index="analyze_sample",
    text="this is a test",
)
print(resp)

response = client.indices.analyze(
  index: 'analyze_sample',
  body: {
    text: 'this is a test'
  }
)
puts response

const response = await client.indices.analyze({
  index: "analyze_sample",
  text: "this is a test",
});
console.log(response);

GET /analyze_sample/_analyze
{
  "text" : "this is a test"
}

以上操作将使用与 analyze_sample 索引关联的默认索引分析器对“this is a test”文本进行分析。还可以提供 analyzer 来使用不同的分析器

resp = client.indices.analyze(
    index="analyze_sample",
    analyzer="whitespace",
    text="this is a test",
)
print(resp)

response = client.indices.analyze(
  index: 'analyze_sample',
  body: {
    analyzer: 'whitespace',
    text: 'this is a test'
  }
)
puts response

const response = await client.indices.analyze({
  index: "analyze_sample",
  analyzer: "whitespace",
  text: "this is a test",
});
console.log(response);

GET /analyze_sample/_analyze
{
  "analyzer" : "whitespace",
  "text" : "this is a test"
}

从字段映射派生分析器

编辑

可以根据字段映射派生分析器，例如

resp = client.indices.analyze(
    index="analyze_sample",
    field="obj1.field1",
    text="this is a test",
)
print(resp)

response = client.indices.analyze(
  index: 'analyze_sample',
  body: {
    field: 'obj1.field1',
    text: 'this is a test'
  }
)
puts response

const response = await client.indices.analyze({
  index: "analyze_sample",
  field: "obj1.field1",
  text: "this is a test",
});
console.log(response);

GET /analyze_sample/_analyze
{
  "field" : "obj1.field1",
  "text" : "this is a test"
}

将导致根据 obj1.field1 的映射中配置的分析器（如果没有，则使用默认索引分析器）进行分析。

归一化器

编辑

可以为与 analyze_sample 索引关联的归一化器关键字字段提供 normalizer。

resp = client.indices.analyze(
    index="analyze_sample",
    normalizer="my_normalizer",
    text="BaR",
)
print(resp)

response = client.indices.analyze(
  index: 'analyze_sample',
  body: {
    normalizer: 'my_normalizer',
    text: 'BaR'
  }
)
puts response

const response = await client.indices.analyze({
  index: "analyze_sample",
  normalizer: "my_normalizer",
  text: "BaR",
});
console.log(response);

GET /analyze_sample/_analyze
{
  "normalizer" : "my_normalizer",
  "text" : "BaR"
}

或者通过从词元过滤器和字符过滤器构建自定义瞬态归一化器。

resp = client.indices.analyze(
    filter=[
        "lowercase"
    ],
    text="BaR",
)
print(resp)

response = client.indices.analyze(
  body: {
    filter: [
      'lowercase'
    ],
    text: 'BaR'
  }
)
puts response

const response = await client.indices.analyze({
  filter: ["lowercase"],
  text: "BaR",
});
console.log(response);

GET /_analyze
{
  "filter" : ["lowercase"],
  "text" : "BaR"
}

解释分析

编辑

如果您想获得更高级的详细信息，请将 explain 设置为 true（默认为 false）。它将输出每个词元的所有词元属性。您可以通过设置 attributes 选项来过滤要输出的词元属性。

附加详细信息的格式在 Lucene 中被标记为实验性，并且将来可能会发生更改。

resp = client.indices.analyze(
    tokenizer="standard",
    filter=[
        "snowball"
    ],
    text="detailed output",
    explain=True,
    attributes=[
        "keyword"
    ],
)
print(resp)

response = client.indices.analyze(
  body: {
    tokenizer: 'standard',
    filter: [
      'snowball'
    ],
    text: 'detailed output',
    explain: true,
    attributes: [
      'keyword'
    ]
  }
)
puts response

const response = await client.indices.analyze({
  tokenizer: "standard",
  filter: ["snowball"],
  text: "detailed output",
  explain: true,
  attributes: ["keyword"],
});
console.log(response);

GET /_analyze
{
  "tokenizer" : "standard",
  "filter" : ["snowball"],
  "text" : "detailed output",
  "explain" : true,
  "attributes" : ["keyword"] 
}

将“keyword”设置为仅输出“keyword”属性

请求返回以下结果

{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "standard",
      "tokens" : [ {
        "token" : "detailed",
        "start_offset" : 0,
        "end_offset" : 8,
        "type" : "<ALPHANUM>",
        "position" : 0
      }, {
        "token" : "output",
        "start_offset" : 9,
        "end_offset" : 15,
        "type" : "<ALPHANUM>",
        "position" : 1
      } ]
    },
    "tokenfilters" : [ {
      "name" : "snowball",
      "tokens" : [ {
        "token" : "detail",
        "start_offset" : 0,
        "end_offset" : 8,
        "type" : "<ALPHANUM>",
        "position" : 0,
        "keyword" : false 
      }, {
        "token" : "output",
        "start_offset" : 9,
        "end_offset" : 15,
        "type" : "<ALPHANUM>",
        "position" : 1,
        "keyword" : false 
      } ]
    } ]
  }
}

仅输出“keyword”属性，因为在请求中指定了“attributes”。

设置词元限制

编辑

生成过多的词元可能会导致节点内存不足。以下设置允许限制可以生成的词元数量

index.analyze.max_token_count: 使用 _analyze API 可以生成的最大词元数。默认值为 10000。如果生成的词元数超过此限制，则会引发错误。未指定索引的 _analyze 端点将始终使用 10000 作为限制。此设置允许您控制特定索引的限制

resp = client.indices.create(
    index="analyze_sample",
    settings={
        "index.analyze.max_token_count": 20000
    },
)
print(resp)

response = client.indices.create(
  index: 'analyze_sample',
  body: {
    settings: {
      'index.analyze.max_token_count' => 20_000
    }
  }
)
puts response

const response = await client.indices.create({
  index: "analyze_sample",
  settings: {
    "index.analyze.max_token_count": 20000,
  },
});
console.log(response);

PUT /analyze_sample
{
  "settings" : {
    "index.analyze.max_token_count" : 20000
  }
}

resp = client.indices.analyze(
    index="analyze_sample",
    text="this is a test",
)
print(resp)

response = client.indices.analyze(
  index: 'analyze_sample',
  body: {
    text: 'this is a test'
  }
)
puts response

const response = await client.indices.analyze({
  index: "analyze_sample",
  text: "this is a test",
});
console.log(response);

GET /analyze_sample/_analyze
{
  "text" : "this is a test"
}

« 别名 API 分析索引磁盘使用情况 API »