分析 API

编辑

对文本字符串执行分析并返回生成的词元。

resp = client.indices.analyze(
    analyzer="standard",
    text="Quick Brown Foxes!",
)
print(resp)
response = client.indices.analyze(
  body: {
    analyzer: 'standard',
    text: 'Quick Brown Foxes!'
  }
)
puts response
const response = await client.indices.analyze({
  analyzer: "standard",
  text: "Quick Brown Foxes!",
});
console.log(response);
GET /_analyze
{
  "analyzer" : "standard",
  "text" : "Quick Brown Foxes!"
}

请求

编辑

GET /_analyze

POST /_analyze

GET /<index>/_analyze

POST /<index>/_analyze

先决条件

编辑
  • 如果启用了 Elasticsearch 安全功能,您必须对指定的索引拥有 manage 索引权限

路径参数

编辑
<index>

(可选,字符串)用于派生分析器的索引。

如果指定,则 analyzer<field> 参数将覆盖此值。

如果未指定分析器或字段,则分析 API 将使用索引的默认分析器。

如果未指定索引或索引没有默认分析器,则分析 API 将使用标准分析器

查询参数

编辑
analyzer

(可选,字符串)应应用于提供的 text 的分析器的名称。这可以是内置分析器,也可以是在索引中配置的分析器。

如果未指定此参数,则分析 API 将使用字段映射中定义的分析器。

如果未指定字段,则分析 API 将使用索引的默认分析器。

如果未指定索引,或索引没有默认分析器,则分析 API 将使用标准分析器

attributes
(可选,字符串数组)用于过滤 explain 参数输出的词元属性数组。
char_filter
(可选,字符串数组)在分词器之前用于预处理字符的字符过滤器数组。 有关字符过滤器列表,请参阅字符过滤器参考
explain
(可选,布尔值)如果为 true,则响应包括词元属性和其他详细信息。默认为 false [预览] 附加详细信息的格式在 Lucene 中被标记为实验性,并且将来可能会发生更改。
field

(可选,字符串)用于派生分析器的字段。要使用此参数,您必须指定一个索引。

如果指定,则 analyzer 参数将覆盖此值。

如果未指定字段,则分析 API 将使用索引的默认分析器。

如果未指定索引或索引没有默认分析器,则分析 API 将使用标准分析器

filter
(可选,字符串数组)用于在分词器之后应用的词元过滤器数组。 有关词元过滤器列表,请参阅词元过滤器参考
normalizer
(可选,字符串)用于将文本转换为单个词元的归一化器。 有关归一化器列表,请参阅归一化器
text
(必需,字符串或字符串数组)要分析的文本。如果提供了字符串数组,则将其分析为多值字段。
tokenizer
(可选,字符串)用于将文本转换为词元的分词器。 有关分词器列表,请参阅分词器参考

示例

编辑

未指定索引

编辑

您可以将任何内置分析器应用于文本字符串,而无需指定索引。

resp = client.indices.analyze(
    analyzer="standard",
    text="this is a test",
)
print(resp)
response = client.indices.analyze(
  body: {
    analyzer: 'standard',
    text: 'this is a test'
  }
)
puts response
const response = await client.indices.analyze({
  analyzer: "standard",
  text: "this is a test",
});
console.log(response);
GET /_analyze
{
  "analyzer" : "standard",
  "text" : "this is a test"
}

字符串数组

编辑

如果 text 参数作为字符串数组提供,则将其分析为多值字段。

resp = client.indices.analyze(
    analyzer="standard",
    text=[
        "this is a test",
        "the second text"
    ],
)
print(resp)
response = client.indices.analyze(
  body: {
    analyzer: 'standard',
    text: [
      'this is a test',
      'the second text'
    ]
  }
)
puts response
const response = await client.indices.analyze({
  analyzer: "standard",
  text: ["this is a test", "the second text"],
});
console.log(response);
GET /_analyze
{
  "analyzer" : "standard",
  "text" : ["this is a test", "the second text"]
}

自定义分析器

编辑

您可以使用分析 API 来测试由分词器、词元过滤器和字符过滤器构建的自定义瞬态分析器。词元过滤器使用 filter 参数

resp = client.indices.analyze(
    tokenizer="keyword",
    filter=[
        "lowercase"
    ],
    text="this is a test",
)
print(resp)
response = client.indices.analyze(
  body: {
    tokenizer: 'keyword',
    filter: [
      'lowercase'
    ],
    text: 'this is a test'
  }
)
puts response
const response = await client.indices.analyze({
  tokenizer: "keyword",
  filter: ["lowercase"],
  text: "this is a test",
});
console.log(response);
GET /_analyze
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "text" : "this is a test"
}
resp = client.indices.analyze(
    tokenizer="keyword",
    filter=[
        "lowercase"
    ],
    char_filter=[
        "html_strip"
    ],
    text="this is a test</b>",
)
print(resp)
response = client.indices.analyze(
  body: {
    tokenizer: 'keyword',
    filter: [
      'lowercase'
    ],
    char_filter: [
      'html_strip'
    ],
    text: 'this is a test</b>'
  }
)
puts response
const response = await client.indices.analyze({
  tokenizer: "keyword",
  filter: ["lowercase"],
  char_filter: ["html_strip"],
  text: "this is a test</b>",
});
console.log(response);
GET /_analyze
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "char_filter" : ["html_strip"],
  "text" : "this is a <b>test</b>"
}

自定义分词器、词元过滤器和字符过滤器可以在请求正文中指定,如下所示

resp = client.indices.analyze(
    tokenizer="whitespace",
    filter=[
        "lowercase",
        {
            "type": "stop",
            "stopwords": [
                "a",
                "is",
                "this"
            ]
        }
    ],
    text="this is a test",
)
print(resp)
response = client.indices.analyze(
  body: {
    tokenizer: 'whitespace',
    filter: [
      'lowercase',
      {
        type: 'stop',
        stopwords: [
          'a',
          'is',
          'this'
        ]
      }
    ],
    text: 'this is a test'
  }
)
puts response
const response = await client.indices.analyze({
  tokenizer: "whitespace",
  filter: [
    "lowercase",
    {
      type: "stop",
      stopwords: ["a", "is", "this"],
    },
  ],
  text: "this is a test",
});
console.log(response);
GET /_analyze
{
  "tokenizer" : "whitespace",
  "filter" : ["lowercase", {"type": "stop", "stopwords": ["a", "is", "this"]}],
  "text" : "this is a test"
}

特定索引

编辑

您还可以针对特定索引运行分析 API

resp = client.indices.analyze(
    index="analyze_sample",
    text="this is a test",
)
print(resp)
response = client.indices.analyze(
  index: 'analyze_sample',
  body: {
    text: 'this is a test'
  }
)
puts response
const response = await client.indices.analyze({
  index: "analyze_sample",
  text: "this is a test",
});
console.log(response);
GET /analyze_sample/_analyze
{
  "text" : "this is a test"
}

以上操作将使用与 analyze_sample 索引关联的默认索引分析器对“this is a test”文本进行分析。还可以提供 analyzer 来使用不同的分析器

resp = client.indices.analyze(
    index="analyze_sample",
    analyzer="whitespace",
    text="this is a test",
)
print(resp)
response = client.indices.analyze(
  index: 'analyze_sample',
  body: {
    analyzer: 'whitespace',
    text: 'this is a test'
  }
)
puts response
const response = await client.indices.analyze({
  index: "analyze_sample",
  analyzer: "whitespace",
  text: "this is a test",
});
console.log(response);
GET /analyze_sample/_analyze
{
  "analyzer" : "whitespace",
  "text" : "this is a test"
}

从字段映射派生分析器

编辑

可以根据字段映射派生分析器,例如

resp = client.indices.analyze(
    index="analyze_sample",
    field="obj1.field1",
    text="this is a test",
)
print(resp)
response = client.indices.analyze(
  index: 'analyze_sample',
  body: {
    field: 'obj1.field1',
    text: 'this is a test'
  }
)
puts response
const response = await client.indices.analyze({
  index: "analyze_sample",
  field: "obj1.field1",
  text: "this is a test",
});
console.log(response);
GET /analyze_sample/_analyze
{
  "field" : "obj1.field1",
  "text" : "this is a test"
}

将导致根据 obj1.field1 的映射中配置的分析器(如果没有,则使用默认索引分析器)进行分析。

归一化器

编辑

可以为与 analyze_sample 索引关联的归一化器关键字字段提供 normalizer

resp = client.indices.analyze(
    index="analyze_sample",
    normalizer="my_normalizer",
    text="BaR",
)
print(resp)
response = client.indices.analyze(
  index: 'analyze_sample',
  body: {
    normalizer: 'my_normalizer',
    text: 'BaR'
  }
)
puts response
const response = await client.indices.analyze({
  index: "analyze_sample",
  normalizer: "my_normalizer",
  text: "BaR",
});
console.log(response);
GET /analyze_sample/_analyze
{
  "normalizer" : "my_normalizer",
  "text" : "BaR"
}

或者通过从词元过滤器和字符过滤器构建自定义瞬态归一化器。

resp = client.indices.analyze(
    filter=[
        "lowercase"
    ],
    text="BaR",
)
print(resp)
response = client.indices.analyze(
  body: {
    filter: [
      'lowercase'
    ],
    text: 'BaR'
  }
)
puts response
const response = await client.indices.analyze({
  filter: ["lowercase"],
  text: "BaR",
});
console.log(response);
GET /_analyze
{
  "filter" : ["lowercase"],
  "text" : "BaR"
}

解释分析

编辑

如果您想获得更高级的详细信息,请将 explain 设置为 true(默认为 false)。它将输出每个词元的所有词元属性。您可以通过设置 attributes 选项来过滤要输出的词元属性。

附加详细信息的格式在 Lucene 中被标记为实验性,并且将来可能会发生更改。

resp = client.indices.analyze(
    tokenizer="standard",
    filter=[
        "snowball"
    ],
    text="detailed output",
    explain=True,
    attributes=[
        "keyword"
    ],
)
print(resp)
response = client.indices.analyze(
  body: {
    tokenizer: 'standard',
    filter: [
      'snowball'
    ],
    text: 'detailed output',
    explain: true,
    attributes: [
      'keyword'
    ]
  }
)
puts response
const response = await client.indices.analyze({
  tokenizer: "standard",
  filter: ["snowball"],
  text: "detailed output",
  explain: true,
  attributes: ["keyword"],
});
console.log(response);
GET /_analyze
{
  "tokenizer" : "standard",
  "filter" : ["snowball"],
  "text" : "detailed output",
  "explain" : true,
  "attributes" : ["keyword"] 
}

将“keyword”设置为仅输出“keyword”属性

请求返回以下结果

{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "standard",
      "tokens" : [ {
        "token" : "detailed",
        "start_offset" : 0,
        "end_offset" : 8,
        "type" : "<ALPHANUM>",
        "position" : 0
      }, {
        "token" : "output",
        "start_offset" : 9,
        "end_offset" : 15,
        "type" : "<ALPHANUM>",
        "position" : 1
      } ]
    },
    "tokenfilters" : [ {
      "name" : "snowball",
      "tokens" : [ {
        "token" : "detail",
        "start_offset" : 0,
        "end_offset" : 8,
        "type" : "<ALPHANUM>",
        "position" : 0,
        "keyword" : false 
      }, {
        "token" : "output",
        "start_offset" : 9,
        "end_offset" : 15,
        "type" : "<ALPHANUM>",
        "position" : 1,
        "keyword" : false 
      } ]
    } ]
  }
}

仅输出“keyword”属性,因为在请求中指定了“attributes”。

设置词元限制

编辑

生成过多的词元可能会导致节点内存不足。以下设置允许限制可以生成的词元数量

index.analyze.max_token_count
使用 _analyze API 可以生成的最大词元数。默认值为 10000。如果生成的词元数超过此限制,则会引发错误。未指定索引的 _analyze 端点将始终使用 10000 作为限制。此设置允许您控制特定索引的限制
resp = client.indices.create(
    index="analyze_sample",
    settings={
        "index.analyze.max_token_count": 20000
    },
)
print(resp)
response = client.indices.create(
  index: 'analyze_sample',
  body: {
    settings: {
      'index.analyze.max_token_count' => 20_000
    }
  }
)
puts response
const response = await client.indices.create({
  index: "analyze_sample",
  settings: {
    "index.analyze.max_token_count": 20000,
  },
});
console.log(response);
PUT /analyze_sample
{
  "settings" : {
    "index.analyze.max_token_count" : 20000
  }
}
resp = client.indices.analyze(
    index="analyze_sample",
    text="this is a test",
)
print(resp)
response = client.indices.analyze(
  index: 'analyze_sample',
  body: {
    text: 'this is a test'
  }
)
puts response
const response = await client.indices.analyze({
  index: "analyze_sample",
  text: "this is a test",
});
console.log(response);
GET /analyze_sample/_analyze
{
  "text" : "this is a test"
}