分析 API
编辑分析 API
编辑对文本字符串执行分析并返回生成的词元。
resp = client.indices.analyze( analyzer="standard", text="Quick Brown Foxes!", ) print(resp)
response = client.indices.analyze( body: { analyzer: 'standard', text: 'Quick Brown Foxes!' } ) puts response
const response = await client.indices.analyze({ analyzer: "standard", text: "Quick Brown Foxes!", }); console.log(response);
GET /_analyze { "analyzer" : "standard", "text" : "Quick Brown Foxes!" }
路径参数
编辑-
<index>
-
(可选,字符串)用于派生分析器的索引。
如果指定,则
analyzer
或<field>
参数将覆盖此值。如果未指定分析器或字段,则分析 API 将使用索引的默认分析器。
如果未指定索引或索引没有默认分析器,则分析 API 将使用标准分析器。
查询参数
编辑-
analyzer
-
(可选,字符串)应应用于提供的
text
的分析器的名称。这可以是内置分析器,也可以是在索引中配置的分析器。如果未指定此参数,则分析 API 将使用字段映射中定义的分析器。
如果未指定字段,则分析 API 将使用索引的默认分析器。
如果未指定索引,或索引没有默认分析器,则分析 API 将使用标准分析器。
-
attributes
- (可选,字符串数组)用于过滤
explain
参数输出的词元属性数组。 -
char_filter
- (可选,字符串数组)在分词器之前用于预处理字符的字符过滤器数组。 有关字符过滤器列表,请参阅字符过滤器参考。
-
explain
- (可选,布尔值)如果为
true
,则响应包括词元属性和其他详细信息。默认为false
。 [预览] 附加详细信息的格式在 Lucene 中被标记为实验性,并且将来可能会发生更改。 -
field
-
(可选,字符串)用于派生分析器的字段。要使用此参数,您必须指定一个索引。
如果指定,则
analyzer
参数将覆盖此值。如果未指定字段,则分析 API 将使用索引的默认分析器。
如果未指定索引或索引没有默认分析器,则分析 API 将使用标准分析器。
-
filter
- (可选,字符串数组)用于在分词器之后应用的词元过滤器数组。 有关词元过滤器列表,请参阅词元过滤器参考。
-
normalizer
- (可选,字符串)用于将文本转换为单个词元的归一化器。 有关归一化器列表,请参阅归一化器。
-
text
- (必需,字符串或字符串数组)要分析的文本。如果提供了字符串数组,则将其分析为多值字段。
-
tokenizer
- (可选,字符串)用于将文本转换为词元的分词器。 有关分词器列表,请参阅分词器参考。
示例
编辑未指定索引
编辑您可以将任何内置分析器应用于文本字符串,而无需指定索引。
resp = client.indices.analyze( analyzer="standard", text="this is a test", ) print(resp)
response = client.indices.analyze( body: { analyzer: 'standard', text: 'this is a test' } ) puts response
const response = await client.indices.analyze({ analyzer: "standard", text: "this is a test", }); console.log(response);
GET /_analyze { "analyzer" : "standard", "text" : "this is a test" }
字符串数组
编辑如果 text
参数作为字符串数组提供,则将其分析为多值字段。
resp = client.indices.analyze( analyzer="standard", text=[ "this is a test", "the second text" ], ) print(resp)
response = client.indices.analyze( body: { analyzer: 'standard', text: [ 'this is a test', 'the second text' ] } ) puts response
const response = await client.indices.analyze({ analyzer: "standard", text: ["this is a test", "the second text"], }); console.log(response);
GET /_analyze { "analyzer" : "standard", "text" : ["this is a test", "the second text"] }
自定义分析器
编辑您可以使用分析 API 来测试由分词器、词元过滤器和字符过滤器构建的自定义瞬态分析器。词元过滤器使用 filter
参数
resp = client.indices.analyze( tokenizer="keyword", filter=[ "lowercase" ], text="this is a test", ) print(resp)
response = client.indices.analyze( body: { tokenizer: 'keyword', filter: [ 'lowercase' ], text: 'this is a test' } ) puts response
const response = await client.indices.analyze({ tokenizer: "keyword", filter: ["lowercase"], text: "this is a test", }); console.log(response);
GET /_analyze { "tokenizer" : "keyword", "filter" : ["lowercase"], "text" : "this is a test" }
resp = client.indices.analyze( tokenizer="keyword", filter=[ "lowercase" ], char_filter=[ "html_strip" ], text="this is a test</b>", ) print(resp)
response = client.indices.analyze( body: { tokenizer: 'keyword', filter: [ 'lowercase' ], char_filter: [ 'html_strip' ], text: 'this is a test</b>' } ) puts response
const response = await client.indices.analyze({ tokenizer: "keyword", filter: ["lowercase"], char_filter: ["html_strip"], text: "this is a test</b>", }); console.log(response);
GET /_analyze { "tokenizer" : "keyword", "filter" : ["lowercase"], "char_filter" : ["html_strip"], "text" : "this is a <b>test</b>" }
自定义分词器、词元过滤器和字符过滤器可以在请求正文中指定,如下所示
resp = client.indices.analyze( tokenizer="whitespace", filter=[ "lowercase", { "type": "stop", "stopwords": [ "a", "is", "this" ] } ], text="this is a test", ) print(resp)
response = client.indices.analyze( body: { tokenizer: 'whitespace', filter: [ 'lowercase', { type: 'stop', stopwords: [ 'a', 'is', 'this' ] } ], text: 'this is a test' } ) puts response
const response = await client.indices.analyze({ tokenizer: "whitespace", filter: [ "lowercase", { type: "stop", stopwords: ["a", "is", "this"], }, ], text: "this is a test", }); console.log(response);
GET /_analyze { "tokenizer" : "whitespace", "filter" : ["lowercase", {"type": "stop", "stopwords": ["a", "is", "this"]}], "text" : "this is a test" }
特定索引
编辑您还可以针对特定索引运行分析 API
resp = client.indices.analyze( index="analyze_sample", text="this is a test", ) print(resp)
response = client.indices.analyze( index: 'analyze_sample', body: { text: 'this is a test' } ) puts response
const response = await client.indices.analyze({ index: "analyze_sample", text: "this is a test", }); console.log(response);
GET /analyze_sample/_analyze { "text" : "this is a test" }
以上操作将使用与 analyze_sample
索引关联的默认索引分析器对“this is a test”文本进行分析。还可以提供 analyzer
来使用不同的分析器
resp = client.indices.analyze( index="analyze_sample", analyzer="whitespace", text="this is a test", ) print(resp)
response = client.indices.analyze( index: 'analyze_sample', body: { analyzer: 'whitespace', text: 'this is a test' } ) puts response
const response = await client.indices.analyze({ index: "analyze_sample", analyzer: "whitespace", text: "this is a test", }); console.log(response);
GET /analyze_sample/_analyze { "analyzer" : "whitespace", "text" : "this is a test" }
从字段映射派生分析器
编辑可以根据字段映射派生分析器,例如
resp = client.indices.analyze( index="analyze_sample", field="obj1.field1", text="this is a test", ) print(resp)
response = client.indices.analyze( index: 'analyze_sample', body: { field: 'obj1.field1', text: 'this is a test' } ) puts response
const response = await client.indices.analyze({ index: "analyze_sample", field: "obj1.field1", text: "this is a test", }); console.log(response);
GET /analyze_sample/_analyze { "field" : "obj1.field1", "text" : "this is a test" }
将导致根据 obj1.field1
的映射中配置的分析器(如果没有,则使用默认索引分析器)进行分析。
归一化器
编辑可以为与 analyze_sample
索引关联的归一化器关键字字段提供 normalizer
。
resp = client.indices.analyze( index="analyze_sample", normalizer="my_normalizer", text="BaR", ) print(resp)
response = client.indices.analyze( index: 'analyze_sample', body: { normalizer: 'my_normalizer', text: 'BaR' } ) puts response
const response = await client.indices.analyze({ index: "analyze_sample", normalizer: "my_normalizer", text: "BaR", }); console.log(response);
GET /analyze_sample/_analyze { "normalizer" : "my_normalizer", "text" : "BaR" }
或者通过从词元过滤器和字符过滤器构建自定义瞬态归一化器。
resp = client.indices.analyze( filter=[ "lowercase" ], text="BaR", ) print(resp)
response = client.indices.analyze( body: { filter: [ 'lowercase' ], text: 'BaR' } ) puts response
const response = await client.indices.analyze({ filter: ["lowercase"], text: "BaR", }); console.log(response);
GET /_analyze { "filter" : ["lowercase"], "text" : "BaR" }
解释分析
编辑如果您想获得更高级的详细信息,请将 explain
设置为 true
(默认为 false
)。它将输出每个词元的所有词元属性。您可以通过设置 attributes
选项来过滤要输出的词元属性。
附加详细信息的格式在 Lucene 中被标记为实验性,并且将来可能会发生更改。
resp = client.indices.analyze( tokenizer="standard", filter=[ "snowball" ], text="detailed output", explain=True, attributes=[ "keyword" ], ) print(resp)
response = client.indices.analyze( body: { tokenizer: 'standard', filter: [ 'snowball' ], text: 'detailed output', explain: true, attributes: [ 'keyword' ] } ) puts response
const response = await client.indices.analyze({ tokenizer: "standard", filter: ["snowball"], text: "detailed output", explain: true, attributes: ["keyword"], }); console.log(response);
GET /_analyze { "tokenizer" : "standard", "filter" : ["snowball"], "text" : "detailed output", "explain" : true, "attributes" : ["keyword"] }
请求返回以下结果
{ "detail" : { "custom_analyzer" : true, "charfilters" : [ ], "tokenizer" : { "name" : "standard", "tokens" : [ { "token" : "detailed", "start_offset" : 0, "end_offset" : 8, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "output", "start_offset" : 9, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 1 } ] }, "tokenfilters" : [ { "name" : "snowball", "tokens" : [ { "token" : "detail", "start_offset" : 0, "end_offset" : 8, "type" : "<ALPHANUM>", "position" : 0, "keyword" : false }, { "token" : "output", "start_offset" : 9, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 1, "keyword" : false } ] } ] } }
设置词元限制
编辑生成过多的词元可能会导致节点内存不足。以下设置允许限制可以生成的词元数量
-
index.analyze.max_token_count
- 使用
_analyze
API 可以生成的最大词元数。默认值为10000
。如果生成的词元数超过此限制,则会引发错误。未指定索引的_analyze
端点将始终使用10000
作为限制。此设置允许您控制特定索引的限制
resp = client.indices.create( index="analyze_sample", settings={ "index.analyze.max_token_count": 20000 }, ) print(resp)
response = client.indices.create( index: 'analyze_sample', body: { settings: { 'index.analyze.max_token_count' => 20_000 } } ) puts response
const response = await client.indices.create({ index: "analyze_sample", settings: { "index.analyze.max_token_count": 20000, }, }); console.log(response);
PUT /analyze_sample { "settings" : { "index.analyze.max_token_count" : 20000 } }
resp = client.indices.analyze( index="analyze_sample", text="this is a test", ) print(resp)
response = client.indices.analyze( index: 'analyze_sample', body: { text: 'this is a test' } ) puts response
const response = await client.indices.analyze({ index: "analyze_sample", text: "this is a test", }); console.log(response);
GET /analyze_sample/_analyze { "text" : "this is a test" }