标准分词器
编辑标准分词器
编辑standard
分词器是默认分词器,如果没有指定其他分词器,则会使用它。它提供基于语法的分词(基于 Unicode 文本分割算法,如 Unicode Standard Annex #29 中所指定),并且适用于大多数语言。
示例输出
编辑resp = client.indices.analyze( analyzer="standard", text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.", ) print(resp)
response = client.indices.analyze( body: { analyzer: 'standard', text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } ) puts response
const response = await client.indices.analyze({ analyzer: "standard", text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.", }); console.log(response);
POST _analyze { "analyzer": "standard", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
上述句子将产生以下词项:
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
配置
编辑standard
分词器接受以下参数:
|
最大词项长度。如果看到的词项超过此长度,则会在 |
|
预定义的停用词列表,例如 |
|
包含停用词的文件的路径。 |
有关停用词配置的更多信息,请参阅 停用词标记过滤器。
示例配置
编辑在此示例中,我们将 standard
分词器的 max_token_length
配置为 5(仅用于演示目的),并使用预定义的英语停用词列表
resp = client.indices.create( index="my-index-000001", settings={ "analysis": { "analyzer": { "my_english_analyzer": { "type": "standard", "max_token_length": 5, "stopwords": "_english_" } } } }, ) print(resp) resp1 = client.indices.analyze( index="my-index-000001", analyzer="my_english_analyzer", text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.", ) print(resp1)
response = client.indices.create( index: 'my-index-000001', body: { settings: { analysis: { analyzer: { my_english_analyzer: { type: 'standard', max_token_length: 5, stopwords: '_english_' } } } } } ) puts response response = client.indices.analyze( index: 'my-index-000001', body: { analyzer: 'my_english_analyzer', text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } ) puts response
const response = await client.indices.create({ index: "my-index-000001", settings: { analysis: { analyzer: { my_english_analyzer: { type: "standard", max_token_length: 5, stopwords: "_english_", }, }, }, }, }); console.log(response); const response1 = await client.indices.analyze({ index: "my-index-000001", analyzer: "my_english_analyzer", text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.", }); console.log(response1);
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_english_analyzer": { "type": "standard", "max_token_length": 5, "stopwords": "_english_" } } } } } POST my-index-000001/_analyze { "analyzer": "my_english_analyzer", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
上述示例产生以下词项:
[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
定义
编辑standard
分词器由以下部分组成:
如果需要自定义 standard
分词器,使其超出配置参数的范围,则需要将其重新创建为 custom
分词器并进行修改,通常是通过添加标记过滤器。这将重新创建内置的 standard
分词器,您可以将其用作起点
resp = client.indices.create( index="standard_example", settings={ "analysis": { "analyzer": { "rebuilt_standard": { "tokenizer": "standard", "filter": [ "lowercase" ] } } } }, ) print(resp)
response = client.indices.create( index: 'standard_example', body: { settings: { analysis: { analyzer: { rebuilt_standard: { tokenizer: 'standard', filter: [ 'lowercase' ] } } } } } ) puts response
const response = await client.indices.create({ index: "standard_example", settings: { analysis: { analyzer: { rebuilt_standard: { tokenizer: "standard", filter: ["lowercase"], }, }, }, }, }); console.log(response);