经典分词器
编辑经典分词器
编辑classic
分词器是一个基于语法的分词器,适用于英文文档。此分词器包含针对首字母缩写词、公司名称、电子邮件地址和互联网主机名的特殊处理启发式算法。但是,这些规则并不总是有效,并且该分词器对于除英语以外的大多数语言都不适用。
- 它在大多数字符处分割单词,并去除标点符号。但是,后面没有空格的点被视为标记的一部分。
- 它在连字符处分割单词,除非标记中包含数字,在这种情况下,整个标记被解释为产品编号并且不会被分割。
- 它将电子邮件地址和互联网主机名识别为一个标记。
示例输出
编辑resp = client.indices.analyze( tokenizer="classic", text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.", ) print(resp)
response = client.indices.analyze( body: { tokenizer: 'classic', text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } ) puts response
const response = await client.indices.analyze({ tokenizer: "classic", text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.", }); console.log(response);
POST _analyze { "tokenizer": "classic", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
上述句子将产生以下词元:
[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]
配置
编辑classic
分词器接受以下参数:
|
最大词元长度。如果看到的词元超过此长度,则会在 |
示例配置
编辑在此示例中,我们将 classic
分词器的 max_token_length
配置为 5(出于演示目的)。
resp = client.indices.create( index="my-index-000001", settings={ "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "classic", "max_token_length": 5 } } } }, ) print(resp) resp1 = client.indices.analyze( index="my-index-000001", analyzer="my_analyzer", text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.", ) print(resp1)
response = client.indices.create( index: 'my-index-000001', body: { settings: { analysis: { analyzer: { my_analyzer: { tokenizer: 'my_tokenizer' } }, tokenizer: { my_tokenizer: { type: 'classic', max_token_length: 5 } } } } } ) puts response response = client.indices.analyze( index: 'my-index-000001', body: { analyzer: 'my_analyzer', text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } ) puts response
const response = await client.indices.create({ index: "my-index-000001", settings: { analysis: { analyzer: { my_analyzer: { tokenizer: "my_tokenizer", }, }, tokenizer: { my_tokenizer: { type: "classic", max_token_length: 5, }, }, }, }, }); console.log(response); const response1 = await client.indices.analyze({ index: "my-index-000001", analyzer: "my_analyzer", text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.", }); console.log(response1);
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "classic", "max_token_length": 5 } } } } } POST my-index-000001/_analyze { "analyzer": "my_analyzer", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
上述示例产生以下词元:
[ The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog's, bone ]