模式分析器
编辑模式分析器
编辑pattern
分析器使用正则表达式将文本拆分为词条。正则表达式应该匹配 词条分隔符,而不是词条本身。正则表达式默认为 \W+
(或所有非单词字符)。
注意病态正则表达式
模式分析器使用 Java 正则表达式。
编写不佳的正则表达式可能会运行非常缓慢,甚至抛出 StackOverflowError 并导致其运行所在的节点突然退出。
阅读更多关于病态正则表达式以及如何避免它们。
示例输出
编辑resp = client.indices.analyze( analyzer="pattern", text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.", ) print(resp)
response = client.indices.analyze( body: { analyzer: 'pattern', text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } ) puts response
const response = await client.indices.analyze({ analyzer: "pattern", text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.", }); console.log(response);
POST _analyze { "analyzer": "pattern", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
上面的句子会产生以下词条
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
配置
编辑pattern
分析器接受以下参数
|
一个 Java 正则表达式,默认为 |
|
Java 正则表达式 标志。标志应该用管道符分隔,例如 |
|
词条是否应该小写。默认为 |
|
一个预定义的停用词列表,例如 |
|
包含停用词的文件路径。 |
有关停用词配置的更多信息,请参阅 停止词语过滤器。
示例配置
编辑在此示例中,我们将 pattern
分析器配置为根据非单词字符或下划线(\W|_
)拆分电子邮件地址,并将结果小写
resp = client.indices.create( index="my-index-000001", settings={ "analysis": { "analyzer": { "my_email_analyzer": { "type": "pattern", "pattern": "\\W|_", "lowercase": True } } } }, ) print(resp) resp1 = client.indices.analyze( index="my-index-000001", analyzer="my_email_analyzer", text="[email protected]", ) print(resp1)
response = client.indices.create( index: 'my-index-000001', body: { settings: { analysis: { analyzer: { my_email_analyzer: { type: 'pattern', pattern: '\\W|_', lowercase: true } } } } } ) puts response response = client.indices.analyze( index: 'my-index-000001', body: { analyzer: 'my_email_analyzer', text: '[email protected]' } ) puts response
const response = await client.indices.create({ index: "my-index-000001", settings: { analysis: { analyzer: { my_email_analyzer: { type: "pattern", pattern: "\\W|_", lowercase: true, }, }, }, }, }); console.log(response); const response1 = await client.indices.analyze({ index: "my-index-000001", analyzer: "my_email_analyzer", text: "[email protected]", }); console.log(response1);
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_email_analyzer": { "type": "pattern", "pattern": "\\W|_", "lowercase": true } } } } } POST my-index-000001/_analyze { "analyzer": "my_email_analyzer", "text": "[email protected]" }
上面的示例产生以下词条
[ john, smith, foo, bar, com ]
CamelCase 分词器
编辑以下更复杂的示例将 CamelCase 文本拆分为词条
resp = client.indices.create( index="my-index-000001", settings={ "analysis": { "analyzer": { "camel": { "type": "pattern", "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])" } } } }, ) print(resp) resp1 = client.indices.analyze( index="my-index-000001", analyzer="camel", text="MooseX::FTPClass2_beta", ) print(resp1)
response = client.indices.create( index: 'my-index-000001', body: { settings: { analysis: { analyzer: { camel: { type: 'pattern', pattern: '([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])' } } } } } ) puts response response = client.indices.analyze( index: 'my-index-000001', body: { analyzer: 'camel', text: 'MooseX::FTPClass2_beta' } ) puts response
const response = await client.indices.create({ index: "my-index-000001", settings: { analysis: { analyzer: { camel: { type: "pattern", pattern: "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])", }, }, }, }, }); console.log(response); const response1 = await client.indices.analyze({ index: "my-index-000001", analyzer: "camel", text: "MooseX::FTPClass2_beta", }); console.log(response1);
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "camel": { "type": "pattern", "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])" } } } } } GET my-index-000001/_analyze { "analyzer": "camel", "text": "MooseX::FTPClass2_beta" }
上面的示例产生以下词条
[ moose, x, ftp, class, 2, beta ]
上面的正则表达式更容易理解为
([^\p{L}\d]+) # swallow non letters and numbers, | (?<=\D)(?=\d) # or non-number followed by number, | (?<=\d)(?=\D) # or number followed by non-number, | (?<=[ \p{L} && [^\p{Lu}]]) # or lower case (?=\p{Lu}) # followed by upper case, | (?<=\p{Lu}) # or upper case (?=\p{Lu} # followed by upper case [\p{L}&&[^\p{Lu}]] # then lower case )
定义
编辑pattern
分析器由以下部分组成
如果需要自定义 pattern
分析器,超出配置参数的范围,则需要将其重新创建为 custom
分析器并进行修改,通常是通过添加词条过滤器。这将重新创建内置的 pattern
分析器,您可以将其用作进一步自定义的起点
resp = client.indices.create( index="pattern_example", settings={ "analysis": { "tokenizer": { "split_on_non_word": { "type": "pattern", "pattern": "\\W+" } }, "analyzer": { "rebuilt_pattern": { "tokenizer": "split_on_non_word", "filter": [ "lowercase" ] } } } }, ) print(resp)
response = client.indices.create( index: 'pattern_example', body: { settings: { analysis: { tokenizer: { split_on_non_word: { type: 'pattern', pattern: '\\W+' } }, analyzer: { rebuilt_pattern: { tokenizer: 'split_on_non_word', filter: [ 'lowercase' ] } } } } } ) puts response
const response = await client.indices.create({ index: "pattern_example", settings: { analysis: { tokenizer: { split_on_non_word: { type: "pattern", pattern: "\\W+", }, }, analyzer: { rebuilt_pattern: { tokenizer: "split_on_non_word", filter: ["lowercase"], }, }, }, }, }); console.log(response);