模式捕获词元过滤器
编辑模式捕获词元过滤器
编辑与 pattern
分词器不同,pattern_capture
词元过滤器为正则表达式中的每个捕获组发射一个词元。模式不锚定到字符串的开头和结尾,因此每个模式可以匹配多次,并且允许匹配重叠。
警惕病态正则表达式
模式捕获词元过滤器使用 Java 正则表达式。
编写不当的正则表达式可能会运行非常缓慢,甚至抛出 StackOverflowError 并导致其运行所在的节点突然退出。
阅读更多关于 病态正则表达式以及如何避免它们 的信息。
例如,像这样的模式
"(([a-z]+)(\d*))"
当与以下内容匹配时
"abc123def456"
将产生以下词元:[ abc123
, abc
, 123
, def456
, def
, 456
]
如果 preserve_original
设置为 true
(默认值),则它还会发射原始词元:abc123def456
。
这对于索引像驼峰式代码这样的文本特别有用,例如 stripHTML
,用户可能会搜索 "strip html"
或 "striphtml"
resp = client.indices.create( index="test", settings={ "analysis": { "filter": { "code": { "type": "pattern_capture", "preserve_original": True, "patterns": [ "(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)", "(\\d+)" ] } }, "analyzer": { "code": { "tokenizer": "pattern", "filter": [ "code", "lowercase" ] } } } }, ) print(resp)
response = client.indices.create( index: 'test', body: { settings: { analysis: { filter: { code: { type: 'pattern_capture', preserve_original: true, patterns: [ '(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)', '(\\d+)' ] } }, analyzer: { code: { tokenizer: 'pattern', filter: [ 'code', 'lowercase' ] } } } } } ) puts response
const response = await client.indices.create({ index: "test", settings: { analysis: { filter: { code: { type: "pattern_capture", preserve_original: true, patterns: ["(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)", "(\\d+)"], }, }, analyzer: { code: { tokenizer: "pattern", filter: ["code", "lowercase"], }, }, }, }, }); console.log(response);
PUT test { "settings" : { "analysis" : { "filter" : { "code" : { "type" : "pattern_capture", "preserve_original" : true, "patterns" : [ "(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)", "(\\d+)" ] } }, "analyzer" : { "code" : { "tokenizer" : "pattern", "filter" : [ "code", "lowercase" ] } } } } }
当用于分析文本时
import static org.apache.commons.lang.StringEscapeUtils.escapeHtml
这将发射以下词元:[ import
, static
, org
, apache
, commons
, lang
, stringescapeutils
, string
, escape
, utils
, escapehtml
, escape
, html
]
另一个例子是分析电子邮件地址
resp = client.indices.create( index="test", settings={ "analysis": { "filter": { "email": { "type": "pattern_capture", "preserve_original": True, "patterns": [ "([^@]+)", "(\\p{L}+)", "(\\d+)", "@(.+)" ] } }, "analyzer": { "email": { "tokenizer": "uax_url_email", "filter": [ "email", "lowercase", "unique" ] } } } }, ) print(resp)
response = client.indices.create( index: 'test', body: { settings: { analysis: { filter: { email: { type: 'pattern_capture', preserve_original: true, patterns: [ '([^@]+)', '(\\p{L}+)', '(\\d+)', '@(.+)' ] } }, analyzer: { email: { tokenizer: 'uax_url_email', filter: [ 'email', 'lowercase', 'unique' ] } } } } } ) puts response
const response = await client.indices.create({ index: "test", settings: { analysis: { filter: { email: { type: "pattern_capture", preserve_original: true, patterns: ["([^@]+)", "(\\p{L}+)", "(\\d+)", "@(.+)"], }, }, analyzer: { email: { tokenizer: "uax_url_email", filter: ["email", "lowercase", "unique"], }, }, }, }, }); console.log(response);
PUT test { "settings" : { "analysis" : { "filter" : { "email" : { "type" : "pattern_capture", "preserve_original" : true, "patterns" : [ "([^@]+)", "(\\p{L}+)", "(\\d+)", "@(.+)" ] } }, "analyzer" : { "email" : { "tokenizer" : "uax_url_email", "filter" : [ "email", "lowercase", "unique" ] } } } } }
当以上分析器用于像这样的电子邮件地址时
它将产生以下词元
[email protected], john-smith_123, john, smith, 123, foo-bar.com, foo, bar, com
需要多个模式才能允许重叠捕获,但也意味着模式不那么密集,更容易理解。
注意: 所有词元都在相同的位置发射,并具有相同的字符偏移量。这意味着,例如,使用此分析器的 match
查询 [email protected]
将返回包含这些词元中任何一个的文档,即使在使用 and
运算符时也是如此。此外,当与高亮显示结合使用时,将高亮显示整个原始词元,而不仅仅是匹配的子集。例如,查询上述电子邮件地址中的 "smith"
将高亮显示
<em>[email protected]</em>
而不是
john-<em>smith</em>[email protected]