模式捕获词元过滤器

编辑

pattern 分词器不同,pattern_capture 词元过滤器为正则表达式中的每个捕获组发射一个词元。模式不锚定到字符串的开头和结尾,因此每个模式可以匹配多次,并且允许匹配重叠。

警惕病态正则表达式

模式捕获词元过滤器使用 Java 正则表达式

编写不当的正则表达式可能会运行非常缓慢,甚至抛出 StackOverflowError 并导致其运行所在的节点突然退出。

阅读更多关于 病态正则表达式以及如何避免它们 的信息。

例如,像这样的模式

"(([a-z]+)(\d*))"

当与以下内容匹配时

"abc123def456"

将产生以下词元:[ abc123, abc, 123, def456, def, 456 ]

如果 preserve_original 设置为 true(默认值),则它还会发射原始词元:abc123def456

这对于索引像驼峰式代码这样的文本特别有用,例如 stripHTML,用户可能会搜索 "strip html""striphtml"

resp = client.indices.create(
    index="test",
    settings={
        "analysis": {
            "filter": {
                "code": {
                    "type": "pattern_capture",
                    "preserve_original": True,
                    "patterns": [
                        "(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)",
                        "(\\d+)"
                    ]
                }
            },
            "analyzer": {
                "code": {
                    "tokenizer": "pattern",
                    "filter": [
                        "code",
                        "lowercase"
                    ]
                }
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'test',
  body: {
    settings: {
      analysis: {
        filter: {
          code: {
            type: 'pattern_capture',
            preserve_original: true,
            patterns: [
              '(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)',
              '(\\d+)'
            ]
          }
        },
        analyzer: {
          code: {
            tokenizer: 'pattern',
            filter: [
              'code',
              'lowercase'
            ]
          }
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "test",
  settings: {
    analysis: {
      filter: {
        code: {
          type: "pattern_capture",
          preserve_original: true,
          patterns: ["(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)", "(\\d+)"],
        },
      },
      analyzer: {
        code: {
          tokenizer: "pattern",
          filter: ["code", "lowercase"],
        },
      },
    },
  },
});
console.log(response);
PUT test
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "code" : {
               "type" : "pattern_capture",
               "preserve_original" : true,
               "patterns" : [
                  "(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)",
                  "(\\d+)"
               ]
            }
         },
         "analyzer" : {
            "code" : {
               "tokenizer" : "pattern",
               "filter" : [ "code", "lowercase" ]
            }
         }
      }
   }
}

当用于分析文本时

import static org.apache.commons.lang.StringEscapeUtils.escapeHtml

这将发射以下词元:[ import, static, org, apache, commons, lang, stringescapeutils, string, escape, utils, escapehtml, escape, html ]

另一个例子是分析电子邮件地址

resp = client.indices.create(
    index="test",
    settings={
        "analysis": {
            "filter": {
                "email": {
                    "type": "pattern_capture",
                    "preserve_original": True,
                    "patterns": [
                        "([^@]+)",
                        "(\\p{L}+)",
                        "(\\d+)",
                        "@(.+)"
                    ]
                }
            },
            "analyzer": {
                "email": {
                    "tokenizer": "uax_url_email",
                    "filter": [
                        "email",
                        "lowercase",
                        "unique"
                    ]
                }
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'test',
  body: {
    settings: {
      analysis: {
        filter: {
          email: {
            type: 'pattern_capture',
            preserve_original: true,
            patterns: [
              '([^@]+)',
              '(\\p{L}+)',
              '(\\d+)',
              '@(.+)'
            ]
          }
        },
        analyzer: {
          email: {
            tokenizer: 'uax_url_email',
            filter: [
              'email',
              'lowercase',
              'unique'
            ]
          }
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "test",
  settings: {
    analysis: {
      filter: {
        email: {
          type: "pattern_capture",
          preserve_original: true,
          patterns: ["([^@]+)", "(\\p{L}+)", "(\\d+)", "@(.+)"],
        },
      },
      analyzer: {
        email: {
          tokenizer: "uax_url_email",
          filter: ["email", "lowercase", "unique"],
        },
      },
    },
  },
});
console.log(response);
PUT test
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "email" : {
               "type" : "pattern_capture",
               "preserve_original" : true,
               "patterns" : [
                  "([^@]+)",
                  "(\\p{L}+)",
                  "(\\d+)",
                  "@(.+)"
               ]
            }
         },
         "analyzer" : {
            "email" : {
               "tokenizer" : "uax_url_email",
               "filter" : [ "email", "lowercase",  "unique" ]
            }
         }
      }
   }
}

当以上分析器用于像这样的电子邮件地址时

它将产生以下词元

[email protected], john-smith_123,
john, smith, 123, foo-bar.com, foo, bar, com

需要多个模式才能允许重叠捕获,但也意味着模式不那么密集,更容易理解。

注意: 所有词元都在相同的位置发射,并具有相同的字符偏移量。这意味着,例如,使用此分析器的 match 查询 [email protected] 将返回包含这些词元中任何一个的文档,即使在使用 and 运算符时也是如此。此外,当与高亮显示结合使用时,将高亮显示整个原始词元,而不仅仅是匹配的子集。例如,查询上述电子邮件地址中的 "smith" 将高亮显示

而不是

  john-<em>smith</em>[email protected]