模式分析器

编辑

pattern 分析器使用正则表达式将文本分割成词元。正则表达式应该匹配词元分隔符,而不是词元本身。正则表达式的默认值为 \W+(或所有非单词字符)。

注意病态正则表达式

模式分析器使用Java 正则表达式

编写不良的正则表达式可能会运行非常缓慢,甚至抛出 StackOverflowError 并导致其运行所在的节点突然退出。

阅读更多关于病态正则表达式以及如何避免它们

示例输出

编辑
resp = client.indices.analyze(
    analyzer="pattern",
    text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
)
print(resp)
response = client.indices.analyze(
  body: {
    analyzer: 'pattern',
    text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  }
)
puts response
const response = await client.indices.analyze({
  analyzer: "pattern",
  text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
});
console.log(response);
POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

以上句子将产生以下词元:

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

配置

编辑

pattern 分析器接受以下参数:

pattern

一个Java 正则表达式,默认为 \W+

flags

Java 正则表达式标志。标志应以管道分隔,例如 "CASE_INSENSITIVE|COMMENTS"

lowercase

词元是否应小写。默认为 true

stopwords

预定义的停用词列表,例如 _english_ 或包含停用词列表的数组。默认为 _none_

stopwords_path

包含停用词的文件路径。

有关停用词配置的更多信息,请参阅停用词元过滤器

示例配置

编辑

在此示例中,我们将 pattern 分析器配置为在非单词字符或下划线 (\W|_) 上分割电子邮件地址,并将结果小写。

resp = client.indices.create(
    index="my-index-000001",
    settings={
        "analysis": {
            "analyzer": {
                "my_email_analyzer": {
                    "type": "pattern",
                    "pattern": "\\W|_",
                    "lowercase": True
                }
            }
        }
    },
)
print(resp)

resp1 = client.indices.analyze(
    index="my-index-000001",
    analyzer="my_email_analyzer",
    text="[email protected]",
)
print(resp1)
response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_email_analyzer: {
            type: 'pattern',
            pattern: '\\W|_',
            lowercase: true
          }
        }
      }
    }
  }
)
puts response

response = client.indices.analyze(
  index: 'my-index-000001',
  body: {
    analyzer: 'my_email_analyzer',
    text: '[email protected]'
  }
)
puts response
const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    analysis: {
      analyzer: {
        my_email_analyzer: {
          type: "pattern",
          pattern: "\\W|_",
          lowercase: true,
        },
      },
    },
  },
});
console.log(response);

const response1 = await client.indices.analyze({
  index: "my-index-000001",
  analyzer: "my_email_analyzer",
  text: "[email protected]",
});
console.log(response1);
PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer": {
          "type":      "pattern",
          "pattern":   "\\W|_", 
          "lowercase": true
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_email_analyzer",
  "text": "[email protected]"
}

将模式指定为 JSON 字符串时,需要转义模式中的反斜杠。

上述示例产生以下词元:

[ john, smith, foo, bar, com ]

CamelCase 分词器

编辑

以下更复杂的示例将 CamelCase 文本分割成词元:

resp = client.indices.create(
    index="my-index-000001",
    settings={
        "analysis": {
            "analyzer": {
                "camel": {
                    "type": "pattern",
                    "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
                }
            }
        }
    },
)
print(resp)

resp1 = client.indices.analyze(
    index="my-index-000001",
    analyzer="camel",
    text="MooseX::FTPClass2_beta",
)
print(resp1)
response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          camel: {
            type: 'pattern',
            pattern: '([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])'
          }
        }
      }
    }
  }
)
puts response

response = client.indices.analyze(
  index: 'my-index-000001',
  body: {
    analyzer: 'camel',
    text: 'MooseX::FTPClass2_beta'
  }
)
puts response
const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    analysis: {
      analyzer: {
        camel: {
          type: "pattern",
          pattern:
            "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])",
        },
      },
    },
  },
});
console.log(response);

const response1 = await client.indices.analyze({
  index: "my-index-000001",
  analyzer: "camel",
  text: "MooseX::FTPClass2_beta",
});
console.log(response1);
PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel": {
          "type": "pattern",
          "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
        }
      }
    }
  }
}

GET my-index-000001/_analyze
{
  "analyzer": "camel",
  "text": "MooseX::FTPClass2_beta"
}

上述示例产生以下词元:

[ moose, x, ftp, class, 2, beta ]

上面的正则表达式更容易理解为:

  ([^\p{L}\d]+)                 # swallow non letters and numbers,
| (?<=\D)(?=\d)                 # or non-number followed by number,
| (?<=\d)(?=\D)                 # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]])    # or lower case
  (?=\p{Lu})                    #   followed by upper case,
| (?<=\p{Lu})                   # or upper case
  (?=\p{Lu}                     #   followed by upper case
    [\p{L}&&[^\p{Lu}]]          #   then lower case
  )

定义

编辑

pattern 分析器包含:

分词器
词元过滤器

如果您需要自定义 pattern 分析器超出配置参数的范围,则需要将其重新创建为 custom 分析器并进行修改,通常是通过添加词元过滤器。这将重新创建内置的 pattern 分析器,您可以将其用作进一步自定义的起点。

resp = client.indices.create(
    index="pattern_example",
    settings={
        "analysis": {
            "tokenizer": {
                "split_on_non_word": {
                    "type": "pattern",
                    "pattern": "\\W+"
                }
            },
            "analyzer": {
                "rebuilt_pattern": {
                    "tokenizer": "split_on_non_word",
                    "filter": [
                        "lowercase"
                    ]
                }
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'pattern_example',
  body: {
    settings: {
      analysis: {
        tokenizer: {
          split_on_non_word: {
            type: 'pattern',
            pattern: '\\W+'
          }
        },
        analyzer: {
          rebuilt_pattern: {
            tokenizer: 'split_on_non_word',
            filter: [
              'lowercase'
            ]
          }
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "pattern_example",
  settings: {
    analysis: {
      tokenizer: {
        split_on_non_word: {
          type: "pattern",
          pattern: "\\W+",
        },
      },
      analyzer: {
        rebuilt_pattern: {
          tokenizer: "split_on_non_word",
          filter: ["lowercase"],
        },
      },
    },
  },
});
console.log(response);
PUT /pattern_example
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "split_on_non_word": {
          "type":       "pattern",
          "pattern":    "\\W+" 
        }
      },
      "analyzer": {
        "rebuilt_pattern": {
          "tokenizer": "split_on_non_word",
          "filter": [
            "lowercase"       
          ]
        }
      }
    }
  }
}

默认模式是 \W+,它在非单词字符处分割,这就是您需要更改的地方。

您可以在 lowercase 后添加其他词元过滤器。