模式分词器

编辑

pattern 分词器使用正则表达式,要么在匹配到单词分隔符时将文本拆分为词元,要么将匹配到的文本捕获为词元。

默认模式为 \W+,当遇到非单词字符时,它会将文本拆分。

注意病态正则表达式

模式分词器使用 Java 正则表达式

编写不当的正则表达式可能会运行非常缓慢,甚至抛出 StackOverflowError 并导致其运行所在的节点突然退出。

阅读更多关于 病态正则表达式以及如何避免它们 的内容。

示例输出

编辑
resp = client.indices.analyze(
    tokenizer="pattern",
    text="The foo_bar_size's default is 5.",
)
print(resp)
response = client.indices.analyze(
  body: {
    tokenizer: 'pattern',
    text: "The foo_bar_size's default is 5."
  }
)
puts response
const response = await client.indices.analyze({
  tokenizer: "pattern",
  text: "The foo_bar_size's default is 5.",
});
console.log(response);
POST _analyze
{
  "tokenizer": "pattern",
  "text": "The foo_bar_size's default is 5."
}

上述句子将生成以下词元

[ The, foo_bar_size, s, default, is, 5 ]

配置

编辑

pattern 分词器接受以下参数

pattern

一个 Java 正则表达式,默认为 \W+

flags

Java 正则表达式 标志。标志应以管道分隔,例如 "CASE_INSENSITIVE|COMMENTS"

group

要提取为词元的捕获组。默认为 -1(拆分)。

示例配置

编辑

在此示例中,我们配置 pattern 分词器,使其在遇到逗号时将文本拆分为词元

resp = client.indices.create(
    index="my-index-000001",
    settings={
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "my_tokenizer"
                }
            },
            "tokenizer": {
                "my_tokenizer": {
                    "type": "pattern",
                    "pattern": ","
                }
            }
        }
    },
)
print(resp)

resp1 = client.indices.analyze(
    index="my-index-000001",
    analyzer="my_analyzer",
    text="comma,separated,values",
)
print(resp1)
response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_analyzer: {
            tokenizer: 'my_tokenizer'
          }
        },
        tokenizer: {
          my_tokenizer: {
            type: 'pattern',
            pattern: ','
          }
        }
      }
    }
  }
)
puts response

response = client.indices.analyze(
  index: 'my-index-000001',
  body: {
    analyzer: 'my_analyzer',
    text: 'comma,separated,values'
  }
)
puts response
const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    analysis: {
      analyzer: {
        my_analyzer: {
          tokenizer: "my_tokenizer",
        },
      },
      tokenizer: {
        my_tokenizer: {
          type: "pattern",
          pattern: ",",
        },
      },
    },
  },
});
console.log(response);

const response1 = await client.indices.analyze({
  index: "my-index-000001",
  analyzer: "my_analyzer",
  text: "comma,separated,values",
});
console.log(response1);
PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": ","
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "comma,separated,values"
}

上述示例生成以下词元

[ comma, separated, values ]

在下一个示例中,我们配置 pattern 分词器以捕获用双引号括起来的值(忽略嵌入的转义引号 \")。正则表达式本身如下所示

"((?:\\"|[^"]|\\")*)"

并按如下方式读取

  • 一个字面量 "
  • 开始捕获

    • 一个字面量 \" 或除 " 之外的任何字符
    • 重复直到没有更多字符匹配
  • 一个字面量结束符 "

当在 JSON 中指定模式时,需要转义 "\ 字符,因此模式最终看起来像这样

\"((?:\\\\\"|[^\"]|\\\\\")+)\"
resp = client.indices.create(
    index="my-index-000001",
    settings={
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "my_tokenizer"
                }
            },
            "tokenizer": {
                "my_tokenizer": {
                    "type": "pattern",
                    "pattern": "\"((?:\\\\\"|[^\"]|\\\\\")+)\"",
                    "group": 1
                }
            }
        }
    },
)
print(resp)

resp1 = client.indices.analyze(
    index="my-index-000001",
    analyzer="my_analyzer",
    text="\"value\", \"value with embedded \\\" quote\"",
)
print(resp1)
response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_analyzer: {
            tokenizer: 'my_tokenizer'
          }
        },
        tokenizer: {
          my_tokenizer: {
            type: 'pattern',
            pattern: '"((?:\\\"|[^"]|\\\")+)"',
            group: 1
          }
        }
      }
    }
  }
)
puts response

response = client.indices.analyze(
  index: 'my-index-000001',
  body: {
    analyzer: 'my_analyzer',
    text: '"value", "value with embedded \" quote"'
  }
)
puts response
const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    analysis: {
      analyzer: {
        my_analyzer: {
          tokenizer: "my_tokenizer",
        },
      },
      tokenizer: {
        my_tokenizer: {
          type: "pattern",
          pattern: '"((?:\\\\"|[^"]|\\\\")+)"',
          group: 1,
        },
      },
    },
  },
});
console.log(response);

const response1 = await client.indices.analyze({
  index: "my-index-000001",
  analyzer: "my_analyzer",
  text: '"value", "value with embedded \\" quote"',
});
console.log(response1);
PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "\"((?:\\\\\"|[^\"]|\\\\\")+)\"",
          "group": 1
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "\"value\", \"value with embedded \\\" quote\""
}

上述示例生成以下两个词元

[ value, value with embedded \" quote ]