模式分词器
编辑模式分词器
编辑pattern
分词器使用正则表达式,要么在匹配到单词分隔符时将文本拆分为词元,要么将匹配到的文本捕获为词元。
默认模式为 \W+
,当遇到非单词字符时,它会将文本拆分。
注意病态正则表达式
模式分词器使用 Java 正则表达式。
编写不当的正则表达式可能会运行非常缓慢,甚至抛出 StackOverflowError 并导致其运行所在的节点突然退出。
阅读更多关于 病态正则表达式以及如何避免它们 的内容。
示例输出
编辑resp = client.indices.analyze( tokenizer="pattern", text="The foo_bar_size's default is 5.", ) print(resp)
response = client.indices.analyze( body: { tokenizer: 'pattern', text: "The foo_bar_size's default is 5." } ) puts response
const response = await client.indices.analyze({ tokenizer: "pattern", text: "The foo_bar_size's default is 5.", }); console.log(response);
POST _analyze { "tokenizer": "pattern", "text": "The foo_bar_size's default is 5." }
上述句子将生成以下词元
[ The, foo_bar_size, s, default, is, 5 ]
配置
编辑pattern
分词器接受以下参数
|
一个 Java 正则表达式,默认为 |
|
Java 正则表达式 标志。标志应以管道分隔,例如 |
|
要提取为词元的捕获组。默认为 |
示例配置
编辑在此示例中,我们配置 pattern
分词器,使其在遇到逗号时将文本拆分为词元
resp = client.indices.create( index="my-index-000001", settings={ "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "pattern", "pattern": "," } } } }, ) print(resp) resp1 = client.indices.analyze( index="my-index-000001", analyzer="my_analyzer", text="comma,separated,values", ) print(resp1)
response = client.indices.create( index: 'my-index-000001', body: { settings: { analysis: { analyzer: { my_analyzer: { tokenizer: 'my_tokenizer' } }, tokenizer: { my_tokenizer: { type: 'pattern', pattern: ',' } } } } } ) puts response response = client.indices.analyze( index: 'my-index-000001', body: { analyzer: 'my_analyzer', text: 'comma,separated,values' } ) puts response
const response = await client.indices.create({ index: "my-index-000001", settings: { analysis: { analyzer: { my_analyzer: { tokenizer: "my_tokenizer", }, }, tokenizer: { my_tokenizer: { type: "pattern", pattern: ",", }, }, }, }, }); console.log(response); const response1 = await client.indices.analyze({ index: "my-index-000001", analyzer: "my_analyzer", text: "comma,separated,values", }); console.log(response1);
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "pattern", "pattern": "," } } } } } POST my-index-000001/_analyze { "analyzer": "my_analyzer", "text": "comma,separated,values" }
上述示例生成以下词元
[ comma, separated, values ]
在下一个示例中,我们配置 pattern
分词器以捕获用双引号括起来的值(忽略嵌入的转义引号 \"
)。正则表达式本身如下所示
"((?:\\"|[^"]|\\")*)"
并按如下方式读取
- 一个字面量
"
-
开始捕获
- 一个字面量
\"
或除"
之外的任何字符 - 重复直到没有更多字符匹配
- 一个字面量
- 一个字面量结束符
"
当在 JSON 中指定模式时,需要转义 "
和 \
字符,因此模式最终看起来像这样
\"((?:\\\\\"|[^\"]|\\\\\")+)\"
resp = client.indices.create( index="my-index-000001", settings={ "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "pattern", "pattern": "\"((?:\\\\\"|[^\"]|\\\\\")+)\"", "group": 1 } } } }, ) print(resp) resp1 = client.indices.analyze( index="my-index-000001", analyzer="my_analyzer", text="\"value\", \"value with embedded \\\" quote\"", ) print(resp1)
response = client.indices.create( index: 'my-index-000001', body: { settings: { analysis: { analyzer: { my_analyzer: { tokenizer: 'my_tokenizer' } }, tokenizer: { my_tokenizer: { type: 'pattern', pattern: '"((?:\\\"|[^"]|\\\")+)"', group: 1 } } } } } ) puts response response = client.indices.analyze( index: 'my-index-000001', body: { analyzer: 'my_analyzer', text: '"value", "value with embedded \" quote"' } ) puts response
const response = await client.indices.create({ index: "my-index-000001", settings: { analysis: { analyzer: { my_analyzer: { tokenizer: "my_tokenizer", }, }, tokenizer: { my_tokenizer: { type: "pattern", pattern: '"((?:\\\\"|[^"]|\\\\")+)"', group: 1, }, }, }, }, }); console.log(response); const response1 = await client.indices.analyze({ index: "my-index-000001", analyzer: "my_analyzer", text: '"value", "value with embedded \\" quote"', }); console.log(response1);
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "pattern", "pattern": "\"((?:\\\\\"|[^\"]|\\\\\")+)\"", "group": 1 } } } } } POST my-index-000001/_analyze { "analyzer": "my_analyzer", "text": "\"value\", \"value with embedded \\\" quote\"" }
上述示例生成以下两个词元
[ value, value with embedded \" quote ]