保留词语 Token 过滤器
编辑保留词语 Token 过滤器
编辑仅保留包含在指定词语列表中的 token。
此过滤器使用 Lucene 的 KeepWordFilter。
要从 token 流中删除一系列词语,请使用 stop
过滤器。
示例
编辑以下 analyze API 请求使用 keep
过滤器仅保留 the quick fox jumps over the lazy dog
中的 fox
和 dog
token。
resp = client.indices.analyze( tokenizer="whitespace", filter=[ { "type": "keep", "keep_words": [ "dog", "elephant", "fox" ] } ], text="the quick fox jumps over the lazy dog", ) print(resp)
response = client.indices.analyze( body: { tokenizer: 'whitespace', filter: [ { type: 'keep', keep_words: [ 'dog', 'elephant', 'fox' ] } ], text: 'the quick fox jumps over the lazy dog' } ) puts response
const response = await client.indices.analyze({ tokenizer: "whitespace", filter: [ { type: "keep", keep_words: ["dog", "elephant", "fox"], }, ], text: "the quick fox jumps over the lazy dog", }); console.log(response);
GET _analyze { "tokenizer": "whitespace", "filter": [ { "type": "keep", "keep_words": [ "dog", "elephant", "fox" ] } ], "text": "the quick fox jumps over the lazy dog" }
该过滤器生成以下 token
[ fox, dog ]
可配置参数
编辑-
keep_words
-
(必需*,字符串数组)要保留的词语列表。只有与此列表中词语匹配的 token 才会包含在输出中。
必须指定此参数或
keep_words_path
。 -
keep_words_path
-
(必需*,字符串数组)包含要保留的词语列表的文件的路径。只有与此列表中词语匹配的 token 才会包含在输出中。
此路径必须是绝对路径或相对于
config
位置的相对路径,并且文件必须是 UTF-8 编码。文件中的每个词语必须用换行符分隔。必须指定此参数或
keep_words
。 -
keep_words_case
- (可选,布尔值)如果为
true
,则将所有保留词语转换为小写。默认为false
。
自定义并添加到分析器
编辑要自定义 keep
过滤器,请复制它以创建新的自定义 token 过滤器的基础。您可以使用其可配置参数修改过滤器。
例如,以下 create index API 请求使用自定义 keep
过滤器来配置两个新的 自定义分析器
-
standard_keep_word_array
,它使用带有内联保留词语数组的自定义keep
过滤器 -
standard_keep_word_file
,它使用带有保留词语文件的自定义keep
过滤器
resp = client.indices.create( index="keep_words_example", settings={ "analysis": { "analyzer": { "standard_keep_word_array": { "tokenizer": "standard", "filter": [ "keep_word_array" ] }, "standard_keep_word_file": { "tokenizer": "standard", "filter": [ "keep_word_file" ] } }, "filter": { "keep_word_array": { "type": "keep", "keep_words": [ "one", "two", "three" ] }, "keep_word_file": { "type": "keep", "keep_words_path": "analysis/example_word_list.txt" } } } }, ) print(resp)
const response = await client.indices.create({ index: "keep_words_example", settings: { analysis: { analyzer: { standard_keep_word_array: { tokenizer: "standard", filter: ["keep_word_array"], }, standard_keep_word_file: { tokenizer: "standard", filter: ["keep_word_file"], }, }, filter: { keep_word_array: { type: "keep", keep_words: ["one", "two", "three"], }, keep_word_file: { type: "keep", keep_words_path: "analysis/example_word_list.txt", }, }, }, }, }); console.log(response);
PUT keep_words_example { "settings": { "analysis": { "analyzer": { "standard_keep_word_array": { "tokenizer": "standard", "filter": [ "keep_word_array" ] }, "standard_keep_word_file": { "tokenizer": "standard", "filter": [ "keep_word_file" ] } }, "filter": { "keep_word_array": { "type": "keep", "keep_words": [ "one", "two", "three" ] }, "keep_word_file": { "type": "keep", "keep_words_path": "analysis/example_word_list.txt" } } } } }