保留单词标记过滤器

编辑

仅保留指定单词列表中包含的标记。

此过滤器使用 Lucene 的 KeepWordFilter

要从标记流中移除单词列表,请使用 stop 过滤器。

示例

编辑

以下 分析 API 请求使用 keep 过滤器仅保留来自 the quick fox jumps over the lazy dogfoxdog 标记。

resp = client.indices.analyze(
    tokenizer="whitespace",
    filter=[
        {
            "type": "keep",
            "keep_words": [
                "dog",
                "elephant",
                "fox"
            ]
        }
    ],
    text="the quick fox jumps over the lazy dog",
)
print(resp)
response = client.indices.analyze(
  body: {
    tokenizer: 'whitespace',
    filter: [
      {
        type: 'keep',
        keep_words: [
          'dog',
          'elephant',
          'fox'
        ]
      }
    ],
    text: 'the quick fox jumps over the lazy dog'
  }
)
puts response
const response = await client.indices.analyze({
  tokenizer: "whitespace",
  filter: [
    {
      type: "keep",
      keep_words: ["dog", "elephant", "fox"],
    },
  ],
  text: "the quick fox jumps over the lazy dog",
});
console.log(response);
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "keep",
      "keep_words": [ "dog", "elephant", "fox" ]
    }
  ],
  "text": "the quick fox jumps over the lazy dog"
}

此过滤器生成以下标记

[ fox, dog ]

可配置参数

编辑
keep_words

(必需*,字符串数组) 要保留的单词列表。仅输出包含此列表中单词的标记。

必须指定此参数或 keep_words_path 参数。

keep_words_path

(必需*,字符串数组) 包含要保留的单词列表的文件的路径。仅输出包含此列表中单词的标记。

此路径必须是绝对路径或相对于 config 位置的相对路径,并且文件必须使用 UTF-8 编码。文件中每个单词必须用换行符分隔。

必须指定此参数或 keep_words 参数。

keep_words_case
(可选,布尔值) 如果为 true,则将所有保留单词转换为小写。默认为 false

自定义和添加到分析器

编辑

要自定义 keep 过滤器,请复制它以创建新自定义标记过滤器的基础。您可以使用其可配置参数修改过滤器。

例如,以下 创建索引 API 请求使用自定义 keep 过滤器来配置两个新的 自定义分析器

  • standard_keep_word_array,它使用具有内联保留单词数组的自定义 keep 过滤器
  • standard_keep_word_file,它使用具有保留单词文件的客户 keep 过滤器
resp = client.indices.create(
    index="keep_words_example",
    settings={
        "analysis": {
            "analyzer": {
                "standard_keep_word_array": {
                    "tokenizer": "standard",
                    "filter": [
                        "keep_word_array"
                    ]
                },
                "standard_keep_word_file": {
                    "tokenizer": "standard",
                    "filter": [
                        "keep_word_file"
                    ]
                }
            },
            "filter": {
                "keep_word_array": {
                    "type": "keep",
                    "keep_words": [
                        "one",
                        "two",
                        "three"
                    ]
                },
                "keep_word_file": {
                    "type": "keep",
                    "keep_words_path": "analysis/example_word_list.txt"
                }
            }
        }
    },
)
print(resp)
const response = await client.indices.create({
  index: "keep_words_example",
  settings: {
    analysis: {
      analyzer: {
        standard_keep_word_array: {
          tokenizer: "standard",
          filter: ["keep_word_array"],
        },
        standard_keep_word_file: {
          tokenizer: "standard",
          filter: ["keep_word_file"],
        },
      },
      filter: {
        keep_word_array: {
          type: "keep",
          keep_words: ["one", "two", "three"],
        },
        keep_word_file: {
          type: "keep",
          keep_words_path: "analysis/example_word_list.txt",
        },
      },
    },
  },
});
console.log(response);
PUT keep_words_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_keep_word_array": {
          "tokenizer": "standard",
          "filter": [ "keep_word_array" ]
        },
        "standard_keep_word_file": {
          "tokenizer": "standard",
          "filter": [ "keep_word_file" ]
        }
      },
      "filter": {
        "keep_word_array": {
          "type": "keep",
          "keep_words": [ "one", "two", "three" ]
        },
        "keep_word_file": {
          "type": "keep",
          "keep_words_path": "analysis/example_word_list.txt"
        }
      }
    }
  }
}