› › ›

停用词标记过滤器

停用词标记过滤器编辑

从标记流中移除停用词。

如果没有自定义，过滤器默认会移除以下英文停用词

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

除了英文，stop 过滤器还支持预定义的几种语言的停用词列表。您也可以将自己的停用词指定为数组或文件。

stop 过滤器使用 Lucene 的 StopFilter。

示例编辑

以下分析 API 请求使用 stop 过滤器从 a quick fox jumps over the lazy dog 中移除停用词 a 和 the

response = client.indices.analyze(
  body: {
    tokenizer: 'standard',
    filter: [
      'stop'
    ],
    text: 'a quick fox jumps over the lazy dog'
  }
)
puts response

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [ "stop" ],
  "text": "a quick fox jumps over the lazy dog"
}

过滤器会生成以下标记

[ quick, fox, jumps, over, lazy, dog ]

添加到分析器编辑

以下创建索引 API 请求使用 stop 过滤器来配置一个新的自定义分析器。

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_analyzer: {
            tokenizer: 'whitespace',
            filter: [
              'stop'
            ]
          }
        }
      }
    }
  }
)
puts response

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "whitespace",
          "filter": [ "stop" ]
        }
      }
    }
  }
}

可配置参数编辑

stopwords

(可选，字符串或字符串数组) 语言值，例如 _arabic_ 或 _thai_。默认值为 _english_。

每个语言值对应于 Lucene 中预定义的停用词列表。有关支持的语言值及其停用词，请参阅按语言分类的停用词。

也接受停用词数组。

对于空停用词列表，请使用 _none_。

stopwords_path

(可选，字符串) 包含要移除的停用词列表的文件路径。

此路径必须是绝对路径或相对于 config 位置的相对路径，并且文件必须是 UTF-8 编码的。文件中每个停用词必须用换行符分隔。

ignore_case

(可选，布尔值) 如果为 true，则停用词匹配不区分大小写。例如，如果为 true，则停用词 the 会匹配并移除 The、THE 或 the。默认值为 false。

remove_trailing

(可选，布尔值) 如果为 true，则如果流的最后一个标记是停用词，则会移除它。默认值为 true。

当将过滤器与自动完成建议器一起使用时，此参数应为 false。这将确保像 green a 这样的查询匹配并建议 green apple，同时仍然移除其他停用词。

自定义编辑

要自定义 stop 过滤器，请复制它以创建新自定义标记过滤器的基础。您可以使用其可配置参数修改过滤器。

例如，以下请求创建一个自定义不区分大小写的 stop 过滤器，该过滤器会从 _english_ 停用词列表中移除停用词

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          default: {
            tokenizer: 'whitespace',
            filter: [
              'my_custom_stop_words_filter'
            ]
          }
        },
        filter: {
          my_custom_stop_words_filter: {
            type: 'stop',
            ignore_case: true
          }
        }
      }
    }
  }
)
puts response

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "whitespace",
          "filter": [ "my_custom_stop_words_filter" ]
        }
      },
      "filter": {
        "my_custom_stop_words_filter": {
          "type": "stop",
          "ignore_case": true
        }
      }
    }
  }
}

您也可以指定自己的停用词列表。例如，以下请求创建一个自定义不区分大小写的 stop 过滤器，该过滤器只移除停用词 and、is 和 the

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          default: {
            tokenizer: 'whitespace',
            filter: [
              'my_custom_stop_words_filter'
            ]
          }
        },
        filter: {
          my_custom_stop_words_filter: {
            type: 'stop',
            ignore_case: true,
            stopwords: [
              'and',
              'is',
              'the'
            ]
          }
        }
      }
    }
  }
)
puts response

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "whitespace",
          "filter": [ "my_custom_stop_words_filter" ]
        }
      },
      "filter": {
        "my_custom_stop_words_filter": {
          "type": "stop",
          "ignore_case": true,
          "stopwords": [ "and", "is", "the" ]
        }
      }
    }
  }
}