停止词语过滤器

从语元流中移除停止词。

在未自定义的情况下，此过滤器默认移除以下英语停止词

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

除了英语外，stop 过滤器还支持预定义的多种语言的停止词列表。您还可以将自己的停止词指定为数组或文件。

stop 过滤器使用 Lucene 的 StopFilter。

示例

编辑

以下 analyze API 请求使用 stop 过滤器从 a quick fox jumps over the lazy dog 中移除停止词 a 和 the

resp = client.indices.analyze(
    tokenizer="standard",
    filter=[
        "stop"
    ],
    text="a quick fox jumps over the lazy dog",
)
print(resp)

response = client.indices.analyze(
  body: {
    tokenizer: 'standard',
    filter: [
      'stop'
    ],
    text: 'a quick fox jumps over the lazy dog'
  }
)
puts response

const response = await client.indices.analyze({
  tokenizer: "standard",
  filter: ["stop"],
  text: "a quick fox jumps over the lazy dog",
});
console.log(response);

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [ "stop" ],
  "text": "a quick fox jumps over the lazy dog"
}

Copy as curl Try in Elastic

此过滤器生成以下语元

[ quick, fox, jumps, over, lazy, dog ]

添加到分析器

编辑

以下创建索引 API 请求使用 stop 过滤器配置新的自定义分析器。

resp = client.indices.create(
    index="my-index-000001",
    settings={
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "stop"
                    ]
                }
            }
        }
    },
)
print(resp)

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_analyzer: {
            tokenizer: 'whitespace',
            filter: [
              'stop'
            ]
          }
        }
      }
    }
  }
)
puts response

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    analysis: {
      analyzer: {
        my_analyzer: {
          tokenizer: "whitespace",
          filter: ["stop"],
        },
      },
    },
  },
});
console.log(response);

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "whitespace",
          "filter": [ "stop" ]
        }
      }
    }
  }
}

Copy as curl Try in Elastic

可配置参数

编辑

stopwords

（可选，字符串或字符串数组）语言值，例如 _arabic_ 或 _thai_。默认为 _english_。

每个语言值都对应于 Lucene 中预定义的停止词列表。有关支持的语言值及其停止词，请参阅按语言划分的停止词。

也接受停止词数组。

对于空的停止词列表，请使用 _none_。

stopwords_path

（可选，字符串）包含要移除的停止词列表的文件的路径。

此路径必须是绝对路径或相对于 config 位置的相对路径，并且该文件必须采用 UTF-8 编码。文件中的每个停止词必须用换行符分隔。

ignore_case

（可选，布尔值）如果为 true，则停止词匹配不区分大小写。例如，如果为 true，则停止词 the 会匹配并移除 The、THE 或 the。默认为 false。

remove_trailing

（可选，布尔值）如果为 true，则如果流的最后一个语元是停止词，则会移除它。默认为 true。

将此过滤器与完成建议器一起使用时，此参数应为 false。这将确保像 green a 这样的查询可以匹配并建议 green apple，同时仍然移除其他停止词。

自定义

编辑

要自定义 stop 过滤器，请复制它以创建新的自定义语元过滤器的基础。您可以使用其可配置参数修改过滤器。

例如，以下请求创建一个自定义的不区分大小写的 stop 过滤器，该过滤器从_english_ 停止词列表中移除停止词

resp = client.indices.create(
    index="my-index-000001",
    settings={
        "analysis": {
            "analyzer": {
                "default": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "my_custom_stop_words_filter"
                    ]
                }
            },
            "filter": {
                "my_custom_stop_words_filter": {
                    "type": "stop",
                    "ignore_case": True
                }
            }
        }
    },
)
print(resp)

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          default: {
            tokenizer: 'whitespace',
            filter: [
              'my_custom_stop_words_filter'
            ]
          }
        },
        filter: {
          my_custom_stop_words_filter: {
            type: 'stop',
            ignore_case: true
          }
        }
      }
    }
  }
)
puts response

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    analysis: {
      analyzer: {
        default: {
          tokenizer: "whitespace",
          filter: ["my_custom_stop_words_filter"],
        },
      },
      filter: {
        my_custom_stop_words_filter: {
          type: "stop",
          ignore_case: true,
        },
      },
    },
  },
});
console.log(response);

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "whitespace",
          "filter": [ "my_custom_stop_words_filter" ]
        }
      },
      "filter": {
        "my_custom_stop_words_filter": {
          "type": "stop",
          "ignore_case": true
        }
      }
    }
  }
}

Copy as curl Try in Elastic

您还可以指定自己的停止词列表。例如，以下请求创建一个自定义的不区分大小写的 stop 过滤器，该过滤器仅移除停止词 and、is 和 the

resp = client.indices.create(
    index="my-index-000001",
    settings={
        "analysis": {
            "analyzer": {
                "default": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "my_custom_stop_words_filter"
                    ]
                }
            },
            "filter": {
                "my_custom_stop_words_filter": {
                    "type": "stop",
                    "ignore_case": True,
                    "stopwords": [
                        "and",
                        "is",
                        "the"
                    ]
                }
            }
        }
    },
)
print(resp)

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          default: {
            tokenizer: 'whitespace',
            filter: [
              'my_custom_stop_words_filter'
            ]
          }
        },
        filter: {
          my_custom_stop_words_filter: {
            type: 'stop',
            ignore_case: true,
            stopwords: [
              'and',
              'is',
              'the'
            ]
          }
        }
      }
    }
  }
)
puts response

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    analysis: {
      analyzer: {
        default: {
          tokenizer: "whitespace",
          filter: ["my_custom_stop_words_filter"],
        },
      },
      filter: {
        my_custom_stop_words_filter: {
          type: "stop",
          ignore_case: true,
          stopwords: ["and", "is", "the"],
        },
      },
    },
  },
});
console.log(response);

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "whitespace",
          "filter": [ "my_custom_stop_words_filter" ]
        }
      },
      "filter": {
        "my_custom_stop_words_filter": {
          "type": "stop",
          "ignore_case": true,
          "stopwords": [ "and", "is", "the" ]
        }
      }
    }
  }
}