› › ›

长度词元过滤器

编辑

长度词元过滤器

编辑

移除短于或长于指定字符长度的词元。例如，您可以使用 length 过滤器排除短于 2 个字符和长于 5 个字符的词元。

此过滤器使用 Lucene 的 LengthFilter。

length 过滤器会移除整个词元。如果您希望将词元缩短到特定长度，请使用 truncate 过滤器。

示例

编辑

以下 analyze API 请求使用 length 过滤器移除长度超过 4 个字符的词元。

resp = client.indices.analyze(
    tokenizer="whitespace",
    filter=[
        {
            "type": "length",
            "min": 0,
            "max": 4
        }
    ],
    text="the quick brown fox jumps over the lazy dog",
)
print(resp)

response = client.indices.analyze(
  body: {
    tokenizer: 'whitespace',
    filter: [
      {
        type: 'length',
        min: 0,
        max: 4
      }
    ],
    text: 'the quick brown fox jumps over the lazy dog'
  }
)
puts response

const response = await client.indices.analyze({
  tokenizer: "whitespace",
  filter: [
    {
      type: "length",
      min: 0,
      max: 4,
    },
  ],
  text: "the quick brown fox jumps over the lazy dog",
});
console.log(response);

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "length",
      "min": 0,
      "max": 4
    }
  ],
  "text": "the quick brown fox jumps over the lazy dog"
}

Copy as curl Try in Elastic

过滤器生成以下词元

[ the, fox, over, the, lazy, dog ]

添加到分析器

编辑

以下创建索引 API 请求使用 length 过滤器配置新的自定义分析器。

resp = client.indices.create(
    index="length_example",
    settings={
        "analysis": {
            "analyzer": {
                "standard_length": {
                    "tokenizer": "standard",
                    "filter": [
                        "length"
                    ]
                }
            }
        }
    },
)
print(resp)

response = client.indices.create(
  index: 'length_example',
  body: {
    settings: {
      analysis: {
        analyzer: {
          standard_length: {
            tokenizer: 'standard',
            filter: [
              'length'
            ]
          }
        }
      }
    }
  }
)
puts response

const response = await client.indices.create({
  index: "length_example",
  settings: {
    analysis: {
      analyzer: {
        standard_length: {
          tokenizer: "standard",
          filter: ["length"],
        },
      },
    },
  },
});
console.log(response);

PUT length_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_length": {
          "tokenizer": "standard",
          "filter": [ "length" ]
        }
      }
    }
  }
}

Copy as curl Try in Elastic

可配置参数

编辑

min: （可选，整数）词元的最小字符长度。较短的词元将从输出中排除。默认为 0。
max: （可选，整数）词元的最大字符长度。较长的词元将从输出中排除。默认为 Integer.MAX_VALUE，即 2^31-1 或 2147483647。

自定义

编辑

要自定义 length 过滤器，请复制它以创建新的自定义词元过滤器的基础。您可以使用其可配置参数修改过滤器。

例如，以下请求创建一个自定义 length 过滤器，该过滤器移除短于 2 个字符和长于 10 个字符的词元。

resp = client.indices.create(
    index="length_custom_example",
    settings={
        "analysis": {
            "analyzer": {
                "whitespace_length_2_to_10_char": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "length_2_to_10_char"
                    ]
                }
            },
            "filter": {
                "length_2_to_10_char": {
                    "type": "length",
                    "min": 2,
                    "max": 10
                }
            }
        }
    },
)
print(resp)

response = client.indices.create(
  index: 'length_custom_example',
  body: {
    settings: {
      analysis: {
        analyzer: {
          "whitespace_length_2_to_10_char": {
            tokenizer: 'whitespace',
            filter: [
              'length_2_to_10_char'
            ]
          }
        },
        filter: {
          "length_2_to_10_char": {
            type: 'length',
            min: 2,
            max: 10
          }
        }
      }
    }
  }
)
puts response

const response = await client.indices.create({
  index: "length_custom_example",
  settings: {
    analysis: {
      analyzer: {
        whitespace_length_2_to_10_char: {
          tokenizer: "whitespace",
          filter: ["length_2_to_10_char"],
        },
      },
      filter: {
        length_2_to_10_char: {
          type: "length",
          min: 2,
          max: 10,
        },
      },
    },
  },
});
console.log(response);

PUT length_custom_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_length_2_to_10_char": {
          "tokenizer": "whitespace",
          "filter": [ "length_2_to_10_char" ]
        }
      },
      "filter": {
        "length_2_to_10_char": {
          "type": "length",
          "min": 2,
          "max": 10
        }
      }
    }
  }
}

Copy as curl Try in Elastic

« KStem 词元过滤器限制词元计数词元过滤器 »

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

长度词元过滤器

长度词元过滤器

示例

添加到分析器

可配置参数

自定义

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards