关键字重复标记过滤器编辑

输出流中每个标记的关键字版本。这些关键字标记不会被词干提取。

keyword_repeat 过滤器为关键字标记分配 keyword 属性,值为 true。词干提取标记过滤器,例如 stemmerporter_stem,会跳过 keyword 属性值为 true 的标记。

您可以将 keyword_repeat 过滤器与词干提取标记过滤器一起使用,以输出流中每个标记的词干提取和未词干提取版本。

为了正常工作,keyword_repeat 过滤器必须在 分析器配置 中的任何词干提取标记过滤器之前列出。

词干提取不会影响所有标记。这意味着即使在词干提取之后,流中也可能在相同位置包含重复的标记。

要删除这些重复的标记,请在分析器配置中的词干提取过滤器之后添加 remove_duplicates 过滤器。

keyword_repeat 过滤器使用 Lucene 的 KeywordRepeatFilter

示例编辑

以下 分析 API 请求使用 keyword_repeat 过滤器输出 fox running and jumping 中每个标记的关键字和非关键字版本。

要返回这些标记的 keyword 属性,分析 API 请求还包括以下参数

  • explain: true
  • attributes: keyword
response = client.indices.analyze(
  body: {
    tokenizer: 'whitespace',
    filter: [
      'keyword_repeat'
    ],
    text: 'fox running and jumping',
    explain: true,
    attributes: 'keyword'
  }
)
puts response
GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "keyword_repeat"
  ],
  "text": "fox running and jumping",
  "explain": true,
  "attributes": "keyword"
}

API 返回以下响应。请注意,每个标记的一个版本具有 keyword 属性,值为 true

响应
{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": ...,
    "tokenfilters": [
      {
        "name": "keyword_repeat",
        "tokens": [
          {
            "token": "fox",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0,
            "keyword": true
          },
          {
            "token": "fox",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0,
            "keyword": false
          },
          {
            "token": "running",
            "start_offset": 4,
            "end_offset": 11,
            "type": "word",
            "position": 1,
            "keyword": true
          },
          {
            "token": "running",
            "start_offset": 4,
            "end_offset": 11,
            "type": "word",
            "position": 1,
            "keyword": false
          },
          {
            "token": "and",
            "start_offset": 12,
            "end_offset": 15,
            "type": "word",
            "position": 2,
            "keyword": true
          },
          {
            "token": "and",
            "start_offset": 12,
            "end_offset": 15,
            "type": "word",
            "position": 2,
            "keyword": false
          },
          {
            "token": "jumping",
            "start_offset": 16,
            "end_offset": 23,
            "type": "word",
            "position": 3,
            "keyword": true
          },
          {
            "token": "jumping",
            "start_offset": 16,
            "end_offset": 23,
            "type": "word",
            "position": 3,
            "keyword": false
          }
        ]
      }
    ]
  }
}

要对非关键字标记进行词干提取,请在之前的分析 API 请求中将 stemmer 过滤器添加到 keyword_repeat 过滤器之后。

response = client.indices.analyze(
  body: {
    tokenizer: 'whitespace',
    filter: [
      'keyword_repeat',
      'stemmer'
    ],
    text: 'fox running and jumping',
    explain: true,
    attributes: 'keyword'
  }
)
puts response
GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "keyword_repeat",
    "stemmer"
  ],
  "text": "fox running and jumping",
  "explain": true,
  "attributes": "keyword"
}

API 返回以下响应。请注意以下更改

  • running 的非关键字版本被词干提取为 run
  • jumping 的非关键字版本被词干提取为 jump
响应
{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": ...,
    "tokenfilters": [
      {
        "name": "keyword_repeat",
        "tokens": ...
      },
      {
        "name": "stemmer",
        "tokens": [
          {
            "token": "fox",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0,
            "keyword": true
          },
          {
            "token": "fox",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0,
            "keyword": false
          },
          {
            "token": "running",
            "start_offset": 4,
            "end_offset": 11,
            "type": "word",
            "position": 1,
            "keyword": true
          },
          {
            "token": "run",
            "start_offset": 4,
            "end_offset": 11,
            "type": "word",
            "position": 1,
            "keyword": false
          },
          {
            "token": "and",
            "start_offset": 12,
            "end_offset": 15,
            "type": "word",
            "position": 2,
            "keyword": true
          },
          {
            "token": "and",
            "start_offset": 12,
            "end_offset": 15,
            "type": "word",
            "position": 2,
            "keyword": false
          },
          {
            "token": "jumping",
            "start_offset": 16,
            "end_offset": 23,
            "type": "word",
            "position": 3,
            "keyword": true
          },
          {
            "token": "jump",
            "start_offset": 16,
            "end_offset": 23,
            "type": "word",
            "position": 3,
            "keyword": false
          }
        ]
      }
    ]
  }
}

但是,foxand 的关键字和非关键字版本是相同的,并且在各自的相同位置。

要删除这些重复的标记,请在分析 API 请求中将 remove_duplicates 过滤器添加到 stemmer 之后。

response = client.indices.analyze(
  body: {
    tokenizer: 'whitespace',
    filter: [
      'keyword_repeat',
      'stemmer',
      'remove_duplicates'
    ],
    text: 'fox running and jumping',
    explain: true,
    attributes: 'keyword'
  }
)
puts response
GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "keyword_repeat",
    "stemmer",
    "remove_duplicates"
  ],
  "text": "fox running and jumping",
  "explain": true,
  "attributes": "keyword"
}

API 返回以下响应。请注意,foxand 的重复标记已被删除。

响应
{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": ...,
    "tokenfilters": [
      {
        "name": "keyword_repeat",
        "tokens": ...
      },
      {
        "name": "stemmer",
        "tokens": ...
      },
      {
        "name": "remove_duplicates",
        "tokens": [
          {
            "token": "fox",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0,
            "keyword": true
          },
          {
            "token": "running",
            "start_offset": 4,
            "end_offset": 11,
            "type": "word",
            "position": 1,
            "keyword": true
          },
          {
            "token": "run",
            "start_offset": 4,
            "end_offset": 11,
            "type": "word",
            "position": 1,
            "keyword": false
          },
          {
            "token": "and",
            "start_offset": 12,
            "end_offset": 15,
            "type": "word",
            "position": 2,
            "keyword": true
          },
          {
            "token": "jumping",
            "start_offset": 16,
            "end_offset": 23,
            "type": "word",
            "position": 3,
            "keyword": true
          },
          {
            "token": "jump",
            "start_offset": 16,
            "end_offset": 23,
            "type": "word",
            "position": 3,
            "keyword": false
          }
        ]
      }
    ]
  }
}

添加到分析器编辑

以下 创建索引 API 请求使用 keyword_repeat 过滤器来配置一个新的 自定义分析器

此自定义分析器使用 keyword_repeatporter_stem 过滤器来创建流中每个标记的词干提取和未词干提取版本。然后,remove_duplicates 过滤器从流中删除任何重复的标记。

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_custom_analyzer: {
            tokenizer: 'standard',
            filter: [
              'keyword_repeat',
              'porter_stem',
              'remove_duplicates'
            ]
          }
        }
      }
    }
  }
)
puts response
PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "keyword_repeat",
            "porter_stem",
            "remove_duplicates"
          ]
        }
      }
    }
  }
}