› › ›

Hunspell 词元过滤器

编辑

Hunspell 词元过滤器

编辑

提供基于提供的 Hunspell 词典的词典词干提取。hunspell 过滤器需要配置一个或多个特定于语言的 Hunspell 词典。

此过滤器使用 Lucene 的 HunspellStemFilter。

如果可用，我们建议在应用 hunspell 词元过滤器之前，先尝试针对您的语言的算法词干提取器。在实践中，算法词干提取器的性能通常优于词典词干提取器。请参阅词典词干提取器。

配置 Hunspell 词典

编辑

Hunspell 词典存储在文件系统上的专用 hunspell 目录中并在此目录下检测：<$ES_PATH_CONF>/hunspell。每个词典都应该有自己的目录，以其关联的语言和区域设置命名（例如，pt_BR，en_GB）。此词典目录应包含一个 .aff 文件和一个或多个 .dic 文件，所有这些文件都将被自动拾取。例如，以下目录布局将定义 en_US 词典

- config
    |-- hunspell
    |    |-- en_US
    |    |    |-- en_US.dic
    |    |    |-- en_US.aff

每个词典都可以配置一个设置

ignore_case

（静态，布尔值）如果为 true，则词典匹配将不区分大小写。默认为 false。

此设置可以在 elasticsearch.yml 中使用 indices.analysis.hunspell.dictionary.ignore_case 进行全局配置。

要为特定区域设置配置该设置，请使用 indices.analysis.hunspell.dictionary.<locale>.ignore_case 设置（例如，对于 en_US（美式英语）区域设置，该设置为 indices.analysis.hunspell.dictionary.en_US.ignore_case）。

您还可以在词典目录下添加一个 settings.yml 文件，该文件包含这些设置。这将覆盖在 elasticsearch.yml 中定义的任何其他 ignore_case 设置。

示例

编辑

以下 analyze API 请求使用 hunspell 过滤器将 the foxes jumping quickly 词干化为 the fox jump quick。

该请求指定了 en_US 区域设置，这意味着 <$ES_PATH_CONF>/hunspell/en_US 目录中的 .aff 和 .dic 文件将用于 Hunspell 词典。

resp = client.indices.analyze(
    tokenizer="standard",
    filter=[
        {
            "type": "hunspell",
            "locale": "en_US"
        }
    ],
    text="the foxes jumping quickly",
)
print(resp)

const response = await client.indices.analyze({
  tokenizer: "standard",
  filter: [
    {
      type: "hunspell",
      locale: "en_US",
    },
  ],
  text: "the foxes jumping quickly",
});
console.log(response);

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "hunspell",
      "locale": "en_US"
    }
  ],
  "text": "the foxes jumping quickly"
}

过滤器生成以下词元

[ the, fox, jump, quick ]

可配置参数

编辑

dictionary

（可选，字符串或字符串数组）一个或多个 .dic 文件（例如，en_US.dic, my_custom.dic），用于 Hunspell 词典。

默认情况下，hunspell 过滤器使用使用 lang、language 或 locale 参数指定的 <$ES_PATH_CONF>/hunspell/<locale> 目录中的所有 .dic 文件。

dedup

（可选，布尔值）如果为 true，则从过滤器的输出中删除重复的词元。默认为 true。

lang

（必需*，字符串）locale 参数的别名。

如果未指定此参数，则需要 language 或 locale 参数。

language

（必需*，字符串）locale 参数的别名。

如果未指定此参数，则需要 lang 或 locale 参数。

locale

（必需*，字符串）用于指定 Hunspell 词典的 .aff 和 .dic 文件的区域设置目录。请参阅配置 Hunspell 词典。

如果未指定此参数，则需要 lang 或 language 参数。

longest_only

（可选，布尔值）如果为 true，则输出中仅包含每个词元的最长词干版本。如果为 false，则包含该词元的所有词干版本。默认为 false。

自定义并添加到分析器

编辑

要自定义 hunspell 过滤器，请复制它以为新的自定义词元过滤器创建基础。您可以使用其可配置的参数来修改过滤器。

例如，以下创建索引 API 请求使用自定义 hunspell 过滤器 my_en_US_dict_stemmer 来配置新的自定义分析器。

my_en_US_dict_stemmer 过滤器使用 en_US 的 locale，这意味着使用 <$ES_PATH_CONF>/hunspell/en_US 目录中的 .aff 和 .dic 文件。该过滤器还包括 false 的 dedup 参数，这意味着从词典添加的重复词元不会从过滤器的输出中删除。

resp = client.indices.create(
    index="my-index-000001",
    settings={
        "analysis": {
            "analyzer": {
                "en": {
                    "tokenizer": "standard",
                    "filter": [
                        "my_en_US_dict_stemmer"
                    ]
                }
            },
            "filter": {
                "my_en_US_dict_stemmer": {
                    "type": "hunspell",
                    "locale": "en_US",
                    "dedup": False
                }
            }
        }
    },
)
print(resp)

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          en: {
            tokenizer: 'standard',
            filter: [
              'my_en_US_dict_stemmer'
            ]
          }
        },
        filter: {
          "my_en_US_dict_stemmer": {
            type: 'hunspell',
            locale: 'en_US',
            dedup: false
          }
        }
      }
    }
  }
)
puts response

const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    analysis: {
      analyzer: {
        en: {
          tokenizer: "standard",
          filter: ["my_en_US_dict_stemmer"],
        },
      },
      filter: {
        my_en_US_dict_stemmer: {
          type: "hunspell",
          locale: "en_US",
          dedup: false,
        },
      },
    },
  },
});
console.log(response);

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "en": {
          "tokenizer": "standard",
          "filter": [ "my_en_US_dict_stemmer" ]
        }
      },
      "filter": {
        "my_en_US_dict_stemmer": {
          "type": "hunspell",
          "locale": "en_US",
          "dedup": false
        }
      }
    }
  }
}

设置

编辑

除了 ignore_case 设置之外，您可以使用 elasticsearch.yml 为 hunspell 过滤器配置以下全局设置

indices.analysis.hunspell.dictionary.lazy: （静态，布尔值）如果为 true，则将 Hunspell 词典的加载延迟到使用该词典时。如果为 false，则在节点启动时检查词典目录中的词典，并且会自动加载任何词典。默认为 false。

« 扁平图词元过滤器连字符分解词元过滤器 »