指纹令牌过滤器

编辑

对令牌流进行排序并删除重复令牌,然后将流连接成单个输出令牌。

例如,此过滤器将 [ the, fox, was, very, very, quick ] 令牌流更改如下

  1. 按字母顺序对令牌进行排序,得到 [ fox, quick, the, very, very, was ]
  2. 删除 very 令牌的重复实例。
  3. 将令牌流连接到单个输出令牌:[fox quick the very was ]

此过滤器生成的输出令牌可用于指纹识别和聚类文本主体,如OpenRefine 项目中所述。

此过滤器使用 Lucene 的 FingerprintFilter

示例

编辑

以下 分析 API 请求使用 fingerprint 过滤器为文本 zebra jumps over resting resting dog 创建单个输出令牌。

resp = client.indices.analyze(
    tokenizer="whitespace",
    filter=[
        "fingerprint"
    ],
    text="zebra jumps over resting resting dog",
)
print(resp)
response = client.indices.analyze(
  body: {
    tokenizer: 'whitespace',
    filter: [
      'fingerprint'
    ],
    text: 'zebra jumps over resting resting dog'
  }
)
puts response
const response = await client.indices.analyze({
  tokenizer: "whitespace",
  filter: ["fingerprint"],
  text: "zebra jumps over resting resting dog",
});
console.log(response);
GET _analyze
{
  "tokenizer" : "whitespace",
  "filter" : ["fingerprint"],
  "text" : "zebra jumps over resting resting dog"
}

过滤器生成以下令牌

[ dog jumps over resting zebra ]

添加到分析器

编辑

以下 创建索引 API 请求使用 fingerprint 过滤器配置新的 自定义分析器

resp = client.indices.create(
    index="fingerprint_example",
    settings={
        "analysis": {
            "analyzer": {
                "whitespace_fingerprint": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "fingerprint"
                    ]
                }
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'fingerprint_example',
  body: {
    settings: {
      analysis: {
        analyzer: {
          whitespace_fingerprint: {
            tokenizer: 'whitespace',
            filter: [
              'fingerprint'
            ]
          }
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "fingerprint_example",
  settings: {
    analysis: {
      analyzer: {
        whitespace_fingerprint: {
          tokenizer: "whitespace",
          filter: ["fingerprint"],
        },
      },
    },
  },
});
console.log(response);
PUT fingerprint_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_fingerprint": {
          "tokenizer": "whitespace",
          "filter": [ "fingerprint" ]
        }
      }
    }
  }
}

可配置参数

编辑
max_output_size
(可选,整数) 输出令牌的最大字符长度(包括空格)。默认为 255。连接后的令牌长度超过此值将导致没有令牌输出。
separator
(可选,字符串) 用于连接令牌流输入的字符。默认为空格。

自定义

编辑

要自定义 fingerprint 过滤器,请复制它以创建新自定义令牌过滤器的基础。您可以使用其可配置参数修改过滤器。

例如,以下请求创建一个自定义 fingerprint 过滤器,该过滤器使用 + 连接令牌流。该过滤器还将输出令牌限制为 100 个字符或更少。

resp = client.indices.create(
    index="custom_fingerprint_example",
    settings={
        "analysis": {
            "analyzer": {
                "whitespace_": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "fingerprint_plus_concat"
                    ]
                }
            },
            "filter": {
                "fingerprint_plus_concat": {
                    "type": "fingerprint",
                    "max_output_size": 100,
                    "separator": "+"
                }
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'custom_fingerprint_example',
  body: {
    settings: {
      analysis: {
        analyzer: {
          whitespace_: {
            tokenizer: 'whitespace',
            filter: [
              'fingerprint_plus_concat'
            ]
          }
        },
        filter: {
          fingerprint_plus_concat: {
            type: 'fingerprint',
            max_output_size: 100,
            separator: '+'
          }
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "custom_fingerprint_example",
  settings: {
    analysis: {
      analyzer: {
        whitespace_: {
          tokenizer: "whitespace",
          filter: ["fingerprint_plus_concat"],
        },
      },
      filter: {
        fingerprint_plus_concat: {
          type: "fingerprint",
          max_output_size: 100,
          separator: "+",
        },
      },
    },
  },
});
console.log(response);
PUT custom_fingerprint_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_": {
          "tokenizer": "whitespace",
          "filter": [ "fingerprint_plus_concat" ]
        }
      },
      "filter": {
        "fingerprint_plus_concat": {
          "type": "fingerprint",
          "max_output_size": 100,
          "separator": "+"
        }
      }
    }
  }
}