指纹令牌过滤器
编辑指纹令牌过滤器
编辑对令牌流进行排序并删除重复令牌,然后将流连接成单个输出令牌。
例如,此过滤器将 [ the, fox, was, very, very, quick ]
令牌流更改如下
- 按字母顺序对令牌进行排序,得到
[ fox, quick, the, very, very, was ]
- 删除
very
令牌的重复实例。 - 将令牌流连接到单个输出令牌:
[fox quick the very was ]
此过滤器生成的输出令牌可用于指纹识别和聚类文本主体,如OpenRefine 项目中所述。
此过滤器使用 Lucene 的 FingerprintFilter。
示例
编辑以下 分析 API 请求使用 fingerprint
过滤器为文本 zebra jumps over resting resting dog
创建单个输出令牌。
resp = client.indices.analyze( tokenizer="whitespace", filter=[ "fingerprint" ], text="zebra jumps over resting resting dog", ) print(resp)
response = client.indices.analyze( body: { tokenizer: 'whitespace', filter: [ 'fingerprint' ], text: 'zebra jumps over resting resting dog' } ) puts response
const response = await client.indices.analyze({ tokenizer: "whitespace", filter: ["fingerprint"], text: "zebra jumps over resting resting dog", }); console.log(response);
GET _analyze { "tokenizer" : "whitespace", "filter" : ["fingerprint"], "text" : "zebra jumps over resting resting dog" }
过滤器生成以下令牌
[ dog jumps over resting zebra ]
添加到分析器
编辑以下 创建索引 API 请求使用 fingerprint
过滤器配置新的 自定义分析器。
resp = client.indices.create( index="fingerprint_example", settings={ "analysis": { "analyzer": { "whitespace_fingerprint": { "tokenizer": "whitespace", "filter": [ "fingerprint" ] } } } }, ) print(resp)
response = client.indices.create( index: 'fingerprint_example', body: { settings: { analysis: { analyzer: { whitespace_fingerprint: { tokenizer: 'whitespace', filter: [ 'fingerprint' ] } } } } } ) puts response
const response = await client.indices.create({ index: "fingerprint_example", settings: { analysis: { analyzer: { whitespace_fingerprint: { tokenizer: "whitespace", filter: ["fingerprint"], }, }, }, }, }); console.log(response);
PUT fingerprint_example { "settings": { "analysis": { "analyzer": { "whitespace_fingerprint": { "tokenizer": "whitespace", "filter": [ "fingerprint" ] } } } } }
可配置参数
编辑自定义
编辑要自定义 fingerprint
过滤器,请复制它以创建新自定义令牌过滤器的基础。您可以使用其可配置参数修改过滤器。
例如,以下请求创建一个自定义 fingerprint
过滤器,该过滤器使用 +
连接令牌流。该过滤器还将输出令牌限制为 100
个字符或更少。
resp = client.indices.create( index="custom_fingerprint_example", settings={ "analysis": { "analyzer": { "whitespace_": { "tokenizer": "whitespace", "filter": [ "fingerprint_plus_concat" ] } }, "filter": { "fingerprint_plus_concat": { "type": "fingerprint", "max_output_size": 100, "separator": "+" } } } }, ) print(resp)
response = client.indices.create( index: 'custom_fingerprint_example', body: { settings: { analysis: { analyzer: { whitespace_: { tokenizer: 'whitespace', filter: [ 'fingerprint_plus_concat' ] } }, filter: { fingerprint_plus_concat: { type: 'fingerprint', max_output_size: 100, separator: '+' } } } } } ) puts response
const response = await client.indices.create({ index: "custom_fingerprint_example", settings: { analysis: { analyzer: { whitespace_: { tokenizer: "whitespace", filter: ["fingerprint_plus_concat"], }, }, filter: { fingerprint_plus_concat: { type: "fingerprint", max_output_size: 100, separator: "+", }, }, }, }, }); console.log(response);
PUT custom_fingerprint_example { "settings": { "analysis": { "analyzer": { "whitespace_": { "tokenizer": "whitespace", "filter": [ "fingerprint_plus_concat" ] } }, "filter": { "fingerprint_plus_concat": { "type": "fingerprint", "max_output_size": 100, "separator": "+" } } } } }