指纹分析器
编辑指纹分析器
编辑fingerprint
分析器实现了一种 指纹算法,该算法被 OpenRefine 项目用于辅助聚类。
输入文本会被转换为小写,标准化以删除扩展字符,排序,去重,并连接成一个单独的词元。如果配置了停用词列表,停用词也会被删除。
示例输出
编辑resp = client.indices.analyze( analyzer="fingerprint", text="Yes yes, Gödel said this sentence is consistent and.", ) print(resp)
response = client.indices.analyze( body: { analyzer: 'fingerprint', text: 'Yes yes, Gödel said this sentence is consistent and.' } ) puts response
const response = await client.indices.analyze({ analyzer: "fingerprint", text: "Yes yes, Gödel said this sentence is consistent and.", }); console.log(response);
POST _analyze { "analyzer": "fingerprint", "text": "Yes yes, Gödel said this sentence is consistent and." }
上面的句子会产生以下单个词元
[ and consistent godel is said sentence this yes ]
配置
编辑fingerprint
分析器接受以下参数:
|
用于连接词元的字符。默认为空格。 |
|
要发出的最大词元大小。默认为 |
|
预定义的停用词列表,如 |
|
包含停用词的文件的路径。 |
有关停用词配置的更多信息,请参阅 停用词词元过滤器。
示例配置
编辑在此示例中,我们将 fingerprint
分析器配置为使用预定义的英语停用词列表
resp = client.indices.create( index="my-index-000001", settings={ "analysis": { "analyzer": { "my_fingerprint_analyzer": { "type": "fingerprint", "stopwords": "_english_" } } } }, ) print(resp) resp1 = client.indices.analyze( index="my-index-000001", analyzer="my_fingerprint_analyzer", text="Yes yes, Gödel said this sentence is consistent and.", ) print(resp1)
response = client.indices.create( index: 'my-index-000001', body: { settings: { analysis: { analyzer: { my_fingerprint_analyzer: { type: 'fingerprint', stopwords: '_english_' } } } } } ) puts response response = client.indices.analyze( index: 'my-index-000001', body: { analyzer: 'my_fingerprint_analyzer', text: 'Yes yes, Gödel said this sentence is consistent and.' } ) puts response
const response = await client.indices.create({ index: "my-index-000001", settings: { analysis: { analyzer: { my_fingerprint_analyzer: { type: "fingerprint", stopwords: "_english_", }, }, }, }, }); console.log(response); const response1 = await client.indices.analyze({ index: "my-index-000001", analyzer: "my_fingerprint_analyzer", text: "Yes yes, Gödel said this sentence is consistent and.", }); console.log(response1);
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_fingerprint_analyzer": { "type": "fingerprint", "stopwords": "_english_" } } } } } POST my-index-000001/_analyze { "analyzer": "my_fingerprint_analyzer", "text": "Yes yes, Gödel said this sentence is consistent and." }
上面的示例产生以下词元
[ consistent godel said sentence yes ]
定义
编辑fingerprint
分词器包括:
如果需要自定义 fingerprint
分析器,使其超出配置参数的范围,则需要将其重新创建为 custom
分析器并进行修改,通常是通过添加词元过滤器。这将重新创建内置的 fingerprint
分析器,您可以将其用作进一步自定义的起点
resp = client.indices.create( index="fingerprint_example", settings={ "analysis": { "analyzer": { "rebuilt_fingerprint": { "tokenizer": "standard", "filter": [ "lowercase", "asciifolding", "fingerprint" ] } } } }, ) print(resp)
response = client.indices.create( index: 'fingerprint_example', body: { settings: { analysis: { analyzer: { rebuilt_fingerprint: { tokenizer: 'standard', filter: [ 'lowercase', 'asciifolding', 'fingerprint' ] } } } } } ) puts response
const response = await client.indices.create({ index: "fingerprint_example", settings: { analysis: { analyzer: { rebuilt_fingerprint: { tokenizer: "standard", filter: ["lowercase", "asciifolding", "fingerprint"], }, }, }, }, }); console.log(response);
PUT /fingerprint_example { "settings": { "analysis": { "analyzer": { "rebuilt_fingerprint": { "tokenizer": "standard", "filter": [ "lowercase", "asciifolding", "fingerprint" ] } } } } }