词向量 API
编辑词向量 API
编辑检索特定文档字段中词项的信息和统计数据。
resp = client.termvectors( index="my-index-000001", id="1", ) print(resp)
response = client.termvectors( index: 'my-index-000001', id: 1 ) puts response
const response = await client.termvectors({ index: "my-index-000001", id: 1, }); console.log(response);
GET /my-index-000001/_termvectors/1
请求
编辑GET /<索引>/_termvectors/<_id>
描述
编辑您可以检索存储在索引中的文档的词向量,或者检索请求正文中传递的人工文档的词向量。
您可以通过 fields
参数指定您感兴趣的字段,或者通过将字段添加到请求正文中。
resp = client.termvectors( index="my-index-000001", id="1", fields="message", ) print(resp)
response = client.termvectors( index: 'my-index-000001', id: 1, fields: 'message' ) puts response
const response = await client.termvectors({ index: "my-index-000001", id: 1, fields: "message", }); console.log(response);
GET /my-index-000001/_termvectors/1?fields=message
可以使用通配符指定字段,类似于 多字段匹配查询。
默认情况下,词向量是 实时的,而不是近实时的。可以通过将 realtime
参数设置为 false
来更改此设置。
您可以请求三种类型的值:词项信息、词项统计和字段统计。默认情况下,返回所有字段的所有词项信息和字段统计信息,但排除词项统计信息。
词项信息
编辑- 字段中的词项频率(始终返回)
- 词项位置(
positions
: true) - 起始和结束偏移量(
offsets
: true) - 词项负载(
payloads
: true),以 base64 编码的字节表示
如果请求的信息未存储在索引中,则会在可能的情况下动态计算。此外,可以为索引中甚至不存在的文档计算词向量,而是由用户提供。
起始和结束偏移量假设使用 UTF-16 编码。如果您想使用这些偏移量来获取生成此令牌的原始文本,则应确保您从中获取子字符串的字符串也使用 UTF-16 编码。
词项统计
编辑将 term_statistics
设置为 true
(默认为 false
)将返回
- 总词项频率(词项在所有文档中出现的频率)
- 文档频率(包含当前词项的文档数量)
默认情况下,不返回这些值,因为词项统计可能会对性能产生严重影响。
字段统计
编辑将 field_statistics
设置为 false
(默认为 true
)将省略
- 文档计数(包含此字段的文档数量)
- 文档频率之和(此字段中所有词项的文档频率之和)
- 总词项频率之和(此字段中每个词项的总词项频率之和)
词项过滤
编辑通过参数 filter
,还可以根据词项的 tf-idf 分数过滤返回的词项。这对于找出文档的良好特征向量很有用。此功能的工作方式类似于 More Like This 查询的第二阶段。有关用法,请参阅示例 5。
支持以下子参数
|
每个字段必须返回的最大词项数。默认为 |
|
忽略源文档中频率低于此值的单词。默认为 |
|
忽略源文档中频率高于此值的单词。默认为无界限。 |
|
忽略在至少这么多个文档中没有出现的词项。默认为 |
|
忽略在超过这么多个文档中出现的单词。默认为无界限。 |
|
忽略低于此最小长度的单词。默认为 |
|
忽略高于此最大长度的单词。默认为无界限 ( |
行为
编辑词项和字段统计信息不准确。已删除的文档未考虑在内。仅检索请求的文档所在的碎片的信息。因此,词项和字段统计信息仅作为相对度量有用,而绝对数字在此上下文中没有意义。默认情况下,当请求人工文档的词向量时,会随机选择一个碎片来获取统计信息。仅使用 routing
来访问特定碎片。
路径参数
编辑-
<索引>
- (必需,字符串)包含文档的索引的名称。
-
<_id>
- (可选,字符串)文档的唯一标识符。
查询参数
编辑-
fields
-
(可选,字符串)要包含在统计信息中的字段的逗号分隔列表或通配符表达式。
用作默认列表,除非在
completion_fields
或fielddata_fields
参数中提供了特定的字段列表。 -
field_statistics
- (可选,布尔值)如果为
true
,则响应包括文档计数、文档频率之和以及总词项频率之和。默认为true
。 -
<offsets>
- (可选,布尔值)如果为
true
,则响应包括词项偏移量。默认为true
。 -
payloads
- (可选,布尔值)如果为
true
,则响应包括词项负载。默认为true
。 -
positions
- (可选,布尔值)如果为
true
,则响应包括词项位置。默认为true
。 -
preference
- (可选,字符串)指定应在其上执行操作的节点或碎片。默认为随机。
-
routing
- (可选,字符串)用于将操作路由到特定碎片的自定义值。
-
realtime
- (可选,布尔值)如果为
true
,则请求是实时的,而不是近实时的。默认为true
。请参阅实时。 -
term_statistics
- (可选,布尔值)如果为
true
,则响应包括词项频率和文档频率。默认为false
。 -
version
- (可选,布尔值)如果为
true
,则返回命中结果中的文档版本。 -
version_type
- (可选,枚举)特定的版本类型:
external
,external_gte
。
示例
编辑返回存储的词向量
编辑首先,我们创建一个存储词向量、负载等的索引。
resp = client.indices.create( index="my-index-000001", mappings={ "properties": { "text": { "type": "text", "term_vector": "with_positions_offsets_payloads", "store": True, "analyzer": "fulltext_analyzer" }, "fullname": { "type": "text", "term_vector": "with_positions_offsets_payloads", "analyzer": "fulltext_analyzer" } } }, settings={ "index": { "number_of_shards": 1, "number_of_replicas": 0 }, "analysis": { "analyzer": { "fulltext_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "type_as_payload" ] } } } }, ) print(resp)
response = client.indices.create( index: 'my-index-000001', body: { mappings: { properties: { text: { type: 'text', term_vector: 'with_positions_offsets_payloads', store: true, analyzer: 'fulltext_analyzer' }, fullname: { type: 'text', term_vector: 'with_positions_offsets_payloads', analyzer: 'fulltext_analyzer' } } }, settings: { index: { number_of_shards: 1, number_of_replicas: 0 }, analysis: { analyzer: { fulltext_analyzer: { type: 'custom', tokenizer: 'whitespace', filter: [ 'lowercase', 'type_as_payload' ] } } } } } ) puts response
const response = await client.indices.create({ index: "my-index-000001", mappings: { properties: { text: { type: "text", term_vector: "with_positions_offsets_payloads", store: true, analyzer: "fulltext_analyzer", }, fullname: { type: "text", term_vector: "with_positions_offsets_payloads", analyzer: "fulltext_analyzer", }, }, }, settings: { index: { number_of_shards: 1, number_of_replicas: 0, }, analysis: { analyzer: { fulltext_analyzer: { type: "custom", tokenizer: "whitespace", filter: ["lowercase", "type_as_payload"], }, }, }, }, }); console.log(response);
PUT /my-index-000001 { "mappings": { "properties": { "text": { "type": "text", "term_vector": "with_positions_offsets_payloads", "store" : true, "analyzer" : "fulltext_analyzer" }, "fullname": { "type": "text", "term_vector": "with_positions_offsets_payloads", "analyzer" : "fulltext_analyzer" } } }, "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }, "analysis": { "analyzer": { "fulltext_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "type_as_payload" ] } } } } }
其次,我们添加一些文档
resp = client.index( index="my-index-000001", id="1", document={ "fullname": "John Doe", "text": "test test test " }, ) print(resp) resp1 = client.index( index="my-index-000001", id="2", refresh="wait_for", document={ "fullname": "Jane Doe", "text": "Another test ..." }, ) print(resp1)
response = client.index( index: 'my-index-000001', id: 1, body: { fullname: 'John Doe', text: 'test test test ' } ) puts response response = client.index( index: 'my-index-000001', id: 2, refresh: 'wait_for', body: { fullname: 'Jane Doe', text: 'Another test ...' } ) puts response
const response = await client.index({ index: "my-index-000001", id: 1, document: { fullname: "John Doe", text: "test test test ", }, }); console.log(response); const response1 = await client.index({ index: "my-index-000001", id: 2, refresh: "wait_for", document: { fullname: "Jane Doe", text: "Another test ...", }, }); console.log(response1);
PUT /my-index-000001/_doc/1 { "fullname" : "John Doe", "text" : "test test test " } PUT /my-index-000001/_doc/2?refresh=wait_for { "fullname" : "Jane Doe", "text" : "Another test ..." }
以下请求返回文档 1
(John Doe) 中字段 text
的所有信息和统计信息
resp = client.termvectors( index="my-index-000001", id="1", fields=[ "text" ], offsets=True, payloads=True, positions=True, term_statistics=True, field_statistics=True, ) print(resp)
response = client.termvectors( index: 'my-index-000001', id: 1, body: { fields: [ 'text' ], offsets: true, payloads: true, positions: true, term_statistics: true, field_statistics: true } ) puts response
const response = await client.termvectors({ index: "my-index-000001", id: 1, fields: ["text"], offsets: true, payloads: true, positions: true, term_statistics: true, field_statistics: true, }); console.log(response);
GET /my-index-000001/_termvectors/1 { "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true }
响应
{ "_index": "my-index-000001", "_id": "1", "_version": 1, "found": true, "took": 6, "term_vectors": { "text": { "field_statistics": { "sum_doc_freq": 4, "doc_count": 2, "sum_ttf": 6 }, "terms": { "test": { "doc_freq": 2, "ttf": 4, "term_freq": 3, "tokens": [ { "position": 0, "start_offset": 0, "end_offset": 4, "payload": "d29yZA==" }, { "position": 1, "start_offset": 5, "end_offset": 9, "payload": "d29yZA==" }, { "position": 2, "start_offset": 10, "end_offset": 14, "payload": "d29yZA==" } ] } } } } }
动态生成词向量
编辑未在索引中显式存储的词向量会自动动态计算。以下请求返回文档 1
中字段的所有信息和统计信息,即使这些词项没有在索引中显式存储。请注意,对于字段 text
,词项不会重新生成。
resp = client.termvectors( index="my-index-000001", id="1", fields=[ "text", "some_field_without_term_vectors" ], offsets=True, positions=True, term_statistics=True, field_statistics=True, ) print(resp)
response = client.termvectors( index: 'my-index-000001', id: 1, body: { fields: [ 'text', 'some_field_without_term_vectors' ], offsets: true, positions: true, term_statistics: true, field_statistics: true } ) puts response
const response = await client.termvectors({ index: "my-index-000001", id: 1, fields: ["text", "some_field_without_term_vectors"], offsets: true, positions: true, term_statistics: true, field_statistics: true, }); console.log(response);
GET /my-index-000001/_termvectors/1 { "fields" : ["text", "some_field_without_term_vectors"], "offsets" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true }
人工文档
编辑也可以为人工文档生成词向量,即为索引中不存在的文档生成词向量。例如,以下请求将返回与示例 1 相同的结果。所使用的映射由 index
决定。
如果启用了动态映射(默认),则原始映射中不存在的文档字段将动态创建。
resp = client.termvectors( index="my-index-000001", doc={ "fullname": "John Doe", "text": "test test test" }, ) print(resp)
response = client.termvectors( index: 'my-index-000001', body: { doc: { fullname: 'John Doe', text: 'test test test' } } ) puts response
const response = await client.termvectors({ index: "my-index-000001", doc: { fullname: "John Doe", text: "test test test", }, }); console.log(response);
GET /my-index-000001/_termvectors { "doc" : { "fullname" : "John Doe", "text" : "test test test" } }
每个字段的分析器
编辑此外,可以使用 per_field_analyzer
参数提供与字段不同的分析器。这对于以任何方式生成词向量非常有用,尤其是在使用人工文档时。当为已存储词向量的字段提供分析器时,词向量将重新生成。
resp = client.termvectors( index="my-index-000001", doc={ "fullname": "John Doe", "text": "test test test" }, fields=[ "fullname" ], per_field_analyzer={ "fullname": "keyword" }, ) print(resp)
response = client.termvectors( index: 'my-index-000001', body: { doc: { fullname: 'John Doe', text: 'test test test' }, fields: [ 'fullname' ], per_field_analyzer: { fullname: 'keyword' } } ) puts response
const response = await client.termvectors({ index: "my-index-000001", doc: { fullname: "John Doe", text: "test test test", }, fields: ["fullname"], per_field_analyzer: { fullname: "keyword", }, }); console.log(response);
GET /my-index-000001/_termvectors { "doc" : { "fullname" : "John Doe", "text" : "test test test" }, "fields": ["fullname"], "per_field_analyzer" : { "fullname": "keyword" } }
响应
{ "_index": "my-index-000001", "_version": 0, "found": true, "took": 6, "term_vectors": { "fullname": { "field_statistics": { "sum_doc_freq": 2, "doc_count": 4, "sum_ttf": 4 }, "terms": { "John Doe": { "term_freq": 1, "tokens": [ { "position": 0, "start_offset": 0, "end_offset": 8 } ] } } } } }
词项过滤
编辑最后,可以根据词项的 tf-idf 分数过滤返回的词项。在下面的示例中,我们从具有给定“plot”字段值的人工文档中获得三个最“有趣”的关键字。请注意,关键字“Tony”或任何停用词都不是响应的一部分,因为它们的 tf-idf 必须太低。
resp = client.termvectors( index="imdb", doc={ "plot": "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil." }, term_statistics=True, field_statistics=True, positions=False, offsets=False, filter={ "max_num_terms": 3, "min_term_freq": 1, "min_doc_freq": 1 }, ) print(resp)
response = client.termvectors( index: 'imdb', body: { doc: { plot: 'When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil.' }, term_statistics: true, field_statistics: true, positions: false, offsets: false, filter: { max_num_terms: 3, min_term_freq: 1, min_doc_freq: 1 } } ) puts response
const response = await client.termvectors({ index: "imdb", doc: { plot: "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil.", }, term_statistics: true, field_statistics: true, positions: false, offsets: false, filter: { max_num_terms: 3, min_term_freq: 1, min_doc_freq: 1, }, }); console.log(response);
GET /imdb/_termvectors { "doc": { "plot": "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil." }, "term_statistics": true, "field_statistics": true, "positions": false, "offsets": false, "filter": { "max_num_terms": 3, "min_term_freq": 1, "min_doc_freq": 1 } }
响应
{ "_index": "imdb", "_version": 0, "found": true, "term_vectors": { "plot": { "field_statistics": { "sum_doc_freq": 3384269, "doc_count": 176214, "sum_ttf": 3753460 }, "terms": { "armored": { "doc_freq": 27, "ttf": 27, "term_freq": 1, "score": 9.74725 }, "industrialist": { "doc_freq": 88, "ttf": 88, "term_freq": 1, "score": 8.590818 }, "stark": { "doc_freq": 44, "ttf": 47, "term_freq": 1, "score": 9.272792 } } } } }