词向量 API

编辑

检索特定文档字段中词项的信息和统计数据。

resp = client.termvectors(
    index="my-index-000001",
    id="1",
)
print(resp)
response = client.termvectors(
  index: 'my-index-000001',
  id: 1
)
puts response
const response = await client.termvectors({
  index: "my-index-000001",
  id: 1,
});
console.log(response);
GET /my-index-000001/_termvectors/1

请求

编辑

GET /<索引>/_termvectors/<_id>

先决条件

编辑
  • 如果启用了 Elasticsearch 安全功能,您必须拥有目标索引或索引别名的 read 索引权限

描述

编辑

您可以检索存储在索引中的文档的词向量,或者检索请求正文中传递的人工文档的词向量。

您可以通过 fields 参数指定您感兴趣的字段,或者通过将字段添加到请求正文中。

resp = client.termvectors(
    index="my-index-000001",
    id="1",
    fields="message",
)
print(resp)
response = client.termvectors(
  index: 'my-index-000001',
  id: 1,
  fields: 'message'
)
puts response
const response = await client.termvectors({
  index: "my-index-000001",
  id: 1,
  fields: "message",
});
console.log(response);
GET /my-index-000001/_termvectors/1?fields=message

可以使用通配符指定字段,类似于 多字段匹配查询

默认情况下,词向量是 实时的,而不是近实时的。可以通过将 realtime 参数设置为 false 来更改此设置。

您可以请求三种类型的值:词项信息词项统计字段统计。默认情况下,返回所有字段的所有词项信息和字段统计信息,但排除词项统计信息。

词项信息

编辑
  • 字段中的词项频率(始终返回)
  • 词项位置(positions : true)
  • 起始和结束偏移量(offsets : true)
  • 词项负载(payloads : true),以 base64 编码的字节表示

如果请求的信息未存储在索引中,则会在可能的情况下动态计算。此外,可以为索引中甚至不存在的文档计算词向量,而是由用户提供。

起始和结束偏移量假设使用 UTF-16 编码。如果您想使用这些偏移量来获取生成此令牌的原始文本,则应确保您从中获取子字符串的字符串也使用 UTF-16 编码。

词项统计

编辑

term_statistics 设置为 true(默认为 false)将返回

  • 总词项频率(词项在所有文档中出现的频率)
  • 文档频率(包含当前词项的文档数量)

默认情况下,不返回这些值,因为词项统计可能会对性能产生严重影响。

字段统计

编辑

field_statistics 设置为 false(默认为 true)将省略

  • 文档计数(包含此字段的文档数量)
  • 文档频率之和(此字段中所有词项的文档频率之和)
  • 总词项频率之和(此字段中每个词项的总词项频率之和)

词项过滤

编辑

通过参数 filter,还可以根据词项的 tf-idf 分数过滤返回的词项。这对于找出文档的良好特征向量很有用。此功能的工作方式类似于 More Like This 查询第二阶段。有关用法,请参阅示例 5

支持以下子参数

max_num_terms

每个字段必须返回的最大词项数。默认为 25

min_term_freq

忽略源文档中频率低于此值的单词。默认为 1

max_term_freq

忽略源文档中频率高于此值的单词。默认为无界限。

min_doc_freq

忽略在至少这么多个文档中没有出现的词项。默认为 1

max_doc_freq

忽略在超过这么多个文档中出现的单词。默认为无界限。

min_word_length

忽略低于此最小长度的单词。默认为 0

max_word_length

忽略高于此最大长度的单词。默认为无界限 (0)。

行为

编辑

词项和字段统计信息不准确。已删除的文档未考虑在内。仅检索请求的文档所在的碎片的信息。因此,词项和字段统计信息仅作为相对度量有用,而绝对数字在此上下文中没有意义。默认情况下,当请求人工文档的词向量时,会随机选择一个碎片来获取统计信息。仅使用 routing 来访问特定碎片。

路径参数

编辑
<索引>
(必需,字符串)包含文档的索引的名称。
<_id>
(可选,字符串)文档的唯一标识符。

查询参数

编辑
fields

(可选,字符串)要包含在统计信息中的字段的逗号分隔列表或通配符表达式。

用作默认列表,除非在 completion_fieldsfielddata_fields 参数中提供了特定的字段列表。

field_statistics
(可选,布尔值)如果为 true,则响应包括文档计数、文档频率之和以及总词项频率之和。默认为 true
<offsets>
(可选,布尔值)如果为 true,则响应包括词项偏移量。默认为 true
payloads
(可选,布尔值)如果为 true,则响应包括词项负载。默认为 true
positions
(可选,布尔值)如果为 true,则响应包括词项位置。默认为 true
preference
(可选,字符串)指定应在其上执行操作的节点或碎片。默认为随机。
routing
(可选,字符串)用于将操作路由到特定碎片的自定义值。
realtime
(可选,布尔值)如果为 true,则请求是实时的,而不是近实时的。默认为 true。请参阅实时
term_statistics
(可选,布尔值)如果为 true,则响应包括词项频率和文档频率。默认为 false
version
(可选,布尔值)如果为 true,则返回命中结果中的文档版本。
version_type
(可选,枚举)特定的版本类型:external, external_gte

示例

编辑

返回存储的词向量

编辑

首先,我们创建一个存储词向量、负载等的索引。

resp = client.indices.create(
    index="my-index-000001",
    mappings={
        "properties": {
            "text": {
                "type": "text",
                "term_vector": "with_positions_offsets_payloads",
                "store": True,
                "analyzer": "fulltext_analyzer"
            },
            "fullname": {
                "type": "text",
                "term_vector": "with_positions_offsets_payloads",
                "analyzer": "fulltext_analyzer"
            }
        }
    },
    settings={
        "index": {
            "number_of_shards": 1,
            "number_of_replicas": 0
        },
        "analysis": {
            "analyzer": {
                "fulltext_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": [
                        "lowercase",
                        "type_as_payload"
                    ]
                }
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'my-index-000001',
  body: {
    mappings: {
      properties: {
        text: {
          type: 'text',
          term_vector: 'with_positions_offsets_payloads',
          store: true,
          analyzer: 'fulltext_analyzer'
        },
        fullname: {
          type: 'text',
          term_vector: 'with_positions_offsets_payloads',
          analyzer: 'fulltext_analyzer'
        }
      }
    },
    settings: {
      index: {
        number_of_shards: 1,
        number_of_replicas: 0
      },
      analysis: {
        analyzer: {
          fulltext_analyzer: {
            type: 'custom',
            tokenizer: 'whitespace',
            filter: [
              'lowercase',
              'type_as_payload'
            ]
          }
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "my-index-000001",
  mappings: {
    properties: {
      text: {
        type: "text",
        term_vector: "with_positions_offsets_payloads",
        store: true,
        analyzer: "fulltext_analyzer",
      },
      fullname: {
        type: "text",
        term_vector: "with_positions_offsets_payloads",
        analyzer: "fulltext_analyzer",
      },
    },
  },
  settings: {
    index: {
      number_of_shards: 1,
      number_of_replicas: 0,
    },
    analysis: {
      analyzer: {
        fulltext_analyzer: {
          type: "custom",
          tokenizer: "whitespace",
          filter: ["lowercase", "type_as_payload"],
        },
      },
    },
  },
});
console.log(response);
PUT /my-index-000001
{ "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "term_vector": "with_positions_offsets_payloads",
        "store" : true,
        "analyzer" : "fulltext_analyzer"
       },
       "fullname": {
        "type": "text",
        "term_vector": "with_positions_offsets_payloads",
        "analyzer" : "fulltext_analyzer"
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}

其次,我们添加一些文档

resp = client.index(
    index="my-index-000001",
    id="1",
    document={
        "fullname": "John Doe",
        "text": "test test test "
    },
)
print(resp)

resp1 = client.index(
    index="my-index-000001",
    id="2",
    refresh="wait_for",
    document={
        "fullname": "Jane Doe",
        "text": "Another test ..."
    },
)
print(resp1)
response = client.index(
  index: 'my-index-000001',
  id: 1,
  body: {
    fullname: 'John Doe',
    text: 'test test test '
  }
)
puts response

response = client.index(
  index: 'my-index-000001',
  id: 2,
  refresh: 'wait_for',
  body: {
    fullname: 'Jane Doe',
    text: 'Another test ...'
  }
)
puts response
const response = await client.index({
  index: "my-index-000001",
  id: 1,
  document: {
    fullname: "John Doe",
    text: "test test test ",
  },
});
console.log(response);

const response1 = await client.index({
  index: "my-index-000001",
  id: 2,
  refresh: "wait_for",
  document: {
    fullname: "Jane Doe",
    text: "Another test ...",
  },
});
console.log(response1);
PUT /my-index-000001/_doc/1
{
  "fullname" : "John Doe",
  "text" : "test test test "
}

PUT /my-index-000001/_doc/2?refresh=wait_for
{
  "fullname" : "Jane Doe",
  "text" : "Another test ..."
}

以下请求返回文档 1 (John Doe) 中字段 text 的所有信息和统计信息

resp = client.termvectors(
    index="my-index-000001",
    id="1",
    fields=[
        "text"
    ],
    offsets=True,
    payloads=True,
    positions=True,
    term_statistics=True,
    field_statistics=True,
)
print(resp)
response = client.termvectors(
  index: 'my-index-000001',
  id: 1,
  body: {
    fields: [
      'text'
    ],
    offsets: true,
    payloads: true,
    positions: true,
    term_statistics: true,
    field_statistics: true
  }
)
puts response
const response = await client.termvectors({
  index: "my-index-000001",
  id: 1,
  fields: ["text"],
  offsets: true,
  payloads: true,
  positions: true,
  term_statistics: true,
  field_statistics: true,
});
console.log(response);
GET /my-index-000001/_termvectors/1
{
  "fields" : ["text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

响应

{
  "_index": "my-index-000001",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 6,
  "term_vectors": {
    "text": {
      "field_statistics": {
        "sum_doc_freq": 4,
        "doc_count": 2,
        "sum_ttf": 6
      },
      "terms": {
        "test": {
          "doc_freq": 2,
          "ttf": 4,
          "term_freq": 3,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 4,
              "payload": "d29yZA=="
            },
            {
              "position": 1,
              "start_offset": 5,
              "end_offset": 9,
              "payload": "d29yZA=="
            },
            {
              "position": 2,
              "start_offset": 10,
              "end_offset": 14,
              "payload": "d29yZA=="
            }
          ]
        }
      }
    }
  }
}

动态生成词向量

编辑

未在索引中显式存储的词向量会自动动态计算。以下请求返回文档 1 中字段的所有信息和统计信息,即使这些词项没有在索引中显式存储。请注意,对于字段 text,词项不会重新生成。

resp = client.termvectors(
    index="my-index-000001",
    id="1",
    fields=[
        "text",
        "some_field_without_term_vectors"
    ],
    offsets=True,
    positions=True,
    term_statistics=True,
    field_statistics=True,
)
print(resp)
response = client.termvectors(
  index: 'my-index-000001',
  id: 1,
  body: {
    fields: [
      'text',
      'some_field_without_term_vectors'
    ],
    offsets: true,
    positions: true,
    term_statistics: true,
    field_statistics: true
  }
)
puts response
const response = await client.termvectors({
  index: "my-index-000001",
  id: 1,
  fields: ["text", "some_field_without_term_vectors"],
  offsets: true,
  positions: true,
  term_statistics: true,
  field_statistics: true,
});
console.log(response);
GET /my-index-000001/_termvectors/1
{
  "fields" : ["text", "some_field_without_term_vectors"],
  "offsets" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

人工文档

编辑

也可以为人工文档生成词向量,即为索引中不存在的文档生成词向量。例如,以下请求将返回与示例 1 相同的结果。所使用的映射由 index 决定。

如果启用了动态映射(默认),则原始映射中不存在的文档字段将动态创建。

resp = client.termvectors(
    index="my-index-000001",
    doc={
        "fullname": "John Doe",
        "text": "test test test"
    },
)
print(resp)
response = client.termvectors(
  index: 'my-index-000001',
  body: {
    doc: {
      fullname: 'John Doe',
      text: 'test test test'
    }
  }
)
puts response
const response = await client.termvectors({
  index: "my-index-000001",
  doc: {
    fullname: "John Doe",
    text: "test test test",
  },
});
console.log(response);
GET /my-index-000001/_termvectors
{
  "doc" : {
    "fullname" : "John Doe",
    "text" : "test test test"
  }
}
每个字段的分析器
编辑

此外,可以使用 per_field_analyzer 参数提供与字段不同的分析器。这对于以任何方式生成词向量非常有用,尤其是在使用人工文档时。当为已存储词向量的字段提供分析器时,词向量将重新生成。

resp = client.termvectors(
    index="my-index-000001",
    doc={
        "fullname": "John Doe",
        "text": "test test test"
    },
    fields=[
        "fullname"
    ],
    per_field_analyzer={
        "fullname": "keyword"
    },
)
print(resp)
response = client.termvectors(
  index: 'my-index-000001',
  body: {
    doc: {
      fullname: 'John Doe',
      text: 'test test test'
    },
    fields: [
      'fullname'
    ],
    per_field_analyzer: {
      fullname: 'keyword'
    }
  }
)
puts response
const response = await client.termvectors({
  index: "my-index-000001",
  doc: {
    fullname: "John Doe",
    text: "test test test",
  },
  fields: ["fullname"],
  per_field_analyzer: {
    fullname: "keyword",
  },
});
console.log(response);
GET /my-index-000001/_termvectors
{
  "doc" : {
    "fullname" : "John Doe",
    "text" : "test test test"
  },
  "fields": ["fullname"],
  "per_field_analyzer" : {
    "fullname": "keyword"
  }
}

响应

{
  "_index": "my-index-000001",
  "_version": 0,
  "found": true,
  "took": 6,
  "term_vectors": {
    "fullname": {
       "field_statistics": {
          "sum_doc_freq": 2,
          "doc_count": 4,
          "sum_ttf": 4
       },
       "terms": {
          "John Doe": {
             "term_freq": 1,
             "tokens": [
                {
                   "position": 0,
                   "start_offset": 0,
                   "end_offset": 8
                }
             ]
          }
       }
    }
  }
}

词项过滤

编辑

最后,可以根据词项的 tf-idf 分数过滤返回的词项。在下面的示例中,我们从具有给定“plot”字段值的人工文档中获得三个最“有趣”的关键字。请注意,关键字“Tony”或任何停用词都不是响应的一部分,因为它们的 tf-idf 必须太低。

resp = client.termvectors(
    index="imdb",
    doc={
        "plot": "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil."
    },
    term_statistics=True,
    field_statistics=True,
    positions=False,
    offsets=False,
    filter={
        "max_num_terms": 3,
        "min_term_freq": 1,
        "min_doc_freq": 1
    },
)
print(resp)
response = client.termvectors(
  index: 'imdb',
  body: {
    doc: {
      plot: 'When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil.'
    },
    term_statistics: true,
    field_statistics: true,
    positions: false,
    offsets: false,
    filter: {
      max_num_terms: 3,
      min_term_freq: 1,
      min_doc_freq: 1
    }
  }
)
puts response
const response = await client.termvectors({
  index: "imdb",
  doc: {
    plot: "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil.",
  },
  term_statistics: true,
  field_statistics: true,
  positions: false,
  offsets: false,
  filter: {
    max_num_terms: 3,
    min_term_freq: 1,
    min_doc_freq: 1,
  },
});
console.log(response);
GET /imdb/_termvectors
{
  "doc": {
    "plot": "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil."
  },
  "term_statistics": true,
  "field_statistics": true,
  "positions": false,
  "offsets": false,
  "filter": {
    "max_num_terms": 3,
    "min_term_freq": 1,
    "min_doc_freq": 1
  }
}

响应

{
   "_index": "imdb",
   "_version": 0,
   "found": true,
   "term_vectors": {
      "plot": {
         "field_statistics": {
            "sum_doc_freq": 3384269,
            "doc_count": 176214,
            "sum_ttf": 3753460
         },
         "terms": {
            "armored": {
               "doc_freq": 27,
               "ttf": 27,
               "term_freq": 1,
               "score": 9.74725
            },
            "industrialist": {
               "doc_freq": 88,
               "ttf": 88,
               "term_freq": 1,
               "score": 8.590818
            },
            "stark": {
               "doc_freq": 44,
               "ttf": 47,
               "term_freq": 1,
               "score": 9.272792
            }
         }
      }
   }
}