› › ›

词向量 API

检索特定文档字段中词项的信息和统计数据。

resp = client.termvectors(
    index="my-index-000001",
    id="1",
)
print(resp)

response = client.termvectors(
  index: 'my-index-000001',
  id: 1
)
puts response

const response = await client.termvectors({
  index: "my-index-000001",
  id: 1,
});
console.log(response);

GET /my-index-000001/_termvectors/1

Copy as curl Try in Elastic

请求

编辑

GET /<索引>/_termvectors/<_id>

先决条件

编辑

如果启用了 Elasticsearch 安全功能，您必须拥有目标索引或索引别名的 read 索引权限。

描述

编辑

您可以检索存储在索引中的文档的词向量，或者检索请求正文中传递的人工文档的词向量。

您可以通过 fields 参数指定您感兴趣的字段，或者通过将字段添加到请求正文中。

resp = client.termvectors(
    index="my-index-000001",
    id="1",
    fields="message",
)
print(resp)

response = client.termvectors(
  index: 'my-index-000001',
  id: 1,
  fields: 'message'
)
puts response

const response = await client.termvectors({
  index: "my-index-000001",
  id: 1,
  fields: "message",
});
console.log(response);

GET /my-index-000001/_termvectors/1?fields=message

Copy as curl Try in Elastic

可以使用通配符指定字段，类似于多字段匹配查询。

默认情况下，词向量是实时的，而不是近实时的。可以通过将 realtime 参数设置为 false 来更改此设置。

您可以请求三种类型的值：词项信息、词项统计和字段统计。默认情况下，返回所有字段的所有词项信息和字段统计信息，但排除词项统计信息。

词项信息

编辑

字段中的词项频率（始终返回）
词项位置（positions : true）
起始和结束偏移量（offsets : true）
词项负载（payloads : true），以 base64 编码的字节表示

如果请求的信息未存储在索引中，则会在可能的情况下动态计算。此外，可以为索引中甚至不存在的文档计算词向量，而是由用户提供。

起始和结束偏移量假设使用 UTF-16 编码。如果您想使用这些偏移量来获取生成此令牌的原始文本，则应确保您从中获取子字符串的字符串也使用 UTF-16 编码。

词项统计

编辑

将 term_statistics 设置为 true（默认为 false）将返回

总词项频率（词项在所有文档中出现的频率）
文档频率（包含当前词项的文档数量）

默认情况下，不返回这些值，因为词项统计可能会对性能产生严重影响。

字段统计

编辑

将 field_statistics 设置为 false（默认为 true）将省略

文档计数（包含此字段的文档数量）
文档频率之和（此字段中所有词项的文档频率之和）
总词项频率之和（此字段中每个词项的总词项频率之和）

词项过滤

编辑

通过参数 filter，还可以根据词项的 tf-idf 分数过滤返回的词项。这对于找出文档的良好特征向量很有用。此功能的工作方式类似于 More Like This 查询的第二阶段。有关用法，请参阅示例 5。

支持以下子参数

`max_num_terms`	每个字段必须返回的最大词项数。默认为 `25`。
`min_term_freq`	忽略源文档中频率低于此值的单词。默认为 `1`。
`max_term_freq`	忽略源文档中频率高于此值的单词。默认为无界限。
`min_doc_freq`	忽略在至少这么多个文档中没有出现的词项。默认为 `1`。
`max_doc_freq`	忽略在超过这么多个文档中出现的单词。默认为无界限。
`min_word_length`	忽略低于此最小长度的单词。默认为 `0`。
`max_word_length`	忽略高于此最大长度的单词。默认为无界限 (`0`)。

行为

编辑

词项和字段统计信息不准确。已删除的文档未考虑在内。仅检索请求的文档所在的碎片的信息。因此，词项和字段统计信息仅作为相对度量有用，而绝对数字在此上下文中没有意义。默认情况下，当请求人工文档的词向量时，会随机选择一个碎片来获取统计信息。仅使用 routing 来访问特定碎片。

路径参数

编辑

<索引>: （必需，字符串）包含文档的索引的名称。
<_id>: （可选，字符串）文档的唯一标识符。

查询参数

编辑

fields

（可选，字符串）要包含在统计信息中的字段的逗号分隔列表或通配符表达式。

用作默认列表，除非在 completion_fields 或 fielddata_fields 参数中提供了特定的字段列表。

field_statistics

（可选，布尔值）如果为 true，则响应包括文档计数、文档频率之和以及总词项频率之和。默认为 true。

<offsets>

（可选，布尔值）如果为 true，则响应包括词项偏移量。默认为 true。

payloads

（可选，布尔值）如果为 true，则响应包括词项负载。默认为 true。

positions

（可选，布尔值）如果为 true，则响应包括词项位置。默认为 true。

preference

（可选，字符串）指定应在其上执行操作的节点或碎片。默认为随机。

routing

（可选，字符串）用于将操作路由到特定碎片的自定义值。

realtime

（可选，布尔值）如果为 true，则请求是实时的，而不是近实时的。默认为 true。请参阅实时。

term_statistics

（可选，布尔值）如果为 true，则响应包括词项频率和文档频率。默认为 false。

version

（可选，布尔值）如果为 true，则返回命中结果中的文档版本。

version_type

（可选，枚举）特定的版本类型：external, external_gte。

示例

编辑

返回存储的词向量

编辑

首先，我们创建一个存储词向量、负载等的索引。

resp = client.indices.create(
    index="my-index-000001",
    mappings={
        "properties": {
            "text": {
                "type": "text",
                "term_vector": "with_positions_offsets_payloads",
                "store": True,
                "analyzer": "fulltext_analyzer"
            },
            "fullname": {
                "type": "text",
                "term_vector": "with_positions_offsets_payloads",
                "analyzer": "fulltext_analyzer"
            }
        }
    },
    settings={
        "index": {
            "number_of_shards": 1,
            "number_of_replicas": 0
        },
        "analysis": {
            "analyzer": {
                "fulltext_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": [
                        "lowercase",
                        "type_as_payload"
                    ]
                }
            }
        }
    },
)
print(resp)

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    mappings: {
      properties: {
        text: {
          type: 'text',
          term_vector: 'with_positions_offsets_payloads',
          store: true,
          analyzer: 'fulltext_analyzer'
        },
        fullname: {
          type: 'text',
          term_vector: 'with_positions_offsets_payloads',
          analyzer: 'fulltext_analyzer'
        }
      }
    },
    settings: {
      index: {
        number_of_shards: 1,
        number_of_replicas: 0
      },
      analysis: {
        analyzer: {
          fulltext_analyzer: {
            type: 'custom',
            tokenizer: 'whitespace',
            filter: [
              'lowercase',
              'type_as_payload'
            ]
          }
        }
      }
    }
  }
)
puts response

const response = await client.indices.create({
  index: "my-index-000001",
  mappings: {
    properties: {
      text: {
        type: "text",
        term_vector: "with_positions_offsets_payloads",
        store: true,
        analyzer: "fulltext_analyzer",
      },
      fullname: {
        type: "text",
        term_vector: "with_positions_offsets_payloads",
        analyzer: "fulltext_analyzer",
      },
    },
  },
  settings: {
    index: {
      number_of_shards: 1,
      number_of_replicas: 0,
    },
    analysis: {
      analyzer: {
        fulltext_analyzer: {
          type: "custom",
          tokenizer: "whitespace",
          filter: ["lowercase", "type_as_payload"],
        },
      },
    },
  },
});
console.log(response);

PUT /my-index-000001
{ "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "term_vector": "with_positions_offsets_payloads",
        "store" : true,
        "analyzer" : "fulltext_analyzer"
       },
       "fullname": {
        "type": "text",
        "term_vector": "with_positions_offsets_payloads",
        "analyzer" : "fulltext_analyzer"
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}

Copy as curl Try in Elastic

其次，我们添加一些文档

resp = client.index(
    index="my-index-000001",
    id="1",
    document={
        "fullname": "John Doe",
        "text": "test test test "
    },
)
print(resp)

resp1 = client.index(
    index="my-index-000001",
    id="2",
    refresh="wait_for",
    document={
        "fullname": "Jane Doe",
        "text": "Another test ..."
    },
)
print(resp1)

response = client.index(
  index: 'my-index-000001',
  id: 1,
  body: {
    fullname: 'John Doe',
    text: 'test test test '
  }
)
puts response

response = client.index(
  index: 'my-index-000001',
  id: 2,
  refresh: 'wait_for',
  body: {
    fullname: 'Jane Doe',
    text: 'Another test ...'
  }
)
puts response

const response = await client.index({
  index: "my-index-000001",
  id: 1,
  document: {
    fullname: "John Doe",
    text: "test test test ",
  },
});
console.log(response);

const response1 = await client.index({
  index: "my-index-000001",
  id: 2,
  refresh: "wait_for",
  document: {
    fullname: "Jane Doe",
    text: "Another test ...",
  },
});
console.log(response1);

PUT /my-index-000001/_doc/1
{
  "fullname" : "John Doe",
  "text" : "test test test "
}

PUT /my-index-000001/_doc/2?refresh=wait_for
{
  "fullname" : "Jane Doe",
  "text" : "Another test ..."
}

Copy as curl Try in Elastic

以下请求返回文档 1 (John Doe) 中字段 text 的所有信息和统计信息

resp = client.termvectors(
    index="my-index-000001",
    id="1",
    fields=[
        "text"
    ],
    offsets=True,
    payloads=True,
    positions=True,
    term_statistics=True,
    field_statistics=True,
)
print(resp)

response = client.termvectors(
  index: 'my-index-000001',
  id: 1,
  body: {
    fields: [
      'text'
    ],
    offsets: true,
    payloads: true,
    positions: true,
    term_statistics: true,
    field_statistics: true
  }
)
puts response

const response = await client.termvectors({
  index: "my-index-000001",
  id: 1,
  fields: ["text"],
  offsets: true,
  payloads: true,
  positions: true,
  term_statistics: true,
  field_statistics: true,
});
console.log(response);

GET /my-index-000001/_termvectors/1
{
  "fields" : ["text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

Copy as curl Try in Elastic

响应

{
  "_index": "my-index-000001",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 6,
  "term_vectors": {
    "text": {
      "field_statistics": {
        "sum_doc_freq": 4,
        "doc_count": 2,
        "sum_ttf": 6
      },
      "terms": {
        "test": {
          "doc_freq": 2,
          "ttf": 4,
          "term_freq": 3,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 4,
              "payload": "d29yZA=="
            },
            {
              "position": 1,
              "start_offset": 5,
              "end_offset": 9,
              "payload": "d29yZA=="
            },
            {
              "position": 2,
              "start_offset": 10,
              "end_offset": 14,
              "payload": "d29yZA=="
            }
          ]
        }
      }
    }
  }
}

动态生成词向量

编辑

未在索引中显式存储的词向量会自动动态计算。以下请求返回文档 1 中字段的所有信息和统计信息，即使这些词项没有在索引中显式存储。请注意，对于字段 text，词项不会重新生成。

resp = client.termvectors(
    index="my-index-000001",
    id="1",
    fields=[
        "text",
        "some_field_without_term_vectors"
    ],
    offsets=True,
    positions=True,
    term_statistics=True,
    field_statistics=True,
)
print(resp)

response = client.termvectors(
  index: 'my-index-000001',
  id: 1,
  body: {
    fields: [
      'text',
      'some_field_without_term_vectors'
    ],
    offsets: true,
    positions: true,
    term_statistics: true,
    field_statistics: true
  }
)
puts response

const response = await client.termvectors({
  index: "my-index-000001",
  id: 1,
  fields: ["text", "some_field_without_term_vectors"],
  offsets: true,
  positions: true,
  term_statistics: true,
  field_statistics: true,
});
console.log(response);

GET /my-index-000001/_termvectors/1
{
  "fields" : ["text", "some_field_without_term_vectors"],
  "offsets" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

Copy as curl Try in Elastic

人工文档

编辑

也可以为人工文档生成词向量，即为索引中不存在的文档生成词向量。例如，以下请求将返回与示例 1 相同的结果。所使用的映射由 index 决定。

如果启用了动态映射（默认），则原始映射中不存在的文档字段将动态创建。

resp = client.termvectors(
    index="my-index-000001",
    doc={
        "fullname": "John Doe",
        "text": "test test test"
    },
)
print(resp)

response = client.termvectors(
  index: 'my-index-000001',
  body: {
    doc: {
      fullname: 'John Doe',
      text: 'test test test'
    }
  }
)
puts response

const response = await client.termvectors({
  index: "my-index-000001",
  doc: {
    fullname: "John Doe",
    text: "test test test",
  },
});
console.log(response);

GET /my-index-000001/_termvectors
{
  "doc" : {
    "fullname" : "John Doe",
    "text" : "test test test"
  }
}

Copy as curl Try in Elastic

每个字段的分析器

编辑

此外，可以使用 per_field_analyzer 参数提供与字段不同的分析器。这对于以任何方式生成词向量非常有用，尤其是在使用人工文档时。当为已存储词向量的字段提供分析器时，词向量将重新生成。

resp = client.termvectors(
    index="my-index-000001",
    doc={
        "fullname": "John Doe",
        "text": "test test test"
    },
    fields=[
        "fullname"
    ],
    per_field_analyzer={
        "fullname": "keyword"
    },
)
print(resp)

response = client.termvectors(
  index: 'my-index-000001',
  body: {
    doc: {
      fullname: 'John Doe',
      text: 'test test test'
    },
    fields: [
      'fullname'
    ],
    per_field_analyzer: {
      fullname: 'keyword'
    }
  }
)
puts response

const response = await client.termvectors({
  index: "my-index-000001",
  doc: {
    fullname: "John Doe",
    text: "test test test",
  },
  fields: ["fullname"],
  per_field_analyzer: {
    fullname: "keyword",
  },
});
console.log(response);

GET /my-index-000001/_termvectors
{
  "doc" : {
    "fullname" : "John Doe",
    "text" : "test test test"
  },
  "fields": ["fullname"],
  "per_field_analyzer" : {
    "fullname": "keyword"
  }
}

Copy as curl Try in Elastic

响应

{
  "_index": "my-index-000001",
  "_version": 0,
  "found": true,
  "took": 6,
  "term_vectors": {
    "fullname": {
       "field_statistics": {
          "sum_doc_freq": 2,
          "doc_count": 4,
          "sum_ttf": 4
       },
       "terms": {
          "John Doe": {
             "term_freq": 1,
             "tokens": [
                {
                   "position": 0,
                   "start_offset": 0,
                   "end_offset": 8
                }
             ]
          }
       }
    }
  }
}

词项过滤

编辑

最后，可以根据词项的 tf-idf 分数过滤返回的词项。在下面的示例中，我们从具有给定“plot”字段值的人工文档中获得三个最“有趣”的关键字。请注意，关键字“Tony”或任何停用词都不是响应的一部分，因为它们的 tf-idf 必须太低。

resp = client.termvectors(
    index="imdb",
    doc={
        "plot": "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil."
    },
    term_statistics=True,
    field_statistics=True,
    positions=False,
    offsets=False,
    filter={
        "max_num_terms": 3,
        "min_term_freq": 1,
        "min_doc_freq": 1
    },
)
print(resp)

response = client.termvectors(
  index: 'imdb',
  body: {
    doc: {
      plot: 'When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil.'
    },
    term_statistics: true,
    field_statistics: true,
    positions: false,
    offsets: false,
    filter: {
      max_num_terms: 3,
      min_term_freq: 1,
      min_doc_freq: 1
    }
  }
)
puts response

const response = await client.termvectors({
  index: "imdb",
  doc: {
    plot: "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil.",
  },
  term_statistics: true,
  field_statistics: true,
  positions: false,
  offsets: false,
  filter: {
    max_num_terms: 3,
    min_term_freq: 1,
    min_doc_freq: 1,
  },
});
console.log(response);

GET /imdb/_termvectors
{
  "doc": {
    "plot": "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil."
  },
  "term_statistics": true,
  "field_statistics": true,
  "positions": false,
  "offsets": false,
  "filter": {
    "max_num_terms": 3,
    "min_term_freq": 1,
    "min_doc_freq": 1
  }
}

Copy as curl Try in Elastic

响应

{
   "_index": "imdb",
   "_version": 0,
   "found": true,
   "term_vectors": {
      "plot": {
         "field_statistics": {
            "sum_doc_freq": 3384269,
            "doc_count": 176214,
            "sum_ttf": 3753460
         },
         "terms": {
            "armored": {
               "doc_freq": 27,
               "ttf": 27,
               "term_freq": 1,
               "score": 9.74725
            },
            "industrialist": {
               "doc_freq": 88,
               "ttf": 88,
               "term_freq": 1,
               "score": 8.590818
            },
            "stark": {
               "doc_freq": 44,
               "ttf": 47,
               "term_freq": 1,
               "score": 9.272792
            }
         }
      }
   }
}

« Reindex API Multi term vectors API »

Was this helpful?

Feedback

The Search AI Company

Generative AI

Search

Security

Observability

By solution

Industries

词向量 API

词向量 API

请求

先决条件

描述

词项信息

词项统计

字段统计

词项过滤

行为

路径参数

查询参数

示例

返回存储的词向量

动态生成词向量

人工文档

每个字段的分析器

词项过滤

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards