术语向量 API

编辑

检索特定文档字段中术语的信息和统计数据。

resp = client.termvectors(
    index="my-index-000001",
    id="1",
)
print(resp)
response = client.termvectors(
  index: 'my-index-000001',
  id: 1
)
puts response
const response = await client.termvectors({
  index: "my-index-000001",
  id: 1,
});
console.log(response);
GET /my-index-000001/_termvectors/1

请求

编辑

GET /<index>/_termvectors/<_id>

先决条件

编辑
  • 如果启用了 Elasticsearch 安全功能,则您必须拥有目标索引或索引别名的 read 索引权限

描述

编辑

您可以检索存储在索引中的文档或请求正文中传递的人工文档的术语向量。

您可以通过 fields 参数或将字段添加到请求正文来指定您感兴趣的字段。

resp = client.termvectors(
    index="my-index-000001",
    id="1",
    fields="message",
)
print(resp)
response = client.termvectors(
  index: 'my-index-000001',
  id: 1,
  fields: 'message'
)
puts response
const response = await client.termvectors({
  index: "my-index-000001",
  id: 1,
  fields: "message",
});
console.log(response);
GET /my-index-000001/_termvectors/1?fields=message

字段可以使用通配符来指定,类似于 多匹配查询

术语向量默认情况下是 实时的,而不是接近实时的。这可以通过将 realtime 参数设置为 false 来更改。

您可以请求三种类型的值:术语信息术语统计信息字段统计信息。默认情况下,所有字段都返回所有术语信息和字段统计信息,但术语统计信息除外。

术语信息

编辑
  • 字段中的术语频率(始终返回)
  • 术语位置(positions:true)
  • 起始和结束偏移量(offsets:true)
  • 术语有效负载(payloads:true),以 base64 编码的字节表示

如果请求的信息未存储在索引中,则如果可能,它将被动态计算。此外,术语向量可以针对甚至不存在于索引中的文档进行计算,而是由用户提供。

起始和结束偏移量假设正在使用 UTF-16 编码。如果您想使用这些偏移量来获取生成此标记的原始文本,则应确保您正在获取子字符串的字符串也使用 UTF-16 编码。

术语统计信息

编辑

term_statistics 设置为 true(默认为 false)将返回

  • 总术语频率(术语在所有文档中出现的次数)
  • 文档频率(包含当前术语的文档数)

默认情况下不返回这些值,因为术语统计信息可能会严重影响性能。

字段统计信息

编辑

field_statistics 设置为 false(默认为 true)将省略

  • 文档计数(包含此字段的文档数)
  • 文档频率总和(此字段中所有术语的文档频率总和)
  • 总术语频率总和(此字段中每个术语的总术语频率总和)

术语过滤

编辑

使用参数 filter,返回的术语也可以根据其 tf-idf 分数进行过滤。这对于找出文档的良好特征向量可能很有用。此功能的工作方式类似于 第二阶段类似于此查询。请参阅 示例 5 以了解用法。

支持以下子参数

max_num_terms

每个字段必须返回的最大术语数。默认为 25

min_term_freq

忽略源文档中频率低于此值的词语。默认为 1

max_term_freq

忽略源文档中频率高于此值的词语。默认为无界。

min_doc_freq

忽略在至少这么多文档中未出现的术语。默认为 1

max_doc_freq

忽略在超过这么多文档中出现的词语。默认为无界。

min_word_length

词语长度的最小值,低于此值的词语将被忽略。默认为 0

max_word_length

词语长度的最大值,高于此值的词语将被忽略。默认为无界(0)。

行为

编辑

术语和字段统计信息不准确。未考虑已删除的文档。信息仅针对请求文档所在的碎片检索。因此,术语和字段统计信息仅可用作相对度量,而绝对数字在此上下文中没有任何意义。默认情况下,在请求人工文档的术语向量时,会随机选择一个分片来获取统计信息。仅使用 routing 来命中特定分片。

路径参数

编辑
<index>
(必需,字符串)包含文档的索引名称。
<_id>
(可选,字符串)文档的唯一标识符。

查询参数

编辑
fields

(可选,字符串)要包含在统计信息中的字段的逗号分隔列表或通配符表达式。

用作默认列表,除非在 completion_fieldsfielddata_fields 参数中提供了特定字段列表。

field_statistics
(可选,布尔值)如果为 true,则响应包括文档计数、文档频率总和和总术语频率总和。默认为 true
<offsets>
(可选,布尔值)如果为 true,则响应包括术语偏移量。默认为 true
payloads
(可选,布尔值)如果为 true,则响应包括术语有效负载。默认为 true
positions
(可选,布尔值)如果为 true,则响应包括术语位置。默认为 true
preference
(可选,字符串)指定应在其中执行操作的节点或分片。默认为随机。
routing
(可选,字符串)用于将操作路由到特定分片的自定义值。
realtime
(可选,布尔值)如果为 true,则请求为实时请求,而不是接近实时请求。默认为 true。请参阅 实时
term_statistics
(可选,布尔值)如果为 true,则响应包括术语频率和文档频率。默认为 false
version
(可选,布尔值)如果为 true,则返回作为命中一部分的文档版本。
version_type
(可选,枚举)特定版本类型:externalexternal_gte

示例

编辑

返回存储的术语向量

编辑

首先,我们创建一个存储术语向量、有效负载等的索引。

resp = client.indices.create(
    index="my-index-000001",
    mappings={
        "properties": {
            "text": {
                "type": "text",
                "term_vector": "with_positions_offsets_payloads",
                "store": True,
                "analyzer": "fulltext_analyzer"
            },
            "fullname": {
                "type": "text",
                "term_vector": "with_positions_offsets_payloads",
                "analyzer": "fulltext_analyzer"
            }
        }
    },
    settings={
        "index": {
            "number_of_shards": 1,
            "number_of_replicas": 0
        },
        "analysis": {
            "analyzer": {
                "fulltext_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": [
                        "lowercase",
                        "type_as_payload"
                    ]
                }
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'my-index-000001',
  body: {
    mappings: {
      properties: {
        text: {
          type: 'text',
          term_vector: 'with_positions_offsets_payloads',
          store: true,
          analyzer: 'fulltext_analyzer'
        },
        fullname: {
          type: 'text',
          term_vector: 'with_positions_offsets_payloads',
          analyzer: 'fulltext_analyzer'
        }
      }
    },
    settings: {
      index: {
        number_of_shards: 1,
        number_of_replicas: 0
      },
      analysis: {
        analyzer: {
          fulltext_analyzer: {
            type: 'custom',
            tokenizer: 'whitespace',
            filter: [
              'lowercase',
              'type_as_payload'
            ]
          }
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "my-index-000001",
  mappings: {
    properties: {
      text: {
        type: "text",
        term_vector: "with_positions_offsets_payloads",
        store: true,
        analyzer: "fulltext_analyzer",
      },
      fullname: {
        type: "text",
        term_vector: "with_positions_offsets_payloads",
        analyzer: "fulltext_analyzer",
      },
    },
  },
  settings: {
    index: {
      number_of_shards: 1,
      number_of_replicas: 0,
    },
    analysis: {
      analyzer: {
        fulltext_analyzer: {
          type: "custom",
          tokenizer: "whitespace",
          filter: ["lowercase", "type_as_payload"],
        },
      },
    },
  },
});
console.log(response);
PUT /my-index-000001
{ "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "term_vector": "with_positions_offsets_payloads",
        "store" : true,
        "analyzer" : "fulltext_analyzer"
       },
       "fullname": {
        "type": "text",
        "term_vector": "with_positions_offsets_payloads",
        "analyzer" : "fulltext_analyzer"
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}

其次,我们添加一些文档

resp = client.index(
    index="my-index-000001",
    id="1",
    document={
        "fullname": "John Doe",
        "text": "test test test "
    },
)
print(resp)

resp1 = client.index(
    index="my-index-000001",
    id="2",
    refresh="wait_for",
    document={
        "fullname": "Jane Doe",
        "text": "Another test ..."
    },
)
print(resp1)
response = client.index(
  index: 'my-index-000001',
  id: 1,
  body: {
    fullname: 'John Doe',
    text: 'test test test '
  }
)
puts response

response = client.index(
  index: 'my-index-000001',
  id: 2,
  refresh: 'wait_for',
  body: {
    fullname: 'Jane Doe',
    text: 'Another test ...'
  }
)
puts response
const response = await client.index({
  index: "my-index-000001",
  id: 1,
  document: {
    fullname: "John Doe",
    text: "test test test ",
  },
});
console.log(response);

const response1 = await client.index({
  index: "my-index-000001",
  id: 2,
  refresh: "wait_for",
  document: {
    fullname: "Jane Doe",
    text: "Another test ...",
  },
});
console.log(response1);
PUT /my-index-000001/_doc/1
{
  "fullname" : "John Doe",
  "text" : "test test test "
}

PUT /my-index-000001/_doc/2?refresh=wait_for
{
  "fullname" : "Jane Doe",
  "text" : "Another test ..."
}

以下请求返回文档 1(John Doe)中字段 text 的所有信息和统计数据

resp = client.termvectors(
    index="my-index-000001",
    id="1",
    fields=[
        "text"
    ],
    offsets=True,
    payloads=True,
    positions=True,
    term_statistics=True,
    field_statistics=True,
)
print(resp)
response = client.termvectors(
  index: 'my-index-000001',
  id: 1,
  body: {
    fields: [
      'text'
    ],
    offsets: true,
    payloads: true,
    positions: true,
    term_statistics: true,
    field_statistics: true
  }
)
puts response
const response = await client.termvectors({
  index: "my-index-000001",
  id: 1,
  fields: ["text"],
  offsets: true,
  payloads: true,
  positions: true,
  term_statistics: true,
  field_statistics: true,
});
console.log(response);
GET /my-index-000001/_termvectors/1
{
  "fields" : ["text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

响应

{
  "_index": "my-index-000001",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 6,
  "term_vectors": {
    "text": {
      "field_statistics": {
        "sum_doc_freq": 4,
        "doc_count": 2,
        "sum_ttf": 6
      },
      "terms": {
        "test": {
          "doc_freq": 2,
          "ttf": 4,
          "term_freq": 3,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 4,
              "payload": "d29yZA=="
            },
            {
              "position": 1,
              "start_offset": 5,
              "end_offset": 9,
              "payload": "d29yZA=="
            },
            {
              "position": 2,
              "start_offset": 10,
              "end_offset": 14,
              "payload": "d29yZA=="
            }
          ]
        }
      }
    }
  }
}

动态生成术语向量

编辑

未在索引中显式存储的术语向量会自动动态计算。以下请求返回文档 1 中字段的所有信息和统计数据,即使术语未在索引中显式存储。请注意,对于字段 text,术语不会重新生成。

resp = client.termvectors(
    index="my-index-000001",
    id="1",
    fields=[
        "text",
        "some_field_without_term_vectors"
    ],
    offsets=True,
    positions=True,
    term_statistics=True,
    field_statistics=True,
)
print(resp)
response = client.termvectors(
  index: 'my-index-000001',
  id: 1,
  body: {
    fields: [
      'text',
      'some_field_without_term_vectors'
    ],
    offsets: true,
    positions: true,
    term_statistics: true,
    field_statistics: true
  }
)
puts response
const response = await client.termvectors({
  index: "my-index-000001",
  id: 1,
  fields: ["text", "some_field_without_term_vectors"],
  offsets: true,
  positions: true,
  term_statistics: true,
  field_statistics: true,
});
console.log(response);
GET /my-index-000001/_termvectors/1
{
  "fields" : ["text", "some_field_without_term_vectors"],
  "offsets" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

人工文档

编辑

还可以为人工文档生成术语向量,即为索引中不存在的文档生成术语向量。例如,以下请求将返回与示例 1 中相同的结果。使用的映射由 index 确定。

如果启用了动态映射(默认),则原始映射中不存在的文档字段将被动态创建。

resp = client.termvectors(
    index="my-index-000001",
    doc={
        "fullname": "John Doe",
        "text": "test test test"
    },
)
print(resp)
response = client.termvectors(
  index: 'my-index-000001',
  body: {
    doc: {
      fullname: 'John Doe',
      text: 'test test test'
    }
  }
)
puts response
const response = await client.termvectors({
  index: "my-index-000001",
  doc: {
    fullname: "John Doe",
    text: "test test test",
  },
});
console.log(response);
GET /my-index-000001/_termvectors
{
  "doc" : {
    "fullname" : "John Doe",
    "text" : "test test test"
  }
}
每个字段分析器
编辑

此外,可以使用 per_field_analyzer 参数提供与字段中分析器不同的分析器。这对于以任何方式生成术语向量很有用,尤其是在使用人工文档时。当为已存储术语向量的字段提供分析器时,术语向量将被重新生成。

resp = client.termvectors(
    index="my-index-000001",
    doc={
        "fullname": "John Doe",
        "text": "test test test"
    },
    fields=[
        "fullname"
    ],
    per_field_analyzer={
        "fullname": "keyword"
    },
)
print(resp)
response = client.termvectors(
  index: 'my-index-000001',
  body: {
    doc: {
      fullname: 'John Doe',
      text: 'test test test'
    },
    fields: [
      'fullname'
    ],
    per_field_analyzer: {
      fullname: 'keyword'
    }
  }
)
puts response
const response = await client.termvectors({
  index: "my-index-000001",
  doc: {
    fullname: "John Doe",
    text: "test test test",
  },
  fields: ["fullname"],
  per_field_analyzer: {
    fullname: "keyword",
  },
});
console.log(response);
GET /my-index-000001/_termvectors
{
  "doc" : {
    "fullname" : "John Doe",
    "text" : "test test test"
  },
  "fields": ["fullname"],
  "per_field_analyzer" : {
    "fullname": "keyword"
  }
}

响应

{
  "_index": "my-index-000001",
  "_version": 0,
  "found": true,
  "took": 6,
  "term_vectors": {
    "fullname": {
       "field_statistics": {
          "sum_doc_freq": 2,
          "doc_count": 4,
          "sum_ttf": 4
       },
       "terms": {
          "John Doe": {
             "term_freq": 1,
             "tokens": [
                {
                   "position": 0,
                   "start_offset": 0,
                   "end_offset": 8
                }
             ]
          }
       }
    }
  }
}

术语过滤

编辑

最后,返回的术语可以根据其 tf-idf 分数进行过滤。在下面的示例中,我们从具有给定“plot”字段值的人工文档中获得了三个最“有趣”的关键字。请注意,关键字“Tony”或任何停用词都不包含在响应中,因为它们的 tf-idf 必须太低。

resp = client.termvectors(
    index="imdb",
    doc={
        "plot": "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil."
    },
    term_statistics=True,
    field_statistics=True,
    positions=False,
    offsets=False,
    filter={
        "max_num_terms": 3,
        "min_term_freq": 1,
        "min_doc_freq": 1
    },
)
print(resp)
response = client.termvectors(
  index: 'imdb',
  body: {
    doc: {
      plot: 'When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil.'
    },
    term_statistics: true,
    field_statistics: true,
    positions: false,
    offsets: false,
    filter: {
      max_num_terms: 3,
      min_term_freq: 1,
      min_doc_freq: 1
    }
  }
)
puts response
const response = await client.termvectors({
  index: "imdb",
  doc: {
    plot: "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil.",
  },
  term_statistics: true,
  field_statistics: true,
  positions: false,
  offsets: false,
  filter: {
    max_num_terms: 3,
    min_term_freq: 1,
    min_doc_freq: 1,
  },
});
console.log(response);
GET /imdb/_termvectors
{
  "doc": {
    "plot": "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil."
  },
  "term_statistics": true,
  "field_statistics": true,
  "positions": false,
  "offsets": false,
  "filter": {
    "max_num_terms": 3,
    "min_term_freq": 1,
    "min_doc_freq": 1
  }
}

响应

{
   "_index": "imdb",
   "_version": 0,
   "found": true,
   "term_vectors": {
      "plot": {
         "field_statistics": {
            "sum_doc_freq": 3384269,
            "doc_count": 176214,
            "sum_ttf": 3753460
         },
         "terms": {
            "armored": {
               "doc_freq": 27,
               "ttf": 27,
               "term_freq": 1,
               "score": 9.74725
            },
            "industrialist": {
               "doc_freq": 88,
               "ttf": 88,
               "term_freq": 1,
               "score": 8.590818
            },
            "stark": {
               "doc_freq": 44,
               "ttf": 47,
               "term_freq": 1,
               "score": 9.272792
            }
         }
      }
   }
}