k 近邻 (kNN) 搜索

编辑

k 近邻 (kNN) 搜索会找到与查询向量最接近的 k 个向量,以相似性度量为标准。

kNN 的常见用例包括

  • 基于自然语言处理 (NLP) 算法的相关性排名
  • 产品推荐和推荐引擎
  • 图像或视频的相似性搜索

先决条件

编辑
  • 要运行 kNN 搜索,你必须能够将数据转换为有意义的向量值。你可以在 Elasticsearch 中使用自然语言处理 (NLP) 模型创建这些向量,或者在 Elasticsearch 之外生成它们。向量可以作为 dense_vector 字段值添加到文档中。查询表示为具有相同维度的向量。

    设计你的向量,使文档的向量与查询向量基于相似性度量越接近,则匹配效果越好。

  • 要完成本指南中的步骤,你必须具有以下索引权限

    • create_indexmanage,以创建具有 dense_vector 字段的索引
    • createindexwrite,以将数据添加到你创建的索引中
    • read,以搜索索引

kNN 方法

编辑

Elasticsearch 支持两种 kNN 搜索方法

在大多数情况下,你都想使用近似 kNN。近似 kNN 以牺牲较慢的索引速度和不完美的准确性为代价,提供了较低的延迟。

精确的暴力 kNN 保证了结果的准确性,但在大型数据集上无法很好地扩展。使用此方法,script_score 查询必须扫描每个匹配的文档以计算向量函数,这可能会导致搜索速度缓慢。但是,你可以通过使用查询来限制传递给该函数的匹配文档的数量,从而提高延迟。如果你将数据过滤到一小部分文档,则可以使用此方法获得良好的搜索性能。

近似 kNN

编辑

与其他类型的搜索相比,近似 kNN 搜索具有特定的资源需求。特别是,所有向量数据都必须适合节点的页面缓存才能有效。请查阅近似 kNN 搜索调整指南,以获取有关配置和大小调整的重要说明。

要运行近似 kNN 搜索,请使用knn 选项来搜索一个或多个已启用索引的 dense_vector 字段。

  1. 显式映射一个或多个 dense_vector 字段。近似 kNN 搜索需要以下映射选项

    • similarity 值。此值确定用于根据查询向量和文档向量之间的相似性对文档进行评分的相似性度量。有关可用度量的列表,请参阅similarity 参数文档。similarity 设置默认为 cosine
    resp = client.indices.create(
        index="image-index",
        mappings={
            "properties": {
                "image-vector": {
                    "type": "dense_vector",
                    "dims": 3,
                    "similarity": "l2_norm"
                },
                "title-vector": {
                    "type": "dense_vector",
                    "dims": 5,
                    "similarity": "l2_norm"
                },
                "title": {
                    "type": "text"
                },
                "file-type": {
                    "type": "keyword"
                }
            }
        },
    )
    print(resp)
    response = client.indices.create(
      index: 'image-index',
      body: {
        mappings: {
          properties: {
            "image-vector": {
              type: 'dense_vector',
              dims: 3,
              similarity: 'l2_norm'
            },
            "title-vector": {
              type: 'dense_vector',
              dims: 5,
              similarity: 'l2_norm'
            },
            title: {
              type: 'text'
            },
            "file-type": {
              type: 'keyword'
            }
          }
        }
      }
    )
    puts response
    const response = await client.indices.create({
      index: "image-index",
      mappings: {
        properties: {
          "image-vector": {
            type: "dense_vector",
            dims: 3,
            similarity: "l2_norm",
          },
          "title-vector": {
            type: "dense_vector",
            dims: 5,
            similarity: "l2_norm",
          },
          title: {
            type: "text",
          },
          "file-type": {
            type: "keyword",
          },
        },
      },
    });
    console.log(response);
    PUT image-index
    {
      "mappings": {
        "properties": {
          "image-vector": {
            "type": "dense_vector",
            "dims": 3,
            "similarity": "l2_norm"
          },
          "title-vector": {
            "type": "dense_vector",
            "dims": 5,
            "similarity": "l2_norm"
          },
          "title": {
            "type": "text"
          },
          "file-type": {
            "type": "keyword"
          }
        }
      }
    }
  2. 索引你的数据。

    POST image-index/_bulk?refresh=true
    { "index": { "_id": "1" } }
    { "image-vector": [1, 5, -20], "title-vector": [12, 50, -10, 0, 1], "title": "moose family", "file-type": "jpg" }
    { "index": { "_id": "2" } }
    { "image-vector": [42, 8, -15], "title-vector": [25, 1, 4, -12, 2], "title": "alpine lake", "file-type": "png" }
    { "index": { "_id": "3" } }
    { "image-vector": [15, 11, 23], "title-vector": [1, 5, 25, 50, 20], "title": "full moon", "file-type": "jpg" }
    ...
  3. 使用knn 选项knn 查询(专家案例)运行搜索。

    resp = client.search(
        index="image-index",
        knn={
            "field": "image-vector",
            "query_vector": [
                -5,
                9,
                -12
            ],
            "k": 10,
            "num_candidates": 100
        },
        fields=[
            "title",
            "file-type"
        ],
    )
    print(resp)
    response = client.search(
      index: 'image-index',
      body: {
        knn: {
          field: 'image-vector',
          query_vector: [
            -5,
            9,
            -12
          ],
          k: 10,
          num_candidates: 100
        },
        fields: [
          'title',
          'file-type'
        ]
      }
    )
    puts response
    const response = await client.search({
      index: "image-index",
      knn: {
        field: "image-vector",
        query_vector: [-5, 9, -12],
        k: 10,
        num_candidates: 100,
      },
      fields: ["title", "file-type"],
    });
    console.log(response);
    POST image-index/_search
    {
      "knn": {
        "field": "image-vector",
        "query_vector": [-5, 9, -12],
        "k": 10,
        "num_candidates": 100
      },
      "fields": [ "title", "file-type" ]
    }

文档 _score 由查询向量和文档向量之间的相似性决定。有关 kNN 搜索分数如何计算的更多信息,请参阅similarity

对近似 kNN 搜索的支持是在 8.0 版本中添加的。在此之前,dense_vector 字段不支持在映射中启用 index。如果你在 8.0 之前创建了包含 dense_vector 字段的索引,则为了支持近似 kNN 搜索,必须使用新的字段映射(设置 index: true,这是默认选项)重新索引数据。

调整近似 kNN 以获得速度或准确性

编辑

为了收集结果,kNN 搜索 API 在每个分片上找到 num_candidates 个近似最近邻候选对象。搜索计算这些候选向量与查询向量的相似性,从每个分片中选择 k 个最相似的结果。然后,搜索合并来自每个分片的结果,以返回全局前 k 个最近邻。

你可以增加 num_candidates 以获得更准确的结果,但代价是搜索速度较慢。具有较高 num_candidates 值的搜索会考虑每个分片中的更多候选对象。这需要更多时间,但搜索找到真正的 k 个最接近邻居的概率更高。

同样,你可以减少 num_candidates 以加快搜索速度,但可能会降低结果的准确性。

使用字节向量的近似 kNN

编辑

近似 kNN 搜索 API 除了支持 float 值向量外,还支持 byte 值向量。使用knn 选项来搜索 dense_vector 字段,其中 element_type 设置为 byte 并启用了索引。

  1. 显式映射一个或多个 dense_vector 字段,其中 element_type 设置为 byte 并启用了索引。

    resp = client.indices.create(
        index="byte-image-index",
        mappings={
            "properties": {
                "byte-image-vector": {
                    "type": "dense_vector",
                    "element_type": "byte",
                    "dims": 2
                },
                "title": {
                    "type": "text"
                }
            }
        },
    )
    print(resp)
    response = client.indices.create(
      index: 'byte-image-index',
      body: {
        mappings: {
          properties: {
            "byte-image-vector": {
              type: 'dense_vector',
              element_type: 'byte',
              dims: 2
            },
            title: {
              type: 'text'
            }
          }
        }
      }
    )
    puts response
    const response = await client.indices.create({
      index: "byte-image-index",
      mappings: {
        properties: {
          "byte-image-vector": {
            type: "dense_vector",
            element_type: "byte",
            dims: 2,
          },
          title: {
            type: "text",
          },
        },
      },
    });
    console.log(response);
    PUT byte-image-index
    {
      "mappings": {
        "properties": {
          "byte-image-vector": {
            "type": "dense_vector",
            "element_type": "byte",
            "dims": 2
          },
          "title": {
            "type": "text"
          }
        }
      }
    }
  2. 索引你的数据,确保所有向量值都是 [-128, 127] 范围内的整数。

    resp = client.bulk(
        index="byte-image-index",
        refresh=True,
        operations=[
            {
                "index": {
                    "_id": "1"
                }
            },
            {
                "byte-image-vector": [
                    5,
                    -20
                ],
                "title": "moose family"
            },
            {
                "index": {
                    "_id": "2"
                }
            },
            {
                "byte-image-vector": [
                    8,
                    -15
                ],
                "title": "alpine lake"
            },
            {
                "index": {
                    "_id": "3"
                }
            },
            {
                "byte-image-vector": [
                    11,
                    23
                ],
                "title": "full moon"
            }
        ],
    )
    print(resp)
    response = client.bulk(
      index: 'byte-image-index',
      refresh: true,
      body: [
        {
          index: {
            _id: '1'
          }
        },
        {
          "byte-image-vector": [
            5,
            -20
          ],
          title: 'moose family'
        },
        {
          index: {
            _id: '2'
          }
        },
        {
          "byte-image-vector": [
            8,
            -15
          ],
          title: 'alpine lake'
        },
        {
          index: {
            _id: '3'
          }
        },
        {
          "byte-image-vector": [
            11,
            23
          ],
          title: 'full moon'
        }
      ]
    )
    puts response
    const response = await client.bulk({
      index: "byte-image-index",
      refresh: "true",
      operations: [
        {
          index: {
            _id: "1",
          },
        },
        {
          "byte-image-vector": [5, -20],
          title: "moose family",
        },
        {
          index: {
            _id: "2",
          },
        },
        {
          "byte-image-vector": [8, -15],
          title: "alpine lake",
        },
        {
          index: {
            _id: "3",
          },
        },
        {
          "byte-image-vector": [11, 23],
          title: "full moon",
        },
      ],
    });
    console.log(response);
    POST byte-image-index/_bulk?refresh=true
    { "index": { "_id": "1" } }
    { "byte-image-vector": [5, -20], "title": "moose family" }
    { "index": { "_id": "2" } }
    { "byte-image-vector": [8, -15], "title": "alpine lake" }
    { "index": { "_id": "3" } }
    { "byte-image-vector": [11, 23], "title": "full moon" }
  3. 使用knn 选项运行搜索,确保 query_vector 值是 [-128, 127] 范围内的整数。

    resp = client.search(
        index="byte-image-index",
        knn={
            "field": "byte-image-vector",
            "query_vector": [
                -5,
                9
            ],
            "k": 10,
            "num_candidates": 100
        },
        fields=[
            "title"
        ],
    )
    print(resp)
    response = client.search(
      index: 'byte-image-index',
      body: {
        knn: {
          field: 'byte-image-vector',
          query_vector: [
            -5,
            9
          ],
          k: 10,
          num_candidates: 100
        },
        fields: [
          'title'
        ]
      }
    )
    puts response
    const response = await client.search({
      index: "byte-image-index",
      knn: {
        field: "byte-image-vector",
        query_vector: [-5, 9],
        k: 10,
        num_candidates: 100,
      },
      fields: ["title"],
    });
    console.log(response);
    POST byte-image-index/_search
    {
      "knn": {
        "field": "byte-image-vector",
        "query_vector": [-5, 9],
        "k": 10,
        "num_candidates": 100
      },
      "fields": [ "title" ]
    }

注意:除了标准字节数组之外,还可以为 query_vector 参数提供十六进制编码的字符串值。例如,上面的搜索请求也可以表示如下,这将产生相同的结果

resp = client.search(
    index="byte-image-index",
    knn={
        "field": "byte-image-vector",
        "query_vector": "fb09",
        "k": 10,
        "num_candidates": 100
    },
    fields=[
        "title"
    ],
)
print(resp)
response = client.search(
  index: 'byte-image-index',
  body: {
    knn: {
      field: 'byte-image-vector',
      query_vector: 'fb09',
      k: 10,
      num_candidates: 100
    },
    fields: [
      'title'
    ]
  }
)
puts response
const response = await client.search({
  index: "byte-image-index",
  knn: {
    field: "byte-image-vector",
    query_vector: "fb09",
    k: 10,
    num_candidates: 100,
  },
  fields: ["title"],
});
console.log(response);
POST byte-image-index/_search
{
  "knn": {
    "field": "byte-image-vector",
    "query_vector": "fb09",
    "k": 10,
    "num_candidates": 100
  },
  "fields": [ "title" ]
}

字节量化 kNN 搜索

编辑

如果要提供 float 向量,但又希望节省 byte 向量的内存,可以使用量化功能。量化允许你提供 float 向量,但在内部它们被索引为 byte 向量。此外,原始的 float 向量仍然保留在索引中。

dense_vector 的默认索引类型是 int8_hnsw

要使用量化,你可以在 dense_vector 映射中使用索引类型 int8_hnswint4_hnsw 对象。

resp = client.indices.create(
    index="quantized-image-index",
    mappings={
        "properties": {
            "image-vector": {
                "type": "dense_vector",
                "element_type": "float",
                "dims": 2,
                "index": True,
                "index_options": {
                    "type": "int8_hnsw"
                }
            },
            "title": {
                "type": "text"
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'quantized-image-index',
  body: {
    mappings: {
      properties: {
        "image-vector": {
          type: 'dense_vector',
          element_type: 'float',
          dims: 2,
          index: true,
          index_options: {
            type: 'int8_hnsw'
          }
        },
        title: {
          type: 'text'
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "quantized-image-index",
  mappings: {
    properties: {
      "image-vector": {
        type: "dense_vector",
        element_type: "float",
        dims: 2,
        index: true,
        index_options: {
          type: "int8_hnsw",
        },
      },
      title: {
        type: "text",
      },
    },
  },
});
console.log(response);
PUT quantized-image-index
{
  "mappings": {
    "properties": {
      "image-vector": {
        "type": "dense_vector",
        "element_type": "float",
        "dims": 2,
        "index": true,
        "index_options": {
          "type": "int8_hnsw"
        }
      },
      "title": {
        "type": "text"
      }
    }
  }
}
  1. 索引你的 float 向量。

    resp = client.bulk(
        index="quantized-image-index",
        refresh=True,
        operations=[
            {
                "index": {
                    "_id": "1"
                }
            },
            {
                "image-vector": [
                    0.1,
                    -2
                ],
                "title": "moose family"
            },
            {
                "index": {
                    "_id": "2"
                }
            },
            {
                "image-vector": [
                    0.75,
                    -1
                ],
                "title": "alpine lake"
            },
            {
                "index": {
                    "_id": "3"
                }
            },
            {
                "image-vector": [
                    1.2,
                    0.1
                ],
                "title": "full moon"
            }
        ],
    )
    print(resp)
    response = client.bulk(
      index: 'quantized-image-index',
      refresh: true,
      body: [
        {
          index: {
            _id: '1'
          }
        },
        {
          "image-vector": [
            0.1,
            -2
          ],
          title: 'moose family'
        },
        {
          index: {
            _id: '2'
          }
        },
        {
          "image-vector": [
            0.75,
            -1
          ],
          title: 'alpine lake'
        },
        {
          index: {
            _id: '3'
          }
        },
        {
          "image-vector": [
            1.2,
            0.1
          ],
          title: 'full moon'
        }
      ]
    )
    puts response
    const response = await client.bulk({
      index: "quantized-image-index",
      refresh: "true",
      operations: [
        {
          index: {
            _id: "1",
          },
        },
        {
          "image-vector": [0.1, -2],
          title: "moose family",
        },
        {
          index: {
            _id: "2",
          },
        },
        {
          "image-vector": [0.75, -1],
          title: "alpine lake",
        },
        {
          index: {
            _id: "3",
          },
        },
        {
          "image-vector": [1.2, 0.1],
          title: "full moon",
        },
      ],
    });
    console.log(response);
    POST quantized-image-index/_bulk?refresh=true
    { "index": { "_id": "1" } }
    { "image-vector": [0.1, -2], "title": "moose family" }
    { "index": { "_id": "2" } }
    { "image-vector": [0.75, -1], "title": "alpine lake" }
    { "index": { "_id": "3" } }
    { "image-vector": [1.2, 0.1], "title": "full moon" }
  2. 使用knn 选项运行搜索。搜索时,float 向量会自动量化为 byte 向量。

    resp = client.search(
        index="quantized-image-index",
        knn={
            "field": "image-vector",
            "query_vector": [
                0.1,
                -2
            ],
            "k": 10,
            "num_candidates": 100
        },
        fields=[
            "title"
        ],
    )
    print(resp)
    response = client.search(
      index: 'quantized-image-index',
      body: {
        knn: {
          field: 'image-vector',
          query_vector: [
            0.1,
            -2
          ],
          k: 10,
          num_candidates: 100
        },
        fields: [
          'title'
        ]
      }
    )
    puts response
    const response = await client.search({
      index: "quantized-image-index",
      knn: {
        field: "image-vector",
        query_vector: [0.1, -2],
        k: 10,
        num_candidates: 100,
      },
      fields: ["title"],
    });
    console.log(response);
    POST quantized-image-index/_search
    {
      "knn": {
        "field": "image-vector",
        "query_vector": [0.1, -2],
        "k": 10,
        "num_candidates": 100
      },
      "fields": [ "title" ]
    }

由于原始的 float 向量仍然保留在索引中,你可以选择使用它们进行重新评分。这意味着,你可以使用 int8_hnsw 索引快速搜索所有向量,然后仅对前 k 个结果进行重新评分。这提供了两全其美的解决方案:快速搜索和准确评分。

resp = client.search(
    index="quantized-image-index",
    knn={
        "field": "image-vector",
        "query_vector": [
            0.1,
            -2
        ],
        "k": 15,
        "num_candidates": 100
    },
    fields=[
        "title"
    ],
    rescore={
        "window_size": 10,
        "query": {
            "rescore_query": {
                "script_score": {
                    "query": {
                        "match_all": {}
                    },
                    "script": {
                        "source": "cosineSimilarity(params.query_vector, 'image-vector') + 1.0",
                        "params": {
                            "query_vector": [
                                0.1,
                                -2
                            ]
                        }
                    }
                }
            }
        }
    },
)
print(resp)
response = client.search(
  index: 'quantized-image-index',
  body: {
    knn: {
      field: 'image-vector',
      query_vector: [
        0.1,
        -2
      ],
      k: 15,
      num_candidates: 100
    },
    fields: [
      'title'
    ],
    rescore: {
      window_size: 10,
      query: {
        rescore_query: {
          script_score: {
            query: {
              match_all: {}
            },
            script: {
              source: "cosineSimilarity(params.query_vector, 'image-vector') + 1.0",
              params: {
                query_vector: [
                  0.1,
                  -2
                ]
              }
            }
          }
        }
      }
    }
  }
)
puts response
const response = await client.search({
  index: "quantized-image-index",
  knn: {
    field: "image-vector",
    query_vector: [0.1, -2],
    k: 15,
    num_candidates: 100,
  },
  fields: ["title"],
  rescore: {
    window_size: 10,
    query: {
      rescore_query: {
        script_score: {
          query: {
            match_all: {},
          },
          script: {
            source:
              "cosineSimilarity(params.query_vector, 'image-vector') + 1.0",
            params: {
              query_vector: [0.1, -2],
            },
          },
        },
      },
    },
  },
});
console.log(response);
POST quantized-image-index/_search
{
  "knn": {
    "field": "image-vector",
    "query_vector": [0.1, -2],
    "k": 15,
    "num_candidates": 100
  },
  "fields": [ "title" ],
  "rescore": {
    "window_size": 10,
    "query": {
      "rescore_query": {
        "script_score": {
          "query": {
            "match_all": {}
          },
          "script": {
            "source": "cosineSimilarity(params.query_vector, 'image-vector') + 1.0",
            "params": {
              "query_vector": [0.1, -2]
            }
          }
        }
      }
    }
  }
}

过滤的 kNN 搜索

编辑

kNN 搜索 API 支持使用过滤器限制搜索。搜索将返回也与过滤器查询匹配的前 k 个文档。

以下请求执行由 file-type 字段过滤的近似 kNN 搜索

resp = client.search(
    index="image-index",
    knn={
        "field": "image-vector",
        "query_vector": [
            54,
            10,
            -2
        ],
        "k": 5,
        "num_candidates": 50,
        "filter": {
            "term": {
                "file-type": "png"
            }
        }
    },
    fields=[
        "title"
    ],
    source=False,
)
print(resp)
response = client.search(
  index: 'image-index',
  body: {
    knn: {
      field: 'image-vector',
      query_vector: [
        54,
        10,
        -2
      ],
      k: 5,
      num_candidates: 50,
      filter: {
        term: {
          "file-type": 'png'
        }
      }
    },
    fields: [
      'title'
    ],
    _source: false
  }
)
puts response
const response = await client.search({
  index: "image-index",
  knn: {
    field: "image-vector",
    query_vector: [54, 10, -2],
    k: 5,
    num_candidates: 50,
    filter: {
      term: {
        "file-type": "png",
      },
    },
  },
  fields: ["title"],
  _source: false,
});
console.log(response);
POST image-index/_search
{
  "knn": {
    "field": "image-vector",
    "query_vector": [54, 10, -2],
    "k": 5,
    "num_candidates": 50,
    "filter": {
      "term": {
        "file-type": "png"
      }
    }
  },
  "fields": ["title"],
  "_source": false
}

过滤器在近似 kNN 搜索期间应用,以确保返回 k 个匹配的文档。这与后过滤方法形成对比,在后过滤方法中,过滤器在近似 kNN 搜索完成后应用。后过滤的缺点是,即使有足够的匹配文档,有时也会返回少于 k 个结果。

近似 kNN 搜索和过滤

编辑

与传统的查询过滤(其中更严格的过滤器通常会加快查询速度)不同,在具有 HNSW 索引的近似 kNN 搜索中应用过滤器可能会降低性能。这是因为搜索 HNSW 图需要额外的探索才能获得满足筛选条件的 num_candidates

为了避免显著的性能缺陷,Lucene 为每个段实现了以下策略

  • 如果过滤的文档计数小于或等于 num_candidates,则搜索将绕过 HNSW 图,并对过滤的文档使用暴力搜索。
  • 在探索 HNSW 图时,如果探索的节点数超过满足过滤器的文档数,则搜索将停止探索该图,并切换到对过滤的文档进行暴力搜索。

将近似 kNN 与其他功能结合使用

编辑

你可以通过同时提供knn 选项query来执行混合检索

resp = client.search(
    index="image-index",
    query={
        "match": {
            "title": {
                "query": "mountain lake",
                "boost": 0.9
            }
        }
    },
    knn={
        "field": "image-vector",
        "query_vector": [
            54,
            10,
            -2
        ],
        "k": 5,
        "num_candidates": 50,
        "boost": 0.1
    },
    size=10,
)
print(resp)
response = client.search(
  index: 'image-index',
  body: {
    query: {
      match: {
        title: {
          query: 'mountain lake',
          boost: 0.9
        }
      }
    },
    knn: {
      field: 'image-vector',
      query_vector: [
        54,
        10,
        -2
      ],
      k: 5,
      num_candidates: 50,
      boost: 0.1
    },
    size: 10
  }
)
puts response
const response = await client.search({
  index: "image-index",
  query: {
    match: {
      title: {
        query: "mountain lake",
        boost: 0.9,
      },
    },
  },
  knn: {
    field: "image-vector",
    query_vector: [54, 10, -2],
    k: 5,
    num_candidates: 50,
    boost: 0.1,
  },
  size: 10,
});
console.log(response);
POST image-index/_search
{
  "query": {
    "match": {
      "title": {
        "query": "mountain lake",
        "boost": 0.9
      }
    }
  },
  "knn": {
    "field": "image-vector",
    "query_vector": [54, 10, -2],
    "k": 5,
    "num_candidates": 50,
    "boost": 0.1
  },
  "size": 10
}

此搜索查找全局前 k = 5 个向量匹配项,将它们与来自 match 查询的匹配项组合,最后返回得分最高的 10 个结果。knnquery 匹配项通过析取组合,就像你在它们之间取布尔一样。前 k 个向量结果表示所有索引分片中的全局最近邻。

每次命中的得分是 knnquery 得分的总和。你可以指定一个 boost 值,以在总和中为每个得分赋予权重。在上面的示例中,得分将计算为

score = 0.9 * match_score + 0.1 * knn_score

knn 选项也可以与 aggregations 一起使用。通常,Elasticsearch 会计算与搜索匹配的所有文档的聚合。因此,对于近似 kNN 搜索,聚合是基于最接近的 k 个文档计算的。如果搜索还包含 query,则聚合是基于 knnquery 匹配的组合集合计算的。

执行语义搜索

编辑

kNN 搜索使您能够通过使用先前部署的 文本嵌入模型来执行语义搜索。语义搜索不是在搜索词上进行字面匹配,而是根据搜索查询的意图和上下文含义检索结果。

在底层,文本嵌入 NLP 模型会从您提供的输入查询字符串(称为 model_text)生成一个密集向量。然后,它会针对包含使用相同文本嵌入机器学习模型创建的密集向量的索引进行搜索。搜索结果在语义上与模型学习到的结果相似。

要执行语义搜索

  • 您需要一个包含要搜索的输入数据的密集向量表示的索引,
  • 您必须使用用于从输入数据创建密集向量的相同文本嵌入模型进行搜索,
  • 必须启动文本嵌入 NLP 模型部署。

query_vector_builder 对象中引用已部署的文本嵌入模型或模型部署,并将搜索查询作为 model_text 提供

(...)
{
  "knn": {
    "field": "dense-vector-field",
    "k": 10,
    "num_candidates": 100,
    "query_vector_builder": {
      "text_embedding": { 
        "model_id": "my-text-embedding-model", 
        "model_text": "The opposite of blue" 
      }
    }
  }
}
(...)

要执行的自然语言处理任务。必须为 text_embedding

用于从查询字符串生成密集向量的文本嵌入模型的 ID。使用与在您搜索的索引中从输入文本生成嵌入的同一模型。您可以在 model_id 参数中使用 deployment_id 的值。

模型从中生成密集向量表示的查询字符串。

有关如何部署训练好的模型并使用它来创建文本嵌入的更多信息,请参阅此端到端示例

搜索多个 kNN 字段

编辑

除了混合检索之外,您还可以一次搜索多个 kNN 向量字段

resp = client.search(
    index="image-index",
    query={
        "match": {
            "title": {
                "query": "mountain lake",
                "boost": 0.9
            }
        }
    },
    knn=[
        {
            "field": "image-vector",
            "query_vector": [
                54,
                10,
                -2
            ],
            "k": 5,
            "num_candidates": 50,
            "boost": 0.1
        },
        {
            "field": "title-vector",
            "query_vector": [
                1,
                20,
                -52,
                23,
                10
            ],
            "k": 10,
            "num_candidates": 10,
            "boost": 0.5
        }
    ],
    size=10,
)
print(resp)
response = client.search(
  index: 'image-index',
  body: {
    query: {
      match: {
        title: {
          query: 'mountain lake',
          boost: 0.9
        }
      }
    },
    knn: [
      {
        field: 'image-vector',
        query_vector: [
          54,
          10,
          -2
        ],
        k: 5,
        num_candidates: 50,
        boost: 0.1
      },
      {
        field: 'title-vector',
        query_vector: [
          1,
          20,
          -52,
          23,
          10
        ],
        k: 10,
        num_candidates: 10,
        boost: 0.5
      }
    ],
    size: 10
  }
)
puts response
const response = await client.search({
  index: "image-index",
  query: {
    match: {
      title: {
        query: "mountain lake",
        boost: 0.9,
      },
    },
  },
  knn: [
    {
      field: "image-vector",
      query_vector: [54, 10, -2],
      k: 5,
      num_candidates: 50,
      boost: 0.1,
    },
    {
      field: "title-vector",
      query_vector: [1, 20, -52, 23, 10],
      k: 10,
      num_candidates: 10,
      boost: 0.5,
    },
  ],
  size: 10,
});
console.log(response);
POST image-index/_search
{
  "query": {
    "match": {
      "title": {
        "query": "mountain lake",
        "boost": 0.9
      }
    }
  },
  "knn": [ {
    "field": "image-vector",
    "query_vector": [54, 10, -2],
    "k": 5,
    "num_candidates": 50,
    "boost": 0.1
  },
  {
    "field": "title-vector",
    "query_vector": [1, 20, -52, 23, 10],
    "k": 10,
    "num_candidates": 10,
    "boost": 0.5
  }],
  "size": 10
}

此搜索查找 image-vector 的全局前 k = 5 个向量匹配项和 title-vector 的全局前 k = 10 个向量匹配项。然后,这些前几名的值将与 match 查询中的匹配项组合在一起,并返回前 10 个文档。多个 knn 条目和 query 匹配项通过析取进行组合,就好像您在它们之间进行了布尔运算。前 k 个向量结果表示所有索引分片中的全局最近邻。

使用上述配置的提升的文档的评分将是

score = 0.9 * match_score + 0.1 * knn_score_image-vector + 0.5 * knn_score_title-vector

使用期望的相似度搜索 kNN

编辑

虽然 kNN 是一个强大的工具,但它总是尝试返回 k 个最近的邻居。因此,当将 knnfilter 一起使用时,您可以筛选掉所有相关的文档,只剩下不相关的文档进行搜索。在这种情况下,knn 仍将尽最大努力返回 k 个最近的邻居,即使这些邻居在向量空间中可能很远。

为了缓解这种担忧,knn 子句中有一个可用的 similarity 参数。此值是向量被视为匹配项所需的最小相似度。knn 搜索流程(带有此参数)如下

  • 应用任何用户提供的 filter 查询
  • 浏览向量空间以获取 k 个向量
  • 不要返回任何比配置的 similarity 更远的向量

similarity 是在转换为 _score 并应用提升之前的真实相似度

对于每个配置的相似度,以下是相应的反向 _score 函数。这样,如果您想从 _score 的角度进行筛选,您可以进行此小的转换以正确拒绝不相关的结果。

  • l2_normsqrt((1 / _score) - 1)
  • cosine(2 * _score) - 1
  • dot_product(2 * _score) - 1
  • max_inner_product:

    • _score < 11 - (1 / _score)
    • _score >= 1_score - 1

这是一个示例。在此示例中,我们搜索给定 query_vectork 个最近邻居。但是,应用了 filter 并要求找到的向量之间至少具有提供的 similarity

resp = client.search(
    index="image-index",
    knn={
        "field": "image-vector",
        "query_vector": [
            1,
            5,
            -20
        ],
        "k": 5,
        "num_candidates": 50,
        "similarity": 36,
        "filter": {
            "term": {
                "file-type": "png"
            }
        }
    },
    fields=[
        "title"
    ],
    source=False,
)
print(resp)
response = client.search(
  index: 'image-index',
  body: {
    knn: {
      field: 'image-vector',
      query_vector: [
        1,
        5,
        -20
      ],
      k: 5,
      num_candidates: 50,
      similarity: 36,
      filter: {
        term: {
          "file-type": 'png'
        }
      }
    },
    fields: [
      'title'
    ],
    _source: false
  }
)
puts response
const response = await client.search({
  index: "image-index",
  knn: {
    field: "image-vector",
    query_vector: [1, 5, -20],
    k: 5,
    num_candidates: 50,
    similarity: 36,
    filter: {
      term: {
        "file-type": "png",
      },
    },
  },
  fields: ["title"],
  _source: false,
});
console.log(response);
POST image-index/_search
{
  "knn": {
    "field": "image-vector",
    "query_vector": [1, 5, -20],
    "k": 5,
    "num_candidates": 50,
    "similarity": 36,
    "filter": {
      "term": {
        "file-type": "png"
      }
    }
  },
  "fields": ["title"],
  "_source": false
}

在我们的数据集中,唯一文件类型为 png 的文档的向量为 [42, 8, -15][42, 8, -15][1, 5, -20] 之间的 l2_norm 距离为 41.412,这大于配置的相似度 36。这意味着,此搜索将不返回任何命中。

嵌套 kNN 搜索

编辑

文本超出特定模型的标记限制是很常见的,并且需要在为各个块构建嵌入之前进行分块。当将 nesteddense_vector 一起使用时,您可以在不复制顶级文档元数据的情况下实现最近段落检索。

这是一个简单的段落向量索引,用于存储向量和一些用于筛选的顶级元数据。

resp = client.indices.create(
    index="passage_vectors",
    mappings={
        "properties": {
            "full_text": {
                "type": "text"
            },
            "creation_time": {
                "type": "date"
            },
            "paragraph": {
                "type": "nested",
                "properties": {
                    "vector": {
                        "type": "dense_vector",
                        "dims": 2,
                        "index_options": {
                            "type": "hnsw"
                        }
                    },
                    "text": {
                        "type": "text",
                        "index": False
                    }
                }
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'passage_vectors',
  body: {
    mappings: {
      properties: {
        full_text: {
          type: 'text'
        },
        creation_time: {
          type: 'date'
        },
        paragraph: {
          type: 'nested',
          properties: {
            vector: {
              type: 'dense_vector',
              dims: 2,
              index_options: {
                type: 'hnsw'
              }
            },
            text: {
              type: 'text',
              index: false
            }
          }
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "passage_vectors",
  mappings: {
    properties: {
      full_text: {
        type: "text",
      },
      creation_time: {
        type: "date",
      },
      paragraph: {
        type: "nested",
        properties: {
          vector: {
            type: "dense_vector",
            dims: 2,
            index_options: {
              type: "hnsw",
            },
          },
          text: {
            type: "text",
            index: false,
          },
        },
      },
    },
  },
});
console.log(response);
PUT passage_vectors
{
    "mappings": {
        "properties": {
            "full_text": {
                "type": "text"
            },
            "creation_time": {
                "type": "date"
            },
            "paragraph": {
                "type": "nested",
                "properties": {
                    "vector": {
                        "type": "dense_vector",
                        "dims": 2,
                        "index_options": {
                            "type": "hnsw"
                        }
                    },
                    "text": {
                        "type": "text",
                        "index": false
                    }
                }
            }
        }
    }
}

使用上面的映射,我们可以索引多个段落向量,同时存储各个段落文本。

resp = client.bulk(
    index="passage_vectors",
    refresh=True,
    operations=[
        {
            "index": {
                "_id": "1"
            }
        },
        {
            "full_text": "first paragraph another paragraph",
            "creation_time": "2019-05-04",
            "paragraph": [
                {
                    "vector": [
                        0.45,
                        45
                    ],
                    "text": "first paragraph",
                    "paragraph_id": "1"
                },
                {
                    "vector": [
                        0.8,
                        0.6
                    ],
                    "text": "another paragraph",
                    "paragraph_id": "2"
                }
            ]
        },
        {
            "index": {
                "_id": "2"
            }
        },
        {
            "full_text": "number one paragraph number two paragraph",
            "creation_time": "2020-05-04",
            "paragraph": [
                {
                    "vector": [
                        1.2,
                        4.5
                    ],
                    "text": "number one paragraph",
                    "paragraph_id": "1"
                },
                {
                    "vector": [
                        -1,
                        42
                    ],
                    "text": "number two paragraph",
                    "paragraph_id": "2"
                }
            ]
        }
    ],
)
print(resp)
response = client.bulk(
  index: 'passage_vectors',
  refresh: true,
  body: [
    {
      index: {
        _id: '1'
      }
    },
    {
      full_text: 'first paragraph another paragraph',
      creation_time: '2019-05-04',
      paragraph: [
        {
          vector: [
            0.45,
            45
          ],
          text: 'first paragraph',
          paragraph_id: '1'
        },
        {
          vector: [
            0.8,
            0.6
          ],
          text: 'another paragraph',
          paragraph_id: '2'
        }
      ]
    },
    {
      index: {
        _id: '2'
      }
    },
    {
      full_text: 'number one paragraph number two paragraph',
      creation_time: '2020-05-04',
      paragraph: [
        {
          vector: [
            1.2,
            4.5
          ],
          text: 'number one paragraph',
          paragraph_id: '1'
        },
        {
          vector: [
            -1,
            42
          ],
          text: 'number two paragraph',
          paragraph_id: '2'
        }
      ]
    }
  ]
)
puts response
const response = await client.bulk({
  index: "passage_vectors",
  refresh: "true",
  operations: [
    {
      index: {
        _id: "1",
      },
    },
    {
      full_text: "first paragraph another paragraph",
      creation_time: "2019-05-04",
      paragraph: [
        {
          vector: [0.45, 45],
          text: "first paragraph",
          paragraph_id: "1",
        },
        {
          vector: [0.8, 0.6],
          text: "another paragraph",
          paragraph_id: "2",
        },
      ],
    },
    {
      index: {
        _id: "2",
      },
    },
    {
      full_text: "number one paragraph number two paragraph",
      creation_time: "2020-05-04",
      paragraph: [
        {
          vector: [1.2, 4.5],
          text: "number one paragraph",
          paragraph_id: "1",
        },
        {
          vector: [-1, 42],
          text: "number two paragraph",
          paragraph_id: "2",
        },
      ],
    },
  ],
});
console.log(response);
POST passage_vectors/_bulk?refresh=true
{ "index": { "_id": "1" } }
{ "full_text": "first paragraph another paragraph", "creation_time": "2019-05-04", "paragraph": [ { "vector": [ 0.45, 45 ], "text": "first paragraph", "paragraph_id": "1" }, { "vector": [ 0.8, 0.6 ], "text": "another paragraph", "paragraph_id": "2" } ] }
{ "index": { "_id": "2" } }
{ "full_text": "number one paragraph number two paragraph", "creation_time": "2020-05-04", "paragraph": [ { "vector": [ 1.2, 4.5 ], "text": "number one paragraph", "paragraph_id": "1" }, { "vector": [ -1, 42 ], "text": "number two paragraph", "paragraph_id": "2" } ] }

查询看起来与典型的 kNN 搜索非常相似

resp = client.search(
    index="passage_vectors",
    fields=[
        "full_text",
        "creation_time"
    ],
    source=False,
    knn={
        "query_vector": [
            0.45,
            45
        ],
        "field": "paragraph.vector",
        "k": 2,
        "num_candidates": 2
    },
)
print(resp)
response = client.search(
  index: 'passage_vectors',
  body: {
    fields: [
      'full_text',
      'creation_time'
    ],
    _source: false,
    knn: {
      query_vector: [
        0.45,
        45
      ],
      field: 'paragraph.vector',
      k: 2,
      num_candidates: 2
    }
  }
)
puts response
const response = await client.search({
  index: "passage_vectors",
  fields: ["full_text", "creation_time"],
  _source: false,
  knn: {
    query_vector: [0.45, 45],
    field: "paragraph.vector",
    k: 2,
    num_candidates: 2,
  },
});
console.log(response);
POST passage_vectors/_search
{
    "fields": ["full_text", "creation_time"],
    "_source": false,
    "knn": {
        "query_vector": [
            0.45,
            45
        ],
        "field": "paragraph.vector",
        "k": 2,
        "num_candidates": 2
    }
}

请注意,即使我们总共有 4 个向量,我们仍然返回两个文档。对嵌套 dense_vectors 进行的 kNN 搜索将始终在顶级文档中对前几名的结果进行多样化。这意味着,将返回 "k" 个顶级文档,并根据其最近的段落向量(例如 "paragraph.vector")进行评分。

{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "passage_vectors",
                "_id": "1",
                "_score": 1.0,
                "fields": {
                    "creation_time": [
                        "2019-05-04T00:00:00.000Z"
                    ],
                    "full_text": [
                        "first paragraph another paragraph"
                    ]
                }
            },
            {
                "_index": "passage_vectors",
                "_id": "2",
                "_score": 0.9997144,
                "fields": {
                    "creation_time": [
                        "2020-05-04T00:00:00.000Z"
                    ],
                    "full_text": [
                        "number one paragraph number two paragraph"
                    ]
                }
            }
        ]
    }
}

如果您想按一些顶级文档元数据进行筛选怎么办?您可以通过将 filter 添加到 knn 子句中来实现此目的。

filter 将始终位于顶级文档元数据之上。这意味着您无法根据 nested 字段元数据进行筛选。

resp = client.search(
    index="passage_vectors",
    fields=[
        "creation_time",
        "full_text"
    ],
    source=False,
    knn={
        "query_vector": [
            0.45,
            45
        ],
        "field": "paragraph.vector",
        "k": 2,
        "num_candidates": 2,
        "filter": {
            "bool": {
                "filter": [
                    {
                        "range": {
                            "creation_time": {
                                "gte": "2019-05-01",
                                "lte": "2019-05-05"
                            }
                        }
                    }
                ]
            }
        }
    },
)
print(resp)
response = client.search(
  index: 'passage_vectors',
  body: {
    fields: [
      'creation_time',
      'full_text'
    ],
    _source: false,
    knn: {
      query_vector: [
        0.45,
        45
      ],
      field: 'paragraph.vector',
      k: 2,
      num_candidates: 2,
      filter: {
        bool: {
          filter: [
            {
              range: {
                creation_time: {
                  gte: '2019-05-01',
                  lte: '2019-05-05'
                }
              }
            }
          ]
        }
      }
    }
  }
)
puts response
const response = await client.search({
  index: "passage_vectors",
  fields: ["creation_time", "full_text"],
  _source: false,
  knn: {
    query_vector: [0.45, 45],
    field: "paragraph.vector",
    k: 2,
    num_candidates: 2,
    filter: {
      bool: {
        filter: [
          {
            range: {
              creation_time: {
                gte: "2019-05-01",
                lte: "2019-05-05",
              },
            },
          },
        ],
      },
    },
  },
});
console.log(response);
POST passage_vectors/_search
{
    "fields": [
        "creation_time",
        "full_text"
    ],
    "_source": false,
    "knn": {
        "query_vector": [
            0.45,
            45
        ],
        "field": "paragraph.vector",
        "k": 2,
        "num_candidates": 2,
        "filter": {
            "bool": {
                "filter": [
                    {
                        "range": {
                            "creation_time": {
                                "gte": "2019-05-01",
                                "lte": "2019-05-05"
                            }
                        }
                    }
                ]
            }
        }
    }
}

现在,我们已根据顶级 "creation_time" 进行了筛选,并且只有一个文档属于该范围内。

{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "passage_vectors",
                "_id": "1",
                "_score": 1.0,
                "fields": {
                    "creation_time": [
                        "2019-05-04T00:00:00.000Z"
                    ],
                    "full_text": [
                        "first paragraph another paragraph"
                    ]
                }
            }
        ]
    }
}

使用内部匹配的嵌套 kNN 搜索

编辑

此外,如果您想为匹配的文档提取最近的段落,您可以将 inner_hits 提供给 knn 子句。

当使用 inner_hits 和多个 knn 子句时,请务必指定 inner_hits.name 字段。否则,可能会发生命名冲突并导致搜索请求失败。

resp = client.search(
    index="passage_vectors",
    fields=[
        "creation_time",
        "full_text"
    ],
    source=False,
    knn={
        "query_vector": [
            0.45,
            45
        ],
        "field": "paragraph.vector",
        "k": 2,
        "num_candidates": 2,
        "inner_hits": {
            "_source": False,
            "fields": [
                "paragraph.text"
            ],
            "size": 1
        }
    },
)
print(resp)
const response = await client.search({
  index: "passage_vectors",
  fields: ["creation_time", "full_text"],
  _source: false,
  knn: {
    query_vector: [0.45, 45],
    field: "paragraph.vector",
    k: 2,
    num_candidates: 2,
    inner_hits: {
      _source: false,
      fields: ["paragraph.text"],
      size: 1,
    },
  },
});
console.log(response);
POST passage_vectors/_search
{
    "fields": [
        "creation_time",
        "full_text"
    ],
    "_source": false,
    "knn": {
        "query_vector": [
            0.45,
            45
        ],
        "field": "paragraph.vector",
        "k": 2,
        "num_candidates": 2,
        "inner_hits": {
            "_source": false,
            "fields": [
                "paragraph.text"
            ],
            "size": 1
        }
    }
}

现在,结果将在搜索时包含最近找到的段落。

{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "passage_vectors",
                "_id": "1",
                "_score": 1.0,
                "fields": {
                    "creation_time": [
                        "2019-05-04T00:00:00.000Z"
                    ],
                    "full_text": [
                        "first paragraph another paragraph"
                    ]
                },
                "inner_hits": {
                    "paragraph": {
                        "hits": {
                            "total": {
                                "value": 2,
                                "relation": "eq"
                            },
                            "max_score": 1.0,
                            "hits": [
                                {
                                    "_index": "passage_vectors",
                                    "_id": "1",
                                    "_nested": {
                                        "field": "paragraph",
                                        "offset": 0
                                    },
                                    "_score": 1.0,
                                    "fields": {
                                        "paragraph": [
                                            {
                                                "text": [
                                                    "first paragraph"
                                                ]
                                            }
                                        ]
                                    }
                                }
                            ]
                        }
                    }
                }
            },
            {
                "_index": "passage_vectors",
                "_id": "2",
                "_score": 0.9997144,
                "fields": {
                    "creation_time": [
                        "2020-05-04T00:00:00.000Z"
                    ],
                    "full_text": [
                        "number one paragraph number two paragraph"
                    ]
                },
                "inner_hits": {
                    "paragraph": {
                        "hits": {
                            "total": {
                                "value": 2,
                                "relation": "eq"
                            },
                            "max_score": 0.9997144,
                            "hits": [
                                {
                                    "_index": "passage_vectors",
                                    "_id": "2",
                                    "_nested": {
                                        "field": "paragraph",
                                        "offset": 1
                                    },
                                    "_score": 0.9997144,
                                    "fields": {
                                        "paragraph": [
                                            {
                                                "text": [
                                                    "number two paragraph"
                                                ]
                                            }
                                        ]
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}

索引注意事项

编辑

对于近似 kNN 搜索,Elasticsearch 会将每个段的密集向量值存储为 HNSW 图。为近似 kNN 搜索编制向量索引可能会花费大量时间,因为构建这些图的成本很高。您可能需要增加索引和批量请求的客户端请求超时。近似 kNN 调整指南包含有关索引性能的重要指导,以及索引配置如何影响搜索性能。

除了其搜索时调整参数之外,HNSW 算法还具有索引时参数,这些参数在构建图的成本、搜索速度和准确性之间进行权衡。在设置 dense_vector 映射时,您可以使用 index_options 参数来调整这些参数

resp = client.indices.create(
    index="image-index",
    mappings={
        "properties": {
            "image-vector": {
                "type": "dense_vector",
                "dims": 3,
                "similarity": "l2_norm",
                "index_options": {
                    "type": "hnsw",
                    "m": 32,
                    "ef_construction": 100
                }
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'image-index',
  body: {
    mappings: {
      properties: {
        "image-vector": {
          type: 'dense_vector',
          dims: 3,
          similarity: 'l2_norm',
          index_options: {
            type: 'hnsw',
            m: 32,
            ef_construction: 100
          }
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "image-index",
  mappings: {
    properties: {
      "image-vector": {
        type: "dense_vector",
        dims: 3,
        similarity: "l2_norm",
        index_options: {
          type: "hnsw",
          m: 32,
          ef_construction: 100,
        },
      },
    },
  },
});
console.log(response);
PUT image-index
{
  "mappings": {
    "properties": {
      "image-vector": {
        "type": "dense_vector",
        "dims": 3,
        "similarity": "l2_norm",
        "index_options": {
          "type": "hnsw",
          "m": 32,
          "ef_construction": 100
        }
      }
    }
  }
}

近似 kNN 搜索的限制

编辑
  • 当在 跨集群搜索中使用 kNN 搜索时,不支持 ccs_minimize_roundtrips 选项。
  • Elasticsearch 使用 HNSW 算法来支持高效的 kNN 搜索。与大多数 kNN 算法一样,HNSW 是一种近似方法,它牺牲结果准确性来提高搜索速度。这意味着返回的结果并不总是真正的 k 个最近邻居。

近似 kNN 搜索始终使用 dfs_query_then_fetch 搜索类型,以便收集各个分片中的全局前 k 个匹配项。运行 kNN 搜索时,您无法显式设置 search_type

精确 kNN

编辑

要运行精确 kNN 搜索,请将 script_score 查询与向量函数一起使用。

  1. 显式映射一个或多个 dense_vector 字段。如果您不打算将该字段用于近似 kNN,请将 index 映射选项设置为 false。这可以显著提高索引速度。

    resp = client.indices.create(
        index="product-index",
        mappings={
            "properties": {
                "product-vector": {
                    "type": "dense_vector",
                    "dims": 5,
                    "index": False
                },
                "price": {
                    "type": "long"
                }
            }
        },
    )
    print(resp)
    response = client.indices.create(
      index: 'product-index',
      body: {
        mappings: {
          properties: {
            "product-vector": {
              type: 'dense_vector',
              dims: 5,
              index: false
            },
            price: {
              type: 'long'
            }
          }
        }
      }
    )
    puts response
    const response = await client.indices.create({
      index: "product-index",
      mappings: {
        properties: {
          "product-vector": {
            type: "dense_vector",
            dims: 5,
            index: false,
          },
          price: {
            type: "long",
          },
        },
      },
    });
    console.log(response);
    PUT product-index
    {
      "mappings": {
        "properties": {
          "product-vector": {
            "type": "dense_vector",
            "dims": 5,
            "index": false
          },
          "price": {
            "type": "long"
          }
        }
      }
    }
  2. 索引你的数据。

    POST product-index/_bulk?refresh=true
    { "index": { "_id": "1" } }
    { "product-vector": [230.0, 300.33, -34.8988, 15.555, -200.0], "price": 1599 }
    { "index": { "_id": "2" } }
    { "product-vector": [-0.5, 100.0, -13.0, 14.8, -156.0], "price": 799 }
    { "index": { "_id": "3" } }
    { "product-vector": [0.5, 111.3, -13.0, 14.8, -156.0], "price": 1099 }
    ...
  3. 使用 搜索 API 来运行包含 向量函数script_score 查询。

    为了限制传递给向量函数的匹配文档数量,我们建议您在 script_score.query 参数中指定筛选查询。如果需要,您可以在此参数中使用 match_all 查询来匹配所有文档。但是,匹配所有文档会显著增加搜索延迟。

    resp = client.search(
        index="product-index",
        query={
            "script_score": {
                "query": {
                    "bool": {
                        "filter": {
                            "range": {
                                "price": {
                                    "gte": 1000
                                }
                            }
                        }
                    }
                },
                "script": {
                    "source": "cosineSimilarity(params.queryVector, 'product-vector') + 1.0",
                    "params": {
                        "queryVector": [
                            -0.5,
                            90,
                            -10,
                            14.8,
                            -156
                        ]
                    }
                }
            }
        },
    )
    print(resp)
    response = client.search(
      index: 'product-index',
      body: {
        query: {
          script_score: {
            query: {
              bool: {
                filter: {
                  range: {
                    price: {
                      gte: 1000
                    }
                  }
                }
              }
            },
            script: {
              source: "cosineSimilarity(params.queryVector, 'product-vector') + 1.0",
              params: {
                "queryVector": [
                  -0.5,
                  90,
                  -10,
                  14.8,
                  -156
                ]
              }
            }
          }
        }
      }
    )
    puts response
    const response = await client.search({
      index: "product-index",
      query: {
        script_score: {
          query: {
            bool: {
              filter: {
                range: {
                  price: {
                    gte: 1000,
                  },
                },
              },
            },
          },
          script: {
            source: "cosineSimilarity(params.queryVector, 'product-vector') + 1.0",
            params: {
              queryVector: [-0.5, 90, -10, 14.8, -156],
            },
          },
        },
      },
    });
    console.log(response);
    POST product-index/_search
    {
      "query": {
        "script_score": {
          "query" : {
            "bool" : {
              "filter" : {
                "range" : {
                  "price" : {
                    "gte": 1000
                  }
                }
              }
            }
          },
          "script": {
            "source": "cosineSimilarity(params.queryVector, 'product-vector') + 1.0",
            "params": {
              "queryVector": [-0.5, 90.0, -10, 14.8, -156.0]
            }
          }
        }
      }
    }

用于量化向量的过采样和重新评分

编辑

所有形式的量化都会导致一些准确性损失,并且随着量化级别的增加,准确性损失也会增加。通常,我们发现: - int8 需要很少或不需要重新评分 - int4 需要一些重新评分才能获得更高的准确性和更大的召回场景。通常,过采样 1.5 倍 - 2 倍会恢复大部分准确性损失。 - bbq 需要重新评分,除非在极大的索引或专门为量化设计的模型上。我们发现,通常 3 倍 - 5 倍的过采样就足够了。但是,对于维度较少或量化不佳的向量,可能需要更高的过采样。

有两种主要的过采样和重评分方法。第一种是利用 _search 请求中的重评分部分

这是一个使用顶层 knn 搜索进行过采样,并使用 rescore 对结果进行重新排序的示例。

resp = client.search(
    index="my-index",
    size=10,
    knn={
        "query_vector": [
            0.04283529,
            0.85670587,
            -0.51402352,
            0
        ],
        "field": "my_int4_vector",
        "k": 20,
        "num_candidates": 50
    },
    rescore={
        "window_size": 20,
        "query": {
            "rescore_query": {
                "script_score": {
                    "query": {
                        "match_all": {}
                    },
                    "script": {
                        "source": "(dotProduct(params.queryVector, 'my_int4_vector') + 1.0)",
                        "params": {
                            "queryVector": [
                                0.04283529,
                                0.85670587,
                                -0.51402352,
                                0
                            ]
                        }
                    }
                }
            },
            "query_weight": 0,
            "rescore_query_weight": 1
        }
    },
)
print(resp)
const response = await client.search({
  index: "my-index",
  size: 10,
  knn: {
    query_vector: [0.04283529, 0.85670587, -0.51402352, 0],
    field: "my_int4_vector",
    k: 20,
    num_candidates: 50,
  },
  rescore: {
    window_size: 20,
    query: {
      rescore_query: {
        script_score: {
          query: {
            match_all: {},
          },
          script: {
            source: "(dotProduct(params.queryVector, 'my_int4_vector') + 1.0)",
            params: {
              queryVector: [0.04283529, 0.85670587, -0.51402352, 0],
            },
          },
        },
      },
      query_weight: 0,
      rescore_query_weight: 1,
    },
  },
});
console.log(response);
POST /my-index/_search
{
  "size": 10, 
  "knn": {
    "query_vector": [0.04283529, 0.85670587, -0.51402352, 0],
    "field": "my_int4_vector",
    "k": 20, 
    "num_candidates": 50
  },
  "rescore": {
    "window_size": 20, 
    "query": {
      "rescore_query": {
        "script_score": {
          "query": {
            "match_all": {}
          },
          "script": {
            "source": "(dotProduct(params.queryVector, 'my_int4_vector') + 1.0)", 
            "params": {
              "queryVector": [0.04283529, 0.85670587, -0.51402352, 0]
            }
          }
        }
      },
      "query_weight": 0, 
      "rescore_query_weight": 1 
    }
  }
}

要返回的结果数量,请注意这里仅返回 10 个结果,我们将进行 2 倍的过采样,收集 20 个最近邻。

从 KNN 搜索返回的结果数量。这将执行一个近似的 KNN 搜索,每个 HNSW 图使用 50 个候选向量,并使用量化向量,返回根据量化得分最相似的 20 个向量。此外,由于这是顶层的 knn 对象,在重评分之前,将收集来自所有分片的全局前 20 个结果。与 rescore 结合使用,这意味着过采样 2x,即根据量化得分收集 20 个最近邻,并使用更高保真度的浮点向量进行重评分。

要重评分的结果数量,如果要重评分所有结果,请将其设置为与 k 相同的值。

用于重评分结果的脚本。脚本评分将直接与最初提供的 float32 向量交互。

原始查询的权重,这里我们简单地丢弃了原始得分。

重评分查询的权重,这里我们只使用重评分查询。

第二种方法是使用 knn 查询script_score 查询 对每个分片进行评分。通常,这意味着每个分片将进行更多的重评分,但这可以以计算为代价提高整体召回率。

resp = client.search(
    index="my-index",
    size=10,
    query={
        "script_score": {
            "query": {
                "knn": {
                    "query_vector": [
                        0.04283529,
                        0.85670587,
                        -0.51402352,
                        0
                    ],
                    "field": "my_int4_vector",
                    "num_candidates": 20
                }
            },
            "script": {
                "source": "(dotProduct(params.queryVector, 'my_int4_vector') + 1.0)",
                "params": {
                    "queryVector": [
                        0.04283529,
                        0.85670587,
                        -0.51402352,
                        0
                    ]
                }
            }
        }
    },
)
print(resp)
const response = await client.search({
  index: "my-index",
  size: 10,
  query: {
    script_score: {
      query: {
        knn: {
          query_vector: [0.04283529, 0.85670587, -0.51402352, 0],
          field: "my_int4_vector",
          num_candidates: 20,
        },
      },
      script: {
        source: "(dotProduct(params.queryVector, 'my_int4_vector') + 1.0)",
        params: {
          queryVector: [0.04283529, 0.85670587, -0.51402352, 0],
        },
      },
    },
  },
});
console.log(response);
POST /my-index/_search
{
  "size": 10, 
  "query": {
    "script_score": {
      "query": {
        "knn": { 
          "query_vector": [0.04283529, 0.85670587, -0.51402352, 0],
          "field": "my_int4_vector",
          "num_candidates": 20 
        }
      },
      "script": {
        "source": "(dotProduct(params.queryVector, 'my_int4_vector') + 1.0)", 
        "params": {
          "queryVector": [0.04283529, 0.85670587, -0.51402352, 0]
        }
      }
    }
  }
}

要返回的结果数量

执行初始搜索的 knn 查询,此查询在每个分片上执行。

用于初始近似 knn 搜索的候选数量。这将使用量化向量进行搜索,并返回每个分片的前 20 个候选向量,然后进行评分。

用于对结果进行评分的脚本。脚本评分将直接与最初提供的 float32 向量交互。