max_chunking_size: (可选，整数) 指定块的最大字数。默认为 250。该值不能高于 300 或低于 20（对于 sentence 策略）或 10（对于 word 策略）。
overlap: (可选，整数) 仅用于 word 分块策略。指定块的重叠字数。默认为 100。该值不能高于 max_chunking_size 的一半。
sentence_overlap: (可选，整数) 仅用于 sentence 分块策略。指定块的重叠句子数。可以是 1 或 0。默认为 1。
strategy: (可选，字符串) 指定分块策略。可以是 sentence 或 word。

service

(必需，字符串) 指定任务类型支持的服务类型。在本例中为 elser。

service_settings

(必需，对象) 用于安装推理模型的设置。

这些设置特定于 elser 服务。

adaptive_allocations

(可选，对象) 自适应分配配置对象。如果启用，模型的分配数量将根据当前进程的负载进行设置。当负载较高时，会自动创建一个新的模型分配（如果设置了 max_number_of_allocations，则会遵守其值）。当负载较低时，会自动删除一个模型分配（如果设置了 min_number_of_allocations，则会遵守其值）。如果启用了 adaptive_allocations，请勿手动设置分配数量。

enabled: (可选，布尔值) 如果为 true，则启用 adaptive_allocations。默认为 false。
max_number_of_allocations: (可选，整数) 指定要扩展到的最大分配数量。如果设置，则必须大于或等于 min_number_of_allocations。
min_number_of_allocations: (可选，整数) 指定要扩展到的最小分配数量。如果设置，则必须大于或等于 0。如果未定义，则部署会扩展到 0。

num_allocations

(必需，整数) 此模型在机器学习节点上分配的总数量。增加此值通常会增加吞吐量。如果启用了 adaptive_allocations，请勿设置此值，因为它会自动设置。

num_threads

(必需，整数) 设置每个模型分配在推理期间使用的线程数。这通常会提高每个推理请求的速度。推理过程是一个计算密集型过程；threads_per_allocations 不能超过每个节点可用的已分配处理器数量。必须是 2 的幂。允许的最大值为 32。

使用自适应分配的 ELSER 服务示例

编辑

启用自适应分配后，模型的分配数量会根据当前负载自动设置。

有关如何优化 ELSER 端点的更多信息，请参考模型文档中的ELSER 建议部分。要了解有关模型自动缩放的更多信息，请参考已训练模型自动缩放页面。

以下示例演示如何创建名为 my-elser-model 的推理端点，以执行 sparse_embedding 任务类型并配置自适应分配。

如果 ELSER 模型尚未下载，则以下请求将自动下载该模型，然后部署该模型。

resp = client.inference.put(
    task_type="sparse_embedding",
    inference_id="my-elser-model",
    inference_config={
        "service": "elser",
        "service_settings": {
            "adaptive_allocations": {
                "enabled": True,
                "min_number_of_allocations": 3,
                "max_number_of_allocations": 10
            },
            "num_threads": 1
        }
    },
)
print(resp)

const response = await client.inference.put({
  task_type: "sparse_embedding",
  inference_id: "my-elser-model",
  inference_config: {
    service: "elser",
    service_settings: {
      adaptive_allocations: {
        enabled: true,
        min_number_of_allocations: 3,
        max_number_of_allocations: 10,
      },
      num_threads: 1,
    },
  },
});
console.log(response);

PUT _inference/sparse_embedding/my-elser-model
{
  "service": "elser",
  "service_settings": {
    "adaptive_allocations": {
      "enabled": true,
      "min_number_of_allocations": 3,
      "max_number_of_allocations": 10
    },
    "num_threads": 1
  }
}

不使用自适应分配的 ELSER 服务示例

编辑

以下示例演示如何创建名为 my-elser-model 的推理端点，以执行 sparse_embedding 任务类型。有关更多信息，请参考ELSER 模型文档。

如果要优化 ELSER 端点以进行摄取，请将线程数设置为 1 ("num_threads": 1)。如果要优化 ELSER 端点以进行搜索，请将线程数设置为大于 1。

如果 ELSER 模型尚未下载，则以下请求将自动下载该模型，然后部署该模型。

resp = client.inference.put(
    task_type="sparse_embedding",
    inference_id="my-elser-model",
    inference_config={
        "service": "elser",
        "service_settings": {
            "num_allocations": 1,
            "num_threads": 1
        }
    },
)
print(resp)

const response = await client.inference.put({
  task_type: "sparse_embedding",
  inference_id: "my-elser-model",
  inference_config: {
    service: "elser",
    service_settings: {
      num_allocations: 1,
      num_threads: 1,
    },
  },
});
console.log(response);

PUT _inference/sparse_embedding/my-elser-model
{
  "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  }
}

示例响应

{
  "inference_id": "my-elser-model",
  "task_type": "sparse_embedding",
  "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  },
  "task_settings": {}
}

在 Kibana 控制台中使用时，您可能会在响应中看到 502 错误网关错误。此错误通常仅反映超时，而模型在后台下载。您可以在机器学习 UI 中查看下载进度。如果使用 Python 客户端，则可以将 timeout 参数设置为更高的值。

« Elasticsearch 推理服务 Google AI Studio 推理服务 »