文本扩展查询

编辑

在 8.15.0 版本中已弃用。

此查询已被 稀疏向量 取代。

文本扩展查询使用自然语言处理模型将查询文本转换为词元-权重对列表,然后将其用于针对 稀疏向量排序特征 字段的查询。

示例请求

编辑
resp = client.search(
    query={
        "text_expansion": {
            "<sparse_vector_field>": {
                "model_id": "the model to produce the token weights",
                "model_text": "the query string"
            }
        }
    },
)
print(resp)
response = client.search(
  body: {
    query: {
      text_expansion: {
        "<sparse_vector_field>": {
          model_id: 'the model to produce the token weights',
          model_text: 'the query string'
        }
      }
    }
  }
)
puts response
const response = await client.search({
  query: {
    text_expansion: {
      "<sparse_vector_field>": {
        model_id: "the model to produce the token weights",
        model_text: "the query string",
      },
    },
  },
});
console.log(response);
GET _search
{
   "query":{
      "text_expansion":{
         "<sparse_vector_field>":{
            "model_id":"the model to produce the token weights",
            "model_text":"the query string"
         }
      }
   }
}

text_expansion 的顶层参数

编辑
<sparse_vector_field>
(必需,对象)包含 NLP 模型根据输入文本创建的词元-权重对的字段名称。

<sparse_vector_field> 的顶层参数

编辑
model_id
(必需,字符串)用于将查询文本转换为词元-权重对的模型的 ID。它必须与用于从输入文本创建词元的模型 ID 相同。
model_text
(必需,字符串)您要用于搜索的查询文本。
pruning_config

(可选,对象) [预览] 此功能为技术预览版,可能会在未来的版本中更改或删除。Elastic 将努力修复任何问题,但技术预览版中的功能不受官方 GA 功能的支持 SLA 的约束。 可选的剪枝配置。如果启用,这将从查询中省略不重要的词元,以提高查询性能。默认值:禁用。

<pruning_config> 的参数为

tokens_freq_ratio_threshold
(可选,整数) [预览] 此功能为技术预览版,可能会在未来的版本中更改或删除。Elastic 将努力修复任何问题,但技术预览版中的功能不受官方 GA 功能的支持 SLA 的约束。 频率高于指定字段中所有词元平均频率 tokens_freq_ratio_threshold 倍的词元被视为异常值并被修剪。此值必须介于 1 和 100 之间。默认值:5
tokens_weight_threshold
(可选,浮点数) [预览] 此功能为技术预览版,可能会在未来的版本中更改或删除。Elastic 将努力修复任何问题,但技术预览版中的功能不受官方 GA 功能的支持 SLA 的约束。 权重小于 tokens_weight_threshold 的词元被认为是不重要的并被修剪。此值必须介于 0 和 1 之间。默认值:0.4
only_score_pruned_tokens
(可选,布尔值) [预览] 此功能为技术预览版,可能会在未来的版本中更改或删除。Elastic 将努力修复任何问题,但技术预览版中的功能不受官方 GA 功能的支持 SLA 的约束。 如果为 true,我们只将修剪的词元输入到评分中,并丢弃未修剪的词元。强烈建议将此值设置为主查询的 false,但可以为重新评分查询设置为 true 以获得更相关的结果。默认值:false

tokens_freq_ratio_thresholdtokens_weight_threshold 的默认值是根据使用 ELSER 的测试选择的,这些测试提供了最理想的结果。

ELSER 查询示例

编辑

以下是 text_expansion 查询的示例,该查询引用 ELSER 模型执行语义搜索。有关如何使用 ELSER 和 text_expansion 查询执行语义搜索的更详细说明,请参阅 本教程

resp = client.search(
    index="my-index",
    query={
        "text_expansion": {
            "ml.tokens": {
                "model_id": ".elser_model_2",
                "model_text": "How is the weather in Jamaica?"
            }
        }
    },
)
print(resp)
response = client.search(
  index: 'my-index',
  body: {
    query: {
      text_expansion: {
        'ml.tokens' => {
          model_id: '.elser_model_2',
          model_text: 'How is the weather in Jamaica?'
        }
      }
    }
  }
)
puts response
const response = await client.search({
  index: "my-index",
  query: {
    text_expansion: {
      "ml.tokens": {
        model_id: ".elser_model_2",
        model_text: "How is the weather in Jamaica?",
      },
    },
  },
});
console.log(response);
GET my-index/_search
{
   "query":{
      "text_expansion":{
         "ml.tokens":{
            "model_id":".elser_model_2",
            "model_text":"How is the weather in Jamaica?"
         }
      }
   }
}

多个 text_expansion 查询可以相互组合或与其他查询类型组合。这可以通过将它们包装在 布尔查询子句 中并使用线性提升来实现。

resp = client.search(
    index="my-index",
    query={
        "bool": {
            "should": [
                {
                    "text_expansion": {
                        "ml.inference.title_expanded.predicted_value": {
                            "model_id": ".elser_model_2",
                            "model_text": "How is the weather in Jamaica?",
                            "boost": 1
                        }
                    }
                },
                {
                    "text_expansion": {
                        "ml.inference.description_expanded.predicted_value": {
                            "model_id": ".elser_model_2",
                            "model_text": "How is the weather in Jamaica?",
                            "boost": 1
                        }
                    }
                },
                {
                    "multi_match": {
                        "query": "How is the weather in Jamaica?",
                        "fields": [
                            "title",
                            "description"
                        ],
                        "boost": 4
                    }
                }
            ]
        }
    },
)
print(resp)
response = client.search(
  index: 'my-index',
  body: {
    query: {
      bool: {
        should: [
          {
            text_expansion: {
              'ml.inference.title_expanded.predicted_value' => {
                model_id: '.elser_model_2',
                model_text: 'How is the weather in Jamaica?',
                boost: 1
              }
            }
          },
          {
            text_expansion: {
              'ml.inference.description_expanded.predicted_value' => {
                model_id: '.elser_model_2',
                model_text: 'How is the weather in Jamaica?',
                boost: 1
              }
            }
          },
          {
            multi_match: {
              query: 'How is the weather in Jamaica?',
              fields: [
                'title',
                'description'
              ],
              boost: 4
            }
          }
        ]
      }
    }
  }
)
puts response
const response = await client.search({
  index: "my-index",
  query: {
    bool: {
      should: [
        {
          text_expansion: {
            "ml.inference.title_expanded.predicted_value": {
              model_id: ".elser_model_2",
              model_text: "How is the weather in Jamaica?",
              boost: 1,
            },
          },
        },
        {
          text_expansion: {
            "ml.inference.description_expanded.predicted_value": {
              model_id: ".elser_model_2",
              model_text: "How is the weather in Jamaica?",
              boost: 1,
            },
          },
        },
        {
          multi_match: {
            query: "How is the weather in Jamaica?",
            fields: ["title", "description"],
            boost: 4,
          },
        },
      ],
    },
  },
});
console.log(response);
GET my-index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "text_expansion": {
            "ml.inference.title_expanded.predicted_value": {
              "model_id": ".elser_model_2",
              "model_text": "How is the weather in Jamaica?",
              "boost": 1
            }
          }
        },
        {
          "text_expansion": {
            "ml.inference.description_expanded.predicted_value": {
              "model_id": ".elser_model_2",
              "model_text": "How is the weather in Jamaica?",
              "boost": 1
            }
          }
        },
        {
          "multi_match": {
            "query": "How is the weather in Jamaica?",
            "fields": [
              "title",
              "description"
            ],
            "boost": 4
          }
        }
      ]
    }
  }
}

这也可以通过 倒数排名融合 (RRF) 来实现,通过具有多个 标准检索器rrf 检索器 来实现。

resp = client.search(
    index="my-index",
    retriever={
        "rrf": {
            "retrievers": [
                {
                    "standard": {
                        "query": {
                            "multi_match": {
                                "query": "How is the weather in Jamaica?",
                                "fields": [
                                    "title",
                                    "description"
                                ]
                            }
                        }
                    }
                },
                {
                    "standard": {
                        "query": {
                            "text_expansion": {
                                "ml.inference.title_expanded.predicted_value": {
                                    "model_id": ".elser_model_2",
                                    "model_text": "How is the weather in Jamaica?"
                                }
                            }
                        }
                    }
                },
                {
                    "standard": {
                        "query": {
                            "text_expansion": {
                                "ml.inference.description_expanded.predicted_value": {
                                    "model_id": ".elser_model_2",
                                    "model_text": "How is the weather in Jamaica?"
                                }
                            }
                        }
                    }
                }
            ],
            "window_size": 10,
            "rank_constant": 20
        }
    },
)
print(resp)
const response = await client.search({
  index: "my-index",
  retriever: {
    rrf: {
      retrievers: [
        {
          standard: {
            query: {
              multi_match: {
                query: "How is the weather in Jamaica?",
                fields: ["title", "description"],
              },
            },
          },
        },
        {
          standard: {
            query: {
              text_expansion: {
                "ml.inference.title_expanded.predicted_value": {
                  model_id: ".elser_model_2",
                  model_text: "How is the weather in Jamaica?",
                },
              },
            },
          },
        },
        {
          standard: {
            query: {
              text_expansion: {
                "ml.inference.description_expanded.predicted_value": {
                  model_id: ".elser_model_2",
                  model_text: "How is the weather in Jamaica?",
                },
              },
            },
          },
        },
      ],
      window_size: 10,
      rank_constant: 20,
    },
  },
});
console.log(response);
GET my-index/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "multi_match": {
                "query": "How is the weather in Jamaica?",
                "fields": [
                  "title",
                  "description"
                ]
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "text_expansion": {
                "ml.inference.title_expanded.predicted_value": {
                  "model_id": ".elser_model_2",
                  "model_text": "How is the weather in Jamaica?"
                }
              }
            }
          }
        },
        {
          "standard": {
            "query": {
              "text_expansion": {
                "ml.inference.description_expanded.predicted_value": {
                  "model_id": ".elser_model_2",
                  "model_text": "How is the weather in Jamaica?"
                }
              }
            }
          }
        }
      ],
      "window_size": 10,
      "rank_constant": 20
    }
  }
}

带有剪枝配置和重新评分的 ELSER 查询示例

编辑

以下是上述示例的扩展,它向 text_expansion 查询添加了 [预览] 此功能为技术预览版,可能会在未来的版本中更改或删除。Elastic 将努力修复任何问题,但技术预览版中的功能不受官方 GA 功能的支持 SLA 的约束。 剪枝配置。剪枝配置会识别不重要的词元以从查询中修剪,从而提高查询性能。

词元剪枝发生在分片级别。虽然这应该导致相同的词元在分片中被标记为不重要,但这并不能保证基于每个分片的组成。因此,如果您在多分片索引上运行带有 pruning_configtext_expansion,我们强烈建议添加一个带有最初从查询中修剪的词元的 重新评分过滤的搜索结果 函数。这将有助于缓解词元修剪的任何分片级别不一致,并提供更好的整体相关性。

resp = client.search(
    index="my-index",
    query={
        "text_expansion": {
            "ml.tokens": {
                "model_id": ".elser_model_2",
                "model_text": "How is the weather in Jamaica?",
                "pruning_config": {
                    "tokens_freq_ratio_threshold": 5,
                    "tokens_weight_threshold": 0.4,
                    "only_score_pruned_tokens": False
                }
            }
        }
    },
    rescore={
        "window_size": 100,
        "query": {
            "rescore_query": {
                "text_expansion": {
                    "ml.tokens": {
                        "model_id": ".elser_model_2",
                        "model_text": "How is the weather in Jamaica?",
                        "pruning_config": {
                            "tokens_freq_ratio_threshold": 5,
                            "tokens_weight_threshold": 0.4,
                            "only_score_pruned_tokens": True
                        }
                    }
                }
            }
        }
    },
)
print(resp)
response = client.search(
  index: 'my-index',
  body: {
    query: {
      text_expansion: {
        'ml.tokens' => {
          model_id: '.elser_model_2',
          model_text: 'How is the weather in Jamaica?',
          pruning_config: {
            tokens_freq_ratio_threshold: 5,
            tokens_weight_threshold: 0.4,
            only_score_pruned_tokens: false
          }
        }
      }
    },
    rescore: {
      window_size: 100,
      query: {
        rescore_query: {
          text_expansion: {
            'ml.tokens' => {
              model_id: '.elser_model_2',
              model_text: 'How is the weather in Jamaica?',
              pruning_config: {
                tokens_freq_ratio_threshold: 5,
                tokens_weight_threshold: 0.4,
                only_score_pruned_tokens: true
              }
            }
          }
        }
      }
    }
  }
)
puts response
const response = await client.search({
  index: "my-index",
  query: {
    text_expansion: {
      "ml.tokens": {
        model_id: ".elser_model_2",
        model_text: "How is the weather in Jamaica?",
        pruning_config: {
          tokens_freq_ratio_threshold: 5,
          tokens_weight_threshold: 0.4,
          only_score_pruned_tokens: false,
        },
      },
    },
  },
  rescore: {
    window_size: 100,
    query: {
      rescore_query: {
        text_expansion: {
          "ml.tokens": {
            model_id: ".elser_model_2",
            model_text: "How is the weather in Jamaica?",
            pruning_config: {
              tokens_freq_ratio_threshold: 5,
              tokens_weight_threshold: 0.4,
              only_score_pruned_tokens: true,
            },
          },
        },
      },
    },
  },
});
console.log(response);
GET my-index/_search
{
   "query":{
      "text_expansion":{
         "ml.tokens":{
            "model_id":".elser_model_2",
            "model_text":"How is the weather in Jamaica?",
            "pruning_config": {
               "tokens_freq_ratio_threshold": 5,
               "tokens_weight_threshold": 0.4,
               "only_score_pruned_tokens": false
           }
         }
      }
   },
   "rescore": {
      "window_size": 100,
      "query": {
         "rescore_query": {
            "text_expansion": {
               "ml.tokens": {
                  "model_id": ".elser_model_2",
                  "model_text": "How is the weather in Jamaica?",
                  "pruning_config": {
                     "tokens_freq_ratio_threshold": 5,
                     "tokens_weight_threshold": 0.4,
                     "only_score_pruned_tokens": true
                  }
               }
            }
         }
      }
   }
}

根据您的数据,文本扩展查询在 track_total_hits: false 的情况下可能会更快。