教程:使用 semantic_text 进行混合搜索
编辑教程:使用 semantic_text
在 Elastic Stack 中使用混合搜索的推荐方法是遵循 semantic_text
工作流程。本教程使用 elasticsearch
服务进行演示,但您可以使用 Inference API 提供的任何服务及其支持的模型。
PUT semantic-embeddings { "mappings": { "properties": { "semantic_text": { "type": "semantic_text", }, "content": { "type": "text", "copy_to": "semantic_text" } } } }
如果要对由网络爬虫或连接器填充的索引运行搜索,则必须更新这些索引的索引映射,以包含 semantic_text
使用 msmarco-passagetest2019-top1000
数据集,它是 MS MARCO Passage Ranking 数据集的子集。它包含 200 个查询,每个查询都附带一个相关的文本段落列表。所有唯一的段落及其 ID 都已从该数据集中提取,并编译成一个 tsv 文件。
下载该文件并使用机器学习 UI 中的数据可视化工具将其上传到您的集群。分析数据后,单击覆盖设置。在编辑字段名称下,将 id
分配给第一列,将 content
分配给第二列。单击应用,然后单击导入。将索引命名为 test-data
,然后单击导入。上传完成后,您将看到一个名为 test-data
的索引,其中包含 182,469 个文档。
编辑将数据从 test-data
索引重新索引到 semantic-embeddings
索引中。源索引的 content
字段中的数据将复制到目标索引的 content
字段中。索引映射创建中设置的 copy_to
参数确保内容被复制到 semantic_text
此步骤使用 reindex API 来模拟数据摄取。如果您正在使用已编制索引的数据,而不是使用 test-data
resp = client.reindex( wait_for_completion=False, source={ "index": "test-data", "size": 10 }, dest={ "index": "semantic-embeddings" }, ) print(resp)
const response = await client.reindex({ wait_for_completion: "false", source: { index: "test-data", size: 10, }, dest: { index: "semantic-embeddings", }, }); console.log(response);
POST _reindex?wait_for_completion=false { "source": { "index": "test-data", "size": 10 }, "dest": { "index": "semantic-embeddings" } }
该调用返回一个任务 ID 以监视进度
resp = client.tasks.get( task_id="<task_id>", ) print(resp)
const response = await client.tasks.get({ task_id: "<task_id>", }); console.log(response);
GET _tasks/<task_id>
resp = client.tasks.cancel( task_id="<task_id>", ) print(resp)
const response = await client.tasks.cancel({ task_id: "<task_id>", }); console.log(response);
POST _tasks/<task_id>/_cancel
编辑将数据重新索引到 semantic-embeddings
索引后,您可以使用倒数排名融合 (RRF) 执行混合搜索。 RRF 是一种合并语义查询和词法查询的排名技术,它为在任一搜索中排名较高的结果赋予更大的权重。这确保最终结果是平衡且相关的。
resp = client.search( index="semantic-embeddings", retriever={ "rrf": { "retrievers": [ { "standard": { "query": { "match": { "content": "How to avoid muscle soreness while running?" } } } }, { "standard": { "query": { "semantic": { "field": "semantic_text", "query": "How to avoid muscle soreness while running?" } } } } ] } }, ) print(resp)
const response = await client.search({ index: "semantic-embeddings", retriever: { rrf: { retrievers: [ { standard: { query: { match: { content: "How to avoid muscle soreness while running?", }, }, }, }, { standard: { query: { semantic: { field: "semantic_text", query: "How to avoid muscle soreness while running?", }, }, }, }, ], }, }, }); console.log(response);
GET semantic-embeddings/_search { "retriever": { "rrf": { "retrievers": [ { "standard": { "query": { "match": { "content": "How to avoid muscle soreness while running?" } } } }, { "standard": { "query": { "semantic": { "field": "semantic_text", "query": "How to avoid muscle soreness while running?" } } } } ] } } }
第一个 |
使用指定的短语在 |
第二个 |
执行混合搜索后,查询将返回与语义和词法搜索条件匹配的前 10 个文档。结果包括有关每个文档的详细信息
{ "took": 107, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 473, "relation": "eq" }, "max_score": null, "hits": [ { "_index": "semantic-embeddings", "_id": "wv65epIBEMBRnhfTsOFM", "_score": 0.032786883, "_rank": 1, "_source": { "semantic_text": { "inference": { "inference_id": "my-elser-endpoint", "model_settings": { "task_type": "sparse_embedding" }, "chunks": [ { "text": "What so many out there do not realize is the importance of what you do after you work out. You may have done the majority of the work, but how you treat your body in the minutes and hours after you exercise has a direct effect on muscle soreness, muscle strength and growth, and staying hydrated. Cool Down. After your last exercise, your workout is not over. The first thing you need to do is cool down. Even if running was all that you did, you still should do light cardio for a few minutes. This brings your heart rate down at a slow and steady pace, which helps you avoid feeling sick after a workout.", "embeddings": { "exercise": 1.571044, "after": 1.3603843, "sick": 1.3281639, "cool": 1.3227621, "muscle": 1.2645415, "sore": 1.2561599, "cooling": 1.2335974, "running": 1.1750668, "hours": 1.1104802, "out": 1.0991782, "##io": 1.0794281, "last": 1.0474665, (...) } } ] } }, "id": 8408852, "content": "What so many out there do not realize is the importance of (...)" } } ] } }