通过摄取管道和嵌套向量对大型文档进行分块，实现轻松的段落搜索

向量搜索是一种强大的方法，可以根据含义而不是精确或不精确的标记匹配技术来搜索数据。但是，为向量搜索提供支持的文本嵌入模型只能处理大约几句话的短文本段落，而不是基于 BM25 的技术，后者可以处理任意大量文本。现在可以使用 Elasticsearch 将大型文档与向量搜索无缝结合。

它在高级别上是如何工作的？

Elasticsearch 功能（如摄取管道、脚本处理器的灵活性和对带有 dense_vectors 的嵌套文档的新支持）的组合，提供了一种直接的方法，可在摄取时将大型文档分割成足够小的段落，然后这些段落可以由文本嵌入模型处理，以生成表示大型文档完整含义所需的所有向量。

像往常一样摄取您的文档数据，并向您的摄取管道添加一个脚本处理器，以将大型文本数据分解成句子或其他类型的块数组，然后使用 for_each 处理器对每个块运行推理处理器。索引的映射定义如下：块数组被设置为嵌套对象，其中 dense_vector 映射作为子对象，这将正确索引每个向量并使其可搜索。

如何通过摄取管道和嵌套向量对大型文档进行分块

加载文本嵌入模型

首先，您需要一个模型来根据块创建文本嵌入，您可以使用任何您想要的模型，但此示例将在 all-distilroberta-v1 模型上端到端运行。创建 Elastic Cloud 集群或准备好另一个 Elasticsearch 集群后，我们可以使用 eland 库上传文本嵌入模型。

MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
ELASTIC_PASSWORD = "YOURPASSWORD"
CLOUD_ID = "YOURCLOUDID"

eland_import_hub_model \
    --cloud-id $CLOUD_ID \
    --es-username elastic \
    --es-password $ELASTIC_PASSWORD \
    --hub-model-id $MODEL_ID \
    --task-type text_embedding \
    --start

映射示例

下一步是准备映射以处理在摄取管道期间创建的句子和向量对象的数组。对于此特定文本嵌入模型，维度为 384，并且将使用点积相似度进行最近邻计算。

PUT chunker
{
  "mappings": {
    "dynamic": "true",
    "properties": {
      "passages": {
        "type": "nested",
        "properties": {
          "vector": {
            "properties": {
              "predicted_value": {
                "type": "dense_vector",
                "index": true,
                "dims": 384,
                "similarity": "dot_product"
              }
            }
          }
        }
      }
    }
  }
}

摄取管道示例

最后的准备步骤是定义一个摄取管道，以将 body_content 字段分解成存储在 passages 字段中的文本块。此管道有两个处理器，第一个脚本处理器使用正则表达式将 body_content 字段分解成存储在 passages 字段中的句子数组。有关进一步的研究，请阅读正则表达式的高级功能，例如负向后视和正向前视，以了解它如何尝试正确地分割句子边界，不对 Mr. 或 Mrs. 或 Ms. 进行分割，并将标点符号与句子保留在一起。它还尝试将句子块连接在一起，只要总字符串长度小于传递给脚本的参数即可。下一个 for each 处理器通过推理处理器对每个句子运行文本嵌入模型。

PUT _ingest/pipeline/chunker
{
  "processors": [
    {
      "script": {
        "description": "Chunk body_content into sentences by looking for . followed by a space",
        "lang": "painless",
        "source": """
          String[] envSplit = /((?<!M(r|s|rs)\.)(?<=\.) |(?<=\!) |(?<=\?) )/.split(ctx['body_content']);
          ctx['passages'] = new ArrayList();
          int i = 0;
          boolean remaining = true;
          if (envSplit.length == 0) {
            return
          } else if (envSplit.length == 1) {
            Map passage = ['text': envSplit[0]];ctx['passages'].add(passage)
          } else {
            while (remaining) {
              Map passage = ['text': envSplit[i++]];
              while (i < envSplit.length && passage.text.length() + envSplit[i].length() < params.model_limit) {passage.text = passage.text + ' ' + envSplit[i++]}
              if (i == envSplit.length) {remaining = false}
              ctx['passages'].add(passage)
            }
          }
          """,
        "params": {
          "model_limit": 400
        }
      }
    },
    {
      "foreach": {
        "field": "passages",
        "processor": {
          "inference": {
            "field_map": {
              "_ingest._value.text": "text_field"
            },
            "model_id": "sentence-transformers__all-minilm-l6-v2",
            "target_field": "_ingest._value.vector",
            "on_failure": [
              {
                "append": {
                  "field": "_source._ingest.inference_errors",
                  "value": [
                    {
                      "message": "Processor 'inference' in pipeline 'ml-inference-title-vector' failed with message '{{ _ingest.on_failure_message }}'",
                      "pipeline": "ml-inference-title-vector",
                      "timestamp": "{{{ _ingest.timestamp }}}"
                    }
                  ]
                }
              }
            ]
          }
        }
      }
    }
  ]
}

添加一些文档

现在，我们可以添加在 body_content 中包含大量文本的文档，并自动对其进行分块，并由模型将每个块文本嵌入到向量中。

PUT chunker/_doc/1?pipeline=chunker
{
"title": "Adding passage vector search to Lucene",
"body_content": "Vector search is a powerful tool in the information retrieval tool box. Using vectors alongside lexical search like BM25 is quickly becoming commonplace. But there are still a few pain points within vector search that need to be addressed. A major one is text embedding models and handling larger text input. Where lexical search like BM25 is already designed for long documents, text embedding models are not. All embedding models have limitations on the number of tokens they can embed. So, for longer text input it must be chunked into passages shorter than the model’s limit. Now instead of having one document with all its metadata, you have multiple passages and embeddings. And if you want to preserve your metadata, it must be added to every new document. A way to address this is with Lucene's “join” functionality. This is an integral part of Elasticsearch’s nested field type. It makes it possible to have a top-level document with multiple nested documents, allowing you to search over nested documents and join back against their parent documents. This sounds perfect for multiple passages and vectors belonging to a single top-level document! This is all awesome! But, wait, Elasticsearch doesn’t support vectors in nested fields. Why not, and what needs to change? The key issue is how Lucene can join back to the parent documents when searching child vector passages. Like with kNN pre-filtering versus post-filtering, when the joining occurs determines the result quality and quantity. If a user searches for the top four nearest parent documents (not passages) to a query vector, they usually expect four documents. But what if they are searching over child vector passages and all four of the nearest vectors are from the same parent document? This would end up returning just one parent document, which would be surprising. This same kind of issue occurs with post-filtering."
}

PUT chunker/_doc/3?pipeline=chunker
{
"title": "Automatic Byte Quantization in Lucene",
"body_content": "While HNSW is a powerful and flexible way to store and search vectors, it does require a significant amount of memory to run quickly. For example, querying 1MM float32 vectors of 768 dimensions requires roughly 1,000,000∗4∗(768+12)=3120000000≈31,000,000∗4∗(768+12)=3120000000bytes≈3GB of ram. Once you start searching a significant number of vectors, this gets expensive. One way to use around 75% less memory is through byte quantization. Lucene and consequently Elasticsearch has supported indexing byte vectors for some time, but building these vectors has been the user's responsibility. This is about to change, as we have introduced int8 scalar quantization in Lucene. All quantization techniques are considered lossy transformations of the raw data. Meaning some information is lost for the sake of space. For an in depth explanation of scalar quantization, see: Scalar Quantization 101. At a high level, scalar quantization is a lossy compression technique. Some simple math gives significant space savings with very little impact on recall. Those used to working with Elasticsearch may be familiar with these concepts already, but here is a quick overview of the distribution of documents for search. Each Elasticsearch index is composed of multiple shards. While each shard can only be assigned to a single node, multiple shards per index gives you compute parallelism across nodes. Each shard is composed as a single Lucene Index. A Lucene index consists of multiple read-only segments. During indexing, documents are buffered and periodically flushed into a read-only segment. When certain conditions are met, these segments can be merged in the background into a larger segment. All of this is configurable and has its own set of complexities. But, when we talk about segments and merging, we are talking about read-only Lucene segments and the automatic periodic merging of these segments. Here is a deeper dive into segment merging and design decisions."
}

PUT chunker/_doc/2?pipeline=chunker
{
"title": "Use a Japanese language NLP model in Elasticsearch to enable semantic searches",
"body_content": "Quickly finding necessary documents from among the large volume of internal documents and product information generated every day is an extremely important task in both work and daily life. However, if there is a high volume of documents to search through, it can be a time-consuming process even for computers to re-read all of the documents in real time and find the target file. That is what led to the appearance of Elasticsearch and other search engine software. When a search engine is used, search index data is first created so that key search terms included in documents can be used to quickly find those documents. However, even if the user has a general idea of what type of information they are searching for, they may not be able to recall a suitable keyword or they may search for another expression that has the same meaning. Elasticsearch enables synonyms and similar terms to be defined to handle such situations, but in some cases it can be difficult to simply use a correspondence table to convert a search query into a more suitable one. To address this need, Elasticsearch 8.0 released the vector search feature, which searches by the semantic content of a phrase. Alongside that, we also have a blog series on how to use Elasticsearch to perform vector searches and other NLP tasks. However, up through the 8.8 release, it was not able to correctly analyze text in languages other than English. With the 8.9 release, Elastic added functionality for properly analyzing Japanese in text analysis processing. This functionality enables Elasticsearch to perform semantic searches like vector search on Japanese text, as well as natural language processing tasks such as sentiment analysis in Japanese. In this article, we will provide specific step-by-step instructions on how to use these features."
}

PUT chunker/_doc/5?pipeline=chunker
{
"title": "We can chunk whatever we want now basically to the limits of a document ingest",
"body_content": """Chonk is an internet slang term used to describe overweight cats that grew popular in the late summer of 2018 after a photoshopped chart of cat body-fat indexes renamed the "Chonk" scale grew popular on Twitter and Reddit. Additionally, "Oh Lawd He Comin'," the final level of the Chonk Chart, was adopted as an online catchphrase used to describe large objects, animals or people. It is not to be confused with the Saturday Night Live sketch of the same name. The term "Chonk" was popularized in a photoshopped edit of a chart illustrating cat body-fat indexes and the risk of health problems for each class (original chart shown below). The first known post of the "Chonk" photoshop, which classifies each cat to a certain level of "chonk"-ness ranging from "A fine boi" to "OH LAWD HE COMIN," was posted to Facebook group THIS CAT IS C H O N K Y on August 2nd, 2018 by Emilie Chang (shown below). The chart surged in popularity after it was tweeted by @dreamlandtea[1] on August 10th, 2018, gaining over 37,000 retweets and 94,000 likes (shown below). After the chart was posted there, it began growing popular on Reddit. It was reposted to /r/Delighfullychubby[2] on August 13th, 2018, and /r/fatcats on August 16th.[3] Additionally, cats were shared with variations on the phrase "Chonk." In @dreamlandtea's Twitter thread, she rated several cats on the Chonk scale (example, shown below, left). On /r/tumblr, a screenshot of a post featuring a "good luck cat" titled "Lucky Chonk" gained over 27,000 points (shown below, right). The popularity of the phrase led to the creation of a subreddit, /r/chonkers,[4] that gained nearly 400 subscribers in less than a month. Some photoshops of the chonk chart also spread on Reddit. For example, an edit showing various versions of Pikachu on the chart posted to /r/me_irl gained over 1,200 points (shown below, left). The chart gained further popularity when it was posted to /r/pics[5] September 29th, 2018."""
}

搜索这些文档

要搜索数据并返回与查询最匹配的块，您可以使用带有 knn 子句的 inner_hits，以便仅在查询的 hits 输出中返回与文档最匹配的块。

GET chunker/_search
{
  "_source": false,
  "fields": [
    "title"
  ],
  "knn": {
    "inner_hits": {
      "_source": false,
      "fields": [
        "passages.text"
      ]
    },
    "field": "passages.vector.predicted_value",
    "k": 1,
    "num_candidates": 100,
    "query_vector_builder": {
      "text_embedding": {
        "model_id": "sentence-transformers__all-minilm-l6-v2",
        "model_text": "Can I use multiple vectors per document now?"
      }
    }
  }
}

将返回最佳文档和较大文档文本的相关部分。

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.75261426,
    "hits": [
      {
        "_index": "chunker",
        "_id": "1",
        "_score": 0.75261426,
        "_ignored": [
          "body_content.keyword",
          "passages.text.keyword"
        ],
        "fields": {
          "title": [
            "Adding passage vector search to Lucene"
          ]
        },
        "inner_hits": {
          "passages": {
            "hits": {
              "total": {
                "value": 1,
                "relation": "eq"
              },
              "max_score": 0.75261426,
              "hits": [
                {
                  "_index": "chunker",
                  "_id": "1",
                  "_nested": {
                    "field": "passages",
                    "offset": 3
                  },
                  "_score": 0.75261426,
                  "fields": {
                    "passages": [
                      {
                        "text": [
                          "This sounds perfect for multiple passages and vectors belonging to a single top-level document! This is all awesome! But, wait, Elasticsearch doesn’t support vectors in nested fields. Why not, and what needs to change? The key issue is how Lucene can join back to the parent documents when searching child vector passages."
                        ]
                      }
                    ]
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

回顾

此处使用的方法展示了利用 Elasticsearch 的不同功能来解决更大问题的能力。

摄取管道允许您在索引之前预处理文档，虽然许多处理器执行特定的目标任务，但有时您需要脚本语言的功能才能执行诸如将文本分解成句子数组之类的操作。因为您可以访问文档在索引之前的状态，所以您可以根据自己的想象几乎以任何方式重新制作数据，只要所有信息都在文档本身中即可。foreach 处理器允许我们包装可能运行零到 N 次的东西，而无需预先知道需要执行多少次。在本例中，我们使用它来遍历我们提取的尽可能多的句子，以运行 infer 处理器来创建向量。

索引的映射已准备好处理现在包含文本和向量对象的数组（这些对象最初并不存在于文档中），并使用嵌套对象来以一种我们可以正确搜索文档的方式来索引数据。

将 knn 与向量的嵌套支持结合使用，允许使用 inner_hits 来呈现文档中得分最高的片段，这可以替代在 BM25 查询中通常通过高亮显示完成的操作。