› › ›

使用 annotated-text 字段

使用 `annotated-text` 字段

annotated-text 字段会按照更常见的 text 字段的方式对文本内容进行分词（请参阅下面的“限制”），同时还会将任何标记的注解词元直接注入到搜索索引中。

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "my_field": {
        "type": "annotated_text"
      }
    }
  }
}

Copy as curl Try in Elastic

这样的映射允许将标记的文本（例如维基百科文章）索引为文本和结构化词元。注解使用类似 Markdown 的语法，使用 URL 编码的一个或多个值，值之间用 & 符号分隔。

我们可以使用 "_analyze" API 来测试一个示例注解如何作为词元存储在搜索索引中

GET my-index-000001/_analyze
{
  "field": "my_field",
  "text":"Investors in [Apple](Apple+Inc.) rejoiced."
}

响应

{
  "tokens": [
    {
      "token": "investors",
      "start_offset": 0,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "in",
      "start_offset": 10,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "Apple Inc.", 
      "start_offset": 13,
      "end_offset": 18,
      "type": "annotation",
      "position": 2
    },
    {
      "token": "apple",
      "start_offset": 13,
      "end_offset": 18,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "rejoiced",
      "start_offset": 19,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

请注意，整个注解词元 Apple Inc. 将作为一个单独的词元，原封不动地放置在词元流中，并且与它注解的文本词元（apple）处于相同的位置（位置 2）。

现在，我们可以使用不分词的常规 term 查询来搜索注解。注解是一种更精确的匹配方式，正如本例所示，搜索 Beck 将不会匹配 Jeff Beck。

# Example documents
PUT my-index-000001/_doc/1
{
  "my_field": "[Beck](Beck) announced a new tour"
}

PUT my-index-000001/_doc/2
{
  "my_field": "[Jeff Beck](Jeff+Beck&Guitarist) plays a strat"
}

# Example search
GET my-index-000001/_search
{
  "query": {
    "term": {
        "my_field": "Beck" 
    }
  }
}

Copy as curl Try in Elastic

	除了将纯文本分词为单个单词，例如 `beck`，这里我们在词元流中与 `beck` 相同的位置注入了单个词元值 `Beck`。
	请注意，注解可以在同一位置注入多个词元 - 这里我们注入了非常具体的值 `Jeff Beck` 和更广泛的术语 `Guitarist`。这使得可以进行更广泛的位置查询，例如查找 `Guitarist` 在 `strat` 附近出现的提及。
	使用这些精心定义的注解词元进行搜索的好处是，对 `Beck` 的查询将不会匹配包含词元 `jeff`、`beck` 和 `Jeff Beck` 的文档 2。

在注解值中使用 = 符号（例如 [Prince](person=Prince)）会导致文档被拒绝并出现解析失败。未来我们希望能够使用等号，因此今天会主动拒绝包含等号的文档。

合成 `_source`

编辑

合成 _source 仅对 TSDB 索引（index.mode 设置为 time_series 的索引）正式可用。对于其他索引，合成 _source 处于技术预览状态。技术预览中的功能可能会在未来版本中更改或删除。Elastic 将努力解决任何问题，但技术预览中的功能不受官方 GA 功能的支持 SLA 约束。

如果使用子 keyword 字段，则值的排序方式与 keyword 字段的值排序方式相同。默认情况下，这意味着排序时会删除重复项。因此

PUT idx
{
  "settings": {
    "index": {
      "mapping": {
        "source": {
          "mode": "synthetic"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "annotated_text",
        "fields": {
          "raw": {
            "type": "keyword"
          }
        }
      }
    }
  }
}
PUT idx/_doc/1
{
  "text": [
    "the quick brown fox",
    "the quick brown fox",
    "jumped over the lazy dog"
  ]
}

Copy as curl Try in Elastic

将变为

{
  "text": [
    "jumped over the lazy dog",
    "the quick brown fox"
  ]
}

对文本字段重新排序可能会影响短语和跨度查询。有关更多详细信息，请参阅关于position_increment_gap的讨论。您可以通过确保短语查询上的 slop 参数低于 position_increment_gap 来避免这种情况。这是默认值。

如果 annotated_text 字段将 store 设置为 true，则会保留顺序和重复项。

PUT idx
{
  "settings": {
    "index": {
      "mapping": {
        "source": {
          "mode": "synthetic"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": { "type": "annotated_text", "store": true }
    }
  }
}
PUT idx/_doc/1
{
  "text": [
    "the quick brown fox",
    "the quick brown fox",
    "jumped over the lazy dog"
  ]
}

Copy as curl Try in Elastic

将变为

{
  "text": [
    "the quick brown fox",
    "the quick brown fox",
    "jumped over the lazy dog"
  ]
}

« Mapper annotated text 插件数据建模技巧 »

On this page

合成 _source

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

使用 annotated-text 字段

使用 `annotated-text` 字段

合成 `_source`

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

The Search AI Company

Generative AI

Search

Security

Observability

By solution

Industries

使用 annotated-text 字段

使用 annotated-text 字段

合成 _source

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

使用 `annotated-text` 字段

合成 `_source`