› › ›

文本类型族

文本类型族包含以下字段类型

text，用于全文内容的传统字段类型，例如电子邮件正文或产品描述。
match_only_text，text 的一种空间优化变体，它禁用评分，并且在需要位置的查询上执行速度较慢。它最适合索引日志消息。

文本字段类型

用于索引全文值的字段，例如电子邮件正文或产品描述。这些字段是 analyzed，也就是说，它们会通过 analyzer 将字符串转换为索引前的单个词项列表。分析过程允许 Elasticsearch 在每个全文字段中搜索单个词。文本字段不用于排序，也很少用于聚合（尽管显著文本聚合是一个值得注意的例外）。

text 字段最适合非结构化但人类可读的内容。如果您需要索引非结构化的机器生成内容，请参阅映射非结构化内容。

如果您需要索引结构化内容，例如电子邮件地址、主机名、状态代码或标签，则您应该使用 keyword 字段。

以下是文本字段的映射示例

resp = client.indices.create(
    index="my-index-000001",
    mappings={
        "properties": {
            "full_name": {
                "type": "text"
            }
        }
    },
)
print(resp)

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    mappings: {
      properties: {
        full_name: {
          type: 'text'
        }
      }
    }
  }
)
puts response

const response = await client.indices.create({
  index: "my-index-000001",
  mappings: {
    properties: {
      full_name: {
        type: "text",
      },
    },
  },
});
console.log(response);

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "full_name": {
        "type":  "text"
      }
    }
  }
}

将字段用作文本和关键字

编辑

有时，拥有同一字段的全文 (text) 和关键字 (keyword) 版本非常有用：一个用于全文搜索，另一个用于聚合和排序。这可以通过多字段来实现。

文本字段的参数

编辑

text 字段接受以下参数

`analyzer`	analyzer 应该用于 `text` 字段，无论是在索引时还是在搜索时（除非被 `search_analyzer` 覆盖）。默认为默认索引分析器，或 `standard` 分析器。
`eager_global_ordinals`	是否应在刷新时急切加载全局序号？接受 `true` 或 `false`（默认）。在频繁用于（显著）词项聚合的字段上启用此功能是一个好主意。
`fielddata`	该字段是否可以使用内存中的 fielddata 进行排序、聚合或脚本编写？接受 `true` 或 `false`（默认）。
`fielddata_frequency_filter`	专家设置，允许决定在启用 `fielddata` 时将哪些值加载到内存中。默认情况下，加载所有值。
`fields`	多字段允许以多种方式索引相同的字符串值以用于不同的目的，例如，一个字段用于搜索，一个多字段用于排序和聚合，或者由不同的分析器分析的相同字符串值。
`index`	该字段是否应该可搜索？接受 `true` （默认）或 `false`。
`index_options`	为了搜索和突出显示的目的，应该在索引中存储哪些信息。默认为 `positions`。
`index_prefixes`	如果启用，则将 2 到 5 个字符之间的词项前缀索引到一个单独的字段中。这允许前缀搜索更有效地运行，但代价是索引更大。
`index_phrases`	如果启用，则将两个词的组合（shingles）索引到单独的字段中。这允许精确的短语查询（无 slop）更有效地运行，但代价是索引更大。请注意，这在不删除停用词时效果最佳，因为包含停用词的短语不会使用辅助字段，并且会回退到标准的短语查询。接受 `true` 或 `false`（默认）。
`norms`	在对查询进行评分时是否应考虑字段长度。接受 `true` （默认）或 `false`。
`position_increment_gap`	应在字符串数组的每个元素之间插入的伪词项位置的数量。默认为在分析器上配置的 `position_increment_gap`，该值默认为 `100`。`100` 之所以被选中，是因为它可以防止具有相当大的 slops（小于 100）的短语查询匹配字段值中的词项。
`store`	字段值是否应与 `_source` 字段分开存储和检索。接受 `true` 或 `false`（默认）。
`search_analyzer`	应该在 `text` 字段的搜索时使用的 `analyzer`。默认为 `analyzer` 设置。
`search_quote_analyzer`	当遇到短语时，应该在搜索时使用的 `analyzer`。默认为 `search_analyzer` 设置。
`similarity`	应该使用哪种评分算法或相似性。默认为 `BM25`。
`term_vector`	是否应为该字段存储词项向量。默认为 `no`。
`meta`	有关该字段的元数据。

合成 `_source`

编辑

合成 _source 仅对 TSDB 索引（index.mode 设置为 time_series 的索引）正式可用。对于其他索引，合成 _source 处于技术预览状态。技术预览版中的功能可能会在未来版本中更改或删除。Elastic 将努力修复任何问题，但技术预览版中的功能不受官方 GA 功能的支持 SLA 的约束。

如果 text 字段具有支持合成 _source 的 keyword 子字段，或者如果 text 字段将 store 设置为 true，则它们支持合成 _source。无论哪种方式，它都可能没有copy_to。

如果使用子 keyword 字段，则值的排序方式与 keyword 字段的值排序方式相同。默认情况下，这意味着排序时会删除重复项。所以

resp = client.indices.create(
    index="idx",
    settings={
        "index": {
            "mapping": {
                "source": {
                    "mode": "synthetic"
                }
            }
        }
    },
    mappings={
        "properties": {
            "text": {
                "type": "text",
                "fields": {
                    "raw": {
                        "type": "keyword"
                    }
                }
            }
        }
    },
)
print(resp)

resp1 = client.index(
    index="idx",
    id="1",
    document={
        "text": [
            "the quick brown fox",
            "the quick brown fox",
            "jumped over the lazy dog"
        ]
    },
)
print(resp1)

const response = await client.indices.create({
  index: "idx",
  settings: {
    index: {
      mapping: {
        source: {
          mode: "synthetic",
        },
      },
    },
  },
  mappings: {
    properties: {
      text: {
        type: "text",
        fields: {
          raw: {
            type: "keyword",
          },
        },
      },
    },
  },
});
console.log(response);

const response1 = await client.index({
  index: "idx",
  id: 1,
  document: {
    text: [
      "the quick brown fox",
      "the quick brown fox",
      "jumped over the lazy dog",
    ],
  },
});
console.log(response1);

PUT idx
{
  "settings": {
    "index": {
      "mapping": {
        "source": {
          "mode": "synthetic"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "fields": {
          "raw": {
            "type": "keyword"
          }
        }
      }
    }
  }
}
PUT idx/_doc/1
{
  "text": [
    "the quick brown fox",
    "the quick brown fox",
    "jumped over the lazy dog"
  ]
}

将变成

{
  "text": [
    "jumped over the lazy dog",
    "the quick brown fox"
  ]
}

重新排序文本字段可能会影响短语和 span 查询。有关更多详细信息，请参阅关于 position_increment_gap 的讨论。您可以通过确保短语查询中的 slop 参数低于 position_increment_gap 来避免这种情况。这是默认值。

如果 text 字段将 store 设置为 true，则保留顺序和重复项。

resp = client.indices.create(
    index="idx",
    settings={
        "index": {
            "mapping": {
                "source": {
                    "mode": "synthetic"
                }
            }
        }
    },
    mappings={
        "properties": {
            "text": {
                "type": "text",
                "store": True
            }
        }
    },
)
print(resp)

resp1 = client.index(
    index="idx",
    id="1",
    document={
        "text": [
            "the quick brown fox",
            "the quick brown fox",
            "jumped over the lazy dog"
        ]
    },
)
print(resp1)

const response = await client.indices.create({
  index: "idx",
  settings: {
    index: {
      mapping: {
        source: {
          mode: "synthetic",
        },
      },
    },
  },
  mappings: {
    properties: {
      text: {
        type: "text",
        store: true,
      },
    },
  },
});
console.log(response);

const response1 = await client.index({
  index: "idx",
  id: 1,
  document: {
    text: [
      "the quick brown fox",
      "the quick brown fox",
      "jumped over the lazy dog",
    ],
  },
});
console.log(response1);

PUT idx
{
  "settings": {
    "index": {
      "mapping": {
        "source": {
          "mode": "synthetic"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": { "type": "text", "store": true }
    }
  }
}
PUT idx/_doc/1
{
  "text": [
    "the quick brown fox",
    "the quick brown fox",
    "jumped over the lazy dog"
  ]
}

将变成

{
  "text": [
    "the quick brown fox",
    "the quick brown fox",
    "jumped over the lazy dog"
  ]
}

`fielddata` 映射参数

编辑

text 字段默认是可搜索的，但默认情况下不能用于聚合、排序或脚本编写。如果您尝试使用脚本对 text 字段进行排序、聚合或访问值，您将看到一个异常，指示默认情况下在文本字段上禁用字段数据。要在内存中加载字段数据，请在您的字段上设置 fielddata=true。

在内存中加载字段数据可能会消耗大量内存。

字段数据是在聚合、排序或脚本编写中访问全文字段的已分析令牌的唯一方法。例如，像 New York 这样的全文字段将被分析为 new 和 york。要对这些令牌进行聚合，需要字段数据。

启用字段数据之前

编辑

在文本字段上启用字段数据通常没有意义。字段数据存储在具有字段数据缓存的堆中，因为它的计算成本很高。计算字段数据可能会导致延迟峰值，而增加堆使用率是集群性能问题的原因。

大多数想要对文本字段进行更多操作的用户都使用多字段映射，方法是同时拥有一个用于全文搜索的 text 字段和一个用于聚合的未分析的 keyword 字段，如下所示

resp = client.indices.create(
    index="my-index-000001",
    mappings={
        "properties": {
            "my_field": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword"
                    }
                }
            }
        }
    },
)
print(resp)

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    mappings: {
      properties: {
        my_field: {
          type: 'text',
          fields: {
            keyword: {
              type: 'keyword'
            }
          }
        }
      }
    }
  }
)
puts response

res, err := es.Indices.Create(
	"my-index-000001",
	es.Indices.Create.WithBody(strings.NewReader(`{
	  "mappings": {
	    "properties": {
	      "my_field": {
	        "type": "text",
	        "fields": {
	          "keyword": {
	            "type": "keyword"
	          }
	        }
	      }
	    }
	  }
	}`)),
)
fmt.Println(res, err)

const response = await client.indices.create({
  index: "my-index-000001",
  mappings: {
    properties: {
      my_field: {
        type: "text",
        fields: {
          keyword: {
            type: "keyword",
          },
        },
      },
    },
  },
});
console.log(response);

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "my_field": { 
        "type": "text",
        "fields": {
          "keyword": { 
            "type": "keyword"
          }
        }
      }
    }
  }
}

	使用 `my_field` 字段进行搜索。
	使用 `my_field.keyword` 字段进行聚合、排序或脚本编写。

在 `text` 字段上启用字段数据

编辑

您可以使用更新映射 API 在现有 text 字段上启用 fielddata，如下所示

resp = client.indices.put_mapping(
    index="my-index-000001",
    properties={
        "my_field": {
            "type": "text",
            "fielddata": True
        }
    },
)
print(resp)

response = client.indices.put_mapping(
  index: 'my-index-000001',
  body: {
    properties: {
      my_field: {
        type: 'text',
        fielddata: true
      }
    }
  }
)
puts response

res, err := es.Indices.PutMapping(
	[]string{"my-index-000001"},
	strings.NewReader(`{
	  "properties": {
	    "my_field": {
	      "type": "text",
	      "fielddata": true
	    }
	  }
	}`),
)
fmt.Println(res, err)

const response = await client.indices.putMapping({
  index: "my-index-000001",
  properties: {
    my_field: {
      type: "text",
      fielddata: true,
    },
  },
});
console.log(response);

PUT my-index-000001/_mapping
{
  "properties": {
    "my_field": { 
      "type":     "text",
      "fielddata": true
    }
  }
}

您为 my_field 指定的映射应包含该字段的现有映射，以及 fielddata 参数。

`fielddata_frequency_filter` 映射参数

编辑

字段数据过滤可用于减少加载到内存中的词项数量，从而减少内存使用。可以使用频率来过滤词项

频率过滤器允许您仅加载文档频率在 min 和 max 值之间的词项，该值可以表示为绝对数字（当数字大于 1.0 时）或百分比（例如 0.01 为 1%，1.0 为 100%）。频率是按段计算的。百分比基于具有字段值的文档数量，而不是该段中的所有文档。

通过指定该段应包含的最小文档数 min_segment_size，可以完全排除小段。

resp = client.indices.create(
    index="my-index-000001",
    mappings={
        "properties": {
            "tag": {
                "type": "text",
                "fielddata": True,
                "fielddata_frequency_filter": {
                    "min": 0.001,
                    "max": 0.1,
                    "min_segment_size": 500
                }
            }
        }
    },
)
print(resp)

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    mappings: {
      properties: {
        tag: {
          type: 'text',
          fielddata: true,
          fielddata_frequency_filter: {
            min: 0.001,
            max: 0.1,
            min_segment_size: 500
          }
        }
      }
    }
  }
)
puts response

res, err := es.Indices.Create(
	"my-index-000001",
	es.Indices.Create.WithBody(strings.NewReader(`{
	  "mappings": {
	    "properties": {
	      "tag": {
	        "type": "text",
	        "fielddata": true,
	        "fielddata_frequency_filter": {
	          "min": 0.001,
	          "max": 0.1,
	          "min_segment_size": 500
	        }
	      }
	    }
	  }
	}`)),
)
fmt.Println(res, err)

const response = await client.indices.create({
  index: "my-index-000001",
  mappings: {
    properties: {
      tag: {
        type: "text",
        fielddata: true,
        fielddata_frequency_filter: {
          min: 0.001,
          max: 0.1,
          min_segment_size: 500,
        },
      },
    },
  },
});
console.log(response);

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "tag": {
        "type": "text",
        "fielddata": true,
        "fielddata_frequency_filter": {
          "min": 0.001,
          "max": 0.1,
          "min_segment_size": 500
        }
      }
    }
  }
}

仅匹配文本字段类型

编辑

它是text字段的一种变体，它牺牲了位置查询的评分和效率，以换取空间效率。此字段有效地以与text字段相同的方式存储数据，该字段仅索引文档（index_options: docs）并禁用规范（norms: false）。词项查询的性能与text字段一样快，甚至更快，但是需要位置的查询（如match_phrase查询）的性能较慢，因为它们需要查看_source文档以验证短语是否匹配。所有查询都返回等于 1.0 的恒定分数。

分析是不可配置的：文本始终使用默认分析器（默认情况下是standard）进行分析。

此字段不支持跨度查询，请改用间隔查询，或者如果您绝对需要跨度查询，请使用text字段类型。

除此之外，match_only_text支持与text相同的查询。与text一样，它不支持排序，并且仅对聚合提供有限的支持。

resp = client.indices.create(
    index="logs",
    mappings={
        "properties": {
            "@timestamp": {
                "type": "date"
            },
            "message": {
                "type": "match_only_text"
            }
        }
    },
)
print(resp)

response = client.indices.create(
  index: 'logs',
  body: {
    mappings: {
      properties: {
        "@timestamp": {
          type: 'date'
        },
        message: {
          type: 'match_only_text'
        }
      }
    }
  }
)
puts response

const response = await client.indices.create({
  index: "logs",
  mappings: {
    properties: {
      "@timestamp": {
        type: "date",
      },
      message: {
        type: "match_only_text",
      },
    },
  },
});
console.log(response);

PUT logs
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "message": {
        "type": "match_only_text"
      }
    }
  }
}

仅匹配文本字段的参数

编辑

接受以下映射参数

`fields`	多字段允许以多种方式索引相同的字符串值以用于不同的目的，例如，一个字段用于搜索，一个多字段用于排序和聚合，或者由不同的分析器分析的相同字符串值。
`meta`	有关该字段的元数据。

« 稀疏向量字段类型 Token 计数字段类型 »

文本类型族