« rewrite 参数聚合 »

› ›

正则表达式语法

编辑

正则表达式语法

编辑

正则表达式是一种使用占位符字符（称为运算符）来匹配数据中模式的方法。

Elasticsearch 在以下查询中支持正则表达式：

Elasticsearch 使用 Apache Lucene 的正则表达式引擎来解析这些查询。

保留字符

编辑

Lucene 的正则表达式引擎支持所有 Unicode 字符。但是，以下字符被保留为运算符：

. ? + * | { } [ ] ( ) " \

根据启用的可选运算符，以下字符也可能被保留：

# @ & < >  ~

要按字面意义使用这些字符之一，请使用前导反斜杠对其进行转义或用双引号将其括起来。例如：

\@                  # renders as a literal '@'
\\                  # renders as a literal '\'
"john@smith.com"    # renders as 'john@smith.com'

反斜杠在 JSON 字符串和正则表达式中都是转义字符。您需要转义查询中的两个反斜杠，除非您使用语言客户端，该客户端会处理此问题。例如，字符串 a\b 需要索引为 "a\\b"

resp = client.index(
    index="my-index-000001",
    id="1",
    document={
        "my_field": "a\\b"
    },
)
print(resp)

response = client.index(
  index: 'my-index-000001',
  id: 1,
  body: {
    my_field: 'a\\b'
  }
)
puts response

const response = await client.index({
  index: "my-index-000001",
  id: 1,
  document: {
    my_field: "a\\b",
  },
});
console.log(response);

PUT my-index-000001/_doc/1
{
  "my_field": "a\\b"
}

此文档与以下 regexp 查询匹配：

resp = client.search(
    index="my-index-000001",
    query={
        "regexp": {
            "my_field.keyword": "a\\\\.*"
        }
    },
)
print(resp)

response = client.search(
  index: 'my-index-000001',
  body: {
    query: {
      regexp: {
        'my_field.keyword' => 'a\\\\.*'
      }
    }
  }
)
puts response

const response = await client.search({
  index: "my-index-000001",
  query: {
    regexp: {
      "my_field.keyword": "a\\\\.*",
    },
  },
});
console.log(response);

GET my-index-000001/_search
{
  "query": {
    "regexp": {
      "my_field.keyword": "a\\\\.*"
    }
  }
}

标准运算符

编辑

Lucene 的正则表达式引擎不使用 Perl 兼容正则表达式 (PCRE) 库，但它支持以下标准运算符。

.

匹配任何字符。例如：

ab.     # matches 'aba', 'abb', 'abz', etc.

?

重复前面的字符零次或一次。通常用于使前面的字符可选。例如：

abc?     # matches 'ab' and 'abc'

+

重复前面的字符一次或多次。例如：

ab+     # matches 'ab', 'abb', 'abbb', etc.

*

重复前面的字符零次或多次。例如：

ab*     # matches 'a', 'ab', 'abb', 'abbb', etc.

{}

前面的字符可以重复的最小和最大次数。例如：

a{2}    # matches 'aa'
a{2,4}  # matches 'aa', 'aaa', and 'aaaa'
a{2,}   # matches 'a` repeated two or more times

|

OR 运算符。如果左侧或右侧的最长模式匹配，则匹配将成功。例如：

abc|xyz  # matches 'abc' and 'xyz'

( … )

形成一个组。您可以使用一个组将表达式的一部分视为单个字符。例如：

abc(def)?  # matches 'abc' and 'abcdef' but not 'abcd'

[ … ]

匹配方括号中的一个字符。例如：

[abc]   # matches 'a', 'b', 'c'

在方括号内，- 表示范围，除非 - 是第一个字符或已转义。例如：

[a-c]   # matches 'a', 'b', or 'c'
[-abc]  # '-' is first character. Matches '-', 'a', 'b', or 'c'
[abc\-] # Escapes '-'. Matches 'a', 'b', 'c', or '-'

方括号中字符之前的 ^ 会否定该字符或范围。例如：

[^abc]      # matches any character except 'a', 'b', or 'c'
[^a-c]      # matches any character except 'a', 'b', or 'c'
[^-abc]     # matches any character except '-', 'a', 'b', or 'c'
[^abc\-]    # matches any character except 'a', 'b', 'c', or '-'

可选运算符

编辑

您可以使用 flags 参数为 Lucene 的正则表达式引擎启用更多可选运算符。

要启用多个运算符，请使用 | 分隔符。例如，flags 值为 COMPLEMENT|INTERVAL 将启用 COMPLEMENT 和 INTERVAL 运算符。

有效值

编辑

ALL (默认)

启用所有可选运算符。

"" (空字符串)

ALL 值的别名。

COMPLEMENT

启用 ~ 运算符。您可以使用 ~ 来否定最短的以下模式。例如：

a~bc   # matches 'adc' and 'aec' but not 'abc'

EMPTY

启用 # (空语言) 运算符。# 运算符不匹配任何字符串，甚至不匹配空字符串。

如果通过编程方式组合值来创建正则表达式，则可以传递 # 来指定“无字符串”。这使您可以避免意外匹配空字符串或其他不需要的字符串。例如：

#|abc  # matches 'abc' but nothing else, not even an empty string

INTERVAL

启用 <> 运算符。您可以使用 <> 来匹配数值范围。例如：

foo<1-100>      # matches 'foo1', 'foo2' ... 'foo99', 'foo100'
foo<01-100>     # matches 'foo01', 'foo02' ... 'foo99', 'foo100'

INTERSECTION

启用 & 运算符，它充当 AND 运算符。如果左侧和右侧的模式都匹配，则匹配将成功。例如：

aaa.+&.+bbb  # matches 'aaabbb'

ANYSTRING

启用 @ 运算符。您可以使用 @ 来匹配任何整个字符串。

您可以将 @ 运算符与 & 和 ~ 运算符组合起来，以创建“除...之外的所有内容”逻辑。例如：

@&~(abc.+)  # matches everything except terms beginning with 'abc'

NONE

禁用所有可选运算符。

不支持的运算符

编辑

Lucene 的正则表达式引擎不支持锚运算符，例如 ^ (行首) 或 $ (行尾)。要匹配一个词，正则表达式必须匹配整个字符串。

« rewrite 参数聚合 »