远程集群故障排除
编辑远程集群故障排除
编辑在为跨集群复制或跨集群搜索设置远程集群时,您可能会遇到一些问题。
常规故障排除
编辑检查远程集群是否已成功连接
编辑成功调用用于添加或更新远程集群的集群设置更新 API 并不一定意味着配置成功。使用 远程集群信息 API 来验证本地集群是否已成功连接到远程集群。
resp = client.cluster.remote_info() print(resp)
response = client.cluster.remote_info puts response
const response = await client.cluster.remoteInfo(); console.log(response);
GET /_remote/info
API 应返回 "connected" : true
。当使用 API 密钥认证时,它还应返回 "cluster_credentials": "::es_redacted::"
。
启用远程集群服务器
编辑当使用 API 密钥认证时,跨集群流量发生在远程集群接口上,而不是传输接口上。默认情况下,远程集群接口未启用。这意味着节点默认情况下未准备好接收传入的跨集群请求,但已准备好发送传出的跨集群请求。确保您已在远程集群的每个节点上启用了远程集群服务器。在 elasticsearch.yml
中
- 将
remote_cluster_server.enabled
设置为true
。 - 配置远程集群服务器流量的绑定和发布地址,例如使用
remote_cluster.host
。如果不配置地址,远程集群流量可能会绑定到本地接口,并且在其他机器上运行的远程集群将无法连接。 - 可以选择使用
remote_cluster.port
配置远程服务器端口(默认为9443
)。
常见问题
编辑以下问题按在设置远程集群时可能发生的顺序排列。
远程集群无法访问
编辑症状
编辑本地集群可能由于多种原因无法访问远程集群。例如,远程集群服务器可能未启用,可能配置了不正确的主机或端口,或者防火墙可能正在阻止流量。当远程集群无法访问时,请检查本地集群的日志,查找 connect_exception
。
当远程集群使用代理模式配置时
[2023-06-28T16:36:47,264][WARN ][o.e.t.ProxyConnectionStrategy] [local-node] failed to open any proxy connections to cluster [my]
org.elasticsearch.transport.ConnectTransportException: [][192.168.0.42:9443] connect_exception
当远程集群使用嗅探模式配置时
[2023-06-28T16:38:37,731][WARN ][o.e.t.SniffConnectionStrategy] [local-node] fetching nodes from external cluster [my] failed
org.elasticsearch.transport.ConnectTransportException: [][192.168.0.42:9443] connect_exception
远程集群连接不可靠
编辑症状
编辑本地集群可以连接到远程集群,但连接不可靠。例如,某些跨集群请求可能会成功,而其他请求会报告连接错误、超时或似乎卡在等待远程集群响应的位置。
当 Elasticsearch 检测到远程集群连接不起作用时,它将在其日志中报告以下消息
[2023-06-28T16:36:47,264][INFO ][o.e.t.ClusterConnectionManager] [local-node] transport connection to [{my-remote#192.168.0.42:9443}{...}] closed by remote
如果 Elasticsearch 连接到的远程集群节点关闭或重新启动,也会记录此消息。
请注意,使用某些网络配置,操作系统可能需要数分钟或数小时才能检测到连接已停止工作。在检测到故障并报告给 Elasticsearch 之前,涉及远程集群的请求可能会超时或似乎卡住。
TLS 信任未建立
编辑可以在本地或远程集群上错误配置 TLS。结果是本地集群不信任远程集群提供的证书。
症状
编辑本地集群记录 failed to establish trust with server
[2023-06-29T09:40:55,465][WARN ][o.e.c.s.DiagnosticTrustManager] [local-node] failed to establish trust with server at [192.168.0.42]; the server provided a certificate with subject name [CN=remote_cluster], fingerprint [529de35e15666ffaa26afa50876a2a48119db03a], no keyUsage and no extendedKeyUsage; the certificate is valid between [2023-01-29T12:08:37Z] and [2032-08-29T12:08:37Z] (current time is [2023-08-16T23:40:55.464275Z], certificate dates are valid); the session uses cipher suite [TLS_AES_256_GCM_SHA384] and protocol [TLSv1.3]; the certificate has subject alternative names [DNS:localhost,DNS:localhost6.localdomain6,IP:127.0.0.1,IP:0:0:0:0:0:0:0:1,DNS:localhost4,DNS:localhost6,DNS:localhost.localdomain,DNS:localhost4.localdomain4,IP:192.168.0.42]; the certificate is issued by [CN=Elastic Auto RemoteCluster CA] but the server did not provide a copy of the issuing certificate in the certificate chain; this ssl context ([(shared) (with trust configuration: JDK-trusted-certs)]) is not configured to trust that issuer but trusts [97] other issuers
sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
远程集群记录 client did not trust this server's certificate
[2023-06-29T09:40:55,478][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [remote-node] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/192.168.0.42:9443, remoteAddress=/192.168.0.84:57305, profile=_remote_cluster}
解决方法
编辑仔细阅读本地集群上的警告日志消息,以确定故障的确切原因。例如
- 远程集群证书是否未由受信任的 CA 签名?这是最可能的原因。
- 主机名验证是否失败?
- 证书是否已过期?
一旦您知道原因,您应该能够通过调整本地集群或远程集群上与远程集群相关的 SSL 设置来修复它。
通常,问题出在本地集群上。例如,通过配置必要的受信任 CA (xpack.security.remote_cluster_client.ssl.certificate_authorities
) 来修复它。
如果您更改了 elasticsearch.yml
文件,则需要重新启动关联的集群以使更改生效。
API 密钥认证问题
编辑当使用 API 密钥认证时连接到传输端口
编辑当使用 API 密钥认证时,本地集群应连接到远程集群的远程集群服务器端口(默认为 9443
),而不是传输端口(默认为 9300
)。错误配置可能会导致一些症状
症状 1
编辑建议为传输接口和远程集群服务器接口使用不同的 CA 和证书。如果遵循此建议,则远程集群客户端节点将不信任远程集群在传输接口上提供的服务器证书。
本地集群记录 failed to establish trust with server
[2023-06-28T12:48:46,575][WARN ][o.e.c.s.DiagnosticTrustManager] [local-node] failed to establish trust with server at [1192.168.0.42]; the server provided a certificate with subject name [CN=transport], fingerprint [c43e628be2a8aaaa4092b82d78f2bc206c492322], no keyUsage and no extendedKeyUsage; the certificate is valid between [2023-01-29T12:05:53Z] and [2032-08-29T12:05:53Z] (current time is [2023-06-28T02:48:46.574738Z], certificate dates are valid); the session uses cipher suite [TLS_AES_256_GCM_SHA384] and protocol [TLSv1.3]; the certificate has subject alternative names [DNS:localhost,DNS:localhost6.localdomain6,IP:127.0.0.1,IP:0:0:0:0:0:0:0:1,DNS:localhost4,DNS:localhost6,DNS:localhost.localdomain,DNS:localhost4.localdomain4,IP:192.168.0.42]; the certificate is issued by [CN=Elastic Auto Transport CA] but the server did not provide a copy of the issuing certificate in the certificate chain; this ssl context ([xpack.security.remote_cluster_client.ssl (with trust configuration: PEM-trust{/rcs2/ssl/remote-cluster-ca.crt})]) is not configured to trust that issuer, it only trusts the issuer [CN=Elastic Auto RemoteCluster CA] with fingerprint [ba2350661f66e46c746c1629f0c4b645a2587ff4]
sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
远程集群记录 client did not trust this server's certificate
[2023-06-28T12:48:46,584][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [remote-node] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/192.168.0.42:9309, remoteAddress=/192.168.0.84:60810, profile=default}
症状 2
编辑CA 和证书可以在传输和远程集群服务器接口之间共享。由于默认情况下远程集群客户端没有客户端证书,因此服务器将无法验证客户端证书。
本地集群记录 Received fatal alert: bad_certificate
[2023-06-28T12:43:30,705][WARN ][o.e.t.TcpTransport ] [local-node] exception caught on transport layer [Netty4TcpChannel{localAddress=/192.168.0.84:60738, remoteAddress=/192.168.0.42:9309, profile=_remote_cluster}], closing connection
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
远程集群记录 Empty client certificate chain
[2023-06-28T12:43:30,772][WARN ][o.e.t.TcpTransport ] [remote-node] exception caught on transport layer [Netty4TcpChannel{localAddress=/192.168.0.42:9309, remoteAddress=/192.168.0.84:60783, profile=default}], closing connection
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Empty client certificate chain
症状 3
编辑如果远程集群客户端配置为 mTLS 并提供有效的客户端证书,则连接失败,因为客户端没有发送预期的身份验证标头。
本地集群记录 missing authentication
[2023-06-28T13:04:52,710][WARN ][o.e.t.ProxyConnectionStrategy] [local-node] failed to open any proxy connections to cluster [my]
org.elasticsearch.transport.RemoteTransportException: [remote-node][192.168.0.42:9309][cluster:internal/remote_cluster/handshake]
Caused by: org.elasticsearch.ElasticsearchSecurityException: missing authentication credentials for action [cluster:internal/remote_cluster/handshake]
这不会显示在远程集群的日志中。
症状 4
编辑如果匿名访问在远程集群上启用并且不需要身份验证,则根据匿名用户的权限,本地集群可能会记录以下内容。
如果匿名用户没有建立连接的必要权限,则本地集群记录 unauthorized
org.elasticsearch.transport.RemoteTransportException: [remote-node][192.168.0.42:9309][cluster:internal/remote_cluster/handshake]
Caused by: org.elasticsearch.ElasticsearchSecurityException: action [cluster:internal/remote_cluster/handshake] is unauthorized for user [anonymous_foo] with effective roles [reporting_user], this action is granted by the cluster privileges [cross_cluster_search,cross_cluster_replication,manage,all]
如果匿名用户具有必要的权限,例如它是超级用户,则本地集群记录 requires channel profile to be [_remote_cluster], but got [default]
[2023-06-28T13:09:52,031][WARN ][o.e.t.ProxyConnectionStrategy] [local-node] failed to open any proxy connections to cluster [my]
org.elasticsearch.transport.RemoteTransportException: [remote-node][192.168.0.42:9309][cluster:internal/remote_cluster/handshake]
Caused by: java.lang.IllegalArgumentException: remote cluster handshake action requires channel profile to be [_remote_cluster], but got [default]
解决方法
编辑检查端口号并确保您确实连接到远程集群服务器而不是传输接口。
在没有跨集群 API 密钥的情况下连接
编辑本地集群使用跨集群 API 密钥的存在来确定它连接到远程集群的模型。如果存在跨集群 API 密钥,则使用基于 API 密钥的身份验证。否则,它使用基于证书的身份验证。您可以使用本地集群上的 远程集群信息 API 检查正在使用的模型
resp = client.cluster.remote_info() print(resp)
response = client.cluster.remote_info puts response
const response = await client.cluster.remoteInfo(); console.log(response);
GET /_remote/info
API 应返回 "connected" : true
。当使用 API 密钥认证时,它还应返回 "cluster_credentials": "::es_redacted::"
。
{ "cluster_one" : { "seeds" : [ "127.0.0.1:9443" ], "connected" : true, "num_nodes_connected" : 1, "max_connections_per_cluster" : 3, "initial_connect_timeout" : "30s", "skip_unavailable" : false, "cluster_credentials": "::es_redacted::", "mode" : "sniff" } }
除了检查远程集群信息 API 的响应之外,您还可以检查日志。
症状 1
编辑如果未使用跨集群 API 密钥,则本地集群将使用基于证书的身份验证方法,并使用传输接口的 TLS 配置连接到远程集群。如果远程集群的传输接口和远程集群服务器接口具有不同的 TLS CA 和证书(这是建议),则 TLS 验证将失败。
本地集群记录 failed to establish trust with server
[2023-06-28T12:51:06,452][WARN ][o.e.c.s.DiagnosticTrustManager] [local-node] failed to establish trust with server at [<unknown host>]; the server provided a certificate with subject name [CN=remote_cluster], fingerprint [529de35e15666ffaa26afa50876a2a48119db03a], no keyUsage and no extendedKeyUsage; the certificate is valid between [2023-01-29T12:08:37Z] and [2032-08-29T12:08:37Z] (current time is [2023-06-28T02:51:06.451581Z], certificate dates are valid); the session uses cipher suite [TLS_AES_256_GCM_SHA384] and protocol [TLSv1.3]; the certificate has subject alternative names [DNS:localhost,DNS:localhost6.localdomain6,IP:127.0.0.1,IP:0:0:0:0:0:0:0:1,DNS:localhost4,DNS:localhost6,DNS:localhost.localdomain,DNS:localhost4.localdomain4,IP:192.168.0.42]; the certificate is issued by [CN=Elastic Auto RemoteCluster CA] but the server did not provide a copy of the issuing certificate in the certificate chain; this ssl context ([xpack.security.transport.ssl (with trust configuration: PEM-trust{/rcs2/ssl/transport-ca.crt})]) is not configured to trust that issuer, it only trusts the issuer [CN=Elastic Auto Transport CA] with fingerprint [bbe49e3f986506008a70ab651b188c70df104812]
sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
远程集群记录 client did not trust this server's certificate
[2023-06-28T12:52:16,914][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [remote-node] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/192.168.0.42:9443, remoteAddress=/192.168.0.84:60981, profile=_remote_cluster}
症状 2
编辑即使 TLS 验证不是问题,由于缺少凭据,连接也会失败。
本地集群记录 Please ensure you have configured remote cluster credentials
Caused by: java.lang.IllegalArgumentException: Cross cluster requests through the dedicated remote cluster server port require transport header [_cross_cluster_access_credentials] but none found. Please ensure you have configured remote cluster credentials on the cluster originating the request.
这不会显示在远程集群的日志中。
解决方法
编辑将跨集群 API 密钥添加到本地集群的每个节点上的 Elasticsearch 密钥库。使用 节点重新加载安全设置 API 来重新加载密钥库。
使用错误的 API 密钥类型
编辑基于 API 密钥的身份验证需要 跨集群 API 密钥。它不适用于 REST API 密钥。
症状
编辑本地集群记录 authentication expected API key type of [cross_cluster]
[2023-06-28T13:26:53,962][WARN ][o.e.t.ProxyConnectionStrategy] [local-node] failed to open any proxy connections to cluster [my]
org.elasticsearch.transport.RemoteTransportException: [remote-node][192.168.0.42:9443][cluster:internal/remote_cluster/handshake]
Caused by: org.elasticsearch.ElasticsearchSecurityException: authentication expected API key type of [cross_cluster], but API key [agZXJocBmA2beJfq2yKu] has type [rest]
这不会显示在远程集群的日志中。
解决方法
编辑请远程集群管理员创建并分发 跨集群 API 密钥。使用本地集群上每个节点的此跨集群 API 密钥替换 Elasticsearch 密钥库中的现有 API 密钥。使用 节点重新加载安全设置 API 来重新加载密钥库。
无效的 API 密钥
编辑跨集群 API 可能无法进行身份验证。例如,当其凭据不正确,或者已失效或过期时。
症状
编辑本地集群记录 unable to authenticate
[2023-06-28T13:22:58,264][WARN ][o.e.t.ProxyConnectionStrategy] [local-node] failed to open any proxy connections to cluster [my]
org.elasticsearch.transport.RemoteTransportException: [remote-node][192.168.0.42:9443][cluster:internal/remote_cluster/handshake]
Caused by: org.elasticsearch.ElasticsearchSecurityException: unable to authenticate user [agZXJocBmA2beJfq2yKu] for action [cluster:internal/remote_cluster/handshake]
远程集群记录 Authentication using apikey failed
[2023-06-28T13:24:38,744][WARN ][o.e.x.s.a.ApiKeyAuthenticator] [remote-node] Authentication using apikey failed - invalid credentials for API key [agZXJocBmA2beJfq2yKu]
解决方法
编辑请远程集群管理员创建并分发 跨集群 API 密钥。使用本地集群上每个节点的此跨集群 API 密钥替换 Elasticsearch 密钥库中的现有 API 密钥。使用 节点重新加载安全设置 API 来重新加载密钥库。
API 密钥或本地用户权限不足
编辑本地用户在远程集群上运行请求的有效权限由跨集群 API 密钥的权限和本地用户的 remote_indices
权限的交集决定。
症状
编辑由于权限不足导致的请求失败会产生如下 API 响应:
{
"type": "security_exception",
"reason": "action [indices:data/read/search] towards remote cluster is unauthorized for user [foo] with assigned roles [foo-role] authenticated by API key id [agZXJocBmA2beJfq2yKu] of user [elastic-admin] on indices [cd], this action is granted by the index privileges [read,all]"
}
这不会在任何日志中显示。
解决方法
编辑- 检查本地用户是否具有必要的
remote_indices
或remote_cluster
权限。如有必要,授予足够的remote_indices
或remote_cluster
权限。 - 如果本地权限没有问题,请要求远程集群管理员创建并分发跨集群 API 密钥。在本地集群的每个节点上,使用此跨集群 API 密钥替换 Elasticsearch 密钥库中现有的 API 密钥。使用 节点重新加载安全设置 API 重新加载密钥库。
本地用户没有 remote_indices
权限
编辑这是权限不足的特殊情况。在这种情况下,本地用户对于目标远程集群根本没有 remote_indices
权限。Elasticsearch 可以检测到这一点并发出更明确的错误响应。
症状
编辑这会导致如下 API 响应:
{
"type": "security_exception",
"reason": "action [indices:data/read/search] towards remote cluster [my] is unauthorized for user [foo] with effective roles [] (assigned roles [foo-role] were not found) because no remote indices privileges apply for the target cluster"
}
解决方法
编辑授予本地用户足够的 remote_indices
权限。