远程集群故障排除
编辑远程集群故障排除编辑
在为跨集群复制或跨集群搜索设置远程集群时,您可能会遇到一些问题。
一般故障排除编辑
检查远程集群是否已成功连接编辑
成功调用集群设置更新 API 以添加或更新远程集群并不一定意味着配置成功。使用 远程集群信息 API 验证本地集群是否已成功连接到远程集群。
resp = client.cluster.remote_info() print(resp)
response = client.cluster.remote_info puts response
GET /_remote/info
API 应返回 "connected" : true
。当使用 API 密钥身份验证 时,它还应返回 "cluster_credentials": "::es_redacted::"
。
{ "cluster_one" : { "seeds" : [ "127.0.0.1:9443" ], "connected" : true, "num_nodes_connected" : 1, "max_connections_per_cluster" : 3, "initial_connect_timeout" : "30s", "skip_unavailable" : false, "cluster_credentials": "::es_redacted::", "mode" : "sniff" } }
远程集群已成功连接。 |
|
如果存在,则表示远程集群已使用 API 密钥身份验证 而不是 基于证书的身份验证 连接。 |
启用远程集群服务器编辑
当使用 API 密钥身份验证时,跨集群流量发生在远程集群接口上,而不是传输接口上。远程集群接口默认情况下未启用。这意味着节点默认情况下不准备接受传入的跨集群请求,而它准备发送传出的跨集群请求。确保您已在远程集群的每个节点上启用了远程集群服务器。在 elasticsearch.yml
中
- 将
remote_cluster_server.enabled
设置为true
。 - 配置远程集群服务器流量的绑定和发布地址,例如使用
remote_cluster.host
。如果不配置地址,远程集群流量可能会绑定到本地接口,而运行在其他机器上的远程集群无法连接。 - 可选地,使用
remote_cluster.port
(默认为9443
)配置远程服务器端口。
常见问题编辑
以下问题按在设置远程集群时可能出现的顺序列出。
远程集群不可达编辑
症状编辑
本地集群可能由于多种原因无法到达远程集群。例如,远程集群服务器可能未启用,可能配置了错误的主机或端口,或者防火墙可能阻止了流量。当远程集群不可达时,检查本地集群的日志以查找 connect_exception
。
当远程集群使用代理模式配置时
[2023-06-28T16:36:47,264][WARN ][o.e.t.ProxyConnectionStrategy] [local-node] failed to open any proxy connections to cluster [my]
org.elasticsearch.transport.ConnectTransportException: [][192.168.0.42:9443] connect_exception
当远程集群使用嗅探模式配置时
[2023-06-28T16:38:37,731][WARN ][o.e.t.SniffConnectionStrategy] [local-node] fetching nodes from external cluster [my] failed
org.elasticsearch.transport.ConnectTransportException: [][192.168.0.42:9443] connect_exception
解决方案编辑
- 检查远程集群的主机和端口是否正确。
- 确保 远程集群服务器已在远程集群上启用。
- 确保没有防火墙阻止通信。
远程集群连接不可靠编辑
症状编辑
本地集群可以连接到远程集群,但连接不可靠。例如,一些跨集群请求可能会成功,而其他请求则报告连接错误、超时或似乎卡在等待远程集群响应的状态。
当 Elasticsearch 检测到远程集群连接不起作用时,它将在其日志中报告以下消息
[2023-06-28T16:36:47,264][INFO ][o.e.t.ClusterConnectionManager] [local-node] transport connection to [{my-remote#192.168.0.42:9443}{...}] closed by remote
如果连接到 Elasticsearch 的远程集群节点关闭或重新启动,也会记录此消息。
请注意,对于某些网络配置,操作系统可能需要几分钟或几小时才能检测到连接已停止工作。在检测到故障并将其报告给 Elasticsearch 之前,涉及远程集群的请求可能会超时或似乎卡住。
未建立 TLS 信任编辑
TLS 可能在本地集群或远程集群上配置错误。结果是本地集群不信任远程集群提供的证书。
症状编辑
本地集群记录 failed to establish trust with server
[2023-06-29T09:40:55,465][WARN ][o.e.c.s.DiagnosticTrustManager] [local-node] failed to establish trust with server at [192.168.0.42]; the server provided a certificate with subject name [CN=remote_cluster], fingerprint [529de35e15666ffaa26afa50876a2a48119db03a], no keyUsage and no extendedKeyUsage; the certificate is valid between [2023-01-29T12:08:37Z] and [2032-08-29T12:08:37Z] (current time is [2023-08-16T23:40:55.464275Z], certificate dates are valid); the session uses cipher suite [TLS_AES_256_GCM_SHA384] and protocol [TLSv1.3]; the certificate has subject alternative names [DNS:localhost,DNS:localhost6.localdomain6,IP:127.0.0.1,IP:0:0:0:0:0:0:0:1,DNS:localhost4,DNS:localhost6,DNS:localhost.localdomain,DNS:localhost4.localdomain4,IP:192.168.0.42]; the certificate is issued by [CN=Elastic Auto RemoteCluster CA] but the server did not provide a copy of the issuing certificate in the certificate chain; this ssl context ([(shared) (with trust configuration: JDK-trusted-certs)]) is not configured to trust that issuer but trusts [97] other issuers
sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
远程集群记录 client did not trust this server's certificate
[2023-06-29T09:40:55,478][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [remote-node] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/192.168.0.42:9443, remoteAddress=/192.168.0.84:57305, profile=_remote_cluster}
解决方案编辑
仔细阅读本地集群上的警告日志消息以确定故障的确切原因。例如
- 远程集群证书是否未由受信任的 CA 签名?这是最可能的原因。
- 主机名验证是否失败?
- 证书是否已过期?
一旦您知道原因,您应该能够通过调整本地集群或远程集群上的远程集群相关 SSL 设置来解决它。
通常,问题出在本地集群上。例如,通过配置必要的受信任 CA (xpack.security.remote_cluster_client.ssl.certificate_authorities
) 来解决它。
如果您更改了 elasticsearch.yml
文件,则相关集群需要重新启动才能使更改生效。
API 密钥身份验证问题编辑
使用 API 密钥身份验证连接到传输端口编辑
当使用 API 密钥身份验证时,本地集群应连接到远程集群的远程集群服务器端口(默认为 9443
),而不是传输端口(默认为 9300
)。错误配置会导致许多症状
症状 1编辑
建议对传输接口和远程集群服务器接口使用不同的 CA 和证书。如果遵循此建议,远程集群客户端节点将不信任远程集群在传输接口上提供的服务器证书。
本地集群记录 failed to establish trust with server
[2023-06-28T12:48:46,575][WARN ][o.e.c.s.DiagnosticTrustManager] [local-node] failed to establish trust with server at [1192.168.0.42]; the server provided a certificate with subject name [CN=transport], fingerprint [c43e628be2a8aaaa4092b82d78f2bc206c492322], no keyUsage and no extendedKeyUsage; the certificate is valid between [2023-01-29T12:05:53Z] and [2032-08-29T12:05:53Z] (current time is [2023-06-28T02:48:46.574738Z], certificate dates are valid); the session uses cipher suite [TLS_AES_256_GCM_SHA384] and protocol [TLSv1.3]; the certificate has subject alternative names [DNS:localhost,DNS:localhost6.localdomain6,IP:127.0.0.1,IP:0:0:0:0:0:0:0:1,DNS:localhost4,DNS:localhost6,DNS:localhost.localdomain,DNS:localhost4.localdomain4,IP:192.168.0.42]; the certificate is issued by [CN=Elastic Auto Transport CA] but the server did not provide a copy of the issuing certificate in the certificate chain; this ssl context ([xpack.security.remote_cluster_client.ssl (with trust configuration: PEM-trust{/rcs2/ssl/remote-cluster-ca.crt})]) is not configured to trust that issuer, it only trusts the issuer [CN=Elastic Auto RemoteCluster CA] with fingerprint [ba2350661f66e46c746c1629f0c4b645a2587ff4]
sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
远程集群记录 client did not trust this server's certificate
[2023-06-28T12:48:46,584][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [remote-node] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/192.168.0.42:9309, remoteAddress=/192.168.0.84:60810, profile=default}
症状 2编辑
CA 和证书可以在传输和远程集群服务器接口之间共享。由于远程集群客户端默认情况下没有客户端证书,因此服务器将无法验证客户端证书。
本地集群记录 Received fatal alert: bad_certificate
[2023-06-28T12:43:30,705][WARN ][o.e.t.TcpTransport ] [local-node] exception caught on transport layer [Netty4TcpChannel{localAddress=/192.168.0.84:60738, remoteAddress=/192.168.0.42:9309, profile=_remote_cluster}], closing connection
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
远程集群记录 Empty client certificate chain
[2023-06-28T12:43:30,772][WARN ][o.e.t.TcpTransport ] [remote-node] exception caught on transport layer [Netty4TcpChannel{localAddress=/192.168.0.42:9309, remoteAddress=/192.168.0.84:60783, profile=default}], closing connection
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Empty client certificate chain
症状 3编辑
如果远程集群客户端配置为 mTLS 并提供有效的客户端证书,则连接失败,因为客户端未发送预期的身份验证标头。
本地集群记录 missing authentication
[2023-06-28T13:04:52,710][WARN ][o.e.t.ProxyConnectionStrategy] [local-node] failed to open any proxy connections to cluster [my]
org.elasticsearch.transport.RemoteTransportException: [remote-node][192.168.0.42:9309][cluster:internal/remote_cluster/handshake]
Caused by: org.elasticsearch.ElasticsearchSecurityException: missing authentication credentials for action [cluster:internal/remote_cluster/handshake]
这不会显示在远程集群的日志中。
症状 4编辑
如果远程集群上启用了匿名访问并且它不需要身份验证,则根据匿名用户的权限,本地集群可能会记录以下内容。
如果匿名用户没有进行连接的必要权限,本地集群将记录 unauthorized
org.elasticsearch.transport.RemoteTransportException: [remote-node][192.168.0.42:9309][cluster:internal/remote_cluster/handshake]
Caused by: org.elasticsearch.ElasticsearchSecurityException: action [cluster:internal/remote_cluster/handshake] is unauthorized for user [anonymous_foo] with effective roles [reporting_user], this action is granted by the cluster privileges [cross_cluster_search,cross_cluster_replication,manage,all]
如果匿名用户具有必要的权限,例如它是超级用户,本地集群将记录 requires channel profile to be [_remote_cluster], but got [default]
[2023-06-28T13:09:52,031][WARN ][o.e.t.ProxyConnectionStrategy] [local-node] failed to open any proxy connections to cluster [my]
org.elasticsearch.transport.RemoteTransportException: [remote-node][192.168.0.42:9309][cluster:internal/remote_cluster/handshake]
Caused by: java.lang.IllegalArgumentException: remote cluster handshake action requires channel profile to be [_remote_cluster], but got [default]
解决方案编辑
检查端口号,并确保您确实连接到远程集群服务器而不是传输接口。
在没有跨集群 API 密钥的情况下连接编辑
本地集群使用跨集群 API 密钥的存在来确定它连接到远程集群的模型。如果存在跨集群 API 密钥,它将使用基于 API 密钥的身份验证。否则,它将使用基于证书的身份验证。您可以在本地集群上使用 远程集群信息 API 检查正在使用的模型。
resp = client.cluster.remote_info() print(resp)
response = client.cluster.remote_info puts response
GET /_remote/info
API 应返回 "connected" : true
。当使用 API 密钥身份验证 时,它还应返回 "cluster_credentials": "::es_redacted::"
。
{ "cluster_one" : { "seeds" : [ "127.0.0.1:9443" ], "connected" : true, "num_nodes_connected" : 1, "max_connections_per_cluster" : 3, "initial_connect_timeout" : "30s", "skip_unavailable" : false, "cluster_credentials": "::es_redacted::", "mode" : "sniff" } }
远程集群已成功连接。 |
|
如果存在,则表示远程集群已使用 API 密钥身份验证 而不是 基于证书的身份验证 连接。 |
除了检查远程集群信息 API 的响应外,您还可以检查日志。
症状 1编辑
如果没有使用跨集群 API 密钥,本地集群将使用基于证书的身份验证方法,并使用传输接口的 TLS 配置连接到远程集群。如果远程集群为传输和远程集群服务器接口使用不同的 TLS CA 和证书(这是推荐的做法),TLS 验证将失败。
本地集群记录 failed to establish trust with server
[2023-06-28T12:51:06,452][WARN ][o.e.c.s.DiagnosticTrustManager] [local-node] failed to establish trust with server at [<unknown host>]; the server provided a certificate with subject name [CN=remote_cluster], fingerprint [529de35e15666ffaa26afa50876a2a48119db03a], no keyUsage and no extendedKeyUsage; the certificate is valid between [2023-01-29T12:08:37Z] and [2032-08-29T12:08:37Z] (current time is [2023-06-28T02:51:06.451581Z], certificate dates are valid); the session uses cipher suite [TLS_AES_256_GCM_SHA384] and protocol [TLSv1.3]; the certificate has subject alternative names [DNS:localhost,DNS:localhost6.localdomain6,IP:127.0.0.1,IP:0:0:0:0:0:0:0:1,DNS:localhost4,DNS:localhost6,DNS:localhost.localdomain,DNS:localhost4.localdomain4,IP:192.168.0.42]; the certificate is issued by [CN=Elastic Auto RemoteCluster CA] but the server did not provide a copy of the issuing certificate in the certificate chain; this ssl context ([xpack.security.transport.ssl (with trust configuration: PEM-trust{/rcs2/ssl/transport-ca.crt})]) is not configured to trust that issuer, it only trusts the issuer [CN=Elastic Auto Transport CA] with fingerprint [bbe49e3f986506008a70ab651b188c70df104812]
sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
远程集群记录 client did not trust this server's certificate
[2023-06-28T12:52:16,914][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [remote-node] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/192.168.0.42:9443, remoteAddress=/192.168.0.84:60981, profile=_remote_cluster}
症状 2编辑
即使 TLS 验证不是问题,连接也会由于缺少凭据而失败。
本地集群日志 请确保您已配置远程集群凭据
Caused by: java.lang.IllegalArgumentException: Cross cluster requests through the dedicated remote cluster server port require transport header [_cross_cluster_access_credentials] but none found. Please ensure you have configured remote cluster credentials on the cluster originating the request.
这不会显示在远程集群的日志中。
解决方案编辑
将跨集群 API 密钥添加到本地集群每个节点上的 Elasticsearch 密钥库。使用 节点重新加载安全设置 API 重新加载密钥库。
使用错误的 API 密钥类型编辑
基于 API 密钥的身份验证需要 跨集群 API 密钥。它不适用于 REST API 密钥。
症状编辑
本地集群日志 身份验证预期 API 密钥类型为 [cross_cluster]
[2023-06-28T13:26:53,962][WARN ][o.e.t.ProxyConnectionStrategy] [local-node] failed to open any proxy connections to cluster [my]
org.elasticsearch.transport.RemoteTransportException: [remote-node][192.168.0.42:9443][cluster:internal/remote_cluster/handshake]
Caused by: org.elasticsearch.ElasticsearchSecurityException: authentication expected API key type of [cross_cluster], but API key [agZXJocBmA2beJfq2yKu] has type [rest]
这不会显示在远程集群的日志中。
解决方案编辑
请远程集群管理员创建并分发 跨集群 API 密钥。将本地集群每个节点上的 Elasticsearch 密钥库中的现有 API 密钥替换为此跨集群 API 密钥。使用 节点重新加载安全设置 API 重新加载密钥库。
无效的 API 密钥编辑
跨集群 API 可能会无法进行身份验证。例如,当其凭据不正确时,或者它被无效或过期时。
症状编辑
本地集群日志 无法进行身份验证
[2023-06-28T13:22:58,264][WARN ][o.e.t.ProxyConnectionStrategy] [local-node] failed to open any proxy connections to cluster [my]
org.elasticsearch.transport.RemoteTransportException: [remote-node][192.168.0.42:9443][cluster:internal/remote_cluster/handshake]
Caused by: org.elasticsearch.ElasticsearchSecurityException: unable to authenticate user [agZXJocBmA2beJfq2yKu] for action [cluster:internal/remote_cluster/handshake]
远程集群日志 使用 apikey 的身份验证失败
[2023-06-28T13:24:38,744][WARN ][o.e.x.s.a.ApiKeyAuthenticator] [remote-node] Authentication using apikey failed - invalid credentials for API key [agZXJocBmA2beJfq2yKu]
解决方案编辑
请远程集群管理员创建并分发 跨集群 API 密钥。将本地集群每个节点上的 Elasticsearch 密钥库中的现有 API 密钥替换为此跨集群 API 密钥。使用 节点重新加载安全设置 API 重新加载密钥库。
API 密钥或本地用户权限不足编辑
在远程集群上运行请求的本地用户的有效权限由跨集群 API 密钥的权限和本地用户的 remote_indices
权限的交集决定。
症状编辑
由于权限不足导致的请求失败会导致 API 响应,例如
{
"type": "security_exception",
"reason": "action [indices:data/read/search] towards remote cluster is unauthorized for user [foo] with assigned roles [foo-role] authenticated by API key id [agZXJocBmA2beJfq2yKu] of user [elastic-admin] on indices [cd], this action is granted by the index privileges [read,all]"
}
这不会显示在任何日志中。
解决方案编辑
- 检查本地用户是否具有必要的
remote_indices
权限。如果需要,授予足够的remote_indices
权限。 - 如果本地权限不是问题,请远程集群管理员创建并分发 跨集群 API 密钥。将本地集群每个节点上的 Elasticsearch 密钥库中的现有 API 密钥替换为此跨集群 API 密钥。使用 节点重新加载安全设置 API 重新加载密钥库。
本地用户没有 remote_indices
权限编辑
这是权限不足的一种特殊情况。在这种情况下,本地用户对目标远程集群根本没有 remote_indices
权限。Elasticsearch 可以检测到这一点并发出更明确的错误响应。
症状编辑
这会导致 API 响应,例如
{
"type": "security_exception",
"reason": "action [indices:data/read/search] towards remote cluster [my] is unauthorized for user [foo] with effective roles [] (assigned roles [foo-role] were not found) because no remote indices privileges apply for the target cluster"
}
解决方案编辑
向本地用户授予足够的 remote_indices
权限。