spark on k8s模式下kerberos认证报错


发布于 2024-04-18 / 88 阅读 / 0 评论 /
IllegalArgumentException(String.format("Server has invalid Kerberos principal: %s, expecting: %s", serverPrincipal, confPrincipal))

1.问题描述

Spark on k8s模式下,适配kerberos,出现如下错误:

2024-04-11 19:14:13,435 [DEBUG] [main] getting serverKey: dfs.namenode.kerberos.principal conf value: hadoop/_HOST@leo.com principal: hadoop/2402:4e00:140b:4c00:0:9bab:29fd:180@leo.com (org.apache.hadoop.security.SaslRpcClient(org.apache.hadoop.security.SaslRpcClient.getServerPrincipal:444))
2024-04-11 19:14:13,435 [DEBUG] [main] closing ipc connection to 2402:4e00:140b:4c00:0:9bab:29fd:180/2402:4e00:140b:4c00:0:9bab:29fd:180:4007: Couldn't set up IO streams: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hadoop/leo-2402-4e00-140b-4c00-0-9bab-29fd-180@leo.com, expecting: hadoop/2402:4e00:140b:4c00:0:9bab:29fd:180@leo.com (org.apache.hadoop.ipc.Client(org.apache.hadoop.ipc.Client$Connection.close:1302))
java.io.IOException: Couldn't set up IO streams: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hadoop/leo-2402-4e00-140b-4c00-0-9bab-29fd-180@leo.com, expecting: hadoop/2402:4e00:140b:4c00:0:9bab:29fd:180@leo.com
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:903) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.ipc.Client$Connection.access$3900(Client.java:419) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1657) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.ipc.Client.call(Client.java:1473) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.ipc.Client.call(Client.java:1426) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) ~[hadoop-client-api-3.2.2.jar:?]
        at com.sun.proxy.$Proxy27.setSafeMode(Unknown Source) ~[?:?]
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setSafeMode(ClientNamenodeProtocolTranslatorPB.java:700) ~[hadoop-client-api-3.2.2.jar:?]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_345]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_345]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_345]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_345]
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:424) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) ~[hadoop-client-api-3.2.2.jar:?]
        at com.sun.proxy.$Proxy28.setSafeMode(Unknown Source) ~[?:?]
        at org.apache.hadoop.hdfs.DFSClient.setSafeMode(DFSClient.java:2217) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.hdfs.DistributedFileSystem.setSafeMode(DistributedFileSystem.java:1523) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.spark.deploy.history.FsHistoryProvider.isFsInSafeMode(FsHistoryProvider.scala:1190) ~[spark-core_2.12-3.4.2.jar:3.4.2]
        at org.apache.spark.deploy.history.FsHistoryProvider.isFsInSafeMode(FsHistoryProvider.scala:1183) ~[spark-core_2.12-3.4.2.jar:3.4.2]
        at org.apache.spark.deploy.history.FsHistoryProvider.initialize(FsHistoryProvider.scala:216) ~[spark-core_2.12-3.4.2.jar:3.4.2]
        at org.apache.spark.deploy.history.FsHistoryProvider.start(FsHistoryProvider.scala:396) ~[spark-core_2.12-3.4.2.jar:3.4.2]
        at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:319) ~[spark-core_2.12-3.4.2.jar:3.4.2]
        at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) ~[spark-core_2.12-3.4.2.jar:3.4.2]
Caused by: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hadoop/leo-2402-4e00-140b-4c00-0-9bab-29fd-180@leo.com, expecting: hadoop/2402:4e00:140b:4c00:0:9bab:29fd:180@leo.com
        at org.apache.hadoop.security.SaslRpcClient.getServerPrincipal(SaslRpcClient.java:458) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:287) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:201) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:511) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:635) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:419) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:836) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:832) ~[hadoop-client-api-3.2.2.jar:?]
        at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_345]
        at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_345]
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2065) ~[hadoop-client-api-3.2.2.jar:?]
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:832) ~[hadoop-client-api-3.2.2.jar:?]
        ... 26 more
2024-04-11 19:14:13,436 [DEBUG] [main] IPC Client (1745608181) connection to 2402:4e00:140b:4c00:0:9bab:29fd:180/[2402:4e00:140b:4c00:0:9bab:29fd:180]:4007 from hadoop/leo-9bab29fd@leo.com: closed (org.apache.hadoop.ipc.Client(org.apache.hadoop.ipc.Client$Connection.close:1311))

错误提示很明显,期待的principal和当前的principal是不同的,所以报错。

2.源码分析

我们根据错误堆栈找到对应的源码块,在org.apache.hadoop.security.SaslRpcClient类中,错误发生在创建saslClient的时候,最终导致无法与saslServer建立连接。

创建saslClient过程如下所示:

org.apache.hadoop.security.SaslRpcClient.saslConnect
	saslAuthType = selectSaslClient(saslMessage.getAuthsList()); // 从sasl消息中选择正确的authType
		对authTypes进行遍历,每个记为authType
			SaslRpcClient.createSaslClient(authType) // 创建kerberos认证客户端
				method = AuthMethod.valueOf(authType.getMethod()); // 不同的认证方式method不同
				serverPrincipal = getServerPrincipal(authType); // 获取服务端配置的principal
					krbInfo = SecurityUtil.getKerberosInfo // 从协议和陪孩子中获取认证信息
					serverKey = krbInfo.serverPrincipal(); // HDFS服务端配置principal的key值
					serverPrincipal = new KerberosPrincipal(authType.getProtocol() + "/" + authType.getServerId(), KerberosPrincipal.KRB_NT_SRV_HST).getName();
					confPrincipal = SecurityUtil.getServerPrincipal(conf.get(serverKey), serverAddr.getAddress());
					if (!serverPrincipal.equals(confPrincipal)) { //下面就是报错信息
			        	throw new IllegalArgumentException(String.format("Server has invalid Kerberos principal: %s, expecting: %s", serverPrincipal, confPrincipal));
			      	}

这里主要就是比对serverPrincipal和confPrincipal。

2.1.serverPrincipal

serverPrincipal是根据服务端获取的信息组成的principal,依赖authType中的serverId。

我们通过arthas工具的监听功能,看到authType的一个输出案例,如下所示:

[arthas@1071$ watch org.apache.hadoop.security.SaslRpcClient selectSaslClient "{params}" -× 3
Press Q or Ctrl+C to abort.
Affect(class count: 1 , method count: 1) cost in 274 ms, listenerId: 1 method=org.apache.hadoop.security.SaslRpcClient.selectSaslClient location=AtExceptionExit
ts=2024-04-11 20:18:03; [cost=13.770484ms] result=@ArrayList[
        @Object [] [
                @UnmodifiableRandomAccessList[
                        @SaslAuth [
                                method: "TOKEN"
                                mechanism: "DIGEST-MD5"
                                protocol: ""
                                serverId: "default"
                                challenge: "realm=\ "default\", nonce=\"5GPSQ+1F67H/3c6WcOAR19PqQ5HEf2qF3n6QNWh\", qop=\ "auth\", charset=utf-8, algorithm=md5-sess"
                                serverVersion: 0
                        ],
                        @SaslAuth [
                                method: "KERBEROS"
                                mechanism: "GSSAPI"
                                protocol: "hadoop" 
                                serverId: "leo-2402-4e00-140b-4c01-0-9b9f-83c6-a5e0"
                                serverVersion: 0
                        ]
                ]
        ]
]

2.2.confPrincipal

confPrincipal是根据服务端地址信息构建的principal,构建过程在SecurityUtil.getServerPrincipal方法中,源码如下所示:

  @InterfaceAudience.Public
  @InterfaceStability.Evolving
  public static String getServerPrincipal(String principalConfig,
      InetAddress addr) throws IOException {
    String[] components = getComponents(principalConfig);
    if (components == null || components.length != 3
        || !components[1].equals(HOSTNAME_PATTERN)) {
      return principalConfig;
    } else {
      if (addr == null) {
        throw new IOException("Can't replace " + HOSTNAME_PATTERN
            + " pattern since client address is null");
      }
      return replacePattern(components, addr.getCanonicalHostName());
    }
  }

如果hdfs-site中配置的dfs.namenode.kerberos.principal带有“_HOST”,则根据服务端地址(addr.getCanonicalHostName())进行替换,如果不带“_HOST”,直接使用dfs.namenode.kerberos.principal配置值。

3.根因总结

通过以上分析,我们知道,创建saslClient与saslServer建立连接时,hadoop-common中有一个比对的逻辑,比对的两个对象是principal,一个是serverPrincipal,另一个是confPrincipal,如果两者不相同,就会报错。

根据我们的场景,我们服务端HDFS使用的地址都是hostname,不是ip,所以serverPrincipal中是带有hostname的。而客户端hdfs-site中配置的namenode地址都是ip的,所以得到的confPrincipal是带有ip的,两者不相同,就报错。

4.解决方案

客户端使用的hdfs-site中的通信地址相关属性统一适配为hostname即可。