1.问题描述
Spark on k8s模式下,适配kerberos,出现如下错误:
2024-04-11 19:14:13,435 [DEBUG] [main] getting serverKey: dfs.namenode.kerberos.principal conf value: hadoop/_HOST@leo.com principal: hadoop/2402:4e00:140b:4c00:0:9bab:29fd:180@leo.com (org.apache.hadoop.security.SaslRpcClient(org.apache.hadoop.security.SaslRpcClient.getServerPrincipal:444))
2024-04-11 19:14:13,435 [DEBUG] [main] closing ipc connection to 2402:4e00:140b:4c00:0:9bab:29fd:180/2402:4e00:140b:4c00:0:9bab:29fd:180:4007: Couldn't set up IO streams: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hadoop/leo-2402-4e00-140b-4c00-0-9bab-29fd-180@leo.com, expecting: hadoop/2402:4e00:140b:4c00:0:9bab:29fd:180@leo.com (org.apache.hadoop.ipc.Client(org.apache.hadoop.ipc.Client$Connection.close:1302))
java.io.IOException: Couldn't set up IO streams: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hadoop/leo-2402-4e00-140b-4c00-0-9bab-29fd-180@leo.com, expecting: hadoop/2402:4e00:140b:4c00:0:9bab:29fd:180@leo.com
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:903) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.ipc.Client$Connection.access$3900(Client.java:419) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1657) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.ipc.Client.call(Client.java:1473) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.ipc.Client.call(Client.java:1426) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) ~[hadoop-client-api-3.2.2.jar:?]
at com.sun.proxy.$Proxy27.setSafeMode(Unknown Source) ~[?:?]
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setSafeMode(ClientNamenodeProtocolTranslatorPB.java:700) ~[hadoop-client-api-3.2.2.jar:?]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_345]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_345]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_345]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_345]
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:424) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) ~[hadoop-client-api-3.2.2.jar:?]
at com.sun.proxy.$Proxy28.setSafeMode(Unknown Source) ~[?:?]
at org.apache.hadoop.hdfs.DFSClient.setSafeMode(DFSClient.java:2217) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.hdfs.DistributedFileSystem.setSafeMode(DistributedFileSystem.java:1523) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.spark.deploy.history.FsHistoryProvider.isFsInSafeMode(FsHistoryProvider.scala:1190) ~[spark-core_2.12-3.4.2.jar:3.4.2]
at org.apache.spark.deploy.history.FsHistoryProvider.isFsInSafeMode(FsHistoryProvider.scala:1183) ~[spark-core_2.12-3.4.2.jar:3.4.2]
at org.apache.spark.deploy.history.FsHistoryProvider.initialize(FsHistoryProvider.scala:216) ~[spark-core_2.12-3.4.2.jar:3.4.2]
at org.apache.spark.deploy.history.FsHistoryProvider.start(FsHistoryProvider.scala:396) ~[spark-core_2.12-3.4.2.jar:3.4.2]
at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:319) ~[spark-core_2.12-3.4.2.jar:3.4.2]
at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) ~[spark-core_2.12-3.4.2.jar:3.4.2]
Caused by: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hadoop/leo-2402-4e00-140b-4c00-0-9bab-29fd-180@leo.com, expecting: hadoop/2402:4e00:140b:4c00:0:9bab:29fd:180@leo.com
at org.apache.hadoop.security.SaslRpcClient.getServerPrincipal(SaslRpcClient.java:458) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:287) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:201) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:511) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:635) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:419) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:836) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:832) ~[hadoop-client-api-3.2.2.jar:?]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_345]
at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_345]
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2065) ~[hadoop-client-api-3.2.2.jar:?]
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:832) ~[hadoop-client-api-3.2.2.jar:?]
... 26 more
2024-04-11 19:14:13,436 [DEBUG] [main] IPC Client (1745608181) connection to 2402:4e00:140b:4c00:0:9bab:29fd:180/[2402:4e00:140b:4c00:0:9bab:29fd:180]:4007 from hadoop/leo-9bab29fd@leo.com: closed (org.apache.hadoop.ipc.Client(org.apache.hadoop.ipc.Client$Connection.close:1311))
错误提示很明显,期待的principal和当前的principal是不同的,所以报错。
2.源码分析
我们根据错误堆栈找到对应的源码块,在org.apache.hadoop.security.SaslRpcClient类中,错误发生在创建saslClient的时候,最终导致无法与saslServer建立连接。
创建saslClient过程如下所示:
org.apache.hadoop.security.SaslRpcClient.saslConnect
saslAuthType = selectSaslClient(saslMessage.getAuthsList()); // 从sasl消息中选择正确的authType
对authTypes进行遍历,每个记为authType
SaslRpcClient.createSaslClient(authType) // 创建kerberos认证客户端
method = AuthMethod.valueOf(authType.getMethod()); // 不同的认证方式method不同
serverPrincipal = getServerPrincipal(authType); // 获取服务端配置的principal
krbInfo = SecurityUtil.getKerberosInfo // 从协议和陪孩子中获取认证信息
serverKey = krbInfo.serverPrincipal(); // HDFS服务端配置principal的key值
serverPrincipal = new KerberosPrincipal(authType.getProtocol() + "/" + authType.getServerId(), KerberosPrincipal.KRB_NT_SRV_HST).getName();
confPrincipal = SecurityUtil.getServerPrincipal(conf.get(serverKey), serverAddr.getAddress());
if (!serverPrincipal.equals(confPrincipal)) { //下面就是报错信息
throw new IllegalArgumentException(String.format("Server has invalid Kerberos principal: %s, expecting: %s", serverPrincipal, confPrincipal));
}
这里主要就是比对serverPrincipal和confPrincipal。
2.1.serverPrincipal
serverPrincipal是根据服务端获取的信息组成的principal,依赖authType中的serverId。
我们通过arthas工具的监听功能,看到authType的一个输出案例,如下所示:
[arthas@1071$ watch org.apache.hadoop.security.SaslRpcClient selectSaslClient "{params}" -× 3
Press Q or Ctrl+C to abort.
Affect(class count: 1 , method count: 1) cost in 274 ms, listenerId: 1 method=org.apache.hadoop.security.SaslRpcClient.selectSaslClient location=AtExceptionExit
ts=2024-04-11 20:18:03; [cost=13.770484ms] result=@ArrayList[
@Object [] [
@UnmodifiableRandomAccessList[
@SaslAuth [
method: "TOKEN"
mechanism: "DIGEST-MD5"
protocol: ""
serverId: "default"
challenge: "realm=\ "default\", nonce=\"5GPSQ+1F67H/3c6WcOAR19PqQ5HEf2qF3n6QNWh\", qop=\ "auth\", charset=utf-8, algorithm=md5-sess"
serverVersion: 0
],
@SaslAuth [
method: "KERBEROS"
mechanism: "GSSAPI"
protocol: "hadoop"
serverId: "leo-2402-4e00-140b-4c01-0-9b9f-83c6-a5e0"
serverVersion: 0
]
]
]
]
2.2.confPrincipal
confPrincipal是根据服务端地址信息构建的principal,构建过程在SecurityUtil.getServerPrincipal方法中,源码如下所示:
@InterfaceAudience.Public
@InterfaceStability.Evolving
public static String getServerPrincipal(String principalConfig,
InetAddress addr) throws IOException {
String[] components = getComponents(principalConfig);
if (components == null || components.length != 3
|| !components[1].equals(HOSTNAME_PATTERN)) {
return principalConfig;
} else {
if (addr == null) {
throw new IOException("Can't replace " + HOSTNAME_PATTERN
+ " pattern since client address is null");
}
return replacePattern(components, addr.getCanonicalHostName());
}
}
如果hdfs-site中配置的dfs.namenode.kerberos.principal带有“_HOST”,则根据服务端地址(addr.getCanonicalHostName())进行替换,如果不带“_HOST”,直接使用dfs.namenode.kerberos.principal配置值。
3.根因总结
通过以上分析,我们知道,创建saslClient与saslServer建立连接时,hadoop-common中有一个比对的逻辑,比对的两个对象是principal,一个是serverPrincipal,另一个是confPrincipal,如果两者不相同,就会报错。
根据我们的场景,我们服务端HDFS使用的地址都是hostname,不是ip,所以serverPrincipal中是带有hostname的。而客户端hdfs-site中配置的namenode地址都是ip的,所以得到的confPrincipal是带有ip的,两者不相同,就报错。
4.解决方案
客户端使用的hdfs-site中的通信地址相关属性统一适配为hostname即可。