[SPARK-13148][YARN] document zero-keytab Oozie application launch; add diagnostics

This patch provides detail on what to do for keytabless Oozie launches of spark apps, and adds some debug-level diagnostics of what credentials have been submitted

Author: Steve Loughran <stevel@hortonworks.com>
Author: Steve Loughran <stevel@apache.org>

Closes #11033 from steveloughran/stevel/feature/SPARK-13148-oozie.
This commit is contained in:
Steve Loughran 2016-05-26 13:55:22 -05:00 committed by Tom Graves
parent c76457c8e4
commit 01b350a4f7
3 changed files with 148 additions and 2 deletions

View file

@ -17,10 +17,11 @@
package org.apache.spark.deploy
import java.io.{ByteArrayInputStream, DataInputStream}
import java.io.{ByteArrayInputStream, DataInputStream, IOException}
import java.lang.reflect.Method
import java.security.PrivilegedExceptionAction
import java.util.{Arrays, Comparator}
import java.text.DateFormat
import java.util.{Arrays, Comparator, Date}
import scala.collection.JavaConverters._
import scala.concurrent.duration._
@ -34,6 +35,8 @@ import org.apache.hadoop.fs.FileSystem.Statistics
import org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.security.{Credentials, UserGroupInformation}
import org.apache.hadoop.security.token.{Token, TokenIdentifier}
import org.apache.hadoop.security.token.delegation.AbstractDelegationTokenIdentifier
import org.apache.spark.{SparkConf, SparkException}
import org.apache.spark.annotation.DeveloperApi
@ -357,6 +360,50 @@ class SparkHadoopUtil extends Logging {
newConf.setBoolean(confKey, true)
newConf
}
/**
* Dump the credentials' tokens to string values.
*
* @param credentials credentials
* @return an iterator over the string values. If no credentials are passed in: an empty list
*/
private[spark] def dumpTokens(credentials: Credentials): Iterable[String] = {
if (credentials != null) {
credentials.getAllTokens.asScala.map(tokenToString)
} else {
Seq()
}
}
/**
* Convert a token to a string for logging.
* If its an abstract delegation token, attempt to unmarshall it and then
* print more details, including timestamps in human-readable form.
*
* @param token token to convert to a string
* @return a printable string value.
*/
private[spark] def tokenToString(token: Token[_ <: TokenIdentifier]): String = {
val df = DateFormat.getDateTimeInstance(DateFormat.SHORT, DateFormat.SHORT)
val buffer = new StringBuilder(128)
buffer.append(token.toString)
try {
val ti = token.decodeIdentifier
buffer.append("; ").append(ti)
ti match {
case dt: AbstractDelegationTokenIdentifier =>
// include human times and the renewer, which the HDFS tokens toString omits
buffer.append("; Renewer: ").append(dt.getRenewer)
buffer.append("; Issued: ").append(df.format(new Date(dt.getIssueDate)))
buffer.append("; Max Date: ").append(df.format(new Date(dt.getMaxDate)))
case _ =>
}
} catch {
case e: IOException =>
logDebug("Failed to decode $token: $e", e)
}
buffer.toString
}
}
object SparkHadoopUtil {

View file

@ -476,3 +476,99 @@ If you need a reference to the proper location to put log files in the YARN so t
- In `cluster` mode, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config `yarn.nodemanager.local-dirs`). If the user specifies `spark.local.dir`, it will be ignored. In `client` mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in `spark.local.dir`. This is because the Spark driver does not run on the YARN cluster in `client` mode, only the Spark executors do.
- The `--files` and `--archives` options support specifying file names with the # similar to Hadoop. For example you can specify: `--files localtest.txt#appSees.txt` and this will upload the file you have locally named `localtest.txt` into HDFS but this will be linked to by the name `appSees.txt`, and your application should use the name as `appSees.txt` to reference it when running on YARN.
- The `--jars` option allows the `SparkContext.addJar` function to work if you are using it with local files and running in `cluster` mode. It does not need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files.
# Running in a Secure Cluster
As covered in [security](security.html), Kerberos is used in a secure Hadoop cluster to
authenticate principals associated with services and clients. This allows clients to
make requests of these authenticated services; the services to grant rights
to the authenticated principals.
Hadoop services issue *hadoop tokens* to grant access to the services and data.
Clients must first acquire tokens for the services they will access and pass them along with their
application as it is launched in the YARN cluster.
For a Spark application to interact with HDFS, HBase and Hive, it must acquire the relevant tokens
using the Kerberos credentials of the user launching the application
—that is, the principal whose identity will become that of the launched Spark application.
This is normally done at launch time: in a secure cluster Spark will automatically obtain a
token for the cluster's HDFS filesystem, and potentially for HBase and Hive.
An HBase token will be obtained if HBase is in on classpath, the HBase configuration declares
the application is secure (i.e. `hbase-site.xml` sets `hbase.security.authentication` to `kerberos`),
and `spark.yarn.security.tokens.hbase.enabled` is not set to `false`.
Similarly, a Hive token will be obtained if Hive is on the classpath, its configuration
includes a URI of the metadata store in `"hive.metastore.uris`, and
`spark.yarn.security.tokens.hive.enabled` is not set to `false`.
If an application needs to interact with other secure HDFS clusters, then
the tokens needed to access these clusters must be explicitly requested at
launch time. This is done by listing them in the `spark.yarn.access.namenodes` property.
```
spark.yarn.access.namenodes hdfs://ireland.example.org:8020/,hdfs://frankfurt.example.org:8020/
```
## Launching your application with Apache Oozie
Apache Oozie can launch Spark applications as part of a workflow.
In a secure cluster, the launched application will need the relevant tokens to access the cluster's
services. If Spark is launched with a keytab, this is automatic.
However, if Spark is to be launched without a keytab, the responsibility for setting up security
must be handed over to Oozie.
The details of configuring Oozie for secure clusters and obtaining
credentials for a job can be found on the [Oozie web site](http://oozie.apache.org/)
in the "Authentication" section of the specific release's documentation.
For Spark applications, the Oozie workflow must be set up for Oozie to request all tokens which
the application needs, including:
- The YARN resource manager.
- The local HDFS filesystem.
- Any remote HDFS filesystems used as a source or destination of I/O.
- Hive —if used.
- HBase —if used.
- The YARN timeline server, if the application interacts with this.
To avoid Spark attempting —and then failing— to obtain Hive, HBase and remote HDFS tokens,
the Spark configuration must be set to disable token collection for the services.
The Spark configuration must include the lines:
```
spark.yarn.security.tokens.hive.enabled false
spark.yarn.security.tokens.hbase.enabled false
```
The configuration option `spark.yarn.access.namenodes` must be unset.
## Troubleshooting Kerberos
Debugging Hadoop/Kerberos problems can be "difficult". One useful technique is to
enable extra logging of Kerberos operations in Hadoop by setting the `HADOOP_JAAS_DEBUG`
environment variable.
```bash
export HADOOP_JAAS_DEBUG=true
```
The JDK classes can be configured to enable extra logging of their Kerberos and
SPNEGO/REST authentication via the system properties `sun.security.krb5.debug`
and `sun.security.spnego.debug=true`
```
-Dsun.security.krb5.debug=true -Dsun.security.spnego.debug=true
```
All these options can be enabled in the Application Master:
```
spark.yarn.appMasterEnv.HADOOP_JAAS_DEBUG true
spark.yarn.am.extraJavaOptions -Dsun.security.krb5.debug=true -Dsun.security.spnego.debug=true
```
Finally, if the log level for `org.apache.spark.deploy.yarn.Client` is set to `DEBUG`, the log
will include a list of all tokens obtained, and their expiry details

View file

@ -375,6 +375,9 @@ private[spark] class Client(
val distributedNames = new HashSet[String]
YarnSparkHadoopUtil.get.obtainTokenForHiveMetastore(sparkConf, hadoopConf, credentials)
YarnSparkHadoopUtil.get.obtainTokenForHBase(sparkConf, hadoopConf, credentials)
if (credentials != null) {
logDebug(YarnSparkHadoopUtil.get.dumpTokens(credentials).mkString("\n"))
}
val replication = sparkConf.get(STAGING_FILE_REPLICATION).map(_.toShort)
.getOrElse(fs.getDefaultReplication(destDir))