spark-instrumented-optimizer/docs/security.md
Hari Shreedharan b1f4ca82d1 [SPARK-5342] [YARN] Allow long running Spark apps to run on secure YARN/HDFS
Take 2. Does the same thing as #4688, but fixes Hadoop-1 build.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #5823 from harishreedharan/kerberos-longrunning and squashes the following commits:

3c86bba [Hari Shreedharan] Import fixes. Import postfixOps explicitly.
4d04301 [Hari Shreedharan] Minor formatting fixes.
b5e7a72 [Hari Shreedharan] Remove reflection, use a method in SparkHadoopUtil to update the token renewer.
7bff6e9 [Hari Shreedharan] Make sure all required classes are present in the jar. Fix import order.
e851f70 [Hari Shreedharan] Move the ExecutorDelegationTokenRenewer to yarn module. Use reflection to use it.
36eb8a9 [Hari Shreedharan] Change the renewal interval config param. Fix a bunch of comments.
611923a [Hari Shreedharan] Make sure the namenodes are listed correctly for creating tokens.
09fe224 [Hari Shreedharan] Use token.renew to get token's renewal interval rather than using hdfs-site.xml
6963bbc [Hari Shreedharan] Schedule renewal in AM before starting user class. Else, a restarted AM cannot access HDFS if the user class tries to.
072659e [Hari Shreedharan] Fix build failure caused by thread factory getting moved to ThreadUtils.
f041dd3 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
42eead4 [Hari Shreedharan] Remove RPC part. Refactor and move methods around, use renewal interval rather than max lifetime to create new tokens.
ebb36f5 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
bc083e3 [Hari Shreedharan] Overload RegisteredExecutor to send tokens. Minor doc updates.
7b19643 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
8a4f268 [Hari Shreedharan] Added docs in the security guide. Changed some code to ensure that the renewer objects are created only if required.
e800c8b [Hari Shreedharan] Restore original RegisteredExecutor message, and send new tokens via NewTokens message.
0e9507e [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
7f1bc58 [Hari Shreedharan] Minor fixes, cleanup.
bcd11f9 [Hari Shreedharan] Refactor AM and Executor token update code into separate classes, also send tokens via akka on executor startup.
f74303c [Hari Shreedharan] Move the new logic into specialized classes. Add cleanup for old credentials files.
2f9975c [Hari Shreedharan] Ensure new tokens are written out immediately on AM restart. Also, pikc up the latest suffix from HDFS if the AM is restarted.
61b2b27 [Hari Shreedharan] Account for AM restarts by making sure lastSuffix is read from the files on HDFS.
62c45ce [Hari Shreedharan] Relogin from keytab periodically.
fa233bd [Hari Shreedharan] Adding logging, fixing minor formatting and ordering issues.
42813b4 [Hari Shreedharan] Remove utils.sh, which was re-added due to merge with master.
0de27ee [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
55522e3 [Hari Shreedharan] Fix failure caused by Preconditions ambiguity.
9ef5f1b [Hari Shreedharan] Added explanation of how the credentials refresh works, some other minor fixes.
f4fd711 [Hari Shreedharan] Fix SparkConf usage.
2debcea [Hari Shreedharan] Change the file structure for credentials files. I will push a followup patch which adds a cleanup mechanism for old credentials files. The credentials files are small and few enough for it to cause issues on HDFS.
af6d5f0 [Hari Shreedharan] Cleaning up files where changes weren't required.
f0f54cb [Hari Shreedharan] Be more defensive when updating the credentials file.
f6954da [Hari Shreedharan] Got rid of Akka communication to renew, instead the executors check a known file's modification time to read the credentials.
5c11c3e [Hari Shreedharan] Move tests to YarnSparkHadoopUtil to fix compile issues.
b4cb917 [Hari Shreedharan] Send keytab to AM via DistributedCache rather than directly via HDFS
0985b4e [Hari Shreedharan] Write tokens to HDFS and read them back when required, rather than sending them over the wire.
d79b2b9 [Hari Shreedharan] Make sure correct credentials are passed to FileSystem#addDelegationTokens()
8c6928a [Hari Shreedharan] Fix issue caused by direct creation of Actor object.
fb27f46 [Hari Shreedharan] Make sure principal and keytab are set before CoarseGrainedSchedulerBackend is started. Also schedule re-logins in CoarseGrainedSchedulerBackend#start()
41efde0 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
d282d7a [Hari Shreedharan] Fix ClientSuite to set YARN mode, so that the correct class is used in tests.
bcfc374 [Hari Shreedharan] Fix Hadoop-1 build by adding no-op methods in SparkHadoopUtil, with impl in YarnSparkHadoopUtil.
f8fe694 [Hari Shreedharan] Handle None if keytab-login is not scheduled.
2b0d745 [Hari Shreedharan] [SPARK-5342][YARN] Allow long running Spark apps to run on secure YARN/HDFS.
ccba5bc [Hari Shreedharan] WIP: More changes wrt kerberos
77914dd [Hari Shreedharan] WIP: Add kerberos principal and keytab to YARN client.
2015-05-01 15:32:09 -05:00

10 KiB

layout displayTitle title
global Spark Security Security

Spark currently supports authentication via a shared secret. Authentication can be configured to be on via the spark.authenticate configuration parameter. This parameter controls whether the Spark communication protocols do authentication using the shared secret. This authentication is a basic handshake to make sure both sides have the same shared secret and are allowed to communicate. If the shared secret is not identical they will not be allowed to communicate. The shared secret is created as follows:

  • For Spark on YARN deployments, configuring spark.authenticate to true will automatically handle generating and distributing the shared secret. Each application will use a unique shared secret.
  • For other types of Spark deployments, the Spark parameter spark.authenticate.secret should be configured on each of the nodes. This secret will be used by all the Master/Workers and applications.

Web UI

The Spark UI can also be secured by using javax servlet filters via the spark.ui.filters setting. A user may want to secure the UI if it has data that other users should not be allowed to see. The javax servlet filter specified by the user can authenticate the user and then once the user is logged in, Spark can compare that user versus the view ACLs to make sure they are authorized to view the UI. The configs spark.acls.enable and spark.ui.view.acls control the behavior of the ACLs. Note that the user who started the application always has view access to the UI. On YARN, the Spark UI uses the standard YARN web application proxy mechanism and will authenticate via any installed Hadoop filters.

Spark also supports modify ACLs to control who has access to modify a running Spark application. This includes things like killing the application or a task. This is controlled by the configs spark.acls.enable and spark.modify.acls. Note that if you are authenticating the web UI, in order to use the kill button on the web UI it might be necessary to add the users in the modify acls to the view acls also. On YARN, the modify acls are passed in and control who has modify access via YARN interfaces.

Spark allows for a set of administrators to be specified in the acls who always have view and modify permissions to all the applications. is controlled by the config spark.admin.acls. This is useful on a shared cluster where you might have administrators or support staff who help users debug applications.

Event Logging

If your applications are using event logging, the directory where the event logs go (spark.eventLog.dir) should be manually created and have the proper permissions set on it. If you want those log files secured, the permissions should be set to drwxrwxrwxt for that directory. The owner of the directory should be the super user who is running the history server and the group permissions should be restricted to super user group. This will allow all users to write to the directory but will prevent unprivileged users from removing or renaming a file unless they own the file or directory. The event log files will be created by Spark with permissions such that only the user and group have read and write access.

Encryption

Spark supports SSL for Akka and HTTP (for broadcast and file server) protocols. However SSL is not supported yet for WebUI and block transfer service.

Connection encryption (SSL) configuration is organized hierarchically. The user can configure the default SSL settings which will be used for all the supported communication protocols unless they are overwritten by protocol-specific settings. This way the user can easily provide the common settings for all the protocols without disabling the ability to configure each one individually. The common SSL settings are at spark.ssl namespace in Spark configuration, while Akka SSL configuration is at spark.ssl.akka and HTTP for broadcast and file server SSL configuration is at spark.ssl.fs. The full breakdown can be found on the configuration page.

SSL must be configured on each node and configured for each component involved in communication using the particular protocol.

YARN mode

The key-store can be prepared on the client side and then distributed and used by the executors as the part of the application. It is possible because the user is able to deploy files before the application is started in YARN by using spark.yarn.dist.files or spark.yarn.dist.archives configuration settings. The responsibility for encryption of transferring these files is on YARN side and has nothing to do with Spark.

For long-running apps like Spark Streaming apps to be able to write to HDFS, it is possible to pass a principal and keytab to spark-submit via the --principal and --keytab parameters respectively. The keytab passed in will be copied over to the machine running the Application Master via the Hadoop Distributed Cache (securely - if YARN is configured with SSL and HDFS encryption is enabled). The Kerberos login will be periodically renewed using this principal and keytab and the delegation tokens required for HDFS will be generated periodically so the application can continue writing to HDFS.

Standalone mode

The user needs to provide key-stores and configuration options for master and workers. They have to be set by attaching appropriate Java system properties in SPARK_MASTER_OPTS and in SPARK_WORKER_OPTS environment variables, or just in SPARK_DAEMON_JAVA_OPTS. In this mode, the user may allow the executors to use the SSL settings inherited from the worker which spawned that executor. It can be accomplished by setting spark.ssl.useNodeLocalConf to true. If that parameter is set, the settings provided by user on the client side, are not used by the executors.

Preparing the key-stores

Key-stores can be generated by keytool program. The reference documentation for this tool is here. The most basic steps to configure the key-stores and the trust-store for the standalone deployment mode is as follows:

  • Generate a keys pair for each node
  • Export the public key of the key pair to a file on each node
  • Import all exported public keys into a single trust-store
  • Distribute the trust-store over the nodes

Configuring Ports for Network Security

Spark makes heavy use of the network, and some environments have strict requirements for using tight firewall settings. Below are the primary ports that Spark uses for its communication and how to configure those ports.

Standalone mode only

FromToDefault PortPurposeConfiguration SettingNotes
Browser Standalone Master 8080 Web UI spark.master.ui.port /
SPARK_MASTER_WEBUI_PORT
Jetty-based. Standalone mode only.
Browser Standalone Worker 8081 Web UI spark.worker.ui.port /
SPARK_WORKER_WEBUI_PORT
Jetty-based. Standalone mode only.
Driver /
Standalone Worker
Standalone Master 7077 Submit job to cluster /
Join cluster
SPARK_MASTER_PORT Akka-based. Set to "0" to choose a port randomly. Standalone mode only.
Standalone Master Standalone Worker (random) Schedule executors SPARK_WORKER_PORT Akka-based. Set to "0" to choose a port randomly. Standalone mode only.

All cluster managers

FromToDefault PortPurposeConfiguration SettingNotes
Browser Application 4040 Web UI spark.ui.port Jetty-based
Browser History Server 18080 Web UI spark.history.ui.port Jetty-based
Executor /
Standalone Master
Driver (random) Connect to application /
Notify executor state changes
spark.driver.port Akka-based. Set to "0" to choose a port randomly.
Driver Executor (random) Schedule tasks spark.executor.port Akka-based. Set to "0" to choose a port randomly.
Executor Driver (random) File server for files and jars spark.fileserver.port Jetty-based
Executor Driver (random) HTTP Broadcast spark.broadcast.port Jetty-based. Not used by TorrentBroadcast, which sends data through the block manager instead.
Executor Driver (random) Class file server spark.replClassServer.port Jetty-based. Only used in Spark shells.
Executor / Driver Executor / Driver (random) Block Manager port spark.blockManager.port Raw socket via ServerSocketChannel

See the configuration page for more details on the security configuration parameters, and org.apache.spark.SecurityManager for implementation details about security.