spark-instrumented-optimizer/sbin
Xingbo Jiang ef1622899f [SPARK-20989][CORE] Fail to start multiple workers on one host if external shuffle service is enabled in standalone mode
## What changes were proposed in this pull request?

In standalone mode, if we enable external shuffle service by setting `spark.shuffle.service.enabled` to true, and then we try to start multiple workers on one host(by setting `SPARK_WORKER_INSTANCES=3` in spark-env.sh, and then run `sbin/start-slaves.sh`), we can only launch one worker on each host successfully and the rest of the workers fail to launch.
The reason is the port of external shuffle service if configed by `spark.shuffle.service.port`, so currently we could start no more than one external shuffle service on each host. In our case, each worker tries to start a external shuffle service, and only one of them succeeded doing this.

We should give explicit reason of failure instead of fail silently.

## How was this patch tested?
Manually test by the following steps:
1. SET `SPARK_WORKER_INSTANCES=1` in `conf/spark-env.sh`;
2. SET `spark.shuffle.service.enabled` to `true` in `conf/spark-defaults.conf`;
3. Run `sbin/start-all.sh`.

Before the change, you will see no error in the command line, as the following:
```
starting org.apache.spark.deploy.master.Master, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.master.Master-1-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
```
And you can see in the webUI that only one worker is running.

After the change, you get explicit error messages in the command line:
```
starting org.apache.spark.deploy.master.Master, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.master.Master-1-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
localhost: failed to launch: nice -n 0 /Users/xxx/workspace/spark/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://xxx.local:7077
localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing view acls to: xxx
localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing modify acls to: xxx
localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing view acls groups to:
localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing modify acls groups to:
localhost:   17/06/13 23:24:53 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(xxx); groups with view permissions: Set(); users  with modify permissions: Set(xxx); groups with modify permissions: Set()
localhost:   17/06/13 23:24:54 INFO Utils: Successfully started service 'sparkWorker' on port 63354.
localhost:   Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Start multiple worker on one host failed because we may launch no more than one external shuffle service on each host, please set spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to resolve the conflict.
localhost:   	at scala.Predef$.require(Predef.scala:224)
localhost:   	at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
localhost:   	at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
localhost: full log in /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
localhost: failed to launch: nice -n 0 /Users/xxx/workspace/spark/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8082 spark://xxx.local:7077
localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing view acls to: xxx
localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing modify acls to: xxx
localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing view acls groups to:
localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing modify acls groups to:
localhost:   17/06/13 23:24:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(xxx); groups with view permissions: Set(); users  with modify permissions: Set(xxx); groups with modify permissions: Set()
localhost:   17/06/13 23:24:56 INFO Utils: Successfully started service 'sparkWorker' on port 63359.
localhost:   Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Start multiple worker on one host failed because we may launch no more than one external shuffle service on each host, please set spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to resolve the conflict.
localhost:   	at scala.Predef$.require(Predef.scala:224)
localhost:   	at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
localhost:   	at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
localhost: full log in /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
localhost: failed to launch: nice -n 0 /Users/xxx/workspace/spark/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8083 spark://xxx.local:7077
localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing view acls to: xxx
localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing modify acls to: xxx
localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing view acls groups to:
localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing modify acls groups to:
localhost:   17/06/13 23:24:59 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(xxx); groups with view permissions: Set(); users  with modify permissions: Set(xxx); groups with modify permissions: Set()
localhost:   17/06/13 23:24:59 INFO Utils: Successfully started service 'sparkWorker' on port 63360.
localhost:   Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Start multiple worker on one host failed because we may launch no more than one external shuffle service on each host, please set spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to resolve the conflict.
localhost:   	at scala.Predef$.require(Predef.scala:224)
localhost:   	at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
localhost:   	at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
localhost: full log in /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
```

Author: Xingbo Jiang <xingbo.jiang@databricks.com>

Closes #18290 from jiangxb1987/start-slave.
2017-06-20 17:17:21 +08:00
..
slaves.sh [SPARK-2960][DEPLOY] Support executing Spark from symlinks (reopen) 2015-11-04 10:49:34 +00:00
spark-config.sh [SPARK-17960][PYSPARK][UPGRADE TO PY4J 0.10.4] 2016-10-21 09:48:24 +01:00
spark-daemon.sh [SPARK-20989][CORE] Fail to start multiple workers on one host if external shuffle service is enabled in standalone mode 2017-06-20 17:17:21 +08:00
spark-daemons.sh [SPARK-2960][DEPLOY] Support executing Spark from symlinks (reopen) 2015-11-04 10:49:34 +00:00
start-all.sh [SPARK-13521][BUILD] Remove reference to Tachyon in cluster & release scripts 2016-02-26 22:35:12 -08:00
start-history-server.sh [SPARK-19083] sbin/start-history-server.sh script use of $@ without quotes 2017-01-06 09:57:49 -08:00
start-master.sh [SPARK-17944][DEPLOY] sbin/start-* scripts use of hostname -f fail with Solaris 2016-10-22 09:37:53 +01:00
start-mesos-dispatcher.sh [SPARK-17944][DEPLOY] sbin/start-* scripts use of hostname -f fail with Solaris 2016-10-22 09:37:53 +01:00
start-mesos-shuffle-service.sh [SPARK-2960][DEPLOY] Support executing Spark from symlinks (reopen) 2015-11-04 10:49:34 +00:00
start-shuffle-service.sh [SPARK-2960][DEPLOY] Support executing Spark from symlinks (reopen) 2015-11-04 10:49:34 +00:00
start-slave.sh [SPARK-11218][CORE] show help messages for start-slave and start-master 2015-11-09 13:22:05 +01:00
start-slaves.sh [SPARK-17944][DEPLOY] sbin/start-* scripts use of hostname -f fail with Solaris 2016-10-22 09:37:53 +01:00
start-thriftserver.sh [SPARK-17598][SQL][WEB UI] User-friendly name for Spark Thrift Server in web UI 2016-10-03 10:24:30 +01:00
stop-all.sh [SPARK-2960][DEPLOY] Support executing Spark from symlinks (reopen) 2015-11-04 10:49:34 +00:00
stop-history-server.sh [SPARK-2960][DEPLOY] Support executing Spark from symlinks (reopen) 2015-11-04 10:49:34 +00:00
stop-master.sh [SPARK-13521][BUILD] Remove reference to Tachyon in cluster & release scripts 2016-02-26 22:35:12 -08:00
stop-mesos-dispatcher.sh [SPARK-13414][MESOS] Allow multiple dispatchers to be launched. 2016-02-20 12:58:47 -08:00
stop-mesos-shuffle-service.sh [SPARK-2960][DEPLOY] Support executing Spark from symlinks (reopen) 2015-11-04 10:49:34 +00:00
stop-shuffle-service.sh [SPARK-2960][DEPLOY] Support executing Spark from symlinks (reopen) 2015-11-04 10:49:34 +00:00
stop-slave.sh [SPARK-2960][DEPLOY] Support executing Spark from symlinks (reopen) 2015-11-04 10:49:34 +00:00
stop-slaves.sh [SPARK-13521][BUILD] Remove reference to Tachyon in cluster & release scripts 2016-02-26 22:35:12 -08:00
stop-thriftserver.sh [SPARK-2960][DEPLOY] Support executing Spark from symlinks (reopen) 2015-11-04 10:49:34 +00:00