## What changes were proposed in this pull request?
The relation between spark.network.timeout and spark.executor.heartbeatInterval should be mentioned in the document.
… network timeout]
Author: Jagadeesan <as2@us.ibm.com>
Closes#15042 from jagadeesanas2/SPARK-17449.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
Streaming doc correction.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Satendra Kumar <satendra@knoldus.com>
Closes#14996 from satendrakumar06/patch-1.
## What changes were proposed in this pull request?
This pull request adds the functionality to enable accessing worker and application UI through master UI itself. Thus helps in accessing SparkUI when running spark cluster in closed networks e.g. Kubernetes. Cluster admin needs to expose only spark master UI and rest of the UIs can be in the private network, master UI will reverse proxy the connection request to corresponding resource. It adds the path for workers/application UIs as
WorkerUI: <http/https>://master-publicIP:<port>/target/workerID/
ApplicationUI: <http/https>://master-publicIP:<port>/target/appID/
This makes it easy for users to easily protect the Spark master cluster access by putting some reverse proxy e.g. https://github.com/bitly/oauth2_proxy
## How was this patch tested?
The functionality has been tested manually and there is a unit test too for testing access to worker UI with reverse proxy address.
pwendell bomeng BryanCutler can you please review it, thanks.
Author: Gurvinder Singh <gurvinder.singh@uninett.no>
Closes#13950 from gurvindersingh/rproxy.
After change [SPARK-16405](https://github.com/apache/spark/pull/14080), we need to update docs by adding shuffle service metrics entry in currently supporting metrics list.
Author: Yangyang Liu <yangyangliu@fb.com>
Closes#14254 from lovexi/yangyang-monitoring-doc.
## What changes were proposed in this pull request?
Avoid allocating some 0-length arrays, esp. in UTF8String, and by using Array.empty in Scala over Array[T]()
## How was this patch tested?
Jenkins
Author: Sean Owen <sowen@cloudera.com>
Closes#14895 from srowen/SPARK-17331.
## What changes were proposed in this pull request?
Allow user to set sparkr shell command through --conf spark.r.shell.command
## How was this patch tested?
Unit test is added and also verify it manually through
```
bin/sparkr --master yarn-client --conf spark.r.shell.command=/usr/local/bin/R
```
Author: Jeff Zhang <zjffdu@apache.org>
Closes#14744 from zjffdu/SPARK-17178.
## What changes were proposed in this pull request?
With the new History Server the summary page loads the application list via the the REST API, this makes it very slow to impossible to load with large (10K+) application history. This pr fixes this by adding the `spark.history.ui.maxApplications` conf to limit the number of applications the History Server displays. This is accomplished using a new optional `limit` param for the `applications` api. (Note this only applies to what the summary page displays, all the Application UI's are still accessible if the user knows the App ID and goes to the Application UI directly.)
I've also added a new test for the `limit` param in `HistoryServerSuite.scala`
## How was this patch tested?
Manual testing and dev/run-tests
Author: Alex Bozarth <ajbozart@us.ibm.com>
Closes#14835 from ajbozarth/spark17243.
This patch is using Apache Commons Crypto library to enable shuffle encryption support.
Author: Ferdinand Xu <cheng.a.xu@intel.com>
Author: kellyzly <kellyzly@126.com>
Closes#8880 from winningsix/SPARK-10771.
## What changes were proposed in this pull request?
Fix minor typos python example code in streaming programming guide
## How was this patch tested?
N/A
Author: Dmitriy Sokolov <silentsokolov@gmail.com>
Closes#14805 from silentsokolov/fix-typos.
## What changes were proposed in this pull request?
Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages.
## How was this patch tested?
Jenkins tests, including new caes to reflect the new behavior.
Author: Sean Owen <sowen@cloudera.com>
Closes#14663 from srowen/SPARK-17001.
## What changes were proposed in this pull request?
Move Mesos code into a mvn module
## How was this patch tested?
unit tests
manually submitting a client mode and cluster mode job
spark/mesos integration test suite
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#14637 from mgummelt/mesos-module.
## What changes were proposed in this pull request?
Updated links of external dstream projects.
## How was this patch tested?
Just document changes.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#14814 from zsxwing/dstream-link.
## What changes were proposed in this pull request?
Based on #12990 by tankkyo
Since the History Server currently loads all application's data it can OOM if too many applications have a significant task count. `spark.ui.trimTasks` (default: false) can be set to true to trim tasks by `spark.ui.retainedTasks` (default: 10000)
(This is a "quick fix" to help those running into the problem until a update of how the history server loads app data can be done)
## How was this patch tested?
Manual testing and dev/run-tests
![spark-15083](https://cloud.githubusercontent.com/assets/13952758/17713694/fe82d246-63b0-11e6-9697-b87ea75ff4ef.png)
Author: Alex Bozarth <ajbozart@us.ibm.com>
Closes#14673 from ajbozarth/spark15083.
## What changes were proposed in this pull request?
Collect GC discussion in one section, and documenting findings about G1 GC heap region size.
## How was this patch tested?
Jekyll doc build
Author: Sean Owen <sowen@cloudera.com>
Closes#14732 from srowen/SPARK-16320.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
This is the document for previous JDBC Writer options.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Unit test has been added in previous PR.
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: GraceH <jhuang1@paypal.com>
Closes#14683 from GraceH/jdbc_options.
## What changes were proposed in this pull request?
`spark.ssl.enabled`=true, but failing to set `spark.ssl.protocol` will fail and throw meaningless exception. `spark.ssl.protocol` is required when `spark.ssl.enabled`.
Improvement: require `spark.ssl.protocol` when initializing SSLContext, otherwise throws an exception to indicate that.
Remove the OrElse("default").
Document this requirement in configure.md
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manual tests:
Build document and check document
Configure `spark.ssl.enabled` only, it throws exception below:
6/08/16 16:04:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mwang); groups with view permissions: Set(); users with modify permissions: Set(mwang); groups with modify permissions: Set()
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: spark.ssl.protocol is required when enabling SSL connections.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:285)
at org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1026)
at org.apache.spark.deploy.master.Master$.main(Master.scala:1011)
at org.apache.spark.deploy.master.Master.main(Master.scala)
Configure `spark.ssl.protocol` and `spark.ssl.protocol`
It works fine.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#14674 from wangmiao1981/ssl.
## What changes were proposed in this pull request?
- adds documentation for https://issues.apache.org/jira/browse/SPARK-11714
## How was this patch tested?
Doc no test needed.
Author: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com>
Closes#14667 from skonto/add_doc.
## What changes were proposed in this pull request?
Remove the api doc link for mapReduceTriplets operator because in latest api they are remove so when user link to that api they will not get mapReduceTriplets there so its more good to remove than confuse the user.
## How was this patch tested?
Run all the test cases
![screenshot from 2016-08-16 23-08-25](https://cloud.githubusercontent.com/assets/8075390/17709393/8cfbf75a-6406-11e6-98e6-38f7b319d833.png)
Author: sandy <phalodi@gmail.com>
Closes#14669 from phalodi/SPARK-17089.
## What changes were proposed in this pull request?
As README.md file is updated over time. Some code snippet outputs are not correct based on new README.md file. For example:
```
scala> textFile.count()
res0: Long = 126
```
should be
```
scala> textFile.count()
res0: Long = 99
```
This pr is to add comments to point out this problem so that new spark learners have a correct reference.
Also, fixed a samll bug, inside current documentation, the outputs of linesWithSpark.count() without and with cache are different (one is 15 and the other is 19)
```
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:27
scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
res3: Long = 15
...
scala> linesWithSpark.cache()
res7: linesWithSpark.type = MapPartitionsRDD[2] at filter at <console>:27
scala> linesWithSpark.count()
res8: Long = 19
```
## How was this patch tested?
manual test: run `$ SKIP_API=1 jekyll serve --watch`
Author: linbojin <linbojin203@gmail.com>
Closes#14645 from linbojin/quick-start-documentation.
## What changes were proposed in this pull request?
When documentation is built is should reference examples from the same build. There are times when the docs have links that point to files in the GitHub head which may not be valid on the current release. Changed that in URLs to make them point to the right tag in git using ```SPARK_VERSION_SHORT```
…from its own release version] [Streaming programming guide]
Author: Jagadeesan <as2@us.ibm.com>
Closes#14596 from jagadeesanas2/SPARK-12370.
## What changes were proposed in this pull request?
The configuration doc lost the config option `spark.ui.enabled` (default value is `true`)
I think this option is important because many cases we would like to turn it off.
so I add it.
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#14604 from WeichenXu123/add_doc_param_spark_ui_enabled.
Before this PR, user have to export environment variable to specify the python of driver & executor which is not so convenient for users. This PR is trying to allow user to specify python through configuration "--pyspark-driver-python" & "--pyspark-executor-python"
Manually test in local & yarn mode for pyspark-shell and pyspark batch mode.
Author: Jeff Zhang <zjffdu@apache.org>
Closes#13146 from zjffdu/SPARK-13081.
## What changes were proposed in this pull request?
Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.
This PR fixes three things below:
- Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.
- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.
- Fix `StructuredNetworkWordCountWindowed` and `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).
## How was this patch tested?
N/A
Closes https://github.com/apache/spark/pull/14491
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>
Closes#14564 from HyukjinKwon/SPARK-16886.
Docs adjustment to:
- link to other relevant section of docs
- correct statement about the only value when actually other values are supported
Author: Andrew Ash <andrew@andrewash.com>
Closes#14581 from ash211/patch-10.
## What changes were proposed in this pull request?
change the remain percent to right one.
## How was this patch tested?
Manual review
Author: Tao Wang <wangtao111@huawei.com>
Closes#14591 from WangTaoTheTonic/patch-1.
## What changes were proposed in this pull request?
Add a configurable token manager for Spark on running on yarn.
### Current Problems ###
1. Supported token provider is hard-coded, currently only hdfs, hbase and hive are supported and it is impossible for user to add new token provider without code changes.
2. Also this problem exits in timely token renewer and updater.
### Changes In This Proposal ###
In this proposal, to address the problems mentioned above and make the current code more cleaner and easier to understand, mainly has 3 changes:
1. Abstract a `ServiceTokenProvider` as well as `ServiceTokenRenewable` interface for token provider. Each service wants to communicate with Spark through token way needs to implement this interface.
2. Provide a `ConfigurableTokenManager` to manage all the register token providers, also token renewer and updater. Also this class offers the API for other modules to obtain tokens, get renewal interval and so on.
3. Implement 3 built-in token providers `HDFSTokenProvider`, `HiveTokenProvider` and `HBaseTokenProvider` to keep the same semantics as supported today. Whether to load in these built-in token providers is controlled by configuration "spark.yarn.security.tokens.${service}.enabled", by default for all the built-in token providers are loaded.
### Behavior Changes ###
For the end user there's no behavior change, we still use the same configuration `spark.yarn.security.tokens.${service}.enabled` to decide which token provider is enabled (hbase or hive).
For user implemented token provider (assume the name of token provider is "test") needs to add into this class should have two configurations:
1. `spark.yarn.security.tokens.test.enabled` to true
2. `spark.yarn.security.tokens.test.class` to the full qualified class name.
So we still keep the same semantics as current code while add one new configuration.
### Current Status ###
- [x] token provider interface and management framework.
- [x] implement built-in token providers (hdfs, hbase, hive).
- [x] Coverage of unit test.
- [x] Integrated test with security cluster.
## How was this patch tested?
Unit test and integrated test.
Please suggest and review, any comment is greatly appreciated.
Author: jerryshao <sshao@hortonworks.com>
Closes#14065 from jerryshao/SPARK-16342.
## What changes were proposed in this pull request?
- enable setting default properties for all jobs submitted through the dispatcher [SPARK-16927]
- remove duplication of conf vars on cluster submitted jobs [SPARK-16923] (this is a small fix, so I'm including in the same PR)
## How was this patch tested?
mesos/spark integration test suite
manual testing
Author: Timothy Chen <tnachen@gmail.com>
Closes#14511 from mgummelt/override-props.
## What changes were proposed in this pull request?
This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration.
**Background:** This application-killing was added in 6b5980da79 (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path.
**Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative.
I'd like to merge this patch into master, branch-2.0, and branch-1.6.
## How was this patch tested?
I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#14544 from JoshRosen/add-setting-for-max-executor-failures.
## What changes were proposed in this pull request?
Links the Spark Mesos Dispatcher UI to the history server UI
- adds spark.mesos.dispatcher.historyServer.url
- explicitly generates frameworkIDs for the launched drivers, so the dispatcher knows how to correlate drivers and frameworkIDs
## How was this patch tested?
manual testing
Author: Michael Gummelt <mgummelt@mesosphere.io>
Author: Sergiusz Urbaniak <sur@mesosphere.io>
Closes#14414 from mgummelt/history-server.
## What changes were proposed in this pull request?
Fix the broken links in the programming guide of the Graphx Migration and understanding closures
## How was this patch tested?
By running the test cases and checking the links.
Author: Shivansh <shiv4nsh@gmail.com>
Closes#14503 from shiv4nsh/SPARK-16911.
## What changes were proposed in this pull request?
default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned
## How was this patch tested?
not need
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
…ide.md
JIRA_ID:SPARK-16870
Description:default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned
Test:done
Author: keliang <keliang@cmss.chinamobile.com>
Closes#14477 from biglobster/keliang.
## What changes were proposed in this pull request?
In the programming guide, the accumulator section mixes up both the old and new APIs causing it to be confusing. This is not necessary for Scala, so all references to the old API are removed. For Java, it is somewhat fixed up except for the example of a custom accumulator because I don't think an API exists yet. Python has not currently implemented the new API.
## How was this patch tested?
built doc locally
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#14516 from BryanCutler/fixup-accumulator-programming-guide-SPARK-15702.
## What changes were proposed in this pull request?
Doc for the Kafka 0.10 integration
## How was this patch tested?
Scala code examples were taken from my example repo, so hopefully they compile.
Author: cody koeninger <cody@koeninger.org>
Closes#14385 from koeninger/SPARK-16312.
## What changes were proposed in this pull request?
Shuffle fetch on large intermediate dataset is slow because the shuffle service open/close the index file for each shuffle fetch. This change introduces a cache for the index information so that we can avoid accessing the index files for each block fetch
## How was this patch tested?
Tested by running a job on the cluster and the shuffle read time was reduced by 50%.
Author: Sital Kedia <skedia@fb.com>
Closes#12944 from sitalkedia/shuffle_service.
## What changes were proposed in this pull request?
This PR makes various minor updates to examples of all language bindings to make sure they are consistent with each other. Some typos and missing parts (JDBC example in Scala/Java/Python) are also fixed.
## How was this patch tested?
Manually tested.
Author: Cheng Lian <lian@databricks.com>
Closes#14368 from liancheng/revise-examples.
## What changes were proposed in this pull request?
Fix the link at http://spark.apache.org/docs/latest/ml-guide.html.
## How was this patch tested?
None
Author: Sun Dapeng <sdp@apache.org>
Closes#14386 from sundapeng/doclink.
## What changes were proposed in this pull request?
New config var: spark.mesos.docker.containerizer={"mesos","docker" (default)}
This adds support for running docker containers via the Mesos unified containerizer: http://mesos.apache.org/documentation/latest/container-image/
The benefit is losing the dependency on `dockerd`, and all the costs which it incurs.
I've also updated the supported Mesos version to 0.28.2 for support of the required protobufs.
This is blocked on: https://github.com/apache/spark/pull/14167
## How was this patch tested?
- manually testing jobs submitted with both "mesos" and "docker" settings for the new config var.
- spark/mesos integration test suite
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#14275 from mgummelt/unified-containerizer.
## What changes were proposed in this pull request?
added missing keyword for java example
## How was this patch tested?
wasn't
Author: Bartek Wiśniewski <wedi@Ava.local>
Closes#14381 from wedi-dev/quickfix/missing_keyword.
## What changes were proposed in this pull request?
Adding a new property to SparkConf called spark.metrics.namespace that allows users to
set a custom namespace for executor and driver metrics in the metrics systems.
By default, the root namespace used for driver or executor metrics is
the value of `spark.app.id`. However, often times, users want to be able to track the metrics
across apps for driver and executor metrics, which is hard to do with application ID
(i.e. `spark.app.id`) since it changes with every invocation of the app. For such use cases,
users can set the `spark.metrics.namespace` property to another spark configuration key like
`spark.app.name` which is then used to populate the root namespace of the metrics system
(with the app name in our example). `spark.metrics.namespace` property can be set to any
arbitrary spark property key, whose value would be used to set the root namespace of the
metrics system. Non driver and executor metrics are never prefixed with `spark.app.id`, nor
does the `spark.metrics.namespace` property have any such affect on such metrics.
## How was this patch tested?
Added new unit tests, modified existing unit tests.
Author: Mark Grover <mark@apache.org>
Closes#14270 from markgrover/spark-5847.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Mesos agents by default will not pull docker images which are cached
locally already. In order to run Spark executors from mutable tags like
`:latest` this commit introduces a Spark setting
(`spark.mesos.executor.docker.forcePullImage`). Setting this flag to
true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous
implementation and Mesos' default
behaviour).
Author: Philipp Hoffmann <mail@philipphoffmann.de>
Closes#14348 from philipphoffmann/force-pull-image.
## What changes were proposed in this pull request?
Minor doc fix regarding the spark.speculation.quantile configuration parameter. It incorrectly states it should be a percentage, when it should be a fraction.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
I tried building the documentation but got some unidoc errors. I also got them when building off origin/master, so I don't think I caused that problem. I did run the web app and saw the changes reflected as expected.
Author: Nicholas Brown <nbrown@adroitdigital.com>
Closes#14352 from nwbvt/master.