Commit graph

2027 commits

Author SHA1 Message Date
Joachim Hereth 01f183e849 Mesos doc fixes
## What changes were proposed in this pull request?

Some link fixes for the documentation [Running Spark on Mesos](https://spark.apache.org/docs/latest/running-on-mesos.html):

* Updated Link to Mesos Frameworks (Projects built on top of Mesos)
* Update Link to Mesos binaries from Mesosphere (former link was redirected to dcos install page)

## How was this patch tested?

Documentation was built and changed page manually/visually inspected.

No code was changed, hence no dev tests.

Since these changes are rather trivial I did not open a new JIRA ticket.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Joachim Hereth <joachim.hereth@numberfour.eu>

Closes #18564 from daten-kieker/mesos_doc_fixes.
2017-07-08 08:32:45 +01:00
Prashant Sharma d0bfc67335 [SPARK-21069][SS][DOCS] Add rate source to programming guide.
## What changes were proposed in this pull request?

SPARK-20979 added a new structured streaming source: Rate source. This patch adds the corresponding documentation to programming guide.

## How was this patch tested?

Tested by running jekyll locally.

Author: Prashant Sharma <prashant@apache.org>
Author: Prashant Sharma <prashsh1@in.ibm.com>

Closes #18562 from ScrapCodes/spark-21069/rate-source-docs.
2017-07-07 23:33:12 -07:00
Tathagata Das 0217dfd26f [SPARK-21267][SS][DOCS] Update Structured Streaming Documentation
## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.
2017-07-06 17:28:20 -07:00
jerryshao 5800144a54 [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark
Current "--jars (spark.jars)", "--files (spark.files)", "--py-files (spark.submit.pyFiles)" and "--archives (spark.yarn.dist.archives)" only support non-glob path. This is OK for most of the cases, but when user requires to add more jars, files into Spark, it is too verbose to list one by one. So here propose to add glob path support for resources.

Also improving the code of downloading resources.

## How was this patch tested?

UT added, also verified manually in local cluster.

Author: jerryshao <sshao@hortonworks.com>

Closes #18235 from jerryshao/SPARK-21012.
2017-07-06 15:32:49 +08:00
sadikovi 960298ee66 [SPARK-20858][DOC][MINOR] Document ListenerBus event queue size
## What changes were proposed in this pull request?

This change adds a new configuration option `spark.scheduler.listenerbus.eventqueue.size` to the configuration docs to specify the capacity of the spark listener bus event queue. Default value is 10000.

This is doc PR for [SPARK-15703](https://issues.apache.org/jira/browse/SPARK-15703).

I added option to the `Scheduling` section, however it might be more related to `Spark UI` section.

## How was this patch tested?

Manually verified correct rendering of configuration option.

Author: sadikovi <ivan.sadikov@lincolnuni.ac.nz>
Author: Ivan Sadikov <ivan.sadikov@team.telstra.com>

Closes #18476 from sadikovi/SPARK-20858.
2017-07-05 14:40:44 +01:00
Shixiong Zhu 80f7ac3a60 [SPARK-21253][CORE] Disable spark.reducer.maxReqSizeShuffleToMem
## What changes were proposed in this pull request?

Disable spark.reducer.maxReqSizeShuffleToMem because it breaks the old shuffle service.

Credits to wangyum

Closes #18466

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>
Author: Yuming Wang <wgyumg@gmail.com>

Closes #18467 from zsxwing/SPARK-21253.
2017-06-30 11:02:22 +08:00
jerryshao 9e50a1d37a [SPARK-13669][SPARK-20898][CORE] Improve the blacklist mechanism to handle external shuffle service unavailable situation
## What changes were proposed in this pull request?

Currently we are running into an issue with Yarn work preserving enabled + external shuffle service.
In the work preserving enabled scenario, the failure of NM will not lead to the exit of executors, so executors can still accept and run the tasks. The problem here is when NM is failed, external shuffle service is actually inaccessible, so reduce tasks will always complain about the “Fetch failure”, and the failure of reduce stage will make the parent stage (map stage) rerun. The tricky thing here is Spark scheduler is not aware of the unavailability of external shuffle service, and will reschedule the map tasks on the executor where NM is failed, and again reduce stage will be failed with “Fetch failure”, and after 4 retries, the job is failed. This could also apply to other cluster manager with external shuffle service.

So here the main problem is that we should avoid assigning tasks to those bad executors (where shuffle service is unavailable). Current Spark's blacklist mechanism could blacklist executors/nodes by failure tasks, but it doesn't handle this specific fetch failure scenario. So here propose to improve the current application blacklist mechanism to handle fetch failure issue (especially with external shuffle service unavailable issue), to blacklist the executors/nodes where shuffle fetch is unavailable.

## How was this patch tested?

Unit test and small cluster verification.

Author: jerryshao <sshao@hortonworks.com>

Closes #17113 from jerryshao/SPARK-13669.
2017-06-26 11:14:03 -05:00
Yuming Wang 987eb8fadd [MINOR][DOCS] Add lost <tr> tag for configuration.md
## What changes were proposed in this pull request?

Add lost `<tr>` tag for `configuration.md`.

## How was this patch tested?
N/A

Author: Yuming Wang <wgyumg@gmail.com>

Closes #18372 from wangyum/docs-missing-tr.
2017-06-21 15:30:31 +01:00
Li Yichao d107b3b910 [SPARK-20640][CORE] Make rpc timeout and retry for shuffle registration configurable.
## What changes were proposed in this pull request?

Currently the shuffle service registration timeout and retry has been hardcoded. This works well for small workloads but under heavy workload when the shuffle service is busy transferring large amount of data we see significant delay in responding to the registration request, as a result we often see the executors fail to register with the shuffle service, eventually failing the job. We need to make these two parameters configurable.

## How was this patch tested?

* Updated `BlockManagerSuite` to test registration timeout and max attempts configuration actually works.

cc sitalkedia

Author: Li Yichao <lyc@zhihu.com>

Closes #18092 from liyichao/SPARK-20640.
2017-06-21 21:54:29 +08:00
assafmendelson 66a792cd88 [SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream source are in a wrong table
## What changes were proposed in this pull request?

The description for several options of File Source for structured streaming appeared in the File Sink description instead.

This pull request has two commits: The first includes changes to the version as it appeared in spark 2.1 and the second handled an additional option added for spark 2.2

## How was this patch tested?

Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide.

The original documentation was written by tdas and lw-lin

Author: assafmendelson <assaf.mendelson@gmail.com>

Closes #18342 from assafmendelson/spark-21123.
2017-06-19 10:58:58 -07:00
liuzhaokun 0d8604bb84 [SPARK-21126] The configuration which named "spark.core.connection.auth.wait.timeout" hasn't been used in spark
[https://issues.apache.org/jira/browse/SPARK-21126](https://issues.apache.org/jira/browse/SPARK-21126)
The configuration which named "spark.core.connection.auth.wait.timeout" hasn't been used in spark,so I think it should be removed from configuration.md.

Author: liuzhaokun <liu.zhaokun@zte.com.cn>

Closes #18333 from liu-zhaokun/new3.
2017-06-18 08:32:29 +01:00
Yuming Wang 45824fb608 [MINOR][DOCS] Improve Running R Tests docs
## What changes were proposed in this pull request?

Update Running R Tests dependence packages to:
```bash
R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 'survival'), repos='http://cran.us.r-project.org')"
```

## How was this patch tested?
manual tests

Author: Yuming Wang <wgyumg@gmail.com>

Closes #18271 from wangyum/building-spark.
2017-06-16 11:03:54 +01:00
Michael Gummelt a18d637112 [SPARK-20434][YARN][CORE] Move Hadoop delegation token code from yarn to core
## What changes were proposed in this pull request?

Move Hadoop delegation token code from `spark-yarn` to `spark-core`, so that other schedulers (such as Mesos), may use it.  In order to avoid exposing Hadoop interfaces in spark-core, the new Hadoop delegation token classes are kept private.  In order to provider backward compatiblity, and to allow YARN users to continue to load their own delegation token providers via Java service loading, the old YARN interfaces, as well as the client code that uses them, have been retained.

Summary:
- Move registered `yarn.security.ServiceCredentialProvider` classes from `spark-yarn` to `spark-core`.  Moved them into a new, private hierarchy under `HadoopDelegationTokenProvider`.  Client code in `HadoopDelegationTokenManager` now loads credentials from a whitelist of three providers (`HadoopFSDelegationTokenProvider`, `HiveDelegationTokenProvider`, `HBaseDelegationTokenProvider`), instead of service loading, which means that users are not able to implement their own delegation token providers, as they are in the `spark-yarn` module.

- The `yarn.security.ServiceCredentialProvider` interface has been kept for backwards compatibility, and to continue to allow YARN users to implement their own delegation token provider implementations.  Client code in YARN now fetches tokens via the new `YARNHadoopDelegationTokenManager` class, which fetches tokens from the core providers through `HadoopDelegationTokenManager`, as well as service loads them from `yarn.security.ServiceCredentialProvider`.

Old Hierarchy:

```
yarn.security.ServiceCredentialProvider (service loaded)
  HadoopFSCredentialProvider
  HiveCredentialProvider
  HBaseCredentialProvider
yarn.security.ConfigurableCredentialManager
```

New Hierarchy:

```
HadoopDelegationTokenManager
HadoopDelegationTokenProvider (not service loaded)
  HadoopFSDelegationTokenProvider
  HiveDelegationTokenProvider
  HBaseDelegationTokenProvider

yarn.security.ServiceCredentialProvider (service loaded)
yarn.security.YARNHadoopDelegationTokenManager
```
## How was this patch tested?

unit tests

Author: Michael Gummelt <mgummelt@mesosphere.io>
Author: Dr. Stefan Schimanski <sttts@mesosphere.io>

Closes #17723 from mgummelt/SPARK-20434-refactor-kerberos.
2017-06-15 11:46:00 -07:00
Felix Cheung 1bf55e396c [SPARK-20980][DOCS] update doc to reflect multiLine change
## What changes were proposed in this pull request?

doc only change

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #18312 from felixcheung/sqljsonwholefiledoc.
2017-06-14 23:08:05 -07:00
Ziyue Huang e6eb02df15 [DOCS] Fix error: ambiguous reference to overloaded definition
## What changes were proposed in this pull request?

`df.groupBy.count()` should be `df.groupBy().count()` , otherwise there is an error :

ambiguous reference to overloaded definition, both method groupBy in class Dataset of type (col1: String, cols: String*) and method groupBy in class Dataset of type (cols: org.apache.spark.sql.Column*)

## How was this patch tested?

```scala
val df = spark.readStream.schema(...).json(...)
val dfCounts = df.groupBy().count()
```

Author: Ziyue Huang <zyhuang94@gmail.com>

Closes #18272 from ZiyueHuang/master.
2017-06-12 10:59:33 +01:00
Michael Gummelt 8da3f7041a [SPARK-21000][MESOS] Add Mesos labels support to the Spark Dispatcher
## What changes were proposed in this pull request?

Add Mesos labels support to the Spark Dispatcher

## How was this patch tested?

unit tests

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #18220 from mgummelt/SPARK-21000-dispatcher-labels.
2017-06-11 09:49:39 +01:00
Corey Woodfield 033839559e Fixed broken link
## What changes were proposed in this pull request?

I fixed some incorrect formatting on a link in the docs

## How was this patch tested?

I looked at the markdown preview before and after, and the link was fixed

Before:
<img width="593" alt="screen shot 2017-06-08 at 6 37 32 pm" src="https://user-images.githubusercontent.com/17733030/26956272-a62cd558-4c79-11e7-862f-9d0e0184b18a.png">
After:
<img width="587" alt="screen shot 2017-06-08 at 6 37 44 pm" src="https://user-images.githubusercontent.com/17733030/26956276-b1135ef6-4c79-11e7-8028-84d19c392fda.png">

Author: Corey Woodfield <coreywoodfield@gmail.com>

Closes #18246 from coreywoodfield/master.
2017-06-09 10:24:49 +01:00
Mark Grover 55b8cfe6e6 [SPARK-19185][DSTREAM] Make Kafka consumer cache configurable
## What changes were proposed in this pull request?

Add a new property `spark.streaming.kafka.consumer.cache.enabled` that allows users to enable or disable the cache for Kafka consumers. This property can be especially handy in cases where issues like SPARK-19185 get hit, for which there isn't a solution committed yet. By default, the cache is still on, so this change doesn't change any out-of-box behavior.

## How was this patch tested?
Running unit tests

Author: Mark Grover <mark@apache.org>
Author: Mark Grover <grover.markgrover@gmail.com>

Closes #18234 from markgrover/spark-19185.
2017-06-08 09:55:43 -07:00
Dongjoon Hyun 3218505a0b [MINOR][DOC] Update deprecation notes on Python/Hadoop/Scala.
## What changes were proposed in this pull request?

We had better update the deprecation notes about Python 2.6, Hadoop (before 2.6.5) and Scala 2.10 in [2.2.0-RC4](http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/) documentation. Since this is a doc only update, I think we can update the doc during publishing.

**BEFORE (2.2.0-RC4)**
    ![before](https://cloud.githubusercontent.com/assets/9700541/26799758/aea0dc06-49eb-11e7-8ca3-ed8ce1cc6147.png)

**AFTER**
    ![after](https://cloud.githubusercontent.com/assets/9700541/26799761/b3fef818-49eb-11e7-83c5-334f0e4768ed.png)

## How was this patch tested?

Manual.
```
SKIP_API=1 jekyll build
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #18207 from dongjoon-hyun/minor_doc_deprecation.
2017-06-07 08:50:36 +01:00
jerryshao 06c0544113 [SPARK-20981][SPARKSUBMIT] Add new configuration spark.jars.repositories as equivalence of --repositories
## What changes were proposed in this pull request?

In our use case of launching Spark applications via REST APIs (Livy), there's no way for user to specify command line arguments, all Spark configurations are set through configurations map. For "--repositories" because there's no equivalent Spark configuration, so we cannot specify the custom repository through configuration.

So here propose to add "--repositories" equivalent configuration in Spark.

## How was this patch tested?

New UT added.

Author: jerryshao <sshao@hortonworks.com>

Closes #18201 from jerryshao/SPARK-20981.
2017-06-05 11:06:50 -07:00
zero323 ae33abf71b [SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide
## What changes were proposed in this pull request?

- Add Scala, Python and Java examples for `partitionBy`, `sortBy` and `bucketBy`.
- Add _Bucketing, Sorting and Partitioning_ section to SQL Programming Guide
- Remove bucketing from Unsupported Hive Functionalities.

## How was this patch tested?

Manual tests, docs build.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17938 from zero323/DOCS-BUCKETING-AND-PARTITIONING.
2017-05-26 15:01:01 -07:00
Michael Armbrust d935e0a9d9 [SPARK-20844] Remove experimental from Structured Streaming APIs
Now that Structured Streaming has been out for several Spark release and has large production use cases, the `Experimental` label is no longer appropriate.  I've left `InterfaceStability.Evolving` however, as I think we may make a few changes to the pluggable Source & Sink API in Spark 2.3.

Author: Michael Armbrust <michael@databricks.com>

Closes #18065 from marmbrus/streamingGA.
2017-05-26 13:33:23 -07:00
Zheng RuiFeng a97c497045 [SPARK-20849][DOC][SPARKR] Document R DecisionTree
## What changes were proposed in this pull request?
1, add an example for sparkr `decisionTree`
2, document it in user guide

## How was this patch tested?
local submit

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #18067 from zhengruifeng/dt_example.
2017-05-25 23:00:50 -07:00
Michael Allman c1e7989c4f [SPARK-20888][SQL][DOCS] Document change of default setting of spark.sql.hive.caseSensitiveInferenceMode
(Link to Jira: https://issues.apache.org/jira/browse/SPARK-20888)

## What changes were proposed in this pull request?

Document change of default setting of spark.sql.hive.caseSensitiveInferenceMode configuration key from NEVER_INFO to INFER_AND_SAVE in the Spark SQL 2.1 to 2.2 migration notes.

Author: Michael Allman <michael@videoamp.com>

Closes #18112 from mallman/spark-20888-document_infer_and_save.
2017-05-26 09:25:43 +08:00
jinxing 3f94e64aa8 [SPARK-19659] Fetch big blocks to disk when shuffle-read.
## What changes were proposed in this pull request?

Currently the whole block is fetched into memory(off heap by default) when shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can be large when skew situations. If OOM happens during shuffle read, job will be killed and users will be notified to "Consider boosting spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more memory can resolve the OOM. However the approach is not perfectly suitable for production environment, especially for data warehouse.
Using Spark SQL as data engine in warehouse, users hope to have a unified parameter(e.g. memory) but less resource wasted(resource is allocated but not used). The hope is strong especially when migrating data engine to Spark from another one(e.g. Hive). Tuning the parameter for thousands of SQLs one by one is very time consuming.
It's not always easy to predict skew situations, when happen, it make sense to fetch remote blocks to disk for shuffle-read, rather than kill the job because of OOM.

In this pr, I propose to fetch big blocks to disk(which is also mentioned in SPARK-3019):

1. Track average size and also the outliers(which are larger than 2*avgSize) in MapStatus;
2. Request memory from `MemoryManager` before fetch blocks and release the memory to `MemoryManager` when `ManagedBuffer` is released.
3. Fetch remote blocks to disk when failing acquiring memory from `MemoryManager`, otherwise fetch to memory.

This is an improvement for memory control when shuffle blocks and help to avoid OOM in scenarios like below:
1. Single huge block;
2. Sizes of many blocks are underestimated in `MapStatus` and the actual footprint of blocks is much larger than the estimated.

## How was this patch tested?
Added unit test in `MapStatusSuite` and `ShuffleBlockFetcherIteratorSuite`.

Author: jinxing <jinxing6042@126.com>

Closes #16989 from jinxing64/SPARK-19659.
2017-05-25 16:11:30 +08:00
jinxing 2597674bcc [SPARK-20801] Record accurate size of blocks in MapStatus when it's above threshold.
## What changes were proposed in this pull request?

Currently, when number of reduces is above 2000, HighlyCompressedMapStatus is used to store size of blocks. in HighlyCompressedMapStatus, only average size is stored for non empty blocks. Which is not good for memory control when we shuffle blocks. It makes sense to store the accurate size of block when it's above threshold.

## How was this patch tested?

Added test in MapStatusSuite.

Author: jinxing <jinxing6042@126.com>

Closes #18031 from jinxing64/SPARK-20801.
2017-05-22 22:09:49 +08:00
Nick Pentreath be846db48b [SPARK-20506][DOCS] Add HTML links to highlight list in MLlib guide for 2.2
Quick follow up to #17996 - forgot to add the HTML links to the relevant sections of the guide in the highlights list.

## How was this patch tested?

Built docs locally and tested links.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #18043 from MLnick/SPARK-20506-2.2-migration-guide-2.
2017-05-22 12:29:29 +02:00
Nick Pentreath b5d8d9ba17 [SPARK-20506][DOCS] 2.2 migration guide
Update ML guide for migration `2.1` -> `2.2` and the previous version migration guide section.

## How was this patch tested?

Build doc locally.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #17996 from MLnick/SPARK-20506-2.2-migration-guide.
2017-05-19 20:51:56 +02:00
liuzhaokun dba2ca2c12 [SPARK-20759] SCALA_VERSION in _config.yml should be consistent with pom.xml
[https://issues.apache.org/jira/browse/SPARK-20759](https://issues.apache.org/jira/browse/SPARK-20759)
SCALA_VERSION in _config.yml is 2.11.7, but 2.11.8 in pom.xml. So I think SCALA_VERSION in _config.yml should be consistent with pom.xml.

Author: liuzhaokun <liu.zhaokun@zte.com.cn>

Closes #17992 from liu-zhaokun/new.
2017-05-19 15:26:39 +01:00
Yash Sharma 92580bd0ea [DSTREAM][DOC] Add documentation for kinesis retry configurations
## What changes were proposed in this pull request?

The changes were merged as part of - https://github.com/apache/spark/pull/17467.
The documentation was missed somewhere in the review iterations. Adding the documentation where it belongs.

## How was this patch tested?
Docs. Not tested.

cc budde , brkyvz

Author: Yash Sharma <ysharma@atlassian.com>

Closes #18028 from yssharma/ysharma/kinesis_retry_docs.
2017-05-18 11:24:33 -07:00
liuzhaokun 99452df44f [SPARK-20796] the location of start-master.sh in spark-standalone.md is wrong
[https://issues.apache.org/jira/browse/SPARK-20796](https://issues.apache.org/jira/browse/SPARK-20796)
the location of start-master.sh in spark-standalone.md should be "sbin/start-master.sh" rather than "bin/start-master.sh".

Author: liuzhaokun <liu.zhaokun@zte.com.cn>

Closes #18027 from liu-zhaokun/sbin.
2017-05-18 17:44:40 +01:00
Yanbo Liang 697a5e5517 [SPARK-20505][ML] Add docs and examples for ml.stat.Correlation and ml.stat.ChiSquareTest.
## What changes were proposed in this pull request?
Add docs and examples for ```ml.stat.Correlation``` and ```ml.stat.ChiSquareTest```.

## How was this patch tested?
Generate docs and run examples manually, successfully.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17994 from yanboliang/spark-20505.
2017-05-18 11:54:09 +08:00
Andrew Ray 1995417696 [SPARK-20769][DOC] Incorrect documentation for using Jupyter notebook
## What changes were proposed in this pull request?

SPARK-13973 incorrectly removed the required PYSPARK_DRIVER_PYTHON_OPTS=notebook from documentation to use pyspark with Jupyter notebook. This patch corrects the documentation error.

## How was this patch tested?

Tested invocation locally with
```bash
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook ./bin/pyspark
```

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #18001 from aray/patch-1.
2017-05-17 10:06:01 +01:00
uncleGen c0189abc7c [SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute
## What changes were proposed in this pull request?

Any Dataset/DataFrame batch query with the operation `withWatermark` does not execute because the batch planner does not have any rule to explicitly handle the EventTimeWatermark logical plan.
The right solution is to simply remove the plan node, as the watermark should not affect any batch query in any way.

Changes:
- In this PR, we add a new rule `EliminateEventTimeWatermark` to check if we need to ignore the event time watermark. We will ignore watermark in any batch query.

Depends upon:
- [SPARK-20672](https://issues.apache.org/jira/browse/SPARK-20672). We can not add this rule into analyzer directly, because streaming query will be copied to `triggerLogicalPlan ` in every trigger, and the rule will be applied to `triggerLogicalPlan` mistakenly.

Others:
- A typo fix in example.

## How was this patch tested?

add new unit test.

Author: uncleGen <hustyugm@gmail.com>

Closes #17896 from uncleGen/SPARK-20373.
2017-05-09 15:08:09 -07:00
jerryshao 829cd7b8b7 [SPARK-20605][CORE][YARN][MESOS] Deprecate not used AM and executor port configuration
## What changes were proposed in this pull request?

After SPARK-10997, client mode Netty RpcEnv doesn't require to start server, so port configurations are not used any more, here propose to remove these two configurations: "spark.executor.port" and "spark.am.port".

## How was this patch tested?

Existing UTs.

Author: jerryshao <sshao@hortonworks.com>

Closes #17866 from jerryshao/SPARK-20605.
2017-05-08 14:27:56 -07:00
Steve Loughran 2cf83c4783 [SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access.
## What changes were proposed in this pull request?

Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.

It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.

There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.

(this is the successor to #12004; I can't re-open it)

## How was this patch tested?

Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)

Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.

Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.

SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
maven build `mvn install -Phadoop-cloud -Phadoop-2.7`

This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.

Author: Steve Loughran <stevel@apache.org>
Author: Steve Loughran <stevel@hortonworks.com>

Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 10:15:31 +01:00
Felix Cheung b8302ccd02 [SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example
## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.
2017-05-04 00:27:10 -07:00
MechCoder db2fb84b4a [SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers for SVD and PCA (v2)
Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` (SVD only).

Based on #7963, updated.

## How was this patch tested?

New doc tests and unit tests. Ran all examples locally.

Author: MechCoder <manojkumarsivaraj334@gmail.com>
Author: Nick Pentreath <nickp@za.ibm.com>

Closes #17621 from MLnick/SPARK-6227-pyspark-svd-pca.
2017-05-03 10:58:05 +02:00
Felix Cheung d20a976e89 [SPARK-20192][SPARKR][DOC] SparkR migration guide to 2.2.0
## What changes were proposed in this pull request?

Updating R Programming Guide

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17816 from felixcheung/r22relnote.
2017-05-01 21:03:48 -07:00
郭小龙 10207633 4d99b95ad0 [SPARK-20521][DOC][CORE] The default of 'spark.worker.cleanup.appDataTtl' should be 604800 in spark-standalone.md
## What changes were proposed in this pull request?

Currently, our project needs to be set to clean up the worker directory cleanup cycle is three days.
When I follow http://spark.apache.org/docs/latest/spark-standalone.html, configure the 'spark.worker.cleanup.appDataTtl' parameter, I configured to 3 * 24 * 3600.
When I start the spark service, the startup fails, and the worker log displays the error log as follows:

2017-04-28 15:02:03,306 INFO Utils: Successfully started service 'sparkWorker' on port 48728.
Exception in thread "main" java.lang.NumberFormatException: For input string: "3 * 24 * 3600"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Long.parseLong(Long.java:430)
	at java.lang.Long.parseLong(Long.java:483)
	at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
	at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
	at org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
	at org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
	at scala.Option.map(Option.scala:146)
	at org.apache.spark.SparkConf.getLong(SparkConf.scala:380)
	at org.apache.spark.deploy.worker.Worker.<init>(Worker.scala:100)
	at org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:730)
	at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:709)
	at org.apache.spark.deploy.worker.Worker.main(Worker.scala)

**Because we put 7 * 24 * 3600 as a string, forced to convert to the dragon type,  will lead to problems in the program.**

**So I think the default value of the current configuration should be a specific long value, rather than 7 * 24 * 3600,should be 604800. Because it would mislead users for similar configurations, resulting in spark start failure.**

## How was this patch tested?
manual tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
Author: guoxiaolong <guo.xiaolong1@zte.com.cn>
Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>

Closes #17798 from guoxiaolongzte/SPARK-20521.
2017-04-30 09:06:25 +01:00
Yuhao Yang add9d1bba5 [SPARK-19791][ML] Add doc and example for fpgrowth
## What changes were proposed in this pull request?

Add a new section for fpm
Add Example for FPGrowth in scala and Java

updated: Rewrite transform to be more compact.

## How was this patch tested?

local doc generation.

Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #17130 from hhbyyh/fpmdoc.
2017-04-29 10:51:45 -07:00
wangmiao1981 b28c3bc202 [SPARK-20477][SPARKR][DOC] Document R bisecting k-means in R programming guide
## What changes were proposed in this pull request?

Add hyper link in the SparkR programming guide.

## How was this patch tested?

Build doc and manually check the doc link.

Author: wangmiao1981 <wm624@hotmail.com>

Closes #17805 from wangmiao1981/doc.
2017-04-29 10:31:01 -07:00
wangmiao1981 7fe8249793 [SPARKR][DOC] Document LinearSVC in R programming guide
## What changes were proposed in this pull request?

add link to svmLinear in the SparkR programming document.

## How was this patch tested?

Build doc manually and click the link to the document. It looks good.

Author: wangmiao1981 <wm624@hotmail.com>

Closes #17797 from wangmiao1981/doc.
2017-04-27 22:29:47 -07:00
zero323 ba7666274e [SPARK-20208][DOCS][FOLLOW-UP] Add FP-Growth to SparkR programming guide
## What changes were proposed in this pull request?

Add `spark.fpGrowth` to SparkR programming guide.

## How was this patch tested?

Manual tests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17775 from zero323/SPARK-20208-FOLLOW-UP.
2017-04-27 00:34:20 -07:00
Mark Grover 66636ef0b0 [SPARK-20435][CORE] More thorough redaction of sensitive information
This change does a more thorough redaction of sensitive information from logs and UI
Add unit tests that ensure that no regressions happen that leak sensitive information to the logs.

The motivation for this change was appearance of password like so in `SparkListenerEnvironmentUpdate` in event logs under some JVM configurations:
`"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ..."
`
Previously redaction logic was only checking if the key matched the secret regex pattern, it'd redact it's value. That worked for most cases. However, in the above case, the key (sun.java.command) doesn't tell much, so the value needs to be searched. This PR expands the check to check for values as well.

## How was this patch tested?

New unit tests added that ensure that no sensitive information is present in the event logs or the yarn logs. Old unit test in UtilsSuite was modified because the test was asserting that a non-sensitive property's value won't be redacted. However, the non-sensitive value had the literal "secret" in it which was causing it to redact. Simply updating the non-sensitive property's value to another arbitrary value (that didn't have "secret" in it) fixed it.

Author: Mark Grover <mark@apache.org>

Closes #17725 from markgrover/spark-20435.
2017-04-26 17:06:21 -07:00
anabranch 7a365257e9 [SPARK-20400][DOCS] Remove References to 3rd Party Vendor Tools
## What changes were proposed in this pull request?

Simple documentation change to remove explicit vendor references.

## How was this patch tested?

NA

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: anabranch <bill@databricks.com>

Closes #17695 from anabranch/remove-vendor.
2017-04-26 09:49:05 +01:00
zero323 df58a95a33 [SPARK-20437][R] R wrappers for rollup and cube
## What changes were proposed in this pull request?

- Add `rollup` and `cube` methods and corresponding generics.
- Add short description to the vignette.

## How was this patch tested?

- Existing unit tests.
- Additional unit tests covering new features.
- `check-cran.sh`.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17728 from zero323/SPARK-20437.
2017-04-25 22:00:45 -07:00
ding 0a7f5f2798 [SPARK-5484][GRAPHX] Periodically do checkpoint in Pregel
## What changes were proposed in this pull request?

Pregel-based iterative algorithms with more than ~50 iterations begin to slow down and eventually fail with a StackOverflowError due to Spark's lack of support for long lineage chains.

This PR causes Pregel to checkpoint the graph periodically if the checkpoint directory is set.
This PR moves PeriodicGraphCheckpointer.scala from mllib to graphx, moves PeriodicRDDCheckpointer.scala, PeriodicCheckpointer.scala from mllib to core
## How was this patch tested?

unit tests, manual tests
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: ding <ding@localhost.localdomain>
Author: dding3 <ding.ding@intel.com>
Author: Michael Allman <michael@videoamp.com>

Closes #15125 from dding3/cp2_pregel.
2017-04-25 11:20:32 -07:00
Armin Braun c8f1219510 [SPARK-20455][DOCS] Fix Broken Docker IT Docs
## What changes were proposed in this pull request?

Just added the Maven `test`goal.

## How was this patch tested?

No test needed, just a trivial documentation fix.

Author: Armin Braun <me@obrown.io>

Closes #17756 from original-brownbear/SPARK-20455.
2017-04-25 09:13:50 +01:00
Josh Rosen f44c8a843c [SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT
This patch bumps the master branch version to `2.3.0-SNAPSHOT`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #17753 from JoshRosen/SPARK-20453.
2017-04-24 21:48:04 -07:00
郭小龙 10207633 ad290402aa [SPARK-20401][DOC] In the spark official configuration document, the 'spark.driver.supervise' configuration parameter specification and default values are necessary.
## What changes were proposed in this pull request?
Use the REST interface submits the spark job.
e.g.
curl -X  POST http://10.43.183.120:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data'{
    "action": "CreateSubmissionRequest",
    "appArgs": [
        "myAppArgument"
    ],
    "appResource": "/home/mr/gxl/test.jar",
    "clientSparkVersion": "2.2.0",
    "environmentVariables": {
        "SPARK_ENV_LOADED": "1"
    },
    "mainClass": "cn.zte.HdfsTest",
    "sparkProperties": {
        "spark.jars": "/home/mr/gxl/test.jar",
        **"spark.driver.supervise": "true",**
        "spark.app.name": "HdfsTest",
        "spark.eventLog.enabled": "false",
        "spark.submit.deployMode": "cluster",
        "spark.master": "spark://10.43.183.120:6066"
    }
}'

**I hope that make sure that the driver is automatically restarted if it fails with non-zero exit code.
But I can not find the 'spark.driver.supervise' configuration parameter specification and default values from the spark official document.**
## How was this patch tested?

manual tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
Author: guoxiaolong <guo.xiaolong1@zte.com.cn>
Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>

Closes #17696 from guoxiaolongzte/SPARK-20401.
2017-04-21 20:08:26 +01:00
Hervé 34767997e0 Small rewording about history server use case
Hello
PR #10991 removed the built-in history view from Spark Standalone, so the history server is no longer useful to Yarn or Mesos only.

Author: Hervé <dud225@users.noreply.github.com>

Closes #17709 from dud225/patch-1.
2017-04-21 08:52:18 +01:00
ymahajan bdc6056919 Fixed typos in docs
## What changes were proposed in this pull request?

Typos at a couple of place in the docs.

## How was this patch tested?

build including docs

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: ymahajan <ymahajan@snappydata.io>

Closes #17690 from ymahajan/master.
2017-04-19 20:08:31 -07:00
cody koeninger 71a8e9df12 [SPARK-20036][DOC] Note incompatible dependencies on org.apache.kafka artifacts
## What changes were proposed in this pull request?

Note that you shouldn't manually add dependencies on org.apache.kafka artifacts

## How was this patch tested?

Doc only change, did jekyll build and looked at the page.

Author: cody koeninger <cody@koeninger.org>

Closes #17675 from koeninger/SPARK-20036.
2017-04-19 18:58:58 +01:00
Ji Yan a888fed309 [SPARK-19740][MESOS] Add support in Spark to pass arbitrary parameters into docker when running on mesos with docker containerizer
## What changes were proposed in this pull request?

Allow passing in arbitrary parameters into docker when launching spark executors on mesos with docker containerizer tnachen

## How was this patch tested?

Manually built and tested with passed in parameter

Author: Ji Yan <jiyan@Jis-MacBook-Air.local>

Closes #17109 from yanji84/ji/allow_set_docker_user.
2017-04-16 14:34:12 +01:00
hyukjinkwon bca4259f12 [MINOR][DOCS] JSON APIs related documentation fixes
## What changes were proposed in this pull request?

This PR proposes corrections related to JSON APIs as below:

- Rendering links in Python documentation
- Replacing `RDD` to `Dataset` in programing guide
- Adding missing description about JSON Lines consistently in `DataFrameReader.json` in Python API
- De-duplicating little bit of `DataFrameReader.json` in Scala/Java API

## How was this patch tested?

Manually build the documentation via `jekyll build`. Corresponding snapstops will be left on the codes.

Note that currently there are Javadoc8 breaks in several places. These are proposed to be handled in https://github.com/apache/spark/pull/17477. So, this PR does not fix those.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17602 from HyukjinKwon/minor-json-documentation.
2017-04-12 09:16:39 +01:00
Lee Dongjin b938438248 [MINOR][DOCS] Fix spacings in Structured Streaming Programming Guide
## What changes were proposed in this pull request?

1. Omitted space between the sentences: `... on static data.The Spark SQL engine will ...` -> `... on static data. The Spark SQL engine will ...`
2. Omitted colon in Output Model section.

## How was this patch tested?

None.

Author: Lee Dongjin <dongjin@apache.org>

Closes #17564 from dongjinleekr/feature/fix-programming-guide.
2017-04-12 09:12:14 +01:00
Dongjoon Hyun cde9e32848 [MINOR][DOCS] Update supported versions for Hive Metastore
## What changes were proposed in this pull request?

Since SPARK-18112 and SPARK-13446, Apache Spark starts to support reading Hive metastore 2.0 ~ 2.1.1. This updates the docs.

## How was this patch tested?

N/A

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #17612 from dongjoon-hyun/metastore.
2017-04-11 19:30:34 -07:00
MirrorZ d11ef3d77e Document Master URL format in high availability set up
## What changes were proposed in this pull request?

Add documentation for adding master url in multi host, port format for standalone cluster with high availability with zookeeper.
Referring documentation [Standby Masters with ZooKeeper](http://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper)

## How was this patch tested?

Documenting the functionality already present.

Author: MirrorZ <chandrika3437@gmail.com>

Closes #17584 from MirrorZ/master.
2017-04-11 10:34:39 +01:00
郭小龙 10207633 9e0893b53d [SPARK-20218][DOC][APP-ID] applications//stages' in REST API,add description.
## What changes were proposed in this pull request?

1. '/applications/[app-id]/stages' in rest api.status should add description '?status=[active|complete|pending|failed] list only stages in the state.'

Now the lack of this description, resulting in the use of this api do not know the use of the status through the brush stage list.

2.'/applications/[app-id]/stages/[stage-id]' in REST API,remove redundant description ‘?status=[active|complete|pending|failed] list only stages in the state.’.
Because only one stage is determined based on stage-id.

code:
  GET
  def stageList(QueryParam("status") statuses: JList[StageStatus]): Seq[StageData] = {
    val listener = ui.jobProgressListener
    val stageAndStatus = AllStagesResource.stagesAndStatus(ui)
    val adjStatuses = {
      if (statuses.isEmpty()) {
        Arrays.asList(StageStatus.values(): _*)
      } else {
        statuses
      }
    };

## How was this patch tested?

manual tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>

Closes #17534 from guoxiaolongzte/SPARK-20218.
2017-04-07 13:03:07 +01:00
Kalvin Chau c8fc1f3bad [SPARK-20085][MESOS] Configurable mesos labels for executors
## What changes were proposed in this pull request?

Add spark.mesos.task.labels configuration option to add mesos key:value labels to the executor.

 "k1:v1,k2:v2" as the format, colons separating key-value and commas to list out more than one.

Discussion of labels with mgummelt at #17404

## How was this patch tested?

Added unit tests to verify labels were added correctly, with incorrect labels being ignored and added a test to test the name of the executor.

Tested with: `./build/sbt -Pmesos mesos/test`

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Kalvin Chau <kalvin.chau@viasat.com>

Closes #17413 from kalvinnchau/mesos-labels.
2017-04-06 09:14:31 +01:00
Tathagata Das 9543fc0e08 [SPARK-20224][SS] Updated docs for streaming dropDuplicates and mapGroupsWithState
## What changes were proposed in this pull request?

- Fixed bug in Java API not passing timeout conf to scala API
- Updated markdown docs
- Updated scala docs
- Added scala and Java example

## How was this patch tested?
Manually ran examples.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #17539 from tdas/SPARK-20224.
2017-04-05 16:03:04 -07:00
Anirudh Ramanathan 11238d4c62 [SPARK-18278][SCHEDULER] Documentation to point to Kubernetes cluster scheduler
## What changes were proposed in this pull request?

Adding documentation to point to Kubernetes cluster scheduler being developed out-of-repo in https://github.com/apache-spark-on-k8s/spark
cc rxin srowen tnachen ash211 mccheah erikerlandson

## How was this patch tested?

Docs only change

Author: Anirudh Ramanathan <foxish@users.noreply.github.com>
Author: foxish <ramanathana@google.com>

Closes #17522 from foxish/upstream-doc.
2017-04-04 10:46:44 -07:00
guoxiaolongzte c95fbea68e [SPARK-20190][APP-ID] applications//jobs' in rest api,status should be [running|s…
…ucceeded|failed|unknown]

## What changes were proposed in this pull request?

'/applications/[app-id]/jobs' in rest api.status should be'[running|succeeded|failed|unknown]'.
now status is '[complete|succeeded|failed]'.
but '/applications/[app-id]/jobs?status=complete' the server return 'HTTP ERROR 404'.
Added '?status=running' and '?status=unknown'.
code :
public enum JobExecutionStatus {
RUNNING,
SUCCEEDED,
FAILED,
UNKNOWN;

## How was this patch tested?

 manual tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>

Closes #17507 from guoxiaolongzte/SPARK-20190.
2017-04-04 09:56:17 +01:00
Yuhao Yang 4d28e8430d [SPARK-19969][ML] Imputer doc and example
## What changes were proposed in this pull request?

Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after https://github.com/apache/spark/pull/17316

## How was this patch tested?

local doc generation and example execution

Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #17324 from hhbyyh/imputerdoc.
2017-04-03 11:42:33 +02:00
hyukjinkwon 364b0db753 [MINOR][DOCS] Replace non-breaking space to normal spaces that breaks rendering markdown
# What changes were proposed in this pull request?

It seems there are several non-breaking spaces were inserted into several `.md`s and they look breaking rendering markdown files.

These are different. For example, this can be checked via `python` as below:

```python
>>> " "
'\xc2\xa0'
>>> " "
' '
```

_Note that it seems this PR description automatically replaces non-breaking spaces into normal spaces. Please open a `vi` and copy and paste it into `python` to verify this (do not copy the characters here)._

I checked the output below in  Sapari and Chrome on Mac OS and, Internal Explorer on Windows 10.

**Before**

![2017-04-03 12 37 17](https://cloud.githubusercontent.com/assets/6477701/24594655/50aaba02-186a-11e7-80bb-d34b17a3398a.png)
![2017-04-03 12 36 57](https://cloud.githubusercontent.com/assets/6477701/24594654/50a855e6-186a-11e7-94e2-661e56544b0f.png)

**After**

![2017-04-03 12 36 46](https://cloud.githubusercontent.com/assets/6477701/24594657/53c2545c-186a-11e7-9a73-00529afbfd75.png)
![2017-04-03 12 36 31](https://cloud.githubusercontent.com/assets/6477701/24594658/53c286c0-186a-11e7-99c9-e66b1f510fe7.png)

## How was this patch tested?

Manually checking.

These instances were found via

```
grep --include=*.scala --include=*.python --include=*.java --include=*.r --include=*.R --include=*.md --include=*.r -r -I " " .
```

in Mac OS.

It seems there are several instances more as below:

```
./docs/sql-programming-guide.md:        │   ├── ...
./docs/sql-programming-guide.md:        │   │
./docs/sql-programming-guide.md:        │   ├── country=US
./docs/sql-programming-guide.md:        │   │   └── data.parquet
./docs/sql-programming-guide.md:        │   ├── country=CN
./docs/sql-programming-guide.md:        │   │   └── data.parquet
./docs/sql-programming-guide.md:        │   └── ...
./docs/sql-programming-guide.md:            ├── ...
./docs/sql-programming-guide.md:            │
./docs/sql-programming-guide.md:            ├── country=US
./docs/sql-programming-guide.md:            │   └── data.parquet
./docs/sql-programming-guide.md:            ├── country=CN
./docs/sql-programming-guide.md:            │   └── data.parquet
./docs/sql-programming-guide.md:            └── ...
./sql/core/src/test/README.md:│   ├── *.avdl                  # Testing Avro IDL(s)
./sql/core/src/test/README.md:│   └── *.avpr                  # !! NO TOUCH !! Protocol files generated from Avro IDL(s)
./sql/core/src/test/README.md:│   ├── gen-avro.sh             # Script used to generate Java code for Avro
./sql/core/src/test/README.md:│   └── gen-thrift.sh           # Script used to generate Java code for Thrift
```

These seems generated via `tree` command which inserts non-breaking spaces. They do not look causing any problem for rendering within code blocks and I did not fix it to reduce the overhead to manually replace it when it is overwritten via `tree` command in the future.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17517 from HyukjinKwon/non-breaking-space.
2017-04-03 10:09:11 +01:00
郭小龙 10207633 cf5963c961 [SPARK-20177] Document about compression way has some little detail ch…
…anges.

## What changes were proposed in this pull request?

Document compression way little detail changes.
1.spark.eventLog.compress add 'Compression will use spark.io.compression.codec.'
2.spark.broadcast.compress add 'Compression will use spark.io.compression.codec.'
3,spark.rdd.compress add 'Compression will use spark.io.compression.codec.'
4.spark.io.compression.codec add 'event log describe'.

eg
Through the documents, I don't know  what is compression mode about 'event log'.

## How was this patch tested?

manual tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>

Closes #17498 from guoxiaolongzte/SPARK-20177.
2017-04-01 11:48:58 +01:00
Seigneurin, Alexis (CONT) 669a11b61b [DOCS][MINOR] Fixed a few typos in the Structured Streaming documentation
Fixed a few typos.

There is one more I'm not sure of:

```
        Append mode uses watermark to drop old aggregation state. But the output of a
        windowed aggregation is delayed the late threshold specified in `withWatermark()` as by
        the modes semantics, rows can be added to the Result Table only once after they are
```

Not sure how to change `is delayed the late threshold`.

Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>

Closes #17443 from aseigneurin/typos.
2017-03-30 16:12:17 +01:00
Yuming Wang edc87d76ef [SPARK-20107][DOC] Add spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option to configuration.md
## What changes were proposed in this pull request?

Add `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version` option to `configuration.md`.
Set `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2` can speed up [HadoopMapReduceCommitProtocol.commitJob](https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121) for many output files.

All cloudera's hadoop 2.6.0-cdh5.4.0 or higher versions(see: 1c12361823 and 16b2de2732) and apache's hadoop 2.7.0 or higher versions support this improvement.

More see:

1. [MAPREDUCE-4815](https://issues.apache.org/jira/browse/MAPREDUCE-4815): Speed up FileOutputCommitter#commitJob for many output files.
2. [MAPREDUCE-6406](https://issues.apache.org/jira/browse/MAPREDUCE-6406): Update the default version for the property mapreduce.fileoutputcommitter.algorithm.version to 2.

## How was this patch tested?

Manual test and exist tests.

Author: Yuming Wang <wgyumg@gmail.com>

Closes #17442 from wangyum/SPARK-20107.
2017-03-30 10:39:57 +01:00
Yanbo Liang 1d00761b91 [MINOR][SPARKR] Move 'Data type mapping between R and Spark' to right place in SparkR doc.
Section ```Data type mapping between R and Spark``` was put in the wrong place in SparkR doc currently, we should move it to a separate section.

## What changes were proposed in this pull request?
Before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/24340911/bc01a532-126a-11e7-9a08-0d60d13a547c.png)

After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/24340938/d9d32a9a-126a-11e7-8891-d2f5b46e0c71.png)

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17440 from yanboliang/sparkr-doc.
2017-03-27 17:37:24 -07:00
sureshthalamati c791180705 [SPARK-10849][SQL] Adds option to the JDBC data source write for user to specify database column type for the create table
## What changes were proposed in this pull request?
Currently JDBC data source creates tables in the target database using the default type mapping, and the JDBC dialect mechanism.  If users want to specify different database data type for only some of columns, there is no option available. In scenarios where default mapping does not work, users are forced to create tables on the target database before writing. This workaround is probably not acceptable from a usability point of view. This PR is to provide a user-defined type mapping for specific columns.

The solution is to allow users to specify database column data type for the create table  as JDBC datasource option(createTableColumnTypes) on write. Data type information can be specified in the same format as table schema DDL format (e.g: `name CHAR(64), comments VARCHAR(1024)`).

All supported target database types can not be specified ,  the data types has to be valid spark sql data types also.  For example user can not specify target database  CLOB data type. This will be supported in the follow-up PR.

Example:
```Scala
df.write
.option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
.jdbc(url, "TEST.DBCOLTYPETEST", properties)
```
## How was this patch tested?
Added new test cases to the JDBCWriteSuite

Author: sureshthalamati <suresh.thalamati@gmail.com>

Closes #16209 from sureshthalamati/jdbc_custom_dbtype_option_json-spark-10849.
2017-03-23 17:39:33 -07:00
uncleGen facfd60886 [SPARK-20021][PYSPARK] Miss backslash in python code
## What changes were proposed in this pull request?

Add backslash for line continuation in python code.

## How was this patch tested?

Jenkins.

Author: uncleGen <hustyugm@gmail.com>
Author: dylon <hustyugm@gmail.com>

Closes #17352 from uncleGen/python-example-doc.
2017-03-22 11:10:08 +00:00
christopher snow 7620aed828 [SPARK-20011][ML][DOCS] Clarify documentation for ALS 'rank' parameter
## What changes were proposed in this pull request?

API documentation and collaborative filtering documentation page changes to clarify inconsistent description of ALS rank parameter.

 - [DOCS] was previously: "rank is the number of latent factors in the model."
 - [API] was previously:  "rank - number of features to use"

This change describes rank in both places consistently as:

 - "Number of features to use (also referred to as the number of latent factors)"

Author: Chris Snow <chris.snowuk.ibm.com>

Author: christopher snow <chsnow123@gmail.com>

Closes #17345 from snowch/SPARK-20011.
2017-03-21 13:23:59 +00:00
Tyson Condie c2d1761a57 [SPARK-19906][SS][DOCS] Documentation describing how to write queries to Kafka
## What changes were proposed in this pull request?

Add documentation that describes how to write streaming and batch queries to Kafka.

zsxwing tdas

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tyson Condie <tcondie@gmail.com>

Closes #17246 from tcondie/kafka-write-docs.
2017-03-20 17:18:59 -07:00
Sital Kedia 7b5d873aef [SPARK-13369] Add config for number of consecutive fetch failures
The previously hardcoded max 4 retries per stage is not suitable for all cluster configurations. Since spark retries a stage at the sign of the first fetch failure, you can easily end up with many stage retries to discover all the failures. In particular, two scenarios this value should change are (1) if there are more than 4 executors per node; in that case, it may take 4 retries to discover the problem with each executor on the node and (2) during cluster maintenance on large clusters, where multiple machines are serviced at once, but you also cannot afford total cluster downtime. By making this value configurable, cluster managers can tune this value to something more appropriate to their cluster configuration.

Unit tests

Author: Sital Kedia <skedia@fb.com>

Closes #17307 from sitalkedia/SPARK-13369.
2017-03-17 09:33:58 -05:00
uncleGen e29a74d5b1 [DOCS][SS] fix structured streaming python example
## What changes were proposed in this pull request?

- SS python example: `TypeError: 'xxx' object is not callable`
- some other doc issue.

## How was this patch tested?

Jenkins.

Author: uncleGen <hustyugm@gmail.com>

Closes #17257 from uncleGen/docs-ss-python.
2017-03-12 08:29:37 +00:00
Yong Tang 8f0490e22b [SPARK-17979][SPARK-14453] Remove deprecated SPARK_YARN_USER_ENV and SPARK_JAVA_OPTS
This fix removes deprecated support for config `SPARK_YARN_USER_ENV`, as is mentioned in SPARK-17979.
This fix also removes deprecated support for the following:
```
SPARK_YARN_USER_ENV
SPARK_JAVA_OPTS
SPARK_CLASSPATH
SPARK_WORKER_INSTANCES
```

Related JIRA:
[SPARK-14453]: https://issues.apache.org/jira/browse/SPARK-14453
[SPARK-12344]: https://issues.apache.org/jira/browse/SPARK-12344
[SPARK-15781]: https://issues.apache.org/jira/browse/SPARK-15781

Existing tests should pass.

Author: Yong Tang <yong.tang.github@outlook.com>

Closes #17212 from yongtang/SPARK-17979.
2017-03-10 13:34:01 -08:00
Liwei Lin 40da4d181d [SPARK-19715][STRUCTURED STREAMING] Option to Strip Paths in FileSource
## What changes were proposed in this pull request?

Today, we compare the whole path when deciding if a file is new in the FileSource for structured streaming. However, this would cause false negatives in the case where the path has changed in a cosmetic way (i.e. changing `s3n` to `s3a`).

This patch adds an option `fileNameOnly` that causes the new file check to be based only on the filename (but still store the whole path in the log).

## Usage

```scala
spark
  .readStream
  .option("fileNameOnly", true)
  .text("s3n://bucket/dir1/dir2")
  .writeStream
  ...
```
## How was this patch tested?

Added a test case

Author: Liwei Lin <lwlin7@gmail.com>

Closes #17120 from lw-lin/filename-only.
2017-03-09 11:02:44 -08:00
Wenchen Fan d69aeeaff4 [SPARK-19516][DOC] update public doc to use SparkSession instead of SparkContext
## What changes were proposed in this pull request?

After Spark 2.0, `SparkSession` becomes the new entry point for Spark applications. We should update the public documents to reflect this.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16856 from cloud-fan/doc.
2017-03-07 11:32:36 -08:00
VinceShieh 4a9034b173 [SPARK-17498][ML] StringIndexer enhancement for handling unseen labels
## What changes were proposed in this pull request?
This PR is an enhancement to ML StringIndexer.
Before this PR, String Indexer only supports "skip"/"error" options to deal with unseen records.
But those unseen records might still be useful and user would like to keep the unseen labels in
certain use cases, This PR enables StringIndexer to support keeping unseen labels as
indices [numLabels].

'''Before
StringIndexer().setHandleInvalid("skip")
StringIndexer().setHandleInvalid("error")
'''After
support the third option "keep"
StringIndexer().setHandleInvalid("keep")

## How was this patch tested?
Test added in StringIndexerSuite

Signed-off-by: VinceShieh <vincent.xieintel.com>
(Please fill in changes proposed in this fix)

Author: VinceShieh <vincent.xie@intel.com>

Closes #16883 from VinceShieh/spark-17498.
2017-03-07 11:24:20 -08:00
jerryshao ba186a841f [MINOR][DOC] Fix doc for web UI https configuration
## What changes were proposed in this pull request?

Doc about enabling web UI https is not correct, "spark.ui.https.enabled" is not existed, actually enabling SSL is enough for https.

## How was this patch tested?

N/A

Author: jerryshao <sshao@hortonworks.com>

Closes #17147 from jerryshao/fix-doc-ssl.
2017-03-03 14:23:31 -08:00
Zhe Sun 0bac3e4cde [SPARK-19797][DOC] ML pipeline document correction
## What changes were proposed in this pull request?
Description about pipeline in this paragraph is incorrect https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works

> If the Pipeline had more **stages**, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage.

Reason: Transformer could also be a stage. But only another Estimator will invoke an transform call and pass the data to next stage. The description in the document misleads ML pipeline users.

## How was this patch tested?
This is a tiny modification of **docs/ml-pipelines.md**. I jekyll build the modification and check the compiled document.

Author: Zhe Sun <ymwdalex@gmail.com>

Closes #17137 from ymwdalex/SPARK-19797-ML-pipeline-document-correction.
2017-03-03 11:55:57 +01:00
Nick Pentreath 9cca3dbf4a [SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" usage in ALS
[SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489) added the ability to skip `NaN` predictions during `ALSModel.transform`. This PR adds documentation for the `coldStartStrategy` param to the ALS user guide, and add code to the examples to illustrate usage.

## How was this patch tested?

Doc and example change only. Build HTML doc locally and verified example code builds, and runs in shell for Scala/Python.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #17102 from MLnick/SPARK-19345-coldstart-doc.
2017-03-02 15:51:16 +02:00
Felix Cheung 8d6ef895ee [SPARK-18352][DOCS] wholeFile JSON update doc and programming guide
## What changes were proposed in this pull request?

Update doc for R, programming guide. Clarify default behavior for all languages.

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17128 from felixcheung/jsonwholefiledoc.
2017-03-02 01:02:38 -08:00
Michael McCune bf5987cbe6 [SPARK-19769][DOCS] Update quickstart instructions
## What changes were proposed in this pull request?

This change addresses the renaming of the `simple.sbt` build file to
`build.sbt`. Newer versions of the sbt tool are not finding the older
named file and are looking for the `build.sbt`. The quickstart
instructions for self-contained applications is updated with this
change.

## How was this patch tested?

As this is a relatively minor change of a few words, the markdown was checked for syntax and spelling. Site was built with `SKIP_API=1 jekyll serve` for testing purposes.

Author: Michael McCune <msm@redhat.com>

Closes #17101 from elmiko/spark-19769.
2017-03-01 00:07:16 +01:00
actuaryzhang d743ea4c76 [MINOR][DOC] Update GLM doc to include tweedie distribution
Update GLM documentation to include the Tweedie distribution. #16344

jkbradley yanboliang

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #17103 from actuaryzhang/doc.
2017-02-28 14:43:44 -08:00
Yuming Wang 9b8eca65dc [SPARK-19660][CORE][SQL] Replace the configuration property names that are deprecated in the version of Hadoop 2.6
## What changes were proposed in this pull request?

Replace all the Hadoop deprecated configuration property names according to [DeprecatedProperties](https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/DeprecatedProperties.html).

except:
https://github.com/apache/spark/blob/v2.1.0/python/pyspark/sql/tests.py#L1533
https://github.com/apache/spark/blob/v2.1.0/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L987
https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/SetCommand.scala#L45
https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L614

## How was this patch tested?

Existing tests

Author: Yuming Wang <wgyumg@gmail.com>

Closes #16990 from wangyum/HadoopDeprecatedProperties.
2017-02-28 10:13:42 +00:00
Boaz Mohar 061bcfb869 [MINOR][DOCS] Fixes two problems in the SQL programing guide page
## What changes were proposed in this pull request?

Removed duplicated lines in sql python example and found a typo.

## How was this patch tested?

Searched for other typo's in the page to minimize PR's.

Author: Boaz Mohar <boazmohar@gmail.com>

Closes #17066 from boazmohar/doc-fix.
2017-02-25 11:32:09 -08:00
Ramkumar Venkataraman 1b9ba258e0 [MINOR][DOCS] Fix few typos in structured streaming doc
## What changes were proposed in this pull request?

Minor typo in `even-time`, which is changed to `event-time` and a couple of grammatical errors fix.

## How was this patch tested?

N/A - since this is a doc fix. I did a jekyll build locally though.

Author: Ramkumar Venkataraman <rvenkataraman@paypal.com>

Closes #17037 from ramkumarvenkat/doc-fix.
2017-02-25 02:18:22 +00:00
Shubham Chopra fa7c582e94 [SPARK-15355][CORE] Proactive block replication
## What changes were proposed in this pull request?

We are proposing addition of pro-active block replication in case of executor failures. BlockManagerMasterEndpoint does all the book-keeping to keep a track of all the executors and the blocks they hold. It also keeps a track of which executors are alive through heartbeats. When an executor is removed, all this book-keeping state is updated to reflect the lost executor. This step can be used to identify executors that are still in possession of a copy of the cached data and a message could be sent to them to use the existing "replicate" function to find and place new replicas on other suitable hosts. Blocks replicated this way will let the master know of their existence.

This can happen when an executor is lost, and would that way be pro-active as opposed be being done at query time.
## How was this patch tested?

This patch was tested with existing unit tests along with new unit tests added to test the functionality.

Author: Shubham Chopra <schopra31@bloomberg.net>

Closes #14412 from shubhamchopra/ProactiveBlockReplication.
2017-02-24 15:40:01 -08:00
uncleGen d027624574 [SPARK-16122][DOCS] application environment rest api
## What changes were proposed in this pull request?

follow up pr of #16949.

## How was this patch tested?

jenkins

Author: uncleGen <hustyugm@gmail.com>

Closes #17033 from uncleGen/doc-restapi-environment.
2017-02-23 17:06:14 -08:00
Kay Ousterhout f87a6a59af [SPARK-19684][DOCS] Remove developer info from docs.
This commit moves developer-specific information from the release-
specific documentation in this repo to the developer tools page on
the main Spark website. This commit relies on this PR on the
Spark website: https://github.com/apache/spark-website/pull/33.

srowen

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #17018 from kayousterhout/SPARK-19684.
2017-02-23 13:27:47 -08:00
Marcelo Vanzin 4661d30b98 [SPARK-19554][UI,YARN] Allow SHS URL to be used for tracking in YARN RM.
Allow an application to use the History Server URL as the tracking
URL in the YARN RM, so there's still a link to the web UI somewhere
in YARN even if the driver's UI is disabled. This is useful, for
example, if an admin wants to disable the driver UI by default for
applications, since it's harder to secure it (since it involves non
trivial ssl certificate and auth management that admins may not want
to expose to user apps).

This needs to be opt-in, because of the way the YARN proxy works, so
a new configuration was added to enable the option.

The YARN RM will proxy requests to live AMs instead of redirecting
the client, so pages in the SHS UI will not render correctly since
they'll reference invalid paths in the RM UI. The proxy base support
in the SHS cannot be used since that would prevent direct access to
the SHS.

So, to solve this problem, for the feature to work end-to-end, a new
YARN-specific filter was added that detects whether the requests come
from the proxy and redirects the client appropriatly. The SHS admin has
to add this filter manually if they want the feature to work.

Tested with new unit test, and by running with the documented configuration
set in a test cluster. Also verified the driver UI is used when it's
enabled.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16946 from vanzin/SPARK-19554.
2017-02-22 14:37:53 -08:00
Yuhao Yang 280afe0ef3 [SPARK-19337][ML][DOC] Documentation and examples for LinearSVC
## What changes were proposed in this pull request?

Documentation and examples (Java, scala, python, R) for LinearSVC

## How was this patch tested?
local doc generation

Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #16968 from hhbyyh/mlsvmdoc.
2017-02-21 09:38:14 -08:00
Sean Owen 0e2405490f
[SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support
- Move external/java8-tests tests into core, streaming, sql and remove
- Remove MaxPermGen and related options
- Fix some reflection / TODOs around Java 8+ methods
- Update doc references to 1.7/1.8 differences
- Remove Java 7/8 related build profiles
- Update some plugins for better Java 8 compatibility
- Fix a few Java-related warnings

For the future:

- Update Java 8 examples to fully use Java 8
- Update Java tests to use lambdas for simplicity
- Update Java internal implementations to use lambdas

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #16871 from srowen/SPARK-19493.
2017-02-16 12:32:45 +00:00
Yun Ni 08c1972a06 [SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing
## What changes were proposed in this pull request?
This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH.

## How was this patch tested?
API and examples are tested using spark-submit:
`bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py`
`bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py`

User guide changes are generated and manually inspected:
`SKIP_API=1 jekyll build`

Author: Yun Ni <yunn@uber.com>
Author: Yanbo Liang <ybliang8@gmail.com>
Author: Yunni <Euler57721@gmail.com>

Closes #16715 from Yunni/spark-18080.
2017-02-15 16:26:05 -08:00
Tyson Condie 447b2b5309 [SPARK-19584][SS][DOCS] update structured streaming documentation around batch mode
## What changes were proposed in this pull request?

Revision to structured-streaming-kafka-integration.md to reflect new Batch query specification and options.

zsxwing tdas

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tyson Condie <tcondie@gmail.com>

Closes #16918 from tcondie/kafka-docs.
2017-02-14 18:50:14 -08:00
Sunitha Kambhampati 9b5e460a91 [SPARK-19585][DOC][SQL] Fix the cacheTable and uncacheTable api call in the doc
## What changes were proposed in this pull request?

https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
In the doc, the call spark.cacheTable(“tableName”) and spark.uncacheTable(“tableName”) actually needs to be spark.catalog.cacheTable and spark.catalog.uncacheTable

## How was this patch tested?
Built the docs and verified the change shows up fine.

Author: Sunitha Kambhampati <skambha@us.ibm.com>

Closes #16919 from skambha/docChange.
2017-02-13 22:49:29 -08:00
Marcelo Vanzin 0169360ef5 [SPARK-19520][STREAMING] Do not encrypt data written to the WAL.
Spark's I/O encryption uses an ephemeral key for each driver instance.
So driver B cannot decrypt data written by driver A since it doesn't
have the correct key.

The write ahead log is used for recovery, thus needs to be readable by
a different driver. So it cannot be encrypted by Spark's I/O encryption
code.

The BlockManager APIs used by the WAL code to write the data automatically
encrypt data, so changes are needed so that callers can to opt out of
encryption.

Aside from that, the "putBytes" API in the BlockManager does not do
encryption, so a separate situation arised where the WAL would write
unencrypted data to the BM and, when those blocks were read, decryption
would fail. So the WAL code needs to ask the BM to encrypt that data
when encryption is enabled; this code is not optimal since it results
in a (temporary) second copy of the data block in memory, but should be
OK for now until a more performant solution is added. The non-encryption
case should not be affected.

Tested with new unit tests, and by running streaming apps that do
recovery using the WAL data with I/O encryption turned on.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16862 from vanzin/SPARK-19520.
2017-02-13 14:19:41 -08:00
Hervé c5a66356d4 Encryption of shuffle files
Hello

According to my understanding of commits 4b4e329e49 & 8b325b17ec, one may now encrypt shuffle files regardless of the cluster manager in use.

However I have limited understanding of the code, I'm not able to find out whether theses changes also comprise all "temporary local storage, such as shuffle files, cached data, and other application files".

Please feel free to amend or reject my PR if I'm wrong.

dud

Author: Hervé <dud225@users.noreply.github.com>

Closes #16885 from dud225/patch-1.
2017-02-10 17:11:03 +01:00
jerryshao 8e8afb3a34
[SPARK-19545][YARN] Fix compile issue for Spark on Yarn when building against Hadoop 2.6.0~2.6.3
## What changes were proposed in this pull request?

Due to the newly added API in Hadoop 2.6.4+, Spark builds against Hadoop 2.6.0~2.6.3 will meet compile error. So here still reverting back to use reflection to handle this issue.

## How was this patch tested?

Manual verification.

Author: jerryshao <sshao@hortonworks.com>

Closes #16884 from jerryshao/SPARK-19545.
2017-02-10 13:44:26 +00:00
José Hiram Soltren 6287c94f08 [SPARK-16554][CORE] Automatically Kill Executors and Nodes when they are Blacklisted
## What changes were proposed in this pull request?

In SPARK-8425, we introduced a mechanism for blacklisting executors and nodes (hosts). After a certain number of failures, these resources would be "blacklisted" and no further work would be assigned to them for some period of time.

In some scenarios, it is better to fail fast, and to simply kill these unreliable resources. This changes proposes to do so by having the BlacklistTracker kill unreliable resources when they would otherwise be "blacklisted".

In order to be thread safe, this code depends on the CoarseGrainedSchedulerBackend sending a message to the driver backend in order to do the actual killing. This also helps to prevent a race which would permit work to begin on a resource (executor or node), between the time the resource is marked for killing and the time at which it is finally killed.

## How was this patch tested?

./dev/run-tests
Ran https://github.com/jsoltren/jose-utils/blob/master/blacklist/test-blacklist.sh, and checked logs to see executors and nodes being killed.

Testing can likely be improved here; suggestions welcome.

Author: José Hiram Soltren <jose@cloudera.com>

Closes #16650 from jsoltren/SPARK-16554-submit.
2017-02-09 12:49:31 -06:00
Marcelo Vanzin 3fc8e8caf8 [SPARK-17874][CORE] Add SSL port configuration.
Make the SSL port configuration explicit, instead of deriving it
from the non-SSL port, but retain the existing functionality in
case anyone depends on it.

The change starts the HTTPS and HTTP connectors separately, so
that it's possible to use independent ports for each. For that to
work, the initialization of the server needs to be shuffled around
a bit. The change also makes it so the initialization of both
connectors is similar, and end up using the same Scheduler - previously
only the HTTP connector would use the correct one.

Also fixed some outdated documentation about a couple of services
that were removed long ago.

Tested with unit tests and by running spark-shell with SSL configs.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16625 from vanzin/SPARK-17874.
2017-02-09 22:06:46 +09:00
Sean Owen e8d3fca450
[SPARK-19464][CORE][YARN][TEST-HADOOP2.6] Remove support for Hadoop 2.5 and earlier
## What changes were proposed in this pull request?

- Remove support for Hadoop 2.5 and earlier
- Remove reflection and code constructs only needed to support multiple versions at once
- Update docs to reflect newer versions
- Remove older versions' builds and profiles.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #16810 from srowen/SPARK-19464.
2017-02-08 12:20:07 +00:00
manugarri 5a0569ce69 [MINOR][DOC] Remove parenthesis in readStream() on kafka structured streaming doc
There is a typo in http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-stream , python example n1 uses `readStream()` instead of `readStream`

Just removed the parenthesis.

Author: manugarri <manuel.garrido.pena@gmail.com>

Closes #16836 from manugarri/fix_kafka_python_doc.
2017-02-07 21:45:57 -08:00
krishnakalyan3 48aafeda7d [SPARK-19386][SPARKR][DOC] Bisecting k-means in SparkR documentation
## What changes were proposed in this pull request?
Update programming guide, example and vignette with Bisecting k-means.

Author: krishnakalyan3 <krishnakalyan3@gmail.com>

Closes #16767 from krishnakalyan3/bisecting-kmeans.
2017-02-03 12:19:47 -08:00
Zheng RuiFeng 04ee8cf633
[SPARK-19410][DOC] Fix brokens links in ml-pipeline and ml-tuning
## What changes were proposed in this pull request?
Fix brokens links in ml-pipeline and ml-tuning
`<div data-lang="scala">`  ->   `<div data-lang="scala" markdown="1">`

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #16754 from zhengruifeng/doc_api_fix.
2017-02-01 13:27:20 +00:00
hyukjinkwon f1a1f2607d
[SPARK-19402][DOCS] Support LaTex inline formula correctly and fix warnings in Scala/Java APIs generation
## What changes were proposed in this pull request?

This PR proposes three things as below:

- Support LaTex inline-formula, `\( ... \)` in Scala API documentation
  It seems currently,

  ```
  \( ... \)
  ```

  are rendered as they are, for example,

  <img width="345" alt="2017-01-30 10 01 13" src="https://cloud.githubusercontent.com/assets/6477701/22423960/ab37d54a-e737-11e6-9196-4f6229c0189c.png">

  It seems mistakenly more backslashes were added.

- Fix warnings Scaladoc/Javadoc generation
  This PR fixes t two types of warnings as below:

  ```
  [warn] .../spark/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala:335: Could not find any member to link for "UnsupportedOperationException".
  [warn]   /**
  [warn]   ^
  ```

  ```
  [warn] .../spark/sql/core/src/main/scala/org/apache/spark/sql/internal/VariableSubstitution.scala:24: Variable var undefined in comment for class VariableSubstitution in class VariableSubstitution
  [warn]  * `${var}`, `${system:var}` and `${env:var}`.
  [warn]      ^
  ```

- Fix Javadoc8 break
  ```
  [error] .../spark/mllib/target/java/org/apache/spark/ml/PredictionModel.java:7: error: reference not found
  [error]  *                       E.g., {link VectorUDT} for vector features.
  [error]                                       ^
  [error] .../spark/mllib/target/java/org/apache/spark/ml/PredictorParams.java:12: error: reference not found
  [error]    *                          E.g., {link VectorUDT} for vector features.
  [error]                                            ^
  [error] .../spark/mllib/target/java/org/apache/spark/ml/Predictor.java:10: error: reference not found
  [error]  *                       E.g., {link VectorUDT} for vector features.
  [error]                                       ^
  [error] .../spark/sql/hive/target/java/org/apache/spark/sql/hive/HiveAnalysis.java:5: error: reference not found
  [error]  * Note that, this rule must be run after {link PreprocessTableInsertion}.
  [error]                                                  ^
  ```

## How was this patch tested?

Manually via `sbt unidoc` and `jeykil build`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16741 from HyukjinKwon/warn-and-break.
2017-02-01 13:26:16 +00:00
gatorsmile c0eda7e87f [SPARK-19396][DOC] JDBC Options are Case In-sensitive
### What changes were proposed in this pull request?
The case are not sensitive in JDBC options, after the PR https://github.com/apache/spark/pull/15884 is merged to Spark 2.1.

### How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16734 from gatorsmile/fixDocCaseInsensitive.
2017-01-30 14:05:53 -08:00
aokolnychyi 3fdce81434 [SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide
## What changes were proposed in this pull request?

- A separate subsection for Aggregations under “Getting Started” in the Spark SQL programming guide. It mentions which aggregate functions are predefined and how users can create their own.
- Examples of using the `UserDefinedAggregateFunction` abstract class for untyped aggregations in Java and Scala.
- Examples of using the `Aggregator` abstract class for type-safe aggregations in Java and Scala.
- Python is not covered.
- The PR might not resolve the ticket since I do not know what exactly was planned by the author.

In total, there are four new standalone examples that can be executed via `spark-submit` or `run-example`. The updated Spark SQL programming guide references to these examples and does not contain hard-coded snippets.

## How was this patch tested?

The patch was tested locally by building the docs. The examples were run as well.

![image](https://cloud.githubusercontent.com/assets/6235869/21292915/04d9d084-c515-11e6-811a-999d598dffba.png)

Author: aokolnychyi <okolnychyyanton@gmail.com>

Closes #16329 from aokolnychyi/SPARK-16046.
2017-01-24 22:13:17 -08:00
Marcelo Vanzin 8f3f73abc1 [SPARK-19139][CORE] New auth mechanism for transport library.
This change introduces a new auth mechanism to the transport library,
to be used when users enable strong encryption. This auth mechanism
has better security than the currently used DIGEST-MD5.

The new protocol uses symmetric key encryption to mutually authenticate
the endpoints, and is very loosely based on ISO/IEC 9798.

The new protocol falls back to SASL when it thinks the remote end is old.
Because SASL does not support asking the server for multiple auth protocols,
which would mean we could re-use the existing SASL code by just adding a
new SASL provider, the protocol is implemented outside of the SASL API
to avoid the boilerplate of adding a new provider.

Details of the auth protocol are discussed in the included README.md
file.

This change partly undos the changes added in SPARK-13331; AES encryption
is now decoupled from SASL authentication. The encryption code itself,
though, has been re-used as part of this change.

## How was this patch tested?

- Unit tests
- Tested Spark 2.2 against Spark 1.6 shuffle service with SASL enabled
- Tested Spark 2.2 against Spark 2.2 shuffle service with SASL fallback disabled

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16521 from vanzin/SPARK-19139.
2017-01-24 10:44:04 -08:00
Parag Chaudhari 0ff67a1cf9 [SPARK-14049][CORE] Add functionality in spark history sever API to query applications by end time
## What changes were proposed in this pull request?

Currently, spark history server REST API provides functionality to query applications by application start time range based on minDate and maxDate query parameters, but it  lacks support to query applications by their end time. In this pull request we are proposing optional minEndDate and maxEndDate query parameters and filtering capability based on these parameters to spark history server REST API. This functionality can be used for following queries,
1. Applications finished in last 'x' minutes
2. Applications finished before 'y' time
3. Applications finished between 'x' time to 'y' time
4. Applications started from 'x' time and finished before 'y' time.

For backward compatibility, we can keep existing minDate and maxDate query parameters as they are and they can continue support filtering based on start time range.
## How was this patch tested?

Existing unit tests and 4 new unit tests.

Author: Parag Chaudhari <paragpc@amazon.com>

Closes #11867 from paragpc/master-SHS-query-by-endtime_2.
2017-01-24 08:41:46 -06:00
uncleGen 7c61c2a1c4
[DOCS] Fix typo in docs
## What changes were proposed in this pull request?

Fix typo in docs

## How was this patch tested?

Author: uncleGen <hustyugm@gmail.com>

Closes #16658 from uncleGen/typo-issue.
2017-01-24 11:32:11 +00:00
Yuming Wang c99492141b
[SPARK-19146][CORE] Drop more elements when stageData.taskData.size > retainedTasks
## What changes were proposed in this pull request?

Drop more elements when `stageData.taskData.size > retainedTasks` to reduce the number of times on call drop function.

## How was this patch tested?

Jenkins

Author: Yuming Wang <wgyumg@gmail.com>

Closes #16527 from wangyum/SPARK-19146.
2017-01-23 11:02:22 +00:00
sarutak d50d12b49f
[SPARK-19302][DOC][MINOR] Fix the wrong item format in security.md
## What changes were proposed in this pull request?

In docs/security.md, there is a description as follows.

```
 steps to configure the key-stores and the trust-store for the standalone deployment mode is as
 follows:
 * Generate a keys pair for each node
 * Export the public key of the key pair to a file on each node
 * Import all exported public keys into a single trust-store
```

According to markdown format, the first item should follow a blank line.

## How was this patch tested?

Manually tested.

Following captures are rendered web page before and after fix.

* before
![before](https://cloud.githubusercontent.com/assets/4736016/22136731/b358115c-df19-11e6-8f6c-2f7b65766265.png)

* after
![after](https://cloud.githubusercontent.com/assets/4736016/22136745/c6366ff8-df19-11e6-840d-e7e894218f9c.png)

Author: sarutak <sarutak@oss.nttdata.co.jp>

Closes #16653 from sarutak/SPARK-19302.
2017-01-20 14:15:18 +00:00
jerryshao b79cc7ceb4 [SPARK-19179][YARN] Change spark.yarn.access.namenodes config and update docs
## What changes were proposed in this pull request?

`spark.yarn.access.namenodes` configuration cannot actually reflects the usage of it, inside the code it is the Hadoop filesystems we get tokens, not NNs. So here propose to update the name of this configuration, also change the related code and doc.

## How was this patch tested?

Local verification.

Author: jerryshao <sshao@hortonworks.com>

Closes #16560 from jerryshao/SPARK-19179.
2017-01-17 09:30:56 -06:00
Maurus Cuelenaere 3df2d93146
[MINOR][DOC] Document local[*,F] master modes
## What changes were proposed in this pull request?

core/src/main/scala/org/apache/spark/SparkContext.scala contains LOCAL_N_FAILURES_REGEX master mode, but this was never documented, so do so.

## How was this patch tested?

By using the Github Markdown preview feature.

Author: Maurus Cuelenaere <mcuelenaere@gmail.com>

Closes #16562 from mcuelenaere/patch-1.
2017-01-15 11:14:50 +00:00
Bryan Cutler 3bc2eff888 [SPARK-17568][CORE][DEPLOY] Add spark-submit option to override ivy settings used to resolve packages/artifacts
## What changes were proposed in this pull request?

Adding option in spark-submit to allow overriding the default IvySettings used to resolve artifacts as part of the Spark Packages functionality.  This will allow all artifact resolution to go through a central managed repository, such as Nexus or Artifactory, where site admins can better approve and control what is used with Spark apps.

This change restructures the creation of the IvySettings object in two distinct ways.  First, if the `spark.ivy.settings` option is not defined then `buildIvySettings` will create a default settings instance, as before, with defined repositories (Maven Central) included.  Second, if the option is defined, the ivy settings file will be loaded from the given path and only repositories defined within will be used for artifact resolution.
## How was this patch tested?

Existing tests for default behaviour, Manual tests that load a ivysettings.xml file with local and Nexus repositories defined.  Added new test to load a simple Ivy settings file with a local filesystem resolver.

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Ian Hummel <ian@themodernlife.net>

Closes #15119 from BryanCutler/spark-custom-IvySettings.
2017-01-11 11:57:38 -08:00
jerryshao 4239a1081a [SPARK-19021][YARN] Generailize HDFSCredentialProvider to support non HDFS security filesystems
Currently Spark can only get token renewal interval from security HDFS (hdfs://), if Spark runs with other security file systems like webHDFS (webhdfs://), wasb (wasb://), ADLS, it will ignore these tokens and not get token renewal intervals from these tokens. These will make Spark unable to work with these security clusters. So instead of only checking HDFS token, we should generalize to support different DelegationTokenIdentifier.

## How was this patch tested?

Manually verified in security cluster.

Author: jerryshao <sshao@hortonworks.com>

Closes #16432 from jerryshao/SPARK-19021.
2017-01-11 09:24:02 -06:00
Shixiong Zhu bc6c56e940 [SPARK-19140][SS] Allow update mode for non-aggregation streaming queries
## What changes were proposed in this pull request?

This PR allow update mode for non-aggregation streaming queries. It will be same as the append mode if a query has no aggregations.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16520 from zsxwing/update-without-agg.
2017-01-10 17:58:11 -08:00
Peng, Meng 32286ba68a
[SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change
## What changes were proposed in this pull request?
Add FDR test case in ml/feature/ChiSqSelectorSuite.
Improve some comments in the code.
This is a follow-up pr for #15212.

## How was this patch tested?
ut

Author: Peng, Meng <peng.meng@intel.com>

Closes #16434 from mpjlu/fdr_fwe_update.
2017-01-10 13:09:58 +00:00
Dongjoon Hyun 923e594844 [SPARK-18941][SQL][DOC] Add a new behavior document on CREATE/DROP TABLE with LOCATION
## What changes were proposed in this pull request?

This PR adds a new behavior change description on `CREATE TABLE ... LOCATION` at `sql-programming-guide.md` clearly under `Upgrading From Spark SQL 1.6 to 2.0`. This change is introduced at Apache Spark 2.0.0 as [SPARK-15276](https://issues.apache.org/jira/browse/SPARK-15276).

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

**Newly Added Description**
<img width="913" alt="new" src="https://cloud.githubusercontent.com/assets/9700541/21743606/7efe2b12-d4ba-11e6-8a0d-551222718ea2.png">

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16400 from dongjoon-hyun/SPARK-18941.
2017-01-07 18:55:01 -08:00
Sean Owen 54138f6e89
[SPARK-19106][DOCS] Styling for the configuration docs is broken
## What changes were proposed in this pull request?

configuration.html section headings were not specified correctly in markdown and weren't rendering, being recognized correctly. Removed extra p tags and pulled level 4 titles up to level 3, since level 3 had been skipped. This improves the TOC.

## How was this patch tested?

Doc build, manual check.

Author: Sean Owen <sowen@cloudera.com>

Closes #16490 from srowen/SPARK-19106.
2017-01-07 19:15:51 +00:00
Tathagata Das b59cddaba0 [SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options
## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.
2017-01-06 11:29:01 -08:00
jerryshao 4a4c3dc9ca [SPARK-19033][CORE] Add admin acls for history server
## What changes were proposed in this pull request?

Current HistoryServer's ACLs is derived from application event-log, which means the newly changed ACLs cannot be applied to the old data, this will become a problem where newly added admin cannot access the old application history UI, only the new application can be affected.

So here propose to add admin ACLs for history server, any configured user/group could have the view access to all the applications, while the view ACLs derived from application run-time still take effect.

## How was this patch tested?

Unit test added.

Author: jerryshao <sshao@hortonworks.com>

Closes #16470 from jerryshao/SPARK-19033.
2017-01-06 10:07:54 -06:00
Wenchen Fan cca945b6aa [SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde tables
## What changes were proposed in this pull request?

Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source.

Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for  details.

TODO(for follow-up PRs):
1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later.
2. `SHOW CREATE TABLE` should be updated to use the new syntax.
3. we should decide if we wanna change the behavior of `SET LOCATION`.

## How was this patch tested?

new tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16296 from cloud-fan/create-table.
2017-01-05 17:40:27 -08:00
uncleGen 6873430cb5 [SPARK-19009][DOC] Add streaming rest api doc
## What changes were proposed in this pull request?

add streaming rest api doc

related to pr #16253

cc saturday-shi srowen

## How was this patch tested?

Author: uncleGen <hustyugm@gmail.com>

Closes #16414 from uncleGen/SPARK-19009.
2017-01-04 15:14:51 -08:00
Niranjan Padmanabhan a1e40b1f5d
[MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo
## What changes were proposed in this pull request?
There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words.

## How was this patch tested?
N/A since only docs or comments were updated.

Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com>

Closes #16455 from neurons/np.structure_streaming_doc.
2017-01-04 15:07:29 +00:00
Liang-Chi Hsieh 0ac2f1e71f
[MINOR][DOC] Minor doc change for YARN credential providers
## What changes were proposed in this pull request?

The configuration `spark.yarn.security.tokens.{service}.enabled` is deprecated. Now we should use `spark.yarn.security.credentials.{service}.enabled`. Some places in the doc is not updated yet.

## How was this patch tested?

N/A. Just doc change.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #16444 from viirya/minor-credential-provider-doc.
2017-01-02 14:41:57 +00:00
Liwei Lin 808b84e2de
[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide
## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.
2017-01-02 14:40:06 +00:00
Cheng Lian 871f6114ac [SPARK-19016][SQL][DOC] Document scalable partition handling
## What changes were proposed in this pull request?

This PR documents the scalable partition handling feature in the body of the programming guide.

Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1.

## How was this patch tested?

N/A.

Author: Cheng Lian <lian@databricks.com>

Closes #16424 from liancheng/scalable-partition-handling-doc.
2016-12-30 14:46:30 -08:00
adesharatushar dba81e1dcd
[SPARK-19003][DOCS] Add Java example in Spark Streaming Guide, section Design Patterns for using foreachRDD
## What changes were proposed in this pull request?

Added missing Java example under section "Design Patterns for using foreachRDD". Now this section has examples in all 3 languages, improving consistency of documentation.

## How was this patch tested?

Manual.
Generated docs using command "SKIP_API=1 jekyll build" and verified generated HTML page manually.

The syntax of example has been tested for correctness using sample code on Java1.7 and Spark 2.2.0-SNAPSHOT.

Author: adesharatushar <tushar_adeshara@persistent.com>

Closes #16408 from adesharatushar/streaming-doc-fix.
2016-12-29 22:03:34 +00:00
Tathagata Das 092c6725bf [SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status
## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.
2016-12-28 12:11:25 -08:00
Peng 79ff853631 [SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE)
## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on univariate statistical tests.
FDR and FWE are a popular univariate statistical test for feature selection.
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate

We add  FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
## How was this patch tested?

ut will be added soon

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Peng <peng.meng@intel.com>
Author: Peng, Meng <peng.meng@intel.com>

Closes #15212 from mpjlu/fdr_fwe.
2016-12-28 00:49:36 -08:00
Felix Cheung 2af8b5cffa [DOC][BUILD][MINOR] add doc on new make-distribution switches
## What changes were proposed in this pull request?

add example with `--pip` and `--r` switch as it is actually done in create-release

## How was this patch tested?

Doc only

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16364 from felixcheung/buildguide.
2016-12-27 22:37:37 -08:00
Yuexin Zhang 28ab0ec49f
[SPARK-19006][DOCS] mention spark.kryoserializer.buffer.max must be less than 2048m in doc
## What changes were proposed in this pull request?

On configuration doc page:https://spark.apache.org/docs/latest/configuration.html
We mentioned spark.kryoserializer.buffer.max : Maximum allowable size of Kryo serialization buffer. This must be larger than any object you attempt to serialize. Increase this if you get a "buffer limit exceeded" exception inside Kryo.
from source code, it has hard coded upper limit :
```
val maxBufferSizeMb = conf.getSizeAsMb("spark.kryoserializer.buffer.max", "64m").toInt
if (maxBufferSizeMb >= ByteUnit.GiB.toMiB(2))
{ throw new IllegalArgumentException("spark.kryoserializer.buffer.max must be less than " + s"2048 mb, got: + $maxBufferSizeMb mb.") }
```
We should mention "this value must be less than 2048 mb" on the configuration doc page as well.

## How was this patch tested?

None. Since it's minor doc change.

Author: Yuexin Zhang <yxzhang@cloudera.com>

Closes #16412 from cnZach/SPARK-19006.
2016-12-27 20:29:45 +00:00
Dongjoon Hyun ba4468bb24
[SPARK-18923][DOC][BUILD] Support skipping R/Python API docs
## What changes were proposed in this pull request?

We can build Python API docs by `cd ./python/docs && make html for Python` and R API docs by `cd ./R && sh create-docs.sh for R` separately. However, `jekyll` fails in some environments.

This PR aims to support `SKIP_PYTHONDOC` and `SKIP_RDOC` for documentation build in `docs` folder. Currently, we can use `SKIP_SCALADOC` or `SKIP_API`. The reason providing additional options is that the Spark documentation build uses a number of tools to build HTML docs and API docs in Scala, Python and R. Specifically, for Python and R,

- Python API docs requires `sphinx`.
- R API docs requires `R` installation and `knitr` (and more others libraries).

In other words, we cannot generate Python API docs without R installation. Also, we cannot generate R API docs without Python `sphinx` installation. If Spark provides `SKIP_PYTHONDOC` and `SKIP_RDOC` like `SKIP_SCALADOC`, it would be more convenient.

## How was this patch tested?

Manual.

**Skipping Scala/Java/Python API Doc Build**
```bash
$ cd docs
$ SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 jekyll build
$ ls api
DESCRIPTION R
```

**Skipping Scala/Java/R API Doc Build**
```bash
$ cd docs
$ SKIP_SCALADOC=1 SKIP_RDOC=1 jekyll build
$ ls api
python
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16336 from dongjoon-hyun/SPARK-18923.
2016-12-21 08:59:38 +00:00
Josh Rosen fa829ce21f [SPARK-18761][CORE] Introduce "task reaper" to oversee task killing in executors
## What changes were proposed in this pull request?

Spark's current task cancellation / task killing mechanism is "best effort" because some tasks may not be interruptible or may not respond to their "killed" flags being set. If a significant fraction of a cluster's task slots are occupied by tasks that have been marked as killed but remain running then this can lead to a situation where new jobs and tasks are starved of resources that are being used by these zombie tasks.

This patch aims to address this problem by adding a "task reaper" mechanism to executors. At a high-level, task killing now launches a new thread which attempts to kill the task and then watches the task and periodically checks whether it has been killed. The TaskReaper will periodically re-attempt to call `TaskRunner.kill()` and will log warnings if the task keeps running. I modified TaskRunner to rename its thread at the start of the task, allowing TaskReaper to take a thread dump and filter it in order to log stacktraces from the exact task thread that we are waiting to finish. If the task has not stopped after a configurable timeout then the TaskReaper will throw an exception to trigger executor JVM death, thereby forcibly freeing any resources consumed by the zombie tasks.

This feature is flagged off by default and is controlled by four new configurations under the `spark.task.reaper.*` namespace. See the updated `configuration.md` doc for details.

## How was this patch tested?

Tested via a new test case in `JobCancellationSuite`, plus manual testing.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #16189 from JoshRosen/cancellation.
2016-12-19 18:43:59 -08:00
gatorsmile c0c9e1d27a
[SPARK-18918][DOC] Missing </td> in Configuration page
### What changes were proposed in this pull request?
The configuration page looks messy now, as shown in the nightly build:
https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/configuration.html

Starting from the following location:

![screenshot 2016-12-18 00 26 33](https://cloud.githubusercontent.com/assets/11567269/21292396/ace4719c-c4b8-11e6-8dfd-d9ab95be43d5.png)

### How was this patch tested?
Attached is the screenshot generated in my local computer after the fix.
[Configuration - Spark 2.2.0 Documentation.pdf](https://github.com/apache/spark/files/659315/Configuration.-.Spark.2.2.0.Documentation.pdf)

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16327 from gatorsmile/docFix.
2016-12-18 09:02:04 +00:00
Felix Cheung 38fd163d0d [SPARK-18849][ML][SPARKR][DOC] vignettes final check reorg
## What changes were proposed in this pull request?

Reorganizing content (copy/paste)

## How was this patch tested?

https://felixcheung.github.io/sparkr-vignettes.html

Previous:
https://felixcheung.github.io/sparkr-vignettes_old.html

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16301 from felixcheung/rvignettespass2.
2016-12-17 14:37:34 -08:00
Michal Senkyr 836c95b108
[SPARK-18723][DOC] Expanded programming guide information on wholeTex…
## What changes were proposed in this pull request?

Add additional information to wholeTextFiles in the Programming Guide. Also explain partitioning policy difference in relation to textFile and its impact on performance.

Also added reference to the underlying CombineFileInputFormat

## How was this patch tested?

Manual build of documentation and inspection in browser

```
cd docs
jekyll serve --watch
```

Author: Michal Senkyr <mike.senkyr@gmail.com>

Closes #16157 from michalsenkyr/wholeTextFilesExpandedDocs.
2016-12-16 17:43:39 +00:00
Imran Rashid 93cdb8a7d0 [SPARK-8425][CORE] Application Level Blacklisting
## What changes were proposed in this pull request?

This builds upon the blacklisting introduced in SPARK-17675 to add blacklisting of executors and nodes for an entire Spark application.  Resources are blacklisted based on tasks that fail, in tasksets that eventually complete successfully; they are automatically returned to the pool of active resources based on a timeout.  Full details are available in a design doc attached to the jira.
## How was this patch tested?

Added unit tests, ran them via Jenkins, also ran a handful of them in a loop to check for flakiness.

The added tests include:
- verifying BlacklistTracker works correctly
- verifying TaskSchedulerImpl interacts with BlacklistTracker correctly (via a mock BlacklistTracker)
- an integration test for the entire scheduler with blacklisting in a few different scenarios

Author: Imran Rashid <irashid@cloudera.com>
Author: mwws <wei.mao@intel.com>

Closes #14079 from squito/blacklist-SPARK-8425.
2016-12-15 08:29:56 -06:00
Dongjoon Hyun ec0eae4863 [SPARK-18875][SPARKR][DOCS] Fix R API doc generation by adding DESCRIPTION file
## What changes were proposed in this pull request?

Since Apache Spark 1.4.0, R API document page has a broken link on `DESCRIPTION file` because Jekyll plugin script doesn't copy the file. This PR aims to fix that.

- Official Latest Website: http://spark.apache.org/docs/latest/api/R/index.html
- Apache Spark 2.1.0-rc2: http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html

## How was this patch tested?

Manual.

```bash
cd docs
SKIP_SCALADOC=1 jekyll build
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16292 from dongjoon-hyun/SPARK-18875.
2016-12-14 21:29:20 -08:00
Marcelo Vanzin bc59951bab [SPARK-18773][CORE] Make commons-crypto config translation consistent.
This change moves the logic that translates Spark configuration to
commons-crypto configuration to the network-common module. It also
extends TransportConf and ConfigProvider to provide the necessary
interfaces for the translation to work.

As part of the change, I removed SystemPropertyConfigProvider, which
was mostly used as an "empty config" in unit tests, and adjusted the
very few tests that required a specific config.

I also changed the config keys for AES encryption to live under the
"spark.network." namespace, which is more correct than their previous
names under "spark.authenticate.".

Tested via existing unit test.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16200 from vanzin/SPARK-18773.
2016-12-12 16:27:04 -08:00
Bill Chambers 70ffff21f7
[DOCS][MINOR] Clarify Where AccumulatorV2s are Displayed
## What changes were proposed in this pull request?

This PR clarifies where accumulators will be displayed.

## How was this patch tested?

No testing.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Bill Chambers <bill@databricks.com>
Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <wchambers@ischool.berkeley.edu>

Closes #16180 from anabranch/improve-acc-docs.
2016-12-12 13:33:17 +00:00
Dongjoon Hyun f3a3fed76c
[MINOR][DOCS] Remove Apache Spark Wiki address
## What changes were proposed in this pull request?

According to the notice of the following Wiki front page, we can remove the obsolete wiki pointer safely in `README.md` and `docs/index.md`, too. These two lines are the last occurrence of that links.

```
All current wiki content has been merged into pages at http://spark.apache.org as of November 2016.
Each page links to the new location of its information on the Spark web site.
Obsolete wiki content is still hosted here, but carries a notice that it is no longer current.
```

## How was this patch tested?

Manual.

- `README.md`: https://github.com/dongjoon-hyun/spark/tree/remove_wiki_from_readme
- `docs/index.md`:
```
cd docs
SKIP_API=1 jekyll build
```
![screen shot 2016-12-09 at 2 53 29 pm](https://cloud.githubusercontent.com/assets/9700541/21067323/517252e2-be1f-11e6-85b1-2a4471131c5d.png)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16239 from dongjoon-hyun/remove_wiki_from_readme.
2016-12-10 16:40:10 +00:00
Xiangrui Meng d2493a203e [SPARK-18812][MLLIB] explain "Spark ML"
## What changes were proposed in this pull request?

There has been some confusion around "Spark ML" vs. "MLlib". This PR adds some FAQ-like entries to the MLlib user guide to explain "Spark ML" and reduce the confusion.

I check the [Spark FAQ page](http://spark.apache.org/faq.html), which seems too high-level for the content here. So I added it to the MLlib user guide instead.

cc: mateiz

Author: Xiangrui Meng <meng@databricks.com>

Closes #16241 from mengxr/SPARK-18812.
2016-12-09 17:34:52 -08:00
Jacek Laskowski b162cc0c28
[MINOR][CORE][SQL][DOCS] Typo fixes
## What changes were proposed in this pull request?

Typo fixes

## How was this patch tested?

Local build. Awaiting the official build.

Author: Jacek Laskowski <jacek@japila.pl>

Closes #16144 from jaceklaskowski/typo-fixes.
2016-12-09 18:45:57 +08:00
Yanbo Liang 9bf8f3cd4f [SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide
## What changes were proposed in this pull request?
* Add all R examples for ML wrappers which were added during 2.1 release cycle.
* Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them.
* Add corresponding examples to ML user guide.
* Update ML section of SparkR user guide.

Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```.

## How was this patch tested?
Run all examples manually.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16148 from yanboliang/spark-18325.
2016-12-08 06:19:38 -08:00
sethah 82253617f5 [SPARK-18705][ML][DOC] Update user guide to reflect one pass solver for L1 and elastic-net
## What changes were proposed in this pull request?

WeightedLeastSquares now supports L1 and elastic net penalties and has an additional solver option: QuasiNewton. The docs are updated to reflect this change.

## How was this patch tested?

Docs only. Generated documentation to make sure Latex looks ok.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #16139 from sethah/SPARK-18705.
2016-12-07 19:41:32 -08:00