Commit graph

1948 commits

Author SHA1 Message Date
Liwei Lin 3e01f12828
[DOC][MINOR] Kafka doc: breakup into lines
## Before

![before](https://cloud.githubusercontent.com/assets/15843379/20340231/99b039fe-ac1b-11e6-9ba9-b44582427459.png)

## After

![after](https://cloud.githubusercontent.com/assets/15843379/20340236/9d5796e2-ac1b-11e6-92bb-6da40ba1a383.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #15903 from lw-lin/kafka-doc-lines.
2016-11-16 09:51:59 +00:00
Zheng RuiFeng 33be4da539
[SPARK-18427][DOC] Update docs of mllib.KMeans
## What changes were proposed in this pull request?
1,Remove `runs` from docs of mllib.KMeans
2,Add notes for `k` according to comments in sources
## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #15873 from zhengruifeng/update_doc_mllib_kmeans.
2016-11-15 15:44:50 +01:00
Michael Gummelt d89bfc9230 [SPARK-18232][MESOS] Support CNI
## What changes were proposed in this pull request?

Adds support for CNI-isolated containers

## How was this patch tested?

I launched SparkPi both with and without `spark.mesos.network.name`, and verified the job completed successfully.

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #15740 from mgummelt/spark-342-cni.
2016-11-14 23:46:54 -08:00
Zheng RuiFeng c31def1ddc [SPARK-18428][DOC] Update docs for GraphX
## What changes were proposed in this pull request?
1, Add link of `VertexRDD` and `EdgeRDD`
2, Notify in `Vertex and Edge RDDs` that not all methods are listed
3, `VertexID` -> `VertexId`

## How was this patch tested?
No tests, only docs is modified

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #15875 from zhengruifeng/update_graphop_doc.
2016-11-14 21:15:39 -08:00
Noritaka Sekiyama 9d07ceee78 [SPARK-18432][DOC] Changed HDFS default block size from 64MB to 128MB
Changed HDFS default block size from 64MB to 128MB.
https://issues.apache.org/jira/browse/SPARK-18432

Author: Noritaka Sekiyama <moomindani@gmail.com>

Closes #15879 from moomindani/SPARK-18432.
2016-11-14 21:07:59 +09:00
Denny Lee b91a51bb23 [SPARK-18426][STRUCTURED STREAMING] Python Documentation Fix for Structured Streaming Programming Guide
## What changes were proposed in this pull request?

Update the python section of the Structured Streaming Guide from .builder() to .builder

## How was this patch tested?

Validated documentation and successfully running the test example.

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

'Builder' object is not callable object hence changed .builder() to
.builder

Author: Denny Lee <dennylee@gallifrey.local>

Closes #15872 from dennyglee/master.
2016-11-13 18:10:06 -08:00
Weiqing Yang 3af894511b [SPARK-16759][CORE] Add a configuration property to pass caller contexts of upstream applications into Spark
## What changes were proposed in this pull request?

Many applications take Spark as a computing engine and run on it. This PR adds a configuration property `spark.log.callerContext` that can be used by Spark's upstream applications (e.g. Oozie) to set up their caller contexts into Spark. In the end, Spark will combine its own caller context with the caller contexts of its upstream applications, and write them into Yarn RM log and HDFS audit log.

The audit log has a config to truncate the caller contexts passed in (default 128). The caller contexts will be sent over rpc, so it should be concise. The call context written into HDFS log and Yarn log consists of two parts: the information `A` specified by Spark itself and the value `B` of `spark.log.callerContext` property.  Currently `A` typically takes 64 to 74 characters,  so `B` can have up to 50 characters (mentioned in the doc `running-on-yarn.md`)
## How was this patch tested?

Manual tests. I have run some Spark applications with `spark.log.callerContext` configuration in Yarn client/cluster mode, and verified that the caller contexts were written into Yarn RM log and HDFS audit log correctly.

The ways to configure `spark.log.callerContext` property:
- In spark-defaults.conf:

```
spark.log.callerContext  infoSpecifiedByUpstreamApp
```
- In app's source code:

```
val spark = SparkSession
      .builder
      .appName("SparkKMeans")
      .config("spark.log.callerContext", "infoSpecifiedByUpstreamApp")
      .getOrCreate()
```

When running on Spark Yarn cluster mode, the driver is unable to pass 'spark.log.callerContext' to Yarn client and AM since Yarn client and AM have already started before the driver performs `.config("spark.log.callerContext", "infoSpecifiedByUpstreamApp")`.

The following  example shows the command line used to submit a SparkKMeans application and the corresponding records in Yarn RM log and HDFS audit log.

Command:

```
./bin/spark-submit --verbose --executor-cores 3 --num-executors 1 --master yarn --deploy-mode client --class org.apache.spark.examples.SparkKMeans examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar hdfs://localhost:9000/lr_big.txt 2 5
```

Yarn RM log:

<img width="1440" alt="screen shot 2016-10-19 at 9 12 03 pm" src="https://cloud.githubusercontent.com/assets/8546874/19547050/7d2f278c-9649-11e6-9df8-8d5ff12609f0.png">

HDFS audit log:

<img width="1400" alt="screen shot 2016-10-19 at 10 18 14 pm" src="https://cloud.githubusercontent.com/assets/8546874/19547102/096060ae-964a-11e6-981a-cb28efd5a058.png">

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #15563 from weiqingy/SPARK-16759.
2016-11-11 18:36:23 -08:00
Junjie Chen 4f15d94cfe [SPARK-13331] AES support for over-the-wire encryption
## What changes were proposed in this pull request?

DIGEST-MD5 mechanism is used for SASL authentication and secure communication. DIGEST-MD5 mechanism supports 3DES, DES, and RC4 ciphers. However, 3DES, DES and RC4 are slow relatively.

AES provide better performance and security by design and is a replacement for 3DES according to NIST. Apache Common Crypto is a cryptographic library optimized with AES-NI, this patch employ Apache Common Crypto as enc/dec backend for SASL authentication and secure channel to improve spark RPC.
## How was this patch tested?

Unit tests and Integration test.

Author: Junjie Chen <junjie.j.chen@intel.com>

Closes #15172 from cjjnjust/shuffle_rpc_encrypt.
2016-11-11 10:37:58 -08:00
Zheng RuiFeng b1033fb745
[MINOR][DOC] Unify example marks
## What changes were proposed in this pull request?
1, `**Example**` => `**Examples**`, because more algos use `**Examples**`.
2,  delete `### Examples` in `Isotonic regression`, because it's not that special in http://spark.apache.org/docs/latest/ml-classification-regression.html
3, add missing marks for `LDA` and other algos.

## How was this patch tested?
No tests for it only modify doc

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #15783 from zhengruifeng/doc_fix.
2016-11-08 14:04:07 +00:00
chie8842 ee2e741ac1
[SPARK-13770][DOCUMENTATION][ML] Document the ML feature Interaction
I created Scala and Java example and added documentation.

Author: chie8842 <hayashidac@nttdata.co.jp>

Closes #15658 from hayashidac/SPARK-13770.
2016-11-08 13:45:37 +00:00
fidato 6f3697136a [SPARK-16575][CORE] partition calculation mismatch with sc.binaryFiles
## What changes were proposed in this pull request?

This Pull request comprises of the critical bug SPARK-16575 changes. This change rectifies the issue with BinaryFileRDD partition calculations as  upon creating an RDD with sc.binaryFiles, the resulting RDD always just consisted of two partitions only.
## How was this patch tested?

The original issue ie. getNumPartitions on binary Files RDD (always having two partitions) was first replicated and then tested upon the changes. Also the unit tests have been checked and passed.

This contribution is my original work and I licence the work to the project under the project's open source license

srowen hvanhovell rxin vanzin skyluc kmader zsxwing datafarmer Please have a look .

Author: fidato <fidato.july13@gmail.com>

Closes #15327 from fidato13/SPARK-16575.
2016-11-07 18:41:17 -08:00
Sean Owen dc4c600986 [SPARK-18138][DOCS] Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0
## What changes were proposed in this pull request?

Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0. This does not actually implement any of the change in SPARK-18138, just peppers the documentation with notices about it.

## How was this patch tested?

Doc build

Author: Sean Owen <sowen@cloudera.com>

Closes #15733 from srowen/SPARK-18138.
2016-11-03 17:27:23 -07:00
Liwei Lin 98ede49496
[SPARK-18198][DOC][STREAMING] Highlight code snippets
## What changes were proposed in this pull request?

This patch uses `{% highlight lang %}...{% endhighlight %}` to highlight code snippets in the `Structured Streaming Kafka010 integration doc` and the `Spark Streaming Kafka010 integration doc`.

This patch consists of two commits:
- the first commit fixes only the leading spaces -- this is large
- the second commit adds the highlight instructions -- this is much simpler and easier to review

## How was this patch tested?

SKIP_API=1 jekyll build

## Screenshots

**Before**

![snip20161101_3](https://cloud.githubusercontent.com/assets/15843379/19894258/47746524-a087-11e6-9a2a-7bff2d428d44.png)

**After**

![snip20161101_1](https://cloud.githubusercontent.com/assets/15843379/19894324/8bebcd1e-a087-11e6-835b-88c4d2979cfa.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #15715 from lw-lin/doc-highlight-code-snippet.
2016-11-02 09:10:34 +00:00
Joseph K. Bradley 91c33a0ca5 [SPARK-18088][ML] Various ChiSqSelector cleanups
## What changes were proposed in this pull request?
- Renamed kbest to numTopFeatures
- Renamed alpha to fpr
- Added missing Since annotations
- Doc cleanups
## How was this patch tested?

Added new standardized unit tests for spark.ml.
Improved existing unit test coverage a bit.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #15647 from jkbradley/chisqselector-follow-ups.
2016-11-01 17:00:00 -07:00
Josh Rosen 6e6298154a [SPARK-17350][SQL] Disable default use of KryoSerializer in Thrift Server
In SPARK-4761 / #3621 (December 2014) we enabled Kryo serialization by default in the Spark Thrift Server. However, I don't think that the original rationale for doing this still holds now that most Spark SQL serialization is now performed via encoders and our UnsafeRow format.

In addition, the use of Kryo as the default serializer can introduce performance problems because the creation of new KryoSerializer instances is expensive and we haven't performed instance-reuse optimizations in several code paths (including DirectTaskResult deserialization).

Given all of this, I propose to revert back to using JavaSerializer as the default serializer in the Thrift Server.

/cc liancheng

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14906 from JoshRosen/disable-kryo-in-thriftserver.
2016-11-01 16:23:47 -07:00
Charles Allen e34b4e1267
[SPARK-15994][MESOS] Allow enabling Mesos fetch cache in coarse executor backend
Mesos 0.23.0 introduces a Fetch Cache feature http://mesos.apache.org/documentation/latest/fetcher/ which allows caching of resources specified in command URIs.

This patch:
- Updates the Mesos shaded protobuf dependency to 0.23.0
- Allows setting `spark.mesos.fetcherCache.enable` to enable the fetch cache for all specified URIs. (URIs must be specified for the setting to have any affect)
- Updates documentation for Mesos configuration with the new setting.

This patch does NOT:
- Allow for per-URI caching configuration. The cache setting is global to ALL URIs for the command.

Author: Charles Allen <charles@allen-net.com>

Closes #13713 from drcrallen/SPARK15994.
2016-11-01 13:14:17 +00:00
Dongjoon Hyun 623fc7fc67
[MINOR][DOC] Remove spaces following slashs
## What changes were proposed in this pull request?

This PR merges multiple lines enumerating items in order to remove the redundant spaces following slashes in [Structured Streaming Programming Guide in 2.0.2-rc1](http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/structured-streaming-programming-guide.html).
- Before: `Scala/ Java/ Python`
- After: `Scala/Java/Python`
## How was this patch tested?

Manual by the followings because this is documentation update.

```
cd docs
SKIP_API=1 jekyll build
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15686 from dongjoon-hyun/minor_doc_space.
2016-11-01 13:08:49 +00:00
Hossein 2881a2d1d1 [SPARK-17919] Make timeout to RBackend configurable in SparkR
## What changes were proposed in this pull request?

This patch makes RBackend connection timeout configurable by user.

## How was this patch tested?
N/A

Author: Hossein <hossein@databricks.com>

Closes #15471 from falaki/SPARK-17919.
2016-10-30 16:17:23 -07:00
Liwei Lin 505b927cb7
[SPARK-16312][FOLLOW-UP][STREAMING][KAFKA][DOC] Add java code snippet for Kafka 0.10 integration doc
## What changes were proposed in this pull request?

added java code snippet for Kafka 0.10 integration doc

## How was this patch tested?

SKIP_API=1 jekyll build

## Screenshot

![kafka-doc](https://cloud.githubusercontent.com/assets/15843379/19826272/bf0d8a4c-9db8-11e6-9e40-1396723df4bc.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #15679 from lw-lin/kafka-010-examples.
2016-10-30 09:32:19 +00:00
VinceShieh 0b076d4cb6 [SPARK-17219][ML] enhanced NaN value handling in Bucketizer
## What changes were proposed in this pull request?

This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2.
NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively.

'''Before:
val bucketizer: Bucketizer = new Bucketizer()
          .setInputCol("feature")
          .setOutputCol("result")
          .setSplits(splits)
'''After:
val bucketizer: Bucketizer = new Bucketizer()
          .setInputCol("feature")
          .setOutputCol("result")
          .setSplits(splits)
          .setHandleNaN("keep")

## How was this patch tested?
Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite

Signed-off-by: VinceShieh <vincent.xieintel.com>

Author: VinceShieh <vincent.xie@intel.com>
Author: Vincent Xie <vincent.xie@intel.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #15428 from VinceShieh/spark-17219_followup.
2016-10-27 11:52:15 -07:00
cody koeninger 1042325805 [SPARK-17813][SQL][KAFKA] Maximum data per trigger
## What changes were proposed in this pull request?

maxOffsetsPerTrigger option for rate limiting, proportionally based on volume of different topicpartitions.

## How was this patch tested?

Added unit test

Author: cody koeninger <cody@koeninger.org>

Closes #15527 from koeninger/SPARK-17813.
2016-10-27 10:30:59 -07:00
Felix Cheung 44c8bfda79 [SQL][DOC] updating doc for JSON source to link to jsonlines.org
## What changes were proposed in this pull request?

API and programming guide doc changes for Scala, Python and R.

## How was this patch tested?

manual test

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15629 from felixcheung/jsondoc.
2016-10-26 23:06:11 -07:00
Alex Bozarth 5d0f81da49
[SPARK-4411][WEB UI] Add "kill" link for jobs in the UI
## What changes were proposed in this pull request?

Currently users can kill stages via the web ui but not jobs directly (jobs are killed if one of their stages is). I've added the ability to kill jobs via the web ui. This code change is based on #4823 by lianhuiwang and updated to work with the latest code matching how stages are currently killed. In general I've copied the kill stage code warning and note comments and all. I also updated applicable tests and documentation.

## How was this patch tested?

Manually tested and dev/run-tests

![screen shot 2016-10-11 at 4 49 43 pm](https://cloud.githubusercontent.com/assets/13952758/19292857/12f1b7c0-8fd4-11e6-8982-210249f7b697.png)

Author: Alex Bozarth <ajbozart@us.ibm.com>
Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #15441 from ajbozarth/spark4411.
2016-10-26 14:26:54 +02:00
Sean Owen 4ecbe1b92f
[SPARK-17810][SQL] Default spark.sql.warehouse.dir is relative to local FS but can resolve as HDFS path
## What changes were proposed in this pull request?

Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #15382 from srowen/SPARK-17810.
2016-10-24 10:44:45 +01:00
Sandeep Singh bc167a2a53 [SPARK-928][CORE] Add support for Unsafe-based serializer in Kryo
## What changes were proposed in this pull request?
Now since we have migrated to Kryo-3.0.0 in https://issues.apache.org/jira/browse/SPARK-11416, we can gives users option to use unsafe SerDer. It can turned by setting `spark.kryo.useUnsafe` to `true`

## How was this patch tested?
Ran existing tests

```
     Benchmark Kryo Unsafe vs safe Serialization: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      basicTypes: Int unsafe:true                    160 /  178         98.5          10.1       1.0X
      basicTypes: Long unsafe:true                   210 /  218         74.9          13.4       0.8X
      basicTypes: Float unsafe:true                  203 /  213         77.5          12.9       0.8X
      basicTypes: Double unsafe:true                 226 /  235         69.5          14.4       0.7X
      Array: Int unsafe:true                        1087 / 1101         14.5          69.1       0.1X
      Array: Long unsafe:true                       2758 / 2844          5.7         175.4       0.1X
      Array: Float unsafe:true                      1511 / 1552         10.4          96.1       0.1X
      Array: Double unsafe:true                     2942 / 2972          5.3         187.0       0.1X
      Map of string->Double unsafe:true             2645 / 2739          5.9         168.2       0.1X
      basicTypes: Int unsafe:false                   211 /  218         74.7          13.4       0.8X
      basicTypes: Long unsafe:false                  247 /  253         63.6          15.7       0.6X
      basicTypes: Float unsafe:false                 211 /  216         74.5          13.4       0.8X
      basicTypes: Double unsafe:false                227 /  233         69.2          14.4       0.7X
      Array: Int unsafe:false                       3012 / 3032          5.2         191.5       0.1X
      Array: Long unsafe:false                      4463 / 4515          3.5         283.8       0.0X
      Array: Float unsafe:false                     2788 / 2868          5.6         177.2       0.1X
      Array: Double unsafe:false                    3558 / 3752          4.4         226.2       0.0X
      Map of string->Double unsafe:false            2806 / 2933          5.6         178.4       0.1X
```

Author: Sandeep Singh <sandeep@techaddict.me>
Author: Sandeep Singh <sandeep@origamilogic.com>

Closes #12913 from techaddict/SPARK-928.
2016-10-22 12:03:37 -07:00
Sean Owen 01b26a0643
[SPARK-17898][DOCS] repositories needs username and password
## What changes were proposed in this pull request?

Document `user:password` syntax as possible means of specifying credentials for password-protected `--repositories`

## How was this patch tested?

Doc build

Author: Sean Owen <sowen@cloudera.com>

Closes #15584 from srowen/SPARK-17898.
2016-10-22 09:39:07 +01:00
cody koeninger c9720b2195 [STREAMING][KAFKA][DOC] clarify kafka settings needed for larger batches
## What changes were proposed in this pull request?

Minor doc change to mention kafka configuration for larger spark batches.

## How was this patch tested?

Doc change only, confirmed via jekyll.

The configuration issue was discussed / confirmed with users on the mailing list.

Author: cody koeninger <cody@koeninger.org>

Closes #15570 from koeninger/kafka-doc-heartbeat.
2016-10-21 16:27:19 -07:00
cody koeninger 268ccb9a48 [SPARK-17812][SQL][KAFKA] Assign and specific startingOffsets for structured stream
## What changes were proposed in this pull request?

startingOffsets takes specific per-topicpartition offsets as a json argument, usable with any consumer strategy

assign with specific topicpartitions as a consumer strategy

## How was this patch tested?

Unit tests

Author: cody koeninger <cody@koeninger.org>

Closes #15504 from koeninger/SPARK-17812.
2016-10-21 15:55:04 -07:00
Felix Cheung e21e1c946c [SPARK-18013][SPARKR] add crossJoin API
## What changes were proposed in this pull request?

Add crossJoin and do not default to cross join if joinExpr is left out

## How was this patch tested?

unit test

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15559 from felixcheung/rcrossjoin.
2016-10-21 12:35:37 -07:00
Mark Grover 2d14ab7e64 [DOCS] Update docs to not suggest to package Spark before running tests.
## What changes were proposed in this pull request?

Update docs to not suggest to package Spark before running tests.

## How was this patch tested?

Not creating a JIRA since this pretty small. We haven't had the need to run mvn package before mvn test since 1.6 at least, or so I am told. So, updating the docs to not be misguiding.

Author: Mark Grover <mark@apache.org>

Closes #15572 from markgrover/doc_update.
2016-10-20 15:30:01 -07:00
Takuya UESHIN 9540357ada
[SPARK-17985][CORE] Bump commons-lang3 version to 3.5.
## What changes were proposed in this pull request?

`SerializationUtils.clone()` of commons-lang3 (<3.5) has a bug that breaks thread safety, which gets stack sometimes caused by race condition of initializing hash map.
See https://issues.apache.org/jira/browse/LANG-1251.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #15548 from ueshin/issues/SPARK-17985.
2016-10-19 10:06:43 +01:00
Tommy YU f39852e598 [SPARK-18001][DOCUMENT] fix broke link to SparkDataFrame
## What changes were proposed in this pull request?

In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section "Untyped Dataset Operations (aka DataFrame Operations)"

Link to R DataFrame doesn't work that return
The requested URL /docs/latest/api/R/DataFrame.html was not found on this server.

Correct link is SparkDataFrame.html for spark 2.0

## How was this patch tested?

Manual checked.

Author: Tommy YU <tummyyu@163.com>

Closes #15543 from Wenpei/spark-18001.
2016-10-18 21:15:32 -07:00
Reynold Xin cd662bc7a2 Revert "[SPARK-17985][CORE] Bump commons-lang3 version to 3.5."
This reverts commit bfe7885aee.

The commit caused build failures on Hadoop 2.2 profile:

```
[error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1489: value read is not a member of object org.apache.commons.io.IOUtils
[error]       var numBytes = IOUtils.read(gzInputStream, buf)
[error]                              ^
[error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1492: value read is not a member of object org.apache.commons.io.IOUtils
[error]         numBytes = IOUtils.read(gzInputStream, buf)
[error]                            ^
```
2016-10-18 13:56:35 -07:00
Weiqing Yang 20dd11096c [MINOR][DOC] Add more built-in sources in sql-programming-guide.md
## What changes were proposed in this pull request?
Add more built-in sources in sql-programming-guide.md.

## How was this patch tested?
Manually.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #15522 from weiqingy/dsDoc.
2016-10-18 13:38:14 -07:00
Takuya UESHIN bfe7885aee [SPARK-17985][CORE] Bump commons-lang3 version to 3.5.
## What changes were proposed in this pull request?

`SerializationUtils.clone()` of commons-lang3 (<3.5) has a bug that breaks thread safety, which gets stack sometimes caused by race condition of initializing hash map.
See https://issues.apache.org/jira/browse/LANG-1251.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #15525 from ueshin/issues/SPARK-17985.
2016-10-18 13:36:00 -07:00
Yu Peng 231f39e3f6 [SPARK-17711] Compress rolled executor log
## What changes were proposed in this pull request?

This PR adds support for executor log compression.

## How was this patch tested?

Unit tests

cc: yhuai tdas mengxr

Author: Yu Peng <loneknightpy@gmail.com>

Closes #15285 from loneknightpy/compress-executor-log.
2016-10-18 13:23:31 -07:00
Reynold Xin 72a6e7a57a Revert "[SPARK-17637][SCHEDULER] Packed scheduling for Spark tasks across executors"
This reverts commit ed14633414.

The patch merged had obvious quality and documentation issue. The idea is useful, and we should work towards improving its quality and merging it in again.
2016-10-15 22:31:37 -07:00
Zhan Zhang ed14633414 [SPARK-17637][SCHEDULER] Packed scheduling for Spark tasks across executors
## What changes were proposed in this pull request?

Restructure the code and implement two new task assigner.
PackedAssigner: try to allocate tasks to the executors with least available cores, so that spark can release reserved executors when dynamic allocation is enabled.

BalancedAssigner: try to allocate tasks to the executors with more available cores in order to balance the workload across all executors.

By default, the original round robin assigner is used.

We test a pipeline, and new PackedAssigner  save around 45% regarding the reserved cpu and memory with dynamic allocation enabled.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Both unit test in TaskSchedulerImplSuite and manual tests in production pipeline.

Author: Zhan Zhang <zhanzhang@fb.com>

Closes #15218 from zhzhan/packed-scheduler.
2016-10-15 18:45:04 -07:00
Dhruve Ashar a0ebcb3a30
[DOC] Fix typo in sql hive doc
Change is too trivial to file a JIRA.

Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #15485 from dhruve/master.
2016-10-14 17:45:27 +01:00
Imran Rashid 9ce7d3e542 [SPARK-17675][CORE] Expand Blacklist for TaskSets
## What changes were proposed in this pull request?

This is a step along the way to SPARK-8425.

To enable incremental review, the first step proposed here is to expand the blacklisting within tasksets. In particular, this will enable blacklisting for
* (task, executor) pairs (this already exists via an undocumented config)
* (task, node)
* (taskset, executor)
* (taskset, node)

Adding (task, node) is critical to making spark fault-tolerant of one-bad disk in a cluster, without requiring careful tuning of "spark.task.maxFailures". The other additions are also important to avoid many misleading task failures and long scheduling delays when there is one bad node on a large cluster.

Note that some of the code changes here aren't really required for just this -- they put pieces in place for SPARK-8425 even though they are not used yet (eg. the `BlacklistTracker` helper is a little out of place, `TaskSetBlacklist` holds onto a little more info than it needs to for just this change, and `ExecutorFailuresInTaskSet` is more complex than it needs to be).

## How was this patch tested?

Added unit tests, run tests via jenkins.

Author: Imran Rashid <irashid@cloudera.com>
Author: mwws <wei.mao@intel.com>

Closes #15249 from squito/taskset_blacklist_only.
2016-10-12 16:43:03 -05:00
cody koeninger c264ef9b19 [SPARK-17853][STREAMING][KAFKA][DOC] make it clear that reusing group.id is bad
## What changes were proposed in this pull request?

Documentation fix to make it clear that reusing group id for different streams is super duper bad, just like it is with the underlying Kafka consumer.

## How was this patch tested?

I built jekyll doc and made sure it looked ok.

Author: cody koeninger <cody@koeninger.org>

Closes #15442 from koeninger/SPARK-17853.
2016-10-12 00:40:47 -07:00
Kousuke Saruta b512f04f8e [SPARK-17880][DOC] The url linking to AccumulatorV2 in the document is incorrect.
## What changes were proposed in this pull request?

In `programming-guide.md`, the url which links to `AccumulatorV2` says `api/scala/index.html#org.apache.spark.AccumulatorV2` but `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct.

## How was this patch tested?
manual test.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #15439 from sarutak/SPARK-17880.
2016-10-11 22:36:57 -07:00
Alexander Pivovarov 299eb04ba0 Fix hadoop.version in building-spark.md
Couple of mvn build examples use `-Dhadoop.version=VERSION` instead of actual version number

Author: Alexander Pivovarov <apivovarov@gmail.com>

Closes #15440 from apivovarov/patch-1.
2016-10-11 22:31:21 -07:00
hyukjinkwon 0c0ad436ad [SPARK-17719][SPARK-17776][SQL] Unify and tie up options in a single place in JDBC datasource package
## What changes were proposed in this pull request?

This PR proposes to fix arbitrary usages among `Map[String, String]`, `Properties` and `JDBCOptions` instances for options in `execution/jdbc` package and make the connection properties exclude Spark-only options.

This PR includes some changes as below:

  - Unify `Map[String, String]`, `Properties` and `JDBCOptions` in `execution/jdbc` package to `JDBCOptions`.

- Move `batchsize`, `fetchszie`, `driver` and `isolationlevel` options into `JDBCOptions` instance.

- Document `batchSize` and `isolationlevel` with marking both read-only options and write-only options. Also, this includes minor types and detailed explanation for some statements such as url.

- Throw exceptions fast by checking arguments first rather than in execution time (e.g. for `fetchsize`).

- Exclude Spark-only options in connection properties.

## How was this patch tested?

Existing tests should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15292 from HyukjinKwon/SPARK-17719.
2016-10-10 22:22:41 -07:00
Timothy Chen 29f186bfdf
[SPARK-14082][MESOS] Enable GPU support with Mesos
## What changes were proposed in this pull request?

Enable GPU resources to be used when running coarse grain mode with Mesos.

## How was this patch tested?

Manual test with GPU.

Author: Timothy Chen <tnachen@gmail.com>

Closes #14644 from tnachen/gpu_mesos.
2016-10-10 23:20:15 +02:00
Wenchen Fan 23ddff4b2b [SPARK-17338][SQL] add global temp view
## What changes were proposed in this pull request?

Global temporary view is a cross-session temporary view, which means it's shared among all sessions. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database `global_temp`(configurable via SparkConf), and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1.

changes for `SessionCatalog`:

1. add a new field `gloabalTempViews: GlobalTempViewManager`, to access the shared global temp views, and the global temp db name.
2. `createDatabase` will fail if users wanna create `global_temp`, which is system preserved.
3. `setCurrentDatabase` will fail if users wanna set `global_temp`, which is system preserved.
4. add `createGlobalTempView`, which is used in `CreateViewCommand` to create global temp views.
5. add `dropGlobalTempView`, which is used in `CatalogImpl` to drop global temp view.
6. add `alterTempViewDefinition`, which is used in `AlterViewAsCommand` to update the view definition for local/global temp views.
7. `renameTable`/`dropTable`/`isTemporaryTable`/`lookupRelation`/`getTempViewOrPermanentTableMetadata`/`refreshTable` will handle global temp views.

changes for SQL commands:

1. `CreateViewCommand`/`AlterViewAsCommand` is updated to support global temp views
2. `ShowTablesCommand` outputs a new column `database`, which is used to distinguish global and local temp views.
3. other commands can also handle global temp views if they call `SessionCatalog` APIs which accepts global temp views, e.g. `DropTableCommand`, `AlterTableRenameCommand`, `ShowColumnsCommand`, etc.

changes for other public API

1. add a new method `dropGlobalTempView` in `Catalog`
2. `Catalog.findTable` can find global temp view
3. add a new method `createGlobalTempView` in `Dataset`

## How was this patch tested?

new tests in `SQLViewSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14897 from cloud-fan/global-temp-view.
2016-10-10 15:48:57 +08:00
Sean Owen cff5607552 [SPARK-17707][WEBUI] Web UI prevents spark-submit application to be finished
## What changes were proposed in this pull request?

This expands calls to Jetty's simple `ServerConnector` constructor to explicitly specify a `ScheduledExecutorScheduler` that makes daemon threads. It should otherwise result in exactly the same configuration, because the other args are copied from the constructor that is currently called.

(I'm not sure we should change the Hive Thriftserver impl, but I did anyway.)

This also adds `sc.stop()` to the quick start guide example.

## How was this patch tested?

Existing tests; _pending_ at least manual verification of the fix.

Author: Sean Owen <sowen@cloudera.com>

Closes #15381 from srowen/SPARK-17707.
2016-10-07 10:31:41 -07:00
Shixiong Zhu 9293734d35 [SPARK-17346][SQL] Add Kafka source for Structured Streaming
## What changes were proposed in this pull request?

This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.

It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing

tdas did most of work and part of them was inspired by koeninger's work.

### Introduction

The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:

Column | Type
---- | ----
key | binary
value | binary
topic | string
partition | int
offset | long
timestamp | long
timestampType | int

The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.

### Configuration

The user can use `DataStreamReader.option` to set the following configurations.

Kafka Source's options | value | default | meaning
------ | ------- | ------ | -----
startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets

Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`

### Usage

* Subscribe to 1 topic
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1")
  .load()
```

* Subscribe to multiple topics
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1,topic2")
  .load()
```

* Subscribe to a pattern
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribePattern", "topic.*")
  .load()
```

## How was this patch tested?

The new unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Shixiong Zhu <zsxwing@gmail.com>
Author: cody koeninger <cody@koeninger.org>

Closes #15102 from zsxwing/kafka-source.
2016-10-05 16:45:45 -07:00
sethah 9df54f5325
[SPARK-17239][ML][DOC] Update user guide for multiclass logistic regression
## What changes were proposed in this pull request?
Updates user guide to reflect that LogisticRegression now supports multiclass. Also adds new examples to show multiclass training.

## How was this patch tested?
Ran locally using spark-submit, run-example, and copy/paste from user guide into shells. Generated docs and verified correct output.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #15349 from sethah/SPARK-17239.
2016-10-05 18:28:21 +00:00
Sean Owen 1dd68d3827
[SPARK-17718][DOCS][MLLIB] Make loss function formulation label note clearer in MLlib docs
## What changes were proposed in this pull request?

Move note about labels being +1/-1 in formulation only to be just under the table of formulations.

## How was this patch tested?

Doc build

Author: Sean Owen <sowen@cloudera.com>

Closes #15330 from srowen/SPARK-17718.
2016-10-03 18:09:36 +00:00
Jagadeesan a27033c0bb
[SPARK-17736][DOCUMENTATION][SPARKR] Update R README for rmarkdown,…
## What changes were proposed in this pull request?

To build R docs (which are built when R tests are run), users need to install pandoc and rmarkdown. This was done for Jenkins in ~~[SPARK-17420](https://issues.apache.org/jira/browse/SPARK-17420)~~

… pandoc]

Author: Jagadeesan <as2@us.ibm.com>

Closes #15309 from jagadeesanas2/SPARK-17736.
2016-10-03 10:46:38 +01:00
Dongjoon Hyun 15e9bbb49e [MINOR][DOC] Add an up-to-date description for default serialization during shuffling
## What changes were proposed in this pull request?

This PR aims to make the doc up-to-date. The documentation is generally correct, but after https://issues.apache.org/jira/browse/SPARK-13926, Spark starts to choose Kyro as a default serialization library during shuffling of simple types, arrays of simple types, or string type.

## How was this patch tested?

This is a documentation update.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15315 from dongjoon-hyun/SPARK-DOC-SERIALIZER.
2016-09-30 22:05:59 -07:00
Dongjoon Hyun 39eb3bb1ec [SPARK-17412][DOC] All test should not be run by root or any admin user
## What changes were proposed in this pull request?

`FsHistoryProviderSuite` fails if `root` user runs it. The test case **SPARK-3697: ignore directories that cannot be read** depends on `setReadable(false, false)` to make test data files and expects the number of accessible files is 1. But, `root` can access all files, so it returns 2.

This PR adds the assumption explicitly on doc. `building-spark.md`.

## How was this patch tested?

This is a documentation change.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15291 from dongjoon-hyun/SPARK-17412.
2016-09-29 16:01:45 -07:00
José Hiram Soltren 958200497a [DOCS] Reorganize explanation of Accumulators and Broadcast Variables
## What changes were proposed in this pull request?

The discussion of the interaction of Accumulators and Broadcast Variables should logically follow the discussion on Checkpointing. As currently written, this section discusses Checkpointing before it is formally introduced. To remedy this:

 - Rename this section to "Accumulators, Broadcast Variables, and Checkpoints", and
 - Move this section after "Checkpointing".

## How was this patch tested?

Testing: ran

$ SKIP_API=1 jekyll build

, and verified changes in a Web browser pointed at docs/_site/index.html.

Author: José Hiram Soltren <jose@cloudera.com>

Closes #15281 from jsoltren/doc-changes.
2016-09-29 10:18:56 -07:00
Takeshi YAMAMURO b2e9731ca4
[MINOR][DOCS] Fix th doc. of spark-streaming with kinesis
## What changes were proposed in this pull request?
This pr is just to fix the document of `spark-kinesis-integration`.
Since `SPARK-17418` prevented all the kinesis stuffs (including kinesis example code)
from publishing,  `bin/run-example streaming.KinesisWordCountASL` and `bin/run-example streaming.JavaKinesisWordCountASL` does not work.
Instead, it fetches the kinesis jar from the Spark Package.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #15260 from maropu/DocFixKinesis.
2016-09-29 08:26:03 -04:00
Shuai Lin b2a7eedcdd
[SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature selection docs for ChiSqSelector
## What changes were proposed in this pull request?

A follow up for #14597 to update feature selection docs about ChiSqSelector.

## How was this patch tested?

Generated html docs. It can be previewed at:

* ml: http://sparkdocs.lins05.pw/spark-17017/ml-features.html#chisqselector
* mllib: http://sparkdocs.lins05.pw/spark-17017/mllib-feature-extraction.html#chisqselector

Author: Shuai Lin <linshuai2012@gmail.com>

Closes #15236 from lins05/spark-17017-update-docs-for-chisq-selector-fpr.
2016-09-28 06:12:48 -04:00
Andrew Mills 00be16df64 [Docs] Update spark-standalone.md to fix link
Corrected a link to the configuration.html page, it was pointing to a page that does not exist (configurations.html).

Documentation change, verified in preview.

Author: Andrew Mills <ammills01@users.noreply.github.com>

Closes #15244 from ammills01/master.
2016-09-26 16:41:14 -04:00
Liang-Chi Hsieh 8135e0e5eb [SPARK-17153][SQL] Should read partition data when reading new files in filestream without globbing
## What changes were proposed in this pull request?

When reading file stream with non-globbing path, the results return data with all `null`s for the
partitioned columns. E.g.,

    case class A(id: Int, value: Int)
    val data = spark.createDataset(Seq(
      A(1, 1),
      A(2, 2),
      A(2, 3))
    )
    val url = "/tmp/test"
    data.write.partitionBy("id").parquet(url)
    spark.read.parquet(url).show

    +-----+---+
    |value| id|
    +-----+---+
    |    2|  2|
    |    3|  2|
    |    1|  1|
    +-----+---+

    val s = spark.readStream.schema(spark.read.load(url).schema).parquet(url)
    s.writeStream.queryName("test").format("memory").start()

    sql("SELECT * FROM test").show

    +-----+----+
    |value|  id|
    +-----+----+
    |    2|null|
    |    3|null|
    |    1|null|
    +-----+----+

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #14803 from viirya/filestreamsource-option.
2016-09-26 13:07:11 -07:00
Justin Pihony 50b89d05b7
[SPARK-14525][SQL] Make DataFrameWrite.save work for jdbc
## What changes were proposed in this pull request?

This change modifies the implementation of DataFrameWriter.save such that it works with jdbc, and the call to jdbc merely delegates to save.

## How was this patch tested?

This was tested via unit tests in the JDBCWriteSuite, of which I added one new test to cover this scenario.

## Additional details

rxin This seems to have been most recently touched by you and was also commented on in the JIRA.

This contribution is my original work and I license the work to the project under the project's open source license.

Author: Justin Pihony <justin.pihony@gmail.com>
Author: Justin Pihony <justin.pihony@typesafe.com>

Closes #12601 from JustinPihony/jdbc_reconciliation.
2016-09-26 09:54:22 +01:00
Jeff Zhang f62ddc5983 [SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when running sparkr in RStudio
## What changes were proposed in this pull request?

Spark will add sparkr.zip to archive only when it is yarn mode (SparkSubmit.scala).
```
    if (args.isR && clusterManager == YARN) {
      val sparkRPackagePath = RUtils.localSparkRPackagePath
      if (sparkRPackagePath.isEmpty) {
        printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.")
      }
      val sparkRPackageFile = new File(sparkRPackagePath.get, SPARKR_PACKAGE_ARCHIVE)
      if (!sparkRPackageFile.exists()) {
        printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.")
      }
      val sparkRPackageURI = Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString

      // Distribute the SparkR package.
      // Assigns a symbol link name "sparkr" to the shipped package.
      args.archives = mergeFileLists(args.archives, sparkRPackageURI + "#sparkr")

      // Distribute the R package archive containing all the built R packages.
      if (!RUtils.rPackages.isEmpty) {
        val rPackageFile =
          RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), R_PACKAGE_ARCHIVE)
        if (!rPackageFile.exists()) {
          printErrorAndExit("Failed to zip all the built R packages.")
        }

        val rPackageURI = Utils.resolveURI(rPackageFile.getAbsolutePath).toString
        // Assigns a symbol link name "rpkg" to the shipped package.
        args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg")
      }
    }
```
So it is necessary to pass spark.master from R process to JVM. Otherwise sparkr.zip won't be distributed to executor.  Besides that I also pass spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need them to access secured cluster.

## How was this patch tested?

Verify it manually in R Studio using the following code.
```
Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark")
.libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths()))
library(SparkR)
sparkR.session(master="yarn-client", sparkConfig = list(spark.executor.instances="1"))
df <- as.DataFrame(mtcars)
head(df)

```

…

Author: Jeff Zhang <zjffdu@apache.org>

Closes #14784 from zjffdu/SPARK-17210.
2016-09-23 11:37:43 -07:00
frreiss 646f383465
[SPARK-17421][DOCS] Documenting the current treatment of MAVEN_OPTS.
## What changes were proposed in this pull request?

Modified the documentation to clarify that `build/mvn` and `pom.xml` always add Java 7-specific parameters to `MAVEN_OPTS`, and that developers can safely ignore warnings about `-XX:MaxPermSize` that may result from compiling or running tests with Java 8.

## How was this patch tested?

Rebuilt HTML documentation, made sure that building-spark.html displays correctly in a browser.

Author: frreiss <frreiss@us.ibm.com>

Closes #15005 from frreiss/fred-17421a.
2016-09-22 10:31:15 +01:00
Marcelo Vanzin 2cd1bfa4f0 [SPARK-4563][CORE] Allow driver to advertise a different network address.
The goal of this feature is to allow the Spark driver to run in an
isolated environment, such as a docker container, and be able to use
the host's port forwarding mechanism to be able to accept connections
from the outside world.

The change is restricted to the driver: there is no support for achieving
the same thing on executors (or the YARN AM for that matter). Those still
need full access to the outside world so that, for example, connections
can be made to an executor's block manager.

The core of the change is simple: add a new configuration that tells what's
the address the driver should bind to, which can be different than the address
it advertises to executors (spark.driver.host). Everything else is plumbing
the new configuration where it's needed.

To use the feature, the host starting the container needs to set up the
driver's port range to fall into a range that is being forwarded; this
required the block manager port to need a special configuration just for
the driver, which falls back to the existing spark.blockManager.port when
not set. This way, users can modify the driver settings without affecting
the executors; it would theoretically be nice to also have different
retry counts for driver and executors, but given that docker (at least)
allows forwarding port ranges, we can probably live without that for now.

Because of the nature of the feature it's kinda hard to add unit tests;
I just added a simple one to make sure the configuration works.

This was tested with a docker image running spark-shell with the following
command:

 docker blah blah blah \
   -p 38000-38100:38000-38100 \
   [image] \
   spark-shell \
     --num-executors 3 \
     --conf spark.shuffle.service.enabled=false \
     --conf spark.dynamicAllocation.enabled=false \
     --conf spark.driver.host=[host's address] \
     --conf spark.driver.port=38000 \
     --conf spark.driver.blockManager.port=38020 \
     --conf spark.ui.port=38040

Running on YARN; verified the driver works, executors start up and listen
on ephemeral ports (instead of using the driver's config), and that caching
and shuffling (without the shuffle service) works. Clicked through the UI
to make sure all pages (including executor thread dumps) worked. Also tested
apps without docker, and ran unit tests.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #15120 from vanzin/SPARK-4563.
2016-09-21 14:42:41 -07:00
VinceShieh 57dc326bd0
[SPARK-17219][ML] Add NaN value handling in Bucketizer
## What changes were proposed in this pull request?
This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value.
Sometimes, null value might also be useful to users, so in these cases, Bucketizer should
reserve one extra bucket for NaN values, instead of throwing an illegal exception.
Before:
```
Bucketizer.transform on NaN value threw an illegal exception.
```
After:
```
NaN values will be grouped in an extra bucket.
```
## How was this patch tested?
New test cases added in `BucketizerSuite`.
Signed-off-by: VinceShieh <vincent.xieintel.com>

Author: VinceShieh <vincent.xie@intel.com>

Closes #14858 from VinceShieh/spark-17219.
2016-09-21 10:20:57 +01:00
sandy bbe0b1d623
[SPARK-17575][DOCS] Remove extra table tags in configuration document
## What changes were proposed in this pull request?

Remove extra table tags in configurations document.

## How was this patch tested?

Run all test cases and generate document.

Before with extra tag its look like below
![config-wrong1](https://cloud.githubusercontent.com/assets/8075390/18608239/c602bb60-7d01-11e6-875e-f38558997dd3.png)

![config-wrong2](https://cloud.githubusercontent.com/assets/8075390/18608241/cf3b672c-7d01-11e6-935e-1e73f9e6e578.png)

After removing tags its looks like below

![config](https://cloud.githubusercontent.com/assets/8075390/18608245/e156eb8e-7d01-11e6-98aa-3be68d4d1961.png)

![config2](https://cloud.githubusercontent.com/assets/8075390/18608247/e84eecd4-7d01-11e6-9738-a3f7ff8fe834.png)

Author: sandy <phalodi@gmail.com>

Closes #15130 from phalodi/SPARK-17575.
2016-09-17 16:25:03 +01:00
Daniel Darabos 69cb049697
Correct fetchsize property name in docs
## What changes were proposed in this pull request?

Replace `fetchSize` with `fetchsize` in the docs.

## How was this patch tested?

I manually tested `fetchSize` and `fetchsize`. The latter has an effect. See also [`JdbcUtils.scala#L38`](https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L38) for the definition of the property.

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #14975 from darabos/patch-3.
2016-09-17 12:28:42 +01:00
Sean Owen dc0a4c9161 [SPARK-17445][DOCS] Reference an ASF page as the main place to find third-party packages
## What changes were proposed in this pull request?

Point references to spark-packages.org to https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects

This will be accompanied by a parallel change to the spark-website repo, and additional changes to this wiki.

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #15075 from srowen/SPARK-17445.
2016-09-14 10:10:16 +01:00
Jagadeesan def7c265f5 [SPARK-17449][DOCUMENTATION] Relation between heartbeatInterval and…
## What changes were proposed in this pull request?

The relation between spark.network.timeout and spark.executor.heartbeatInterval should be mentioned in the document.

… network timeout]

Author: Jagadeesan <as2@us.ibm.com>

Closes #15042 from jagadeesanas2/SPARK-17449.
2016-09-14 09:03:16 +01:00
Satendra Kumar 7098a12945 Streaming doc correction.
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
Streaming doc correction.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Satendra Kumar <satendra@knoldus.com>

Closes #14996 from satendrakumar06/patch-1.
2016-09-09 19:15:06 +01:00
Gurvinder Singh 92ce8d4849 [SPARK-15487][WEB UI] Spark Master UI to reverse proxy Application and Workers UI
## What changes were proposed in this pull request?

This pull request adds the functionality to enable accessing worker and application UI through master UI itself. Thus helps in accessing SparkUI when running spark cluster in closed networks e.g. Kubernetes. Cluster admin needs to expose only spark master UI and rest of the UIs can be in the private network, master UI will reverse proxy the connection request to corresponding resource. It adds the path for workers/application UIs as

WorkerUI: <http/https>://master-publicIP:<port>/target/workerID/
ApplicationUI: <http/https>://master-publicIP:<port>/target/appID/

This makes it easy for users to easily protect the Spark master cluster access by putting some reverse proxy e.g. https://github.com/bitly/oauth2_proxy

## How was this patch tested?

The functionality has been tested manually and there is a unit test too for testing access to worker UI with reverse proxy address.

pwendell bomeng BryanCutler can you please review it, thanks.

Author: Gurvinder Singh <gurvinder.singh@uninett.no>

Closes #13950 from gurvindersingh/rproxy.
2016-09-08 17:20:20 -07:00
Yangyang Liu 5bea8757cc [SPARK-16619] Add shuffle service metrics entry in monitoring docs
After change [SPARK-16405](https://github.com/apache/spark/pull/14080), we need to update docs by adding shuffle service metrics entry in currently supporting metrics list.

Author: Yangyang Liu <yangyangliu@fb.com>

Closes #14254 from lovexi/yangyang-monitoring-doc.
2016-09-01 17:01:16 -07:00
Sean Owen 3893e8c576 [SPARK-17331][CORE][MLLIB] Avoid allocating 0-length arrays
## What changes were proposed in this pull request?

Avoid allocating some 0-length arrays, esp. in UTF8String, and by using Array.empty in Scala over Array[T]()

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #14895 from srowen/SPARK-17331.
2016-09-01 12:13:07 -07:00
Seigneurin, Alexis (CONT) dd859f95c0 fixed typos
fixed 2 typos

Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>

Closes #14877 from aseigneurin/fix-typo-2.
2016-09-01 09:32:05 +01:00
Jeff Zhang fa6347938f [SPARK-17178][SPARKR][SPARKSUBMIT] Allow to set sparkr shell command through --conf
## What changes were proposed in this pull request?

Allow user to set sparkr shell command through --conf spark.r.shell.command

## How was this patch tested?

Unit test is added and also verify it manually through
```
bin/sparkr --master yarn-client --conf spark.r.shell.command=/usr/local/bin/R
```

Author: Jeff Zhang <zjffdu@apache.org>

Closes #14744 from zjffdu/SPARK-17178.
2016-08-31 00:20:41 -07:00
Alex Bozarth f7beae6da0 [SPARK-17243][WEB UI] Spark 2.0 History Server won't load with very large application history
## What changes were proposed in this pull request?

With the new History Server the summary page loads the application list via the the REST API, this makes it very slow to impossible to load with large (10K+) application history. This pr fixes this by adding the `spark.history.ui.maxApplications` conf to limit the number of applications the History Server displays. This is accomplished using a new optional `limit` param for the `applications` api. (Note this only applies to what the summary page displays, all the Application UI's are still accessible if the user knows the App ID and goes to the Application UI directly.)

I've also added a new test for the `limit` param in `HistoryServerSuite.scala`

## How was this patch tested?

Manual testing and dev/run-tests

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #14835 from ajbozarth/spark17243.
2016-08-30 16:33:54 -05:00
Ferdinand Xu 4b4e329e49 [SPARK-5682][CORE] Add encrypted shuffle in spark
This patch is using Apache Commons Crypto library to enable shuffle encryption support.

Author: Ferdinand Xu <cheng.a.xu@intel.com>
Author: kellyzly <kellyzly@126.com>

Closes #8880 from winningsix/SPARK-10771.
2016-08-30 09:15:31 -07:00
Dmitriy Sokolov d4eee9932e [MINOR][DOCS] Fix minor typos in python example code
## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.
2016-08-30 11:23:37 +01:00
Seigneurin, Alexis (CONT) 08913ce000 fixed a typo
idempotant -> idempotent

Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>

Closes #14833 from aseigneurin/fix-typo.
2016-08-29 13:12:10 +01:00
Sean Owen e07baf1412 [SPARK-17001][ML] Enable standardScaler to standardize sparse vectors when withMean=True
## What changes were proposed in this pull request?

Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages.

## How was this patch tested?

Jenkins tests, including new caes to reflect the new behavior.

Author: Sean Owen <sowen@cloudera.com>

Closes #14663 from srowen/SPARK-17001.
2016-08-27 08:48:56 +01:00
Michael Gummelt 8e5475be3c [SPARK-16967] move mesos to module
## What changes were proposed in this pull request?

Move Mesos code into a mvn module

## How was this patch tested?

unit tests
manually submitting a client mode and cluster mode job
spark/mesos integration test suite

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #14637 from mgummelt/mesos-module.
2016-08-26 12:25:22 -07:00
Shixiong Zhu 341e0e778d [SPARK-17242][DOCUMENT] Update links of external dstream projects
## What changes were proposed in this pull request?

Updated links of external dstream projects.

## How was this patch tested?

Just document changes.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #14814 from zsxwing/dstream-link.
2016-08-25 21:08:42 -07:00
Alex Bozarth 891ac2b914 [SPARK-15083][WEB UI] History Server can OOM due to unlimited TaskUIData
## What changes were proposed in this pull request?

Based on #12990 by tankkyo

Since the History Server currently loads all application's data it can OOM if too many applications have a significant task count. `spark.ui.trimTasks` (default: false) can be set to true to trim tasks by `spark.ui.retainedTasks` (default: 10000)

(This is a "quick fix" to help those running into the problem until a update of how the history server loads app data can be done)

## How was this patch tested?

Manual testing and dev/run-tests

![spark-15083](https://cloud.githubusercontent.com/assets/13952758/17713694/fe82d246-63b0-11e6-9697-b87ea75ff4ef.png)

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #14673 from ajbozarth/spark15083.
2016-08-24 14:39:41 -05:00
Yanbo Liang 45b786aca2 [MINOR][DOC] Fix wrong ml.feature.Normalizer document.
## What changes were proposed in this pull request?
The ```ml.feature.Normalizer``` examples illustrate L1 norm rather than L2, we should correct corresponding document.
![image](https://cloud.githubusercontent.com/assets/1962026/17928637/85aec284-69b0-11e6-9b13-d465ee560581.png)

## How was this patch tested?
Doc change, no test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14787 from yanboliang/normalizer.
2016-08-24 08:24:16 -07:00
hyukjinkwon 588559911d [MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in structured streaming programming guides
## What changes were proposed in this pull request?

This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`).

This will be a actual problem when users copy and paste the examples. This would not work.

This seems only happening in `structured-streaming-programming-guide.md`.

## How was this patch tested?

Manually built.

This will change some examples to be correctly marked down as below:

![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png)

to

![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14770 from HyukjinKwon/minor-quotes.
2016-08-23 21:21:43 +01:00
Sean Owen 342278c09c [SPARK-16320][DOC] Document G1 heap region's effect on spark 2.0 vs 1.6
## What changes were proposed in this pull request?

Collect GC discussion in one section, and documenting findings about G1 GC heap region size.

## How was this patch tested?

Jekyll doc build

Author: Sean Owen <sowen@cloudera.com>

Closes #14732 from srowen/SPARK-16320.
2016-08-22 11:15:53 -07:00
Jagadeesan bd9655063b [SPARK-17085][STREAMING][DOCUMENTATION AND ACTUAL CODE DIFFERS - UNSUPPORTED OPERATIONS]
Changes in  Spark Stuctured Streaming doc in this link
https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html#unsupported-operations

Author: Jagadeesan <as2@us.ibm.com>

Closes #14715 from jagadeesanas2/SPARK-17085.
2016-08-22 09:30:31 +01:00
GraceH 4b6c2cbcb1 [SPARK-16968] Document additional options in jdbc Writer
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
This is the document for previous JDBC Writer options.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Unit test has been added in previous PR.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: GraceH <jhuang1@paypal.com>

Closes #14683 from GraceH/jdbc_options.
2016-08-22 09:03:46 +01:00
wm624@hotmail.com e328f577e8 [SPARK-17002][CORE] Document that spark.ssl.protocol. is required for SSL
## What changes were proposed in this pull request?

`spark.ssl.enabled`=true, but failing to set `spark.ssl.protocol` will fail and throw meaningless exception. `spark.ssl.protocol` is required when `spark.ssl.enabled`.

Improvement: require `spark.ssl.protocol` when initializing SSLContext, otherwise throws an exception to indicate that.

Remove the OrElse("default").

Document this requirement in configure.md

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manual tests:
Build document and check document

Configure `spark.ssl.enabled` only, it throws exception below:
6/08/16 16:04:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(mwang); groups with view permissions: Set(); users  with modify permissions: Set(mwang); groups with modify permissions: Set()
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: spark.ssl.protocol is required when enabling SSL connections.
	at scala.Predef$.require(Predef.scala:224)
	at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:285)
	at org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1026)
	at org.apache.spark.deploy.master.Master$.main(Master.scala:1011)
	at org.apache.spark.deploy.master.Master.main(Master.scala)

Configure `spark.ssl.protocol`  and `spark.ssl.protocol`
It works fine.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #14674 from wangmiao1981/ssl.
2016-08-21 11:51:46 +01:00
Stavros Kontopoulos b81421afb0 [SPARK-17087][MESOS] Documentation for Making Spark on Mesos honor port restrictions
## What changes were proposed in this pull request?

- adds documentation for https://issues.apache.org/jira/browse/SPARK-11714

## How was this patch tested?
Doc no test needed.

Author: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com>

Closes #14667 from skonto/add_doc.
2016-08-18 12:19:19 +01:00
sandy e28a8c5899 [SPARK-17089][DOCS] Remove api doc link for mapReduceTriplets operator
## What changes were proposed in this pull request?

Remove the api doc link for mapReduceTriplets operator because in latest api they are remove so when user link to that api they will not get mapReduceTriplets there so its more good to remove than confuse the user.

## How was this patch tested?
Run all the test cases

![screenshot from 2016-08-16 23-08-25](https://cloud.githubusercontent.com/assets/8075390/17709393/8cfbf75a-6406-11e6-98e6-38f7b319d833.png)

Author: sandy <phalodi@gmail.com>

Closes #14669 from phalodi/SPARK-17089.
2016-08-16 12:50:55 -07:00
linbojin 6f0988b129 [MINOR][DOC] Correct code snippet results in quick start documentation
## What changes were proposed in this pull request?

As README.md file is updated over time. Some code snippet outputs are not correct based on new README.md file. For example:
```
scala> textFile.count()
res0: Long = 126
```
should be
```
scala> textFile.count()
res0: Long = 99
```
This pr is to add comments to point out this problem so that new spark learners have a correct reference.
Also, fixed a samll bug, inside current documentation, the outputs of linesWithSpark.count() without and with cache are different (one is 15 and the other is 19)
```
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:27

scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
res3: Long = 15

...

scala> linesWithSpark.cache()
res7: linesWithSpark.type = MapPartitionsRDD[2] at filter at <console>:27

scala> linesWithSpark.count()
res8: Long = 19
```

## How was this patch tested?

manual test:  run `$ SKIP_API=1 jekyll serve --watch`

Author: linbojin <linbojin203@gmail.com>

Closes #14645 from linbojin/quick-start-documentation.
2016-08-16 11:37:54 +01:00
Jagadeesan e46cb78b3b [SPARK-12370][DOCUMENTATION] Documentation should link to examples …
## What changes were proposed in this pull request?

When documentation is built is should reference examples from the same build. There are times when the docs have links that point to files in the GitHub head which may not be valid on the current release. Changed that in URLs to make them point to the right tag in git using ```SPARK_VERSION_SHORT```

…from its own release version] [Streaming programming guide]

Author: Jagadeesan <as2@us.ibm.com>

Closes #14596 from jagadeesanas2/SPARK-12370.
2016-08-13 11:25:03 +01:00
WeichenXu 91f2735a18 [DOC] add config option spark.ui.enabled into document
## What changes were proposed in this pull request?

The configuration doc lost the config option `spark.ui.enabled` (default value is `true`)
I think this option is important because many cases we would like to turn it off.
so I add it.

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14604 from WeichenXu123/add_doc_param_spark_ui_enabled.
2016-08-12 20:10:09 +01:00
hyukjinkwon f4482225c4 [MINOR][DOC] Fix style in examples across documentation
## What changes were proposed in this pull request?

This PR fixes the documentation as below:

  -  Python has 4 spaces and Java and Scala has 2 spaces (See https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide).

  - Avoid excessive parentheses and curly braces for anonymous functions. (See https://github.com/databricks/scala-style-guide#anonymous)

## How was this patch tested?

N/A

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14593 from HyukjinKwon/minor-documentation.
2016-08-12 10:00:58 +01:00
Jeff Zhang 7a9e25c383 [SPARK-13081][PYSPARK][SPARK_SUBMIT] Allow set pythonExec of driver and executor through conf…
Before this PR, user have to export environment variable to specify the python of driver & executor which is not so convenient for users. This PR is trying to allow user to specify python through configuration "--pyspark-driver-python" & "--pyspark-executor-python"

Manually test in local & yarn mode for pyspark-shell and pyspark batch mode.

Author: Jeff Zhang <zjffdu@apache.org>

Closes #13146 from zjffdu/SPARK-13081.
2016-08-11 20:08:39 -07:00
hyukjinkwon 7186e8c318 [SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation
## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.
2016-08-11 11:31:52 +01:00
Andrew Ash 8a6b7037bb Correct example value for spark.ssl.YYY.XXX settings
Docs adjustment to:
- link to other relevant section of docs
- correct statement about the only value when actually other values are supported

Author: Andrew Ash <andrew@andrewash.com>

Closes #14581 from ash211/patch-10.
2016-08-11 11:26:57 +01:00
Tao Wang 7a6a3c3fbc [SPARK-17010][MINOR][DOC] Wrong description in memory management document
## What changes were proposed in this pull request?

change the remain percent to right one.

## How was this patch tested?

Manual review

Author: Tao Wang <wangtao111@huawei.com>

Closes #14591 from WangTaoTheTonic/patch-1.
2016-08-10 22:30:18 -07:00
jerryshao ab648c0004 [SPARK-14743][YARN] Add a configurable credential manager for Spark running on YARN
## What changes were proposed in this pull request?

Add a configurable token manager for Spark on running on yarn.

### Current Problems ###

1. Supported token provider is hard-coded, currently only hdfs, hbase and hive are supported and it is impossible for user to add new token provider without code changes.
2. Also this problem exits in timely token renewer and updater.

### Changes In This Proposal ###

In this proposal, to address the problems mentioned above and make the current code more cleaner and easier to understand, mainly has 3 changes:

1. Abstract a `ServiceTokenProvider` as well as `ServiceTokenRenewable` interface for token provider. Each service wants to communicate with Spark through token way needs to implement this interface.
2. Provide a `ConfigurableTokenManager` to manage all the register token providers, also token renewer and updater. Also this class offers the API for other modules to obtain tokens, get renewal interval and so on.
3. Implement 3 built-in token providers `HDFSTokenProvider`, `HiveTokenProvider` and `HBaseTokenProvider` to keep the same semantics as supported today. Whether to load in these built-in token providers is controlled by configuration "spark.yarn.security.tokens.${service}.enabled", by default for all the built-in token providers are loaded.

### Behavior Changes ###

For the end user there's no behavior change, we still use the same configuration `spark.yarn.security.tokens.${service}.enabled` to decide which token provider is enabled (hbase or hive).

For user implemented token provider (assume the name of token provider is "test") needs to add into this class should have two configurations:

1. `spark.yarn.security.tokens.test.enabled` to true
2. `spark.yarn.security.tokens.test.class` to the full qualified class name.

So we still keep the same semantics as current code while add one new configuration.

### Current Status ###

- [x] token provider interface and management framework.
- [x] implement built-in token providers (hdfs, hbase, hive).
- [x] Coverage of unit test.
- [x] Integrated test with security cluster.

## How was this patch tested?

Unit test and integrated test.

Please suggest and review, any comment is greatly appreciated.

Author: jerryshao <sshao@hortonworks.com>

Closes #14065 from jerryshao/SPARK-16342.
2016-08-10 15:39:30 -07:00
Timothy Chen eca58755fb [SPARK-16927][SPARK-16923] Override task properties at dispatcher.
## What changes were proposed in this pull request?

- enable setting default properties for all jobs submitted through the dispatcher [SPARK-16927]
- remove duplication of conf vars on cluster submitted jobs [SPARK-16923] (this is a small fix, so I'm including in the same PR)

## How was this patch tested?

mesos/spark integration test suite
manual testing

Author: Timothy Chen <tnachen@gmail.com>

Closes #14511 from mgummelt/override-props.
2016-08-10 10:11:03 +01:00
Josh Rosen b89b3a5c8e [SPARK-16956] Make ApplicationState.MAX_NUM_RETRY configurable
## What changes were proposed in this pull request?

This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration.

**Background:** This application-killing was added in 6b5980da79 (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path.

**Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative.

I'd like to merge this patch into master, branch-2.0, and branch-1.6.

## How was this patch tested?

I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14544 from JoshRosen/add-setting-for-max-executor-failures.
2016-08-09 11:21:45 -07:00
Michael Gummelt 62e6212441 [SPARK-16809] enable history server links in dispatcher UI
## What changes were proposed in this pull request?

Links the Spark Mesos Dispatcher UI to the history server UI

- adds spark.mesos.dispatcher.historyServer.url
- explicitly generates frameworkIDs for the launched drivers, so the dispatcher knows how to correlate drivers and frameworkIDs

## How was this patch tested?

manual testing

Author: Michael Gummelt <mgummelt@mesosphere.io>
Author: Sergiusz Urbaniak <sur@mesosphere.io>

Closes #14414 from mgummelt/history-server.
2016-08-09 10:55:33 +01:00
Michael Gummelt 53d1c78779 Update docs to include SASL support for RPC
## What changes were proposed in this pull request?

Update docs to include SASL support for RPC

Evidence: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala#L63

## How was this patch tested?

Docs change only

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #14549 from mgummelt/sasl.
2016-08-08 16:07:51 -07:00
Shivansh 6c1ecb191b [SPARK-16911] Fix the links in the programming guide
## What changes were proposed in this pull request?

 Fix the broken links in the programming guide of the Graphx Migration and understanding closures

## How was this patch tested?

By running the test cases  and checking the links.

Author: Shivansh <shiv4nsh@gmail.com>

Closes #14503 from shiv4nsh/SPARK-16911.
2016-08-07 09:30:18 +01:00
keliang 1275f64696 [SPARK-16870][DOCS] Summary:add "spark.sql.broadcastTimeout" into docs/sql-programming-gu…
## What changes were proposed in this pull request?
default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned

## How was this patch tested?

not need

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

…ide.md

JIRA_ID:SPARK-16870
Description:default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned
Test:done

Author: keliang <keliang@cmss.chinamobile.com>

Closes #14477 from biglobster/keliang.
2016-08-07 09:28:32 +01:00
Bryan Cutler b1ebe182ca [SPARK-16932][DOCS] Changed programming guide to not reference old accumulator API in Scala
## What changes were proposed in this pull request?

In the programming guide, the accumulator section mixes up both the old and new APIs causing it to be confusing.  This is not necessary for Scala, so all references to the old API are removed.  For Java, it is somewhat fixed up except for the example of a custom accumulator because I don't think an API exists yet.  Python has not currently implemented the new API.

## How was this patch tested?
built doc locally

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #14516 from BryanCutler/fixup-accumulator-programming-guide-SPARK-15702.
2016-08-07 09:06:59 +01:00
Michael Gummelt 7aaa5a01c1 document that Mesos cluster mode supports python
update docs to be consistent with SPARK-14645 https://issues.apache.org/jira/browse/SPARK-14645

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #14514 from mgummelt/fix-docs.
2016-08-07 08:59:04 +01:00
cody koeninger c9f2501af2 [SPARK-16312][STREAMING][KAFKA][DOC] Doc for Kafka 0.10 integration
## What changes were proposed in this pull request?
Doc for the Kafka 0.10 integration

## How was this patch tested?
Scala code examples were taken from my example repo, so hopefully they compile.

Author: cody koeninger <cody@koeninger.org>

Closes #14385 from koeninger/SPARK-16312.
2016-08-05 10:13:32 +01:00
Sital Kedia 9c15d079df [SPARK-15074][SHUFFLE] Cache shuffle index file to speedup shuffle fetch
## What changes were proposed in this pull request?

Shuffle fetch on large intermediate dataset is slow because the shuffle service open/close the index file for each shuffle fetch. This change introduces a cache for the index information so that we can avoid accessing the index files for each block fetch

## How was this patch tested?

Tested by running a job on the cluster and the shuffle read time was reduced by 50%.

Author: Sital Kedia <skedia@fb.com>

Closes #12944 from sitalkedia/shuffle_service.
2016-08-04 14:54:38 -07:00
Shuai Lin 36827ddafe [SPARK-16822][DOC] Support latex in scaladoc.
## What changes were proposed in this pull request?

Support using latex in scaladoc by adding MathJax javascript to the js template.

## How was this patch tested?

Generated scaladoc.  Preview:

- LogisticGradient: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient)

- MinMaxScaler: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler)

Author: Shuai Lin <linshuai2012@gmail.com>

Closes #14438 from lins05/spark-16822-support-latex-in-scaladoc.
2016-08-02 09:14:08 -07:00
Cheng Lian 10e1c0e638 [SPARK-16734][EXAMPLES][SQL] Revise examples of all language bindings
## What changes were proposed in this pull request?

This PR makes various minor updates to examples of all language bindings to make sure they are consistent with each other. Some typos and missing parts (JDBC example in Scala/Java/Python) are also fixed.

## How was this patch tested?

Manually tested.

Author: Cheng Lian <lian@databricks.com>

Closes #14368 from liancheng/revise-examples.
2016-08-02 15:02:40 +08:00
Sun Dapeng 2c15323ad0 [SPARK-16761][DOC][ML] Fix doc link in docs/ml-guide.md
## What changes were proposed in this pull request?

Fix the link at http://spark.apache.org/docs/latest/ml-guide.html.

## How was this patch tested?

None

Author: Sun Dapeng <sdp@apache.org>

Closes #14386 from sundapeng/doclink.
2016-07-29 06:01:23 -07:00
Michael Gummelt 266b92faff [SPARK-16637] Unified containerizer
## What changes were proposed in this pull request?

New config var: spark.mesos.docker.containerizer={"mesos","docker" (default)}

This adds support for running docker containers via the Mesos unified containerizer: http://mesos.apache.org/documentation/latest/container-image/

The benefit is losing the dependency on `dockerd`, and all the costs which it incurs.

I've also updated the supported Mesos version to 0.28.2 for support of the required protobufs.

This is blocked on: https://github.com/apache/spark/pull/14167

## How was this patch tested?

- manually testing jobs submitted with both "mesos" and "docker" settings for the new config var.
- spark/mesos integration test suite

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #14275 from mgummelt/unified-containerizer.
2016-07-29 05:50:47 -07:00
Bartek Wiśniewski bc4851adeb [MINOR][DOC] missing keyword new
## What changes were proposed in this pull request?

added missing keyword for java example

## How was this patch tested?

wasn't

Author: Bartek Wiśniewski <wedi@Ava.local>

Closes #14381 from wedi-dev/quickfix/missing_keyword.
2016-07-27 10:53:22 -07:00
Mark Grover 70f846a313 [SPARK-5847][CORE] Allow for configuring MetricsSystem's use of app ID to namespace all metrics
## What changes were proposed in this pull request?
Adding a new property to SparkConf called spark.metrics.namespace that allows users to
set a custom namespace for executor and driver metrics in the metrics systems.

By default, the root namespace used for driver or executor metrics is
the value of `spark.app.id`. However, often times, users want to be able to track the metrics
across apps for driver and executor metrics, which is hard to do with application ID
(i.e. `spark.app.id`) since it changes with every invocation of the app. For such use cases,
users can set the `spark.metrics.namespace` property to another spark configuration key like
`spark.app.name` which is then used to populate the root namespace of the metrics system
(with the app name in our example). `spark.metrics.namespace` property can be set to any
arbitrary spark property key, whose value would be used to set the root namespace of the
metrics system. Non driver and executor metrics are never prefixed with `spark.app.id`, nor
does the `spark.metrics.namespace` property have any such affect on such metrics.

## How was this patch tested?
Added new unit tests, modified existing unit tests.

Author: Mark Grover <mark@apache.org>

Closes #14270 from markgrover/spark-5847.
2016-07-27 10:13:15 -07:00
Philipp Hoffmann 0869b3a5f0 [SPARK-15271][MESOS] Allow force pulling executor docker images
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Mesos agents by default will not pull docker images which are cached
locally already. In order to run Spark executors from mutable tags like
`:latest` this commit introduces a Spark setting
(`spark.mesos.executor.docker.forcePullImage`). Setting this flag to
true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous
implementation and Mesos' default
behaviour).

Author: Philipp Hoffmann <mail@philipphoffmann.de>

Closes #14348 from philipphoffmann/force-pull-image.
2016-07-26 16:09:10 +01:00
Nicholas Brown ba0aade6d5 Fix description of spark.speculation.quantile
## What changes were proposed in this pull request?

Minor doc fix regarding the spark.speculation.quantile configuration parameter.  It incorrectly states it should be a percentage, when it should be a fraction.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

I tried building the documentation but got some unidoc errors.  I also got them when building off origin/master, so I don't think I caused that problem.  I did run the web app and saw the changes reflected as expected.

Author: Nicholas Brown <nbrown@adroitdigital.com>

Closes #14352 from nwbvt/master.
2016-07-25 19:18:27 -07:00
Takeshi YAMAMURO cda4603de3 [SQL][DOC] Fix a default name for parquet compression
## What changes were proposed in this pull request?
This pr is to fix a wrong description for parquet default compression.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #14351 from maropu/FixParquetDoc.
2016-07-25 15:08:58 -07:00
Josh Rosen fc17121d59 Revert "[SPARK-15271][MESOS] Allow force pulling executor docker images"
This reverts commit 978cd5f125.
2016-07-25 12:43:44 -07:00
Shuai Lin 3b6e1d094e [SPARK-16485][DOC][ML] Fixed several inline formatting in ml features doc
## What changes were proposed in this pull request?

Fixed several inline formatting in ml features doc.

Before:

<img width="475" alt="screen shot 2016-07-14 at 12 24 57 pm" src="https://cloud.githubusercontent.com/assets/717363/16827974/1e1b6e04-49be-11e6-8aa9-4a0cb6cd3b4e.png">

After:

<img width="404" alt="screen shot 2016-07-14 at 12 25 48 pm" src="https://cloud.githubusercontent.com/assets/717363/16827976/2576510a-49be-11e6-96dd-92a1fa464d36.png">

## How was this patch tested?

Genetate the docs locally by `SKIP_API=1 jekyll build` and view it in the browser.

Author: Shuai Lin <linshuai2012@gmail.com>

Closes #14194 from lins05/fix-docs-formatting.
2016-07-25 20:26:55 +01:00
Philipp Hoffmann 978cd5f125 [SPARK-15271][MESOS] Allow force pulling executor docker images
## What changes were proposed in this pull request?

Mesos agents by default will not pull docker images which are cached
locally already. In order to run Spark executors from mutable tags like
`:latest` this commit introduces a Spark setting
`spark.mesos.executor.docker.forcePullImage`. Setting this flag to
true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous
implementation and Mesos' default
behaviour).

## How was this patch tested?

I ran a sample application including this change on a Mesos cluster and verified the correct behaviour for both, with and without, force pulling the executor image. As expected the image is being force pulled if the flag is set.

Author: Philipp Hoffmann <mail@philipphoffmann.de>

Closes #13051 from philipphoffmann/force-pull-image.
2016-07-25 20:14:47 +01:00
Felix Cheung b73defdd79 [SPARKR][DOCS] fix broken url in doc
## What changes were proposed in this pull request?

Fix broken url, also,

sparkR.session.stop doc page should have it in the header, instead of saying "sparkR.stop"
![image](https://cloud.githubusercontent.com/assets/8969467/17080129/26d41308-50d9-11e6-8967-79d6c920313f.png)

Data type section is in the middle of a list of gapply/gapplyCollect subsections:
![image](https://cloud.githubusercontent.com/assets/8969467/17080122/f992d00a-50d8-11e6-8f2c-fd5786213920.png)

## How was this patch tested?

manual test

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14329 from felixcheung/rdoclinkfix.
2016-07-25 11:25:41 -07:00
Cheng Lian 53b2456d1d [SPARK-16380][EXAMPLES] Update SQL examples and programming guide for Python language binding
This PR is based on PR #14098 authored by wangmiao1981.

## What changes were proposed in this pull request?

This PR replaces the original Python Spark SQL example file with the following three files:

- `sql/basic.py`

  Demonstrates basic Spark SQL features.

- `sql/datasource.py`

  Demonstrates various Spark SQL data sources.

- `sql/hive.py`

  Demonstrates Spark SQL Hive interaction.

This PR also removes hard-coded Python example snippets in the SQL programming guide by extracting snippets from the above files using the `include_example` Liquid template tag.

## How was this patch tested?

Manually tested.

Author: wm624@hotmail.com <wm624@hotmail.com>
Author: Cheng Lian <lian@databricks.com>

Closes #14317 from liancheng/py-examples-update.
2016-07-23 11:41:24 -07:00
Tom Graves 6c56fff118 [SPARK-16650] Improve documentation of spark.task.maxFailures
Clarify documentation on spark.task.maxFailures

No tests run as its documentation

Author: Tom Graves <tgraves@yahoo-inc.com>

Closes #14287 from tgravescs/SPARK-16650.
2016-07-22 12:41:38 +01:00
Michael Gummelt 235cb256d0 [SPARK-16194] Mesos Driver env vars
## What changes were proposed in this pull request?

Added new configuration namespace: spark.mesos.env.*

This allows a user submitting a job in cluster mode to set arbitrary environment variables on the driver.
spark.mesos.driverEnv.KEY=VAL will result in the env var "KEY" being set to "VAL"

I've also refactored the tests a bit so we can re-use code in MesosClusterScheduler.

And I've refactored the command building logic in `buildDriverCommand`.  Command builder values were very intertwined before, and now it's easier to determine exactly how each variable is set.

## How was this patch tested?

unit tests

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #14167 from mgummelt/driver-env-vars.
2016-07-21 18:29:00 +01:00
Holden Karau 1bf13ba3a2 [MINOR][DOCS][STREAMING] Minor docfix schema of csv rather than parquet in comments
## What changes were proposed in this pull request?

Fix parquet to csv in a comment to match the input format being read.

## How was this patch tested?
N/A (doc change only)

Author: Holden Karau <holden@us.ibm.com>

Closes #14274 from holdenk/minor-docfix-schema-of-csv-rather-than-parquet.
2016-07-21 09:17:38 +01:00
Kishor Patil b9bab4dcf6 [SPARK-15951] Change Executors Page to use datatables to support sorting columns and searching
1. Create the executorspage-template.html for displaying application information in datables.
2. Added REST API endpoint "allexecutors" to be able to see all executors created for particular job.
3. The executorspage.js uses jQuery to access the data from /api/v1/applications/appid/allexecutors REST API, and use DataTable to display executors for the application. It also, generates summary of dead/live and total executors created during life of the application.
4. Similar changes applicable to Executors Page on history server for a given application.

Snapshots for how it looks like now:
<img width="938" alt="screen shot 2016-06-14 at 2 45 44 pm" src="https://cloud.githubusercontent.com/assets/6090397/16060092/ad1de03a-324b-11e6-8469-9eaa3f2548b5.png">

New Executors Page screenshot looks like this:
<img width="1436" alt="screen shot 2016-06-15 at 10 12 01 am" src="https://cloud.githubusercontent.com/assets/6090397/16085514/ee7004f0-32e1-11e6-9340-33d91e407f2b.png">

Author: Kishor Patil <kpatil@yahoo-inc.com>

Closes #13670 from kishorvpatil/execTemplates.
2016-07-20 12:22:43 -05:00
Weiqing Yang 95abbe5377 [SPARK-15923][YARN] Spark Application rest api returns 'no such app: …
## What changes were proposed in this pull request?
Update monitoring.md.

…<appId>'

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #14163 from Sherry302/master.
2016-07-20 14:26:26 +01:00
WeichenXu 9674af6f6f [SPARK-16568][SQL][DOCUMENTATION] update sql programming guide refreshTable API in python code
## What changes were proposed in this pull request?

update `refreshTable` API in python code of the sql-programming-guide.

This API is added in SPARK-15820

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14220 from WeichenXu123/update_sql_doc_catalog.
2016-07-19 18:48:41 -07:00
Ahmed Mahran 6caa22050e [MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar
## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.
2016-07-19 12:01:54 +01:00
Cheng Lian 1426a08052 [SPARK-16303][DOCS][EXAMPLES] Minor Scala/Java example update
## What changes were proposed in this pull request?

This PR moves one and the last hard-coded Scala example snippet from the SQL programming guide into `SparkSqlExample.scala`. It also renames all Scala/Java example files so that all "Sql" in the file names are updated to "SQL".

## How was this patch tested?

Manually verified the generated HTML page.

Author: Cheng Lian <lian@databricks.com>

Closes #14245 from liancheng/minor-scala-example-update.
2016-07-18 23:07:59 -07:00
Felix Cheung 75f0efe74d [SPARKR][DOCS] minor code sample update in R programming guide
## What changes were proposed in this pull request?

Fix code style from ad hoc review of RC4 doc

## How was this patch tested?

manual

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14250 from felixcheung/rdocs2rc4.
2016-07-18 16:01:57 -07:00
Narine Kokhlikyan 4167304836 [SPARK-16112][SPARKR] Programming guide for gapply/gapplyCollect
## What changes were proposed in this pull request?

Updates programming guide for spark.gapply/spark.gapplyCollect.

Similar to other examples I used `faithful` dataset to demonstrate gapply's functionality.
Please, let me know if you prefer another example.

## How was this patch tested?
Existing test cases in R

Author: Narine Kokhlikyan <narine@slice.com>

Closes #14090 from NarineK/gapplyProgGuide.
2016-07-16 16:56:16 -07:00
Joseph K. Bradley 5ffd5d3838 [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide
## What changes were proposed in this pull request?

Made DataFrame-based API primary
* Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html
* mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html
* ml-guide.html includes a "maintenance mode" announcement about the RDD-based API
  * **Reviewers: please check this carefully**
* (minor) Titles for DF API no longer include "- spark.ml" suffix.  Titles for RDD API have "- RDD-based API" suffix
* Moved migration guide to ml-guide from mllib-guide
  * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides
  * **Reviewers**: I did not change any of the content of the migration guides.

Reorganized DataFrame-based guide:
* ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc.
* Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html
  * **Reviewers**: I did not change the content of these guides, except some intro text.
* Sidebar remains the same, but with pipeline and tuning sections added

Other:
* ml-classification-regression.html: Moved text about linear methods to new section in page

## How was this patch tested?

Generated docs locally

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #14213 from jkbradley/ml-guide-2.0.
2016-07-15 13:38:23 -07:00
Josh Rosen 972673aca5 [SPARK-16555] Work around Jekyll error-handling bug which led to silent failures
If a custom Jekyll template tag throws Ruby's equivalent of a "file not found" exception, then Jekyll will stop the doc building process but will exit with a successful status, causing our doc publishing jobs to silently fail.

This is caused by https://github.com/jekyll/jekyll/issues/5104, a case of bad error-handling logic in Jekyll. This patch works around this by updating our `include_example.rb` plugin to catch the exception and exit rather than allowing it to bubble up and be ignored by Jekyll.

I tested this manually with

```
rm ./examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala
cd docs
SKIP_API=1 jekyll build
echo $?
```

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14209 from JoshRosen/fix-doc-building.
2016-07-14 15:55:36 -07:00
Shivaram Venkataraman 01c4c1fa53 [SPARK-16553][DOCS] Fix SQL example file name in docs
## What changes were proposed in this pull request?

Fixes a typo in the sql programming guide

## How was this patch tested?

Building docs locally

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #14208 from shivaram/spark-sql-doc-fix.
2016-07-14 14:19:30 -07:00
Marcelo Vanzin b7b5e17876 [SPARK-16505][YARN] Optionally propagate error during shuffle service startup.
This prevents the NM from starting when something is wrong, which would
lead to later errors which are confusing and harder to debug.

Added a unit test to verify startup fails if something is wrong.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14162 from vanzin/SPARK-16505.
2016-07-14 09:42:32 -05:00
Felix Cheung fb2e8eeb0b [SPARKR][DOCS][MINOR] R programming guide to include csv data source example
## What changes were proposed in this pull request?

Minor documentation update for code example, code style, and missed reference to "sparkR.init"

## How was this patch tested?

manual

shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #14178 from felixcheung/rcsvprogrammingguide.
2016-07-13 15:09:23 -07:00
James Thomas 51a6706b13 [SPARK-16114][SQL] updated structured streaming guide
## What changes were proposed in this pull request?

Updated structured streaming programming guide with new windowed example.

## How was this patch tested?

Docs

Author: James Thomas <jamesjoethomas@gmail.com>

Closes #14183 from jjthomas/ss_docs_update.
2016-07-13 13:26:23 -07:00
sandy bf107f1e65 [SPARK-16438] Add Asynchronous Actions documentation
## What changes were proposed in this pull request?

Add Asynchronous Actions documentation inside action of programming guide

## How was this patch tested?

check the documentation indentation and formatting with md preview.

Author: sandy <phalodi@gmail.com>

Closes #14104 from phalodi/SPARK-16438.
2016-07-13 11:33:46 +01:00
aokolnychyi 772c213ec7 [SPARK-16303][DOCS][EXAMPLES] Updated SQL programming guide and examples
- Hard-coded Spark SQL sample snippets were moved into source files under examples sub-project.
- Removed the inconsistency between Scala and Java Spark SQL examples
- Scala and Java Spark SQL examples were updated

The work is still in progress. All involved examples were tested manually. An additional round of testing will be done after the code review.

![image](https://cloud.githubusercontent.com/assets/6235869/16710314/51851606-462a-11e6-9fbe-0818daef65e4.png)

Author: aokolnychyi <okolnychyyanton@gmail.com>

Closes #14119 from aokolnychyi/spark_16303.
2016-07-13 16:12:11 +08:00
Lianhui Wang 5ad68ba5ce [SPARK-15752][SQL] Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators.
## What changes were proposed in this pull request?
when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003.

## How was this patch tested?
add unit tests

Author: Lianhui Wang <lianhuiwang09@gmail.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Lianhui Wang <lianhuiwang@users.noreply.github.com>

Closes #13494 from lianhuiwang/metadata-only.
2016-07-12 18:52:15 +02:00
Xin Ren 05d7151ccb [MINOR][STREAMING][DOCS] Minor changes on kinesis integration
## What changes were proposed in this pull request?

Some minor changes for documentation page "Spark Streaming + Kinesis Integration".

Moved "streaming-kinesis-arch.png" before the bullet list, not in between the bullets.

## How was this patch tested?

Tested manually, on my local machine.

Author: Xin Ren <iamshrek@126.com>

Closes #14097 from keypointt/kinesisDoc.
2016-07-11 18:09:14 -07:00
Yanbo Liang 2ad031be67 [SPARKR][DOC] SparkR ML user guides update for 2.0
## What changes were proposed in this pull request?
* Update SparkR ML section to make them consistent with SparkR API docs.
* Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page.

## How was this patch tested?
Only docs update, manually check the generated docs.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14011 from yanboliang/r-user-guide-update.
2016-07-11 14:31:11 -07:00
Reynold Xin ffcb6e055a [SPARK-16477] Bump master version to 2.1.0-SNAPSHOT
## What changes were proposed in this pull request?
After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #14130 from rxin/SPARK-16477.
2016-07-11 09:42:56 -07:00
Xin Ren 9cb1eb7af7 [SPARK-16381][SQL][SPARKR] Update SQL examples and programming guide for R language binding
https://issues.apache.org/jira/browse/SPARK-16381

## What changes were proposed in this pull request?

Update SQL examples and programming guide for R language binding.

Here I just follow example https://github.com/apache/spark/compare/master...liancheng:example-snippet-extraction, created a separate R file to store all the example code.

## How was this patch tested?

Manual test on my local machine.
Screenshot as below:

![screen shot 2016-07-06 at 4 52 25 pm](https://cloud.githubusercontent.com/assets/3925641/16638180/13925a58-439a-11e6-8d57-8451a63dcae9.png)

Author: Xin Ren <iamshrek@126.com>

Closes #14082 from keypointt/SPARK-16381.
2016-07-11 20:05:28 +08:00
Michael Gummelt b1db26acc5 [SPARK-11857][MESOS] Deprecate fine grained
## What changes were proposed in this pull request?

Documentation changes to indicate that fine-grained mode is now deprecated.  No code changes were made, and all fine-grained mode instructions were left in place.  We can remove all of that once the deprecation cycle completes (Does Spark have a standard deprecation cycle?  One major version?)

Blocked on https://github.com/apache/spark/pull/14059

## How was this patch tested?

Viewed in Github

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #14078 from mgummelt/deprecate-fine-grained.
2016-07-08 20:20:26 -07:00
Michael Gummelt 9c041990cf [MESOS] expand coarse-grained mode docs
## What changes were proposed in this pull request?

docs

## How was this patch tested?

viewed the docs in github

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #14059 from mgummelt/coarse-grained.
2016-07-06 15:02:45 -07:00
WeichenXu b1310425b3 [DOC][SQL] update out-of-date code snippets using SQLContext in all documents.
## What changes were proposed in this pull request?

I search the whole documents directory using SQLContext, and update the following places:

- docs/configuration.md, sparkR code snippets.
- docs/streaming-programming-guide.md, several example code.

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14025 from WeichenXu123/WIP_SQLContext_update.
2016-07-06 10:41:48 -07:00
Sean Owen 18fb57f58a [MINOR][DOCS] Remove unused images; crush PNGs that could use it for good measure
## What changes were proposed in this pull request?

Coincidentally, I discovered that a couple images were unused in `docs/`, and then searched and found more, and then realized some PNGs were pretty big and could be crushed, and before I knew it, had done the same for the ASF site (not committed yet).

No functional change at all, just less superfluous image data.

## How was this patch tested?

`jekyll serve`

Author: Sean Owen <sowen@cloudera.com>

Closes #14029 from srowen/RemoveCompressImages.
2016-07-04 09:21:58 +01:00
WeichenXu 0bd7cd18bc [SPARK-16345][DOCUMENTATION][EXAMPLES][GRAPHX] Extract graphx programming guide example snippets from source files instead of hard code them
## What changes were proposed in this pull request?

I extract 6 example programs from GraphX programming guide and replace them with
`include_example` label.

The 6 example programs are:
- AggregateMessagesExample.scala
- SSSPExample.scala
- TriangleCountingExample.scala
- ConnectedComponentsExample.scala
- ComprehensiveExample.scala
- PageRankExample.scala

All the example code can run using
`bin/run-example graphx.EXAMPLE_NAME`

## How was this patch tested?

Manual.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14015 from WeichenXu123/graphx_example_plugin.
2016-07-02 16:29:00 +01:00
WeichenXu 192d1f9cf3 [GRAPHX][EXAMPLES] move graphx test data directory and update graphx document
## What changes were proposed in this pull request?

There are two test data files used for graphx examples existing in directory "graphx/data"
I move it into "data/" directory because the "graphx" directory is used for code files and other test data files (such as mllib, streaming test data) are all in there.

I also update the graphx document where reference the data files which I move place.

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14010 from WeichenXu123/move_graphx_data_dir.
2016-07-02 08:40:23 +01:00
Nick Pentreath 4a981dc870 [SPARK-15643][DOC][ML] Add breaking changes to ML migration guide
This PR adds the breaking changes from [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810) to the migration guide.

## How was this patch tested?

Built docs locally.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13924 from MLnick/SPARK-15643-migration-guide.
2016-06-30 17:55:14 -07:00
Tathagata Das 5d00a7bc19 [SPARK-16256][DOCS] Fix window operation diagram
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #14001 from tdas/SPARK-16256-2.
2016-06-30 14:01:34 -07:00
Tathagata Das 2c3d96134d [SPARK-16256][DOCS] Minor fixes on the Structured Streaming Programming Guide
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13978 from tdas/SPARK-16256-1.
2016-06-29 23:38:19 -07:00
Cheng Lian bde1d6a615 [SPARK-16294][SQL] Labelling support for the include_example Jekyll plugin
## What changes were proposed in this pull request?

This PR adds labelling support for the `include_example` Jekyll plugin, so that we may split a single source file into multiple line blocks with different labels, and include them in multiple code snippets in the generated HTML page.

## How was this patch tested?

Manually tested.

<img width="923" alt="screenshot at jun 29 19-53-21" src="https://cloud.githubusercontent.com/assets/230655/16451099/66a76db2-3e33-11e6-84fb-63104c2f0688.png">

Author: Cheng Lian <lian@databricks.com>

Closes #13972 from liancheng/include-example-with-labels.
2016-06-29 22:50:53 -07:00
Tathagata Das 64132a14fb [SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide
Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.
2016-06-29 11:45:57 -07:00
jerryshao 272a2f78f3 [SPARK-15990][YARN] Add rolling log aggregation support for Spark on yarn
## What changes were proposed in this pull request?

Yarn supports rolling log aggregation since 2.6, previously log will only be aggregated to HDFS after application is finished, it is quite painful for long running applications like Spark Streaming, thriftserver. Also out of disk problem will be occurred when log file is too large. So here propose to add support of rolling log aggregation for Spark on yarn.

One limitation for this is that log4j should be set to change to file appender, now in Spark itself uses console appender by default, in which file will not be created again once removed after aggregation. But I think lots of production users should have changed their log4j configuration instead of default on, so this is not a big problem.

## How was this patch tested?

Manually verified with Hadoop 2.7.1.

Author: jerryshao <sshao@hortonworks.com>

Closes #13712 from jerryshao/SPARK-15990.
2016-06-29 08:17:27 -05:00
Yanbo Liang 26252f7064 [SPARK-15643][DOC][ML] Update spark.ml and spark.mllib migration guide from 1.6 to 2.0
## What changes were proposed in this pull request?
Update ```spark.ml``` and ```spark.mllib``` migration guide from 1.6 to 2.0.

## How was this patch tested?
Docs update, no tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13378 from yanboliang/spark-13448.
2016-06-28 11:54:25 -07:00
Yin Huai dd6b7dbe70 [SPARK-15863][SQL][DOC][FOLLOW-UP] Update SQL programming guide.
## What changes were proposed in this pull request?
This PR makes several updates to SQL programming guide.

Author: Yin Huai <yhuai@databricks.com>

Closes #13938 from yhuai/doc.
2016-06-27 22:44:08 -07:00
GayathriMurali be88383e15 [SPARK-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer
## What changes were proposed in this pull request?

Made changes to HashingTF,QuantileVectorizer and CountVectorizer

Author: GayathriMurali <gayathri.m@intel.com>

Closes #13745 from GayathriMurali/SPARK-15997.
2016-06-24 13:25:40 +02:00
Ryan Blue 738f134bf4 [SPARK-13723][YARN] Change behavior of --num-executors with dynamic allocation.
## What changes were proposed in this pull request?

This changes the behavior of --num-executors and spark.executor.instances when using dynamic allocation. Instead of turning dynamic allocation off, it uses the value for the initial number of executors.

This changes was discussed on [SPARK-13723](https://issues.apache.org/jira/browse/SPARK-13723). I highly recommend using it while we can change the behavior for 2.0.0. In practice, the 1.x behavior causes unexpected behavior for users (it is not clear that it disables dynamic allocation) and wastes cluster resources because users rarely notice the log message.

## How was this patch tested?

This patch updates tests and adds a test for Utils.getDynamicAllocationInitialExecutors.

Author: Ryan Blue <blue@apache.org>

Closes #13338 from rdblue/SPARK-13723-num-executors-with-dynamic-allocation.
2016-06-23 14:03:46 -05:00
Felix Cheung b5a997667f [SPARK-16088][SPARKR] update setJobGroup, cancelJobGroup, clearJobGroup
## What changes were proposed in this pull request?

Updated setJobGroup, cancelJobGroup, clearJobGroup to not require sc/SparkContext as parameter.
Also updated roxygen2 doc and R programming guide on deprecations.

## How was this patch tested?

unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13838 from felixcheung/rjobgroup.
2016-06-23 09:45:01 -07:00
Kai Jiang 43b04b7ecb [SPARK-15672][R][DOC] R programming guide update
## What changes were proposed in this pull request?
Guide for
- UDFs with dapply, dapplyCollect
- spark.lapply for running parallel R functions

## How was this patch tested?
build locally
<img width="654" alt="screen shot 2016-06-14 at 03 12 56" src="https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png">

Author: Kai Jiang <jiangkai@gmail.com>

Closes #13660 from vectorijk/spark-15672-R-guide-update.
2016-06-22 12:50:36 -07:00
Felix Cheung 79aa1d82ca [SQL][DOC] SQL programming guide add deprecated methods in 2.0.0
## What changes were proposed in this pull request?

Doc changes

## How was this patch tested?

manual

liancheng

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13827 from felixcheung/sqldocdeprecate.
2016-06-22 10:37:13 +08:00
Yuhao Yang a58f402394 [SPARK-16045][ML][DOC] Spark 2.0 ML.feature: doc update for stopwords and binarizer
## What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-16045
2.0 Audit: Update document for StopWordsRemover and Binarizer.

## How was this patch tested?

manual review for doc

Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #13375 from hhbyyh/stopdoc.
2016-06-21 00:47:36 -07:00
Takeshi YAMAMURO 41e0ffb19f [SPARK-15894][SQL][DOC] Update docs for controlling #partitions
## What changes were proposed in this pull request?
Update docs for two parameters `spark.sql.files.maxPartitionBytes` and `spark.sql.files.openCostInBytes ` in Other Configuration Options.

## How was this patch tested?
N/A

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #13797 from maropu/SPARK-15894-2.
2016-06-21 14:27:16 +08:00
Felix Cheung 58f6e27dd7 [SPARK-15863][SQL][DOC][SPARKR] sql programming guide updates to include sparkSession in R
## What changes were proposed in this pull request?

Update doc as per discussion in PR #13592

## How was this patch tested?

manual

shivaram liancheng

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13799 from felixcheung/rsqlprogrammingguide.
2016-06-21 13:56:37 +08:00
Eric Liang 07367533de [SPARK-16025][CORE] Document OFF_HEAP storage level in 2.0
This has changed from 1.6, and now stores memory off-heap using spark's off-heap support instead of in tachyon.

Author: Eric Liang <ekl@databricks.com>

Closes #13744 from ericl/spark-16025.
2016-06-20 21:56:44 -07:00
Cheng Lian 6df8e38860 [SPARK-15863][SQL][DOC] Initial SQL programming guide update for Spark 2.0
## What changes were proposed in this pull request?

Initial SQL programming guide update for Spark 2.0. Contents like 1.6 to 2.0 migration guide are still incomplete.

We may also want to add more examples for Scala/Java Dataset typed transformations.

## How was this patch tested?

N/A

Author: Cheng Lian <lian@databricks.com>

Closes #13592 from liancheng/sql-programming-guide-2.0.
2016-06-20 14:50:28 -07:00
Felix Cheung 359c2e827d [SPARK-15159][SPARKR] SparkSession roxygen2 doc, programming guide, example updates
## What changes were proposed in this pull request?

roxygen2 doc, programming guide, example updates

## How was this patch tested?

manual checks
shivaram

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13751 from felixcheung/rsparksessiondoc.
2016-06-20 13:46:24 -07:00
wm624@hotmail.com 5930d7a2e9 [SPARK-16040][MLLIB][DOC] spark.mllib PIC document extra line of refernece
## What changes were proposed in this pull request?

In the 2.0 document, Line "A full example that produces the experiment described in the PIC paper can be found under examples/." is redundant.

There is already "Find full example code at "examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala" in the Spark repo.".

We should remove the first line, which is consistent with other documents.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manual test

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13755 from wangmiao1981/doc.
2016-06-19 20:19:40 +01:00
GayathriMurali af2a4b0826 [SPARK-15129][R][DOC] R API changes in ML
## What changes were proposed in this pull request?

Make user guide changes to SparkR documentation for all changes that happened in 2.0 to Machine Learning APIs

Author: GayathriMurali <gayathri.m@intel.com>

Closes #13285 from GayathriMurali/SPARK-15129.
2016-06-17 21:10:29 -07:00
Dhruve Ashar f1bf0d2f3a [SPARK-15966][DOC] Add closing tag to fix rendering issue for Spark monitoring
## What changes were proposed in this pull request?
Adds the missing closing tag for spark.ui.view.acls.groups

## How was this patch tested?
I built the docs locally and verified the changed in browser.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
**Before:**
![image](https://cloud.githubusercontent.com/assets/7732317/16135005/49fc0724-33e6-11e6-9390-98711593fa5b.png)

**After:**
![image](https://cloud.githubusercontent.com/assets/7732317/16135021/62b5c4a8-33e6-11e6-8118-b22fda5c66eb.png)

Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #13719 from dhruve/doc/SPARK-15966.
2016-06-16 17:46:19 -07:00
WeichenXu 9040d83bc2 [SPARK-15608][ML][EXAMPLES][DOC] add examples and documents of ml.isotonic regression
## What changes were proposed in this pull request?

add ml doc for ml isotonic regression
add scala example for ml isotonic regression
add java example for ml isotonic regression
add python example for ml isotonic regression

modify scala example for mllib isotonic regression
modify java example for mllib isotonic regression
modify python example for mllib isotonic regression

add data/mllib/sample_isotonic_regression_libsvm_data.txt
delete data/mllib/sample_isotonic_regression_data.txt
## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13381 from WeichenXu123/add_isotonic_regression_doc.
2016-06-16 17:35:40 -07:00
Sean Owen 457126e420 [SPARK-15796][CORE] Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config
## What changes were proposed in this pull request?

Reduce `spark.memory.fraction` default to 0.6 in order to make it fit within default JVM old generation size (2/3 heap). See JIRA discussion. This means a full cache doesn't spill into the new gen. CC andrewor14

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #13618 from srowen/SPARK-15796.
2016-06-16 23:04:10 +02:00
Nirman Narang 04d7b3d2b6 [SPARK-7848][STREAMING][UPDATE SPARKSTREAMING DOCS TO INCORPORATE IMPORTANT POINTS.]
Updated the SparkStreaming Doc with some important points.

Author: Nirman Narang <narang@us.ibm.com>

Closes #11114 from nirmannarang/SPARK-7848.
2016-06-15 15:36:31 -07:00
Mortada Mehyar a87a56f5c7 [DOCUMENTATION] fixed typos in python programming guide
## What changes were proposed in this pull request?

minor typo

## How was this patch tested?

minor typo in the doc, should be self explanatory

Author: Mortada Mehyar <mortada.mehyar@gmail.com>

Closes #13639 from mortada/typo.
2016-06-14 09:45:46 +01:00
Sean Owen f51dfe616b [SPARK-15086][CORE][STREAMING] Deprecate old Java accumulator API
## What changes were proposed in this pull request?

- Deprecate old Java accumulator API; should use Scala now
- Update Java tests and examples
- Don't bother testing old accumulator API in Java 8 (too)
- (fix a misspelling too)

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #13606 from srowen/SPARK-15086.
2016-06-12 11:44:33 -07:00
bomeng 50248dcfff [SPARK-15806][DOCUMENTATION] update doc for SPARK_MASTER_IP
## What changes were proposed in this pull request?

SPARK_MASTER_IP is a deprecated environment variable. It is replaced by SPARK_MASTER_HOST according to MasterArguments.scala.

## How was this patch tested?

Manually verified.

Author: bomeng <bmeng@us.ibm.com>

Closes #13543 from bomeng/SPARK-15806.
2016-06-12 14:25:48 +01:00
bomeng 3fd3ee038b [SPARK-15781][DOCUMENTATION] remove deprecated environment variable doc
## What changes were proposed in this pull request?

Like `SPARK_JAVA_OPTS` and `SPARK_CLASSPATH`, we will remove the document for `SPARK_WORKER_INSTANCES` to discourage user not to use them. If they are actually used, SparkConf will show a warning message as before.

## How was this patch tested?

Manually tested.

Author: bomeng <bmeng@us.ibm.com>

Closes #13533 from bomeng/SPARK-15781.
2016-06-12 12:58:34 +01:00
Dongjoon Hyun ad102af169 [SPARK-15883][MLLIB][DOCS] Fix broken links in mllib documents
## What changes were proposed in this pull request?

This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, this contains some editorial change.

**Fix broken links**
  * mllib-data-types.md
  * mllib-decision-tree.md
  * mllib-ensembles.md
  * mllib-feature-extraction.md
  * mllib-pmml-model-export.md
  * mllib-statistics.md

**Fix malformed section header and scala coding style**
  * mllib-linear-methods.md

**Replace indirect forward links with direct one**
  * ml-classification-regression.md

## How was this patch tested?

Manual tests (with `cd docs; jekyll build`.)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13608 from dongjoon-hyun/SPARK-15883.
2016-06-11 12:55:38 +01:00
Sean Owen 3761330dd0 [SPARK-15879][DOCS][UI] Update logo in UI and docs to add "Apache"
## What changes were proposed in this pull request?

Use new Spark logo including "Apache" (now, with crushed PNGs). Remove old unreferenced logo files.

## How was this patch tested?

Manual check of generated HTML site and Spark UI. I searched for references to the deleted files to make sure they were not used.

Author: Sean Owen <sowen@cloudera.com>

Closes #13609 from srowen/SPARK-15879.
2016-06-11 12:46:07 +01:00
Mortada Mehyar 675a73715d [DOCUMENTATION] fixed groupby aggregation example for pyspark
## What changes were proposed in this pull request?

fixing documentation for the groupby/agg example in python

## How was this patch tested?

the existing example in the documentation dose not contain valid syntax (missing parenthesis) and is not using `Column` in the expression for `agg()`

after the fix here's how I tested it:

```
In [1]: from pyspark.sql import Row

In [2]: import pyspark.sql.functions as func

In [3]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:records = [{'age': 19, 'department': 1, 'expense': 100},
: {'age': 20, 'department': 1, 'expense': 200},
: {'age': 21, 'department': 2, 'expense': 300},
: {'age': 22, 'department': 2, 'expense': 300},
: {'age': 23, 'department': 3, 'expense': 300}]
:--

In [4]: df = sqlContext.createDataFrame([Row(**d) for d in records])

In [5]: df.groupBy("department").agg(df["department"], func.max("age"), func.sum("expense")).show()

+----------+----------+--------+------------+
|department|department|max(age)|sum(expense)|
+----------+----------+--------+------------+
|         1|         1|      20|         300|
|         2|         2|      22|         600|
|         3|         3|      23|         300|
+----------+----------+--------+------------+

Author: Mortada Mehyar <mortada.mehyar@gmail.com>

Closes #13587 from mortada/groupby_agg_doc_fix.
2016-06-10 00:23:34 -07:00
prabs ca70ab27cc [DOCUMENTATION] Fixed target JAR path
## What changes were proposed in this pull request?

Mentioned Scala version in the sbt configuration file is 2.11, so the path of the target JAR should be `/target/scala-2.11/simple-project_2.11-1.0.jar`

## How was this patch tested?

n/a

Author: prabs <prabsmails@gmail.com>
Author: Prabeesh K <prabsmails@gmail.com>

Closes #13554 from prabeesh/master.
2016-06-08 17:22:55 +01:00
Yanbo Liang 6ecedf39b4 [SPARK-13590][ML][DOC] Document spark.ml LiR, LoR and AFTSurvivalRegression behavior difference
## What changes were proposed in this pull request?
When fitting ```LinearRegressionModel```(by "l-bfgs" solver) and ```LogisticRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce same model as R glmnet but different from LIBSVM.

When fitting ```AFTSurvivalRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce different model compared with R survival::survreg.

We should output a warning message and clarify in document for this condition.

## How was this patch tested?
Document change, no unit test.

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12731 from yanboliang/spark-13590.
2016-06-07 15:25:36 -07:00
Marcelo Vanzin 200f01c8fb [SPARK-15760][DOCS] Add documentation for package-related configs.
While there, also document spark.files and spark.jars. Text is the
same as the spark-submit help text with some minor adjustments.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #13502 from vanzin/SPARK-15760.
2016-06-07 09:28:39 -07:00
WeichenXu 1e2c931187 [MINOR] fix typo in documents
## What changes were proposed in this pull request?

I use spell check tools checks typo in spark documents and fix them.

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13538 from WeichenXu123/fix_doc_typo.
2016-06-07 13:29:27 +01:00
Ruifeng Zheng 2099e05f93 [SPARK-15617][ML][DOC] Clarify that fMeasure in MulticlassMetrics is "micro" f1_score
## What changes were proposed in this pull request?
1, del precision,recall in  `ml.MulticlassClassificationEvaluator`
2, update user guide for `mlllib.weightedFMeasure`

## How was this patch tested?
local build

Author: Ruifeng Zheng <ruifengz@foxmail.com>

Closes #13390 from zhengruifeng/clarify_f1.
2016-06-04 13:56:04 +01:00
Liwei Lin a0eec8e8ff [SPARK-15208][WIP][CORE][STREAMING][DOCS] Update Spark examples with AccumulatorV2
## What changes were proposed in this pull request?

The patch updates the codes & docs in the example module as well as the related doc module:

- [ ] [docs] `streaming-programming-guide.md`
  - [x] scala code part
  - [ ] java code part
  - [ ] python code part
- [x] [examples] `RecoverableNetworkWordCount.scala`
- [ ] [examples] `JavaRecoverableNetworkWordCount.java`
- [ ] [examples] `recoverable_network_wordcount.py`

## How was this patch tested?

Ran the examples and verified results manually.

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12981 from lw-lin/accumulatorV2-examples.
2016-06-02 11:07:15 -05:00
WeichenXu 2402b91461 [SPARK-15702][DOCUMENTATION] Update document programming-guide accumulator section
## What changes were proposed in this pull request?

Update document programming-guide accumulator section (scala language)
java and python version, because the API haven't done, so I do not modify them.

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13441 from WeichenXu123/update_doc_accumulatorV2_clean.
2016-06-01 12:57:02 -07:00
Matthew Wise 2d34183b27 [DOCS] fix example code issues in documentation
## What changes were proposed in this pull request?

Fixed broken java code examples in streaming documentation

Attn: tdas

Author: Matthew Wise <matthew.rs.wise@gmail.com>

Closes #13388 from mawise/fix_docs_java_streaming_example.
2016-05-30 09:12:02 -05:00
Yanbo Liang a3550e3747 [SPARK-11959][SPARK-15484][DOC][ML] Document WLS and IRLS
## What changes were proposed in this pull request?
* Document ```WeightedLeastSquares```(normal equation) and ```IterativelyReweightedLeastSquares```.
* Copy ```L-BFGS``` documents from ```spark.mllib``` to ```spark.ml```.

Due to the session ```Optimization of linear methods``` is used for developers, I think we should provide the brief introduction of the optimization method, necessary references and how it implements in Spark. It's not necessary to paste all mathematical formula and derivation here. If developers/users want to learn more, they can track reference.

## How was this patch tested?
Document update, no tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13262 from yanboliang/spark-15484.
2016-05-27 13:16:22 -07:00
sethah c96244f5ac [SPARK-15186][ML][DOCS] Add user guide for generalized linear regression
## What changes were proposed in this pull request?

This patch adds a user guide section for generalized linear regression and includes the examples from [#12754](https://github.com/apache/spark/pull/12754).

## How was this patch tested?

Documentation only, no tests required.

## Approach

In general, it is a bit unclear what level of detail ought to be included in the user guide since there is a lot of variability within the current user guide. I tried to give a fairly brief mathematical introduction to GLMs, and cover what types of problems they could be used for. Additionally, I included a brief blurb on the IRLS solver. The input/output columns are given in a table as is found elsewhere in the docs (though, again, these appear rather intermittently in the current docs), as well as a table providing the supported families and their link functions.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #13139 from sethah/SPARK-15186.
2016-05-27 12:55:48 -07:00
jerryshao 1b98fa2e43 [YARN][DOC][MINOR] Remove several obsolete env variables and update the doc
## What changes were proposed in this pull request?

Remove several obsolete env variables not supported for Spark on YARN now, also updates the docs to include several changes with 2.0.

## How was this patch tested?

N/A

CC vanzin tgravescs

Author: jerryshao <sshao@hortonworks.com>

Closes #13296 from jerryshao/yarn-doc.
2016-05-27 11:31:25 -07:00
Zheng RuiFeng 6b1a6180e7 [MINOR] Fix Typos 'a -> an'
## What changes were proposed in this pull request?

`a` -> `an`

I use regex to generate potential error lines:
`grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
and review them line by line.

## How was this patch tested?

local build
`lint-java` checking

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13317 from zhengruifeng/a_an.
2016-05-26 22:39:14 -07:00
felixcheung c82883239e [SPARK-10903] followup - update API doc for SqlContext
## What changes were proposed in this pull request?

Follow up on the earlier PR - in here we are fixing up roxygen2 doc examples.
Also add to the programming guide migration section.

## How was this patch tested?

SparkR tests

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #13340 from felixcheung/sqlcontextdoc.
2016-05-26 21:42:36 -07:00
Steve Loughran 01b350a4f7 [SPARK-13148][YARN] document zero-keytab Oozie application launch; add diagnostics
This patch provides detail on what to do for keytabless Oozie launches of spark apps, and adds some debug-level diagnostics of what credentials have been submitted

Author: Steve Loughran <stevel@hortonworks.com>
Author: Steve Loughran <stevel@apache.org>

Closes #11033 from steveloughran/stevel/feature/SPARK-13148-oozie.
2016-05-26 13:55:22 -05:00
Krishna Kalyan 9082b7968a [SPARK-12071][DOC] Document the behaviour of NA in R
## What changes were proposed in this pull request?

Under Upgrading From SparkR 1.5.x to 1.6.x section added the information, SparkSQL converts `NA` in R to `null`.

## How was this patch tested?

Document update, no tests.

Author: Krishna Kalyan <krishnakalyan3@gmail.com>

Closes #13268 from krishnakalyan3/spark-12071-1.
2016-05-24 22:21:52 -07:00
Holden Karau cd9f16906c [SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression pydoc & doc build insturctions
## What changes were proposed in this pull request?

PySpark: Add links to the predictors from the models in regression.py, improve linear and isotonic pydoc in minor ways.
User guide / R: Switch the installed package list to be enough to build the R docs on a "fresh" install on ubuntu and add sudo to match the rest of the commands.
User Guide: Add a note about using gem2.0 for systems with both 1.9 and 2.0 (e.g. some ubuntu but maybe more).

## How was this patch tested?

built pydocs locally, tested new user build instructions

Author: Holden Karau <holden@us.ibm.com>

Closes #13199 from holdenk/SPARK-15412-improve-linear-isotonic-regression-pydoc.
2016-05-24 22:20:00 -07:00
Nick Pentreath 20900e5fec [SPARK-15502][DOC][ML][PYSPARK] add guide note that ALS only supports integer ids
This PR adds a note to clarify that the ML API for ALS only supports integers for user/item ids, and that other types for these columns can be used but the ids must fall within integer range.

(Refer [SPARK-14891](https://issues.apache.org/jira/browse/SPARK-14891)).

Also cleaned up a reference to `mllib` in the ML doc.

## How was this patch tested?
Built and viewed User Guide doc locally.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13278 from MLnick/SPARK-15502-als-int-id-doc-note.
2016-05-24 11:34:06 -07:00
gatorsmile d207716451 [SPARK-15485][SQL][DOCS] Spark SQL Configuration
#### What changes were proposed in this pull request?
So far, the page Configuration in the official documentation does not have a section for Spark SQL.
http://spark.apache.org/docs/latest/configuration.html

For Spark users, the information and default values of these public configuration parameters are very useful. This PR is to add this missing section to the configuration.html.

rxin yhuai marmbrus

#### How was this patch tested?
Below is the generated webpage.
<img width="924" alt="screenshot 2016-05-23 11 35 57" src="https://cloud.githubusercontent.com/assets/11567269/15480492/b08fefc4-20da-11e6-9fa2-7cd5b699ed35.png">
<img width="914" alt="screenshot 2016-05-23 11 37 38" src="https://cloud.githubusercontent.com/assets/11567269/15480499/c5f9482e-20da-11e6-95ff-10821add1af4.png">
<img width="923" alt="screenshot 2016-05-23 11 36 11" src="https://cloud.githubusercontent.com/assets/11567269/15480506/cbd81644-20da-11e6-9d27-effb716b2fac.png">
<img width="920" alt="screenshot 2016-05-23 11 36 18" src="https://cloud.githubusercontent.com/assets/11567269/15480511/d013e332-20da-11e6-854a-cf8813c46f36.png">

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13263 from gatorsmile/configurationSQL.
2016-05-23 21:07:14 -07:00
gatorsmile 6cb8f836da [SPARK-15396][SQL][DOC] It can't connect hive metastore database
#### What changes were proposed in this pull request?
The `hive.metastore.warehouse.dir` property in hive-site.xml is deprecated since Spark 2.0.0. Users might not be able to connect to the existing metastore if they do not use the new conf parameter `spark.sql.warehouse.dir`.

This PR is to update the document and example for explaining the latest changes in the configuration of default location of database.

Below is the screenshot of the latest generated docs:

<img width="681" alt="screenshot 2016-05-20 08 38 10" src="https://cloud.githubusercontent.com/assets/11567269/15433296/a05c4ace-1e66-11e6-8d2b-73682b32e9c2.png">

<img width="789" alt="screenshot 2016-05-20 08 53 26" src="https://cloud.githubusercontent.com/assets/11567269/15433734/645dc42e-1e68-11e6-9476-effc9f8721bb.png">

<img width="789" alt="screenshot 2016-05-20 08 53 37" src="https://cloud.githubusercontent.com/assets/11567269/15433738/68569f92-1e68-11e6-83d3-ef5bb221a8d8.png">

No change is made in the R's example.

<img width="860" alt="screenshot 2016-05-20 08 54 38" src="https://cloud.githubusercontent.com/assets/11567269/15433779/965b8312-1e68-11e6-8bc4-53c88ceacde2.png">

#### How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13225 from gatorsmile/document.
2016-05-21 23:12:27 -07:00
sethah 5e203505f1 [SPARK-15394][ML][DOCS] User guide typos and grammar audit
## What changes were proposed in this pull request?

Correct some typos and incorrectly worded sentences.

## How was this patch tested?

Doc changes only.

Note that many of these changes were identified by whomfire01

Author: sethah <seth.hendrickson16@gmail.com>

Closes #13180 from sethah/ml_guide_audit.
2016-05-19 23:29:37 -07:00
Sean Zhong 25b315e6ca [SPARK-15171][SQL] Remove the references to deprecated method dataset.registerTempTable
## What changes were proposed in this pull request?

Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`.

## How was this patch tested?

This PR only changes the unit test code, examples, and comments. It should be safe.
This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13098 from clockfly/spark-15171-remove-deprecation.
2016-05-18 09:01:59 +08:00
Yuhao Yang 3308a862ba [SPARK-15182][ML] Copy MLlib doc to ML: ml.feature.tf, idf
## What changes were proposed in this pull request?

We should now begin copying algorithm details from the spark.mllib guide to spark.ml as needed, rather than just linking back to the corresponding algorithms in the spark.mllib user guide.

## How was this patch tested?

manual review for doc.

Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #12957 from hhbyyh/tfidfdoc.
2016-05-17 20:44:19 +02:00
Sean Owen 932d800293 [SPARK-15333][DOCS] Reorganize building-spark.md; rationalize vs wiki
## What changes were proposed in this pull request?

See JIRA for the motivation. The changes are almost entirely movement of text and edits to sections. Minor changes to text include:

- Copying in / merging text from the "Useful Developer Tools" wiki, in areas of
  - Docker
  - R
  - Running one test
- standardizing on ./build/mvn not mvn, and likewise for ./build/sbt
- correcting some typos
- standardizing code block formatting

No text has been removed from this doc; text has been imported from the https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools wiki

## How was this patch tested?

Jekyll doc build and inspection of resulting HTML in browser.

Author: Sean Owen <sowen@cloudera.com>

Closes #13124 from srowen/SPARK-15333.
2016-05-17 16:40:38 +01:00
wm624@hotmail.com 4134ff0c65 [SPARK-14434][ML] User guide doc and examples for GaussianMixture in spark.ml
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

Add guide doc and examples for GaussianMixture in Spark.ml in Java, Scala and Python.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manual compile and test all examples

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12788 from wangmiao1981/example.
2016-05-17 15:20:47 +02:00
wm624@hotmail.com c1836d66bd [SPARK-15305][ML][DOC] spark.ml document Bisectiong k-means has the incorrect format
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
The generated document has the incorrect format for biseckmeans.

![bug](https://cloud.githubusercontent.com/assets/5033592/15233120/d910098a-185a-11e6-901d-44aeafc8a011.jpg)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Fix the formatting.
![fix](https://cloud.githubusercontent.com/assets/5033592/15233136/fce2ccd0-185a-11e6-9ded-14d71da4bdab.jpg)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13083 from wangmiao1981/doc.
2016-05-16 08:22:16 +02:00
Zheng RuiFeng c7efc56c7b [MINOR] Fix Typos
## What changes were proposed in this pull request?
1,Rename matrix args in BreezeUtil to upper to match the doc
2,Fix several typos in ML and SQL

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13078 from zhengruifeng/fix_ann.
2016-05-15 15:59:49 +01:00
cody koeninger 89e67d6667 [SPARK-15085][STREAMING][KAFKA] Rename streaming-kafka artifact
## What changes were proposed in this pull request?
Renaming the streaming-kafka artifact to include kafka version, in anticipation of needing a different artifact for later kafka versions

## How was this patch tested?
Unit tests

Author: cody koeninger <cody@koeninger.org>

Closes #12946 from koeninger/SPARK-15085.
2016-05-11 12:15:41 -07:00
Zheng RuiFeng d88afabdfa [SPARK-15150][EXAMPLE][DOC] Update LDA examples
## What changes were proposed in this pull request?
1,create a libsvm-type dataset for lda: `data/mllib/sample_lda_libsvm_data.txt`
2,add python example
3,directly read the datafile in examples
4,BTW, change to `SparkSession` in `aft_survival_regression.py`

## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/lda_example.py`

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12927 from zhengruifeng/lda_pe.
2016-05-11 12:49:41 +02:00
Nicholas Chammas fafc95af79 [SPARK-15238] Clarify supported Python versions
This PR:
* Clarifies that Spark *does* support Python 3, starting with Python 3.4.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #13017 from nchammas/supported-python-versions.
2016-05-11 11:00:12 +01:00
Zheng RuiFeng 8beae59144 [SPARK-15149][EXAMPLE][DOC] update kmeans example
## What changes were proposed in this pull request?
Python example for ml.kmeans already exists, but not included in user guide.
1,small changes like: `example_on` `example_off`
2,add it to user guide
3,update examples to directly read datafile

## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/kmeans_example.py

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12925 from zhengruifeng/km_pe.
2016-05-11 10:01:43 +02:00
Zheng RuiFeng cef73b5638 [SPARK-14340][EXAMPLE][DOC] Update Examples and User Guide for ml.BisectingKMeans
## What changes were proposed in this pull request?

1, add BisectingKMeans to ml-clustering.md
2, add the missing Scala BisectingKMeansExample
3, create a new datafile `data/mllib/sample_kmeans_data.txt`

## How was this patch tested?

manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11844 from zhengruifeng/doc_bkm.
2016-05-11 09:56:36 +02:00
Zheng RuiFeng ad1a8466e9 [SPARK-15141][EXAMPLE][DOC] Update OneVsRest Examples
## What changes were proposed in this pull request?
1, Add python example for OneVsRest
2, remove args-parsing

## How was this patch tested?
manual tests
`./bin/spark-submit examples/src/main/python/ml/one_vs_rest_example.py`

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12920 from zhengruifeng/ovr_pe.
2016-05-11 09:53:36 +02:00
Holden Karau 488863d873 [SPARK-13382][DOCS][PYSPARK] Update pyspark testing notes in build docs
## What changes were proposed in this pull request?

The current build documents don't specify that for PySpark tests we need to include Hive in the assembly otherwise the ORC tests fail.

## How was the this patch tested?

Manually built the docs locally. Ran the provided build command follow by the PySpark SQL tests.

![pyspark2](https://cloud.githubusercontent.com/assets/59893/13190008/8829cde4-d70f-11e5-8ff5-a88b7894d2ad.png)

Author: Holden Karau <holden@us.ibm.com>

Closes #11278 from holdenk/SPARK-13382-update-pyspark-testing-notes-r2.
2016-05-10 10:29:38 -07:00
Philipp Hoffmann 65b4ab281e [SPARK-15223][DOCS] fix wrongly named config reference
## What changes were proposed in this pull request?

The configuration setting `spark.executor.logs.rolling.size.maxBytes` was changed to `spark.executor.logs.rolling.maxSize` in 1.4 or so.

This commit fixes a remaining reference to the old name in the documentation.

Also the description for `spark.executor.logs.rolling.maxSize` was edited to clearly state that the unit for the size is bytes.

## How was this patch tested?

no tests

Author: Philipp Hoffmann <mail@philipphoffmann.de>

Closes #13001 from philipphoffmann/patch-3.
2016-05-09 11:02:13 -07:00
Yanbo Liang ee3b171562 [MINOR] [SPARKR] Update data-manipulation.R to use native csv reader
## What changes were proposed in this pull request?
* Since Spark has supported native csv reader, it does not necessary to use the third party ```spark-csv``` in ```examples/src/main/r/data-manipulation.R```. Meanwhile, remove all ```spark-csv``` usage in SparkR.
* Running R applications through ```sparkR``` is not supported as of Spark 2.0, so we change to use ```./bin/spark-submit``` to run the example.

## How was this patch tested?
Offline test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13005 from yanboliang/r-df-examples.
2016-05-09 09:58:36 -07:00
Bryan Cutler 5d188a6970 [DOC][MINOR] Fixed minor errors in feature.ml user guide doc
## What changes were proposed in this pull request?
Fixed some minor errors found when reviewing feature.ml user guide

## How was this patch tested?
built docs locally

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #12940 from BryanCutler/feature.ml-doc_fixes-DOCS-MINOR.
2016-05-07 11:20:38 +02:00
Zheng RuiFeng 76ad04d9a0 [SPARK-14512] [DOC] Add python example for QuantileDiscretizer
## What changes were proposed in this pull request?
Add the missing python example for QuantileDiscretizer

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12281 from zhengruifeng/discret_pe.
2016-05-06 10:47:13 -07:00
Luciano Resende a03c5e68ab [SPARK-14738][BUILD] Separate docker integration tests from main build
## What changes were proposed in this pull request?

Create a maven profile for executing the docker integration tests using maven
Remove docker integration tests from main sbt build
Update documentation on how to run docker integration tests from sbt

## How was this patch tested?

Manual test of the docker integration tests as in :
mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 compile test

## Other comments

Note that the the DB2 Docker Tests are still disabled as there is a kernel version issue on the AMPLab Jenkins slaves and we would need to get them on the right level before enabling those tests. They do run ok locally with the updates from PR #12348

Author: Luciano Resende <lresende@apache.org>

Closes #12508 from lresende/docker.
2016-05-06 12:25:45 +01:00
Bryan Cutler cf2e9da612 [SPARK-12299][CORE] Remove history serving functionality from Master
Remove history server functionality from standalone Master.  Previously, the Master process rebuilt a SparkUI once the application was completed which sometimes caused problems, such as OOM, when the application event log is large (see SPARK-6270).  Keeping this functionality out of the Master will help to simplify the process and increase stability.

Testing for this change included running core unit tests and manually running an application on a standalone cluster to verify that it completed successfully and that the Master UI functioned correctly.  Also added 2 unit tests to verify killing an application and driver from MasterWebUI makes the correct request to the Master.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #10991 from BryanCutler/remove-history-master-SPARK-12299.
2016-05-04 14:29:54 -07:00
Dhruve Ashar a45647746d [SPARK-4224][CORE][YARN] Support group acls
## What changes were proposed in this pull request?
Currently only a list of users can be specified for view and modify acls. This change enables a group of admins/devs/users to be provisioned for viewing and modifying Spark jobs.

**Changes Proposed in the fix**
Three new corresponding config entries have been added where the user can specify the groups to be given access.

```
spark.admin.acls.groups
spark.modify.acls.groups
spark.ui.view.acls.groups
```

New config entries were added because specifying the users and groups explicitly is a better and cleaner way compared to specifying them in the existing config entry using a delimiter.

A generic trait has been introduced to provide the user to group mapping which makes it pluggable to support a variety of mapping protocols - similar to the one used in hadoop. A default unix shell based implementation has been provided.
Custom user to group mapping protocol can be specified and configured by the entry ```spark.user.groups.mapping```

**How the patch was Tested**
We ran different spark jobs setting the config entries in combinations of admin, modify and ui acls. For modify acls we tried killing the job stages from the ui and using yarn commands. For view acls we tried accessing the UI tabs and the logs. Headless accounts were used to launch these jobs and different users tried to modify and view the jobs to ensure that the groups mapping applied correctly.

Additional Unit tests have been added without modifying the existing ones. These test for different ways of setting the acls through configuration and/or API and validate the expected behavior.

Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #12760 from dhruve/impr/SPARK-4224.
2016-05-04 08:45:43 -05:00
Shuai Lin c4e0fde876 [MINOR][DOC] Fixed some python snippets in mllib data types documentation.
## What changes were proposed in this pull request?

Some python snippets is using scala imports and comments.

## How was this patch tested?

Generated the docs locally with `SKIP_API=1 jekyll build` and viewed the changes in the browser.

Author: Shuai Lin <linshuai2012@gmail.com>

Closes #12869 from lins05/fix-mllib-python-snippets.
2016-05-03 18:02:12 -07:00
Sandeep Singh dfd9723dd3 [MINOR][DOCS] Fix type Information in Quick Start and Programming Guide
Author: Sandeep Singh <sandeep@techaddict.me>

Closes #12841 from techaddict/improve_docs_1.
2016-05-03 12:38:21 +01:00
Ben McCann 214d1be4fd Fix reference to external metrics documentation
Author: Ben McCann <benjamin.j.mccann@gmail.com>

Closes #12833 from benmccann/patch-1.
2016-05-01 22:43:28 -07:00
pshearer 0368ff30dd [SPARK-13973][PYSPARK] Make pyspark fail noisily if IPYTHON or IPYTHON_OPTS are set
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13973

Following discussion with srowen the IPYTHON and IPYTHON_OPTS variables are removed. If they are set in the user's environment, pyspark will not execute and prints an error message. Failing noisily will force users to remove these options and learn the new configuration scheme, which is much more sustainable and less confusing.

## How was this patch tested?

Manual testing; set IPYTHON=1 and verified that the error message prints.

Author: pshearer <pshearer@massmutual.com>
Author: shearerp <shearerp@umich.edu>

Closes #12528 from shearerp/master.
2016-04-30 10:15:20 +01:00
Sun Rui 4ae9fe091c [SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR.
## What changes were proposed in this pull request?

dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.

The function signature is:

	dapply(df, function(localDF) {}, schema = NULL)

R function input: local data.frame from the partition on local node
R function output: local data.frame

Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <rui.sun@intel.com>
Author: Sun Rui <sunrui2016@gmail.com>

Closes #12493 from sun-rui/SPARK-12919.
2016-04-29 16:41:07 -07:00
Sean Owen bed0b00202 [SPARK-14882][DOCS] Clarify that Spark can be cross-built for other Scala versions
## What changes were proposed in this pull request?

Add simple clarification that Spark can be cross-built for other Scala versions.

## How was this patch tested?

Automated doc build

Author: Sean Owen <sowen@cloudera.com>

Closes #12757 from srowen/SPARK-14882.
2016-04-28 10:41:15 -07:00
jerryshao 8b44bd52fa [SPARK-6735][YARN] Add window based executor failure tracking mechanism for long running service
This work is based on twinkle-sachdeva 's proposal. In parallel to such mechanism for AM failures, here add similar mechanism for executor failure tracking, this is useful for long running Spark service to mitigate the executor failure problems.

Please help to review, tgravescs sryza and vanzin

Author: jerryshao <sshao@hortonworks.com>

Closes #10241 from jerryshao/SPARK-6735.
2016-04-28 12:38:19 -05:00
Zheng RuiFeng e88476c8c6 [SPARK-14514][DOC] Add python example for VectorSlicer
## What changes were proposed in this pull request?
Add the missing python example for VectorSlicer

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12282 from zhengruifeng/vecslicer_pe.
2016-04-26 14:38:29 -07:00
Michael Gummelt 6a7ba1ff74 Fix dynamic allocation docs to address cached data.
## What changes were proposed in this pull request?

Documentation changes

## How was this patch tested?

No tests

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #12664 from mgummelt/fix-dynamic-docs.
2016-04-26 09:31:53 +01:00
Dongjoon Hyun 6ab4d9e0c7 [SPARK-14883][DOCS] Fix wrong R examples and make them up-to-date
## What changes were proposed in this pull request?

This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules.

- Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later.
- Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency
- Fix datatypes in `sparkr.md`.
- Update a data result in `sparkr.md`.
- Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet
- Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet
- Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`.
- Other minor syntax fixes and a typo.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12649 from dongjoon-hyun/SPARK-14883.
2016-04-24 22:10:27 -07:00
Jacek Laskowski 8df8a81825 [DOCS][MINOR] Screenshot + minor fixes to improve reading for accumulators
## What changes were proposed in this pull request?

Added screenshot + minor fixes to improve reading

## How was this patch tested?

Manual

Author: Jacek Laskowski <jacek@japila.pl>

Closes #12569 from jaceklaskowski/docs-accumulators.
2016-04-24 10:36:33 +01:00
Steve Loughran db7113b1d3 [SPARK-13267][WEB UI] document the ?param arguments of the REST API; lift the…
Add to the REST API details on the ? args and examples from the test suite.

I've used the existing table, adding all the fields to the second table.

see [in the pr](https://github.com/steveloughran/spark/blob/history/SPARK-13267-doc-params/docs/monitoring.md).

There's a slightly more sophisticated option: make the table 3 columns wide, and for all existing entries, have the initial `td` span 2 columns. The new entries would then have an empty 1st column, param in 2nd and text in 3rd, with any examples after a `br` entry.

Author: Steve Loughran <stevel@hortonworks.com>

Closes #11152 from steveloughran/history/SPARK-13267-doc-params.
2016-04-24 10:32:22 +01:00
felixcheung 1b7eab74e6 [SPARK-12148][SPARKR] fix doc after renaming DataFrame to SparkDataFrame
## What changes were proposed in this pull request?

Fixed inadvertent roxygen2 doc changes, added class name change to programming guide
Follow up of #12621

## How was this patch tested?

manually checked

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #12647 from felixcheung/rdataframe.
2016-04-23 18:20:31 -07:00
Parth Brahmbhatt 6fdd0e32a6 [SPARK-13988][CORE] Make replaying event logs multi threaded in Histo…ry server to ensure a single large log does not block other logs from being rendered.
## What changes were proposed in this pull request?
The patch makes event log processing multi threaded.

## How was this patch tested?
Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI.

Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com>

Closes #11800 from Parth-Brahmbhatt/SPARK-13988.
2016-04-21 06:58:00 -05:00
Sean Owen b4e76a9a3b [SPARK-14742][DOCS] Redirect spark-ec2 doc to new location
## What changes were proposed in this pull request?

Restore `ec2-scripts.md` as a redirect to amplab/spark-ec2 docs

## How was this patch tested?

`jekyll build` and checked with the browser

Author: Sean Owen <sowen@cloudera.com>

Closes #12534 from srowen/SPARK-14742.
2016-04-20 10:46:02 -07:00
Yuhao Yang ed9d803854 [SPARK-14635][ML] Documentation and Examples for TF-IDF only refer to HashingTF
## What changes were proposed in this pull request?

Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.

## How was this patch tested?

unit tests and doc generation

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #12454 from hhbyyh/tfdoc.
2016-04-20 11:45:08 +01:00
Reynold Xin 5e92583d38 [SPARK-14667] Remove HashShuffleManager
## What changes were proposed in this pull request?
The sort shuffle manager has been the default since Spark 1.2. It is time to remove the old hash shuffle manager.

## How was this patch tested?
Removed some tests related to the old manager.

Author: Reynold Xin <rxin@databricks.com>

Closes #12423 from rxin/SPARK-14667.
2016-04-18 19:30:00 -07:00
Zheng RuiFeng 9bfb35da1e [SPARK-14515][DOC] Add python example for ChiSqSelector
## What changes were proposed in this pull request?
Add the missing python example for ChiSqSelector

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12283 from zhengruifeng/chi2_pe.
2016-04-18 17:14:22 -07:00
Mark Grover ff9ae61a3b [SPARK-14601][DOC] Minor doc/usage changes related to removal of Spark assembly
## What changes were proposed in this pull request?

Removing references to assembly jar in documentation.
Adding an additional (previously undocumented) usage of spark-submit to run examples.

## How was this patch tested?

Ran spark-submit usage to ensure formatting was fine. Ran examples using SparkSubmit.

Author: Mark Grover <mark@apache.org>

Closes #12365 from markgrover/spark-14601.
2016-04-14 18:51:43 -07:00
Dhruve Ashar f83ba454a5 [SPARK-14572][DOC] Update config docs to allow -Xms in extraJavaOptions
## What changes were proposed in this pull request?
The configuration docs are updated to reflect the changes introduced with [SPARK-12384](https://issues.apache.org/jira/browse/SPARK-12384). This allows the user to specify initial heap memory settings through the extraJavaOptions for executor, driver and am.

## How was this patch tested?
The changes are tested in [SPARK-12384](https://issues.apache.org/jira/browse/SPARK-12384). This is just documenting the changes made.

Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #12333 from dhruve/doc/SPARK-14572.
2016-04-14 10:29:14 -05:00
Yuhao Yang 781df49983 [SPARK-13089][ML] [Doc] spark.ml Naive Bayes user guide and examples
jira: https://issues.apache.org/jira/browse/SPARK-13089

Add section in ml-classification.md for NaiveBayes DataFrame-based API, plus example code (using include_example to clip code from examples/ folder files).

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #11015 from hhbyyh/naiveBayesDoc.
2016-04-13 13:58:35 -07:00
Zheng RuiFeng fcdd69260e [SPARK-14509][DOC] Add python CountVectorizerExample
## What changes were proposed in this pull request?
Add python CountVectorizerExample

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11917 from zhengruifeng/cv_pe.
2016-04-13 13:56:23 -07:00
Dongjoon Hyun 1a0cca1fc8 [MINOR][DOCS] Fix wrong data types in JSON Datasets example.
## What changes were proposed in this pull request?

This PR fixes the `age` data types from `integer` to `long` in `SQL Programming Guide: JSON Datasets`.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12290 from dongjoon-hyun/minor_fix_type_in_json_example.
2016-04-11 09:03:11 +01:00
Zheng RuiFeng adb9d73cd6 [SPARK-14339][DOC] Add python examples for DCT,MinMaxScaler,MaxAbsScaler
## What changes were proposed in this pull request?
add three python examples

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12063 from zhengruifeng/dct_pe.
2016-04-09 11:25:39 -07:00
Michael Gummelt 30e980ad8e [DOCS][MINOR] Remove sentence about Mesos not supporting cluster mode.
Docs change to remove the sentence about Mesos not supporting cluster mode.

It was not.

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #12249 from mgummelt/fix-mesos-cluster-docs.
2016-04-07 17:41:55 -07:00
Malte db75ccb552 Better host description for multi-master mesos
## What changes were proposed in this pull request?

Since not having the correct zk url causes job failure, the documentation should include all parameters

## How was this patch tested?

no tests necessary

Author: Malte <elmalto@users.noreply.github.com>

Closes #12218 from elmalto/patch-1.
2016-04-07 09:16:07 +01:00
Reynold Xin 9ca0760d67 [SPARK-10063][SQL] Remove DirectParquetOutputCommitter
## What changes were proposed in this pull request?
This patch removes DirectParquetOutputCommitter. This was initially created by Databricks as a faster way to write Parquet data to S3. However, given how the underlying S3 Hadoop implementation works, this committer only works when there are no failures. If there are multiple attempts of the same task (e.g. speculation or task failures or node failures), the output data can be corrupted. I don't think this performance optimization outweighs the correctness issue.

## How was this patch tested?
Removed the related tests also.

Author: Reynold Xin <rxin@databricks.com>

Closes #12229 from rxin/SPARK-10063.
2016-04-07 00:51:45 -07:00