Commit graph

1737 commits

Author SHA1 Message Date
Sean Owen dc4c600986 [SPARK-18138][DOCS] Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0
## What changes were proposed in this pull request?

Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0. This does not actually implement any of the change in SPARK-18138, just peppers the documentation with notices about it.

## How was this patch tested?

Doc build

Author: Sean Owen <sowen@cloudera.com>

Closes #15733 from srowen/SPARK-18138.
2016-11-03 17:27:23 -07:00
Liwei Lin 98ede49496
[SPARK-18198][DOC][STREAMING] Highlight code snippets
## What changes were proposed in this pull request?

This patch uses `{% highlight lang %}...{% endhighlight %}` to highlight code snippets in the `Structured Streaming Kafka010 integration doc` and the `Spark Streaming Kafka010 integration doc`.

This patch consists of two commits:
- the first commit fixes only the leading spaces -- this is large
- the second commit adds the highlight instructions -- this is much simpler and easier to review

## How was this patch tested?

SKIP_API=1 jekyll build

## Screenshots

**Before**

![snip20161101_3](https://cloud.githubusercontent.com/assets/15843379/19894258/47746524-a087-11e6-9a2a-7bff2d428d44.png)

**After**

![snip20161101_1](https://cloud.githubusercontent.com/assets/15843379/19894324/8bebcd1e-a087-11e6-835b-88c4d2979cfa.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #15715 from lw-lin/doc-highlight-code-snippet.
2016-11-02 09:10:34 +00:00
Joseph K. Bradley 91c33a0ca5 [SPARK-18088][ML] Various ChiSqSelector cleanups
## What changes were proposed in this pull request?
- Renamed kbest to numTopFeatures
- Renamed alpha to fpr
- Added missing Since annotations
- Doc cleanups
## How was this patch tested?

Added new standardized unit tests for spark.ml.
Improved existing unit test coverage a bit.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #15647 from jkbradley/chisqselector-follow-ups.
2016-11-01 17:00:00 -07:00
Josh Rosen 6e6298154a [SPARK-17350][SQL] Disable default use of KryoSerializer in Thrift Server
In SPARK-4761 / #3621 (December 2014) we enabled Kryo serialization by default in the Spark Thrift Server. However, I don't think that the original rationale for doing this still holds now that most Spark SQL serialization is now performed via encoders and our UnsafeRow format.

In addition, the use of Kryo as the default serializer can introduce performance problems because the creation of new KryoSerializer instances is expensive and we haven't performed instance-reuse optimizations in several code paths (including DirectTaskResult deserialization).

Given all of this, I propose to revert back to using JavaSerializer as the default serializer in the Thrift Server.

/cc liancheng

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14906 from JoshRosen/disable-kryo-in-thriftserver.
2016-11-01 16:23:47 -07:00
Charles Allen e34b4e1267
[SPARK-15994][MESOS] Allow enabling Mesos fetch cache in coarse executor backend
Mesos 0.23.0 introduces a Fetch Cache feature http://mesos.apache.org/documentation/latest/fetcher/ which allows caching of resources specified in command URIs.

This patch:
- Updates the Mesos shaded protobuf dependency to 0.23.0
- Allows setting `spark.mesos.fetcherCache.enable` to enable the fetch cache for all specified URIs. (URIs must be specified for the setting to have any affect)
- Updates documentation for Mesos configuration with the new setting.

This patch does NOT:
- Allow for per-URI caching configuration. The cache setting is global to ALL URIs for the command.

Author: Charles Allen <charles@allen-net.com>

Closes #13713 from drcrallen/SPARK15994.
2016-11-01 13:14:17 +00:00
Dongjoon Hyun 623fc7fc67
[MINOR][DOC] Remove spaces following slashs
## What changes were proposed in this pull request?

This PR merges multiple lines enumerating items in order to remove the redundant spaces following slashes in [Structured Streaming Programming Guide in 2.0.2-rc1](http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/structured-streaming-programming-guide.html).
- Before: `Scala/ Java/ Python`
- After: `Scala/Java/Python`
## How was this patch tested?

Manual by the followings because this is documentation update.

```
cd docs
SKIP_API=1 jekyll build
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15686 from dongjoon-hyun/minor_doc_space.
2016-11-01 13:08:49 +00:00
Hossein 2881a2d1d1 [SPARK-17919] Make timeout to RBackend configurable in SparkR
## What changes were proposed in this pull request?

This patch makes RBackend connection timeout configurable by user.

## How was this patch tested?
N/A

Author: Hossein <hossein@databricks.com>

Closes #15471 from falaki/SPARK-17919.
2016-10-30 16:17:23 -07:00
Liwei Lin 505b927cb7
[SPARK-16312][FOLLOW-UP][STREAMING][KAFKA][DOC] Add java code snippet for Kafka 0.10 integration doc
## What changes were proposed in this pull request?

added java code snippet for Kafka 0.10 integration doc

## How was this patch tested?

SKIP_API=1 jekyll build

## Screenshot

![kafka-doc](https://cloud.githubusercontent.com/assets/15843379/19826272/bf0d8a4c-9db8-11e6-9e40-1396723df4bc.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #15679 from lw-lin/kafka-010-examples.
2016-10-30 09:32:19 +00:00
VinceShieh 0b076d4cb6 [SPARK-17219][ML] enhanced NaN value handling in Bucketizer
## What changes were proposed in this pull request?

This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2.
NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively.

'''Before:
val bucketizer: Bucketizer = new Bucketizer()
          .setInputCol("feature")
          .setOutputCol("result")
          .setSplits(splits)
'''After:
val bucketizer: Bucketizer = new Bucketizer()
          .setInputCol("feature")
          .setOutputCol("result")
          .setSplits(splits)
          .setHandleNaN("keep")

## How was this patch tested?
Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite

Signed-off-by: VinceShieh <vincent.xieintel.com>

Author: VinceShieh <vincent.xie@intel.com>
Author: Vincent Xie <vincent.xie@intel.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #15428 from VinceShieh/spark-17219_followup.
2016-10-27 11:52:15 -07:00
cody koeninger 1042325805 [SPARK-17813][SQL][KAFKA] Maximum data per trigger
## What changes were proposed in this pull request?

maxOffsetsPerTrigger option for rate limiting, proportionally based on volume of different topicpartitions.

## How was this patch tested?

Added unit test

Author: cody koeninger <cody@koeninger.org>

Closes #15527 from koeninger/SPARK-17813.
2016-10-27 10:30:59 -07:00
Felix Cheung 44c8bfda79 [SQL][DOC] updating doc for JSON source to link to jsonlines.org
## What changes were proposed in this pull request?

API and programming guide doc changes for Scala, Python and R.

## How was this patch tested?

manual test

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15629 from felixcheung/jsondoc.
2016-10-26 23:06:11 -07:00
Alex Bozarth 5d0f81da49
[SPARK-4411][WEB UI] Add "kill" link for jobs in the UI
## What changes were proposed in this pull request?

Currently users can kill stages via the web ui but not jobs directly (jobs are killed if one of their stages is). I've added the ability to kill jobs via the web ui. This code change is based on #4823 by lianhuiwang and updated to work with the latest code matching how stages are currently killed. In general I've copied the kill stage code warning and note comments and all. I also updated applicable tests and documentation.

## How was this patch tested?

Manually tested and dev/run-tests

![screen shot 2016-10-11 at 4 49 43 pm](https://cloud.githubusercontent.com/assets/13952758/19292857/12f1b7c0-8fd4-11e6-8982-210249f7b697.png)

Author: Alex Bozarth <ajbozart@us.ibm.com>
Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #15441 from ajbozarth/spark4411.
2016-10-26 14:26:54 +02:00
Sean Owen 4ecbe1b92f
[SPARK-17810][SQL] Default spark.sql.warehouse.dir is relative to local FS but can resolve as HDFS path
## What changes were proposed in this pull request?

Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #15382 from srowen/SPARK-17810.
2016-10-24 10:44:45 +01:00
Sandeep Singh bc167a2a53 [SPARK-928][CORE] Add support for Unsafe-based serializer in Kryo
## What changes were proposed in this pull request?
Now since we have migrated to Kryo-3.0.0 in https://issues.apache.org/jira/browse/SPARK-11416, we can gives users option to use unsafe SerDer. It can turned by setting `spark.kryo.useUnsafe` to `true`

## How was this patch tested?
Ran existing tests

```
     Benchmark Kryo Unsafe vs safe Serialization: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      basicTypes: Int unsafe:true                    160 /  178         98.5          10.1       1.0X
      basicTypes: Long unsafe:true                   210 /  218         74.9          13.4       0.8X
      basicTypes: Float unsafe:true                  203 /  213         77.5          12.9       0.8X
      basicTypes: Double unsafe:true                 226 /  235         69.5          14.4       0.7X
      Array: Int unsafe:true                        1087 / 1101         14.5          69.1       0.1X
      Array: Long unsafe:true                       2758 / 2844          5.7         175.4       0.1X
      Array: Float unsafe:true                      1511 / 1552         10.4          96.1       0.1X
      Array: Double unsafe:true                     2942 / 2972          5.3         187.0       0.1X
      Map of string->Double unsafe:true             2645 / 2739          5.9         168.2       0.1X
      basicTypes: Int unsafe:false                   211 /  218         74.7          13.4       0.8X
      basicTypes: Long unsafe:false                  247 /  253         63.6          15.7       0.6X
      basicTypes: Float unsafe:false                 211 /  216         74.5          13.4       0.8X
      basicTypes: Double unsafe:false                227 /  233         69.2          14.4       0.7X
      Array: Int unsafe:false                       3012 / 3032          5.2         191.5       0.1X
      Array: Long unsafe:false                      4463 / 4515          3.5         283.8       0.0X
      Array: Float unsafe:false                     2788 / 2868          5.6         177.2       0.1X
      Array: Double unsafe:false                    3558 / 3752          4.4         226.2       0.0X
      Map of string->Double unsafe:false            2806 / 2933          5.6         178.4       0.1X
```

Author: Sandeep Singh <sandeep@techaddict.me>
Author: Sandeep Singh <sandeep@origamilogic.com>

Closes #12913 from techaddict/SPARK-928.
2016-10-22 12:03:37 -07:00
Sean Owen 01b26a0643
[SPARK-17898][DOCS] repositories needs username and password
## What changes were proposed in this pull request?

Document `user:password` syntax as possible means of specifying credentials for password-protected `--repositories`

## How was this patch tested?

Doc build

Author: Sean Owen <sowen@cloudera.com>

Closes #15584 from srowen/SPARK-17898.
2016-10-22 09:39:07 +01:00
cody koeninger c9720b2195 [STREAMING][KAFKA][DOC] clarify kafka settings needed for larger batches
## What changes were proposed in this pull request?

Minor doc change to mention kafka configuration for larger spark batches.

## How was this patch tested?

Doc change only, confirmed via jekyll.

The configuration issue was discussed / confirmed with users on the mailing list.

Author: cody koeninger <cody@koeninger.org>

Closes #15570 from koeninger/kafka-doc-heartbeat.
2016-10-21 16:27:19 -07:00
cody koeninger 268ccb9a48 [SPARK-17812][SQL][KAFKA] Assign and specific startingOffsets for structured stream
## What changes were proposed in this pull request?

startingOffsets takes specific per-topicpartition offsets as a json argument, usable with any consumer strategy

assign with specific topicpartitions as a consumer strategy

## How was this patch tested?

Unit tests

Author: cody koeninger <cody@koeninger.org>

Closes #15504 from koeninger/SPARK-17812.
2016-10-21 15:55:04 -07:00
Felix Cheung e21e1c946c [SPARK-18013][SPARKR] add crossJoin API
## What changes were proposed in this pull request?

Add crossJoin and do not default to cross join if joinExpr is left out

## How was this patch tested?

unit test

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15559 from felixcheung/rcrossjoin.
2016-10-21 12:35:37 -07:00
Mark Grover 2d14ab7e64 [DOCS] Update docs to not suggest to package Spark before running tests.
## What changes were proposed in this pull request?

Update docs to not suggest to package Spark before running tests.

## How was this patch tested?

Not creating a JIRA since this pretty small. We haven't had the need to run mvn package before mvn test since 1.6 at least, or so I am told. So, updating the docs to not be misguiding.

Author: Mark Grover <mark@apache.org>

Closes #15572 from markgrover/doc_update.
2016-10-20 15:30:01 -07:00
Takuya UESHIN 9540357ada
[SPARK-17985][CORE] Bump commons-lang3 version to 3.5.
## What changes were proposed in this pull request?

`SerializationUtils.clone()` of commons-lang3 (<3.5) has a bug that breaks thread safety, which gets stack sometimes caused by race condition of initializing hash map.
See https://issues.apache.org/jira/browse/LANG-1251.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #15548 from ueshin/issues/SPARK-17985.
2016-10-19 10:06:43 +01:00
Tommy YU f39852e598 [SPARK-18001][DOCUMENT] fix broke link to SparkDataFrame
## What changes were proposed in this pull request?

In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section "Untyped Dataset Operations (aka DataFrame Operations)"

Link to R DataFrame doesn't work that return
The requested URL /docs/latest/api/R/DataFrame.html was not found on this server.

Correct link is SparkDataFrame.html for spark 2.0

## How was this patch tested?

Manual checked.

Author: Tommy YU <tummyyu@163.com>

Closes #15543 from Wenpei/spark-18001.
2016-10-18 21:15:32 -07:00
Reynold Xin cd662bc7a2 Revert "[SPARK-17985][CORE] Bump commons-lang3 version to 3.5."
This reverts commit bfe7885aee.

The commit caused build failures on Hadoop 2.2 profile:

```
[error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1489: value read is not a member of object org.apache.commons.io.IOUtils
[error]       var numBytes = IOUtils.read(gzInputStream, buf)
[error]                              ^
[error] /scratch/rxin/spark/core/src/main/scala/org/apache/spark/util/Utils.scala:1492: value read is not a member of object org.apache.commons.io.IOUtils
[error]         numBytes = IOUtils.read(gzInputStream, buf)
[error]                            ^
```
2016-10-18 13:56:35 -07:00
Weiqing Yang 20dd11096c [MINOR][DOC] Add more built-in sources in sql-programming-guide.md
## What changes were proposed in this pull request?
Add more built-in sources in sql-programming-guide.md.

## How was this patch tested?
Manually.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #15522 from weiqingy/dsDoc.
2016-10-18 13:38:14 -07:00
Takuya UESHIN bfe7885aee [SPARK-17985][CORE] Bump commons-lang3 version to 3.5.
## What changes were proposed in this pull request?

`SerializationUtils.clone()` of commons-lang3 (<3.5) has a bug that breaks thread safety, which gets stack sometimes caused by race condition of initializing hash map.
See https://issues.apache.org/jira/browse/LANG-1251.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #15525 from ueshin/issues/SPARK-17985.
2016-10-18 13:36:00 -07:00
Yu Peng 231f39e3f6 [SPARK-17711] Compress rolled executor log
## What changes were proposed in this pull request?

This PR adds support for executor log compression.

## How was this patch tested?

Unit tests

cc: yhuai tdas mengxr

Author: Yu Peng <loneknightpy@gmail.com>

Closes #15285 from loneknightpy/compress-executor-log.
2016-10-18 13:23:31 -07:00
Reynold Xin 72a6e7a57a Revert "[SPARK-17637][SCHEDULER] Packed scheduling for Spark tasks across executors"
This reverts commit ed14633414.

The patch merged had obvious quality and documentation issue. The idea is useful, and we should work towards improving its quality and merging it in again.
2016-10-15 22:31:37 -07:00
Zhan Zhang ed14633414 [SPARK-17637][SCHEDULER] Packed scheduling for Spark tasks across executors
## What changes were proposed in this pull request?

Restructure the code and implement two new task assigner.
PackedAssigner: try to allocate tasks to the executors with least available cores, so that spark can release reserved executors when dynamic allocation is enabled.

BalancedAssigner: try to allocate tasks to the executors with more available cores in order to balance the workload across all executors.

By default, the original round robin assigner is used.

We test a pipeline, and new PackedAssigner  save around 45% regarding the reserved cpu and memory with dynamic allocation enabled.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Both unit test in TaskSchedulerImplSuite and manual tests in production pipeline.

Author: Zhan Zhang <zhanzhang@fb.com>

Closes #15218 from zhzhan/packed-scheduler.
2016-10-15 18:45:04 -07:00
Dhruve Ashar a0ebcb3a30
[DOC] Fix typo in sql hive doc
Change is too trivial to file a JIRA.

Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #15485 from dhruve/master.
2016-10-14 17:45:27 +01:00
Imran Rashid 9ce7d3e542 [SPARK-17675][CORE] Expand Blacklist for TaskSets
## What changes were proposed in this pull request?

This is a step along the way to SPARK-8425.

To enable incremental review, the first step proposed here is to expand the blacklisting within tasksets. In particular, this will enable blacklisting for
* (task, executor) pairs (this already exists via an undocumented config)
* (task, node)
* (taskset, executor)
* (taskset, node)

Adding (task, node) is critical to making spark fault-tolerant of one-bad disk in a cluster, without requiring careful tuning of "spark.task.maxFailures". The other additions are also important to avoid many misleading task failures and long scheduling delays when there is one bad node on a large cluster.

Note that some of the code changes here aren't really required for just this -- they put pieces in place for SPARK-8425 even though they are not used yet (eg. the `BlacklistTracker` helper is a little out of place, `TaskSetBlacklist` holds onto a little more info than it needs to for just this change, and `ExecutorFailuresInTaskSet` is more complex than it needs to be).

## How was this patch tested?

Added unit tests, run tests via jenkins.

Author: Imran Rashid <irashid@cloudera.com>
Author: mwws <wei.mao@intel.com>

Closes #15249 from squito/taskset_blacklist_only.
2016-10-12 16:43:03 -05:00
cody koeninger c264ef9b19 [SPARK-17853][STREAMING][KAFKA][DOC] make it clear that reusing group.id is bad
## What changes were proposed in this pull request?

Documentation fix to make it clear that reusing group id for different streams is super duper bad, just like it is with the underlying Kafka consumer.

## How was this patch tested?

I built jekyll doc and made sure it looked ok.

Author: cody koeninger <cody@koeninger.org>

Closes #15442 from koeninger/SPARK-17853.
2016-10-12 00:40:47 -07:00
Kousuke Saruta b512f04f8e [SPARK-17880][DOC] The url linking to AccumulatorV2 in the document is incorrect.
## What changes were proposed in this pull request?

In `programming-guide.md`, the url which links to `AccumulatorV2` says `api/scala/index.html#org.apache.spark.AccumulatorV2` but `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct.

## How was this patch tested?
manual test.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #15439 from sarutak/SPARK-17880.
2016-10-11 22:36:57 -07:00
Alexander Pivovarov 299eb04ba0 Fix hadoop.version in building-spark.md
Couple of mvn build examples use `-Dhadoop.version=VERSION` instead of actual version number

Author: Alexander Pivovarov <apivovarov@gmail.com>

Closes #15440 from apivovarov/patch-1.
2016-10-11 22:31:21 -07:00
hyukjinkwon 0c0ad436ad [SPARK-17719][SPARK-17776][SQL] Unify and tie up options in a single place in JDBC datasource package
## What changes were proposed in this pull request?

This PR proposes to fix arbitrary usages among `Map[String, String]`, `Properties` and `JDBCOptions` instances for options in `execution/jdbc` package and make the connection properties exclude Spark-only options.

This PR includes some changes as below:

  - Unify `Map[String, String]`, `Properties` and `JDBCOptions` in `execution/jdbc` package to `JDBCOptions`.

- Move `batchsize`, `fetchszie`, `driver` and `isolationlevel` options into `JDBCOptions` instance.

- Document `batchSize` and `isolationlevel` with marking both read-only options and write-only options. Also, this includes minor types and detailed explanation for some statements such as url.

- Throw exceptions fast by checking arguments first rather than in execution time (e.g. for `fetchsize`).

- Exclude Spark-only options in connection properties.

## How was this patch tested?

Existing tests should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15292 from HyukjinKwon/SPARK-17719.
2016-10-10 22:22:41 -07:00
Timothy Chen 29f186bfdf
[SPARK-14082][MESOS] Enable GPU support with Mesos
## What changes were proposed in this pull request?

Enable GPU resources to be used when running coarse grain mode with Mesos.

## How was this patch tested?

Manual test with GPU.

Author: Timothy Chen <tnachen@gmail.com>

Closes #14644 from tnachen/gpu_mesos.
2016-10-10 23:20:15 +02:00
Wenchen Fan 23ddff4b2b [SPARK-17338][SQL] add global temp view
## What changes were proposed in this pull request?

Global temporary view is a cross-session temporary view, which means it's shared among all sessions. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database `global_temp`(configurable via SparkConf), and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1.

changes for `SessionCatalog`:

1. add a new field `gloabalTempViews: GlobalTempViewManager`, to access the shared global temp views, and the global temp db name.
2. `createDatabase` will fail if users wanna create `global_temp`, which is system preserved.
3. `setCurrentDatabase` will fail if users wanna set `global_temp`, which is system preserved.
4. add `createGlobalTempView`, which is used in `CreateViewCommand` to create global temp views.
5. add `dropGlobalTempView`, which is used in `CatalogImpl` to drop global temp view.
6. add `alterTempViewDefinition`, which is used in `AlterViewAsCommand` to update the view definition for local/global temp views.
7. `renameTable`/`dropTable`/`isTemporaryTable`/`lookupRelation`/`getTempViewOrPermanentTableMetadata`/`refreshTable` will handle global temp views.

changes for SQL commands:

1. `CreateViewCommand`/`AlterViewAsCommand` is updated to support global temp views
2. `ShowTablesCommand` outputs a new column `database`, which is used to distinguish global and local temp views.
3. other commands can also handle global temp views if they call `SessionCatalog` APIs which accepts global temp views, e.g. `DropTableCommand`, `AlterTableRenameCommand`, `ShowColumnsCommand`, etc.

changes for other public API

1. add a new method `dropGlobalTempView` in `Catalog`
2. `Catalog.findTable` can find global temp view
3. add a new method `createGlobalTempView` in `Dataset`

## How was this patch tested?

new tests in `SQLViewSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14897 from cloud-fan/global-temp-view.
2016-10-10 15:48:57 +08:00
Sean Owen cff5607552 [SPARK-17707][WEBUI] Web UI prevents spark-submit application to be finished
## What changes were proposed in this pull request?

This expands calls to Jetty's simple `ServerConnector` constructor to explicitly specify a `ScheduledExecutorScheduler` that makes daemon threads. It should otherwise result in exactly the same configuration, because the other args are copied from the constructor that is currently called.

(I'm not sure we should change the Hive Thriftserver impl, but I did anyway.)

This also adds `sc.stop()` to the quick start guide example.

## How was this patch tested?

Existing tests; _pending_ at least manual verification of the fix.

Author: Sean Owen <sowen@cloudera.com>

Closes #15381 from srowen/SPARK-17707.
2016-10-07 10:31:41 -07:00
Shixiong Zhu 9293734d35 [SPARK-17346][SQL] Add Kafka source for Structured Streaming
## What changes were proposed in this pull request?

This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.

It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing

tdas did most of work and part of them was inspired by koeninger's work.

### Introduction

The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:

Column | Type
---- | ----
key | binary
value | binary
topic | string
partition | int
offset | long
timestamp | long
timestampType | int

The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.

### Configuration

The user can use `DataStreamReader.option` to set the following configurations.

Kafka Source's options | value | default | meaning
------ | ------- | ------ | -----
startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets

Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`

### Usage

* Subscribe to 1 topic
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1")
  .load()
```

* Subscribe to multiple topics
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1,topic2")
  .load()
```

* Subscribe to a pattern
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribePattern", "topic.*")
  .load()
```

## How was this patch tested?

The new unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Shixiong Zhu <zsxwing@gmail.com>
Author: cody koeninger <cody@koeninger.org>

Closes #15102 from zsxwing/kafka-source.
2016-10-05 16:45:45 -07:00
sethah 9df54f5325
[SPARK-17239][ML][DOC] Update user guide for multiclass logistic regression
## What changes were proposed in this pull request?
Updates user guide to reflect that LogisticRegression now supports multiclass. Also adds new examples to show multiclass training.

## How was this patch tested?
Ran locally using spark-submit, run-example, and copy/paste from user guide into shells. Generated docs and verified correct output.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #15349 from sethah/SPARK-17239.
2016-10-05 18:28:21 +00:00
Sean Owen 1dd68d3827
[SPARK-17718][DOCS][MLLIB] Make loss function formulation label note clearer in MLlib docs
## What changes were proposed in this pull request?

Move note about labels being +1/-1 in formulation only to be just under the table of formulations.

## How was this patch tested?

Doc build

Author: Sean Owen <sowen@cloudera.com>

Closes #15330 from srowen/SPARK-17718.
2016-10-03 18:09:36 +00:00
Jagadeesan a27033c0bb
[SPARK-17736][DOCUMENTATION][SPARKR] Update R README for rmarkdown,…
## What changes were proposed in this pull request?

To build R docs (which are built when R tests are run), users need to install pandoc and rmarkdown. This was done for Jenkins in ~~[SPARK-17420](https://issues.apache.org/jira/browse/SPARK-17420)~~

… pandoc]

Author: Jagadeesan <as2@us.ibm.com>

Closes #15309 from jagadeesanas2/SPARK-17736.
2016-10-03 10:46:38 +01:00
Dongjoon Hyun 15e9bbb49e [MINOR][DOC] Add an up-to-date description for default serialization during shuffling
## What changes were proposed in this pull request?

This PR aims to make the doc up-to-date. The documentation is generally correct, but after https://issues.apache.org/jira/browse/SPARK-13926, Spark starts to choose Kyro as a default serialization library during shuffling of simple types, arrays of simple types, or string type.

## How was this patch tested?

This is a documentation update.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15315 from dongjoon-hyun/SPARK-DOC-SERIALIZER.
2016-09-30 22:05:59 -07:00
Dongjoon Hyun 39eb3bb1ec [SPARK-17412][DOC] All test should not be run by root or any admin user
## What changes were proposed in this pull request?

`FsHistoryProviderSuite` fails if `root` user runs it. The test case **SPARK-3697: ignore directories that cannot be read** depends on `setReadable(false, false)` to make test data files and expects the number of accessible files is 1. But, `root` can access all files, so it returns 2.

This PR adds the assumption explicitly on doc. `building-spark.md`.

## How was this patch tested?

This is a documentation change.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15291 from dongjoon-hyun/SPARK-17412.
2016-09-29 16:01:45 -07:00
José Hiram Soltren 958200497a [DOCS] Reorganize explanation of Accumulators and Broadcast Variables
## What changes were proposed in this pull request?

The discussion of the interaction of Accumulators and Broadcast Variables should logically follow the discussion on Checkpointing. As currently written, this section discusses Checkpointing before it is formally introduced. To remedy this:

 - Rename this section to "Accumulators, Broadcast Variables, and Checkpoints", and
 - Move this section after "Checkpointing".

## How was this patch tested?

Testing: ran

$ SKIP_API=1 jekyll build

, and verified changes in a Web browser pointed at docs/_site/index.html.

Author: José Hiram Soltren <jose@cloudera.com>

Closes #15281 from jsoltren/doc-changes.
2016-09-29 10:18:56 -07:00
Takeshi YAMAMURO b2e9731ca4
[MINOR][DOCS] Fix th doc. of spark-streaming with kinesis
## What changes were proposed in this pull request?
This pr is just to fix the document of `spark-kinesis-integration`.
Since `SPARK-17418` prevented all the kinesis stuffs (including kinesis example code)
from publishing,  `bin/run-example streaming.KinesisWordCountASL` and `bin/run-example streaming.JavaKinesisWordCountASL` does not work.
Instead, it fetches the kinesis jar from the Spark Package.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #15260 from maropu/DocFixKinesis.
2016-09-29 08:26:03 -04:00
Shuai Lin b2a7eedcdd
[SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature selection docs for ChiSqSelector
## What changes were proposed in this pull request?

A follow up for #14597 to update feature selection docs about ChiSqSelector.

## How was this patch tested?

Generated html docs. It can be previewed at:

* ml: http://sparkdocs.lins05.pw/spark-17017/ml-features.html#chisqselector
* mllib: http://sparkdocs.lins05.pw/spark-17017/mllib-feature-extraction.html#chisqselector

Author: Shuai Lin <linshuai2012@gmail.com>

Closes #15236 from lins05/spark-17017-update-docs-for-chisq-selector-fpr.
2016-09-28 06:12:48 -04:00
Andrew Mills 00be16df64 [Docs] Update spark-standalone.md to fix link
Corrected a link to the configuration.html page, it was pointing to a page that does not exist (configurations.html).

Documentation change, verified in preview.

Author: Andrew Mills <ammills01@users.noreply.github.com>

Closes #15244 from ammills01/master.
2016-09-26 16:41:14 -04:00
Liang-Chi Hsieh 8135e0e5eb [SPARK-17153][SQL] Should read partition data when reading new files in filestream without globbing
## What changes were proposed in this pull request?

When reading file stream with non-globbing path, the results return data with all `null`s for the
partitioned columns. E.g.,

    case class A(id: Int, value: Int)
    val data = spark.createDataset(Seq(
      A(1, 1),
      A(2, 2),
      A(2, 3))
    )
    val url = "/tmp/test"
    data.write.partitionBy("id").parquet(url)
    spark.read.parquet(url).show

    +-----+---+
    |value| id|
    +-----+---+
    |    2|  2|
    |    3|  2|
    |    1|  1|
    +-----+---+

    val s = spark.readStream.schema(spark.read.load(url).schema).parquet(url)
    s.writeStream.queryName("test").format("memory").start()

    sql("SELECT * FROM test").show

    +-----+----+
    |value|  id|
    +-----+----+
    |    2|null|
    |    3|null|
    |    1|null|
    +-----+----+

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #14803 from viirya/filestreamsource-option.
2016-09-26 13:07:11 -07:00
Justin Pihony 50b89d05b7
[SPARK-14525][SQL] Make DataFrameWrite.save work for jdbc
## What changes were proposed in this pull request?

This change modifies the implementation of DataFrameWriter.save such that it works with jdbc, and the call to jdbc merely delegates to save.

## How was this patch tested?

This was tested via unit tests in the JDBCWriteSuite, of which I added one new test to cover this scenario.

## Additional details

rxin This seems to have been most recently touched by you and was also commented on in the JIRA.

This contribution is my original work and I license the work to the project under the project's open source license.

Author: Justin Pihony <justin.pihony@gmail.com>
Author: Justin Pihony <justin.pihony@typesafe.com>

Closes #12601 from JustinPihony/jdbc_reconciliation.
2016-09-26 09:54:22 +01:00
Jeff Zhang f62ddc5983 [SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when running sparkr in RStudio
## What changes were proposed in this pull request?

Spark will add sparkr.zip to archive only when it is yarn mode (SparkSubmit.scala).
```
    if (args.isR && clusterManager == YARN) {
      val sparkRPackagePath = RUtils.localSparkRPackagePath
      if (sparkRPackagePath.isEmpty) {
        printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.")
      }
      val sparkRPackageFile = new File(sparkRPackagePath.get, SPARKR_PACKAGE_ARCHIVE)
      if (!sparkRPackageFile.exists()) {
        printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.")
      }
      val sparkRPackageURI = Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString

      // Distribute the SparkR package.
      // Assigns a symbol link name "sparkr" to the shipped package.
      args.archives = mergeFileLists(args.archives, sparkRPackageURI + "#sparkr")

      // Distribute the R package archive containing all the built R packages.
      if (!RUtils.rPackages.isEmpty) {
        val rPackageFile =
          RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), R_PACKAGE_ARCHIVE)
        if (!rPackageFile.exists()) {
          printErrorAndExit("Failed to zip all the built R packages.")
        }

        val rPackageURI = Utils.resolveURI(rPackageFile.getAbsolutePath).toString
        // Assigns a symbol link name "rpkg" to the shipped package.
        args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg")
      }
    }
```
So it is necessary to pass spark.master from R process to JVM. Otherwise sparkr.zip won't be distributed to executor.  Besides that I also pass spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need them to access secured cluster.

## How was this patch tested?

Verify it manually in R Studio using the following code.
```
Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark")
.libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths()))
library(SparkR)
sparkR.session(master="yarn-client", sparkConfig = list(spark.executor.instances="1"))
df <- as.DataFrame(mtcars)
head(df)

```

…

Author: Jeff Zhang <zjffdu@apache.org>

Closes #14784 from zjffdu/SPARK-17210.
2016-09-23 11:37:43 -07:00
frreiss 646f383465
[SPARK-17421][DOCS] Documenting the current treatment of MAVEN_OPTS.
## What changes were proposed in this pull request?

Modified the documentation to clarify that `build/mvn` and `pom.xml` always add Java 7-specific parameters to `MAVEN_OPTS`, and that developers can safely ignore warnings about `-XX:MaxPermSize` that may result from compiling or running tests with Java 8.

## How was this patch tested?

Rebuilt HTML documentation, made sure that building-spark.html displays correctly in a browser.

Author: frreiss <frreiss@us.ibm.com>

Closes #15005 from frreiss/fred-17421a.
2016-09-22 10:31:15 +01:00