Commit graph

1015 commits

Author SHA1 Message Date
Marcelo Vanzin 979bf946d5 [SPARK-20655][CORE] In-memory KVStore implementation.
This change adds an in-memory implementation of KVStore that can be
used by the live UI.

The implementation is not fully optimized, neither for speed nor
space, but should be fast enough for using in the listener bus.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #18395 from vanzin/SPARK-20655.
2017-08-08 11:02:54 -07:00
Sean Owen e26dac5feb [SPARK-21415] Triage scapegoat warnings, part 1
## What changes were proposed in this pull request?

Address scapegoat warnings for:
- BigDecimal double constructor
- Catching NPE
- Finalizer without super
- List.size is O(n)
- Prefer Seq.empty
- Prefer Set.empty
- reverse.map instead of reverseMap
- Type shadowing
- Unnecessary if condition.
- Use .log1p
- Var could be val

In some instances like Seq.empty, I avoided making the change even where valid in test code to keep the scope of the change smaller. Those issues are concerned with performance and it won't matter for tests.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #18635 from srowen/Scapegoat1.
2017-07-18 08:47:17 +01:00
Sean Owen 425c4ada4c [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10
## What changes were proposed in this pull request?

- Remove Scala 2.10 build profiles and support
- Replace some 2.10 support in scripts with commented placeholders for 2.12 later
- Remove deprecated API calls from 2.10 support
- Remove usages of deprecated context bounds where possible
- Remove Scala 2.10 workarounds like ScalaReflectionLock
- Other minor Scala warning fixes

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #17150 from srowen/SPARK-19810.
2017-07-13 17:06:24 +08:00
jinxing 58434acdd8 [SPARK-19937] Collect metrics for remote bytes read to disk during shuffle.
In current code(https://github.com/apache/spark/pull/16989), big blocks are shuffled to disk.
This pr proposes to collect metrics for remote bytes fetched to disk.

Author: jinxing <jinxing6042@126.com>

Closes #18249 from jinxing64/SPARK-19937.
2017-06-22 14:10:51 -07:00
Marcelo Vanzin 0cba495120 [SPARK-20641][CORE] Add key-value store abstraction and LevelDB implementation.
This change adds an abstraction and LevelDB implementation for a key-value
store that will be used to store UI and SHS data.

The interface is described in KVStore.java (see javadoc). Specifics
of the LevelDB implementation are discussed in the javadocs of both
LevelDB.java and LevelDBTypeInfo.java.

Included also are a few small benchmarks just to get some idea of
latency. Because they're too slow for regular unit test runs, they're
disabled by default.

Tested with the included unit tests, and also as part of the overall feature
implementation (including running SHS with hundreds of apps).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #17902 from vanzin/shs-ng/M1.
2017-06-06 13:39:10 -05:00
Yanbo Liang 0698e6c88c [SPARK-20606][ML] Revert "[] ML 2.2 QA: Remove deprecated methods for ML"
This reverts commit b8733e0ad9.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17944 from yanboliang/spark-20606-revert.
2017-05-11 14:48:13 +08:00
Sanket 181261a81d [SPARK-20355] Add per application spark version on the history server headerpage
## What changes were proposed in this pull request?

Spark Version for a specific application is not displayed on the history page now. It should be nice to switch the spark version on the UI when we click on the specific application.
Currently there seems to be way as SparkListenerLogStart records the application version. So, it should be trivial to listen to this event and provision this change on the UI.
For Example
<img width="1439" alt="screen shot 2017-04-06 at 3 23 41 pm" src="https://cloud.githubusercontent.com/assets/8295799/25092650/41f3970a-2354-11e7-9b0d-4646d0adeb61.png">
<img width="1399" alt="screen shot 2017-04-17 at 9 59 33 am" src="https://cloud.githubusercontent.com/assets/8295799/25092743/9f9e2f28-2354-11e7-9605-f2f1c63f21fe.png">

{"Event":"SparkListenerLogStart","Spark Version":"2.0.0"}
(Please fill in changes proposed in this fix)
Modified the SparkUI for History server to listen to SparkLogListenerStart event and extract the version and print it.

## How was this patch tested?
Manual testing of UI page. Attaching the UI screenshot changes here

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Sanket <schintap@untilservice-lm>

Closes #17658 from redsanket/SPARK-20355.
2017-05-09 09:30:09 -05:00
Yanbo Liang b8733e0ad9 [SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML
## What changes were proposed in this pull request?
Remove ML methods we deprecated in 2.1.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17867 from yanboliang/spark-20606.
2017-05-09 17:30:37 +08:00
Steve Loughran 2cf83c4783 [SPARK-7481][BUILD] Add spark-hadoop-cloud module to pull in object store access.
## What changes were proposed in this pull request?

Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.

It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.

There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.

(this is the successor to #12004; I can't re-open it)

## How was this patch tested?

Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)

Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.

Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.

SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
maven build `mvn install -Phadoop-cloud -Phadoop-2.7`

This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.

Author: Steve Loughran <stevel@apache.org>
Author: Steve Loughran <stevel@hortonworks.com>

Closes #17834 from steveloughran/cloud/SPARK-7481-current.
2017-05-07 10:15:31 +01:00
madhu 9064f1b044 [SPARK-20495][SQL][CORE] Add StorageLevel to cacheTable API
## What changes were proposed in this pull request?
Currently cacheTable API only supports MEMORY_AND_DISK. This PR adds additional API to take different storage levels.
## How was this patch tested?
unit tests

Author: madhu <phatak.dev@gmail.com>

Closes #17802 from phatak-dev/cacheTableAPI.
2017-05-05 22:44:03 +08:00
Josh Rosen f44c8a843c [SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT
This patch bumps the master branch version to `2.3.0-SNAPSHOT`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #17753 from JoshRosen/SPARK-20453.
2017-04-24 21:48:04 -07:00
hyukjinkwon 35378766ad [SPARK-20343][BUILD] Avoid Unidoc build only if Hadoop 2.6 is explicitly set in SBT build
## What changes were proposed in this pull request?

This PR proposes two things as below:

- Avoid Unidoc build only if Hadoop 2.6 is explicitly set in SBT build

  Due to a different dependency resolution in SBT & Unidoc by an unknown reason, the documentation build fails on a specific machine & environment in Jenkins but it was unable to reproduce.

  So, this PR just checks an environment variable `AMPLAB_JENKINS_BUILD_PROFILE` that is set in Hadoop 2.6 SBT build against branches on Jenkins, and then disables Unidoc build. **Note that PR builder will still build it with Hadoop 2.6 & SBT.**

  ```
  ========================================================================
  Building Unidoc API Documentation
  ========================================================================
  [info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments:  -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive unidoc
  Using /usr/java/jdk1.8.0_60 as default JAVA_HOME.
  ...
  ```

  I checked the environment variables from the logs (first bit) as below:

  - **spark-master-test-sbt-hadoop-2.6** (this one is being failed) - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/lastBuild/consoleFull

  ```
  JAVA_HOME=/usr/java/jdk1.8.0_60
  JAVA_7_HOME=/usr/java/jdk1.7.0_79
  SPARK_BRANCH=master
  AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.6   <- I use this variable
  AMPLAB_JENKINS="true"
  ```
  - spark-master-test-sbt-hadoop-2.7 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/lastBuild/consoleFull

  ```
  JAVA_HOME=/usr/java/jdk1.8.0_60
  JAVA_7_HOME=/usr/java/jdk1.7.0_79
  SPARK_BRANCH=master
  AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.7
  AMPLAB_JENKINS="true"
  ```

  - spark-master-test-maven-hadoop-2.6 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/lastBuild/consoleFull

  ```
  JAVA_HOME=/usr/java/jdk1.8.0_60
  JAVA_7_HOME=/usr/java/jdk1.7.0_79
  HADOOP_PROFILE=hadoop-2.6
  HADOOP_VERSION=
  SPARK_BRANCH=master
  AMPLAB_JENKINS="true"
  ```

  - spark-master-test-maven-hadoop-2.7 - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastBuild/consoleFull

  ```
  JAVA_HOME=/usr/java/jdk1.8.0_60
  JAVA_7_HOME=/usr/java/jdk1.7.0_79
  HADOOP_PROFILE=hadoop-2.7
  HADOOP_VERSION=
  SPARK_BRANCH=master
  AMPLAB_JENKINS="true"
  ```

  - PR builder - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75843/consoleFull

  ```
  JENKINS_MASTER_HOSTNAME=amp-jenkins-master
  JAVA_HOME=/usr/java/jdk1.8.0_60
  JAVA_7_HOME=/usr/java/jdk1.7.0_79
  ```

  Assuming from other logs in branch-2.1

    - SBT & Hadoop 2.6 against branch-2.1 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.1-test-sbt-hadoop-2.6/lastBuild/consoleFull

      ```
      JAVA_HOME=/usr/java/jdk1.8.0_60
      JAVA_7_HOME=/usr/java/jdk1.7.0_79
      SPARK_BRANCH=branch-2.1
      AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.6
      AMPLAB_JENKINS="true"
      ```

    - Maven & Hadoop 2.6 against branch-2.1 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.1-test-maven-hadoop-2.6/lastBuild/consoleFull

      ```
      JAVA_HOME=/usr/java/jdk1.8.0_60
      JAVA_7_HOME=/usr/java/jdk1.7.0_79
      HADOOP_PROFILE=hadoop-2.6
      HADOOP_VERSION=
      SPARK_BRANCH=branch-2.1
      AMPLAB_JENKINS="true"
      ```

  We have been using the same convention for those variables. These are actually being used in `run-tests.py` script - here https://github.com/apache/spark/blob/master/dev/run-tests.py#L519-L520

- Revert the previous try

  After https://github.com/apache/spark/pull/17651, it seems the build still fails on SBT Hadoop 2.6 master.

  I am unable to reproduce this - https://github.com/apache/spark/pull/17477#issuecomment-294094092 and the reviewer was too. So, this got merged as it looks the only way to verify this is to merge it currently (as no one seems able to reproduce this).

## How was this patch tested?

I only checked `is_hadoop_version_2_6 = os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6"` is working fine as expected as below:

```python
>>> import collections
>>> os = collections.namedtuple('os', 'environ')(environ={"AMPLAB_JENKINS_BUILD_PROFILE": "hadoop2.6"})
>>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6")
False
>>> os = collections.namedtuple('os', 'environ')(environ={"AMPLAB_JENKINS_BUILD_PROFILE": "hadoop2.7"})
>>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6")
True
>>> os = collections.namedtuple('os', 'environ')(environ={})
>>> print(not os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE") == "hadoop2.6")
True
```

I tried many ways but I was unable to reproduce this in my local. Sean also tried the way I did but he was also unable to reproduce this.

Please refer the comments in https://github.com/apache/spark/pull/17477#issuecomment-294094092

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17669 from HyukjinKwon/revert-SPARK-20343.
2017-04-19 12:18:54 +01:00
hyukjinkwon d4f10cbbe1 [SPARK-20343][BUILD] Force Avro 1.7.7 in sbt build to resolve build failure in SBT Hadoop 2.6 master on Jenkins
## What changes were proposed in this pull request?

This PR proposes to force Avro's version to 1.7.7 in core to resolve the build failure as below:

```
[error] /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123: value createDatumWriter is not a member of org.apache.avro.generic.GenericData
[error]     writerCache.getOrElseUpdate(schema, GenericData.get.createDatumWriter(schema))
[error]
```

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/2770/consoleFull

Note that this is a hack and should be removed in the future.

## How was this patch tested?

I only tested this actually overrides the dependency.

I tried many ways but I was unable to reproduce this in my local. Sean also tried the way I did but he was also unable to reproduce this.

Please refer the comments in https://github.com/apache/spark/pull/17477#issuecomment-294094092

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17651 from HyukjinKwon/SPARK-20343-sbt.
2017-04-18 11:05:00 +01:00
jerryshao a4491626ed [SPARK-17019][CORE] Expose on-heap and off-heap memory usage in various places
## What changes were proposed in this pull request?

With [SPARK-13992](https://issues.apache.org/jira/browse/SPARK-13992), Spark supports persisting data into off-heap memory, but the usage of on-heap and off-heap memory is not exposed currently, it is not so convenient for user to monitor and profile, so here propose to expose off-heap memory as well as on-heap memory usage in various places:
1. Spark UI's executor page will display both on-heap and off-heap memory usage.
2. REST request returns both on-heap and off-heap memory.
3. Also this can be gotten from MetricsSystem.
4. Last this usage can be obtained programmatically from SparkListener.

Attach the UI changes:

![screen shot 2016-08-12 at 11 20 44 am](https://cloud.githubusercontent.com/assets/850797/17612032/6c2f4480-607f-11e6-82e8-a27fb8cbb4ae.png)

Backward compatibility is also considered for event-log and REST API. Old event log can still be replayed with off-heap usage displayed as 0. For REST API, only adds the new fields, so JSON backward compatibility can still be kept.
## How was this patch tested?

Unit test added and manual verification.

Author: jerryshao <sshao@hortonworks.com>

Closes #14617 from jerryshao/SPARK-17019.
2017-04-06 13:23:54 -05:00
sethah e8810b73c4
[SPARK-17471][ML] Add compressed method to ML matrices
## What changes were proposed in this pull request?

This patch adds a `compressed` method to ML `Matrix` class, which returns the minimal storage representation of the matrix - either sparse or dense. Because the space occupied by a sparse matrix is dependent upon its layout (i.e. column major or row major), this method must consider both cases. It may also be useful to force the layout to be column or row major beforehand, so an overload is added which takes in a `columnMajor: Boolean` parameter.

The compressed implementation relies upon two new abstract methods `toDense(columnMajor: Boolean)` and `toSparse(columnMajor: Boolean)`, similar to the compressed method implemented in the `Vector` class. These methods also allow the layout of the resulting matrix to be specified via the `columnMajor` parameter. More detail on the new methods is given below.
## How was this patch tested?

Added many new unit tests
## New methods (summary, not exhaustive list)

**Matrix trait**
- `private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix` (abstract) - converts the matrix (either sparse or dense) to dense format
- `private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix` (abstract) -  converts the matrix (either sparse or dense) to sparse format
- `def toDense: DenseMatrix = toDense(true)`  - converts the matrix (either sparse or dense) to dense format in column major layout
- `def toSparse: SparseMatrix = toSparse(true)` -  converts the matrix (either sparse or dense) to sparse format in column major layout
- `def compressed: Matrix` - finds the minimum space representation of this matrix, considering both column and row major layouts, and converts it
- `def compressed(columnMajor: Boolean): Matrix` - finds the minimum space representation of this matrix considering only column OR row major, and converts it

**DenseMatrix class**
- `private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix` - converts the dense matrix to a dense matrix, optionally changing the layout (data is NOT duplicated if the layouts are the same)
- `private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix` - converts the dense matrix to sparse matrix, using the specified layout

**SparseMatrix class**
- `private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix` - converts the sparse matrix to a dense matrix, using the specified layout
- `private[ml] def toSparseMatrix(columnMajors: Boolean): SparseMatrix` - converts the sparse matrix to sparse matrix. If the sparse matrix contains any explicit zeros, they are removed. If the layout requested does not match the current layout, data is copied to a new representation. If the layouts match and no explicit zeros exist, the current matrix is returned.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #15628 from sethah/matrix_compress.
2017-03-24 20:32:42 +00:00
Eric Liang 8e558041aa [SPARK-19820][CORE] Add interface to kill tasks w/ a reason
This commit adds a killTaskAttempt method to SparkContext, to allow users to
kill tasks so that they can be re-scheduled elsewhere.

This also refactors the task kill path to allow specifying a reason for the task kill. The reason is propagated opaquely through events, and will show up in the UI automatically as `(N killed: $reason)` and `TaskKilled: $reason`. Without this change, there is no way to provide the user feedback through the UI.

Currently used reasons are "stage cancelled", "another attempt succeeded", and "killed via SparkContext.killTask". The user can also specify a custom reason through `SparkContext.killTask`.

cc rxin

In the stage overview UI the reasons are summarized:
![1](https://cloud.githubusercontent.com/assets/14922/23929209/a83b2862-08e1-11e7-8b3e-ae1967bbe2e5.png)

Within the stage UI you can see individual task kill reasons:
![2](https://cloud.githubusercontent.com/assets/14922/23929200/9a798692-08e1-11e7-8697-72b27ad8a287.png)

Existing tests, tried killing some stages in the UI and verified the messages are as expected.

Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekl@google.com>

Closes #17166 from ericl/kill-reason.
2017-03-23 23:30:44 -07:00
Tyson Condie 746a558de2 [SPARK-19876][SS][WIP] OneTime Trigger Executor
## What changes were proposed in this pull request?

An additional trigger and trigger executor that will execute a single trigger only. One can use this OneTime trigger to have more control over the scheduling of triggers.

In addition, this patch requires an optimization to StreamExecution that logs a commit record at the end of successfully processing a batch. This new commit log will be used to determine the next batch (offsets) to process after a restart, instead of using the offset log itself to determine what batch to process next after restart; using the offset log to determine this would process the previously logged batch, always, thus not permitting a OneTime trigger feature.

## How was this patch tested?

A number of existing tests have been revised. These tests all assumed that when restarting a stream, the last batch in the offset log is to be re-processed. Given that we now have a commit log that will tell us if that last batch was processed successfully, the results/assumptions of those tests needed to be revised accordingly.

In addition, a OneTime trigger test was added to StreamingQuerySuite, which tests:
- The semantics of OneTime trigger (i.e., on start, execute a single batch, then stop).
- The case when the commit log was not able to successfully log the completion of a batch before restart, which would mean that we should fall back to what's in the offset log.
- A OneTime trigger execution that results in an exception being thrown.

marmbrus tdas zsxwing

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tyson Condie <tcondie@gmail.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #17219 from tcondie/stream-commit.
2017-03-23 14:32:05 -07:00
Shixiong Zhu 029e40b412 [SPARK-19874][BUILD] Hide API docs for org.apache.spark.sql.internal
## What changes were proposed in this pull request?

The API docs should not include the "org.apache.spark.sql.internal" package because they are internal private APIs.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #17217 from zsxwing/SPARK-19874.
2017-03-08 23:15:52 -08:00
VinceShieh 4a9034b173 [SPARK-17498][ML] StringIndexer enhancement for handling unseen labels
## What changes were proposed in this pull request?
This PR is an enhancement to ML StringIndexer.
Before this PR, String Indexer only supports "skip"/"error" options to deal with unseen records.
But those unseen records might still be useful and user would like to keep the unseen labels in
certain use cases, This PR enables StringIndexer to support keeping unseen labels as
indices [numLabels].

'''Before
StringIndexer().setHandleInvalid("skip")
StringIndexer().setHandleInvalid("error")
'''After
support the third option "keep"
StringIndexer().setHandleInvalid("keep")

## How was this patch tested?
Test added in StringIndexerSuite

Signed-off-by: VinceShieh <vincent.xieintel.com>
(Please fill in changes proposed in this fix)

Author: VinceShieh <vincent.xie@intel.com>

Closes #16883 from VinceShieh/spark-17498.
2017-03-07 11:24:20 -08:00
Imran Rashid 8417a7ae6c [SPARK-19276][CORE] Fetch Failure handling robust to user error handling
## What changes were proposed in this pull request?

Fault-tolerance in spark requires special handling of shuffle fetch
failures.  The Executor would catch FetchFailedException and send a
special msg back to the driver.

However, intervening user code could intercept that exception, and wrap
it with something else.  This even happens in SparkSQL.  So rather than
checking the thrown exception only, we'll store the fetch failure directly
in the TaskContext, where users can't touch it.

## How was this patch tested?

Added a test case which failed before the fix.  Full test suite via jenkins.

Author: Imran Rashid <irashid@cloudera.com>

Closes #16639 from squito/SPARK-19276.
2017-03-02 16:46:01 -08:00
Marcelo Vanzin 17d83e1ee5 [SPARK-19652][UI] Do auth checks for REST API access.
The REST API has a security filter that performs auth checks
based on the UI root's security manager. That works fine when
the UI root is the app's UI, but not when it's the history server.

In the SHS case, all users would be allowed to see all applications
through the REST API, even if the UI itself wouldn't be available
to them.

This change adds auth checks for each app access through the API
too, so that only authorized users can see the app's data.

The change also modifies the existing security filter to use
`HttpServletRequest.getRemoteUser()`, which is used in other
places. That is not necessarily the same as the principal's
name; for example, when using Hadoop's SPNEGO auth filter,
the remote user strips the realm information, which then matches
the user name registered as the owner of the application.

I also renamed the UIRootFromServletContext trait to a more generic
name since I'm using it to store more context information now.

Tested manually with an authentication filter enabled.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16978 from vanzin/SPARK-19652.
2017-02-21 16:14:34 -08:00
Sean Owen df3cbe3a33
[SPARK-19550][BUILD][WIP] Addendum: select Java 1.7 for scalac 2.10, still
## What changes were proposed in this pull request?

Go back to selecting source/target 1.7 for Scala 2.10 builds, because the SBT-based build for 2.10 won't work otherwise.

## How was this patch tested?

Existing tests, but, we need to verify this vs what the SBT build would exactly run on Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #16983 from srowen/SPARK-19550.3.
2017-02-19 04:24:11 -08:00
Sean Owen 0e2405490f
[SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support
- Move external/java8-tests tests into core, streaming, sql and remove
- Remove MaxPermGen and related options
- Fix some reflection / TODOs around Java 8+ methods
- Update doc references to 1.7/1.8 differences
- Remove Java 7/8 related build profiles
- Update some plugins for better Java 8 compatibility
- Fix a few Java-related warnings

For the future:

- Update Java 8 examples to fully use Java 8
- Update Java tests to use lambdas for simplicity
- Update Java internal implementations to use lambdas

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #16871 from srowen/SPARK-19493.
2017-02-16 12:32:45 +00:00
Sean Owen e8d3fca450
[SPARK-19464][CORE][YARN][TEST-HADOOP2.6] Remove support for Hadoop 2.5 and earlier
## What changes were proposed in this pull request?

- Remove support for Hadoop 2.5 and earlier
- Remove reflection and code constructs only needed to support multiple versions at once
- Update docs to reflect newer versions
- Remove older versions' builds and profiles.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #16810 from srowen/SPARK-19464.
2017-02-08 12:20:07 +00:00
Bryan Cutler 57d70d26c8 [SPARK-17161][PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays
## What changes were proposed in this pull request?

Adding convenience function to Python `JavaWrapper` so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala `Array` as input so that it is not necessary to have a Java/Python friendly constructor.  The function takes a Java class as input that is used by Py4J to create the Java array of the given class.  As an example, `OneVsRest` has been updated to use this and the alternate constructor is removed.

## How was this patch tested?

Added unit tests for the new convenience function and updated `OneVsRest` doctests which use this to persist the model.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #14725 from BryanCutler/pyspark-new_java_array-CountVectorizer-SPARK-17161.
2017-01-31 15:42:36 -08:00
Parag Chaudhari e20d9b1565 [SPARK-19069][CORE] Expose task 'status' and 'duration' in spark history server REST API.
## What changes were proposed in this pull request?

Although Spark history server UI shows task ‘status’ and ‘duration’ fields, it does not expose these fields in the REST API response. For the Spark history server API users, it is not possible to determine task status and duration. Spark history server has access to task status and duration from event log, but it is not exposing these in API. This patch is proposed to expose task ‘status’ and ‘duration’ fields in Spark history server REST API.

## How was this patch tested?

Modified existing test cases in org.apache.spark.deploy.history.HistoryServerSuite.

Author: Parag Chaudhari <paragpc@amazon.com>

Closes #16473 from paragpc/expose_task_status.
2017-01-20 10:49:05 -06:00
Zheng RuiFeng 8ccca9170f [SPARK-14272][ML] Add Loglikelihood in GaussianMixtureSummary
## What changes were proposed in this pull request?

add loglikelihood in GMM.summary

## How was this patch tested?

added tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>
Author: Ruifeng Zheng <ruifengz@foxmail.com>

Closes #12064 from zhengruifeng/gmm_metric.
2017-01-19 03:46:37 -08:00
Wenchen Fan 18ee55dd5d [SPARK-19148][SQL] do not expose the external table concept in Catalog
## What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path.

This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options.

## How was this patch tested?

new tests in `CatalogSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16528 from cloud-fan/create-table.
2017-01-17 12:54:50 +08:00
saturday_s ce99f51d2e [SPARK-18537][WEB UI] Add a REST api to serve spark streaming information
## What changes were proposed in this pull request?

This PR is an inheritance from #16000, and is a completion of #15904.

**Description**

- Augment the `org.apache.spark.status.api.v1` package for serving streaming information.
- Retrieve the streaming information through StreamingJobProgressListener.

> this api should cover exceptly the same amount of information as you can get from the web interface
> the implementation is base on the current REST implementation of spark-core
> and will be available for running applications only
>
> https://issues.apache.org/jira/browse/SPARK-18537

## How was this patch tested?

Local test.

Author: saturday_s <shi.indetail@gmail.com>
Author: Chan Chor Pang <ChorPang.Chan@access-company.com>
Author: peterCPChan <universknight@gmail.com>

Closes #16253 from saturday-shi/SPARK-18537.
2016-12-22 12:51:37 -08:00
gatorsmile 24c0c94128 [SPARK-18949][SQL] Add recoverPartitions API to Catalog
### What changes were proposed in this pull request?

Currently, we only have a SQL interface for recovering all the partitions in the directory of a table and update the catalog. `MSCK REPAIR TABLE` or `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to remember `MSCK` and have no clue what it means)

After the new "Scalable Partition Handling", the table repair becomes much more important for making visible the data in the created data source partitioned table.

Thus, this PR is to add it into the Catalog interface. After this PR, users can repair the table by
```Scala
spark.catalog.recoverPartitions("testTable")
```

### How was this patch tested?
Modified the existing test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16356 from gatorsmile/repairTable.
2016-12-20 23:40:02 -08:00
hyukjinkwon c6b8eb71a9
[SPARK-18842][TESTS][LAUNCHER] De-duplicate paths in classpaths in commands for local-cluster mode to work around the path length limitation on Windows
## What changes were proposed in this pull request?

Currently, some tests are being failed and hanging on Windows due to this problem. For the reason in SPARK-18718, some tests using `local-cluster` mode were disabled on Windows due to the length limitation by paths given to classpaths.

The limitation seems roughly 32K (see the [blog in MS](https://blogs.msdn.microsoft.com/oldnewthing/20031210-00/?p=41553/) and [another reference](https://support.thoughtworks.com/hc/en-us/articles/213248526-Getting-around-maximum-command-line-length-is-32767-characters-on-Windows)) but in `local-cluster` mode, executors were being launched as processes with the command such as [here](https://gist.github.com/HyukjinKwon/5bc81061c250d4af5a180869b59d42ea) in (only) tests.

This length is roughly 40K due to the classpaths given to `java` command. However, it seems duplicates are almost half of them. So, if we deduplicate the paths, it seems reduced to roughly 20K with the command, [here](https://gist.github.com/HyukjinKwon/dad0c8db897e5e094684a2dc6a417790).

Maybe, we should consider as some more paths are added in the future but it seems better than disabling all the tests for now with minimised changes.

Therefore, this PR proposes to deduplicate the paths in classpaths in case of launching executors as processes in `local-cluster` mode.

## How was this patch tested?

Existing tests in `ShuffleSuite` and `BroadcastJoinSuite` manually via AppVeyor

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16266 from HyukjinKwon/disable-local-cluster-tests.
2016-12-14 19:24:24 +00:00
Weiqing Yang 9338aa4f89
[SPARK-18697][BUILD] Upgrade sbt plugins
## What changes were proposed in this pull request?

This PR is to upgrade sbt plugins. The following sbt plugins will be upgraded:
```
sbteclipse-plugin: 4.0.0 -> 5.0.1
sbt-mima-plugin: 0.1.11 -> 0.1.12
org.ow2.asm/asm: 5.0.3 -> 5.1
org.ow2.asm/asm-commons: 5.0.3 -> 5.1
```
## How was this patch tested?
Pass the Jenkins build.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #16223 from weiqingy/SPARK_18697.
2016-12-09 14:13:01 +08:00
Michael Armbrust 70b2bf717d [SPARK-18754][SS] Rename recentProgresses to recentProgress
Based on an informal survey, users find this option easier to understand / remember.

Author: Michael Armbrust <michael@databricks.com>

Closes #16182 from marmbrus/renameRecentProgress.
2016-12-07 15:36:29 -08:00
Sean Owen 4cc8d8906d
Revert "[SPARK-18697][BUILD] Upgrade sbt plugins"
This reverts commit 7f31d378c4.
2016-12-07 09:03:20 +08:00
Weiqing Yang 7f31d378c4
[SPARK-18697][BUILD] Upgrade sbt plugins
## What changes were proposed in this pull request?

This PR is to upgrade sbt plugins. The following sbt plugins will be upgraded:
```
sbt-assembly: 0.11.2 -> 0.14.3
sbteclipse-plugin: 4.0.0 -> 5.0.1
sbt-mima-plugin: 0.1.11 -> 0.1.12
org.ow2.asm/asm: 5.0.3 -> 5.1
org.ow2.asm/asm-commons: 5.0.3 -> 5.1
```
All other plugins are up-to-date.

## How was this patch tested?
Pass the Jenkins build.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #16159 from weiqingy/SPARK-18697.
2016-12-07 06:11:39 +08:00
Tathagata Das bb57bfe97d [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and not auto-generate StreamingQuery.name
## What changes were proposed in this pull request?
Here are the major changes in this PR.
- Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`.
- Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`).
- Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default.
- Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`.

Implementation details
- Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`).
- Added the `id` as the new `StreamMetadata`.
- When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`.
- All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name`

TODO
- [x] Test handling of name=null in json generation of StreamingQueryProgress
- [x] Test handling of name=null in json generation of StreamingQueryListener events
- [x] Test python API of runId

## How was this patch tested?
Updated unit tests and new unit tests

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16113 from tdas/SPARK-18657.
2016-12-05 18:17:38 -08:00
Shixiong Zhu 246012859f [SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix StreamingQueryException
## What changes were proposed in this pull request?

- Add StreamingQuery.explain and exception to Python.
- Fix StreamingQueryException to not expose `OffsetSeq`.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16125 from zsxwing/py-streaming-explain.
2016-12-05 11:36:11 -08:00
Weiqing Yang 576197320c
[SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven plugins
## What changes were proposed in this pull request?
This PR is to upgrade:
```
   sbt: 0.13.11 -> 0.13.13,
   zinc: 0.3.9 -> 0.3.11,
   maven-assembly-plugin: 2.6 -> 3.0.0
   maven-compiler-plugin: 3.5.1 -> 3.6.
   maven-jar-plugin: 2.6 -> 3.0.2
   maven-javadoc-plugin: 2.10.3 -> 2.10.4
   maven-source-plugin: 2.4 -> 3.0.1
   org.codehaus.mojo:build-helper-maven-plugin: 1.10 -> 1.12
   org.codehaus.mojo:exec-maven-plugin: 1.4.0 -> 1.5.0
```

The sbt release notes since the last version we used are: [v0.13.12](https://github.com/sbt/sbt/releases/tag/v0.13.12)  and [v0.13.13 ](https://github.com/sbt/sbt/releases/tag/v0.13.13).

## How was this patch tested?
Pass build and the existing tests.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #16069 from weiqingy/SPARK-18638.
2016-12-03 10:36:19 +00:00
Reynold Xin c7c7265950 [SPARK-18695] Bump master branch version to 2.2.0-SNAPSHOT
## What changes were proposed in this pull request?
This patch bumps master branch version to 2.2.0-SNAPSHOT.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #16126 from rxin/SPARK-18695.
2016-12-02 21:09:37 -08:00
Reynold Xin d3c90b74ed [SPARK-18663][SQL] Simplify CountMinSketch aggregate implementation
## What changes were proposed in this pull request?
SPARK-18429 introduced count-min sketch aggregate function for SQL, but the implementation and testing is more complicated than needed. This simplifies the test cases and removes support for data types that don't have clear equality semantics:

1. Removed support for floating point and decimal types.

2. Removed the heavy randomized tests. The underlying CountMinSketch implementation already had pretty good test coverage through randomized tests, and the SPARK-18429 implementation is just to add an aggregate function wrapper around CountMinSketch. There is no need for randomized tests at three different levels of the implementations.

## How was this patch tested?
A lot of the change is to simplify test cases.

Author: Reynold Xin <rxin@databricks.com>

Closes #16093 from rxin/SPARK-18663.
2016-12-01 21:38:52 -08:00
Tathagata Das c3d08e2f29 [SPARK-18516][SQL] Split state and progress in streaming
This PR separates the status of a `StreamingQuery` into two separate APIs:
 - `status` - describes the status of a `StreamingQuery` at this moment, including what phase of processing is currently happening and if data is available.
 - `recentProgress` - an array of statistics about the most recent microbatches that have executed.

A recent progress contains the following information:
```
{
  "id" : "2be8670a-fce1-4859-a530-748f29553bb6",
  "name" : "query-29",
  "timestamp" : 1479705392724,
  "inputRowsPerSecond" : 230.76923076923077,
  "processedRowsPerSecond" : 10.869565217391303,
  "durationMs" : {
    "triggerExecution" : 276,
    "queryPlanning" : 3,
    "getBatch" : 5,
    "getOffset" : 3,
    "addBatch" : 234,
    "walCommit" : 30
  },
  "currentWatermark" : 0,
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "KafkaSource[Subscribe[topic-14]]",
    "startOffset" : {
      "topic-14" : {
        "2" : 0,
        "4" : 1,
        "1" : 0,
        "3" : 0,
        "0" : 0
      }
    },
    "endOffset" : {
      "topic-14" : {
        "2" : 1,
        "4" : 2,
        "1" : 0,
        "3" : 0,
        "0" : 1
      }
    },
    "numRecords" : 3,
    "inputRowsPerSecond" : 230.76923076923077,
    "processedRowsPerSecond" : 10.869565217391303
  } ]
}
```

Additionally, in order to make it possible to correlate progress updates across restarts, we change the `id` field from an integer that is unique with in the JVM to a `UUID` that is globally unique.

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #15954 from marmbrus/queryProgress.
2016-11-29 17:24:17 -08:00
Yanbo Liang c4a7eef0ce [SPARK-18481][ML] ML 2.1 QA: Remove deprecated methods for ML
## What changes were proposed in this pull request?
Remove deprecated methods for ML.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15913 from yanboliang/spark-18481.
2016-11-26 05:28:41 -08:00
hyukjinkwon 51b1c1551d
[SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility
## What changes were proposed in this pull request?

This PR only tries to fix things that looks pretty straightforward and were fixed in other previous PRs before.

This PR roughly fixes several things as below:

- Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``

  ```
  [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/DataStreamReader.java:226: error: reference not found
  [error]    * Loads text files and returns a {link DataFrame} whose schema starts with a string column named
  ```

- Fix an exception annotation and remove code backticks in `throws` annotation

  Currently, sbt unidoc with Java 8 complains as below:

  ```
  [error] .../java/org/apache/spark/sql/streaming/StreamingQuery.java:72: error: unexpected text
  [error]    * throws StreamingQueryException, if <code>this</code> query has terminated with an exception.
  ```

  `throws` should specify the correct class name from `StreamingQueryException,` to `StreamingQueryException` without backticks. (see [JDK-8007644](https://bugs.openjdk.java.net/browse/JDK-8007644)).

- Fix `[[http..]]` to `<a href="http..."></a>`.

  ```diff
  -   * [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle
  -   * blog page]].
  +   * <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">
  +   * Oracle blog page</a>.
  ```

   `[[http...]]` link markdown in scaladoc is unrecognisable in javadoc.

- It seems class can't have `return` annotation. So, two cases of this were removed.

  ```
  [error] .../java/org/apache/spark/mllib/regression/IsotonicRegression.java:27: error: invalid use of return
  [error]    * return New instance of IsotonicRegression.
  ```

- Fix < to `&lt;` and > to `&gt;` according to HTML rules.

- Fix `</p>` complaint

- Exclude unrecognisable in javadoc, `constructor`, `todo` and `groupname`.

## How was this patch tested?

Manually tested by `jekyll build` with Java 7 and 8

```
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
```

```
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
```

Note: this does not yet make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15999 from HyukjinKwon/SPARK-3359-errors.
2016-11-25 11:27:07 +00:00
hyukjinkwon c528812ce7
[SPARK-3359][BUILD][DOCS] Print examples and disable group and tparam tags in javadoc
## What changes were proposed in this pull request?

This PR proposes/fixes two things.

- Remove many errors to generate javadoc with Java8 from unrecognisable tags, `tparam` and `group`.

  ```
  [error] .../spark/mllib/target/java/org/apache/spark/ml/classification/Classifier.java:18: error: unknown tag: group
  [error]   /** group setParam */
  [error]       ^
  [error] .../spark/mllib/target/java/org/apache/spark/ml/classification/Classifier.java:8: error: unknown tag: tparam
  [error]  * tparam FeaturesType  Type of input features.  E.g., <code>Vector</code>
  [error]    ^
  ...
  ```

  It does not fully resolve the problem but remove many errors. It seems both `group` and `tparam` are unrecognisable in javadoc. It seems we can't print them pretty in javadoc in a way of `example` here because they appear differently (both examples can be found in http://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.ml.classification.Classifier).

- Print `example` in javadoc.
  Currently, there are few `example` tag in several places.

  ```
  ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This operation might be used to evaluate a graph
  ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example We might use this operation to change the vertex values
  ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
  ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
  ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
  ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example We can use this function to compute the in-degree of each
  ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function is used to update the vertices with new values based on external data.
  ./graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala:   * example Loads a file in the following format:
  ./graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala:   * example This function is used to update the vertices with new
  ./graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala:   * example This function can be used to filter the graph based on some property, without
  ./graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala: * example We can use the Pregel abstraction to implement PageRank:
  ./graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala: * example Construct a `VertexRDD` from a plain RDD:
  ./repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkCommandLine.scala: * example new SparkCommandLine(Nil).settings
  ./repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkIMain.scala:   * example addImports("org.apache.spark.SparkContext")
  ./sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/LiteralGenerator.scala: * example {{{
  ```

**Before**

  <img width="505" alt="2016-11-20 2 43 23" src="https://cloud.githubusercontent.com/assets/6477701/20457285/26f07e1c-aecb-11e6-9ae9-d9dee66845f4.png">

**After**
  <img width="499" alt="2016-11-20 1 27 17" src="https://cloud.githubusercontent.com/assets/6477701/20457240/409124e4-aeca-11e6-9a91-0ba514148b52.png">

## How was this patch tested?

Maunally tested by `jekyll build` with Java 7 and 8

```
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
```

```
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
```

Note: this does not make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15939 from HyukjinKwon/SPARK-3359-javadoc.
2016-11-20 09:52:03 +00:00
hyukjinkwon d5b1d5fc80
[SPARK-18445][BUILD][DOCS] Fix the markdown for Note:/NOTE:/Note that/'''Note:''' across Scala/Java API documentation
## What changes were proposed in this pull request?

It seems in Scala/Java,

- `Note:`
- `NOTE:`
- `Note that`
- `'''Note:'''`
- `note`

This PR proposes to fix those to `note` to be consistent.

**Before**

- Scala
  ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png)

- Java
  ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png)

**After**

- Scala
  ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png)

- Java
  ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png)

## How was this patch tested?

The notes were found via

```bash
grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// NOTE: " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...`
-e 'org.apache.spark.api.r' \
...
```

```bash
grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// Note that " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \
-e 'org.apache.spark.api.r' \
...
```

```bash
grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// Note: " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \
-e 'org.apache.spark.api.r' \
...
```

```bash
grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:'''
grep -v "// '''Note:''' " | \  # starting with // does not appear in API documentation.
grep -E '.scala|.java' | \ # java/scala files
grep -v Suite | \ # exclude tests
grep -v Test | \ # exclude tests
grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
-e 'org.apache.spark.api.java.function' \
-e 'org.apache.spark.api.r' \
...
```

And then fixed one by one comparing with API documentation/access modifiers.

After that, manually tested via `jekyll build`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15889 from HyukjinKwon/SPARK-18437.
2016-11-19 11:24:15 +00:00
Josh Rosen 3a710b94b0 [SPARK-18236] Reduce duplicate objects in Spark UI and HistoryServer
## What changes were proposed in this pull request?

When profiling heap dumps from the HistoryServer and live Spark web UIs, I found a large amount of memory being wasted on duplicated objects and strings. This patch's changes remove most of this duplication, resulting in over 40% memory savings for some benchmarks.

- **Task metrics** (6441f0624dfcda9c7193a64bfb416a145b5aabdf): previously, every `TaskUIData` object would have its own instances of `InputMetricsUIData`, `OutputMetricsUIData`, `ShuffleReadMetrics`, and `ShuffleWriteMetrics`, but for many tasks these metrics are irrelevant because they're all zero. This patch changes how we construct these metrics in order to re-use a single immutable "empty" value for the cases where these metrics are empty.
- **TaskInfo.accumulables** (ade86db901127bf13c0e0bdc3f09c933a093bb76): Previously, every `TaskInfo` object had its own empty `ListBuffer` for holding updates from named accumulators. Tasks which didn't use named accumulators still paid for the cost of allocating and storing this empty buffer. To avoid this overhead, I changed the `val` with a mutable buffer into a `var` which holds an immutable Scala list, allowing tasks which do not have named accumulator updates to share the same singleton `Nil` object.
- **String.intern() in JSONProtocol** (7e05630e9a78c455db8c8c499f0590c864624e05): in the HistoryServer, executor hostnames and ids are deserialized from JSON, leading to massive duplication of these string objects. By calling `String.intern()` on the deserialized values we can remove all of this duplication. Since Spark now requires Java 7+ we don't have to worry about string interning exhausting the permgen (see http://java-performance.info/string-intern-in-java-6-7-8/).

## How was this patch tested?

I ran

```
sc.parallelize(1 to 100000, 100000).count()
```

in `spark-shell` with event logging enabled, then loaded that event log in the HistoryServer, performed a full GC, and took a heap dump. According to YourKit, the changes in this patch reduced memory consumption by roughly 28 megabytes (or 770k Java objects):

![image](https://cloud.githubusercontent.com/assets/50748/19953276/4f3a28aa-a129-11e6-93df-d7fa91396f66.png)

Here's a table illustrating the drop in objects due to deduplication (the drop is <100k for some objects because some events were dropped from the listener bus; this is a separate, existing bug that I'll address separately after CPU-profiling):

![image](https://cloud.githubusercontent.com/assets/50748/19953290/6a271290-a129-11e6-93ad-b825f1448886.png)

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15743 from JoshRosen/spark-ui-memory-usage.
2016-11-07 16:14:19 -08:00
Shixiong Zhu 7d10631c16 [SPARK-18104][DOC] Don't build KafkaSource doc
## What changes were proposed in this pull request?

Don't need to build doc for KafkaSource because the user should use the data source APIs to use KafkaSource. All KafkaSource APIs are internal.

## How was this patch tested?

Verified manually.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15630 from zsxwing/kafka-unidoc.
2016-10-26 11:16:20 -07:00
Josh Rosen b3b4b95422 [SPARK-18034] Upgrade to MiMa 0.1.11 to fix flakiness
We should upgrade to the latest release of MiMa (0.1.11) in order to include a fix for a bug which led to flakiness in the MiMa checks (https://github.com/typesafehub/migration-manager/issues/115).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15571 from JoshRosen/SPARK-18034.
2016-10-21 11:25:01 -07:00
Tathagata Das 941b3f9aca [SPARK-17731][SQL][STREAMING][FOLLOWUP] Refactored StreamingQueryListener APIs
## What changes were proposed in this pull request?

As per rxin request, here are further API changes
- Changed `Stream(Started/Progress/Terminated)` events to `Stream*Event`
- Changed the fields in `StreamingQueryListener.on***` from `query*` to `event`

## How was this patch tested?
Existing unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15530 from tdas/SPARK-17731-1.
2016-10-18 17:32:16 -07:00
Tathagata Das 7106866c22 [SPARK-17731][SQL][STREAMING] Metrics for structured streaming
## What changes were proposed in this pull request?

Metrics are needed for monitoring structured streaming apps. Here is the design doc for implementing the necessary metrics.
https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing

Specifically, this PR adds the following public APIs changes.

### New APIs
- `StreamingQuery.status` returns a `StreamingQueryStatus` object (renamed from `StreamingQueryInfo`, see later)

- `StreamingQueryStatus` has the following important fields
  - inputRate - Current rate (rows/sec) at which data is being generated by all the sources
  - processingRate - Current rate (rows/sec) at which the query is processing data from
                                  all the sources
  - ~~outputRate~~ - *Does not work with wholestage codegen*
  - latency - Current average latency between the data being available in source and the sink writing the corresponding output
  - sourceStatuses: Array[SourceStatus] - Current statuses of the sources
  - sinkStatus: SinkStatus - Current status of the sink
  - triggerStatus - Low-level detailed status of the last completed/currently active trigger
    - latencies - getOffset, getBatch, full trigger, wal writes
    - timestamps - trigger start, finish, after getOffset, after getBatch
    - numRows - input, output, state total/updated rows for aggregations

- `SourceStatus` has the following important fields
  - inputRate - Current rate (rows/sec) at which data is being generated by the source
  - processingRate - Current rate (rows/sec) at which the query is processing data from the source
  - triggerStatus - Low-level detailed status of the last completed/currently active trigger

- Python API for `StreamingQuery.status()`

### Breaking changes to existing APIs
**Existing direct public facing APIs**
- Deprecated direct public-facing APIs `StreamingQuery.sourceStatuses` and `StreamingQuery.sinkStatus` in favour of `StreamingQuery.status.sourceStatuses/sinkStatus`.
  - Branch 2.0 should have it deprecated, master should have it removed.

**Existing advanced listener APIs**
- `StreamingQueryInfo` renamed to `StreamingQueryStatus` for consistency with `SourceStatus`, `SinkStatus`
   - Earlier StreamingQueryInfo was used only in the advanced listener API, but now it is used in direct public-facing API (StreamingQuery.status)

- Field `queryInfo` in listener events `QueryStarted`, `QueryProgress`, `QueryTerminated` changed have name `queryStatus` and return type `StreamingQueryStatus`.

- Field `offsetDesc` in `SourceStatus` was Option[String], converted it to `String`.

- For `SourceStatus` and `SinkStatus` made constructor private instead of private[sql] to make them more java-safe. Instead added `private[sql] object SourceStatus/SinkStatus.apply()` which are harder to accidentally use in Java.

## How was this patch tested?

Old and new unit tests.
- Rate calculation and other internal logic of StreamMetrics tested by StreamMetricsSuite.
- New info in statuses returned through StreamingQueryListener is tested in StreamingQueryListenerSuite.
- New and old info returned through StreamingQuery.status is tested in StreamingQuerySuite.
- Source-specific tests for making sure input rows are counted are is source-specific test suites.
- Additional tests to test minor additions in LocalTableScanExec, StateStore, etc.

Metrics also manually tested using Ganglia sink

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15307 from tdas/SPARK-17731.
2016-10-13 13:36:26 -07:00
Wenchen Fan 7388ad94d7 [SPARK-17338][SQL][FOLLOW-UP] add global temp view
## What changes were proposed in this pull request?

address post hoc review comments for https://github.com/apache/spark/pull/14897

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #15424 from cloud-fan/global-temp-view.
2016-10-11 15:21:28 +08:00
Wenchen Fan 23ddff4b2b [SPARK-17338][SQL] add global temp view
## What changes were proposed in this pull request?

Global temporary view is a cross-session temporary view, which means it's shared among all sessions. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database `global_temp`(configurable via SparkConf), and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1.

changes for `SessionCatalog`:

1. add a new field `gloabalTempViews: GlobalTempViewManager`, to access the shared global temp views, and the global temp db name.
2. `createDatabase` will fail if users wanna create `global_temp`, which is system preserved.
3. `setCurrentDatabase` will fail if users wanna set `global_temp`, which is system preserved.
4. add `createGlobalTempView`, which is used in `CreateViewCommand` to create global temp views.
5. add `dropGlobalTempView`, which is used in `CatalogImpl` to drop global temp view.
6. add `alterTempViewDefinition`, which is used in `AlterViewAsCommand` to update the view definition for local/global temp views.
7. `renameTable`/`dropTable`/`isTemporaryTable`/`lookupRelation`/`getTempViewOrPermanentTableMetadata`/`refreshTable` will handle global temp views.

changes for SQL commands:

1. `CreateViewCommand`/`AlterViewAsCommand` is updated to support global temp views
2. `ShowTablesCommand` outputs a new column `database`, which is used to distinguish global and local temp views.
3. other commands can also handle global temp views if they call `SessionCatalog` APIs which accepts global temp views, e.g. `DropTableCommand`, `AlterTableRenameCommand`, `ShowColumnsCommand`, etc.

changes for other public API

1. add a new method `dropGlobalTempView` in `Catalog`
2. `Catalog.findTable` can find global temp view
3. add a new method `createGlobalTempView` in `Dataset`

## How was this patch tested?

new tests in `SQLViewSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14897 from cloud-fan/global-temp-view.
2016-10-10 15:48:57 +08:00
Shixiong Zhu 9293734d35 [SPARK-17346][SQL] Add Kafka source for Structured Streaming
## What changes were proposed in this pull request?

This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.

It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing

tdas did most of work and part of them was inspired by koeninger's work.

### Introduction

The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:

Column | Type
---- | ----
key | binary
value | binary
topic | string
partition | int
offset | long
timestamp | long
timestampType | int

The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.

### Configuration

The user can use `DataStreamReader.option` to set the following configurations.

Kafka Source's options | value | default | meaning
------ | ------- | ------ | -----
startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets

Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`

### Usage

* Subscribe to 1 topic
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1")
  .load()
```

* Subscribe to multiple topics
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1,topic2")
  .load()
```

* Subscribe to a pattern
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribePattern", "topic.*")
  .load()
```

## How was this patch tested?

The new unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Shixiong Zhu <zsxwing@gmail.com>
Author: cody koeninger <cody@koeninger.org>

Closes #15102 from zsxwing/kafka-source.
2016-10-05 16:45:45 -07:00
Sean Owen 8e8de0073d
[SPARK-17671][WEBUI] Spark 2.0 history server summary page is slow even set spark.history.ui.maxApplications
## What changes were proposed in this pull request?

Return Iterator of applications internally in history server, for consistency and performance. See https://github.com/apache/spark/pull/15248 for some back-story.

The code called by and calling HistoryServer.getApplicationList wants an Iterator, but this method materializes an Iterable, which potentially causes a performance problem. It's simpler too to make this internal method also pass through an Iterator.

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #15321 from srowen/SPARK-17671.
2016-10-04 10:29:22 +01:00
Herman van Hovell af6ece33d3 [SPARK-17717][SQL] Add Exist/find methods to Catalog [FOLLOW-UP]
## What changes were proposed in this pull request?
We added find and exists methods for Databases, Tables and Functions to the user facing Catalog in PR https://github.com/apache/spark/pull/15301. However, it was brought up that the semantics of the  `find` methods are more in line a `get` method (get an object or else fail). So we rename these in this PR.

## How was this patch tested?
Existing tests.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #15308 from hvanhovell/SPARK-17717-2.
2016-10-01 00:50:16 -07:00
Herman van Hovell 74ac1c4381 [SPARK-17717][SQL] Add exist/find methods to Catalog.
## What changes were proposed in this pull request?
The current user facing catalog does not implement methods for checking object existence or finding objects. You could theoretically do this using the `list*` commands, but this is rather cumbersome and can actually be costly when there are many objects. This PR adds `exists*` and `find*` methods for Databases, Table and Functions.

## How was this patch tested?
Added tests to `org.apache.spark.sql.internal.CatalogSuite`

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #15301 from hvanhovell/SPARK-17717.
2016-09-29 17:56:32 -07:00
Yanbo Liang f7082ac125 [SPARK-17704][ML][MLLIB] ChiSqSelector performance improvement.
## What changes were proposed in this pull request?
Several performance improvement for ```ChiSqSelector```:
1, Keep ```selectedFeatures``` ordered ascendent.
```ChiSqSelectorModel.transform``` need ```selectedFeatures``` ordered to make prediction. We should sort it when training model rather than making prediction, since users usually train model once and use the model to do prediction multiple times.
2, When training ```fpr``` type ```ChiSqSelectorModel```, it's not necessary to sort the ChiSq test result by statistic.

## How was this patch tested?
Existing unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15277 from yanboliang/spark-17704.
2016-09-29 04:30:42 -07:00
jisookim 90a30f4634 [SPARK-12221] add cpu time to metrics
Currently task metrics don't support executor CPU time, so there's no way to calculate how much CPU time a stage/task took from History Server metrics. This PR enables reporting CPU time.

Author: jisookim <jisookim0513@gmail.com>

Closes #10212 from jisookim0513/add-cpu-time-metric.
2016-09-23 13:43:47 -07:00
Marcelo Vanzin a4aeb7677b [SPARK-17639][BUILD] Add jce.jar to buildclasspath when building.
This was missing, preventing code that uses javax.crypto to properly
compile in Spark.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #15204 from vanzin/SPARK-17639.
2016-09-22 21:35:25 -07:00
Gayathri Murali f4f6bd8c98 [SPARK-16240][ML] ML persistence backward compatibility for LDA
## What changes were proposed in this pull request?

Allow Spark 2.x to load instances of LDA, LocalLDAModel, and DistributedLDAModel saved from Spark 1.6.

## How was this patch tested?

I tested this manually, saving the 3 types from 1.6 and loading them into master (2.x).  In the future, we can add generic tests for testing backwards compatibility across all ML models in SPARK-15573.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #15034 from jkbradley/lda-backwards.
2016-09-22 16:34:42 -07:00
Dhruve Ashar 17b72d31e0 [SPARK-17365][CORE] Remove/Kill multiple executors together to reduce RPC call time.
## What changes were proposed in this pull request?
We are killing multiple executors together instead of iterating over expensive RPC calls to kill single executor.

## How was this patch tested?
Executed sample spark job to observe executors being killed/removed with dynamic allocation enabled.

Author: Dhruve Ashar <dashar@yahoo-inc.com>
Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #15152 from dhruve/impr/SPARK-17365.
2016-09-22 10:10:37 -07:00
Peng, Meng b366f18496
[SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test
## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on univariate statistical tests. False Positive Rate (FPR) is a popular univariate statistical test for feature selection. We add a chiSquare Selector based on False Positive Rate (FPR) test in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

## How was this patch tested?

Add Scala ut

Author: Peng, Meng <peng.meng@intel.com>

Closes #14597 from mpjlu/fprChiSquare.
2016-09-21 10:17:38 +01:00
sethah 26145a5af9 [SPARK-17163][ML] Unified LogisticRegression interface
## What changes were proposed in this pull request?

Merge `MultinomialLogisticRegression` into `LogisticRegression` and remove `MultinomialLogisticRegression`.

Marked as WIP because we should discuss the coefficients API in the model. See discussion below.

JIRA: [SPARK-17163](https://issues.apache.org/jira/browse/SPARK-17163)

## How was this patch tested?

Merged test suites and added some new unit tests.

## Design

### Switching between binomial and multinomial

We default to automatically detecting whether we should run binomial or multinomial lor. We expose a new parameter called `family` which defaults to auto. When "auto" is used, we run normal binomial lor with pivoting if there are 1 or 2 label classes. Otherwise, we run multinomial. If the user explicitly sets the family, then we abide by that setting. In the case where "binomial" is set but multiclass lor is detected, we throw an error.

### coefficients/intercept model API (TODO)

This is the biggest design point remaining, IMO. We need to decide how to store the coefficients and intercepts in the model, and in turn how to expose them via the API. Two important points:

* We must maintain compatibility with the old API, i.e. we must expose `def coefficients: Vector` and `def intercept: Double`
* There are two separate cases: binomial lr where we have a single set of coefficients and a single intercept and multinomial lr where we have `numClasses` sets of coefficients and `numClasses` intercepts.

Some options:

1. **Store the binomial coefficients as a `2 x numFeatures` matrix.** This means that we would center the model coefficients before storing them in the model. The BLOR algorithm gives `1 * numFeatures` coefficients, but we would convert them to `2 x numFeatures` coefficients before storing them, effectively doubling the storage in the model. This has the advantage that we can make the code cleaner (i.e. less `if (isMultinomial) ... else ...`) and we don't have to reason about the different cases as much. It has the disadvantage that we double the storage space and we could see small regressions at prediction time since there are 2x the number of operations in the prediction algorithms. Additionally, we still have to produce the uncentered coefficients/intercept via the API, so we will have to either ALSO store the uncentered version, or compute it in `def coefficients: Vector` every time.

2. **Store the binomial coefficients as a `1 x numFeatures` matrix.** We still store the coefficients as a matrix and the intercepts as a vector. When users call `coefficients` we return them a `Vector` that is backed by the same underlying array as the `coefficientMatrix`, so we don't duplicate any data. At prediction time, we use the old prediction methods that are specialized for binary LOR. The benefits here are that we don't store extra data, and we won't see any regressions in performance. The cost of this is that we have separate implementations for predict methods in the binary vs multiclass case. The duplicated code is really not very high, but it's still a bit messy.

If we do decide to store the 2x coefficients, we would likely want to see some performance tests to understand the potential regressions.

**Update:** We have chosen option 2

### Threshold/thresholds (TODO)

Currently, when `threshold` is set we clear whatever value is in `thresholds` and when `thresholds` is set we clear whatever value is in `threshold`. [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543) was created to prefer thresholds over threshold. We should decide if we should implement this behavior now or if we want to do it in a separate JIRA.

**Update:** Let's leave it for a follow up PR

## Follow up

* Summary model for multiclass logistic regression [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139)
* Thresholds vs threshold [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543)

Author: sethah <seth.hendrickson16@gmail.com>

Closes #14834 from sethah/SPARK-17163.
2016-09-19 21:33:54 -07:00
Sean Owen 2ad2769548 [SPARK-17406][BUILD][HOTFIX] MiMa excludes fix
## What changes were proposed in this pull request?

Following https://github.com/apache/spark/pull/14969 for some reason the MiMa excludes weren't complete, but still passed the PR builder. This adds 3 more excludes from https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.2/1749/consoleFull

It also moves the excludes to their own Seq in the build, as they probably should have been.
Even though this is merged to 2.1.x only / master, I left the exclude in for 2.0.x in case we back port. It's a private API so is always a false positive.

## How was this patch tested?

Jenkins build

Author: Sean Owen <sowen@cloudera.com>

Closes #15110 from srowen/SPARK-17406.2.
2016-09-15 13:54:41 +01:00
cenyuhai ad79fc0a84 [SPARK-17406][WEB UI] limit timeline executor events
## What changes were proposed in this pull request?
The job page will be too slow to open when there are thousands of executor events(added or removed). I found that in ExecutorsTab file, executorIdToData will not remove elements, it will increase all the time.Before this pr, it looks like [timeline1.png](https://issues.apache.org/jira/secure/attachment/12827112/timeline1.png). After this pr, it looks like [timeline2.png](https://issues.apache.org/jira/secure/attachment/12827113/timeline2.png)(we can set how many executor events will be displayed)

Author: cenyuhai <cenyuhai@didichuxing.com>

Closes #14969 from cenyuhai/SPARK-17406.
2016-09-15 09:58:53 +01:00
Josh Rosen 7c51b99a42 [SPARK-14818] Post-2.0 MiMa exclusion and build changes
This patch makes a handful of post-Spark-2.0 MiMa exclusion and build updates. It should be merged to master and a subset of it should be picked into branch-2.0 in order to test Spark 2.0.1-SNAPSHOT.

- Remove the ` sketch`, `mllibLocal`, and `streamingKafka010` from the list of excluded subprojects so that MiMa checks them.
- Remove now-unnecessary special-case handling of the Kafka 0.8 artifact in `mimaSettings`.
- Move the exclusion added in SPARK-14743 from `v20excludes` to `v21excludes`, since that patch was only merged into master and not branch-2.0.
- Add exclusions for an API change introduced by SPARK-17096 / #14675.
- Add missing exclusions for the `o.a.spark.internal` and `o.a.spark.sql.internal` packages.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15061 from JoshRosen/post-2.0-mima-changes.
2016-09-12 15:24:33 -07:00
Shivansh e75c162e9e [SPARK-17308] Improved the spark core code by replacing all pattern match on boolean value by if/else block.
## What changes were proposed in this pull request?
Improved the code quality of spark by replacing all pattern match on boolean value by if/else block.

## How was this patch tested?

By running the tests

Author: Shivansh <shiv4nsh@gmail.com>

Closes #14873 from shiv4nsh/SPARK-17308.
2016-09-04 12:39:26 +01:00
Michael Gummelt 8e5475be3c [SPARK-16967] move mesos to module
## What changes were proposed in this pull request?

Move Mesos code into a mvn module

## How was this patch tested?

unit tests
manually submitting a client mode and cluster mode job
spark/mesos integration test suite

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #14637 from mgummelt/mesos-module.
2016-08-26 12:25:22 -07:00
jerryshao ab648c0004 [SPARK-14743][YARN] Add a configurable credential manager for Spark running on YARN
## What changes were proposed in this pull request?

Add a configurable token manager for Spark on running on yarn.

### Current Problems ###

1. Supported token provider is hard-coded, currently only hdfs, hbase and hive are supported and it is impossible for user to add new token provider without code changes.
2. Also this problem exits in timely token renewer and updater.

### Changes In This Proposal ###

In this proposal, to address the problems mentioned above and make the current code more cleaner and easier to understand, mainly has 3 changes:

1. Abstract a `ServiceTokenProvider` as well as `ServiceTokenRenewable` interface for token provider. Each service wants to communicate with Spark through token way needs to implement this interface.
2. Provide a `ConfigurableTokenManager` to manage all the register token providers, also token renewer and updater. Also this class offers the API for other modules to obtain tokens, get renewal interval and so on.
3. Implement 3 built-in token providers `HDFSTokenProvider`, `HiveTokenProvider` and `HBaseTokenProvider` to keep the same semantics as supported today. Whether to load in these built-in token providers is controlled by configuration "spark.yarn.security.tokens.${service}.enabled", by default for all the built-in token providers are loaded.

### Behavior Changes ###

For the end user there's no behavior change, we still use the same configuration `spark.yarn.security.tokens.${service}.enabled` to decide which token provider is enabled (hbase or hive).

For user implemented token provider (assume the name of token provider is "test") needs to add into this class should have two configurations:

1. `spark.yarn.security.tokens.test.enabled` to true
2. `spark.yarn.security.tokens.test.class` to the full qualified class name.

So we still keep the same semantics as current code while add one new configuration.

### Current Status ###

- [x] token provider interface and management framework.
- [x] implement built-in token providers (hdfs, hbase, hive).
- [x] Coverage of unit test.
- [x] Integrated test with security cluster.

## How was this patch tested?

Unit test and integrated test.

Please suggest and review, any comment is greatly appreciated.

Author: jerryshao <sshao@hortonworks.com>

Closes #14065 from jerryshao/SPARK-16342.
2016-08-10 15:39:30 -07:00
Sean Zhong 9d7a47406e [SPARK-16853][SQL] fixes encoder error in DataSet typed select
## What changes were proposed in this pull request?

For DataSet typed select:
```
def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1]
```
If type T is a case class or a tuple class that is not atomic, the resulting logical plan's schema will mismatch with `Dataset[T]` encoder's schema, which will cause encoder error and throw AnalysisException.

### Before change:
```
scala> case class A(a: Int, b: Int)
scala> Seq((0, A(1,2))).toDS.select($"_2".as[A])
org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input columns: [_2];
..
```

### After change:
```
scala> case class A(a: Int, b: Int)
scala> Seq((0, A(1,2))).toDS.select($"_2".as[A]).show
+---+---+
|  a|  b|
+---+---+
|  1|  2|
+---+---+
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14474 from clockfly/SPARK-16853.
2016-08-04 19:45:47 +08:00
petermaxlee c9a6762150 [SPARK-16199][SQL] Add a method to list the referenced columns in data source Filter
## What changes were proposed in this pull request?
It would be useful to support listing the columns that are referenced by a filter. This can help simplify data source planning, because with this we would be able to implement unhandledFilters method in HadoopFsRelation.

This is based on rxin's patch (#13901) and adds unit tests.

## How was this patch tested?
Added a new suite FiltersSuite.

Author: petermaxlee <petermaxlee@gmail.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #14120 from petermaxlee/SPARK-16199.
2016-07-11 22:23:32 -07:00
Reynold Xin 52b5bb0b7f [SPARK-16476] Restructure MimaExcludes for easier union excludes
## What changes were proposed in this pull request?
It is currently fairly difficult to have proper mima excludes when we cut a version branch. I'm proposing a small change to take the exclude list out of the exclude function, and put it in a variable so we can easily union excludes.

After this change, we can bump pom.xml version to 2.1.0-SNAPSHOT, without bumping the diff base version. Note that I also deleted all the exclude rules for version 1.x, to cut down the size of the file.

## How was this patch tested?
N/A - this is a build infra change.

Author: Reynold Xin <rxin@databricks.com>

Closes #14128 from rxin/SPARK-16476.
2016-07-10 22:05:16 -07:00
Eric Liang 44c7c62bcf [SPARK-16021] Fill freed memory in test to help catch correctness bugs
## What changes were proposed in this pull request?

This patches `MemoryAllocator` to fill clean and freed memory with known byte values, similar to https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Find-a-memory-corruption-bug . Memory filling is flag-enabled in test only by default.

## How was this patch tested?

Unit test that it's on in test.

cc sameeragarwal

Author: Eric Liang <ekl@databricks.com>

Closes #13983 from ericl/spark-16021.
2016-07-06 16:30:25 -07:00
cody koeninger 1f0d021308 [SPARK-16359][STREAMING][KAFKA] unidoc skip kafka 0.10
## What changes were proposed in this pull request?
during sbt unidoc task, skip the streamingKafka010 subproject and filter kafka 0.10 classes from the classpath, so that at least existing kafka 0.8 doc can be included in unidoc without error

## How was this patch tested?
sbt spark/scalaunidoc:doc | grep -i error

Author: cody koeninger <cody@koeninger.org>

Closes #14041 from koeninger/SPARK-16359.
2016-07-05 16:44:15 -07:00
Michael Allman 7dbffcdd6d [SPARK-16353][BUILD][DOC] Missing javadoc options for java unidoc
Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-16353

## What changes were proposed in this pull request?

The javadoc options for the java unidoc generation are ignored when generating the java unidoc. For example, the generated `index.html` has the wrong HTML page title. This can be seen at http://spark.apache.org/docs/latest/api/java/index.html.

I changed the relevant setting scope from `doc` to `(JavaUnidoc, unidoc)`.

## How was this patch tested?

I ran `docs/jekyll build` and verified that the java unidoc `index.html` has the correct HTML page title.

Author: Michael Allman <michael@videoamp.com>

Closes #14031 from mallman/spark-16353.
2016-07-04 21:16:17 +01:00
cody koeninger dedbceec1e [SPARK-12177][STREAMING][KAFKA] Update KafkaDStreams to new Kafka 0.10 Consumer API
## What changes were proposed in this pull request?

New Kafka consumer api for the released 0.10 version of Kafka

## How was this patch tested?

Unit tests, manual tests

Author: cody koeninger <cody@koeninger.org>

Closes #11863 from koeninger/kafka-0.9.
2016-06-29 23:21:03 -07:00
Dongjoon Hyun 50fdd866b5 [SPARK-16111][SQL][DOC] Hide SparkOrcNewRecordReader in API docs
## What changes were proposed in this pull request?

Currently, Spark Scala/Java API documents shows **org.apache.hadoop.hive.ql.io.orc** package at the top.

http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.package
http://spark.apache.org/docs/2.0.0-preview/api/java/index.html

This PR hides `SparkOrcNewRecordReader` from API docs.

## How was this patch tested?

Manual. (`build/sbt unidoc`).

The following is the screenshot after this PR.

**Scala API doc**

![Scala API doc](https://app.box.com/representation/file_version_75673952621/image_2048/1.png?shared_name=2mdqydygs8le6q9x00356898662zjwz6)

**Java API doc**

![Java API doc](https://app.box.com/representation/file_version_75673951725/image_2048/1.png?shared_name=iv23eeqy3avvkqz203v9ygfaqeyml85j)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13914 from dongjoon-hyun/SPARK-16111.
2016-06-27 21:58:16 -07:00
Xiangrui Meng 857ecff1d8 [SPARK-16155][DOC] remove package grouping in Java docs
## What changes were proposed in this pull request?

In 1.4 and earlier releases, we have package grouping in the generated Java API docs. See http://spark.apache.org/docs/1.4.0/api/java/index.html. However, this disappeared in 1.5.0: http://spark.apache.org/docs/1.5.0/api/java/index.html.

Rather than fixing it, I'd suggest removing grouping. Because it might take some time to fix and it is a manual process to update the grouping in `SparkBuild.scala`. I didn't find anyone complaining about missing groups since 1.5.0 on Google.

Manually checked the generated Java API docs and confirmed that they are the same as in master.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13856 from mengxr/SPARK-16155.
2016-06-22 15:52:37 -07:00
Reynold Xin 5a52ba0f95 [SPARK-15851][BUILD] Fix the call of the bash script to enable proper run in Windows
## What changes were proposed in this pull request?
The way bash script `build/spark-build-info` is called from core/pom.xml prevents Spark building on Windows. Instead of calling the script directly we call bash and pass the script as an argument. This enables running it on Windows with bash installed which typically comes with Git.

This brings https://github.com/apache/spark/pull/13612 up-to-date and also addresses comments from the code review.

Closes #13612

## How was this patch tested?
I built manually (on a Mac) to verify it didn't break Mac compilation.

Author: Reynold Xin <rxin@databricks.com>
Author: avulanov <nashb@yandex.ru>

Closes #13691 from rxin/SPARK-15851.
2016-06-15 20:11:23 -07:00
Sean Zhong 6e8cdef0cf [SPARK-15914][SQL] Add deprecated method back to SQLContext for backward source code compatibility
## What changes were proposed in this pull request?

Revert partial changes in SPARK-12600, and add some deprecated method back to SQLContext for backward source code compatibility.

## How was this patch tested?

Manual test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13637 from clockfly/SPARK-15914.
2016-06-14 09:10:27 -07:00
Eric Liang 5bb4564cd4 [SPARK-15881] Update microbenchmark results for WideSchemaBenchmark
## What changes were proposed in this pull request?

These were not updated after performance improvements. To make updating them easier, I also moved the results from inline comments out into a file, which is auto-generated when the benchmark is re-run.

Author: Eric Liang <ekl@databricks.com>

Closes #13607 from ericl/sc-3538.
2016-06-11 15:26:08 -07:00
Josh Rosen f74b77713e [SPARK-15827][BUILD] Publish Spark's forked sbt-pom-reader to Maven Central
Spark's SBT build currently uses a fork of the sbt-pom-reader plugin but depends on that fork via a SBT subproject which is cloned from https://github.com/scrapcodes/sbt-pom-reader/tree/ignore_artifact_id. This unnecessarily slows down the initial build on fresh machines and is also risky because it risks a build breakage in case that GitHub repository ever changes or is deleted.

In order to address these issues, I have published a pre-built binary of our forked sbt-pom-reader plugin to Maven Central under the `org.spark-project` namespace and have updated Spark's build to use that artifact. This published artifact was built from https://github.com/JoshRosen/sbt-pom-reader/tree/v1.0.0-spark, which contains the contents of ScrapCodes's branch plus an additional patch to configure the build for artifact publication.

/cc srowen ScrapCodes for review.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13564 from JoshRosen/use-published-fork-of-pom-reader.
2016-06-09 11:04:08 -07:00
Dhruve Ashar fa4bc8ea8b [SPARK-14279][BUILD] Pick the spark version from pom
## What changes were proposed in this pull request?
Change the way spark picks up version information. Also embed the build information to better identify the spark version running.

More context can be found here : https://github.com/apache/spark/pull/12152

## How was this patch tested?
Ran the mvn and sbt builds to verify the version information was being displayed correctly on executing <code>spark-submit --version </code>

![image](https://cloud.githubusercontent.com/assets/7732317/15197251/f7c673a2-1795-11e6-8b2f-88f2a70cf1c1.png)

Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #13061 from dhruve/impr/SPARK-14279.
2016-06-06 09:42:50 -07:00
Marcelo Vanzin 57adb77e6b [SPARK-15451][BUILD] Use jdk7's rt.jar when available.
This helps with preventing jdk8-specific calls being checked in,
because PR builders are running the compiler with the wrong settings.

If the JAVA_7_HOME env variable is set, assume it points at
a jdk7 and use its rt.jar when invoking javac. For zinc, just run
it with jdk7, and disable it when building jdk8-specific code.

A big note for sbt usage: adding the bootstrap options forces sbt
to fork the compiler, and that disables incremental compilation.
That means that it's really not convenient to use for normal
development, but should be ok for automated builds.

Tested with JAVA_HOME=jdk8 and JAVA_7_HOME=jdk7:
- mvn + zinc
- mvn sans zinc
- sbt

Verified that in all cases, jdk8-specific library calls fail to
compile.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #13272 from vanzin/SPARK-15451.
2016-05-31 16:54:34 -07:00
DB Tsai 21b2605dc4 [SPARK-15413][ML][MLLIB] Change toBreeze to asBreeze in Vector and Matrix
## What changes were proposed in this pull request?

We're using `asML` to convert the mllib vector/matrix to ml vector/matrix now. Using `as` is more correct given that this conversion actually shares the same underline data structure. As a result, in this PR, `toBreeze` will be changed to `asBreeze`. This is a private API, as a result, it will not affect any user's application.

## How was this patch tested?

unit tests

Author: DB Tsai <dbt@netflix.com>

Closes #13198 from dbtsai/minor.
2016-05-27 14:02:39 -07:00
Yin Huai 3ac2363d75 [SPARK-15532][SQL] SQLContext/HiveContext's public constructors should use SparkSession.build.getOrCreate
## What changes were proposed in this pull request?
This PR changes SQLContext/HiveContext's public constructor to use SparkSession.build.getOrCreate and removes isRootContext from SQLContext.

## How was this patch tested?
Existing tests.

Author: Yin Huai <yhuai@databricks.com>

Closes #13310 from yhuai/SPARK-15532.
2016-05-26 16:53:31 -07:00
Reynold Xin 361ebc282b [SPARK-15543][SQL] Rename DefaultSources to make them more self-describing
## What changes were proposed in this pull request?
This patch renames various DefaultSources to make their names more self-describing. The choice of "DefaultSource" was from the days when we did not have a good way to specify short names.

They are now named:
- LibSVMFileFormat
- CSVFileFormat
- JdbcRelationProvider
- JsonFileFormat
- ParquetFileFormat
- TextFileFormat

Backward compatibility is maintained through aliasing.

## How was this patch tested?
Updated relevant test cases too.

Author: Reynold Xin <rxin@databricks.com>

Closes #13311 from rxin/SPARK-15543.
2016-05-25 23:54:24 -07:00
Herman van Hovell 527499b624 [SPARK-15525][SQL][BUILD] Upgrade ANTLR4 SBT plugin
## What changes were proposed in this pull request?
The ANTLR4 SBT plugin has been moved from its own repo to one on bintray. The version was also changed from `0.7.10` to `0.7.11`. The latter actually broke our build (ihji has fixed this by also adding `0.7.10` and others to the bin-tray repo).

This PR upgrades the SBT-ANTLR4 plugin and ANTLR4 to their most recent versions (`0.7.11`/`4.5.3`). I have also removed a few obsolete build configurations.

## How was this patch tested?
Manually running SBT/Maven builds.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #13299 from hvanhovell/SPARK-15525.
2016-05-25 15:35:38 -07:00
Reynold Xin 45b7557e61 [SPARK-15424][SPARK-15437][SPARK-14807][SQL] Revert Create a hivecontext-compatibility module
## What changes were proposed in this pull request?
I initially asked to create a hivecontext-compatibility module to put the HiveContext there. But we are so close to Spark 2.0 release and there is only a single class in it. It seems overkill to have an entire package, which makes it more inconvenient, for a single class.

## How was this patch tested?
Tests were moved.

Author: Reynold Xin <rxin@databricks.com>

Closes #13207 from rxin/SPARK-15424.
2016-05-20 22:01:55 -07:00
Davies Liu 8fb1d1c7f3 [SPARK-15357] Cooperative spilling should check consumer memory mode
## What changes were proposed in this pull request?

Since we support forced spilling for Spillable, which only works in OnHeap mode, different from other SQL operators (could be OnHeap or OffHeap), we should considering the mode of consumer before calling trigger forced spilling.

## How was this patch tested?

Add new test.

Author: Davies Liu <davies@databricks.com>

Closes #13151 from davies/fix_mode.
2016-05-18 09:44:21 -07:00
DB Tsai e2efe0529a [SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based algorithms
## What changes were proposed in this pull request?

Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis.

## How was this patch tested?

Unit tests

Author: DB Tsai <dbt@netflix.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #12627 from dbtsai/SPARK-14615-NewML.
2016-05-17 12:51:07 -07:00
Sean Owen 122302cbf5 [SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into spark-tags
## What changes were proposed in this pull request?

(See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.)

Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags`

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #13074 from srowen/SPARK-15290.
2016-05-17 09:55:53 +01:00
cody koeninger 89e67d6667 [SPARK-15085][STREAMING][KAFKA] Rename streaming-kafka artifact
## What changes were proposed in this pull request?
Renaming the streaming-kafka artifact to include kafka version, in anticipation of needing a different artifact for later kafka versions

## How was this patch tested?
Unit tests

Author: cody koeninger <cody@koeninger.org>

Closes #12946 from koeninger/SPARK-15085.
2016-05-11 12:15:41 -07:00
hyukjinkwon 3ff012051f [SPARK-15250][SQL] Remove deprecated json API in DataFrameReader
## What changes were proposed in this pull request?

This PR removes the old `json(path: String)` API which is covered by the new `json(paths: String*)`.

## How was this patch tested?

Jenkins tests (existing tests should cover this)

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #13040 from HyukjinKwon/SPARK-15250.
2016-05-10 22:21:17 -07:00
Sital Kedia a019e6efb7 [SPARK-14542][CORE] PipeRDD should allow configurable buffer size for…
## What changes were proposed in this pull request?

Currently PipedRDD internally uses PrintWriter to write data to the stdin of the piped process, which by default uses a BufferedWriter of buffer size 8k. In our experiment, we have seen that 8k buffer size is too small and the job spends significant amount of CPU time in system calls to copy the data. We should have a way to configure the buffer size for the writer.

## How was this patch tested?
Ran PipedRDDSuite tests.

Author: Sital Kedia <skedia@fb.com>

Closes #12309 from sitalkedia/bufferedPipedRDD.
2016-05-10 15:28:35 +01:00
Alex Bozarth c3e23bc0c3 [SPARK-10653][CORE] Remove unnecessary things from SparkEnv
## What changes were proposed in this pull request?

Removed blockTransferService and sparkFilesDir from SparkEnv since they're rarely used and don't need to be in stored in the env. Edited their few usages to accommodate the change.

## How was this patch tested?

ran dev/run-tests locally

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #12970 from ajbozarth/spark10653.
2016-05-09 11:51:37 -07:00
Luciano Resende a03c5e68ab [SPARK-14738][BUILD] Separate docker integration tests from main build
## What changes were proposed in this pull request?

Create a maven profile for executing the docker integration tests using maven
Remove docker integration tests from main sbt build
Update documentation on how to run docker integration tests from sbt

## How was this patch tested?

Manual test of the docker integration tests as in :
mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 compile test

## Other comments

Note that the the DB2 Docker Tests are still disabled as there is a kernel version issue on the AMPLab Jenkins slaves and we would need to get them on the right level before enabling those tests. They do run ok locally with the updates from PR #12348

Author: Luciano Resende <lresende@apache.org>

Closes #12508 from lresende/docker.
2016-05-06 12:25:45 +01:00
Luciano Resende 104430223e [SPARK-14589][SQL] Enhance DB2 JDBC Dialect docker tests
## What changes were proposed in this pull request?

Enhance the DB2 JDBC Dialect docker tests as they seemed to have had some issues on previous merge causing some tests to fail.

## How was this patch tested?

By running the integration tests locally.

Author: Luciano Resende <lresende@apache.org>

Closes #12348 from lresende/SPARK-14589.
2016-05-05 10:54:48 +01:00
Herman van Hovell e5fb78baf9 [SPARK-14952][CORE][ML] Remove methods that were deprecated in 1.6.0
#### What changes were proposed in this pull request?

This PR removes three methods the were deprecated in 1.6.0:
- `PortableDataStream.close()`
- `LinearRegression.weights`
- `LogisticRegression.weights`

The rationale for doing this is that the impact is small and that Spark 2.0 is a major release.

#### How was this patch tested?
Compilation succeded.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12732 from hvanhovell/SPARK-14952.
2016-04-30 16:06:20 +01:00
Jakob Odersky 7226e19067 [SPARK-14511][BUILD] Upgrade genjavadoc to latest upstream
## What changes were proposed in this pull request?
In the past, genjavadoc had issues with package private members which led the spark project to use a forked version. This issue has been fixed upstream (typesafehub/genjavadoc#70) and a release is available for scala versions 2.10, 2.11 **and 2.12**, hence a forked version for spark is no longer necessary.
This pull request updates the build configuration to use the newest upstream genjavadoc.

## How was this patch tested?
The build was run `sbt unidoc`. During the process javadoc emits some errors on the generated java stubs, however these errors were also present before the upgrade. Furthermore, the produced html is fine.

Author: Jakob Odersky <jakob@odersky.com>

Closes #12707 from jodersky/SPARK-14511-genjavadoc.
2016-04-29 10:10:20 +01:00