Commit graph

1724 commits

Author SHA1 Message Date
Xusen Yin b72bb62d42 [SPARK-16447][ML][SPARKR] LDA wrapper in SparkR
## What changes were proposed in this pull request?

Add LDA Wrapper in SparkR with the following interfaces:

- spark.lda(data, ...)

- spark.posterior(object, newData, ...)

- spark.perplexity(object, ...)

- summary(object)

- write.ml(object)

- read.ml(path)

## How was this patch tested?

Test with SparkR unit test.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #14229 from yinxusen/SPARK-16447.
2016-08-18 05:33:52 -07:00
Yanbo Liang 4d92af310a [SPARK-16446][SPARKR][ML] Gaussian Mixture Model wrapper in SparkR
## What changes were proposed in this pull request?
Gaussian Mixture Model wrapper in SparkR, similarly to R's ```mvnormalmixEM```.

## How was this patch tested?
Unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14392 from yanboliang/spark-16446.
2016-08-17 11:18:33 -07:00
wm624@hotmail.com 363793f2bf [SPARK-16444][SPARKR] Isotonic Regression wrapper in SparkR
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

Add Isotonic Regression wrapper in SparkR

Wrappers in R and Scala are added.
Unit tests
Documentation

## How was this patch tested?
Manually tested with sudo ./R/run-tests.sh

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #14182 from wangmiao1981/isoR.
2016-08-17 06:15:04 -07:00
WeichenXu 3d8bfe7a39 [SPARK-16934][ML][MLLIB] Update LogisticCostAggregator serialization code to make it consistent with LinearRegression
## What changes were proposed in this pull request?

Update LogisticCostAggregator serialization code to make it consistent with #14109

## How was this patch tested?
MLlib 2.0:
![image](https://cloud.githubusercontent.com/assets/19235986/17649601/5e2a79ac-61ee-11e6-833c-3bd8b5250470.png)

After this PR:
![image](https://cloud.githubusercontent.com/assets/19235986/17649599/52b002ae-61ee-11e6-9402-9feb3439880f.png)

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14520 from WeichenXu123/improve_logistic_regression_costfun.
2016-08-15 06:38:30 -07:00
Yanbo Liang ddf0d1e3fe [TRIVIAL][ML] Fix LogisticRegression typo in error message.
## What changes were proposed in this pull request?
Fix ```LogisticRegression``` typo in error message.

## How was this patch tested?
Docs change, no new tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14633 from yanboliang/lr-typo.
2016-08-15 10:11:29 +01:00
zero323 0ebf7c1bff [SPARK-17027][ML] Avoid integer overflow in PolynomialExpansion.getPolySize
## What changes were proposed in this pull request?

Replaces custom choose function with o.a.commons.math3.CombinatoricsUtils.binomialCoefficient

## How was this patch tested?

Spark unit tests

Author: zero323 <zero323@users.noreply.github.com>

Closes #14614 from zero323/SPARK-17027.
2016-08-14 11:59:24 +01:00
Yanbo Liang bbae20ade1 [SPARK-17033][ML][MLLIB] GaussianMixture should use treeAggregate to improve performance
## What changes were proposed in this pull request?
```GaussianMixture``` should use ```treeAggregate``` rather than ```aggregate``` to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there is 20% increased performance.
BTW, we should destroy broadcast variable ```compute``` at the end of each iteration.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14621 from yanboliang/spark-17033.
2016-08-12 10:06:17 -07:00
Yanbo Liang d4a9122430 [SPARK-16710][SPARKR][ML] spark.glm should support weightCol
## What changes were proposed in this pull request?
Training GLMs on weighted dataset is very important use cases, but it is not supported by SparkR currently. Users can pass argument ```weights``` to specify the weights vector in native R. For ```spark.glm```, we can pass in the ```weightCol``` which is consistent with MLlib.

## How was this patch tested?
Unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14346 from yanboliang/spark-16710.
2016-08-10 10:53:48 -07:00
Yanbo Liang 182e11904b [SPARK-16933][ML] Fix AFTAggregator in AFTSurvivalRegression serializes unnecessary data.
## What changes were proposed in this pull request?
Similar to ```LeastSquaresAggregator``` in #14109, ```AFTAggregator``` used for ```AFTSurvivalRegression``` ends up serializing the ```parameters``` and ```featuresStd```, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization. This PR is highly inspired by #14109.

## How was this patch tested?
I tested this locally and verified the serialization reduction.

Before patch
![image](https://cloud.githubusercontent.com/assets/1962026/17512035/abb93f04-5dda-11e6-97d3-8ae6b61a0dfd.png)

After patch
![image](https://cloud.githubusercontent.com/assets/1962026/17512024/9e0dc44c-5dda-11e6-93d0-6e130ba0d6aa.png)

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14519 from yanboliang/spark-16933.
2016-08-09 03:39:57 -07:00
Holden Karau 9216901d52 [SPARK-16779][TRIVIAL] Avoid using postfix operators where they do not add much and remove whitelisting
## What changes were proposed in this pull request?

Avoid using postfix operation for command execution in SQLQuerySuite where it wasn't whitelisted and audit existing whitelistings removing postfix operators from most places. Some notable places where postfix operation remains is in the XML parsing & time units (seconds, millis, etc.) where it arguably can improve readability.

## How was this patch tested?

Existing tests.

Author: Holden Karau <holden@us.ibm.com>

Closes #14407 from holdenk/SPARK-16779.
2016-08-08 15:54:03 -07:00
sethah 1db1c6567b [SPARK-16404][ML] LeastSquaresAggregators serializes unnecessary data
## What changes were proposed in this pull request?
Similar to `LogisticAggregator`, `LeastSquaresAggregator` used for linear regression ends up serializing the coefficients and the features standard deviations, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization.

In https://github.com/apache/spark/pull/13729 the approach was to pass these values directly to the add method. The approach used here, initially, is to mark these fields as transient instead which gives the benefit of keeping the signature of the add method simple and interpretable. The downside is that it requires the use of `transient lazy val`s which are difficult to reason about if one is not quite familiar with serialization in Scala/Spark.

## How was this patch tested?

**MLlib**
![image](https://cloud.githubusercontent.com/assets/7275795/16703660/436f79fa-4524-11e6-9022-ef00058ec718.png)

**ML without patch**
![image](https://cloud.githubusercontent.com/assets/7275795/16703831/c4d50b9e-4525-11e6-80cb-9b58c850cd41.png)

**ML with patch**
![image](https://cloud.githubusercontent.com/assets/7275795/16703675/63e0cf40-4524-11e6-9120-1f512a70e083.png)

Author: sethah <seth.hendrickson16@gmail.com>

Closes #14109 from sethah/LIR_serialize.
2016-08-08 00:00:15 -07:00
Yanbo Liang 6cbde337a5 [SPARK-16750][FOLLOW-UP][ML] Add transformSchema for StringIndexer/VectorAssembler and fix failed tests.
## What changes were proposed in this pull request?
This is follow-up for #14378. When we add ```transformSchema``` for all estimators and transformers, I found there are tests failed for ```StringIndexer``` and ```VectorAssembler```. So I moved these parts of work separately in this PR, to make it more clear to review.
The corresponding tests should throw ```IllegalArgumentException``` at schema validation period after we add ```transformSchema```. It's efficient that to throw exception at the start of ```fit``` or ```transform``` rather than during the process.

## How was this patch tested?
Modified unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14455 from yanboliang/transformSchema.
2016-08-05 22:07:59 +01:00
Zheng RuiFeng 0e2e5d7d0b [SPARK-16863][ML] ProbabilisticClassifier.fit check threshoulds' length
## What changes were proposed in this pull request?

Add threshoulds' length checking for Classifiers which extends ProbabilisticClassifier

## How was this patch tested?

unit tests and manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #14470 from zhengruifeng/classifier_check_setThreshoulds_length.
2016-08-04 21:44:54 +01:00
WeichenXu 462784ffad [SPARK-16880][ML][MLLIB] make ann training data persisted if needed
## What changes were proposed in this pull request?

To Make sure ANN layer input training data to be persisted,
so that it can avoid overhead cost if the RDD need to be computed from lineage.

## How was this patch tested?

Existing Tests.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14483 from WeichenXu123/add_ann_persist_training_data.
2016-08-04 21:41:35 +01:00
Shuai Lin 36827ddafe [SPARK-16822][DOC] Support latex in scaladoc.
## What changes were proposed in this pull request?

Support using latex in scaladoc by adding MathJax javascript to the js template.

## How was this patch tested?

Generated scaladoc.  Preview:

- LogisticGradient: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient)

- MinMaxScaler: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler)

Author: Shuai Lin <linshuai2012@gmail.com>

Closes #14438 from lins05/spark-16822-support-latex-in-scaladoc.
2016-08-02 09:14:08 -07:00
Zheng RuiFeng d9e0919d30 [SPARK-16851][ML] Incorrect threshould length in 'setThresholds()' evoke Exception
## What changes were proposed in this pull request?
Add a length checking for threshoulds' length in method `setThreshoulds()`  of classification models.

## How was this patch tested?
unit tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #14457 from zhengruifeng/check_setThresholds.
2016-08-02 07:22:41 -07:00
Shuai Lin 2a0de7dc99 [SPARK-16485][DOC][ML] Remove useless latex in a log messge.
## What changes were proposed in this pull request?

Removed useless latex in a log messge.

## How was this patch tested?

Check generated scaladoc.

Author: Shuai Lin <linshuai2012@gmail.com>

Closes #14380 from lins05/fix-docs-formatting.
2016-08-01 06:54:18 -07:00
WeichenXu bce354c1d4 [SPARK-16696][ML][MLLIB] destroy KMeans bcNewCenters when loop finished and update code where should release unused broadcast/RDD in proper time
## What changes were proposed in this pull request?

update unused broadcast in KMeans/Word2Vec,
use destroy(false) to release memory in time.

and several place destroy() update to destroy(false) so that it will be async-called,
it will better than blocking called.

and update bcNewCenters in KMeans to make it destroy in correct time.
I use a list to store all historical `bcNewCenters` generated in each loop iteration and delay them to release at the end of loop.

fix TODO in `BisectingKMeans.run` "unpersist old indices",
Implements the pattern "persist current step RDD, and unpersist previous one" in the loop iteration.

## How was this patch tested?

Existing tests.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14333 from WeichenXu123/broadvar_unpersist_to_destroy.
2016-07-30 08:07:22 -07:00
Sean Owen 0dc4310b47 [SPARK-16694][CORE] Use for/foreach rather than map for Unit expressions whose side effects are required
## What changes were proposed in this pull request?

Use foreach/for instead of map where operation requires execution of body, not actually defining a transformation

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #14332 from srowen/SPARK-16694.
2016-07-30 04:42:38 -07:00
Yanbo Liang 0557a45452 [SPARK-16750][ML] Fix GaussianMixture training failed due to feature column type mistake
## What changes were proposed in this pull request?
ML ```GaussianMixture``` training failed due to feature column type mistake. The feature column type should be ```ml.linalg.VectorUDT``` but got ```mllib.linalg.VectorUDT``` by mistake.
See [SPARK-16750](https://issues.apache.org/jira/browse/SPARK-16750) for how to reproduce this bug.
Why the unit tests did not complain this errors? Because some estimators/transformers missed calling ```transformSchema(dataset.schema)``` firstly during ```fit``` or ```transform```. I will also add this function to all estimators/transformers who missed in this PR.

## How was this patch tested?
No new tests, should pass existing ones.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14378 from yanboliang/spark-16750.
2016-07-29 04:40:20 -07:00
krishnakalyan3 7e8279fde1 [SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc
## What changes were proposed in this pull request?
Updated ML pipeline Cross Validation Scaladoc & PyDoc.

## How was this patch tested?

Documentation update

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: krishnakalyan3 <krishnakalyan3@gmail.com>

Closes #13894 from krishnakalyan3/kfold-cv.
2016-07-27 15:37:38 +02:00
Yanbo Liang 3c3371bbd6 [MINOR][ML] Fix some mistake in LinearRegression formula.
## What changes were proposed in this pull request?
Fix some mistake in ```LinearRegression``` formula.

## How was this patch tested?
Documents change, no tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14369 from yanboliang/LiR-formula.
2016-07-27 11:24:28 +01:00
WeichenXu 4c9695598e [SPARK-16697][ML][MLLIB] improve LDA submitMiniBatch method to avoid redundant RDD computation
## What changes were proposed in this pull request?

In `LDAOptimizer.submitMiniBatch`, do persist on `stats: RDD[(BDM[Double], List[BDV[Double]])]`
and also move the place of unpersisting `expElogbetaBc` broadcast variable,
to avoid the `expElogbetaBc` broadcast variable to be unpersisted too early,
and update previous `expElogbetaBc.unpersist()` into `expElogbetaBc.destroy(false)`

## How was this patch tested?

Existing test.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14335 from WeichenXu123/improve_LDA.
2016-07-26 10:41:41 +01:00
WeichenXu ad3708e783 [SPARK-16653][ML][OPTIMIZER] update ANN convergence tolerance param default to 1e-6
## What changes were proposed in this pull request?

replace ANN convergence tolerance param default
from 1e-4 to 1e-6

so that it will be the same with other algorithms in MLLib which use LBFGS as optimizer.

## How was this patch tested?

Existing Test.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14286 from WeichenXu123/update_ann_tol.
2016-07-25 20:00:37 +01:00
WeichenXu 25db51675f [SPARK-16561][MLLIB] fix multivarOnlineSummary min/max bug
## What changes were proposed in this pull request?

renaming var names to make code more clear:
nnz => weightSum
weightSum => totalWeightSum

and add a new member vector `nnz` (not `nnz` in previous code, which renamed to `weightSum`) to count each dimensions non-zero value number.
using `nnz` which I added above instead of `weightSum` when calculating min/max so that it fix several numerical error in some extreme case.

## How was this patch tested?

A new testcase added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14216 from WeichenXu123/multivarOnlineSummary.
2016-07-23 12:32:30 +01:00
Anthony Truchet 0dc79ffd1c [SPARK-16440][MLLIB] Destroy broadcasted variables even on driver
## What changes were proposed in this pull request?
Forgotten broadcasted variables were persisted into a previous #PR 14153). This PR turns those `unpersist()` into `destroy()` so that memory is freed even on the driver.

## How was this patch tested?
Unit Tests in Word2VecSuite were run locally.

This contribution is done on behalf of Criteo, according to the
terms of the Apache license 2.0.

Author: Anthony Truchet <a.truchet@criteo.com>

Closes #14268 from AnthonyTruchet/SPARK-16440.
2016-07-20 10:39:59 +01:00
Yanbo Liang 670891496a [SPARK-16494][ML] Upgrade breeze version to 0.12
## What changes were proposed in this pull request?
breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes.
One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case.
We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12.
For more features, improvements and bug fixes of breeze 0.12, you can refer the following link:
https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c

## How was this patch tested?
No new tests, should pass the existing ones.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14150 from yanboliang/spark-16494.
2016-07-19 12:31:04 +01:00
WeichenXu 8310c0741c [SPARK-16600][MLLIB] fix some latex formula syntax error
## What changes were proposed in this pull request?

`\partial\x` ==> `\partial x`
`har{x_i}` ==> `hat{x_i}`

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14246 from WeichenXu123/fix_formular_err.
2016-07-19 12:07:40 +01:00
Xin Ren 21a6dd2aef [SPARK-16535][BUILD] In pom.xml, remove groupId which is redundant definition and inherited from the parent
https://issues.apache.org/jira/browse/SPARK-16535

## What changes were proposed in this pull request?

When I scan through the pom.xml of sub projects, I found this warning as below and attached screenshot
```
Definition of groupId is redundant, because it's inherited from the parent
```
![screen shot 2016-07-13 at 3 13 11 pm](https://cloud.githubusercontent.com/assets/3925641/16823121/744f893e-4916-11e6-8a52-042f83b9db4e.png)

I've tried to remove some of the lines with groupId definition, and the build on my local machine is still ok.
```
<groupId>org.apache.spark</groupId>
```
As I just find now `<maven.version>3.3.9</maven.version>` is being used in Spark 2.x, and Maven-3 supports versionless parent elements: Maven 3 will remove the need to specify the parent version in sub modules. THIS is great (in Maven 3.1).

ref: http://stackoverflow.com/questions/3157240/maven-3-worth-it/3166762#3166762

## How was this patch tested?

I've tested by re-building the project, and build succeeded.

Author: Xin Ren <iamshrek@126.com>

Closes #14189 from keypointt/SPARK-16535.
2016-07-19 11:59:46 +01:00
WeichenXu a529fc9442 [MINOR][TYPO] fix fininsh typo
## What changes were proposed in this pull request?

fininsh => finish

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14238 from WeichenXu123/fix_fininsh_typo.
2016-07-18 09:11:53 +01:00
Reynold Xin 480c870644 [SPARK-16588][SQL] Deprecate monotonicallyIncreasingId in Scala/Java
This patch deprecates monotonicallyIncreasingId in Scala/Java, as done in Python.

This patch was originally written by HyukjinKwon. Closes #14236.
2016-07-17 22:48:00 -07:00
Sean Owen 5ec0d692b0 [SPARK-3359][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility
## What changes were proposed in this pull request?

These are yet more changes that resolve problems with unidoc/genjavadoc and Java 8. It does not fully resolve the problem, but gets rid of as many errors as we can from this end.

## How was this patch tested?

Jenkins build of docs

Author: Sean Owen <sowen@cloudera.com>

Closes #14221 from srowen/SPARK-3359.3.
2016-07-16 13:26:58 -07:00
z001qdp 71ad945bbb [SPARK-16426][MLLIB] Fix bug that caused NaNs in IsotonicRegression
## What changes were proposed in this pull request?

Fixed a bug that caused `NaN`s in `IsotonicRegression`. The problem occurs when training rows with the same feature value but different labels end up on different partitions. This patch changes a `sortBy` call to a `partitionBy(RangePartitioner)` followed by a `mapPartitions(sortBy)` in order to ensure that all rows with the same feature value end up on the same partition.

## How was this patch tested?

Added a unit test.

Author: z001qdp <Nicholas.Eggert@target.com>

Closes #14140 from neggert/SPARK-16426-isotonic-nan.
2016-07-15 12:30:22 +01:00
WeichenXu 252d4f27f2 [SPARK-16500][ML][MLLIB][OPTIMIZER] add LBFGS convergence warning for all used place in MLLib
## What changes were proposed in this pull request?

Add warning_for the following case when LBFGS training not actually convergence:

1) LogisticRegression
2) AFTSurvivalRegression
3) LBFGS algorithm wrapper in mllib package

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14157 from WeichenXu123/add_lbfgs_convergence_warning_for_all_used_place.
2016-07-14 09:11:04 +01:00
Joseph K. Bradley a5f51e2162 [SPARK-16485][ML][DOC] Fix privacy of GLM members, rename sqlDataTypes for ML, doc fixes
## What changes were proposed in this pull request?

Fixing issues found during 2.0 API checks:
* GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should not be exposed
* sqlDataTypes: name does not follow conventions. Do we need to expose it?
* Evaluator: inconsistent doc between evaluate and isLargerBetter
* MinMaxScaler: math rendering --> hard to make it great, but I'll change it a little
* GeneralizedLinearRegressionSummary: aic doc is incorrect --> will change to use more common name

## How was this patch tested?

Existing unit tests.  Docs generated locally.  (MinMaxScaler is improved a tiny bit.)

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #14187 from jkbradley/final-api-check-2.0.
2016-07-13 15:40:44 -07:00
Joseph K. Bradley 01f09b1612 [SPARK-14812][ML][MLLIB][PYTHON] Experimental, DeveloperApi annotation audit for ML
## What changes were proposed in this pull request?

General decisions to follow, except where noted:
* spark.mllib, pyspark.mllib: Remove all Experimental annotations.  Leave DeveloperApi annotations alone.
* spark.ml, pyspark.ml
** Annotate Estimator-Model pairs of classes and companion objects the same way.
** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation.
** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation.
* DeveloperApi annotations are left alone, except where noted.
* No changes to which types are sealed.

Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new:
* Model Summary classes
* MLWriter, MLReader, MLWritable, MLReadable
* Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency.
* RFormula: Its behavior may need to change slightly to match R in edge cases.
* AFTSurvivalRegression
* MultilayerPerceptronClassifier

DeveloperApi changes:
* ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi

## How was this patch tested?

N/A

Note to reviewers:
* spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental.
* Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature.  I did not find such cases, but please verify.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #14147 from jkbradley/experimental-audit.
2016-07-13 12:33:39 -07:00
oraviv ea06e4ef34 [SPARK-16469] enhanced simulate multiply
## What changes were proposed in this pull request?

We have a use case of multiplying very big sparse matrices. we have about 1000x1000 distributed block matrices multiplication and the simulate multiply goes like O(n^4) (n being 1000). it takes about 1.5 hours. We modified it slightly with classical hashmap and now run in about 30 seconds O(n^2).

## How was this patch tested?

We have added a performance test and verified the reduced time.

Author: oraviv <oraviv@paypal.com>

Closes #14068 from uzadude/master.
2016-07-13 14:47:08 +01:00
Sean Owen 51ade51a9f [SPARK-16440][MLLIB] Undeleted broadcast variables in Word2Vec causing OoM for long runs
## What changes were proposed in this pull request?

Unpersist broadcasted vars in Word2Vec.fit for more timely / reliable resource cleanup

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #14153 from srowen/SPARK-16440.
2016-07-13 11:39:32 +01:00
WeichenXu 6cb75db9ab [SPARK-16470][ML][OPTIMIZER] Check linear regression training whether actually reach convergence and add warning if not
## What changes were proposed in this pull request?

In `ml.regression.LinearRegression`, it use breeze `LBFGS` and `OWLQN` optimizer to do data training, but do not check whether breeze's optimizer returned result actually reached convergence.

The `LBFGS` and `OWLQN` optimizer in breeze finish iteration may result the following situations:

1) reach max iteration number
2) function reach value convergence
3) objective function stop improving
4) gradient reach convergence
5) search failed(due to some internal numerical error)

I add warning printing code so that
if the iteration result is (1) or (3) or (5) in above, it will print a warning with respective reason string.

## How was this patch tested?

Manual.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14122 from WeichenXu123/add_lr_not_convergence_warn.
2016-07-12 13:04:34 +01:00
WeichenXu fc11c509e2 [MINOR][ML] update comment where is inconsistent with code in ml.regression.LinearRegression
## What changes were proposed in this pull request?

In `train` method of `ml.regression.LinearRegression` when handling situation `std(label) == 0`
the code replace `std(label)` with `mean(label)` but the relative comment is inconsistent, I update it.

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14121 from WeichenXu123/update_lr_comment.
2016-07-12 09:23:59 +01:00
Reynold Xin ffcb6e055a [SPARK-16477] Bump master version to 2.1.0-SNAPSHOT
## What changes were proposed in this pull request?
After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #14130 from rxin/SPARK-16477.
2016-07-11 09:42:56 -07:00
Xusen Yin 255d74fe4a [SPARK-16369][MLLIB] tallSkinnyQR of RowMatrix should aware of empty partition
## What changes were proposed in this pull request?

tallSkinnyQR of RowMatrix should aware of empty partition, which could cause exception from Breeze qr decomposition.

See the [archived dev mail](https://mail-archives.apache.org/mod_mbox/spark-dev/201510.mbox/%3CCAF7ADNrycvPL3qX-VZJhq4OYmiUUhoscut_tkOm63Cm18iK1tQmail.gmail.com%3E) for more details.

## How was this patch tested?

Scala unit test.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #14049 from yinxusen/SPARK-16369.
2016-07-08 14:23:57 +01:00
Xusen Yin 4c6f00d09c [SPARK-16372][MLLIB] Retag RDD to tallSkinnyQR of RowMatrix
## What changes were proposed in this pull request?

The following Java code because of type erasing:

```Java
JavaRDD<Vector> rows = jsc.parallelize(...);
RowMatrix mat = new RowMatrix(rows.rdd());
QRDecomposition<RowMatrix, Matrix> result = mat.tallSkinnyQR(true);
```

We should use retag to restore the type to prevent the following exception:

```Java
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.mllib.linalg.Vector;
```

## How was this patch tested?

Java unit test

Author: Xusen Yin <yinxusen@gmail.com>

Closes #14051 from yinxusen/SPARK-16372.
2016-07-07 11:28:04 +01:00
tmnd1991 040f6f9f46 [SPARK-15740][MLLIB] Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
## What changes were proposed in this pull request?
"test big model load / save" in Word2VecSuite, lately resulted into OOM.
Therefore we decided to make the partitioning adaptive (not based on spark default "spark.kryoserializer.buffer.max" conf) and then testing it using a small buffer size in order to trigger partitioning without allocating too much memory for the test.

## How was this patch tested?
It was tested running the following unit test:
org.apache.spark.mllib.feature.Word2VecSuite

Author: tmnd1991 <antonio.murgia2@studio.unibo.it>

Closes #13509 from tmnd1991/SPARK-15740.
2016-07-06 12:56:26 -07:00
MechCoder 909c6d812f [SPARK-16307][ML] Add test to verify the predicted variances of a DT on toy data
## What changes were proposed in this pull request?

The current tests assumes that `impurity.calculate()` returns the variance correctly. It should be better to make the tests independent of this assumption. In other words verify that the variance computed equals the variance computed manually on a small tree.

## How was this patch tested?

The patch is a test....

Author: MechCoder <mks542@nyu.edu>

Closes #13981 from MechCoder/dt_variance.
2016-07-06 02:54:44 -07:00
Yuhao Yang 5497242c76 [SPARK-16249][ML] Change visibility of Object ml.clustering.LDA to public for loading
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16249
Change visibility of Object ml.clustering.LDA to public for loading, thus users can invoke LDA.load("path").

## How was this patch tested?

existing ut and manually test for load ( saved with current code)

Author: Yuhao Yang <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #13941 from hhbyyh/ldapublic.
2016-07-06 01:30:47 -07:00
Yuhao Yang aa6564f37f [SPARK-14608][ML] transformSchema needs better documentation
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14608
PipelineStage.transformSchema currently has minimal documentation. It should have more to explain it can:
check schema
check parameter interactions

## How was this patch tested?
unit test

Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #12384 from hhbyyh/transformSchemaDoc.
2016-06-30 19:34:51 -07:00
zlpmichelle b30a2dc7c5 [SPARK-16241][ML] model loading backward compatibility for ml NaiveBayes
## What changes were proposed in this pull request?

model loading backward compatibility for ml NaiveBayes

## How was this patch tested?

existing ut and manual test for loading models saved by Spark 1.6.

Author: zlpmichelle <zlpmichelle@gmail.com>

Closes #13940 from zlpmichelle/naivebayes.
2016-06-30 00:50:14 -07:00
Mahmoud Rawas 393db655c3 [SPARK-15858][ML] Fix calculating error by tree stack over flow prob…
## What changes were proposed in this pull request?

What changes were proposed in this pull request?

Improving evaluateEachIteration function in mllib as it fails when trying to calculate error by tree for a model that has more than 500 trees

## How was this patch tested?

the batch tested on productions data set (2K rows x 2K features) training a gradient boosted model without validation with 1000 maxIteration settings, then trying to produce the error by tree, the new patch was able to perform the calculation within 30 seconds, while previously it was take hours then fail.

**PS**: It would be better if this PR can be cherry picked into release branches 1.6.1 and 2.0

Author: Mahmoud Rawas <mhmoudr@gmail.com>
Author: Mahmoud Rawas <Mahmoud.Rawas@quantium.com.au>

Closes #13624 from mhmoudr/SPARK-15858.master.
2016-06-29 13:12:17 +01:00
Yanbo Liang 0df5ce1bc1 [SPARK-16245][ML] model loading backward compatibility for ml.feature.PCA
## What changes were proposed in this pull request?
model loading backward compatibility for ml.feature.PCA.

## How was this patch tested?
existing ut and manual test for loading models saved by Spark 1.6.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13937 from yanboliang/spark-16245.
2016-06-28 19:53:07 -07:00
Yanbo Liang e158478a9f [SPARK-16242][MLLIB][PYSPARK] Conversion between old/new matrix columns in a DataFrame (Python)
## What changes were proposed in this pull request?
This PR implements python wrappers for #13888 to convert old/new matrix columns in a DataFrame.

## How was this patch tested?
Doctest in python.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13935 from yanboliang/spark-16242.
2016-06-28 06:28:22 -07:00
Yuhao Yang c17b1abff8 [SPARK-16187][ML] Implement util method for ML Matrix conversion in scala/java
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16187
This is to provide conversion utils between old/new vector columns in a DataFrame. So users can use it to migrate their datasets and pipelines manually.

## How was this patch tested?

java and scala ut

Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #13888 from hhbyyh/matComp.
2016-06-27 12:27:39 -07:00
José Antonio a3c7b4187b [MLLIB] org.apache.spark.mllib.util.SVMDataGenerator generates ArrayIndexOutOfBoundsException. I have found the bug and tested the solution.
## What changes were proposed in this pull request?

Just adjust the size of an array in line 58 so it does not cause an ArrayOutOfBoundsException in line 66.

## How was this patch tested?

Manual tests. I have recompiled the entire project with the fix, it has been built successfully and I have run the code, also with good results.

line 66: val yD = blas.ddot(trueWeights.length, x, 1, trueWeights, 1) + rnd.nextGaussian() * 0.1
crashes because trueWeights has length "nfeatures + 1" while "x" has length "features", and they should have the same length.

To fix this just make trueWeights be the same length as x.

I have recompiled the project with the change and it is working now:
[spark-1.6.1]$ spark-submit --master local[*] --class org.apache.spark.mllib.util.SVMDataGenerator mllib/target/spark-mllib_2.11-1.6.1.jar local /home/user/test

And it generates the data successfully now in the specified folder.

Author: José Antonio <joseanmunoz@gmail.com>

Closes #13895 from j4munoz/patch-2.
2016-06-25 09:11:25 +01:00
Yuhao Yang cc6778ee0b [SPARK-16133][ML] model loading backward compatibility for ml.feature
## What changes were proposed in this pull request?

model loading backward compatibility for ml.feature,

## How was this patch tested?

existing ut and manual test for loading 1.6 models.

Author: Yuhao Yang <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #13844 from hhbyyh/featureComp.
2016-06-23 21:50:25 -07:00
Yuhao Yang 14bc5a7f36 [SPARK-16177][ML] model loading backward compatibility for ml.regression
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16177
model loading backward compatibility for ml.regression

## How was this patch tested?

existing ut and manual test for loading 1.6 models.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #13879 from hhbyyh/regreComp.
2016-06-23 20:43:19 -07:00
Yuhao Yang 60398dabc5 [SPARK-16130][ML] model loading backward compatibility for ml.classfication.LogisticRegression
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16130
model loading backward compatibility for ml.classfication.LogisticRegression

## How was this patch tested?
existing ut and manual test for loading old models.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #13841 from hhbyyh/lrcomp.
2016-06-23 11:00:00 -07:00
Xiangrui Meng 65d1f0f716 [SPARK-16154][MLLIB] Update spark.ml and spark.mllib package docs
## What changes were proposed in this pull request?

Since we decided to switch spark.mllib package into maintenance mode in 2.0, it would be nice to update the package docs to reflect this change.

## How was this patch tested?

Manually checked generated APIs.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13859 from mengxr/SPARK-16154.
2016-06-23 08:26:17 -07:00
Xiangrui Meng 00cc5cca45 [SPARK-16153][MLLIB] switch to multi-line doc to avoid a genjavadoc bug
## What changes were proposed in this pull request?

We recently deprecated setLabelCol in ChiSqSelectorModel (#13823):

~~~scala
  /** group setParam */
  Since("1.6.0")
  deprecated("labelCol is not used by ChiSqSelectorModel.", "2.0.0")
  def setLabelCol(value: String): this.type = set(labelCol, value)
~~~

This unfortunately hit a genjavadoc bug and broken doc generation. This is the generated Java code:

~~~java
  /** group setParam */
  public  org.apache.spark.ml.feature.ChiSqSelectorModel setOutputCol (java.lang.String value)  { throw new RuntimeException(); }
   *
   * deprecated labelCol is not used by ChiSqSelectorModel. Since 2.0.0.
  */
  public  org.apache.spark.ml.feature.ChiSqSelectorModel setLabelCol (java.lang.String value)  { throw new RuntimeException(); }
~~~

Switching to multiline is a workaround.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13855 from mengxr/SPARK-16153.
2016-06-22 15:50:21 -07:00
Xiangrui Meng 6a6010f001 [MINOR][MLLIB] DefaultParamsReadable/Writable should be DeveloperApi
## What changes were proposed in this pull request?

`DefaultParamsReadable/Writable` are not user-facing. Only developers who implement `Transformer/Estimator` would use it. So this PR changes the annotation to `DeveloperApi`.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13828 from mengxr/default-readable-should-be-developer-api.
2016-06-22 10:06:43 -07:00
Nick Pentreath 18faa588ca [SPARK-16127][ML][PYPSARK] Audit @Since annotations related to ml.linalg
[SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them.

## How was this patch tested?

Existing unit tests.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13840 from MLnick/SPARK-16127-ml-linalg-since.
2016-06-22 10:05:25 -07:00
Holden Karau d281b0bafe [SPARK-15162][SPARK-15164][PYSPARK][DOCS][ML] update some pydocs
## What changes were proposed in this pull request?

Mark ml.classification algorithms as experimental to match Scala algorithms, update PyDoc for for thresholds on `LogisticRegression` to have same level of info as Scala, and enable mathjax for PyDoc.

## How was this patch tested?

Built docs locally & PySpark SQL tests

Author: Holden Karau <holden@us.ibm.com>

Closes #12938 from holdenk/SPARK-15162-SPARK-15164-update-some-pydocs.
2016-06-22 11:54:49 +02:00
gatorsmile 0e3ce75332 [SPARK-15644][MLLIB][SQL] Replace SQLContext with SparkSession in MLlib
#### What changes were proposed in this pull request?
This PR is to use the latest `SparkSession` to replace the existing `SQLContext` in `MLlib`. `SQLContext` is removed from `MLlib`.

Also fix a test case issue in `BroadcastJoinSuite`.

BTW, `SQLContext` is not being used in the `MLlib` test suites.
#### How was this patch tested?
Existing test cases.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #13380 from gatorsmile/sqlContextML.
2016-06-21 23:12:08 -07:00
Xiangrui Meng d77c4e6e2e [MINOR][MLLIB] deprecate setLabelCol in ChiSqSelectorModel
## What changes were proposed in this pull request?

Deprecate `labelCol`, which is not used by ChiSqSelectorModel.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13823 from mengxr/deprecate-setLabelCol-in-ChiSqSelectorModel.
2016-06-21 20:53:38 -07:00
Xiangrui Meng 9493b079a0 [SPARK-16118][MLLIB] add getDropLast to OneHotEncoder
## What changes were proposed in this pull request?

We forgot the getter of `dropLast` in `OneHotEncoder`

## How was this patch tested?

unit test

Author: Xiangrui Meng <meng@databricks.com>

Closes #13821 from mengxr/SPARK-16118.
2016-06-21 15:52:31 -07:00
Xiangrui Meng f4e8c31adf [SPARK-16117][MLLIB] hide LibSVMFileFormat and move its doc to LibSVMDataSource
## What changes were proposed in this pull request?

LibSVMFileFormat implements data source for LIBSVM format. However, users do not really need to call its APIs to use it. So we should hide it in the public API docs. The main issue is that we still need to put the documentation and example code somewhere. The proposal it to have a dummy class to hold the documentation, as a workaround to https://issues.scala-lang.org/browse/SI-8124.

## How was this patch tested?

Manually checked the generated API doc and tested loading LIBSVM data.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13819 from mengxr/SPARK-16117.
2016-06-21 15:46:14 -07:00
Xiangrui Meng 918c91954f [MINOR][MLLIB] move setCheckpointInterval to non-expert setters
## What changes were proposed in this pull request?

The `checkpointInterval` is a non-expert param. This PR moves its setter to non-expert group.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13813 from mengxr/checkpoint-non-expert.
2016-06-21 13:35:06 -07:00
Xiangrui Meng 4f83ca1059 [SPARK-15177][.1][R] make SparkR model params and default values consistent with MLlib
## What changes were proposed in this pull request?

This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation.

Main changes:
* `spark.glm`: epsilon -> tol, maxit -> maxIter
* `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means||"
* `spark.naiveBayes`: laplace -> smoothing, default 1.0

## How was this patch tested?

Existing unit tests.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13801 from mengxr/SPARK-15177.1.
2016-06-21 08:31:15 -07:00
Nick Pentreath 37494a18e8 [SPARK-10258][DOC][ML] Add @Since annotations to ml.feature
This PR adds missing `Since` annotations to `ml.feature` package.

Closes #8505.

## How was this patch tested?

Existing tests.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13641 from MLnick/add-since-annotations.
2016-06-21 00:39:47 -07:00
Xiangrui Meng 18a8a9b1f4 [SPARK-16074][MLLIB] expose VectorUDT/MatrixUDT in a public API
## What changes were proposed in this pull request?

Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself is private in Spark. However, in order to let developers implement their own transformers and estimators, we should expose both types in a public API to simply the implementation of transformSchema, transform, etc. Otherwise, they need to get the data types using reflection.

## How was this patch tested?

Unit tests in Scala and Java.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13789 from mengxr/SPARK-16074.
2016-06-20 21:51:02 -07:00
Xiangrui Meng edb23f9e47 [SPARK-15946][MLLIB] Conversion between old/new vector columns in a DataFrame (Python)
## What changes were proposed in this pull request?

This PR implements python wrappers for #13662 to convert old/new vector columns in a DataFrame.

## How was this patch tested?

doctest in Python

cc: yanboliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #13731 from mengxr/SPARK-15946.
2016-06-17 21:22:29 -07:00
sethah 1f0a46958e [SPARK-16008][ML] Remove unnecessary serialization in logistic regression
JIRA: [SPARK-16008](https://issues.apache.org/jira/browse/SPARK-16008)

## What changes were proposed in this pull request?
`LogisticAggregator` stores references to two arrays of dimension `numFeatures` which are serialized before the combine op, unnecessarily. This results in the shuffle write being ~3x (for multiclass logistic regression, this number will go up) larger than it should be (in MLlib, for instance, it is 3x smaller).

This patch modifies `LogisticAggregator.add` to accept the two arrays as method parameters which avoids the serialization.

## How was this patch tested?

I tested this locally and verified the serialization reduction.

![image](https://cloud.githubusercontent.com/assets/7275795/16140387/d2974bac-3404-11e6-94f9-268860c931a2.png)

Additionally, I ran some tests of a 4 node cluster (4x48 cores, 4x128 GB RAM). Data set size of 2M rows and 10k features showed >2x iteration speedup.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #13729 from sethah/lr_improvement.
2016-06-17 09:58:49 -07:00
Dongjoon Hyun 36110a8306 [SPARK-15922][MLLIB] toIndexedRowMatrix should consider the case cols < offset+colsPerBlock
## What changes were proposed in this pull request?

SPARK-15922 reports the following scenario throwing an exception due to the mismatched vector sizes. This PR handles the exceptional case, `cols < (offset + colsPerBlock)`.

**Before**
```scala
scala> import org.apache.spark.mllib.linalg.distributed._
scala> import org.apache.spark.mllib.linalg._
scala> val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil
scala> val rdd = sc.parallelize(rows)
scala> val matrix = new IndexedRowMatrix(rdd, 3, 3)
scala> val bmat = matrix.toBlockMatrix
scala> val imat = bmat.toIndexedRowMatrix
scala> imat.rows.collect
... // java.lang.IllegalArgumentException: requirement failed: Vectors must be the same length!
```

**After**
```scala
...
scala> imat.rows.collect
res0: Array[org.apache.spark.mllib.linalg.distributed.IndexedRow] = Array(IndexedRow(0,[1.0,2.0,3.0]), IndexedRow(1,[1.0,2.0,3.0]), IndexedRow(2,[1.0,2.0,3.0]))
```

## How was this patch tested?

Pass the Jenkins tests (including the above case)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13643 from dongjoon-hyun/SPARK-15922.
2016-06-16 23:02:46 +02:00
Cheng Lian 9ea0d5e326 [SPARK-15983][SQL] Removes FileFormat.prepareRead
## What changes were proposed in this pull request?

Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source.

However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean.

## How was this patch tested?

Existing tests.

Author: Cheng Lian <lian@databricks.com>

Closes #13698 from liancheng/remove-prepare-read.
2016-06-16 10:24:29 -07:00
Reynold Xin 865e7cc38d [SPARK-15979][SQL] Rename various Parquet support classes.
## What changes were proposed in this pull request?
This patch renames various Parquet support classes from CatalystAbc to ParquetAbc. This new naming makes more sense for two reasons:

1. These are not optimizer related (i.e. Catalyst) classes.
2. We are in the Spark code base, and as a result it'd be more clear to call out these are Parquet support classes, rather than some Spark classes.

## How was this patch tested?
Renamed test cases as well.

Author: Reynold Xin <rxin@databricks.com>

Closes #13696 from rxin/parquet-rename.
2016-06-15 20:05:08 -07:00
Wojciech Jurczyk 6e0b3d795c [DOCS] Fix Gini and Entropy scaladocs in context of multiclass classification
The PR changes outdated scaladocs for Gini and Entropy classes. Since PR #886 Spark supports multiclass classification, but the docs tell only about binary classification.

Author: Wojciech Jurczyk <wojciech.jurczyk@codilime.com>

Closes #11252 from wjur/wjur/docs_multiclass.
2016-06-15 15:58:42 -07:00
Xiangrui Meng 63e0aebe22 [SPARK-15945][MLLIB] Conversion between old/new vector columns in a DataFrame (Scala/Java)
## What changes were proposed in this pull request?

This PR provides conversion utils between old/new vector columns in a DataFrame. So users can use it to migrate their datasets and pipelines manually. The methods are implemented under `MLUtils` and called `convertVectorColumnsToML` and `convertVectorColumnsFromML`. Both take a DataFrame and a list of vector columns to be converted. It is a no-op on vector columns that are already converted. A warning message is logged if actual conversion happens.

This is the first sub-task under SPARK-15944 to make it easier to migrate existing pipelines to Spark 2.0.

## How was this patch tested?

Unit tests in Scala and Java.

cc: yanboliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #13662 from mengxr/SPARK-15945.
2016-06-14 18:57:45 -07:00
Liang-Chi Hsieh baa3e633e1 [SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ml.Matrix under spark.ml.python
## What changes were proposed in this pull request?

Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead.

## How was this patch tested?
Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13219 from viirya/pyspark-pickler-ml.
2016-06-13 19:59:53 -07:00
hyukjinkwon e3554605b3 [SPARK-15892][ML] Incorrectly merged AFTAggregator with zero total count
## What changes were proposed in this pull request?

Currently, `AFTAggregator` is not being merged correctly. For example, if there is any single empty partition in the data, this creates an `AFTAggregator` with zero total count which causes the exception below:

```
IllegalArgumentException: u'requirement failed: The number of instances should be greater than 0.0, but got 0.'
```

Please see [AFTSurvivalRegression.scala#L573-L575](6ecedf39b4/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala (L573-L575)) as well.

Just to be clear, the python example `aft_survival_regression.py` seems using 5 rows. So, if there exist partitions more than 5, it throws the exception above since it contains empty partitions which results in an incorrectly merged `AFTAggregator`.

Executing `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py` on a machine with CPUs more than 5 is being failed because it creates tasks with some empty partitions with defualt  configurations (AFAIK, it sets the parallelism level to the number of CPU cores).

## How was this patch tested?

An unit test in `AFTSurvivalRegressionSuite.scala` and manually tested by `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py`.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #13619 from HyukjinKwon/SPARK-15892.
2016-06-12 14:26:53 -07:00
Davies Liu aec502d911 [SPARK-15654] [SQL] fix non-splitable files for text based file formats
## What changes were proposed in this pull request?

Currently, we always split the files when it's bigger than maxSplitBytes, but Hadoop LineRecordReader does not respect the splits for compressed files correctly, we should have a API for FileFormat to check whether the file could be splitted or not.

This PR is based on #13442, closes #13442

## How was this patch tested?

add regression tests.

Author: Davies Liu <davies@databricks.com>

Closes #13531 from davies/fix_split.
2016-06-10 14:32:43 -07:00
wangyang 026eb90644 [SPARK-15875] Try to use Seq.isEmpty and Seq.nonEmpty instead of Seq.length == 0 and Seq.length > 0
## What changes were proposed in this pull request?

In scala, immutable.List.length is an expensive operation so we should
avoid using Seq.length == 0 or Seq.lenth > 0, and use Seq.isEmpty and Seq.nonEmpty instead.

## How was this patch tested?
existing tests

Author: wangyang <wangyang@haizhi.com>

Closes #13601 from yangw1234/isEmpty.
2016-06-10 13:10:03 -07:00
Bryan Cutler 7d7a0a5e07 [SPARK-15738][PYSPARK][ML] Adding Pyspark ml RFormula __str__ method similar to Scala API
## What changes were proposed in this pull request?
Adding __str__ to RFormula and model that will show the set formula param and resolved formula.  This is currently present in the Scala API, found missing in PySpark during Spark 2.0 coverage review.

## How was this patch tested?
run pyspark-ml tests locally

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #13481 from BryanCutler/pyspark-ml-rformula_str-SPARK-15738.
2016-06-10 11:27:30 -07:00
yinxusen 87706eb66c [SPARK-15793][ML] Add maxSentenceLength for ml.Word2Vec
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-15793

Word2vec in ML package should have maxSentenceLength method for feature parity.

## How was this patch tested?

Tested with Spark unit test.

Author: yinxusen <yinxusen@gmail.com>

Closes #13536 from yinxusen/SPARK-15793.
2016-06-08 09:18:04 +01:00
Yanbo Liang 6ecedf39b4 [SPARK-13590][ML][DOC] Document spark.ml LiR, LoR and AFTSurvivalRegression behavior difference
## What changes were proposed in this pull request?
When fitting ```LinearRegressionModel```(by "l-bfgs" solver) and ```LogisticRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce same model as R glmnet but different from LIBSVM.

When fitting ```AFTSurvivalRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce different model compared with R survival::survreg.

We should output a warning message and clarify in document for this condition.

## How was this patch tested?
Document change, no unit test.

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12731 from yanboliang/spark-13590.
2016-06-07 15:25:36 -07:00
Joseph K. Bradley 4c74ee8d8e [SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable public
## What changes were proposed in this pull request?

Made DefaultParamsReadable, DefaultParamsWritable public.  Also added relevant doc and annotations.  Added UnaryTransformerExample to demonstrate use of UnaryTransformer and DefaultParamsReadable,Writable.

## How was this patch tested?

Wrote example making use of the now-public APIs.  Compiled and ran locally

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #13461 from jkbradley/defaultparamswritable.
2016-06-06 09:49:45 -07:00
Zheng RuiFeng fd8af39713 [MINOR] Fix Typos 'an -> a'
## What changes were proposed in this pull request?

`an -> a`

Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13515 from zhengruifeng/an_a.
2016-06-06 09:35:47 +01:00
Josh Rosen 26c1089c37 [SPARK-15748][SQL] Replace inefficient foldLeft() call with flatMap() in PartitionStatistics
`PartitionStatistics` uses `foldLeft` and list concatenation (`++`) to flatten an iterator of lists, but this is extremely inefficient compared to simply doing `flatMap`/`flatten` because it performs many unnecessary object allocations. Simply replacing this `foldLeft` by a `flatMap` results in decent performance gains when constructing PartitionStatistics instances for tables with many columns.

This patch fixes this and also makes two similar changes in MLlib and streaming to try to fix all known occurrences of this pattern.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13491 from JoshRosen/foldleft-to-flatmap.
2016-06-05 16:51:00 -07:00
Zheng RuiFeng 372fa61f51 [SPARK-15770][ML] Annotation audit for Experimental and DeveloperApi
## What changes were proposed in this pull request?
1, remove comments `:: Experimental ::` for non-experimental API
2, add comments `:: Experimental ::` for experimental API
3, add comments `:: DeveloperApi ::` for developerApi API

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13514 from zhengruifeng/del_experimental.
2016-06-05 11:55:25 -07:00
Ruifeng Zheng 2099e05f93 [SPARK-15617][ML][DOC] Clarify that fMeasure in MulticlassMetrics is "micro" f1_score
## What changes were proposed in this pull request?
1, del precision,recall in  `ml.MulticlassClassificationEvaluator`
2, update user guide for `mlllib.weightedFMeasure`

## How was this patch tested?
local build

Author: Ruifeng Zheng <ruifengz@foxmail.com>

Closes #13390 from zhengruifeng/clarify_f1.
2016-06-04 13:56:04 +01:00
Wenchen Fan 190ff274fd [SPARK-15494][SQL] encoder code cleanup
## What changes were proposed in this pull request?

Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions.

1. move validation logic to analyzer instead of encoder
2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore.
3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework.
4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups)

## How was this patch tested?

existing test

Author: Wenchen Fan <wenchen@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #13269 from cloud-fan/clean-encoder.
2016-06-03 00:43:02 -07:00
Xiangrui Meng e23370ec61 [SPARK-15740][MLLIB] ignore big model load / save in Word2VecSuite
## What changes were proposed in this pull request?

andrewor14 noticed some OOM errors caused by "test big model load / save" in Word2VecSuite, e.g., https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. It doesn't show up in the test result because it was OOMed.

This PR disables the test. I will leave the JIRA open for a proper fix

## How was this patch tested?

No new features.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13478 from mengxr/SPARK-15740.
2016-06-02 17:41:31 -07:00
Yuhao Yang 5855e0057d [SPARK-15668][ML] ml.feature: update check schema to avoid confusion when user use MLlib.vector as input type
## What changes were proposed in this pull request?

ml.feature: update check schema to avoid confusion when user use MLlib.vector as input type

## How was this patch tested?
existing ut

Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #13411 from hhbyyh/schemaCheck.
2016-06-02 16:37:01 -07:00
Nick Pentreath ccd298eb67 [MINOR] clean up style for storage param setters in ALS
Clean up style for param setter methods in ALS to match standard style and the other setter in class (this is an artefact of one of my previous PRs that wasn't cleaned up).

## How was this patch tested?
Existing tests - no functionality change.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13480 from MLnick/als-param-minor-cleanup.
2016-06-02 16:33:16 -07:00
Yanbo Liang 07a98ca4ce [SPARK-15587][ML] ML 2.0 QA: Scala APIs audit for ml.feature
## What changes were proposed in this pull request?
ML 2.0 QA: Scala APIs audit for ml.feature. Mainly include:
* Remove seed for ```QuantileDiscretizer```, since we use ```approxQuantile``` to produce bins and ```seed``` is useless.
* Scala API docs update.
* Sync Scala and Python API docs for these changes.

## How was this patch tested?
Exist tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13410 from yanboliang/spark-15587.
2016-06-01 10:49:51 -07:00
Lianhui Wang 6563d72b16 [SPARK-15664][MLLIB] Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing CheckpointFile in MLlib
## What changes were proposed in this pull request?
if sparkContext.set CheckpointDir to another Dir that is not default FileSystem, it will throw exception when removing CheckpointFile in MLlib.
So we should always get the FileSystem from Path to avoid wrong FS problem.
## How was this patch tested?
N/A

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #13408 from lianhuiwang/SPARK-15664.
2016-06-01 08:30:38 -05:00
Dongjoon Hyun 85d6b0db9f [SPARK-15618][SQL][MLLIB] Use SparkSession.builder.sparkContext if applicable.
## What changes were proposed in this pull request?

This PR changes function `SparkSession.builder.sparkContext(..)` from **private[sql]** into **private[spark]**, and uses it if applicable like the followings.
```
- val spark = SparkSession.builder().config(sc.getConf).getOrCreate()
+ val spark = SparkSession.builder().sparkContext(sc).getOrCreate()
```

## How was this patch tested?

Pass the existing Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13365 from dongjoon-hyun/SPARK-15618.
2016-05-31 17:40:44 -07:00
Sean Owen ce1572d16f [MINOR] Resolve a number of miscellaneous build warnings
## What changes were proposed in this pull request?

This change resolves a number of build warnings that have accumulated, before 2.x. It does not address a large number of deprecation warnings, especially related to the Accumulator API. That will happen separately.

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #13377 from srowen/BuildWarnings.
2016-05-29 16:48:14 -05:00
Zheng RuiFeng 9893dc9757 [SPARK-15610][ML] update error message for k in pca
## What changes were proposed in this pull request?
Fix the wrong bound of `k` in `PCA`
`require(k <= sources.first().size, ...`  ->  `require(k < sources.first().size`

BTW, remove unused import in `ml.ElementwiseProduct`

## How was this patch tested?

manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13356 from zhengruifeng/fix_pca.
2016-05-27 21:57:41 -05:00
DB Tsai 21b2605dc4 [SPARK-15413][ML][MLLIB] Change toBreeze to asBreeze in Vector and Matrix
## What changes were proposed in this pull request?

We're using `asML` to convert the mllib vector/matrix to ml vector/matrix now. Using `as` is more correct given that this conversion actually shares the same underline data structure. As a result, in this PR, `toBreeze` will be changed to `asBreeze`. This is a private API, as a result, it will not affect any user's application.

## How was this patch tested?

unit tests

Author: DB Tsai <dbt@netflix.com>

Closes #13198 from dbtsai/minor.
2016-05-27 14:02:39 -07:00
Yanbo Liang a3550e3747 [SPARK-11959][SPARK-15484][DOC][ML] Document WLS and IRLS
## What changes were proposed in this pull request?
* Document ```WeightedLeastSquares```(normal equation) and ```IterativelyReweightedLeastSquares```.
* Copy ```L-BFGS``` documents from ```spark.mllib``` to ```spark.ml```.

Due to the session ```Optimization of linear methods``` is used for developers, I think we should provide the brief introduction of the optimization method, necessary references and how it implements in Spark. It's not necessary to paste all mathematical formula and derivation here. If developers/users want to learn more, they can track reference.

## How was this patch tested?
Document update, no tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13262 from yanboliang/spark-15484.
2016-05-27 13:16:22 -07:00
Andrew Or b376a4eabc [HOTFIX] Scala 2.10 compile GaussianMixtureModel 2016-05-27 11:43:01 -07:00
Dongjoon Hyun 4538443e27 [SPARK-15584][SQL] Abstract duplicate code: spark.sql.sources. properties
## What changes were proposed in this pull request?

This PR replaces `spark.sql.sources.` strings with `CreateDataSourceTableUtils.*` constant variables.

## How was this patch tested?

Pass the existing Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13349 from dongjoon-hyun/SPARK-15584.
2016-05-27 11:10:31 -07:00
Dongjoon Hyun d24e251572 [SPARK-15603][MLLIB] Replace SQLContext with SparkSession in ML/MLLib
## What changes were proposed in this pull request?

This PR replaces all deprecated `SQLContext` occurrences with `SparkSession` in `ML/MLLib` module except the following two classes. These two classes use `SQLContext` in their function signatures.
- ReadWrite.scala
- TreeModels.scala

## How was this patch tested?

Pass the existing Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13352 from dongjoon-hyun/SPARK-15603.
2016-05-27 11:09:15 -07:00
Zheng RuiFeng 6b1a6180e7 [MINOR] Fix Typos 'a -> an'
## What changes were proposed in this pull request?

`a` -> `an`

I use regex to generate potential error lines:
`grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
and review them line by line.

## How was this patch tested?

local build
`lint-java` checking

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13317 from zhengruifeng/a_an.
2016-05-26 22:39:14 -07:00
Yin Huai 3ac2363d75 [SPARK-15532][SQL] SQLContext/HiveContext's public constructors should use SparkSession.build.getOrCreate
## What changes were proposed in this pull request?
This PR changes SQLContext/HiveContext's public constructor to use SparkSession.build.getOrCreate and removes isRootContext from SQLContext.

## How was this patch tested?
Existing tests.

Author: Yin Huai <yhuai@databricks.com>

Closes #13310 from yhuai/SPARK-15532.
2016-05-26 16:53:31 -07:00
Sean Owen b0a03feef2 [SPARK-15457][MLLIB][ML] Eliminate some warnings from MLlib about deprecations
## What changes were proposed in this pull request?

Several classes and methods have been deprecated and are creating lots of build warnings in branch-2.0. This issue is to identify and fix those items:
* WithSGD classes: Change to make class not deprecated, object deprecated, and public class constructor deprecated. Any public use will require a deprecated API. We need to keep a non-deprecated private API since we cannot eliminate certain uses: Python API, streaming algs, and examples.
  * Use in PythonMLlibAPI: Change to using private constructors
  * Streaming algs: No warnings after we un-deprecate the classes
  * Examples: Deprecate or change ones which use deprecated APIs
* MulticlassMetrics fields (precision, etc.)
* LinearRegressionSummary.model field

## How was this patch tested?

Existing tests.  Checked for warnings manually.

Author: Sean Owen <sowen@cloudera.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #13314 from jkbradley/warning-cleanups.
2016-05-26 14:25:28 -07:00
Villu Ruusmann 6d506c9ae9 [SPARK-15523][ML][MLLIB] Update JPMML to 1.2.15
## What changes were proposed in this pull request?

See https://issues.apache.org/jira/browse/SPARK-15523

This PR replaces PR #13293. It's isolated to a new branch, and contains some more squashed changes.

## How was this patch tested?

1. Executed `mvn clean package` in `mllib` directory
2. Executed `dev/test-dependencies.sh --replace-manifest` in the root directory.

Author: Villu Ruusmann <villu.ruusmann@gmail.com>

Closes #13297 from vruusmann/update-jpmml.
2016-05-26 08:11:34 -05:00
Reynold Xin 361ebc282b [SPARK-15543][SQL] Rename DefaultSources to make them more self-describing
## What changes were proposed in this pull request?
This patch renames various DefaultSources to make their names more self-describing. The choice of "DefaultSource" was from the days when we did not have a good way to specify short names.

They are now named:
- LibSVMFileFormat
- CSVFileFormat
- JdbcRelationProvider
- JsonFileFormat
- ParquetFileFormat
- TextFileFormat

Backward compatibility is maintained through aliasing.

## How was this patch tested?
Updated relevant test cases too.

Author: Reynold Xin <rxin@databricks.com>

Closes #13311 from rxin/SPARK-15543.
2016-05-25 23:54:24 -07:00
Gio Borje 589cce93c8 Log warnings for numIterations * miniBatchFraction < 1.0
## What changes were proposed in this pull request?

Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used.

This may be counter-intuitive to most users and led to the issue during the development of another Spark  ML model: https://github.com/zhengruifeng/spark-libFM/issues/11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`.

## How was this patch tested?

`build/mvn -DskipTests clean package` build succeeds

Author: Gio Borje <gborje@linkedin.com>

Closes #13265 from Hydrotoast/master.
2016-05-25 16:52:31 -05:00
Nick Pentreath 1cb347fbc4 [SPARK-15500][DOC][ML][PYSPARK] Remove default value in Param doc field in ALS
Remove "Default: MEMORY_AND_DISK" from `Param` doc field in ALS storage level params. This fixes up the output of `explainParam(s)` so that default values are not displayed twice.

We can revisit in the case that [SPARK-15130](https://issues.apache.org/jira/browse/SPARK-15130) moves ahead with adding defaults in some way to PySpark param doc fields.

Tests N/A.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13277 from MLnick/SPARK-15500-als-remove-default-storage-param.
2016-05-25 20:41:53 +02:00
lfzCarlosC 02c8072eea [MINOR][MLLIB][STREAMING][SQL] Fix typos
fixed typos for source code for components [mllib] [streaming] and [SQL]

None and obvious.

Author: lfzCarlosC <lfz.carlos@gmail.com>

Closes #13298 from lfzCarlosC/master.
2016-05-25 10:53:57 -07:00
Nick Pentreath 6075f5b4d8 [SPARK-15442][ML][PYSPARK] Add 'relativeError' param to PySpark QuantileDiscretizer
This PR adds the `relativeError` param to PySpark's `QuantileDiscretizer` to match Scala.

Also cleaned up a duplication of `numBuckets` where the param is both a class and instance attribute (I removed the instance attr to match the style of params throughout `ml`).

Finally, cleaned up the docs for `QuantileDiscretizer` to reflect that it now uses `approxQuantile`.

## How was this patch tested?

A little doctest and built API docs locally to check HTML doc generation.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13228 from MLnick/SPARK-15442-py-relerror-param.
2016-05-24 10:02:10 +02:00
Yanbo Liang c94b34ebbf [SPARK-15339][ML] ML 2.0 QA: Scala APIs and code audit for regression
## What changes were proposed in this pull request?
* ```GeneralizedLinearRegression``` API docs enhancement.
* The default value of ```GeneralizedLinearRegression``` ```linkPredictionCol``` is not set rather than empty. This will consistent with other similar params such as ```weightCol```
* Make some methods more private.
* Fix a minor bug of LinearRegression.
* Fix some other issues.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13129 from yanboliang/spark-15339.
2016-05-19 23:35:20 -07:00
Reynold Xin f2ee0ed4b7 [SPARK-15075][SPARK-15345][SQL] Clean up SparkSession builder and propagate config options to existing sessions if specified
## What changes were proposed in this pull request?
Currently SparkSession.Builder use SQLContext.getOrCreate. It should probably the the other way around, i.e. all the core logic goes in SparkSession, and SQLContext just calls that. This patch does that.

This patch also makes sure config options specified in the builder are propagated to the existing (and of course the new) SparkSession.

## How was this patch tested?
Updated tests to reflect the change, and also introduced a new SparkSessionBuilderSuite that should cover all the branches.

Author: Reynold Xin <rxin@databricks.com>

Closes #13200 from rxin/SPARK-15075.
2016-05-19 21:53:26 -07:00
Sandeep Singh 01cf649c4f [SPARK-15296][MLLIB] Refactor All Java Tests that use SparkSession
## What changes were proposed in this pull request?
Refactor All Java Tests that use SparkSession, to extend SharedSparkSesion

## How was this patch tested?
Existing Tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13101 from techaddict/SPARK-15296.
2016-05-19 20:38:44 -07:00
Yanbo Liang 6643677817 [MINOR][ML][PYSPARK] ml.evaluation Scala and Python API sync
## What changes were proposed in this pull request?
```ml.evaluation``` Scala and Python API sync.

## How was this patch tested?
Only API docs change, no new tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13195 from yanboliang/evaluation-doc.
2016-05-19 17:56:21 -07:00
Yanbo Liang f8107c7846 [SPARK-15341][DOC][ML] Add documentation for "model.write" to clarify "summary" was not saved
## What changes were proposed in this pull request?
Currently in ```model.write```, we don't save ```summary```(if applicable). We should add documentation to clarify it.
We fixed the incorrect link ```[[MLWriter]]``` to ```[[org.apache.spark.ml.util.MLWriter]]``` BTW.

## How was this patch tested?
Documentation update, no unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13131 from yanboliang/spark-15341.
2016-05-19 17:54:18 -07:00
Sandeep Singh ef43a5fe51 [SPARK-15414][MLLIB] Make the mllib,ml linalg type conversion APIs public
## What changes were proposed in this pull request?
Open up APIs for converting between new, old linear algebra types (in spark.mllib.linalg):
`Sparse`/`Dense` X `Vector`/`Matrices` `.asML` and `.fromML`

## How was this patch tested?
Existing Tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13202 from techaddict/SPARK-15414.
2016-05-19 17:24:42 -07:00
Yanbo Liang 59e6c5560d [SPARK-15361][ML] ML 2.0 QA: Scala APIs audit for ml.clustering
## What changes were proposed in this pull request?
Audit Scala API for ml.clustering.
Fix some wrong API documentations and update outdated one.

## How was this patch tested?
Existing unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13148 from yanboliang/spark-15361.
2016-05-19 13:26:41 -07:00
DB Tsai 5255e55c84 [SPARK-15411][ML] Add @since to ml.stat.MultivariateOnlineSummarizer.scala
## What changes were proposed in this pull request?

Add since to ml.stat.MultivariateOnlineSummarizer.scala

## How was this patch tested?

unit tests

Author: DB Tsai <dbt@netflix.com>

Closes #13197 from dbtsai/cleanup.
2016-05-19 13:10:51 -07:00
Yanbo Liang 8ecf7f77b2 [SPARK-15292][ML] ML 2.0 QA: Scala APIs audit for classification
## What changes were proposed in this pull request?
Audit Scala API for classification, almost all issues were related ```MultilayerPerceptronClassifier``` in this section.
* Fix one wrong param getter function: ```getOptimizer``` -> ```getSolver```
* Add missing setter function for ```solver``` and ```stepSize```.
* Make ```GD``` solver take effect.
* Update docs, annotations and fix other minor issues.

## How was this patch tested?
Existing unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13076 from yanboliang/spark-15292.
2016-05-19 10:27:17 -07:00
Yanbo Liang 1052d3644d [SPARK-15362][ML] Make spark.ml KMeansModel load backwards compatible
## What changes were proposed in this pull request?
[SPARK-14646](https://issues.apache.org/jira/browse/SPARK-14646) makes ```KMeansModel``` store the cluster centers one per row. ```KMeansModel.load()``` method needs to be updated in order to load models saved with Spark 1.6.

## How was this patch tested?
Since ```save/load``` is ```Experimental``` for 1.6, I think offline test for backwards compatibility is enough.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13149 from yanboliang/spark-15362.
2016-05-19 10:25:33 -07:00
Bryan Cutler b1bc5ebdd5 [DOC][MINOR] ml.feature Scala and Python API sync
## What changes were proposed in this pull request?

I reviewed Scala and Python APIs for ml.feature and corrected discrepancies.

## How was this patch tested?

Built docs locally, ran style checks

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #13159 from BryanCutler/ml.feature-api-sync.
2016-05-19 04:48:36 +02:00
Wenchen Fan ebfe3a1f2c [SPARK-15192][SQL] null check for SparkSession.createDataFrame
## What changes were proposed in this pull request?

This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema.

## How was this patch tested?

new tests in `DatasetSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13008 from cloud-fan/row-encoder.
2016-05-18 18:06:38 -07:00
Nick Pentreath e8b79afa02 [SPARK-14891][ML] Add schema validation for ALS
This PR adds schema validation to `ml`'s ALS and ALSModel. Currently, no schema validation was performed as `transformSchema` was never called in `ALS.fit` or `ALSModel.transform`. Furthermore, due to no schema validation, if users passed in Long (or Float etc) ids, they would be silently cast to Int with no warning or error thrown.

With this PR, ALS now supports all numeric types for `user`, `item`, and `rating` columns. The rating column is cast to `Float` and the user and item cols are cast to `Int` (as is the case currently) - however for user/item, the cast throws an error if the value is outside integer range. Behavior for rating col is unchanged (as it is not an issue).

## How was this patch tested?
New test cases in `ALSSuite`.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #12762 from MLnick/SPARK-14891-als-validate-schema.
2016-05-18 21:13:12 +02:00
DLucky 420b700695 [SPARK-15346][MLLIB] Reduce duplicate computation in picking initial points
mateiz srowen

I state that the contribution is my original work and that I license the work to the project under the project's open source license

There's some format problems with my last PR, with HyukjinKwon 's help I read the guidance, re-check my code and PR, then run the tests, finally re-submit the PR request here.

The related JIRA issue though marked as resolved, this change may relate to it I think.

## Proposed Change

After picking each new initial centers, it's unnecessary to compute the distances between all the points and the old ones.
Instead this change keeps the distance between all the points and their closest centers, and compare to the distance of them with the new center then update them.

## Test result

One can find an easy test way in (https://issues.apache.org/jira/browse/SPARK-6706)

I test the KMeans++ method for a small dataset with 16k points, and the whole KMeans|| with a large one with 240k points.
The data has 4096 features and I tunes K from 100 to 500.
The test environment was on my 4 machine cluster, I also tested a 3M points data on a larger cluster with 25 machines and got similar results, which I would not draw the detail curve. The result of the first two exps are shown below

### Local KMeans++ test:

Dataset:4m_ini_center
Data_size:16234
Dimension:4096

Lloyd's Iteration = 10
The y-axis is time in sec, the x-axis is tuning the K.

![image](https://cloud.githubusercontent.com/assets/10915169/15175831/d0c92b82-179a-11e6-8b68-4e165fc2fdff.png)

![local_total](https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg)

### On a larger dataset

An improve show in the graph but not commit in this file: In this experiment I also have an improvement for calculation in normalization data (the distance is convert to the cosine distance). As if the data is normalized into (0,1), one improvement in the original vesion for util.MLUtils.fastSauaredDistance would have no effect (the precisionBound 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON) will never less then precision in this case). Therefore I design an early terminal method when comparing two distance (used for findClosest). But I don't include this improve in this file, you may only refer to the curves without "normalize" for comparing the results.

Dataset:4k24
Data_size:243960
Dimension:4096

Normlize 	Enlarge 	Initialize 	Lloyd's_Iteration
NO    	1 	         3 	          5
YES 	        10000 	 3 	          5

Notice: the normlized data is enlarged to ensure precision

The cost time: x-for value of K, y-for time in sec
![4k24_total](https://cloud.githubusercontent.com/assets/10915169/15176635/9a54c0bc-179e-11e6-81c5-238e0c54bce2.jpg)

SE for unnormalized data between two version, to ensure the correctness
![4k24_unnorm_se](https://cloud.githubusercontent.com/assets/10915169/15176661/b85dabc8-179e-11e6-9269-fe7d2101dd48.jpg)

Here is the SE between normalized data just for reference, it's also correct.
![4k24_norm_se](https://cloud.githubusercontent.com/assets/10915169/15176742/1fbde940-179f-11e6-8290-d24b0dd4a4f7.jpg)

Author: DLucky <mouendless@gmail.com>

Closes #13133 from mouendless/patch-2.
2016-05-18 12:05:21 +01:00
WeichenXu 2f9047b5eb [SPARK-15322][MLLIB][CORE][SQL] update deprecate accumulator usage into accumulatorV2 in spark project
## What changes were proposed in this pull request?

I use Intellj-IDEA to search usage of deprecate SparkContext.accumulator in the whole spark project, and update the code.(except those test code for accumulator method itself)

## How was this patch tested?

Exisiting unit tests

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13112 from WeichenXu123/update_accuV2_in_mllib.
2016-05-18 11:48:46 +01:00
Sean Zhong 25b315e6ca [SPARK-15171][SQL] Remove the references to deprecated method dataset.registerTempTable
## What changes were proposed in this pull request?

Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`.

## How was this patch tested?

This PR only changes the unit test code, examples, and comments. It should be safe.
This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13098 from clockfly/spark-15171-remove-deprecation.
2016-05-18 09:01:59 +08:00
DB Tsai e2efe0529a [SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based algorithms
## What changes were proposed in this pull request?

Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis.

## How was this patch tested?

Unit tests

Author: DB Tsai <dbt@netflix.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #12627 from dbtsai/SPARK-14615-NewML.
2016-05-17 12:51:07 -07:00
Dongjoon Hyun 9f176dd391 [MINOR][DOCS] Replace remaining 'sqlContext' in ScalaDoc/JavaDoc.
## What changes were proposed in this pull request?

According to the recent change, this PR replaces all the remaining `sqlContext` usage with `spark` in ScalaDoc/JavaDoc (.scala/.java files) except `SQLContext.scala`, `SparkPlan.scala', and `DatasetHolder.scala`.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13125 from dongjoon-hyun/minor_doc_sparksession.
2016-05-17 20:50:22 +02:00
Sean Owen 122302cbf5 [SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into spark-tags
## What changes were proposed in this pull request?

(See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.)

Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags`

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #13074 from srowen/SPARK-15290.
2016-05-17 09:55:53 +01:00
Zheng RuiFeng c7efc56c7b [MINOR] Fix Typos
## What changes were proposed in this pull request?
1,Rename matrix args in BreezeUtil to upper to match the doc
2,Fix several typos in ML and SQL

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13078 from zhengruifeng/fix_ann.
2016-05-15 15:59:49 +01:00
wm624@hotmail.com 354f8f11bd [SPARK-15096][ML] LogisticRegression MultiClassSummarizer numClasses can fail if no valid labels are found
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
Throw better exception when numClasses is empty and empty.max is thrown.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Add a new unit test, which calls histogram with empty numClasses.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12969 from wangmiao1981/logisticR.
2016-05-14 09:45:56 +01:00
hyukjinkwon 3ded5bc4db [SPARK-15267][SQL] Refactor options for JDBC and ORC data sources and change default compression for ORC
## What changes were proposed in this pull request?

Currently, Parquet, JSON and CSV data sources have a class for thier options, (`ParquetOptions`, `JSONOptions` and `CSVOptions`).

It is convenient to manage options for sources to gather options into a class. Currently, `JDBC`, `Text`, `libsvm` and `ORC` datasources do not have this class. This might be nicer if these options are in a unified format so that options can be added and

This PR refactors the options in Spark internal data sources adding new classes, `OrcOptions`, `TextOptions`, `JDBCOptions` and `LibSVMOptions`.

Also, this PR change the default compression codec for ORC from `NONE` to `SNAPPY`.

## How was this patch tested?

Existing tests should cover this for refactoring and unittests in `OrcHadoopFsRelationSuite` for changing the default compression codec for ORC.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13048 from HyukjinKwon/SPARK-15267.
2016-05-13 09:04:37 -07:00
wm624@hotmail.com bdff299f9e [SPARK-14900][ML] spark.ml classification metrics should include accuracy
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
Add accuracy to MulticlassMetrics class and add corresponding code in MulticlassClassificationEvaluator.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Scala Unit tests in ml.evaluation

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12882 from wangmiao1981/accuracy.
2016-05-13 08:29:37 +01:00
BenFradet 31f1aebbeb [SPARK-13961][ML] spark.ml ChiSqSelector and RFormula should support other numeric types for label
## What changes were proposed in this pull request?

Made ChiSqSelector and RFormula accept all numeric types for label

## How was this patch tested?

Unit tests

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #12467 from BenFradet/SPARK-13961.
2016-05-13 09:08:04 +02:00
sethah 5b849766ab [SPARK-15181][ML][PYSPARK] Python API for GLR summaries.
## What changes were proposed in this pull request?

This patch adds a python API for generalized linear regression summaries (training and test). This helps provide feature parity for Python GLMs.

## How was this patch tested?

Added a unit test to `pyspark.ml.tests`

Author: sethah <seth.hendrickson16@gmail.com>

Closes #12961 from sethah/GLR_summary.
2016-05-13 09:01:20 +02:00
Sean Zhong 33c6eb5218 [SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempView
## What changes were proposed in this pull request?

Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView.

## How was this patch tested?

Unit tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #12945 from clockfly/spark-15171.
2016-05-12 15:51:53 +08:00
Liang-Chi Hsieh a5f9fdbba3 [SPARK-15268][SQL] Make JavaTypeInference work with UDTRegistration
## What changes were proposed in this pull request?

We have a private `UDTRegistration` API to register user defined type. Currently `JavaTypeInference` can't work with it. So `SparkSession.createDataFrame` from a bean class will not correctly infer the schema of the bean class.

## How was this patch tested?
`VectorUDTSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13046 from viirya/fix-udt-registry-javatypeinference.
2016-05-11 09:31:22 -07:00
Sandeep Singh ed0b4070fb [SPARK-15037][SQL][MLLIB] Use SparkSession instead of SQLContext in Scala/Java TestSuites
## What changes were proposed in this pull request?
Use SparkSession instead of SQLContext in Scala/Java TestSuites
as this PR already very big working Python TestSuites in a diff PR.

## How was this patch tested?
Existing tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #12907 from techaddict/SPARK-15037.
2016-05-10 11:17:47 -07:00
dding3 a78fbfa619 [SPARK-15172][ML] Explicitly tell user initial coefficients is ignored when size mismatch happened in LogisticRegression
## What changes were proposed in this pull request?
Explicitly tell user initial coefficients is ignored if its size doesn't match expected size in LogisticRegression

## How was this patch tested?
local build

Author: dding3 <dingding@dingding-ubuntu.sh.intel.com>

Closes #12948 from dding3/master.
2016-05-09 09:43:07 +01:00
Yuhao Yang 68abc1b4e9 [SPARK-14814][MLLIB] API: Java compatibility, docs
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14814
fix a java compatibility function in mllib DecisionTreeModel. As synced in jira, other compatibility issues don't need fixes.

## How was this patch tested?

existing ut

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #12971 from hhbyyh/javacompatibility.
2016-05-09 09:08:54 +01:00
Liang-Chi Hsieh 635ef407e1 [SPARK-15211][SQL] Select features column from LibSVMRelation causes failure
## What changes were proposed in this pull request?

We need to use `requiredSchema` in `LibSVMRelation` to project the fetch required columns when loading data from this data source. Otherwise, when users try to select `features` column, it will cause failure.

## How was this patch tested?
`LibSVMRelationSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12986 from viirya/fix-libsvmrelation.
2016-05-09 15:05:06 +08:00
Burak Köse e20cd9f4ce [SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover
## What changes were proposed in this pull request?

This PR continues the work from #11871 with the following changes:
* load English stopwords as default
* covert stopwords to list in Python
* update some tests and doc

## How was this patch tested?

Unit tests.

Closes #11871

cc: burakkose srowen

Author: Burak Köse <burakks41@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Burak KOSE <burakks41@gmail.com>

Closes #12843 from mengxr/SPARK-14050.
2016-05-06 13:58:12 -07:00
Andrew Or 7f5922aa4a [HOTFIX] Fix MLUtils compile 2016-05-05 16:51:06 -07:00
Jacek Laskowski bbb7773437 [SPARK-15152][DOC][MINOR] Scaladoc and Code style Improvements
## What changes were proposed in this pull request?

Minor doc and code style fixes

## How was this patch tested?

local build

Author: Jacek Laskowski <jacek@japila.pl>

Closes #12928 from jaceklaskowski/SPARK-15152.
2016-05-05 16:34:27 -07:00
Holden Karau 4c0d827cfc [SPARK-15106][PYSPARK][ML] Add PySpark package doc for ML component & remove "BETA"
## What changes were proposed in this pull request?

Copy the package documentation from Scala/Java to Python for ML package and remove beta tags. Not super sure if we want to keep the BETA tag but since we are making it the default it seems like probably the time to remove it (happy to put it back in if we want to keep it BETA).

## How was this patch tested?

Python documentation built locally as HTML and text and verified output.

Author: Holden Karau <holden@us.ibm.com>

Closes #12883 from holdenk/SPARK-15106-add-pyspark-package-doc-for-ml.
2016-05-05 10:52:25 +01:00
Dominik Jastrzębski abecbcd5e9 [SPARK-14844][ML] Add setFeaturesCol and setPredictionCol to KMeansM…
## What changes were proposed in this pull request?

Introduction of setFeaturesCol and setPredictionCol methods to KMeansModel in ML library.

## How was this patch tested?

By running KMeansSuite.

Author: Dominik Jastrzębski <dominik.jastrzebski@codilime.com>

Closes #12609 from dominik-jastrzebski/master.
2016-05-04 14:25:51 +02:00
Cheng Lian bc3760d405 [SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations
## What changes were proposed in this pull request?

Currently, various `FileFormat` data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication.

A new method `buildReaderWithPartitionValues()` is added to `FileFormat` with a default implementation that appends partition values to `InternalRow`s produced by the reader function returned by `buildReader()`.

Special data sources like Parquet, which implements partition value appending inside `buildReader()` because of the vectorized reader, and the Text data source, which doesn't support partitioning, override `buildReaderWithPartitionValues()` and simply delegate to `buildReader()`.

This PR brings two benefits:

1. Apparently, it de-duplicates partition value appending logic

2. Now the reader function returned by `buildReader()` is only required to produce `InternalRow`s rather than `UnsafeRow`s if the data source doesn't override `buildReaderWithPartitionValues()`.

   Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving `UnsafeRow`.

## How was this patch tested?

Existing tests should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes #12866 from liancheng/spark-14237-simplify-partition-values-appending.
2016-05-04 14:16:57 +08:00
yinxusen 2e2a6211c4 [SPARK-14973][ML] The CrossValidator and TrainValidationSplit miss the seed when saving and loading
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14973

Add seed support when saving/loading of CrossValidator and TrainValidationSplit.

## How was this patch tested?

Spark unit test.

Author: yinxusen <yinxusen@gmail.com>

Closes #12825 from yinxusen/SPARK-14973.
2016-05-03 14:19:13 -07:00
Holden Karau f10ae4b1e1 [SPARK-6717][ML] Clear shuffle files after checkpointing in ALS
## What changes were proposed in this pull request?

When ALS is run with a checkpoint interval, during the checkpoint materialize the current state and cleanup the previous shuffles (non-blocking).

## How was this patch tested?

Existing ALS unit tests, new ALS checkpoint cleanup unit tests added & shuffle files checked after ALS w/checkpointing run.

Author: Holden Karau <holden@us.ibm.com>
Author: Holden Karau <holden@pigscanfly.ca>

Closes #11919 from holdenk/SPARK-6717-clear-shuffle-files-after-checkpointing-in-ALS.
2016-05-03 00:18:10 -07:00
Xusen Yin a6428292f7 [SPARK-14931][ML][PYTHON] Mismatched default values between pipelines in Spark and PySpark - update
## What changes were proposed in this pull request?

This PR is an update for [https://github.com/apache/spark/pull/12738] which:
* Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side
* Various fixes for bugs found
  * This includes changing classes taking weightCol to treat unset and empty String Param values the same way.

Defaults changed:
* Scala
 * LogisticRegression: weightCol defaults to not set (instead of empty string)
 * StringIndexer: labels default to not set (instead of empty array)
 * GeneralizedLinearRegression:
   * maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver)
   * weightCol defaults to not set (instead of empty string)
 * LinearRegression: weightCol defaults to not set (instead of empty string)
* Python
 * MultilayerPerceptron: layers default to not set (instead of [1,1])
 * ChiSqSelector: numTopFeatures defaults to 50 (instead of not set)

## How was this patch tested?

Generic unit test.  Manually tested that unit test by changing defaults and verifying that broke the test.

Author: Joseph K. Bradley <joseph@databricks.com>
Author: yinxusen <yinxusen@gmail.com>

Closes #12816 from jkbradley/yinxusen-SPARK-14931.
2016-05-01 12:29:01 -07:00
Yanbo Liang 19a6d192d5 [SPARK-15030][ML][SPARKR] Support formula in spark.kmeans in SparkR
## What changes were proposed in this pull request?
* ```RFormula``` supports empty response variable like ```~ x + y```.
* Support formula in ```spark.kmeans``` in SparkR.
* Fix some outdated docs for SparkR.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12813 from yanboliang/spark-15030.
2016-04-30 08:37:56 -07:00
Herman van Hovell e5fb78baf9 [SPARK-14952][CORE][ML] Remove methods that were deprecated in 1.6.0
#### What changes were proposed in this pull request?

This PR removes three methods the were deprecated in 1.6.0:
- `PortableDataStream.close()`
- `LinearRegression.weights`
- `LogisticRegression.weights`

The rationale for doing this is that the impact is small and that Spark 2.0 is a major release.

#### How was this patch tested?
Compilation succeded.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12732 from hvanhovell/SPARK-14952.
2016-04-30 16:06:20 +01:00
Xiangrui Meng 0847fe4eb3 [SPARK-14653][ML] Remove json4s from mllib-local
## What changes were proposed in this pull request?

This PR moves Vector.toJson/fromJson to ml.linalg.VectorEncoder under mllib/ to keep mllib-local's dependency minimal. The json encoding is used by Params. So we still need this feature in SPARK-14615, where we will switch to ml.linalg in spark.ml APIs.

## How was this patch tested?

Copied existing unit tests over.

cc; dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #12802 from mengxr/SPARK-14653.
2016-04-30 06:30:39 -07:00
Junyang 1192fe4cd2 [SPARK-13289][MLLIB] Fix infinite distances between word vectors in Word2VecModel
## What changes were proposed in this pull request?

This PR fixes the bug that generates infinite distances between word vectors. For example,

Before this PR, we have
```
val synonyms = model.findSynonyms("who", 40)
```
will give the following results:
```
to Infinity
and Infinity
that Infinity
with Infinity
```
With this PR, the distance between words is a value between 0 and 1, as follows:
```
scala> model.findSynonyms("who", 10)
res0: Array[(String, Double)] = Array((Harvard-educated,0.5253688097000122), (ex-SAS,0.5213794708251953), (McMutrie,0.5187736749649048), (fellow,0.5166833400726318), (businessman,0.5145374536514282), (American-born,0.5127736330032349), (British-born,0.5062344074249268), (gray-bearded,0.5047978162765503), (American-educated,0.5035858750343323), (mentored,0.49849334359169006))

scala> model.findSynonyms("king", 10)
res1: Array[(String, Double)] = Array((queen,0.6787897944450378), (prince,0.6786158084869385), (monarch,0.659771203994751), (emperor,0.6490438580513), (goddess,0.643266499042511), (dynasty,0.635733425617218), (sultan,0.6166239380836487), (pharaoh,0.6150713562965393), (birthplace,0.6143025159835815), (empress,0.6109727025032043))

scala> model.findSynonyms("queen", 10)
res2: Array[(String, Double)] = Array((princess,0.7670737504959106), (godmother,0.6982434988021851), (raven-haired,0.6877717971801758), (swan,0.684934139251709), (hunky,0.6816608309745789), (Titania,0.6808111071586609), (heroine,0.6794036030769348), (king,0.6787897944450378), (diva,0.67848801612854), (lip-synching,0.6731793284416199))
```

### There are two places changed in this PR:
- Normalize the word vector to avoid overflow when calculating inner product between word vectors. This also simplifies the distance calculation, since the word vectors only need to be normalized once.
- Scale the learning rate by number of iteration, to be consistent with Google Word2Vec implementation

## How was this patch tested?

Use word2vec to train text corpus, and run model.findSynonyms() to get the distances between word vectors.

Author: Junyang <fly.shenjy@gmail.com>
Author: flyskyfly <fly.shenjy@gmail.com>

Closes #11812 from flyjy/TVec.
2016-04-30 10:16:35 +01:00
Xiangrui Meng 7fbe1bb24d [SPARK-14412][.2][ML] rename *RDDStorageLevel to *StorageLevel in ml.ALS
## What changes were proposed in this pull request?

As discussed in #12660, this PR renames
* intermediateRDDStorageLevel -> intermediateStorageLevel
* finalRDDStorageLevel -> finalStorageLevel

The argument name in `ALS.train` will be addressed in SPARK-15027.

## How was this patch tested?

Existing unit tests.

Author: Xiangrui Meng <meng@databricks.com>

Closes #12803 from mengxr/SPARK-14412.
2016-04-30 00:41:28 -07:00
Sean Owen 5886b6217b [SPARK-14533][MLLIB] RowMatrix.computeCovariance inaccurate when values are very large (partial fix)
## What changes were proposed in this pull request?

Fix for part of SPARK-14533: trivial simplification and more accurate computation of column means. See also https://github.com/apache/spark/pull/12299 which contained a complete fix that was very slow. This PR does _not_ resolve SPARK-14533 entirely.

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #12779 from srowen/SPARK-14533.2.
2016-04-30 00:15:41 -07:00
Xiangrui Meng 3d09ceeef9 [SPARK-14850][.2][ML] use UnsafeArrayData.fromPrimitiveArray in ml.VectorUDT/MatrixUDT
## What changes were proposed in this pull request?

This PR uses `UnsafeArrayData.fromPrimitiveArray` to implement `ml.VectorUDT/MatrixUDT` to avoid boxing/unboxing.

## How was this patch tested?

Exiting unit tests.

cc: cloud-fan

Author: Xiangrui Meng <meng@databricks.com>

Closes #12805 from mengxr/SPARK-14850.
2016-04-29 23:51:01 -07:00
Wenchen Fan 43b149fb88 [SPARK-14850][ML] convert primitive array from/to unsafe array directly in VectorUDT/MatrixUDT
## What changes were proposed in this pull request?

This PR adds `fromPrimitiveArray` and `toPrimitiveArray` in `UnsafeArrayData`, so that we can do the conversion much faster in VectorUDT/MatrixUDT.

## How was this patch tested?

existing tests and new test suite `UnsafeArraySuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12640 from cloud-fan/ml.
2016-04-29 23:04:51 -07:00
Nick Pentreath 90fa2c6e7f [SPARK-14412][ML][PYSPARK] Add StorageLevel params to ALS
`mllib` `ALS` supports `setIntermediateRDDStorageLevel` and `setFinalRDDStorageLevel`. This PR adds these as Params in `ml` `ALS`. They are put in group **expertParam** since few users will need them.

## How was this patch tested?

New test cases in `ALSSuite` and `tests.py`.

cc yanboliang jkbradley sethah rishabhbhardwaj

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #12660 from MLnick/SPARK-14412-als-storage-params.
2016-04-29 22:01:41 -07:00
Joseph K. Bradley 1eda2f10d9 [SPARK-14646][ML] Modified Kmeans to store cluster centers with one per row
## What changes were proposed in this pull request?

Modified Kmeans to store cluster centers with one per row

## How was this patch tested?

Existing tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12792 from jkbradley/kmeans-save-fix.
2016-04-29 16:46:25 -07:00
BenFradet d78fbcc3cc [SPARK-14570][ML] Log instrumentation in Random forests
## What changes were proposed in this pull request?

Added Instrumentation logging to DecisionTree{Classifier,Regressor} and RandomForest{Classifier,Regressor}

## How was this patch tested?

No tests involved since it's logging related.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #12536 from BenFradet/SPARK-14570.
2016-04-29 15:42:47 -07:00
Jeff Zhang 775772de36 [SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA PR2
## What changes were proposed in this pull request?

pyspark.ml API for LDA
* LDA, LDAModel, LocalLDAModel, DistributedLDAModel
* includes persistence

This replaces [https://github.com/apache/spark/pull/10242]

## How was this patch tested?

* doc test for LDA, including Param setters
* unit test for persistence

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Jeff Zhang <zjffdu@apache.org>

Closes #12723 from jkbradley/zjffdu-SPARK-11940.
2016-04-29 10:42:52 -07:00
Joseph K. Bradley f08dcdb8d3 [SPARK-14984][ML] Deprecated model field in LinearRegressionSummary
## What changes were proposed in this pull request?

Deprecated model field in LinearRegressionSummary

Removed unnecessary Since annotations

## How was this patch tested?

Existing tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12763 from jkbradley/lr-summary-api.
2016-04-29 10:40:00 -07:00
Yanbo Liang 87ac84d437 [SPARK-14314][SPARK-14315][ML][SPARKR] Model persistence in SparkR (glm & kmeans)
SparkR ```glm``` and ```kmeans``` model persistence.

Unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>
Author: Gayathri Murali <gayathri.m.softie@gmail.com>

Closes #12778 from yanboliang/spark-14311.
Closes #12680
Closes #12683
2016-04-29 09:43:04 -07:00
wm624@hotmail.com b6fa7e5934 [SPARK-14571][ML] Log instrumentation in ALS
## What changes were proposed in this pull request?

Add log instrumentation for parameters:
rank, numUserBlocks, numItemBlocks, implicitPrefs, alpha,
userCol, itemCol, ratingCol, predictionCol, maxIter,
regParam, nonnegative, checkpointInterval, seed

Add log instrumentation for numUserFeatures and numItemFeatures

## How was this patch tested?

Manual test: Set breakpoint in intellij and run def testALS(). Single step debugging and check the log method is called.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12560 from wangmiao1981/log.
2016-04-29 16:18:25 +02:00
dding3 6d5aeaae26 [SPARK-14969][MLLIB] Remove duplicate implementation of compute in LogisticGradient
## What changes were proposed in this pull request?

This PR removes duplicate implementation of compute in LogisticGradient class

## How was this patch tested?

unit tests

Author: dding3 <dingding@dingding-ubuntu.sh.intel.com>

Closes #12747 from dding3/master.
2016-04-29 10:19:51 +01:00
Sean Owen d1cf320105 [SPARK-14886][MLLIB] RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException
## What changes were proposed in this pull request?

Handle case where number of predictions is less than label set, k in nDCG computation

## How was this patch tested?

New unit test; existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #12756 from srowen/SPARK-14886.
2016-04-29 09:21:27 +02:00
Zheng RuiFeng cabd54d931 [SPARK-14829][MLLIB] Deprecate GLM APIs using SGD
## What changes were proposed in this pull request?
According to the [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829), deprecate API of LogisticRegression and LinearRegression using SGD

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12596 from zhengruifeng/deprecate_sgd.
2016-04-28 22:44:14 -07:00
Yin Huai 9c7c42bc6a Revert "[SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local"
This reverts commit dae538a4d7.
2016-04-28 19:57:41 -07:00
Joseph K. Bradley 4f4721a21c [SPARK-14862][ML] Updated Classifiers to not require labelCol metadata
## What changes were proposed in this pull request?

Updated Classifier, DecisionTreeClassifier, RandomForestClassifier, GBTClassifier to not require input column metadata.
* They first check for metadata.
* If numClasses is not specified in metadata, they identify the largest label value (up to a limit).

This functionality is implemented in a new Classifier.getNumClasses method.

Also
* Updated Classifier.extractLabeledPoints to (a) check label values and (b) include a second version which takes a numClasses value for validity checking.

## How was this patch tested?

* Unit tests in ClassifierSuite for helper methods
* Unit tests for DecisionTreeClassifier, RandomForestClassifier, GBTClassifier with toy datasets lacking label metadata

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12663 from jkbradley/trees-no-metadata.
2016-04-28 16:20:00 -07:00
Pravin Gadakh dae538a4d7 [SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local
## What changes were proposed in this pull request?

This PR adds `since` tag into the matrix and vector classes in spark-mllib-local.

## How was this patch tested?

Scala-style checks passed.

Author: Pravin Gadakh <prgadakh@in.ibm.com>

Closes #12416 from pravingadakh/SPARK-14613.
2016-04-28 15:59:18 -07:00
Yuhao Yang d5ab42ceb9 [SPARK-14916][MLLIB] A more friendly tostring for FreqItemset in mllib.fpm
## What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-14916
FreqItemset as the result of FPGrowth should have a more friendly toString(), to help users and developers.
sample:
{a, b}: 5
{x, y, z}: 4

## How was this patch tested?

existing unit tests.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #12698 from hhbyyh/freqtos.
2016-04-28 19:52:09 +01:00
Joseph K. Bradley 5ee72454df [SPARK-14852][ML] refactored GLM summary into training, non-training summaries
## What changes were proposed in this pull request?

This splits GeneralizedLinearRegressionSummary into 2 summary types:
* GeneralizedLinearRegressionSummary, which does not store info from fitting (diagInvAtWA)
* GeneralizedLinearRegressionTrainingSummary, which is a subclass of GeneralizedLinearRegressionSummary and stores info from fitting

This also add a method evaluate() which can produce a GeneralizedLinearRegressionSummary on a new dataset.

The summary no longer provides the model itself as a public val.

Also:
* Fixes bug where GeneralizedLinearRegressionTrainingSummary was created with model, not summaryModel.
* Adds hasSummary method.
* Renames findSummaryModelAndPredictionCol -> getSummaryModel and simplifies that method.
* In summary, extract values from model immediately in case user later changes those (e.g., predictionCol).
* Pardon the style fixes; that is IntelliJ being obnoxious.

## How was this patch tested?

Existing unit tests + updated test for evaluate and hasSummary

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12624 from jkbradley/model-summary-api.
2016-04-28 11:22:13 -07:00
Liang-Chi Hsieh 7c6937a885 [SPARK-14487][SQL] User Defined Type registration without SQLUserDefinedType annotation
## What changes were proposed in this pull request?

Currently we use `SQLUserDefinedType` annotation to register UDTs for user classes. However, by doing this, we add Spark dependency to user classes.

For some user classes, it is unnecessary to add such dependency that will increase deployment difficulty.

We should provide alternative approach to register UDTs for user classes without `SQLUserDefinedType` annotation.

## How was this patch tested?

`UserDefinedTypeSuite`

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12259 from viirya/improve-sql-usertype.
2016-04-28 01:14:49 -07:00
Joseph K. Bradley f5ebb18c45 [SPARK-14671][ML] Pipeline setStages should handle subclasses of PipelineStage
## What changes were proposed in this pull request?

Pipeline.setStages failed for some code examples which worked in 1.5 but fail in 1.6.  This tends to occur when using a mix of transformers from ml.feature. It is because Java Arrays are non-covariant and the addition of MLWritable to some transformers means the stages0/1 arrays above are not of type Array[PipelineStage].  This PR modifies the following to accept subclasses of PipelineStage:
* Pipeline.setStages()
* Params.w()

## How was this patch tested?

Unit test which fails to compile before this fix.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12430 from jkbradley/pipeline-setstages.
2016-04-27 16:11:12 -07:00
Yanbo Liang 4672e9838b [SPARK-14899][ML][PYSPARK] Remove spark.ml HashingTF hashingAlg option
## What changes were proposed in this pull request?
Since [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574) breaks behavior of ```HashingTF```, we should try to enforce good practice by removing the "native" hashAlgorithm option in spark.ml and pyspark.ml. We can leave spark.mllib and pyspark.mllib alone.

## How was this patch tested?
Unit tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12702 from yanboliang/spark-14899.
2016-04-27 14:08:26 -07:00
Mike Dusenberry 607f50341c [SPARK-9656][MLLIB][PYTHON] Add missing methods to PySpark's Distributed Linear Algebra Classes
This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows:

* `RowMatrix` <sup>**[1]**</sup>
  1. `computeGramianMatrix`
  2. `computeCovariance`
  3. `computeColumnSummaryStatistics`
  4. `columnSimilarities`
  5. `tallSkinnyQR` <sup>**[2]**</sup>
* `IndexedRowMatrix` <sup>**[3]**</sup>
  1. `computeGramianMatrix`
* `CoordinateMatrix`
  1. `transpose`
* `BlockMatrix`
  1. `validate`
  2. `cache`
  3. `persist`
  4. `transpose`

**[1]**: Note: `multiply`, `computeSVD`, and `computePrincipalComponents` are already part of PR #7963 for SPARK-6227.
**[2]**: Implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor.  As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark.  Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type.  Thus, this PR currently contains that fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`.  `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.  However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR.  Therefore, I have marked this PR as WIP and am open to discussion.
**[3]**: Note: `multiply` and `computeSVD` are already part of PR #7963 for SPARK-6227.

Author: Mike Dusenberry <mwdusenb@us.ibm.com>

Closes #9441 from dusenberrymw/SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra.
2016-04-27 19:48:05 +02:00
Joseph K. Bradley bd2c9a6d48 [SPARK-14732][ML] spark.ml GaussianMixture should use MultivariateGaussian in mllib-local
## What changes were proposed in this pull request?

Before, spark.ml GaussianMixtureModel used the spark.mllib MultivariateGaussian in its public API.  This was added after 1.6, so we can modify this API without breaking APIs.

This PR copies MultivariateGaussian to mllib-local in spark.ml, with a few changes:
* Renamed fields to match numpy, scipy: mu => mean, sigma => cov

This PR then uses the spark.ml MultivariateGaussian in the spark.ml GaussianMixtureModel, which involves:
* Modifying the constructor
* Adding a computeProbabilities method

Also:
* Added EPSILON to mllib-local for use in MultivariateGaussian

## How was this patch tested?

Existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12593 from jkbradley/sparkml-gmm-fix.
2016-04-26 16:53:16 -07:00
Joseph K. Bradley 6c5a837c50 [SPARK-12301][ML] Made all tree and ensemble classes not final
## What changes were proposed in this pull request?

There have been continuing requests (e.g., SPARK-7131) for allowing users to extend and modify MLlib models and algorithms.

This PR makes tree and ensemble classes, Node types, and Split types in spark.ml no longer final.  This matches most other spark.ml algorithms.

Constructors for models are still private since we may need to refactor how stats are maintained in tree nodes.

## How was this patch tested?

Existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12711 from jkbradley/final-trees.
2016-04-26 14:44:39 -07:00
Dongjoon Hyun e4f3eec5b7 [SPARK-14907][MLLIB] Use repartition in GLMRegressionModel.save
## What changes were proposed in this pull request?

This PR changes `GLMRegressionModel.save` function like the following code that is similar to other algorithms' parquet write.
```
- val dataRDD: DataFrame = sc.parallelize(Seq(data), 1).toDF()
- // TODO: repartition with 1 partition after SPARK-5532 gets fixed
- dataRDD.write.parquet(Loader.dataPath(path))
+ sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(Loader.dataPath(path))
```

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12676 from dongjoon-hyun/SPARK-14907.
2016-04-26 13:58:29 -07:00
Yanbo Liang 302a186869 [SPARK-11559][MLLIB] Make runs no effect in mllib.KMeans
## What changes were proposed in this pull request?
We deprecated  ```runs``` of mllib.KMeans in Spark 1.6 (SPARK-11358). In 2.0, we will make it no effect (with warning messages). We did not remove ```setRuns/getRuns``` for better binary compatibility.
This PR change `runs` which are appeared at the public API. Usage inside of ```KMeans.runAlgorithm()``` will be resolved at #10806.

## How was this patch tested?
Existing unit tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12608 from yanboliang/spark-11559.
2016-04-26 11:55:21 -07:00
Andrew Or 2a3d39f48b [MINOR] Follow-up to #12625
## What changes were proposed in this pull request?

That patch mistakenly widened the visibility from `private[x]` to `protected[x]`. This patch reverts those changes.

Author: Andrew Or <andrew@databricks.com>

Closes #12686 from andrewor14/visibility.
2016-04-26 11:08:08 -07:00
Reynold Xin 5cb03220a0 [SPARK-14912][SQL] Propagate data source options to Hadoop configuration
## What changes were proposed in this pull request?
We currently have no way for users to propagate options to the underlying library that rely in Hadoop configurations to work. For example, there are various options in parquet-mr that users might want to set, but the data source API does not expose a per-job way to set it. This patch propagates the user-specified options also into Hadoop Configuration.

## How was this patch tested?
Used a mock data source implementation to test both the read path and the write path.

Author: Reynold Xin <rxin@databricks.com>

Closes #12688 from rxin/SPARK-14912.
2016-04-26 10:58:56 -07:00
Yanbo Liang 92f66331b4 [SPARK-14313][ML][SPARKR] AFTSurvivalRegression model persistence in SparkR
## What changes were proposed in this pull request?
```AFTSurvivalRegressionModel``` supports ```save/load``` in SparkR.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12685 from yanboliang/spark-14313.
2016-04-26 10:30:24 -07:00
BenFradet 2a5c930790 [SPARK-13962][ML] spark.ml Evaluators should support other numeric types for label
## What changes were proposed in this pull request?

Made BinaryClassificationEvaluator, MulticlassClassificationEvaluator and RegressionEvaluator accept all numeric types for label

## How was this patch tested?

Unit tests

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #12500 from BenFradet/SPARK-13962.
2016-04-26 08:55:50 +02:00
Andrew Or 18c2c92580 [SPARK-14861][SQL] Replace internal usages of SQLContext with SparkSession
## What changes were proposed in this pull request?

In Spark 2.0, `SparkSession` is the new thing. Internally we should stop using `SQLContext` everywhere since that's supposed to be not the main user-facing API anymore.

In this patch I took care to not break any public APIs. The one place that's suspect is `o.a.s.ml.source.libsvm.DefaultSource`, but according to mengxr it's not supposed to be public so it's OK to change the underlying `FileFormat` trait.

**Reviewers**: This is a big patch that may be difficult to review but the changes are actually really straightforward. If you prefer I can break it up into a few smaller patches, but it will delay the progress of this issue a little.

## How was this patch tested?

No change in functionality intended.

Author: Andrew Or <andrew@databricks.com>

Closes #12625 from andrewor14/spark-session-refactor.
2016-04-25 20:54:31 -07:00
Yanbo Liang 9cb3ba1013 [SPARK-14312][ML][SPARKR] NaiveBayes model persistence in SparkR
## What changes were proposed in this pull request?
SparkR ```NaiveBayesModel``` supports ```save/load``` by the following API:
```
df <- createDataFrame(sqlContext, infert)
model <- naiveBayes(education ~ ., df, laplace = 0)
ml.save(model, path)
model2 <- ml.load(path)
```

## How was this patch tested?
Add unit tests.

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12573 from yanboliang/spark-14312.
2016-04-25 14:08:41 -07:00
Yanbo Liang 425f691646 [SPARK-10574][ML][MLLIB] HashingTF supports MurmurHash3
## What changes were proposed in this pull request?
As the discussion at [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574), ```HashingTF``` should support MurmurHash3 and make it as the default hash algorithm. We should also expose set/get API for ```hashAlgorithm```, then users can choose the hash method.

Note: The problem that ```mllib.feature.HashingTF``` behaves differently between Scala/Java and Python will be resolved in the followup work.

## How was this patch tested?
unit tests.

cc jkbradley MLnick

Author: Yanbo Liang <ybliang8@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12498 from yanboliang/spark-10574.
2016-04-25 12:08:43 -07:00
wm624@hotmail.com b50e2eca93 [SPARK-14433][PYSPARK][ML] PySpark ml GaussianMixture
## What changes were proposed in this pull request?

Add Python API in ML for GaussianMixture

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Add doctest and test cases are the same as mllib Python tests
./dev/lint-python
PEP8 checks passed.
rm -rf _build/*
pydoc checks passed.

./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-ml']
Finished test(python2.7): pyspark.ml.evaluation (18s)
Finished test(python2.7): pyspark.ml.clustering (40s)
Finished test(python2.7): pyspark.ml.classification (49s)
Finished test(python2.7): pyspark.ml.recommendation (44s)
Finished test(python2.7): pyspark.ml.feature (64s)
Finished test(python2.7): pyspark.ml.regression (45s)
Finished test(python2.7): pyspark.ml.tuning (30s)
Finished test(python2.7): pyspark.ml.tests (56s)
Tests passed in 106 seconds

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12402 from wangmiao1981/gmm.
2016-04-25 10:48:15 -07:00
Zheng RuiFeng e6f954a579 [SPARK-14758][ML] Add checking for StepSize and Tol
## What changes were proposed in this pull request?
add the checking for StepSize and Tol in sharedParams

## How was this patch tested?
Unit tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12530 from zhengruifeng/ml_args_checking.
2016-04-25 10:30:55 +02:00
Dongjoon Hyun d34d650378 [SPARK-14868][BUILD] Enable NewLineAtEofChecker in checkstyle and fix lint-java errors
## What changes were proposed in this pull request?

Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java code also comply with the rule. This PR aims to enforce the same rule `NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java errors since SPARK-14465. The followings are the items.

- Adds a new line at the end of the files (19 files)
- Fixes 25 lint-java errors (12 RedundantModifier, 6 **ArrayTypeStyle**, 2 LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder)

## How was this patch tested?

After the Jenkins test succeeds, `dev/lint-java` should pass. (Currently, Jenkins dose not run lint-java.)
```bash
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12632 from dongjoon-hyun/SPARK-14868.
2016-04-24 20:40:03 -07:00
Zheng RuiFeng 86ca8fefc8 [MINOR][ML][MLLIB] Remove unused imports
## What changes were proposed in this pull request?
del unused imports in ML/MLLIB

## How was this patch tested?
unit tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12497 from zhengruifeng/del_unused_imports.
2016-04-22 23:20:10 -07:00
Liang-Chi Hsieh 8098f15857 [SPARK-14843][ML] Fix encoding error in LibSVMRelation
## What changes were proposed in this pull request?

We use `RowEncoder` in libsvm data source to serialize the label and features read from libsvm files. However, the schema passed in this encoder is not correct. As the result, we can't correctly select `features` column from the DataFrame. We should use full data schema instead of `requiredSchema` to serialize the data read in. Then do projection to select required columns later.

## How was this patch tested?
`LibSVMRelationSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12611 from viirya/fix-libsvm.
2016-04-23 01:11:36 +08:00
Zheng RuiFeng 92675471b7 [MINOR][DOC] Fix doc style in ml.ann.Layer and MultilayerPerceptronClassifier
## What changes were proposed in this pull request?
1, fix the indentation
2, add a missing param desc

## How was this patch tested?
unit tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12499 from zhengruifeng/fix_doc.
2016-04-22 14:52:37 +01:00
Joan bf95b8da27 [SPARK-6429] Implement hashCode and equals together
## What changes were proposed in this pull request?

Implement some `hashCode` and `equals` together in order to enable the scalastyle.
This is a first batch, I will continue to implement them but I wanted to know your thoughts.

Author: Joan <joan@goyeau.com>

Closes #12157 from joan38/SPARK-6429-HashCode-Equals.
2016-04-22 12:24:12 +01:00
Yanbo Liang 4e726227a3 [SPARK-14479][ML] GLM supports output link prediction
## What changes were proposed in this pull request?
GLM supports output link prediction.
## How was this patch tested?
unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12287 from yanboliang/spark-14479.
2016-04-21 17:31:33 -07:00
Joseph K. Bradley f25a3ea8d3 [SPARK-14734][ML][MLLIB] Added asML, fromML methods for all spark.mllib Vector, Matrix types
## What changes were proposed in this pull request?

For maintaining wrappers around spark.mllib algorithms in spark.ml, it will be useful to have ```private[spark]``` methods for converting from one linear algebra representation to another.
This PR adds toNew, fromNew methods for all spark.mllib Vector and Matrix types.

## How was this patch tested?

Unit tests for all conversions

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12504 from jkbradley/linalg-conversions.
2016-04-21 16:50:09 -07:00
Xin Ren 6d1e4c4a65 [SPARK-14569][ML] Log instrumentation in KMeans
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14569

Log instrumentation in KMeans:

- featuresCol
- predictionCol
- k
- initMode
- initSteps
- maxIter
- seed
- tol
- summary

## How was this patch tested?

Manually test on local machine, by running and checking output of org.apache.spark.examples.ml.KMeansExample

Author: Xin Ren <iamshrek@126.com>

Closes #12432 from keypointt/SPARK-14569.
2016-04-21 16:29:39 -07:00
Joseph K. Bradley acc7e592c4 [SPARK-14478][ML][MLLIB][DOC] Doc that StandardScaler uses the corrected sample std
## What changes were proposed in this pull request?

Currently, MLlib's StandardScaler scales columns using the corrected standard deviation (sqrt of unbiased variance). This matches what R's scale package does.

This PR documents this fact.

## How was this patch tested?

doc only

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12519 from jkbradley/scaler-variance-doc.
2016-04-20 11:48:30 -07:00