Commit graph

1683 commits

Author SHA1 Message Date
Xusen Yin 255d74fe4a [SPARK-16369][MLLIB] tallSkinnyQR of RowMatrix should aware of empty partition
## What changes were proposed in this pull request?

tallSkinnyQR of RowMatrix should aware of empty partition, which could cause exception from Breeze qr decomposition.

See the [archived dev mail](https://mail-archives.apache.org/mod_mbox/spark-dev/201510.mbox/%3CCAF7ADNrycvPL3qX-VZJhq4OYmiUUhoscut_tkOm63Cm18iK1tQmail.gmail.com%3E) for more details.

## How was this patch tested?

Scala unit test.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #14049 from yinxusen/SPARK-16369.
2016-07-08 14:23:57 +01:00
Xusen Yin 4c6f00d09c [SPARK-16372][MLLIB] Retag RDD to tallSkinnyQR of RowMatrix
## What changes were proposed in this pull request?

The following Java code because of type erasing:

```Java
JavaRDD<Vector> rows = jsc.parallelize(...);
RowMatrix mat = new RowMatrix(rows.rdd());
QRDecomposition<RowMatrix, Matrix> result = mat.tallSkinnyQR(true);
```

We should use retag to restore the type to prevent the following exception:

```Java
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.mllib.linalg.Vector;
```

## How was this patch tested?

Java unit test

Author: Xusen Yin <yinxusen@gmail.com>

Closes #14051 from yinxusen/SPARK-16372.
2016-07-07 11:28:04 +01:00
tmnd1991 040f6f9f46 [SPARK-15740][MLLIB] Word2VecSuite "big model load / save" caused OOM in maven jenkins builds
## What changes were proposed in this pull request?
"test big model load / save" in Word2VecSuite, lately resulted into OOM.
Therefore we decided to make the partitioning adaptive (not based on spark default "spark.kryoserializer.buffer.max" conf) and then testing it using a small buffer size in order to trigger partitioning without allocating too much memory for the test.

## How was this patch tested?
It was tested running the following unit test:
org.apache.spark.mllib.feature.Word2VecSuite

Author: tmnd1991 <antonio.murgia2@studio.unibo.it>

Closes #13509 from tmnd1991/SPARK-15740.
2016-07-06 12:56:26 -07:00
MechCoder 909c6d812f [SPARK-16307][ML] Add test to verify the predicted variances of a DT on toy data
## What changes were proposed in this pull request?

The current tests assumes that `impurity.calculate()` returns the variance correctly. It should be better to make the tests independent of this assumption. In other words verify that the variance computed equals the variance computed manually on a small tree.

## How was this patch tested?

The patch is a test....

Author: MechCoder <mks542@nyu.edu>

Closes #13981 from MechCoder/dt_variance.
2016-07-06 02:54:44 -07:00
Yuhao Yang 5497242c76 [SPARK-16249][ML] Change visibility of Object ml.clustering.LDA to public for loading
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16249
Change visibility of Object ml.clustering.LDA to public for loading, thus users can invoke LDA.load("path").

## How was this patch tested?

existing ut and manually test for load ( saved with current code)

Author: Yuhao Yang <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #13941 from hhbyyh/ldapublic.
2016-07-06 01:30:47 -07:00
Yuhao Yang aa6564f37f [SPARK-14608][ML] transformSchema needs better documentation
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14608
PipelineStage.transformSchema currently has minimal documentation. It should have more to explain it can:
check schema
check parameter interactions

## How was this patch tested?
unit test

Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #12384 from hhbyyh/transformSchemaDoc.
2016-06-30 19:34:51 -07:00
zlpmichelle b30a2dc7c5 [SPARK-16241][ML] model loading backward compatibility for ml NaiveBayes
## What changes were proposed in this pull request?

model loading backward compatibility for ml NaiveBayes

## How was this patch tested?

existing ut and manual test for loading models saved by Spark 1.6.

Author: zlpmichelle <zlpmichelle@gmail.com>

Closes #13940 from zlpmichelle/naivebayes.
2016-06-30 00:50:14 -07:00
Mahmoud Rawas 393db655c3 [SPARK-15858][ML] Fix calculating error by tree stack over flow prob…
## What changes were proposed in this pull request?

What changes were proposed in this pull request?

Improving evaluateEachIteration function in mllib as it fails when trying to calculate error by tree for a model that has more than 500 trees

## How was this patch tested?

the batch tested on productions data set (2K rows x 2K features) training a gradient boosted model without validation with 1000 maxIteration settings, then trying to produce the error by tree, the new patch was able to perform the calculation within 30 seconds, while previously it was take hours then fail.

**PS**: It would be better if this PR can be cherry picked into release branches 1.6.1 and 2.0

Author: Mahmoud Rawas <mhmoudr@gmail.com>
Author: Mahmoud Rawas <Mahmoud.Rawas@quantium.com.au>

Closes #13624 from mhmoudr/SPARK-15858.master.
2016-06-29 13:12:17 +01:00
Yanbo Liang 0df5ce1bc1 [SPARK-16245][ML] model loading backward compatibility for ml.feature.PCA
## What changes were proposed in this pull request?
model loading backward compatibility for ml.feature.PCA.

## How was this patch tested?
existing ut and manual test for loading models saved by Spark 1.6.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13937 from yanboliang/spark-16245.
2016-06-28 19:53:07 -07:00
Yanbo Liang e158478a9f [SPARK-16242][MLLIB][PYSPARK] Conversion between old/new matrix columns in a DataFrame (Python)
## What changes were proposed in this pull request?
This PR implements python wrappers for #13888 to convert old/new matrix columns in a DataFrame.

## How was this patch tested?
Doctest in python.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13935 from yanboliang/spark-16242.
2016-06-28 06:28:22 -07:00
Yuhao Yang c17b1abff8 [SPARK-16187][ML] Implement util method for ML Matrix conversion in scala/java
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16187
This is to provide conversion utils between old/new vector columns in a DataFrame. So users can use it to migrate their datasets and pipelines manually.

## How was this patch tested?

java and scala ut

Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #13888 from hhbyyh/matComp.
2016-06-27 12:27:39 -07:00
José Antonio a3c7b4187b [MLLIB] org.apache.spark.mllib.util.SVMDataGenerator generates ArrayIndexOutOfBoundsException. I have found the bug and tested the solution.
## What changes were proposed in this pull request?

Just adjust the size of an array in line 58 so it does not cause an ArrayOutOfBoundsException in line 66.

## How was this patch tested?

Manual tests. I have recompiled the entire project with the fix, it has been built successfully and I have run the code, also with good results.

line 66: val yD = blas.ddot(trueWeights.length, x, 1, trueWeights, 1) + rnd.nextGaussian() * 0.1
crashes because trueWeights has length "nfeatures + 1" while "x" has length "features", and they should have the same length.

To fix this just make trueWeights be the same length as x.

I have recompiled the project with the change and it is working now:
[spark-1.6.1]$ spark-submit --master local[*] --class org.apache.spark.mllib.util.SVMDataGenerator mllib/target/spark-mllib_2.11-1.6.1.jar local /home/user/test

And it generates the data successfully now in the specified folder.

Author: José Antonio <joseanmunoz@gmail.com>

Closes #13895 from j4munoz/patch-2.
2016-06-25 09:11:25 +01:00
Yuhao Yang cc6778ee0b [SPARK-16133][ML] model loading backward compatibility for ml.feature
## What changes were proposed in this pull request?

model loading backward compatibility for ml.feature,

## How was this patch tested?

existing ut and manual test for loading 1.6 models.

Author: Yuhao Yang <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #13844 from hhbyyh/featureComp.
2016-06-23 21:50:25 -07:00
Yuhao Yang 14bc5a7f36 [SPARK-16177][ML] model loading backward compatibility for ml.regression
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16177
model loading backward compatibility for ml.regression

## How was this patch tested?

existing ut and manual test for loading 1.6 models.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #13879 from hhbyyh/regreComp.
2016-06-23 20:43:19 -07:00
Yuhao Yang 60398dabc5 [SPARK-16130][ML] model loading backward compatibility for ml.classfication.LogisticRegression
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16130
model loading backward compatibility for ml.classfication.LogisticRegression

## How was this patch tested?
existing ut and manual test for loading old models.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #13841 from hhbyyh/lrcomp.
2016-06-23 11:00:00 -07:00
Xiangrui Meng 65d1f0f716 [SPARK-16154][MLLIB] Update spark.ml and spark.mllib package docs
## What changes were proposed in this pull request?

Since we decided to switch spark.mllib package into maintenance mode in 2.0, it would be nice to update the package docs to reflect this change.

## How was this patch tested?

Manually checked generated APIs.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13859 from mengxr/SPARK-16154.
2016-06-23 08:26:17 -07:00
Xiangrui Meng 00cc5cca45 [SPARK-16153][MLLIB] switch to multi-line doc to avoid a genjavadoc bug
## What changes were proposed in this pull request?

We recently deprecated setLabelCol in ChiSqSelectorModel (#13823):

~~~scala
  /** group setParam */
  Since("1.6.0")
  deprecated("labelCol is not used by ChiSqSelectorModel.", "2.0.0")
  def setLabelCol(value: String): this.type = set(labelCol, value)
~~~

This unfortunately hit a genjavadoc bug and broken doc generation. This is the generated Java code:

~~~java
  /** group setParam */
  public  org.apache.spark.ml.feature.ChiSqSelectorModel setOutputCol (java.lang.String value)  { throw new RuntimeException(); }
   *
   * deprecated labelCol is not used by ChiSqSelectorModel. Since 2.0.0.
  */
  public  org.apache.spark.ml.feature.ChiSqSelectorModel setLabelCol (java.lang.String value)  { throw new RuntimeException(); }
~~~

Switching to multiline is a workaround.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13855 from mengxr/SPARK-16153.
2016-06-22 15:50:21 -07:00
Xiangrui Meng 6a6010f001 [MINOR][MLLIB] DefaultParamsReadable/Writable should be DeveloperApi
## What changes were proposed in this pull request?

`DefaultParamsReadable/Writable` are not user-facing. Only developers who implement `Transformer/Estimator` would use it. So this PR changes the annotation to `DeveloperApi`.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13828 from mengxr/default-readable-should-be-developer-api.
2016-06-22 10:06:43 -07:00
Nick Pentreath 18faa588ca [SPARK-16127][ML][PYPSARK] Audit @Since annotations related to ml.linalg
[SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them.

## How was this patch tested?

Existing unit tests.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13840 from MLnick/SPARK-16127-ml-linalg-since.
2016-06-22 10:05:25 -07:00
Holden Karau d281b0bafe [SPARK-15162][SPARK-15164][PYSPARK][DOCS][ML] update some pydocs
## What changes were proposed in this pull request?

Mark ml.classification algorithms as experimental to match Scala algorithms, update PyDoc for for thresholds on `LogisticRegression` to have same level of info as Scala, and enable mathjax for PyDoc.

## How was this patch tested?

Built docs locally & PySpark SQL tests

Author: Holden Karau <holden@us.ibm.com>

Closes #12938 from holdenk/SPARK-15162-SPARK-15164-update-some-pydocs.
2016-06-22 11:54:49 +02:00
gatorsmile 0e3ce75332 [SPARK-15644][MLLIB][SQL] Replace SQLContext with SparkSession in MLlib
#### What changes were proposed in this pull request?
This PR is to use the latest `SparkSession` to replace the existing `SQLContext` in `MLlib`. `SQLContext` is removed from `MLlib`.

Also fix a test case issue in `BroadcastJoinSuite`.

BTW, `SQLContext` is not being used in the `MLlib` test suites.
#### How was this patch tested?
Existing test cases.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #13380 from gatorsmile/sqlContextML.
2016-06-21 23:12:08 -07:00
Xiangrui Meng d77c4e6e2e [MINOR][MLLIB] deprecate setLabelCol in ChiSqSelectorModel
## What changes were proposed in this pull request?

Deprecate `labelCol`, which is not used by ChiSqSelectorModel.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13823 from mengxr/deprecate-setLabelCol-in-ChiSqSelectorModel.
2016-06-21 20:53:38 -07:00
Xiangrui Meng 9493b079a0 [SPARK-16118][MLLIB] add getDropLast to OneHotEncoder
## What changes were proposed in this pull request?

We forgot the getter of `dropLast` in `OneHotEncoder`

## How was this patch tested?

unit test

Author: Xiangrui Meng <meng@databricks.com>

Closes #13821 from mengxr/SPARK-16118.
2016-06-21 15:52:31 -07:00
Xiangrui Meng f4e8c31adf [SPARK-16117][MLLIB] hide LibSVMFileFormat and move its doc to LibSVMDataSource
## What changes were proposed in this pull request?

LibSVMFileFormat implements data source for LIBSVM format. However, users do not really need to call its APIs to use it. So we should hide it in the public API docs. The main issue is that we still need to put the documentation and example code somewhere. The proposal it to have a dummy class to hold the documentation, as a workaround to https://issues.scala-lang.org/browse/SI-8124.

## How was this patch tested?

Manually checked the generated API doc and tested loading LIBSVM data.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13819 from mengxr/SPARK-16117.
2016-06-21 15:46:14 -07:00
Xiangrui Meng 918c91954f [MINOR][MLLIB] move setCheckpointInterval to non-expert setters
## What changes were proposed in this pull request?

The `checkpointInterval` is a non-expert param. This PR moves its setter to non-expert group.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13813 from mengxr/checkpoint-non-expert.
2016-06-21 13:35:06 -07:00
Xiangrui Meng 4f83ca1059 [SPARK-15177][.1][R] make SparkR model params and default values consistent with MLlib
## What changes were proposed in this pull request?

This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation.

Main changes:
* `spark.glm`: epsilon -> tol, maxit -> maxIter
* `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means||"
* `spark.naiveBayes`: laplace -> smoothing, default 1.0

## How was this patch tested?

Existing unit tests.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13801 from mengxr/SPARK-15177.1.
2016-06-21 08:31:15 -07:00
Nick Pentreath 37494a18e8 [SPARK-10258][DOC][ML] Add @Since annotations to ml.feature
This PR adds missing `Since` annotations to `ml.feature` package.

Closes #8505.

## How was this patch tested?

Existing tests.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13641 from MLnick/add-since-annotations.
2016-06-21 00:39:47 -07:00
Xiangrui Meng 18a8a9b1f4 [SPARK-16074][MLLIB] expose VectorUDT/MatrixUDT in a public API
## What changes were proposed in this pull request?

Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself is private in Spark. However, in order to let developers implement their own transformers and estimators, we should expose both types in a public API to simply the implementation of transformSchema, transform, etc. Otherwise, they need to get the data types using reflection.

## How was this patch tested?

Unit tests in Scala and Java.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13789 from mengxr/SPARK-16074.
2016-06-20 21:51:02 -07:00
Xiangrui Meng edb23f9e47 [SPARK-15946][MLLIB] Conversion between old/new vector columns in a DataFrame (Python)
## What changes were proposed in this pull request?

This PR implements python wrappers for #13662 to convert old/new vector columns in a DataFrame.

## How was this patch tested?

doctest in Python

cc: yanboliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #13731 from mengxr/SPARK-15946.
2016-06-17 21:22:29 -07:00
sethah 1f0a46958e [SPARK-16008][ML] Remove unnecessary serialization in logistic regression
JIRA: [SPARK-16008](https://issues.apache.org/jira/browse/SPARK-16008)

## What changes were proposed in this pull request?
`LogisticAggregator` stores references to two arrays of dimension `numFeatures` which are serialized before the combine op, unnecessarily. This results in the shuffle write being ~3x (for multiclass logistic regression, this number will go up) larger than it should be (in MLlib, for instance, it is 3x smaller).

This patch modifies `LogisticAggregator.add` to accept the two arrays as method parameters which avoids the serialization.

## How was this patch tested?

I tested this locally and verified the serialization reduction.

![image](https://cloud.githubusercontent.com/assets/7275795/16140387/d2974bac-3404-11e6-94f9-268860c931a2.png)

Additionally, I ran some tests of a 4 node cluster (4x48 cores, 4x128 GB RAM). Data set size of 2M rows and 10k features showed >2x iteration speedup.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #13729 from sethah/lr_improvement.
2016-06-17 09:58:49 -07:00
Dongjoon Hyun 36110a8306 [SPARK-15922][MLLIB] toIndexedRowMatrix should consider the case cols < offset+colsPerBlock
## What changes were proposed in this pull request?

SPARK-15922 reports the following scenario throwing an exception due to the mismatched vector sizes. This PR handles the exceptional case, `cols < (offset + colsPerBlock)`.

**Before**
```scala
scala> import org.apache.spark.mllib.linalg.distributed._
scala> import org.apache.spark.mllib.linalg._
scala> val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil
scala> val rdd = sc.parallelize(rows)
scala> val matrix = new IndexedRowMatrix(rdd, 3, 3)
scala> val bmat = matrix.toBlockMatrix
scala> val imat = bmat.toIndexedRowMatrix
scala> imat.rows.collect
... // java.lang.IllegalArgumentException: requirement failed: Vectors must be the same length!
```

**After**
```scala
...
scala> imat.rows.collect
res0: Array[org.apache.spark.mllib.linalg.distributed.IndexedRow] = Array(IndexedRow(0,[1.0,2.0,3.0]), IndexedRow(1,[1.0,2.0,3.0]), IndexedRow(2,[1.0,2.0,3.0]))
```

## How was this patch tested?

Pass the Jenkins tests (including the above case)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13643 from dongjoon-hyun/SPARK-15922.
2016-06-16 23:02:46 +02:00
Cheng Lian 9ea0d5e326 [SPARK-15983][SQL] Removes FileFormat.prepareRead
## What changes were proposed in this pull request?

Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source.

However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean.

## How was this patch tested?

Existing tests.

Author: Cheng Lian <lian@databricks.com>

Closes #13698 from liancheng/remove-prepare-read.
2016-06-16 10:24:29 -07:00
Reynold Xin 865e7cc38d [SPARK-15979][SQL] Rename various Parquet support classes.
## What changes were proposed in this pull request?
This patch renames various Parquet support classes from CatalystAbc to ParquetAbc. This new naming makes more sense for two reasons:

1. These are not optimizer related (i.e. Catalyst) classes.
2. We are in the Spark code base, and as a result it'd be more clear to call out these are Parquet support classes, rather than some Spark classes.

## How was this patch tested?
Renamed test cases as well.

Author: Reynold Xin <rxin@databricks.com>

Closes #13696 from rxin/parquet-rename.
2016-06-15 20:05:08 -07:00
Wojciech Jurczyk 6e0b3d795c [DOCS] Fix Gini and Entropy scaladocs in context of multiclass classification
The PR changes outdated scaladocs for Gini and Entropy classes. Since PR #886 Spark supports multiclass classification, but the docs tell only about binary classification.

Author: Wojciech Jurczyk <wojciech.jurczyk@codilime.com>

Closes #11252 from wjur/wjur/docs_multiclass.
2016-06-15 15:58:42 -07:00
Xiangrui Meng 63e0aebe22 [SPARK-15945][MLLIB] Conversion between old/new vector columns in a DataFrame (Scala/Java)
## What changes were proposed in this pull request?

This PR provides conversion utils between old/new vector columns in a DataFrame. So users can use it to migrate their datasets and pipelines manually. The methods are implemented under `MLUtils` and called `convertVectorColumnsToML` and `convertVectorColumnsFromML`. Both take a DataFrame and a list of vector columns to be converted. It is a no-op on vector columns that are already converted. A warning message is logged if actual conversion happens.

This is the first sub-task under SPARK-15944 to make it easier to migrate existing pipelines to Spark 2.0.

## How was this patch tested?

Unit tests in Scala and Java.

cc: yanboliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #13662 from mengxr/SPARK-15945.
2016-06-14 18:57:45 -07:00
Liang-Chi Hsieh baa3e633e1 [SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ml.Matrix under spark.ml.python
## What changes were proposed in this pull request?

Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead.

## How was this patch tested?
Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13219 from viirya/pyspark-pickler-ml.
2016-06-13 19:59:53 -07:00
hyukjinkwon e3554605b3 [SPARK-15892][ML] Incorrectly merged AFTAggregator with zero total count
## What changes were proposed in this pull request?

Currently, `AFTAggregator` is not being merged correctly. For example, if there is any single empty partition in the data, this creates an `AFTAggregator` with zero total count which causes the exception below:

```
IllegalArgumentException: u'requirement failed: The number of instances should be greater than 0.0, but got 0.'
```

Please see [AFTSurvivalRegression.scala#L573-L575](6ecedf39b4/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala (L573-L575)) as well.

Just to be clear, the python example `aft_survival_regression.py` seems using 5 rows. So, if there exist partitions more than 5, it throws the exception above since it contains empty partitions which results in an incorrectly merged `AFTAggregator`.

Executing `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py` on a machine with CPUs more than 5 is being failed because it creates tasks with some empty partitions with defualt  configurations (AFAIK, it sets the parallelism level to the number of CPU cores).

## How was this patch tested?

An unit test in `AFTSurvivalRegressionSuite.scala` and manually tested by `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py`.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #13619 from HyukjinKwon/SPARK-15892.
2016-06-12 14:26:53 -07:00
Davies Liu aec502d911 [SPARK-15654] [SQL] fix non-splitable files for text based file formats
## What changes were proposed in this pull request?

Currently, we always split the files when it's bigger than maxSplitBytes, but Hadoop LineRecordReader does not respect the splits for compressed files correctly, we should have a API for FileFormat to check whether the file could be splitted or not.

This PR is based on #13442, closes #13442

## How was this patch tested?

add regression tests.

Author: Davies Liu <davies@databricks.com>

Closes #13531 from davies/fix_split.
2016-06-10 14:32:43 -07:00
wangyang 026eb90644 [SPARK-15875] Try to use Seq.isEmpty and Seq.nonEmpty instead of Seq.length == 0 and Seq.length > 0
## What changes were proposed in this pull request?

In scala, immutable.List.length is an expensive operation so we should
avoid using Seq.length == 0 or Seq.lenth > 0, and use Seq.isEmpty and Seq.nonEmpty instead.

## How was this patch tested?
existing tests

Author: wangyang <wangyang@haizhi.com>

Closes #13601 from yangw1234/isEmpty.
2016-06-10 13:10:03 -07:00
Bryan Cutler 7d7a0a5e07 [SPARK-15738][PYSPARK][ML] Adding Pyspark ml RFormula __str__ method similar to Scala API
## What changes were proposed in this pull request?
Adding __str__ to RFormula and model that will show the set formula param and resolved formula.  This is currently present in the Scala API, found missing in PySpark during Spark 2.0 coverage review.

## How was this patch tested?
run pyspark-ml tests locally

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #13481 from BryanCutler/pyspark-ml-rformula_str-SPARK-15738.
2016-06-10 11:27:30 -07:00
yinxusen 87706eb66c [SPARK-15793][ML] Add maxSentenceLength for ml.Word2Vec
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-15793

Word2vec in ML package should have maxSentenceLength method for feature parity.

## How was this patch tested?

Tested with Spark unit test.

Author: yinxusen <yinxusen@gmail.com>

Closes #13536 from yinxusen/SPARK-15793.
2016-06-08 09:18:04 +01:00
Yanbo Liang 6ecedf39b4 [SPARK-13590][ML][DOC] Document spark.ml LiR, LoR and AFTSurvivalRegression behavior difference
## What changes were proposed in this pull request?
When fitting ```LinearRegressionModel```(by "l-bfgs" solver) and ```LogisticRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce same model as R glmnet but different from LIBSVM.

When fitting ```AFTSurvivalRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce different model compared with R survival::survreg.

We should output a warning message and clarify in document for this condition.

## How was this patch tested?
Document change, no unit test.

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12731 from yanboliang/spark-13590.
2016-06-07 15:25:36 -07:00
Joseph K. Bradley 4c74ee8d8e [SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable public
## What changes were proposed in this pull request?

Made DefaultParamsReadable, DefaultParamsWritable public.  Also added relevant doc and annotations.  Added UnaryTransformerExample to demonstrate use of UnaryTransformer and DefaultParamsReadable,Writable.

## How was this patch tested?

Wrote example making use of the now-public APIs.  Compiled and ran locally

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #13461 from jkbradley/defaultparamswritable.
2016-06-06 09:49:45 -07:00
Zheng RuiFeng fd8af39713 [MINOR] Fix Typos 'an -> a'
## What changes were proposed in this pull request?

`an -> a`

Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13515 from zhengruifeng/an_a.
2016-06-06 09:35:47 +01:00
Josh Rosen 26c1089c37 [SPARK-15748][SQL] Replace inefficient foldLeft() call with flatMap() in PartitionStatistics
`PartitionStatistics` uses `foldLeft` and list concatenation (`++`) to flatten an iterator of lists, but this is extremely inefficient compared to simply doing `flatMap`/`flatten` because it performs many unnecessary object allocations. Simply replacing this `foldLeft` by a `flatMap` results in decent performance gains when constructing PartitionStatistics instances for tables with many columns.

This patch fixes this and also makes two similar changes in MLlib and streaming to try to fix all known occurrences of this pattern.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13491 from JoshRosen/foldleft-to-flatmap.
2016-06-05 16:51:00 -07:00
Zheng RuiFeng 372fa61f51 [SPARK-15770][ML] Annotation audit for Experimental and DeveloperApi
## What changes were proposed in this pull request?
1, remove comments `:: Experimental ::` for non-experimental API
2, add comments `:: Experimental ::` for experimental API
3, add comments `:: DeveloperApi ::` for developerApi API

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13514 from zhengruifeng/del_experimental.
2016-06-05 11:55:25 -07:00
Ruifeng Zheng 2099e05f93 [SPARK-15617][ML][DOC] Clarify that fMeasure in MulticlassMetrics is "micro" f1_score
## What changes were proposed in this pull request?
1, del precision,recall in  `ml.MulticlassClassificationEvaluator`
2, update user guide for `mlllib.weightedFMeasure`

## How was this patch tested?
local build

Author: Ruifeng Zheng <ruifengz@foxmail.com>

Closes #13390 from zhengruifeng/clarify_f1.
2016-06-04 13:56:04 +01:00
Wenchen Fan 190ff274fd [SPARK-15494][SQL] encoder code cleanup
## What changes were proposed in this pull request?

Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions.

1. move validation logic to analyzer instead of encoder
2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore.
3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework.
4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups)

## How was this patch tested?

existing test

Author: Wenchen Fan <wenchen@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #13269 from cloud-fan/clean-encoder.
2016-06-03 00:43:02 -07:00
Xiangrui Meng e23370ec61 [SPARK-15740][MLLIB] ignore big model load / save in Word2VecSuite
## What changes were proposed in this pull request?

andrewor14 noticed some OOM errors caused by "test big model load / save" in Word2VecSuite, e.g., https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. It doesn't show up in the test result because it was OOMed.

This PR disables the test. I will leave the JIRA open for a proper fix

## How was this patch tested?

No new features.

Author: Xiangrui Meng <meng@databricks.com>

Closes #13478 from mengxr/SPARK-15740.
2016-06-02 17:41:31 -07:00
Yuhao Yang 5855e0057d [SPARK-15668][ML] ml.feature: update check schema to avoid confusion when user use MLlib.vector as input type
## What changes were proposed in this pull request?

ml.feature: update check schema to avoid confusion when user use MLlib.vector as input type

## How was this patch tested?
existing ut

Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #13411 from hhbyyh/schemaCheck.
2016-06-02 16:37:01 -07:00
Nick Pentreath ccd298eb67 [MINOR] clean up style for storage param setters in ALS
Clean up style for param setter methods in ALS to match standard style and the other setter in class (this is an artefact of one of my previous PRs that wasn't cleaned up).

## How was this patch tested?
Existing tests - no functionality change.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13480 from MLnick/als-param-minor-cleanup.
2016-06-02 16:33:16 -07:00
Yanbo Liang 07a98ca4ce [SPARK-15587][ML] ML 2.0 QA: Scala APIs audit for ml.feature
## What changes were proposed in this pull request?
ML 2.0 QA: Scala APIs audit for ml.feature. Mainly include:
* Remove seed for ```QuantileDiscretizer```, since we use ```approxQuantile``` to produce bins and ```seed``` is useless.
* Scala API docs update.
* Sync Scala and Python API docs for these changes.

## How was this patch tested?
Exist tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13410 from yanboliang/spark-15587.
2016-06-01 10:49:51 -07:00
Lianhui Wang 6563d72b16 [SPARK-15664][MLLIB] Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing CheckpointFile in MLlib
## What changes were proposed in this pull request?
if sparkContext.set CheckpointDir to another Dir that is not default FileSystem, it will throw exception when removing CheckpointFile in MLlib.
So we should always get the FileSystem from Path to avoid wrong FS problem.
## How was this patch tested?
N/A

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #13408 from lianhuiwang/SPARK-15664.
2016-06-01 08:30:38 -05:00
Dongjoon Hyun 85d6b0db9f [SPARK-15618][SQL][MLLIB] Use SparkSession.builder.sparkContext if applicable.
## What changes were proposed in this pull request?

This PR changes function `SparkSession.builder.sparkContext(..)` from **private[sql]** into **private[spark]**, and uses it if applicable like the followings.
```
- val spark = SparkSession.builder().config(sc.getConf).getOrCreate()
+ val spark = SparkSession.builder().sparkContext(sc).getOrCreate()
```

## How was this patch tested?

Pass the existing Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13365 from dongjoon-hyun/SPARK-15618.
2016-05-31 17:40:44 -07:00
Sean Owen ce1572d16f [MINOR] Resolve a number of miscellaneous build warnings
## What changes were proposed in this pull request?

This change resolves a number of build warnings that have accumulated, before 2.x. It does not address a large number of deprecation warnings, especially related to the Accumulator API. That will happen separately.

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #13377 from srowen/BuildWarnings.
2016-05-29 16:48:14 -05:00
Zheng RuiFeng 9893dc9757 [SPARK-15610][ML] update error message for k in pca
## What changes were proposed in this pull request?
Fix the wrong bound of `k` in `PCA`
`require(k <= sources.first().size, ...`  ->  `require(k < sources.first().size`

BTW, remove unused import in `ml.ElementwiseProduct`

## How was this patch tested?

manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13356 from zhengruifeng/fix_pca.
2016-05-27 21:57:41 -05:00
DB Tsai 21b2605dc4 [SPARK-15413][ML][MLLIB] Change toBreeze to asBreeze in Vector and Matrix
## What changes were proposed in this pull request?

We're using `asML` to convert the mllib vector/matrix to ml vector/matrix now. Using `as` is more correct given that this conversion actually shares the same underline data structure. As a result, in this PR, `toBreeze` will be changed to `asBreeze`. This is a private API, as a result, it will not affect any user's application.

## How was this patch tested?

unit tests

Author: DB Tsai <dbt@netflix.com>

Closes #13198 from dbtsai/minor.
2016-05-27 14:02:39 -07:00
Yanbo Liang a3550e3747 [SPARK-11959][SPARK-15484][DOC][ML] Document WLS and IRLS
## What changes were proposed in this pull request?
* Document ```WeightedLeastSquares```(normal equation) and ```IterativelyReweightedLeastSquares```.
* Copy ```L-BFGS``` documents from ```spark.mllib``` to ```spark.ml```.

Due to the session ```Optimization of linear methods``` is used for developers, I think we should provide the brief introduction of the optimization method, necessary references and how it implements in Spark. It's not necessary to paste all mathematical formula and derivation here. If developers/users want to learn more, they can track reference.

## How was this patch tested?
Document update, no tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13262 from yanboliang/spark-15484.
2016-05-27 13:16:22 -07:00
Andrew Or b376a4eabc [HOTFIX] Scala 2.10 compile GaussianMixtureModel 2016-05-27 11:43:01 -07:00
Dongjoon Hyun 4538443e27 [SPARK-15584][SQL] Abstract duplicate code: spark.sql.sources. properties
## What changes were proposed in this pull request?

This PR replaces `spark.sql.sources.` strings with `CreateDataSourceTableUtils.*` constant variables.

## How was this patch tested?

Pass the existing Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13349 from dongjoon-hyun/SPARK-15584.
2016-05-27 11:10:31 -07:00
Dongjoon Hyun d24e251572 [SPARK-15603][MLLIB] Replace SQLContext with SparkSession in ML/MLLib
## What changes were proposed in this pull request?

This PR replaces all deprecated `SQLContext` occurrences with `SparkSession` in `ML/MLLib` module except the following two classes. These two classes use `SQLContext` in their function signatures.
- ReadWrite.scala
- TreeModels.scala

## How was this patch tested?

Pass the existing Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13352 from dongjoon-hyun/SPARK-15603.
2016-05-27 11:09:15 -07:00
Zheng RuiFeng 6b1a6180e7 [MINOR] Fix Typos 'a -> an'
## What changes were proposed in this pull request?

`a` -> `an`

I use regex to generate potential error lines:
`grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
and review them line by line.

## How was this patch tested?

local build
`lint-java` checking

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13317 from zhengruifeng/a_an.
2016-05-26 22:39:14 -07:00
Yin Huai 3ac2363d75 [SPARK-15532][SQL] SQLContext/HiveContext's public constructors should use SparkSession.build.getOrCreate
## What changes were proposed in this pull request?
This PR changes SQLContext/HiveContext's public constructor to use SparkSession.build.getOrCreate and removes isRootContext from SQLContext.

## How was this patch tested?
Existing tests.

Author: Yin Huai <yhuai@databricks.com>

Closes #13310 from yhuai/SPARK-15532.
2016-05-26 16:53:31 -07:00
Sean Owen b0a03feef2 [SPARK-15457][MLLIB][ML] Eliminate some warnings from MLlib about deprecations
## What changes were proposed in this pull request?

Several classes and methods have been deprecated and are creating lots of build warnings in branch-2.0. This issue is to identify and fix those items:
* WithSGD classes: Change to make class not deprecated, object deprecated, and public class constructor deprecated. Any public use will require a deprecated API. We need to keep a non-deprecated private API since we cannot eliminate certain uses: Python API, streaming algs, and examples.
  * Use in PythonMLlibAPI: Change to using private constructors
  * Streaming algs: No warnings after we un-deprecate the classes
  * Examples: Deprecate or change ones which use deprecated APIs
* MulticlassMetrics fields (precision, etc.)
* LinearRegressionSummary.model field

## How was this patch tested?

Existing tests.  Checked for warnings manually.

Author: Sean Owen <sowen@cloudera.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #13314 from jkbradley/warning-cleanups.
2016-05-26 14:25:28 -07:00
Villu Ruusmann 6d506c9ae9 [SPARK-15523][ML][MLLIB] Update JPMML to 1.2.15
## What changes were proposed in this pull request?

See https://issues.apache.org/jira/browse/SPARK-15523

This PR replaces PR #13293. It's isolated to a new branch, and contains some more squashed changes.

## How was this patch tested?

1. Executed `mvn clean package` in `mllib` directory
2. Executed `dev/test-dependencies.sh --replace-manifest` in the root directory.

Author: Villu Ruusmann <villu.ruusmann@gmail.com>

Closes #13297 from vruusmann/update-jpmml.
2016-05-26 08:11:34 -05:00
Reynold Xin 361ebc282b [SPARK-15543][SQL] Rename DefaultSources to make them more self-describing
## What changes were proposed in this pull request?
This patch renames various DefaultSources to make their names more self-describing. The choice of "DefaultSource" was from the days when we did not have a good way to specify short names.

They are now named:
- LibSVMFileFormat
- CSVFileFormat
- JdbcRelationProvider
- JsonFileFormat
- ParquetFileFormat
- TextFileFormat

Backward compatibility is maintained through aliasing.

## How was this patch tested?
Updated relevant test cases too.

Author: Reynold Xin <rxin@databricks.com>

Closes #13311 from rxin/SPARK-15543.
2016-05-25 23:54:24 -07:00
Gio Borje 589cce93c8 Log warnings for numIterations * miniBatchFraction < 1.0
## What changes were proposed in this pull request?

Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used.

This may be counter-intuitive to most users and led to the issue during the development of another Spark  ML model: https://github.com/zhengruifeng/spark-libFM/issues/11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`.

## How was this patch tested?

`build/mvn -DskipTests clean package` build succeeds

Author: Gio Borje <gborje@linkedin.com>

Closes #13265 from Hydrotoast/master.
2016-05-25 16:52:31 -05:00
Nick Pentreath 1cb347fbc4 [SPARK-15500][DOC][ML][PYSPARK] Remove default value in Param doc field in ALS
Remove "Default: MEMORY_AND_DISK" from `Param` doc field in ALS storage level params. This fixes up the output of `explainParam(s)` so that default values are not displayed twice.

We can revisit in the case that [SPARK-15130](https://issues.apache.org/jira/browse/SPARK-15130) moves ahead with adding defaults in some way to PySpark param doc fields.

Tests N/A.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13277 from MLnick/SPARK-15500-als-remove-default-storage-param.
2016-05-25 20:41:53 +02:00
lfzCarlosC 02c8072eea [MINOR][MLLIB][STREAMING][SQL] Fix typos
fixed typos for source code for components [mllib] [streaming] and [SQL]

None and obvious.

Author: lfzCarlosC <lfz.carlos@gmail.com>

Closes #13298 from lfzCarlosC/master.
2016-05-25 10:53:57 -07:00
Nick Pentreath 6075f5b4d8 [SPARK-15442][ML][PYSPARK] Add 'relativeError' param to PySpark QuantileDiscretizer
This PR adds the `relativeError` param to PySpark's `QuantileDiscretizer` to match Scala.

Also cleaned up a duplication of `numBuckets` where the param is both a class and instance attribute (I removed the instance attr to match the style of params throughout `ml`).

Finally, cleaned up the docs for `QuantileDiscretizer` to reflect that it now uses `approxQuantile`.

## How was this patch tested?

A little doctest and built API docs locally to check HTML doc generation.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13228 from MLnick/SPARK-15442-py-relerror-param.
2016-05-24 10:02:10 +02:00
Yanbo Liang c94b34ebbf [SPARK-15339][ML] ML 2.0 QA: Scala APIs and code audit for regression
## What changes were proposed in this pull request?
* ```GeneralizedLinearRegression``` API docs enhancement.
* The default value of ```GeneralizedLinearRegression``` ```linkPredictionCol``` is not set rather than empty. This will consistent with other similar params such as ```weightCol```
* Make some methods more private.
* Fix a minor bug of LinearRegression.
* Fix some other issues.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13129 from yanboliang/spark-15339.
2016-05-19 23:35:20 -07:00
Reynold Xin f2ee0ed4b7 [SPARK-15075][SPARK-15345][SQL] Clean up SparkSession builder and propagate config options to existing sessions if specified
## What changes were proposed in this pull request?
Currently SparkSession.Builder use SQLContext.getOrCreate. It should probably the the other way around, i.e. all the core logic goes in SparkSession, and SQLContext just calls that. This patch does that.

This patch also makes sure config options specified in the builder are propagated to the existing (and of course the new) SparkSession.

## How was this patch tested?
Updated tests to reflect the change, and also introduced a new SparkSessionBuilderSuite that should cover all the branches.

Author: Reynold Xin <rxin@databricks.com>

Closes #13200 from rxin/SPARK-15075.
2016-05-19 21:53:26 -07:00
Sandeep Singh 01cf649c4f [SPARK-15296][MLLIB] Refactor All Java Tests that use SparkSession
## What changes were proposed in this pull request?
Refactor All Java Tests that use SparkSession, to extend SharedSparkSesion

## How was this patch tested?
Existing Tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13101 from techaddict/SPARK-15296.
2016-05-19 20:38:44 -07:00
Yanbo Liang 6643677817 [MINOR][ML][PYSPARK] ml.evaluation Scala and Python API sync
## What changes were proposed in this pull request?
```ml.evaluation``` Scala and Python API sync.

## How was this patch tested?
Only API docs change, no new tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13195 from yanboliang/evaluation-doc.
2016-05-19 17:56:21 -07:00
Yanbo Liang f8107c7846 [SPARK-15341][DOC][ML] Add documentation for "model.write" to clarify "summary" was not saved
## What changes were proposed in this pull request?
Currently in ```model.write```, we don't save ```summary```(if applicable). We should add documentation to clarify it.
We fixed the incorrect link ```[[MLWriter]]``` to ```[[org.apache.spark.ml.util.MLWriter]]``` BTW.

## How was this patch tested?
Documentation update, no unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13131 from yanboliang/spark-15341.
2016-05-19 17:54:18 -07:00
Sandeep Singh ef43a5fe51 [SPARK-15414][MLLIB] Make the mllib,ml linalg type conversion APIs public
## What changes were proposed in this pull request?
Open up APIs for converting between new, old linear algebra types (in spark.mllib.linalg):
`Sparse`/`Dense` X `Vector`/`Matrices` `.asML` and `.fromML`

## How was this patch tested?
Existing Tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13202 from techaddict/SPARK-15414.
2016-05-19 17:24:42 -07:00
Yanbo Liang 59e6c5560d [SPARK-15361][ML] ML 2.0 QA: Scala APIs audit for ml.clustering
## What changes were proposed in this pull request?
Audit Scala API for ml.clustering.
Fix some wrong API documentations and update outdated one.

## How was this patch tested?
Existing unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13148 from yanboliang/spark-15361.
2016-05-19 13:26:41 -07:00
DB Tsai 5255e55c84 [SPARK-15411][ML] Add @since to ml.stat.MultivariateOnlineSummarizer.scala
## What changes were proposed in this pull request?

Add since to ml.stat.MultivariateOnlineSummarizer.scala

## How was this patch tested?

unit tests

Author: DB Tsai <dbt@netflix.com>

Closes #13197 from dbtsai/cleanup.
2016-05-19 13:10:51 -07:00
Yanbo Liang 8ecf7f77b2 [SPARK-15292][ML] ML 2.0 QA: Scala APIs audit for classification
## What changes were proposed in this pull request?
Audit Scala API for classification, almost all issues were related ```MultilayerPerceptronClassifier``` in this section.
* Fix one wrong param getter function: ```getOptimizer``` -> ```getSolver```
* Add missing setter function for ```solver``` and ```stepSize```.
* Make ```GD``` solver take effect.
* Update docs, annotations and fix other minor issues.

## How was this patch tested?
Existing unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13076 from yanboliang/spark-15292.
2016-05-19 10:27:17 -07:00
Yanbo Liang 1052d3644d [SPARK-15362][ML] Make spark.ml KMeansModel load backwards compatible
## What changes were proposed in this pull request?
[SPARK-14646](https://issues.apache.org/jira/browse/SPARK-14646) makes ```KMeansModel``` store the cluster centers one per row. ```KMeansModel.load()``` method needs to be updated in order to load models saved with Spark 1.6.

## How was this patch tested?
Since ```save/load``` is ```Experimental``` for 1.6, I think offline test for backwards compatibility is enough.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13149 from yanboliang/spark-15362.
2016-05-19 10:25:33 -07:00
Bryan Cutler b1bc5ebdd5 [DOC][MINOR] ml.feature Scala and Python API sync
## What changes were proposed in this pull request?

I reviewed Scala and Python APIs for ml.feature and corrected discrepancies.

## How was this patch tested?

Built docs locally, ran style checks

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #13159 from BryanCutler/ml.feature-api-sync.
2016-05-19 04:48:36 +02:00
Wenchen Fan ebfe3a1f2c [SPARK-15192][SQL] null check for SparkSession.createDataFrame
## What changes were proposed in this pull request?

This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema.

## How was this patch tested?

new tests in `DatasetSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13008 from cloud-fan/row-encoder.
2016-05-18 18:06:38 -07:00
Nick Pentreath e8b79afa02 [SPARK-14891][ML] Add schema validation for ALS
This PR adds schema validation to `ml`'s ALS and ALSModel. Currently, no schema validation was performed as `transformSchema` was never called in `ALS.fit` or `ALSModel.transform`. Furthermore, due to no schema validation, if users passed in Long (or Float etc) ids, they would be silently cast to Int with no warning or error thrown.

With this PR, ALS now supports all numeric types for `user`, `item`, and `rating` columns. The rating column is cast to `Float` and the user and item cols are cast to `Int` (as is the case currently) - however for user/item, the cast throws an error if the value is outside integer range. Behavior for rating col is unchanged (as it is not an issue).

## How was this patch tested?
New test cases in `ALSSuite`.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #12762 from MLnick/SPARK-14891-als-validate-schema.
2016-05-18 21:13:12 +02:00
DLucky 420b700695 [SPARK-15346][MLLIB] Reduce duplicate computation in picking initial points
mateiz srowen

I state that the contribution is my original work and that I license the work to the project under the project's open source license

There's some format problems with my last PR, with HyukjinKwon 's help I read the guidance, re-check my code and PR, then run the tests, finally re-submit the PR request here.

The related JIRA issue though marked as resolved, this change may relate to it I think.

## Proposed Change

After picking each new initial centers, it's unnecessary to compute the distances between all the points and the old ones.
Instead this change keeps the distance between all the points and their closest centers, and compare to the distance of them with the new center then update them.

## Test result

One can find an easy test way in (https://issues.apache.org/jira/browse/SPARK-6706)

I test the KMeans++ method for a small dataset with 16k points, and the whole KMeans|| with a large one with 240k points.
The data has 4096 features and I tunes K from 100 to 500.
The test environment was on my 4 machine cluster, I also tested a 3M points data on a larger cluster with 25 machines and got similar results, which I would not draw the detail curve. The result of the first two exps are shown below

### Local KMeans++ test:

Dataset:4m_ini_center
Data_size:16234
Dimension:4096

Lloyd's Iteration = 10
The y-axis is time in sec, the x-axis is tuning the K.

![image](https://cloud.githubusercontent.com/assets/10915169/15175831/d0c92b82-179a-11e6-8b68-4e165fc2fdff.png)

![local_total](https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg)

### On a larger dataset

An improve show in the graph but not commit in this file: In this experiment I also have an improvement for calculation in normalization data (the distance is convert to the cosine distance). As if the data is normalized into (0,1), one improvement in the original vesion for util.MLUtils.fastSauaredDistance would have no effect (the precisionBound 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON) will never less then precision in this case). Therefore I design an early terminal method when comparing two distance (used for findClosest). But I don't include this improve in this file, you may only refer to the curves without "normalize" for comparing the results.

Dataset:4k24
Data_size:243960
Dimension:4096

Normlize 	Enlarge 	Initialize 	Lloyd's_Iteration
NO    	1 	         3 	          5
YES 	        10000 	 3 	          5

Notice: the normlized data is enlarged to ensure precision

The cost time: x-for value of K, y-for time in sec
![4k24_total](https://cloud.githubusercontent.com/assets/10915169/15176635/9a54c0bc-179e-11e6-81c5-238e0c54bce2.jpg)

SE for unnormalized data between two version, to ensure the correctness
![4k24_unnorm_se](https://cloud.githubusercontent.com/assets/10915169/15176661/b85dabc8-179e-11e6-9269-fe7d2101dd48.jpg)

Here is the SE between normalized data just for reference, it's also correct.
![4k24_norm_se](https://cloud.githubusercontent.com/assets/10915169/15176742/1fbde940-179f-11e6-8290-d24b0dd4a4f7.jpg)

Author: DLucky <mouendless@gmail.com>

Closes #13133 from mouendless/patch-2.
2016-05-18 12:05:21 +01:00
WeichenXu 2f9047b5eb [SPARK-15322][MLLIB][CORE][SQL] update deprecate accumulator usage into accumulatorV2 in spark project
## What changes were proposed in this pull request?

I use Intellj-IDEA to search usage of deprecate SparkContext.accumulator in the whole spark project, and update the code.(except those test code for accumulator method itself)

## How was this patch tested?

Exisiting unit tests

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13112 from WeichenXu123/update_accuV2_in_mllib.
2016-05-18 11:48:46 +01:00
Sean Zhong 25b315e6ca [SPARK-15171][SQL] Remove the references to deprecated method dataset.registerTempTable
## What changes were proposed in this pull request?

Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`.

## How was this patch tested?

This PR only changes the unit test code, examples, and comments. It should be safe.
This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13098 from clockfly/spark-15171-remove-deprecation.
2016-05-18 09:01:59 +08:00
DB Tsai e2efe0529a [SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based algorithms
## What changes were proposed in this pull request?

Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis.

## How was this patch tested?

Unit tests

Author: DB Tsai <dbt@netflix.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #12627 from dbtsai/SPARK-14615-NewML.
2016-05-17 12:51:07 -07:00
Dongjoon Hyun 9f176dd391 [MINOR][DOCS] Replace remaining 'sqlContext' in ScalaDoc/JavaDoc.
## What changes were proposed in this pull request?

According to the recent change, this PR replaces all the remaining `sqlContext` usage with `spark` in ScalaDoc/JavaDoc (.scala/.java files) except `SQLContext.scala`, `SparkPlan.scala', and `DatasetHolder.scala`.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13125 from dongjoon-hyun/minor_doc_sparksession.
2016-05-17 20:50:22 +02:00
Sean Owen 122302cbf5 [SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into spark-tags
## What changes were proposed in this pull request?

(See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.)

Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags`

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #13074 from srowen/SPARK-15290.
2016-05-17 09:55:53 +01:00
Zheng RuiFeng c7efc56c7b [MINOR] Fix Typos
## What changes were proposed in this pull request?
1,Rename matrix args in BreezeUtil to upper to match the doc
2,Fix several typos in ML and SQL

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13078 from zhengruifeng/fix_ann.
2016-05-15 15:59:49 +01:00
wm624@hotmail.com 354f8f11bd [SPARK-15096][ML] LogisticRegression MultiClassSummarizer numClasses can fail if no valid labels are found
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
Throw better exception when numClasses is empty and empty.max is thrown.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Add a new unit test, which calls histogram with empty numClasses.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12969 from wangmiao1981/logisticR.
2016-05-14 09:45:56 +01:00
hyukjinkwon 3ded5bc4db [SPARK-15267][SQL] Refactor options for JDBC and ORC data sources and change default compression for ORC
## What changes were proposed in this pull request?

Currently, Parquet, JSON and CSV data sources have a class for thier options, (`ParquetOptions`, `JSONOptions` and `CSVOptions`).

It is convenient to manage options for sources to gather options into a class. Currently, `JDBC`, `Text`, `libsvm` and `ORC` datasources do not have this class. This might be nicer if these options are in a unified format so that options can be added and

This PR refactors the options in Spark internal data sources adding new classes, `OrcOptions`, `TextOptions`, `JDBCOptions` and `LibSVMOptions`.

Also, this PR change the default compression codec for ORC from `NONE` to `SNAPPY`.

## How was this patch tested?

Existing tests should cover this for refactoring and unittests in `OrcHadoopFsRelationSuite` for changing the default compression codec for ORC.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13048 from HyukjinKwon/SPARK-15267.
2016-05-13 09:04:37 -07:00
wm624@hotmail.com bdff299f9e [SPARK-14900][ML] spark.ml classification metrics should include accuracy
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
Add accuracy to MulticlassMetrics class and add corresponding code in MulticlassClassificationEvaluator.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Scala Unit tests in ml.evaluation

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12882 from wangmiao1981/accuracy.
2016-05-13 08:29:37 +01:00
BenFradet 31f1aebbeb [SPARK-13961][ML] spark.ml ChiSqSelector and RFormula should support other numeric types for label
## What changes were proposed in this pull request?

Made ChiSqSelector and RFormula accept all numeric types for label

## How was this patch tested?

Unit tests

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #12467 from BenFradet/SPARK-13961.
2016-05-13 09:08:04 +02:00
sethah 5b849766ab [SPARK-15181][ML][PYSPARK] Python API for GLR summaries.
## What changes were proposed in this pull request?

This patch adds a python API for generalized linear regression summaries (training and test). This helps provide feature parity for Python GLMs.

## How was this patch tested?

Added a unit test to `pyspark.ml.tests`

Author: sethah <seth.hendrickson16@gmail.com>

Closes #12961 from sethah/GLR_summary.
2016-05-13 09:01:20 +02:00
Sean Zhong 33c6eb5218 [SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempView
## What changes were proposed in this pull request?

Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView.

## How was this patch tested?

Unit tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #12945 from clockfly/spark-15171.
2016-05-12 15:51:53 +08:00
Liang-Chi Hsieh a5f9fdbba3 [SPARK-15268][SQL] Make JavaTypeInference work with UDTRegistration
## What changes were proposed in this pull request?

We have a private `UDTRegistration` API to register user defined type. Currently `JavaTypeInference` can't work with it. So `SparkSession.createDataFrame` from a bean class will not correctly infer the schema of the bean class.

## How was this patch tested?
`VectorUDTSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13046 from viirya/fix-udt-registry-javatypeinference.
2016-05-11 09:31:22 -07:00
Sandeep Singh ed0b4070fb [SPARK-15037][SQL][MLLIB] Use SparkSession instead of SQLContext in Scala/Java TestSuites
## What changes were proposed in this pull request?
Use SparkSession instead of SQLContext in Scala/Java TestSuites
as this PR already very big working Python TestSuites in a diff PR.

## How was this patch tested?
Existing tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #12907 from techaddict/SPARK-15037.
2016-05-10 11:17:47 -07:00
dding3 a78fbfa619 [SPARK-15172][ML] Explicitly tell user initial coefficients is ignored when size mismatch happened in LogisticRegression
## What changes were proposed in this pull request?
Explicitly tell user initial coefficients is ignored if its size doesn't match expected size in LogisticRegression

## How was this patch tested?
local build

Author: dding3 <dingding@dingding-ubuntu.sh.intel.com>

Closes #12948 from dding3/master.
2016-05-09 09:43:07 +01:00
Yuhao Yang 68abc1b4e9 [SPARK-14814][MLLIB] API: Java compatibility, docs
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14814
fix a java compatibility function in mllib DecisionTreeModel. As synced in jira, other compatibility issues don't need fixes.

## How was this patch tested?

existing ut

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #12971 from hhbyyh/javacompatibility.
2016-05-09 09:08:54 +01:00
Liang-Chi Hsieh 635ef407e1 [SPARK-15211][SQL] Select features column from LibSVMRelation causes failure
## What changes were proposed in this pull request?

We need to use `requiredSchema` in `LibSVMRelation` to project the fetch required columns when loading data from this data source. Otherwise, when users try to select `features` column, it will cause failure.

## How was this patch tested?
`LibSVMRelationSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12986 from viirya/fix-libsvmrelation.
2016-05-09 15:05:06 +08:00
Burak Köse e20cd9f4ce [SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover
## What changes were proposed in this pull request?

This PR continues the work from #11871 with the following changes:
* load English stopwords as default
* covert stopwords to list in Python
* update some tests and doc

## How was this patch tested?

Unit tests.

Closes #11871

cc: burakkose srowen

Author: Burak Köse <burakks41@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Burak KOSE <burakks41@gmail.com>

Closes #12843 from mengxr/SPARK-14050.
2016-05-06 13:58:12 -07:00
Andrew Or 7f5922aa4a [HOTFIX] Fix MLUtils compile 2016-05-05 16:51:06 -07:00
Jacek Laskowski bbb7773437 [SPARK-15152][DOC][MINOR] Scaladoc and Code style Improvements
## What changes were proposed in this pull request?

Minor doc and code style fixes

## How was this patch tested?

local build

Author: Jacek Laskowski <jacek@japila.pl>

Closes #12928 from jaceklaskowski/SPARK-15152.
2016-05-05 16:34:27 -07:00
Holden Karau 4c0d827cfc [SPARK-15106][PYSPARK][ML] Add PySpark package doc for ML component & remove "BETA"
## What changes were proposed in this pull request?

Copy the package documentation from Scala/Java to Python for ML package and remove beta tags. Not super sure if we want to keep the BETA tag but since we are making it the default it seems like probably the time to remove it (happy to put it back in if we want to keep it BETA).

## How was this patch tested?

Python documentation built locally as HTML and text and verified output.

Author: Holden Karau <holden@us.ibm.com>

Closes #12883 from holdenk/SPARK-15106-add-pyspark-package-doc-for-ml.
2016-05-05 10:52:25 +01:00
Dominik Jastrzębski abecbcd5e9 [SPARK-14844][ML] Add setFeaturesCol and setPredictionCol to KMeansM…
## What changes were proposed in this pull request?

Introduction of setFeaturesCol and setPredictionCol methods to KMeansModel in ML library.

## How was this patch tested?

By running KMeansSuite.

Author: Dominik Jastrzębski <dominik.jastrzebski@codilime.com>

Closes #12609 from dominik-jastrzebski/master.
2016-05-04 14:25:51 +02:00
Cheng Lian bc3760d405 [SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations
## What changes were proposed in this pull request?

Currently, various `FileFormat` data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication.

A new method `buildReaderWithPartitionValues()` is added to `FileFormat` with a default implementation that appends partition values to `InternalRow`s produced by the reader function returned by `buildReader()`.

Special data sources like Parquet, which implements partition value appending inside `buildReader()` because of the vectorized reader, and the Text data source, which doesn't support partitioning, override `buildReaderWithPartitionValues()` and simply delegate to `buildReader()`.

This PR brings two benefits:

1. Apparently, it de-duplicates partition value appending logic

2. Now the reader function returned by `buildReader()` is only required to produce `InternalRow`s rather than `UnsafeRow`s if the data source doesn't override `buildReaderWithPartitionValues()`.

   Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving `UnsafeRow`.

## How was this patch tested?

Existing tests should do the work.

Author: Cheng Lian <lian@databricks.com>

Closes #12866 from liancheng/spark-14237-simplify-partition-values-appending.
2016-05-04 14:16:57 +08:00
yinxusen 2e2a6211c4 [SPARK-14973][ML] The CrossValidator and TrainValidationSplit miss the seed when saving and loading
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14973

Add seed support when saving/loading of CrossValidator and TrainValidationSplit.

## How was this patch tested?

Spark unit test.

Author: yinxusen <yinxusen@gmail.com>

Closes #12825 from yinxusen/SPARK-14973.
2016-05-03 14:19:13 -07:00
Holden Karau f10ae4b1e1 [SPARK-6717][ML] Clear shuffle files after checkpointing in ALS
## What changes were proposed in this pull request?

When ALS is run with a checkpoint interval, during the checkpoint materialize the current state and cleanup the previous shuffles (non-blocking).

## How was this patch tested?

Existing ALS unit tests, new ALS checkpoint cleanup unit tests added & shuffle files checked after ALS w/checkpointing run.

Author: Holden Karau <holden@us.ibm.com>
Author: Holden Karau <holden@pigscanfly.ca>

Closes #11919 from holdenk/SPARK-6717-clear-shuffle-files-after-checkpointing-in-ALS.
2016-05-03 00:18:10 -07:00
Xusen Yin a6428292f7 [SPARK-14931][ML][PYTHON] Mismatched default values between pipelines in Spark and PySpark - update
## What changes were proposed in this pull request?

This PR is an update for [https://github.com/apache/spark/pull/12738] which:
* Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side
* Various fixes for bugs found
  * This includes changing classes taking weightCol to treat unset and empty String Param values the same way.

Defaults changed:
* Scala
 * LogisticRegression: weightCol defaults to not set (instead of empty string)
 * StringIndexer: labels default to not set (instead of empty array)
 * GeneralizedLinearRegression:
   * maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver)
   * weightCol defaults to not set (instead of empty string)
 * LinearRegression: weightCol defaults to not set (instead of empty string)
* Python
 * MultilayerPerceptron: layers default to not set (instead of [1,1])
 * ChiSqSelector: numTopFeatures defaults to 50 (instead of not set)

## How was this patch tested?

Generic unit test.  Manually tested that unit test by changing defaults and verifying that broke the test.

Author: Joseph K. Bradley <joseph@databricks.com>
Author: yinxusen <yinxusen@gmail.com>

Closes #12816 from jkbradley/yinxusen-SPARK-14931.
2016-05-01 12:29:01 -07:00
Yanbo Liang 19a6d192d5 [SPARK-15030][ML][SPARKR] Support formula in spark.kmeans in SparkR
## What changes were proposed in this pull request?
* ```RFormula``` supports empty response variable like ```~ x + y```.
* Support formula in ```spark.kmeans``` in SparkR.
* Fix some outdated docs for SparkR.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12813 from yanboliang/spark-15030.
2016-04-30 08:37:56 -07:00
Herman van Hovell e5fb78baf9 [SPARK-14952][CORE][ML] Remove methods that were deprecated in 1.6.0
#### What changes were proposed in this pull request?

This PR removes three methods the were deprecated in 1.6.0:
- `PortableDataStream.close()`
- `LinearRegression.weights`
- `LogisticRegression.weights`

The rationale for doing this is that the impact is small and that Spark 2.0 is a major release.

#### How was this patch tested?
Compilation succeded.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12732 from hvanhovell/SPARK-14952.
2016-04-30 16:06:20 +01:00
Xiangrui Meng 0847fe4eb3 [SPARK-14653][ML] Remove json4s from mllib-local
## What changes were proposed in this pull request?

This PR moves Vector.toJson/fromJson to ml.linalg.VectorEncoder under mllib/ to keep mllib-local's dependency minimal. The json encoding is used by Params. So we still need this feature in SPARK-14615, where we will switch to ml.linalg in spark.ml APIs.

## How was this patch tested?

Copied existing unit tests over.

cc; dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #12802 from mengxr/SPARK-14653.
2016-04-30 06:30:39 -07:00
Junyang 1192fe4cd2 [SPARK-13289][MLLIB] Fix infinite distances between word vectors in Word2VecModel
## What changes were proposed in this pull request?

This PR fixes the bug that generates infinite distances between word vectors. For example,

Before this PR, we have
```
val synonyms = model.findSynonyms("who", 40)
```
will give the following results:
```
to Infinity
and Infinity
that Infinity
with Infinity
```
With this PR, the distance between words is a value between 0 and 1, as follows:
```
scala> model.findSynonyms("who", 10)
res0: Array[(String, Double)] = Array((Harvard-educated,0.5253688097000122), (ex-SAS,0.5213794708251953), (McMutrie,0.5187736749649048), (fellow,0.5166833400726318), (businessman,0.5145374536514282), (American-born,0.5127736330032349), (British-born,0.5062344074249268), (gray-bearded,0.5047978162765503), (American-educated,0.5035858750343323), (mentored,0.49849334359169006))

scala> model.findSynonyms("king", 10)
res1: Array[(String, Double)] = Array((queen,0.6787897944450378), (prince,0.6786158084869385), (monarch,0.659771203994751), (emperor,0.6490438580513), (goddess,0.643266499042511), (dynasty,0.635733425617218), (sultan,0.6166239380836487), (pharaoh,0.6150713562965393), (birthplace,0.6143025159835815), (empress,0.6109727025032043))

scala> model.findSynonyms("queen", 10)
res2: Array[(String, Double)] = Array((princess,0.7670737504959106), (godmother,0.6982434988021851), (raven-haired,0.6877717971801758), (swan,0.684934139251709), (hunky,0.6816608309745789), (Titania,0.6808111071586609), (heroine,0.6794036030769348), (king,0.6787897944450378), (diva,0.67848801612854), (lip-synching,0.6731793284416199))
```

### There are two places changed in this PR:
- Normalize the word vector to avoid overflow when calculating inner product between word vectors. This also simplifies the distance calculation, since the word vectors only need to be normalized once.
- Scale the learning rate by number of iteration, to be consistent with Google Word2Vec implementation

## How was this patch tested?

Use word2vec to train text corpus, and run model.findSynonyms() to get the distances between word vectors.

Author: Junyang <fly.shenjy@gmail.com>
Author: flyskyfly <fly.shenjy@gmail.com>

Closes #11812 from flyjy/TVec.
2016-04-30 10:16:35 +01:00
Xiangrui Meng 7fbe1bb24d [SPARK-14412][.2][ML] rename *RDDStorageLevel to *StorageLevel in ml.ALS
## What changes were proposed in this pull request?

As discussed in #12660, this PR renames
* intermediateRDDStorageLevel -> intermediateStorageLevel
* finalRDDStorageLevel -> finalStorageLevel

The argument name in `ALS.train` will be addressed in SPARK-15027.

## How was this patch tested?

Existing unit tests.

Author: Xiangrui Meng <meng@databricks.com>

Closes #12803 from mengxr/SPARK-14412.
2016-04-30 00:41:28 -07:00
Sean Owen 5886b6217b [SPARK-14533][MLLIB] RowMatrix.computeCovariance inaccurate when values are very large (partial fix)
## What changes were proposed in this pull request?

Fix for part of SPARK-14533: trivial simplification and more accurate computation of column means. See also https://github.com/apache/spark/pull/12299 which contained a complete fix that was very slow. This PR does _not_ resolve SPARK-14533 entirely.

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #12779 from srowen/SPARK-14533.2.
2016-04-30 00:15:41 -07:00
Xiangrui Meng 3d09ceeef9 [SPARK-14850][.2][ML] use UnsafeArrayData.fromPrimitiveArray in ml.VectorUDT/MatrixUDT
## What changes were proposed in this pull request?

This PR uses `UnsafeArrayData.fromPrimitiveArray` to implement `ml.VectorUDT/MatrixUDT` to avoid boxing/unboxing.

## How was this patch tested?

Exiting unit tests.

cc: cloud-fan

Author: Xiangrui Meng <meng@databricks.com>

Closes #12805 from mengxr/SPARK-14850.
2016-04-29 23:51:01 -07:00
Wenchen Fan 43b149fb88 [SPARK-14850][ML] convert primitive array from/to unsafe array directly in VectorUDT/MatrixUDT
## What changes were proposed in this pull request?

This PR adds `fromPrimitiveArray` and `toPrimitiveArray` in `UnsafeArrayData`, so that we can do the conversion much faster in VectorUDT/MatrixUDT.

## How was this patch tested?

existing tests and new test suite `UnsafeArraySuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12640 from cloud-fan/ml.
2016-04-29 23:04:51 -07:00
Nick Pentreath 90fa2c6e7f [SPARK-14412][ML][PYSPARK] Add StorageLevel params to ALS
`mllib` `ALS` supports `setIntermediateRDDStorageLevel` and `setFinalRDDStorageLevel`. This PR adds these as Params in `ml` `ALS`. They are put in group **expertParam** since few users will need them.

## How was this patch tested?

New test cases in `ALSSuite` and `tests.py`.

cc yanboliang jkbradley sethah rishabhbhardwaj

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #12660 from MLnick/SPARK-14412-als-storage-params.
2016-04-29 22:01:41 -07:00
Joseph K. Bradley 1eda2f10d9 [SPARK-14646][ML] Modified Kmeans to store cluster centers with one per row
## What changes were proposed in this pull request?

Modified Kmeans to store cluster centers with one per row

## How was this patch tested?

Existing tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12792 from jkbradley/kmeans-save-fix.
2016-04-29 16:46:25 -07:00
BenFradet d78fbcc3cc [SPARK-14570][ML] Log instrumentation in Random forests
## What changes were proposed in this pull request?

Added Instrumentation logging to DecisionTree{Classifier,Regressor} and RandomForest{Classifier,Regressor}

## How was this patch tested?

No tests involved since it's logging related.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #12536 from BenFradet/SPARK-14570.
2016-04-29 15:42:47 -07:00
Jeff Zhang 775772de36 [SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA PR2
## What changes were proposed in this pull request?

pyspark.ml API for LDA
* LDA, LDAModel, LocalLDAModel, DistributedLDAModel
* includes persistence

This replaces [https://github.com/apache/spark/pull/10242]

## How was this patch tested?

* doc test for LDA, including Param setters
* unit test for persistence

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Jeff Zhang <zjffdu@apache.org>

Closes #12723 from jkbradley/zjffdu-SPARK-11940.
2016-04-29 10:42:52 -07:00
Joseph K. Bradley f08dcdb8d3 [SPARK-14984][ML] Deprecated model field in LinearRegressionSummary
## What changes were proposed in this pull request?

Deprecated model field in LinearRegressionSummary

Removed unnecessary Since annotations

## How was this patch tested?

Existing tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12763 from jkbradley/lr-summary-api.
2016-04-29 10:40:00 -07:00
Yanbo Liang 87ac84d437 [SPARK-14314][SPARK-14315][ML][SPARKR] Model persistence in SparkR (glm & kmeans)
SparkR ```glm``` and ```kmeans``` model persistence.

Unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>
Author: Gayathri Murali <gayathri.m.softie@gmail.com>

Closes #12778 from yanboliang/spark-14311.
Closes #12680
Closes #12683
2016-04-29 09:43:04 -07:00
wm624@hotmail.com b6fa7e5934 [SPARK-14571][ML] Log instrumentation in ALS
## What changes were proposed in this pull request?

Add log instrumentation for parameters:
rank, numUserBlocks, numItemBlocks, implicitPrefs, alpha,
userCol, itemCol, ratingCol, predictionCol, maxIter,
regParam, nonnegative, checkpointInterval, seed

Add log instrumentation for numUserFeatures and numItemFeatures

## How was this patch tested?

Manual test: Set breakpoint in intellij and run def testALS(). Single step debugging and check the log method is called.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12560 from wangmiao1981/log.
2016-04-29 16:18:25 +02:00
dding3 6d5aeaae26 [SPARK-14969][MLLIB] Remove duplicate implementation of compute in LogisticGradient
## What changes were proposed in this pull request?

This PR removes duplicate implementation of compute in LogisticGradient class

## How was this patch tested?

unit tests

Author: dding3 <dingding@dingding-ubuntu.sh.intel.com>

Closes #12747 from dding3/master.
2016-04-29 10:19:51 +01:00
Sean Owen d1cf320105 [SPARK-14886][MLLIB] RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException
## What changes were proposed in this pull request?

Handle case where number of predictions is less than label set, k in nDCG computation

## How was this patch tested?

New unit test; existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #12756 from srowen/SPARK-14886.
2016-04-29 09:21:27 +02:00
Zheng RuiFeng cabd54d931 [SPARK-14829][MLLIB] Deprecate GLM APIs using SGD
## What changes were proposed in this pull request?
According to the [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829), deprecate API of LogisticRegression and LinearRegression using SGD

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12596 from zhengruifeng/deprecate_sgd.
2016-04-28 22:44:14 -07:00
Yin Huai 9c7c42bc6a Revert "[SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local"
This reverts commit dae538a4d7.
2016-04-28 19:57:41 -07:00
Joseph K. Bradley 4f4721a21c [SPARK-14862][ML] Updated Classifiers to not require labelCol metadata
## What changes were proposed in this pull request?

Updated Classifier, DecisionTreeClassifier, RandomForestClassifier, GBTClassifier to not require input column metadata.
* They first check for metadata.
* If numClasses is not specified in metadata, they identify the largest label value (up to a limit).

This functionality is implemented in a new Classifier.getNumClasses method.

Also
* Updated Classifier.extractLabeledPoints to (a) check label values and (b) include a second version which takes a numClasses value for validity checking.

## How was this patch tested?

* Unit tests in ClassifierSuite for helper methods
* Unit tests for DecisionTreeClassifier, RandomForestClassifier, GBTClassifier with toy datasets lacking label metadata

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12663 from jkbradley/trees-no-metadata.
2016-04-28 16:20:00 -07:00
Pravin Gadakh dae538a4d7 [SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local
## What changes were proposed in this pull request?

This PR adds `since` tag into the matrix and vector classes in spark-mllib-local.

## How was this patch tested?

Scala-style checks passed.

Author: Pravin Gadakh <prgadakh@in.ibm.com>

Closes #12416 from pravingadakh/SPARK-14613.
2016-04-28 15:59:18 -07:00
Yuhao Yang d5ab42ceb9 [SPARK-14916][MLLIB] A more friendly tostring for FreqItemset in mllib.fpm
## What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-14916
FreqItemset as the result of FPGrowth should have a more friendly toString(), to help users and developers.
sample:
{a, b}: 5
{x, y, z}: 4

## How was this patch tested?

existing unit tests.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #12698 from hhbyyh/freqtos.
2016-04-28 19:52:09 +01:00
Joseph K. Bradley 5ee72454df [SPARK-14852][ML] refactored GLM summary into training, non-training summaries
## What changes were proposed in this pull request?

This splits GeneralizedLinearRegressionSummary into 2 summary types:
* GeneralizedLinearRegressionSummary, which does not store info from fitting (diagInvAtWA)
* GeneralizedLinearRegressionTrainingSummary, which is a subclass of GeneralizedLinearRegressionSummary and stores info from fitting

This also add a method evaluate() which can produce a GeneralizedLinearRegressionSummary on a new dataset.

The summary no longer provides the model itself as a public val.

Also:
* Fixes bug where GeneralizedLinearRegressionTrainingSummary was created with model, not summaryModel.
* Adds hasSummary method.
* Renames findSummaryModelAndPredictionCol -> getSummaryModel and simplifies that method.
* In summary, extract values from model immediately in case user later changes those (e.g., predictionCol).
* Pardon the style fixes; that is IntelliJ being obnoxious.

## How was this patch tested?

Existing unit tests + updated test for evaluate and hasSummary

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12624 from jkbradley/model-summary-api.
2016-04-28 11:22:13 -07:00
Liang-Chi Hsieh 7c6937a885 [SPARK-14487][SQL] User Defined Type registration without SQLUserDefinedType annotation
## What changes were proposed in this pull request?

Currently we use `SQLUserDefinedType` annotation to register UDTs for user classes. However, by doing this, we add Spark dependency to user classes.

For some user classes, it is unnecessary to add such dependency that will increase deployment difficulty.

We should provide alternative approach to register UDTs for user classes without `SQLUserDefinedType` annotation.

## How was this patch tested?

`UserDefinedTypeSuite`

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12259 from viirya/improve-sql-usertype.
2016-04-28 01:14:49 -07:00
Joseph K. Bradley f5ebb18c45 [SPARK-14671][ML] Pipeline setStages should handle subclasses of PipelineStage
## What changes were proposed in this pull request?

Pipeline.setStages failed for some code examples which worked in 1.5 but fail in 1.6.  This tends to occur when using a mix of transformers from ml.feature. It is because Java Arrays are non-covariant and the addition of MLWritable to some transformers means the stages0/1 arrays above are not of type Array[PipelineStage].  This PR modifies the following to accept subclasses of PipelineStage:
* Pipeline.setStages()
* Params.w()

## How was this patch tested?

Unit test which fails to compile before this fix.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12430 from jkbradley/pipeline-setstages.
2016-04-27 16:11:12 -07:00
Yanbo Liang 4672e9838b [SPARK-14899][ML][PYSPARK] Remove spark.ml HashingTF hashingAlg option
## What changes were proposed in this pull request?
Since [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574) breaks behavior of ```HashingTF```, we should try to enforce good practice by removing the "native" hashAlgorithm option in spark.ml and pyspark.ml. We can leave spark.mllib and pyspark.mllib alone.

## How was this patch tested?
Unit tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12702 from yanboliang/spark-14899.
2016-04-27 14:08:26 -07:00
Mike Dusenberry 607f50341c [SPARK-9656][MLLIB][PYTHON] Add missing methods to PySpark's Distributed Linear Algebra Classes
This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows:

* `RowMatrix` <sup>**[1]**</sup>
  1. `computeGramianMatrix`
  2. `computeCovariance`
  3. `computeColumnSummaryStatistics`
  4. `columnSimilarities`
  5. `tallSkinnyQR` <sup>**[2]**</sup>
* `IndexedRowMatrix` <sup>**[3]**</sup>
  1. `computeGramianMatrix`
* `CoordinateMatrix`
  1. `transpose`
* `BlockMatrix`
  1. `validate`
  2. `cache`
  3. `persist`
  4. `transpose`

**[1]**: Note: `multiply`, `computeSVD`, and `computePrincipalComponents` are already part of PR #7963 for SPARK-6227.
**[2]**: Implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor.  As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark.  Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type.  Thus, this PR currently contains that fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`.  `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.  However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR.  Therefore, I have marked this PR as WIP and am open to discussion.
**[3]**: Note: `multiply` and `computeSVD` are already part of PR #7963 for SPARK-6227.

Author: Mike Dusenberry <mwdusenb@us.ibm.com>

Closes #9441 from dusenberrymw/SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra.
2016-04-27 19:48:05 +02:00
Joseph K. Bradley bd2c9a6d48 [SPARK-14732][ML] spark.ml GaussianMixture should use MultivariateGaussian in mllib-local
## What changes were proposed in this pull request?

Before, spark.ml GaussianMixtureModel used the spark.mllib MultivariateGaussian in its public API.  This was added after 1.6, so we can modify this API without breaking APIs.

This PR copies MultivariateGaussian to mllib-local in spark.ml, with a few changes:
* Renamed fields to match numpy, scipy: mu => mean, sigma => cov

This PR then uses the spark.ml MultivariateGaussian in the spark.ml GaussianMixtureModel, which involves:
* Modifying the constructor
* Adding a computeProbabilities method

Also:
* Added EPSILON to mllib-local for use in MultivariateGaussian

## How was this patch tested?

Existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12593 from jkbradley/sparkml-gmm-fix.
2016-04-26 16:53:16 -07:00
Joseph K. Bradley 6c5a837c50 [SPARK-12301][ML] Made all tree and ensemble classes not final
## What changes were proposed in this pull request?

There have been continuing requests (e.g., SPARK-7131) for allowing users to extend and modify MLlib models and algorithms.

This PR makes tree and ensemble classes, Node types, and Split types in spark.ml no longer final.  This matches most other spark.ml algorithms.

Constructors for models are still private since we may need to refactor how stats are maintained in tree nodes.

## How was this patch tested?

Existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12711 from jkbradley/final-trees.
2016-04-26 14:44:39 -07:00
Dongjoon Hyun e4f3eec5b7 [SPARK-14907][MLLIB] Use repartition in GLMRegressionModel.save
## What changes were proposed in this pull request?

This PR changes `GLMRegressionModel.save` function like the following code that is similar to other algorithms' parquet write.
```
- val dataRDD: DataFrame = sc.parallelize(Seq(data), 1).toDF()
- // TODO: repartition with 1 partition after SPARK-5532 gets fixed
- dataRDD.write.parquet(Loader.dataPath(path))
+ sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(Loader.dataPath(path))
```

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12676 from dongjoon-hyun/SPARK-14907.
2016-04-26 13:58:29 -07:00
Yanbo Liang 302a186869 [SPARK-11559][MLLIB] Make runs no effect in mllib.KMeans
## What changes were proposed in this pull request?
We deprecated  ```runs``` of mllib.KMeans in Spark 1.6 (SPARK-11358). In 2.0, we will make it no effect (with warning messages). We did not remove ```setRuns/getRuns``` for better binary compatibility.
This PR change `runs` which are appeared at the public API. Usage inside of ```KMeans.runAlgorithm()``` will be resolved at #10806.

## How was this patch tested?
Existing unit tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12608 from yanboliang/spark-11559.
2016-04-26 11:55:21 -07:00
Andrew Or 2a3d39f48b [MINOR] Follow-up to #12625
## What changes were proposed in this pull request?

That patch mistakenly widened the visibility from `private[x]` to `protected[x]`. This patch reverts those changes.

Author: Andrew Or <andrew@databricks.com>

Closes #12686 from andrewor14/visibility.
2016-04-26 11:08:08 -07:00
Reynold Xin 5cb03220a0 [SPARK-14912][SQL] Propagate data source options to Hadoop configuration
## What changes were proposed in this pull request?
We currently have no way for users to propagate options to the underlying library that rely in Hadoop configurations to work. For example, there are various options in parquet-mr that users might want to set, but the data source API does not expose a per-job way to set it. This patch propagates the user-specified options also into Hadoop Configuration.

## How was this patch tested?
Used a mock data source implementation to test both the read path and the write path.

Author: Reynold Xin <rxin@databricks.com>

Closes #12688 from rxin/SPARK-14912.
2016-04-26 10:58:56 -07:00
Yanbo Liang 92f66331b4 [SPARK-14313][ML][SPARKR] AFTSurvivalRegression model persistence in SparkR
## What changes were proposed in this pull request?
```AFTSurvivalRegressionModel``` supports ```save/load``` in SparkR.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12685 from yanboliang/spark-14313.
2016-04-26 10:30:24 -07:00
BenFradet 2a5c930790 [SPARK-13962][ML] spark.ml Evaluators should support other numeric types for label
## What changes were proposed in this pull request?

Made BinaryClassificationEvaluator, MulticlassClassificationEvaluator and RegressionEvaluator accept all numeric types for label

## How was this patch tested?

Unit tests

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #12500 from BenFradet/SPARK-13962.
2016-04-26 08:55:50 +02:00
Andrew Or 18c2c92580 [SPARK-14861][SQL] Replace internal usages of SQLContext with SparkSession
## What changes were proposed in this pull request?

In Spark 2.0, `SparkSession` is the new thing. Internally we should stop using `SQLContext` everywhere since that's supposed to be not the main user-facing API anymore.

In this patch I took care to not break any public APIs. The one place that's suspect is `o.a.s.ml.source.libsvm.DefaultSource`, but according to mengxr it's not supposed to be public so it's OK to change the underlying `FileFormat` trait.

**Reviewers**: This is a big patch that may be difficult to review but the changes are actually really straightforward. If you prefer I can break it up into a few smaller patches, but it will delay the progress of this issue a little.

## How was this patch tested?

No change in functionality intended.

Author: Andrew Or <andrew@databricks.com>

Closes #12625 from andrewor14/spark-session-refactor.
2016-04-25 20:54:31 -07:00
Yanbo Liang 9cb3ba1013 [SPARK-14312][ML][SPARKR] NaiveBayes model persistence in SparkR
## What changes were proposed in this pull request?
SparkR ```NaiveBayesModel``` supports ```save/load``` by the following API:
```
df <- createDataFrame(sqlContext, infert)
model <- naiveBayes(education ~ ., df, laplace = 0)
ml.save(model, path)
model2 <- ml.load(path)
```

## How was this patch tested?
Add unit tests.

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12573 from yanboliang/spark-14312.
2016-04-25 14:08:41 -07:00
Yanbo Liang 425f691646 [SPARK-10574][ML][MLLIB] HashingTF supports MurmurHash3
## What changes were proposed in this pull request?
As the discussion at [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574), ```HashingTF``` should support MurmurHash3 and make it as the default hash algorithm. We should also expose set/get API for ```hashAlgorithm```, then users can choose the hash method.

Note: The problem that ```mllib.feature.HashingTF``` behaves differently between Scala/Java and Python will be resolved in the followup work.

## How was this patch tested?
unit tests.

cc jkbradley MLnick

Author: Yanbo Liang <ybliang8@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12498 from yanboliang/spark-10574.
2016-04-25 12:08:43 -07:00
wm624@hotmail.com b50e2eca93 [SPARK-14433][PYSPARK][ML] PySpark ml GaussianMixture
## What changes were proposed in this pull request?

Add Python API in ML for GaussianMixture

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Add doctest and test cases are the same as mllib Python tests
./dev/lint-python
PEP8 checks passed.
rm -rf _build/*
pydoc checks passed.

./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-ml']
Finished test(python2.7): pyspark.ml.evaluation (18s)
Finished test(python2.7): pyspark.ml.clustering (40s)
Finished test(python2.7): pyspark.ml.classification (49s)
Finished test(python2.7): pyspark.ml.recommendation (44s)
Finished test(python2.7): pyspark.ml.feature (64s)
Finished test(python2.7): pyspark.ml.regression (45s)
Finished test(python2.7): pyspark.ml.tuning (30s)
Finished test(python2.7): pyspark.ml.tests (56s)
Tests passed in 106 seconds

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12402 from wangmiao1981/gmm.
2016-04-25 10:48:15 -07:00
Zheng RuiFeng e6f954a579 [SPARK-14758][ML] Add checking for StepSize and Tol
## What changes were proposed in this pull request?
add the checking for StepSize and Tol in sharedParams

## How was this patch tested?
Unit tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12530 from zhengruifeng/ml_args_checking.
2016-04-25 10:30:55 +02:00
Dongjoon Hyun d34d650378 [SPARK-14868][BUILD] Enable NewLineAtEofChecker in checkstyle and fix lint-java errors
## What changes were proposed in this pull request?

Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java code also comply with the rule. This PR aims to enforce the same rule `NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java errors since SPARK-14465. The followings are the items.

- Adds a new line at the end of the files (19 files)
- Fixes 25 lint-java errors (12 RedundantModifier, 6 **ArrayTypeStyle**, 2 LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder)

## How was this patch tested?

After the Jenkins test succeeds, `dev/lint-java` should pass. (Currently, Jenkins dose not run lint-java.)
```bash
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12632 from dongjoon-hyun/SPARK-14868.
2016-04-24 20:40:03 -07:00
Zheng RuiFeng 86ca8fefc8 [MINOR][ML][MLLIB] Remove unused imports
## What changes were proposed in this pull request?
del unused imports in ML/MLLIB

## How was this patch tested?
unit tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12497 from zhengruifeng/del_unused_imports.
2016-04-22 23:20:10 -07:00
Liang-Chi Hsieh 8098f15857 [SPARK-14843][ML] Fix encoding error in LibSVMRelation
## What changes were proposed in this pull request?

We use `RowEncoder` in libsvm data source to serialize the label and features read from libsvm files. However, the schema passed in this encoder is not correct. As the result, we can't correctly select `features` column from the DataFrame. We should use full data schema instead of `requiredSchema` to serialize the data read in. Then do projection to select required columns later.

## How was this patch tested?
`LibSVMRelationSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12611 from viirya/fix-libsvm.
2016-04-23 01:11:36 +08:00
Zheng RuiFeng 92675471b7 [MINOR][DOC] Fix doc style in ml.ann.Layer and MultilayerPerceptronClassifier
## What changes were proposed in this pull request?
1, fix the indentation
2, add a missing param desc

## How was this patch tested?
unit tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12499 from zhengruifeng/fix_doc.
2016-04-22 14:52:37 +01:00
Joan bf95b8da27 [SPARK-6429] Implement hashCode and equals together
## What changes were proposed in this pull request?

Implement some `hashCode` and `equals` together in order to enable the scalastyle.
This is a first batch, I will continue to implement them but I wanted to know your thoughts.

Author: Joan <joan@goyeau.com>

Closes #12157 from joan38/SPARK-6429-HashCode-Equals.
2016-04-22 12:24:12 +01:00
Yanbo Liang 4e726227a3 [SPARK-14479][ML] GLM supports output link prediction
## What changes were proposed in this pull request?
GLM supports output link prediction.
## How was this patch tested?
unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12287 from yanboliang/spark-14479.
2016-04-21 17:31:33 -07:00
Joseph K. Bradley f25a3ea8d3 [SPARK-14734][ML][MLLIB] Added asML, fromML methods for all spark.mllib Vector, Matrix types
## What changes were proposed in this pull request?

For maintaining wrappers around spark.mllib algorithms in spark.ml, it will be useful to have ```private[spark]``` methods for converting from one linear algebra representation to another.
This PR adds toNew, fromNew methods for all spark.mllib Vector and Matrix types.

## How was this patch tested?

Unit tests for all conversions

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12504 from jkbradley/linalg-conversions.
2016-04-21 16:50:09 -07:00
Xin Ren 6d1e4c4a65 [SPARK-14569][ML] Log instrumentation in KMeans
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14569

Log instrumentation in KMeans:

- featuresCol
- predictionCol
- k
- initMode
- initSteps
- maxIter
- seed
- tol
- summary

## How was this patch tested?

Manually test on local machine, by running and checking output of org.apache.spark.examples.ml.KMeansExample

Author: Xin Ren <iamshrek@126.com>

Closes #12432 from keypointt/SPARK-14569.
2016-04-21 16:29:39 -07:00
Joseph K. Bradley acc7e592c4 [SPARK-14478][ML][MLLIB][DOC] Doc that StandardScaler uses the corrected sample std
## What changes were proposed in this pull request?

Currently, MLlib's StandardScaler scales columns using the corrected standard deviation (sqrt of unbiased variance). This matches what R's scale package does.

This PR documents this fact.

## How was this patch tested?

doc only

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12519 from jkbradley/scaler-variance-doc.
2016-04-20 11:48:30 -07:00
Liwei Lin 17db4bfeaa [SPARK-14687][CORE][SQL][MLLIB] Call path.getFileSystem(conf) instead of call FileSystem.get(conf)
## What changes were proposed in this pull request?

- replaced `FileSystem.get(conf)` calls with `path.getFileSystem(conf)`

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12450 from lw-lin/fix-fs-get.
2016-04-20 11:28:51 +01:00
Cheng Lian 10f273d8db [SPARK-14407][SQL] Hides HadoopFsRelation related data source API into execution/datasources package #12178
## What changes were proposed in this pull request?

This PR moves `HadoopFsRelation` related data source API into `execution/datasources` package.

Note that to avoid conflicts, this PR is based on #12153. Effective changes for this PR only consist of the last three commits. Will rebase after merging #12153.

## How was this patch tested?

Existing tests.

Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #12361 from liancheng/spark-14407-hide-hadoop-fs-relation.
2016-04-19 17:32:23 -07:00
Jason Lee 3d66a2ce9b [SPARK-14564][ML][MLLIB][PYSPARK] Python Word2Vec missing setWindowSize method
## What changes were proposed in this pull request?
Added windowSize getter/setter to ML/MLlib

## How was this patch tested?
Added test cases in tests.py under both ML and MLlib

Author: Jason Lee <cjlee@us.ibm.com>

Closes #12428 from jasoncl/SPARK-14564.
2016-04-18 12:47:14 -07:00
Xusen Yin b64482f49f [SPARK-14306][ML][PYSPARK] PySpark ml.classification OneVsRest support export/import
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14306

Add PySpark OneVsRest save/load supports.

## How was this patch tested?

Test with Python unit test.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12439 from yinxusen/SPARK-14306-0415.
2016-04-18 11:52:29 -07:00
hyukjinkwon 9f678e9754 [MINOR] Remove inappropriate type notation and extra anonymous closure within functional transformations
## What changes were proposed in this pull request?

This PR removes

- Inappropriate type notations
    For example, from
    ```scala
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
    ...
    ```
    to
    ```scala
    words.foreachRDD { (rdd, time) =>
    ...
    ```

- Extra anonymous closure within functional transformations.
    For example,
    ```scala
    .map(item => {
      ...
    })
    ```

    which can be just simply as below:

    ```scala
    .map { item =>
      ...
    }
    ```

and corrects some obvious style nits.

## How was this patch tested?

This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly.

The rules applied were below:

- For the first correction,

```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]*\(\s*[^,]+s*=>\s*\{[^\}]+\}\s*\)</parameter></parameters>
</check>
```

```xml
<check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]([^\n>,]+=>)?\s*\{([^()]|(?R))*\}^[,]</parameter></parameters>
</check>
```

- For the second correction
```xml
<check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true">
    <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]\s*\([^):]*:R))*\}^[,]</parameter></parameters>
</check>
```

**Those rules were not added**

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12413 from HyukjinKwon/SPARK-style.
2016-04-16 14:56:23 +01:00
Yanbo Liang 83af297ac4 [SPARK-13925][ML][SPARKR] Expose R-like summary statistics in SparkR::glm for more family and link functions
## What changes were proposed in this pull request?
Expose R-like summary statistics in SparkR::glm for more family and link functions.
Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work.

## How was this patch tested?
Unit tests.

SparkR Output:
```
Deviance Residuals:
(Note: These are approximate quantiles with relative error <= 0.01)
     Min        1Q    Median        3Q       Max
-0.95096  -0.16585  -0.00232   0.17410   0.72918

Coefficients:
                    Estimate  Std. Error  t value  Pr(>|t|)
(Intercept)         1.6765    0.23536     7.1231   4.4561e-11
Sepal_Length        0.34988   0.046301    7.5566   4.1873e-12
Species_versicolor  -0.98339  0.072075    -13.644  0
Species_virginica   -1.0075   0.093306    -10.798  0

(Dispersion parameter for gaussian family taken to be 0.08351462)

    Null deviance: 28.307  on 149  degrees of freedom
Residual deviance: 12.193  on 146  degrees of freedom
AIC: 59.22

Number of Fisher Scoring iterations: 1
```
R output:
```
Deviance Residuals:
     Min        1Q    Median        3Q       Max
-0.95096  -0.16522   0.00171   0.18416   0.72918

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)        1.67650    0.23536   7.123 4.46e-11 ***
Sepal.Length       0.34988    0.04630   7.557 4.19e-12 ***
Speciesversicolor -0.98339    0.07207 -13.644  < 2e-16 ***
Speciesvirginica  -1.00751    0.09331 -10.798  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.08351462)

    Null deviance: 28.307  on 149  degrees of freedom
Residual deviance: 12.193  on 146  degrees of freedom
AIC: 59.217

Number of Fisher Scoring iterations: 2
```

cc mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12393 from yanboliang/spark-13925.
2016-04-15 08:23:51 -07:00
Pravin Gadakh e24923267f [SPARK-14370][MLLIB] removed duplicate generation of ids in OnlineLDAOptimizer
## What changes were proposed in this pull request?

Removed duplicated generation of `ids` in OnlineLDAOptimizer.

## How was this patch tested?

tested with existing unit tests.

Author: Pravin Gadakh <prgadakh@in.ibm.com>

Closes #12176 from pravingadakh/SPARK-14370.
2016-04-15 13:08:30 +01:00
DB Tsai 96534aa47c [SPARK-14549][ML] Copy the Vector and Matrix classes from mllib to ml in mllib-local
## What changes were proposed in this pull request?

This task will copy the Vector and Matrix classes from mllib to ml package in mllib-local jar. The UDTs and `since` annotation in ml vector and matrix will be removed from now. UDTs will be achieved by #SPARK-14487, and `since` will be replaced by /*  since 1.2.0 */

The BLAS implementation will be copied, and some of the test utilities will be copies as well.

Summary of changes:

1. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/BLAS.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/BLAS.scala
  - logDebug("gemm: alpha is equal to 0 and beta is equal to 1. Returning C.") is removed in ml version.
2. In  mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Matrices.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Matrices.scala
  - `Since` was removed, and we'll use standard `/* Since /*` Java doc. Will be in another PR.
  - `UDT` related code was removed, and will use `SPARK-13944` https://github.com/apache/spark/pull/12259  to replace the annotation.
3. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Vectors.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
  - `Since` was removed.
  - `UDT` related code was removed.
  - In `def parseNumeric`, it was throwing `throw new SparkException(s"Cannot parse $other.")`, and now it's throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
4. In mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
  - For consistency with ML version of vector, `def parseNumeric` is now throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
5. mllib/src/main/scala/org/apache/spark/**mllib**/util/NumericParser.scala is moved to mllib-local/src/main/scala/org/apache/spark/**ml**/util/NumericParser.scala
  - All the `throw new SparkException` were replaced by `throw new IllegalArgumentException`

## How was this patch tested?

unit tests

Author: DB Tsai <dbt@netflix.com>

Closes #12317 from dbtsai/dbtsai-ml-vector.
2016-04-15 01:17:03 -07:00
Fokko Driesprong c80586d9e8 [SPARK-12869] Implemented an improved version of the toIndexedRowMatrix
Hi guys,

I've implemented an improved version of the `toIndexedRowMatrix` function on the `BlockMatrix`. I needed this for a project, but would like to share it with the rest of the community. In the case of dense matrices, it can increase performance up to 19 times:
https://github.com/Fokko/BlockMatrixToIndexedRowMatrix

If there are any questions or suggestions, please let me know. Keep up the good work! Cheers.

Author: Fokko Driesprong <f.driesprong@catawiki.nl>
Author: Fokko Driesprong <fokko@driesprongen.nl>

Closes #10839 from Fokko/master.
2016-04-14 17:32:20 -07:00
Yong Tang 01dd1f5c07 [SPARK-14565][ML] RandomForest should use parseInt and parseDouble for feature subset size instead of regexes
## What changes were proposed in this pull request?

This fix tries to change RandomForest's supported strategies from using regexes to using parseInt and
parseDouble, for the purpose of robustness and maintainability.

## How was this patch tested?

Existing tests passed.

Author: Yong Tang <yong.tang.github@outlook.com>

Closes #12360 from yongtang/SPARK-14565.
2016-04-14 17:23:16 -07:00
Joseph K. Bradley bf65c87f70 [SPARK-14618][ML][DOC] Updated RegressionEvaluator.metricName param doc
## What changes were proposed in this pull request?

In Spark 1.4, we negated some metrics from RegressionEvaluator since CrossValidator always maximized metrics. This was fixed in 1.5, but the docs were not updated. This PR updates the docs.

## How was this patch tested?

no tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12377 from jkbradley/regeval-doc.
2016-04-14 12:44:59 -07:00
Sean Owen 9fa43a33b9 [SPARK-14612][ML] Consolidate the version of dependencies in mllib and mllib-local into one place
## What changes were proposed in this pull request?

Move json4s, breeze dependency declaration into parent

## How was this patch tested?

Should be no functional change, but Jenkins tests will test that.

Author: Sean Owen <sowen@cloudera.com>

Closes #12390 from srowen/SPARK-14612.
2016-04-14 10:48:17 -07:00
Yanbo Liang a91aaf5a8c [SPARK-14375][ML] Unit test for spark.ml KMeansSummary
## What changes were proposed in this pull request?
* Modify ```KMeansSummary.clusterSizes``` method to make it robust to empty clusters.
* Add unit test for spark.ml ```KMeansSummary```.
* Add Since tag.

## How was this patch tested?
unit tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12254 from yanboliang/spark-14375.
2016-04-13 13:23:10 -07:00
Yanbo Liang 0d17593b32 [SPARK-14461][ML] GLM training summaries should provide solver
## What changes were proposed in this pull request?
GLM training summaries should provide solver.

## How was this patch tested?
Unit tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12253 from yanboliang/spark-14461.
2016-04-13 13:20:29 -07:00
Yanbo Liang b0adb9f543 [SPARK-10386][MLLIB] PrefixSpanModel supports save/load
```PrefixSpanModel``` supports ```save/load```. It's similar with #9267.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10664 from yanboliang/spark-10386.
2016-04-13 13:18:02 -07:00
Yanbo Liang f9d578eaa1 [SPARK-13783][ML] Model export/import for spark.ml: GBTs
## What changes were proposed in this pull request?
* Added save/load for ```GBTClassifier/GBTClassificationModel/GBTRegressor/GBTRegressionModel```.
* Meanwhile, I modified ```EnsembleModelReadWrite.saveImpl/loadImpl``` to support save/load ```treeWeights```.

## How was this patch tested?
Adds standard unit tests for GBT save/load.

cc jkbradley GayathriMurali

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12230 from yanboliang/spark-13783.
2016-04-13 11:31:10 -07:00
Timothy Hunter 1018a1c1eb [SPARK-14568][ML] Instrumentation framework for logistic regression
## What changes were proposed in this pull request?

This adds extra logging information about a `LogisticRegression` estimator when being fit on a dataset. With this PR, you see the following extra lines when running the example in the documentation:

```
16/04/13 07:19:00 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_55dd3c09f164-1230977381-1): training: numPartitions=1 storageLevel=StorageLevel(disk=true, memory=true, offheap=false, deserialized=true, replication=1)
16/04/13 07:19:00 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_55dd3c09f164-1230977381-1): {"regParam":0.3,"elasticNetParam":0.8,"maxIter":10}
...
16/04/12 11:48:07 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_a89eb23cb386-358781145):numClasses=2
16/04/12 11:48:07 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_a89eb23cb386-358781145):numFeatures=692
...
16/04/13 07:19:01 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_55dd3c09f164-1230977381-1): training finished
```

## How was this patch tested?

This PR was manually tested.

Author: Timothy Hunter <timhunter@databricks.com>

Closes #12331 from thunterdb/1604-instrumentation.
2016-04-13 11:06:42 -07:00
Xiangrui Meng 323e7390a5 Revert "[SPARK-14154][MLLIB] Simplify the implementation for Kolmogorov–Smirnov test"
This reverts commit d2a819a636.
2016-04-13 09:17:46 -07:00
hyukjinkwon 587cd554af [MINOR][SQL] Remove some unused imports in datasources.
## What changes were proposed in this pull request?

It looks several recent commits for datasources (maybe while removing old `HadoopFsRelation` interface) missed removing some unused imports.

This PR removes some unused imports in datasources.

## How was this patch tested?

`sbt scalastyle` and some unit tests for them.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12326 from HyukjinKwon/minor-imports.
2016-04-13 10:20:03 +08:00
Yanbo Liang 111a62474a [SPARK-14147][ML][SPARKR] SparkR predict should not output feature column
## What changes were proposed in this pull request?
SparkR does not support type of vector which is the default type of feature column in ML. R predict also does not output intermediate feature column. So SparkR ```predict``` should not output feature column. In this PR, I only fix this issue for ```naiveBayes``` and ```survreg```. ```kmeans``` has the right code route already and  ```glm``` will be fixed at SparkRWrapper refactor(#12294).

## How was this patch tested?
No new tests.

cc mengxr shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11958 from yanboliang/spark-14147.
2016-04-12 11:34:40 -07:00
Xiangrui Meng 1995c2e648 [SPARK-14563][ML] use a random table name instead of __THIS__ in SQLTransformer
## What changes were proposed in this pull request?

Use a random table name instead of `__THIS__` in SQLTransformer, and add a test for `transformSchema`. The problems of using `__THIS__` are:

* It doesn't work under HiveContext (in Spark 1.6)
* Race conditions

## How was this patch tested?

* Manual test with HiveContext.
* Added a unit test for `transformSchema` to improve coverage.

cc: yhuai

Author: Xiangrui Meng <meng@databricks.com>

Closes #12330 from mengxr/SPARK-14563.
2016-04-12 11:30:09 -07:00
Yanbo Liang 101663f1ae [SPARK-13322][ML] AFTSurvivalRegression supports feature standardization
## What changes were proposed in this pull request?
AFTSurvivalRegression should support feature standardization, it will improve the convergence rate.
Test the convergence rate on the [Ovarian](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/ovarian.html) data which is standard data comes with Survival library in R,
* without standardization(before this PR) -> 74 iterations.
* with standardization(after this PR) -> 38 iterations.

But after this fix, with or without ```standardization``` will converge to the same solution. It means that ```standardization = false``` will run the same code route as ```standardization = true```. Because if the features are not standardized at all, it will result convergency issue when the features have very different scales. This behavior is the same as ML [```LinearRegression``` and ```LogisticRegression```](https://issues.apache.org/jira/browse/SPARK-8522). See more discussion about this topic at #11247.
cc mengxr
## How was this patch tested?
unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11365 from yanboliang/spark-13322.
2016-04-12 11:27:16 -07:00
Yanbo Liang 75e05a5a96 [SPARK-12566][SPARK-14324][ML] GLM model family, link function support in SparkR:::glm
* SparkR glm supports families and link functions which match R's signature for family.
* SparkR glm API refactor. The comparative standard of the new API is R glm, so I only expose the arguments that R glm supports: ```formula, family, data, epsilon and maxit```.
* This PR is focus on glm() and predict(), summary statistics will be done in a separate PR after this get in.
* This PR depends on #12287 which make GLMs support link prediction at Scala side. After that merged, I will add more tests for predict() to this PR.

Unit tests.

cc mengxr jkbradley hhbyyh

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12294 from yanboliang/spark-12566.
2016-04-12 10:51:09 -07:00
Yong Tang da60b34d2f [SPARK-3724][ML] RandomForest: More options for feature subset size.
## What changes were proposed in this pull request?

This PR tries to support more options for feature subset size in RandomForest implementation. Previously, RandomForest only support "auto", "all", "sort", "log2", "onethird". This PR tries to support any given value to allow model search.

In this PR, `featureSubsetStrategy` could be passed with:
a) a real number in the range of `(0.0-1.0]` that represents the fraction of the number of features in each subset,
b)  an integer number (`>0`) that represents the number of features in each subset.

## How was this patch tested?

Two tests `JavaRandomForestClassifierSuite` and `JavaRandomForestRegressorSuite` have been updated to check the additional options for params in this PR.
An additional test has been added to `org.apache.spark.mllib.tree.RandomForestSuite` to cover the cases in this PR.

Author: Yong Tang <yong.tang.github@outlook.com>

Closes #11989 from yongtang/SPARK-3724.
2016-04-12 16:53:26 +02:00
Dongjoon Hyun b0f5497e95 [SPARK-14508][BUILD] Add a new ScalaStyle Rule OmitBracesInCase
## What changes were proposed in this pull request?

According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule.
  ```
  case: Always omit braces in case clauses.
  ```
This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code.

## How was this patch tested?

Pass the Jenkins tests (including Scala style checking)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12280 from dongjoon-hyun/SPARK-14508.
2016-04-12 00:43:28 -07:00
Wenchen Fan 678b96e77b [SPARK-14535][SQL] Remove buildInternalScan from FileFormat
## What changes were proposed in this pull request?

Now `HadoopFsRelation` with all kinds of file formats can be handled in `FileSourceStrategy`, we can remove the branches for  `HadoopFsRelation` in `FileSourceStrategy` and the `buildInternalScan` API from `FileFormat`.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12300 from cloud-fan/remove.
2016-04-11 22:59:42 -07:00
Joseph K. Bradley e9e1adc036 [MINOR][ML] Fixed MLlib build warnings
## What changes were proposed in this pull request?

Fixes to eliminate warnings during package and doc builds.

## How was this patch tested?

Existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12263 from jkbradley/warning-cleanups.
2016-04-12 03:24:26 +01:00
Yanbo Liang 3f0f40800b [SPARK-14298][ML][MLLIB] Add unit test for EM LDA disable checkpointing
## What changes were proposed in this pull request?
This is follow up for #12089, add unit test for EM LDA which test disable checkpointing when set ```checkpointInterval = -1```.
## How was this patch tested?
unit test.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12286 from yanboliang/spark-14298-followup.
2016-04-11 14:01:05 -07:00
Oliver Pierson 89a41c5b7a [SPARK-13600][MLLIB] Use approxQuantile from DataFrame stats in QuantileDiscretizer
## What changes were proposed in this pull request?
QuantileDiscretizer can return an unexpected number of buckets in certain cases.  This PR proposes to fix this issue and also refactor QuantileDiscretizer to use approxQuantiles from DataFrame stats functions.
## How was this patch tested?
QuantileDiscretizerSuite unit tests (some existing tests will change or even be removed in this PR)

Author: Oliver Pierson <ocp@gatech.edu>

Closes #11553 from oliverpierson/SPARK-13600.
2016-04-11 12:02:48 -07:00
DB Tsai efaf7d1820 [SPARK-14462][ML][MLLIB] Add the mllib-local build to maven pom
## What changes were proposed in this pull request?

In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies.

The previous PR was failing the build because of `spark-core:test` dependency, and that was reverted. In this PR, `FunSuite` with `// scalastyle:ignore funsuite` in mllib-local test was used, similar to sketch.

Thanks.

## How was this patch tested?

Unit tests

mengxr tedyu holdenk

Author: DB Tsai <dbt@netflix.com>

Closes #12298 from dbtsai/dbtsai-mllib-local-build-fix.
2016-04-11 09:35:47 -07:00
Zheng RuiFeng 643b4e2257 [SPARK-14510][MLLIB] Add args-checking for LDA and StreamingKMeans
## What changes were proposed in this pull request?
add the checking for LDA and StreamingKMeans

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12062 from zhengruifeng/initmodel.
2016-04-11 09:33:52 -07:00
Xiangrui Meng 1c751fcf48 [SPARK-14500] [ML] Accept Dataset[_] instead of DataFrame in MLlib APIs
## What changes were proposed in this pull request?

This PR updates MLlib APIs to accept `Dataset[_]` as input where `DataFrame` was the input type. This PR doesn't change the output type. In Java, `Dataset[_]` maps to `Dataset<?>`, which includes `Dataset<Row>`. Some implementations were changed in order to return `DataFrame`. Tests and examples were updated. Note that this is a breaking change for subclasses of Transformer/Estimator.

Lol, we don't have to rename the input argument, which has been `dataset` since Spark 1.2.

TODOs:
- [x] update MiMaExcludes (seems all covered by explicit filters from SPARK-13920)
- [x] Python
- [x] add a new test to accept Dataset[LabeledPoint]
- [x] remove unused imports of Dataset

## How was this patch tested?

Exiting unit tests with some modifications.

cc: rxin jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #12274 from mengxr/SPARK-14500.
2016-04-11 09:28:28 -07:00
fwang1 f4344582ba [SPARK-14497][ML] Use top instead of sortBy() to get top N frequent words as dict in ConutVectorizer
## What changes were proposed in this pull request?

Replace sortBy() with top() to calculate the top N frequent words as dictionary.

## How was this patch tested?
existing unit tests.  The terms with same TF would be sorted in descending order. The test would fail if hardcode the terms with same TF the dictionary like "c", "d"...

Author: fwang1 <desperado.wf@gmail.com>

Closes #12265 from lionelfeng/master.
2016-04-10 01:13:25 -07:00
Xiangrui Meng 415446cc9b Revert "[SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom"
This reverts commit 1598d11bb0.
2016-04-09 14:03:03 -07:00
DB Tsai 1598d11bb0 [SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom
## What changes were proposed in this pull request?

In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The test scope will still depend on spark-core and spark-core-test in order to use the common utilities, but the runtime will avoid any platform dependency. Couple platform independent classes will be moved to this package to demonstrate how this work.

## How was this patch tested?

Unit tests

Author: DB Tsai <dbt@netflix.com>

Closes #12241 from dbtsai/dbtsai-mllib-local-build.
2016-04-09 09:21:12 -07:00
wm624@hotmail.com a9b8b655b2 [SPARK-14392][ML] CountVectorizer Estimator should include binary toggle Param
## What changes were proposed in this pull request?

CountVectorizerModel has a binary toggle param. This PR is to add binary toggle param for estimator CountVectorizer. As discussed in the JIRA, instead of adding a param into CountVerctorizer, I moved the binary param to CountVectorizerParams. Therefore, the estimator inherits the binary param.

## How was this patch tested?

Add a new test case, which fits the model with binary flag set to true and then check the trained model's all non-zero counts is set to 1.0.

All tests in CounterVectorizerSuite.scala are passed.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12200 from wangmiao1981/binary_param.
2016-04-09 09:57:07 +02:00
Joseph K. Bradley d7af736b2c [SPARK-14498][ML][PYTHON][SQL] Many cleanups to ML and ML-related docs
## What changes were proposed in this pull request?

Cleanups to documentation.  No changes to code.
* GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor
* GLM regParam: needs doc saying it is for L2 only
* TrainValidationSplitModel: add .. versionadded:: 2.0.0
* Rename “_transformer_params_from_java” to “_transfer_params_from_java”
* LogReg Summary classes: “probability” col should not say “calibrated”
* LR summaries: coefficientStandardErrors —> document that intercept stderr comes last.  Same for t,p-values
* approxCountDistinct: Document meaning of “rsd" argument.
* LDA: note which params are for online LDA only

## How was this patch tested?

Doc build

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12266 from jkbradley/ml-doc-cleanups.
2016-04-08 20:15:44 -07:00
Yanbo Liang 56af8e85cc [SPARK-14298][ML][MLLIB] LDA should support disable checkpoint
## What changes were proposed in this pull request?
In the doc of [```checkpointInterval```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala#L241), we told users that they can disable checkpoint by setting ```checkpointInterval = -1```. But we did not handle this situation for LDA actually, we should fix this bug.
## How was this patch tested?
Existing tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12089 from yanboliang/spark-14298.
2016-04-08 11:49:44 -07:00
Joseph K. Bradley 953ff897e4 [SPARK-13048][ML][MLLIB] keepLastCheckpoint option for LDA EM optimizer
## What changes were proposed in this pull request?

The EMLDAOptimizer should generally not delete its last checkpoint since that can cause failures when DistributedLDAModel methods are called (if any partitions need to be recovered from the checkpoint).

This PR adds a "deleteLastCheckpoint" option which defaults to false.  This is a change in behavior from Spark 1.6, in that the last checkpoint will not be removed by default.

This involves adding the deleteLastCheckpoint option to both spark.ml and spark.mllib, and modifying PeriodicCheckpointer to support the option.

This also:
* Makes MLlibTestSparkContext extend TempDirectory and set the checkpointDir to tempDir
* Updates LibSVMRelationSuite because of a name conflict with "tempDir" (and fixes a bug where it failed to delete a temp directory)
* Adds a MIMA exclude for DistributedLDAModel constructor, which is already ```private[clustering]```

## How was this patch tested?

Added 2 new unit tests to spark.ml LDASuite, which calls into spark.mllib.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12166 from jkbradley/emlda-save-checkpoint.
2016-04-07 19:48:33 -07:00
Marcelo Vanzin 21d5ca128b [SPARK-14134][CORE] Change the package name used for shading classes.
The current package name uses a dash, which is a little weird but seemed
to work. That is, until a new test tried to mock a class that references
one of those shaded types, and then things started failing.

Most changes are just noise to fix the logging configs.

For reference, SPARK-8815 also raised this issue, although at the time it
did not cause any issues in Spark, so it was not addressed.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11941 from vanzin/SPARK-14134.
2016-04-06 19:33:51 -07:00
sethah bb873754b4 [SPARK-12382][ML] Remove mllib GBT implementation and wrap ml
## What changes were proposed in this pull request?

This patch removes the implementation of gradient boosted trees in mllib/tree/GradientBoostedTrees.scala and changes mllib GBTs to call the implementation in spark.ML.

Primary changes:
* Removed `boost` method in mllib GradientBoostedTrees.scala
* Created new test suite GradientBoostedTreesSuite in ML, which contains unit tests that were specific to GBT internals from mllib

Other changes:
* Added an `updatePrediction` method in GradientBoostedTrees package. This method is added to provide consistency for methods that build predictions from boosted models. There are several methods that hard code the method of predicting as: sum_{i=1}^{numTrees} (treePrediction*treeWeight). Calling this function ensures that test methods that check accuracy use the same prediction method that the algorithm uses during training
* Added methods that were previously only used in testing, but were public methods, to GradientBoostedTrees. This includes `computeError` (previously part  of `Loss` trait) and `evaluateEachIteration`. These are used in the new spark.ML unit tests. They are left in mllib as well so as to not break the API.

## How was this patch tested?

Existing unit tests which compare ML and MLlib ensure that mllib GBTs have not changed. Only a single unit test was moved to ML, which verifies that `runWithValidation` performs as expected.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #12050 from sethah/SPARK-12382.
2016-04-06 17:13:34 -07:00