Commit graph

1091 commits

Author SHA1 Message Date
Xiangrui Meng 33ae7a35da [SPARK-11358][MLLIB] deprecate runs in k-means
This PR deprecates `runs` in k-means. `runs` introduces extra complexity and overhead in MLlib's k-means implementation. I haven't seen much usage with `runs` not equal to `1`. We don't have a unit test for it either. We can deprecate this method in 1.6, and void it in 1.7. It helps us simplify the implementation.

cc: srowen

Author: Xiangrui Meng <meng@databricks.com>

Closes #9322 from mengxr/SPARK-11358.
2015-11-02 13:42:16 -08:00
Yu ISHIKAWA e963070c13 [SPARK-9722] [ML] Pass random seed to spark.ml DecisionTree*
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9402 from yu-iskw/SPARK-9722.
2015-11-01 23:52:50 -08:00
Nakul Jindal 69b9e4b3c2 [SPARK-11385] [ML] foreachActive made public in MLLib's vector API
Made foreachActive public in MLLib's vector API

Author: Nakul Jindal <njindal@us.ibm.com>

Closes #9362 from nakul02/SPARK-11385_foreach_for_mllib_linalg_vector.
2015-10-30 17:12:24 -07:00
Lewuathe 86d65265fc [SPARK-11207] [ML] Add test cases for solver selection of LinearRegres…
…sion as followup. This is the follow up work of SPARK-10668.

* Fix miner style issues.
* Add test case for checking whether solver is selected properly.

Author: Lewuathe <lewuathe@me.com>
Author: lewuathe <lewuathe@me.com>

Closes #9180 from Lewuathe/SPARK-11207.
2015-10-30 02:59:05 -07:00
Yanbo Liang fba9e95452 [SPARK-11369][ML][R] SparkR glm should support setting standardize
SparkR glm currently support :
```formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0```
We should also support setting standardize which has been defined at [design documentation](https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit)

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9331 from yanboliang/spark-11369.
2015-10-28 08:50:21 -07:00
Nakul Jindal 5f1cee6f15 [SPARK-11332] [ML] Refactored to use ml.feature.Instance instead of WeightedLeastSquare.Instance
WeightedLeastSquares now uses the common Instance class in ml.feature instead of a private one.

Author: Nakul Jindal <njindal@us.ibm.com>

Closes #9325 from nakul02/SPARK-11332_refactor_WeightedLeastSquares_dot_Instance.
2015-10-28 01:02:03 -07:00
Xiangrui Meng 82c1c57728 [MINOR][ML] fix compile warns
This fixes some compile time warnings.

Author: Xiangrui Meng <meng@databricks.com>

Closes #9319 from mengxr/mllib-compile-warn-20151027.
2015-10-27 23:41:42 -07:00
Sean Owen 826e1e304b [SPARK-11302][MLLIB] 2) Multivariate Gaussian Model with Covariance matrix returns incorrect answer in some cases
Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test.

Supersedes https://github.com/apache/spark/pull/9293

Author: Sean Owen <sowen@cloudera.com>

Closes #9309 from srowen/SPARK-11302.2.
2015-10-27 23:07:37 -07:00
Reza Zadeh 8b292b19c9 [SPARK-10654][MLLIB] Add columnSimilarities to IndexedRowMatrix
Add columnSimilarities to IndexedRowMatrix by delegating to functionality already in RowMatrix.

With a test.

Author: Reza Zadeh <reza@databricks.com>

Closes #8792 from rezazadeh/colsims.
2015-10-26 22:00:24 -07:00
Sean Owen 3cac6614a4 [SPARK-11184][MLLIB] Declare most of .mllib code not-Experimental
Remove "Experimental" from .mllib code that has been around since 1.4.0 or earlier

Author: Sean Owen <sowen@cloudera.com>

Closes #9169 from srowen/SPARK-11184.
2015-10-26 21:47:42 -07:00
Jayant Shekar 4e38defae1 [SPARK-6723] [MLLIB] Model import/export for ChiSqSelector
This is a PR for Parquet-based model import/export.

* Added save/load for ChiSqSelectorModel
* Updated the test suite ChiSqSelectorSuite

Author: Jayant Shekar <jayant@user-MBPMBA-3.local>

Closes #6785 from jayantshekhar/SPARK-6723.
2015-10-23 08:45:13 -07:00
Reynold Xin cdea0174e3 [SPARK-11273][SQL] Move ArrayData/MapData/DataTypeParser to catalyst.util package
Author: Reynold Xin <rxin@databricks.com>

Closes #9239 from rxin/types-private.
2015-10-23 00:00:21 -07:00
Xiangrui Meng 45861693be [SPARK-10082][MLLIB] minor style updates for matrix indexing after #8271
* `>=0` => `>= 0`
* print `i`, `j` in the log message

MechCoder

Author: Xiangrui Meng <meng@databricks.com>

Closes #9189 from mengxr/SPARK-10082.
2015-10-20 18:37:29 -07:00
MechCoder da46b77afd [SPARK-10082][MLLIB] Validate i, j in apply DenseMatrices and SparseMatrices
Given row_ind should be less than the number of rows
Given col_ind should be less than the number of cols.

The current code in master gives unpredictable behavior for such cases.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #8271 from MechCoder/hash_code_matrices.
2015-10-20 16:35:34 -07:00
Tijo Thomas 9f49895fef [SPARK-10261][DOCUMENTATION, ML] Fixed @Since annotation to ml.evaluation
Author: Tijo Thomas <tijoparacka@gmail.com>
Author: tijo <tijo@ezzoft.com>

Closes #8554 from tijoparacka/SPARK-10261-2.
2015-10-20 16:13:34 -07:00
lewuathe 4c33a34ba3 [SPARK-10668] [ML] Use WeightedLeastSquares in LinearRegression with L…
…2 regularization if the number of features is small

Author: lewuathe <lewuathe@me.com>
Author: Lewuathe <sasaki@treasure-data.com>
Author: Kai Sasaki <sasaki@treasure-data.com>
Author: Lewuathe <lewuathe@me.com>

Closes #8884 from Lewuathe/SPARK-10668.
2015-10-19 10:46:10 -07:00
Luvsandondov Lkhamsuren cca2258685 [SPARK-9963] [ML] RandomForest cleanup: replace predictNodeIndex with predictImpl
predictNodeIndex is moved to LearningNode and renamed predictImpl for consistency with Node.predictImpl

Author: Luvsandondov Lkhamsuren <lkhamsurenl@gmail.com>

Closes #8609 from lkhamsurenl/SPARK-9963.
2015-10-17 10:07:42 -07:00
Yuhao Yang e1e77b22b3 [SPARK-11029] [ML] Add computeCost to KMeansModel in spark.ml
jira: https://issues.apache.org/jira/browse/SPARK-11029

We should add a method analogous to spark.mllib.clustering.KMeansModel.computeCost to spark.ml.clustering.KMeansModel.
This will be a temp fix until we have proper evaluators defined for clustering.

Author: Yuhao Yang <hhbyyh@gmail.com>
Author: yuhaoyang <yuhao@zhanglipings-iMac.local>

Closes #9073 from hhbyyh/computeCost.
2015-10-17 10:04:19 -07:00
Burak Yavuz 10046ea76c [SPARK-10599] [MLLIB] Lower communication for block matrix multiplication
This PR aims to decrease communication costs in BlockMatrix multiplication in two ways:
 - Simulate the multiplication on the driver, and figure out which blocks actually need to be shuffled
 - Send the block once to a partition, and join inside the partition rather than sending multiple copies to the same partition

**NOTE**: One important note is that right now, the old behavior of checking for multiple blocks with the same index is lost. This is not hard to add, but is a little more expensive than how it was.

Initial benchmarking showed promising results (look below), however I did hit some `FileNotFound` exceptions with the new implementation after the shuffle.

Size A: 1e5 x 1e5
Size B: 1e5 x 1e5
Block Sizes: 1024 x 1024
Sparsity: 0.01
Old implementation: 1m 13s
New implementation: 9s

cc avulanov Would you be interested in helping me benchmark this? I used your code from the mailing list (which you sent about 3 months ago?), and the old implementation didn't even run, but the new implementation completed in 268s in a 120 GB / 16 core cluster

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #8757 from brkyvz/opt-bmm.
2015-10-16 15:30:07 -07:00
vectorijk 3889b1c7a9 [SPARK-11059] [ML] Change range of quantile probabilities in AFTSurvivalRegression
Value of the quantile probabilities array should be in the range (0, 1) instead of [0,1]
 in `AFTSurvivalRegression.scala` according to [Discussion] (https://github.com/apache/spark/pull/8926#discussion-diff-40698242)

Author: vectorijk <jiangkai@gmail.com>

Closes #9083 from vectorijk/spark-11059.
2015-10-13 15:57:36 -07:00
Xiangrui Meng 2b574f52d7 [SPARK-7402] [ML] JSON SerDe for standard param types
This PR implements the JSON SerDe for the following param types: `Boolean`, `Int`, `Long`, `Float`, `Double`, `String`, `Array[Int]`, `Array[Double]`, and `Array[String]`. The implementation of `Float`, `Double`, and `Array[Double]` are specialized to handle `NaN` and `Inf`s. This will be used in pipeline persistence. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #9090 from mengxr/SPARK-7402.
2015-10-13 13:24:10 -07:00
Vladimir Vladimirov c1b4ce4326 [SPARK-10535] Sync up API for matrix factorization model between Scala and PySpark
Support for recommendUsersForProducts and recommendProductsForUsers in matrix factorization model for PySpark

Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com>

Closes #8700 from smartkiwi/SPARK-10535_.
2015-10-09 14:16:13 -07:00
Nick Pritchard 5994cfe812 [SPARK-10875] [MLLIB] Computed covariance matrix should be symmetric
Compute upper triangular values of the covariance matrix, then copy to lower triangular values.

Author: Nick Pritchard <nicholas.pritchard@falkonry.com>

Closes #8940 from pnpritchard/SPARK-10875.
2015-10-08 22:22:20 -07:00
Yanbo Liang 2268356002 [SPARK-7770] [ML] GBT validationTol change to compare with relative or absolute error
GBT compare ValidateError with tolerance switching between relative and absolute ones, where the former one is relative to the current loss on the training set.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8549 from yanboliang/spark-7770.
2015-10-08 11:27:46 -07:00
Holden Karau 0903c6489e [SPARK-9718] [ML] linear regression training summary all columns
LinearRegression training summary: The transformed dataset should hold all columns, not just selected ones like prediction and label. There is no real need to remove some, and the user may find them useful.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8564 from holdenk/SPARK-9718-LinearRegressionTrainingSummary-all-columns.
2015-10-08 11:16:20 -07:00
Nathan Howell 1bc435ae3a [SPARK-10064] [ML] Parallelize decision tree bin split calculations
Reimplement `DecisionTree.findSplitsBins` via `RDD` to parallelize bin calculation.

With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours.

Author: Nathan Howell <nhowell@godaddy.com>

Closes #8246 from NathanHowell/SPARK-10064.
2015-10-07 17:46:16 -07:00
DB Tsai dd36ec6bc5 [SPARK-10738] [ML] Refactoring Instance out from LOR and LIR, and also cleaning up some code
Refactoring `Instance` case class out from LOR and LIR, and also cleaning up some code.

Author: DB Tsai <dbt@netflix.com>

Closes #8853 from dbtsai/refactoring.
2015-10-07 15:56:57 -07:00
Yanbo Liang 7bf07faa71 [SPARK-10490] [ML] Consolidate the Cholesky solvers in WeightedLeastSquares and ALS
Consolidate the Cholesky solvers in WeightedLeastSquares and ALS.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8936 from yanboliang/spark-10490.
2015-10-07 15:50:45 -07:00
Evan Chen da936fbb74 [SPARK-10779] [PYSPARK] [MLLIB] Set initialModel for KMeans model in PySpark (spark.mllib)
Provide initialModel param for pyspark.mllib.clustering.KMeans

Author: Evan Chen <chene@us.ibm.com>

Closes #8967 from evanyc15/SPARK-10779-pyspark-mllib.
2015-10-07 15:04:53 -07:00
Marcelo Vanzin 94fc57afdf [SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8775 from vanzin/SPARK-10300.
2015-10-07 14:11:21 -07:00
Holden Karau 5be5d24744 [SPARK-9841] [ML] Make clear public
It is currently impossible to clear Param values once set. It would be helpful to be able to.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8619 from holdenk/SPARK-9841-params-clear-needs-to-be-public.
2015-10-07 12:00:56 -07:00
Yin Huai b0baa11d3b [HOT-FIX] Fix style.
https://github.com/apache/spark/pull/8882 broke our build.

Author: Yin Huai <yhuai@databricks.com>

Closes #8964 from yhuai/fixStyle.
2015-10-02 11:23:08 -07:00
Xusen Yin 633aaae0a1 [SPARK-6530] [ML] Add chi-square selector for ml package
See JIRA [here](https://issues.apache.org/jira/browse/SPARK-6530).

Author: Xusen Yin <yinxusen@gmail.com>

Closes #5742 from yinxusen/SPARK-6530.
2015-10-02 10:25:58 -07:00
Xusen Yin 23a9448c04 [SPARK-5890] [ML] Add feature discretizer
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5890).

I borrow the code of `findSplits` from `RandomForest`. I don't think it's good to call it from `RandomForest` directly.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #5779 from yinxusen/SPARK-5890.
2015-10-02 10:19:18 -07:00
Rerngvit Yanggratoke 2a717821bb [SPARK-9798] [ML] CrossValidatorModel Documentation Improvements
Document CrossValidatorModel members: bestModel and avgMetrics

Author: Rerngvit Yanggratoke <rerngvit@kth.se>

Closes #8882 from rerngvit/Spark-9798.
2015-10-02 10:15:02 -07:00
Yanbo Liang 2931e89f0c [SPARK-10736] [ML] Use 1 for all ratings if $(ratingCol) = ""
For some implicit dataset, ratings may not exist in the training data. In this case, we can assume all observed pairs to be positive and treat their ratings as 1. This should happen when users set ```ratingCol``` to an empty string.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8937 from yanboliang/spark-10736.
2015-09-29 23:58:32 -07:00
y-shimizu 299b439920 [SPARK-10778] [MLLIB] Implement toString for AssociationRules.Rule
I implemented toString for AssociationRules.Rule, format like `[x, y] => {z}: 1.0`

Author: y-shimizu <y.shimizu0429@gmail.com>

Closes #8904 from y-shimizu/master.
2015-09-27 16:36:03 +01:00
Eric Liang 922338812c [SPARK-9681] [ML] Support R feature interactions in RFormula
This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).

To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.

mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #8830 from ericl/interaction-2.
2015-09-25 00:43:22 -07:00
Holden Karau d91967e159 [SPARK-10763] [ML] [JAVA] [TEST] Update Java MLLIB/ML tests to use simplified dataframe construction
As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now have an easier way to create dataframes from local Java lists. Lets update the tests to use those.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8886 from holdenk/SPARK-10763-update-java-mllib-ml-tests-to-use-simplified-dataframe-construction.
2015-09-23 22:49:08 -07:00
Yanbo Liang 067afb4e9b [SPARK-10699] [ML] Support checkpointInterval can be disabled
Currently use can set ```checkpointInterval``` to specify how often should the cache be check-pointed. But we also need the function that users can disable it. This PR supports that users can disable checkpoint if user setting ```checkpointInterval = -1```.
We also add documents for GBT ```cacheNodeIds``` to make users can understand more clearly about checkpoint.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8820 from yanboliang/spark-10699.
2015-09-23 16:41:42 -07:00
Yanbo Liang ce2b056d35 [SPARK-10686] [ML] Add quantilesCol to AFTSurvivalRegression
By default ```quantilesCol``` should be empty. If ```quantileProbabilities``` is set, we should append quantiles as a new column (of type Vector).

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8836 from yanboliang/spark-10686.
2015-09-23 15:26:02 -07:00
sethah 098be27ad5 [SPARK-9715] [ML] Store numFeatures in all ML PredictionModel types
All prediction models should store `numFeatures` indicating the number of features the model was trained on. Default value of -1 added for backwards compatibility.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #8675 from sethah/SPARK-9715.
2015-09-23 15:00:52 -07:00
Yanbo Liang 7104ee0e5d [SPARK-10750] [ML] ML Param validate should print better error information
Currently when you set illegal value for params of array type (such as IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw IllegalArgumentException but with incomprehensible error information.
Take ```VectorSlicer.setNames``` as an example:
```scala
val vectorSlicer = new VectorSlicer().setInputCol("features").setOutputCol("result")
// The value of setNames must be contain distinct elements, so the next line will throw exception.
vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
```
It will throw IllegalArgumentException as:
```
vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5.
java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5.
```
We should distinguish the value of array type from primitive type at Param.validate(value: T), and we will get better error information.
```
vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1].
java.lang.IllegalArgumentException: vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1].
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8863 from yanboliang/spark-10750.
2015-09-22 11:00:33 -07:00
Holden Karau f4a3c4e34c [SPARK-9962] [ML] Decision Tree training: prevNodeIdsForInstances.unpersist() at end of training
NodeIdCache: prevNodeIdsForInstances.unpersist() needs to be called at end of training.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8541 from holdenk/SPARK-9962-decission-tree-training-prevNodeIdsForiNstances-unpersist-at-end-of-training.
2015-09-22 10:19:08 -07:00
Meihua Wu 870b8a2edd [SPARK-10706] [MLLIB] Add java wrapper for random vector rdd
Add java wrapper for random vector rdd

holdenk srowen

Author: Meihua Wu <meihuawu@umich.edu>

Closes #8841 from rotationsymmetry/SPARK-10706.
2015-09-22 11:05:24 +01:00
Feynman Liang aeef44a3e3 [SPARK-3147] [MLLIB] [STREAMING] Streaming 2-sample statistical significance testing
Implementation of significance testing using Streaming API.

Author: Feynman Liang <fliang@databricks.com>
Author: Feynman Liang <feynman.liang@gmail.com>

Closes #4716 from feynmanliang/ab_testing.
2015-09-21 13:11:28 -07:00
Meihua Wu 331f0b10f7 [SPARK-9642] [ML] LinearRegression should supported weighted data
In many modeling application, data points are not necessarily sampled with equal probabilities. Linear regression should support weighting which account the over or under sampling.

work in progress.

Author: Meihua Wu <meihuawu@umich.edu>

Closes #8631 from rotationsymmetry/SPARK-9642.
2015-09-21 12:09:00 -07:00
Holden Karau 20a61dbd9b [SPARK-10626] [MLLIB] create java friendly method for random rdd
SPARK-3136 added a large number of functions for creating Java RandomRDDs, but for people that want to use custom RandomDataGenerators we should make a Java friendly method.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8782 from holdenk/SPARK-10626-create-java-friendly-method-for-randomRDD.
2015-09-21 18:53:28 +01:00
lewuathe 0c498717ba [SPARK-10715] [ML] Duplicate initialization flag in WeightedLeastSquare
There are duplicate set of initialization flag in `WeightedLeastSquares#add`.
`initialized` is already set in `init(Int)`.

Author: lewuathe <lewuathe@me.com>

Closes #8837 from Lewuathe/duplicate-initialization-flag.
2015-09-20 16:16:31 -07:00
Sean Owen 1aa9e50256 [SPARK-5905] [MLLIB] Note requirements for certain RowMatrix methods in docs
Note methods that fail for cols > 65535; note that SVD does not require n >= m
CC mengxr

Author: Sean Owen <sowen@cloudera.com>

Closes #8839 from srowen/SPARK-5905.
2015-09-20 16:05:12 -07:00