Doc for 1.6 that the summaries mostly ignore the weight column.
To be corrected for 1.7
CC: mengxr thunterdb
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9927 from jkbradley/linregsummary-doc.
There is an unhandled case in the transform method of VectorAssembler if one of the input columns doesn't have one of the supported type DoubleType, NumericType, BooleanType or VectorUDT.
So, if you try to transform a column of StringType you get a cryptic "scala.MatchError: StringType".
This PR aims to fix this, throwing a SparkException when dealing with an unknown column type.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes#9885 from BenFradet/SPARK-11902.
Like [SPARK-11852](https://issues.apache.org/jira/browse/SPARK-11852), ```k``` is params and we should save it under ```metadata/``` rather than both under ```data/``` and ```metadata/```. Refactor the constructor of ```ml.feature.PCAModel``` to take only ```pc``` but construct ```mllib.feature.PCAModel``` inside ```transform```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9897 from yanboliang/spark-11912.
I believe this works for general estimators within CrossValidator, including compound estimators. (See the complex unit test.)
Added read/write for all 3 Evaluators as well.
CC: mengxr yanboliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9848 from jkbradley/cv-io.
```withStd``` and ```withMean``` should be params of ```StandardScaler``` and ```StandardScalerModel```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9839 from yanboliang/standardScaler-refactor.
Need to remove parent directory (```className```) rather than just tempDir (```className/random_name```)
I tested this with IDFSuite, which has 2 read/write tests, and it fixes the problem.
CC: mengxr Can you confirm this is fine? I believe it is since the same ```random_name``` is used for all tests in a suite; we basically have an extra unneeded level of nesting.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9851 from jkbradley/tempdir-cleanup.
Add read/write support to the following estimators under spark.ml:
* ChiSqSelector
* PCA
* VectorIndexer
* Word2Vec
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9838 from yanboliang/spark-11829.
Updates:
* Add repartition(1) to save() methods' saving of data for LogisticRegressionModel, LinearRegressionModel.
* Strengthen privacy to class and companion object for Writers and Readers
* Change LogisticRegressionSuite read/write test to fit intercept
* Add Since versions for read/write methods in Pipeline, LogisticRegression
* Switch from hand-written class names in Readers to using getClass
CC: mengxr
CC: yanboliang Would you mind taking a look at this PR? mengxr might not be able to soon. Thank you!
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9829 from jkbradley/ml-io-cleanups.
* add "ML" prefix to reader/writer/readable/writable to avoid name collision with java.util.*
* define `DefaultParamsReadable/Writable` and use them to save some code
* use `super.load` instead so people can jump directly to the doc of `Readable.load`, which documents the Java compatibility issues
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9827 from mengxr/SPARK-11839.
Add read/write support to the following estimators under spark.ml:
* CountVectorizer
* IDF
* MinMaxScaler
* StandardScaler (a little awkward because we store some params in spark.mllib model)
* StringIndexer
Added some necessary method for read/write. Maybe we should add `private[ml] trait DefaultParamsReadable` and `DefaultParamsWritable` to save some boilerplate code, though we still need to override `load` for Java compatibility.
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9798 from mengxr/SPARK-6787.
This PR includes:
* Update SparkR:::glm, SparkR:::summary API docs.
* Update SparkR machine learning user guide and example codes to show:
* supporting feature interaction in R formula.
* summary for gaussian GLM model.
* coefficients for binomial GLM model.
mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9727 from yanboliang/spark-11684.
jira: https://issues.apache.org/jira/browse/SPARK-11813
I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.
1. Performance improvement for less serialization.
2. Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.
Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.
Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#9803 from hhbyyh/w2vVocab.
Also modifies DefaultParamsWriter.saveMetadata to take optional extra metadata.
CC: mengxr yanboliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9786 from jkbradley/als-io.
This replaces [https://github.com/apache/spark/pull/9656] with updates.
fayeshine should be the main author when this PR is committed.
CC: mengxr fayeshine
Author: Wenjian Huang <nextrush@163.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9814 from jkbradley/fayeshine-patch-6790.
I have added unit test for ML's StandardScaler By comparing with R's output, please review for me.
Thx.
Author: RoyGaoVLIS <roygao@zju.edu.cn>
Closes#6665 from RoyGao/7013.
This PR makes the default read/write work with simple transformers/estimators that have params of type `Param[Vector]`. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9776 from mengxr/SPARK-11764.
Add save/load to LogisticRegression Estimator, and refactor tests a little to make it easier to add similar support to other Estimator, Model pairs.
Moved LogisticRegressionReader/Writer to within LogisticRegressionModel
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9749 from jkbradley/lr-io-2.
This excludes Estimators and ones which include Vector and other non-basic types for Params or data. This adds:
* Bucketizer
* DCT
* HashingTF
* Interaction
* NGram
* Normalizer
* OneHotEncoder
* PolynomialExpansion
* QuantileDiscretizer
* RFormula
* SQLTransformer
* StopWordsRemover
* StringIndexer
* Tokenizer
* VectorAssembler
* VectorSlicer
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9755 from jkbradley/transformer-io.
This is to support JSON serialization of Param[Vector] in the pipeline API. It could be used for other purposes too. The schema is the same as `VectorUDT`. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9751 from mengxr/SPARK-11766.
Pipeline and PipelineModel extend Readable and Writable. Persistence succeeds only when all stages are Writable.
Note: This PR reinstates tests for other read/write functionality. It should probably not get merged until [https://issues.apache.org/jira/browse/SPARK-11672] gets fixed.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9674 from jkbradley/pipeline-io.
Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame, include:
* Use libSVM data source for all example codes under examples/ml, and remove unused import.
* Use libSVM data source for user guides under ml-*** which were omitted by #8697.
* Fix bug: We should use ```sqlContext.read().format("libsvm").load(path)``` at Java side, but the API doc and user guides misuse as ```sqlContext.read.format("libsvm").load(path)```.
* Code cleanup.
mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9690 from yanboliang/spark-11723.
We set `sqlContext = null` in `afterAll`. However, this doesn't change `SQLContext.activeContext` and then `SQLContext.getOrCreate` might use the `SparkContext` from previous test suite and hence causes the error. This PR calls `clearActive` in `beforeAll` and `afterAll` to avoid using an old context from other test suites.
cc: yhuai
Author: Xiangrui Meng <meng@databricks.com>
Closes#9677 from mengxr/SPARK-11672.2.
Per discussion in the initial Pipelines LDA PR [https://github.com/apache/spark/pull/9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases.
CC feynmanliang mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9678 from jkbradley/lda-pipelines-2.
This causes compile failure with Scala 2.11. See https://issues.scala-lang.org/browse/SI-8813. (Jenkins won't test Scala 2.11. I tested compile locally.) JoshRosen
Author: Xiangrui Meng <meng@databricks.com>
Closes#9644 from mengxr/SPARK-11674.
org.apache.spark.ml.feature.Word2Vec.transform() very slow. we should not read broadcast every sentence.
Author: Yuming Wang <q79969786@gmail.com>
Author: yuming.wang <q79969786@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#9592 from 979969786/master.
This PR adds model save/load for spark.ml's LogisticRegressionModel. It also does minor refactoring of the default save/load classes to reuse code.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9606 from jkbradley/logreg-io2.
This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change:
* I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed.
Note: This will conflict with [https://github.com/apache/spark/pull/9484], but I'll try to merge [https://github.com/apache/spark/pull/9484] first and then rebase this PR.
CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9513 from jkbradley/lda-pipelines.
Implementation of step capability for sliding window function in MLlib's RDD.
Though one can use current sliding window with step 1 and then filter every Nth window, it will take more time and space (N*data.count times more than needed). For example, below are the results for various windows and steps on 10M data points:
Window | Step | Time | Windows produced
------------ | ------------- | ---------- | ----------
128 | 1 | 6.38 | 9999873
128 | 10 | 0.9 | 999988
128 | 100 | 0.41 | 99999
1024 | 1 | 44.67 | 9998977
1024 | 10 | 4.74 | 999898
1024 | 100 | 0.78 | 99990
```
import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd = sc.parallelize(1 to 10000000, 10)
rdd.count
val window = 1024
val step = 1
val t = System.nanoTime(); val windows = rdd.sliding(window, step); println(windows.count); println((System.nanoTime() - t) / 1e9)
```
Author: unknown <ulanov@ULANOV3.americas.hpqcorp.net>
Author: Alexander Ulanov <nashb@yandex.ru>
Author: Xiangrui Meng <meng@databricks.com>
Closes#5855 from avulanov/SPARK-7316-sliding.
Refactoring
* separated overwrite and param save logic in DefaultParamsWriter
* added sparkVersion to DefaultParamsWriter
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9587 from jkbradley/logreg-io.
jira: https://issues.apache.org/jira/browse/SPARK-11069
quotes from jira:
Tokenizer converts strings to lowercase automatically, but RegexTokenizer does not. It would be nice to add an option to RegexTokenizer to convert to lowercase. Proposal:
call the Boolean Param "toLowercase"
set default to false (so behavior does not change)
Actually sklearn converts to lowercase before tokenizing too
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#9092 from hhbyyh/tokenLower.
I implemented a hierarchical clustering algorithm again. This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later.
https://issues.apache.org/jira/browse/SPARK-6517
- This implementation based on a bi-sectiong K-means clustering.
- It derives from the freeman-lab 's implementation
- The basic idea is not changed from the previous version. (#2906)
- However, It is 1000x faster than the previous version through parallel processing.
Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen).
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com>
Closes#5267 from yu-iskw/new-hierarchical-clustering.
The current pmml models generated do not specify the pmml version in its root node. This is a problem when using this pmml model in other tools because they expect the version attribute to be set explicitly. This fix adds the pmml version attribute to the generated pmml models and specifies its value as 4.2.
Author: fazlan-nazeem <fazlann@wso2.com>
Closes#9558 from fazlan-nazeem/master.
Expose R-like summary statistics in SparkR::glm for linear regression, the output of ```summary``` like
```Java
$DevianceResiduals
Min Max
-0.9509607 0.7291832
$Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6765 0.2353597 7.123139 4.456124e-11
Sepal_Length 0.3498801 0.04630128 7.556598 4.187317e-12
Species_versicolor -0.9833885 0.07207471 -13.64402 0
Species_virginica -1.00751 0.09330565 -10.79796 0
```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9561 from yanboliang/spark-11494.
Could jkbradley and davies review it?
- Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it.
- Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`.
[[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#8643 from yu-iskw/SPARK-8467-2.
This PR implements the default save/load for non-meta estimators and transformers using the JSON serialization of param values. The saved metadata includes:
* class name
* uid
* timestamp
* paramMap
The save/load interface is similar to DataFrames. We use the current active context by default, which should be sufficient for most use cases.
~~~scala
instance.save("path")
instance.write.context(sqlContext).overwrite().save("path")
Instance.load("path")
~~~
The param handling is different from the design doc. We didn't save default and user-set params separately, and when we load it back, all parameters are user-set. This does cause issues. But it also cause other issues if we modify the default params.
TODOs:
* [x] Java test
* [ ] a follow-up PR to implement default save/load for all non-meta estimators and transformers
cc jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9454 from mengxr/SPARK-11217.
https://issues.apache.org/jira/browse/SPARK-10116
This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.
mengxr mkolod
Author: Imran Rashid <irashid@cloudera.com>
Closes#8314 from squito/SPARK-10116.
Follow up [SPARK-9836](https://issues.apache.org/jira/browse/SPARK-9836), we should also support summary statistics for ```intercept```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9485 from yanboliang/spark-11473.
In file LDAOptimizer.scala:
line 441: since "idx" was never used, replaced unrequired zipWithIndex.foreach with foreach.
- nonEmptyDocs.zipWithIndex.foreach { case ((_, termCounts: Vector), idx: Int) =>
+ nonEmptyDocs.foreach { case (_, termCounts: Vector) =>
Author: a1singh <a1singh@ucsd.edu>
Closes#9456 from a1singh/master.
Like ml ```LinearRegression```, ```LogisticRegression``` should provide a training summary including feature names and their coefficients.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9303 from yanboliang/spark-9492.
Currently ```RFormula``` can only handle label with ```NumericType``` or ```BinaryType``` (cast it to ```DoubleType``` as the label of Linear Regression training), we should also support label of ```StringType``` which is needed for Logistic Regression (glm with family = "binomial").
For label of ```StringType```, we should use ```StringIndexer``` to transform it to 0-based index.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9302 from yanboliang/spark-11349.
Removed the old `getModelWeights` function which was private and renamed into `getModelCoefficients`
Author: DB Tsai <dbt@netflix.com>
Closes#9426 from dbtsai/feature-minor.
mengxr, felixcheung
This pull request just relaxes the type of the prediction/label columns to be float and double. Internally, these columns are casted to double. The other evaluators might need to be changed also.
Author: Dominik Dahlem <dominik.dahlem@gmail.combination>
Closes#9296 from dahlem/ddahlem_regression_evaluator_double_predictions_27102015.
This PR deprecates `runs` in k-means. `runs` introduces extra complexity and overhead in MLlib's k-means implementation. I haven't seen much usage with `runs` not equal to `1`. We don't have a unit test for it either. We can deprecate this method in 1.6, and void it in 1.7. It helps us simplify the implementation.
cc: srowen
Author: Xiangrui Meng <meng@databricks.com>
Closes#9322 from mengxr/SPARK-11358.
Made foreachActive public in MLLib's vector API
Author: Nakul Jindal <njindal@us.ibm.com>
Closes#9362 from nakul02/SPARK-11385_foreach_for_mllib_linalg_vector.
…sion as followup. This is the follow up work of SPARK-10668.
* Fix miner style issues.
* Add test case for checking whether solver is selected properly.
Author: Lewuathe <lewuathe@me.com>
Author: lewuathe <lewuathe@me.com>
Closes#9180 from Lewuathe/SPARK-11207.
SparkR glm currently support :
```formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0```
We should also support setting standardize which has been defined at [design documentation](https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit)
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9331 from yanboliang/spark-11369.
WeightedLeastSquares now uses the common Instance class in ml.feature instead of a private one.
Author: Nakul Jindal <njindal@us.ibm.com>
Closes#9325 from nakul02/SPARK-11332_refactor_WeightedLeastSquares_dot_Instance.
Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test.
Supersedes https://github.com/apache/spark/pull/9293
Author: Sean Owen <sowen@cloudera.com>
Closes#9309 from srowen/SPARK-11302.2.
Add columnSimilarities to IndexedRowMatrix by delegating to functionality already in RowMatrix.
With a test.
Author: Reza Zadeh <reza@databricks.com>
Closes#8792 from rezazadeh/colsims.
Remove "Experimental" from .mllib code that has been around since 1.4.0 or earlier
Author: Sean Owen <sowen@cloudera.com>
Closes#9169 from srowen/SPARK-11184.
This is a PR for Parquet-based model import/export.
* Added save/load for ChiSqSelectorModel
* Updated the test suite ChiSqSelectorSuite
Author: Jayant Shekar <jayant@user-MBPMBA-3.local>
Closes#6785 from jayantshekhar/SPARK-6723.
Given row_ind should be less than the number of rows
Given col_ind should be less than the number of cols.
The current code in master gives unpredictable behavior for such cases.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#8271 from MechCoder/hash_code_matrices.
…2 regularization if the number of features is small
Author: lewuathe <lewuathe@me.com>
Author: Lewuathe <sasaki@treasure-data.com>
Author: Kai Sasaki <sasaki@treasure-data.com>
Author: Lewuathe <lewuathe@me.com>
Closes#8884 from Lewuathe/SPARK-10668.
predictNodeIndex is moved to LearningNode and renamed predictImpl for consistency with Node.predictImpl
Author: Luvsandondov Lkhamsuren <lkhamsurenl@gmail.com>
Closes#8609 from lkhamsurenl/SPARK-9963.
jira: https://issues.apache.org/jira/browse/SPARK-11029
We should add a method analogous to spark.mllib.clustering.KMeansModel.computeCost to spark.ml.clustering.KMeansModel.
This will be a temp fix until we have proper evaluators defined for clustering.
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: yuhaoyang <yuhao@zhanglipings-iMac.local>
Closes#9073 from hhbyyh/computeCost.
This PR aims to decrease communication costs in BlockMatrix multiplication in two ways:
- Simulate the multiplication on the driver, and figure out which blocks actually need to be shuffled
- Send the block once to a partition, and join inside the partition rather than sending multiple copies to the same partition
**NOTE**: One important note is that right now, the old behavior of checking for multiple blocks with the same index is lost. This is not hard to add, but is a little more expensive than how it was.
Initial benchmarking showed promising results (look below), however I did hit some `FileNotFound` exceptions with the new implementation after the shuffle.
Size A: 1e5 x 1e5
Size B: 1e5 x 1e5
Block Sizes: 1024 x 1024
Sparsity: 0.01
Old implementation: 1m 13s
New implementation: 9s
cc avulanov Would you be interested in helping me benchmark this? I used your code from the mailing list (which you sent about 3 months ago?), and the old implementation didn't even run, but the new implementation completed in 268s in a 120 GB / 16 core cluster
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#8757 from brkyvz/opt-bmm.
Value of the quantile probabilities array should be in the range (0, 1) instead of [0,1]
in `AFTSurvivalRegression.scala` according to [Discussion] (https://github.com/apache/spark/pull/8926#discussion-diff-40698242)
Author: vectorijk <jiangkai@gmail.com>
Closes#9083 from vectorijk/spark-11059.
This PR implements the JSON SerDe for the following param types: `Boolean`, `Int`, `Long`, `Float`, `Double`, `String`, `Array[Int]`, `Array[Double]`, and `Array[String]`. The implementation of `Float`, `Double`, and `Array[Double]` are specialized to handle `NaN` and `Inf`s. This will be used in pipeline persistence. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9090 from mengxr/SPARK-7402.
Support for recommendUsersForProducts and recommendProductsForUsers in matrix factorization model for PySpark
Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com>
Closes#8700 from smartkiwi/SPARK-10535_.
Compute upper triangular values of the covariance matrix, then copy to lower triangular values.
Author: Nick Pritchard <nicholas.pritchard@falkonry.com>
Closes#8940 from pnpritchard/SPARK-10875.
GBT compare ValidateError with tolerance switching between relative and absolute ones, where the former one is relative to the current loss on the training set.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8549 from yanboliang/spark-7770.
LinearRegression training summary: The transformed dataset should hold all columns, not just selected ones like prediction and label. There is no real need to remove some, and the user may find them useful.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8564 from holdenk/SPARK-9718-LinearRegressionTrainingSummary-all-columns.
Reimplement `DecisionTree.findSplitsBins` via `RDD` to parallelize bin calculation.
With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours.
Author: Nathan Howell <nhowell@godaddy.com>
Closes#8246 from NathanHowell/SPARK-10064.
Refactoring `Instance` case class out from LOR and LIR, and also cleaning up some code.
Author: DB Tsai <dbt@netflix.com>
Closes#8853 from dbtsai/refactoring.
Provide initialModel param for pyspark.mllib.clustering.KMeans
Author: Evan Chen <chene@us.ibm.com>
Closes#8967 from evanyc15/SPARK-10779-pyspark-mllib.
It is currently impossible to clear Param values once set. It would be helpful to be able to.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8619 from holdenk/SPARK-9841-params-clear-needs-to-be-public.
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5890).
I borrow the code of `findSplits` from `RandomForest`. I don't think it's good to call it from `RandomForest` directly.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#5779 from yinxusen/SPARK-5890.
For some implicit dataset, ratings may not exist in the training data. In this case, we can assume all observed pairs to be positive and treat their ratings as 1. This should happen when users set ```ratingCol``` to an empty string.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8937 from yanboliang/spark-10736.
I implemented toString for AssociationRules.Rule, format like `[x, y] => {z}: 1.0`
Author: y-shimizu <y.shimizu0429@gmail.com>
Closes#8904 from y-shimizu/master.
This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).
To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.
mengxr
Author: Eric Liang <ekl@databricks.com>
Closes#8830 from ericl/interaction-2.
As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now have an easier way to create dataframes from local Java lists. Lets update the tests to use those.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8886 from holdenk/SPARK-10763-update-java-mllib-ml-tests-to-use-simplified-dataframe-construction.
Currently use can set ```checkpointInterval``` to specify how often should the cache be check-pointed. But we also need the function that users can disable it. This PR supports that users can disable checkpoint if user setting ```checkpointInterval = -1```.
We also add documents for GBT ```cacheNodeIds``` to make users can understand more clearly about checkpoint.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8820 from yanboliang/spark-10699.
By default ```quantilesCol``` should be empty. If ```quantileProbabilities``` is set, we should append quantiles as a new column (of type Vector).
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8836 from yanboliang/spark-10686.
All prediction models should store `numFeatures` indicating the number of features the model was trained on. Default value of -1 added for backwards compatibility.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#8675 from sethah/SPARK-9715.
Currently when you set illegal value for params of array type (such as IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw IllegalArgumentException but with incomprehensible error information.
Take ```VectorSlicer.setNames``` as an example:
```scala
val vectorSlicer = new VectorSlicer().setInputCol("features").setOutputCol("result")
// The value of setNames must be contain distinct elements, so the next line will throw exception.
vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
```
It will throw IllegalArgumentException as:
```
vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5.
java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5.
```
We should distinguish the value of array type from primitive type at Param.validate(value: T), and we will get better error information.
```
vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1].
java.lang.IllegalArgumentException: vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1].
```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8863 from yanboliang/spark-10750.
NodeIdCache: prevNodeIdsForInstances.unpersist() needs to be called at end of training.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8541 from holdenk/SPARK-9962-decission-tree-training-prevNodeIdsForiNstances-unpersist-at-end-of-training.