ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Holden Karau	4bacddb602	[SPARK-7146][ML] Expose the common params as a DeveloperAPI for other ML developers ## What changes were proposed in this pull request? Expose the common params from Spark ML as a Developer API. ## How was this patch tested? Existing tests. Author: Holden Karau <holden@us.ibm.com> Author: Holden Karau <holdenkarau@google.com> Closes #18699 from holdenk/SPARK-7146-ml-shared-params-developer-api.	2017-11-05 21:21:12 -08:00
xubo245	7a8412352e	[SPARK-22423][SQL] Scala test source files like TestHiveSingleton.scala should be in scala source root ## What changes were proposed in this pull request? Scala test source files like TestHiveSingleton.scala should be in scala source root ## How was this patch tested? Just move scala file from java directory to scala directory No new test case in this PR. ``` renamed: mllib/src/test/java/org/apache/spark/ml/util/IdentifiableSuite.scala -> mllib/src/test/scala/org/apache/spark/ml/util/IdentifiableSuite.scala renamed: streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala -> streaming/src/test/scala/org/apache/spark/streaming/JavaTestUtils.scala renamed: streaming/src/test/java/org/apache/spark/streaming/api/java/JavaStreamingListenerWrapperSuite.scala -> streaming/src/test/scala/org/apache/spark/streaming/api/java/JavaStreamingListenerWrapperSuite.scala renamed: sql/hive/src/test/java/org/apache/spark/sql/hive/test/TestHiveSingleton.scala sql/hive/src/test/scala/org/apache/spark/sql/hive/test/TestHiveSingleton.scala ``` Author: xubo245 <601450868@qq.com> Closes #19639 from xubo245/scalaDirectory.	2017-11-04 11:51:10 +00:00
bomeng	aa6db57e39	[SPARK-22399][ML] update the location of reference paper ## What changes were proposed in this pull request? Update the url of reference paper. ## How was this patch tested? It is comments, so nothing tested. Author: bomeng <bmeng@us.ibm.com> Closes #19614 from bomeng/22399.	2017-10-31 08:20:23 +00:00
WeichenXu	20eb95e5e9	[SPARK-21911][ML][PYSPARK] Parallel Model Evaluation for ML Tuning in PySpark ## What changes were proposed in this pull request? Add parallelism support for ML tuning in pyspark. ## How was this patch tested? Test updated. Author: WeichenXu <weichen.xu@databricks.com> Closes #19122 from WeichenXu123/par-ml-tuning-py.	2017-10-27 15:19:27 -07:00
WeichenXu	841f1d776f	[SPARK-22332][ML][TEST] Fix NaiveBayes unit test occasionly fail (cause by test dataset not deterministic) ## What changes were proposed in this pull request? Fix NaiveBayes unit test occasionly fail: Set seed for `BrzMultinomial.sample`, make `generateNaiveBayesInput` output deterministic dataset. (If we do not set seed, the generated dataset will be random, and the model will be possible to exceed the tolerance in the test, which trigger this failure) ## How was this patch tested? Manually run tests multiple times and check each time output models contains the same values. Author: WeichenXu <weichen.xu@databricks.com> Closes #19558 from WeichenXu123/fix_nb_test_seed.	2017-10-25 14:31:36 -07:00
Zheng RuiFeng	673876b7ea	[SPARK-22309][ML] Remove unused param in `LDAModel.getTopicDistributionMethod` ## What changes were proposed in this pull request? Remove unused param in `LDAModel.getTopicDistributionMethod` ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #19530 from zhengruifeng/lda_bc.	2017-10-20 08:28:05 +01:00
Valeriy Avanesov	52facb0062	[SPARK-14371][MLLIB] OnlineLDAOptimizer should not collect stats for each doc in mini-batch to driver Hi, # What changes were proposed in this pull request? as it was proposed by jkbradley , ```gammat``` are not collected to the driver anymore. # How was this patch tested? existing test suite. Author: Valeriy Avanesov <avanesov@wias-berlin.de> Author: Valeriy Avanesov <acopich@gmail.com> Closes #18924 from akopich/master.	2017-10-18 10:46:46 -07:00
Zhenhua Wang	655f6f86f8	[SPARK-22208][SQL] Improve percentile_approx by not rounding up targetError and starting from index 0 ## What changes were proposed in this pull request? Currently percentile_approx never returns the first element when percentile is in (relativeError, 1/N], where relativeError default 1/10000, and N is the total number of elements. But ideally, percentiles in [0, 1/N] should all return the first element as the answer. For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10%. Currently it returns 2. Based on the paper, targetError is not rounded up, and searching index should start from 0 instead of 1. By following the paper, we should be able to fix the cases mentioned above. ## How was this patch tested? Added a new test case and fix existing test cases. Author: Zhenhua Wang <wzh_zju@163.com> Closes #19438 from wzhfy/improve_percentile_approx.	2017-10-11 00:16:12 -07:00
WeichenXu	3b5c2a84bf	[SPARK-21770][ML] ProbabilisticClassificationModel fix corner case: normalization of all-zero raw predictions ## What changes were proposed in this pull request? Fix probabilisticClassificationModel corner case: normalization of all-zero raw predictions, throw IllegalArgumentException with description. ## How was this patch tested? Test case added. Author: WeichenXu <weichen.xu@databricks.com> Closes #19106 from WeichenXu123/SPARK-21770.	2017-10-10 08:27:45 +01:00
Nick Pentreath	98057583dd	[SPARK-20679][ML] Support recommending for a subset of users/items in ALSModel This PR adds methods `recommendForUserSubset` and `recommendForItemSubset` to `ALSModel`. These allow recommending for a specified set of user / item ids rather than for every user / item (as in the `recommendForAllX` methods). The subset methods take a `DataFrame` as input, containing ids in the column specified by the param `userCol` or `itemCol`. The model will generate recommendations for each _unique_ id in this input dataframe. ## How was this patch tested? New unit tests in `ALSSuite` and Python doctests in `ALS`. Ran updated examples locally. Author: Nick Pentreath <nickp@za.ibm.com> Closes #18748 from MLnick/als-recommend-df.	2017-10-09 10:42:33 +02:00
Kento NOZAWA	5eacc3bfa9	[SPARK-22156][MLLIB] Fix update equation of learning rate in Word2Vec.scala ## What changes were proposed in this pull request? Current equation of learning rate is incorrect when `numIterations` > `1`. This PR is based on [original C code](https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L393). cc: mengxr ## How was this patch tested? manual tests I modified [this example code](https://spark.apache.org/docs/2.1.1/mllib-feature-extraction.html#example). ### `numIteration=1` #### Code ```scala import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} val input = sc.textFile("data/mllib/sample_lda_data.txt").map(line => line.split(" ").toSeq) val word2vec = new Word2Vec() val model = word2vec.fit(input) val synonyms = model.findSynonyms("1", 5) for((synonym, cosineSimilarity) <- synonyms) { println(s"$synonym $cosineSimilarity") } ``` #### Result ``` 2 0.175856813788414 0 0.10971353203058243 4 0.09818313270807266 3 0.012947646901011467 9 -0.09881238639354706 ``` ### `numIteration=5` #### Code ```scala import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} val input = sc.textFile("data/mllib/sample_lda_data.txt").map(line => line.split(" ").toSeq) val word2vec = new Word2Vec() word2vec.setNumIterations(5) val model = word2vec.fit(input) val synonyms = model.findSynonyms("1", 5) for((synonym, cosineSimilarity) <- synonyms) { println(s"$synonym $cosineSimilarity") } ``` #### Result ``` 0 0.9898583889007568 2 0.9808019399642944 4 0.9794934391975403 3 0.9506527781486511 9 -0.9065656661987305 ``` Author: Kento NOZAWA <k_nzw@klis.tsukuba.ac.jp> Closes #19372 from nzw0301/master.	2017-10-07 08:30:48 +01:00
Liang-Chi Hsieh	3ca367083e	[SPARK-22001][ML][SQL] ImputerModel can do withColumn for all input columns at one pass ## What changes were proposed in this pull request? SPARK-21690 makes one-pass `Imputer` by parallelizing the computation of all input columns. When we transform dataset with `ImputerModel`, we do `withColumn` on all input columns sequentially. We can also do this on all input columns at once by adding a `withColumns` API to `Dataset`. The new `withColumns` API is for internal use only now. ## How was this patch tested? Existing tests for `ImputerModel`'s change. Added tests for `withColumns` API. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19229 from viirya/SPARK-22001.	2017-10-01 10:49:22 -07:00
Sean Owen	576c43fb42	[SPARK-22087][SPARK-14650][WIP][BUILD][REPL][CORE] Compile Spark REPL for Scala 2.12 + other 2.12 fixes ## What changes were proposed in this pull request? Enable Scala 2.12 REPL. Fix most remaining issues with 2.12 compilation and warnings, including: - Selecting Kafka 0.10.1+ for Scala 2.12 and patching over a minor API difference - Fixing lots of "eta expansion of zero arg method deprecated" warnings - Resolving the SparkContext.sequenceFile implicits compile problem - Fixing an odd but valid jetty-server missing dependency in hive-thriftserver ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #19307 from srowen/Scala212.	2017-09-24 09:40:13 +01:00
WeichenXu	f180b65343	[SPARK-22060][ML] Fix CrossValidator/TrainValidationSplit param persist/load bug ## What changes were proposed in this pull request? Currently the param of CrossValidator/TrainValidationSplit persist/loading is hardcoding, which is different with other ML estimators. This cause persist bug for new added `parallelism` param. I refactor related code, avoid hardcoding persist/load param. And in the same time, it solve the `parallelism` persisting bug. This refactoring is very useful because we will add more new params in #19208 , hardcoding param persisting/loading making the thing adding new params very troublesome. ## How was this patch tested? Test added. Author: WeichenXu <weichen.xu@databricks.com> Closes #19278 from WeichenXu123/fix-tuning-param-bug.	2017-09-22 18:15:01 -07:00
Zheng RuiFeng	a8a5cd24e2	[SPARK-22009][ML] Using treeAggregate improve some algs ## What changes were proposed in this pull request? I test on a dataset of about 13M instances, and found that using `treeAggregate` give a speedup in following algs: \|Algs\| SpeedUp \| \|------\|-----------\| \|OneHotEncoder\| 5% \| \|StatFunctions.calculateCov\| 7% \| \|StatFunctions.multipleApproxQuantiles\| 9% \| \|RegressionEvaluator\| 8% \| ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #19232 from zhengruifeng/use_treeAggregate.	2017-09-21 20:06:42 +01:00
Zheng RuiFeng	b21b806ecc	[SPARK-22075][ML] GBTs unpersist datasets cached by Checkpointer ## What changes were proposed in this pull request? `PeriodicRDDCheckpointer` will automatically persist the last 3 datasets called by `PeriodicRDDCheckpointer.update()`. In GBTs, the last 3 intermediate rdds are still cached after `fit()` ## How was this patch tested? existing tests and local test in spark-shell Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #19288 from zhengruifeng/gbt_unpersist.	2017-09-21 20:05:44 +01:00
Travis Hegner	79a4dab629	[SPARK-21958][ML] Word2VecModel save: transform data in the cluster ## What changes were proposed in this pull request? Change a data transformation while saving a Word2VecModel to happen with distributed data instead of local driver data. ## How was this patch tested? Unit tests for the ML sub-component still pass. Running this patch against v2.2.0 in a fully distributed production cluster allows a 4.0G model to save and load correctly, where it would not do so without the patch. Author: Travis Hegner <thegner@trilliumit.com> Closes #19191 from travishegner/master.	2017-09-15 15:17:16 +02:00
Ming Jiang	8d8641f122	[SPARK-21854] Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API ## What changes were proposed in this pull request? Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API ## How was this patch tested? Added unit test Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Ming Jiang <mjiang@fanatics.com> Author: Ming Jiang <jmwdpk@gmail.com> Author: jmwdpk <jmwdpk@gmail.com> Closes #19185 from jmwdpk/SPARK-21854.	2017-09-14 13:53:28 +08:00
Zheng RuiFeng	0fa5b7cacc	[SPARK-21690][ML] one-pass imputer ## What changes were proposed in this pull request? parallelize the computation of all columns performance tests: \|numColums\| Mean(Old) \| Median(Old) \| Mean(RDD) \| Median(RDD) \| Mean(DF) \| Median(DF) \| \|------\|----------\|------------\|----------\|------------\|----------\|------------\| \|1\|0.0771394713\|0.0658712813\|0.080779802\|0.048165981499999996\|0.10525509870000001\|0.0499620203\| \|10\|0.7234340630999999\|0.5954440414\|0.0867935197\|0.13263428659999998\|0.09255724889999999\|0.1573943635\| \|100\|7.3756451568\|6.2196631259\|0.1911931552\|0.8625376817000001\|0.5557462431\|1.7216837982000002\| ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #18902 from zhengruifeng/parallelize_imputer.	2017-09-13 20:12:21 +08:00
WeichenXu	f6c5d8f692	[SPARK-21027][MINOR][FOLLOW-UP] add missing since tag ## What changes were proposed in this pull request? add missing since tag for `setParallelism` in #19110 ## How was this patch tested? N/A Author: WeichenXu <weichen.xu@databricks.com> Closes #19214 from WeichenXu123/minor01.	2017-09-13 09:48:04 +01:00
Zheng RuiFeng	c5f9b89dda	[SPARK-18608][ML] Fix double caching ## What changes were proposed in this pull request? `df.rdd.getStorageLevel` => `df.storageLevel` using cmd `find . -name '*.scala' \| xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed. Previous discussion in other PRs: https://github.com/apache/spark/pull/19107, https://github.com/apache/spark/pull/17014 ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #19197 from zhengruifeng/double_caching.	2017-09-12 11:37:05 -07:00
Ajay Saini	720c94fe77	[SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark # What changes were proposed in this pull request? Added tunable parallelism to the pyspark implementation of one vs. rest classification. Added a parallelism parameter to the Scala implementation of one vs. rest along with functionality for using the parameter to tune the level of parallelism. I take this PR #18281 over because the original author is busy but we need merge this PR soon. After this been merged, we can close #18281 . ## How was this patch tested? Test suite added. Author: Ajay Saini <ajays725@gmail.com> Author: WeichenXu <weichen.xu@databricks.com> Closes #19110 from WeichenXu123/spark-21027.	2017-09-12 10:02:27 -07:00
Marco Gaido	dd78167585	[SPARK-14516][ML] Adding ClusteringEvaluator with the implementation of Cosine silhouette and squared Euclidean silhouette. ## What changes were proposed in this pull request? This PR adds the ClusteringEvaluator Evaluator which contains two metrics: - cosineSilhouette: the Silhouette measure using the cosine distance; - squaredSilhouette: the Silhouette measure using the squared Euclidean distance. The implementation of the two metrics refers to the algorithm proposed and explained [here](https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view). These algorithms have been thought for a distributed and parallel environment, thus they have reasonable performance, unlike a naive Silhouette implementation following its definition. ## How was this patch tested? The patch has been tested with the additional unit tests added (comparing the results with the ones provided by [Python sklearn library](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)). Author: Marco Gaido <mgaido@hortonworks.com> Closes #18538 from mgaido91/SPARK-14516.	2017-09-12 17:59:53 +08:00
caoxuewen	dc74c0e67d	[MINOR][SQL] remove unuse import class ## What changes were proposed in this pull request? this PR describe remove the import class that are unused. ## How was this patch tested? N/A Author: caoxuewen <cao.xuewen@zte.com.cn> Closes #19131 from heary-cao/unuse_import.	2017-09-11 10:09:20 +01:00
Xin Ren	31c74fec24	[SPARK-19866][ML][PYSPARK] Add local version of Word2Vec findSynonyms for spark.ml: Python API https://issues.apache.org/jira/browse/SPARK-19866 ## What changes were proposed in this pull request? Add Python API for findSynonymsArray matching Scala API. ## How was this patch tested? Manual test `./python/run-tests --python-executables=python2.7 --modules=pyspark-ml` Author: Xin Ren <iamshrek@126.com> Author: Xin Ren <renxin.ubc@gmail.com> Author: Xin Ren <keypointt@users.noreply.github.com> Closes #17451 from keypointt/SPARK-19866.	2017-09-08 12:09:00 -07:00
Bryan Cutler	16c4c03c71	[SPARK-19357][ML] Adding parallel model evaluation in ML tuning ## What changes were proposed in this pull request? Modified `CrossValidator` and `TrainValidationSplit` to be able to evaluate models in parallel for a given parameter grid. The level of parallelism is controlled by a parameter `numParallelEval` used to schedule a number of models to be trained/evaluated so that the jobs can be run concurrently. This is a naive approach that does not check the cluster for needed resources, so care must be taken by the user to tune the parameter appropriately. The default value is `1` which will train/evaluate in serial. ## How was this patch tested? Added unit tests for CrossValidator and TrainValidationSplit to verify that model selection is the same when run in serial vs parallel. Manual testing to verify tasks run in parallel when param is > 1. Added parameter usage to relevant examples. Author: Bryan Cutler <cutlerb@gmail.com> Closes #16774 from BryanCutler/parallel-model-eval-SPARK-19357.	2017-09-06 14:12:27 +02:00
WeichenXu	900f14f6fa	[SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure consistent output columns ## What changes were proposed in this pull request? Add test for prediction using the model with all combinations of output columns turned on/off. Make sure the output column values match, presumably by comparing vs. the case with all 3 output columns turned on. ## How was this patch tested? Test updated. Author: WeichenXu <weichen.xu@databricks.com> Author: WeichenXu <WeichenXu123@outlook.com> Closes #19065 from WeichenXu123/generic_test_for_prob_classifier.	2017-09-01 17:32:33 -07:00
Sean Owen	12ab7f7e89	[SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation …build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure ## What changes were proposed in this pull request? This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts. In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11. It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release. - Scalatest 2.x -> 3.0.3 - Chill 0.8.0 -> 0.8.4 - Clapper 1.0.x -> 1.1.2 - json4s 3.2.x -> 3.4.2 - Jackson 2.6.x -> 2.7.9 (required by json4s) This change does _not_ fully enable a Scala 2.12 build: - It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here - It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too. What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build. ## How was this patch tested? Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above. Author: Sean Owen <sowen@cloudera.com> Closes #18645 from srowen/SPARK-14280.	2017-09-01 19:21:21 +01:00
WeichenXu	f5e10a34e6	[SPARK-21862][ML] Add overflow check in PCA ## What changes were proposed in this pull request? add overflow check in PCA, otherwise it is possible to throw `NegativeArraySizeException` when `k` and `numFeatures` are too large. The overflow checking formula is here: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala#L87 ## How was this patch tested? N/A Author: WeichenXu <weichen.xu@databricks.com> Closes #19078 from WeichenXu123/SVD_overflow_check.	2017-08-31 16:25:10 -07:00
WeichenXu	96028e36b4	[SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to BinaryLogisticRegressionSummary ## What changes were proposed in this pull request? add an "asBinary" method to LogisticRegressionSummary for convenient casting to BinaryLogisticRegressionSummary. ## How was this patch tested? Testcase updated. Author: WeichenXu <weichen.xu@databricks.com> Closes #19072 from WeichenXu123/mlor_summary_as_binary.	2017-08-31 16:22:40 -07:00
Bryan Cutler	4133c1b0ab	[SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher ## What changes were proposed in this pull request? This PR adds ML examples for the FeatureHasher transform in Scala, Java, Python. ## How was this patch tested? Manually ran examples and verified that output is consistent for different APIs Author: Bryan Cutler <cutlerb@gmail.com> Closes #19024 from BryanCutler/ml-examples-FeatureHasher-SPARK-21810.	2017-08-30 16:00:29 +02:00
Sean Owen	734ed7a7b3	[SPARK-21806][MLLIB] BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading ## What changes were proposed in this pull request? Prepend (0,p) to precision-recall curve not (0,1) where p matches lowest recall point ## How was this patch tested? Updated tests. Author: Sean Owen <sowen@cloudera.com> Closes #19038 from srowen/SPARK-21806.	2017-08-30 11:36:00 +01:00
Joseph K. Bradley	840ba053b9	[MINOR][ML] Document treatment of instance weights in logreg summary ## What changes were proposed in this pull request? Add Scaladoc noting that instance weights are currently ignored in the logistic regression summary traits. Author: Joseph K. Bradley <joseph@databricks.com> Closes #19071 from jkbradley/lr-summary-minor.	2017-08-29 13:01:37 -07:00
Weichen Xu	c7270a46fc	[SPARK-17139][ML] Add model summary for MultinomialLogisticRegression ## What changes were proposed in this pull request? Add 4 traits, using the following hierarchy: LogisticRegressionSummary LogisticRegressionTrainingSummary: LogisticRegressionSummary BinaryLogisticRegressionSummary: LogisticRegressionSummary BinaryLogisticRegressionTrainingSummary: LogisticRegressionTrainingSummary, BinaryLogisticRegressionSummary and the public method such as `def summary` only return trait type listed above. and then implement 4 concrete classes: LogisticRegressionSummaryImpl (multiclass case) LogisticRegressionTrainingSummaryImpl (multiclass case) BinaryLogisticRegressionSummaryImpl (binary case). BinaryLogisticRegressionTrainingSummaryImpl (binary case). ## How was this patch tested? Existing tests & added tests. Author: WeichenXu <WeichenXu123@outlook.com> Closes #15435 from WeichenXu123/mlor_summary.	2017-08-28 13:31:01 -07:00
WeichenXu	0456b40508	[SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSummarizer.variance generate negative result ## What changes were proposed in this pull request? Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate negative variance. This is a serious bug because many algos in MLLib use stddev computed from `sqrt(variance)` it will generate NaN and crash the whole algorithm. we can reproduce this bug use the following code: ``` val summarizer1 = (new MultivariateOnlineSummarizer) .add(Vectors.dense(3.0), 0.7) val summarizer2 = (new MultivariateOnlineSummarizer) .add(Vectors.dense(3.0), 0.4) val summarizer3 = (new MultivariateOnlineSummarizer) .add(Vectors.dense(3.0), 0.5) val summarizer4 = (new MultivariateOnlineSummarizer) .add(Vectors.dense(3.0), 0.4) val summarizer = summarizer1 .merge(summarizer2) .merge(summarizer3) .merge(summarizer4) println(summarizer.variance(0)) ``` This PR fix the bugs in `mllib.stat.MultivariateOnlineSummarizer.variance` and `ml.stat.SummarizerBuffer.variance`, and several places in `WeightedLeastSquares` ## How was this patch tested? test cases added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #19029 from WeichenXu123/fix_summarizer_var_bug.	2017-08-28 07:41:42 +01:00
Sean Owen	de7af295c2	[MINOR][BUILD] Fix build warnings and Java lint errors ## What changes were proposed in this pull request? Fix build warnings and Java lint errors. This just helps a bit in evaluating (new) warnings in another PR I have open. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #19051 from srowen/JavaWarnings.	2017-08-25 16:07:13 +01:00
Yuhao Yang	f3676d6391	[SPARK-21108][ML] convert LinearSVC to aggregator framework ## What changes were proposed in this pull request? convert LinearSVC to new aggregator framework ## How was this patch tested? existing unit test. Author: Yuhao Yang <yuhao.yang@intel.com> Closes #18315 from hhbyyh/svcAggregator.	2017-08-25 10:22:27 +08:00
Weichen Xu	d6b30edd49	[SPARK-12664][ML] Expose probability in mlp model ## What changes were proposed in this pull request? Modify MLP model to inherit `ProbabilisticClassificationModel` and so that it can expose the probability column when transforming data. ## How was this patch tested? Test added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #17373 from WeichenXu123/expose_probability_in_mlp_model.	2017-08-22 21:16:34 -07:00
Yanbo Liang	3429619055	[ML][MINOR] Make sharedParams update. ## What changes were proposed in this pull request? ```sharedParams.scala``` was generated by ```SharedParamsCodeGen```, but it's not updated in master. Maybe someone manual update ```sharedParams.scala```, this PR fix this issue. ## How was this patch tested? Offline check. Author: Yanbo Liang <ybliang8@gmail.com> Closes #19011 from yanboliang/sharedParams.	2017-08-23 11:06:53 +08:00
Weichen Xu	d56c262109	[SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero ## What changes were proposed in this pull request? fix bug of MLOR do not work correctly when featureStd contains zero We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0) ``` val multinomialDatasetWithZeroVar = { val nPoints = 100 val coefficients = Array( -0.57997, 0.912083, -0.371077, -0.16624, -0.84355, -0.048509) val xMean = Array(5.843, 3.0) val xVariance = Array(0.6856, 0.0) // including zero variance val testData = generateMultinomialLogisticInput( coefficients, xMean, xVariance, addIntercept = true, nPoints, seed) val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0)) df.cache() df } ``` ## How was this patch tested? testcase added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #18896 from WeichenXu123/fix_mlor_stdvalue_zero_bug.	2017-08-22 16:55:34 -07:00
Yanbo Liang	c108a5d30e	[SPARK-19762][ML][FOLLOWUP] Add necessary comments to L2Regularization. ## What changes were proposed in this pull request? MLlib ```LinearRegression/LogisticRegression/LinearSVC``` always standardize the data during training to improve the rate of convergence regardless of _standardization_ is true or false. If _standardization_ is false, we perform reverse standardization by penalizing each component differently to get effectively the same objective function when the training dataset is not standardized. We should keep these comments in the code to let developers understand how we handle it correctly. ## How was this patch tested? Existing tests, only adding some comments in code. Author: Yanbo Liang <ybliang8@gmail.com> Closes #18992 from yanboliang/SPARK-19762.	2017-08-22 08:43:18 +08:00
Nick Pentreath	988b84d7ed	[SPARK-21468][PYSPARK][ML] Python API for FeatureHasher Add Python API for `FeatureHasher` transformer. ## How was this patch tested? New doc test. Author: Nick Pentreath <nickp@za.ibm.com> Closes #18970 from MLnick/SPARK-21468-pyspark-hasher.	2017-08-21 14:35:38 +02:00
Cédric Pelvet	73e04ecc4f	[MINOR] Correct validateAndTransformSchema in GaussianMixture and AFTSurvivalRegression ## What changes were proposed in this pull request? The line SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType) did not modify the variable schema, hence only the last line had any effect. A temporary variable is used to correctly append the two columns predictionCol and probabilityCol. ## How was this patch tested? Manually. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Cédric Pelvet <cedric.pelvet@gmail.com> Closes #18980 from sharp-pixel/master.	2017-08-20 11:05:54 +01:00
Peng Meng	a0345cbebe	[SPARK-21680][ML][MLLIB] optimize Vector compress ## What changes were proposed in this pull request? When use Vector.compressed to change a Vector to SparseVector, the performance is very low comparing with Vector.toSparse. This is because you have to scan the value three times using Vector.compressed, but you just need two times when use Vector.toSparse. When the length of the vector is large, there is significant performance difference between this two method. ## How was this patch tested? The existing UT Author: Peng Meng <peng.meng@intel.com> Closes #18899 from mpjlu/optVectorCompress.	2017-08-16 19:05:20 +01:00
Nick Pentreath	0bb8d1f30a	[SPARK-13969][ML] Add FeatureHasher transformer This PR adds a `FeatureHasher` transformer, modeled on [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html) and [Vowpal wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-Hashing-and-Extraction). The transformer operates on multiple input columns in one pass. Current behavior is: * for numerical columns, the values are assumed to be real values and the feature index is `hash(columnName)` while feature value is `feature_value` * for string columns, the values are assumed to be categorical and the feature index is `hash(column_name=feature_value)`, while feature value is `1.0` * For hash collisions, feature values will be summed * `null` (missing) values are ignored The following dataframe illustrates the basic semantics: ``` +---+------+-----+---------+------+-----------------------------------------+ \|int\|double\|float\|stringNum\|string\|features \| +---+------+-----+---------+------+-----------------------------------------+ \|3 \|4.0 \|5.0 \|1 \|foo \|(16,[0,8,11,12,15],[5.0,3.0,1.0,4.0,1.0])\| \|6 \|7.0 \|8.0 \|2 \|bar \|(16,[0,8,11,12,15],[8.0,6.0,1.0,7.0,1.0])\| +---+------+-----+---------+------+-----------------------------------------+ ``` ## How was this patch tested? New unit tests and manual experiments. Author: Nick Pentreath <nickp@za.ibm.com> Closes #18513 from MLnick/FeatureHasher.	2017-08-16 10:54:28 +02:00
Jan Vrsovsky	8321c141f6	[SPARK-21723][ML] Fix writing LibSVM (key not found: numFeatures) ## What changes were proposed in this pull request? Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. liancheng HyukjinKwon (Maybe the usage should be forbidden when writing, in a major version change?). ## How was this patch tested? Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Jan Vrsovsky <jan.vrsovsky@firma.seznam.cz> Closes #18872 from ProtD/master.	2017-08-16 08:21:42 +01:00
WeichenXu	07549b20a3	[SPARK-19634][ML] Multivariate summarizer - dataframes API ## What changes were proposed in this pull request? This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of MultivariateOnlineSummarizer, it also allows the user to select a subset of the metrics. ## How was this patch tested? Testcases added. ## Performance Resolve several performance issues in #17419, further optimization pending on SQL team's work. One of the SQL layer performance issue related to these feature has been resolved in #18712, thanks liancheng and cloud-fan ### Performance data (test on my laptop, use 2 partitions. tries out = 20, warm up = 10) The unit of test results is records/milliseconds (higher is better) Vector size/records number \| 1/10000000 \| 10/1000000 \| 100/1000000 \| 1000/100000 \| 10000/10000 ----\|------\|----\|---\|----\|---- Dataframe \| 15149 \| 7441 \| 2118 \| 224 \| 21 RDD from Dataframe \| 4992 \| 4440 \| 2328 \| 320 \| 33 raw RDD \| 53931 \| 20683 \| 3966 \| 528 \| 53 Author: WeichenXu <WeichenXu123@outlook.com> Closes #18798 from WeichenXu123/SPARK-19634-dataframe-summarizer.	2017-08-16 10:41:05 +08:00
Marcelo Vanzin	3f958a9992	[SPARK-21731][BUILD] Upgrade scalastyle to 0.9. This version fixes a few issues in the import order checker; it provides better error messages, and detects more improper ordering (thus the need to change a lot of files in this patch). The main fix is that it correctly complains about the order of packages vs. classes. As part of the above, I moved some "SparkSession" import in ML examples inside the "$example on$" blocks; that didn't seem consistent across different source files to start with, and avoids having to add more on/off blocks around specific imports. The new scalastyle also seems to have a better header detector, so a few license headers had to be updated to match the expected indentation. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18943 from vanzin/SPARK-21731.	2017-08-15 13:59:00 -07:00
Peng Meng	ca6955858c	[SPARK-21638][ML] Fix RF/GBT Warning message error ## What changes were proposed in this pull request? When train RF model, there are many warning messages like this: > WARN RandomForest: Tree learning is using approximately 268492800 bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. This allows splitting 2622 nodes in this iteration. This warning message is unnecessary and the data is not accurate. Actually, if all the nodes cannot split in one iteration, it will show this warning. For most of the case, all the nodes cannot split just in one iteration, so for most of the case, it will show this warning for each iteration. ## How was this patch tested? The existing UT Author: Peng Meng <peng.meng@intel.com> Closes #18868 from mpjlu/fixRFwarning.	2017-08-10 21:38:03 +01:00
WeichenXu	b35660dd0e	[SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search ## What changes were proposed in this pull request? Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search https://github.com/scalanlp/breeze/pull/651 ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #18797 from WeichenXu123/update-breeze.	2017-08-09 14:44:10 +08:00
Ajay Saini	fdcee028af	[SPARK-21542][ML][PYTHON] Python persistence helper functions ## What changes were proposed in this pull request? Added DefaultParamsWriteable, DefaultParamsReadable, DefaultParamsWriter, and DefaultParamsReader to Python to support Python-only persistence of Json-serializable parameters. ## How was this patch tested? Instantiated an estimator with Json-serializable parameters (ex. LogisticRegression), saved it using the added helper functions, and loaded it back, and compared it to the original instance to make sure it is the same. This test was both done in the Python REPL and implemented in the unit tests. Note to reviewers: there are a few excess comments that I left in the code for clarity but will remove before the code is merged to master. Author: Ajay Saini <ajays725@gmail.com> Closes #18742 from ajaysaini725/PythonPersistenceHelperFunctions.	2017-08-07 17:03:20 -07:00
Peng Meng	1426eea84c	[SPARK-21623][ML] fix RF doc ## What changes were proposed in this pull request? comments of parentStats in RF are wrong. parentStats is not only used for the first iteration, it is used with all the iteration for unordered features. ## How was this patch tested? Author: Peng Meng <peng.meng@intel.com> Closes #18832 from mpjlu/fixRFDoc.	2017-08-07 11:03:07 +01:00
actuaryzhang	55aa4da285	[SPARK-21622][ML][SPARKR] Support offset in SparkR GLM ## What changes were proposed in this pull request? Support offset in SparkR GLM #16699 Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #18831 from actuaryzhang/sparkROffset.	2017-08-06 15:14:12 -07:00
Zheng RuiFeng	253a07e43a	[SPARK-21388][ML][PYSPARK] GBTs inherit from HasStepSize & LInearSVC from HasThreshold ## What changes were proposed in this pull request? GBTs inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Author: Ruifeng Zheng <ruifengz@foxmail.com> Closes #18612 from zhengruifeng/override_HasXXX.	2017-08-01 21:34:26 +08:00
wangmiao1981	9570e81aa9	[SPARK-21381][SPARKR] SparkR: pass on setHandleInvalid for classification algorithms ## What changes were proposed in this pull request? SPARK-20307 Added handleInvalid option to RFormula for tree-based classification algorithms. We should add this parameter for other classification algorithms in SparkR. This is a followup PR for SPARK-20307. ## How was this patch tested? New Unit tests are added. Author: wangmiao1981 <wm624@hotmail.com> Closes #18605 from wangmiao1981/class.	2017-07-31 20:37:06 -07:00
Yan Facai (颜发才)	a5a3189974	[SPARK-21306][ML] OneVsRest should support setWeightCol ## What changes were proposed in this pull request? add `setWeightCol` method for OneVsRest. `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait. ## How was this patch tested? + [x] add an unit test. Author: Yan Facai (颜发才) <facai.yan@gmail.com> Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol.	2017-07-28 10:10:35 +08:00
actuaryzhang	ddcd2e8269	[SPARK-19270][ML] Add summary table to GLM summary ## What changes were proposed in this pull request? Add R-like summary table to GLM summary, which includes feature name (if exist), parameter estimate, standard error, t-stat and p-value. This allows scala users to easily gather these commonly used inference results. srowen yanboliang felixcheung ## How was this patch tested? New tests. One for testing feature Name, and one for testing the summary Table. Author: actuaryzhang <actuaryzhang10@gmail.com> Author: Wayne Zhang <actuaryzhang10@gmail.com> Author: Yanbo Liang <ybliang8@gmail.com> Closes #16630 from actuaryzhang/glmTable.	2017-07-27 22:00:59 +08:00
sethah	cf29828d72	[SPARK-20988][ML] Logistic regression uses aggregator hierarchy ## What changes were proposed in this pull request? This change pulls the `LogisticAggregator` class out of LogisticRegression.scala and makes it extend `DifferentiableLossAggregator`. It also changes logistic regression to use the generic `RDDLossFunction` instead of having its own. Other minor changes: * L2Regularization accepts `Option[Int => Double]` for features standard deviation * L2Regularization uses `Vector` type instead of Array * Some tests added to LeastSquaresAggregator ## How was this patch tested? Unit test suites are added. Author: sethah <shendrickson@cloudera.com> Closes #18305 from sethah/SPARK-20988.	2017-07-26 13:38:53 +02:00
Yuhao Yang	ae4ea5fe25	[SPARK-21524][ML] unit test fix: ValidatorParamsSuiteHelpers generates wrong temp files ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-21524 ValidatorParamsSuiteHelpers.testFileMove() is generating temp dir in the wrong place and does not delete them. ValidatorParamsSuiteHelpers.testFileMove() is invoked by TrainValidationSplitSuite and crossValidatorSuite. Currently it uses `tempDir` from `TempDirectory`, which unfortunately is never initialized since the `boforeAll()` of `ValidatorParamsSuiteHelpers` is never invoked. In my system, it leaves some temp directories in the assembly folder each time I run the TrainValidationSplitSuite and crossValidatorSuite. ## How was this patch tested? unit test fix Author: Yuhao Yang <yuhao.yang@intel.com> Closes #18728 from hhbyyh/tempDirFix.	2017-07-26 10:37:48 +01:00
Yanbo Liang	5d1850d4b5	[MINOR][ML] Reorg RFormula params. ## What changes were proposed in this pull request? There are mainly two reasons for this reorg: * Some params are placed in ```RFormulaBase```, while others are placed in ```RFormula```, this is disordered. * ```RFormulaModel``` should have params ```handleInvalid```, ```formula``` and ```forceIndexLabel```, that users can get invalid values handling policy, formula or whether to force index label if they only have a ```RFormulaModel```. So we need move these params to ```RFormulaBase``` which is also inherited by ```RFormulaModel```. * ```RFormulaModel``` should support set different ```handleInvalid``` when cross validation. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #18681 from yanboliang/rformula-reorg.	2017-07-20 20:07:16 +08:00
Sean Owen	d3f4a21196	[SPARK-15526][ML][FOLLOWUP] Make JPMML provided scope to avoid including unshaded JARs, and repromote to compile in MLlib Following the comment at https://issues.apache.org/jira/browse/SPARK-15526?focusedCommentId=16086106&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16086106 -- this change actually needed a little more work to be complete. This also marks JPMML as `provided` to make sure its JARs aren't included in the `jars` output, but then scopes to `compile` in `mllib`. This is how Guava is handled. Checked result in `assembly/target/scala-2.11/jars` to verify there are no JPMML jars. Maven and SBT builds still work. Author: Sean Owen <sowen@cloudera.com> Closes #18637 from srowen/SPARK-15526.2.	2017-07-18 09:53:51 -07:00
Sean Owen	e26dac5feb	[SPARK-21415] Triage scapegoat warnings, part 1 ## What changes were proposed in this pull request? Address scapegoat warnings for: - BigDecimal double constructor - Catching NPE - Finalizer without super - List.size is O(n) - Prefer Seq.empty - Prefer Set.empty - reverse.map instead of reverseMap - Type shadowing - Unnecessary if condition. - Use .log1p - Var could be val In some instances like Seq.empty, I avoided making the change even where valid in test code to keep the scope of the change smaller. Those issues are concerned with performance and it won't matter for tests. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #18635 from srowen/Scapegoat1.	2017-07-18 08:47:17 +01:00
Ajay Saini	7047f49f45	[SPARK-21221][ML] CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest ## What changes were proposed in this pull request? Added functionality for CrossValidator and TrainValidationSplit to persist nested estimators such as OneVsRest. Also added CrossValidator and TrainValidation split persistence to pyspark. ## How was this patch tested? Performed both cross validation and train validation split with a one vs. rest estimator and tested read/write functionality of the estimator parameter maps required by these meta-algorithms. Author: Ajay Saini <ajays725@gmail.com> Closes #18428 from ajaysaini725/MetaAlgorithmPersistNestedEstimators.	2017-07-17 10:07:32 -07:00
Yanbo Liang	69e5282d3c	[SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both features and label column. ## What changes were proposed in this pull request? ```RFormula``` should handle invalid for both features and label column. #18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases. ## How was this patch tested? Add test cases. Author: Yanbo Liang <ybliang8@gmail.com> Closes #18613 from yanboliang/spark-20307.	2017-07-15 20:56:38 +08:00
Sean Owen	425c4ada4c	[SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 ## What changes were proposed in this pull request? - Remove Scala 2.10 build profiles and support - Replace some 2.10 support in scripts with commented placeholders for 2.12 later - Remove deprecated API calls from 2.10 support - Remove usages of deprecated context bounds where possible - Remove Scala 2.10 workarounds like ScalaReflectionLock - Other minor Scala warning fixes ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #17150 from srowen/SPARK-19810.	2017-07-13 17:06:24 +08:00
Zheng RuiFeng	d2d2a5de18	[SPARK-18619][ML] Make QuantileDiscretizer/Bucketizer/StringIndexer/RFormula inherit from HasHandleInvalid ## What changes were proposed in this pull request? 1, HasHandleInvaild support override 2, Make QuantileDiscretizer/Bucketizer/StringIndexer/RFormula inherit from HasHandleInvalid ## How was this patch tested? existing tests [JIRA](https://issues.apache.org/jira/browse/SPARK-18619) Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #18582 from zhengruifeng/heritate_HasHandleInvalid.	2017-07-12 22:09:03 +08:00
caoxuewen	330bf5c998	[SPARK-20609][MLLIB][TEST] manually cleared 'spark.local.dir' before/after a test in ALSCleanerSuite ## What changes were proposed in this pull request? This PR is similar to #17869. Once` 'spark.local.dir'` is set. Unless this is manually cleared before/after a test. it could return the same directory even if this property is configured. and add before/after for each likewise in ALSCleanerSuite. ## How was this patch tested? existing test. Author: caoxuewen <cao.xuewen@zte.com.cn> Closes #18537 from heary-cao/ALSCleanerSuite.	2017-07-08 08:34:51 +01:00
wangmiao1981	a7b46c627b	[SPARK-20307][SPARKR] SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer ## What changes were proposed in this pull request? For randomForest classifier, if test data contains unseen labels, it will throw an error. The StringIndexer already has the handleInvalid logic. The patch add a new method to set the underlying StringIndexer handleInvalid logic. This patch should also apply to other classifiers. This PR focuses on the main logic and randomForest classifier. I will do follow-up PR for other classifiers. ## How was this patch tested? Add a new unit test based on the error case in the JIRA. Author: wangmiao1981 <wm624@hotmail.com> Closes #18496 from wangmiao1981/handle.	2017-07-07 23:51:32 -07:00
Yan Facai (颜发才)	56536e9992	[SPARK-21285][ML] VectorAssembler reports the column name of unsupported data type ## What changes were proposed in this pull request? add the column name in the exception which is raised by unsupported data type. ## How was this patch tested? + [x] pass all tests. Author: Yan Facai (颜发才) <facai.yan@gmail.com> Closes #18523 from facaiy/ENH/vectorassembler_add_col.	2017-07-07 18:32:01 +08:00
hyukjinkwon	d451b7f43d	[SPARK-21326][SPARK-21066][ML] Use TextFileFormat in LibSVMFileFormat and allow multiple input paths for determining numFeatures ## What changes were proposed in this pull request? This is related with [SPARK-19918](https://issues.apache.org/jira/browse/SPARK-19918) and [SPARK-18362](https://issues.apache.org/jira/browse/SPARK-18362). This PR proposes to use `TextFileFormat` and allow multiple input paths (but with a warning) when determining the number of features in LibSVM data source via an extra scan. There are three points here: - The main advantage of this change should be to remove file-listing bottlenecks in driver side. - Another advantage is ones from using `FileScanRDD`. For example, I guess we can use `spark.sql.files.ignoreCorruptFiles` option when determining the number of features. - We can unify the schema inference code path in text based data sources. This is also a preparation for [SPARK-21289](https://issues.apache.org/jira/browse/SPARK-21289). ## How was this patch tested? Unit tests in `LibSVMRelationSuite`. Closes #18288 Author: hyukjinkwon <gurwls223@gmail.com> Closes #18556 from HyukjinKwon/libsvm-schema.	2017-07-07 12:24:03 +08:00
Wenchen Fan	40c7add3a4	[SPARK-20946][SQL] Do not update conf for existing SparkContext in SparkSession.getOrCreate ## What changes were proposed in this pull request? SparkContext is shared by all sessions, we should not update its conf for only one session. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #18536 from cloud-fan/config.	2017-07-07 08:44:31 +08:00
dardelet	4d6d8192c8	[SPARK-21268][MLLIB] Move center calculations to a distributed map in KMeans ## What changes were proposed in this pull request? The scal() and creation of newCenter vector is done in the driver, after a collectAsMap operation while it could be done in the distributed RDD. This PR moves this code before the collectAsMap for more efficiency ## How was this patch tested? This was tested manually by running the KMeansExample and verifying that the new code ran without error and gave same output as before. Author: dardelet <guillaumegorp@gmail.com> Author: Guillaume Dardelet <dardelet@users.noreply.github.com> Closes #18491 from dardelet/move-center-calculation-to-distributed-map-kmean.	2017-07-04 17:58:44 +01:00
Thomas Decaux	8ca4ebefa6	[MINOR] Add french stop word "les" ## What changes were proposed in this pull request? Added "les" as french stop word (plurial of le) Author: Thomas Decaux <ebuildy@gmail.com> Closes #18514 from ebuildy/patch-1.	2017-07-04 12:17:48 +01:00
Ruifeng Zheng	e0b047eafe	[SPARK-18518][ML] HasSolver supports override ## What changes were proposed in this pull request? 1, make param support non-final with `finalFields` option 2, generate `HasSolver` with `finalFields = false` 3, override `solver` in LiR, GLR, and make MLPC inherit `HasSolver` ## How was this patch tested? existing tests Author: Ruifeng Zheng <ruifengz@foxmail.com> Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #16028 from zhengruifeng/param_non_final.	2017-07-01 15:37:41 +08:00
actuaryzhang	37ef32e515	[SPARK-21275][ML] Update GLM test to use supportedFamilyNames ## What changes were proposed in this pull request? Update GLM test to use supportedFamilyNames as suggested here: https://github.com/apache/spark/pull/16699#discussion-diff-100574976R855 Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #18495 from actuaryzhang/mlGlmTest2.	2017-07-01 14:57:57 +08:00
Yanbo Liang	528c9281ae	[ML] Fix scala-2.10 build failure of GeneralizedLinearRegressionSuite. ## What changes were proposed in this pull request? Fix scala-2.10 build failure of ```GeneralizedLinearRegressionSuite```. ## How was this patch tested? Build with scala-2.10. Author: Yanbo Liang <ybliang8@gmail.com> Closes #18489 from yanboliang/glr.	2017-06-30 23:25:14 +08:00
actuaryzhang	49d767d838	[SPARK-18710][ML] Add offset in GLM ## What changes were proposed in this pull request? Add support for offset in GLM. This is useful for at least two reasons: 1. Account for exposure: e.g., when modeling the number of accidents, we may need to use miles driven as an offset to access factors on frequency. 2. Test incremental effects of new variables: we can use predictions from the existing model as offset and run a much smaller model on only new variables. This avoids re-estimating the large model with all variables (old + new) and can be very important for efficient large-scaled analysis. ## How was this patch tested? New test. yanboliang srowen felixcheung sethah Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16699 from actuaryzhang/offset.	2017-06-30 20:02:15 +08:00
Nick Pentreath	70085e83d1	[SPARK-21210][DOC][ML] Javadoc 8 fixes for ML shared param traits PR #15999 included fixes for doc strings in the ML shared param traits (occurrences of `>` and `>=`). This PR simply uses the HTML-escaped version of the param doc to embed into the Scaladoc, to ensure that when `SharedParamsCodeGen` is run, the generated javadoc will be compliant for Java 8. ## How was this patch tested? Existing tests Author: Nick Pentreath <nickp@za.ibm.com> Closes #18420 from MLnick/shared-params-javadoc8.	2017-06-29 09:51:12 +01:00
Yanbo Liang	0c8444cf6d	[SPARK-14657][SPARKR][ML] RFormula w/o intercept should output reference category when encoding string terms ## What changes were proposed in this pull request? Please see [SPARK-14657](https://issues.apache.org/jira/browse/SPARK-14657) for detail of this bug. I searched online and test some other cases, found when we fit R glm model(or other models powered by R formula) w/o intercept on a dataset including string/category features, one of the categories in the first category feature is being used as reference category, we will not drop any category for that feature. I think we should keep consistent semantics between Spark RFormula and R formula. ## How was this patch tested? Add standard unit tests. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #12414 from yanboliang/spark-14657.	2017-06-29 10:32:32 +08:00
wangmiao1981	53543374ce	[SPARK-20906][SPARKR] Constrained Logistic Regression for SparkR ## What changes were proposed in this pull request? PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic Regression for ML. We should add it to SparkR. ## How was this patch tested? Add new unit tests. Author: wangmiao1981 <wm624@hotmail.com> Closes #18128 from wangmiao1981/test.	2017-06-21 20:42:45 -07:00
actuaryzhang	ad459cfb1d	[SPARK-20917][ML][SPARKR] SparkR supports string encoding consistent with R ## What changes were proposed in this pull request? Add `stringIndexerOrderType` to `spark.glm` and `spark.survreg` to support string encoding that is consistent with default R. ## How was this patch tested? new tests Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #18140 from actuaryzhang/sparkRFormula.	2017-06-21 10:35:16 -07:00
Joseph K. Bradley	cc67bd5732	[SPARK-20929][ML] LinearSVC should use its own threshold param ## What changes were proposed in this pull request? LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability. This PR changes the param in the Scala, Python and R APIs. ## How was this patch tested? New unit test to make sure the threshold can be set to any Double value. Author: Joseph K. Bradley <joseph@databricks.com> Closes #18151 from jkbradley/ml-2.2-linearsvc-cleanup.	2017-06-19 23:04:17 -07:00
Joseph K. Bradley	ff318c0d2f	[SPARK-21050][ML] Word2vec persistence overflow bug fix ## What changes were proposed in this pull request? The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib version), so it is very easily to have an overflow in calculating the number of partitions for ML persistence. This modifies the calculations to use Long. ## How was this patch tested? New unit test. I verified that the test fails before this patch. Author: Joseph K. Bradley <joseph@databricks.com> Closes #18265 from jkbradley/word2vec-save-fix.	2017-06-12 14:27:57 -07:00
sethah	1665b5f724	[SPARK-19762][ML] Hierarchy for consolidating ML aggregator/loss code ## What changes were proposed in this pull request? JIRA: [SPARK-19762](https://issues.apache.org/jira/browse/SPARK-19762) The larger changes in this patch are: * Adds a `DifferentiableLossAggregator` trait which is intended to be used as a common parent trait to all Spark ML aggregator classes. It factors out the common methods: `merge, gradient, loss, weight` from the aggregator subclasses. * Adds a `RDDLossFunction` which is intended to be the only implementation of Breeze's `DiffFunction` necessary in Spark ML, and can be used by all other algorithms. It takes the aggregator type as a type parameter, and maps the aggregator over an RDD. It additionally takes in a optional regularization loss function for applying the differentiable part of regularization. * Factors out the regularization from the data part of the cost function, and treats regularization as a separate independent cost function which can be evaluated and added to the data cost function. * Changes `LinearRegression` to use this new hierarchy as a proof of concept. * Adds the following new namespaces `o.a.s.ml.optim.loss` and `o.a.s.ml.optim.aggregator` Also note that none of these are public-facing changes. All of these classes are internal to Spark ML and remain that way. NOTE: The large majority of the "lines added" and "lines deleted" are simply code moving around or unit tests. BTW, I also converted LinearSVC to this framework as a way to prove that this new hierarchy is flexible enough for the other algorithms, but I backed those changes out because the PR is large enough as is. ## How was this patch tested? Test suites are added for the new components, and some test suites are also added to provide coverage where there wasn't any before. * DifferentiablLossAggregatorSuite * LeastSquaresAggregatorSuite * RDDLossFunctionSuite * DifferentiableRegularizationSuite Below are some performance testing numbers. Run on a 6 node virtual cluster with 44 cores and ~110G RAM, the dataset size is about 37G. These are not "large-scale" tests, but we really want to just make sure the iteration times don't increase with this patch. Notably we are doing the regularization a bit differently than before, but that should cost very little. I think there's very little risk otherwise, and these numbers don't show a difference. Of course I'm happy to add more tests as we think it's necessary, but I think the patch is ready for review now. Note: timings are best of 3 runs. \| \| numFeatures \| numPoints \| maxIter \| regParam \| elasticNetParam \| SPARK-19762 (sec) \| master (sec) \| \|----\|---------------\|-------------\|-----------\|------------\|-------------------\|---------------------\|----------------\| \| 0 \| 5000 \| 1e+06 \| 30 \| 0 \| 0 \| 129.594 \| 131.153 \| \| 1 \| 5000 \| 1e+06 \| 30 \| 0.1 \| 0 \| 135.54 \| 136.327 \| \| 2 \| 5000 \| 1e+06 \| 30 \| 0.01 \| 0.5 \| 135.148 \| 129.771 \| \| 3 \| 50000 \| 100000 \| 30 \| 0 \| 0 \| 145.764 \| 144.096 \| ## Follow ups If this design is accepted, we will convert the other ML algorithms that use this aggregator pattern to this new hierarchy in follow up PRs. Author: sethah <seth.hendrickson16@gmail.com> Author: sethah <shendrickson@cloudera.com> Closes #17094 from sethah/ml_aggregators.	2017-06-05 10:32:17 +01:00
Zheng RuiFeng	98b5ccd32b	[SPARK-20930][ML] Destroy broadcasted centers after computing cost in KMeans ## What changes were proposed in this pull request? Destroy broadcasted centers after computing cost ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #18152 from zhengruifeng/destroy_kmeans_model.	2017-06-05 10:25:09 +01:00
David Eis	96e6ba6c2a	[SPARK-20790][MLLIB] Remove extraneous logging in test ## What changes were proposed in this pull request? Remove extraneous logging. ## How was this patch tested? Unit tests pass. Author: David Eis <deis@bloomberg.net> Closes #18188 from davideis/fix-test.	2017-06-03 09:48:10 +01:00
Yin Huai	0eb1fc6cd5	Revert "[SPARK-20946][SQL] simplify the config setting logic in SparkSession.getOrCreate" This reverts commit `e11d90bf8d`.	2017-06-02 15:36:21 -07:00
Wenchen Fan	e11d90bf8d	[SPARK-20946][SQL] simplify the config setting logic in SparkSession.getOrCreate ## What changes were proposed in this pull request? The current conf setting logic is a little complex and has duplication, this PR simplifies it. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #18172 from cloud-fan/session.	2017-06-02 10:05:05 -07:00
John Compitello	0975019cd4	[SPARK-20109][MLLIB] Rewrote toBlockMatrix method on IndexedRowMatrix ## What changes were proposed in this pull request? - ~~I added the method `toBlockMatrixDense` to the IndexedRowMatrix class. The current implementation of `toBlockMatrix` is insufficient for users with relatively dense IndexedRowMatrix objects, since it assumes sparsity.~~ EDIT: Ended up deciding that there should be just a single `toBlockMatrix` method, which creates a BlockMatrix whose blocks may be dense or sparse depending on the sparsity of the rows. This method will work better on any current use case of `toBlockMatrix` and doesn't go through `CoordinateMatrix` like the old method. ## How was this patch tested? ~~I used the same tests already written for `toBlockMatrix()` to test this method. I also added a new additional unit test for an edge case that was not adequately tested by current test suite.~~ I ran the original `IndexedRowMatrix` tests, plus wrote more to better handle edge cases ignored by original tests. Author: John Compitello <johnc@broadinstitute.org> Closes #17459 from johnc1231/johnc-fix-ir-to-block.	2017-06-01 05:42:42 -04:00
David Eis	d52f636228	[SPARK-20790][MLLIB] Correctly handle negative values for implicit feedback in ALS ## What changes were proposed in this pull request? Revert the handling of negative values in ALS with implicit feedback, so that the confidence is the absolute value of the rating and the preference is 0 for negative ratings. This was the original behavior. ## How was this patch tested? This patch was tested with the existing unit tests and an added unit test to ensure that negative ratings are not ignored. mengxr Author: David Eis <deis@bloomberg.net> Closes #18022 from davideis/bugfix/negative-rating.	2017-05-31 13:52:55 +01:00
Wayne Zhang	f47700c9ca	[SPARK-14659][ML] RFormula consistent with R when handling strings ## What changes were proposed in this pull request? When handling strings, the category dropped by RFormula and R are different: - RFormula drops the least frequent level - R drops the first level after ascending alphabetical ordering This PR supports different string ordering types in StringIndexer #17879 so that RFormula can drop the same level as R when handling strings using`stringOrderType = "alphabetDesc"`. ## How was this patch tested? new tests Author: Wayne Zhang <actuaryzhang@uber.com> Closes #17967 from actuaryzhang/RFormula.	2017-05-26 10:44:40 +08:00
Yanbo Liang	ad09e4ca04	[MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary. ## What changes were proposed in this pull request? Joint coefficients with intercept for SparkR linear SVM summary. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #18035 from yanboliang/svm-r.	2017-05-23 16:16:14 +08:00
Zheng RuiFeng	4be3375835	[SPARK-15767][ML][SPARKR] Decision Tree wrapper in SparkR ## What changes were proposed in this pull request? support decision tree in R ## How was this patch tested? added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #17981 from zhengruifeng/dt_r.	2017-05-22 10:40:49 -07:00
Ignacio Bermudez	06dda1d58f	[SPARK-20687][MLLIB] mllib.Matrices.fromBreeze may crash when converting from Breeze sparse matrix ## What changes were proposed in this pull request? When two Breeze SparseMatrices are operated, the result matrix may contain provisional 0 values extra in rowIndices and data arrays. This causes an incoherence with the colPtrs data, but Breeze get away with this incoherence by keeping a counter of the valid data. In spark, when this matrices are converted to SparseMatrices, Sparks relies solely on rowIndices, data, and colPtrs, but these might be incorrect because of breeze internal hacks. Therefore, we need to slice both rowIndices and data, using their counter of active data This method is at least called by BlockMatrix when performing distributed block operations, causing exceptions on valid operations. See http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add ## How was this patch tested? Added a test to MatricesSuite that verifies that the conversions are valid and that code doesn't crash. Originally the same code would crash on Spark. Bugfix for https://issues.apache.org/jira/browse/SPARK-20687 Author: Ignacio Bermudez <ignaciobermudez@gmail.com> Author: Ignacio Bermudez Corrales <icorrales@splunk.com> Closes #17940 from ghoto/bug-fix/SPARK-20687.	2017-05-22 10:27:28 +01:00
Nick Pentreath	25b4f41d23	[SPARK-20677][MLLIB][ML] Follow-up to ALS recommend-all performance PRs Small clean ups from #17742 and #17845. ## How was this patch tested? Existing unit tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #17919 from MLnick/SPARK-20677-als-perf-followup.	2017-05-16 10:59:34 +02:00
Yanbo Liang	dbe81633a7	[SPARK-20501][ML] ML 2.2 QA: New Scala APIs, docs ## What changes were proposed in this pull request? Review new Scala APIs introduced in 2.2. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17934 from yanboliang/spark-20501.	2017-05-15 21:21:54 -07:00
Yanbo Liang	d4022d4951	[SPARK-20707][ML] ML deprecated APIs should be removed in major release. ## What changes were proposed in this pull request? Before 2.2, MLlib keep to remove APIs deprecated in last feature/minor release. But from Spark 2.2, we decide to remove deprecated APIs in a major release, so we need to change corresponding annotations to tell users those will be removed in 3.0. Meanwhile, this fixed bugs in ML documents. The original ML docs can't show deprecated annotations in ```MLWriter``` and ```MLReader``` related class, we correct it in this PR. Before: ![image](https://cloud.githubusercontent.com/assets/1962026/25939889/f8c55f20-3666-11e7-9fa2-0605bfb3ed06.png) After: ![image](https://cloud.githubusercontent.com/assets/1962026/25939870/e9b0d5be-3666-11e7-9765-5e04885e4b32.png) ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17946 from yanboliang/spark-20707.	2017-05-16 10:08:23 +08:00
Zheng RuiFeng	9970aa0962	[SPARK-20669][ML] LoR.family and LDA.optimizer should be case insensitive ## What changes were proposed in this pull request? make param `family` in LoR and `optimizer` in LDA case insensitive ## How was this patch tested? updated tests yanboliang Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #17910 from zhengruifeng/lr_family_lowercase.	2017-05-15 23:21:44 +08:00
Wayne Zhang	af40bb1159	[SPARK-20619][ML] StringIndexer supports multiple ways to order label ## What changes were proposed in this pull request? StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL. For example, the ordering will affect the result in one-hot encoding and RFormula. This PR proposes to support other ordering methods and we add a parameter `stringOrderType` that supports the following four options: - 'frequencyDesc': descending order by label frequency (most frequent label assigned 0) - 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0) - 'alphabetDesc': descending alphabetical order - 'alphabetAsc': ascending alphabetical order The default is still descending order of label frequency, so there should be no impact to existing programs. ## How was this patch tested? new test Author: Wayne Zhang <actuaryzhang@uber.com> Closes #17879 from actuaryzhang/stringIndexer.	2017-05-12 00:12:47 -07:00
Takeshi Yamamuro	3aa4e464a8	[SPARK-20416][SQL] Print UDF names in EXPLAIN ## What changes were proposed in this pull request? This pr added `withName` in `UserDefinedFunction` for printing UDF names in EXPLAIN ## How was this patch tested? Added tests in `UDFSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #17712 from maropu/SPARK-20416.	2017-05-11 09:49:05 -07:00
Yanbo Liang	0698e6c88c	[SPARK-20606][ML] Revert "[] ML 2.2 QA: Remove deprecated methods for ML" This reverts commit `b8733e0ad9`. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17944 from yanboliang/spark-20606-revert.	2017-05-11 14:48:13 +08:00
Yuhao Yang	a819dab668	[SPARK-20670][ML] Simplify FPGrowth transform ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-20670 As suggested by Sean Owen in https://github.com/apache/spark/pull/17130, the transform code in FPGrowthModel can be simplified. As I tested on some public dataset http://fimi.ua.ac.be/data/, the performance of the new transform code is even or better than the old implementation. ## How was this patch tested? Existing unit test. Author: Yuhao Yang <yuhao.yang@intel.com> Closes #17912 from hhbyyh/fpgrowthTransform.	2017-05-09 23:39:26 -07:00
Yanbo Liang	b8733e0ad9	[SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML ## What changes were proposed in this pull request? Remove ML methods we deprecated in 2.1. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17867 from yanboliang/spark-20606.	2017-05-09 17:30:37 +08:00
Jon McLean	be53a78352	[SPARK-20615][ML][TEST] SparseVector.argmax throws IndexOutOfBoundsException ## What changes were proposed in this pull request? Added a check for for the number of defined values. Previously the argmax function assumed that at least one value was defined if the vector size was greater than zero. ## How was this patch tested? Tests were added to the existing VectorsSuite to cover this case. Author: Jon McLean <jon.mclean@atsid.com> Closes #17877 from jonmclean/vectorArgmaxIndexBug.	2017-05-09 09:47:50 +01:00
Nick Pentreath	10b00abadf	[SPARK-20587][ML] Improve performance of ML ALS recommendForAll This PR is a `DataFrame` version of #17742 for [SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968), for improving the performance of `recommendAll` methods. ## How was this patch tested? Existing unit tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #17845 from MLnick/ml-als-perf.	2017-05-09 10:13:15 +02:00
Peng	8079424763	[SPARK-11968][MLLIB] Optimize MLLIB ALS recommendForAll The recommendForAll of MLLIB ALS is very slow. GC is a key problem of the current method. The task use the following code to keep temp result: val output = new Array[(Int, (Int, Double))](mn) m = n = 4096 (default value, no method to set) so output is about 4k 4k * (4 + 4 + 8) = 256M. This is a large memory and cause serious GC problem, and it is frequently OOM. Actually, we don't need to save all the temp result. Support we recommend topK (topK is about 10, or 20) product for each user, we only need 4k * topK * (4 + 4 + 8) memory to save the temp result. The Test Environment: 3 workers: each work 10 core, each work 30G memory, each work 1 executor. The Data: User 480,000, and Item 17,000 BlockSize: 1024 2048 4096 8192 Old method: 245s 332s 488s OOM This solution: 121s 118s 117s 120s The existing UT. Author: Peng <peng.meng@intel.com> Author: Peng Meng <peng.meng@intel.com> Closes #17742 from mpjlu/OptimizeAls.	2017-05-09 10:06:48 +02:00
Nick Pentreath	58518d0707	[SPARK-20596][ML][TEST] Consolidate and improve ALS recommendAll test cases Existing test cases for `recommendForAllX` methods (added in [SPARK-19535](https://issues.apache.org/jira/browse/SPARK-19535)) test `k < num items` and `k = num items`. Technically we should also test that `k > num items` returns the same results as `k = num items`. ## How was this patch tested? Updated existing unit tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #17860 from MLnick/SPARK-20596-als-rec-tests.	2017-05-08 12:45:00 +02:00
Daniel Li	88e6d75072	[SPARK-20484][MLLIB] Add documentation to ALS code ## What changes were proposed in this pull request? This PR adds documentation to the ALS code. ## How was this patch tested? Existing tests were used. mengxr srowen This contribution is my original work. I have the license to work on this project under the Spark project’s open source license. Author: Daniel Li <dan@danielyli.com> Closes #17793 from danielyli/spark-20484.	2017-05-07 10:09:58 +01:00
Wayne Zhang	0d16faab90	[SPARK-20574][ML] Allow Bucketizer to handle non-Double numeric column ## What changes were proposed in this pull request? Bucketizer currently requires input column to be Double, but the logic should work on any numeric data types. Many practical problems have integer/float data types, and it could get very tedious to manually cast them into Double before calling bucketizer. This PR extends bucketizer to handle all numeric types. ## How was this patch tested? New test. Author: Wayne Zhang <actuaryzhang@uber.com> Closes #17840 from actuaryzhang/bucketizer.	2017-05-05 10:23:58 +08:00
Yanbo Liang	c5dceb8c65	[SPARK-20047][FOLLOWUP][ML] Constrained Logistic Regression follow up ## What changes were proposed in this pull request? Address some minor comments for #17715: * Put bound-constrained optimization params under expertParams. * Update some docs. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17829 from yanboliang/spark-20047-followup.	2017-05-04 17:56:43 +08:00
Yan Facai (颜发才)	7f96f2d7f2	[SPARK-16957][MLLIB] Use midpoints for split values. ## What changes were proposed in this pull request? Use midpoints for split values now, and maybe later to make it weighted. ## How was this patch tested? + [x] add unit test. + [x] revise Split's unit test. Author: Yan Facai (颜发才) <facai.yan@gmail.com> Author: 颜发才（Yan Facai） <facai.yan@gmail.com> Closes #17556 from facaiy/ENH/decision_tree_overflow_and_precision_in_aggregation.	2017-05-03 10:54:40 +01:00
Sean Owen	16fab6b0ef	[SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release ## What changes were proposed in this pull request? Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #17803 from srowen/SPARK-20523.	2017-05-03 10:18:35 +01:00
wangmiao1981	ee694cdff6	[SPARK-20533][SPARKR] SparkR Wrappers Model should be private and value should be lazy ## What changes were proposed in this pull request? MultilayerPerceptronClassifierWrapper model should be private. LogisticRegressionWrapper.scala rFeatures and rCoefficients should be lazy. ## How was this patch tested? Unit tests. Author: wangmiao1981 <wm624@hotmail.com> Closes #17808 from wangmiao1981/lazy.	2017-04-29 10:58:48 -07:00
Yuhao Yang	add9d1bba5	[SPARK-19791][ML] Add doc and example for fpgrowth ## What changes were proposed in this pull request? Add a new section for fpm Add Example for FPGrowth in scala and Java updated: Rewrite transform to be more compact. ## How was this patch tested? local doc generation. Author: Yuhao Yang <yuhao.yang@intel.com> Closes #17130 from hhbyyh/fpmdoc.	2017-04-29 10:51:45 -07:00
Yanbo Liang	606432a13a	[SPARK-20047][ML] Constrained Logistic Regression ## What changes were proposed in this pull request? MLlib ```LogisticRegression``` should support bound constrained optimization (only for L2 regularization). Users can add bound constraints to coefficients to make the solver produce solution in the specified range. Under the hood, we call Breeze [```L-BFGS-B```](https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/LBFGSB.scala) as the solver for bound constrained optimization. But in the current breeze implementation, there are some bugs in L-BFGS-B, and https://github.com/scalanlp/breeze/pull/633 fixed them. We need to upgrade dependent breeze later, and currently we use the workaround L-BFGS-B in this PR temporary for reviewing. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17715 from yanboliang/spark-20047.	2017-04-27 20:48:43 +00:00
ding	0a7f5f2798	[SPARK-5484][GRAPHX] Periodically do checkpoint in Pregel ## What changes were proposed in this pull request? Pregel-based iterative algorithms with more than ~50 iterations begin to slow down and eventually fail with a StackOverflowError due to Spark's lack of support for long lineage chains. This PR causes Pregel to checkpoint the graph periodically if the checkpoint directory is set. This PR moves PeriodicGraphCheckpointer.scala from mllib to graphx, moves PeriodicRDDCheckpointer.scala, PeriodicCheckpointer.scala from mllib to core ## How was this patch tested? unit tests, manual tests (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: ding <ding@localhost.localdomain> Author: dding3 <ding.ding@intel.com> Author: Michael Allman <michael@videoamp.com> Closes #15125 from dding3/cp2_pregel.	2017-04-25 11:20:32 -07:00
Yanbo Liang	67eef47acf	[SPARK-20449][ML] Upgrade breeze version to 0.13.1 ## What changes were proposed in this pull request? Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B. ## How was this patch tested? Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17746 from yanboliang/spark-20449.	2017-04-25 17:10:41 +00:00
wangmiao1981	387565cf14	[SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant ## What changes were proposed in this pull request? This is a follow-up PR of #17478. ## How was this patch tested? Existing tests Author: wangmiao1981 <wm624@hotmail.com> Closes #17754 from wangmiao1981/followup.	2017-04-25 16:30:36 +08:00
Josh Rosen	f44c8a843c	[SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT This patch bumps the master branch version to `2.3.0-SNAPSHOT`. Author: Josh Rosen <joshrosen@databricks.com> Closes #17753 from JoshRosen/SPARK-20453.	2017-04-24 21:48:04 -07:00
wm624@hotmail.com	90264aced7	[SPARK-18901][ML] Require in LR LogisticAggregator is redundant ## What changes were proposed in this pull request? In MultivariateOnlineSummarizer, `add` and `merge` have check for weights and feature sizes. The checks in LR are redundant, which are removed from this PR. ## How was this patch tested? Existing tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #17478 from wangmiao1981/logit.	2017-04-24 23:43:06 +08:00
WeichenXu	eb00378f0e	[SPARK-20423][ML] fix MLOR coeffs centering when reg == 0 ## What changes were proposed in this pull request? When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result. BUT current implementation centralize the `coefficientMatrix` by the global coeffs means. In fact the `coefficientMatrix` should be centralized on each feature index itself. Because, according to the MLOR probability distribution function, it can be proven easily that: suppose `{ w0, w1, .. w(K-1) }` make up the `coefficientMatrix`, then `{ w0 + c, w1 + c, ... w(K - 1) + c}` will also be the equivalent solution. `c` is an arbitrary vector of `numFeatures` dimension. reference https://core.ac.uk/download/pdf/6287975.pdf So that we need to centralize the `coefficientMatrix` on each feature dimension separately. We can also confirm this through R library `glmnet`, that MLOR in `glmnet` always generate coefficients result that the sum of each dimension is all `zero`, when reg == 0. ## How was this patch tested? Tests added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #17706 from WeichenXu123/mlor_center.	2017-04-21 17:58:13 +00:00
Syrux	095d1cb3aa	[SPARK-20265][MLLIB] Improve Prefix'span pre-processing efficiency ## What changes were proposed in this pull request? Improve PrefixSpan pre-processing efficency by preventing sequences of zero in the cleaned database. The efficiency gain is reflected in the following graph : https://postimg.org/image/9x6ireuvn/ ## How was this patch tested? Using MLlib's PrefixSpan existing tests and tests of my own on the 8 datasets shown in the graph. All result obtained were stricly the same as the original implementation (without this change). dev/run-tests was also runned, no error were found. Author : Cyril de Vogelaere <cyril.devogelaeregmail.com> Author: Syrux <pokcyril@hotmail.com> Closes #17575 from Syrux/SPARK-20265.	2017-04-13 09:44:33 +01:00
hyukjinkwon	ceaf77ae43	[SPARK-18692][BUILD][DOCS] Test Java 8 unidoc build on Jenkins ## What changes were proposed in this pull request? This PR proposes to run Spark unidoc to test Javadoc 8 build as Javadoc 8 is easily re-breakable. There are several problems with it: - It introduces little extra bit of time to run the tests. In my case, it took 1.5 mins more (`Elapsed :[94.8746569157]`). How it was tested is described in "How was this patch tested?". - > One problem that I noticed was that Unidoc appeared to be processing test sources: if we can find a way to exclude those from being processed in the first place then that might significantly speed things up. (see joshrosen's [comment](https://issues.apache.org/jira/browse/SPARK-18692?focusedCommentId=15947627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15947627)) To complete this automated build, It also suggests to fix existing Javadoc breaks / ones introduced by test codes as described above. There fixes are similar instances that previously fixed. Please refer https://github.com/apache/spark/pull/15999 and https://github.com/apache/spark/pull/16013 Note that this only fixes errors not warnings. Please see my observation https://github.com/apache/spark/pull/17389#issuecomment-288438704 for spurious errors by warnings. ## How was this patch tested? Manually via `jekyll build` for building tests. Also, tested via running `./dev/run-tests`. This was tested via manually adding `time.time()` as below: ```diff profiles_and_goals = build_profiles + sbt_goals print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: ", " ".join(profiles_and_goals)) + import time + st = time.time() exec_sbt(profiles_and_goals) + print("Elapsed :[%s]" % str(time.time() - st)) ``` produces ``` ... ======================================================================== Building Unidoc API Documentation ======================================================================== ... [info] Main Java API documentation successful. ... Elapsed :[94.8746569157] ... Author: hyukjinkwon <gurwls223@gmail.com> Closes #17477 from HyukjinKwon/SPARK-18692.	2017-04-12 12:38:48 +01:00
Benjamin Fradet	0d2b796427	[SPARK-20097][ML] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR ## What changes were proposed in this pull request? - made `numInstances` public in GLR - made `degreesOfFreedom` public in LR ## How was this patch tested? reran the concerned test suites Author: Benjamin Fradet <benjamin.fradet@gmail.com> Closes #17431 from BenFradet/SPARK-20097.	2017-04-11 09:12:49 +02:00
Sean Owen	a26e3ed5e4	[SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String toLowerCase "Turkish locale bug" causes Spark problems ## What changes were proposed in this pull request? Add Locale.ROOT to internal calls to String `toLowerCase`, `toUpperCase`, to avoid inadvertent locale-sensitive variation in behavior (aka the "Turkish locale problem"). The change looks large but it is just adding `Locale.ROOT` (the locale with no country or language specified) to every call to these methods. ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #17527 from srowen/SPARK-20156.	2017-04-10 20:11:56 +01:00
Vijay Ramesh	261eaf5149	[SPARK-20260][MLLIB] String interpolation required for error message ## What changes were proposed in this pull request? This error message doesn't get properly formatted because of a missing `s`. Currently the error looks like: ``` Caused by: java.lang.IllegalArgumentException: requirement failed: indices should be one-based and in ascending order; found current=$current, previous=$previous; line="$line" ``` (note the literal `$current` instead of the interpolated value) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Vijay Ramesh <vramesh@demandbase.com> Closes #17572 from vijaykramesh/master.	2017-04-09 19:39:09 +01:00
Liang-Chi Hsieh	1a52a62377	[SPARK-20076][ML][PYSPARK] Add Python interface for ml.stats.Correlation ## What changes were proposed in this pull request? The Dataframes-based support for the correlation statistics is added in #17108. This patch adds the Python interface for it. ## How was this patch tested? Python unit test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #17494 from viirya/correlation-python-api.	2017-04-07 11:00:10 +02:00
Bryan Cutler	e156b5dd39	[SPARK-19953][ML] Random Forest Models use parent UID when being fit ## What changes were proposed in this pull request? The ML `RandomForestClassificationModel` and `RandomForestRegressionModel` were not using the estimator parent UID when being fit. This change fixes that so the models can be properly be identified with their parents. ## How was this patch tested?Existing tests. Added check to verify that model uid matches that of the parent, then renamed `checkCopy` to `checkCopyAndUids` and verified that it was called by one test for each ML algorithm. Author: Bryan Cutler <cutlerb@gmail.com> Closes #17296 from BryanCutler/rfmodels-use-parent-uid-SPARK-19953.	2017-04-06 09:41:32 +02:00
Yuhao Yang	b28bbffbad	[SPARK-20003][ML] FPGrowthModel setMinConfidence should affect rules generation and transform ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-20003 I was doing some test and found the issue. ml.fpm.FPGrowthModel `setMinConfidence` should always affect rules generation and transform. Currently associationRules in FPGrowthModel is a lazy val and `setMinConfidence` in FPGrowthModel has no impact once associationRules got computed . I try to cache the associationRules to avoid re-computation if `minConfidence` is not changed, but this makes FPGrowthModel somehow stateful. Let me know if there's any concern. ## How was this patch tested? new unit test and I strength the unit test for model save/load to ensure the cache mechanism. Author: Yuhao Yang <yuhao.yang@intel.com> Closes #17336 from hhbyyh/fpmodelminconf.	2017-04-04 17:51:45 -07:00
Seth Hendrickson	a59759e6c0	[SPARK-20183][ML] Added outlierRatio arg to MLTestingUtils.testOutliersWithSmallWeights ## What changes were proposed in this pull request? This is a small piece from https://github.com/apache/spark/pull/16722 which ultimately will add sample weights to decision trees. This is to allow more flexibility in testing outliers since linear models and trees behave differently. Note: The primary author when this is committed should be sethah since this is taken from his code. ## How was this patch tested? Existing tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #17501 from jkbradley/SPARK-20183.	2017-04-04 17:04:41 -07:00
zero323	b34f7665dd	[SPARK-19825][R][ML] spark.ml R API for FPGrowth ## What changes were proposed in this pull request? Adds SparkR API for FPGrowth: [SPARK-19825](https://issues.apache.org/jira/browse/SPARK-19825): - `spark.fpGrowth` -model training. - `freqItemsets` and `associationRules` methods with new corresponding generics. - Scala helper: `org.apache.spark.ml.r. FPGrowthWrapper` - unit tests. ## How was this patch tested? Feature specific unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17170 from zero323/SPARK-19825.	2017-04-03 23:42:04 -07:00
Yuhao Yang	4d28e8430d	[SPARK-19969][ML] Imputer doc and example ## What changes were proposed in this pull request? Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after https://github.com/apache/spark/pull/17316 ## How was this patch tested? local doc generation and example execution Author: Yuhao Yang <yuhao.yang@intel.com> Closes #17324 from hhbyyh/imputerdoc.	2017-04-03 11:42:33 +02:00
Bryan Cutler	2a903a1eec	[SPARK-19985][ML] Fixed copy method for some ML Models ## What changes were proposed in this pull request? Some ML Models were using `defaultCopy` which expects a default constructor, and others were not setting the parent estimator. This change fixes these by creating a new instance of the model and explicitly setting values and parent. ## How was this patch tested? Added `MLTestingUtils.checkCopy` to the offending models to tests to verify the copy is made and parent is set. Author: Bryan Cutler <cutlerb@gmail.com> Closes #17326 from BryanCutler/ml-model-copy-error-SPARK-19985.	2017-04-03 10:56:54 +02:00
Jacek Laskowski	0197262a35	[DOCS] Docs-only improvements …adoc ## What changes were proposed in this pull request? Use recommended values for row boundaries in Window's scaladoc, i.e. `Window.unboundedPreceding`, `Window.unboundedFollowing`, and `Window.currentRow` (that were introduced in 2.1.0). ## How was this patch tested? Local build Author: Jacek Laskowski <jacek@japila.pl> Closes #17417 from jaceklaskowski/window-expression-scaladoc.	2017-03-30 16:07:27 +01:00
Bago Amirbekian	a5c87707ea	[SPARK-20040][ML][PYTHON] pyspark wrapper for ChiSquareTest ## What changes were proposed in this pull request? A pyspark wrapper for spark.ml.stat.ChiSquareTest ## How was this patch tested? unit tests doctests Author: Bago Amirbekian <bago@databricks.com> Closes #17421 from MrBago/chiSquareTestWrapper.	2017-03-28 19:19:16 -07:00
颜发才（Yan Facai）	7d432af8f3	[SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails for uppercase impurity type Gini Fix bug: DecisionTreeModel can't recongnize Impurity "Gini" when loading TODO: + [x] add unit test + [x] fix the bug Author: 颜发才（Yan Facai） <facai.yan@gmail.com> Closes #17407 from facaiy/BUG/decision_tree_loader_failer_with_Gini_impurity.	2017-03-28 16:14:01 -07:00
sethah	be85245a98	[SPARK-17137][ML][WIP] Compress logistic regression coefficients ## What changes were proposed in this pull request? Use the new `compressed` method on matrices to store the logistic regression coefficients as sparse or dense - whichever is requires less memory. Marked as WIP so we can add some performance test results. Basically, we should see if prediction is slower because of using a sparse matrix over a dense one. This can happen since sparse matrices do not use native BLAS operations when computing the margins. ## How was this patch tested? Unit tests added. Author: sethah <seth.hendrickson16@gmail.com> Closes #17426 from sethah/SPARK-17137.	2017-03-25 17:41:59 +00:00
Nick Pentreath	d9f4ce6943	[SPARK-15040][ML][PYSPARK] Add Imputer to PySpark Add Python wrapper for `Imputer` feature transformer. ## How was this patch tested? New doc tests and tweak to PySpark ML `tests.py` Author: Nick Pentreath <nickp@za.ibm.com> Closes #17316 from MLnick/SPARK-15040-pyspark-imputer.	2017-03-24 08:01:15 -07:00
Timothy Hunter	d27daa54bd	[SPARK-19636][ML] Feature parity for correlation statistics in MLlib ## What changes were proposed in this pull request? This patch adds the Dataframes-based support for the correlation statistics found in the `org.apache.spark.mllib.stat.correlation.Statistics`, following the design doc discussed in the JIRA ticket. The current implementation is a simple wrapper around the `spark.mllib` implementation. Future optimizations can be implemented at a later stage. ## How was this patch tested? ``` build/sbt "testOnly org.apache.spark.ml.stat.StatisticsSuite" ``` Author: Timothy Hunter <timhunter@databricks.com> Closes #17108 from thunterdb/19636.	2017-03-23 18:42:13 -07:00
hyukjinkwon	aefe798905	[MINOR][BUILD] Fix javadoc8 break ## What changes were proposed in this pull request? Several javadoc8 breaks have been introduced. This PR proposes fix those instances so that we can build Scala/Java API docs. ``` [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:6: error: reference not found [error] * <code>flatMapGroupsWithState</code> operations on {link KeyValueGroupedDataset}. [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:10: error: reference not found [error] * Both, <code>mapGroupsWithState</code> and <code>flatMapGroupsWithState</code> in {link KeyValueGroupedDataset} [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:51: error: reference not found [error] * {link GroupStateTimeout.ProcessingTimeTimeout}) or event time (i.e. [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:52: error: reference not found [error] * {link GroupStateTimeout.EventTimeTimeout}). [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:158: error: reference not found [error] * Spark SQL types (see {link Encoder} for more details). [error] ^ [error] .../spark/mllib/target/java/org/apache/spark/ml/fpm/FPGrowthParams.java:26: error: bad use of '>' [error] * Number of partitions (>=1) used by parallel FP-growth. By default the param is not set, and [error] ^ [error] .../spark/sql/core/src/main/java/org/apache/spark/api/java/function/FlatMapGroupsWithStateFunction.java:30: error: reference not found [error] * {link org.apache.spark.sql.KeyValueGroupedDataset#flatMapGroupsWithState( [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:211: error: reference not found [error] * See {link GroupState} for more details. [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:232: error: reference not found [error] * See {link GroupState} for more details. [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:254: error: reference not found [error] * See {link GroupState} for more details. [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:277: error: reference not found [error] * See {link GroupState} for more details. [error] ^ [error] .../spark/core/target/java/org/apache/spark/TaskContextImpl.java:10: error: reference not found [error] * {link TaskMetrics} & {link MetricsSystem} objects are not thread safe. [error] ^ [error] .../spark/core/target/java/org/apache/spark/TaskContextImpl.java:10: error: reference not found [error] * {link TaskMetrics} & {link MetricsSystem} objects are not thread safe. [error] ^ [info] 13 errors ``` ``` jekyll 3.3.1 \| Error: Unidoc generation failed ``` ## How was this patch tested? Manually via `jekyll build` Author: hyukjinkwon <gurwls223@gmail.com> Closes #17389 from HyukjinKwon/minor-javadoc8-fix.	2017-03-23 08:41:30 +00:00
Joseph K. Bradley	ae4b91d1f5	[SPARK-20039][ML] rename ChiSquare to ChiSquareTest ## What changes were proposed in this pull request? I realized that since ChiSquare is in the package stat, it's pretty unclear if it's the hypothesis test, distribution, or what. This PR renames it to ChiSquareTest to clarify this. ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #17368 from jkbradley/SPARK-20039.	2017-03-21 11:01:25 -07:00
Zheng RuiFeng	63f077fbe5	[SPARK-20041][DOC] Update docs for NaN handling in approxQuantile ## What changes were proposed in this pull request? Update docs for NaN handling in approxQuantile. ## How was this patch tested? existing tests. Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #17369 from zhengruifeng/doc_quantiles_nan.	2017-03-21 08:45:59 -07:00
christopher snow	7620aed828	[SPARK-20011][ML][DOCS] Clarify documentation for ALS 'rank' parameter ## What changes were proposed in this pull request? API documentation and collaborative filtering documentation page changes to clarify inconsistent description of ALS rank parameter. - [DOCS] was previously: "rank is the number of latent factors in the model." - [API] was previously: "rank - number of features to use" This change describes rank in both places consistently as: - "Number of features to use (also referred to as the number of latent factors)" Author: Chris Snow <chris.snowuk.ibm.com> Author: christopher snow <chsnow123@gmail.com> Closes #17345 from snowch/SPARK-20011.	2017-03-21 13:23:59 +00:00
zero323	bec6b16c19	[SPARK-19899][ML] Replace featuresCol with itemsCol in ml.fpm.FPGrowth ## What changes were proposed in this pull request? Replaces `featuresCol` `Param` with `itemsCol`. See [SPARK-19899](https://issues.apache.org/jira/browse/SPARK-19899). ## How was this patch tested? Manual tests. Existing unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17321 from zero323/SPARK-19899.	2017-03-20 10:58:30 -07:00
Joseph K. Bradley	4c3200546c	[SPARK-19635][ML] DataFrame-based API for chi square test ## What changes were proposed in this pull request? Wrapper taking and return a DataFrame ## How was this patch tested? Copied unit tests from RDD-based API Author: Joseph K. Bradley <joseph@databricks.com> Closes #17110 from jkbradley/df-hypotests.	2017-03-16 17:10:15 -07:00
Yuhao Yang	d647aae278	[SPARK-13568][ML] Create feature transformer to impute missing values ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-13568 It is quite common to encounter missing values in data sets. It would be useful to implement a Transformer that can impute missing data points, similar to e.g. Imputer in scikit-learn. Initially, options for imputation could include mean, median and most frequent, but we could add various other approaches, where possible existing DataFrame code can be used (e.g. for approximate quantiles etc). Currently this PR supports imputation for Double and Vector (null and NaN in Vector). ## How was this patch tested? new unit tests and manual test Author: Yuhao Yang <hhbyyh@gmail.com> Author: Yuhao Yang <yuhao.yang@intel.com> Author: Yuhao <yuhao.yang@intel.com> Closes #11601 from hhbyyh/imputer.	2017-03-16 12:49:59 +02:00
Menglong TAN	85941ecf28	[SPARK-11569][ML] Fix StringIndexer to handle null value properly ## What changes were proposed in this pull request? This PR is to enhance StringIndexer with NULL values handling. Before the PR, StringIndexer will throw an exception when encounters NULL values. With this PR: - handleInvalid=error: Throw an exception as before - handleInvalid=skip: Skip null values as well as unseen labels - handleInvalid=keep: Give null values an additional index as well as unseen labels BTW, I noticed someone was trying to solve the same problem ( #9920 ) but seems getting no progress or response for a long time. Would you mind to give me a chance to solve it ? I'm eager to help. :-) ## How was this patch tested? new unit tests Author: Menglong TAN <tanmenglong@renrenche.com> Author: Menglong TAN <tanmenglong@gmail.com> Closes #17233 from crackcell/11569_StringIndexer_NULL.	2017-03-14 07:45:42 -07:00
zero323	d4a637cd46	[SPARK-19940][ML][MINOR] FPGrowthModel.transform should skip duplicated items ## What changes were proposed in this pull request? This commit moved `distinct` in its intended place to avoid duplicated predictions and adds unit test covering the issue. ## How was this patch tested? Unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17283 from zero323/SPARK-19940.	2017-03-14 07:34:44 -07:00
Asher Krim	5e96a57b2f	[SPARK-19922][ML] small speedups to findSynonyms Currently generating synonyms using a large model (I've tested with 3m words) is very slow. These efficiencies have sped things up for us by ~17% I wasn't sure if such small changes were worthy of a jira, but the guidelines seemed to suggest that that is the preferred approach ## What changes were proposed in this pull request? Address a few small issues in the findSynonyms logic: 1) remove usage of ``Array.fill`` to zero out the ``cosineVec`` array. The default float value in Scala and Java is 0.0f, so explicitly setting the values to zero is not needed 2) use Floats throughout. The conversion to Doubles before doing the ``priorityQueue`` is totally superfluous, since all the similarity computations are done using Floats anyway. Creating a second large array just serves to put extra strain on the GC 3) convert the slow ``for(i <- cosVec.indices)`` to an ugly, but faster, ``while`` loop These efficiencies are really only apparent when working with a large model ## How was this patch tested? Existing unit tests + some in-house tests to time the difference cc jkbradley MLNick srowen Author: Asher Krim <krim.asher@gmail.com> Author: Asher Krim <krim.asher@gmail> Closes #17263 from Krimit/fasterFindSynonyms.	2017-03-14 13:08:11 +00:00
actuaryzhang	f6314eab4b	[SPARK-19391][SPARKR][ML] Tweedie GLM API for SparkR ## What changes were proposed in this pull request? Port Tweedie GLM #16344 to SparkR felixcheung yanboliang ## How was this patch tested? new test in SparkR Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16729 from actuaryzhang/sparkRTweedie.	2017-03-14 00:50:38 -07:00
Joseph K. Bradley	72c66dbbb4	[MINOR][ML] Improve MLWriter overwrite error message ## What changes were proposed in this pull request? Give proper syntax for Java and Python in addition to Scala. ## How was this patch tested? Manually. Author: Joseph K. Bradley <joseph@databricks.com> Closes #17215 from jkbradley/write-err-msg.	2017-03-13 16:30:15 -07:00
Xin Ren	9f8ce4825e	[SPARK-19282][ML][SPARKR] RandomForest Wrapper and GBT Wrapper return param "maxDepth" to R models ## What changes were proposed in this pull request? RandomForest R Wrapper and GBT R Wrapper return param `maxDepth` to R models. Below 4 R wrappers are changed: * `RandomForestClassificationWrapper` * `RandomForestRegressionWrapper` * `GBTClassificationWrapper` * `GBTRegressionWrapper` ## How was this patch tested? Test manually on my local machine. Author: Xin Ren <iamshrek@126.com> Closes #17207 from keypointt/SPARK-19282.	2017-03-12 12:15:19 -07:00
Anthony Truchet	9ea201cf64	[SPARK-16440][MLLIB] Ensure broadcasted variables are destroyed even in case of exception ## What changes were proposed in this pull request? Ensure broadcasted variable are destroyed even in case of exception ## How was this patch tested? Word2VecSuite was run locally Author: Anthony Truchet <a.truchet@criteo.com> Closes #14299 from AnthonyTruchet/SPARK-16440.	2017-03-08 11:44:25 +00:00
Yanbo Liang	81303f7ca7	[SPARK-19806][ML][PYSPARK] PySpark GeneralizedLinearRegression supports tweedie distribution. ## What changes were proposed in this pull request? PySpark ```GeneralizedLinearRegression``` supports tweedie distribution. ## How was this patch tested? Add unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17146 from yanboliang/spark-19806.	2017-03-08 02:09:36 -08:00
Yanbo Liang	1fa58868bc	[ML][MINOR] Separate estimator and model params for read/write test. ## What changes were proposed in this pull request? Since we allow ```Estimator``` and ```Model``` not always share same params (see ```ALSParams``` and ```ALSModelParams```), we should pass in test params for estimator and model separately in function ```testEstimatorAndModelReadWrite```. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17151 from yanboliang/test-rw.	2017-03-08 02:05:01 -08:00
Asher Krim	56e1bd337c	[SPARK-17629][ML] methods to return synonyms directly ## What changes were proposed in this pull request? provide methods to return synonyms directly, without wrapping them in a dataframe In performance sensitive applications (such as user facing apis) the roundtrip to and from dataframes is costly and unnecessary The methods are named ``findSynonymsArray`` to make the return type clear, which also implies a local datastructure ## How was this patch tested? updated word2vec tests Author: Asher Krim <akrim@hubspot.com> Closes #16811 from Krimit/w2vFindSynonymsLocal.	2017-03-07 20:36:46 -08:00
VinceShieh	4a9034b173	[SPARK-17498][ML] StringIndexer enhancement for handling unseen labels ## What changes were proposed in this pull request? This PR is an enhancement to ML StringIndexer. Before this PR, String Indexer only supports "skip"/"error" options to deal with unseen records. But those unseen records might still be useful and user would like to keep the unseen labels in certain use cases, This PR enables StringIndexer to support keeping unseen labels as indices [numLabels]. '''Before StringIndexer().setHandleInvalid("skip") StringIndexer().setHandleInvalid("error") '''After support the third option "keep" StringIndexer().setHandleInvalid("keep") ## How was this patch tested? Test added in StringIndexerSuite Signed-off-by: VinceShieh <vincent.xieintel.com> (Please fill in changes proposed in this fix) Author: VinceShieh <vincent.xie@intel.com> Closes #16883 from VinceShieh/spark-17498.	2017-03-07 11:24:20 -08:00
wm624@hotmail.com	926543664f	[SPARK-19382][ML] Test sparse vectors in LinearSVCSuite ## What changes were proposed in this pull request? Add unit tests for testing SparseVector. We can't add mixed DenseVector and SparseVector test case, as discussed in JIRA 19382. def merge(other: MultivariateOnlineSummarizer): this.type = { if (this.totalWeightSum != 0.0 && other.totalWeightSum != 0.0) { require(n == other.n, s"Dimensions mismatch when merging with another summarizer. " + s"Expecting $n but got $ {other.n} .") ## How was this patch tested? Unit tests Author: wm624@hotmail.com <wm624@hotmail.com> Author: Miao Wang <wangmiao1981@users.noreply.github.com> Closes #16784 from wangmiao1981/bk.	2017-03-06 13:08:59 -08:00
Sue Ann Hong	70f9d7f71c	[SPARK-19535][ML] RecommendForAllUsers RecommendForAllItems for ALS on Dataframe ## What changes were proposed in this pull request? This is a simple implementation of RecommendForAllUsers & RecommendForAllItems for the Dataframe version of ALS. It uses Dataframe operations (not a wrapper on the RDD implementation). Haven't benchmarked against a wrapper, but unit test examples do work. ## How was this patch tested? Unit tests ``` $ build/sbt > mllib/testOnly *ALSSuite -- -z "recommendFor" > mllib/testOnly ``` Author: Your Name <you@example.com> Author: sueann <sueann@databricks.com> Closes #17090 from sueann/SPARK-19535.	2017-03-05 16:49:31 -08:00
sethah	93ae176e89	[SPARK-19745][ML] SVCAggregator captures coefficients in its closure ## What changes were proposed in this pull request? JIRA: [SPARK-19745](https://issues.apache.org/jira/browse/SPARK-19745) Reorganize SVCAggregator to avoid serializing coefficients. This patch also makes the gradient array a `lazy val` which will avoid materializing a large array on the driver before shipping the class to the executors. This improvement stems from https://github.com/apache/spark/pull/16037. Actually, probably all ML aggregators can benefit from this. We can either: a.) separate the gradient improvement into another patch b.) keep what's here _plus_ add the lazy evaluation to all other aggregators in this patch or c.) keep it as is. ## How was this patch tested? This is an interesting question! I don't know of a reasonable way to test this right now. Ideally, we could perform an optimization and look at the shuffle write data for each task, and we could compare the size to what it we know it should be: `numCoefficients * 8 bytes`. Not sure if there is a good way to do that right now? We could discuss this here or in another JIRA, but I suspect it would be a significant undertaking. Author: sethah <seth.hendrickson16@gmail.com> Closes #17076 from sethah/svc_agg.	2017-03-02 19:38:25 -08:00
Zheng RuiFeng	50c08e82f0	[SPARK-19704][ML] AFTSurvivalRegression should support numeric censorCol ## What changes were proposed in this pull request? make `AFTSurvivalRegression` support numeric censorCol ## How was this patch tested? existing tests and added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #17034 from zhengruifeng/aft_numeric_censor.	2017-03-02 13:34:04 +02:00
Vasilis Vryniotis	625cfe09e6	[SPARK-19733][ML] Removed unnecessary castings and refactored checked casts in ALS. ## What changes were proposed in this pull request? The original ALS was performing unnecessary casting to the user and item ids because the protected checkedCast() method required a double. I removed the castings and refactored the method to receive Any and efficiently handle all permitted numeric values. ## How was this patch tested? I tested it by running the unit-tests and by manually validating the result of checkedCast for various legal and illegal values. Author: Vasilis Vryniotis <bbriniotis@datumbox.com> Closes #17059 from datumbox/als_casting_fix.	2017-03-02 12:37:42 +02:00
Vasilis Vryniotis	417140e441	[SPARK-19787][ML] Changing the default parameter of regParam. ## What changes were proposed in this pull request? In the ALS method the default values of regParam do not match within the same file (lines [224](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224) and [714](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714)). In one place we set it to 1.0 and in the other to 0.1. I changed the one of train() method to 0.1 and now it matches the default value which is visible to Spark users. The method is marked with DeveloperApi so it should not affect the users. Whenever we use the particular method we provide all parameters, so the default does not matter. Only exception is the unit-tests on ALSSuite but the change does not break them. Note: This PR should get the award of the laziest commit in Spark history. Originally I wanted to correct this on another PR but MLnick [suggested](https://github.com/apache/spark/pull/17059#issuecomment-283333572) to create a separate PR & ticket. If you think this change is too insignificant/minor, you are probably right, so feel free to reject and close this. :) ## How was this patch tested? Unit-tests Author: Vasilis Vryniotis <vvryniotis@hotels.com> Closes #17121 from datumbox/als_regparam.	2017-03-01 20:55:17 +02:00
Yuhao	0fe8020f3a	[SPARK-14503][ML] spark.ml API for FPGrowth ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-14503 Function parity: Add FPGrowth and AssociationRules to ML. design doc: https://docs.google.com/document/d/1bVhABn5DiEj8bw0upqGMJT2L4nvO_0_cXdwu4uMT6uU/pub Currently I make FPGrowthModel a transformer. For each association rule, it will just examine the input items against antecedents and summarize the consequents. Update: Thinking again, FPGrowth is only the algorithm to find the frequent itemsets, and can be replaced by other algorithms. The frequent itemsets are used by AssociationRules to generate the association rules. Then we can use the association rules to predict with other records. ![drawing1](https://cloud.githubusercontent.com/assets/7981698/22489294/76b9302c-e7cb-11e6-8d2d-3fc53f407b2f.png) For reviewers, Let's first decide if the current `transform` function meets your expectation. Current options: 1. Current implementation: Use Estimator and Transformer pattern in ML, the `transform` function will examine the input items against all the association rules and summarize the consequents. Users can also access frequent items and association rules via other model members. 2. Keep the Estimator and Transformer pattern. But AssociationRulesModel and FPGrowthModel will have empty `transform` function, meaning DataFrame has no change after transform. But users can access frequent items and association rules via other model members. 3. (mentioned by zhengruifeng) Keep the Estimator and Transformer pattern. But `FPGrowthModel` and `AssociationRulesModel` will just return frequent itemsets and association rules DataFrame in the `transform` function. Meaning the resulting DataFrame after `transform` will not be related to the input DataFrame. 4. Discard the Estimator and Transformer pattern. Both FPGrowth and FPGrowthModel will directly extend from PipelineStage, thus we don't need to have a `transform` function. I'd like to hear more concrete suggestions. I would prefer option 1 or 2. update 2: As discussed in the jira, we will not expose AssociationRules as a public API for now. ## How was this patch tested? new unit test suites Author: Yuhao <yuhao.yang@intel.com> Author: Yuhao Yang <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #15415 from hhbyyh/mlfpm.	2017-02-28 15:53:41 -08:00
Nick Pentreath	b405466513	[SPARK-14489][ML][PYSPARK] ALS unknown user/item prediction strategy This PR adds a param to `ALS`/`ALSModel` to set the strategy used when encountering unknown users or items at prediction time in `transform`. This can occur in 2 scenarios: (a) production scoring, and (b) cross-validation & evaluation. The current behavior returns `NaN` if a user/item is unknown. In scenario (b), this can easily occur when using `CrossValidator` or `TrainValidationSplit` since some users/items may only occur in the test set and not in the training set. In this case, the evaluator returns `NaN` for all metrics, making model selection impossible. The new param, `coldStartStrategy`, defaults to `nan` (the current behavior). The other option supported initially is `drop`, which drops all rows with `NaN` predictions. This flag allows users to use `ALS` in cross-validation settings. It is made an `expertParam`. The param is made a string so that the set of strategies can be extended in future (some options are discussed in [SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489)). ## How was this patch tested? New unit tests, and manual "before and after" tests for Scala & Python using MovieLens `ml-latest-small` as example data. Here, using `CrossValidator` or `TrainValidationSplit` with the default param setting results in metrics that are all `NaN`, while setting `coldStartStrategy` to `drop` results in valid metrics. Author: Nick Pentreath <nickp@za.ibm.com> Closes #12896 from MLnick/SPARK-14489-als-nan.	2017-02-28 16:17:35 +02:00
sethah	16d8472f74	[SPARK-19746][ML] Faster indexing for logistic aggregator ## What changes were proposed in this pull request? JIRA: [SPARK-19746](https://issues.apache.org/jira/browse/SPARK-19746) The following code is inefficient: ````scala val localCoefficients: Vector = bcCoefficients.value features.foreachActive { (index, value) => val stdValue = value / localFeaturesStd(index) var j = 0 while (j < numClasses) { margins(j) += localCoefficients(index * numClasses + j) * stdValue j += 1 } } ```` `localCoefficients(index * numClasses + j)` calls `Vector.apply` which creates a new Breeze vector and indexes that. Even if it is not that slow to create the object, we will generate a lot of extra garbage that may result in longer GC pauses. This is a hot inner loop, so we should optimize wherever possible. ## How was this patch tested? I don't think there's a great way to test this patch. It's purely performance related, so unit tests should guarantee that we haven't made any unwanted changes. Empirically I observed between 10-40% speedups just running short local tests. I suspect the big differences will be seen when large data/coefficient sizes have to pause for GC more often. I welcome other ideas for testing. Author: sethah <seth.hendrickson16@gmail.com> Closes #17078 from sethah/logistic_agg_indexing.	2017-02-28 00:34:38 +00:00
hyukjinkwon	4ba9c6c453	[MINOR][BUILD] Fix lint-java breaks in Java ## What changes were proposed in this pull request? This PR proposes to fix the lint-breaks as below: ``` [ERROR] src/test/java/org/apache/spark/network/TransportResponseHandlerSuite.java:[29,8] (imports) UnusedImports: Unused import - org.apache.spark.network.buffer.ManagedBuffer. [ERROR] src/main/java/org/apache/spark/unsafe/types/UTF8String.java:[156,10] (modifier) ModifierOrder: 'Nonnull' annotation modifier does not precede non-annotation modifiers. [ERROR] src/main/java/org/apache/spark/SparkFirehoseListener.java:[122] (sizes) LineLength: Line is longer than 100 characters (found 105). [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[164,78] (coding) OneStatementPerLine: Only one statement per line allowed. [ERROR] src/test/java/test/org/apache/spark/JavaAPISuite.java:[1157] (sizes) LineLength: Line is longer than 100 characters (found 121). [ERROR] src/test/java/org/apache/spark/streaming/JavaMapWithStateSuite.java:[149] (sizes) LineLength: Line is longer than 100 characters (found 113). [ERROR] src/test/java/test/org/apache/spark/streaming/Java8APISuite.java:[146] (sizes) LineLength: Line is longer than 100 characters (found 122). [ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[32,8] (imports) UnusedImports: Unused import - org.apache.spark.streaming.Time. [ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[611] (sizes) LineLength: Line is longer than 100 characters (found 101). [ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[1317] (sizes) LineLength: Line is longer than 100 characters (found 102). [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetAggregatorSuite.java:[91] (sizes) LineLength: Line is longer than 100 characters (found 102). [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[113] (sizes) LineLength: Line is longer than 100 characters (found 101). [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[164] (sizes) LineLength: Line is longer than 100 characters (found 110). [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[212] (sizes) LineLength: Line is longer than 100 characters (found 114). [ERROR] src/test/java/org/apache/spark/mllib/tree/JavaDecisionTreeSuite.java:[36] (sizes) LineLength: Line is longer than 100 characters (found 101). [ERROR] src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java:[26,8] (imports) UnusedImports: Unused import - com.amazonaws.regions.RegionUtils. [ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisStreamSuite.java:[20,8] (imports) UnusedImports: Unused import - com.amazonaws.regions.RegionUtils. [ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisStreamSuite.java:[94] (sizes) LineLength: Line is longer than 100 characters (found 103). [ERROR] src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java:[30,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.api.java.UDF1. [ERROR] src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java:[72] (sizes) LineLength: Line is longer than 100 characters (found 104). [ERROR] src/main/java/org/apache/spark/examples/mllib/JavaRankingMetricsExample.java:[121] (sizes) LineLength: Line is longer than 100 characters (found 101). [ERROR] src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java:[28,8] (imports) UnusedImports: Unused import - org.apache.spark.api.java.JavaRDD. [ERROR] src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java:[29,8] (imports) UnusedImports: Unused import - org.apache.spark.api.java.JavaSparkContext. ``` ## How was this patch tested? Manually via ```bash ./dev/lint-java ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #17072 from HyukjinKwon/java-lint.	2017-02-27 08:44:26 +00:00
Joseph K. Bradley	6ab60542e8	[MINOR][ML][DOC] Document default value for GeneralizedLinearRegression.linkPower Add Scaladoc for GeneralizedLinearRegression.linkPower default value Follow-up to https://github.com/apache/spark/pull/16344 Author: Joseph K. Bradley <joseph@databricks.com> Closes #17069 from jkbradley/tweedie-comment.	2017-02-25 22:24:08 -08:00
wm624@hotmail.com	1f86e795b8	[SPARK-19616][SPARKR] weightCol and aggregationDepth should be improved for some SparkR APIs ## What changes were proposed in this pull request? This is a follow-up PR of #16800 When doing SPARK-19456, we found that "" should be consider a NULL column name and should not be set. aggregationDepth should be exposed as an expert parameter. ## How was this patch tested? Existing tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16945 from wangmiao1981/svc.	2017-02-22 11:50:24 -08:00
Zheng RuiFeng	bf7bb49778	[SPARK-19679][ML] Destroy broadcasted object without blocking ## What changes were proposed in this pull request? Destroy broadcasted object without blocking use `find mllib -name '*.scala' \| xargs -i bash -c 'egrep "destroy" -n {} && echo {}'` ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #17016 from zhengruifeng/destroy_without_block.	2017-02-22 16:36:03 +02:00
Zheng RuiFeng	ef3c73535f	[SPARK-19694][ML] Add missing 'setTopicDistributionCol' for LDAModel ## What changes were proposed in this pull request? Add missing 'setTopicDistributionCol' for LDAModel ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #17021 from zhengruifeng/lda_outputCol.	2017-02-22 16:33:14 +02:00
Sean Owen	1487c9af20	[SPARK-19534][TESTS] Convert Java tests to use lambdas, Java 8 features ## What changes were proposed in this pull request? Convert tests to use Java 8 lambdas, and modest related fixes to surrounding code. ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #16964 from srowen/SPARK-19534.	2017-02-19 09:42:50 -08:00
Moussa Taifi	21c7d3c31a	[MLLIB][TYPO] Replace LeastSquaresAggregator with LogisticAggregator ## What changes were proposed in this pull request? Replace LeastSquaresAggregator with LogisticAggregator in the require statement of the merge op. ## How was this patch tested? Simple message fix. Author: Moussa Taifi <moutai10@gmail.com> Closes #16903 from moutai/master.	2017-02-18 14:10:21 +00:00
Yun Ni	08c1972a06	[SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing ## What changes were proposed in this pull request? This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH. ## How was this patch tested? API and examples are tested using spark-submit: `bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py` `bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py` User guide changes are generated and manually inspected: `SKIP_API=1 jekyll build` Author: Yun Ni <yunn@uber.com> Author: Yanbo Liang <ybliang8@gmail.com> Author: Yunni <Euler57721@gmail.com> Closes #16715 from Yunni/spark-18080.	2017-02-15 16:26:05 -08:00
wm624@hotmail.com	3973403d5d	[SPARK-19456][SPARKR] Add LinearSVC R API ## What changes were proposed in this pull request? Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API. Marked as WIP, as I am designing unit tests. ## How was this patch tested? Please review http://spark.apache.org/contributing.html before opening a pull request. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16800 from wangmiao1981/svc.	2017-02-15 01:15:50 -08:00
sureshthalamati	f48c5a57d6	[SPARK-19318][SQL] Fix to treat JDBC connection properties specified by the user in case-sensitive manner. ## What changes were proposed in this pull request? The reason for test failure is that the property “oracle.jdbc.mapDateToTimestamp” set by the test was getting converted into all lower case. Oracle database expects this property in case-sensitive manner. This test was passing in previous releases because connection properties were sent as user specified for the test case scenario. Fixes to handle all option uniformly in case-insensitive manner, converted the JDBC connection properties also to lower case. This PR enhances CaseInsensitiveMap to keep track of input case-sensitive keys , and uses those when creating connection properties that are passed to the JDBC connection. Alternative approach PR https://github.com/apache/spark/pull/16847 is to pass original input keys to JDBC data source by adding check in the Data source class and handle case-insensitivity in the JDBC source code. ## How was this patch tested? Added new test cases to JdbcSuite , and OracleIntegrationSuite. Ran docker integration tests passed on my laptop, all tests passed successfully. Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #16891 from sureshthalamati/jdbc_case_senstivity_props_fix-SPARK-19318.	2017-02-14 15:34:12 -08:00
sueann	3a43ae7c0b	[SPARK-18613][ML] make spark.mllib LDA dependencies in spark.ml LDA private ## What changes were proposed in this pull request? spark.ml.LDAModel classes were exposing spark.mllib LDA models via protected methods. Made them package (clustering) private. ## How was this patch tested? ``` build/sbt doc # "millib.clustering" no longer appears in the docs for LDA* classes build/sbt compile # compiles build/sbt > mllib/testOnly # tests pass ``` Author: sueann <sueann@databricks.com> Closes #16860 from sueann/SPARK-18613.	2017-02-10 11:50:23 -08:00
actuaryzhang	1aeb9f6cba	[SPARK-19400][ML] Allow GLM to handle intercept only model ## What changes were proposed in this pull request? Intercept-only GLM is failing for non-Gaussian family because of reducing an empty array in IWLS. The following code `val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) => math.max(math.abs(x), math.abs(y))` fails in the intercept-only model because `oldCoefficients` is empty. This PR fixes this issue. yanboliang srowen imatiach-msft zhengruifeng ## How was this patch tested? New test for intercept only model. Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16740 from actuaryzhang/interceptOnly.	2017-02-08 10:42:20 -08:00
gatorsmile	e33aaa2ac5	[SPARK-19397][SQL] Make option names of LIBSVM and TEXT case insensitive ### What changes were proposed in this pull request? Prior to Spark 2.1, the option names are case sensitive for all the formats. Since Spark 2.1, the option key names become case insensitive except the format `Text` and `LibSVM `. This PR is to fix these issues. Also, add a check to know whether the input option vector type is legal for `LibSVM`. ### How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #16737 from gatorsmile/libSVMTextOptions.	2017-02-08 09:33:18 +08:00
gatorsmile	65b10ffb38	[SPARK-19279][SQL] Infer Schema for Hive Serde Tables and Block Creating a Hive Table With an Empty Schema ### What changes were proposed in this pull request? So far, we allow users to create a table with an empty schema: `CREATE TABLE tab1`. This could break many code paths if we enable it. Thus, we should follow Hive to block it. For Hive serde tables, some serde libraries require the specified schema and record it in the metastore. To get the list, we need to check `hive.serdes.using.metastore.for.schema,` which contains a list of serdes that require user-specified schema. The default values are - org.apache.hadoop.hive.ql.io.orc.OrcSerde - org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe - org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe - org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe - org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe - org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe - org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe - org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe ### How was this patch tested? Added test cases for both Hive and data source tables Author: gatorsmile <gatorsmile@gmail.com> Closes #16636 from gatorsmile/fixEmptyTableSchema.	2017-02-06 13:30:07 +08:00
Asher Krim	b3e89802ae	[SPARK-19247][ML] Save large word2vec models ## What changes were proposed in this pull request? * save word2vec models as distributed files rather than as one large datum. Backwards compatibility with the previous save format is maintained by checking for the "wordIndex" column * migrate the fix for loading large models (SPARK-11994) to ml word2vec ## How was this patch tested? Tested loading the new and old formats locally srowen yanboliang MLnick Author: Asher Krim <akrim@hubspot.com> Closes #16607 from Krimit/saveLargeModels.	2017-02-05 16:14:07 -08:00
Joseph K. Bradley	1d5d2a9d09	[SPARK-19389][ML][PYTHON][DOC] Minor doc fixes for ML Python Params and LinearSVC ## What changes were proposed in this pull request? * Removed Since tags in Python Params since they are inherited by other classes * Fixed doc links for LinearSVC ## How was this patch tested? * doc tests * generating docs locally and checking manually Author: Joseph K. Bradley <joseph@databricks.com> Closes #16723 from jkbradley/pyparam-fix-doc.	2017-02-02 11:58:46 -08:00
hyukjinkwon	f1a1f2607d	[SPARK-19402][DOCS] Support LaTex inline formula correctly and fix warnings in Scala/Java APIs generation ## What changes were proposed in this pull request? This PR proposes three things as below: - Support LaTex inline-formula, `$ ... $` in Scala API documentation It seems currently, ``` $ ... $ ``` are rendered as they are, for example, <img width="345" alt="2017-01-30 10 01 13" src="https://cloud.githubusercontent.com/assets/6477701/22423960/ab37d54a-e737-11e6-9196-4f6229c0189c.png"> It seems mistakenly more backslashes were added. - Fix warnings Scaladoc/Javadoc generation This PR fixes t two types of warnings as below: ``` [warn] .../spark/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala:335: Could not find any member to link for "UnsupportedOperationException". [warn] /** [warn] ^ ``` ``` [warn] .../spark/sql/core/src/main/scala/org/apache/spark/sql/internal/VariableSubstitution.scala:24: Variable var undefined in comment for class VariableSubstitution in class VariableSubstitution [warn] * `${var}`, `${system:var}` and `${env:var}`. [warn] ^ ``` - Fix Javadoc8 break ``` [error] .../spark/mllib/target/java/org/apache/spark/ml/PredictionModel.java:7: error: reference not found [error] * E.g., {link VectorUDT} for vector features. [error] ^ [error] .../spark/mllib/target/java/org/apache/spark/ml/PredictorParams.java:12: error: reference not found [error] * E.g., {link VectorUDT} for vector features. [error] ^ [error] .../spark/mllib/target/java/org/apache/spark/ml/Predictor.java:10: error: reference not found [error] * E.g., {link VectorUDT} for vector features. [error] ^ [error] .../spark/sql/hive/target/java/org/apache/spark/sql/hive/HiveAnalysis.java:5: error: reference not found [error] * Note that, this rule must be run after {link PreprocessTableInsertion}. [error] ^ ``` ## How was this patch tested? Manually via `sbt unidoc` and `jeykil build`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #16741 from HyukjinKwon/warn-and-break.	2017-02-01 13:26:16 +00:00
wm624@hotmail.com	9ac05225e8	[SPARK-19319][SPARKR] SparkR Kmeans summary returns error when the cluster size doesn't equal to k ## What changes were proposed in this pull request When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured `k`. In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k. Example: > col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > cols <- as.data.frame(cbind(col1, col2, col3)) > df <- createDataFrame(cols) > > model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10, initMode = "random", seed = 22222, tol = 1E-5) > > summary(model2) Error in `colnames<-`(`tmp`, value = c("col1", "col2", "col3")) : length of 'dimnames' [2] not equal to array extent In addition: Warning message: In matrix(coefficients, ncol = k) : data length [9] is not a sub-multiple or multiple of the number of rows [2] Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix. ## How was this patch tested? Add unit tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16666 from wangmiao1981/kmeans.	2017-01-31 21:16:37 -08:00
Bryan Cutler	57d70d26c8	[SPARK-17161][PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays ## What changes were proposed in this pull request? Adding convenience function to Python `JavaWrapper` so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala `Array` as input so that it is not necessary to have a Java/Python friendly constructor. The function takes a Java class as input that is used by Py4J to create the Java array of the given class. As an example, `OneVsRest` has been updated to use this and the alternate constructor is removed. ## How was this patch tested? Added unit tests for the new convenience function and updated `OneVsRest` doctests which use this to persist the model. Author: Bryan Cutler <cutlerb@gmail.com> Closes #14725 from BryanCutler/pyspark-new_java_array-CountVectorizer-SPARK-17161.	2017-01-31 15:42:36 -08:00
Zheng RuiFeng	42ad93b2c9	[SPARK-19384][ML] forget unpersist input dataset in IsotonicRegression ## What changes were proposed in this pull request? unpersist the input dataset if `handlePersistence` = true ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #16718 from zhengruifeng/isoReg_unpersisit.	2017-01-28 10:18:47 +00:00
wm624@hotmail.com	bb1a1fe05e	[SPARK-19336][ML][PYSPARK] LinearSVC Python API ## What changes were proposed in this pull request? Add Python API for the newly added LinearSVC algorithm. ## How was this patch tested? Add new doc string test. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16694 from wangmiao1981/ser.	2017-01-27 16:03:53 -08:00
actuaryzhang	4172ff80dd	[SPARK-18929][ML] Add Tweedie distribution in GLM ## What changes were proposed in this pull request? I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the Tweedie https://en.wikipedia.org/wiki/Tweedie_distribution. yanboliang srowen sethah Author: actuaryzhang <actuaryzhang10@gmail.com> Author: Wayne Zhang <actuaryzhang10@gmail.com> Closes #16344 from actuaryzhang/tweedie.	2017-01-26 23:01:13 -08:00
wm624@hotmail.com	c0ba284300	[SPARK-18821][SPARKR] Bisecting k-means wrapper in SparkR ## What changes were proposed in this pull request? Add R wrapper for bisecting Kmeans. As JIRA is down, I will update title to link with corresponding JIRA later. ## How was this patch tested? Add new unit tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16566 from wangmiao1981/bk.	2017-01-26 21:01:59 -08:00
WeichenXu	1191fe267d	[SPARK-18218][ML][MLLIB] Reduce shuffled data size of BlockMatrix multiplication and solve potential OOM and low parallelism usage problem By split middle dimension in matrix multiplication ## What changes were proposed in this pull request? ### The problem in current block matrix mulitiplication As in JIRA https://issues.apache.org/jira/browse/SPARK-18218 described, block matrix multiplication in spark may cause some problem, suppose we have `MN` dimensions matrix A multiply `NP` dimensions matrix B, when N is much larger than M and P, then the following problem may occur: - when the middle dimension N is too large, it will cause reducer OOM. - even if OOM do not occur, it will still cause parallism too low. - when N is much large than M and P, and matrix A and B have many partitions, it may cause too many partition on M and P dimension, it will cause much larger shuffled data size. (I will expain this in detail in the following.) ### Key point of my improvement In this PR, I introduce `midDimSplitNum` parameter, and improve the algorithm, to resolve this problem. In order to understand the improvement in this PR, first let me give a simple case to explain how the current mulitiplication works and what cause the problems above: suppose we have block matrix A, contains 200 blocks (`2 numRowBlocks * 100 numColBlocks`), blocks arranged in 2 rows, 100 cols: ``` A00 A01 A02 ... A0,99 A10 A11 A12 ... A1,99 ``` and we have block matrix B, also contains 200 blocks (`100 numRowBlocks * 2 numColBlocks`), blocks arranged in 100 rows, 2 cols: ``` B00 B01 B10 B11 B20 B21 ... B99,0 B99,1 ``` Suppose all blocks in the two matrices are dense for now. Now we call A.multiply(B), suppose the generated `resultPartitioner` contains 2 rowPartitions and 2 colPartitions (can't be more partitions because the result matrix only contains `2 * 2` blocks), the current algorithm will contains two shuffle steps: step-1 Step-1 will generate 4 reducer, I tag them as reducer-00, reducer-01, reducer-10, reducer-11, and shuffle data as following: ``` A00 A01 A02 ... A0,99 B00 B10 B20 ... B99,0 shuffled into reducer-00 A00 A01 A02 ... A0,99 B01 B11 B21 ... B99,1 shuffled into reducer-01 A10 A11 A12 ... A1,99 B00 B10 B20 ... B99,0 shuffled into reducer-10 A10 A11 A12 ... A1,99 B01 B11 B21 ... B99,1 shuffled into reducer-11 ``` and the shuffling above is a `cogroup` transform, note that each reducer contains only one group. step-2 Step-2 will do an `aggregateByKey` transform on the result of step-1, will also generate 4 reducers, and generate the final result RDD, contains 4 partitions, each partition contains one block. The main problems are in step-1. Now we have only 4 reducers, but matrix A and B have 400 blocks in total, obviously the reducer number is too small. and, we can see that, each reducer contains only one group(the group concept in `coGroup` transform), each group contains 200 blocks. This is terrible because we know that `coGroup` transformer will load each group into memory when computing. It is un-extensable in the algorithm level. Suppose matrix A has 10000 cols blocks or more instead of 100? Than each reducer will load 20000 blocks into memory. It will easily cause reducer OOM. This PR try to resolve the problem described above. When matrix A with dimension M * N multiply matrix B with dimension N * P, the middle dimension N is the keypoint. If N is large, the current mulitiplication implementation works badly. In this PR, I introduce a `numMidDimSplits` parameter, represent how many splits it will cut on the middle dimension N. Still using the example described above, now we set `numMidDimSplits = 10`, now we can generate 40 reducers in step-1: the reducer-ij above now will be splited into 10 reducers: reducer-ij0, reducer-ij1, ... reducer-ij9, each reducer will receive 20 blocks. now the shuffle works as following: reducer-000 to reducer-009 ``` A0,0 A0,10 A0,20 ... A0,90 B0,0 B10,0 B20,0 ... B90,0 shuffled into reducer-000 A0,1 A0,11 A0,21 ... A0,91 B1,0 B11,0 B21,0 ... B91,0 shuffled into reducer-001 A0,2 A0,12 A0,22 ... A0,92 B2,0 B12,0 B22,0 ... B92,0 shuffled into reducer-002 ... A0,9 A0,19 A0,29 ... A0,99 B9,0 B19,0 B29,0 ... B99,0 shuffled into reducer-009 ``` reducer-010 to reducer-019 ``` A0,0 A0,10 A0,20 ... A0,90 B0,1 B10,1 B20,1 ... B90,1 shuffled into reducer-010 A0,1 A0,11 A0,21 ... A0,91 B1,1 B11,1 B21,1 ... B91,1 shuffled into reducer-011 A0,2 A0,12 A0,22 ... A0,92 B2,1 B12,1 B22,1 ... B92,1 shuffled into reducer-012 ... A0,9 A0,19 A0,29 ... A0,99 B9,1 B19,1 B29,1 ... B99,1 shuffled into reducer-019 ``` reducer-100 to reducer-109 and reducer-110 to reducer-119 is similar to the above, I omit to write them out. ### API for this optimized algorithm I add a new API as following: ``` def multiply( other: BlockMatrix, numMidDimSplits: Int // middle dimension split number, expained above ): BlockMatrix ``` ### Shuffled data size analysis (compared under the same parallelism) The optimization has some subtle influence on the total shuffled data size. Appropriate `numMidDimSplits` will significantly reduce the shuffled data size, but too large `numMidDimSplits` may increase the shuffled data in reverse. For now I don't want to introduce formula to make thing too complex, I only use a simple case to represent it here: Suppose we have two same size square matrices X and Y, both have `16 numRowBlocks * 16 numColBlocks`. X and Y are both dense matrix. Now let me analysis the shuffling data size in the following case: case 1: X and Y both partitioned in 16 rowPartitions and 16 colPartitions, numMidDimSplits = 1 ShufflingDataSize = (16 * 16 * (16 + 16) + 16 * 16) blocks = 8448 blocks parallelism = 16 * 16 * 1 = 256 //use step-1 reducers number as the parallism because it cost most of the computation time in this algorithm. case 2: X and Y both partitioned in 8 rowPartitions and 8 colPartitions, numMidDimSplits = 4 ShufflingDataSize = (8 * 8 * (32 + 32) + 16 * 16 * 4) blocks = 5120 blocks parallelism = 8 * 8 * 4 = 256 //use step-1 reducers number as the parallism because it cost most of the computation time in this algorithm. The two cases above all have parallism = 256, case 1 `numMidDimSplits = 1` is equivalent with current implementation in mllib, but case 2 shuffling data is 60.6% of case 1, it shows that under the same parallelism, proper `numMidDimSplits` will significantly reduce the shuffling data size. ## How was this patch tested? Test suites added. Running result: ![blockmatrix](https://cloud.githubusercontent.com/assets/19235986/21600989/5e162cc2-d1bf-11e6-868c-0ec29190b605.png) Author: WeichenXu <WeichenXu123@outlook.com> Closes #15730 from WeichenXu123/optim_block_matrix.	2017-01-26 20:10:17 -08:00
sethah	0e821ec6fa	[SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features ## What changes were proposed in this pull request? The following test will fail on current master ````scala test("gmm fails on high dimensional data") { val ctx = spark.sqlContext import ctx.implicits._ val df = Seq( Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), Array(3.0, 8.0)), Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), Array(4.0, 9.0))) .map(Tuple1.apply).toDF("features") val gm = new GaussianMixture() intercept[IllegalArgumentException] { gm.fit(df) } } ```` Instead, you'll get an `ArrayIndexOutOfBoundsException` or something similar for MLlib. That's because the covariance matrix allocates an array of `numFeatures * numFeatures`, and in this case we get integer overflow. While there is currently a warning that the algorithm does not perform well for high number of features, we should perform an appropriate check to communicate this limitation to users. This patch adds a `require(numFeatures < GaussianMixture.MAX_NUM_FEATURES)` check to ML and MLlib algorithms. For the feature limitation, we can limit it such that we do not get numerical overflow to something like `math.sqrt(Integer.MaxValue).toInt` (about 46k) which eliminates the cryptic error. However in, for example WLS, we need to collect an array on the order of `numFeatures * numFeatures` to the driver and we therefore limit to 4096 features. We may want to keep that convention here for consistency. ## How was this patch tested? Unit tests in ML and MLlib. Author: sethah <seth.hendrickson16@gmail.com> Closes #16661 from sethah/gmm_high_dim.	2017-01-25 07:12:25 -08:00
Ilya Matiach	d9783380ff	[SPARK-18036][ML][MLLIB] Fixing decision trees handling edge cases ## What changes were proposed in this pull request? Decision trees/GBT/RF do not handle edge cases such as constant features or empty features. In the case of constant features we choose any arbitrary split instead of failing with a cryptic error message. In the case of empty features we fail with a better error message stating: DecisionTree requires number of features > 0, but was given an empty features vector Instead of the cryptic error message: java.lang.UnsupportedOperationException: empty.max ## How was this patch tested? Unit tests are added in the patch for: DecisionTreeRegressor GBTRegressor Random Forest Regressor Author: Ilya Matiach <ilmat@microsoft.com> Closes #16377 from imatiach-msft/ilmat/fix-decision-tree.	2017-01-24 10:25:12 -08:00
Souljoy Zhuo	cca8680047	delete useless var “j” the var “j” defined in "var j = 0" is useless for “def compress” Author: Souljoy Zhuo <zhuoshoujie@126.com> Closes #16676 from xiaoyesoso/patch-1.	2017-01-24 11:33:17 +00:00
Zheng RuiFeng	49f5b0ae4c	[SPARK-17747][ML] WeightCol support non-double numeric datatypes ## What changes were proposed in this pull request? 1, add test for `WeightCol` in `MLTestingUtils.checkNumericTypes` 2, move datatype cast to `Predict.fit`, and supply algos' `train()` with casted dataframe ## How was this patch tested? local tests in spark-shell and unit tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15314 from zhengruifeng/weightCol_support_int.	2017-01-23 17:24:53 -08:00
Ilya Matiach	5b258b8b07	[SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case [SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case where no children exist in updateAssignments ## What changes were proposed in this pull request? Fix a bug in which BisectingKMeans fails with error: java.util.NoSuchElementException: key not found: 166 at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338) at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337) at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337) at scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) at scala.collection.immutable.List.foldLeft(List.scala:84) at scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125) at scala.collection.immutable.List.reduceLeft(List.scala:84) at scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231) at scala.collection.AbstractTraversable.minBy(Traversable.scala:105) at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337) at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) ## How was this patch tested? The dataset was run against the code change to verify that the code works. I will try to add unit tests to the code. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Ilya Matiach <ilmat@microsoft.com> Closes #16355 from imatiach-msft/ilmat/fix-kmeans.	2017-01-23 13:34:27 -08:00
z001qdp	c8aea7445c	[SPARK-17455][MLLIB] Improve PAVA implementation in IsotonicRegression ## What changes were proposed in this pull request? New implementation of the Pool Adjacent Violators Algorithm (PAVA) in mllib.IsotonicRegression, which used under the hood by ml.regression.IsotonicRegression. The previous implementation could have factorial complexity in the worst case. This implementation, which closely follows those in scikit-learn and the R `iso` package, runs in quadratic time in the worst case. ## How was this patch tested? Existing unit tests in both `mllib` and `ml` passed before and after this patch. Scaling properties were tested by running the `poolAdjacentViolators` method in [scala-benchmarking-template](https://github.com/sirthias/scala-benchmarking-template) with the input generated by ``` scala val x = (1 to length).toArray.map(_.toDouble) val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 else yi} val w = Array.fill(length)(1d) val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, x), w) => (y, x, w)} ``` Before this patch: \| Input Length \| Time (us) \| \| --: \| --: \| \| 100 \| 1.35 \| \| 200 \| 3.14 \| \| 400 \| 116.10 \| \| 800 \| 2134225.90 \| After this patch: \| Input Length \| Time (us) \| \| --: \| --: \| \| 100 \| 1.25 \| \| 200 \| 2.53 \| \| 400 \| 5.86 \| \| 800 \| 10.55 \| Benchmarking was also performed with randomly-generated y values, with similar results. Author: z001qdp <Nicholas.Eggert@target.com> Closes #15018 from neggert/SPARK-17455-isoreg-algo.	2017-01-23 13:20:52 -08:00
Yuhao	4a11d029dc	[SPARK-14709][ML] spark.ml API for linear SVM ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-14709 Provide API for SVM algorithm for DataFrames. As discussed in jira, the initial implementation uses OWL-QN with Hinge loss function. The API should mimic existing spark.ml.classification APIs. Currently only Binary Classification is supported. Multinomial support can be added in this or following release. ## How was this patch tested? new unit tests and simple manual test Author: Yuhao <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #15211 from hhbyyh/mlsvm.	2017-01-23 12:18:06 -08:00
actuaryzhang	f067acefab	[SPARK-19155][ML] Make family case insensitive in GLM ## What changes were proposed in this pull request? This is a supplement to PR #16516 which did not make the value from `getFamily` case insensitive. Current tests of poisson/binomial glm with weight fail when specifying 'Poisson' or 'Binomial', because the calculation of `dispersion` and `pValue` checks the value of family retrieved from `getFamily` ``` model.getFamily == Binomial.name \|\| model.getFamily == Poisson.name ``` ## How was this patch tested? Update existing tests for 'Poisson' and 'Binomial'. yanboliang felixcheung imatiach-msft Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16675 from actuaryzhang/family.	2017-01-23 00:53:44 -08:00
Yanbo Liang	0c589e3713	[SPARK-19291][SPARKR][ML] spark.gaussianMixture supports output log-likelihood. ## What changes were proposed in this pull request? ```spark.gaussianMixture``` supports output total log-likelihood for the model like R ```mvnormalmixEM```. ## How was this patch tested? R unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16646 from yanboliang/spark-19291.	2017-01-21 21:26:14 -08:00
Yanbo Liang	3dcad9fab1	[SPARK-19155][ML] MLlib GeneralizedLinearRegression family and link should case insensitive ## What changes were proposed in this pull request? MLlib ```GeneralizedLinearRegression``` ```family``` and ```link``` should be case insensitive. This is consistent with some other MLlib params such as [```featureSubsetStrategy```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala#L415). ## How was this patch tested? Update corresponding tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16516 from yanboliang/spark-19133.	2017-01-21 21:15:57 -08:00
Zheng RuiFeng	8ccca9170f	[SPARK-14272][ML] Add Loglikelihood in GaussianMixtureSummary ## What changes were proposed in this pull request? add loglikelihood in GMM.summary ## How was this patch tested? added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Author: Ruifeng Zheng <ruifengz@foxmail.com> Closes #12064 from zhengruifeng/gmm_metric.	2017-01-19 03:46:37 -08:00
Ilya Matiach	fe409f31d9	[SPARK-14975][ML] Fixed GBTClassifier to predict probability per training instance and fixed interfaces ## What changes were proposed in this pull request? For all of the classifiers in MLLib we can predict probabilities except for GBTClassifier. Also, all classifiers inherit from ProbabilisticClassifier but GBTClassifier strangely inherits from Predictor, which is a bug. This change corrects the interface and adds the ability for the classifier to give a probabilities vector. ## How was this patch tested? The basic ML tests were run after making the changes. I've marked this as WIP as I need to add more tests. Author: Ilya Matiach <ilmat@microsoft.com> Closes #16441 from imatiach-msft/ilmat/fix-GBT.	2017-01-18 15:33:41 -08:00
uncleGen	eefdf9f9dd	[SPARK-19227][SPARK-19251] remove unused imports and outdated comments ## What changes were proposed in this pull request? remove ununsed imports and outdated comments, and fix some minor code style issue. ## How was this patch tested? existing ut Author: uncleGen <hustyugm@gmail.com> Closes #16591 from uncleGen/SPARK-19227.	2017-01-18 09:44:32 +00:00
Zheng RuiFeng	e7f982b20d	[SPARK-18206][ML] Add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR ## What changes were proposed in this pull request? add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR ## How was this patch tested? local test in spark-shell Author: Zheng RuiFeng <ruifengz@foxmail.com> Author: Ruifeng Zheng <ruifengz@foxmail.com> Closes #15671 from zhengruifeng/lir_instr.	2017-01-17 15:39:51 -08:00
hyukjinkwon	6c00c069e3	[SPARK-3249][DOC] Fix links in ScalaDoc that cause warning messages in `sbt/sbt unidoc` ## What changes were proposed in this pull request? This PR proposes to fix ambiguous link warnings by simply making them as code blocks for both javadoc and scaladoc. ``` [warn] .../spark/core/src/main/scala/org/apache/spark/Accumulator.scala:20: The link target "SparkContext#accumulator" is ambiguous. Several members fit the target: [warn] .../spark/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala:281: The link target "runMiniBatchSGD" is ambiguous. Several members fit the target: [warn] .../spark/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala:83: The link target "run" is ambiguous. Several members fit the target: ... ``` This PR also fixes javadoc8 break as below: ``` [error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found [error] * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product} [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found [error] * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product} [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found [error] * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product} [error] ^ [info] 3 errors ``` ## How was this patch tested? Manually via `sbt unidoc > output.txt` and the checked it via `cat output.txt \| grep ambiguous` and `sbt unidoc \| grep error`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #16604 from HyukjinKwon/SPARK-3249.	2017-01-17 12:28:15 +00:00
wm624@hotmail.com	12c8c21608	[SPARK-19066][SPARKR] SparkR LDA doesn't set optimizer correctly ## What changes were proposed in this pull request? spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code. In addition, the `summary` method should bring back the results related to `DistributedLDAModel`. ## How was this patch tested? Manual tests by comparing with scala example. Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16464 from wangmiao1981/new.	2017-01-16 06:05:59 -08:00
wm624@hotmail.com	7f24a0b6c3	[SPARK-19142][SPARKR] spark.kmeans should take seed, initSteps, and tol as parameters ## What changes were proposed in this pull request? spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans. Add missing parameters and corresponding document. Modified existing unit tests to take additional parameters. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16523 from wangmiao1981/kmeans.	2017-01-12 22:27:57 -08:00
wm624@hotmail.com	c983267b08	[SPARK-19110][MLLIB][FOLLOWUP] Add a unit test for testing logPrior and logLikelihood of DistributedLDAModel in MLLIB ## What changes were proposed in this pull request? #16491 added the fix to mllib and a unit test to ml. This followup PR, add unit tests to mllib suite. ## How was this patch tested? Unit tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16524 from wangmiao1981/ldabug.	2017-01-12 18:31:57 -08:00
Peng, Meng	32286ba68a	[SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change ## What changes were proposed in this pull request? Add FDR test case in ml/feature/ChiSqSelectorSuite. Improve some comments in the code. This is a follow-up pr for #15212. ## How was this patch tested? ut Author: Peng, Meng <peng.meng@intel.com> Closes #16434 from mpjlu/fdr_fwe_update.	2017-01-10 13:09:58 +00:00
Yanbo Liang	3ef6d98a80	[SPARK-17847][ML] Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml ## What changes were proposed in this pull request? Copy `GaussianMixture` implementation from mllib to ml, then we can add new features to it. I left mllib `GaussianMixture` untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons: - mllib `GaussianMixture` allows k == 1, but ml does not. - mllib `GaussianMixture` supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future) We can get around these issues to make mllib as a wrapper calling into ml, but I'd prefer to leave mllib untouched which can make ml clean. Meanwhile, There is a big performance improvement for `GaussianMixture` in this PR. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. In my test, this change will reduce shuffled data size by about 50% and accelerate the job execution. Before this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/19641622/4bb017ac-9996-11e6-8ece-83db184b620a.png) After this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/19641635/629c21fe-9996-11e6-91e9-83ab74ae0126.png) ## How was this patch tested? Existing tests and added new tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15413 from yanboliang/spark-17847.	2017-01-09 21:38:46 -08:00
wm624@hotmail.com	036b50347c	[SPARK-19110][ML][MLLIB] DistributedLDAModel returns different logPrior for original and loaded model ## What changes were proposed in this pull request? While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different. For example, in the test("read/write DistributedLDAModel"), I add the test: val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior assert(logPrior === logPrior2) The test fails: -4.394180878889078 did not equal -4.294290536919573 The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`. Please refer to #16464 for details. ## How was this patch tested? Add a new unit test for testing logPrior. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16491 from wangmiao1981/ldabug.	2017-01-07 11:07:49 -08:00
Wenchen Fan	b3d39620c5	[SPARK-19085][SQL] cleanup OutputWriterFactory and OutputWriter ## What changes were proposed in this pull request? `OutputWriterFactory`/`OutputWriter` are internal interfaces and we can remove some unnecessary APIs: 1. `OutputWriterFactory.newWriter(path: String)`: no one calls it and no one implements it. 2. `OutputWriter.write(row: Row)`: during execution we only call `writeInternal`, which is weird as `OutputWriter` is already an internal interface. We should rename `writeInternal` to `write` and remove `def write(row: Row)` and it's related converter code. All implementations should just implement `def write(row: InternalRow)` ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #16479 from cloud-fan/hive-writer.	2017-01-08 00:42:09 +08:00
sueann	d60f6f62d0	[SPARK-18194][ML] Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit ## What changes were proposed in this pull request? Added instrumentation logging for OneVsRest classifier, CrossValidator, TrainValidationSplit fit() functions. ## How was this patch tested? Ran unit tests and checked the log file (see output in comments). Author: sueann <sueann@databricks.com> Closes #16480 from sueann/SPARK-18194.	2017-01-06 18:53:16 -08:00
Yanbo Liang	dfc4c935ba	[MINOR] Correct LogisticRegression test case for probability2prediction. ## What changes were proposed in this pull request? Set correct column names for ```force to use probability2prediction``` in ```LogisticRegressionSuite```. ## How was this patch tested? Change unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16477 from yanboliang/lor-pred.	2017-01-05 18:59:49 -08:00
Niranjan Padmanabhan	a1e40b1f5d	[MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo ## What changes were proposed in this pull request? There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words. ## How was this patch tested? N/A since only docs or comments were updated. Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com> Closes #16455 from neurons/np.structure_streaming_doc.	2017-01-04 15:07:29 +00:00
Zheng RuiFeng	7a82505817	[SPARK-19054][ML] Eliminate extra pass in NB ## What changes were proposed in this pull request? eliminate unnecessary extra pass in NB's train ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #16453 from zhengruifeng/nb_getNC.	2017-01-04 11:54:13 +00:00
Weiqing Yang	e5c307c50a	[MINOR] Add missing sc.stop() to end of examples ## What changes were proposed in this pull request? Add `finally` clause for `sc.stop()` in the `test("register and deregister Spark listener from SparkContext")`. ## How was this patch tested? Pass the build and unit tests. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #16426 from weiqingy/testIssue.	2017-01-03 09:56:42 +00:00
Sean Owen	56d3a7eb83	[SPARK-18808][ML][MLLIB] ml.KMeansModel.transform is very inefficient ## What changes were proposed in this pull request? mllib.KMeansModel.clusterCentersWithNorm is a method than ends up being called every time `predict` is called on a single vector, which is bad news for now the ml.KMeansModel Transformer works, which necessarily transforms one vector at a time. This causes the model to just store the vectors with norms upfront. The extra norm should be small compared to the vectors. This would avoid this form of overhead on this and other code paths. ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #16328 from srowen/SPARK-18808.	2016-12-30 10:40:17 +00:00
Ilya Matiach	87bc4112c5	[SPARK-18698][ML] Adding public constructor that takes uid for IndexToString ## What changes were proposed in this pull request? Based on SPARK-18698, this adds a public constructor that takes a UID for IndexToString. Other transforms have similar constructors. ## How was this patch tested? A unit test was added to verify the new functionality. Author: Ilya Matiach <ilmat@microsoft.com> Closes #16436 from imatiach-msft/ilmat/fix-indextostring.	2016-12-29 13:25:49 -08:00
sethah	6a475ae466	[SPARK-17772][ML][TEST] Add test functions for ML sample weights ## What changes were proposed in this pull request? More and more ML algos are accepting sample weights, and they have been tested rather heterogeneously and with code duplication. This patch adds extensible helper methods to `MLTestingUtils` that can be reused by various algorithms accepting sample weights. Up to now, there seems to be a few tests that have been implemented commonly: * Check that oversampling is the same as giving the instances sample weights proportional to the number of samples * Check that outliers with tiny sample weights do not affect the algorithm's performance This patch adds an additional test: * Check that algorithms are invariant to constant scaling of the sample weights. i.e. uniform sample weights with `w_i = 1.0` is effectively the same as uniform sample weights with `w_i = 10000` or `w_i = 0.0001` The instances of these tests occurred in LinearRegression, NaiveBayes, and LogisticRegression. Those tests have been removed/modified to use the new helper methods. These helper functions will be of use when [SPARK-9478](https://issues.apache.org/jira/browse/SPARK-9478) is implemented. ## How was this patch tested? This patch only involves modifying test suites. ## Other notes Both IsotonicRegression and GeneralizedLinearRegression also extend `HasWeightCol`. I did not modify these test suites because it will make this patch easier to review, and because they did not duplicate the same tests as the three suites that were modified. If we want to change them later, we can create a JIRA for it now, but it's open for debate. Author: sethah <seth.hendrickson16@gmail.com> Closes #15721 from sethah/SPARK-17772.	2016-12-28 07:01:14 -08:00
Yanbo Liang	9cff67f346	[MINOR][ML] Correct test cases of LoR raw2prediction & probability2prediction. ## What changes were proposed in this pull request? Correct test cases of ```LogisticRegression``` raw2prediction & probability2prediction. ## How was this patch tested? Changed unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16407 from yanboliang/raw-probability.	2016-12-28 01:24:18 -08:00
Peng	79ff853631	[SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) ## What changes were proposed in this pull request? Univariate feature selection works by selecting the best features based on univariate statistical tests. FDR and FWE are a popular univariate statistical test for feature selection. In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate. In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests. https://en.wikipedia.org/wiki/Family-wise_error_rate We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn. http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection ## How was this patch tested? ut will be added soon (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Peng <peng.meng@intel.com> Author: Peng, Meng <peng.meng@intel.com> Closes #15212 from mpjlu/fdr_fwe.	2016-12-28 00:49:36 -08:00
Ryan Williams	afd9bc1d8a	[SPARK-17807][CORE] split test-tags into test-JAR Remove spark-tag's compile-scope dependency (and, indirectly, spark-core's compile-scope transitive-dependency) on scalatest by splitting test-oriented tags into spark-tags' test JAR. Alternative to #16303. Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #16311 from ryan-williams/tt.	2016-12-21 16:37:20 -08:00
Zakaria_Hili	7db09abb01	[SPARK-18356][ML] KMeans should cache RDD before training ## What changes were proposed in this pull request? According to request of Mr. Joseph Bradley , I did this update of my PR https://github.com/apache/spark/pull/15965 in order to eliminate the extrat fit() method. jkbradley ## How was this patch tested? Pass existing tests Author: Zakaria_Hili <zakahili@gmail.com> Author: HILI Zakaria <zakahili@gmail.com> Closes #16295 from ZakariaHili/zakbranch.	2016-12-19 10:30:38 +00:00
Anthony Truchet	9e8a9d7c6a	[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0 ## What changes were proposed in this pull request? CostFun used to send a dense vector of zeroes as a closure in a treeAggregate call. To avoid that, we replace treeAggregate by mapPartition + treeReduce, creating a zero vector inside the mapPartition block in-place. ## How was this patch tested? Unit test for module mllib run locally for correctness. As for performance we run an heavy optimization on our production data (50 iterations on 128 MB weight vectors) and have seen significant decrease in terms both of runtime and container being killed by lack of off-heap memory. Author: Anthony Truchet <a.truchet@criteo.com> Author: sethah <seth.hendrickson16@gmail.com> Author: Anthony Truchet <AnthonyTruchet@users.noreply.github.com> Closes #16037 from AnthonyTruchet/ENG-17719-lbfgs-only.	2016-12-13 21:30:57 +00:00
actuaryzhang	e57e3938c6	[SPARK-18715][ML] Fix AIC calculations in Binomial GLM The AIC calculation in Binomial GLM seems to be off when the response or weight is non-integer: the result is different from that in R. This issue arises when one models rates, i.e, num of successes normalized over num of trials, and uses num of trials as weights. In this case, the effective likelihood is weight * label ~ binomial(weight, mu), where weight = number of trials, and weight * label = number of successes and mu = is the success rate. srowen sethah yanboliang HyukjinKwon zhengruifeng ## What changes were proposed in this pull request? I suggest changing the current aic calculation for the Binomial family from ``` -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt) }.sum() ``` to the following which generalizes to the case of real-valued response and weights. ``` -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) => val wt = math.round(weight).toInt if (wt == 0){ 0.0 } else { dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt) } }.sum() ``` ## How was this patch tested? I will write the unit test once the community wants to include the proposed change. For now, the following modifies existing tests in weighted Binomial GLM to illustrate the issue. The second label is changed from 0 to 0.5. ``` val datasetWithWeight = Seq( (1.0, 1.0, 0.0, 5.0), (0.5, 2.0, 1.0, 2.0), (1.0, 3.0, 2.0, 1.0), (0.0, 4.0, 3.0, 3.0) ).toDF("y", "w", "x1", "x2") val formula = (new RFormula() .setFormula("y ~ x1 + x2") .setFeaturesCol("features") .setLabelCol("label")) val output = formula.fit(datasetWithWeight).transform(datasetWithWeight).select("features", "label", "w") val glr = new GeneralizedLinearRegression() .setFamily("binomial") .setWeightCol("w") .setFitIntercept(false) .setRegParam(0) val model = glr.fit(output) model.summary.aic ``` The AIC from Spark is 17.3227, and the AIC from R is 15.66454. Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16149 from actuaryzhang/aic.	2016-12-13 21:27:29 +00:00
krishnakalyan3	c802ad8718	[SPARK-18628][ML] Update Scala param and Python param to have quotes ## What changes were proposed in this pull request? Updated Scala param and Python param to have quotes around the options making it easier for users to read. ## How was this patch tested? Manually checked the docstrings Author: krishnakalyan3 <krishnakalyan3@gmail.com> Closes #16242 from krishnakalyan3/doc-string.	2016-12-11 09:28:16 +00:00
Michal Senkyr	114324832a	[SPARK-3359][DOCS] Fix greater-than symbols in Javadoc to allow building with Java 8 ## What changes were proposed in this pull request? The API documentation build was failing when using Java 8 due to incorrect character `>` in Javadoc. Replace `>` with literals in Javadoc to allow the build to pass. ## How was this patch tested? Documentation was built and inspected manually to ensure it still displays correctly in the browser ``` cd docs && jekyll serve ``` Author: Michal Senkyr <mike.senkyr@gmail.com> Closes #16201 from michalsenkyr/javadoc8-gt-fix.	2016-12-10 19:54:07 +00:00
Yanbo Liang	97255497d8	[SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API for 2.1 ## What changes were proposed in this pull request? Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues: * Remove ```probabilityCol``` from the argument list of ```spark.logit``` and ```spark.randomForest```. Since it was used when making prediction and should be an argument of ```predict```, and we will work on this at [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) in the next release cycle. * Fix ```spark.als``` params to make it consistent with MLlib. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16169 from yanboliang/spark-18326.	2016-12-07 20:23:28 -08:00
actuaryzhang	b828027139	[SPARK-18701][ML] Fix Poisson GLM failure due to wrong initialization Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits. ## What changes were proposed in this pull request? Update initialization in Poisson GLM ## How was this patch tested? Add test in GeneralizedLinearRegressionSuite srowen sethah yanboliang HyukjinKwon mengxr Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16131 from actuaryzhang/master.	2016-12-07 16:37:25 +08:00
Yanbo Liang	90b59d1bf2	[SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit. ## What changes were proposed in this pull request? Several cleanup and improvements for ```spark.logit```: * ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model. * ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently. * SparkR test improvement: comparing the training result with native R glmnet. * Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users. ## How was this patch tested? Unit tests. The ```summary``` output after this change: multinomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > model <- spark.logit(df, Species ~ ., regParam = 0.5) > summary(model) $coefficients versicolor virginica setosa (Intercept) 1.514031 -2.609108 1.095077 Sepal_Length 0.02511006 0.2649821 -0.2900921 Sepal_Width -0.5291215 -0.02016446 0.549286 Petal_Length 0.03647411 0.1544119 -0.190886 Petal_Width 0.000236092 0.4195804 -0.4198165 ``` binomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > training <- df[df$Species %in% c("versicolor", "virginica"), ] > model <- spark.logit(training, Species ~ ., regParam = 0.5) > summary(model) $coefficients Estimate (Intercept) -6.053815 Sepal_Length 0.2449379 Sepal_Width 0.1648321 Petal_Length 0.4730718 Petal_Width 1.031947 ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #16117 from yanboliang/spark-18686.	2016-12-07 00:31:11 -08:00
Yuhao	fac5b75b74	[SPARK-18374][ML] Incorrect words in StopWords/english.txt ## What changes were proposed in this pull request? Currently English stop words list in MLlib contains only the argumented words after removing all the apostrophes, so "wouldn't" become "wouldn" and "t". Yet by default Tokenizer and RegexTokenizer don't split on apostrophes or quotes. Adding original form to stop words list to match the behavior of Tokenizer and StopwordsRemover. Also remove "won" from list. see more discussion in the jira: https://issues.apache.org/jira/browse/SPARK-18374 ## How was this patch tested? existing ut Author: Yuhao <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #16103 from hhbyyh/addstopwords.	2016-12-07 05:12:24 +08:00
Zheng RuiFeng	bdfe7f6746	[SPARK-18625][ML] OneVsRestModel should support setFeaturesCol and setPredictionCol ## What changes were proposed in this pull request? add `setFeaturesCol` and `setPredictionCol` for `OneVsRestModel` ## How was this patch tested? added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #16059 from zhengruifeng/ovrm_setCol.	2016-12-05 00:32:58 -08:00
Reynold Xin	c7c7265950	[SPARK-18695] Bump master branch version to 2.2.0-SNAPSHOT ## What changes were proposed in this pull request? This patch bumps master branch version to 2.2.0-SNAPSHOT. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #16126 from rxin/SPARK-18695.	2016-12-02 21:09:37 -08:00
Yanbo Liang	a985dd8e99	[SPARK-18291][SPARKR][ML] Revert "[SPARK-18291][SPARKR][ML] SparkR glm predict should output original label when family = binomial." ## What changes were proposed in this pull request? It's better we can fix this issue by providing an option ```type``` for users to change the ```predict``` output schema, then they could output probabilities, log-space predictions, or original labels. In order to not involve breaking API change for 2.1, so revert this change firstly and will add it back after [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) resolved. ## How was this patch tested? Existing unit tests. This reverts commit `daa975f4bf`. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16118 from yanboliang/spark-18291-revert.	2016-12-02 12:16:57 -08:00
Nathan Howell	c82f16c15e	[SPARK-18658][SQL] Write text records directly to a FileOutputStream ## What changes were proposed in this pull request? This replaces uses of `TextOutputFormat` with an `OutputStream`, which will either write directly to the filesystem or indirectly via a compressor (if so configured). This avoids intermediate buffering. The inverse of this (reading directly from a stream) is necessary for streaming large JSON records (when `wholeFile` is enabled) so I wanted to keep the read and write paths symmetric. ## How was this patch tested? Existing unit tests. Author: Nathan Howell <nhowell@godaddy.com> Closes #16089 from NathanHowell/SPARK-18658.	2016-12-01 21:40:49 -08:00
wm624@hotmail.com	2eb6764fbb	[SPARK-18476][SPARKR][ML] SparkR Logistic Regression should should support output original label. ## What changes were proposed in this pull request? Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label. In this PR, original label output is supported and test cases are modified and added. Document is also modified. ## How was this patch tested? Unit tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #15910 from wangmiao1981/audit.	2016-11-30 20:32:17 -08:00
Yanbo Liang	60022bfd65	[SPARK-18318][ML] ML, Graph 2.1 QA: API: New Scala APIs, docs ## What changes were proposed in this pull request? API review for 2.1, except ```LSH``` related classes which are still under development. ## How was this patch tested? Only doc changes, no new tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16009 from yanboliang/spark-18318.	2016-11-30 13:21:05 -08:00
Anthony Truchet	c5a64d7606	[SPARK-18612][MLLIB] Delete broadcasted variable in LBFGS CostFun ## What changes were proposed in this pull request? Fix a broadcasted variable leak occurring at each invocation of CostFun in L-BFGS. ## How was this patch tested? UTests + check that fixed fatal memory consumption on Criteo's use cases. This contribution is made on behalf of Criteo S.A. (http://labs.criteo.com/) under the terms of the Apache v2 License. Author: Anthony Truchet <a.truchet@criteo.com> Closes #16040 from AnthonyTruchet/SPARK-18612-lbfgs-cost-fun.	2016-11-30 10:04:47 +00:00
Yuhao	9b670bcaec	[SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit ## What changes were proposed in this pull request? make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. Also check for items marked final or sealed to see if they are stable enough to be opened up as APIs. Some discussions in the jira: https://issues.apache.org/jira/browse/SPARK-18319 ## How was this patch tested? existing ut Author: Yuhao <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #15972 from hhbyyh/experimental21.	2016-11-29 18:46:59 -08:00
Yanbo Liang	95f7985012	[SPARK-18592][ML] Move DT/RF/GBT Param setter methods to subclasses ## What changes were proposed in this pull request? Mainly two changes: * Move DT/RF/GBT Param setter methods to subclasses. * Deprecate corresponding setter methods in the model classes. See discussion here https://github.com/apache/spark/pull/15913#discussion_r89662469. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16017 from yanboliang/spark-18592.	2016-11-29 11:19:35 -08:00
hyukjinkwon	1a870090e4	[SPARK-18615][DOCS] Switch to multi-line doc to avoid a genjavadoc bug for backticks ## What changes were proposed in this pull request? Currently, single line comment does not mark down backticks to `<code>..</code>` but prints as they are (`` `..` ``). For example, the line below: ```scala /** Return an RDD with the pairs from `this` whose keys are not in `other`. / ``` So, we could work around this as below: ```scala /* * Return an RDD with the pairs from `this` whose keys are not in `other`. / ``` - javadoc - Before* ![2016-11-29 10 39 14](https://cloud.githubusercontent.com/assets/6477701/20693606/e64c8f90-b622-11e6-8dfc-4a029216e23d.png) - After ![2016-11-29 10 39 08](https://cloud.githubusercontent.com/assets/6477701/20693607/e7280d36-b622-11e6-8502-d2e21cd5556b.png) - scaladoc (this one looks fine either way) - Before ![2016-11-29 10 38 22](https://cloud.githubusercontent.com/assets/6477701/20693640/12c18aa8-b623-11e6-901a-693e2f6f8066.png) - After ![2016-11-29 10 40 05](https://cloud.githubusercontent.com/assets/6477701/20693642/14eb043a-b623-11e6-82ac-7cd0000106d1.png) I suspect this is related with SPARK-16153 and genjavadoc issue in ` typesafehub/genjavadoc#85`. ## How was this patch tested? I found them via ``` grep -r "\/\\.*\`" . \| grep .scala ```` and then checked if each is in the public API documentation with manually built docs (`jekyll build`) with Java 7. Author: hyukjinkwon <gurwls223@gmail.com> Closes #16050 from HyukjinKwon/javadoc-markdown.	2016-11-29 13:50:24 +00:00
hyukjinkwon	f830bb9170	[SPARK-3359][DOCS] Make javadoc8 working for unidoc/genjavadoc compatibility in Java API documentation ## What changes were proposed in this pull request? This PR make `sbt unidoc` complete with Java 8. This PR roughly includes several fixes as below: - Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` `` ```diff - * A column that will be computed based on the data in a [[DataFrame]]. + * A column that will be computed based on the data in a `DataFrame`. ``` - Fix throws annotations so that they are recognisable in javadoc - Fix URL links to `<a href="http..."></a>`. ```diff - * [[http://en.wikipedia.org/wiki/Decision_tree_learning Decision tree]] model for regression. + * <a href="http://en.wikipedia.org/wiki/Decision_tree_learning"> + * Decision tree (Wikipedia)</a> model for regression. ``` ```diff - * see http://en.wikipedia.org/wiki/Receiver_operating_characteristic + * see <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic"> + * Receiver operating characteristic (Wikipedia)</a> ``` - Fix < to > to - `greater than`/`greater than or equal to` or `less than`/`less than or equal to` where applicable. - Wrap it with `{{{...}}}` to print them in javadoc or use `{code ...}` or `{literal ..}`. Please refer https://github.com/apache/spark/pull/16013#discussion_r89665558 - Fix `</p>` complaint ## How was this patch tested? Manually tested by `jekyll build` with Java 7 and 8 ``` java version "1.7.0_80" Java(TM) SE Runtime Environment (build 1.7.0_80-b15) Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode) ``` ``` java version "1.8.0_45" Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode) ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #16013 from HyukjinKwon/SPARK-3359-errors-more.	2016-11-29 09:41:32 +00:00
Yun Ni	05f7c6ffab	[SPARK-18408][ML] API Improvements for LSH ## What changes were proposed in this pull request? (1) Change output schema to `Array of Vector` instead of `Vectors` (2) Use `numHashTables` as the dimension of Array (3) Rename `RandomProjection` to `BucketedRandomProjectionLSH`, `MinHash` to `MinHashLSH` (4) Make `randUnitVectors/randCoefficients` private (5) Make Multi-Probe NN Search and `hashDistance` private for future discussion Saved for future PRs: (1) AND-amplification and `numHashFunctions` as the dimension of Vector are saved for a future PR. (2) `hashDistance` and MultiProbe NN Search needs more discussion. The current implementation is just a backward compatible one. ## How was this patch tested? Related unit tests are modified to make sure the performance of LSH are ensured, and the outputs of the APIs meets expectation. Author: Yun Ni <yunn@uber.com> Author: Yunni <Euler57721@gmail.com> Closes #15874 from Yunni/SPARK-18408-yunn-api-improvements.	2016-11-28 15:14:46 -08:00
Yanbo Liang	c4a7eef0ce	[SPARK-18481][ML] ML 2.1 QA: Remove deprecated methods for ML ## What changes were proposed in this pull request? Remove deprecated methods for ML. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15913 from yanboliang/spark-18481.	2016-11-26 05:28:41 -08:00
Zakaria_Hili	445d4d9e13	[SPARK-18356][ML] Improve MLKmeans Performance ## What changes were proposed in this pull request? Spark Kmeans fit() doesn't cache the RDD which generates a lot of warnings : WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached. So, Kmeans should cache the internal rdd before calling the Mllib.Kmeans algo, this helped to improve spark kmeans performance by 14% `a9cf905cf7` hhbyyh ## How was this patch tested? Pass Kmeans tests and existing tests Author: Zakaria_Hili <zakahili@gmail.com> Author: HILI Zakaria <zakahili@gmail.com> Closes #15965 from ZakariaHili/zakbranch.	2016-11-25 13:19:26 +00:00
hyukjinkwon	51b1c1551d	[SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility ## What changes were proposed in this pull request? This PR only tries to fix things that looks pretty straightforward and were fixed in other previous PRs before. This PR roughly fixes several things as below: - Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` `` ``` [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/DataStreamReader.java:226: error: reference not found [error] * Loads text files and returns a {link DataFrame} whose schema starts with a string column named ``` - Fix an exception annotation and remove code backticks in `throws` annotation Currently, sbt unidoc with Java 8 complains as below: ``` [error] .../java/org/apache/spark/sql/streaming/StreamingQuery.java:72: error: unexpected text [error] * throws StreamingQueryException, if <code>this</code> query has terminated with an exception. ``` `throws` should specify the correct class name from `StreamingQueryException,` to `StreamingQueryException` without backticks. (see [JDK-8007644](https://bugs.openjdk.java.net/browse/JDK-8007644)). - Fix `[[http..]]` to `<a href="http..."></a>`. ```diff - * [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle - * blog page]]. + * <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https"> + * Oracle blog page</a>. ``` `[[http...]]` link markdown in scaladoc is unrecognisable in javadoc. - It seems class can't have `return` annotation. So, two cases of this were removed. ``` [error] .../java/org/apache/spark/mllib/regression/IsotonicRegression.java:27: error: invalid use of return [error] * return New instance of IsotonicRegression. ``` - Fix < to `<` and > to `>` according to HTML rules. - Fix `</p>` complaint - Exclude unrecognisable in javadoc, `constructor`, `todo` and `groupname`. ## How was this patch tested? Manually tested by `jekyll build` with Java 7 and 8 ``` java version "1.7.0_80" Java(TM) SE Runtime Environment (build 1.7.0_80-b15) Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode) ``` ``` java version "1.8.0_45" Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode) ``` Note: this does not yet make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15999 from HyukjinKwon/SPARK-3359-errors.	2016-11-25 11:27:07 +00:00
Zheng RuiFeng	2dfabec38c	[SPARK-18520][ML] Add missing setXXXCol methods for BisectingKMeansModel and GaussianMixtureModel ## What changes were proposed in this pull request? add `setFeaturesCol` and `setPredictionCol` for BiKModel and GMModel add `setProbabilityCol` for GMModel ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15957 from zhengruifeng/bikm_set.	2016-11-24 05:46:05 -08:00
Yanbo Liang	982b82e32e	[SPARK-18501][ML][SPARKR] Fix spark.glm errors when fitting on collinear data ## What changes were proposed in this pull request? * Fix SparkR ```spark.glm``` errors when fitting on collinear data, since ```standard error of coefficients, t value and p value``` are not available in this condition. * Scala/Python GLM summary should throw exception if users get ```standard error of coefficients, t value and p value``` but the underlying WLS was solved by local "l-bfgs". ## How was this patch tested? Add unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15930 from yanboliang/spark-18501.	2016-11-22 19:17:48 -08:00
sethah	e811fbf9ed	[SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM ## What changes were proposed in this pull request? Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark. ## How was this patch tested? Unit tests. Author: sethah <seth.hendrickson16@gmail.com> Closes #15777 from sethah/pyspark_cluster_summaries.	2016-11-21 05:36:49 -08:00

... 3 4 5 6 7 ...

2132 commits