ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Wu, Xiaochang	ac122762f5	[SPARK-30813][ML] Fix Matrices.sprand comments ### What changes were proposed in this pull request? Fix mistakes in comments ### Why are the changes needed? There are mistakes in comments ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #27564 from xwu99/fix-mllib-sprand-comment. Authored-by: Wu, Xiaochang <xiaochang.wu@intel.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-02 08:56:17 -06:00
Huaxin Gao	0a5e9a12a7	[SPARK-30995][ML][DOCS] Latex doesn't work correctly in FMClassifier/FMRegressor Scala doc ### What changes were proposed in this pull request? Latex doesn't work correctly ### Why are the changes needed? Fix the doc to make Latex work ### Does this PR introduce any user-facing change? Before fix: ![image](https://user-images.githubusercontent.com/13592258/75611743-0fa00a00-5ad2-11ea-9cc0-4b246b25d63d.png) ![image](https://user-images.githubusercontent.com/13592258/75611755-25adca80-5ad2-11ea-884d-b2792b714bd5.png) After fix: ![image](https://user-images.githubusercontent.com/13592258/75611776-46762000-5ad2-11ea-838d-f7f6f93c8aec.png) ![image](https://user-images.githubusercontent.com/13592258/75611778-51c94b80-5ad2-11ea-85e4-8c7424268f52.png) ### How was this patch tested? Manually build doc and test Closes #27748 from huaxingao/fm_doc. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-02 10:33:26 +09:00
zhengruifeng	14bb639c55	[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization ### What changes were proposed in this pull request? 1, avoid `Iterator.grouped(size: Int)`, which need to maintain an arraybuffer of `size` 2, keep the number of partitions in curve computation ### Why are the changes needed? 1, `BinaryClassificationMetrics` tend to fail (OOM) when `grouping=count/numBins` is too large, due to `Iterator.grouped(size: Int)` need to maintain an arraybuffer with `size` entries, however, in `BinaryClassificationMetrics` we do not need to maintain such a big array; 2, make sizes of partitions more even; This PR computes metrics more stable and a littler faster; ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27682 from zhengruifeng/grouped_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-02-28 16:55:24 +08:00
Tigran Saluev	6f4a2e4c99	[MINOR][ML] Fix confusing error message in VectorAssembler ### What changes were proposed in this pull request? When VectorAssembler encounters a NULL with handleInvalid="error", it throws an exception. This exception, though, has a typo making it confusing. Yet apparently, this same exception for NaN values is fine. Fixed it to look like the right one. ### Why are the changes needed? Encountering this error with such message was very confusing! I hope to save time of fellow engineers by improving it. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? It's just an error message... Closes #27709 from Saluev/patch-1. Authored-by: Tigran Saluev <tigran@saluev.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-02-27 11:05:53 -06:00
gatorsmile	28b8713036	[SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT ### What changes were proposed in this pull request? This patch is to bump the master branch version to 3.1.0-SNAPSHOT. ### Why are the changes needed? N/A ### Does this PR introduce any user-facing change? N/A ### How was this patch tested? N/A Closes #27698 from gatorsmile/updateVersion. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-25 19:44:31 -08:00
zhengruifeng	e086a78706	[MINOR][ML] ML cleanup ### What changes were proposed in this pull request? 1, remove used imports and variables; 2, use `.iterator` instead of `.view` to avoid IDEA warnings; 3, remove resolved _TODO_ ### Why are the changes needed? cleanup ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27600 from zhengruifeng/nits. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-02-25 12:32:12 -06:00
Sean Owen	cc8d356e4f	[SPARK-30939][ML] Correctly set output col when StringIndexer.setOutputCols is used ### What changes were proposed in this pull request? Set the supplied output col name as intended when StringIndexer transforms an input after setOutputCols is used. ### Why are the changes needed? The output col names are wrong otherwise and downstream pipeline components fail. ### Does this PR introduce any user-facing change? Yes in the sense that it fixes incorrect behavior, otherwise no. ### How was this patch tested? Existing tests plus new direct tests of the schema. Closes #27684 from srowen/SPARK-30939. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-02-24 20:18:10 -06:00
Huaxin Gao	7aa94ca9cb	[SPARK-30867][ML] Add FValueTest ### What changes were proposed in this pull request? This is the very first PR for supporting continuous distribution features selectors. It adds the algorithm to compute fvalue for continuous features and continuous labels. This algorithm will be used for FValueRegressionSelector. ### Why are the changes needed? Current Spark only supports the selection of categorical features, while there are many requirements for the selection of continuous distribution features. I will add two new selectors: 1. FValueRegressionSelector for continuous features and continuous labels. 2. ANOVAFValueClassificationSelector for continuous features and categorical labels. I will use subtasks to add these two selectors: add FValueRegressionSelector on scala side - add FValueRegressionTest, this contains the algorithm to compute FValue - add FValueRegressionSelector using the above algorithm - add a common Selector, make FValueRegressionSelector and ChisqSelector to extend common selector add FValueRegressionSelector on python side add samples and doc do the same for ANOVAFValueClassificationSelector ### Does this PR introduce any user-facing change? Yes. ``` /** * param dataset DataFrame of continuous labels and continuous features. * param featuresCol Name of features column in dataset, of type `Vector` (`VectorUDT`) * param labelCol Name of label column in dataset, of any numerical type * return Array containing the SelectionTestResult for every feature against the label. */ SelectionTest.fValueRegressionTest(dataset: Dataset[_], featuresCol: String, labelCol: String) ``` ### How was this patch tested? Add Unit test. Closes #27623 from huaxingao/spark-30867. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-02-24 11:14:54 +08:00
yi.wu	82ce4753aa	[SPARK-26580][SQL][ML][FOLLOW-UP] Throw exception when use untyped UDF by default ### What changes were proposed in this pull request? This PR proposes to throw exception by default when user use untyped UDF(a.k.a `org.apache.spark.sql.functions.udf(AnyRef, DataType)`). And user could still use it by setting `spark.sql.legacy.useUnTypedUdf.enabled` to `true`. ### Why are the changes needed? According to #23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default. As a result, this might change data silently and may cause correctness issue if user still expect `null` in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem. ### Does this PR introduce any user-facing change? Yeah. User will hit exception now when use untyped UDF. ### How was this patch tested? Added test and updated some tests. Closes #27488 from Ngone51/spark_26580_followup. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-21 14:46:54 +08:00
Huaxin Gao	69367997c3	[SPARK-30802][ML] Use Summarizer instead of MultivariateOnlineSummarizer in Aggregator test suite ### What changes were proposed in this pull request? There are three changes in this PR: 1. use Summarizer instead of MultivariateOnlineSummarizer in Aggregator test suites (similar to https://github.com/apache/spark/pull/26396) 2. Put common code in ```Summarizer.getRegressionSummarizers``` and ```Summarizer.getClassificationSummarizers```. 3. Move ```MultiClassSummarizer``` from ```LogisticRegression``` to ```ml.stat``` (this seems to be a better place since ```MultiClassSummarizer``` is not only used by ```LogisticRegression``` but also several other classes). ### Why are the changes needed? Minimize code duplication and improve performance ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing test suites. Closes #27555 from huaxingao/spark-30802. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-02-19 17:29:52 -06:00
zhengruifeng	0a4080ec3b	[SPARK-30736][ML] One-Pass ChiSquareTest ### What changes were proposed in this pull request? 1, distributedly gather matrix `contingency` of each feature 2, distributedly compute the results and then collect them back to the driver ### Why are the changes needed? existing impl is not efficient: 1, it directly collect matrix `contingency` of partial featues to driver and compute the corresponding result on one pass; 2, a matrix `contingency` of a featues is of size numDistinctValues X numDistinctLabels, so only 1000 matrices can be collected at a time; ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27461 from zhengruifeng/chisq_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-02-17 09:41:38 -06:00
zhengruifeng	8ebbf85a85	[SPARK-30772][ML][SQL] avoid tuple assignment because it will circumvent the transient tag ### What changes were proposed in this pull request? it is said in [LeastSquaresAggregator](`12e1bbaddb/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LeastSquaresAggregator.scala (L188)`) that : > // do not use tuple assignment above because it will circumvent the transient tag I then check this issue with Scala 2.13.1 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_241) ### Why are the changes needed? avoid tuple assignment because it will circumvent the transient tag ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27523 from zhengruifeng/avoid_tuple_assign_to_transient. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-02-16 10:01:49 -06:00
Liang Zhang	82d0aa37ae	[SPARK-30762] Add dtype=float32 support to vector_to_array UDF ### What changes were proposed in this pull request? In this PR, we add a parameter in the python function vector_to_array(col) that allows converting to a column of arrays of Float (32bits) in scala, which would be mapped to a numpy array of dtype=float32. ### Why are the changes needed? In the downstream ML training, using float32 instead of float64 (default) would allow a larger batch size, i.e., allow more data to fit in the memory. ### Does this PR introduce any user-facing change? Yes. Old: `vector_to_array()` only take one param ``` df.select(vector_to_array("colA"), ...) ``` New: `vector_to_array()` can take an additional optional param: `dtype` = "float32" (or "float64") ``` df.select(vector_to_array("colA", "float32"), ...) ``` ### How was this patch tested? Unit test in scala. doctest in python. Closes #27522 from liangz1/udf-float32. Authored-by: Liang Zhang <liang.zhang@databricks.com> Signed-off-by: WeichenXu <weichen.xu@databricks.com>	2020-02-13 23:55:13 +08:00
Huaxin Gao	a7ae77a8d8	[SPARK-30662][ML][PYSPARK] Put back the API changes for HasBlockSize in ALS/MLP ### What changes were proposed in this pull request? Add ```HasBlockSize``` in shared Params in both Scala and Python. Make ALS/MLP extend ```HasBlockSize``` ### Why are the changes needed? Add ```HasBlockSize ``` in ALS, so user can specify the blockSize. Make ```HasBlockSize``` a shared param so both ALS and MLP can use it. ### Does this PR introduce any user-facing change? Yes ```ALS.setBlockSize/getBlockSize``` ```ALSModel.setBlockSize/getBlockSize``` ### How was this patch tested? Manually tested. Also added doctest. Closes #27501 from huaxingao/spark_30662. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-02-09 13:14:30 +08:00
zhengruifeng	12e1bbaddb	Revert "[SPARK-30642][SPARK-30659][SPARK-30660][SPARK-30662]" ### What changes were proposed in this pull request? Revert #27360 #27396 #27374 #27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (https://github.com/apache/spark/pull/27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes #27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-02-08 08:46:16 +08:00
zhengruifeng	da32d1e6b5	[SPARK-30700][ML] NaiveBayesModel predict optimization ### What changes were proposed in this pull request? var `negThetaSum` is always used together with `pi`, so we can add them at first ### Why are the changes needed? only need to add one var `piMinusThetaSum`, instead of `pi` and `negThetaSum` ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27427 from zhengruifeng/nb_predict. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-02-01 15:19:16 +08:00
zhengruifeng	d0c3e9f1f7	[SPARK-30660][ML][PYSPARK] LinearRegression blockify input vectors ### What changes were proposed in this pull request? 1, use blocks instead of vectors for performance improvement 2, use Level-2 BLAS 3, move standardization of input vectors outside of gradient computation ### Why are the changes needed? 1, less RAM to persist training data; (save ~40%) 2, faster than existing impl; (30% ~ 102%) ### Does this PR introduce any user-facing change? add a new expert param `blockSize` ### How was this patch tested? updated testsuites Closes #27396 from zhengruifeng/blockify_lireg. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-31 21:04:26 -06:00
Huaxin Gao	f59685acaa	[SPARK-30662][ML][PYSPARK] ALS/MLP extend HasBlockSize ### What changes were proposed in this pull request? Make ALS/MLP extend ```HasBlockSize``` ### Why are the changes needed? Currently, MLP has its own ```blockSize``` param, we should make MLP extend ```HasBlockSize``` since ```HasBlockSize``` was added in ```sharedParams.scala``` recently. ALS doesn't have ```blockSize``` param now, we can make it extend ```HasBlockSize```, so user can specify the ```blockSize```. ### Does this PR introduce any user-facing change? Yes ```ALS.setBlockSize``` and ```ALS.getBlockSize``` ```ALSModel.setBlockSize``` and ```ALSModel.getBlockSize``` ### How was this patch tested? Manually tested. Also added doctest. Closes #27389 from huaxingao/spark-30662. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-30 13:13:10 -06:00
Maxim Gekk	a291433ed3	[SPARK-30678][MLLIB][TESTS] Eliminate warnings from deprecated BisectingKMeansModel.computeCost ### What changes were proposed in this pull request? In the PR, I propose to replace deprecated method `computeCost` of `BisectingKMeansModel` by `summary.trainingCost`. ### Why are the changes needed? The changes eliminate deprecation warnings: ``` BisectingKMeansSuite.scala:108: method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated and will be removed in future versions. Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary. [warn] assert(model.computeCost(dataset) < 0.1) BisectingKMeansSuite.scala:135: method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated and will be removed in future versions. Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary. [warn] assert(model.computeCost(dataset) == summary.trainingCost) BisectingKMeansSuite.scala:323: method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated and will be removed in future versions. Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary. [warn] model.computeCost(dataset) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running `BisectingKMeansSuite` via: ``` ./build/sbt "test:testOnly *BisectingKMeansSuite" ``` Closes #27401 from MaxGekk/kmeans-computeCost-warning. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-30 09:05:14 -08:00
zhengruifeng	073ce12543	[SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors ### What changes were proposed in this pull request? 1, use blocks instead of vectors 2, use Level-2 BLAS for binary, use Level-3 BLAS for multinomial ### Why are the changes needed? 1, less RAM to persist training data; (save ~40%) 2, faster than existing impl; (40% ~ 92%) ### Does this PR introduce any user-facing change? add a new expert param `blockSize` ### How was this patch tested? updated testsuites Closes #27374 from zhengruifeng/blockify_lor. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-30 10:52:07 -06:00
zhengruifeng	96d27274f5	[SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors ### What changes were proposed in this pull request? 1, stack input vectors to blocks (like ALS/MLP); 2, add new param `blockSize`; 3, add a new class `InstanceBlock` 4, standardize the input outside of optimization procedure; ### Why are the changes needed? 1, reduce RAM to persist traing dataset; (save ~40% in test) 2, use Level-2 BLAS routines; (12% ~ 28% faster, without native BLAS) ### Does this PR introduce any user-facing change? a new param `blockSize` ### How was this patch tested? existing and updated testsuites Closes #27360 from zhengruifeng/blockify_svc. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-28 20:55:21 +08:00
Huaxin Gao	2f8e4d0d6e	[SPARK-30630][ML] Remove numTrees in GBT in 3.0.0 ### What changes were proposed in this pull request? Remove ```numTrees``` in GBT in 3.0.0. ### Why are the changes needed? Currently, GBT has ``` /** * Number of trees in ensemble / Since("2.0.0") val getNumTrees: Int = trees.length ``` and ``` /* Number of trees in ensemble */ val numTrees: Int = trees.length ``` I think we should remove one of them. We deprecated it in 2.4.5 via https://github.com/apache/spark/pull/27352. ### Does this PR introduce any user-facing change? Yes, remove ```numTrees``` in GBT in 3.0.0 ### How was this patch tested? existing tests Closes #27330 from huaxingao/spark-numTrees. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-24 12:12:46 -08:00
zhengruifeng	f35f352096	[SPARK-30543][ML][PYSPARK][R] RandomForest add Param bootstrap to control sampling method ### What changes were proposed in this pull request? add a param `bootstrap` to control whether bootstrap samples are used. ### Why are the changes needed? Current RF with numTrees=1 will directly build a tree using the orignial dataset, while with numTrees>1 it will use bootstrap samples to build trees. This design is for training a DecisionTreeModel by the impl of RandomForest, however, it is somewhat strange. In Scikit-Learn, there is a param [bootstrap](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) to control whether bootstrap samples are used. ### Does this PR introduce any user-facing change? Yes, new param is added ### How was this patch tested? existing testsuites Closes #27254 from zhengruifeng/add_bootstrap. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-23 16:44:13 +08:00
zhengruifeng	1c46bd9e60	[SPARK-30503][ML] OnlineLDAOptimizer does not handle persistance correctly ### What changes were proposed in this pull request? unpersist graph outside checkpointer, like what Pregel does ### Why are the changes needed? Shown in [SPARK-30503](https://issues.apache.org/jira/browse/SPARK-30503), intermediate edges are not unpersisted ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites and manual test Closes #27261 from zhengruifeng/lda_checkpointer. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-22 08:24:11 -06:00
Huaxin Gao	92dd7c9d2a	[MINOR][ML] Change DecisionTreeClassifier to FMClassifier in OneVsRest setWeightCol test ### What changes were proposed in this pull request? Change ```DecisionTreeClassifier``` to ```FMClassifier``` in ```OneVsRest``` setWeightCol test ### Why are the changes needed? In ```OneVsRest```, if the classifier doesn't support instance weight, ```OneVsRest``` weightCol will be ignored, so unit test has tested one classifier(```LogisticRegression```) that support instance weight, and one classifier (```DecisionTreeClassifier```) that doesn't support instance weight. Since ```DecisionTreeClassifier``` now supports instance weight, we need to change it to the classifier that doesn't have weight support. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing test Closes #27204 from huaxingao/spark-ovr-minor. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-17 10:04:41 +08:00
Huaxin Gao	1ef1d6caf2	[SPARK-29565][FOLLOWUP] add setInputCol/setOutputCol in OHEModel ### What changes were proposed in this pull request? add setInputCol/setOutputCol in OHEModel ### Why are the changes needed? setInputCol/setOutputCol should be in OHEModel too. ### Does this PR introduce any user-facing change? Yes. ```OHEModel.setInputCol``` ```OHEModel.setOutputCol``` ### How was this patch tested? Manually tested. Closes #27228 from huaxingao/spark-29565. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-16 19:23:10 +08:00
zhengruifeng	aec55cd1ca	[SPARK-30502][ML][CORE] PeriodicRDDCheckpointer support storageLevel ### What changes were proposed in this pull request? 1, add field `storageLevel` in `PeriodicRDDCheckpointer` 2, for ml.GBT/ml.RF set storageLevel=`StorageLevel.MEMORY_AND_DISK` ### Why are the changes needed? Intermediate RDDs in ML are cached with storageLevel=StorageLevel.MEMORY_AND_DISK. PeriodicRDDCheckpointer & PeriodicGraphCheckpointer now store RDD with storageLevel=StorageLevel.MEMORY_ONLY, it maybe nice to set the storageLevel of checkpointer. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27189 from zhengruifeng/checkpointer_storage. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-16 11:01:30 +08:00
zhengruifeng	93200115d7	[SPARK-9478][ML][PYSPARK] Add sample weights to Random Forest ### What changes were proposed in this pull request? 1, change `convertToBaggedRDDSamplingWithReplacement` to attach instance weights 2, make RF supports weights ### Why are the changes needed? `weightCol` is already exposed, while RF has not support weights. ### Does this PR introduce any user-facing change? Yes, new setters ### How was this patch tested? added testsuites Closes #27097 from zhengruifeng/rf_support_weight. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-14 08:25:51 -06:00
jiake	b389b8c5f0	[SPARK-30188][SQL] Resolve the failed unit tests when enable AQE ### What changes were proposed in this pull request? Fix all the failed tests when enable AQE. ### Why are the changes needed? Run more tests with AQE to catch bugs, and make it easier to enable AQE by default in the future. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests Closes #26813 from JkSelf/enableAQEDefault. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-13 22:55:19 +08:00
Huaxin Gao	f77dcfc55a	[SPARK-30351][ML][PYSPARK] BisectingKMeans support instance weighting ### What changes were proposed in this pull request? add weight support in BisectingKMeans ### Why are the changes needed? BisectingKMeans should support instance weighting ### Does this PR introduce any user-facing change? Yes. BisectingKMeans.setWeight ### How was this patch tested? Unit test Closes #27035 from huaxingao/spark_30351. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-13 08:24:49 -06:00
Huaxin Gao	d6e28f2922	[SPARK-30377][ML] Make Regressors extend abstract class Regressor ### What changes were proposed in this pull request? Make Regressors extend abstract class Regressor: ```AFTSurvivalRegression extends Estimator => extends Regressor``` ```DecisionTreeRegressor extends Predictor => extends Regressor``` ```FMRegressor extends Predictor => extends Regressor``` ```GBTRegressor extends Predictor => extends Regressor``` ```RandomForestRegressor extends Predictor => extends Regressor``` We will not make ```IsotonicRegression``` extend ```Regressor``` because it is tricky to handle both DoubleType and VectorType. ### Why are the changes needed? Make class hierarchy consistent for all Regressors ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #27168 from huaxingao/spark-30377. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-13 08:22:20 -06:00
zhengruifeng	308ae287a9	[SPARK-30457][ML] Use PeriodicRDDCheckpointer instead of NodeIdCache ### What changes were proposed in this pull request? 1, del `NodeIdCache`, and use `PeriodicRDDCheckpointer` instead; 2, reuse broadcasted `Splits` in the whole training; ### Why are the changes needed? 1, The functionality of `NodeIdCache` and `PeriodicRDDCheckpointer` are highly similar, and the update process of nodeIds is simple; One goal of "Generalize PeriodicGraphCheckpointer for RDDs" in SPARK-5561 is to use checkpointer in RandomForest; 2, only need to broadcast `Splits` once; ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing testsuites Closes #27145 from zhengruifeng/del_NodeIdCache. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-13 13:48:36 +08:00
zhengruifeng	a93b996635	[MINOR][ML][INT] Array.fill(0) -> Array.ofDim; Array.empty -> Array.emptyIntArray ### What changes were proposed in this pull request? 1, for primitive types `Array.fill(n)(0)` -> `Array.ofDim(n)`; 2, for `AnyRef` types `Array.fill(n)(null)` -> `Array.ofDim(n)`; 3, for primitive types `Array.empty[XXX]` -> `Array.emptyXXXArray` ### Why are the changes needed? `Array.ofDim` avoid assignments; `Array.emptyXXXArray` avoid create new object; ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27133 from zhengruifeng/minor_fill_ofDim. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-09 00:07:42 +09:00
zhengruifeng	d7c7e37ae0	[SPARK-30381][ML] Refactor GBT to reuse treePoints for all trees ### What changes were proposed in this pull request? Make GBT reuse splits/treePoints for all trees: 1, reuse splits/treePoints for all trees: existing impl will find feature splits and transform input vectors to treePoints for each tree; while other famous impls like XGBoost/lightGBM will build a global splits/binned features and reuse them for all trees; Note: the sampling rate in existing impl to build `splits` is not the param `subsamplingRate` but the output of `RandomForest.samplesFractionForFindSplits` which depends on `maxBins` and `numExamples`. Note II: Existing impl do not guarantee that splits among iteration are the same, so this may cause a little difference in convergence. 2, do not cache input vectors: existing impl will cached the input twice: 1,`input: RDD[Instance]` is used to compute/update prediction and errors; 2, at each iteration, input is transformed to bagged points, the bagged points will be cached during this iteration; In this PR,`input: RDD[Instance]` is no longer cached, since it is only used three times: 1, compute metadata; 2, find splits; 3, converted to treePoints; Instead, the treePoints `RDD[TreePoint]` is cached, at each iter, it is convert to bagged points by attaching extra `labelWithCounts: RDD[(Double, Int)]` containing residuals/sampleCount information, this rdd is relative small (like cached `norms` in KMeans); To compute/update prediction and errors, new prediction method based on binned features are added in `Node` ### Why are the changes needed? for perfermance improvement: 1,40%~50% faster than existing impl 2,save 30%~50% RAM ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites & several manual tests in REPL Closes #27103 from zhengruifeng/gbt_reuse_bagged. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-08 10:05:29 +08:00
WeichenXu	88542bc3d9	[SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays ### What changes were proposed in this pull request? PySpark UDF to convert MLlib vectors to dense arrays. Example: ``` from pyspark.ml.functions import vector_to_array df.select(vector_to_array(col("features")) ``` ### Why are the changes needed? If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame into dense arrays, an efficient approach is to do that in JVM. However, it requires PySpark user to write Scala code and register it as a UDF. Often this is infeasible for a pure python project. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT. Closes #26910 from WeichenXu123/vector_to_array. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2020-01-06 16:18:51 -08:00
zhengruifeng	f8cfefaf8d	[SPARK-9612][ML][FOLLOWUP] fix GBT support weights if subsamplingRate<1 ### What changes were proposed in this pull request? 1, fix `BaggedPoint.convertToBaggedRDD` when `subsamplingRate < 1.0` 2, reorg `RandomForest.runWithMetadata` btw ### Why are the changes needed? In GBT, Instance weights will be discarded if subsamplingRate<1 1, `baggedPoint: BaggedPoint[TreePoint]` is used in the tree growth to find best split; 2, `BaggedPoint[TreePoint]` contains two weights: ```scala class BaggedPoint[Datum](val datum: Datum, val subsampleCounts: Array[Int], val sampleWeight: Double = 1.0) class TreePoint(val label: Double, val binnedFeatures: Array[Int], val weight: Double) ``` 3, only the var `sampleWeight` in `BaggedPoint` is used, the var `weight` in `TreePoint` is never used in finding splits; 4, The method `BaggedPoint.convertToBaggedRDD` was changed in https://github.com/apache/spark/pull/21632, it was only for decisiontree, so only the following code path was changed; ``` if (numSubsamples == 1 && subsamplingRate == 1.0) { convertToBaggedRDDWithoutSampling(input, extractSampleWeight) } ``` 5, In https://github.com/apache/spark/pull/25926, I made GBT support weights, but only test it with default `subsamplingRate==1`. GBT with `subsamplingRate<1` will convert treePoints to baggedPoints via ```scala convertToBaggedRDDSamplingWithoutReplacement(input, subsamplingRate, numSubsamples, seed) ``` in which the orignial weights from `weightCol` will be discarded and all `sampleWeight` are assigned default 1.0; ### Does this PR introduce any user-facing change? No ### How was this patch tested? updated testsuites Closes #27070 from zhengruifeng/gbt_sampling. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-01-06 10:05:42 +08:00
Huaxin Gao	b3b28687e6	[SPARK-30418][ML] Make FM call super class method extractLabeledPoints ### What changes were proposed in this pull request? make FMClassifier/Regressor call super class method extractLabeledPoints ### Why are the changes needed? code reuse ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #27093 from huaxingao/spark-FM. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-05 18:48:47 -06:00
zhengruifeng	c42fbc7157	[SPARK-30398][ML] PCA/RegressionMetrics/RowMatrix avoid unnecessary computation ### What changes were proposed in this pull request? use `.ml.Summarizer` instead of `.mllib.MultivariateOnlineSummarizer` to avoid computation of unused metrics ### Why are the changes needed? to avoid computation of unused metrics ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27059 from zhengruifeng/pac_summarizer. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-04 10:25:02 -06:00
Aman Omer	4a234dd0e6	[SPARK-30390][MLLIB] Avoid double caching in mllib.KMeans#runWithWeights ### What changes were proposed in this pull request? Check before caching zippedData (as suggested in https://github.com/apache/spark/pull/26483#issuecomment-569702482). ### Why are the changes needed? If the `data` is already cached before calling `run` method of `KMeans` then `zippedData.persist()` will hurt the performance. Hence, persisting it conditionally. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually. Closes #27052 from amanomer/29823followup. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-04 10:17:43 -06:00
Huaxin Gao	d32ed25f0d	[SPARK-30144][ML][PYSPARK] Make MultilayerPerceptronClassificationModel extend MultilayerPerceptronParams ### What changes were proposed in this pull request? Make ```MultilayerPerceptronClassificationModel``` extend ```MultilayerPerceptronParams``` ### Why are the changes needed? Make ```MultilayerPerceptronClassificationModel``` extend ```MultilayerPerceptronParams``` to expose the training params, so user can see these params when calling ```extractParamMap``` ### Does this PR introduce any user-facing change? Yes. The ```MultilayerPerceptronParams``` such as ```seed```, ```maxIter``` ... are available in ```MultilayerPerceptronClassificationModel``` now ### How was this patch tested? Manually tested ```MultilayerPerceptronClassificationModel.extractParamMap()``` to verify all the new params are there. Closes #26838 from huaxingao/spark-30144. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-01-03 12:01:11 -06:00
zhengruifeng	23a49aff27	[SPARK-30329][ML] add iterator/foreach methods for Vectors ### What changes were proposed in this pull request? 1, add new foreach-like methods: foreach/foreachNonZero 2, add iterator: iterator/activeIterator/nonZeroIterator ### Why are the changes needed? see the [ticke](https://issues.apache.org/jira/browse/SPARK-30329) for details foreach/foreachNonZero: for both convenience and performace (SparseVector.foreach should be faster than current traversal method) iterator/activeIterator/nonZeroIterator: add the three iterators, so that we can futuremore add/change some impls based on those iterators for both ml and mllib sides, to avoid vector conversions. ### Does this PR introduce any user-facing change? Yes, new methods are added ### How was this patch tested? added testsuites Closes #26982 from zhengruifeng/vector_iter. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-31 15:52:17 +08:00
Huaxin Gao	694da0382e	[SPARK-30321][ML] Log weightSum in Algo that has weights support ### What changes were proposed in this pull request? add instr.logSumOfWeights in the Algo that has weightCol support ### Why are the changes needed? Many algorithms support weightCol now. I think weightsum is useful info to add to the log. ### Does this PR introduce any user-facing change? no ### How was this patch tested? manually tested Closes #26972 from huaxingao/spark-30321. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-31 14:47:02 +08:00
zhengruifeng	0b561a7f46	[SPARK-30380][ML] Refactor RandomForest.findSplits ### What changes were proposed in this pull request? Refactor `RandomForest.findSplits` by applying `aggregateByKey` instead of `groupByKey` ### Why are the changes needed? Current impl of `RandomForest.findSplits` uses `groupByKey` to collect non-zero values for each feature, so it is quite dangerous. After looking into the following logic to find splits, I found that collecting all non-zero values is not necessary, and we only need weightSums of distinct values. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27040 from zhengruifeng/rf_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-31 14:15:23 +08:00
zhengruifeng	32a5233d12	[SPARK-30358][ML] ML expose predictRaw and predictProbability ### What changes were proposed in this pull request? 1, expose predictRaw and predictProbability 2, add tests ### Why are the changes needed? single instance prediction is useful out of spark, specially for online prediction. Current `predict` is exposed, but it is not enough. ### Does this PR introduce any user-facing change? Yes, new methods are exposed ### How was this patch tested? added testsuites Closes #27015 from zhengruifeng/expose_raw_prob. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-31 12:49:16 +08:00
zhengruifeng	57649b56d9	[SPARK-30376][ML] Unify the computation of numFeatures ### What changes were proposed in this pull request? using `MetadataUtils.getNumFeatures` to extract the numFeatures ### Why are the changes needed? may avoid `first`/`head` job if metadata has attrGroup ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27037 from zhengruifeng/unify_numFeatures. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-30 10:22:40 +08:00
zhengruifeng	f72db40080	[SPARK-18409][ML][FOLLOWUP] LSH approxNearestNeighbors optimization ### What changes were proposed in this pull request? compute count and quantile on one pass ### Why are the changes needed? to avoid extra pass ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #26990 from zhengruifeng/quantile_count_lsh. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-29 17:07:54 +08:00
zhengruifeng	a719edcfd5	[SPARK-29967][ML][FOLLOWUP] KMeans Cleanup ### What changes were proposed in this pull request? 1, remove unused imports and variables 2, remove `countAccum: LongAccumulator`, since `costAccum: DoubleAccumulator` also records the count 3, mark `clusterCentersWithNorm` in KMeansModel trasient and lazy, since it is only used in transformation and can be directly generated from the centers. ### Why are the changes needed? 1,remove unused codes 2,avoid repeated computation 3,reduce broadcasted size ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27014 from zhengruifeng/kmeans_clean_up. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-29 12:43:37 +08:00
zhengruifeng	049641346c	[SPARK-30354][ML] GBT reuse DecisionTreeMetadata among iterations ### What changes were proposed in this pull request? precompute the `DecisionTreeMetadata` and reuse it for all trees ### Why are the changes needed? In existing impl, each `DecisionTreeRegressor` needs a pass on the whole dataset to calculate the same `DecisionTreeMetadata` repeatedly. In this PR, with default depth=5, it is about 8% faster then existing impl ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27011 from zhengruifeng/gbt_reuse_instr_meta. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-28 11:23:46 -06:00
zhengruifeng	9c046dc808	[SPARK-30102][ML][PYSPARK] GMM supports instance weighting ### What changes were proposed in this pull request? supports instance weighting in GMM ### Why are the changes needed? ML should support instance weighting ### Does this PR introduce any user-facing change? yes, a new param `weightCol` is exposed ### How was this patch tested? added testsuits Closes #26735 from zhengruifeng/gmm_support_weight. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-27 13:32:57 +08:00
zhanjf	8d3eed33ee	[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component ### What changes were proposed in this pull request? Implement Factorization Machines as a ml-pipeline component 1. loss function supports: logloss, mse 2. optimizer: GD, adamW ### Why are the changes needed? Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate). Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR. References: 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010. https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes #27000 from mob-ai/ml/fm. Authored-by: zhanjf <zhanjf@mob.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-26 11:39:53 -06:00
zhengruifeng	ad77b400da	[SPARK-30347][ML] LibSVMDataSource attach AttributeGroup ### What changes were proposed in this pull request? LibSVMDataSource attach AttributeGroup ### Why are the changes needed? LibSVMDataSource will attach a special metadata to indicate numFeatures: ```scala scala> val data = spark.read.format("libsvm").load("/data0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt") scala> data.schema("features").metadata res0: org.apache.spark.sql.types.Metadata = {"numFeatures":4} ``` However, all ML impls will try to obtain vector size via AttributeGroup, which can not use this metadata: ```scala scala> import org.apache.spark.ml.attribute._ import org.apache.spark.ml.attribute._ scala> AttributeGroup.fromStructField(data.schema("features")).size res1: Int = -1 ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? added tests Closes #27003 from zhengruifeng/libsvm_attr_group. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-26 10:02:59 +08:00
zhengruifeng	8f07839e74	[SPARK-30178][ML] RobustScaler support large numFeatures ### What changes were proposed in this pull request? compute the medians/ranges more distributedly ### Why are the changes needed? It is a bottleneck to collect the whole Array[QuantileSummaries] from executors, since a QuantileSummaries is a large object, which maintains arrays of large sizes 10k(`defaultCompressThreshold`)/50k(`defaultHeadSize`). In Spark-Shell with default params, I processed a dataset with numFeatures=69,200, and existing impl fail due to OOM. After this PR, it will sucessfuly fit the model. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #26803 from zhengruifeng/robust_high_dim. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-25 09:44:19 +08:00
zhengruifeng	5715a84c40	[SPARK-29914][ML][FOLLOWUP] fix SQLTransformer & VectorSizeHint toString method ### What changes were proposed in this pull request? 1,modify the toString in SQLTransformer & VectorSizeHint 2,add toString in RegexTokenizer ### Why are the changes needed? in SQLTransformer & VectorSizeHint , `toString` methods directly call getter of param without default values. This will cause `java.util.NoSuchElementException` in REPL: ```scala scala> val vs = new VectorSizeHint() java.util.NoSuchElementException: Failed to find a default value for size at org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:780) at scala.Option.getOrElse(Option.scala:189) ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #26999 from zhengruifeng/fix_toString. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-25 09:39:10 +08:00
Wenchen Fan	ba3f6330dd	Revert "[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component" This reverts commit `c6ab7165dd`.	2019-12-24 14:01:27 +08:00
zhanjf	c6ab7165dd	[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component ### What changes were proposed in this pull request? Implement Factorization Machines as a ml-pipeline component 1. loss function supports: logloss, mse 2. optimizer: GD, adamW ### Why are the changes needed? Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate). Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR. References: 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010. https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes #26124 from mob-ai/ml/fm. Authored-by: zhanjf <zhanjf@mob.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-23 10:11:09 -06:00
Kazuaki Ishizaki	f31d9a629b	[MINOR][DOC][SQL][CORE] Fix typo in document and comments ### What changes were proposed in this pull request? Fixed typo in `docs` directory and in other directories 1. Find typo in `docs` and apply fixes to files in all directories 2. Fix `the the` -> `the` ### Why are the changes needed? Better readability of documents ### Does this PR introduce any user-facing change? No ### How was this patch tested? No test needed Closes #26976 from kiszk/typo_20191221. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-21 14:08:58 -08:00
Yuming Wang	696288f623	[INFRA] Reverts commit `56dcd79` and `c216ef1` ### What changes were proposed in this pull request? 1. Revert "Preparing development version 3.0.1-SNAPSHOT": `56dcd79` 2. Revert "Preparing Spark release v3.0.0-preview2-rc2": `c216ef1` ### Why are the changes needed? Shouldn't change master. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? manual test: https://github.com/apache/spark/compare/5de5e46..wangyum:revert-master Closes #26915 from wangyum/revert-master. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-12-16 19:57:44 -07:00
Yuming Wang	56dcd79992	Preparing development version 3.0.1-SNAPSHOT	2019-12-17 01:57:27 +00:00
Yuming Wang	c216ef1d03	Preparing Spark release v3.0.0-preview2-rc2	2019-12-17 01:57:21 +00:00
Maxim Gekk	25de90e762	[SPARK-30170][SQL][MLLIB][TESTS] Eliminate compilation warnings: part 1 ### What changes were proposed in this pull request? - Replace `Seq[String]` by `Seq[_]` in `StopWordsRemoverSuite` because `String` type is unchecked due erasure. - Throw an exception for default case in `MLTest.checkNominalOnDF` because we don't expect other attribute types currently. - Explicitly cast float to double in `BigDecimal(y)`. This is what the `apply()` method does for `float`s. - Replace deprecated `verifyZeroInteractions` by `verifyNoInteractions`. - Equivalent replacement of `\0` by `\u0000` in `CSVExprUtilsSuite` - Import `scala.language.implicitConversions` in `CollectionExpressionsSuite`, `HashExpressionsSuite` and in `ExpressionParserSuite`. ### Why are the changes needed? The changes fix compiler warnings showed in the JIRA ticket https://issues.apache.org/jira/browse/SPARK-30170 . Eliminating the warning highlights other warnings which could take more attention to real problems. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing test suites `StopWordsRemoverSuite`, `AnalysisExternalCatalogSuite`, `CSVExprUtilsSuite`, `CollectionExpressionsSuite`, `HashExpressionsSuite`, `ExpressionParserSuite` and sub-tests of `MLTest`. Closes #26799 from MaxGekk/eliminate-warning-2. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-12 08:38:15 -06:00
Sean Owen	33f53cb2d5	[SPARK-30195][SQL][CORE][ML] Change some function, import definitions to work with stricter compiler in Scala 2.13 ### What changes were proposed in this pull request? See https://issues.apache.org/jira/browse/SPARK-30195 for the background; I won't repeat it here. This is sort of a grab-bag of related issues. ### Why are the changes needed? To cross-compile with Scala 2.13 later. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests for 2.12. I've been manually checking that this actually resolves the compile problems in 2.13 separately. Closes #26826 from srowen/SPARK-30195. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-11 12:33:58 -08:00
Huaxin Gao	1cac9b2cc6	[SPARK-29967][ML][PYTHON] KMeans support instance weighting ### What changes were proposed in this pull request? add weight support in KMeans ### Why are the changes needed? KMeans should support weighting ### Does this PR introduce any user-facing change? Yes. ```KMeans.setWeightCol``` ### How was this patch tested? Unit Tests Closes #26739 from huaxingao/spark-29967. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-10 09:33:06 -06:00
Sean Owen	36fa1980c2	[SPARK-30158][SQL][CORE] Seq -> Array for sc.parallelize for 2.13 compatibility; remove WrappedArray ### What changes were proposed in this pull request? Use Seq instead of Array in sc.parallelize, with reference types. Remove usage of WrappedArray. ### Why are the changes needed? These both enable building on Scala 2.13. ### Does this PR introduce any user-facing change? None ### How was this patch tested? Existing tests Closes #26787 from srowen/SPARK-30158. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-09 14:41:48 -06:00
Liang-Chi Hsieh	755d889448	[SPARK-24666][ML] Fix infinity vectors produced by Word2Vec when numIterations are large ### What changes were proposed in this pull request? This patch adds normalization to word vectors when fitting dataset in Word2Vec. ### Why are the changes needed? Running Word2Vec on some datasets, when numIterations is large, can produce infinity word vectors. ### Does this PR introduce any user-facing change? Yes. After this patch, Word2Vec won't produce infinity word vectors. ### How was this patch tested? Manually. This issue is not always reproducible on any dataset. The dataset known to reproduce it is too large (925M) to upload. ```scala case class Sentences(name: String, words: Array[String]) val dataset = spark.read .option("header", "true").option("sep", "\t") .option("quote", "").option("nullValue", "\\N") .csv("/tmp/title.akas.tsv") .filter("region = 'US' or language = 'en'") .select("title") .as[String] .map(s => Sentences(s, s.split(' '))) .persist() println("Training model...") val word2Vec = new Word2Vec() .setInputCol("words") .setOutputCol("vector") .setVectorSize(64) .setWindowSize(4) .setNumPartitions(50) .setMinCount(5) .setMaxIter(30) val model = word2Vec.fit(dataset) model.getVectors.show() ``` Before: ``` Training model... +-------------+--------------------+ \| word\| vector\| +-------------+--------------------+ \| Unspoken\|[-Infinity,-Infin...\| \| Talent\|[-Infinity,Infini...\| \| Hourglass\|[2.02805806500023...\| \|Nickelodeon's\|[-4.2918617120906...\| \| Priests\|[-1.3570403355926...\| \| Religion:\|[-6.7049072282803...\| \| Bu\|[5.05591774315586...\| \| Totoro:\|[-1.0539840178632...\| \| Trouble,\|[-3.5363592836003...\| \| Hatter\|[4.90413981352826...\| \| '79\|[7.50436471285412...\| \| Vile\|[-2.9147142985312...\| \| 9/11\|[-Infinity,Infini...\| \| Santino\|[1.30005911270850...\| \| Motives\|[-1.2538958306253...\| \| '13\|[-4.5040152427657...\| \| Fierce\|[Infinity,Infinit...\| \| Stover\|[-2.6326895394029...\| \| 'It\|[1.66574533864436...\| \| Butts\|[Infinity,Infinit...\| +-------------+--------------------+ only showing top 20 rows ``` After: ``` Training model... +-------------+--------------------+ \| word\| vector\| +-------------+--------------------+ \| Unspoken\|[-0.0454501919448...\| \| Talent\|[-0.2657704949378...\| \| Hourglass\|[-0.1399687677621...\| \|Nickelodeon's\|[-0.1767119318246...\| \| Priests\|[-0.0047509293071...\| \| Religion:\|[-0.0411605164408...\| \| Bu\|[0.11837736517190...\| \| Totoro:\|[0.05258282646536...\| \| Trouble,\|[0.09482011198997...\| \| Hatter\|[0.06040831282734...\| \| '79\|[0.04783720895648...\| \| Vile\|[-0.0017210749210...\| \| 9/11\|[-0.0713915303349...\| \| Santino\|[-0.0412711687386...\| \| Motives\|[-0.0492418706417...\| \| '13\|[-0.0073119504377...\| \| Fierce\|[-0.0565455369651...\| \| Stover\|[0.06938160210847...\| \| 'It\|[0.01117012929171...\| \| Butts\|[0.05374567210674...\| +-------------+--------------------+ only showing top 20 rows ``` Closes #26722 from viirya/SPARK-24666-2. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-12-05 16:32:33 -08:00
Aman Omer	5892bbf447	[SPARK-30124][MLLIB] unnecessary persist in PythonMLLibAPI.scala ### What changes were proposed in this pull request? Removed unnecessary persist. ### Why are the changes needed? Persist in `PythonMLLibAPI.scala` is unnecessary because later in `run()` of `gmmAlg` is caching the data. `710ddab39e/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala (L167-L171)` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually Closes #26758 from amanomer/improperPersist. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-12-05 11:54:45 -06:00
Sean Owen	2ceed6f32c	[SPARK-29392][CORE][SQL][FOLLOWUP] Avoid deprecated (in 2.13) Symbol syntax 'foo in favor of simpler expression, where it generated deprecation warnings ### What changes were proposed in this pull request? Where it generates a deprecation warning in Scala 2.13, replace Symbol shorthand syntax `'foo` with an equivalent. ### Why are the changes needed? Symbol syntax `'foo` is deprecated in Scala 2.13. The lines changed below otherwise generate about 440 warnings when building for 2.13. The previous PR directly replaced many usages with `Symbol("foo")`. But it's also used to specify Columns via implicit conversion (`.select('foo)`) or even where simple Strings are used (`.as('foo)`), as it's kind of an abstraction for interned Strings. While I find this syntax confusing and would like to deprecate it, here I just replaced it where it generates a build warning (not sure why all occurrences don't): `$"foo"` or just `"foo"`. ### Does this PR introduce any user-facing change? Should not change behavior. ### How was this patch tested? Existing tests. Closes #26748 from srowen/SPARK-29392.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-12-04 15:03:26 -08:00
zhengruifeng	710ddab39e	[SPARK-29914][ML] ML models attach metadata in `transform`/`transformSchema` ### What changes were proposed in this pull request? 1, `predictionCol` in `ml.classification` & `ml.clustering` add `NominalAttribute` 2, `rawPredictionCol` in `ml.classification` add `AttributeGroup` containing vectorsize=`numClasses` 3, `probabilityCol` in `ml.classification` & `ml.clustering` add `AttributeGroup` containing vectorsize=`numClasses`/`k` 4, `leafCol` in GBT/RF add `AttributeGroup` containing vectorsize=`numTrees` 5, `leafCol` in DecisionTree add `NominalAttribute` 6, `outputCol` in models in `ml.feature` add `AttributeGroup` containing vectorsize 7, `outputCol` in `UnaryTransformer`s in `ml.feature` add `AttributeGroup` containing vectorsize ### Why are the changes needed? Appened metadata can be used in downstream ops, like `Classifier.getNumClasses` There are many impls (like `Binarizer`/`Bucketizer`/`VectorAssembler`/`OneHotEncoder`/`FeatureHasher`/`HashingTF`/`VectorSlicer`/...) in `.ml` that append appropriate metadata in `transform`/`transformSchema` method. However there are also many impls return no metadata in transformation, even some metadata like `vector.size`/`numAttrs`/`attrs` can be ealily inferred. ### Does this PR introduce any user-facing change? Yes, add some metadatas in transformed dataset. ### How was this patch tested? existing testsuites and added testsuites Closes #26547 from zhengruifeng/add_output_vecSize. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-04 16:39:57 +08:00
zhengruifeng	5496e980e9	[SPARK-30109][ML] PCA use BLAS.gemv for sparse vectors ### What changes were proposed in this pull request? When PCA was first impled in [SPARK-5521](https://issues.apache.org/jira/browse/SPARK-5521), at that time Matrix.multiply(BLAS.gemv internally) did not support sparse vector. So worked around it by applying a sparse matrix multiplication. Since [SPARK-7681](https://issues.apache.org/jira/browse/SPARK-7681), BLAS.gemv supported sparse vector. So we can directly use Matrix.multiply now. ### Why are the changes needed? for simplity ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #26745 from zhengruifeng/pca_mul. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-04 09:50:00 +08:00
zhengruifeng	4021354b73	[SPARK-30044][ML] MNB/CNB/BNB use empty sigma matrix instead of null ### What changes were proposed in this pull request? MNB/CNB/BNB use empty sigma matrix instead of null ### Why are the changes needed? 1,Using empty sigma matrix will simplify the impl 2,I am reviewing FM impl these days, FMModels have optional bias and linear part. It seems more reasonable to set optional part an empty vector/matrix or zero value than `null` ### Does this PR introduce any user-facing change? yes, sigma from `null` to empty matrix ### How was this patch tested? updated testsuites Closes #26679 from zhengruifeng/nb_use_empty_sigma. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-03 10:02:23 +08:00
zhengruifeng	03ac1b799c	[SPARK-29959][ML][PYSPARK] Summarizer support more metrics ### What changes were proposed in this pull request? Summarizer support more metrics: sum, std ### Why are the changes needed? Those metrics are widely used, it will be convenient to directly obtain them other than a conversion. in `NaiveBayes`: we want the sum of vectors, mean & weightSum need to computed then multiplied in `StandardScaler`,`AFTSurvivalRegression`,`LinearRegression`,`LinearSVC`,`LogisticRegression`: we need to obtain `variance` and then sqrt it to get std ### Does this PR introduce any user-facing change? yes, new metrics are exposed to end users ### How was this patch tested? added testsuites Closes #26596 from zhengruifeng/summarizer_add_metrics. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-02 14:44:31 +08:00
zhengruifeng	0f40d2a6ee	[SPARK-29960][ML][PYSPARK] MulticlassClassificationEvaluator support hammingLoss ### What changes were proposed in this pull request? MulticlassClassificationEvaluator support hammingLoss ### Why are the changes needed? 1, it is an easy to compute hammingLoss based on confusion matrix 2, scikit-learn supports it ### Does this PR introduce any user-facing change? yes ### How was this patch tested? added testsuites Closes #26597 from zhengruifeng/multi_class_hamming_loss. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-11-21 18:32:28 +08:00
zhengruifeng	297cbab98e	[SPARK-29942][ML] Impl Complement Naive Bayes Classifier ### What changes were proposed in this pull request? Impl Complement Naive Bayes Classifier as a `modelType` option in `NaiveBayes` ### Why are the changes needed? 1, it is a better choice for text classification: it is said in [scikit-learn](https://scikit-learn.org/stable/modules/naive_bayes.html#complement-naive-bayes) that 'CNB regularly outperforms MNB (often by a considerable margin) on text classification tasks.' 2, CNB is highly similar to existing MNB, only a small part of existing MNB need to be changed, so it is a easy win to support CNB. ### Does this PR introduce any user-facing change? yes, a new `modelType` is supported ### How was this patch tested? added testsuites Closes #26575 from zhengruifeng/cnb. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-11-21 18:22:05 +08:00
Kent Yao	d555f8fcc9	[SPARK-29961][SQL][FOLLOWUP] Remove useless test for VectorUDT ### What changes were proposed in this pull request? A follow-up to rm useless test in VectorUDTSuite ### Why are the changes needed? rm useless test, which is already covered. ### Does this PR introduce any user-facing change? no ### How was this patch tested? no Closes #26620 from yaooqinn/SPARK-29961-f. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-20 21:02:22 -08:00
Kent Yao	7a70670345	[SPARK-29961][SQL] Implement builtin function - typeof ### What changes were proposed in this pull request? Add typeof function for Spark to get the underlying type of value. ```sql -- !query 0 select typeof(1) -- !query 0 schema struct<typeof(1):string> -- !query 0 output int -- !query 1 select typeof(1.2) -- !query 1 schema struct<typeof(1.2):string> -- !query 1 output decimal(2,1) -- !query 2 select typeof(array(1, 2)) -- !query 2 schema struct<typeof(array(1, 2)):string> -- !query 2 output array<int> -- !query 3 select typeof(a) from (values (1), (2), (3.1)) t(a) -- !query 3 schema struct<typeof(a):string> -- !query 3 output decimal(11,1) decimal(11,1) decimal(11,1) ``` ##### presto ```sql presto> select typeof(array[1]); _col0 ---------------- array(integer) (1 row) ``` ##### PostgreSQL ```sql postgres=# select pg_typeof(a) from (values (1), (2), (3.0)) t(a); pg_typeof ----------- numeric numeric numeric (3 rows) ``` ##### impala https://issues.apache.org/jira/browse/IMPALA-1597 ### Why are the changes needed? a function which is better we have to help us debug, test, develop ... ### Does this PR introduce any user-facing change? add a new function ### How was this patch tested? add ut and example Closes #26599 from yaooqinn/SPARK-29961. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-21 10:28:32 +09:00
Sean Owen	1febd373ea	[MINOR][TESTS] Replace JVM assert with JUnit Assert in tests ### What changes were proposed in this pull request? Use JUnit assertions in tests uniformly, not JVM assert() statements. ### Why are the changes needed? assert() statements do not produce as useful errors when they fail, and, if they were somehow disabled, would fail to test anything. ### Does this PR introduce any user-facing change? No. The assertion logic should be identical. ### How was this patch tested? Existing tests. Closes #26581 from srowen/assertToJUnit. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-20 14:04:15 -06:00
Huaxin Gao	56a65b971d	[SPARK-18409][ML] LSH approxNearestNeighbors should use approxQuantile instead of sort ### What changes were proposed in this pull request? ```LSHModel.approxNearestNeighbors``` sorts the full dataset on the hashDistance in order to find a threshold. This PR uses approxQuantile instead. ### Why are the changes needed? To improve performance. ### Does this PR introduce any user-facing change? Yes. Changed ```LSH``` to make it extend ```HasRelativeError``` ```LSH``` and ```LSHModel``` have new APIs ```setRelativeError/getRelativeError``` ### How was this patch tested? Existing tests. Also added a couple doc test in python to test newly added ```getRelativeError``` Closes #26415 from huaxingao/spark-18409. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-20 08:20:16 -06:00
zhengruifeng	c5f644c6eb	[SPARK-16872][ML][PYSPARK] Impl Gaussian Naive Bayes Classifier ### What changes were proposed in this pull request? support `modelType` `gaussian` ### Why are the changes needed? current modelTypes do not support continuous data ### Does this PR introduce any user-facing change? yes, add a `modelType` option ### How was this patch tested? existing testsuites and added ones Closes #26413 from zhengruifeng/gnb. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-11-18 10:05:42 +08:00
Huaxin Gao	1f4075d29e	[SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols ### What changes were proposed in this pull request? Add multi-cols support in StopWordsRemover ### Why are the changes needed? As a basic Transformer, StopWordsRemover should support multi-cols. Param stopWords can be applied across all columns. ### Does this PR introduce any user-facing change? ```StopWordsRemover.setInputCols``` ```StopWordsRemover.setOutputCols``` ### How was this patch tested? Unit tests Closes #26480 from huaxingao/spark-29808. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-13 08:18:23 -06:00
Aman Omer	8c2bf64743	[SPARK-29823][MLLIB] Improper persist strategy in mllib.clustering.KMeans.run() ### What changes were proposed in this pull request? Adjust RDD to persist. ### Why are the changes needed? To handle the improper persist strategy. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually Closes #26483 from amanomer/SPARK-29823. Authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-13 08:16:06 -06:00
DongWang	9a96a20a69	[SPARK-29844][ML] Improper unpersist strategy in ml.recommendation.ASL.train ### What changes were proposed in this pull request? Adjust improper unpersist timing on RDD. ### Why are the changes needed? Improper unpersist timing will result in memory waste ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually Closes #26469 from Icysandwich/SPARK-29844. Authored-by: DongWang <cqwd123@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-12 18:31:58 -06:00
zhengruifeng	76e5294bb6	[SPARK-29801][ML] ML models unify toString method ### What changes were proposed in this pull request? 1,ML models should extend toString method to expose basic information. Current some algs (GBT/RF/LoR) had done this, while others not yet. 2,add `val numFeatures` in `BisectingKMeansModel`/`GaussianMixtureModel`/`KMeansModel`/`AFTSurvivalRegressionModel`/`IsotonicRegressionModel` ### Why are the changes needed? ML models should extend toString method to expose basic information. ### Does this PR introduce any user-facing change? yes ### How was this patch tested? existing testsuites Closes #26439 from zhengruifeng/models_toString. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-11 11:03:26 -08:00
zhengruifeng	d1cb98d70a	[SPARK-29756][ML] CountVectorizer forget to unpersist intermediate rdd ### What changes were proposed in this pull request? 1,unpersist intermediate rdd `wordCounts` 2,if the `dataset` is already persisted, we do not need to persist rdd `input` 3,if both `minDF`&`maxDF` are gteq or lt than 1, we can compare & check them af first. ### Why are the changes needed? we should unpersit unused rdd ASAP ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing testsuites Closes #26398 from zhengruifeng/CountVectorizer_unpersist_wordCounts. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-11-08 18:31:51 +08:00
zhengruifeng	854f30ffa8	[SPARK-29751][ML] Scalers use Summarizer instead of MultivariateOnlineSummarizer ### What changes were proposed in this pull request? use `ml.Summarizer` instead of `mllib.MultivariateOnlineSummarizer` ### Why are the changes needed? 1, I found that using `ml.Summarizer` is faster than current impl; 2, `mllib.MultivariateOnlineSummarizer` maintain all arrays, while `ml.Summarizer` only maintain necessary arrays 3, using `ml.Summarizer` will avoid vector conversions to `mlllib.Vector` ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #26393 from zhengruifeng/maxabs_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-06 08:57:21 -06:00
Huaxin Gao	8353000b47	[SPARK-29746][ML] Implement validateInputType in Normalizer/ElementwiseProduct/PolynomialExpansion ### What changes were proposed in this pull request? This PR implements ```validateInput``` in ```ElementwiseProduct```, ```Normalizer``` and ```PolynomialExpansion```. ### Why are the changes needed? ```UnaryTransformer``` has abstract method ```validateInputType``` and call it in ```transformSchema```, but this method is not implemented in ```ElementwiseProduct```, ```Normalizer``` and ```PolynomialExpansion```. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests Closes #26388 from huaxingao/spark-29746. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-06 07:21:36 -06:00
zhengruifeng	5853e8b330	[SPARK-29754][ML] LoR/AFT/LiR/SVC use Summarizer instead of MultivariateOnlineSummarizer ### What changes were proposed in this pull request? 1, change the scope of `ml.SummarizerBuffer` and add a method `createSummarizerBuffer` for it, so it can be used as an aggregator like `MultivariateOnlineSummarizer`; 2, In LoR/AFT/LiR/SVC, use Summarizer instead of MultivariateOnlineSummarizer ### Why are the changes needed? The computation of summary before learning iterations is a bottleneck in high-dimension cases, since `MultivariateOnlineSummarizer` compute much more than needed. In the [ticket](https://issues.apache.org/jira/browse/SPARK-29754) is an example, with `--driver-memory=4G` LoR will always fail on KDDA dataset. If we swith to `ml.Summarizer`, then `--driver-memory=3G` is enough to train a model. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites & manual test in REPL Closes #26396 from zhengruifeng/using_SummarizerBuffer. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-11-06 18:19:39 +08:00
zhengruifeng	ed12b61784	[SPARK-29656][ML][PYSPARK] ML algs expose aggregationDepth ### What changes were proposed in this pull request? expose expert param `aggregationDepth` in algs: GMM/GLR ### Why are the changes needed? SVC/LoR/LiR/AFT had exposed expert param aggregationDepth to end users. It should be nice to expose it in similar algs. ### Does this PR introduce any user-facing change? yes, expose new param ### How was this patch tested? added pytext tests Closes #26322 from zhengruifeng/agg_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-11-06 10:34:53 +08:00
Maxim Gekk	80a89873b2	[SPARK-29733][TESTS] Fix wrong order of parameters passed to `assertEquals` ### What changes were proposed in this pull request? The `assertEquals` method of JUnit Assert requires the first parameter to be the expected value. In this PR, I propose to change the order of parameters when the expected value is passed as the second parameter. ### Why are the changes needed? Wrong order of assert parameters confuses when the assert fails and the parameters have special string representation. For example: ```java assertEquals(input1.add(input2), new CalendarInterval(5, 5, 367200000000L)); ``` ``` java.lang.AssertionError: Expected :interval 5 months 5 days 101 hours Actual :interval 5 months 5 days 102 hours ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing tests. Closes #26377 from MaxGekk/fix-order-in-assert-equals. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-03 11:21:28 -08:00
zhengruifeng	8a4378c6f0	[SPARK-29686][ML] LinearSVC should persist instances if needed ### What changes were proposed in this pull request? persist the input if needed ### Why are the changes needed? training with non-cached dataset will hurt performance ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #26344 from zhengruifeng/linear_svc_cache. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-11-01 12:07:07 +08:00
zhengruifeng	bb478706b5	[SPARK-29645][ML][PYSPARK] ML add param RelativeError ### What changes were proposed in this pull request? 1, add shared param `relativeError` 2, `Imputer`/`RobusterScaler`/`QuantileDiscretizer` extend `HasRelativeError` ### Why are the changes needed? It makes sense to expose RelativeError to end users, since it controls both the precision and memory overhead. `QuantileDiscretizer` had already added this param, while other algs not yet. ### Does this PR introduce any user-facing change? yes, new param is added in `Imputer`/`RobusterScaler` ### How was this patch tested? existing testsutes Closes #26305 from zhengruifeng/add_relative_err. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-31 13:52:28 +08:00
Xingbo Jiang	8207c835b4	Revert "Prepare Spark release v3.0.0-preview-rc2" This reverts commit `007c873ae3`.	2019-10-30 17:45:44 -07:00
Xingbo Jiang	007c873ae3	Prepare Spark release v3.0.0-preview-rc2 ### What changes were proposed in this pull request? To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name. Made the following changes in this PR: * Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview` * Update the sparkR version number check logic to allow jvm version like `3.0.0-preview` Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too. We shall revert the changes after 3.0.0-preview release passed. ### Why are the changes needed? To make the maven release repository to accept the built jars. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A	2019-10-30 17:42:59 -07:00
Huaxin Gao	37690dea10	[SPARK-29565][ML][PYTHON] OneHotEncoder should support single-column input/output ### What changes were proposed in this pull request? add single-column input/ouput support in OneHotEncoder ### Why are the changes needed? Currently, OneHotEncoder only has multi columns support. It makes sense to support single column as well. ### Does this PR introduce any user-facing change? Yes ```OneHotEncoder.setInputCol``` ```OneHotEncoder.setOutputCol``` ### How was this patch tested? Unit test Closes #26265 from huaxingao/spark-29565. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-10-28 23:20:21 -07:00
Xingbo Jiang	b33a58c0c6	Revert "Prepare Spark release v3.0.0-preview-rc1" This reverts commit `5eddbb5f1d`.	2019-10-28 22:32:34 -07:00
Xingbo Jiang	5eddbb5f1d	Prepare Spark release v3.0.0-preview-rc1 ### What changes were proposed in this pull request? To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name. Made the following changes in this PR: * Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview` * Update the PySpark version from `3.0.0.dev0` to `3.0.0` Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too. We shall revert the changes after 3.0.0-preview release passed. ### Why are the changes needed? To make the maven release repository to accept the built jars. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #26243 from jiangxb1987/3.0.0-preview-prepare. Lead-authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2019-10-28 22:31:29 -07:00
Huaxin Gao	c137acbf65	[SPARK-29566][ML] Imputer should support single-column input/output ### What changes were proposed in this pull request? add single-column input/output support in Imputer ### Why are the changes needed? Currently, Imputer only has multi-column support. This PR adds single-column input/output support. ### Does this PR introduce any user-facing change? Yes. add single-column input/output support in Imputer ```Imputer.setInputCol``` ```Imputer.setOutputCol``` ### How was this patch tested? add unit tests Closes #26247 from huaxingao/spark-29566. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-29 11:11:41 +08:00
Huaxin Gao	b19fd487df	[SPARK-29093][PYTHON][ML] Remove automatically generated param setters in _shared_params_code_gen.py ### What changes were proposed in this pull request? Remove automatically generated param setters in _shared_params_code_gen.py ### Why are the changes needed? To keep parity between scala and python ### Does this PR introduce any user-facing change? Yes Add some setters in Python ML XXXModels ### How was this patch tested? unit tests Closes #26232 from huaxingao/spark-29093. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-28 11:36:10 +08:00
zhengruifeng	091cbc3be0	[SPARK-9612][ML] Add instance weight support for GBTs ### What changes were proposed in this pull request? add weight support for GBTs by sampling data before passing it to trees and then passing weights to trees in summary: 1, add setters of `minWeightFractionPerNode` & `weightCol` 2, update input types in private methods from `RDD[LabeledPoint]` to `RDD[Instance]`: `DecisionTreeRegressor.train`, `GradientBoostedTrees.run`, `GradientBoostedTrees.runWithValidation`, `GradientBoostedTrees.computeInitialPredictionAndError`, `GradientBoostedTrees.computeError`, `GradientBoostedTrees.evaluateEachIteration`, `GradientBoostedTrees.boost`, `GradientBoostedTrees.updatePredictionError` 3, add new private method `GradientBoostedTrees.computeError(data, predError)` to compute average error, since original `predError.values.mean()` do not take weights into account. 4, add new tests ### Why are the changes needed? GBTs should support sample weights like other algs ### Does this PR introduce any user-facing change? yes, new setters are added ### How was this patch tested? existing & added testsuites Closes #25926 from zhengruifeng/gbt_add_weight. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-25 13:48:09 +08:00
Huaxin Gao	868d851dac	[SPARK-29232][ML] Update the parameter maps of the DecisionTreeRegression/Classification Models ### What changes were proposed in this pull request? The trees (Array[```DecisionTreeRegressionModel```]) in ```RandomForestRegressionModel``` only contains the default parameter value. Need to update the parameter maps for these trees. Same issues in ```RandomForestClassifier```, ```GBTClassifier``` and ```GBTRegressor``` ### Why are the changes needed? User wants to access each individual tree and build the trees back up for the random forest estimator. This doesn't work because trees don't have the correct parameter values ### Does this PR introduce any user-facing change? Yes. Now the trees in ```RandomForestRegressionModel```, ```RandomForestClassifier```, ```GBTClassifier``` and ```GBTRegressor``` have the correct parameter values. ### How was this patch tested? Add tests Closes #26154 from huaxingao/spark-29232. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-22 17:49:44 +08:00
zhengruifeng	dba673f0e3	[SPARK-29489][ML][PYSPARK] ml.evaluation support log-loss ### What changes were proposed in this pull request? `ml.MulticlassClassificationEvaluator` & `mllib.MulticlassMetrics` support log-loss ### Why are the changes needed? log-loss is an important classification metric and is widely used in practice ### Does this PR introduce any user-facing change? Yes, add new option ("logloss") and a related param `eps` ### How was this patch tested? added testsuites & local tests refering to sklearn Closes #26135 from zhengruifeng/logloss. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-18 17:57:13 +08:00
zhengruifeng	9dacdd38b3	[SPARK-23578][ML][PYSPARK] Binarizer support multi-column ### What changes were proposed in this pull request? Binarizer support multi-column by extending `HasInputCols`/`HasOutputCols`/`HasThreshold`/`HasThresholds` ### Why are the changes needed? similar algs in `ml.feature` already support multi-column, like `Bucketizer`/`StringIndexer`/`QuantileDiscretizer` ### Does this PR introduce any user-facing change? yes, add setter/getter of `thresholds`/`inputCols`/`outputCols` ### How was this patch tested? added suites Closes #26064 from zhengruifeng/binarizer_multicols. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-16 18:32:07 +08:00
zhengruifeng	8b62399684	[SPARK-29380][ML] RFormula avoid repeated 'first' jobs to get vector size ### What changes were proposed in this pull request? get the first row lazily, and reuse it for each vector column. ### Why are the changes needed? avoid unnecssary `first` jobs ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing testsuites & local tests in repl Closes #26052 from zhengruifeng/rformula_lazy_row. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-12 22:25:36 +08:00
Sean Owen	cc7493fa21	[SPARK-29416][CORE][ML][SQL][MESOS][TESTS] Use .sameElements to compare arrays, instead of .deep (gone in 2.13) ### What changes were proposed in this pull request? Use `.sameElements` to compare (non-nested) arrays, as `Arrays.deep` is removed in 2.13 and wasn't the best way to do this in the first place. ### Why are the changes needed? To compile with 2.13. ### Does this PR introduce any user-facing change? None. ### How was this patch tested? Existing tests. Closes #26073 from srowen/SPARK-29416. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-09 17:00:48 -07:00
Sean Owen	fa95a5c395	[SPARK-29411][CORE][ML][SQL][DSTREAM] Replace use of Unit object with () for Scala 2.13 ### What changes were proposed in this pull request? Replace `Unit` with equivalent `()` where code refers to the `Unit` companion object. ### Why are the changes needed? It doesn't compile otherwise in Scala 2.13. - https://github.com/scala/scala/blob/v2.13.0/src/library/scala/Unit.scala#L30 ### Does this PR introduce any user-facing change? Should be no behavior change at all. ### How was this patch tested? Existing tests. Closes #26070 from srowen/SPARK-29411. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-09 10:24:13 -07:00
Sean Owen	ee83d09b53	[SPARK-29401][CORE][ML][SQL][GRAPHX][TESTS] Replace calls to .parallelize Arrays of tuples, ambiguous in Scala 2.13, with Seqs of tuples ### What changes were proposed in this pull request? Invocations like `sc.parallelize(Array((1,2)))` cause a compile error in 2.13, like: ``` [ERROR] [Error] /Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/ShuffleSuite.scala:47: overloaded method value apply with alternatives: (x: Unit,xs: Unit)Array[Unit] <and> (x: Double,xs: Double)Array[Double] <and> (x: Float,xs: Float)Array[Float] <and> (x: Long,xs: Long)Array[Long] <and> (x: Int,xs: Int)Array[Int] <and> (x: Char,xs: Char)Array[Char] <and> (x: Short,xs: Short)Array[Short] <and> (x: Byte,xs: Byte)Array[Byte] <and> (x: Boolean,xs: Boolean*)Array[Boolean] cannot be applied to ((Int, Int), (Int, Int), (Int, Int), (Int, Int)) ``` Using a `Seq` instead appears to resolve it, and is effectively equivalent. ### Why are the changes needed? To better cross-build for 2.13. ### Does this PR introduce any user-facing change? None. ### How was this patch tested? Existing tests. Closes #26062 from srowen/SPARK-29401. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-08 20:22:02 -07:00
Huaxin Gao	ffddfc8584	[SPARK-29269][PYTHON][ML] Pyspark ALSModel support getters/setters ### What changes were proposed in this pull request? Add getters/setters in Pyspark ALSModel. ### Why are the changes needed? To keep parity between python and scala. ### Does this PR introduce any user-facing change? Yes. add the following getters/setters to ALSModel ``` get/setUserCol get/setItemCol get/setColdStartStrategy get/setPredictionCol ``` ### How was this patch tested? add doctest Closes #25947 from huaxingao/spark-29269. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-08 14:05:09 +08:00
zero323	7c5db4515e	[SPARK-29363][MLLIB] Make o.a.s.regression.Regressor public ### What changes were proposed in this pull request? - Removal of `private[ml]` modifier from `Regressor`. - Marking `Regressor` as `DeveloperApi`. ### Why are the changes needed? Consistency with the rest of ML API as described in [the corresponding JIRA ticket](https://issues.apache.org/jira/browse/SPARK-29363). ### Does this PR introduce any user-facing change? Yes, as described above. ### How was this patch tested? Existing tests. Closes #26033 from zero323/SPARK-29363. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-05 18:16:28 -07:00
Dongjoon Hyun	bd031c2173	[SPARK-29307][BUILD][TESTS] Remove scalatest deprecation warnings ### What changes were proposed in this pull request? This PR aims to remove `scalatest` deprecation warnings with the following changes. - `org.scalatest.mockito.MockitoSugar` -> `org.scalatestplus.mockito.MockitoSugar` - `org.scalatest.selenium.WebBrowser` -> `org.scalatestplus.selenium.WebBrowser` - `org.scalatest.prop.Checkers` -> `org.scalatestplus.scalacheck.Checkers` - `org.scalatest.prop.GeneratorDrivenPropertyChecks` -> `org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks` ### Why are the changes needed? According to the Jenkins logs, there are 118 warnings about this. ``` grep "is deprecated" ~/consoleText \| grep scalatest \| wc -l 118 ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? After Jenkins passes, we need to check the Jenkins log. Closes #25982 from dongjoon-hyun/SPARK-29307. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-30 21:00:11 -07:00
Sean Owen	e1ea806b30	[SPARK-29291][CORE][SQL][STREAMING][MLLIB] Change procedure-like declaration to function + Unit for 2.13 ### What changes were proposed in this pull request? Scala 2.13 emits a deprecation warning for procedure-like declarations: ``` def foo() { ... ``` This is equivalent to the following, so should be changed to avoid a warning: ``` def foo(): Unit = { ... ``` ### Why are the changes needed? It will avoid about a thousand compiler warnings when we start to support Scala 2.13. I wanted to make the change in 3.0 as there are less likely to be back-ports from 3.0 to 2.4 than 3.1 to 3.0, for example, minimizing that downside to touching so many files. Unfortunately, that makes this quite a big change. ### Does this PR introduce any user-facing change? No behavior change at all. ### How was this patch tested? Existing tests. Closes #25968 from srowen/SPARK-29291. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-30 10:03:23 -07:00
Dongjoon Hyun	989b0c773f	[SPARK-29297][TESTS] Compare `core`/`mllib` module benchmarks in JDK8/11 ### What changes were proposed in this pull request? This PR regenerate the benchmark results in `core` and `mllib` module in order to compare JDK8/JDK11 result. ### Why are the changes needed? According to the result, For `PropertiesCloneBenchmark` and `UDTSerializationBenchmark`, JDK11 is slightly faster. In general, there is no regression in JDK11. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. Manually run the benchmark. Closes #25969 from dongjoon-hyun/SPARK-29297. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-29 21:43:58 -07:00
zhengruifeng	aed7ff36f7	[SPARK-29258][ML][PYSPARK] parity between ml.evaluator and mllib.metrics ### What changes were proposed in this pull request? 1, expose `BinaryClassificationMetrics.numBins` in `BinaryClassificationEvaluator` 2, expose `RegressionMetrics.throughOrigin` in `RegressionEvaluator` 3, add metric `explainedVariance` in `RegressionEvaluator` ### Why are the changes needed? existing function in mllib.metrics should also be exposed in ml ### Does this PR introduce any user-facing change? yes, this PR add two expert params and one metric option ### How was this patch tested? existing and added tests Closes #25940 from zhengruifeng/evaluator_add_param. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-09-27 13:30:03 +08:00
zhengruifeng	fff2e847c2	[SPARK-29095][ML] add extractInstances ### What changes were proposed in this pull request? common methods support extract weights ### Why are the changes needed? today more and more ML algs support weighting, add this method will make impls simple ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing testsuites Closes #25802 from zhengruifeng/add_extractInstances. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-24 09:24:10 -05:00
Patrick Pisciuneri	c7c6b642dc	[SPARK-29121][ML][MLLIB] Support for dot product operation on Vector(s) ### What changes were proposed in this pull request? Support for dot product with: - `ml.linalg.Vector` - `ml.linalg.Vectors` - `mllib.linalg.Vector` - `mllib.linalg.Vectors` ### Why are the changes needed? Dot product is useful for feature engineering and scoring. BLAS routines are already there, just a wrapper is needed. ### Does this PR introduce any user-facing change? No user facing changes, just some new functionality. ### How was this patch tested? Tests were written and added to the appropriate `VectorSuites` classes. They can be quickly run with: ``` sbt "mllib-local/testOnly org.apache.spark.ml.linalg.VectorsSuite" sbt "mllib/testOnly org.apache.spark.mllib.linalg.VectorsSuite" ``` Closes #25818 from phpisciuneri/SPARK-29121. Authored-by: Patrick Pisciuneri <phpisciuneri@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-21 14:26:54 -05:00
Sean Owen	a9ae262cf2	[SPARK-28772][BUILD][MLLIB] Update breeze to 1.0 ### What changes were proposed in this pull request? Update breeze dependency to 1.0. ### Why are the changes needed? Breeze 1.0 supports Scala 2.13 and has a few bug fixes. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25874 from srowen/SPARK-28772. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-20 20:31:26 -07:00
zhengruifeng	c764dd6dd7	[SPARK-29144][ML] Binarizer handle sparse vectors incorrectly with negative threshold ### What changes were proposed in this pull request? if threshold<0, convert implict 0 to 1, althought this will break sparsity ### Why are the changes needed? if `threshold<0`, current impl deal with sparse vector incorrectly. See JIRA [SPARK-29144](https://issues.apache.org/jira/browse/SPARK-29144) and [Scikit-Learn's Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html) ('Threshold may not be less than 0 for operations on sparse matrices.') for details. ### Does this PR introduce any user-facing change? no ### How was this patch tested? added testsuite Closes #25829 from zhengruifeng/binarizer_throw_exception_sparse_vector. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-20 19:22:46 -05:00
zhengruifeng	d74fc6bb82	[SPARK-29118][ML] Avoid redundant computation in transform of GMM & GLR ### What changes were proposed in this pull request? 1,GMM: obtaining the prediction (double) from its probabilty prediction(vector) 2,GLR: obtaining the prediction (double) from its link prediction(double) ### Why are the changes needed? it avoid predict twice ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #25815 from zhengruifeng/gmm_transform_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-18 09:41:02 -05:00
Liang-Chi Hsieh	12e1583093	[SPARK-28927][ML] Rethrow block mismatch exception in ALS when input data is nondeterministic ### What changes were proposed in this pull request? Fitting ALS model can be failed due to nondeterministic input data. Currently the failure is thrown by an ArrayIndexOutOfBoundsException which is not explainable for end users what is wrong in fitting. This patch catches this exception and rethrows a more explainable one, when the input data is nondeterministic. Because we may not exactly know the output deterministic level of RDDs produced by user code, this patch also adds a note to Scala/Python/R ALS document about the training data deterministic level. ### Why are the changes needed? ArrayIndexOutOfBoundsException was observed during fitting ALS model. It was caused by mismatching between in/out user/item blocks during computing ratings. If the training RDD output is nondeterministic, when fetch failure is happened, rerun part of training RDD can produce inconsistent user/item blocks. This patch is needed to notify users ALS fitting on nondeterministic input. ### Does this PR introduce any user-facing change? Yes. When fitting ALS model on nondeterministic input data, previously if rerun happens, users would see ArrayIndexOutOfBoundsException caused by mismatch between In/Out user/item blocks. After this patch, a SparkException with more clear message will be thrown, and original ArrayIndexOutOfBoundsException is wrapped. ### How was this patch tested? Tested on development cluster. Closes #25789 from viirya/als-indeterminate-input. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-18 09:22:13 -05:00
Dongjoon Hyun	e63098b287	[SPARK-29007][MLLIB][FOLLOWUP] Remove duplicated dependency ### What changes were proposed in this pull request? This removes the duplicated dependency which is added by [SPARK-29007](`b62ef8f793/mllib/pom.xml (L58-L64)`). ### Why are the changes needed? Maven complains this kind of duplications. We had better be safe in the future Maven versions. ``` $ cd mllib $ mvn clean package -DskipTests [INFO] Scanning for projects... [WARNING] [WARNING] Some problems were encountered while building the effective model for org.apache.spark:spark-mllib_2.12🫙3.0.0-SNAPSHOT [WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)' must be unique: org.apache.spark:spark-streaming_${scala.binary.version}:test-jar -> duplicate declaration of version ${project.version} line 119, column 17 [WARNING] [WARNING] It is highly recommended to fix these problems because they threaten the stability of your build. [WARNING] [WARNING] For this reason, future Maven versions might no longer support building such malformed projects. [WARNING] ... ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual check since this is a warning. ``` $ cd mllib $ mvn clean package -DskipTests ``` Closes #25783 from dongjoon-hyun/SPARK-29007. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-13 11:54:46 -07:00
Jungtaek Lim (HeartSaVioR)	b62ef8f793	[SPARK-29007][STREAMING][MLLIB][TESTS] Enforce not leaking SparkContext in tests which creates new StreamingContext with new SparkContext ### What changes were proposed in this pull request? This patch enforces tests to prevent leaking newly created SparkContext while is created via initializing StreamingContext. Leaking SparkContext in test would make most of following tests being failed as well, so this patch applies defensive programming, trying its best to ensure SparkContext is cleaned up. ### Why are the changes needed? We got some case in CI build where SparkContext is being leaked and other tests are affected by leaked SparkContext. Ideally we should isolate the environment among tests if possible. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Modified UTs. Closes #25709 from HeartSaVioR/SPARK-29007. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-09-11 10:29:13 -07:00
Huaxin Gao	aa805eca54	[SPARK-23265][ML] Update multi-column error handling logic in QuantileDiscretizer ## What changes were proposed in this pull request? SPARK-22799 added more comprehensive error logic for Bucketizer. This PR is to update QuantileDiscretizer match the new error logic in Bucketizer. ## How was this patch tested? Add new unit test. Closes #20442 from huaxingao/spark-23265. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-09-09 19:11:18 -07:00
Sean Owen	6378d4bc06	[SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3 ### What changes were proposed in this pull request? - Remove SQLContext.createExternalTable and Catalog.createExternalTable, deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods - Remove HiveContext, deprecated in 2.0.0, in favor of `SparkSession.builder.enableHiveSupport` - Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated methods, deprecate in 2.2.0 - Remove deprecated MLlib (not Spark ML) linear method support, mostly utility constructors and 'train' methods, and associated docs. This includes methods in LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been deprecated since 2.0.0 - Remove deprecated Pyspark MLlib linear method support, including LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD - Remove 'runs' argument in KMeans.train() method, which has been a no-op since 2.0.0 - Remove deprecated ChiSqSelector isSorted protected method - Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor of 'yarn' and deploy mode 'cluster', etc Notes: - I was not able to remove deprecated DataFrameReader.json(RDD) in favor of DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is still needed to support Pyspark's .json() method, which can't use a Dataset. - Looks like SQLContext.createExternalTable was not actually deprecated in Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was. - I afterwards noted that the toDegrees, toRadians functions were almost removed fully in SPARK-25908, but Felix suggested keeping just the R version as they hadn't been technically deprecated. I'd like to revisit that. Do we really want the inconsistency? I'm not against reverting it again, but then that implies leaving SQLContext.createExternalTable just in Pyspark too, which seems weird. - I kept LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully removed in Scala. Maybe should not have been deprecated. ### Why are the changes needed? Deprecated items are easiest to remove in a major release, so we should do so as much as possible for Spark 3. This does not target items deprecated 'recently' as of Spark 2.3, which is still 18 months old. ### Does this PR introduce any user-facing change? Yes, in that deprecated items are removed from some public APIs. ### How was this patch tested? Existing tests. Closes #25684 from srowen/SPARK-28980. Lead-authored-by: Sean Owen <sean.owen@databricks.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-09 10:19:40 -05:00
zhengruifeng	4664a082c2	[SPARK-28968][ML] Add HasNumFeatures in the scala side ### What changes were proposed in this pull request? Add HasNumFeatures in the scala side, with `1<<18` as the default value ### Why are the changes needed? HasNumFeatures is already added in the py side, it is reasonable to keep them in sync. I don't find other similar place. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing testsuites Closes #25671 from zhengruifeng/add_HasNumFeatures. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-09-06 11:50:45 +08:00
Liang-Chi Hsieh	19f882ce1b	[SPARK-28933][ML] Reduce unnecessary shuffle in ALS when initializing factors ### What changes were proposed in this pull request? When Initializing factors in ALS, we should use `mapPartitions` instead of current `map`, so we can preserve existing partition of the RDD of `InBlock`. The RDD of `InBlock` is already partitioned by src block id. We don't change the partition when initializing factors. ### Why are the changes needed? This patch can reduce unnecessary shuffle after initializing factors. ### Does this PR introduce any user-facing change? No ### How was this patch tested? It should not change existing tests. It should pass added test that verifies shuffle dependency of factor RDDs. Closes #25639 from viirya/fix-als-partition. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-09-01 19:49:50 -07:00
Sean Owen	eb037a8180	[SPARK-28855][CORE][ML][SQL][STREAMING] Remove outdated usages of Experimental, Evolving annotations ### What changes were proposed in this pull request? The Experimental and Evolving annotations are both (like Unstable) used to express that a an API may change. However there are many things in the code that have been marked that way since even Spark 1.x. Per the dev thread, anything introduced at or before Spark 2.3.0 is pretty much 'stable' in that it would not change without a deprecation cycle. Therefore I'd like to remove most of these annotations. And, remove the `:: Experimental ::` scaladoc tag too. And likewise for Python, R. The changes below can be summarized as: - Generally, anything introduced at or before Spark 2.3.0 has been unmarked as neither Evolving nor Experimental - Obviously experimental items like DSv2, Barrier mode, ExperimentalMethods are untouched - I _did_ unmark a few MLlib classes introduced in 2.4, as I am quite confident they're not going to change (e.g. KolmogorovSmirnovTest, PowerIterationClustering) It's a big change to review, so I'd suggest scanning the list of _files_ changed to see if any area seems like it should remain partly experimental and examine those. ### Why are the changes needed? Many of these annotations are incorrect; the APIs are de facto stable. Leaving them also makes legitimate usages of the annotations less meaningful. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25558 from srowen/SPARK-28855. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-09-01 10:15:00 -05:00
Liang-Chi Hsieh	2bd02e2b41	[SPARK-28866][ML] Persist item factors RDD when checkpointing in ALS ### What changes were proposed in this pull request? In ALS ML implementation, for non-implicit case, we checkpoint the RDD of item factors, between intervals. Before checkpointing (.checkpoint()) and materializing (.count()) RDD, this RDD was not persisted. It causes recomputation. In an experiment, there is performance difference between persisting and no persisting before checkpointing the RDD. The performance difference is not big, but this change is not big too. The actual performance difference varies depending the interval of checkpoint, training dataset, etc. ### Why are the changes needed? Persisting the RDD before checkpointing the RDD of item factors can avoid recomputation. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manual check RDD recomputation or not. Taking 30% MovieLens 20M Dataset as training dataset. Setting checkpoint dir for SparkContext. Fitting an ALS model like: ```scala val als = new ALS() .setMaxIter(100) .setCheckpointInterval(5) .setRegParam(0.01) .setUserCol("userId") .setItemCol("movieId") .setRatingCol("rating") val t0 = System.currentTimeMillis() val model = als.fit(training) val t1 = System.currentTimeMillis() ``` Before this patch: 65.386 s After this patch: 61.022 s Closes #25576 from viirya/persist-item-factors. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-30 11:37:06 -05:00
zhengruifeng	7fe750674e	[SPARK-11215][ML][FOLLOWUP] update the examples and suites using new api ## What changes were proposed in this pull request? since method `labels` is already deprecated, we should update the examples and suites to turn off warings when compiling spark: ``` [warn] /Users/zrf/Dev/OpenSource/spark/examples/src/main/scala/org/apache/spark/examples/ml/DecisionTreeClassificationExample.scala:65: method labels in class StringIndexerModel is deprecated (since 3.0.0): `labels` is deprecated and will be removed in 3.1.0. Use `labelsArray` instead. [warn] .setLabels(labelIndexer.labels) [warn] ^ [warn] /Users/zrf/Dev/OpenSource/spark/examples/src/main/scala/org/apache/spark/examples/ml/GradientBoostedTreeClassifierExample.scala:68: method labels in class StringIndexerModel is deprecated (since 3.0.0): `labels` is deprecated and will be removed in 3.1.0. Use `labelsArray` instead. [warn] .setLabels(labelIndexer.labels) [warn] ^ ``` ## How was this patch tested? existing suites Closes #25428 from zhengruifeng/del_stringindexer_labels_usage. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-27 08:58:32 -05:00
zhengruifeng	defb65ed1a	[SPARK-13677][ML] Implement Tree-Based Feature Transformation for ML ## What changes were proposed in this pull request? Tree-based feature transformation is a widely used feature and already implemented in many famous libraries, like sklearn/xgboost/lightgbm/catboost. But is still missing in ML. The previous discussions and design doc can be found in [SPARK-13677](https://issues.apache.org/jira/browse/SPARK-13677), which is the only left subtask in 'GBT improvement umbrella' [SPARK-14047](https://issues.apache.org/jira/browse/SPARK-14047). This pr is to add tree-based feature transformation. ## How was this patch tested? existing and added suites Closes #25383 from zhengruifeng/tree_path. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-22 09:37:42 -05:00
heleny	fb1f868d4f	[SPARK-28776][ML] SparkML Writer gets hadoop conf from session state <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. --> ### What changes were proposed in this pull request? SparkML writer gets hadoop conf from session state, instead of the spark context. <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> ### Why are the changes needed? Allow for multiple sessions in the same context that have different hadoop configurations. <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> ### Does this PR introduce any user-facing change? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If no, write 'No'. --> No ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Tested in pyspark.ml.tests.test_persistence.PersistenceTest test_default_read_write Closes #25505 from helenyugithub/SPARK-28776. Authored-by: heleny <heleny@palantir.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-22 09:27:05 -05:00
zhengruifeng	49ffbff2fc	[SPARK-28780][ML] Delete the incorrect setWeightCol method in LinearSVCModel ### What changes were proposed in this pull request? Delete the incorrect method `def setWeightCol(value: Double): this.type = set(threshold, value)` in `LinearSVCModel` ### Why are the changes needed? `LinearSVCModel` should not provide this setter, moreover, this method is wrongly defined. ### Does this PR introduce any user-facing change? yes, a public method is removed ### How was this patch tested? existing suites Closes #25510 from zhengruifeng/linearsvc_model_set_weightcol. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-21 09:47:53 -05:00
Sean Owen	fa7fd8f2a4	[SPARK-28434][TESTS][ML] Fix values in dummy tree in DecisionTreeSuite ### What changes were proposed in this pull request? Fix dummy tree created in decision tree tests to have actually consistent stats, so that it can be compared in tests more completely. The current one has values for, say, impurity that don't even match internally. With this, the tests can assert more about stats staying correct after load. ### Why are the changes needed? Fixes a TODO and improves the test slightly. ### Does this PR introduce any user-facing change? None ### How was this patch tested? Existing tests. Closes #25485 from srowen/SPARK-28434. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-19 17:01:14 -05:00
Liang-Chi Hsieh	0094b5fe72	[SPARK-28722][ML] Change sequential label sorting in StringIndexer fit to parallel ## What changes were proposed in this pull request? The `fit` method in `StringIndexer` sorts given labels in a sequential approach, if there are multiple input columns. When the number of input column increases, the time of label sorting dramatically increases too so it is hard to use in practice if dealing with hundreds of input columns. This patch tries to make the label sorting parallel. This runs benchmark like: ```scala import org.apache.spark.ml.feature.StringIndexer val numCol = 300 val data = (0 to 100).map { i => (i, 100 * i) } var df = data.toDF("id", "label0") (1 to numCol).foreach { idx => df = df.withColumn(s"label$idx", col("label0") + 1) } val inputCols = (0 to numCol).map(i => s"label$i").toArray val outputCols = (0 to numCol).map(i => s"labelIndex$i").toArray val t0 = System.nanoTime() val indexer = new StringIndexer().setInputCols(inputCols).setOutputCols(outputCols).setStringOrderType("alphabetDesc").fit(df) val t1 = System.nanoTime() println("Elapsed time: " + (t1 - t0) / 1000000000.0 + "s") ``` \| numCol \| 20 \| 50 \| 100 \| 200 \| 300 \| \|--:\|---\|---\|---\|---\|---\| \| Before \| 9.85 \| 28.62 \| 64.35 \| 167.17 \| 431.60 \| \| After \| 2.44 \| 2.71 \| 3.34 \| 4.83 \| 6.90 \| Unit: second ## How was this patch tested? Passed existing tests. Manually test for performance. Closes #25442 from viirya/improve_stringindexer2. Authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-16 12:39:12 -05:00
younggyu chun	8535df7261	[MINOR] Fix typos in comments and replace an explicit type with <> ## What changes were proposed in this pull request? This PR fixed typos in comments and replace the explicit type with '<>' for Java 8+. ## How was this patch tested? Manually tested. Closes #25338 from younggyuchun/younggyu. Authored-by: younggyu chun <younggyuchun@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-10 16:47:11 -05:00
zhengruifeng	8b08e14de7	[SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup ## What changes were proposed in this pull request? some cleanup and tiny optimization 1, since the `transformImpl` method in the .mllib side is no longer used in the .ml side, the scope should be limited; 2, in the `hashUDF`, val `numOfFeatures` is never used; 3, in the udf, it is inefficient to involve param getter (`$(numFeatures)`/`$(binary)`) directly or via method `indexOf` ((`$(numFeatures)`) . instead, the getter should be called outside of the udf; ## How was this patch tested? existing suites Closes #25324 from zhengruifeng/hashingtf_cleanup. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-09 10:04:39 -05:00
zhengruifeng	c17fa1360d	[SPARK-28514][ML] Remove the redundant transformImpl method in RF & GBT ## What changes were proposed in this pull request? Remove the redundant and confusing transformImpl method in RF & GBT; 1, In `GBTClassifier` & `RandomForestClassifier`, the real `transform` methods inherit from `ProbabilisticClassificationModel` which can deal with multi output columns. The `transformImpl` method, which deals with only one column - `predictionCol`, completely does nothing. This is quite confusing. 2, In `GBTRegressor` & `RandomForestRegressor`, the `transformImpl` do exactly what the superclass `PredictionModel` does (except model broadcasting), so can be removed. ## How was this patch tested? existing suites Closes #25256 from zhengruifeng/del_ensamble_transformImpl. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-06 15:12:47 -05:00
Sean Owen	c09675779b	[SPARK-28604][ML] Use log1p(x) over log(1+x) and expm1(x) over exp(x)-1 for accuracy ## What changes were proposed in this pull request? Use `log1p(x)` over `log(1+x)` and `expm1(x)` over `exp(x)-1` for accuracy, where possible. This should improve accuracy a tiny bit in ML-related calculations, and shouldn't hurt in any event. ## How was this patch tested? Existing tests. Closes #25337 from srowen/SPARK-28604. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-04 17:04:01 -05:00
actuaryzhang	6d7a6751d8	[SPARK-20604][ML] Allow imputer to handle numeric types ## What changes were proposed in this pull request? Imputer currently requires input column to be Double or Float, but the logic should work on any numeric data types. Many practical problems have integer data types, and it could get very tedious to manually cast them into Double before calling imputer. This transformer could be extended to handle all numeric types. ## How was this patch tested? new test Closes #17864 from actuaryzhang/imputer. Lead-authored-by: actuaryzhang <actuaryzhang10@gmail.com> Co-authored-by: Wayne Zhang <actuaryzhang@uber.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-02 10:54:50 -05:00
Huaxin Gao	660423d717	[SPARK-23469][ML] HashingTF should use corrected MurmurHash3 implementation ## What changes were proposed in this pull request? Update HashingTF to use new implementation of MurmurHash3 Make HashingTF use the old MurmurHash3 when a model from pre 3.0 is loaded ## How was this patch tested? Change existing unit tests. Also add one unit test to make sure HashingTF use the old MurmurHash3 when a model from pre 3.0 is loaded Closes #25303 from huaxingao/spark-23469. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-02 10:53:36 -05:00
zhengruifeng	37243e160e	[SPARK-28579][ML] MaxAbsScaler avoids conversion to breeze.vector ## What changes were proposed in this pull request? avoid `.ml.vector => .breeze.vector` conversion in `MaxAbsScaler`, and reuse the transformation method in `StandardScalerModel`, which can deal with dense & sparse vector separately. ## How was this patch tested? existing suites Closes #25311 from zhengruifeng/maxabs_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-02 10:38:10 -05:00
WeichenXu	a745381b9d	[SPARK-25382][SQL][PYSPARK] Remove ImageSchema.readImages in 3.0 ## What changes were proposed in this pull request? I remove the deprecate `ImageSchema.readImages`. Move some useful methods from class `ImageSchema` into class `ImageFileFormat`. In pyspark, I rename `ImageSchema` class to be `ImageUtils`, and keep some useful python methods in it. ## How was this patch tested? UT. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #25245 from WeichenXu123/remove_image_schema. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-31 14:26:18 +09:00
zhengruifeng	44c28d7515	[SPARK-28399][ML][PYTHON] implement RobustScaler ## What changes were proposed in this pull request? Implement `RobustScaler` Since the transformation is quite similar to `StandardScaler`, I refactor the transform function so that it can be reused in both scalers. ## How was this patch tested? existing and added tests Closes #25160 from zhengruifeng/robust_scaler. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-30 10:24:33 -05:00
Huaxin Gao	70f82fd298	[SPARK-21481][ML] Add indexOf method in ml.feature.HashingTF ## What changes were proposed in this pull request? Add indexOf method for ml.feature.HashingTF. ## How was this patch tested? Add Unit test. Closes #25250 from huaxingao/spark-21481. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-28 08:32:43 -05:00
zhengruifeng	bf41070480	[SPARK-28499][ML] Optimize MinMaxScaler ## What changes were proposed in this pull request? 1, avoid calling param getter in udf; 2, for constant dims, precompute the transformed result; 3, for usual dims, precompute `scale / originalRange(i)` to skip a division; ## How was this patch tested? existing suites Closes #25244 from zhengruifeng/minmax_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-26 08:35:00 -05:00
Liang-Chi Hsieh	ded1a7495b	[SPARK-28365][ML] Fallback locale to en_US in StopWordsRemover if system default locale isn't in available locales in JVM ## What changes were proposed in this pull request? Because the local default locale isn't in available locales at `Locale`, when I did some tests locally with python code, `StopWordsRemover` related python test hits some errors, like: ``` Traceback (most recent call last): File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in test_stopwordsremover stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output") File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper return func(self, *kwargs) File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__ self.uid) File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj return java_obj(java_args) File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1554, in __call__ answer, self._gateway_client, None, self._fqn) File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco raise converted pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 parameter locale given invalid value en_TW.' ``` As per HyukjinKwon's advice, instead of setting up locale to pass test, it is better to have a workable locale if system default locale can't be found in available locales in JVM. Otherwise, users have to manually change system locale or accessing a private property _jvm in PySpark. ## How was this patch tested? Added test and manual test. ``` scala> val remover = new StopWordsRemover().setInputCol("raw").setOutputCol("filtered") 19/07/14 19:20:03 WARN StopWordsRemover: Default locale set was [en_TW]; however, it was not found in available locales in JVM, falling back to en_US locale. Set param `locale` in order to respect another locale. ``` Closes #25133 from viirya/pytest-default-locale. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-26 12:13:10 +09:00
zhengruifeng	a3bbc371cb	[SPARK-28421][ML] SparseVector.apply performance optimization ## What changes were proposed in this pull request? optimize the `SparseVector.apply` by avoiding internal conversion Since the speed up is significant (2.5X ~ 5X), and this method is widely used in ml, I suggest back porting. \| size\| nnz \| apply(old) \| apply2(new impl) \| apply3(new impl with extra range check)\| \|------\|----------\|------------\|----------\|----------\| \|10000000\|100\|75294\|12208\|18682\| \|10000000\|10000\|75616\|23132\|32932\| \|10000000\|1000000\|92949\|42529\|48821\| ## How was this patch tested? existing tests using following code to test performance (here the new impl is named `apply2`, and another impl with extra range check is named `apply3`): ``` import scala.util.Random import org.apache.spark.ml.linalg._ val size = 10000000 for (nnz <- Seq(100, 10000, 1000000)) { val rng = new Random(123) val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % size).distinct.take(nnz).sorted val values = Array.fill(nnz)(rng.nextDouble) val vec = Vectors.sparse(size, indices, values).toSparse val tic1 = System.currentTimeMillis; (0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec(i); i+=1} }; val toc1 = System.currentTimeMillis; val tic2 = System.currentTimeMillis; (0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply2(i); i+=1} }; val toc2 = System.currentTimeMillis; val tic3 = System.currentTimeMillis; (0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply3(i); i+=1} }; val toc3 = System.currentTimeMillis; println((size, nnz, toc1 - tic1, toc2 - tic2, toc3 - tic3)) } ``` Closes #25178 from zhengruifeng/sparse_vec_apply. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-23 20:20:22 -05:00
Ievgen Prokhorenko	52ddf038ec	[SPARK-28440][MLLIB][TEST] Use TestingUtils to compare floating point values ## What changes were proposed in this pull request? Use `org.apache.spark.mllib.util.TestingUtils` object across `MLLIB` component to compare floating point values in tests. ## How was this patch tested? `build/mvn test` - existing tests against updated code. Closes #25191 from eugen-prokhorenko/mllib-testingutils-double-comparison. Authored-by: Ievgen Prokhorenko <eugen.prokhorenko@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-18 23:48:12 -07:00
zhengruifeng	282a12d331	[SPARK-27944][ML] Unify the behavior of checking empty output column names ## What changes were proposed in this pull request? In regression/clustering/ovr/als, if an output column name is empty, igore it. And if all names are empty, log a warning msg, then do nothing. ## How was this patch tested? existing tests Closes #24793 from zhengruifeng/aft_iso_check_empty_outputCol. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-16 09:56:12 -04:00
Liang-Chi Hsieh	591de42351	[SPARK-28381][PYSPARK] Upgraded version of Pyrolite to 4.30 ## What changes were proposed in this pull request? This upgraded to a newer version of Pyrolite. Most updates [1] in the newer version are for dotnot. For java, it includes a bug fix to Unpickler regarding cleaning up Unpickler memo, and support of protocol 5. After upgrading, we can remove the fix at SPARK-27629 for the bug in Unpickler. [1] https://github.com/irmen/Pyrolite/compare/pyrolite-4.23...master ## How was this patch tested? Manually tested on Python 3.6 in local on existing tests. Closes #25143 from viirya/upgrade-pyrolite. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-15 12:29:58 +09:00
zhengruifeng	aa41dcea4a	[SPARK-28159][ML][FOLLOWUP] fix typo & (0 until v.size).toList => List.range(0, v.size) ## What changes were proposed in this pull request? fix typo in spark-28159 `transfromWithMean` -> `transformWithMean` ## How was this patch tested? existing test Closes #25129 from zhengruifeng/to_ml_vec_cleanup. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-12 11:00:16 -07:00
Henry D	a32c92c0cd	[SPARK-28140][MLLIB][PYTHON] Accept DataFrames in RowMatrix and IndexedRowMatrix constructors ## What changes were proposed in this pull request? In both cases, the input `DataFrame` schema must contain only the information that's required for the matrix object, so a vector column in the case of `RowMatrix` and long and vector columns for `IndexedRowMatrix`. ## How was this patch tested? Unit tests that verify: - `RowMatrix` and `IndexedRowMatrix` can be created from `DataFrame`s - If the schema does not match expectations, we throw an `IllegalArgumentException` Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #24953 from henrydavidge/row-matrix-df. Authored-by: Henry D <henrydavidge@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-09 16:39:21 -05:00
zhengruifeng	28ea445c43	[SPARK-28159][ML] Make the transform natively in ml framework to avoid extra conversion ## What changes were proposed in this pull request? Make the transform natively in ml framework to avoid extra conversion. There are many TODOs in current ml module, like `// TODO: Make the transformer natively in ml framework to avoid extra conversion.` in ChiSqSelector. This PR is to make ml algs no longer need to convert ml-vector to mllib-vector in transforms. Including: LDA/ChiSqSelector/ElementwiseProduct/HashingTF/IDF/Normalizer/PCA/StandardScaler. ## How was this patch tested? existing testsuites Closes #24963 from zhengruifeng/to_ml_vector. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-08 14:45:45 -05:00
zhengruifeng	c83b3ddb56	[SPARK-28154][ML][FOLLOWUP] GMM fix double caching ## What changes were proposed in this pull request? if the input dataset is alreadly cached, then we do not need to cache the internal rdd (like kmeans) ## How was this patch tested? existing test Closes #24919 from zhengruifeng/gmm_fix_double_caching. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-25 06:50:34 -05:00
zhengruifeng	83b96f6b30	[SPARK-28117][ML] LDA and BisectingKMeans cache the input dataset if necessary ## What changes were proposed in this pull request? cache dataset in BisectingKMeans cache dataset in LDA if Online solver is chosen. ## How was this patch tested? existing test Closes #24920 from zhengruifeng/bikm_cache. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-25 06:47:06 -05:00
zhengruifeng	c397b06924	[SPARK-28045][ML][PYTHON] add missing RankingEvaluator ## What changes were proposed in this pull request? add missing RankingEvaluator ## How was this patch tested? added testsuites Closes #24869 from zhengruifeng/ranking_eval. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-25 06:44:06 -05:00
WeichenXu	b276788d57	[SPARK-27990][SQL][ML] Provide a way to recursively load data from datasource ## What changes were proposed in this pull request? Provide a way to recursively load data from datasource. I add a "recursiveFileLookup" option. When "recursiveFileLookup" option turn on, then partition inferring is turned off and all files from the directory will be loaded recursively. If some datasource explicitly specify the partitionSpec, then if user turn on "recursive" option, then exception will be thrown. ## How was this patch tested? Unit tests. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #24830 from WeichenXu123/recursive_ds. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-06-20 12:43:01 +08:00
Andrew-Crosby	36b327d479	[SPARK-28062][ML] Avoid unnecessary copy of coefficients vector in HuberAggregator ## What changes were proposed in this pull request? Modifies the HuberAggregator class so that a copy of the coefficients vector isn't created every time that an instance is added. Follows the approach of LeastSquaresAggregator and uses transient lazy class variable to store the reused quantities. (See https://github.com/apache/spark/pull/14109 for explanation of the use of transient lazy variables) On the test case in the linked JIRA, this change gives an order of magnitude performance improvement reducing the time taken to fit the model from 540 to 47 seconds. ## How was this patch tested? Existing unit tests. See https://issues.apache.org/jira/browse/SPARK-28062 for results from running a benchmark script. Closes #24880 from Andrew-Crosby/spark-28062. Authored-by: Andrew-Crosby <andrew.crosby@autotrader.co.uk> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-19 08:57:12 -05:00
zhengruifeng	9ec049601a	[SPARK-28044][ML][PYTHON] MulticlassClassificationEvaluator support more metrics ## What changes were proposed in this pull request? expose more metrics in evaluator: weightedTruePositiveRate/weightedFalsePositiveRate/weightedFMeasure/truePositiveRateByLabel/falsePositiveRateByLabel/precisionByLabel/recallByLabel/fMeasureByLabel ## How was this patch tested? existing cases and add cases Closes #24868 from zhengruifeng/multi_class_support_bylabel. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-19 08:56:15 -05:00
Sean Owen	e96dd82f12	[SPARK-28081][ML] Handle large vocab counts in word2vec ## What changes were proposed in this pull request? The word2vec logic fails if a corpora has a word with count > 1e9. We should be able to handle very large counts generally better here by using longs to count. This takes over https://github.com/apache/spark/pull/24814 ## How was this patch tested? Existing tests. Closes #24893 from srowen/SPARK-28081. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-18 20:27:43 -05:00
zhengruifeng	7281784883	[SPARK-16692][ML][PYTHON] add MultilabelClassificationEvaluator ## What changes were proposed in this pull request? add MultilabelClassificationEvaluator ## How was this patch tested? added testsuites Closes #24777 from zhengruifeng/multi_label_eval. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-13 07:58:22 -05:00
ozan	a38d605d0d	[SPARK-18570][ML][R] RFormula support * and ^ operators ## What changes were proposed in this pull request? Added support for `` and `^` operators, along with expressions within parentheses. New operators just expand to already supported terms, such as; - y ~ a b = y ~ a + b + a : b - y ~ (a+b+c)^3 = y ~ a + b + c + a : b + a : c + a :b : c ## How was this patch tested? Added new unit tests to RFormulaParserSuite mengxr yanboliang Closes #24764 from ozancicek/rformula. Authored-by: ozan <ozancancicekci@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-04 08:59:30 -05:00
zhengruifeng	98708de38c	[MINOR][ML] add missing since annotation of meanAveragePrecision ## What changes were proposed in this pull request? add missing since annotation of meanAveragePrecision ## How was this patch tested? existing tests Closes #24778 from zhengruifeng/ranking_missing_since. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-03 18:07:23 -05:00
zhengruifeng	560e7bec6f	[SPARK-27847][ML] One-Pass MultilabelMetrics & MulticlassMetrics ## What changes were proposed in this pull request? compute all metrics with only one pass ## How was this patch tested? existing tests Closes #24717 from zhengruifeng/multi_label_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-01 08:32:52 -05:00
Sean Owen	aec0869fb2	[SPARK-27896][ML] Fix definition of clustering silhouette coefficient for 1-element clusters ## What changes were proposed in this pull request? Single-point clusters should have silhouette score of 0, according to the original paper and scikit implementation. ## How was this patch tested? Existing test suite + new test case. Closes #24756 from srowen/SPARK-27896. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-31 16:27:20 -07:00
Yuming Wang	db3e746b64	[SPARK-27875][CORE][SQL][ML][K8S] Wrap all PrintWriter with Utils.tryWithResource ## What changes were proposed in this pull request? This pr wrap all `PrintWriter` with `Utils.tryWithResource` to prevent resource leak. ## How was this patch tested? Existing test Closes #24739 from wangyum/SPARK-27875. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-30 19:54:32 +09:00
MJ Tang	1824cbfa39	[SPARK-27657][ML] Fix the log format of ml.util.Instrumentation.logFai… …lure ## What changes were proposed in this pull request? The failure log format is fixed according to the jdk implementation. ## How was this patch tested? Manual tests have been done. The new failure log format would be like: java.lang.RuntimeException: Failed to finish the task at com.xxx.Test.test(Test.java:106) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:124) at org.testng.internal.Invoker.invokeMethod(Invoker.java:571) at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:707) at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:979) at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125) at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:109) at org.testng.TestRunner.privateRun(TestRunner.java:648) at org.testng.TestRunner.run(TestRunner.java:505) at org.testng.SuiteRunner.runTest(SuiteRunner.java:455) at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:450) at org.testng.SuiteRunner.privateRun(SuiteRunner.java:415) at org.testng.SuiteRunner.run(SuiteRunner.java:364) at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52) at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:84) at org.testng.TestNG.runSuitesSequentially(TestNG.java:1187) at org.testng.TestNG.runSuitesLocally(TestNG.java:1116) at org.testng.TestNG.runSuites(TestNG.java:1028) at org.testng.TestNG.run(TestNG.java:996) at org.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:72) at org.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:123) Caused by: java.io.FileNotFoundException: File is not found at com.xxx.Test.test(Test.java:105) ... 24 more Closes #24684 from breakdawn/master. Authored-by: MJ Tang <mingjtang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-28 09:29:46 -05:00
zhengruifeng	32461d4744	[SPARK-27777][ML] Eliminate uncessary sliding job in AreaUnderCurve ## What changes were proposed in this pull request? compute AUC on one pass ## How was this patch tested? existing tests performance tests: ``` import org.apache.spark.mllib.evaluation._ val scoreAndLabels = sc.parallelize(Array.range(0, 100000).map{ i => (i.toDouble / 100000, (i % 2).toDouble) }, 4) scoreAndLabels.persist() scoreAndLabels.count() val tic = System.currentTimeMillis (0 until 100).foreach{i => val metrics = new BinaryClassificationMetrics(scoreAndLabels, 0); val auc = metrics.areaUnderROC; metrics.unpersist} val toc = System.currentTimeMillis toc - tic ``` \|New\| Existing\| \|------\|----------\| \|87532\|103644\| One-pass AUC saves about 16% computation time. Closes #24648 from zhengruifeng/auc_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-27 10:31:07 -05:00
zhengruifeng	be9e9466e2	[SPARK-27787][ML] Eliminate uncessary job to compute SSreg ## What changes were proposed in this pull request? Eliminate uncessary job to compute SSreg Compute SSreg based on the summary of predictions ## How was this patch tested? existing tests Closes #24656 from zhengruifeng/RegressionMetrics_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-26 08:16:32 -05:00
wenxuanguan	e7443d6412	[SPARK-27774][CORE][MLLIB] Avoid hardcoded configs ## What changes were proposed in this pull request? avoid hardcoded configs in `SparkConf` and `SparkSubmit` and test ## How was this patch tested? N/A Closes #24631 from wenxuanguan/minor-fix. Authored-by: wenxuanguan <choose_home@126.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-22 10:45:11 +09:00
Sean Owen	bfb3ffe9b3	[SPARK-27682][CORE][GRAPHX][MLLIB] Replace use of collections and methods that will be removed in Scala 2.13 with work-alikes ## What changes were proposed in this pull request? This replaces use of collection classes like `MutableList` and `ArrayStack` with workalikes that are available in 2.12, as they will be removed in 2.13. It also removes use of `.to[Collection]` as its uses was superfluous anyway. Removing `collection.breakOut` will have to wait until 2.13 ## How was this patch tested? Existing tests Closes #24586 from srowen/SPARK-27682. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-15 09:29:12 -05:00
Shahid	fbb56f2b8f	[SPARK-27636][MLLIB] Remove cached RDD blocks after PIC execution ## What changes were proposed in this pull request? Test steps to reproduce: 1) bin/spark-shell ``` val dataset = spark.createDataFrame(Seq( (0L, 1L, 1.0), (1L,2L,1.0), (3L, 4L,1.0), (4L,0L,0.1))).toDF("src", "dst", "weight") val model = new PowerIterationClustering(). setMaxIter(10). setInitMode("degree"). setWeightCol("weight") val prediction = model.assignClusters(dataset).select("id", "cluster") ``` 2) Open storage tab of the UI. We can see many RDD block cached, even after running the PIC. In this PR, basically materializes the new graph before unpersisting the old ones. ## How was this patch tested? Manually tested and existing UTs. Before patch: ![Screenshot from 2019-05-06 02-53-45](https://user-images.githubusercontent.com/23054875/57201033-daf61b80-6fb0-11e9-97ff-7534909ce2d3.png) After patch: ![Screenshot from 2019-05-06 03-41-04](https://user-images.githubusercontent.com/23054875/57201043-07aa3300-6fb1-11e9-855b-f63ee18ea371.png) Closes #24531 from shahidki31/SPARK-27636. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-09 09:27:31 -05:00
qb-tarushg	9b3211a194	[SPARK-27540][MLLIB] Add 'meanAveragePrecision_at_k' metric to RankingMetrics ## What changes were proposed in this pull request? Added method 'meanAveragePrecisionAt' k to RankingMetrics. This branch is rebased with squashed commits from https://github.com/apache/spark/pull/24458 ## How was this patch tested? Added code in the existing test RankingMetricsSuite. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24543 from qb-tarushg/SPARK-27540-REBASE. Authored-by: qb-tarushg <tarush.grover@quantumblack.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-09 08:47:05 -05:00
Adi Muraru	8ef4da753d	[SPARK-27610][YARN] Shade netty native libraries ## What changes were proposed in this pull request? Fixed the `spark-<version>-yarn-shuffle.jar` artifact packaging to shade the native netty libraries: - shade the `META-INF/native/libnetty_*` native libraries when packagin the yarn shuffle service jar. This is required as netty library loader derives that based on shaded package name. - updated the `org/spark_project` shade package prefix to `org/sparkproject` (i.e. removed underscore) as the former breaks the netty native lib loading. This was causing the yarn external shuffle service to fail when spark.shuffle.io.mode=EPOLL ## How was this patch tested? Manual tests Closes #24502 from amuraru/SPARK-27610_master. Authored-by: Adi Muraru <amuraru@adobe.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-05-07 10:47:36 -07:00
Shaochen Shi	d5308cd86f	[SPARK-27577][MLLIB] Correct thresholds downsampled in BinaryClassificationMetrics ## What changes were proposed in this pull request? Choose the last record in chunks when calculating metrics with downsampling in `BinaryClassificationMetrics`. ## How was this patch tested? A new unit test is added to verify thresholds from downsampled records. Closes #24470 from shishaochen/spark-mllib-binary-metrics. Authored-by: Shaochen Shi <shishaochen@bytedance.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-07 08:41:58 -05:00
Liang-Chi Hsieh	d9bcacf94b	[SPARK-27629][PYSPARK] Prevent Unpickler from intervening each unpickling ## What changes were proposed in this pull request? In SPARK-27612, one correctness issue was reported. When protocol 4 is used to pickle Python objects, we found that unpickled objects were wrong. A temporary fix was proposed by not using highest protocol. It was found that Opcodes.MEMOIZE was appeared in the opcodes in protocol 4. It is suspect to this issue. A deeper dive found that Opcodes.MEMOIZE stores objects into internal map of Unpickler object. We use single Unpickler object to unpickle serialized Python bytes. Stored objects intervenes next round of unpickling, if the map is not cleared. We has two options: 1. Continues to reuse Unpickler, but calls its close after each unpickling. 2. Not to reuse Unpickler and create new Unpickler object in each unpickling. This patch takes option 1. ## How was this patch tested? Passing the test added in SPARK-27612 (#24519). Closes #24521 from viirya/SPARK-27629. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-04 13:21:08 +09:00
asarb	4241a72c65	[SPARK-27621][ML] Linear Regression - validate training related params such as loss only during fitting phase ## What changes were proposed in this pull request? When transform(...) method is called on a LinearRegressionModel created directly with the coefficients and intercepts, the following exception is encountered. ``` java.util.NoSuchElementException: Failed to find a default value for loss at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at org.apache.spark.ml.param.Params$class.$(params.scala:786) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111) at org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637) at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305) ``` This is because validateAndTransformSchema() is called both during training and scoring phases, but the checks against the training related params like loss should really be performed during training phase only, I think, please correct me if I'm missing anything :) This issue was first reported for mleap (https://github.com/combust/mleap/issues/455) because basically when we serialize the Spark transformers for mleap, we only serialize the params that are relevant for scoring. We do have the option to de-serialize the serialized transformers back into Spark for scoring again, but in that case, we no longer have all the training params. ## How was this patch tested? Added a unit test to check this scenario. Please let me know if there's anything additional required, this is the first PR that I've raised in this project. Closes #24509 from ancasarb/linear_regression_params_fix. Authored-by: asarb <asarb@expedia.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-03 18:17:04 -05:00
Gabor Somogyi	fb6b19ab7c	[SPARK-23014][SS] Fully remove V1 memory sink. ## What changes were proposed in this pull request? There is a MemorySink v2 already so v1 can be removed. In this PR I've removed it completely. What this PR contains: * V1 memory sink removal * V2 memory sink renamed to become the only implementation * Since DSv2 sends exceptions in a chained format (linking them with cause field) I've made python side compliant * Adapted all the tests ## How was this patch tested? Existing unit tests. Closes #24403 from gaborgsomogyi/SPARK-23014. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-04-29 09:44:23 -07:00
Sean Owen	8a17d26784	[SPARK-27536][CORE][ML][SQL][STREAMING] Remove most use of scala.language.existentials ## What changes were proposed in this pull request? I want to get rid of as much use of `scala.language.existentials` as possible for 3.0. It's a complicated language feature that generates warnings unless this value is imported. It might even be on the way out of Scala: https://contributors.scala-lang.org/t/proposal-to-remove-existential-types-from-the-language/2785 For Spark, it comes up mostly where the code plays fast and loose with generic types, not the advanced situations you'll often see referenced where this feature is explained. For example, it comes up in cases where a function returns something like `(String, Class[_])`. Scala doesn't like matching this to any other instance of `(String, Class[_])` because doing so requires inferring the existence of some type that satisfies both. Seems obvious if the generic type is a wildcard, but, not technically something Scala likes to let you get away with. This is a large PR, and it only gets rid of _most_ instances of `scala.language.existentials`. The change should be all compile-time and shouldn't affect APIs or logic. Many of the changes simply touch up sloppiness about generic types, making the known correct value explicit in the code. Some fixes involve being more explicit about the existence of generic types in methods. For instance, `def foo(arg: Class[_])` seems innocent enough but should really be declared `def foo[T](arg: Class[T])` to let Scala select and fix a single type when evaluating calls to `foo`. For kind of surprising reasons, this comes up in places where code evaluates a tuple of things that involve a generic type, but is OK if the two parts of the tuple are evaluated separately. One key change was altering `Utils.classForName(...): Class[_]` to the more correct `Utils.classForName[T](...): Class[T]`. This caused a number of small but positive changes to callers that otherwise had to cast the result. In several tests, `Dataset[_]` was used where `DataFrame` seems to be the clear intent. Finally, in a few cases in MLlib, the return type `this.type` was used where there are no subclasses of the class that uses it. This really isn't needed and causes issues for Scala reasoning about the return type. These are just changed to be concrete classes as return types. After this change, we have only a few classes that still import `scala.language.existentials` (because modifying them would require extensive rewrites to fix) and no build warnings. ## How was this patch tested? Existing tests. Closes #24431 from srowen/SPARK-27536. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-29 11:02:01 -05:00
Sean Owen	596a5ff273	[MINOR][BUILD] Update genjavadoc to 0.13 ## What changes were proposed in this pull request? Kind of related to https://github.com/gatorsmile/spark/pull/5 - let's update genjavadoc to see if it generates fewer spurious javadoc errors to begin with. ## How was this patch tested? Existing docs build Closes #24443 from srowen/genjavadoc013. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-24 13:44:48 +09:00
Eric Liang	5172190da1	[SPARK-27392][SQL] TestHive test tables should be placed in shared test state, not per session ## What changes were proposed in this pull request? Otherwise, tests that use tables from multiple sessions will run into issues if they access the same table. The correct location is in shared state. A couple other minor test improvements. cc gatorsmile srinathshankar ## How was this patch tested? Existing unit tests. Closes #24302 from ericl/test-conflicts. Lead-authored-by: Eric Liang <ekl@databricks.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-22 11:05:31 -07:00
WeichenXu	d35e81f4bc	[SPARK-27454][ML][SQL] Spark image datasource fail when encounter some illegal images ## What changes were proposed in this pull request? Fix in Spark image datasource fail when encounter some illegal images. This related to bugs inside `ImageIO.read` so in spark code I add exception handling for it. ## How was this patch tested? N/A Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24362 from WeichenXu123/fix_image_ds_bug. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-04-15 11:55:51 -07:00
Sean Owen	67bd124f4f	[MINOR][TEST] Speed up slow tests in QuantileDiscretizerSuite ## What changes were proposed in this pull request? This should reduce the total runtime of these tests from about 2 minutes to about 25 seconds. ## How was this patch tested? Existing tests Closes #24360 from srowen/SpeedQDS. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-13 17:03:23 -05:00
Sean Owen	9ed60c2c33	[MINOR][TEST][ML] Speed up some tests of ML regression by loosening tolerance ## What changes were proposed in this pull request? Loosen some tolerances in the ML regression-related tests, as they seem to account for some of the top slow tests in https://spark-tests.appspot.com/slow-tests These changes are good for about a 25 second speedup on my laptop. ## How was this patch tested? Existing tests Closes #24351 from srowen/SpeedReg. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-12 09:31:12 -05:00
Sean Owen	4ec7f631aa	[SPARK-27404][CORE][SQL][STREAMING][YARN] Fix build warnings for 3.0: postfixOps edition ## What changes were proposed in this pull request? Fix build warnings -- see some details below. But mostly, remove use of postfix syntax where it causes warnings without the `scala.language.postfixOps` import. This is mostly in expressions like "120000 milliseconds". Which, I'd like to simplify to things like "2.minutes" anyway. ## How was this patch tested? Existing tests. Closes #24314 from srowen/SPARK-27404. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-11 13:43:44 -05:00
Sean Owen	05f6b87e81	[SPARK-27410][MLLIB] Remove deprecated / no-op mllib.KMeans getRuns, setRuns ## What changes were proposed in this pull request? Remove deprecated / no-op mllib.KMeans getRuns, setRuns mllib.KMeans has getRuns, setRuns methods which haven't done anything since Spark 2.1. They're deprecated, and no-ops, and should be removed for Spark 3. ## How was this patch tested? Existing tests. Closes #24320 from srowen/SPARK-27410. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-09 19:13:35 -05:00
Rafael Renaudin	dfa2328e28	[SPARK-26881][MLLIB] Heuristic for tree aggregate depth Changes proposed: - Adding method to compute treeAggregate depth required to avoid exceeding driver max result size (first commit) - Using it in the computation of grammian of RowMatrix (second commit) Tests: - Unit Test wise, one unit test checking the behavior of the depth computation method - Tested at scale on hadoop cluster by doing PCA on a large dataset (needed depth 3 to succeed) Debatable choice: I'm not sure if RDD API is the right place to put the depth computation method. The advantage of it is that it allows to access driver max result size, and rdd number of partitions, to set default arguments for the method. Semantically, such a method might belong to something like org.apache.spark.util.Utils though. Closes #23983 from gagafunctor/Heuristic_for_treeAggregate_depth. Authored-by: Rafael Renaudin <renaudin.rafael@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-08 20:56:53 -05:00
Ilya Matiach	887279cc46	[SPARK-24102][ML][MLLIB][PYSPARK][FOLLOWUP] Added weight column to pyspark API for regression evaluator and metrics ## What changes were proposed in this pull request? Followup to PR https://github.com/apache/spark/pull/17085 This PR adds the weight column to the pyspark side, which was already added to the scala API. The PR also undoes a name change in the scala side corresponding to a change in another similar PR as noted here: https://github.com/apache/spark/pull/17084#discussion_r259648639 ## How was this patch tested? This patch adds python tests for the changes to the pyspark API. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24197 from imatiach-msft/ilmat/regressor-eval-python. Authored-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-26 09:06:04 -05:00
Maxim Gekk	027ed2d11b	[SPARK-23643][CORE][SQL][ML] Shrinking the buffer in hashSeed up to size of the seed parameter ## What changes were proposed in this pull request? The hashSeed method allocates 64 bytes instead of 8. Other bytes are always zeros (thanks to default behavior of ByteBuffer). And they could be excluded from hash calculation because they don't differentiate inputs. ## How was this patch tested? By running the existing tests - XORShiftRandomSuite Closes #20793 from MaxGekk/hash-buff-size. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-23 11:26:09 -05:00
Marco Gaido	25bcf59b3b	[SPARK-25838][ML] Remove formatVersion from Saveable ## What changes were proposed in this pull request? `Saveable` interface introduces `formatVersion` which is protected and it is used nowhere. So the PR proposes to remove it. ## How was this patch tested? existing tests Closes #22830 from mgaido91/SPARK-25838. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-09 09:44:20 -06:00
masa3141	5fa4ba0cfb	[SPARK-26981][MLLIB] Add 'Recall_at_k' metric to RankingMetrics ## What changes were proposed in this pull request? Add 'Recall_at_k' metric to RankingMetrics ## How was this patch tested? Add test to RankingMetricsSuite. Closes #23881 from masa3141/SPARK-26981. Authored-by: masa3141 <masahiro@kazama.tv> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-06 08:28:53 -06:00
liuxian	02bbe977ab	[MINOR] Remove unnecessary gets when getting a value from map. ## What changes were proposed in this pull request? Redundant `get` when getting a value from `Map` given a key. ## How was this patch tested? N/A Closes #23901 from 10110346/removegetfrommap. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-01 11:48:07 -06:00
zhengruifeng	acd086f207	[SPARK-19591][ML][PYSPARK][FOLLOWUP] Add sample weights to decision trees ## What changes were proposed in this pull request? Add sample weights to decision trees ## How was this patch tested? updated testsuites Closes #23818 from zhengruifeng/py_tree_support_sample_weight. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-27 21:11:30 -06:00
liuxian	7912dbb88f	[MINOR] Simplify boolean expression ## What changes were proposed in this pull request? Comparing whether Boolean expression is equal to true is redundant For example: The datatype of `a` is boolean. Before: if (a == true) After: if (a) ## How was this patch tested? N/A Closes #23884 from 10110346/simplifyboolean. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-27 08:38:00 -06:00
Sean Owen	9c283662c6	[SPARK-26986][ML] Add JAXB reference impl to build for Java 9+ ## What changes were proposed in this pull request? Add reference JAXB impl for Java 9+ from Glassfish. Right now it's only apparently necessary in MLlib but can be expanded later. ## How was this patch tested? Existing tests particularly PMML-related ones, which use JAXB. This works on Java 11. Closes #23890 from srowen/SPARK-26986. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-26 18:26:49 -06:00
Ilya Matiach	b66be0e490	[SPARK-24103][ML][MLLIB] ML Evaluators should use weight column - added weight column for binary classification evaluator ## What changes were proposed in this pull request? The evaluators BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator and the corresponding metrics classes BinaryClassificationMetrics, RegressionMetrics and MulticlassMetrics should use sample weight data. I've closed the PR: https://github.com/apache/spark/pull/16557 as recommended in favor of creating three pull requests, one for each of the evaluators (binary/regression/multiclass) to make it easier to review/update. ## How was this patch tested? I added tests to the metrics and evaluators classes. Closes #17084 from imatiach-msft/ilmat/binary-evalute. Authored-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-25 17:16:51 -06:00
Sean Owen	d2529788ed	[SPARK-26966][ML] Update to JPMML 1.4.8 ## What changes were proposed in this pull request? JPMML apparently only supports Java 9 in 1.4.2+. We are seeing text failures from JPMML relating to JAXB when running on Java 11. It's shaded and not a big change, so should be safe. ## How was this patch tested? Existing tests. Closes #23868 from srowen/SPARK-26966. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-25 04:37:45 -06:00
zhengruifeng	89d42dc6d3	[SPARK-25097][ML] Support prediction on single instance in KMeans/BiKMeans/GMM ## What changes were proposed in this pull request? expose method `predict` in KMeans/BiKMeans/GMM ## How was this patch tested? added testsuites Closes #22087 from zhengruifeng/clu_pre_instance. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-21 22:21:28 -06:00
Joseph K. Bradley	be1cadf16d	[SPARK-26960][ML] Wait for listener bus to clear in MLEventsSuite to reduce test flakiness ## What changes were proposed in this pull request? This patch aims to address flakiness I've observed in MLEventsSuite in these tests: * test("pipeline read/write events") * test("pipeline model read/write events") The issue is in the "read/write events" tests, which work as follows: * write * wait until we see at least 1 write-related SparkListenerEvent * read * wait until we see at least 1 read-related SparkListenerEvent The problem is that the last step does NOT allow any write-related SparkListenerEvents, but some of those events may be delayed enough that they are seen in this last step. We should ideally add logic before "read" to wait until the listener events are cleared/complete. Looking into other SparkListener tests, we need to use `sc.listenerBus.waitUntilEmpty(TIMEOUT)`. This patch adds the waitUntilEmpty() call. ## How was this patch tested? It's a test! Closes #23863 from jkbradley/SPARK-26960. Authored-by: Joseph K. Bradley <joseph@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-02-22 10:08:16 +08:00
joelgenter	885aa553c5	[MINOR][DOCS] Fix the update rule in StreamingKMeansModel documentation ## What changes were proposed in this pull request? The formatting for the update rule (in the documentation) now appears as ![image](https://user-images.githubusercontent.com/14948437/52933807-5a0c7980-3309-11e9-8573-642a73e77c26.png) instead of ![image](https://user-images.githubusercontent.com/14948437/52933897-a8ba1380-3309-11e9-8e16-e47c27b4a044.png) Closes #23819 from joelgenter/patch-1. Authored-by: joelgenter <joelgenter@outlook.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-19 08:40:59 -06:00
Marco Gaido	5d8a934c13	[SPARK-26721][ML] Avoid per-tree normalization in featureImportance for GBT ## What changes were proposed in this pull request? Our feature importance calculation is taken from sklearn's one, which has been recently fixed (in https://github.com/scikit-learn/scikit-learn/pull/11176). Citing the description of that PR: > Because the feature importances are (currently, by default) normalized and then averaged, feature importances from later stages are overweighted. The PR performs a fix similar to sklearn's one. The per-tree normalization of the feature importance is skipped and GBT. Credits for pointing out clearly the issue and the sklearn's PR to Daniel Jumper. ## How was this patch tested? modified UT, checked that the computed `featureImportance` in that test is similar to sklearn's one (ti can't be the same, because the trees may be slightly different) Closes #23773 from mgaido91/SPARK-26721. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-16 16:51:01 -06:00
Maxim Gekk	a829234df3	[SPARK-26817][CORE] Use System.nanoTime to measure time intervals ## What changes were proposed in this pull request? In the PR, I propose to use `System.nanoTime()` instead of `System.currentTimeMillis()` in measurements of time intervals. `System.currentTimeMillis()` returns current wallclock time and will follow changes to the system clock. Thus, negative wallclock adjustments can cause timeouts to "hang" for a long time (until wallclock time has caught up to its previous value again). This can happen when ntpd does a "step" after the network has been disconnected for some time. The most canonical example is during system bootup when DHCP takes longer than usual. This can lead to failures that are really hard to understand/reproduce. `System.nanoTime()` is guaranteed to be monotonically increasing irrespective of wallclock changes. ## How was this patch tested? By existing test suites. Closes #23727 from MaxGekk/system-nanotime. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-13 13:12:16 -06:00
Hyukjin Kwon	dfb880951a	[SPARK-26818][ML] Make MLEvents JSON ser/de safe ## What changes were proposed in this pull request? Currently, it looks it's not going to cause any virtually effective problem apparently (if I didn't misread the codes). I see one place that JSON formatted events are being used. `ec506bd30c/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala (L148)` It's okay because it just logs when the exception is ignorable `9690eba16e/core/src/main/scala/org/apache/spark/util/ListenerBus.scala (L111)` I guess it should be best to stay safe - I don't want this unstable experimental feature breaks anything in any case. It also disables `logEvent` in `SparkListenerEvent` for the same reason. This is also to match SQL execution events side: `ca545f7941/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala (L41-L57)` to make ML events JSON ser/de safe. ## How was this patch tested? Manually tested, and unit tests were added. Closes #23728 from HyukjinKwon/SPARK-26818. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-02-03 21:19:35 +08:00
Sean Owen	8171b156eb	[SPARK-26771][CORE][GRAPHX] Make .unpersist(), .destroy() consistently non-blocking by default ## What changes were proposed in this pull request? Make .unpersist(), .destroy() non-blocking by default and adjust callers to request blocking only where important. This also adds an optional blocking argument to Pyspark's RDD.unpersist(), which never had one. ## How was this patch tested? Existing tests. Closes #23685 from srowen/SPARK-26771. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-01 18:29:55 -06:00

... 2 3 4 5 6 ...

2489 commits