ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Ruifeng Zheng	ac322a1ac3	[SPARK-34080][ML][PYTHON][FOLLOWUP] Add UnivariateFeatureSelector - make methods private ### What changes were proposed in this pull request? 1, make `getTopIndices`/`selectIndicesFromPValues` private; 2, avoid setting `selectionThreshold` in `fit` 3, move param checking to `transformSchema` ### Why are the changes needed? `getTopIndices`/`selectIndicesFromPValues` should not be exposed to end users; ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #31222 from zhengruifeng/selector_clean_up. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-18 13:19:59 +09:00
Huaxin Gao	f3548837c6	[SPARK-34080][ML][PYTHON] Add UnivariateFeatureSelector ### What changes were proposed in this pull request? Add UnivariateFeatureSelector ### Why are the changes needed? Have one UnivariateFeatureSelector, so we don't need to have three Feature Selectors. ### Does this PR introduce _any_ user-facing change? Yes ``` selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous", selectorType="numTopFeatures", numTopFeatures=100) ``` Or numTopFeatures ``` selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], scoreFunction="f_classif", selectorType="numTopFeatures", numTopFeatures=100) ``` ### How was this patch tested? Add Unit test Closes #31160 from huaxingao/UnivariateSelector. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2021-01-16 11:09:23 +08:00
yangjie01	9e33d49b5b	[SPARK-33346][CORE][SQL][MLLIB][DSTREAM][K8S] Change the never changed 'var' to 'val' ### What changes were proposed in this pull request? Some local variables are declared as `var`, but they are never reassigned and should be declared as `val`, so this pr turn these from `var` to `val` except for `mockito` related cases. ### Why are the changes needed? Use `val` instead of `var` when possible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31142 from LuciferYang/SPARK-33346. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-15 08:47:02 -06:00
yangjie01	8b1ba233f1	[SPARK-34068][CORE][SQL][MLLIB][GRAPHX] Remove redundant collection conversion ### What changes were proposed in this pull request? There are some redundant collection conversion can be removed, for version compatibility, clean up these with Scala-2.13 profile. ### Why are the changes needed? Remove redundant collection conversion ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test `core`, `graphx`, `mllib`, `mllib-local`, `sql`, `yarn`,`kafka-0-10` in Scala 2.13 passed Closes #31125 from LuciferYang/SPARK-34068. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-13 18:07:02 -06:00
Ruifeng Zheng	7ff9ff153e	[SPARK-34045][ML] OneVsRestModel.transform should not call setter of submodels ### What changes were proposed in this pull request? use a tmp model in OneVsRestModel.transform, to avoid calling directly setter of model ### Why are the changes needed? params of model (submodels) should not be changed in transform ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added testsuite Closes #31086 from zhengruifeng/ovr_transform_tmp_model. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2021-01-12 10:21:37 +08:00
Weichen Xu	11fac232c8	[MINOR] Improve flaky NaiveBayes test ### What changes were proposed in this pull request? Improve flaky NaiveBayes test Current test may sometimes fail under different BLAS library. Due to some absTol check. Error like ``` Expected 0.7 and 0.6485507246376814 to be within 0.05 using absolute tolerance... ``` * Change absTol to relTol: The `absTol 0.05` in some cases (such as compare 0.1 and 0.05) is a big difference * Remove the `exp` when comparing params. The `exp` will amplify the relative error. ### Why are the changes needed? Flaky test ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #31004 from WeichenXu123/improve_bayes_tests. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2021-01-11 11:58:57 +08:00
Koert Kuipers	9b4173fa95	[SPARK-33894][SQL] Change visibility of private case classes in mllib to avoid runtime compilation errors with Scala 2.13 ### What changes were proposed in this pull request? Change visibility modifier of two case classes defined inside objects in mllib from private to private[OuterClass] ### Why are the changes needed? Without this change when running tests for Scala 2.13 you get runtime code generation errors. These errors look like this: ``` [info] Cause: java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 73, Column 65: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 73, Column 65: No applicable constructor/method found for zero actual parameters; candidates are: "public java.lang.String org.apache.spark.ml.feature.Word2VecModel$Data.word()" ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests now pass for Scala 2.13 Closes #31018 from koertkuipers/feat-visibility-scala213. Authored-by: Koert Kuipers <koert@tresata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 15:40:32 -08:00
Ruifeng Zheng	6b7527e381	[SPARK-33398] Fix loading tree models prior to Spark 3.0 ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/21632/files#diff-0fdae8a6782091746ed20ea43f77b639f9c6a5f072dd2f600fcf9a7b37db4f47, a new field `rawCount` was added into `NodeData`, which cause that a tree model trained in 2.4 can not be loaded in 3.0/3.1/master; field `rawCount` is only used in training, and not used in `transform`/`predict`/`featureImportance`. So I just set it to -1L. ### Why are the changes needed? to support load old tree model in 3.0/3.1/master ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added testsuites Closes #30889 from zhengruifeng/fix_tree_load. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-03 11:52:46 -06:00
zhengruifeng	44563a0412	[SPARK-33518][ML] Improve performance of ML ALS recommendForAll by GEMV ### What changes were proposed in this pull request? There were a lot of works on improving ALS's recommendForAll For now, I found that it maybe futhermore optimized by 1, using GEMV and sharing a pre-allocated buffer per task; 2, using guava.ordering instead of BoundedPriorityQueue; ### Why are the changes needed? In my test, using `f2jBLAS.sgemv`, it is about 2.3X faster than existing impl. \|Impl\| Master \| GEMM \| GEMV \| GEMV + array aggregator \| GEMV + guava ordering + array aggregator \| GEMV + guava ordering\| \|------\|----------\|------------\|----------\|------------\|------------\|------------\| \|Duration\|341229\|363741\|191201\|189790\|148417\|147222\| ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #30468 from zhengruifeng/als_rec_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-12-19 08:43:48 -06:00
Weichen Xu	f021f6d3c7	[MINOR][ML] Increase Bounded MLOR (without regularization) test error tolerance ### What changes were proposed in this pull request? Improve LogisticRegression test error tolerance ### Why are the changes needed? When we switch BLAS version, some of the tests will fail due to too strict error tolerance in test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30587 from WeichenXu123/fix_lor_test. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2020-12-09 11:18:09 +08:00
Ruifeng Zheng	ebd8b9357a	[SPARK-33609][ML] word2vec reduce broadcast size ### What changes were proposed in this pull request? 1, directly use float vectors instead of converting to double vectors, this is about 2x faster than using vec.axpy; 2, mark `wordList` and `wordVecNorms` lazy 3, avoid slicing in computation of `wordVecNorms` ### Why are the changes needed? halve broadcast size ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #30548 from zhengruifeng/w2v_float32_transform. Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Co-authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2020-12-08 11:04:29 +08:00
Dongjoon Hyun	de9818f043	[SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT ### What changes were proposed in this pull request? This PR aims to update `master` branch version to 3.2.0-SNAPSHOT. ### Why are the changes needed? Start to prepare Apache Spark 3.2.0. ### Does this PR introduce _any_ user-facing change? N/A. ### How was this patch tested? Pass the CIs. Closes #30606 from dongjoon-hyun/SPARK-3.2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-04 14:10:42 -08:00
Ruifeng Zheng	90d4d7d43f	[SPARK-33610][ML] Imputer transform skip duplicate head() job ### What changes were proposed in this pull request? on each call of `transform`, a head() job will be triggered, which can be skipped by using a lazy var. ### Why are the changes needed? avoiding duplicate head() jobs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #30550 from zhengruifeng/imputer_transform. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2020-12-03 09:31:46 +08:00
Weichen Xu	596fbc1d29	[SPARK-33556][ML] Add array_to_vector function for dataframe column ### What changes were proposed in this pull request? Add array_to_vector function for dataframe column ### Why are the changes needed? Utility function for array to vector conversion. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? scala unit test & doctest. Closes #30498 from WeichenXu123/array_to_vec. Lead-authored-by: Weichen Xu <weichen.xu@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-01 09:52:19 +09:00
Josh Soref	485145326a	[MINOR] Spelling bin core docs external mllib repl ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `bin` * `core` * `docs` * `external` * `mllib` * `repl` * `pom.xml` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30530 from jsoref/spelling-bin-core-docs-external-mllib-repl. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-30 13:59:51 +09:00
Dongjoon Hyun	3ce4ab545b	[SPARK-33513][BUILD] Upgrade to Scala 2.13.4 to improve exhaustivity ### What changes were proposed in this pull request? This PR aims the followings. 1. Upgrade from Scala 2.13.3 to 2.13.4 for Apache Spark 3.1 2. Fix exhaustivity issues in both Scala 2.12/2.13 (Scala 2.13.4 requires this for compilation.) 3. Enforce the improved exhaustive check by using the existing Scala 2.13 GitHub Action compilation job. ### Why are the changes needed? Scala 2.13.4 is a maintenance release for 2.13 line and improves JDK 15 support. - https://github.com/scala/scala/releases/tag/v2.13.4 Also, it improves exhaustivity check. - https://github.com/scala/scala/pull/9140 (Check exhaustivity of pattern matches with "if" guards and custom extractors) - https://github.com/scala/scala/pull/9147 (Check all bindings exhaustively, e.g. tuples components) ### Does this PR introduce _any_ user-facing change? Yep. Although it's a maintenance version change, it's a Scala version change. ### How was this patch tested? Pass the CIs and do the manual testing. - Scala 2.12 CI jobs(GitHub Action/Jenkins UT/Jenkins K8s IT) to check the validity of code change. - Scala 2.13 Compilation job to check the compilation Closes #30455 from dongjoon-hyun/SCALA_3.13. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-23 16:28:43 -08:00
Ruifeng Zheng	116b7b72a1	[SPARK-33466][ML][PYTHON] Imputer support mode(most_frequent) strategy ### What changes were proposed in this pull request? impl a new strategy `mode`: replace missing using the most frequent value along each column. ### Why are the changes needed? it is highly scalable, and had been a function in [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) for a long time. ### Does this PR introduce _any_ user-facing change? Yes, a new strategy is added ### How was this patch tested? updated testsuites Closes #30397 from zhengruifeng/imputer_max_freq. Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Co-authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-20 11:35:34 -06:00
yangjie01	e3058ba17c	[SPARK-33441][BUILD] Add unused-imports compilation check and remove all unused-imports ### What changes were proposed in this pull request? This pr add a new Scala compile arg to `pom.xml` to defense against new unused imports: - `-Ywarn-unused-import` for Scala 2.12 - `-Wconf:cat=unused-imports:e` for Scala 2.13 The other fIles change are remove all unused imports in Spark code ### Why are the changes needed? Cleanup code and add guarantee to defense against new unused imports ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30351 from LuciferYang/remove-imports-core-module. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-19 14:20:39 +09:00
zhengruifeng	689c294102	[SPARK-32907][ML][PYTHON] Adaptively blockify instances - AFT,LiR,LoR ### What changes were proposed in this pull request? use `maxBlockSizeInMB` instead of `blockSize` (#rows) to control the stacking of vectors; ### Why are the changes needed? the performance gain is mainly related to the nnz of block. ### Does this PR introduce _any_ user-facing change? yes, param blockSize -> blockSizeInMB in master ### How was this patch tested? updated testsuites Closes #30355 from zhengruifeng/adaptively_blockify_aft_lir_lor. Lead-authored-by: zhengruifeng <ruifengz@foxmail.com> Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2020-11-18 23:02:31 +08:00
xuewei.linxuewei	234711a328	Revert "[SPARK-33139][SQL] protect setActionSession and clearActiveSession" ### What changes were proposed in this pull request? In [SPARK-33139] we defined `setActionSession` and `clearActiveSession` as deprecated API, it turns out it is widely used, and after discussion, even if without this PR, it should work with unify view feature, it might only be a risk if user really abuse using these two API. So revert the PR is needed. [SPARK-33139] has two commit, include a follow up. Revert them both. ### Why are the changes needed? Revert. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #30367 from leanken/leanken-revert-SPARK-33139. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-13 13:35:45 +00:00
zhengruifeng	a2887164bc	[SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC ### What changes were proposed in this pull request? 1, use `maxBlockSizeInMB` instead of `blockSize`(#rows) to control the stacking of vectors; 2, infer an appropriate `maxBlockSizeInMB` if set 0; ### Why are the changes needed? the performance gain is mainly related to the nnz of block. f2jBLAS \| \| \| \| \| \| \| \| \| \| \| \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Duration(millisecond) \| branch 3.0 Impl \| blockSizeInMB=0.0625 \| blockSizeInMB=0.125 \| blockSizeInMB=0.25 \| blockSizeInMB=0.5 \| blockSizeInMB=1 \| blockSizeInMB=2 \| blockSizeInMB=4 \| blockSizeInMB=8 \| blockSizeInMB=16 \| blockSizeInMB=32 \| blockSizeInMB=64 \| blockSizeInMB=128 epsilon(100%) \| 326481 \| 26143 \| 25710 \| 24726 \| 25395 \| 25840 \| 26846 \| 25927 \| 27431 \| 26190 \| 26056 \| 26347 \| 27204 epsilon3000(67%) \| 455247 \| 35893 \| 34366 \| 34985 \| 38387 \| 38901 \| 40426 \| 40044 \| 39161 \| 38767 \| 39965 \| 39523 \| 39108 epsilon4000(50%) \| 306390 \| 42256 \| 41164 \| 43748 \| 48638 \| 50892 \| 50986 \| 51091 \| 51072 \| 51289 \| 51652 \| 53312 \| 52146 epsilon5000(40%) \| 307619 \| 43639 \| 42992 \| 44743 \| 50800 \| 51939 \| 51871 \| 52190 \| 53850 \| 52607 \| 51062 \| 52509 \| 51570 epsilon10000(20%) \| 310070 \| 58371 \| 55921 \| 56317 \| 56618 \| 53694 \| 52131 \| 51768 \| 51728 \| 52233 \| 51881 \| 51653 \| 52440 epsilon20000(10%) \| 316565 \| 109193 \| 95121 \| 82764 \| 69653 \| 60764 \| 56066 \| 53371 \| 52822 \| 52872 \| 52769 \| 52527 \| 53508 epsilon200000(1%) \| 336181 \| 1569721 \| 1069355 \| 673718 \| 375043 \| 218230 \| 145393 \| 110926 \| 94327 \| 87039 \| 83926 \| 81890 \| 81787 \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| Speedup \| \| \| \| \| \| \| \| \| \| \| \| epsilon(100%) \| 1 \| 12.48827602 \| 12.69859977 \| 13.20395535 \| 12.85611341 \| 12.63471362 \| 12.16125307 \| 12.59231689 \| 11.90189931 \| 12.46586483 \| 12.5299739 \| 12.39158158 \| 12.00121306 epsilon3000(67%) \| 1 \| 12.68344803 \| 13.2470174 \| 13.01263399 \| 11.85940553 \| 11.70270687 \| 11.26124276 \| 11.36866946 \| 11.62500958 \| 11.74315784 \| 11.39114225 \| 11.51853351 \| 11.64076404 epsilon4000(50%) \| 1 \| 7.250804619 \| 7.443154212 \| 7.003520161 \| 6.299395534 \| 6.020396133 \| 6.00929667 \| 5.996946625 \| 5.999177632 \| 5.973795551 \| 5.931812902 \| 5.747111345 \| 5.875618456 epsilon5000(40%) \| 1 \| 7.049176196 \| 7.155261444 \| 6.875243055 \| 6.055492126 \| 5.92269778 \| 5.930462108 \| 5.894213451 \| 5.712516249 \| 5.847491779 \| 6.024421292 \| 5.858405226 \| 5.965076595 epsilon10000(20%) \| 1 \| 5.312055644 \| 5.544786395 \| 5.505797539 \| 5.4765269 \| 5.774760681 \| 5.947900481 \| 5.98960748 \| 5.994239097 \| 5.93628549 \| 5.976561747 \| 6.002942714 \| 5.912852784 epsilon20000(10%) \| 1 \| 2.899132728 \| 3.328024306 \| 3.824911797 \| 4.544886796 \| 5.209745902 \| 5.64629187 \| 5.931404695 \| 5.993052137 \| 5.987384627 \| 5.999071425 \| 6.026710073 \| 5.916218136 epsilon200000(1%) \| 1 \| 0.214166084 \| 0.314377358 \| 0.498993644 \| 0.896379882 \| 1.540489392 \| 2.312222734 \| 3.03067811 \| 3.563995463 \| 3.862417997 \| 4.005683578 \| 4.105275369 \| 4.110445425 OpenBLAS \| \| \| \| \| \| \| \| \| \| \| \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Duration(millisecond) \| branch 3.0 Impl \| blockSizeInMB=0.0625 \| blockSizeInMB=0.125 \| blockSizeInMB=0.25 \| blockSizeInMB=0.5 \| blockSizeInMB=1 \| blockSizeInMB=2 \| blockSizeInMB=4 \| blockSizeInMB=8 \| blockSizeInMB=16 \| blockSizeInMB=32 \| blockSizeInMB=64 \| blockSizeInMB=128 epsilon(100%) \| 299119 \| 26047 \| 25049 \| 25239 \| 28001 \| 35138 \| 36438 \| 36279 \| 36114 \| 35111 \| 35428 \| 36295 \| 35197 epsilon3000(67%) \| 439798 \| 33321 \| 34423 \| 34336 \| 38906 \| 51756 \| 54138 \| 54085 \| 53412 \| 54766 \| 54425 \| 54221 \| 54842 epsilon4000(50%) \| 302963 \| 42960 \| 40678 \| 43483 \| 48254 \| 50888 \| 54990 \| 52647 \| 51947 \| 51843 \| 52891 \| 53410 \| 52020 epsilon5000(40%) \| 303569 \| 44225 \| 44961 \| 45065 \| 51768 \| 52776 \| 51930 \| 53587 \| 53104 \| 51833 \| 52138 \| 52574 \| 53756 epsilon10000(20%) \| 307403 \| 58447 \| 55993 \| 56757 \| 56694 \| 54038 \| 52734 \| 52073 \| 52051 \| 52150 \| 51986 \| 52407 \| 52390 epsilon20000(10%) \| 313344 \| 107580 \| 94679 \| 83329 \| 70226 \| 60996 \| 57130 \| 55461 \| 54641 \| 52712 \| 52541 \| 53101 \| 53312 epsilon200000(1%) \| 334679 \| 1642726 \| 1073148 \| 654481 \| 364974 \| 213881 \| 140248 \| 107579 \| 91757 \| 85090 \| 81940 \| 80492 \| 80250 \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| Speedup \| \| \| \| \| \| \| \| \| \| \| \| epsilon(100%) \| 1 \| 11.48381771 \| 11.94135494 \| 11.85146004 \| 10.68243991 \| 8.512692811 \| 8.208985125 \| 8.244962651 \| 8.282632774 \| 8.519238985 \| 8.443011178 \| 8.241328007 \| 8.498423161 epsilon3000(67%) \| 1 \| 13.19882356 \| 12.7762833 \| 12.80865564 \| 11.30411762 \| 8.497526857 \| 8.123646976 \| 8.131607655 \| 8.234067251 \| 8.030493372 \| 8.080808452 \| 8.111211523 \| 8.01936472 epsilon4000(50%) \| 1 \| 7.052211359 \| 7.44783421 \| 6.967389555 \| 6.278505409 \| 5.953525389 \| 5.509419895 \| 5.754610899 \| 5.832155851 \| 5.843855487 \| 5.728063376 \| 5.672402172 \| 5.823971549 epsilon5000(40%) \| 1 \| 6.86419446 \| 6.751829363 \| 6.736247642 \| 5.864027971 \| 5.752027437 \| 5.845734643 \| 5.664974714 \| 5.716499699 \| 5.856674319 \| 5.822413595 \| 5.774127896 \| 5.647164968 epsilon10000(20%) \| 1 \| 5.259517169 \| 5.490025539 \| 5.416124883 \| 5.422143437 \| 5.688645028 \| 5.829313157 \| 5.903308816 \| 5.905803923 \| 5.894592522 \| 5.913188166 \| 5.865685882 \| 5.867589235 epsilon20000(10%) \| 1 \| 2.912660346 \| 3.309540658 \| 3.760323537 \| 4.461937174 \| 5.137123746 \| 5.48475407 \| 5.649807973 \| 5.734594901 \| 5.944452876 \| 5.963799699 \| 5.900905821 \| 5.87755102 epsilon200000(1%) \| 1 \| 0.203733915 \| 0.311866583 \| 0.511365494 \| 0.916994087 \| 1.564790701 \| 2.38633706 \| 3.111006795 \| 3.647449241 \| 3.933235398 \| 4.084439834 \| 4.157916315 \| 4.170454829 ### Does this PR introduce _any_ user-facing change? yes, param `blockSize` -> `blockSizeInMB` in master ### How was this patch tested? added testsuites and performance test (result attached in [ticket](https://issues.apache.org/jira/browse/SPARK-32907)) Closes #30009 from zhengruifeng/adaptively_blockify_linear_svc_II. Lead-authored-by: zhengruifeng <ruifengz@foxmail.com> Co-authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2020-11-12 19:14:07 +08:00
Ruifeng Zheng	6244407ce6	Revert "[WIP] Test (#30327 )" This reverts commit `61ee5d8a4e`. ### What changes were proposed in this pull request? I need to merge https://github.com/apache/spark/pull/30327 to https://github.com/apache/spark/pull/30009, but I merged it to master by mistake. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #30345 from zhengruifeng/revert-30327-adaptively_blockify_linear_svc_II. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-12 11:32:12 +09:00
WeichenXu	61ee5d8a4e	[WIP] Test (#30327 ) * resend * address comments * directly gen new Iter * directly gen new Iter * update blockify strategy * address comments * try to fix 2.13 * try to fix scala 2.13 * use 1.0 as the default value for gemv * update Co-authored-by: zhengruifeng <ruifengz@foxmail.com>	2020-11-12 10:20:33 +08:00
yangjie01	02fd52cfbc	[SPARK-33352][CORE][SQL][SS][MLLIB][AVRO][K8S] Fix procedure-like declaration compilation warnings in Scala 2.13 ### What changes were proposed in this pull request? There are two similar compilation warnings about procedure-like declaration in Scala 2.13: ``` [WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70: procedure syntax is deprecated for constructors: add `=`, as in method definition ``` and ``` [WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:211: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `run`'s return type ``` this pr is the first part to resolve SPARK-33352： - For constructors method definition add `=` to convert to function syntax - For without `return type` methods definition add `: Unit =` to convert to function syntax ### Why are the changes needed? Eliminate compilation warnings in Scala 2.13 and this change should be compatible with Scala 2.12 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30255 from LuciferYang/SPARK-29392-FOLLOWUP.1. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-08 12:51:48 -06:00
zhengruifeng	618695b78f	[SPARK-33111][ML][FOLLOW-UP] aft transform optimization - predictQuantiles ### What changes were proposed in this pull request? 1, optimize `predictQuantiles` by pre-computing an auxiliary var. ### Why are the changes needed? In https://github.com/apache/spark/pull/30000, I optimized the `transform` method. I find that we can also optimize `predictQuantiles` by pre-computing an auxiliary var. It is about 56% faster than existing impl. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #30034 from zhengruifeng/aft_quantiles_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-10-21 08:49:25 -05:00
Max Gekk	26b13c70c3	[SPARK-33169][SQL][TESTS] Check propagation of datasource options to underlying file system for built-in file-based datasources ### What changes were proposed in this pull request? 1. Add the common trait `CommonFileDataSourceSuite` with tests that can be executed for all built-in file-based datasources. 2. Add a test `CommonFileDataSourceSuite` to check that datasource options are propagated to underlying file systems as Hadoop configs. 3. Mix `CommonFileDataSourceSuite` to `AvroSuite`, `OrcSourceSuite`, `TextSuite`, `JsonSuite`, CSVSuite` and to `ParquetFileFormatSuite`. 4. Remove duplicated tests from `AvroSuite` and from `OrcSourceSuite`. ### Why are the changes needed? To improve test coverage and test all built-in file-based datasources. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suites. Closes #30067 from MaxGekk/ds-options-common-test. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-19 17:47:49 +09:00
xuewei.linxuewei	306872eefa	[SPARK-33139][SQL] protect setActionSession and clearActiveSession ### What changes were proposed in this pull request? This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to make SQLConf.get reliable and stable, we need to make sure user can't pollute the SQLConf and SparkSession Context via calling setActiveSession and clearActiveSession. Change of the PR: * add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to old behavior if user do need to call these two API. * by default, if user call these two API, it will throw exception * add extra two internal and private API setActiveSessionInternal and clearActiveSessionInternal for current internal usage * change all internal reference to new internal API except for SQLContext.setActive and SQLContext.clearActive ### Why are the changes needed? Make SQLConf.get reliable and stable. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? * Add UT in SparkSessionBuilderSuite to test the legacy config * Existing test Closes #30042 from leanken/leanken-SPARK-33139. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-16 06:05:17 +00:00
zhengruifeng	86d26b46a5	[SPARK-32455][ML][FOLLOW-UP] LogisticRegressionModel prediction optimization - fix incorrect initialization ### What changes were proposed in this pull request? use `lazy array` instead of `var` for auxiliary variables in binary lor ### Why are the changes needed? In https://github.com/apache/spark/pull/29255, I made a mistake: the `private var _threshold` and `_rawThreshold` are initialized by defaut values of `threshold`, that is beacuse: 1, param `threshold` is set default value at first; 2, `_threshold` and `_rawThreshold` are initialized based on the default value; 3, param `threshold` is updated by the value from estimator, by `copyValues` method: ``` if (map.contains(param) && to.hasParam(param.name)) { to.set(param.name, map(param)) } ``` We can update `_threshold` and `_rawThreshold` in `setThreshold` and `setThresholds`, but we can not update them in `set`/`copyValues` so their values are kept until methods `setThreshold` and `setThresholds` are called. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? test in repl Closes #30013 from zhengruifeng/lor_threshold_init. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-10-13 13:09:40 +08:00
zhengruifeng	ed2fe8d806	[SPARK-33111][ML] aft transform optimization ### What changes were proposed in this pull request? 1, when `predictionCol` and `quantilesCol` are both set, we only need one prediction for each row: prediction is just the variable `lambda` in `predictQuantiles`; 2, in the computation of variable `quantiles` in `predictQuantiles`, a pre-computed vector `val baseQuantiles = $(quantileProbabilities).map(q => math.exp(math.log(-math.log1p(-q)) * scale))` can be reused for each row; ### Why are the changes needed? avoid redundant computation in transform, like what we did in `ProbabilisticClassificationModel`, `GaussianMixtureModel`, etc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuite Closes #30000 from zhengruifeng/aft_predict_transform_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-10-12 09:01:03 -05:00
Max Gekk	1234c66fa6	[SPARK-33101][ML] Make LibSVM format propagate Hadoop config from DS options to underlying HDFS file system ### What changes were proposed in this pull request? Propagate LibSVM options to Hadoop configs in the LibSVM datasource. ### Why are the changes needed? There is a bug that when running: ```scala spark.read.format("libsvm").options(conf).load(path) ``` The underlying file system will not receive the `conf` options. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, for example, users should read files from Azure Data Lake successfully: ```scala def hadoopConf1() = Map[String, String]( s"fs.adl.oauth2.access.token.provider.type" -> "ClientCredential", s"fs.adl.oauth2.client.id" -> dbutils.secrets.get(scope = "...", key = "..."), s"fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "...", key = "..."), s"fs.adl.oauth2.refresh.url" -> s"https://login.microsoftonline.com/.../oauth2/token") val df = spark.read.format("libsvm").options(hadoopConf1).load("adl://....azuredatalakestore.net/foldersp1/...") ``` and not get the following exception because the settings above are not propagated to the filesystem: ```java java.lang.IllegalArgumentException: No value for fs.adl.oauth2.access.token.provider found in conf file. at ....adl.AdlFileSystem.getNonEmptyVal(AdlFileSystem.java:820) at ....adl.AdlFileSystem.getCustomAccessTokenProvider(AdlFileSystem.java:220) at ....adl.AdlFileSystem.getAccessTokenProvider(AdlFileSystem.java:257) at ....adl.AdlFileSystem.initialize(AdlFileSystem.java:164) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) ``` ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes #29984 from MaxGekk/ml-option-propagation. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-09 02:37:47 -07:00
Sean Owen	f86171aea4	[SPARK-33043][ML] Handle spark.driver.maxResultSize=0 in RowMatrix heuristic computation ### What changes were proposed in this pull request? RowMatrix contains a computation based on spark.driver.maxResultSize. However, when this value is set to 0, the computation fails (log of 0). The fix is simply to correctly handle this setting, which means unlimited result size, by using a tree depth of 1 in the RowMatrix method. ### Why are the changes needed? Simple bug fix to make several Spark ML functions which use RowMatrix run correctly in this case. ### Does this PR introduce _any_ user-facing change? Not other than the bug fix of course. ### How was this patch tested? Existing RowMatrix tests plus a new test. Closes #29925 from srowen/SPARK-33043. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-10-03 13:12:55 -05:00
Jungtaek Lim (HeartSaVioR)	d15f504a5e	[SPARK-33011][ML] Promote the stability annotation to Evolving for MLEvent traits/classes ### What changes were proposed in this pull request? This PR proposes to promote the stability annotation to `Evolving` for MLEvent traits/classes. ### Why are the changes needed? The feature is released in Spark 3.0.0 having SPARK-26818 as the last change in Feb. 2020, and haven't changed in Spark 3.0.1. (There's no change more than a half of year.) While we'd better to wait for some minor releases to consider the API as stable, it would worth to promote to Evolving so that we clearly state that we support the API. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Just changed the annotation, no tests required. Closes #29887 from HeartSaVioR/SPARK-33011. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-28 14:57:59 +09:00
yangjie01	bb6d5e7a90	[SPARK-32972][ML] Pass all UTs of `mllib` module in Scala 2.13 ### What changes were proposed in this pull request? The purpose of this pr is to resolve SPARK-32972, total of 51 Scala failed test cases and 3 Java failed test cases were fixed, the main change of this pr as follow: - Specified `Seq` to `scala.collection.Seq` in case match `Seq` scene and `x.asInstanceOf[Seq[T]]` scene - Use `Row.getSeq[T]` instead of `Row.getAs[Seq]` - Manual call `toMap` method to convert `MapView` to `Map` in Scala 2.13 - Change the tol in the last test to 0.75 to pass `RandomForestRegressorSuite#training with sample weights` in Scala 2.13 ### Why are the changes needed? We need to support a Scala 2.13 build. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Scala 2.12: Pass the Jenkins or GitHub Action - Scala 2.13: Pass GitHub 2.13 Build Action Do the follow: ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -pl mllib -Pscala-2.13 -am mvn test -pl mllib -Pscala-2.13 -fn ``` Before ``` [ERROR] Errors: [ERROR] JavaVectorIndexerSuite.vectorIndexerAPI:51 » ClassCast scala.collection.conver... [ERROR] JavaWord2VecSuite.testJavaWord2Vec:51 » Spark Job aborted due to stage failure... [ERROR] JavaPrefixSpanSuite.runPrefixSpanSaveLoad:79 » Spark Job aborted due to stage ... Tests: succeeded 1567, failed 51, canceled 0, ignored 7, pending 0 * 51 TESTS FAILED * ``` After ``` [INFO] Tests run: 122, Failures: 0, Errors: 0, Skipped: 0 Tests: succeeded 1617, failed 0, canceled 0, ignored 7, pending 0 All tests passed. ``` Closes #29857 from LuciferYang/fix-mllib-2. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-27 10:26:51 -05:00
zhengruifeng	bc77e5b840	[SPARK-32973][ML][DOC] FeatureHasher does not check categoricalCols in inputCols ### What changes were proposed in this pull request? 1, update the comment: `Note, the relevant columns must also be set in inputCols` -> `Note, the relevant columns should also be set in inputCols`; 2, add a check, and if there are `categoricalCols` not set in `inputCols`, log.warn it; ### Why are the changes needed? 1, there is no check to make sure `categoricalCols` are all set in `inputCols`, to keep existing behavior, update this comments; ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? repl Closes #29868 from zhengruifeng/feature_hash_cat_doc. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-27 10:26:05 -05:00
zhengruifeng	0c38765b29	[SPARK-32974][ML] FeatureHasher transform optimization ### What changes were proposed in this pull request? pre-compute the output indices of numerical columns, instead of computing them on each row. ### Why are the changes needed? for a numerical column, its output index is a hash of its `col_name`, we can pre-compute it at first, instead of computing it on each row. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #29850 from zhengruifeng/hash_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-09-27 09:35:05 +08:00
zhengruifeng	934a91fcb4	[SPARK-21481][ML][FOLLOWUP][TRIVIAL] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap ### What changes were proposed in this pull request? `HashingTF` use `util.collection.OpenHashMap` instead of `mutable.HashMap` ### Why are the changes needed? according to `util.collection.OpenHashMap` 's doc: > This map is about 5X faster than java.util.HashMap, while using much less space overhead. according to performance tests like ([Simple microbenchmarks comparing Scala vs Java mutable map performance ](https://gist.github.com/pchiusano/1423303)), `mutable.HashMap` maybe more inefficient than `java.util.HashMap` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #29852 from zhengruifeng/hashingtf_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-26 08:16:39 -05:00
zhengruifeng	432afac07e	[SPARK-32907][ML] adaptively blockify instances - revert blockify gmm ### What changes were proposed in this pull request? revert blockify gmm ### Why are the changes needed? WeichenXu123 and I thought we should use memory size instead of number of rows to blockify instance; then if a buffer's size is large and determined by number of rows, we should discard it. In GMM, we found that the pre-allocated memory maybe too large and should be discarded: ``` transient private lazy val auxiliaryPDFMat = DenseMatrix.zeros(blockSize, numFeatures) ``` We had some offline discuss and thought it is better to revert blockify GMM. ### Does this PR introduce _any_ user-facing change? blockSize added in master branch will be removed ### How was this patch tested? existing testsuites Closes #29782 from zhengruifeng/unblockify_gmm. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-09-23 15:54:56 +08:00
zhengruifeng	9d6221b936	[SPARK-18409][ML][FOLLOWUP] LSH approxNearestNeighbors optimization 2 ### What changes were proposed in this pull request? 1, simplify the aggregation by get `count` via `summary.count` 2, ignore nan values like the old impl: ``` val relativeError = 0.05 val approxQuantile = numNearestNeighbors.toDouble / count + relativeError val modelDatasetWithDist = modelDataset.withColumn(distCol, hashDistCol) if (approxQuantile >= 1) { modelDatasetWithDist } else { val hashThreshold = modelDatasetWithDist.stat .approxQuantile(distCol, Array(approxQuantile), relativeError) // Filter the dataset where the hash value is less than the threshold. modelDatasetWithDist.filter(hashDistCol <= hashThreshold(0)) } ``` ### Why are the changes needed? simplify the aggregation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #29778 from zhengruifeng/lsh_nit. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-09-18 08:57:52 +08:00
Kousuke Saruta	b121f0d459	[SPARK-32873][BUILD] Fix code which causes error when build with sbt and Scala 2.13 ### What changes were proposed in this pull request? This PR fix code which causes error when build with sbt and Scala 2.13 like as follows. ``` [error] [warn] /home/kou/work/oss/spark-scala-2.13/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala:251: method with a single empty parameter list overrides method without any parameter list [error] [warn] override def hasNext(): Boolean = requestOffset < part.untilOffset [error] [warn] [error] [warn] /home/kou/work/oss/spark-scala-2.13/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala:294: method with a single empty parameter list overrides method without any parameter list [error] [warn] override def hasNext(): Boolean = okNext ``` More specifically, what this PR fixes are * Methods which has an empty parameter list and overrides an method which has no parameter list. ``` override def hasNext(): Boolean = okNext ``` * Methods which has no parameter list and overrides an method which has an empty parameter list. ``` override def next: (Int, Double) = { ``` * Infix operator expression that the operator wraps. ``` 3L * math.min(k, numFeatures) * math.min(k, numFeatures) 3L * math.min(k, numFeatures) * math.min(k, numFeatures) + + math.max(math.max(k, numFeatures), 4L * math.min(k, numFeatures) math.max(math.max(k, numFeatures), 4L * math.min(k, numFeatures) * * math.min(k, numFeatures) + 4L * math.min(k, numFeatures)) ``` ### Why are the changes needed? For building Spark with sbt and Scala 2.13. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? After this change and #29742 applied, compile passed with the following command. ``` build/sbt -Pscala-2.13 -Phive -Phive-thriftserver -Pyarn -Pkubernetes compile test:compile ``` Closes #29745 from sarutak/fix-code-for-sbt-and-spark-2.13. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-14 15:34:58 +09:00
Max Gekk	aa87b0aba3	[SPARK-32815][ML] Fix LibSVM data source loading error on file paths with glob metacharacters ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. The fix is based on another bug fix for CSV/JSON datasources https://github.com/apache/spark/pull/29659. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc]`: ```scala spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show ``` but would end up hitting an exception: ``` Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes #29670 from MaxGekk/globbing-paths-when-inferring-schema-ml. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-08 14:15:16 +00:00
zhengruifeng	ac520d4a7c	[SPARK-32676][3.0][ML] Fix double caching in KMeans/BiKMeans ### What changes were proposed in this pull request? Fix double caching in KMeans/BiKMeans: 1, let the callers of `runWithWeight` to pass whether `handlePersistence` is needed; 2, persist and unpersist inside of `runWithWeight`; 3, persist the `norms` if needed according to the comments; ### Why are the changes needed? avoid double caching ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing testsuites Closes #29501 from zhengruifeng/kmeans_handlePersistence. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-08-23 17:14:40 -05:00
Jatin Puri	1fd54f4bf5	[SPARK-32662][ML] CountVectorizerModel: Remove requirement for minimum Vocab size ### What changes were proposed in this pull request? The strict requirement for the vocabulary to remain non-empty has been removed in this pull request. Link to the discussion: http://apache-spark-user-list.1001560.n3.nabble.com/Ability-to-have-CountVectorizerModel-vocab-as-empty-td38396.html ### Why are the changes needed? This soothens running it across the corner cases. Without this, the user has to manupulate the data in genuine case, which may be a perfectly fine valid use-case. Question: Should we a log when empty vocabulary is found instead? ### Does this PR introduce _any_ user-facing change? May be a slight change. If someone has put a try-catch to detect an empty vocab. Then that behavior would no longer stand still. ### How was this patch tested? 1. Added testcase to `fit` generating an empty vocabulary 2. Added testcase to `transform` with empty vocabulary Request to review: srowen hhbyyh Closes #29482 from purijatin/spark_32662. Authored-by: Jatin Puri <purijatin@gmail.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-08-21 16:14:29 -07:00
Huaxin Gao	bc7885901d	[SPARK-32310][ML][PYSPARK] ML params default value parity in feature and tuning ### What changes were proposed in this pull request? set params default values in trait Params for feature and tuning in both Scala and Python. ### Why are the changes needed? Make ML has the same default param values between estimator and its corresponding transformer, and also between Scala and Python. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing and modified tests Closes #29153 from huaxingao/default2. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-08-03 08:50:34 -07:00
zhengruifeng	81b0785fb2	[SPARK-32455][ML] LogisticRegressionModel prediction optimization ### What changes were proposed in this pull request? for binary `LogisticRegressionModel`: 1, keep variables `_threshold` and `_rawThreshold` instead of computing them on each instance; 2, in `raw2probabilityInPlace`, make use of the characteristic that the sum of probability is 1.0; ### Why are the changes needed? for better performance ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuite and performace test in REPL Closes #29255 from zhengruifeng/pred_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-07-29 19:53:28 -07:00
Huaxin Gao	40e6a5bbb0	[SPARK-32449][ML][PYSPARK] Add summary to MultilayerPerceptronClassificationModel ### What changes were proposed in this pull request? Add training summary to MultilayerPerceptronClassificationModel... ### Why are the changes needed? so that user can get the training process status, such as loss value of each iteration and total iteration number. ### Does this PR introduce _any_ user-facing change? Yes MultilayerPerceptronClassificationModel.summary MultilayerPerceptronClassificationModel.evaluate ### How was this patch tested? new tests Closes #29250 from huaxingao/mlp_summary. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-07-29 09:58:25 -05:00
zhengruifeng	f7542d3b61	[SPARK-32457][ML] logParam thresholds in DT/GBT/FM/LR/MLP ### What changes were proposed in this pull request? logParam `thresholds` in DT/GBT/FM/LR/MLP ### Why are the changes needed? param `thresholds` is logged in NB/RF, but not in other ProbabilisticClassifier ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #29257 from zhengruifeng/instr.logParams_add_thresholds. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-07-27 12:05:29 -07:00
Sean Owen	be2eca22e9	[SPARK-32398][TESTS][CORE][STREAMING][SQL][ML] Update to scalatest 3.2.0 for Scala 2.13.3+ ### What changes were proposed in this pull request? Updates to scalatest 3.2.0. Though it looks large, it is 99% changes to the new location of scalatest classes. ### Why are the changes needed? 3.2.0+ has a fix that is required for Scala 2.13.3+ compatibility. ### Does this PR introduce _any_ user-facing change? No, only affects tests. ### How was this patch tested? Existing tests. Closes #29196 from srowen/SPARK-32398. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-23 16:20:17 -07:00
zhengruifeng	3a60b41949	[SPARK-32298][ML] tree models prediction optimization ### What changes were proposed in this pull request? use while-loop instead of the recursive way ### Why are the changes needed? 3% ~ 10% faster ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #29095 from zhengruifeng/tree_pred_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-07-17 12:00:49 -05:00
Huaxin Gao	383f5e9cbe	[SPARK-32310][ML][PYSPARK] ML params default value parity in classification, regression, clustering and fpm ### What changes were proposed in this pull request? set params default values in trait ...Params in both Scala and Python. I will do this in two PRs. I will change classification, regression, clustering and fpm in this PR. Will change the rest in another PR. ### Why are the changes needed? Make ML has the same default param values between estimator and its corresponding transformer, and also between Scala and Python. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #29112 from huaxingao/set_default. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-07-16 11:12:29 -07:00
dzlab	d5c672af58	[SPARK-32315][ML] Provide an explanation error message when calling require ### What changes were proposed in this pull request? Small improvement in the error message shown to user https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala#L537-L538 ### Why are the changes needed? The current behavior is an exception is thrown but without any specific details on the cause ``` Caused by: java.lang.IllegalArgumentException: requirement failedCaused by: java.lang.IllegalArgumentException: requirement failed at scala.Predef$.require(Predef.scala:212) at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:508) at org.apache.spark.mllib.clustering.EuclideanDistanceMeasure$.fastSquaredDistance(DistanceMeasure.scala:232) at org.apache.spark.mllib.clustering.EuclideanDistanceMeasure.isCenterConverged(DistanceMeasure.scala:190) at org.apache.spark.mllib.clustering.KMeans$$anonfun$runAlgorithm$4.apply(KMeans.scala:336) at org.apache.spark.mllib.clustering.KMeans$$anonfun$runAlgorithm$4.apply(KMeans.scala:334) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:130) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:245) at org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:334) at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:251) at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:233) ``` ### Does this PR introduce _any_ user-facing change? Yes, this PR adds an explanation message to be shown to user when requirement check is not meant ### How was this patch tested? manually Closes #29115 from dzlab/patch/SPARK-32315. Authored-by: dzlab <dzlabs@outlook.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-07-16 09:25:52 -07:00
Sean Owen	c28a6fa511	[SPARK-29292][SQL][ML] Update rest of default modules (Hive, ML, etc) for Scala 2.13 compilation ### What changes were proposed in this pull request? Same as https://github.com/apache/spark/pull/29078 and https://github.com/apache/spark/pull/28971 . This makes the rest of the default modules (i.e. those you get without specifying `-Pyarn` etc) compile under Scala 2.13. It does not close the JIRA, as a result. this also of course does not demonstrate that tests pass yet in 2.13. Note, this does not fix the `repl` module; that's separate. ### Why are the changes needed? Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. (2.13 was not tested; this is about getting it to compile without breaking 2.12) Closes #29111 from srowen/SPARK-29292.3. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-15 13:26:28 -07:00
Huaxin Gao	b05f309bc9	[SPARK-32140][ML][PYSPARK] Add training summary to FMClassificationModel ### What changes were proposed in this pull request? Add training summary for FMClassificationModel... ### Why are the changes needed? so that user can get the training process status, such as loss value of each iteration and total iteration number. ### Does this PR introduce _any_ user-facing change? Yes FMClassificationModel.summary FMClassificationModel.evaluate ### How was this patch tested? new tests Closes #28960 from huaxingao/fm_summary. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-07-15 10:13:03 -07:00
Huaxin Gao	99b4b06255	[SPARK-32232][ML][PYSPARK] Make sure ML has the same default solver values between Scala and Python # What changes were proposed in this pull request? current problems: ``` mlp = MultilayerPerceptronClassifier(layers=[2, 2, 2], seed=123) model = mlp.fit(df) path = tempfile.mkdtemp() model_path = path + "/mlp" model.save(model_path) model2 = MultilayerPerceptronClassificationModel.load(model_path) self.assertEqual(model2.getSolver(), "l-bfgs") # this fails because model2.getSolver() returns 'auto' model2.transform(df) # this fails with Exception pyspark.sql.utils.IllegalArgumentException: MultilayerPerceptronClassifier_dec859ed24ec parameter solver given invalid value auto. ``` FMClassifier/Regression and GeneralizedLinearRegression have the same problems. Here are the root cause of the problems: 1. In HasSolver, both Scala and Python default solver to 'auto' 2. On Scala side, mlp overrides the default of solver to 'l-bfgs', FMClassifier/Regression overrides the default of solver to 'adamW', and glr overrides the default of solver to 'irls' 3. On Scala side, mlp overrides the default of solver in MultilayerPerceptronClassificationParams, so both MultilayerPerceptronClassification and MultilayerPerceptronClassificationModel have 'l-bfgs' as default 4. On Python side, mlp overrides the default of solver in MultilayerPerceptronClassification, so it has default as 'l-bfgs', but MultilayerPerceptronClassificationModel doesn't override the default so it gets the default from HasSolver which is 'auto'. In theory, we don't care about the solver value or any other params values for MultilayerPerceptronClassificationModel, because we have the fitted model already. That's why on Python side, we never set default values for any of the XXXModel. 5. when calling getSolver on the loaded mlp model, it calls this line of code underneath: ``` def _transfer_params_from_java(self): """ Transforms the embedded params from the companion Java object. """ ...... # SPARK-14931: Only check set params back to avoid default params mismatch. if self._java_obj.isSet(java_param): value = _java2py(sc, self._java_obj.getOrDefault(java_param)) self._set(**{param.name: value}) ...... ``` that's why model2.getSolver() returns 'auto'. The code doesn't get the default Scala value (in this case 'l-bfgs') to set to Python param, so it takes the default value (in this case 'auto') on Python side. 6. when calling model2.transform(df), it calls this underneath: ``` def _transfer_params_to_java(self): """ Transforms the embedded params to the companion Java object. """ ...... if self.hasDefault(param): pair = self._make_java_param_pair(param, self._defaultParamMap[param]) pair_defaults.append(pair) ...... ``` Again, it gets the Python default solver which is 'auto', and this caused the Exception 7. Currently, on Scala side, for some of the algorithms, we set default values in the XXXParam, so both estimator and transformer get the default value. However, for some of the algorithms, we only set default in estimators, and the XXXModel doesn't get the default value. On Python side, we never set defaults for the XXXModel. This causes the default value inconsistency. 8. My proposed solution: set default params in XXXParam for both Scala and Python, so both the estimator and transformer have the same default value for both Scala and Python. I currently only changed solver in this PR. If everyone is OK with the fix, I will change all the other params as well. I hope my explanation makes sense to your folks :) ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing and new tests Closes #29060 from huaxingao/solver_parity. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-07-11 10:37:26 -05:00
zhengruifeng	8d5c0947f8	[SPARK-32164][ML] GeneralizedLinearRegressionSummary optimization ### What changes were proposed in this pull request? 1, GeneralizedLinearRegressionSummary compute several statistics on single pass 2, LinearRegressionSummary use metrics.count ### Why are the changes needed? avoid extra passes on the dataset ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #28990 from zhengruifeng/glr_summary_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-07-07 08:30:15 -07:00
Huaxin Gao	f7d9e3d162	[SPARK-23631][ML][PYSPARK] Add summary to RandomForestClassificationModel ### What changes were proposed in this pull request? Add summary to RandomForestClassificationModel... ### Why are the changes needed? so user can get a summary of this classification model, and retrieve common metrics such as accuracy, weightedTruePositiveRate, roc (for binary), pr curves (for binary), etc. ### Does this PR introduce _any_ user-facing change? Yes ``` RandomForestClassificationModel.summary RandomForestClassificationModel.evaluate ``` ### How was this patch tested? Add new tests Closes #28913 from huaxingao/rf_summary. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-07-01 08:09:07 -05:00
Huaxin Gao	8795133707	[SPARK-20249][ML][PYSPARK] Add training summary for LinearSVCModel ### What changes were proposed in this pull request? Add training summary for LinearSVCModel...... ### Why are the changes needed? so that user can get the training process status, such as loss value of each iteration and total iteration number. ### Does this PR introduce _any_ user-facing change? Yes ```LinearSVCModel.summary``` ```LinearSVCModel.evaluate``` ### How was this patch tested? new tests Closes #28884 from huaxingao/svc_summary. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-06-26 12:57:30 -05:00
Huaxin Gao	d1255297b8	[SPARK-19939][ML] Add support for association rules in ML ### What changes were proposed in this pull request? Adding support to Association Rules in Spark ml.fpm. ### Why are the changes needed? Support is an indication of how frequently the itemset of an association rule appears in the database and suggests if the rules are generally applicable to the dateset. Refer to [wiki](https://en.wikipedia.org/wiki/Association_rule_learning#Support) for more details. ### Does this PR introduce _any_ user-facing change? Yes. Associate Rules now have support measure ### How was this patch tested? existing and new unit test Closes #28903 from huaxingao/fpm. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-06-26 12:55:38 -05:00
Huaxin Gao	297016e34e	[SPARK-31893][ML] Add a generic ClassificationSummary trait ### What changes were proposed in this pull request? Add a generic ClassificationSummary trait ### Why are the changes needed? Add a generic ClassificationSummary trait so all the classification models can use it to implement summary. Currently in classification, we only have summary implemented in ```LogisticRegression```. There are requests to implement summary for ```LinearSVCModel``` in https://issues.apache.org/jira/browse/SPARK-20249 and to implement summary for ```RandomForestClassificationModel``` in https://issues.apache.org/jira/browse/SPARK-23631. If we add a generic ClassificationSummary trait and put all the common code there, we can easily add summary to ```LinearSVCModel``` and ```RandomForestClassificationModel```, and also add summary to all the other classification models. We can use the same approach to add a generic RegressionSummary trait to regression package and implement summary for all the regression models. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? existing tests Closes #28710 from huaxingao/summary_trait. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-06-20 08:43:28 -05:00
Liang-Chi Hsieh	7f6a8ab166	[SPARK-31777][ML][PYSPARK] Add user-specified fold column to CrossValidator ### What changes were proposed in this pull request? This patch adds user-specified fold column support to `CrossValidator`. User can assign fold numbers to dataset instead of letting Spark do random splits. ### Why are the changes needed? This gives `CrossValidator` users more flexibility in splitting folds. ### Does this PR introduce _any_ user-facing change? Yes, a new `foldCol` param is added to `CrossValidator`. User can use it to specify custom fold splitting. ### How was this patch tested? Added unit tests. Closes #28704 from viirya/SPARK-31777. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2020-06-16 16:46:32 -07:00
Huaxin Gao	f83cb3cbb3	[SPARK-31925][ML] Summary.totalIterations greater than maxIters ### What changes were proposed in this pull request? In LogisticRegression and LinearRegression, if set maxIter=n, the model.summary.totalIterations returns n+1 if the training procedure does not drop out. This is because we use ```objectiveHistory.length``` as totalIterations, but ```objectiveHistory``` contains init sate, thus ```objectiveHistory.length``` is 1 larger than number of training iterations. ### Why are the changes needed? correctness ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new tests and also modify existing tests Closes #28786 from huaxingao/summary_iter. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-06-15 08:49:03 -05:00
Huaxin Gao	89c98a4c70	[SPARK-31944] Add instance weight support in LinearRegressionSummary ### What changes were proposed in this pull request? Add instance weight support in LinearRegressionSummary ### Why are the changes needed? LinearRegression and RegressionMetrics support instance weight. We should support instance weight in LinearRegressionSummary too. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test Closes #28772 from huaxingao/lir_weight_summary. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-06-13 12:20:29 -05:00
Huaxin Gao	45cf5e9950	[SPARK-31840][ML] Add instance weight support in LogisticRegressionSummary ### What changes were proposed in this pull request? Add instance weight support in LogisticRegressionSummary ### Why are the changes needed? LogisticRegression, MulticlassClassificationEvaluator and BinaryClassificationEvaluator support instance weight. We should support instance weight in LogisticRegressionSummary too. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new tests Closes #28657 from huaxingao/weighted_summary. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-31 10:24:20 -05:00
Huaxin Gao	50492c0bd3	[SPARK-31803][ML] Make sure instance weight is not negative ### What changes were proposed in this pull request? In the algorithms that support instance weight, add checks to make sure instance weight is not negative. ### Why are the changes needed? instance weight has to be >= 0.0 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested Closes #28621 from huaxingao/weight_check. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-27 09:01:56 -05:00
Huaxin Gao	d4007776f2	[SPARK-31734][ML][PYSPARK] Add weight support in ClusteringEvaluator ### What changes were proposed in this pull request? Add weight support in ClusteringEvaluator ### Why are the changes needed? Currently, BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator support instance weight, but ClusteringEvaluator doesn't, so we will add instance weight support in ClusteringEvaluator. ### Does this PR introduce _any_ user-facing change? Yes. ClusteringEvaluator.setWeightCol ### How was this patch tested? add new unit test Closes #28553 from huaxingao/weight_evaluator. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-25 09:18:08 -05:00
Huaxin Gao	d0fe433b5e	[SPARK-31768][ML] add getMetrics in Evaluators ### What changes were proposed in this pull request? add getMetrics in Evaluators to get the corresponding Metrics instance, so users can use it to get any of the metrics scores. For example: ``` val trainer = new LinearRegression val model = trainer.fit(dataset) val predictions = model.transform(dataset) val evaluator = new RegressionEvaluator() val metrics = evaluator.getMetrics(predictions) val rmse = metrics.rootMeanSquaredError val r2 = metrics.r2 val mae = metrics.meanAbsoluteError val variance = metrics.explainedVariance ``` ### Why are the changes needed? Currently, Evaluator.evaluate only access to one metrics, but most users may need to get multiple metrics. This PR adds getMetrics in all the Evaluators, so users can use it to get an instance of the corresponding Metrics to get any of the metrics they want. ### Does this PR introduce _any_ user-facing change? Yes. Add getMetrics in Evaluators. For example: ``` /** * Get a RegressionMetrics, which can be used to get any of the regression * metrics such as rootMeanSquaredError, meanSquaredError, etc. * * param dataset a dataset that contains labels/observations and predictions. * return RegressionMetrics */ Since("3.1.0") def getMetrics(dataset: Dataset[_]): RegressionMetrics ``` ### How was this patch tested? Add new unit tests Closes #28590 from huaxingao/getMetrics. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-24 08:42:42 -05:00
Weichen Xu	b2300fca1e	[SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0) ### What changes were proposed in this pull request? In QuantileDiscretizer.getDistinctSplits, before invoking distinct, normalize all -0.0 and 0.0 to be 0.0 ``` for (i <- 0 until splits.length) { if (splits(i) == -0.0) { splits(i) = 0.0 } } ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. #### Manually test: ~~~scala import scala.util.Random val rng = new Random(3) val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0) import spark.implicits._ val df1 = sc.parallelize(a1, 2).toDF("id") import org.apache.spark.ml.feature.QuantileDiscretizer val qd = new QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0) val model = qd.fit(df1) // will raise error in spark master. ~~~ ### Explain scala `0.0 == -0.0` is True but `0.0.hashCode == -0.0.hashCode()` is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code. And array.distinct will rely on elem.hashCode so it leads to this error. Test code on distinct ``` import scala.util.Random val rng = new Random(3) val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0) a1.distinct.sorted.foreach(x => print(x.toString + "\n")) ``` Then you will see output like: ``` ... -0.009292684662246975 -0.0033280686465135823 -0.0 0.0 0.0022219556032221366 0.02217419561977274 ... ``` Closes #28498 from WeichenXu123/SPARK-31676. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-14 09:24:40 -05:00
Weichen Xu	e248bc7af6	[SPARK-31610][SPARK-31668][ML] Address hashingTF saving&loading bug and expose hashFunc property in HashingTF ### What changes were proposed in this pull request? Expose hashFunc property in HashingTF Some third-party library such as mleap need to access it. See background description here: https://github.com/combust/mleap/pull/665#issuecomment-621258623 ### Why are the changes needed? See https://github.com/combust/mleap/pull/665#issuecomment-621258623 ### Does this PR introduce any user-facing change? No. Only add a package private constructor. ### How was this patch tested? N/A Closes #28413 from WeichenXu123/hashing_tf_expose_hashfunc. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2020-05-12 08:54:28 -07:00
zhengruifeng	e7fa778dc7	[SPARK-30699][ML][PYSPARK] GMM blockify input vectors ### What changes were proposed in this pull request? 1, add new param blockSize; 2, if blockSize==1, keep original behavior, code path trainOnRows; 3, if blockSize>1, standardize and stack input vectors to blocks (like ALS/MLP), code path trainOnBlocks ### Why are the changes needed? performance gain on dense dataset HIGGS: 1, save about 45% RAM; 2, 3X faster with openBLAS ### Does this PR introduce any user-facing change? add a new expert param `blockSize` ### How was this patch tested? added testsuites Closes #27473 from zhengruifeng/blockify_gmm. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-05-12 12:54:03 +08:00
fan31415	64fb358a99	[SPARK-31671][ML] Wrong error message in VectorAssembler ### What changes were proposed in this pull request? When input column lengths can not be inferred and handleInvalid = "keep", VectorAssembler will throw a runtime exception. However the error message with this exception is not consistent. I change the content of this error message to make it work properly. ### Why are the changes needed? This is a bug. Here is a simple example to reproduce it. ``` // create a df without vector size val df = Seq( (Vectors.dense(1.0), Vectors.dense(2.0)) ).toDF("n1", "n2") // only set vector size hint for n1 column val hintedDf = new VectorSizeHint() .setInputCol("n1") .setSize(1) .transform(df) // assemble n1, n2 val output = new VectorAssembler() .setInputCols(Array("n1", "n2")) .setOutputCol("features") .setHandleInvalid("keep") .transform(hintedDf) // because only n1 has vector size, the error message should tell us to set vector size for n2 too output.show() ``` Expected error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2]. ``` Actual error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2]. ``` This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test in VectorAssemblerSuite. Closes #28487 from fan31415/SPARK-31671. Lead-authored-by: fan31415 <fan12356789@gmail.com> Co-authored-by: yijiefan <fanyije@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-11 18:23:23 -05:00
zhengruifeng	bb9b50c217	[SPARK-31656][ML][PYSPARK] AFT blockify input vectors ### What changes were proposed in this pull request? 1, add new param blockSize; 2, add a new class InstanceBlock; 3, if blockSize==1, keep original behavior; if blockSize>1, stack input vectors to blocks (like ALS/MLP); 4, if blockSize>1, standardize the input outside of optimization procedure; ### Why are the changes needed? it will obtain performance gain on dense datasets, such as epsilon 1, reduce RAM to persist traing dataset; (save about 40% RAM) 2, use Level-2 BLAS routines; (~10X speedup) ### Does this PR introduce _any_ user-facing change? Yes, a new param is added ### How was this patch tested? existing and added testsuites Closes #28473 from zhengruifeng/blockify_aft. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-05-08 14:06:36 +08:00
Huaxin Gao	18d2ba53e4	[SPARK-31652][ML][PYSPARK] Add ANOVASelector and FValueSelector to PySpark ### What changes were proposed in this pull request? Add ANOVASelector and FValueSelector to PySpark ### Why are the changes needed? ANOVASelector and FValueSelector have been implemented in Scala. We need to implement these in Python as well. ### Does this PR introduce _any_ user-facing change? Yes. Add Python version of ANOVASelector and FValueSelector ### How was this patch tested? new doctest Closes #28464 from huaxingao/selector_py. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-05-08 11:02:24 +08:00
zhengruifeng	97332f26bf	[SPARK-30660][ML][PYSPARK] LinearRegression blockify input vectors ### What changes were proposed in this pull request? 1, add new param blockSize; 2, add a new class InstanceBlock; 3, if blockSize==1, keep original behavior; if blockSize>1, stack input vectors to blocks (like ALS/MLP); 4, if blockSize>1, standardize the input outside of optimization procedure; ### Why are the changes needed? it will obtain performance gain on dense datasets, such as `epsilon` 1, reduce RAM to persist traing dataset; (save about 40% RAM) 2, use Level-2 BLAS routines; (up to 6X(squaredError)~12X(huber) speedup) ### Does this PR introduce _any_ user-facing change? Yes, a new param is added ### How was this patch tested? existing and added testsuites Closes #28471 from zhengruifeng/blockify_lir_II. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-05-08 10:52:01 +08:00
zhengruifeng	052ff49acd	[SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors ### What changes were proposed in this pull request? 1, reorg the `fit` method in LR to several blocks (`createModel`, `createBounds`, `createOptimizer`, `createInitCoefWithInterceptMatrix`); 2, add new param blockSize; 3, if blockSize==1, keep original behavior, code path `trainOnRows`; 4, if blockSize>1, standardize and stack input vectors to blocks (like ALS/MLP), code path `trainOnBlocks` ### Why are the changes needed? On dense dataset `epsilon_normalized.t`: 1, reduce RAM to persist traing dataset; (save about 40% RAM) 2, use Level-2 BLAS routines; (4x ~ 5x faster) ### Does this PR introduce _any_ user-facing change? Yes, a new param is added ### How was this patch tested? existing and added testsuites Closes #28458 from zhengruifeng/blockify_lor_II. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-05-07 10:07:24 +08:00
Huaxin Gao	f05560bf50	[SPARK-31127][ML] Implement abstract Selector ### What changes were proposed in this pull request? Implement abstract Selector. Put the common code among ```ANOVASelector```, ```ChiSqSelector```, ```FValueSelector``` and ```VarianceThresholdSelector``` to Selector. ### Why are the changes needed? code reuse ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests Closes #27978 from huaxingao/spark-31127. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-05-06 16:10:30 +08:00
zhengruifeng	ebdf41dd69	[SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors ### What changes were proposed in this pull request? 1, add new param `blockSize`; 2, add a new class InstanceBlock; 3, if `blockSize==1`, keep original behavior; if `blockSize>1`, stack input vectors to blocks (like ALS/MLP); 4, if `blockSize>1`, standardize the input outside of optimization procedure; ### Why are the changes needed? 1, reduce RAM to persist traing dataset; (save about 40% RAM) 2, use Level-2 BLAS routines; (4x ~ 5x faster on dataset `epsilon`) ### Does this PR introduce any user-facing change? Yes, a new param is added ### How was this patch tested? existing and added testsuites Closes #28349 from zhengruifeng/blockify_svc_II. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-05-06 10:06:23 +08:00
zhengruifeng	701deac88d	[SPARK-31603][ML] AFT uses common functions in RDDLossFunction ### What changes were proposed in this pull request? 1, make AFT reuse common functions in `ml.optim`, rather than making its own impl. ### Why are the changes needed? The logic in optimizing AFT is quite similar to other algorithms like other algs based on `RDDLossFunction`, We should reuse the common functions. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #28404 from zhengruifeng/mv_aft_optim. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-05 08:35:20 -05:00
TJX2014	fe07b21b8a	[SPARK-31400][ML] The catalogString doesn't distinguish Vectors in ml and mllib What changes were proposed in this pull request? 1.Add class info output in org.apache.spark.ml.util.SchemaUtils#checkColumnType to distinct Vectors in ml and mllib 2.Add unit test Why are the changes needed? the catalogString doesn't distinguish Vectors in ml and mllib when mllib vector misused in ml https://issues.apache.org/jira/browse/SPARK-31400 Does this PR introduce any user-facing change? No How was this patch tested? Unit test is added Closes #28347 from TJX2014/master-catalogString-distinguish-Vectors-in-ml-and-mllib. Authored-by: TJX2014 <xiaoxingstack@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-26 11:35:44 -05:00
zhengruifeng	0ede08bcb2	[SPARK-31007][ML] KMeans optimization based on triangle-inequality ### What changes were proposed in this pull request? apply Lemma 1 in [Using the Triangle Inequality to Accelerate K-Means](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf): > Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then d(x,c) >= d(x,b); It can be directly applied in EuclideanDistance, but not in CosineDistance. However, for CosineDistance we can luckily get a variant in the space of radian/angle. ### Why are the changes needed? It help improving the performance of prediction and training (mostly) ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27758 from zhengruifeng/km_triangle. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-24 11:24:15 -05:00
zhengruifeng	e7bc6f38b9	[SPARK-31494][ML] flatten the result dataframe of ANOVATest ### What changes were proposed in this pull request? add a new method `def test(dataset: DataFrame, featuresCol: String, labelCol: String, flatten: Boolean): DataFrame` ### Why are the changes needed? Similar to new `test` method in `ChiSquareTest`, it will: 1, support df operation on the returned df; 2, make driver no longer a bottleneck with large numFeatures ### Does this PR introduce any user-facing change? Yes, new method added ### How was this patch tested? existing testsuites Closes #28270 from zhengruifeng/flatten_anova. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-04-21 12:43:14 +08:00
zhengruifeng	32259c9733	[SPARK-31492][ML] flatten the result dataframe of FValueTest ### What changes were proposed in this pull request? add a new method `def test(dataset: DataFrame, featuresCol: String, labelCol: String, flatten: Boolean): DataFrame` ### Why are the changes needed? Similar to new test method in ChiSquareTest, it will: 1, support df operation on the returned df; 2, make driver no longer a bottleneck with large `numFeatures` ### Does this PR introduce any user-facing change? Yes, add a new method ### How was this patch tested? existing testsuites Closes #28268 from zhengruifeng/flatten_fvalue. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-04-21 11:09:05 +08:00
zhengruifeng	f1489e6b12	[SPARK-31436][ML] MinHash keyDistance optimization ### What changes were proposed in this pull request? re-impl `keyDistance`: if both vectors are dense, new impl is 9.09x faster; if both vectors are sparse, new impl is 5.66x faster; if one is dense and the other is sparse, new impl is 7.8x faster; ### Why are the changes needed? current implementation based on set operations is inefficient ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #28206 from zhengruifeng/minhash_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-17 08:27:21 -05:00
herman	fab4ca5156	[SPARK-31450][SQL] Make ExpressionEncoder thread-safe ### What changes were proposed in this pull request? This PR moves the `ExpressionEncoder.toRow` and `ExpressionEncoder.fromRow` functions into their own function objects(`ExpressionEncoder.Serializer` & `ExpressionEncoder.Deserializer`). This effectively makes the `ExpressionEncoder` stateless, thread-safe and (more) reusable. The function objects are not thread safe, however they are documented as such and should be used in a more limited scope (making it easier to reason about thread safety). ### Why are the changes needed? ExpressionEncoders are not thread-safe. We had various (nasty) bugs because of this. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #28223 from hvanhovell/SPARK-31450. Authored-by: herman <herman@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-16 18:47:46 -07:00
zhengruifeng	165590bc29	[SPARK-31301][ML] Flatten the result dataframe of tests in testChiSquare ### What changes were proposed in this pull request? 1, remove newly added method: `def testChiSquare(dataset: Dataset[_], featuresCol: String, labelCol: String): Array[SelectionTestResult]`, because: 1) it is only used in `ChiSqSelector`; 2, since the returned dataframe of `def test(dataset: DataFrame, featuresCol: String, labelCol: String): DataFrame` only contains one row, after collect it back to driver, the results are similar; 2, add method `def test(dataset: DataFrame, featuresCol: String, labelCol: String, flatten: Boolean): DataFrame` to return the flatten results; ### Why are the changes needed? 1, when get returned result dataframe, we may want to filter it like `pValue<0.1`, but current returned dataframe is hard to use; 2, current impl need to collect all test results of all columns back to the driver, this is a bottleneck, if we return the flatten datafame, we no longer to collect them to driver; ### Does this PR introduce any user-facing change? Yes: 1, `def testChiSquare(dataset: Dataset[_], featuresCol: String, labelCol: String): Array[SelectionTestResult]` removed; 2, the returned dataframe need an action to trigger computation; ### How was this patch tested? updated testsuites Closes #28176 from zhengruifeng/flatten_chisq. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-04-14 14:44:54 +08:00
zero323	697fe911ac	[SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR ### What changes were proposed in this pull request? This pull request adds SparkR wrapper for `FMRegressor`: - Supporting ` org.apache.spark.ml.r.FMRegressorWrapper`. - `FMRegressionModel` S4 class. - Corresponding `spark.fmRegressor`, `predict`, `summary` and `write.ml` generics. - Corresponding docs and tests. ### Why are the changes needed? Feature parity. ### Does this PR introduce any user-facing change? No (new API). ### How was this patch tested? New unit tests. Closes #27571 from zero323/SPARK-30819. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-09 19:38:11 -05:00
zero323	0063462d55	[SPARK-30818][SPARKR][ML] Add SparkR LinearRegression wrapper ### What changes were proposed in this pull request? This pull request adds SparkR wrapper for `LinearRegression` - Supporting `org.apache.spark.ml.rLinearRegressionWrapper`. - `LinearRegressionModel` S4 class. - Corresponding `spark.lm` predict, summary and write.ml generics. - Corresponding docs and tests. ### Why are the changes needed? Feature parity. ### Does this PR introduce any user-facing change? No (new API). ### How was this patch tested? New unit tests. Closes #27593 from zero323/SPARK-30818. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-08 22:29:44 -05:00
Holden Karau	8f010bd0a8	[SPARK-31208][CORE] Add an expiremental cleanShuffleDependencies ### What changes were proposed in this pull request? Add a cleanShuffleDependencies as an experimental developer feature to allow folks to clean up shuffle files more aggressively than we currently do. ### Why are the changes needed? Dynamic scaling on Kubernetes (introduced in Spark 3) depends on only shutting down executors without shuffle files. However Spark does not aggressively clean up shuffle files (see SPARK-5836) and instead depends on JVM GC on the driver to trigger deletes. We already have a mechanism to explicitly clean up shuffle files from the ALS algorithm where we create a lot of quickly orphaned shuffle files. We should expose this as an advanced developer feature to enable people to better clean-up shuffle files improving dynamic scaling of their jobs on Kubernetes. ### Does this PR introduce any user-facing change? This adds a new experimental API. ### How was this patch tested? ALS already used a mechanism like this, re-targets the ALS code to the new interface, tested with existing ALS tests. Closes #28038 from holdenk/SPARK-31208-allow-users-to-cleanup-shuffle-files. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-07 13:54:36 -07:00
zero323	0d37f794ef	[SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR ### What changes were proposed in this pull request? This pull request adds SparkR wrapper for `FMClassifier`: - Supporting ` org.apache.spark.ml.r.FMClassifierWrapper`. - `FMClassificationModel` S4 class. - Corresponding `spark.fmClassifier`, `predict`, `summary` and `write.ml` generics. - Corresponding docs and tests. ### Why are the changes needed? Feature parity. ### Does this PR introduce any user-facing change? No (new API). ### How was this patch tested? New unit tests. Closes #27570 from zero323/SPARK-30820. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-07 09:01:45 -05:00
zhengruifeng	1dce6c1fd4	[SPARK-31222][ML] Make ANOVATest Sparsity-Aware ### What changes were proposed in this pull request? when input dataset is sparse, make `ANOVATest` only process non-zero value ### Why are the changes needed? for performance ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27982 from zhengruifeng/anova_sparse. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-03-31 10:40:17 +08:00
zhengruifeng	34d6b90449	[SPARK-31283][ML] Simplify ChiSq by adding a common method ### What changes were proposed in this pull request? add a common method `computeChiSq` and reuse it in both `chiSquaredDenseFeatures` and `chiSquaredSparseFeatures` ### Why are the changes needed? to simplify ChiSq ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #28045 from zhengruifeng/simplify_chisq. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-03-30 13:37:56 +08:00
Huaxin Gao	d81df56f2d	[SPARK-31223][ML] Set seed in np.random to regenerate test data ### What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-31223 set seed in np.random when generating test data...... ### Why are the changes needed? so the same set of test data can be regenerated later. ### Does this PR introduce any user-facing change? No ### How was this patch tested? exiting tests Closes #27994 from huaxingao/spark-31223. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-03-26 13:53:31 +08:00
zhengruifeng	ded51b04d2	[SPARK-31138][ML][FOLLOWUP] ANOVA optimization ### What changes were proposed in this pull request? 1, remove unused var `numFeatures`; 2, remove the computation of `numSamples` and `numClasses`, since they can be directly infered by `counts: OpenHashMap[Double, Long]` ### Why are the changes needed? remove a unnecessary job to compute `numSamples` and `numClasses` ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27979 from zhengruifeng/anova_followup. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-03-23 11:16:57 +08:00
Huaxin Gao	307cfe1f8e	[SPARK-31185][ML] Implement VarianceThresholdSelector ### What changes were proposed in this pull request? Implement a Feature selector that removes all low-variance features. Features with a variance lower than the threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples. ### Why are the changes needed? VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power. scikit has implemented this selector. https://scikit-learn.org/stable/modules/feature_selection.html#variance-threshold ### Does this PR introduce any user-facing change? Yes. ### How was this patch tested? Add new test suite. Closes #27954 from huaxingao/variance-threshold. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-03-22 12:44:18 +08:00
yan ma	fae981e5f3	[SPARK-30773][ML] Support NativeBlas for level-1 routines ### What changes were proposed in this pull request? Change BLAS for part of level-1 routines(axpy, dot, scal(double, denseVector)) from java implementation to NativeBLAS when vector size>256 ### Why are the changes needed? In current ML BLAS.scala, all level-1 routines are fixed to use java implementation. But NativeBLAS(intel MKL, OpenBLAS) can bring up to 11X performance improvement based on performance test which apply direct calls against these methods. We should provide a way to allow user take advantage of NativeBLAS for level-1 routines. Here we do it through switching to NativeBLAS for these methods from f2jBLAS. ### Does this PR introduce any user-facing change? Yes, methods axpy, dot, scal in level-1 routines will switch to NativeBLAS when it has more than nativeL1Threshold(fixed value 256) elements and will fallback to f2jBLAS if native BLAS is not properly configured in system. ### How was this patch tested? Perf test direct calls level-1 routines Closes #27546 from yma11/SPARK-30773. Lead-authored-by: yan ma <yan.ma@intel.com> Co-authored-by: Ma Yan <yan.ma@intel.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-20 10:32:58 -05:00
Huaxin Gao	9a990133f6	[SPARK-31138][ML] Add ANOVA Selector for continuous features and categorical labels ### What changes were proposed in this pull request? Add ANOVA Selector ### Why are the changes needed? Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features. https://github.com/apache/spark/pull/27679 added FValueSelector for continuous features and continuous labels. This PR adds ANOVASelector for continuous features and categorical labels. ### Does this PR introduce any user-facing change? Yes, add a new Selector. ### How was this patch tested? add new test suites Closes #27895 from huaxingao/anova. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-20 10:28:00 -05:00
Qianyang Yu	6f0b0f1655	[SPARK-30954][ML][R] Make file name the same as class name This pr solved the same issue as [pr27919](https://github.com/apache/spark/pull/27919), but this one changes the file names based on comment from previous pr. ### What changes were proposed in this pull request? Make some of file names the same as class name in R package. ### Why are the changes needed? Make the file consistence ### Does this PR introduce any user-facing change? No ### How was this patch tested? run `./R/run-tests.sh` Closes #27940 from kevinyu98/spark-30954-r-v2. Authored-by: Qianyang Yu <qyu@us.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-17 16:15:02 -07:00
zhengruifeng	93088f79cc	[SPARK-30776][ML][FOLLOWUP] FValue clean up ### What changes were proposed in this pull request? remove unused variables; ### Why are the changes needed? remove unused variables; ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27922 from zhengruifeng/test_cleanup. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-03-17 11:29:08 +08:00
Huaxin Gao	3ce1dff7ba	[SPARK-30930][ML] Remove ML/MLLIB DeveloperApi annotations ### What changes were proposed in this pull request? jira link: https://issues.apache.org/jira/browse/SPARK-30930 Remove ML/MLLIB DeveloperApi annotations. ### Why are the changes needed? The Developer APIs in ML/MLLIB have been there for a long time. They are stable now and are very unlikely to be changed or removed, so I unmark these Developer APIs in this PR. ### Does this PR introduce any user-facing change? Yes. DeveloperApi annotations are removed from docs. ### How was this patch tested? existing tests Closes #27859 from huaxingao/spark-30930. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-16 12:41:22 -05:00
zhengruifeng	7f3c8fa42e	[SPARK-31032][ML] GMM compute summary and update distributions in one job ### What changes were proposed in this pull request? 1, compute summary and update distributions in one pass; 2, remove logic related to check `shouldDistributeGaussians` ### Why are the changes needed? In current impl, GMM need to trigger two jobs at one iteration: 1, one to compute summary; 2, if `shouldDistributeGaussians = ((k - 1.0) / k) * numFeatures > 25.0`, trigger another to update distributions; `shouldDistributeGaussians` is almost true in practice, since numFeatures is likely to be greater than 25. We can use only one job to impl above computation, by following the logic in `KMeans`: using `reduceByKey` to compute statistics for each center ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27784 from zhengruifeng/gmm_avoid_distri_gaussian. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-03-12 11:21:30 +08:00
Huaxin Gao	a1a665bece	[SPARK-31077][ML] Remove ChiSqSelector dependency on mllib.ChiSqSelectorModel ### What changes were proposed in this pull request? ```ChiSqSelector ``` depends on ```mllib.ChiSqSelectorModel``` to do the selection logic. Will remove the dependency in this PR. ### Why are the changes needed? This PR is an intermediate PR. Removing ```ChiSqSelector``` dependency on ```mllib.ChiSqSelectorModel```. Next subtask will extract the common code between ```ChiSqSelector``` and ```FValueSelector``` and put in an abstract ```Selector```. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New and existing tests Closes #27841 from huaxingao/chisq. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-11 13:51:49 -05:00
Huaxin Gao	b6b0343e3e	[SPARK-30929][ML] ML, GraphX 3.0 QA: API: New Scala APIs, docs ### What changes were proposed in this pull request? Auditing new ML Scala APIs introduced in 3.0. Fix found issues. ### Why are the changes needed? ### Does this PR introduce any user-facing change? Yes. Some doc changes ### How was this patch tested? Existing tests Closes #27818 from huaxingao/spark-30929. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-09 09:11:21 -05:00

1 2 3 4 5 ...

2492 commits