### What changes were proposed in this pull request?
This test fails flakily. I found it failing in 1 out of 80 runs.
```
Expected -0.35667494393873245 and -0.41914521201224453 to be within 0.15 using relative tolerance.
```
Increasing relative tolerance to 0.2 should improve flakiness.
```
0.2 * 0.35667494393873245 = 0.071 > 0.062 = |-0.35667494393873245 - (-0.41914521201224453)|
```
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#31266 from Loquats/NaiveBayesSuite-reltol.
Authored-by: Andy Zhang <yue.zhang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Increase the timeout for StreamingLinearRegressionSuite to 60s to deflake the test.
### Why are the changes needed?
Reduce merge conflict.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#31248 from liangz1/increase-timeout.
Authored-by: Liang Zhang <liang.zhang@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request?
update the ParamValidators of `maxDepth`
### Why are the changes needed?
current impl of tree models only support maxDepth<=30
### Does this PR introduce _any_ user-facing change?
If `maxDepth`>30, fail quickly
### How was this patch tested?
existing testsuites
Closes#31163 from zhengruifeng/param_maxDepth_upbound.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, make `getTopIndices`/`selectIndicesFromPValues` private;
2, avoid setting `selectionThreshold` in `fit`
3, move param checking to `transformSchema`
### Why are the changes needed?
`getTopIndices`/`selectIndicesFromPValues` should not be exposed to end users;
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#31222 from zhengruifeng/selector_clean_up.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add UnivariateFeatureSelector
### Why are the changes needed?
Have one UnivariateFeatureSelector, so we don't need to have three Feature Selectors.
### Does this PR introduce _any_ user-facing change?
Yes
```
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous", selectorType="numTopFeatures", numTopFeatures=100)
```
Or
numTopFeatures
```
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], scoreFunction="f_classif", selectorType="numTopFeatures", numTopFeatures=100)
```
### How was this patch tested?
Add Unit test
Closes#31160 from huaxingao/UnivariateSelector.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request?
Some local variables are declared as `var`, but they are never reassigned and should be declared as `val`, so this pr turn these from `var` to `val` except for `mockito` related cases.
### Why are the changes needed?
Use `val` instead of `var` when possible.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#31142 from LuciferYang/SPARK-33346.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
There are some redundant collection conversion can be removed, for version compatibility, clean up these with Scala-2.13 profile.
### Why are the changes needed?
Remove redundant collection conversion
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass the Jenkins or GitHub Action
- Manual test `core`, `graphx`, `mllib`, `mllib-local`, `sql`, `yarn`,`kafka-0-10` in Scala 2.13 passed
Closes#31125 from LuciferYang/SPARK-34068.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
use a tmp model in OneVsRestModel.transform, to avoid calling directly setter of model
### Why are the changes needed?
params of model (submodels) should not be changed in transform
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
added testsuite
Closes#31086 from zhengruifeng/ovr_transform_tmp_model.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Improve flaky NaiveBayes test
Current test may sometimes fail under different BLAS library. Due to some absTol check. Error like
```
Expected 0.7 and 0.6485507246376814 to be within 0.05 using absolute tolerance...
```
* Change absTol to relTol: The `absTol 0.05` in some cases (such as compare 0.1 and 0.05) is a big difference
* Remove the `exp` when comparing params. The `exp` will amplify the relative error.
### Why are the changes needed?
Flaky test
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
N/A
Closes#31004 from WeichenXu123/improve_bayes_tests.
Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Change visibility modifier of two case classes defined inside objects in mllib from private to private[OuterClass]
### Why are the changes needed?
Without this change when running tests for Scala 2.13 you get runtime code generation errors. These errors look like this:
```
[info] Cause: java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 73, Column 65: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 73, Column 65: No applicable constructor/method found for zero actual parameters; candidates are: "public java.lang.String org.apache.spark.ml.feature.Word2VecModel$Data.word()"
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests now pass for Scala 2.13
Closes#31018 from koertkuipers/feat-visibility-scala213.
Authored-by: Koert Kuipers <koert@tresata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/21632/files#diff-0fdae8a6782091746ed20ea43f77b639f9c6a5f072dd2f600fcf9a7b37db4f47, a new field `rawCount` was added into `NodeData`, which cause that a tree model trained in 2.4 can not be loaded in 3.0/3.1/master;
field `rawCount` is only used in training, and not used in `transform`/`predict`/`featureImportance`. So I just set it to -1L.
### Why are the changes needed?
to support load old tree model in 3.0/3.1/master
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
added testsuites
Closes#30889 from zhengruifeng/fix_tree_load.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
There were a lot of works on improving ALS's recommendForAll
For now, I found that it maybe futhermore optimized by
1, using GEMV and sharing a pre-allocated buffer per task;
2, using guava.ordering instead of BoundedPriorityQueue;
### Why are the changes needed?
In my test, using `f2jBLAS.sgemv`, it is about 2.3X faster than existing impl.
|Impl| Master | GEMM | GEMV | GEMV + array aggregator | GEMV + guava ordering + array aggregator | GEMV + guava ordering|
|------|----------|------------|----------|------------|------------|------------|
|Duration|341229|363741|191201|189790|148417|147222|
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#30468 from zhengruifeng/als_rec_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Improve LogisticRegression test error tolerance
### Why are the changes needed?
When we switch BLAS version, some of the tests will fail due to too strict error tolerance in test.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
N/A
Closes#30587 from WeichenXu123/fix_lor_test.
Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request?
1, directly use float vectors instead of converting to double vectors, this is about 2x faster than using vec.axpy;
2, mark `wordList` and `wordVecNorms` lazy
3, avoid slicing in computation of `wordVecNorms`
### Why are the changes needed?
halve broadcast size
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#30548 from zhengruifeng/w2v_float32_transform.
Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Co-authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
This PR aims to update `master` branch version to 3.2.0-SNAPSHOT.
### Why are the changes needed?
Start to prepare Apache Spark 3.2.0.
### Does this PR introduce _any_ user-facing change?
N/A.
### How was this patch tested?
Pass the CIs.
Closes#30606 from dongjoon-hyun/SPARK-3.2.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
on each call of `transform`, a head() job will be triggered, which can be skipped by using a lazy var.
### Why are the changes needed?
avoiding duplicate head() jobs
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing tests
Closes#30550 from zhengruifeng/imputer_transform.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Add array_to_vector function for dataframe column
### Why are the changes needed?
Utility function for array to vector conversion.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
scala unit test & doctest.
Closes#30498 from WeichenXu123/array_to_vec.
Lead-authored-by: Weichen Xu <weichen.xu@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR intends to fix typos in the sub-modules:
* `bin`
* `core`
* `docs`
* `external`
* `mllib`
* `repl`
* `pom.xml`
Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618
NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)
### Why are the changes needed?
Misspelled words make it harder to read / understand content.
### Does this PR introduce _any_ user-facing change?
There are various fixes to documentation, etc...
### How was this patch tested?
No testing was performed
Closes#30530 from jsoref/spelling-bin-core-docs-external-mllib-repl.
Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR aims the followings.
1. Upgrade from Scala 2.13.3 to 2.13.4 for Apache Spark 3.1
2. Fix exhaustivity issues in both Scala 2.12/2.13 (Scala 2.13.4 requires this for compilation.)
3. Enforce the improved exhaustive check by using the existing Scala 2.13 GitHub Action compilation job.
### Why are the changes needed?
Scala 2.13.4 is a maintenance release for 2.13 line and improves JDK 15 support.
- https://github.com/scala/scala/releases/tag/v2.13.4
Also, it improves exhaustivity check.
- https://github.com/scala/scala/pull/9140 (Check exhaustivity of pattern matches with "if" guards and custom extractors)
- https://github.com/scala/scala/pull/9147 (Check all bindings exhaustively, e.g. tuples components)
### Does this PR introduce _any_ user-facing change?
Yep. Although it's a maintenance version change, it's a Scala version change.
### How was this patch tested?
Pass the CIs and do the manual testing.
- Scala 2.12 CI jobs(GitHub Action/Jenkins UT/Jenkins K8s IT) to check the validity of code change.
- Scala 2.13 Compilation job to check the compilation
Closes#30455 from dongjoon-hyun/SCALA_3.13.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
impl a new strategy `mode`: replace missing using the most frequent value along each column.
### Why are the changes needed?
it is highly scalable, and had been a function in [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) for a long time.
### Does this PR introduce _any_ user-facing change?
Yes, a new strategy is added
### How was this patch tested?
updated testsuites
Closes#30397 from zhengruifeng/imputer_max_freq.
Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Co-authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This pr add a new Scala compile arg to `pom.xml` to defense against new unused imports:
- `-Ywarn-unused-import` for Scala 2.12
- `-Wconf:cat=unused-imports:e` for Scala 2.13
The other fIles change are remove all unused imports in Spark code
### Why are the changes needed?
Cleanup code and add guarantee to defense against new unused imports
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#30351 from LuciferYang/remove-imports-core-module.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
use `maxBlockSizeInMB` instead of `blockSize` (#rows) to control the stacking of vectors;
### Why are the changes needed?
the performance gain is mainly related to the nnz of block.
### Does this PR introduce _any_ user-facing change?
yes, param blockSize -> blockSizeInMB in master
### How was this patch tested?
updated testsuites
Closes#30355 from zhengruifeng/adaptively_blockify_aft_lir_lor.
Lead-authored-by: zhengruifeng <ruifengz@foxmail.com>
Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request?
In [SPARK-33139] we defined `setActionSession` and `clearActiveSession` as deprecated API, it turns out it is widely used, and after discussion, even if without this PR, it should work with unify view feature, it might only be a risk if user really abuse using these two API. So revert the PR is needed.
[SPARK-33139] has two commit, include a follow up. Revert them both.
### Why are the changes needed?
Revert.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UT.
Closes#30367 from leanken/leanken-revert-SPARK-33139.
Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This reverts commit 61ee5d8a4e.
### What changes were proposed in this pull request?
I need to merge https://github.com/apache/spark/pull/30327 to https://github.com/apache/spark/pull/30009,
but I merged it to master by mistake.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#30345 from zhengruifeng/revert-30327-adaptively_blockify_linear_svc_II.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
* resend
* address comments
* directly gen new Iter
* directly gen new Iter
* update blockify strategy
* address comments
* try to fix 2.13
* try to fix scala 2.13
* use 1.0 as the default value for gemv
* update
Co-authored-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
There are two similar compilation warnings about procedure-like declaration in Scala 2.13:
```
[WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70: procedure syntax is deprecated for constructors: add `=`, as in method definition
```
and
```
[WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:211: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `run`'s return type
```
this pr is the first part to resolve SPARK-33352:
- For constructors method definition add `=` to convert to function syntax
- For without `return type` methods definition add `: Unit =` to convert to function syntax
### Why are the changes needed?
Eliminate compilation warnings in Scala 2.13 and this change should be compatible with Scala 2.12
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#30255 from LuciferYang/SPARK-29392-FOLLOWUP.1.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, optimize `predictQuantiles` by pre-computing an auxiliary var.
### Why are the changes needed?
In https://github.com/apache/spark/pull/30000, I optimized the `transform` method. I find that we can also optimize `predictQuantiles` by pre-computing an auxiliary var.
It is about 56% faster than existing impl.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#30034 from zhengruifeng/aft_quantiles_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1. Add the common trait `CommonFileDataSourceSuite` with tests that can be executed for all built-in file-based datasources.
2. Add a test `CommonFileDataSourceSuite` to check that datasource options are propagated to underlying file systems as Hadoop configs.
3. Mix `CommonFileDataSourceSuite` to `AvroSuite`, `OrcSourceSuite`, `TextSuite`, `JsonSuite`, CSVSuite` and to `ParquetFileFormatSuite`.
4. Remove duplicated tests from `AvroSuite` and from `OrcSourceSuite`.
### Why are the changes needed?
To improve test coverage and test all built-in file-based datasources.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the affected test suites.
Closes#30067 from MaxGekk/ds-options-common-test.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to make SQLConf.get reliable and stable, we need to make sure user can't pollute the SQLConf and SparkSession Context via calling setActiveSession and clearActiveSession.
Change of the PR:
* add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to old behavior if user do need to call these two API.
* by default, if user call these two API, it will throw exception
* add extra two internal and private API setActiveSessionInternal and clearActiveSessionInternal for current internal usage
* change all internal reference to new internal API except for SQLContext.setActive and SQLContext.clearActive
### Why are the changes needed?
Make SQLConf.get reliable and stable.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
* Add UT in SparkSessionBuilderSuite to test the legacy config
* Existing test
Closes#30042 from leanken/leanken-SPARK-33139.
Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
use `lazy array` instead of `var` for auxiliary variables in binary lor
### Why are the changes needed?
In https://github.com/apache/spark/pull/29255, I made a mistake:
the `private var _threshold` and `_rawThreshold` are initialized by defaut values of `threshold`, that is beacuse:
1, param `threshold` is set default value at first;
2, `_threshold` and `_rawThreshold` are initialized based on the default value;
3, param `threshold` is updated by the value from estimator, by `copyValues` method:
```
if (map.contains(param) && to.hasParam(param.name)) {
to.set(param.name, map(param))
}
```
We can update `_threshold` and `_rawThreshold` in `setThreshold` and `setThresholds`, but we can not update them in `set`/`copyValues` so their values are kept until methods `setThreshold` and `setThresholds` are called.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
test in repl
Closes#30013 from zhengruifeng/lor_threshold_init.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, when `predictionCol` and `quantilesCol` are both set, we only need one prediction for each row: prediction is just the variable `lambda` in `predictQuantiles`;
2, in the computation of variable `quantiles` in `predictQuantiles`, a pre-computed vector `val baseQuantiles = $(quantileProbabilities).map(q => math.exp(math.log(-math.log1p(-q)) * scale))` can be reused for each row;
### Why are the changes needed?
avoid redundant computation in transform, like what we did in `ProbabilisticClassificationModel`, `GaussianMixtureModel`, etc
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuite
Closes#30000 from zhengruifeng/aft_predict_transform_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Propagate LibSVM options to Hadoop configs in the LibSVM datasource.
### Why are the changes needed?
There is a bug that when running:
```scala
spark.read.format("libsvm").options(conf).load(path)
```
The underlying file system will not receive the `conf` options.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes, for example, users should read files from Azure Data Lake successfully:
```scala
def hadoopConf1() = Map[String, String](
s"fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
s"fs.adl.oauth2.client.id" -> dbutils.secrets.get(scope = "...", key = "..."),
s"fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "...", key = "..."),
s"fs.adl.oauth2.refresh.url" -> s"https://login.microsoftonline.com/.../oauth2/token")
val df = spark.read.format("libsvm").options(hadoopConf1).load("adl://....azuredatalakestore.net/foldersp1/...")
```
and not get the following exception because the settings above are not propagated to the filesystem:
```java
java.lang.IllegalArgumentException: No value for fs.adl.oauth2.access.token.provider found in conf file.
at ....adl.AdlFileSystem.getNonEmptyVal(AdlFileSystem.java:820)
at ....adl.AdlFileSystem.getCustomAccessTokenProvider(AdlFileSystem.java:220)
at ....adl.AdlFileSystem.getAccessTokenProvider(AdlFileSystem.java:257)
at ....adl.AdlFileSystem.initialize(AdlFileSystem.java:164)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
```
### How was this patch tested?
Added UT to `LibSVMRelationSuite`.
Closes#29984 from MaxGekk/ml-option-propagation.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
RowMatrix contains a computation based on spark.driver.maxResultSize. However, when this value is set to 0, the computation fails (log of 0). The fix is simply to correctly handle this setting, which means unlimited result size, by using a tree depth of 1 in the RowMatrix method.
### Why are the changes needed?
Simple bug fix to make several Spark ML functions which use RowMatrix run correctly in this case.
### Does this PR introduce _any_ user-facing change?
Not other than the bug fix of course.
### How was this patch tested?
Existing RowMatrix tests plus a new test.
Closes#29925 from srowen/SPARK-33043.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to promote the stability annotation to `Evolving` for MLEvent traits/classes.
### Why are the changes needed?
The feature is released in Spark 3.0.0 having SPARK-26818 as the last change in Feb. 2020, and haven't changed in Spark 3.0.1. (There's no change more than a half of year.)
While we'd better to wait for some minor releases to consider the API as stable, it would worth to promote to Evolving so that we clearly state that we support the API.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Just changed the annotation, no tests required.
Closes#29887 from HeartSaVioR/SPARK-33011.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The purpose of this pr is to resolve SPARK-32972, total of 51 Scala failed test cases and 3 Java failed test cases were fixed, the main change of this pr as follow:
- Specified `Seq` to `scala.collection.Seq` in case match `Seq` scene and `x.asInstanceOf[Seq[T]]` scene
- Use `Row.getSeq[T]` instead of `Row.getAs[Seq]`
- Manual call `toMap` method to convert `MapView` to `Map` in Scala 2.13
- Change the tol in the last test to 0.75 to pass `RandomForestRegressorSuite#training with sample weights` in Scala 2.13
### Why are the changes needed?
We need to support a Scala 2.13 build.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: Pass GitHub 2.13 Build Action
Do the follow:
```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests -pl mllib -Pscala-2.13 -am
mvn test -pl mllib -Pscala-2.13 -fn
```
**Before**
```
[ERROR] Errors:
[ERROR] JavaVectorIndexerSuite.vectorIndexerAPI:51 » ClassCast scala.collection.conver...
[ERROR] JavaWord2VecSuite.testJavaWord2Vec:51 » Spark Job aborted due to stage failure...
[ERROR] JavaPrefixSpanSuite.runPrefixSpanSaveLoad:79 » Spark Job aborted due to stage ...
Tests: succeeded 1567, failed 51, canceled 0, ignored 7, pending 0
*** 51 TESTS FAILED ***
```
**After**
```
[INFO] Tests run: 122, Failures: 0, Errors: 0, Skipped: 0
Tests: succeeded 1617, failed 0, canceled 0, ignored 7, pending 0
All tests passed.
```
Closes#29857 from LuciferYang/fix-mllib-2.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, update the comment: `Note, the relevant columns must also be set in inputCols` -> `Note, the relevant columns should also be set in inputCols`;
2, add a check, and if there are `categoricalCols` not set in `inputCols`, log.warn it;
### Why are the changes needed?
1, there is no check to make sure `categoricalCols` are all set in `inputCols`, to keep existing behavior, update this comments;
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
repl
Closes#29868 from zhengruifeng/feature_hash_cat_doc.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
pre-compute the output indices of numerical columns, instead of computing them on each row.
### Why are the changes needed?
for a numerical column, its output index is a hash of its `col_name`, we can pre-compute it at first, instead of computing it on each row.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#29850 from zhengruifeng/hash_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
`HashingTF` use `util.collection.OpenHashMap` instead of `mutable.HashMap`
### Why are the changes needed?
according to `util.collection.OpenHashMap` 's doc:
> This map is about 5X faster than java.util.HashMap, while using much less space overhead.
according to performance tests like ([Simple microbenchmarks comparing Scala vs Java mutable map performance ](https://gist.github.com/pchiusano/1423303)), `mutable.HashMap` maybe more inefficient than `java.util.HashMap`
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#29852 from zhengruifeng/hashingtf_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
revert blockify gmm
### Why are the changes needed?
WeichenXu123 and I thought we should use memory size instead of number of rows to blockify instance; then if a buffer's size is large and determined by number of rows, we should discard it.
In GMM, we found that the pre-allocated memory maybe too large and should be discarded:
```
transient private lazy val auxiliaryPDFMat = DenseMatrix.zeros(blockSize, numFeatures)
```
We had some offline discuss and thought it is better to revert blockify GMM.
### Does this PR introduce _any_ user-facing change?
blockSize added in master branch will be removed
### How was this patch tested?
existing testsuites
Closes#29782 from zhengruifeng/unblockify_gmm.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, simplify the aggregation by get `count` via `summary.count`
2, ignore nan values like the old impl:
```
val relativeError = 0.05
val approxQuantile = numNearestNeighbors.toDouble / count + relativeError
val modelDatasetWithDist = modelDataset.withColumn(distCol, hashDistCol)
if (approxQuantile >= 1) {
modelDatasetWithDist
} else {
val hashThreshold = modelDatasetWithDist.stat
.approxQuantile(distCol, Array(approxQuantile), relativeError)
// Filter the dataset where the hash value is less than the threshold.
modelDatasetWithDist.filter(hashDistCol <= hashThreshold(0))
}
```
### Why are the changes needed?
simplify the aggregation
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#29778 from zhengruifeng/lsh_nit.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
This PR fix code which causes error when build with sbt and Scala 2.13 like as follows.
```
[error] [warn] /home/kou/work/oss/spark-scala-2.13/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala:251: method with a single empty parameter list overrides method without any parameter list
[error] [warn] override def hasNext(): Boolean = requestOffset < part.untilOffset
[error] [warn]
[error] [warn] /home/kou/work/oss/spark-scala-2.13/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala:294: method with a single empty parameter list overrides method without any parameter list
[error] [warn] override def hasNext(): Boolean = okNext
```
More specifically, what this PR fixes are
* Methods which has an empty parameter list and overrides an method which has no parameter list.
```
override def hasNext(): Boolean = okNext
```
* Methods which has no parameter list and overrides an method which has an empty parameter list.
```
override def next: (Int, Double) = {
```
* Infix operator expression that the operator wraps.
```
3L * math.min(k, numFeatures) * math.min(k, numFeatures)
3L * math.min(k, numFeatures) * math.min(k, numFeatures) +
+ math.max(math.max(k, numFeatures), 4L * math.min(k, numFeatures)
math.max(math.max(k, numFeatures), 4L * math.min(k, numFeatures) *
* math.min(k, numFeatures) + 4L * math.min(k, numFeatures))
```
### Why are the changes needed?
For building Spark with sbt and Scala 2.13.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
After this change and #29742 applied, compile passed with the following command.
```
build/sbt -Pscala-2.13 -Phive -Phive-thriftserver -Pyarn -Pkubernetes compile test:compile
```
Closes#29745 from sarutak/fix-code-for-sbt-and-spark-2.13.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true:
* no user specified schema
* some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc.
The fix is based on another bug fix for CSV/JSON datasources https://github.com/apache/spark/pull/29659.
### Why are the changes needed?
To fix the issue when the follow two queries try to read from paths `[abc]`:
```scala
spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show
```
but would end up hitting an exception:
```
Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm;
org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm;
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770)
at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
```
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
Added UT to `LibSVMRelationSuite`.
Closes#29670 from MaxGekk/globbing-paths-when-inferring-schema-ml.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Fix double caching in KMeans/BiKMeans:
1, let the callers of `runWithWeight` to pass whether `handlePersistence` is needed;
2, persist and unpersist inside of `runWithWeight`;
3, persist the `norms` if needed according to the comments;
### Why are the changes needed?
avoid double caching
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing testsuites
Closes#29501 from zhengruifeng/kmeans_handlePersistence.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
The strict requirement for the vocabulary to remain non-empty has been removed in this pull request.
Link to the discussion: http://apache-spark-user-list.1001560.n3.nabble.com/Ability-to-have-CountVectorizerModel-vocab-as-empty-td38396.html
### Why are the changes needed?
This soothens running it across the corner cases. Without this, the user has to manupulate the data in genuine case, which may be a perfectly fine valid use-case.
Question: Should we a log when empty vocabulary is found instead?
### Does this PR introduce _any_ user-facing change?
May be a slight change. If someone has put a try-catch to detect an empty vocab. Then that behavior would no longer stand still.
### How was this patch tested?
1. Added testcase to `fit` generating an empty vocabulary
2. Added testcase to `transform` with empty vocabulary
Request to review: srowen hhbyyh
Closes#29482 from purijatin/spark_32662.
Authored-by: Jatin Puri <purijatin@gmail.com>
Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>
### What changes were proposed in this pull request?
set params default values in trait Params for feature and tuning in both Scala and Python.
### Why are the changes needed?
Make ML has the same default param values between estimator and its corresponding transformer, and also between Scala and Python.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing and modified tests
Closes#29153 from huaxingao/default2.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>
### What changes were proposed in this pull request?
for binary `LogisticRegressionModel`:
1, keep variables `_threshold` and `_rawThreshold` instead of computing them on each instance;
2, in `raw2probabilityInPlace`, make use of the characteristic that the sum of probability is 1.0;
### Why are the changes needed?
for better performance
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuite and performace test in REPL
Closes#29255 from zhengruifeng/pred_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>
### What changes were proposed in this pull request?
Add training summary to MultilayerPerceptronClassificationModel...
### Why are the changes needed?
so that user can get the training process status, such as loss value of each iteration and total iteration number.
### Does this PR introduce _any_ user-facing change?
Yes
MultilayerPerceptronClassificationModel.summary
MultilayerPerceptronClassificationModel.evaluate
### How was this patch tested?
new tests
Closes#29250 from huaxingao/mlp_summary.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
logParam `thresholds` in DT/GBT/FM/LR/MLP
### Why are the changes needed?
param `thresholds` is logged in NB/RF, but not in other ProbabilisticClassifier
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#29257 from zhengruifeng/instr.logParams_add_thresholds.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>
### What changes were proposed in this pull request?
Updates to scalatest 3.2.0. Though it looks large, it is 99% changes to the new location of scalatest classes.
### Why are the changes needed?
3.2.0+ has a fix that is required for Scala 2.13.3+ compatibility.
### Does this PR introduce _any_ user-facing change?
No, only affects tests.
### How was this patch tested?
Existing tests.
Closes#29196 from srowen/SPARK-32398.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>