### What changes were proposed in this pull request?
1, add a new param `sorted` in `slice`;
2, in `VectorSlicer`, set `sorted = true` if input indices are ordered.
### Why are the changes needed?
The input indices of VectorSlicer are probably ordered.
VectorSlicer should use this attribute if possible.
I did a simple test and `sorted = true` maybe about 70% faster than existing `slice`
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
added testsuite
Closes#31588 from zhengruifeng/vector_slice_for_sorted_indices.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Document `mode` as a supported Imputer strategy in Pyspark docs.
### Why are the changes needed?
Support was added in 3.1, and documented in Scala, but some Python docs were missed.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests.
Closes#31883 from srowen/ImputerModeDocs.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
pyrolite 4.21 introduced and enabled value comparison by default (`valueCompare=true`) during object memoization and serialization: https://github.com/irmen/Pyrolite/blob/pyrolite-4.21/java/src/main/java/net/razorvine/pickle/Pickler.java#L112-L122
This change has undesired effect when we serialize a row (actually `GenericRowWithSchema`) to be passed to python: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L60. A simple example is that
```
new GenericRowWithSchema(Array(1.0, 1.0), StructType(Seq(StructField("_1", DoubleType), StructField("_2", DoubleType))))
```
and
```
new GenericRowWithSchema(Array(1, 1), StructType(Seq(StructField("_1", IntegerType), StructField("_2", IntegerType))))
```
are currently equal and the second instance is replaced to the short code of the first one during serialization.
### Why are the changes needed?
The above can cause nasty issues like the one in https://issues.apache.org/jira/browse/SPARK-34545 description:
```
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import *
>>>
>>> def udf1(data_type):
def u1(e):
return e[0]
return udf(u1, data_type)
>>>
>>> df = spark.createDataFrame([((1.0, 1.0), (1, 1))], ['c1', 'c2'])
>>>
>>> df = df.withColumn("c3", udf1(DoubleType())("c1"))
>>> df = df.withColumn("c4", udf1(IntegerType())("c2"))
>>>
>>> df.select("c3").show()
+---+
| c3|
+---+
|1.0|
+---+
>>> df.select("c4").show()
+---+
| c4|
+---+
| 1|
+---+
>>> df.select("c3", "c4").show()
+---+----+
| c3| c4|
+---+----+
|1.0|null|
+---+----+
```
This is because during serialization from JVM to Python `GenericRowWithSchema(1.0, 1.0)` (`c1`) is memoized first and when `GenericRowWithSchema(1, 1)` (`c2`) comes next, it is replaced to some short code of the `c1` (instead of serializing `c2` out) as they are `equal()`. The python functions then runs but the return type of `c4` is expected to be `IntegerType` and if a different type (`DoubleType`) comes back from python then it is discarded: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L108-L113
After this PR:
```
>>> df.select("c3", "c4").show()
+---+---+
| c3| c4|
+---+---+
|1.0| 1|
+---+---+
```
### Does this PR introduce _any_ user-facing change?
Yes, fixes a correctness issue.
### How was this patch tested?
Added new UT + manual tests.
Closes#31682 from peter-toth/SPARK-34545-fix-row-comparison.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here:
http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html
All code is entirely my own work and I license the work to the project under the project’s open source license.
### Why are the changes needed?
Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts.
Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python.
### Does this PR introduce _any_ user-facing change?
A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined.
### How was this patch tested?
Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added.
`ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface.
`RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed.
Closes#31535 from PhillHenry/ParamRandomBuilder.
Authored-by: Phillip Henry <PhillHenry@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
UserDefinedType and UDTRegistration become public Developer APIs, not package-private to Spark.
### Why are the changes needed?
This proposes to simply open up the UserDefinedType class as a developer API. It was public in 1.x, but closed in 2.x for some possible redesign that does not seem to have happened.
Other libraries have managed to define UDTs anyway by inserting shims into the Spark namespace, and this evidently has worked OK. But package isolation in Java 9+ breaks this.
The logic here is mostly: this is de facto a stable API, so can at least be open to developers with the usual caveats about developer APIs.
Open questions:
- Is there in fact some important redesign that's needed before opening it? The comment to this effect is from 2016
- Is this all that needs to be opened up? Like PythonUserDefinedType?
- Should any of this be kept package-private?
This was first proposed in https://github.com/apache/spark/pull/16478 though it was a larger change, but, the other API issues it was fixing seem to have been addressed already (e.g. no need to return internal Spark types). It was never really reviewed.
My hunch is that there isn't much downside, and some upside, to just opening this as-is now.
### Does this PR introduce _any_ user-facing change?
UserDefinedType becomes visible to developers to subclass.
### How was this patch tested?
Existing tests; there is no change to the existing logic.
Closes#31461 from srowen/SPARK-7768.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This follows up #31160 to update score function in the document.
### Why are the changes needed?
Currently we use `f_classif`, `ch2`, `f_regression`, which sound to me the sklearn's naming. It is good to have it but I think it is nice if we have formal score function name with sklearn's ones.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No, only doc change.
Closes#31531 from viirya/SPARK-34080-minor.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Increase the rel tol from 0.2 to 0.35.
### Why are the changes needed?
Fix flaky test
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT.
Closes#31536 from WeichenXu123/ES-65815.
Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1, clear predictionCol & probabilityCol, use tmp rawPred col, to avoid potential column conflict;
2, use array instead of map, to keep in line with the python side;
3, simplify transform
### Why are the changes needed?
if input dataset has a column whose name is `predictionCol`,`probabilityCol`,`RawPredictionCol`, transfrom will fail.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
added testsuite
Closes#31472 from zhengruifeng/ovr_submodel_skip_pred_prob.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
`hashDistance` optimization: if two vectors in a pair are the same, directly return 0.0
### Why are the changes needed?
it should be faster than existing impl, because of short-circuit
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#31394 from zhengruifeng/min_hash_distance_opt.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Param Validation throw `IllegalArgumentException`
### Why are the changes needed?
Param Validation should throw `IllegalArgumentException` instead of `IllegalStateException`
### Does this PR introduce _any_ user-facing change?
Yes, the type of exception changed
### How was this patch tested?
existing testsuites
Closes#31469 from zhengruifeng/mllib_exceptions.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, update checking of numFeatures;
2, update `toString` to take `names` into account;
### Why are the changes needed?
1, should use `inputAttr.size` instead of `inputAttr.numAttributes` to get numFeatures;
2, this checking is necessary only if `$(indices).nonEmpty`, otherwise, `$(indices).max` will throw exception java.lang.UnsupportedOperationException: empty.max
3, in toString, should add length of names;
### Does this PR introduce _any_ user-facing change?
Yes, toString now count `names`;
### How was this patch tested?
existing testsuites
Closes#31354 from zhengruifeng/vector_slicer_clean.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Use `count` to simplify `find + size(or length)` operation, it's semantically consistent, but looks simpler.
**Before**
```
seq.filter(p).size
```
**After**
```
seq.count(p)
```
### Why are the changes needed?
Code Simpilefications.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#31374 from LuciferYang/SPARK-34275.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1, use Guavaording instead of BoundedPriorityQueue;
2, use local variables;
3, avoid conversion: ml.vector -> mllib.vector
### Why are the changes needed?
this pr is about 30% faster than existing impl
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
existing testsuites
Closes#31276 from zhengruifeng/w2v_findSynonyms_opt.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
use GEMV instead of DOT
### Why are the changes needed?
1, better performance, could be 20% faster than existing impl
2, simplify model saving
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#31313 from zhengruifeng/random_project_opt.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, update related doc;
2, MatrixFactorizationModel use GEMV;
### Why are the changes needed?
see performance gain in https://github.com/apache/spark/pull/30468
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
existing testsuites
Closes#31279 from zhengruifeng/als_follow_up.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, make `silhouette` a method;
2, change return type of `setDistanceMeasure` to `this.type`;
### Why are the changes needed?
see comments in https://github.com/apache/spark/pull/28590
### Does this PR introduce _any_ user-facing change?
No, 3.1 has not been released
### How was this patch tested?
existing testsuites
Closes#31334 from zhengruifeng/31768-followup.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request?
determine the numParts by numNodes
### Why are the changes needed?
current model saving may generate too many small files,
a tree model can be too large to single partition (a RandomForestClassificationModel with numTrees=100 and depth=20, its size is 226M)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#31090 from zhengruifeng/treemodel_single_part.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
This test fails flakily. I found it failing in 1 out of 80 runs.
```
Expected -0.35667494393873245 and -0.41914521201224453 to be within 0.15 using relative tolerance.
```
Increasing relative tolerance to 0.2 should improve flakiness.
```
0.2 * 0.35667494393873245 = 0.071 > 0.062 = |-0.35667494393873245 - (-0.41914521201224453)|
```
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#31266 from Loquats/NaiveBayesSuite-reltol.
Authored-by: Andy Zhang <yue.zhang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Increase the timeout for StreamingLinearRegressionSuite to 60s to deflake the test.
### Why are the changes needed?
Reduce merge conflict.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#31248 from liangz1/increase-timeout.
Authored-by: Liang Zhang <liang.zhang@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request?
update the ParamValidators of `maxDepth`
### Why are the changes needed?
current impl of tree models only support maxDepth<=30
### Does this PR introduce _any_ user-facing change?
If `maxDepth`>30, fail quickly
### How was this patch tested?
existing testsuites
Closes#31163 from zhengruifeng/param_maxDepth_upbound.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, make `getTopIndices`/`selectIndicesFromPValues` private;
2, avoid setting `selectionThreshold` in `fit`
3, move param checking to `transformSchema`
### Why are the changes needed?
`getTopIndices`/`selectIndicesFromPValues` should not be exposed to end users;
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#31222 from zhengruifeng/selector_clean_up.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add UnivariateFeatureSelector
### Why are the changes needed?
Have one UnivariateFeatureSelector, so we don't need to have three Feature Selectors.
### Does this PR introduce _any_ user-facing change?
Yes
```
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous", selectorType="numTopFeatures", numTopFeatures=100)
```
Or
numTopFeatures
```
selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], scoreFunction="f_classif", selectorType="numTopFeatures", numTopFeatures=100)
```
### How was this patch tested?
Add Unit test
Closes#31160 from huaxingao/UnivariateSelector.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request?
Some local variables are declared as `var`, but they are never reassigned and should be declared as `val`, so this pr turn these from `var` to `val` except for `mockito` related cases.
### Why are the changes needed?
Use `val` instead of `var` when possible.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#31142 from LuciferYang/SPARK-33346.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
There are some redundant collection conversion can be removed, for version compatibility, clean up these with Scala-2.13 profile.
### Why are the changes needed?
Remove redundant collection conversion
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass the Jenkins or GitHub Action
- Manual test `core`, `graphx`, `mllib`, `mllib-local`, `sql`, `yarn`,`kafka-0-10` in Scala 2.13 passed
Closes#31125 from LuciferYang/SPARK-34068.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
use a tmp model in OneVsRestModel.transform, to avoid calling directly setter of model
### Why are the changes needed?
params of model (submodels) should not be changed in transform
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
added testsuite
Closes#31086 from zhengruifeng/ovr_transform_tmp_model.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Improve flaky NaiveBayes test
Current test may sometimes fail under different BLAS library. Due to some absTol check. Error like
```
Expected 0.7 and 0.6485507246376814 to be within 0.05 using absolute tolerance...
```
* Change absTol to relTol: The `absTol 0.05` in some cases (such as compare 0.1 and 0.05) is a big difference
* Remove the `exp` when comparing params. The `exp` will amplify the relative error.
### Why are the changes needed?
Flaky test
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
N/A
Closes#31004 from WeichenXu123/improve_bayes_tests.
Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Change visibility modifier of two case classes defined inside objects in mllib from private to private[OuterClass]
### Why are the changes needed?
Without this change when running tests for Scala 2.13 you get runtime code generation errors. These errors look like this:
```
[info] Cause: java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 73, Column 65: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 73, Column 65: No applicable constructor/method found for zero actual parameters; candidates are: "public java.lang.String org.apache.spark.ml.feature.Word2VecModel$Data.word()"
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests now pass for Scala 2.13
Closes#31018 from koertkuipers/feat-visibility-scala213.
Authored-by: Koert Kuipers <koert@tresata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/21632/files#diff-0fdae8a6782091746ed20ea43f77b639f9c6a5f072dd2f600fcf9a7b37db4f47, a new field `rawCount` was added into `NodeData`, which cause that a tree model trained in 2.4 can not be loaded in 3.0/3.1/master;
field `rawCount` is only used in training, and not used in `transform`/`predict`/`featureImportance`. So I just set it to -1L.
### Why are the changes needed?
to support load old tree model in 3.0/3.1/master
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
added testsuites
Closes#30889 from zhengruifeng/fix_tree_load.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
There were a lot of works on improving ALS's recommendForAll
For now, I found that it maybe futhermore optimized by
1, using GEMV and sharing a pre-allocated buffer per task;
2, using guava.ordering instead of BoundedPriorityQueue;
### Why are the changes needed?
In my test, using `f2jBLAS.sgemv`, it is about 2.3X faster than existing impl.
|Impl| Master | GEMM | GEMV | GEMV + array aggregator | GEMV + guava ordering + array aggregator | GEMV + guava ordering|
|------|----------|------------|----------|------------|------------|------------|
|Duration|341229|363741|191201|189790|148417|147222|
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#30468 from zhengruifeng/als_rec_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Improve LogisticRegression test error tolerance
### Why are the changes needed?
When we switch BLAS version, some of the tests will fail due to too strict error tolerance in test.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
N/A
Closes#30587 from WeichenXu123/fix_lor_test.
Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request?
1, directly use float vectors instead of converting to double vectors, this is about 2x faster than using vec.axpy;
2, mark `wordList` and `wordVecNorms` lazy
3, avoid slicing in computation of `wordVecNorms`
### Why are the changes needed?
halve broadcast size
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#30548 from zhengruifeng/w2v_float32_transform.
Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Co-authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
This PR aims to update `master` branch version to 3.2.0-SNAPSHOT.
### Why are the changes needed?
Start to prepare Apache Spark 3.2.0.
### Does this PR introduce _any_ user-facing change?
N/A.
### How was this patch tested?
Pass the CIs.
Closes#30606 from dongjoon-hyun/SPARK-3.2.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
on each call of `transform`, a head() job will be triggered, which can be skipped by using a lazy var.
### Why are the changes needed?
avoiding duplicate head() jobs
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing tests
Closes#30550 from zhengruifeng/imputer_transform.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Add array_to_vector function for dataframe column
### Why are the changes needed?
Utility function for array to vector conversion.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
scala unit test & doctest.
Closes#30498 from WeichenXu123/array_to_vec.
Lead-authored-by: Weichen Xu <weichen.xu@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR intends to fix typos in the sub-modules:
* `bin`
* `core`
* `docs`
* `external`
* `mllib`
* `repl`
* `pom.xml`
Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618
NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)
### Why are the changes needed?
Misspelled words make it harder to read / understand content.
### Does this PR introduce _any_ user-facing change?
There are various fixes to documentation, etc...
### How was this patch tested?
No testing was performed
Closes#30530 from jsoref/spelling-bin-core-docs-external-mllib-repl.
Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR aims the followings.
1. Upgrade from Scala 2.13.3 to 2.13.4 for Apache Spark 3.1
2. Fix exhaustivity issues in both Scala 2.12/2.13 (Scala 2.13.4 requires this for compilation.)
3. Enforce the improved exhaustive check by using the existing Scala 2.13 GitHub Action compilation job.
### Why are the changes needed?
Scala 2.13.4 is a maintenance release for 2.13 line and improves JDK 15 support.
- https://github.com/scala/scala/releases/tag/v2.13.4
Also, it improves exhaustivity check.
- https://github.com/scala/scala/pull/9140 (Check exhaustivity of pattern matches with "if" guards and custom extractors)
- https://github.com/scala/scala/pull/9147 (Check all bindings exhaustively, e.g. tuples components)
### Does this PR introduce _any_ user-facing change?
Yep. Although it's a maintenance version change, it's a Scala version change.
### How was this patch tested?
Pass the CIs and do the manual testing.
- Scala 2.12 CI jobs(GitHub Action/Jenkins UT/Jenkins K8s IT) to check the validity of code change.
- Scala 2.13 Compilation job to check the compilation
Closes#30455 from dongjoon-hyun/SCALA_3.13.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
impl a new strategy `mode`: replace missing using the most frequent value along each column.
### Why are the changes needed?
it is highly scalable, and had been a function in [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) for a long time.
### Does this PR introduce _any_ user-facing change?
Yes, a new strategy is added
### How was this patch tested?
updated testsuites
Closes#30397 from zhengruifeng/imputer_max_freq.
Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Co-authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This pr add a new Scala compile arg to `pom.xml` to defense against new unused imports:
- `-Ywarn-unused-import` for Scala 2.12
- `-Wconf:cat=unused-imports:e` for Scala 2.13
The other fIles change are remove all unused imports in Spark code
### Why are the changes needed?
Cleanup code and add guarantee to defense against new unused imports
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#30351 from LuciferYang/remove-imports-core-module.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
use `maxBlockSizeInMB` instead of `blockSize` (#rows) to control the stacking of vectors;
### Why are the changes needed?
the performance gain is mainly related to the nnz of block.
### Does this PR introduce _any_ user-facing change?
yes, param blockSize -> blockSizeInMB in master
### How was this patch tested?
updated testsuites
Closes#30355 from zhengruifeng/adaptively_blockify_aft_lir_lor.
Lead-authored-by: zhengruifeng <ruifengz@foxmail.com>
Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request?
In [SPARK-33139] we defined `setActionSession` and `clearActiveSession` as deprecated API, it turns out it is widely used, and after discussion, even if without this PR, it should work with unify view feature, it might only be a risk if user really abuse using these two API. So revert the PR is needed.
[SPARK-33139] has two commit, include a follow up. Revert them both.
### Why are the changes needed?
Revert.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UT.
Closes#30367 from leanken/leanken-revert-SPARK-33139.
Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This reverts commit 61ee5d8a4e.
### What changes were proposed in this pull request?
I need to merge https://github.com/apache/spark/pull/30327 to https://github.com/apache/spark/pull/30009,
but I merged it to master by mistake.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#30345 from zhengruifeng/revert-30327-adaptively_blockify_linear_svc_II.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
* resend
* address comments
* directly gen new Iter
* directly gen new Iter
* update blockify strategy
* address comments
* try to fix 2.13
* try to fix scala 2.13
* use 1.0 as the default value for gemv
* update
Co-authored-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
There are two similar compilation warnings about procedure-like declaration in Scala 2.13:
```
[WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70: procedure syntax is deprecated for constructors: add `=`, as in method definition
```
and
```
[WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:211: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `run`'s return type
```
this pr is the first part to resolve SPARK-33352:
- For constructors method definition add `=` to convert to function syntax
- For without `return type` methods definition add `: Unit =` to convert to function syntax
### Why are the changes needed?
Eliminate compilation warnings in Scala 2.13 and this change should be compatible with Scala 2.12
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#30255 from LuciferYang/SPARK-29392-FOLLOWUP.1.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, optimize `predictQuantiles` by pre-computing an auxiliary var.
### Why are the changes needed?
In https://github.com/apache/spark/pull/30000, I optimized the `transform` method. I find that we can also optimize `predictQuantiles` by pre-computing an auxiliary var.
It is about 56% faster than existing impl.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#30034 from zhengruifeng/aft_quantiles_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1. Add the common trait `CommonFileDataSourceSuite` with tests that can be executed for all built-in file-based datasources.
2. Add a test `CommonFileDataSourceSuite` to check that datasource options are propagated to underlying file systems as Hadoop configs.
3. Mix `CommonFileDataSourceSuite` to `AvroSuite`, `OrcSourceSuite`, `TextSuite`, `JsonSuite`, CSVSuite` and to `ParquetFileFormatSuite`.
4. Remove duplicated tests from `AvroSuite` and from `OrcSourceSuite`.
### Why are the changes needed?
To improve test coverage and test all built-in file-based datasources.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running the affected test suites.
Closes#30067 from MaxGekk/ds-options-common-test.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to make SQLConf.get reliable and stable, we need to make sure user can't pollute the SQLConf and SparkSession Context via calling setActiveSession and clearActiveSession.
Change of the PR:
* add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to old behavior if user do need to call these two API.
* by default, if user call these two API, it will throw exception
* add extra two internal and private API setActiveSessionInternal and clearActiveSessionInternal for current internal usage
* change all internal reference to new internal API except for SQLContext.setActive and SQLContext.clearActive
### Why are the changes needed?
Make SQLConf.get reliable and stable.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
* Add UT in SparkSessionBuilderSuite to test the legacy config
* Existing test
Closes#30042 from leanken/leanken-SPARK-33139.
Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
use `lazy array` instead of `var` for auxiliary variables in binary lor
### Why are the changes needed?
In https://github.com/apache/spark/pull/29255, I made a mistake:
the `private var _threshold` and `_rawThreshold` are initialized by defaut values of `threshold`, that is beacuse:
1, param `threshold` is set default value at first;
2, `_threshold` and `_rawThreshold` are initialized based on the default value;
3, param `threshold` is updated by the value from estimator, by `copyValues` method:
```
if (map.contains(param) && to.hasParam(param.name)) {
to.set(param.name, map(param))
}
```
We can update `_threshold` and `_rawThreshold` in `setThreshold` and `setThresholds`, but we can not update them in `set`/`copyValues` so their values are kept until methods `setThreshold` and `setThresholds` are called.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
test in repl
Closes#30013 from zhengruifeng/lor_threshold_init.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, when `predictionCol` and `quantilesCol` are both set, we only need one prediction for each row: prediction is just the variable `lambda` in `predictQuantiles`;
2, in the computation of variable `quantiles` in `predictQuantiles`, a pre-computed vector `val baseQuantiles = $(quantileProbabilities).map(q => math.exp(math.log(-math.log1p(-q)) * scale))` can be reused for each row;
### Why are the changes needed?
avoid redundant computation in transform, like what we did in `ProbabilisticClassificationModel`, `GaussianMixtureModel`, etc
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuite
Closes#30000 from zhengruifeng/aft_predict_transform_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Propagate LibSVM options to Hadoop configs in the LibSVM datasource.
### Why are the changes needed?
There is a bug that when running:
```scala
spark.read.format("libsvm").options(conf).load(path)
```
The underlying file system will not receive the `conf` options.
### Does this PR introduce _any_ user-facing change?
Yes. After the changes, for example, users should read files from Azure Data Lake successfully:
```scala
def hadoopConf1() = Map[String, String](
s"fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
s"fs.adl.oauth2.client.id" -> dbutils.secrets.get(scope = "...", key = "..."),
s"fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "...", key = "..."),
s"fs.adl.oauth2.refresh.url" -> s"https://login.microsoftonline.com/.../oauth2/token")
val df = spark.read.format("libsvm").options(hadoopConf1).load("adl://....azuredatalakestore.net/foldersp1/...")
```
and not get the following exception because the settings above are not propagated to the filesystem:
```java
java.lang.IllegalArgumentException: No value for fs.adl.oauth2.access.token.provider found in conf file.
at ....adl.AdlFileSystem.getNonEmptyVal(AdlFileSystem.java:820)
at ....adl.AdlFileSystem.getCustomAccessTokenProvider(AdlFileSystem.java:220)
at ....adl.AdlFileSystem.getAccessTokenProvider(AdlFileSystem.java:257)
at ....adl.AdlFileSystem.initialize(AdlFileSystem.java:164)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
```
### How was this patch tested?
Added UT to `LibSVMRelationSuite`.
Closes#29984 from MaxGekk/ml-option-propagation.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>